基于机器学习方法的文本内容重写识别技术

IF 0.2 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Radio Electronics Computer Science Control Pub Date : 2022-12-13 DOI:10.15588/1607-3274-2022-4-11

N. Kholodna, V. Vysotska

{"title":"基于机器学习方法的文本内容重写识别技术","authors":"N. Kholodna, V. Vysotska","doi":"10.15588/1607-3274-2022-4-11","DOIUrl":null,"url":null,"abstract":"Context. Paraphrased textual content or rewriting is one of the difficult problems of detecting academic plagiarism. Most plagiarism detection systems are designed to detect common words, sequences of linguistic units, and minor changes, but are unable to detect significant semantic and structural changes. Therefore, most cases of plagiarism using paraphrasing remain unnoticed. \nObjective of the study is to develop a technology for detecting paraphrasing in text based on a classification model and machine learning methods through the use of Siamese neural network based on recurrent and Transformer type – RoBERTa to analyze the level of similarity of sentences of text content. \nMethod. For this study, the following semantic similarity metrics or indicators were chosen as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover’s Distance, distances according to WordNet dictionaries, prediction of two ML models: Siamese neural network based on recurrent and Transformer type - RoBERTa. \nResults. An intelligent system for detecting paraphrasing in text based on a classification model and machine learning methods has been developed. The developed system uses the principle of model stacking and feature engineering. Additional features indicate the semantic affiliation of the sentences or the normalized number of common N-grams. An additional fine-tuned RoBERTa neural network (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This specificity of the model may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both the overall classification accuracy and the model’s sensitivity to pairs of sentences that are not paraphrases of each other. \nConclusions. The created model shows excellent classification results on PAWS test data: precision – 93%, recall – 92%, F1score – 92%, accuracy – 92%. The results of the study showed that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without the need for additional feature generation.","PeriodicalId":43783,"journal":{"name":"Radio Electronics Computer Science Control","volume":"76 1","pages":""},"PeriodicalIF":0.2000,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"REWRITING IDENTIFICATION TECHNOLOGY FOR TEXT CONTENT BASED ON MACHINE LEARNING METHODS\",\"authors\":\"N. Kholodna, V. Vysotska\",\"doi\":\"10.15588/1607-3274-2022-4-11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Context. Paraphrased textual content or rewriting is one of the difficult problems of detecting academic plagiarism. Most plagiarism detection systems are designed to detect common words, sequences of linguistic units, and minor changes, but are unable to detect significant semantic and structural changes. Therefore, most cases of plagiarism using paraphrasing remain unnoticed. \\nObjective of the study is to develop a technology for detecting paraphrasing in text based on a classification model and machine learning methods through the use of Siamese neural network based on recurrent and Transformer type – RoBERTa to analyze the level of similarity of sentences of text content. \\nMethod. For this study, the following semantic similarity metrics or indicators were chosen as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover’s Distance, distances according to WordNet dictionaries, prediction of two ML models: Siamese neural network based on recurrent and Transformer type - RoBERTa. \\nResults. An intelligent system for detecting paraphrasing in text based on a classification model and machine learning methods has been developed. The developed system uses the principle of model stacking and feature engineering. Additional features indicate the semantic affiliation of the sentences or the normalized number of common N-grams. An additional fine-tuned RoBERTa neural network (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This specificity of the model may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both the overall classification accuracy and the model’s sensitivity to pairs of sentences that are not paraphrases of each other. \\nConclusions. The created model shows excellent classification results on PAWS test data: precision – 93%, recall – 92%, F1score – 92%, accuracy – 92%. The results of the study showed that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without the need for additional feature generation.\",\"PeriodicalId\":43783,\"journal\":{\"name\":\"Radio Electronics Computer Science Control\",\"volume\":\"76 1\",\"pages\":\"\"},\"PeriodicalIF\":0.2000,\"publicationDate\":\"2022-12-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radio Electronics Computer Science Control\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15588/1607-3274-2022-4-11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radio Electronics Computer Science Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15588/1607-3274-2022-4-11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

上下文。文章内容的意译或改写是学术剽窃检测的难点之一。大多数抄袭检测系统的设计目的是检测常用词、语言单位序列和微小的变化，但无法检测重大的语义和结构变化。因此，大多数使用释义的抄袭案例都没有被注意到。本研究的目的是开发一种基于分类模型和机器学习方法的文本释义检测技术，通过使用基于recurrent和Transformer类型的Siamese神经网络- RoBERTa来分析文本内容句子的相似程度。方法。本研究选择以下语义相似度度量或指标作为特征:共享n -gram的Jacquard系数、句子向量表示之间的余弦距离、Word Mover’s distance、根据WordNet字典的距离、两种ML模型的预测:基于recurrent和Transformer类型的Siamese神经网络- RoBERTa。结果。提出了一种基于分类模型和机器学习方法的智能文本释义检测系统。开发的系统采用了模型叠加和特征工程的原理。附加特征表示句子的语义关联或公共n -gram的规范化数量。一个额外的微调RoBERTa神经网络(具有额外的全连接层)对不是相互转述的句子对不那么敏感。这种模型的特殊性可能会导致对剽窃的错误指控或用户生成内容的错误关联。额外的特征增加了整体分类的准确性和模型对不是彼此转述的句子对的敏感性。结论。所创建的模型在PAWS测试数据上显示出优异的分类结果:准确率- 93%，召回率- 92%，F1score - 92%，准确率- 92%。研究结果表明，transformer类型的神经网络可以成功地应用于检测一对文本中的释义，并且具有相当高的准确性，而无需额外的特征生成。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

REWRITING IDENTIFICATION TECHNOLOGY FOR TEXT CONTENT BASED ON MACHINE LEARNING METHODS

Context. Paraphrased textual content or rewriting is one of the difficult problems of detecting academic plagiarism. Most plagiarism detection systems are designed to detect common words, sequences of linguistic units, and minor changes, but are unable to detect significant semantic and structural changes. Therefore, most cases of plagiarism using paraphrasing remain unnoticed. Objective of the study is to develop a technology for detecting paraphrasing in text based on a classification model and machine learning methods through the use of Siamese neural network based on recurrent and Transformer type – RoBERTa to analyze the level of similarity of sentences of text content. Method. For this study, the following semantic similarity metrics or indicators were chosen as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover’s Distance, distances according to WordNet dictionaries, prediction of two ML models: Siamese neural network based on recurrent and Transformer type - RoBERTa. Results. An intelligent system for detecting paraphrasing in text based on a classification model and machine learning methods has been developed. The developed system uses the principle of model stacking and feature engineering. Additional features indicate the semantic affiliation of the sentences or the normalized number of common N-grams. An additional fine-tuned RoBERTa neural network (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This specificity of the model may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both the overall classification accuracy and the model’s sensitivity to pairs of sentences that are not paraphrases of each other. Conclusions. The created model shows excellent classification results on PAWS test data: precision – 93%, recall – 92%, F1score – 92%, accuracy – 92%. The results of the study showed that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without the need for additional feature generation.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Radio Electronics Computer Science Control COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

自引率

20.00%

发文量

审稿时长

12 weeks