基于机器学习方法的文本内容重写识别技术

IF 0.2 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Radio Electronics Computer Science Control Pub Date : 2022-12-13 DOI:10.15588/1607-3274-2022-4-11
N. Kholodna, V. Vysotska
{"title":"基于机器学习方法的文本内容重写识别技术","authors":"N. Kholodna, V. Vysotska","doi":"10.15588/1607-3274-2022-4-11","DOIUrl":null,"url":null,"abstract":"Context. Paraphrased textual content or rewriting is one of the difficult problems of detecting academic plagiarism. Most plagiarism detection systems are designed to detect common words, sequences of linguistic units, and minor changes, but are unable to detect significant semantic and structural changes. Therefore, most cases of plagiarism using paraphrasing remain unnoticed. \nObjective of the study is to develop a technology for detecting paraphrasing in text based on a classification model and machine learning methods through the use of Siamese neural network based on recurrent and Transformer type – RoBERTa to analyze the level of similarity of sentences of text content. \nMethod. For this study, the following semantic similarity metrics or indicators were chosen as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover’s Distance, distances according to WordNet dictionaries, prediction of two ML models: Siamese neural network based on recurrent and Transformer type - RoBERTa. \nResults. An intelligent system for detecting paraphrasing in text based on a classification model and machine learning methods has been developed. The developed system uses the principle of model stacking and feature engineering. Additional features indicate the semantic affiliation of the sentences or the normalized number of common N-grams. An additional fine-tuned RoBERTa neural network (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This specificity of the model may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both the overall classification accuracy and the model’s sensitivity to pairs of sentences that are not paraphrases of each other. \nConclusions. The created model shows excellent classification results on PAWS test data: precision – 93%, recall – 92%, F1score – 92%, accuracy – 92%. The results of the study showed that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without the need for additional feature generation.","PeriodicalId":43783,"journal":{"name":"Radio Electronics Computer Science Control","volume":"76 1","pages":""},"PeriodicalIF":0.2000,"publicationDate":"2022-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"REWRITING IDENTIFICATION TECHNOLOGY FOR TEXT CONTENT BASED ON MACHINE LEARNING METHODS\",\"authors\":\"N. Kholodna, V. Vysotska\",\"doi\":\"10.15588/1607-3274-2022-4-11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Context. Paraphrased textual content or rewriting is one of the difficult problems of detecting academic plagiarism. Most plagiarism detection systems are designed to detect common words, sequences of linguistic units, and minor changes, but are unable to detect significant semantic and structural changes. Therefore, most cases of plagiarism using paraphrasing remain unnoticed. \\nObjective of the study is to develop a technology for detecting paraphrasing in text based on a classification model and machine learning methods through the use of Siamese neural network based on recurrent and Transformer type – RoBERTa to analyze the level of similarity of sentences of text content. \\nMethod. For this study, the following semantic similarity metrics or indicators were chosen as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover’s Distance, distances according to WordNet dictionaries, prediction of two ML models: Siamese neural network based on recurrent and Transformer type - RoBERTa. \\nResults. An intelligent system for detecting paraphrasing in text based on a classification model and machine learning methods has been developed. The developed system uses the principle of model stacking and feature engineering. Additional features indicate the semantic affiliation of the sentences or the normalized number of common N-grams. An additional fine-tuned RoBERTa neural network (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This specificity of the model may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both the overall classification accuracy and the model’s sensitivity to pairs of sentences that are not paraphrases of each other. \\nConclusions. The created model shows excellent classification results on PAWS test data: precision – 93%, recall – 92%, F1score – 92%, accuracy – 92%. The results of the study showed that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without the need for additional feature generation.\",\"PeriodicalId\":43783,\"journal\":{\"name\":\"Radio Electronics Computer Science Control\",\"volume\":\"76 1\",\"pages\":\"\"},\"PeriodicalIF\":0.2000,\"publicationDate\":\"2022-12-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radio Electronics Computer Science Control\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15588/1607-3274-2022-4-11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radio Electronics Computer Science Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15588/1607-3274-2022-4-11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

上下文。文章内容的意译或改写是学术剽窃检测的难点之一。大多数抄袭检测系统的设计目的是检测常用词、语言单位序列和微小的变化,但无法检测重大的语义和结构变化。因此,大多数使用释义的抄袭案例都没有被注意到。本研究的目的是开发一种基于分类模型和机器学习方法的文本释义检测技术,通过使用基于recurrent和Transformer类型的Siamese神经网络- RoBERTa来分析文本内容句子的相似程度。方法。本研究选择以下语义相似度度量或指标作为特征:共享n -gram的Jacquard系数、句子向量表示之间的余弦距离、Word Mover’s distance、根据WordNet字典的距离、两种ML模型的预测:基于recurrent和Transformer类型的Siamese神经网络- RoBERTa。结果。提出了一种基于分类模型和机器学习方法的智能文本释义检测系统。开发的系统采用了模型叠加和特征工程的原理。附加特征表示句子的语义关联或公共n -gram的规范化数量。一个额外的微调RoBERTa神经网络(具有额外的全连接层)对不是相互转述的句子对不那么敏感。这种模型的特殊性可能会导致对剽窃的错误指控或用户生成内容的错误关联。额外的特征增加了整体分类的准确性和模型对不是彼此转述的句子对的敏感性。结论。所创建的模型在PAWS测试数据上显示出优异的分类结果:准确率- 93%,召回率- 92%,F1score - 92%,准确率- 92%。研究结果表明,transformer类型的神经网络可以成功地应用于检测一对文本中的释义,并且具有相当高的准确性,而无需额外的特征生成。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
REWRITING IDENTIFICATION TECHNOLOGY FOR TEXT CONTENT BASED ON MACHINE LEARNING METHODS
Context. Paraphrased textual content or rewriting is one of the difficult problems of detecting academic plagiarism. Most plagiarism detection systems are designed to detect common words, sequences of linguistic units, and minor changes, but are unable to detect significant semantic and structural changes. Therefore, most cases of plagiarism using paraphrasing remain unnoticed. Objective of the study is to develop a technology for detecting paraphrasing in text based on a classification model and machine learning methods through the use of Siamese neural network based on recurrent and Transformer type – RoBERTa to analyze the level of similarity of sentences of text content. Method. For this study, the following semantic similarity metrics or indicators were chosen as features: Jacquard coefficient for shared N-grams, cosine distance between vector representations of sentences, Word Mover’s Distance, distances according to WordNet dictionaries, prediction of two ML models: Siamese neural network based on recurrent and Transformer type - RoBERTa. Results. An intelligent system for detecting paraphrasing in text based on a classification model and machine learning methods has been developed. The developed system uses the principle of model stacking and feature engineering. Additional features indicate the semantic affiliation of the sentences or the normalized number of common N-grams. An additional fine-tuned RoBERTa neural network (with additional fully connected layers) is less sensitive to pairs of sentences that are not paraphrases of each other. This specificity of the model may contribute to incorrect accusations of plagiarism or incorrect association of user-generated content. Additional features increase both the overall classification accuracy and the model’s sensitivity to pairs of sentences that are not paraphrases of each other. Conclusions. The created model shows excellent classification results on PAWS test data: precision – 93%, recall – 92%, F1score – 92%, accuracy – 92%. The results of the study showed that Transformer-type NNs can be successfully applied to detect paraphrasing in a pair of texts with fairly high accuracy without the need for additional feature generation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Radio Electronics Computer Science Control
Radio Electronics Computer Science Control COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-
自引率
20.00%
发文量
66
审稿时长
12 weeks
期刊最新文献
POLYNOMIAL ESTIMATION OF DATA MODEL PARAMETERS WITH NEGATIVE KURTOSIS USING ESP32 MICROCONTROLLER FOR PHYSICAL SIMULATION OF THE WIRELESS REMOTE CONTROL MODEM APPLICATION OF BLOW-UP THEORY TO DETERMINE THE SERVICE LIFE OF SMALL-SERIES AND SINGLE ITEMS IMPROVED MULTI-OBJECTIVE OPTIMIZATION IN BUSINESS PROCESS MANAGEMENT USING R-NSGA-II PERFORMANCE ANALYSIS OF WIRELESS COMPUTER NETWORKS IN CONDITIONS OF HIGH INTERFERENCE INTENSITY
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1