通过高级自然语言处理和Smith-Waterman算法的E-BERT框架增强抄袭检测

IF 0.7 Q3 COMPUTER SCIENCE, THEORY & METHODS International Journal of Advanced Computer Science and Applications Pub Date : 2023-01-01 DOI:10.14569/ijacsa.2023.0140944

Franciskus Antonius, Myagmarsuren Orosoo, Aanandha Saravanan K, Indrajit Patra, Prema S

{"title":"通过高级自然语言处理和Smith-Waterman算法的E-BERT框架增强抄袭检测","authors":"Franciskus Antonius, Myagmarsuren Orosoo, Aanandha Saravanan K, Indrajit Patra, Prema S","doi":"10.14569/ijacsa.2023.0140944","DOIUrl":null,"url":null,"abstract":"Effective detection has been extremely difficult due to plagiarism's pervasiveness throughout a variety of fields, including academia and research. Increasingly complex plagiarism detection strategies are being used by people, making traditional approaches ineffective. The assessment of plagiarism involves a comprehensive examination encompassing syntactic, lexical, semantic, and structural facets. In contrast to traditional string-matching techniques, this investigation adopts a sophisticated Natural Language Processing (NLP) framework. The preprocessing phase entails a series of intricate steps ultimately refining the raw text data. The crux of this methodology lies in the integration of two distinct metrics within the Encoder Representation from Transformers (E-BERT) approach, effectively facilitating a granular exploration of textual similarity. Within the realm of NLP, the amalgamation of Deep and Shallow approaches serves as a lens to delve into the intricate nuances of the text, uncovering underlying layers of meaning. The discerning outcomes of this research unveil the remarkable proficiency of Deep NLP in promptly identifying substantial revisions. Integral to this innovation is the novel utilization of the Waterman algorithm and an English-Spanish dictionary, which contribute to the selection of optimal attributes. Comparative evaluations against alternative models employing distinct encoding methodologies, along with logistic regression as a classifier underscore the potency of the proposed implementation. The culmination of extensive experimentation substantiates the system's prowess, boasting an impressive 99.5% accuracy rate in extracting instances of plagiarism. This research serves as a pivotal advancement in the domain of plagiarism detection, ushering in effective and sophisticated methods to combat the growing spectre of unoriginal content.","PeriodicalId":13824,"journal":{"name":"International Journal of Advanced Computer Science and Applications","volume":"7 1","pages":"0"},"PeriodicalIF":0.7000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhanced Plagiarism Detection Through Advanced Natural Language Processing and E-BERT Framework of the Smith-Waterman Algorithm\",\"authors\":\"Franciskus Antonius, Myagmarsuren Orosoo, Aanandha Saravanan K, Indrajit Patra, Prema S\",\"doi\":\"10.14569/ijacsa.2023.0140944\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Effective detection has been extremely difficult due to plagiarism's pervasiveness throughout a variety of fields, including academia and research. Increasingly complex plagiarism detection strategies are being used by people, making traditional approaches ineffective. The assessment of plagiarism involves a comprehensive examination encompassing syntactic, lexical, semantic, and structural facets. In contrast to traditional string-matching techniques, this investigation adopts a sophisticated Natural Language Processing (NLP) framework. The preprocessing phase entails a series of intricate steps ultimately refining the raw text data. The crux of this methodology lies in the integration of two distinct metrics within the Encoder Representation from Transformers (E-BERT) approach, effectively facilitating a granular exploration of textual similarity. Within the realm of NLP, the amalgamation of Deep and Shallow approaches serves as a lens to delve into the intricate nuances of the text, uncovering underlying layers of meaning. The discerning outcomes of this research unveil the remarkable proficiency of Deep NLP in promptly identifying substantial revisions. Integral to this innovation is the novel utilization of the Waterman algorithm and an English-Spanish dictionary, which contribute to the selection of optimal attributes. Comparative evaluations against alternative models employing distinct encoding methodologies, along with logistic regression as a classifier underscore the potency of the proposed implementation. The culmination of extensive experimentation substantiates the system's prowess, boasting an impressive 99.5% accuracy rate in extracting instances of plagiarism. This research serves as a pivotal advancement in the domain of plagiarism detection, ushering in effective and sophisticated methods to combat the growing spectre of unoriginal content.\",\"PeriodicalId\":13824,\"journal\":{\"name\":\"International Journal of Advanced Computer Science and Applications\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Advanced Computer Science and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14569/ijacsa.2023.0140944\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Advanced Computer Science and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14569/ijacsa.2023.0140944","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

由于剽窃在包括学术界和研究在内的各个领域的普遍存在，有效的检测非常困难。人们使用越来越复杂的抄袭检测策略，使得传统的方法失效。剽窃的评估涉及一个全面的检查，包括句法，词汇，语义和结构方面。与传统的字符串匹配技术相比，本研究采用了复杂的自然语言处理(NLP)框架。预处理阶段需要一系列复杂的步骤，最终精炼原始文本数据。该方法的关键在于在转换器的编码器表示(E-BERT)方法中集成两个不同的度量，有效地促进了对文本相似性的粒度探索。在NLP的领域内，深层和浅层方法的融合作为一个镜头，深入研究文本的复杂细微差别，揭示潜在的意义层次。这项研究的显著结果揭示了深度NLP在迅速识别实质性修订方面的卓越能力。这一创新的关键是对沃特曼算法和英语-西班牙语词典的新颖利用，这有助于选择最优属性。对采用不同编码方法的替代模型的比较评估，以及作为分类器的逻辑回归，强调了所提议实现的效力。大量实验的高潮证实了该系统的威力，在提取剽窃实例方面拥有令人印象深刻的99.5%的准确率。这项研究是剽窃检测领域的关键进步，为打击日益增长的非原创内容提供了有效而复杂的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Enhanced Plagiarism Detection Through Advanced Natural Language Processing and E-BERT Framework of the Smith-Waterman Algorithm

Effective detection has been extremely difficult due to plagiarism's pervasiveness throughout a variety of fields, including academia and research. Increasingly complex plagiarism detection strategies are being used by people, making traditional approaches ineffective. The assessment of plagiarism involves a comprehensive examination encompassing syntactic, lexical, semantic, and structural facets. In contrast to traditional string-matching techniques, this investigation adopts a sophisticated Natural Language Processing (NLP) framework. The preprocessing phase entails a series of intricate steps ultimately refining the raw text data. The crux of this methodology lies in the integration of two distinct metrics within the Encoder Representation from Transformers (E-BERT) approach, effectively facilitating a granular exploration of textual similarity. Within the realm of NLP, the amalgamation of Deep and Shallow approaches serves as a lens to delve into the intricate nuances of the text, uncovering underlying layers of meaning. The discerning outcomes of this research unveil the remarkable proficiency of Deep NLP in promptly identifying substantial revisions. Integral to this innovation is the novel utilization of the Waterman algorithm and an English-Spanish dictionary, which contribute to the selection of optimal attributes. Comparative evaluations against alternative models employing distinct encoding methodologies, along with logistic regression as a classifier underscore the potency of the proposed implementation. The culmination of extensive experimentation substantiates the system's prowess, boasting an impressive 99.5% accuracy rate in extracting instances of plagiarism. This research serves as a pivotal advancement in the domain of plagiarism detection, ushering in effective and sophisticated methods to combat the growing spectre of unoriginal content.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Advanced Computer Science and Applications COMPUTER SCIENCE, THEORY & METHODS-

CiteScore

2.30

自引率

22.20%

发文量

519

期刊介绍： IJACSA is a scholarly computer science journal representing the best in research. Its mission is to provide an outlet for quality research to be publicised and published to a global audience. The journal aims to publish papers selected through rigorous double-blind peer review to ensure originality, timeliness, relevance, and readability. In sync with the Journal''s vision "to be a respected publication that publishes peer reviewed research articles, as well as review and survey papers contributed by International community of Authors", we have drawn reviewers and editors from Institutions and Universities across the globe. A double blind peer review process is conducted to ensure that we retain high standards. At IJACSA, we stand strong because we know that global challenges make way for new innovations, new ways and new talent. International Journal of Advanced Computer Science and Applications publishes carefully refereed research, review and survey papers which offer a significant contribution to the computer science literature, and which are of interest to a wide audience. Coverage extends to all main-stream branches of computer science and related applications