{"title":"Evaluation of Fingerprint Selection Algorithms for Two-Stage Plagiarism Detection","authors":"Gints Jēkabsons","doi":"10.2478/acss-2021-0022","DOIUrl":null,"url":null,"abstract":"Abstract Generally, the process of plagiarism detection can be divided into two main stages: source retrieval and text alignment. The paper evaluates and compares effectiveness of five fingerprint selection algorithms used during the source retrieval stage: Every p-th, 0 mod p, Winnowing, Frequency-biased Winnowing (FBW) and Modified FBW (MFBW). The algorithms are evaluated on a dataset containing plagiarism cases in Bachelor and Master Theses written in English in the field of computer science. The best performance is reached by 0 mod p, Winnowing and MFBW. For these algorithms, reduction of fingerprint size from 100 % to about 20 % kept the effectiveness at approximately the same level. Moreover, MFBW sends overall fewer document pairs to the text alignment stage, thus also reducing the computational cost of the process. The software developed for this study is freely available at the author’s website http://www.cs.rtu.lv/jekabsons/.","PeriodicalId":41960,"journal":{"name":"Applied Computer Systems","volume":"47 1","pages":"178 - 182"},"PeriodicalIF":0.5000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/acss-2021-0022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Abstract Generally, the process of plagiarism detection can be divided into two main stages: source retrieval and text alignment. The paper evaluates and compares effectiveness of five fingerprint selection algorithms used during the source retrieval stage: Every p-th, 0 mod p, Winnowing, Frequency-biased Winnowing (FBW) and Modified FBW (MFBW). The algorithms are evaluated on a dataset containing plagiarism cases in Bachelor and Master Theses written in English in the field of computer science. The best performance is reached by 0 mod p, Winnowing and MFBW. For these algorithms, reduction of fingerprint size from 100 % to about 20 % kept the effectiveness at approximately the same level. Moreover, MFBW sends overall fewer document pairs to the text alignment stage, thus also reducing the computational cost of the process. The software developed for this study is freely available at the author’s website http://www.cs.rtu.lv/jekabsons/.
摘要一般来说,剽窃检测的过程可以分为两个主要阶段:来源检索和文本比对。本文评估和比较了源检索阶段使用的五种指纹选择算法的有效性:每p次、0模p、窗口化、频率偏置窗口化(FBW)和改进FBW (MFBW)。这些算法在包含计算机科学领域英语学士和硕士论文抄袭案例的数据集上进行了评估。0 mod p、Winnowing和MFBW达到最佳性能。对于这些算法,将指纹大小从100%减小到20%左右,使有效性保持在大致相同的水平。此外,MFBW向文本对齐阶段发送的文档对总体上更少,因此也降低了该过程的计算成本。为这项研究开发的软件可以在作者的网站http://www.cs.rtu.lv/jekabsons/上免费获得。