检测文件匹配的cosin类似于溜离和溜离

Jurnal Linguistik Komputasional (JLK) Pub Date : 2019-03-26 DOI:10.26418/jlk.v2i1.17

Muhammad Zidny Naf’an, Auliya Burhanuddin, Ade Riyani

{"title":"检测文件匹配的cosin类似于溜离和溜离","authors":"Muhammad Zidny Naf’an, Auliya Burhanuddin, Ade Riyani","doi":"10.26418/jlk.v2i1.17","DOIUrl":null,"url":null,"abstract":"Plagiarism is the act of taking part or all of one's ideas in the form of documents or texts without including sources of information retrieval. This study aims to detect the similarity of text documents using the cosine similarity algorithm and weighting TF-IDF so that it can be used to determine the value of plagiarism. The document used for comparison of this text is an abstract of Indonesian. The results of the study, namely when stemming the similarity value is higher on average 10% than the stemming process is not done. This study produces a similarity value above 50% for documents with a high degree of similarity. Whereas documents with low similarity levels or no plagiarism produce similarity values below 40%. With the method used in the preprocessing consisting of folding cases, tokenizing, removeal stopwords, and stemming. After the preprocessing process, the next step is to calculate the weighting of TF-IDF and the similarity value using cosine similarity so that it gets a percentage similarity value. Based on the experimental results of the cosine similarity algorithm and weighting TF-IDF, it can produce similarity values from each comparative document","PeriodicalId":418646,"journal":{"name":"Jurnal Linguistik Komputasional (JLK)","volume":"6 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Penerapan Cosine Similarity dan Pembobotan TF-IDF untuk Mendeteksi Kemiripan Dokumen\",\"authors\":\"Muhammad Zidny Naf’an, Auliya Burhanuddin, Ade Riyani\",\"doi\":\"10.26418/jlk.v2i1.17\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Plagiarism is the act of taking part or all of one's ideas in the form of documents or texts without including sources of information retrieval. This study aims to detect the similarity of text documents using the cosine similarity algorithm and weighting TF-IDF so that it can be used to determine the value of plagiarism. The document used for comparison of this text is an abstract of Indonesian. The results of the study, namely when stemming the similarity value is higher on average 10% than the stemming process is not done. This study produces a similarity value above 50% for documents with a high degree of similarity. Whereas documents with low similarity levels or no plagiarism produce similarity values below 40%. With the method used in the preprocessing consisting of folding cases, tokenizing, removeal stopwords, and stemming. After the preprocessing process, the next step is to calculate the weighting of TF-IDF and the similarity value using cosine similarity so that it gets a percentage similarity value. Based on the experimental results of the cosine similarity algorithm and weighting TF-IDF, it can produce similarity values from each comparative document\",\"PeriodicalId\":418646,\"journal\":{\"name\":\"Jurnal Linguistik Komputasional (JLK)\",\"volume\":\"6 4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Jurnal Linguistik Komputasional (JLK)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.26418/jlk.v2i1.17\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jurnal Linguistik Komputasional (JLK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26418/jlk.v2i1.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

摘要

抄袭是指以文件或文本的形式采用部分或全部思想，而不包括信息检索来源的行为。本研究旨在使用余弦相似度算法和TF-IDF加权来检测文本文档的相似度，从而可以用来确定剽窃的价值。本文比较使用的文件是印尼语摘要。研究结果表明，当词干的相似度值平均比词干的相似度值高10%时，没有进行词干处理。对于高度相似的文档，本研究得出了50%以上的相似值。而相似度低或没有抄袭的文档的相似度值低于40%。在预处理中使用的方法包括折叠案例，标记化，删除停止词和词干。预处理过程结束后，下一步是使用余弦相似度计算TF-IDF和相似度值的权重，从而得到百分比相似度值。基于余弦相似度算法的实验结果，对TF-IDF进行加权，得到各比较文档的相似度值

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Penerapan Cosine Similarity dan Pembobotan TF-IDF untuk Mendeteksi Kemiripan Dokumen

Plagiarism is the act of taking part or all of one's ideas in the form of documents or texts without including sources of information retrieval. This study aims to detect the similarity of text documents using the cosine similarity algorithm and weighting TF-IDF so that it can be used to determine the value of plagiarism. The document used for comparison of this text is an abstract of Indonesian. The results of the study, namely when stemming the similarity value is higher on average 10% than the stemming process is not done. This study produces a similarity value above 50% for documents with a high degree of similarity. Whereas documents with low similarity levels or no plagiarism produce similarity values below 40%. With the method used in the preprocessing consisting of folding cases, tokenizing, removeal stopwords, and stemming. After the preprocessing process, the next step is to calculate the weighting of TF-IDF and the similarity value using cosine similarity so that it gets a percentage similarity value. Based on the experimental results of the cosine similarity algorithm and weighting TF-IDF, it can produce similarity values from each comparative document

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Jurnal Linguistik Komputasional (JLK)

自引率

0.00%

发文量