检测文件匹配的cosin类似于溜离和溜离

Muhammad Zidny Naf’an, Auliya Burhanuddin, Ade Riyani
{"title":"检测文件匹配的cosin类似于溜离和溜离","authors":"Muhammad Zidny Naf’an, Auliya Burhanuddin, Ade Riyani","doi":"10.26418/jlk.v2i1.17","DOIUrl":null,"url":null,"abstract":"Plagiarism is the act of taking part or all of one's ideas in the form of documents or texts without including sources of information retrieval. This study aims to detect the similarity of text documents using the cosine similarity algorithm and weighting TF-IDF so that it can be used to determine the value of plagiarism. The document used for comparison of this text is an abstract of Indonesian. The results of the study, namely when stemming the similarity value is higher on average 10% than the stemming process is not done. This study produces a similarity value above 50% for documents with a high degree of similarity. Whereas documents with low similarity levels or no plagiarism produce similarity values ​​below 40%. With the method used in the preprocessing consisting of folding cases, tokenizing, removeal stopwords, and stemming. After the preprocessing process, the next step is to calculate the weighting of TF-IDF and the similarity value using cosine similarity so that it gets a percentage similarity value. Based on the experimental results of the cosine similarity algorithm and weighting TF-IDF, it can produce similarity values ​​from each comparative document","PeriodicalId":418646,"journal":{"name":"Jurnal Linguistik Komputasional (JLK)","volume":"6 4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Penerapan Cosine Similarity dan Pembobotan TF-IDF untuk Mendeteksi Kemiripan Dokumen\",\"authors\":\"Muhammad Zidny Naf’an, Auliya Burhanuddin, Ade Riyani\",\"doi\":\"10.26418/jlk.v2i1.17\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Plagiarism is the act of taking part or all of one's ideas in the form of documents or texts without including sources of information retrieval. This study aims to detect the similarity of text documents using the cosine similarity algorithm and weighting TF-IDF so that it can be used to determine the value of plagiarism. The document used for comparison of this text is an abstract of Indonesian. The results of the study, namely when stemming the similarity value is higher on average 10% than the stemming process is not done. This study produces a similarity value above 50% for documents with a high degree of similarity. Whereas documents with low similarity levels or no plagiarism produce similarity values ​​below 40%. With the method used in the preprocessing consisting of folding cases, tokenizing, removeal stopwords, and stemming. After the preprocessing process, the next step is to calculate the weighting of TF-IDF and the similarity value using cosine similarity so that it gets a percentage similarity value. Based on the experimental results of the cosine similarity algorithm and weighting TF-IDF, it can produce similarity values ​​from each comparative document\",\"PeriodicalId\":418646,\"journal\":{\"name\":\"Jurnal Linguistik Komputasional (JLK)\",\"volume\":\"6 4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Jurnal Linguistik Komputasional (JLK)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.26418/jlk.v2i1.17\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jurnal Linguistik Komputasional (JLK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26418/jlk.v2i1.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

摘要

抄袭是指以文件或文本的形式采用部分或全部思想,而不包括信息检索来源的行为。本研究旨在使用余弦相似度算法和TF-IDF加权来检测文本文档的相似度,从而可以用来确定剽窃的价值。本文比较使用的文件是印尼语摘要。研究结果表明,当词干的相似度值平均比词干的相似度值高10%时,没有进行词干处理。对于高度相似的文档,本研究得出了50%以上的相似值。而相似度低或没有抄袭的文档的相似度值低于40%。在预处理中使用的方法包括折叠案例,标记化,删除停止词和词干。预处理过程结束后,下一步是使用余弦相似度计算TF-IDF和相似度值的权重,从而得到百分比相似度值。基于余弦相似度算法的实验结果,对TF-IDF进行加权,得到各比较文档的相似度值
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Penerapan Cosine Similarity dan Pembobotan TF-IDF untuk Mendeteksi Kemiripan Dokumen
Plagiarism is the act of taking part or all of one's ideas in the form of documents or texts without including sources of information retrieval. This study aims to detect the similarity of text documents using the cosine similarity algorithm and weighting TF-IDF so that it can be used to determine the value of plagiarism. The document used for comparison of this text is an abstract of Indonesian. The results of the study, namely when stemming the similarity value is higher on average 10% than the stemming process is not done. This study produces a similarity value above 50% for documents with a high degree of similarity. Whereas documents with low similarity levels or no plagiarism produce similarity values ​​below 40%. With the method used in the preprocessing consisting of folding cases, tokenizing, removeal stopwords, and stemming. After the preprocessing process, the next step is to calculate the weighting of TF-IDF and the similarity value using cosine similarity so that it gets a percentage similarity value. Based on the experimental results of the cosine similarity algorithm and weighting TF-IDF, it can produce similarity values ​​from each comparative document
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Studi Ekstraksi Fitur Data Teks Rencana Pelaksanaan Pembelajaran Memanfaatkan Model Word2Vec Bagaimana Masyarakat Menyikapi Pembelajaran Tatap Muka: Analisis Komentar Masyarakat pada Media Sosial Youtube Menggunakan Algoritma Deep Learning Sekuensial dan LDA Sentiment Analysis of Stocktwits Data With Word Vector and Gated Recurrent Unit Indonesian Question Answering System for Factoid Questions using Face Beauty Products Knowledge Graph Sentiment Analysis Terhadap Tweet Bernada Sarkasme Berbahasa Indonesia
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1