Document Similarity Detection Using Indonesian Language Word2vec Model

Nahda Rosa Ramadhanti, Siti Mariyah
{"title":"Document Similarity Detection Using Indonesian Language Word2vec Model","authors":"Nahda Rosa Ramadhanti, Siti Mariyah","doi":"10.1109/ICICoS48119.2019.8982432","DOIUrl":null,"url":null,"abstract":"Most researches on text duplication in Bahasa uses the TF-IDF method. In this method, each word will have a different weight. The more frequencies the word appears, the greater the weight. This study aims to detect the similarity of documents by calculating cosine similarity from word vectors. The corpus was built from a collection of Indonesian Wikipedia articles. This study proposes two techniques to calculate the similarity which is simultaneous and partial comparison. Simultaneous comparison is direct comparison without dividing documents into several chapters, while partial comparison divides documents into several chapters before calculating the similarity. Similarity result from partial comparison is more accurate than simultaneous comparison. This study uses Unicheck application TF-IDF method as a benchmark. Similarity result from Unicheck and this study are different, due to the different method applied. Similarity result using TF -IDF method is smaller than using Word2vec, this is because TF-IDF can't detect paraphrase. The limitation in this study is that the Unicheck application used as a benchmark does not use the same method as the method used in this study other than that the determination of expected value is still subjective.","PeriodicalId":105407,"journal":{"name":"2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)","volume":"165 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 3rd International Conference on Informatics and Computational Sciences (ICICoS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICoS48119.2019.8982432","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Most researches on text duplication in Bahasa uses the TF-IDF method. In this method, each word will have a different weight. The more frequencies the word appears, the greater the weight. This study aims to detect the similarity of documents by calculating cosine similarity from word vectors. The corpus was built from a collection of Indonesian Wikipedia articles. This study proposes two techniques to calculate the similarity which is simultaneous and partial comparison. Simultaneous comparison is direct comparison without dividing documents into several chapters, while partial comparison divides documents into several chapters before calculating the similarity. Similarity result from partial comparison is more accurate than simultaneous comparison. This study uses Unicheck application TF-IDF method as a benchmark. Similarity result from Unicheck and this study are different, due to the different method applied. Similarity result using TF -IDF method is smaller than using Word2vec, this is because TF-IDF can't detect paraphrase. The limitation in this study is that the Unicheck application used as a benchmark does not use the same method as the method used in this study other than that the determination of expected value is still subjective.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于印尼语Word2vec模型的文档相似度检测
对印尼语文本复制的研究大多采用TF-IDF方法。在这种方法中,每个单词都有不同的权重。单词出现的频率越多,权重越大。本研究旨在通过计算词向量的余弦相似度来检测文档的相似度。这个语料库是根据维基百科上印尼语文章的集合建立的。本文提出了同时比较和部分比较两种计算相似度的方法。同时比较是直接比较,不把文档分成几章,而部分比较是把文档分成几章,然后再计算相似度。部分比较得到的相似度比同时比较得到的相似度更准确。本研究以Unicheck应用TF-IDF方法为基准。由于使用的方法不同,Unicheck和本研究的相似度结果不同。使用TF-IDF方法的相似度结果小于使用Word2vec方法,这是因为TF-IDF不能检测释义。本研究的局限性在于,作为基准的Unicheck应用程序使用的方法与本研究中使用的方法不同,期望值的确定仍然是主观的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Analysis of GPGPU-Based Brute-Force and Dictionary Attack on SHA-1 Password Hash Ranking of Game Mechanics for Gamification in Mobile Payment Using AHP-TOPSIS: Uses and Gratification Perspective An Assesment of Knowledge Sharing System: SCeLE Universitas Indonesia Improved Line Operator for Retinal Blood Vessel Segmentation Classification of Abnormality in Chest X-Ray Images by Transfer Learning of CheXNet
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1