EVALUATING OF EFFICACY SEMANTIC SIMILARITY METHODS FOR COMPARISON OF ACADEMIC THESIS AND DISSERTATION TEXTS

Ramadan T. Hassan, N. S. Ahmed
{"title":"EVALUATING OF EFFICACY SEMANTIC SIMILARITY METHODS FOR COMPARISON OF ACADEMIC THESIS AND DISSERTATION TEXTS","authors":"Ramadan T. Hassan, N. S. Ahmed","doi":"10.25271/sjuoz.2023.11.3.1120","DOIUrl":null,"url":null,"abstract":"Detecting semantic similarity between documents is vital in natural language processing applications. One widely used method for measuring the semantic similarity of text documents is embedding, which involves converting texts into numerical vectors using various NLP methods. This paper presents a comparative analysis of four embedding methods for detecting semantic similarity in theses and dissertations , namely Term Frequency–Inverse Document Frequency, Document to Vector, Sentence Bidirectional Encoder Representations from Transformers, and Bidirectional Encoder Representations from Transformers with cosine similarity. The study used two datasets consisting of 27 documents from Duhok Polytechnic University and 100 documents from ProQuest.com. The texts from these documents were pre-processed to make them suitable for semantic similarity analysis. The evaluation of the methods was based on several metrics, including accuracy, precision, Recall, F1 score, and processing time. The results showed that the traditional method, TF-IDF, outperformed modern methods in embedding and detecting actual semantic similarity between documents, with processing time not exceeding a few seconds.","PeriodicalId":21627,"journal":{"name":"Science Journal of University of Zakho","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science Journal of University of Zakho","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25271/sjuoz.2023.11.3.1120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Detecting semantic similarity between documents is vital in natural language processing applications. One widely used method for measuring the semantic similarity of text documents is embedding, which involves converting texts into numerical vectors using various NLP methods. This paper presents a comparative analysis of four embedding methods for detecting semantic similarity in theses and dissertations , namely Term Frequency–Inverse Document Frequency, Document to Vector, Sentence Bidirectional Encoder Representations from Transformers, and Bidirectional Encoder Representations from Transformers with cosine similarity. The study used two datasets consisting of 27 documents from Duhok Polytechnic University and 100 documents from ProQuest.com. The texts from these documents were pre-processed to make them suitable for semantic similarity analysis. The evaluation of the methods was based on several metrics, including accuracy, precision, Recall, F1 score, and processing time. The results showed that the traditional method, TF-IDF, outperformed modern methods in embedding and detecting actual semantic similarity between documents, with processing time not exceeding a few seconds.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
学术论文与论文文本比较的有效性评价语义相似度方法
在自然语言处理应用中,检测文档之间的语义相似性是至关重要的。一种广泛使用的测量文本文档语义相似度的方法是嵌入,它涉及使用各种NLP方法将文本转换为数值向量。本文对论文语义相似度检测的四种嵌入方法进行了对比分析,即词频-逆文档频率、文档到向量、句子双向编码器转换表示和余弦相似度转换双向编码器表示。该研究使用了两个数据集,包括来自杜胡克理工大学的27份文件和来自ProQuest.com的100份文件。对这些文档中的文本进行预处理,使其适合语义相似度分析。对这些方法的评价基于几个指标,包括准确性、精密度、召回率、F1分数和处理时间。结果表明,传统的TF-IDF方法在嵌入和检测文档之间实际语义相似度方面优于现代方法,处理时间不超过几秒。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
35
审稿时长
6 weeks
期刊最新文献
PROPAGATION AND CALLUS REGENERATION OF POTATO (SOLANUM TUBEROSUM L.) CULTIVAR ‘DESIREE’ UNDER SALT STRESS CONDITIONS THE PREDICTION OF HEART DISEASE USING MACHINE LEARNING ALGORITHMS PHYLOGENETIC STUDY OF TEN SPECIES FROM CENTAUREA (ASTERACEAE) IN DUHOK CITY, KURDISTAN REGION-IRAQ ENHANCING KURDISH SIGN LANGUAGE RECOGNITION THROUGH RANDOM FOREST CLASSIFIER AND NOISE REDUCTION VIA SINGULAR VALUE DECOMPOSITION (SVD) QUANTIFYING THE IMPACT OF RUNNING CADENCE ON BIOMECHANICS, PERFORMANCE, AND INJURY RISK: A PHYSICS-BASED ANALYSIS
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1