Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation

IF 1.7 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal Pub Date : 2023-11-01 DOI:10.14201/adcaij.31084
Muhammad Farhat Ullah, Ali Saeed, Naveed Hussain
{"title":"Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation","authors":"Muhammad Farhat Ullah, Ali Saeed, Naveed Hussain","doi":"10.14201/adcaij.31084","DOIUrl":null,"url":null,"abstract":"The prime objective of word sense disambiguation (WSD) is to develop such machines that can automatically recognize the actual meaning (sense) of ambiguous words in a sentence. WSD can improve various NLP and HCI challenges. Researchers explored a wide variety of methods to resolve this issue of sense ambiguity. However, majorly, their focus was on English and some other well-reputed languages. Urdu with more than 300 million users and a large amount of electronic text available on the web is still unexplored. In recent years, for a variety of Natural Language Processing tasks, word embedding methods have proven extremely successful. This study evaluates, compares, and applies a variety of word embedding approaches to Urdu Word embedding (both Lexical Sample and All-Words), including pre-trained (Word2Vec, Glove, and FastText) as well as custom-trained (Word2Vec, Glove, and FastText trained on the Ur-Mono corpus). Two benchmark corpora are used for the evaluation in this study: (1) the UAW-WSD-18 corpus and (2) the ULS-WSD-18 corpus. For Urdu All-Words WSD tasks, top results have been achieved (Accuracy=60.07 and F1=0.45) using pre-trained FastText. For the Lexical Sample, WSD has been achieved (Accuracy=70.93 and F1=0.60) using custom-trained GloVe word embedding method.","PeriodicalId":42597,"journal":{"name":"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal","volume":null,"pages":null},"PeriodicalIF":1.7000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ADCAIJ-Advances in Distributed Computing and Artificial Intelligence Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14201/adcaij.31084","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

The prime objective of word sense disambiguation (WSD) is to develop such machines that can automatically recognize the actual meaning (sense) of ambiguous words in a sentence. WSD can improve various NLP and HCI challenges. Researchers explored a wide variety of methods to resolve this issue of sense ambiguity. However, majorly, their focus was on English and some other well-reputed languages. Urdu with more than 300 million users and a large amount of electronic text available on the web is still unexplored. In recent years, for a variety of Natural Language Processing tasks, word embedding methods have proven extremely successful. This study evaluates, compares, and applies a variety of word embedding approaches to Urdu Word embedding (both Lexical Sample and All-Words), including pre-trained (Word2Vec, Glove, and FastText) as well as custom-trained (Word2Vec, Glove, and FastText trained on the Ur-Mono corpus). Two benchmark corpora are used for the evaluation in this study: (1) the UAW-WSD-18 corpus and (2) the ULS-WSD-18 corpus. For Urdu All-Words WSD tasks, top results have been achieved (Accuracy=60.07 and F1=0.45) using pre-trained FastText. For the Lexical Sample, WSD has been achieved (Accuracy=70.93 and F1=0.60) using custom-trained GloVe word embedding method.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
语义消歧的预训练词嵌入模型与自定义词嵌入模型的比较
词义消歧(WSD)的主要目标是开发能够自动识别句子中歧义词的实际意义(意义)的机器。水务署可以改善各种NLP和HCI挑战。研究者们探索了各种各样的方法来解决这一问题。然而,他们主要关注的是英语和其他一些著名的语言。乌尔都语有超过3亿的用户,网络上有大量的电子文本,但乌尔都语仍未开发。近年来,对于各种自然语言处理任务,词嵌入方法已经被证明是非常成功的。本研究评估、比较并应用了多种乌尔都语词嵌入方法(包括Lexical Sample和All-Words),包括预训练(Word2Vec、Glove和FastText)和自定义训练(在Ur-Mono语料库上训练的Word2Vec、Glove和FastText)。本研究使用两个基准语料库进行评价:(1)UAW-WSD-18语料库和(2)ULS-WSD-18语料库。对于乌尔都语全词WSD任务,使用预训练的FastText获得了最佳结果(准确率=60.07,F1=0.45)。对于Lexical Sample,使用定制训练的GloVe词嵌入方法实现了WSD(准确率=70.93,F1=0.60)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
1.40
自引率
0.00%
发文量
22
审稿时长
4 weeks
期刊最新文献
Enhancing Energy Efficiency in Cluster Based WSN using Grey Wolf Optimization Comparison of Pre-trained vs Custom-trained Word Embedding Models for Word Sense Disambiguation Healthcare Data Collection Using Internet of Things and Blockchain Based Decentralized Data Storage Development of an Extended Medical Diagnostic System for Typhoid and Malaria Fever Comparison of Swarm-based Metaheuristic and Gradient Descent-based Algorithms in Artificial Neural Network Training
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1