Enhancing Word Sense Disambiguation for Amharic homophone words using Bidirectional Long Short-Term Memory network

Mequanent Degu Belete , Lijalem Getanew Shiferaw , Girma Kassa Alitasb , Tariku Sinshaw Tamir
{"title":"Enhancing Word Sense Disambiguation for Amharic homophone words using Bidirectional Long Short-Term Memory network","authors":"Mequanent Degu Belete ,&nbsp;Lijalem Getanew Shiferaw ,&nbsp;Girma Kassa Alitasb ,&nbsp;Tariku Sinshaw Tamir","doi":"10.1016/j.iswa.2024.200417","DOIUrl":null,"url":null,"abstract":"<div><p>Given the Amharic language has a lot of perplexing terminology since it features duplicate homophone letters, fidel's ሀ, ሐ, and ኀ (three of which are pronounced as HA), ሠ and ሰ (both pronounced as SE), አ and ዐ (both pronounced as AE), and ጸ and ፀ (both pronounced as TSE). The WSD (Word Sense Disambiguation) model, which tackles the issue of lexical ambiguity in the context of the Amharic language, is developed using a deep learning technique. Due to the unavailability of the Amharic wordnet, a total of 1756 examples of paired Amharic ambiguous homophonic words were collected. These words were ድህነት(dhnet) and ድኅነት(dhnet), ምሁር(m'hur) and ምሑር(m'hur), በአል(be'al) and በዢል(be'al), አቢይ (abiy) and ዐቢይ(abiy), with a total of 1756 examples. Following word preprocessing, word2vec, fasttext, Term Frequency-Inverse Document Frequency (TFIDF), and bag of words (BoW) were used to vectorize the text. The vectorized text was divided into train and test data. The train data was then analysed using Naive Bayes (NB), K-nearest neighbour (KNN), logistic regression (LG), decision trees (DT), random forests (RF), and random oversampling technique. Bidirectional Gate Recurrent Unit (BiGRU) and Bidirectional Long Short-Term Memory (BiLSTM) improved to 99.99 % accuracy even with limited datasets.</p></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"23 ","pages":"Article 200417"},"PeriodicalIF":0.0000,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667305324000917/pdfft?md5=202ee58dc6e5f4972b676973759f3a8c&pid=1-s2.0-S2667305324000917-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305324000917","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Given the Amharic language has a lot of perplexing terminology since it features duplicate homophone letters, fidel's ሀ, ሐ, and ኀ (three of which are pronounced as HA), ሠ and ሰ (both pronounced as SE), አ and ዐ (both pronounced as AE), and ጸ and ፀ (both pronounced as TSE). The WSD (Word Sense Disambiguation) model, which tackles the issue of lexical ambiguity in the context of the Amharic language, is developed using a deep learning technique. Due to the unavailability of the Amharic wordnet, a total of 1756 examples of paired Amharic ambiguous homophonic words were collected. These words were ድህነት(dhnet) and ድኅነት(dhnet), ምሁር(m'hur) and ምሑር(m'hur), በአል(be'al) and በዢል(be'al), አቢይ (abiy) and ዐቢይ(abiy), with a total of 1756 examples. Following word preprocessing, word2vec, fasttext, Term Frequency-Inverse Document Frequency (TFIDF), and bag of words (BoW) were used to vectorize the text. The vectorized text was divided into train and test data. The train data was then analysed using Naive Bayes (NB), K-nearest neighbour (KNN), logistic regression (LG), decision trees (DT), random forests (RF), and random oversampling technique. Bidirectional Gate Recurrent Unit (BiGRU) and Bidirectional Long Short-Term Memory (BiLSTM) improved to 99.99 % accuracy even with limited datasets.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用双向长短期记忆网络增强阿姆哈拉语同音词的词义消歧能力
鉴于阿姆哈拉语有许多令人困惑的术语,因为它具有重复的同音字母,菲德尔的ሀ、ሐ和ኀ(其中三个发音为 HA)、ሠ和ሰ(发音均为 SE)、አ和ዐ(发音均为 AE)以及ጸ和ፀ(发音均为 TSE)。WSD(词义消歧)模型采用深度学习技术开发,用于解决阿姆哈拉语语境中的词汇歧义问题。由于无法获得阿姆哈拉语单词网,因此共收集了 1756 个阿姆哈拉语成对同音歧义词实例。这些词分别是 ድህነት(dhnet) 和 ድኅነት(dhnet)、ምሁ(m'hur) 和 ምሑ(m'hur) 、በአል(be'al) 和 በዢል(be'al) 、አቢይ (abiy) 和 ዐቢይ(abiy),共计 1756 个例子。经过词预处理后,使用 word2vec、fasttext、词频-反向文档频率(TFIDF)和词包(BoW)对文本进行了向量化。矢量化后的文本分为训练数据和测试数据。然后使用 Naive Bayes (NB)、K-nearest neighbour (KNN)、逻辑回归 (LG)、决策树 (DT)、随机森林 (RF) 和随机超采样技术对训练数据进行分析。即使数据集有限,双向门递归单元(BiGRU)和双向长短期记忆(BiLSTM)的准确率也提高到了 99.99%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
5.60
自引率
0.00%
发文量
0
期刊最新文献
MapReduce teaching learning based optimization algorithm for solving CEC-2013 LSGO benchmark Testsuit Intelligent gear decision method for vehicle automatic transmission system based on data mining Design and implementation of EventsKG for situational monitoring and security intelligence in India: An open-source intelligence gathering approach Ideological orientation and extremism detection in online social networking sites: A systematic review Multi-objective optimization of power networks integrating electric vehicles and wind energy
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1