Enhancing Word Sense Disambiguation for Amharic homophone words using Bidirectional Long Short-Term Memory network

IF 4.3 Intelligent Systems with Applications Pub Date : 2024-09-01 Epub Date: 2024-07-14 DOI:10.1016/j.iswa.2024.200417

Mequanent Degu Belete , Lijalem Getanew Shiferaw , Girma Kassa Alitasb , Tariku Sinshaw Tamir

{"title":"Enhancing Word Sense Disambiguation for Amharic homophone words using Bidirectional Long Short-Term Memory network","authors":"Mequanent Degu Belete , Lijalem Getanew Shiferaw , Girma Kassa Alitasb , Tariku Sinshaw Tamir","doi":"10.1016/j.iswa.2024.200417","DOIUrl":null,"url":null,"abstract":"<div><p>Given the Amharic language has a lot of perplexing terminology since it features duplicate homophone letters, fidel's ሀ, ሐ, and ኀ (three of which are pronounced as HA), ሠ and ሰ (both pronounced as SE), አ and ዐ (both pronounced as AE), and ጸ and ፀ (both pronounced as TSE). The WSD (Word Sense Disambiguation) model, which tackles the issue of lexical ambiguity in the context of the Amharic language, is developed using a deep learning technique. Due to the unavailability of the Amharic wordnet, a total of 1756 examples of paired Amharic ambiguous homophonic words were collected. These words were ድህነት(dhnet) and ድኅነት(dhnet), ምሁር(m'hur) and ምሑር(m'hur), በአል(be'al) and በዢል(be'al), አቢይ (abiy) and ዐቢይ(abiy), with a total of 1756 examples. Following word preprocessing, word2vec, fasttext, Term Frequency-Inverse Document Frequency (TFIDF), and bag of words (BoW) were used to vectorize the text. The vectorized text was divided into train and test data. The train data was then analysed using Naive Bayes (NB), K-nearest neighbour (KNN), logistic regression (LG), decision trees (DT), random forests (RF), and random oversampling technique. Bidirectional Gate Recurrent Unit (BiGRU) and Bidirectional Long Short-Term Memory (BiLSTM) improved to 99.99 % accuracy even with limited datasets.</p></div>","PeriodicalId":100684,"journal":{"name":"Intelligent Systems with Applications","volume":"23 ","pages":"Article 200417"},"PeriodicalIF":4.3000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667305324000917/pdfft?md5=202ee58dc6e5f4972b676973759f3a8c&pid=1-s2.0-S2667305324000917-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Systems with Applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667305324000917","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/14 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Given the Amharic language has a lot of perplexing terminology since it features duplicate homophone letters, fidel's ሀ, ሐ, and ኀ (three of which are pronounced as HA), ሠ and ሰ (both pronounced as SE), አ and ዐ (both pronounced as AE), and ጸ and ፀ (both pronounced as TSE). The WSD (Word Sense Disambiguation) model, which tackles the issue of lexical ambiguity in the context of the Amharic language, is developed using a deep learning technique. Due to the unavailability of the Amharic wordnet, a total of 1756 examples of paired Amharic ambiguous homophonic words were collected. These words were ድህነት(dhnet) and ድኅነት(dhnet), ምሁር(m'hur) and ምሑር(m'hur), በአል(be'al) and በዢል(be'al), አቢይ (abiy) and ዐቢይ(abiy), with a total of 1756 examples. Following word preprocessing, word2vec, fasttext, Term Frequency-Inverse Document Frequency (TFIDF), and bag of words (BoW) were used to vectorize the text. The vectorized text was divided into train and test data. The train data was then analysed using Naive Bayes (NB), K-nearest neighbour (KNN), logistic regression (LG), decision trees (DT), random forests (RF), and random oversampling technique. Bidirectional Gate Recurrent Unit (BiGRU) and Bidirectional Long Short-Term Memory (BiLSTM) improved to 99.99 % accuracy even with limited datasets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用双向长短期记忆网络增强阿姆哈拉语同音词的词义消歧能力

鉴于阿姆哈拉语有许多令人困惑的术语，因为它具有重复的同音字母，菲德尔的ሀ、ሐ和ኀ（其中三个发音为 HA）、ሠ和ሰ（发音均为 SE）、አ和ዐ（发音均为 AE）以及ጸ和ፀ（发音均为 TSE）。WSD（词义消歧）模型采用深度学习技术开发，用于解决阿姆哈拉语语境中的词汇歧义问题。由于无法获得阿姆哈拉语单词网，因此共收集了 1756 个阿姆哈拉语成对同音歧义词实例。这些词分别是 ድህነት(dhnet) 和 ድኅነት(dhnet)、ምሁ(m'hur) 和 ምሑ(m'hur) 、በአል(be'al) 和 በዢል(be'al) 、አቢይ (abiy) 和 ዐቢይ(abiy)，共计 1756 个例子。经过词预处理后，使用 word2vec、fasttext、词频-反向文档频率（TFIDF）和词包（BoW）对文本进行了向量化。矢量化后的文本分为训练数据和测试数据。然后使用 Naive Bayes (NB)、K-nearest neighbour (KNN)、逻辑回归 (LG)、决策树 (DT)、随机森林 (RF) 和随机超采样技术对训练数据进行分析。即使数据集有限，双向门递归单元（BiGRU）和双向长短期记忆（BiLSTM）的准确率也提高到了 99.99%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊