Homograph Disambiguation through Selective Diacritic Restoration

Sawsan Alqahtani, Hanan Aldarmaki, Mona T. Diab
{"title":"Homograph Disambiguation through Selective Diacritic Restoration","authors":"Sawsan Alqahtani, Hanan Aldarmaki, Mona T. Diab","doi":"10.18653/v1/W19-4606","DOIUrl":null,"url":null,"abstract":"Lexical ambiguity, a challenging phenomenon in all natural languages, is particularly prevalent for languages with diacritics that tend to be omitted in writing, such as Arabic. Omitting diacritics leads to an increase in the number of homographs: different words with the same spelling. Diacritic restoration could theoretically help disambiguate these words, but in practice, the increase in overall sparsity leads to performance degradation in NLP applications. In this paper, we propose approaches for automatically marking a subset of words for diacritic restoration, which leads to selective homograph disambiguation. Compared to full or no diacritic restoration, these approaches yield selectively-diacritized datasets that balance sparsity and lexical disambiguation. We evaluate the various selection strategies extrinsically on several downstream applications: neural machine translation, part-of-speech tagging, and semantic textual similarity. Our experiments on Arabic show promising results, where our devised strategies on selective diacritization lead to a more balanced and consistent performance in downstream applications.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WANLP@ACL 2019","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W19-4606","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Lexical ambiguity, a challenging phenomenon in all natural languages, is particularly prevalent for languages with diacritics that tend to be omitted in writing, such as Arabic. Omitting diacritics leads to an increase in the number of homographs: different words with the same spelling. Diacritic restoration could theoretically help disambiguate these words, but in practice, the increase in overall sparsity leads to performance degradation in NLP applications. In this paper, we propose approaches for automatically marking a subset of words for diacritic restoration, which leads to selective homograph disambiguation. Compared to full or no diacritic restoration, these approaches yield selectively-diacritized datasets that balance sparsity and lexical disambiguation. We evaluate the various selection strategies extrinsically on several downstream applications: neural machine translation, part-of-speech tagging, and semantic textual similarity. Our experiments on Arabic show promising results, where our devised strategies on selective diacritization lead to a more balanced and consistent performance in downstream applications.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过选择性变音符恢复的同形词消歧义
词汇歧义在所有自然语言中都是一个具有挑战性的现象,对于那些在写作中倾向于省略变音符的语言,如阿拉伯语,尤其普遍。省略变音符号会导致同音异义词的数量增加:拼写相同的不同单词。变音符恢复理论上可以帮助消除这些词的歧义,但在实践中,总体稀疏度的增加会导致NLP应用程序的性能下降。在本文中,我们提出了自动标记单词子集以进行变音符恢复的方法,从而导致选择性同义消歧。与完全或没有变音符恢复相比,这些方法产生了选择性变音符数据集,平衡了稀疏性和词汇消歧。我们从外部评估了几种下游应用的各种选择策略:神经机器翻译、词性标注和语义文本相似性。我们对阿拉伯语的实验显示了有希望的结果,其中我们设计的选择性变音策略在下游应用中带来了更平衡和一致的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Morphology-aware Word-Segmentation in Dialectal Arabic Adaptation of Neural Machine Translation Arabic Tweet-Act: Speech Act Recognition for Arabic Asynchronous Conversations En-Ar Bilingual Word Embeddings without Word Alignment: Factors Effects Simple But Not Naïve: Fine-Grained Arabic Dialect Identification Using Only N-Grams The SMarT Classifier for Arabic Fine-Grained Dialect Identification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1