展望对消除歧义有多重要?部分阿拉伯语转调个案研究

IF 3.7 2区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computational Linguistics Pub Date : 2022-08-24 DOI:10.1162/coli_a_00456
Saeed Esmail, Kfir Bar, N. Dershowitz
{"title":"展望对消除歧义有多重要?部分阿拉伯语转调个案研究","authors":"Saeed Esmail, Kfir Bar, N. Dershowitz","doi":"10.1162/coli_a_00456","DOIUrl":null,"url":null,"abstract":"Abstract We suggest a model for partial diacritization of deep orthographies. We focus on Arabic, where the optional indication of selected vowels by means of diacritics can resolve ambiguity and improve readability. Our partial diacritizer restores short vowels only when they contribute to the ease of understandability during reading a given running text. The idea is to identify those uncertainties of absent vowels that require the reader to look ahead to disambiguate. To achieve this, two independent neural networks are used for predicting diacritics, one that takes the entire sentence as input and another that considers only the text that has been read thus far. Partial diacritization is then determined by retaining precisely those vowels on which the two networks disagree, preferring the reading based on consideration of the whole sentence over the more naïve reading-order diacritization. For evaluation, we prepared a new dataset of Arabic texts with both full and partial vowelization. In addition to facilitating readability, we find that our partial diacritizer improves translation quality compared either to their total absence or to random selection. Lastly, we study the benefit of knowing the text that follows the word in focus toward the restoration of short vowels during reading, and we measure the degree to which lookahead contributes to resolving ambiguities encountered while reading. L’Herbelot had asserted, that the most ancient Korans, written in the Cufic character, had no vowel points; and that these were first invented by Jahia–ben Jamer, who died in the 127th year of the Hegira. “Toderini’s History of Turkish Literature,” Analytical Review (1789)","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2022-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"How Much Does Lookahead Matter for Disambiguation? Partial Arabic Diacritization Case Study\",\"authors\":\"Saeed Esmail, Kfir Bar, N. Dershowitz\",\"doi\":\"10.1162/coli_a_00456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract We suggest a model for partial diacritization of deep orthographies. We focus on Arabic, where the optional indication of selected vowels by means of diacritics can resolve ambiguity and improve readability. Our partial diacritizer restores short vowels only when they contribute to the ease of understandability during reading a given running text. The idea is to identify those uncertainties of absent vowels that require the reader to look ahead to disambiguate. To achieve this, two independent neural networks are used for predicting diacritics, one that takes the entire sentence as input and another that considers only the text that has been read thus far. Partial diacritization is then determined by retaining precisely those vowels on which the two networks disagree, preferring the reading based on consideration of the whole sentence over the more naïve reading-order diacritization. For evaluation, we prepared a new dataset of Arabic texts with both full and partial vowelization. In addition to facilitating readability, we find that our partial diacritizer improves translation quality compared either to their total absence or to random selection. Lastly, we study the benefit of knowing the text that follows the word in focus toward the restoration of short vowels during reading, and we measure the degree to which lookahead contributes to resolving ambiguities encountered while reading. L’Herbelot had asserted, that the most ancient Korans, written in the Cufic character, had no vowel points; and that these were first invented by Jahia–ben Jamer, who died in the 127th year of the Hegira. “Toderini’s History of Turkish Literature,” Analytical Review (1789)\",\"PeriodicalId\":55229,\"journal\":{\"name\":\"Computational Linguistics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2022-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Linguistics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1162/coli_a_00456\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00456","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 1

摘要

摘要我们提出了一个深度正字法的部分变音模型。我们关注阿拉伯语,在阿拉伯语中,通过变音符号对所选元音的可选指示可以解决歧义并提高可读性。只有当短元音有助于在阅读给定的运行文本时易于理解时,我们的部分变音器才能恢复短元音。这个想法是为了识别缺失元音的不确定性,这些不确定性需要读者向前看以消除歧义。为了实现这一点,使用了两个独立的神经网络来预测变音符号,一个将整个句子作为输入,另一个只考虑迄今为止阅读过的文本。然后,通过准确地保留两个网络不一致的元音来确定部分变音,更喜欢基于整个句子的阅读,而不是更天真的阅读顺序变音。为了进行评估,我们准备了一个新的阿拉伯语文本数据集,包括完整和部分元音。除了提高可读性外,我们还发现,与完全不存在或随机选择相比,我们的部分变音器提高了翻译质量。最后,我们研究了在阅读过程中,了解单词后面的文本对恢复短元音的好处,并衡量了前瞻性在解决阅读中遇到的歧义方面的作用。L’Herbelot断言,最古老的《古兰经》是用库菲克文字写成的,没有元音点;这些最早是由贾希亚-本·贾米尔发明的,他死于赫吉拉127年。《托代里尼的土耳其文学史》,《分析评论》(1789)
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
How Much Does Lookahead Matter for Disambiguation? Partial Arabic Diacritization Case Study
Abstract We suggest a model for partial diacritization of deep orthographies. We focus on Arabic, where the optional indication of selected vowels by means of diacritics can resolve ambiguity and improve readability. Our partial diacritizer restores short vowels only when they contribute to the ease of understandability during reading a given running text. The idea is to identify those uncertainties of absent vowels that require the reader to look ahead to disambiguate. To achieve this, two independent neural networks are used for predicting diacritics, one that takes the entire sentence as input and another that considers only the text that has been read thus far. Partial diacritization is then determined by retaining precisely those vowels on which the two networks disagree, preferring the reading based on consideration of the whole sentence over the more naïve reading-order diacritization. For evaluation, we prepared a new dataset of Arabic texts with both full and partial vowelization. In addition to facilitating readability, we find that our partial diacritizer improves translation quality compared either to their total absence or to random selection. Lastly, we study the benefit of knowing the text that follows the word in focus toward the restoration of short vowels during reading, and we measure the degree to which lookahead contributes to resolving ambiguities encountered while reading. L’Herbelot had asserted, that the most ancient Korans, written in the Cufic character, had no vowel points; and that these were first invented by Jahia–ben Jamer, who died in the 127th year of the Hegira. “Toderini’s History of Turkish Literature,” Analytical Review (1789)
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computational Linguistics
Computational Linguistics 工程技术-计算机:跨学科应用
CiteScore
15.80
自引率
0.00%
发文量
45
审稿时长
>12 weeks
期刊介绍: Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.
期刊最新文献
Generation and Polynomial Parsing of Graph Languages with Non-Structural Reentrancies Languages through the Looking Glass of BPE Compression Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings Machine Learning for Ancient Languages: A Survey Statistical Methods for Annotation Analysis by Silviu Paun, Ron Artstein, and Massimo Poesio
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1