RoLEX: The development of an extended Romanian lexical dataset and its evaluation at predicting concurrent lexical information

IF 2.3 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Natural Language Engineering Pub Date : 2022-08-26 DOI:10.1017/S1351324922000419
Beáta Lőrincz, E. Irimia, Adriana Stan, Verginica Barbu Mititelu
{"title":"RoLEX: The development of an extended Romanian lexical dataset and its evaluation at predicting concurrent lexical information","authors":"Beáta Lőrincz, E. Irimia, Adriana Stan, Verginica Barbu Mititelu","doi":"10.1017/S1351324922000419","DOIUrl":null,"url":null,"abstract":"Abstract In this article, we introduce an extended, freely available resource for the Romanian language, named RoLEX. The dataset was developed mainly for speech processing applications, yet its applicability extends beyond this domain. RoLEX includes over 330,000 curated entries with information regarding lemma, morphosyntactic description, syllabification, lexical stress and phonemic transcription. The process of selecting the list of word entries and semi-automatically annotating the complete lexical information associated with each of the entries is thoroughly described. The dataset’s inherent knowledge is then evaluated in a task of concurrent prediction of syllabification, lexical stress marking and phonemic transcription. The evaluation looked into several dataset design factors, such as the minimum viable number of entries for correct prediction, the optimisation of the minimum number of required entries through expert selection and the augmentation of the input with morphosyntactic information, as well as the influence of each task in the overall accuracy. The best results were obtained when the orthographic form of the entries was augmented with the complete morphosyntactic tags. A word error rate of 3.08% and a character error rate of 1.08% were obtained this way. We show that using a carefully selected subset of entries for training can result in a similar performance to the performance obtained by a larger set of randomly selected entries (twice as many). In terms of prediction complexity, the lexical stress marking posed most problems and accounts for around 60% of the errors in the predicted sequence.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"720 - 745"},"PeriodicalIF":2.3000,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/S1351324922000419","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 1

Abstract

Abstract In this article, we introduce an extended, freely available resource for the Romanian language, named RoLEX. The dataset was developed mainly for speech processing applications, yet its applicability extends beyond this domain. RoLEX includes over 330,000 curated entries with information regarding lemma, morphosyntactic description, syllabification, lexical stress and phonemic transcription. The process of selecting the list of word entries and semi-automatically annotating the complete lexical information associated with each of the entries is thoroughly described. The dataset’s inherent knowledge is then evaluated in a task of concurrent prediction of syllabification, lexical stress marking and phonemic transcription. The evaluation looked into several dataset design factors, such as the minimum viable number of entries for correct prediction, the optimisation of the minimum number of required entries through expert selection and the augmentation of the input with morphosyntactic information, as well as the influence of each task in the overall accuracy. The best results were obtained when the orthographic form of the entries was augmented with the complete morphosyntactic tags. A word error rate of 3.08% and a character error rate of 1.08% were obtained this way. We show that using a carefully selected subset of entries for training can result in a similar performance to the performance obtained by a larger set of randomly selected entries (twice as many). In terms of prediction complexity, the lexical stress marking posed most problems and accounts for around 60% of the errors in the predicted sequence.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
RoLEX:一个扩展的罗马尼亚词汇数据集的发展及其评估在预测并发词汇信息
摘要在本文中,我们介绍了一个扩展的、免费提供的罗马尼亚语资源,名为RoLEX。该数据集主要是为语音处理应用程序开发的,但其适用性超出了该领域。RoLEX包括超过330000个精选条目,其中包含关于引理、形态句法描述、音节划分、词汇重音和音位转录的信息。全面描述了选择单词条目列表和半自动注释与每个条目相关联的完整词汇信息的过程。然后在同时预测音节划分、词汇重音标记和音位转录的任务中评估数据集的固有知识。该评估考察了几个数据集设计因素,如正确预测的最小可行条目数、通过专家选择优化所需最小条目数、用形态句法信息增加输入,以及每个任务对总体准确性的影响。当条目的拼写形式增加了完整的形态句法标签时,获得了最好的结果。以这种方式获得了3.08%的单词错误率和1.08%的字符错误率。我们表明,使用精心选择的条目子集进行训练可以获得与更大的随机选择条目集(两倍多)所获得的性能相似的性能。就预测复杂性而言,词汇重音标记带来了大多数问题,约占预测序列错误的60%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Natural Language Engineering
Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-
CiteScore
5.90
自引率
12.00%
发文量
60
审稿时长
>12 weeks
期刊介绍: Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.
期刊最新文献
Start-up activity in the LLM ecosystem Anisotropic span embeddings and the negative impact of higher-order inference for coreference resolution: An empirical analysis Automated annotation of parallel bible corpora with cross-lingual semantic concordance How do control tokens affect natural language generation tasks like text simplification Emerging trends: When can users trust GPT, and when should they intervene?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1