RoLEX: The development of an extended Romanian lexical dataset and its evaluation at predicting concurrent lexical information

IF 1.9 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Natural Language Engineering Pub Date : 2022-08-26 DOI:10.1017/S1351324922000419

Beáta Lőrincz, E. Irimia, Adriana Stan, Verginica Barbu Mititelu

{"title":"RoLEX: The development of an extended Romanian lexical dataset and its evaluation at predicting concurrent lexical information","authors":"Beáta Lőrincz, E. Irimia, Adriana Stan, Verginica Barbu Mititelu","doi":"10.1017/S1351324922000419","DOIUrl":null,"url":null,"abstract":"Abstract In this article, we introduce an extended, freely available resource for the Romanian language, named RoLEX. The dataset was developed mainly for speech processing applications, yet its applicability extends beyond this domain. RoLEX includes over 330,000 curated entries with information regarding lemma, morphosyntactic description, syllabification, lexical stress and phonemic transcription. The process of selecting the list of word entries and semi-automatically annotating the complete lexical information associated with each of the entries is thoroughly described. The dataset’s inherent knowledge is then evaluated in a task of concurrent prediction of syllabification, lexical stress marking and phonemic transcription. The evaluation looked into several dataset design factors, such as the minimum viable number of entries for correct prediction, the optimisation of the minimum number of required entries through expert selection and the augmentation of the input with morphosyntactic information, as well as the influence of each task in the overall accuracy. The best results were obtained when the orthographic form of the entries was augmented with the complete morphosyntactic tags. A word error rate of 3.08% and a character error rate of 1.08% were obtained this way. We show that using a carefully selected subset of entries for training can result in a similar performance to the performance obtained by a larger set of randomly selected entries (twice as many). In terms of prediction complexity, the lexical stress marking posed most problems and accounts for around 60% of the errors in the predicted sequence.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"720 - 745"},"PeriodicalIF":1.9000,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/S1351324922000419","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 1

Abstract

Abstract In this article, we introduce an extended, freely available resource for the Romanian language, named RoLEX. The dataset was developed mainly for speech processing applications, yet its applicability extends beyond this domain. RoLEX includes over 330,000 curated entries with information regarding lemma, morphosyntactic description, syllabification, lexical stress and phonemic transcription. The process of selecting the list of word entries and semi-automatically annotating the complete lexical information associated with each of the entries is thoroughly described. The dataset’s inherent knowledge is then evaluated in a task of concurrent prediction of syllabification, lexical stress marking and phonemic transcription. The evaluation looked into several dataset design factors, such as the minimum viable number of entries for correct prediction, the optimisation of the minimum number of required entries through expert selection and the augmentation of the input with morphosyntactic information, as well as the influence of each task in the overall accuracy. The best results were obtained when the orthographic form of the entries was augmented with the complete morphosyntactic tags. A word error rate of 3.08% and a character error rate of 1.08% were obtained this way. We show that using a carefully selected subset of entries for training can result in a similar performance to the performance obtained by a larger set of randomly selected entries (twice as many). In terms of prediction complexity, the lexical stress marking posed most problems and accounts for around 60% of the errors in the predicted sequence.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

RoLEX:一个扩展的罗马尼亚词汇数据集的发展及其评估在预测并发词汇信息

摘要在本文中，我们介绍了一个扩展的、免费提供的罗马尼亚语资源，名为RoLEX。该数据集主要是为语音处理应用程序开发的，但其适用性超出了该领域。RoLEX包括超过330000个精选条目，其中包含关于引理、形态句法描述、音节划分、词汇重音和音位转录的信息。全面描述了选择单词条目列表和半自动注释与每个条目相关联的完整词汇信息的过程。然后在同时预测音节划分、词汇重音标记和音位转录的任务中评估数据集的固有知识。该评估考察了几个数据集设计因素，如正确预测的最小可行条目数、通过专家选择优化所需最小条目数、用形态句法信息增加输入，以及每个任务对总体准确性的影响。当条目的拼写形式增加了完整的形态句法标签时，获得了最好的结果。以这种方式获得了3.08%的单词错误率和1.08%的字符错误率。我们表明，使用精心选择的条目子集进行训练可以获得与更大的随机选择条目集（两倍多）所获得的性能相似的性能。就预测复杂性而言，词汇重音标记带来了大多数问题，约占预测序列错误的60%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.