检测土耳其语中与关键字相关的正字法错误

Recent Advances in Natural Language Processing Pub Date : 2019-10-22 DOI:10.26615/978-954-452-056-4_009

Uğurcan Arıkan, Onur Güngör, S. Uskudarli

{"title":"检测土耳其语中与关键字相关的正字法错误","authors":"Uğurcan Arıkan, Onur Güngör, S. Uskudarli","doi":"10.26615/978-954-452-056-4_009","DOIUrl":null,"url":null,"abstract":"For the spell correction task, vocabulary based methods have been replaced with methods that take morphological and grammar rules into account. However, such tools are fairly immature, and, worse, non-existent for many low resource languages. Checking only if a word is well-formed with respect to the morphological rules of a language may produce false negatives due to the ambiguity resulting from the presence of numerous homophonic words. In this work, we propose an approach to detect and correct the “de/da” clitic errors in Turkish text. Our model is a neural sequence tagger trained with a synthetically constructed dataset consisting of positive and negative samples. The model’s performance with this dataset is presented according to different word embedding configurations. The model achieved an F1 score of 86.67% on a synthetically constructed dataset. We also compared the model’s performance on a manually curated dataset of challenging samples that proved superior to other spelling correctors with 71% accuracy compared to the second-best (Google Docs) with and accuracy of 34%.","PeriodicalId":284493,"journal":{"name":"Recent Advances in Natural Language Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Detecting Clitics Related Orthographic Errors in Turkish\",\"authors\":\"Uğurcan Arıkan, Onur Güngör, S. Uskudarli\",\"doi\":\"10.26615/978-954-452-056-4_009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For the spell correction task, vocabulary based methods have been replaced with methods that take morphological and grammar rules into account. However, such tools are fairly immature, and, worse, non-existent for many low resource languages. Checking only if a word is well-formed with respect to the morphological rules of a language may produce false negatives due to the ambiguity resulting from the presence of numerous homophonic words. In this work, we propose an approach to detect and correct the “de/da” clitic errors in Turkish text. Our model is a neural sequence tagger trained with a synthetically constructed dataset consisting of positive and negative samples. The model’s performance with this dataset is presented according to different word embedding configurations. The model achieved an F1 score of 86.67% on a synthetically constructed dataset. We also compared the model’s performance on a manually curated dataset of challenging samples that proved superior to other spelling correctors with 71% accuracy compared to the second-best (Google Docs) with and accuracy of 34%.\",\"PeriodicalId\":284493,\"journal\":{\"name\":\"Recent Advances in Natural Language Processing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Recent Advances in Natural Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.26615/978-954-452-056-4_009\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Recent Advances in Natural Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26615/978-954-452-056-4_009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

对于拼写纠正任务，基于词汇的方法已经被考虑形态和语法规则的方法所取代。然而，这样的工具相当不成熟，更糟糕的是，对于许多低资源语言来说不存在。仅根据语言的形态规则检查一个词是否结构良好可能会产生假否定，因为存在大量同音词会产生歧义。在这项工作中，我们提出了一种检测和纠正土耳其语文本中“de/da”断句错误的方法。我们的模型是一个由正样本和负样本组成的合成数据集训练的神经序列标注器。根据不同的词嵌入配置，给出了模型在该数据集上的性能。该模型在综合构建的数据集上获得了86.67%的F1分数。我们还比较了该模型在人工整理的具有挑战性的样本数据集上的表现，该数据集的准确率为71%，优于其他拼写校正器，而第二好的(谷歌文档)的准确率为34%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Detecting Clitics Related Orthographic Errors in Turkish

For the spell correction task, vocabulary based methods have been replaced with methods that take morphological and grammar rules into account. However, such tools are fairly immature, and, worse, non-existent for many low resource languages. Checking only if a word is well-formed with respect to the morphological rules of a language may produce false negatives due to the ambiguity resulting from the presence of numerous homophonic words. In this work, we propose an approach to detect and correct the “de/da” clitic errors in Turkish text. Our model is a neural sequence tagger trained with a synthetically constructed dataset consisting of positive and negative samples. The model’s performance with this dataset is presented according to different word embedding configurations. The model achieved an F1 score of 86.67% on a synthetically constructed dataset. We also compared the model’s performance on a manually curated dataset of challenging samples that proved superior to other spelling correctors with 71% accuracy compared to the second-best (Google Docs) with and accuracy of 34%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Recent Advances in Natural Language Processing

自引率

0.00%

发文量