带有注释库语语料库的混合词性标注器:词性标注的进展

IF 1.1 3区文学 0 HUMANITIES, MULTIDISCIPLINARY Digital Scholarship in the Humanities Pub Date : 2023-10-05 DOI:10.1093/llc/fqad066

Dastan Maulud, Karwan Jacksi, Ismael Ali

{"title":"带有注释库语语料库的混合词性标注器:词性标注的进展","authors":"Dastan Maulud, Karwan Jacksi, Ismael Ali","doi":"10.1093/llc/fqad066","DOIUrl":null,"url":null,"abstract":"Abstract With the rapid growth of online content written in the Kurdish language, there is an increasing need to make it machine-readable and processable. Part of speech (POS) tagging is a critical aspect of natural language processing (NLP), playing a significant role in applications such as speech recognition, natural language parsing, information retrieval, and multiword term extraction. This study details the creation of the DASTAN corpus, the first POS-annotated corpus for the Sorani Kurdish dialect. The corpus, containing 74,258 words and thirty-eight tags, employs a hybrid approach utilizing the bigram hidden Markov model in combination with the Kurdish rule-based approach to POS tagging. This approach addresses two key problems that arise with rule-based approaches, namely misclassified words and ambiguity-related unanalyzed words. The proposed approach’s accuracy was assessed by training and testing it on the DASTAN corpus, yielding a 96% accuracy rate. Overall, this study’s findings demonstrate the effectiveness of the proposed hybrid approach and its potential to enhance NLP applications for Sorani Kurdish.","PeriodicalId":45315,"journal":{"name":"Digital Scholarship in the Humanities","volume":"71 1","pages":"0"},"PeriodicalIF":1.1000,"publicationDate":"2023-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A hybrid part-of-speech tagger with annotated Kurdish corpus: advancements in POS tagging\",\"authors\":\"Dastan Maulud, Karwan Jacksi, Ismael Ali\",\"doi\":\"10.1093/llc/fqad066\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract With the rapid growth of online content written in the Kurdish language, there is an increasing need to make it machine-readable and processable. Part of speech (POS) tagging is a critical aspect of natural language processing (NLP), playing a significant role in applications such as speech recognition, natural language parsing, information retrieval, and multiword term extraction. This study details the creation of the DASTAN corpus, the first POS-annotated corpus for the Sorani Kurdish dialect. The corpus, containing 74,258 words and thirty-eight tags, employs a hybrid approach utilizing the bigram hidden Markov model in combination with the Kurdish rule-based approach to POS tagging. This approach addresses two key problems that arise with rule-based approaches, namely misclassified words and ambiguity-related unanalyzed words. The proposed approach’s accuracy was assessed by training and testing it on the DASTAN corpus, yielding a 96% accuracy rate. Overall, this study’s findings demonstrate the effectiveness of the proposed hybrid approach and its potential to enhance NLP applications for Sorani Kurdish.\",\"PeriodicalId\":45315,\"journal\":{\"name\":\"Digital Scholarship in the Humanities\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":1.1000,\"publicationDate\":\"2023-10-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Scholarship in the Humanities\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/llc/fqad066\",\"RegionNum\":3,\"RegionCategory\":\"文学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"HUMANITIES, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Scholarship in the Humanities","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/llc/fqad066","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"HUMANITIES, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

随着以库尔德语编写的在线内容的快速增长，越来越需要使其具有机器可读性和可处理性。词性标注是自然语言处理(NLP)的一个重要方面，在语音识别、自然语言解析、信息检索和多词术语提取等应用中发挥着重要作用。本研究详细介绍了DASTAN语料库的创建，这是索拉尼库尔德方言的第一个pos注释语料库。该语料库包含74,258个单词和38个标签，采用了一种混合方法，利用双字母隐马尔可夫模型结合库尔德语基于规则的方法来进行词性标注。这种方法解决了基于规则的方法中出现的两个关键问题，即错误分类的单词和与歧义相关的未分析单词。通过在DASTAN语料库上进行训练和测试，评估了该方法的准确性，准确率达到96%。总的来说，这项研究的结果证明了所提出的混合方法的有效性，以及它在提高索拉尼库尔德语的NLP应用方面的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A hybrid part-of-speech tagger with annotated Kurdish corpus: advancements in POS tagging

Abstract With the rapid growth of online content written in the Kurdish language, there is an increasing need to make it machine-readable and processable. Part of speech (POS) tagging is a critical aspect of natural language processing (NLP), playing a significant role in applications such as speech recognition, natural language parsing, information retrieval, and multiword term extraction. This study details the creation of the DASTAN corpus, the first POS-annotated corpus for the Sorani Kurdish dialect. The corpus, containing 74,258 words and thirty-eight tags, employs a hybrid approach utilizing the bigram hidden Markov model in combination with the Kurdish rule-based approach to POS tagging. This approach addresses two key problems that arise with rule-based approaches, namely misclassified words and ambiguity-related unanalyzed words. The proposed approach’s accuracy was assessed by training and testing it on the DASTAN corpus, yielding a 96% accuracy rate. Overall, this study’s findings demonstrate the effectiveness of the proposed hybrid approach and its potential to enhance NLP applications for Sorani Kurdish.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Digital Scholarship in the Humanities Multiple-

CiteScore

1.80

自引率

25.00%

发文量

期刊介绍： DSH or Digital Scholarship in the Humanities is an international, peer reviewed journal which publishes original contributions on all aspects of digital scholarship in the Humanities including, but not limited to, the field of what is currently called the Digital Humanities. Long and short papers report on theoretical, methodological, experimental, and applied research and include results of research projects, descriptions and evaluations of tools, techniques, and methodologies, and reports on work in progress. DSH also publishes reviews of books and resources. Digital Scholarship in the Humanities was previously known as Literary and Linguistic Computing.