僧伽罗语命名实体识别模型:体育领域特定类

2022 4th International Conference on Advancements in Computing (ICAC) Pub Date : 2022-12-09 DOI:10.1109/ICAC57685.2022.10025148

W.M.S.K. Wijesinghe, Muditha Tissera

{"title":"僧伽罗语命名实体识别模型:体育领域特定类","authors":"W.M.S.K. Wijesinghe, Muditha Tissera","doi":"10.1109/ICAC57685.2022.10025148","DOIUrl":null,"url":null,"abstract":"Named Entity Recognition (NER) is one of the crucial and vital subtasks that must be solved in most Natural Language Processing (NLP) tasks. However, constructing a NER system for the Sinhala Language is challenging. Because it comes under the category of low-resource languages. Therefore, the proposed approach attempted designing a mechanism to identify specific named entities in the sports domain. Firstly, a domain-specific corpus was built using Sinhala sport e-News articles. Then a semi-automated, rule-based component named as “Class_Label_Suggester” was built to annotate pre-defined named entities. After auto annotation, the outcome was further validated manually with a little effort. Finally, it was trained using the annotated data. Linear Perceptron, Stochastic Gradient Descent (SGD), Multinomial Naive Bayes (MNB), and Passive Aggressive classifiers were used to train the NER model. Though, the above Machine Learning (ML) algorithms showed approximately 98% accuracy, the MNB model demonstrated highest accuracy for the identified class labels of which, 99.76% for ‘Ground’, 99.53% for ‘School’, 98.55% for ‘Tournament’, and 97.87% for ‘Other’ classes. Additionally, high precision values of the above classes were 81%, 72%, 62%, and 98% respectively. An accurately annotated Sinhala dataset and the trained Sinhala NER model are main contributions of the study.","PeriodicalId":292397,"journal":{"name":"2022 4th International Conference on Advancements in Computing (ICAC)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Sinhala Named Entity Recognition Model: Domain-Specific Classes in Sports\",\"authors\":\"W.M.S.K. Wijesinghe, Muditha Tissera\",\"doi\":\"10.1109/ICAC57685.2022.10025148\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Named Entity Recognition (NER) is one of the crucial and vital subtasks that must be solved in most Natural Language Processing (NLP) tasks. However, constructing a NER system for the Sinhala Language is challenging. Because it comes under the category of low-resource languages. Therefore, the proposed approach attempted designing a mechanism to identify specific named entities in the sports domain. Firstly, a domain-specific corpus was built using Sinhala sport e-News articles. Then a semi-automated, rule-based component named as “Class_Label_Suggester” was built to annotate pre-defined named entities. After auto annotation, the outcome was further validated manually with a little effort. Finally, it was trained using the annotated data. Linear Perceptron, Stochastic Gradient Descent (SGD), Multinomial Naive Bayes (MNB), and Passive Aggressive classifiers were used to train the NER model. Though, the above Machine Learning (ML) algorithms showed approximately 98% accuracy, the MNB model demonstrated highest accuracy for the identified class labels of which, 99.76% for ‘Ground’, 99.53% for ‘School’, 98.55% for ‘Tournament’, and 97.87% for ‘Other’ classes. Additionally, high precision values of the above classes were 81%, 72%, 62%, and 98% respectively. An accurately annotated Sinhala dataset and the trained Sinhala NER model are main contributions of the study.\",\"PeriodicalId\":292397,\"journal\":{\"name\":\"2022 4th International Conference on Advancements in Computing (ICAC)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 4th International Conference on Advancements in Computing (ICAC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAC57685.2022.10025148\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 4th International Conference on Advancements in Computing (ICAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAC57685.2022.10025148","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

命名实体识别(NER)是大多数自然语言处理(NLP)任务中必须解决的关键子任务之一。然而，为僧伽罗语构建一个NER系统是具有挑战性的。因为它属于低资源语言的范畴。因此，提出的方法试图设计一种机制来识别体育领域中的特定命名实体。首先，以僧伽罗语体育电子新闻文章为对象，构建了一个特定领域的语料库。然后构建了一个名为“class_label_proposer”的基于规则的半自动组件来注释预定义的命名实体。在自动注释之后，只需稍加努力就可以进一步手动验证结果。最后，使用标注的数据对其进行训练。使用线性感知器、随机梯度下降(SGD)、多项朴素贝叶斯(MNB)和被动攻击分类器来训练NER模型。尽管上述机器学习(ML)算法显示出大约98%的准确率，但MNB模型对已识别的类别标签显示出最高的准确率，其中“Ground”为99.76%，“School”为99.53%，“Tournament”为98.55%，“Other”为97.87%。以上分类的高精度值分别为81%、72%、62%和98%。准确标注的僧伽罗语数据集和训练好的僧伽罗语NER模型是本研究的主要贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Sinhala Named Entity Recognition Model: Domain-Specific Classes in Sports

Named Entity Recognition (NER) is one of the crucial and vital subtasks that must be solved in most Natural Language Processing (NLP) tasks. However, constructing a NER system for the Sinhala Language is challenging. Because it comes under the category of low-resource languages. Therefore, the proposed approach attempted designing a mechanism to identify specific named entities in the sports domain. Firstly, a domain-specific corpus was built using Sinhala sport e-News articles. Then a semi-automated, rule-based component named as “Class_Label_Suggester” was built to annotate pre-defined named entities. After auto annotation, the outcome was further validated manually with a little effort. Finally, it was trained using the annotated data. Linear Perceptron, Stochastic Gradient Descent (SGD), Multinomial Naive Bayes (MNB), and Passive Aggressive classifiers were used to train the NER model. Though, the above Machine Learning (ML) algorithms showed approximately 98% accuracy, the MNB model demonstrated highest accuracy for the identified class labels of which, 99.76% for ‘Ground’, 99.53% for ‘School’, 98.55% for ‘Tournament’, and 97.87% for ‘Other’ classes. Additionally, high precision values of the above classes were 81%, 72%, 62%, and 98% respectively. An accurately annotated Sinhala dataset and the trained Sinhala NER model are main contributions of the study.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 4th International Conference on Advancements in Computing (ICAC)

自引率

0.00%

发文量