为低资源语言构建词典:无监督学习的挑战

D. Mati, Mentor Hamiti, Arsim Susuri, B. Selimi, Jaumin Ajdari
{"title":"为低资源语言构建词典:无监督学习的挑战","authors":"D. Mati, Mentor Hamiti, Arsim Susuri, B. Selimi, Jaumin Ajdari","doi":"10.33166/AETIC.2021.03.005","DOIUrl":null,"url":null,"abstract":"The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.","PeriodicalId":36440,"journal":{"name":"Annals of Emerging Technologies in Computing","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning\",\"authors\":\"D. Mati, Mentor Hamiti, Arsim Susuri, B. Selimi, Jaumin Ajdari\",\"doi\":\"10.33166/AETIC.2021.03.005\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.\",\"PeriodicalId\":36440,\"journal\":{\"name\":\"Annals of Emerging Technologies in Computing\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Annals of Emerging Technologies in Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.33166/AETIC.2021.03.005\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Emerging Technologies in Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33166/AETIC.2021.03.005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 2

摘要

近年来,阿尔巴尼亚语自然语言处理资源的开发稳步增长。本文介绍了一项关于无监督学习的研究——与建立阿尔巴尼亚语词典和创建词性标注模型相关的挑战。大多数语言都有自己的词典,但资源少的语言却缺乏资源。它通过自然语言处理促进了用户和整个社区的信息和服务共享。阿尔巴尼亚语的实验语料库包括来自不同学科的250K个句子,并提出了一个词性标记标签集,可以充分代表潜在的语言现象。为阿尔巴尼亚语的发展做出贡献是本文的目的。对阿尔巴尼亚语语料库的实验结果显示,其冠词和代词的使用与其他资源丰富的语言相似。根据这项研究,总期望频率作为正确标记单词的一种手段已被证明是有效的填充阿尔巴尼亚语词典。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning
The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Annals of Emerging Technologies in Computing
Annals of Emerging Technologies in Computing Computer Science-Computer Science (all)
CiteScore
3.50
自引率
0.00%
发文量
26
期刊最新文献
The Proposal of Countermeasures for DeepFake Voices on Social Media Considering Waveform and Text Embedding Lightweight Model for Occlusion Removal from Face Images A Torpor-based Enhanced Security Model for CSMA/CA Protocol in Wireless Networks Enhancing Robot Navigation Efficiency Using Cellular Automata with Active Cells Wildfire Prediction in the United States Using Time Series Forecasting Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1