语料库生成发展阿姆哈拉语词法分词

IF 0.7 Q3 COMPUTER SCIENCE, THEORY & METHODS International Journal of Advanced Computer Science and Applications Pub Date : 2023-01-01 DOI:10.14569/ijacsa.2023.01409116
Terefe Feyisa, Seble Hailu
{"title":"语料库生成发展阿姆哈拉语词法分词","authors":"Terefe Feyisa, Seble Hailu","doi":"10.14569/ijacsa.2023.01409116","DOIUrl":null,"url":null,"abstract":"Morphological segmenter is an important component in Amharic natural language processing systems. Despite this fact, Amharic lacks large amount of morphologically segmented corpus. Large amount of corpus is often a requirement to develop neural network-based language technologies. This paper presents an alternative method to generate large amount of morph-segmented corpus for Amharic language. First, a relatively small (138,400 words) morphologically annotated Amharic seed-corpus is manually prepared. The annotation enables to identify prefixes, stem, and suffixes of a given word. Second, a supervised approach is used to create a conditional random field-based seed-model (on the seed-corpus). Applying the seed-model (an unsupervised technique on a large unsegmented raw Amharic words) for prediction, a large corpus size (3,777,283) of segmented words are automatically generated. Third, the newly generated corpus is used to train an Amharic morphological segmenter (based on a supervised neural sequence-to-sequence (seq2seq) approach using character embeddings). Using the seq2seq method, an F-score of 98.65% was measured. Results show an agreement with previous efforts for Arabic language. The work presented here has profound implications for future studies of Ethiopian language technologies and may one day help solve the problem of the digital-divide between resource-rich and under-resourced languages.","PeriodicalId":13824,"journal":{"name":"International Journal of Advanced Computer Science and Applications","volume":"42 1","pages":"0"},"PeriodicalIF":0.7000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Corpus Generation to Develop Amharic Morphological Segmenter\",\"authors\":\"Terefe Feyisa, Seble Hailu\",\"doi\":\"10.14569/ijacsa.2023.01409116\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Morphological segmenter is an important component in Amharic natural language processing systems. Despite this fact, Amharic lacks large amount of morphologically segmented corpus. Large amount of corpus is often a requirement to develop neural network-based language technologies. This paper presents an alternative method to generate large amount of morph-segmented corpus for Amharic language. First, a relatively small (138,400 words) morphologically annotated Amharic seed-corpus is manually prepared. The annotation enables to identify prefixes, stem, and suffixes of a given word. Second, a supervised approach is used to create a conditional random field-based seed-model (on the seed-corpus). Applying the seed-model (an unsupervised technique on a large unsegmented raw Amharic words) for prediction, a large corpus size (3,777,283) of segmented words are automatically generated. Third, the newly generated corpus is used to train an Amharic morphological segmenter (based on a supervised neural sequence-to-sequence (seq2seq) approach using character embeddings). Using the seq2seq method, an F-score of 98.65% was measured. Results show an agreement with previous efforts for Arabic language. The work presented here has profound implications for future studies of Ethiopian language technologies and may one day help solve the problem of the digital-divide between resource-rich and under-resourced languages.\",\"PeriodicalId\":13824,\"journal\":{\"name\":\"International Journal of Advanced Computer Science and Applications\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Advanced Computer Science and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14569/ijacsa.2023.01409116\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Advanced Computer Science and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14569/ijacsa.2023.01409116","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

摘要

形态切分器是阿姆哈拉语自然语言处理系统的重要组成部分。尽管如此,阿姆哈拉语缺乏大量的形态分段语料库。基于神经网络的语言技术往往需要大量的语料库。本文提出了一种生成大量阿姆哈拉语语料库的方法。首先,手工准备一个相对较小的(138,400字)形态学注释的阿姆哈拉语种子语料库。注释允许识别给定单词的前缀、词干和后缀。其次,使用监督方法创建基于条件随机场的种子模型(在种子语料库上)。应用种子模型(一种针对大量未分割的原始阿姆哈拉语单词的无监督技术)进行预测,自动生成了大量的分割词语料库(3,777,283)。第三,使用新生成的语料库训练阿姆哈拉语形态切分器(基于使用字符嵌入的有监督神经序列到序列(seq2seq)方法)。采用seq2seq法,f值为98.65%。结果表明,与先前的阿拉伯语努力一致。这里提出的工作对埃塞俄比亚语言技术的未来研究具有深远的意义,并且可能有一天有助于解决资源丰富和资源不足的语言之间的数字鸿沟问题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Corpus Generation to Develop Amharic Morphological Segmenter
Morphological segmenter is an important component in Amharic natural language processing systems. Despite this fact, Amharic lacks large amount of morphologically segmented corpus. Large amount of corpus is often a requirement to develop neural network-based language technologies. This paper presents an alternative method to generate large amount of morph-segmented corpus for Amharic language. First, a relatively small (138,400 words) morphologically annotated Amharic seed-corpus is manually prepared. The annotation enables to identify prefixes, stem, and suffixes of a given word. Second, a supervised approach is used to create a conditional random field-based seed-model (on the seed-corpus). Applying the seed-model (an unsupervised technique on a large unsegmented raw Amharic words) for prediction, a large corpus size (3,777,283) of segmented words are automatically generated. Third, the newly generated corpus is used to train an Amharic morphological segmenter (based on a supervised neural sequence-to-sequence (seq2seq) approach using character embeddings). Using the seq2seq method, an F-score of 98.65% was measured. Results show an agreement with previous efforts for Arabic language. The work presented here has profound implications for future studies of Ethiopian language technologies and may one day help solve the problem of the digital-divide between resource-rich and under-resourced languages.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
2.30
自引率
22.20%
发文量
519
期刊介绍: IJACSA is a scholarly computer science journal representing the best in research. Its mission is to provide an outlet for quality research to be publicised and published to a global audience. The journal aims to publish papers selected through rigorous double-blind peer review to ensure originality, timeliness, relevance, and readability. In sync with the Journal''s vision "to be a respected publication that publishes peer reviewed research articles, as well as review and survey papers contributed by International community of Authors", we have drawn reviewers and editors from Institutions and Universities across the globe. A double blind peer review process is conducted to ensure that we retain high standards. At IJACSA, we stand strong because we know that global challenges make way for new innovations, new ways and new talent. International Journal of Advanced Computer Science and Applications publishes carefully refereed research, review and survey papers which offer a significant contribution to the computer science literature, and which are of interest to a wide audience. Coverage extends to all main-stream branches of computer science and related applications
期刊最新文献
Comparison of K-Nearest Neighbor, Naive Bayes Classifier, Decision Tree, and Logistic Regression in Classification of Non-Performing Financing Simulation of fire exposure behavior to building structural elements using LISA FEA V.8. An Exploration into Hybrid Agile Development Approach A Study on Sentiment Analysis Techniques of Twitter Data Handwriting Recognition using Artificial Intelligence Neural Network and Image Processing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1