The development of stemming algorithm for the Uzbek language

Бакаев Илхом Изатович
{"title":"The development of stemming algorithm for the Uzbek language","authors":"Бакаев Илхом Изатович","doi":"10.25136/2644-5522.2021.1.35847","DOIUrl":null,"url":null,"abstract":"\n The automatic processing of unstructured texts in natural languages is one of the relevant problems of computer analysis and text synthesis. Within this problem, the author singles out a task of text normalization, which usually suggests such processes as tokenization, stemming, and lemmatization. The existing stemming algorithms for the most part are oriented towards the synthetic languages with inflectional morphemes. The Uzbek language represents an example of agglutinative language, characterized by polysemanticity of affixal and auxiliary morphemes. Although the Uzbek language largely differs from, for example, English language, it is successfully processed by stemming algorithms. There are virtually no examples of effective implementation of stemming algorithms for the Uzbek language; therefore, this questions is the subject of scientific interest and defines the goal of this work. In the course of this research, the author solved the task of bringing the given texts in the Uzbek language to normal form, which on the preliminary stage were tokenized and cleared of stop words. To author developed the method of normalization of texts in the Uzbek language based on the stemming algorithm. The development of stemming algorithm employed hybrid approach with application of algorithmic method, lexicon of linguistic rules and database of the normal word forms of the Uzbek language. The precision of the proposed algorithm depends on the precision of tokenization algorithm. At the same time, the article did not explore the question of finding the roots of paired words separated by spaces, as this task is solved at the stage of tokenization. The algorithm can be integrated into various automated systems for machine translation, information extraction, data retrieval, etc.\n","PeriodicalId":351916,"journal":{"name":"Кибернетика и программирование","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Кибернетика и программирование","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25136/2644-5522.2021.1.35847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The automatic processing of unstructured texts in natural languages is one of the relevant problems of computer analysis and text synthesis. Within this problem, the author singles out a task of text normalization, which usually suggests such processes as tokenization, stemming, and lemmatization. The existing stemming algorithms for the most part are oriented towards the synthetic languages with inflectional morphemes. The Uzbek language represents an example of agglutinative language, characterized by polysemanticity of affixal and auxiliary morphemes. Although the Uzbek language largely differs from, for example, English language, it is successfully processed by stemming algorithms. There are virtually no examples of effective implementation of stemming algorithms for the Uzbek language; therefore, this questions is the subject of scientific interest and defines the goal of this work. In the course of this research, the author solved the task of bringing the given texts in the Uzbek language to normal form, which on the preliminary stage were tokenized and cleared of stop words. To author developed the method of normalization of texts in the Uzbek language based on the stemming algorithm. The development of stemming algorithm employed hybrid approach with application of algorithmic method, lexicon of linguistic rules and database of the normal word forms of the Uzbek language. The precision of the proposed algorithm depends on the precision of tokenization algorithm. At the same time, the article did not explore the question of finding the roots of paired words separated by spaces, as this task is solved at the stage of tokenization. The algorithm can be integrated into various automated systems for machine translation, information extraction, data retrieval, etc.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
乌兹别克语词干提取算法的发展
自然语言中非结构化文本的自动处理是计算机分析和文本合成的相关问题之一。在这个问题中,作者挑出了文本规范化的任务,它通常建议诸如标记化、词干提取和词序化等过程。现有的词干提取算法大多针对具有屈折词素的合成语言。乌兹别克语是一种粘附性语言,其特点是词缀和助词的多义性。尽管乌兹别克语与英语等语言有很大不同,但它可以通过词干提取算法成功地处理。乌兹别克语的词干提取算法几乎没有有效实现的例子;因此,这个问题是科学兴趣的主题,并确定了这项工作的目标。在本研究过程中,作者解决了将给定的乌兹别克语文本转化为标准形式的任务,在初步阶段对文本进行了标记化并清除了停止词。作者提出了一种基于词干提取算法的乌兹别克语文本规范化方法。词干提取算法的开发采用了算法方法、语言规则词典和乌兹别克语标准词形数据库的混合方法。该算法的精度取决于标记化算法的精度。同时,本文没有探讨由空格分隔的成对单词的词根问题,因为这个任务在标记化阶段就解决了。该算法可集成到各种自动化系统中,用于机器翻译、信息提取、数据检索等。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Entropy estimation of the fragments of chest X-ray images Modification of the Marquardt method for training a neural network predictor in eddy viscosity models Improved CPU load balancing for numerical solution of the tasks of continuous medium mechanics complicated by chemical kinetics Walking robots for rescue operations: overview and analysis of the existing models Design of the database structure for software optimization of operation of the stochastic multiphase systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1