The development of stemming algorithm for the Uzbek language

Кибернетика и программирование Pub Date : 1900-01-01 DOI:10.25136/2644-5522.2021.1.35847

Бакаев Илхом Изатович

{"title":"The development of stemming algorithm for the Uzbek language","authors":"Бакаев Илхом Изатович","doi":"10.25136/2644-5522.2021.1.35847","DOIUrl":null,"url":null,"abstract":"\n The automatic processing of unstructured texts in natural languages is one of the relevant problems of computer analysis and text synthesis. Within this problem, the author singles out a task of text normalization, which usually suggests such processes as tokenization, stemming, and lemmatization. The existing stemming algorithms for the most part are oriented towards the synthetic languages with inflectional morphemes. The Uzbek language represents an example of agglutinative language, characterized by polysemanticity of affixal and auxiliary morphemes. Although the Uzbek language largely differs from, for example, English language, it is successfully processed by stemming algorithms. There are virtually no examples of effective implementation of stemming algorithms for the Uzbek language; therefore, this questions is the subject of scientific interest and defines the goal of this work. In the course of this research, the author solved the task of bringing the given texts in the Uzbek language to normal form, which on the preliminary stage were tokenized and cleared of stop words. To author developed the method of normalization of texts in the Uzbek language based on the stemming algorithm. The development of stemming algorithm employed hybrid approach with application of algorithmic method, lexicon of linguistic rules and database of the normal word forms of the Uzbek language. The precision of the proposed algorithm depends on the precision of tokenization algorithm. At the same time, the article did not explore the question of finding the roots of paired words separated by spaces, as this task is solved at the stage of tokenization. The algorithm can be integrated into various automated systems for machine translation, information extraction, data retrieval, etc.\n","PeriodicalId":351916,"journal":{"name":"Кибернетика и программирование","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Кибернетика и программирование","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25136/2644-5522.2021.1.35847","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The automatic processing of unstructured texts in natural languages is one of the relevant problems of computer analysis and text synthesis. Within this problem, the author singles out a task of text normalization, which usually suggests such processes as tokenization, stemming, and lemmatization. The existing stemming algorithms for the most part are oriented towards the synthetic languages with inflectional morphemes. The Uzbek language represents an example of agglutinative language, characterized by polysemanticity of affixal and auxiliary morphemes. Although the Uzbek language largely differs from, for example, English language, it is successfully processed by stemming algorithms. There are virtually no examples of effective implementation of stemming algorithms for the Uzbek language; therefore, this questions is the subject of scientific interest and defines the goal of this work. In the course of this research, the author solved the task of bringing the given texts in the Uzbek language to normal form, which on the preliminary stage were tokenized and cleared of stop words. To author developed the method of normalization of texts in the Uzbek language based on the stemming algorithm. The development of stemming algorithm employed hybrid approach with application of algorithmic method, lexicon of linguistic rules and database of the normal word forms of the Uzbek language. The precision of the proposed algorithm depends on the precision of tokenization algorithm. At the same time, the article did not explore the question of finding the roots of paired words separated by spaces, as this task is solved at the stage of tokenization. The algorithm can be integrated into various automated systems for machine translation, information extraction, data retrieval, etc.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

乌兹别克语词干提取算法的发展

自然语言中非结构化文本的自动处理是计算机分析和文本合成的相关问题之一。在这个问题中，作者挑出了文本规范化的任务，它通常建议诸如标记化、词干提取和词序化等过程。现有的词干提取算法大多针对具有屈折词素的合成语言。乌兹别克语是一种粘附性语言，其特点是词缀和助词的多义性。尽管乌兹别克语与英语等语言有很大不同，但它可以通过词干提取算法成功地处理。乌兹别克语的词干提取算法几乎没有有效实现的例子;因此，这个问题是科学兴趣的主题，并确定了这项工作的目标。在本研究过程中，作者解决了将给定的乌兹别克语文本转化为标准形式的任务，在初步阶段对文本进行了标记化并清除了停止词。作者提出了一种基于词干提取算法的乌兹别克语文本规范化方法。词干提取算法的开发采用了算法方法、语言规则词典和乌兹别克语标准词形数据库的混合方法。该算法的精度取决于标记化算法的精度。同时，本文没有探讨由空格分隔的成对单词的词根问题，因为这个任务在标记化阶段就解决了。该算法可集成到各种自动化系统中，用于机器翻译、信息提取、数据检索等。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Кибернетика и программирование

自引率

0.00%

发文量