Testing the classifier adapted to recognize the languages of works based on the Latin alphabet

Zafar Usmanov, Abdunabi A. Kosimov
{"title":"Testing the classifier adapted to recognize the languages of works based on the Latin alphabet","authors":"Zafar Usmanov, Abdunabi A. Kosimov","doi":"10.17212/2782-2001-2021-2-83-94","DOIUrl":null,"url":null,"abstract":"Using the example of a model collection of 10 texts in five languages (English, German, Spanish, Italian, and French) using Latin graphics, the article establishes the applicability of the γ-classifier for automatic recognition of the language of a work based on the frequency of 26 common Latin alphabetic letters. The mathematical model of the γ-classifier is represented as a triad. Its first component is a digital portrait (DP) of the text - the distribution of the frequency of alphabetic unigrams in the text; the second component is formulas for calculating the distances between the DP texts and the third is a machine learning algorithm that implements the hypothesis of “homogeneity” of works written in one language and “heterogeneity” of works written in different languages. The tuning of the algorithm using a table of paired distances between all products of the model collection consisted in determining an optimal value of the real parameter γ, for which the error of violation of the “homogeneity” hypothesis is minimized. The γ-classifier trained on the texts of the model collection showed a high, 100% accuracy in recognizing the languages of the works. For testing the classifier, an additional six random texts were selected, of which five were in the same languages as the texts of the model collection. By the method of the nearest (in terms of distance) neighbor, all new texts confirmed their homogeneity with the corresponding pairs of monolingual works. The sixth text in Romanian showed its heterogeneity in relation to all elements of the collection. At the same time, it showed closeness in minimum distances, first of all, to two texts in Spanish and then to two works in Italian.","PeriodicalId":292298,"journal":{"name":"Analysis and data processing systems","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analysis and data processing systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17212/2782-2001-2021-2-83-94","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Using the example of a model collection of 10 texts in five languages (English, German, Spanish, Italian, and French) using Latin graphics, the article establishes the applicability of the γ-classifier for automatic recognition of the language of a work based on the frequency of 26 common Latin alphabetic letters. The mathematical model of the γ-classifier is represented as a triad. Its first component is a digital portrait (DP) of the text - the distribution of the frequency of alphabetic unigrams in the text; the second component is formulas for calculating the distances between the DP texts and the third is a machine learning algorithm that implements the hypothesis of “homogeneity” of works written in one language and “heterogeneity” of works written in different languages. The tuning of the algorithm using a table of paired distances between all products of the model collection consisted in determining an optimal value of the real parameter γ, for which the error of violation of the “homogeneity” hypothesis is minimized. The γ-classifier trained on the texts of the model collection showed a high, 100% accuracy in recognizing the languages of the works. For testing the classifier, an additional six random texts were selected, of which five were in the same languages as the texts of the model collection. By the method of the nearest (in terms of distance) neighbor, all new texts confirmed their homogeneity with the corresponding pairs of monolingual works. The sixth text in Romanian showed its heterogeneity in relation to all elements of the collection. At the same time, it showed closeness in minimum distances, first of all, to two texts in Spanish and then to two works in Italian.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
测试分类器适应识别基于拉丁字母的作品的语言
本文以使用拉丁图形的五种语言(英语、德语、西班牙语、意大利语和法语)的10个文本的模型集合为例,建立了γ-分类器的适用性,以基于26个常见拉丁字母的频率自动识别作品的语言。γ-分类器的数学模型表示为三元组。它的第一个组成部分是文本的数字肖像(DP) -文本中字母单字母的频率分布;第二个组件是计算DP文本之间距离的公式,第三个组件是一个机器学习算法,该算法实现了用一种语言写的作品的“同质性”和用不同语言写的作品的“异质性”假设。使用模型集合的所有产品之间的配对距离表对算法进行调整,包括确定实参数γ的最优值,从而使违反“均匀性”假设的误差最小化。在模型集的文本上训练的γ-分类器在识别作品的语言方面显示出高达100%的准确率。为了测试分类器,选择了另外六个随机文本,其中五个与模型集合的文本使用相同的语言。通过距离最近邻的方法,所有新文本都与相应的单语作品对确认了它们的同质性。罗马尼亚文的第六个文本显示了它与收集的所有元素之间的异质性。与此同时,它在最小距离上表现得很接近,首先是两个西班牙文本,然后是两个意大利语作品。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Determination of the dependence of the apparent density of ceramic samples on the molding moisture content of clay raw materials and compaction pressure based on regression models Development of a control and unit positioning system for a mechatronic rehabilitation complex A methodology for selecting algorithms for optimizing the resilience of energy infrastructures Analysis of operator eye movement characteristics to determine the degree of fatigue Study of the issues of methods for determining the type of content in incoming traffic
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1