Procedure for checking the uniformity of samples of text documents based on nonparametric criteria

S. I. Safin, V. Tolcheev
{"title":"Procedure for checking the uniformity of samples of text documents based on nonparametric criteria","authors":"S. I. Safin, V. Tolcheev","doi":"10.26896/1028-6861-2023-89-7-71-77","DOIUrl":null,"url":null,"abstract":"One of the most important tasks in Text Mining is the formation of sufficiently large representative and consistent samples (datasets). Usually, datasets are obtained from various information sources. In some cases, due to the lack of specialized texts in Russian, the dataset is expanded by adding translated English-language documents. In such situations, it is advisable to evaluate the uniformity-heterogeneity of the combined arrays. However, such a verification is complicated by the fact that the documents are multidimensional vectors, the correct comparison of which is a very non-trivial task. Insufficient elaboration of procedures for checking the uniformity of samples for the multidimensional case leads to the fact the problem of possible differences in data is ignored that in practice as insignificant. As a result, classifiers are trained on samples that are a mixture of quite diverse texts, and the resulting quality of categorization does not improve (or even deteriorates). Thus, it seems relevant to develop a procedure for checking the uniformity of documentary samples. To do this, we provide a comprehensive study of the problem of shift in textual data, identified and analyzed the reasons that cause the heterogeneity of documentary arrays. In this study, the datasets consist of bibliographic descriptions of scientific articles (title, abstract, keywords). The authors develop a procedure for assessing the homogeneity of two samples having approximately the same volume and the same method for calculating the weights of terms. For comparison, centroids are used, which have the size of a common dictionary of two datasets (in the absence of some terms, zero values are put in the corresponding positions of the centroids). The representation of samples in the form of «terminological portraits» (centroids) allowed us to reduce the verification of the homogeneity of multidimensional document vectors to a well-studied problem of analyzing two one-dimensional connected samples, for which nonparametric criteria were used. The sign criterion and the Wilcoxon sign rank criterion were used in the study. The proposed procedure for checking the uniformity of samples was tested on three collections of documents obtained from Russian and English-language sources.","PeriodicalId":13559,"journal":{"name":"Industrial laboratory. Diagnostics of materials","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Industrial laboratory. Diagnostics of materials","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26896/1028-6861-2023-89-7-71-77","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

One of the most important tasks in Text Mining is the formation of sufficiently large representative and consistent samples (datasets). Usually, datasets are obtained from various information sources. In some cases, due to the lack of specialized texts in Russian, the dataset is expanded by adding translated English-language documents. In such situations, it is advisable to evaluate the uniformity-heterogeneity of the combined arrays. However, such a verification is complicated by the fact that the documents are multidimensional vectors, the correct comparison of which is a very non-trivial task. Insufficient elaboration of procedures for checking the uniformity of samples for the multidimensional case leads to the fact the problem of possible differences in data is ignored that in practice as insignificant. As a result, classifiers are trained on samples that are a mixture of quite diverse texts, and the resulting quality of categorization does not improve (or even deteriorates). Thus, it seems relevant to develop a procedure for checking the uniformity of documentary samples. To do this, we provide a comprehensive study of the problem of shift in textual data, identified and analyzed the reasons that cause the heterogeneity of documentary arrays. In this study, the datasets consist of bibliographic descriptions of scientific articles (title, abstract, keywords). The authors develop a procedure for assessing the homogeneity of two samples having approximately the same volume and the same method for calculating the weights of terms. For comparison, centroids are used, which have the size of a common dictionary of two datasets (in the absence of some terms, zero values are put in the corresponding positions of the centroids). The representation of samples in the form of «terminological portraits» (centroids) allowed us to reduce the verification of the homogeneity of multidimensional document vectors to a well-studied problem of analyzing two one-dimensional connected samples, for which nonparametric criteria were used. The sign criterion and the Wilcoxon sign rank criterion were used in the study. The proposed procedure for checking the uniformity of samples was tested on three collections of documents obtained from Russian and English-language sources.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于非参数标准的文本文档样本一致性检查程序
文本挖掘中最重要的任务之一是形成足够大的具有代表性和一致性的样本(数据集)。通常,数据集是从不同的信息源获得的。在某些情况下,由于缺乏俄语的专门文本,通过添加翻译的英语文档来扩展数据集。在这种情况下,建议评估组合阵列的均匀性和非均匀性。但是,由于文档是多维向量,因此这种验证很复杂,正确比较这些向量是一项非常重要的任务。检查多维情况下样本一致性的程序不够详细,导致数据中可能存在差异的问题被忽略,在实践中被认为是无关紧要的。因此,分类器是在混合了多种文本的样本上训练的,结果分类质量没有提高(甚至下降)。因此,似乎有必要制定一套检查单据样品一致性的程序。为此,我们对文本数据的移位问题进行了全面的研究,确定并分析了导致文献阵列异质性的原因。在本研究中,数据集由科学文章的书目描述(标题、摘要、关键词)组成。作者开发了一种程序,用于评估两个样本的均匀性,具有近似相同的体积和相同的方法来计算项的权重。为了进行比较,我们使用了质心,其大小相当于两个数据集的通用字典(在缺少某些项的情况下,在质心的相应位置放置零值)。样本以“术语肖像”(质心)的形式表示,使我们能够将多维文档向量的同质性验证减少到一个经过充分研究的问题,即分析两个一维连接的样本,其中使用了非参数标准。本研究采用符号标准和Wilcoxon符号等级标准。在从俄文和英文来源获得的三份文件上测试了拟议的检查样品一致性的程序。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
0.60
自引率
0.00%
发文量
0
期刊最新文献
Evaluation of the use of polyvinyl alcohol in the manufacture of pressed samples for X-ray fluorescence analysis Determination of the criterion for the morphological classification of etching pits formed in InSb single crystals grown by the Czochralski method in the crystallographic direction [111] and doped with tellurium The paradigm shift in mathematical methods of research Low cycle fracture resistance of the superalloy at single- and two-frequency modes of loading Fatigue fracture of 316L steel manufactured by selective laser melting method
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1