Procedure for checking the uniformity of samples of text documents based on nonparametric criteria

Industrial laboratory. Diagnostics of materials Pub Date : 2023-07-26 DOI:10.26896/1028-6861-2023-89-7-71-77

S. I. Safin, V. Tolcheev

{"title":"Procedure for checking the uniformity of samples of text documents based on nonparametric criteria","authors":"S. I. Safin, V. Tolcheev","doi":"10.26896/1028-6861-2023-89-7-71-77","DOIUrl":null,"url":null,"abstract":"One of the most important tasks in Text Mining is the formation of sufficiently large representative and consistent samples (datasets). Usually, datasets are obtained from various information sources. In some cases, due to the lack of specialized texts in Russian, the dataset is expanded by adding translated English-language documents. In such situations, it is advisable to evaluate the uniformity-heterogeneity of the combined arrays. However, such a verification is complicated by the fact that the documents are multidimensional vectors, the correct comparison of which is a very non-trivial task. Insufficient elaboration of procedures for checking the uniformity of samples for the multidimensional case leads to the fact the problem of possible differences in data is ignored that in practice as insignificant. As a result, classifiers are trained on samples that are a mixture of quite diverse texts, and the resulting quality of categorization does not improve (or even deteriorates). Thus, it seems relevant to develop a procedure for checking the uniformity of documentary samples. To do this, we provide a comprehensive study of the problem of shift in textual data, identified and analyzed the reasons that cause the heterogeneity of documentary arrays. In this study, the datasets consist of bibliographic descriptions of scientific articles (title, abstract, keywords). The authors develop a procedure for assessing the homogeneity of two samples having approximately the same volume and the same method for calculating the weights of terms. For comparison, centroids are used, which have the size of a common dictionary of two datasets (in the absence of some terms, zero values are put in the corresponding positions of the centroids). The representation of samples in the form of «terminological portraits» (centroids) allowed us to reduce the verification of the homogeneity of multidimensional document vectors to a well-studied problem of analyzing two one-dimensional connected samples, for which nonparametric criteria were used. The sign criterion and the Wilcoxon sign rank criterion were used in the study. The proposed procedure for checking the uniformity of samples was tested on three collections of documents obtained from Russian and English-language sources.","PeriodicalId":13559,"journal":{"name":"Industrial laboratory. Diagnostics of materials","volume":"28 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Industrial laboratory. Diagnostics of materials","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26896/1028-6861-2023-89-7-71-77","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

One of the most important tasks in Text Mining is the formation of sufficiently large representative and consistent samples (datasets). Usually, datasets are obtained from various information sources. In some cases, due to the lack of specialized texts in Russian, the dataset is expanded by adding translated English-language documents. In such situations, it is advisable to evaluate the uniformity-heterogeneity of the combined arrays. However, such a verification is complicated by the fact that the documents are multidimensional vectors, the correct comparison of which is a very non-trivial task. Insufficient elaboration of procedures for checking the uniformity of samples for the multidimensional case leads to the fact the problem of possible differences in data is ignored that in practice as insignificant. As a result, classifiers are trained on samples that are a mixture of quite diverse texts, and the resulting quality of categorization does not improve (or even deteriorates). Thus, it seems relevant to develop a procedure for checking the uniformity of documentary samples. To do this, we provide a comprehensive study of the problem of shift in textual data, identified and analyzed the reasons that cause the heterogeneity of documentary arrays. In this study, the datasets consist of bibliographic descriptions of scientific articles (title, abstract, keywords). The authors develop a procedure for assessing the homogeneity of two samples having approximately the same volume and the same method for calculating the weights of terms. For comparison, centroids are used, which have the size of a common dictionary of two datasets (in the absence of some terms, zero values are put in the corresponding positions of the centroids). The representation of samples in the form of «terminological portraits» (centroids) allowed us to reduce the verification of the homogeneity of multidimensional document vectors to a well-studied problem of analyzing two one-dimensional connected samples, for which nonparametric criteria were used. The sign criterion and the Wilcoxon sign rank criterion were used in the study. The proposed procedure for checking the uniformity of samples was tested on three collections of documents obtained from Russian and English-language sources.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于非参数标准的文本文档样本一致性检查程序

文本挖掘中最重要的任务之一是形成足够大的具有代表性和一致性的样本(数据集)。通常，数据集是从不同的信息源获得的。在某些情况下，由于缺乏俄语的专门文本，通过添加翻译的英语文档来扩展数据集。在这种情况下，建议评估组合阵列的均匀性和非均匀性。但是，由于文档是多维向量，因此这种验证很复杂，正确比较这些向量是一项非常重要的任务。检查多维情况下样本一致性的程序不够详细，导致数据中可能存在差异的问题被忽略，在实践中被认为是无关紧要的。因此，分类器是在混合了多种文本的样本上训练的，结果分类质量没有提高(甚至下降)。因此，似乎有必要制定一套检查单据样品一致性的程序。为此，我们对文本数据的移位问题进行了全面的研究，确定并分析了导致文献阵列异质性的原因。在本研究中，数据集由科学文章的书目描述(标题、摘要、关键词)组成。作者开发了一种程序，用于评估两个样本的均匀性，具有近似相同的体积和相同的方法来计算项的权重。为了进行比较，我们使用了质心，其大小相当于两个数据集的通用字典(在缺少某些项的情况下，在质心的相应位置放置零值)。样本以“术语肖像”(质心)的形式表示，使我们能够将多维文档向量的同质性验证减少到一个经过充分研究的问题，即分析两个一维连接的样本，其中使用了非参数标准。本研究采用符号标准和Wilcoxon符号等级标准。在从俄文和英文来源获得的三份文件上测试了拟议的检查样品一致性的程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Industrial laboratory. Diagnostics of materials

CiteScore

0.60

自引率

0.00%

发文量