Training set formation in machine learning problems (review)

Q3 Mathematics Informatsionno-Upravliaiushchie Sistemy Pub Date : 2021-09-13 DOI:10.31799/1684-8853-2021-4-61-70

A. Parasich, V. Parasich, I. Parasich

{"title":"Training set formation in machine learning problems (review)","authors":"A. Parasich, V. Parasich, I. Parasich","doi":"10.31799/1684-8853-2021-4-61-70","DOIUrl":null,"url":null,"abstract":"Introduction: Proper training set formation is a key factor in machine learning. In real training sets, problems and errors commonly occur, having a critical impact on the training result. Training set need to be formed in all machine learning problems; therefore, knowledge of possible difficulties will be helpful. Purpose: Overview of possible problems in the formation of a training set, in order to facilitate their detection and elimination when working with real training sets. Analyzing the impact of these problems on the results of the training. Results: The article makes on overview of possible errors in training set formation, such as lack of data, imbalance, false patterns, sampling from a limited set of sources, change in the general population over time, and others. We discuss the influence of these errors on the result of the training, test set formation, and training algorithm quality measurement. The pseudo-labeling, data augmentation, and hard samples mining are considered the most effective ways to expand a training set. We offer practical recommendations for the formation of a training or test set. Examples from the practice of Kaggle competitions are given. For the problem of cross-dataset generalization in neural network training, we propose an algorithm called Cross-Dataset Machine, which is simple to implement and allows you to get a gain in cross-dataset generalization. Practical relevance: The materials of the article can be used as a practical guide in solving machine learning problems.","PeriodicalId":36977,"journal":{"name":"Informatsionno-Upravliaiushchie Sistemy","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatsionno-Upravliaiushchie Sistemy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31799/1684-8853-2021-4-61-70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 2

Abstract

Introduction: Proper training set formation is a key factor in machine learning. In real training sets, problems and errors commonly occur, having a critical impact on the training result. Training set need to be formed in all machine learning problems; therefore, knowledge of possible difficulties will be helpful. Purpose: Overview of possible problems in the formation of a training set, in order to facilitate their detection and elimination when working with real training sets. Analyzing the impact of these problems on the results of the training. Results: The article makes on overview of possible errors in training set formation, such as lack of data, imbalance, false patterns, sampling from a limited set of sources, change in the general population over time, and others. We discuss the influence of these errors on the result of the training, test set formation, and training algorithm quality measurement. The pseudo-labeling, data augmentation, and hard samples mining are considered the most effective ways to expand a training set. We offer practical recommendations for the formation of a training or test set. Examples from the practice of Kaggle competitions are given. For the problem of cross-dataset generalization in neural network training, we propose an algorithm called Cross-Dataset Machine, which is simple to implement and allows you to get a gain in cross-dataset generalization. Practical relevance: The materials of the article can be used as a practical guide in solving machine learning problems.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

机器学习问题中的训练集形成(回顾)

引言：正确的训练集形成是机器学习的一个关键因素。在真实的训练集中，问题和错误经常发生，对训练结果有着至关重要的影响。所有机器学习问题都需要形成训练集；因此，了解可能存在的困难将有所帮助。目的：概述训练集形成过程中可能存在的问题，以便于在使用真实训练集时发现和消除这些问题。分析这些问题对培训结果的影响。结果：本文概述了训练集形成中可能存在的错误，如数据缺乏、不平衡、错误模式、有限来源的抽样、一般人群随时间的变化等。我们讨论了这些误差对训练结果、测试集形成和训练算法质量测量的影响。伪标记、数据扩充和硬样本挖掘被认为是扩展训练集的最有效方法。我们为培训或测试集的形成提供实用的建议。文中列举了卡格尔比赛的实例。针对神经网络训练中的跨数据集泛化问题，我们提出了一种称为跨数据集机器的算法，该算法实现简单，可以在跨数据集的泛化中获得增益。实际相关性：文章的材料可以作为解决机器学习问题的实践指南。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Informatsionno-Upravliaiushchie Sistemy Mathematics-Control and Optimization

CiteScore

1.40

自引率

0.00%

发文量