“Great in, great out” is the new “garbage in, garbage out”: subsampling from data with no response variable using various approaches, including unsupervised learning
Lubomír Štěpánek, Filip Habarta, I. Malá, L. Marek
{"title":"“Great in, great out” is the new “garbage in, garbage out”: subsampling from data with no response variable using various approaches, including unsupervised learning","authors":"Lubomír Štěpánek, Filip Habarta, I. Malá, L. Marek","doi":"10.1109/ICCMA53594.2021.00028","DOIUrl":null,"url":null,"abstract":"When having more initial data than needed, an appropriate selection of a subpopulation from the given dataset, i. e. subsampling, follows, usually intending to ensure all categorical variables’ levels are nearly equally covered and all numerical variables are all well balanced. If a response variable in the original data is missing, popular propensity scoring cannot be performed.This study addresses the subsampling with the missing response variable about to be collected later, using a COVID-19 dataset (N = 698) with 18 variables of interest, reduced to a subdataset (n = 400). The quality of subpopulation selecting was measured using the minimization of a sum of squares of each variable’s category frequency and averaged over all variables.An exhaustive method selecting all possible combinations of n = 400 observations from initial N = 698 observations was used as a benchmark. Choosing the subsample from a random set of them that minimized the metric returned similar results as a “forward” subselection reducing the original dataset an observation-by-observation per each step by permanently lowering the metric. Finally, k-means clustering (with a random number of clusters) of the original dataset’s observations and choosing a subsample from each cluster, proportionally to its size, also lowered the metric compared to the random subsampling.All four latter approaches showed better results than single random subsampling, considering the metric minimization. However, while the exhaustive sampling is very greedy and time-consuming, the forward one-by-one reducing the original dataset, picking up the subsample minimizing the metric, and subsampling the clusters, are feasible for selecting a well-balanced subdataset, ready for ongoing response variable collecting.","PeriodicalId":131082,"journal":{"name":"2021 International Conference on Computing, Computational Modelling and Applications (ICCMA)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Computing, Computational Modelling and Applications (ICCMA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCMA53594.2021.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
When having more initial data than needed, an appropriate selection of a subpopulation from the given dataset, i. e. subsampling, follows, usually intending to ensure all categorical variables’ levels are nearly equally covered and all numerical variables are all well balanced. If a response variable in the original data is missing, popular propensity scoring cannot be performed.This study addresses the subsampling with the missing response variable about to be collected later, using a COVID-19 dataset (N = 698) with 18 variables of interest, reduced to a subdataset (n = 400). The quality of subpopulation selecting was measured using the minimization of a sum of squares of each variable’s category frequency and averaged over all variables.An exhaustive method selecting all possible combinations of n = 400 observations from initial N = 698 observations was used as a benchmark. Choosing the subsample from a random set of them that minimized the metric returned similar results as a “forward” subselection reducing the original dataset an observation-by-observation per each step by permanently lowering the metric. Finally, k-means clustering (with a random number of clusters) of the original dataset’s observations and choosing a subsample from each cluster, proportionally to its size, also lowered the metric compared to the random subsampling.All four latter approaches showed better results than single random subsampling, considering the metric minimization. However, while the exhaustive sampling is very greedy and time-consuming, the forward one-by-one reducing the original dataset, picking up the subsample minimizing the metric, and subsampling the clusters, are feasible for selecting a well-balanced subdataset, ready for ongoing response variable collecting.