“Great in, great out” is the new “garbage in, garbage out”: subsampling from data with no response variable using various approaches, including unsupervised learning

Lubomír Štěpánek, Filip Habarta, I. Malá, L. Marek
{"title":"“Great in, great out” is the new “garbage in, garbage out”: subsampling from data with no response variable using various approaches, including unsupervised learning","authors":"Lubomír Štěpánek, Filip Habarta, I. Malá, L. Marek","doi":"10.1109/ICCMA53594.2021.00028","DOIUrl":null,"url":null,"abstract":"When having more initial data than needed, an appropriate selection of a subpopulation from the given dataset, i. e. subsampling, follows, usually intending to ensure all categorical variables’ levels are nearly equally covered and all numerical variables are all well balanced. If a response variable in the original data is missing, popular propensity scoring cannot be performed.This study addresses the subsampling with the missing response variable about to be collected later, using a COVID-19 dataset (N = 698) with 18 variables of interest, reduced to a subdataset (n = 400). The quality of subpopulation selecting was measured using the minimization of a sum of squares of each variable’s category frequency and averaged over all variables.An exhaustive method selecting all possible combinations of n = 400 observations from initial N = 698 observations was used as a benchmark. Choosing the subsample from a random set of them that minimized the metric returned similar results as a “forward” subselection reducing the original dataset an observation-by-observation per each step by permanently lowering the metric. Finally, k-means clustering (with a random number of clusters) of the original dataset’s observations and choosing a subsample from each cluster, proportionally to its size, also lowered the metric compared to the random subsampling.All four latter approaches showed better results than single random subsampling, considering the metric minimization. However, while the exhaustive sampling is very greedy and time-consuming, the forward one-by-one reducing the original dataset, picking up the subsample minimizing the metric, and subsampling the clusters, are feasible for selecting a well-balanced subdataset, ready for ongoing response variable collecting.","PeriodicalId":131082,"journal":{"name":"2021 International Conference on Computing, Computational Modelling and Applications (ICCMA)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Computing, Computational Modelling and Applications (ICCMA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCMA53594.2021.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

When having more initial data than needed, an appropriate selection of a subpopulation from the given dataset, i. e. subsampling, follows, usually intending to ensure all categorical variables’ levels are nearly equally covered and all numerical variables are all well balanced. If a response variable in the original data is missing, popular propensity scoring cannot be performed.This study addresses the subsampling with the missing response variable about to be collected later, using a COVID-19 dataset (N = 698) with 18 variables of interest, reduced to a subdataset (n = 400). The quality of subpopulation selecting was measured using the minimization of a sum of squares of each variable’s category frequency and averaged over all variables.An exhaustive method selecting all possible combinations of n = 400 observations from initial N = 698 observations was used as a benchmark. Choosing the subsample from a random set of them that minimized the metric returned similar results as a “forward” subselection reducing the original dataset an observation-by-observation per each step by permanently lowering the metric. Finally, k-means clustering (with a random number of clusters) of the original dataset’s observations and choosing a subsample from each cluster, proportionally to its size, also lowered the metric compared to the random subsampling.All four latter approaches showed better results than single random subsampling, considering the metric minimization. However, while the exhaustive sampling is very greedy and time-consuming, the forward one-by-one reducing the original dataset, picking up the subsample minimizing the metric, and subsampling the clusters, are feasible for selecting a well-balanced subdataset, ready for ongoing response variable collecting.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
“大量输入,大量输出”是新的“垃圾输入,垃圾输出”:使用各种方法(包括无监督学习)从没有响应变量的数据中进行子采样
当初始数据多于所需时,从给定数据集中适当选择子总体,即子抽样,通常旨在确保所有分类变量的水平几乎均匀覆盖,所有数值变量都很好地平衡。如果原始数据中的响应变量缺失,则无法执行流行倾向评分。本研究使用包含18个感兴趣变量的COVID-19数据集(N = 698)简化为子数据集(N = 400),解决了稍后将收集的缺失响应变量的子抽样问题。亚群体选择的质量使用每个变量类别频率的平方和的最小化来测量,并对所有变量进行平均。采用穷举法从初始n = 698个观测值中选择n = 400个观测值的所有可能组合作为基准。从随机集合中选择最小化度量的子样本,返回的结果与“向前”子样本相似,通过永久降低度量,每一步逐个观察地减少原始数据集。最后,对原始数据集的观察结果进行k-means聚类(随机数量的聚类),并从每个聚类中按比例选择一个子样本,与随机子抽样相比,也降低了度量。考虑到度量最小化,后四种方法都比单次随机抽样显示更好的结果。然而,尽管穷举抽样是非常贪婪和耗时的,但对于选择一个平衡良好的子数据集,为持续的响应变量收集做好准备,前向逐一减少原始数据集,提取子样本最小化度量,并对聚类进行子抽样是可行的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Improved Botnet Attack Detection Using Principal Component Analysis and Ensemble Voting Algorithm Cyber Threat Ontology and Adversarial Machine Learning Attacks: Analysis and Prediction Perturbance Using Machine Learning to Predict Students’ Academic Performance During Covid-19 Jack-knifing in small samples of survival data: when bias meets variance to increase estimate precision Crime Predictive Model in Cybercrime based on Social and Economic Factors Using the Bayesian and Markov Theories
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1