处理分类响应变量缺失数据的两阶段估算法

IF 0.5 Q4 STATISTICS & PROBABILITY Communications for Statistical Applications and Methods Pub Date : 2023-11-30 DOI:10.29220/csam.2023.30.6.577
Jong-Min Kim, Kee-Jae Lee, Seung-Joo Lee
{"title":"处理分类响应变量缺失数据的两阶段估算法","authors":"Jong-Min Kim, Kee-Jae Lee, Seung-Joo Lee","doi":"10.29220/csam.2023.30.6.577","DOIUrl":null,"url":null,"abstract":"Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.","PeriodicalId":44931,"journal":{"name":"Communications for Statistical Applications and Methods","volume":"150 1","pages":""},"PeriodicalIF":0.5000,"publicationDate":"2023-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Two-stage imputation method to handle missing data for categorical response variable\",\"authors\":\"Jong-Min Kim, Kee-Jae Lee, Seung-Joo Lee\",\"doi\":\"10.29220/csam.2023.30.6.577\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.\",\"PeriodicalId\":44931,\"journal\":{\"name\":\"Communications for Statistical Applications and Methods\",\"volume\":\"150 1\",\"pages\":\"\"},\"PeriodicalIF\":0.5000,\"publicationDate\":\"2023-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Communications for Statistical Applications and Methods\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.29220/csam.2023.30.6.577\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications for Statistical Applications and Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.29220/csam.2023.30.6.577","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0

摘要

传统的分类数据估算技术(如模式估算)经常会遇到与高估有关的问题。如果变量的类别过多,多叉逻辑回归估算方法可能会因为计算上的限制而无法实现。为了解决这些问题,我们提出了一种两阶段归因法。在第一阶段,我们在完整数据集上使用 Boruta 变量选择法来识别目标分类变量的重要变量。然后,在第二阶段,我们利用目标分类变量的重要变量进行逻辑回归,以弥补二元变量的缺失数据;利用多项式回归,以弥补分类变量的缺失数据;利用预测均值匹配,以弥补定量变量的缺失数据。通过对非对称和非正态模拟数据及真实数据的分析,我们证明了两阶段估算方法优于缺乏变量选择的估算方法,这一点在准确度测量中得到了证明。在对真实调查数据的分析中,我们还证明了我们建议的两阶段估算方法在准确性方面超过了当前的估算方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Two-stage imputation method to handle missing data for categorical response variable
Conventional categorical data imputation techniques, such as mode imputation, often encounter issues related to overestimation. If the variable has too many categories, multinomial logistic regression imputation method may be impossible due to computational limitations. To rectify these limitations, we propose a two-stage imputation method. During the first stage, we utilize the Boruta variable selection method on the complete dataset to identify significant variables for the target categorical variable. Then, in the second stage, we use the important variables for the target categorical variable for logistic regression to impute missing data in binary variables, polytomous regression to impute missing data in categorical variables, and predictive mean matching to impute missing data in quantitative variables. Through analysis of both asymmetric and non-normal simulated and real data, we demonstrate that the two-stage imputation method outperforms imputation methods lacking variable selection, as evidenced by accuracy measures. During the analysis of real survey data, we also demonstrate that our suggested two-stage imputation method surpasses the current imputation approach in terms of accuracy.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
0.90
自引率
0.00%
发文量
49
期刊介绍: Communications for Statistical Applications and Methods (Commun. Stat. Appl. Methods, CSAM) is an official journal of the Korean Statistical Society and Korean International Statistical Society. It is an international and Open Access journal dedicated to publishing peer-reviewed, high quality and innovative statistical research. CSAM publishes articles on applied and methodological research in the areas of statistics and probability. It features rapid publication and broad coverage of statistical applications and methods. It welcomes papers on novel applications of statistical methodology in the areas including medicine (pharmaceutical, biotechnology, medical device), business, management, economics, ecology, education, computing, engineering, operational research, biology, sociology and earth science, but papers from other areas are also considered.
期刊最新文献
Influence diagnostics for skew-t censored linear regression models Identification of indirect effects in the two-condition within-subject mediation model and its implementation using SEM Robust extreme quantile estimation for Pareto-type tails through an exponential regression model Two-stage imputation method to handle missing data for categorical response variable Counterfactual image generation by disentangling data attributes with deep generative models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1