Bayesian Latent Class Models for the Multiple Imputation of Categorical Data

IF 2 3区心理学 Q2 PSYCHOLOGY, MATHEMATICAL Methodology: European Journal of Research Methods for The Behavioral and Social Sciences Pub Date : 2018-06-21 DOI:10.1027/1614-2241/a000146

D. Vidotto, J. Vermunt, K. Van Deun

{"title":"Bayesian Latent Class Models for the Multiple Imputation of Categorical Data","authors":"D. Vidotto, J. Vermunt, K. Van Deun","doi":"10.1027/1614-2241/a000146","DOIUrl":null,"url":null,"abstract":"Latent class analysis has been recently proposed for the multiple imputation (MI) of missing categorical data, using either a standard frequentist approach or a nonparametric Bayesian model called Dirichlet process mixture of multinomial distributions (DPMM). The main advantage of using a latent class model for multiple imputation is that it is very flexible in the sense that it can capture complex relationships in the data given that the number of latent classes is large enough. However, the two existing approaches also have certain disadvantages. The frequentist approach is computationally demanding because it requires estimating many LC models: first models with different number of classes should be estimated to determine the required number of classes and subsequently the selected model is reestimated for multiple bootstrap samples to take into account parameter uncertainty during the imputation stage. Whereas the Bayesian Dirichlet process models perform the model selection and the handling of the parameter uncertainty automatically, the disadvantage of this method is that it tends to use a too small number of clusters during the Gibbs sampling, leading to an underfitting model yielding invalid imputations. In this paper, we propose an alternative approach which combined the strengths of the two existing approaches; that is, we use the Bayesian standard latent class model as an imputation model. We show how model selection can be performed prior to the imputation step using a single run of the Gibbs sampler and, moreover, show how underfitting is prevented by using large values for the hyperparameters of the mixture weights. The results of two simulation studies and one real-data study indicate that with a proper setting of the prior distributions, the Bayesian latent class model yields valid imputations and outperforms competing methods.","PeriodicalId":18476,"journal":{"name":"Methodology: European Journal of Research Methods for The Behavioral and Social Sciences","volume":"14 1","pages":"56–68"},"PeriodicalIF":2.0000,"publicationDate":"2018-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methodology: European Journal of Research Methods for The Behavioral and Social Sciences","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1027/1614-2241/a000146","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PSYCHOLOGY, MATHEMATICAL","Score":null,"Total":0}

引用次数: 7

Abstract

Latent class analysis has been recently proposed for the multiple imputation (MI) of missing categorical data, using either a standard frequentist approach or a nonparametric Bayesian model called Dirichlet process mixture of multinomial distributions (DPMM). The main advantage of using a latent class model for multiple imputation is that it is very flexible in the sense that it can capture complex relationships in the data given that the number of latent classes is large enough. However, the two existing approaches also have certain disadvantages. The frequentist approach is computationally demanding because it requires estimating many LC models: first models with different number of classes should be estimated to determine the required number of classes and subsequently the selected model is reestimated for multiple bootstrap samples to take into account parameter uncertainty during the imputation stage. Whereas the Bayesian Dirichlet process models perform the model selection and the handling of the parameter uncertainty automatically, the disadvantage of this method is that it tends to use a too small number of clusters during the Gibbs sampling, leading to an underfitting model yielding invalid imputations. In this paper, we propose an alternative approach which combined the strengths of the two existing approaches; that is, we use the Bayesian standard latent class model as an imputation model. We show how model selection can be performed prior to the imputation step using a single run of the Gibbs sampler and, moreover, show how underfitting is prevented by using large values for the hyperparameters of the mixture weights. The results of two simulation studies and one real-data study indicate that with a proper setting of the prior distributions, the Bayesian latent class model yields valid imputations and outperforms competing methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

范畴数据多重输入的贝叶斯潜在类模型

最近，有人提出了潜在类分析，用于缺失分类数据的多重插补（MI），使用标准的频率论方法或称为多项分布的狄利克雷过程混合（DPMM）的非参数贝叶斯模型。使用潜在类别模型进行多重插补的主要优点是，它非常灵活，因为它可以在潜在类别数量足够大的情况下捕捉数据中的复杂关系。然而，现有的两种方法也有一定的缺点。频率论方法在计算上要求很高，因为它需要估计许多LC模型：首先应该估计具有不同类别数量的模型，以确定所需的类别数量，然后为多个bootstrap样本重新估计所选模型，以考虑插补阶段的参数不确定性。尽管贝叶斯狄利克雷过程模型自动执行模型选择和参数不确定性的处理，但该方法的缺点是，在吉布斯采样过程中，它倾向于使用太少的聚类，导致模型拟合不足，产生无效的输入。在本文中，我们提出了一种替代方法，它结合了两种现有方法的优点；也就是说，我们使用贝叶斯标准潜在类模型作为插补模型。我们展示了如何在插补步骤之前使用单次吉布斯采样器进行模型选择，此外，还展示了如何通过使用混合物权重的超参数的大值来防止拟合不足。两项模拟研究和一项真实数据研究的结果表明，在适当设置先验分布的情况下，贝叶斯潜在类模型产生了有效的推断，并优于竞争方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊