Multiple Imputation and Synthetic Data Generation with NPBayesImputeCat

R J. Pub Date : 2021-01-01 DOI:10.32614/rj-2021-080

Jingchen Hu, O. Akande, Quanli Wang

{"title":"Multiple Imputation and Synthetic Data Generation with NPBayesImputeCat","authors":"Jingchen Hu, O. Akande, Quanli Wang","doi":"10.32614/rj-2021-080","DOIUrl":null,"url":null,"abstract":"In many contexts, missing data and disclosure control are ubiquitous and challenging issues. In particular, at statistical agencies, the respondent-level data they collect from surveys and censuses can suffer from high rates of missingness. Furthermore, agencies are obliged to protect respondents’ privacy when publishing the collected data for public use. The NPBayesImputeCat R package, introduced in this paper, provides routines to i) create multiple imputations for missing data and ii) create synthetic data for statistical disclosure control, for multivariate categorical data, with or without structural zeros. We describe the Dirichlet process mixture of products of the multinomial distributions model used in the package and illustrate various uses of the package using data samples from the American Community Survey (ACS). We also compare results of the missing data imputation to the mice R package and those of the synthetic data generation to the synthpop R package. Introduction and background Multiple imputation for missing data Missing data problems arise in many statistical analyses. To impute missing values, multiple imputation, first proposed by Rubin (1987), has been widely adopted. This approach estimates predictive models based on the observed data, fills in missing values with draws from the predictive models, and produces multiple imputed and completed datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each imputed dataset and use appropriate combining rules to obtain valid point estimates and variance estimates (Rubin, 1987). As a brief review of the multiple imputation combining rules for missing data, let q be the completed data estimator of some estimand of interest Q, and let u be the estimator of the variance of q. For l = 1, . . . , m, let q(l) and u(l) be the values of q and u in the lth completed dataset. The multiple imputation estimate of Q is equal to q̄m = ∑l=1 q (l)/m, and the estimated variance associated with q̄m is equal to Tm = (1 + 1/m)bm + ūm , where bm = ∑l=1(q (l) − q̄m)/(m − 1) and ūm = ∑l=1 u (l)/m. Inferences for Q are based on (q̄m − Q) ∼ tv(0, Tm), where tv is a t-distribution with v = (m − 1)(1 + ūm/[(1 + 1/m)bm]) degrees of freedom. Multiple imputation by chained equations (MICE, Buuren and Groothuis-Oudshoorn (2011)) remains the most popular method for generating multiple completed datasets after multiple imputation. Under MICE, one specifies univariate conditional models separately for each variable, usually using generalized linear models (GLMs) or classification and regression trees (CART Breiman et al. (1984); Burgette and Reiter (2010)), and then iteratively samples plausible predicted values from the sequence of conditional models . For implementing MICE in R, most analysts use the mice package. For an in-depth review of the MICE algorithm, see Buuren and Groothuis-Oudshoorn (2011). For more details and reviews, see Rubin (1996), Harel and Zhou (2007), Reiter and Raghunathan (2007). Synthetic data for statistical disclosure control Statistical agencies regularly collect information from surveys and censuses and make such information publicly available for various purposes, including research and policymaking. In numerous countries around the world, statistical agencies are legally obliged to protect respondents’ privacy when making this information available to the public. Statistical disclosure control (SDC) is the collection of techniques applied to confidential data before public release for privacy protection. Popular SDC techniques for tabular data include cell suppression and adding noise, and popular SDC techniques for respondent-level data (also known as microdata) include swapping, adding noise, and aggregation. Hundepool et al. (2012) provide a comprehensive review of SDC techniques and applications. The multiple imputation methodology has been generalized to SDC. One approach to facilitating microdata release is to provide synthetic data. First proposed by Little (1993) and Rubin (1993), the synthetic data approach estimates predictive models based on the original, confidential data, simulates synthetic values with draws from the predictive models, and produces multiple synthetic datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each synthetic dataset and use appropriate combining rules (different from those in multiple imputation) to obtain valid point estimates and variance estimates (Reiter and Raghunathan, 2007; Drechsler, The R Journal Vol. 13/2, December 2021 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 91 2011). Moreover, synthetic data comes in two flavors: fully synthetic data (Rubin, 1993), where every variable is deemed sensitive and therefore synthesized, and partially synthetic data (Little, 1993), where only a subset of variables is deemed sensitive and synthesized, while the remaining variables are un-synthesized. Statistical agencies can choose between these two approaches depending on their protection goals, and subsequent analyses also differ. When dealing with fully synthetic data, q̄m estimates Q as in the multiple imputation setting, but the estimated variance associated with q̄m becomes Tf = (1 + 1/m)bm − ūm , where bm and ūm are defined as in previous section on multiple imputation. Inferences for Q are now based on (q̄m − Q) ∼ tv(0, Tf ), where the degrees of freedom is v f = (m − 1)(1 − mūm/((m + 1)bm)). For partially synthetic data, q̄m still estimates Q but the estimated variance associated with q̄m is Tp = bm/m + ūm , where bm and ūm are defined as in the multiple imputation setting. Inferences for Q are based on (q̄m − Q) ∼ tv(0, Tp), where the degrees of freedom is vp = (m − 1)(1 + ūm/[bm/m]). For synthetic data with R, synthpop provides synthetic data generated by drawing from conditional distributions fitted to the confidential data. The conditional distributions are estimated by models chosen by the user, whose choices include parametric or CART models. For more details and reviews of synthetic data for statistical disclosure control, see Drechsler (2011).","PeriodicalId":20974,"journal":{"name":"R J.","volume":"24 1","pages":"25"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"R J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32614/rj-2021-080","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

In many contexts, missing data and disclosure control are ubiquitous and challenging issues. In particular, at statistical agencies, the respondent-level data they collect from surveys and censuses can suffer from high rates of missingness. Furthermore, agencies are obliged to protect respondents’ privacy when publishing the collected data for public use. The NPBayesImputeCat R package, introduced in this paper, provides routines to i) create multiple imputations for missing data and ii) create synthetic data for statistical disclosure control, for multivariate categorical data, with or without structural zeros. We describe the Dirichlet process mixture of products of the multinomial distributions model used in the package and illustrate various uses of the package using data samples from the American Community Survey (ACS). We also compare results of the missing data imputation to the mice R package and those of the synthetic data generation to the synthpop R package. Introduction and background Multiple imputation for missing data Missing data problems arise in many statistical analyses. To impute missing values, multiple imputation, first proposed by Rubin (1987), has been widely adopted. This approach estimates predictive models based on the observed data, fills in missing values with draws from the predictive models, and produces multiple imputed and completed datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each imputed dataset and use appropriate combining rules to obtain valid point estimates and variance estimates (Rubin, 1987). As a brief review of the multiple imputation combining rules for missing data, let q be the completed data estimator of some estimand of interest Q, and let u be the estimator of the variance of q. For l = 1, . . . , m, let q(l) and u(l) be the values of q and u in the lth completed dataset. The multiple imputation estimate of Q is equal to q̄m = ∑l=1 q (l)/m, and the estimated variance associated with q̄m is equal to Tm = (1 + 1/m)bm + ūm , where bm = ∑l=1(q (l) − q̄m)/(m − 1) and ūm = ∑l=1 u (l)/m. Inferences for Q are based on (q̄m − Q) ∼ tv(0, Tm), where tv is a t-distribution with v = (m − 1)(1 + ūm/[(1 + 1/m)bm]) degrees of freedom. Multiple imputation by chained equations (MICE, Buuren and Groothuis-Oudshoorn (2011)) remains the most popular method for generating multiple completed datasets after multiple imputation. Under MICE, one specifies univariate conditional models separately for each variable, usually using generalized linear models (GLMs) or classification and regression trees (CART Breiman et al. (1984); Burgette and Reiter (2010)), and then iteratively samples plausible predicted values from the sequence of conditional models . For implementing MICE in R, most analysts use the mice package. For an in-depth review of the MICE algorithm, see Buuren and Groothuis-Oudshoorn (2011). For more details and reviews, see Rubin (1996), Harel and Zhou (2007), Reiter and Raghunathan (2007). Synthetic data for statistical disclosure control Statistical agencies regularly collect information from surveys and censuses and make such information publicly available for various purposes, including research and policymaking. In numerous countries around the world, statistical agencies are legally obliged to protect respondents’ privacy when making this information available to the public. Statistical disclosure control (SDC) is the collection of techniques applied to confidential data before public release for privacy protection. Popular SDC techniques for tabular data include cell suppression and adding noise, and popular SDC techniques for respondent-level data (also known as microdata) include swapping, adding noise, and aggregation. Hundepool et al. (2012) provide a comprehensive review of SDC techniques and applications. The multiple imputation methodology has been generalized to SDC. One approach to facilitating microdata release is to provide synthetic data. First proposed by Little (1993) and Rubin (1993), the synthetic data approach estimates predictive models based on the original, confidential data, simulates synthetic values with draws from the predictive models, and produces multiple synthetic datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each synthetic dataset and use appropriate combining rules (different from those in multiple imputation) to obtain valid point estimates and variance estimates (Reiter and Raghunathan, 2007; Drechsler, The R Journal Vol. 13/2, December 2021 ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLES 91 2011). Moreover, synthetic data comes in two flavors: fully synthetic data (Rubin, 1993), where every variable is deemed sensitive and therefore synthesized, and partially synthetic data (Little, 1993), where only a subset of variables is deemed sensitive and synthesized, while the remaining variables are un-synthesized. Statistical agencies can choose between these two approaches depending on their protection goals, and subsequent analyses also differ. When dealing with fully synthetic data, q̄m estimates Q as in the multiple imputation setting, but the estimated variance associated with q̄m becomes Tf = (1 + 1/m)bm − ūm , where bm and ūm are defined as in previous section on multiple imputation. Inferences for Q are now based on (q̄m − Q) ∼ tv(0, Tf ), where the degrees of freedom is v f = (m − 1)(1 − mūm/((m + 1)bm)). For partially synthetic data, q̄m still estimates Q but the estimated variance associated with q̄m is Tp = bm/m + ūm , where bm and ūm are defined as in the multiple imputation setting. Inferences for Q are based on (q̄m − Q) ∼ tv(0, Tp), where the degrees of freedom is vp = (m − 1)(1 + ūm/[bm/m]). For synthetic data with R, synthpop provides synthetic data generated by drawing from conditional distributions fitted to the confidential data. The conditional distributions are estimated by models chosen by the user, whose choices include parametric or CART models. For more details and reviews of synthetic data for statistical disclosure control, see Drechsler (2011).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于NPBayesImputeCat的多重输入与合成数据生成

在许多情况下，数据缺失和信息披露控制是普遍存在且具有挑战性的问题。特别是在统计机构，他们从调查和人口普查中收集的被调查者层面的数据可能会有很高的失踪率。此外，机构在公布收集的数据供公众使用时，有义务保护受访者的隐私。本文介绍的NPBayesImputeCat R包提供了i)为缺失数据创建多个imputations和ii)为统计披露控制创建合成数据的例程，用于多变量分类数据，有或没有结构零。我们描述了包中使用的多项分布模型的产品的狄利克雷过程混合物，并使用来自美国社区调查(ACS)的数据样本说明了包的各种用途。我们还比较了缺失数据代入小鼠R包的结果和合成数据代入synthpop R包的结果。缺失数据的多重输入在许多统计分析中都会出现缺失数据的问题。为了估算缺失值，多重估算(multiple imputation)被广泛采用，该方法最早由Rubin(1987)提出。该方法基于观测数据估计预测模型，用预测模型的结果填充缺失值，并产生多个输入和完整的数据集。然后，数据分析人员对每个输入的数据集应用标准统计分析(例如回归分析)，并使用适当的组合规则来获得有效的点估计和方差估计(Rubin, 1987)。作为对缺失数据多重插值组合规则的简要回顾，设q为感兴趣的某个估计q的完备数据估计量，设u为q的方差估计量。， m，设q(l)和u(l)分别为第LTH完整数据集中q和u的值。Q的多重插值估计等于Q ā m =∑l=1 Q (l)/m，与Q ā m相关的估计方差等于Tm = (1 + 1/m)bm + ūm，其中bm =∑l=1(Q (l)−Q ā m)/(m−1)，ūm =∑l=1 u (l)/m。Q的推论基于(Q ā m−Q) ~ tv(0, Tm)，其中tv是一个t-分布，v = (m−1)(1 + ūm/[(1 + 1/m)bm])自由度。通过链式方程进行多次插值(MICE, Buuren和Groothuis-Oudshoorn(2011))仍然是在多次插值后生成多个完整数据集的最流行方法。在MICE中，人们为每个变量单独指定单变量条件模型，通常使用广义线性模型(GLMs)或分类和回归树(CART Breiman et al. (1984);burgtte和Reiter(2010))，然后从条件模型序列中迭代地采样可信的预测值。为了在R中实现MICE，大多数分析人员使用MICE包。有关MICE算法的深入回顾，请参见Buuren和Groothuis-Oudshoorn(2011)。更多细节和评论请参见Rubin (1996)， Harel and Zhou (2007)， Reiter and Raghunathan(2007)。统计机构定期从调查和人口普查中收集资料，并为各种目的，包括研究和决策，公开提供这些资料。在世界上许多国家，统计机构在向公众提供这些信息时，有法律义务保护受访者的隐私。统计披露控制(SDC)是在公开发布之前应用于机密数据以保护隐私的技术集合。用于表格数据的常用SDC技术包括单元格抑制和添加噪声，用于被调查者级数据(也称为微数据)的常用SDC技术包括交换、添加噪声和聚合。Hundepool等人(2012)对SDC技术和应用进行了全面综述。将多重归算方法推广到SDC。促进微数据发布的一种方法是提供合成数据。Little(1993)和Rubin(1993)首先提出了合成数据方法，该方法基于原始的机密数据估计预测模型，利用预测模型中的数据模拟合成值，并产生多个合成数据集。然后，数据分析人员对每个合成数据集应用标准统计分析(例如回归分析)，并使用适当的组合规则(不同于多次imputation)来获得有效的点估计和方差估计(Reiter和Raghunathan, 2007;Drechsler, R Journal Vol. 13/2, December 2021 ISSN 2073-4859贡献研究论文91 2011)。此外，合成数据有两种类型:完全合成数据(Rubin, 1993)，其中每个变量都被认为是敏感的，因此是合成的;部分合成数据(Little, 1993)，其中只有一部分变量被认为是敏感的和合成的，而其余变量是非合成的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

R J.

自引率

0.00%

发文量

期刊最新文献

Generalized Mosaic Plots in the \pkg{ggplot2} Framework populR: a Package for Population Downscaling in R Making Provenance Work for You SurvMetrics: An R package for Predictive Evaluation Metrics in Survival Analysis HostSwitch: An R Package to Simulate the Extent of Host-Switching by a Consumer