Markus Fritsch, Adrian Yu Pua Andrew, Joachim Schnurbus
This paper introduces pdynmc , an R package that provides users sufficient flexibility and precise control over the estimation and inference in linear dynamic panel data models. The package primarily allows for the inclusion of nonlinear moment conditions and the use of iterated GMM; additionally, visualizations for data structure and estimation results are provided. The current implementation reflects recent developments in literature, uses sensible argument defaults, and aligns commercial and noncommercial estimation commands. Since the understanding of the model assumptions is vital for setting up plausible estimation routines, we provide a broad introduction of linear dynamic panel data models directed towards practitioners before concisely describing the functionality available in pdynmc regarding instrument type, covariate type, estimation methodology, and general configuration. We then demonstrate the functionality by revisiting the popular firm-level dataset of Arellano and Bond (1991). ,
{"title":"pdynmc: A Package for Estimating Linear Dynamic Panel Data Models Based on Nonlinear Moment Conditions","authors":"Markus Fritsch, Adrian Yu Pua Andrew, Joachim Schnurbus","doi":"10.32614/rj-2021-035","DOIUrl":"https://doi.org/10.32614/rj-2021-035","url":null,"abstract":"This paper introduces pdynmc , an R package that provides users sufficient flexibility and precise control over the estimation and inference in linear dynamic panel data models. The package primarily allows for the inclusion of nonlinear moment conditions and the use of iterated GMM; additionally, visualizations for data structure and estimation results are provided. The current implementation reflects recent developments in literature, uses sensible argument defaults, and aligns commercial and noncommercial estimation commands. Since the understanding of the model assumptions is vital for setting up plausible estimation routines, we provide a broad introduction of linear dynamic panel data models directed towards practitioners before concisely describing the functionality available in pdynmc regarding instrument type, covariate type, estimation methodology, and general configuration. We then demonstrate the functionality by revisiting the popular firm-level dataset of Arellano and Bond (1991). ,","PeriodicalId":20974,"journal":{"name":"R J.","volume":"16 1","pages":"218"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87189898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces the package ROCnReg that allows estimating the pooled ROC curve, the covariate-specific ROC curve, and the covariate-adjusted ROC curve by different methods, both from (semi) parametric and nonparametric perspectives and within Bayesian and frequentist paradigms. From the estimated ROC curve (pooled, covariate-specific, or covariate-adjusted), several summary measures of discriminatory accuracy, such as the (partial) area under the ROC curve and the Youden index, can be obtained. The package also provides functions to obtain ROC-based optimal threshold values using several criteria, namely, the Youden index criterion and the criterion that sets a target value for the false positive fraction. For the Bayesian methods, we provide tools for assessing model fit via posterior predictive checks, while the model choice can be carried out via several information criteria. Numerical and graphical outputs are provided for all methods. This is the only package implementing Bayesian procedures for ROC curves.
{"title":"ROCnReg: An R Package for Receiver Operating Characteristic Curve Inference With and Without Covariates","authors":"M. Rodríguez-Álvarez, Vanda Inácio","doi":"10.32614/rj-2021-066","DOIUrl":"https://doi.org/10.32614/rj-2021-066","url":null,"abstract":"This paper introduces the package ROCnReg that allows estimating the pooled ROC curve, the covariate-specific ROC curve, and the covariate-adjusted ROC curve by different methods, both from (semi) parametric and nonparametric perspectives and within Bayesian and frequentist paradigms. From the estimated ROC curve (pooled, covariate-specific, or covariate-adjusted), several summary measures of discriminatory accuracy, such as the (partial) area under the ROC curve and the Youden index, can be obtained. The package also provides functions to obtain ROC-based optimal threshold values using several criteria, namely, the Youden index criterion and the criterion that sets a target value for the false positive fraction. For the Bayesian methods, we provide tools for assessing model fit via posterior predictive checks, while the model choice can be carried out via several information criteria. Numerical and graphical outputs are provided for all methods. This is the only package implementing Bayesian procedures for ROC curves.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"9 1","pages":"525"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72975867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Power and sample size estimation are critical aspects in study design to demonstrate minimized risk for subjects and to justify the allocation of time, money, and other resources. Researchers often work with response variables which take the form of various distributions. Here, we present an R package, PASSED, that allows flexibility with seven common distributions and multiple options to accommodate sample size or power analysis. The relevant statistical theory, calculations, and examples for each distribution using PASSED are discussed in this paper.
{"title":"PASSED: Calculate Power and Sample Size for Two Sample Tests","authors":"Jinpu Li, R. Knigge, Kaiyi Chen, E. Leary","doi":"10.32614/rj-2021-094","DOIUrl":"https://doi.org/10.32614/rj-2021-094","url":null,"abstract":"Power and sample size estimation are critical aspects in study design to demonstrate minimized risk for subjects and to justify the allocation of time, money, and other resources. Researchers often work with response variables which take the form of various distributions. Here, we present an R package, PASSED, that allows flexibility with seven common distributions and multiple options to accommodate sample size or power analysis. The relevant statistical theory, calculations, and examples for each distribution using PASSED are discussed in this paper.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"45 1","pages":"450"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75943160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Sjoberg, Karissa A Whiting, Michael Curry, J. Lavery, J. Larmarange
The gtsummary package provides an elegant and flexible way to create publication-ready summary tables in R. A critical part of the work of statisticians, data scientists, and analysts is summarizing data sets and regression models in R and publishing or sharing polished summary tables. The gtsummary package was created to streamline these everyday analysis tasks by allowing users to easily create reproducible summaries of data sets, regression models, survey data, and survival data with a simple interface and very little code. The package follows a tidy framework, making it easy to integrate in standard data workflows, and offers many table customization features through function arguments, helper functions, and custom themes.
{"title":"Reproducible Summary Tables with the gtsummary Package","authors":"D. Sjoberg, Karissa A Whiting, Michael Curry, J. Lavery, J. Larmarange","doi":"10.32614/rj-2021-053","DOIUrl":"https://doi.org/10.32614/rj-2021-053","url":null,"abstract":"The gtsummary package provides an elegant and flexible way to create publication-ready summary tables in R. A critical part of the work of statisticians, data scientists, and analysts is summarizing data sets and regression models in R and publishing or sharing polished summary tables. The gtsummary package was created to streamline these everyday analysis tasks by allowing users to easily create reproducible summaries of data sets, regression models, survey data, and survival data with a simple interface and very little code. The package follows a tidy framework, making it easy to integrate in standard data workflows, and offers many table customization features through function arguments, helper functions, and custom themes.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"16 1","pages":"570"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75422442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chaos theory has been hailed as a revolution of thoughts and attracting ever-increasing attention of many scientists from diverse disciplines. Chaotic systems are non-linear deterministic dynamic systems which can behave like an erratic and apparently random motion. A relevant field inside chaos theory is the detection of chaotic behavior from empirical time-series data. One of the main features of chaos is the well-known initial-value sensitivity property. Methods and techniques related to testing the hypothesis of chaos try to quantify the initial-value sensitive property estimating the so-called Lyapunov exponents. This paper describes the main estimation methods of the Lyapunov exponent from time series data. At the same time, we present the DChaos library. R users may compute the delayed-coordinate embedding vector from time series data, estimates the best-fitted neural net model from the delayed-coordinate embedding vectors, calculates analytically the partial derivatives from the chosen neural nets model. They can also obtain the neural net estimator of the Lyapunov exponent from the partial derivatives computed previously by two different procedures and four ways of subsampling by blocks. To sum up, the DChaos package allows the R users to test robustly the hypothesis of chaos in order to know if the data-generating process behind time series behaves chaotically or not. The package’s functionality is illustrated by examples.
{"title":"DChaos: An R Package for Chaotic Time Series Analysis","authors":"Julio E. Sandubete, L. Escot","doi":"10.32614/rj-2021-036","DOIUrl":"https://doi.org/10.32614/rj-2021-036","url":null,"abstract":"Chaos theory has been hailed as a revolution of thoughts and attracting ever-increasing attention of many scientists from diverse disciplines. Chaotic systems are non-linear deterministic dynamic systems which can behave like an erratic and apparently random motion. A relevant field inside chaos theory is the detection of chaotic behavior from empirical time-series data. One of the main features of chaos is the well-known initial-value sensitivity property. Methods and techniques related to testing the hypothesis of chaos try to quantify the initial-value sensitive property estimating the so-called Lyapunov exponents. This paper describes the main estimation methods of the Lyapunov exponent from time series data. At the same time, we present the DChaos library. R users may compute the delayed-coordinate embedding vector from time series data, estimates the best-fitted neural net model from the delayed-coordinate embedding vectors, calculates analytically the partial derivatives from the chosen neural nets model. They can also obtain the neural net estimator of the Lyapunov exponent from the partial derivatives computed previously by two different procedures and four ways of subsampling by blocks. To sum up, the DChaos package allows the R users to test robustly the hypothesis of chaos in order to know if the data-generating process behind time series behaves chaotically or not. The package’s functionality is illustrated by examples.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"30 1","pages":"232"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88762793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Canonical correlation analysis (CCA) has a long history as an explanatory statistical method in high-dimensional data analysis and has been successfully applied in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. The so-called seedCCA is a newly developed R package, and it implements not only the standard and seeded CCA but also partial least squares. The package enables us to fit CCA to large-p and small-n data. The paper provides a complete guide. Also, the seeded CCA application results are compared with the regularized CCA in the existing R package. It is believed that the package along with the paper will contribute to highdimensional data analysis in various science field practitioners and that the statistical methodologies in multivariate analysis become more fruitful. Introduction Explanatory studies are important to identify patterns and special structure in data prior to develop a specific model. When a study between two sets of a p-dimensional random variables X (X ∈ Rp) and a r-dimensional random variable Y (Y ∈ Rr), are of primary interest, one of the popular explanatory statistical methods would be canonical correlation analysis (CCA; Hotelling (1936)). The main goal of CCA is the dimension reduction of two sets of variables by measuring an association between the two sets. For this, pairs of linear combinations of variables are constructed by maximizing the Pearson correlation. The CCA has successful application in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. In Lee and Yoo (2014) it is shown that the CCA can be used as a dimension reduction tool for high-dimensional data, but also it is connected to least square estimator. Therefore, the CCA is not only explanatory and dimension reduction method but also can be utilized as alternative of least square estimation. If max(p, r) is bigger than or equal to the sample size, n, usual CCA application is not plausible due to no incapability of inverting sample covariance matrices. To overcome this, a regularized CCA is developed by Leurgans et al. (1993), whose idea was firstly suggested in Vinod (1976). In practice, the CCA package by González et al. (2008) can implement a version of the regularized CCA. To make the sample covariance matrices saying Σ̂x and Σ̂y, invertible, in González et al. (2008), they are replaced with Σ̂ λ1 x = Σ̂x + λ1Ip and Σ̂ λ2 y = Σ̂y + λ1Ir. The optimal values of λ1 and λ2 are chosen by maximizing a cross-validation score throughout the two-dimensional grid search. Although it is discussed that a relatively small grid of reasonable values for λ1 and λ2 can lesson intensive computing in González et al. (2008), it is still time-consuming as observed in later sections. Additionally, fast regularized CCA and robust CCA via projection-pursuit are recently developed in Cruz-Cano (2012) and Alfons et al. (2016), respectively. Another version of CCA to handle max(p,
典型相关分析(Canonical correlation analysis, CCA)作为一种用于高维数据分析的解释性统计方法有着悠久的历史,已成功应用于化学计量学、模式识别、基因组序列分析等多个科学领域。所谓的seedCCA是一个新开发的R包,它不仅实现了标准和种子的CCA,而且还实现了偏最小二乘法。该软件包使我们能够将CCA适合于大p和小n数据。本文提供了一个完整的指南。同时,将种子CCA应用结果与现有R包中的正则化CCA进行了比较。相信本文所附的包将有助于各个科学领域的从业者进行高维数据分析,并使多元分析中的统计方法更加富有成果。解释性研究对于在开发特定模型之前识别数据中的模式和特殊结构非常重要。当对两组p维随机变量X (X∈Rp)和r维随机变量Y (Y∈Rr)之间的研究主要感兴趣时,常用的解释性统计方法之一是典型相关分析(CCA;霍特林(1936))。CCA的主要目标是通过测量两组变量之间的关联来降低两组变量的维数。为此,通过最大化Pearson相关性来构造变量的线性组合对。CCA已成功应用于化学计量学、模式识别、基因组序列分析等科学领域。在Lee和Yoo(2014)中,表明CCA可以用作高维数据的降维工具,但它也与最小二乘估计器相连。因此,CCA不仅是一种解释和降维方法,而且可以作为最小二乘估计的替代方法。如果max(p, r)大于或等于样本量n,通常的CCA应用是不合理的,因为不能反转样本协方差矩阵。为了克服这一点,Leurgans等人(1993)开发了正则化CCA,其想法最早是在Vinod(1976)中提出的。在实践中,González等人(2008)的CCA包可以实现正则化CCA的一个版本。为了使González等人(2008)中的样本协方差矩阵Σ³x和Σ³y可逆,将它们替换为Σ³λ 1x = Σ³x + λ1Ip和Σ³λ 2y = Σ³y + λ1Ir。通过在整个二维网格搜索中最大化交叉验证分数来选择λ1和λ2的最优值。虽然在González等人(2008)中讨论了λ1和λ2的合理值的相对较小的网格可以进行密集计算,但在后面的章节中观察到它仍然很耗时。此外,Cruz-Cano(2012)和Alfons等人(2016)最近分别开发了快速正则化CCA和基于投影追踪的鲁棒CCA。另一种处理max(p, r) > n的CCA是Im等人(2014)提出的所谓种子型典型相关分析(seed canonical correlation analysis)。由于种子型CCA不需要任何计算量大的正则化过程,因此对更大数据的实现速度相当快。种子CCA需要两个步骤。在初始阶段,基于迭代投影,初始化简大于n的一组变量。下一步,将标准的CCA应用于初始步骤中获得的两组变量,最终完成数据的CCA。另一个优点是种子CCA的过程与偏最小二乘密切相关,偏最小二乘是大p-小n数据的常用统计方法之一,因此种子CCA可以产生PLS估计。最近开发的seedCCA包主要用于实现种子CCA。然而,该软件包可以适合一系列统计方法,这些方法是标准的典型相关和偏最小二乘法,具有单/多维响应,包括种子CCA。软件包已经上传到CRAN (https://cran.r-project.org/web/packages/ seedCCA/index.html)。本文的主要目的是介绍和说明seedCCA包。为此,使用标准CCA、种子CCA和偏最小二乘法拟合三个真实数据,其中两个数据在包中可用。其中一个已经在González et al.(2008)中进行了分析。因此,将种子CCA和正则化CCA的实施结果进行了密切比较。本文的组织结构如下。第2节将讨论这三种方法的集合。在第3节中说明了seedCCA的实现并与CCA进行了比较。在第4节中,我们对工作进行了总结。在本文的其余部分,我们将使用以下符号。p维随机变量X记为X∈Rp。 所以,X∈Rp是一个随机变量,虽然这里没有特别提到。对于X∈Rp, Y∈Rr,我们定义cov(X) = Σx, cov(Y) = Σy, cov(X, Y) = Σxy, cov(Y, X) = Σyx。并且,假设Σx和Σy是正定的。典型相关分析中实现方法的集合假设两组变量X∈Rp和Y∈Rr,考虑它们的U = aTX和V = bTY的线性组合。然后我们有var(U) = aΣxa, var(V) = bΣyb, cov(U, V) = aΣxyb,其中a∈Rp×1, b∈Rr×1。则U与V之间的person -correlation为:cor(U, V) = aΣxyb√aTΣxa√bTΣyb。(1)我们寻找a和b,使cor(U, V)满足以下条件最大化。1. 第一个典型变量对(U1 = a1 X, V1 = b t1 Y)由最大化(1)得到。第二个典型变量对(U2 = a2 X, V2 = b t2 Y)由(1)的最大化构造而成,约束条件是var(U2) = var(V2) = 1,并且(U1, V1)和(U2, V2)不相关。3.在第k步,第k个典型变量对(Uk = ak X, Vk = b T k Y)由(1)的最大化得到,限制var(Uk) = var(Vk) = 1和(Uk, Vk)与前(k−1)个典型变量对不相关。4. 重复步骤1到3,直到k变成q (= min(p, r))。5. 选择(Uk, Vk)的前d对来表示X与y之间的关系,在此准则下,对(ai, bi)构造如下:ai = Σ−1/2 X ψi, bi = Σ−1/2 y φi,对于i = 1,…, q,其中(ψ1,…), ψq)和(φ1,…, φq)分别为Σ−1/2 x ΣxyΣ−1 y ΣyxΣ−1/2 y和Σ−1/2 y ΣyxΣ−1 x ΣxyΣ−1/2 x的q特征向量,其对应的公共有序特征值为ρ∗21≥···≥ρ∗2 q≥0。则矩阵Mx = (a1,…, ad)和My = (b1,…, bd)称为正则系数矩阵,当d = 1,…同样,Mx X和My Y被称为正则变量。在样本中,用通常的矩估计量代替总体量。关于这个标准CCA的更多细节,读者可以参考Johnson和Wichern(2007)。种子典型相关分析由于标准的CCA应用在实践中需要对Σ³x和Σ³y进行反演,因此对于max(p, r) > n的高维数据是不可能的。Im等人(2014)提出了一种种子典型相关分析方法来克服这一缺陷。种子CCA是一个两步过程,包括初始化和最终化步骤。在初始化步骤中,原始的两组变量被简化为m维对,而不会丢失有关CCA应用程序的信息。在初始化步骤中,必须强制m << n。在最终化步骤中,对初始约简对实现标准的CCA,以进行修复和正交性。关于种子CCA的更详细讨论将在下一小节中进行。定义S(M)的符号为M∈Rp×r的列张成的子空间。Lee和Yoo(2014)给出了S(Mx) = S(Σ−1 x Σxy)和S(My) = S(Σ−1 y Σyx)的关系式。(2)式(2)中的关系直接表明,Mx和My构成了S(Σ−1 x Σxy)和S(Σ−1 y Σyx)的基矩阵,并且可以由Σ−1 x Σxy和Σ−1 y Σyx还原出Mx和My。文章3现在,
{"title":"SEEDCCA: An Integrated R-Package for Canonical Correlation Analysis and Partial Least Squares","authors":"Boyoung Kim, Yunju Im, Keun Yoo Jae","doi":"10.32614/rj-2021-026","DOIUrl":"https://doi.org/10.32614/rj-2021-026","url":null,"abstract":"Canonical correlation analysis (CCA) has a long history as an explanatory statistical method in high-dimensional data analysis and has been successfully applied in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. The so-called seedCCA is a newly developed R package, and it implements not only the standard and seeded CCA but also partial least squares. The package enables us to fit CCA to large-p and small-n data. The paper provides a complete guide. Also, the seeded CCA application results are compared with the regularized CCA in the existing R package. It is believed that the package along with the paper will contribute to highdimensional data analysis in various science field practitioners and that the statistical methodologies in multivariate analysis become more fruitful. Introduction Explanatory studies are important to identify patterns and special structure in data prior to develop a specific model. When a study between two sets of a p-dimensional random variables X (X ∈ Rp) and a r-dimensional random variable Y (Y ∈ Rr), are of primary interest, one of the popular explanatory statistical methods would be canonical correlation analysis (CCA; Hotelling (1936)). The main goal of CCA is the dimension reduction of two sets of variables by measuring an association between the two sets. For this, pairs of linear combinations of variables are constructed by maximizing the Pearson correlation. The CCA has successful application in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. In Lee and Yoo (2014) it is shown that the CCA can be used as a dimension reduction tool for high-dimensional data, but also it is connected to least square estimator. Therefore, the CCA is not only explanatory and dimension reduction method but also can be utilized as alternative of least square estimation. If max(p, r) is bigger than or equal to the sample size, n, usual CCA application is not plausible due to no incapability of inverting sample covariance matrices. To overcome this, a regularized CCA is developed by Leurgans et al. (1993), whose idea was firstly suggested in Vinod (1976). In practice, the CCA package by González et al. (2008) can implement a version of the regularized CCA. To make the sample covariance matrices saying Σ̂x and Σ̂y, invertible, in González et al. (2008), they are replaced with Σ̂ λ1 x = Σ̂x + λ1Ip and Σ̂ λ2 y = Σ̂y + λ1Ir. The optimal values of λ1 and λ2 are chosen by maximizing a cross-validation score throughout the two-dimensional grid search. Although it is discussed that a relatively small grid of reasonable values for λ1 and λ2 can lesson intensive computing in González et al. (2008), it is still time-consuming as observed in later sections. Additionally, fast regularized CCA and robust CCA via projection-pursuit are recently developed in Cruz-Cano (2012) and Alfons et al. (2016), respectively. Another version of CCA to handle max(p, ","PeriodicalId":20974,"journal":{"name":"R J.","volume":"44 1","pages":"7"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91542743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In many contexts, missing data and disclosure control are ubiquitous and challenging issues. In particular, at statistical agencies, the respondent-level data they collect from surveys and censuses can suffer from high rates of missingness. Furthermore, agencies are obliged to protect respondents’ privacy when publishing the collected data for public use. The NPBayesImputeCat R package, introduced in this paper, provides routines to i) create multiple imputations for missing data and ii) create synthetic data for statistical disclosure control, for multivariate categorical data, with or without structural zeros. We describe the Dirichlet process mixture of products of the multinomial distributions model used in the package and illustrate various uses of the package using data samples from the American Community Survey (ACS). We also compare results of the missing data imputation to the mice R package and those of the synthetic data generation to the synthpop R package. Introduction and background Multiple imputation for missing data Missing data problems arise in many statistical analyses. To impute missing values, multiple imputation, first proposed by Rubin (1987), has been widely adopted. This approach estimates predictive models based on the observed data, fills in missing values with draws from the predictive models, and produces multiple imputed and completed datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each imputed dataset and use appropriate combining rules to obtain valid point estimates and variance estimates (Rubin, 1987). As a brief review of the multiple imputation combining rules for missing data, let q be the completed data estimator of some estimand of interest Q, and let u be the estimator of the variance of q. For l = 1, . . . , m, let q(l) and u(l) be the values of q and u in the lth completed dataset. The multiple imputation estimate of Q is equal to q̄m = ∑l=1 q (l)/m, and the estimated variance associated with q̄m is equal to Tm = (1 + 1/m)bm + ūm , where bm = ∑l=1(q (l) − q̄m)/(m − 1) and ūm = ∑l=1 u (l)/m. Inferences for Q are based on (q̄m − Q) ∼ tv(0, Tm), where tv is a t-distribution with v = (m − 1)(1 + ūm/[(1 + 1/m)bm]) degrees of freedom. Multiple imputation by chained equations (MICE, Buuren and Groothuis-Oudshoorn (2011)) remains the most popular method for generating multiple completed datasets after multiple imputation. Under MICE, one specifies univariate conditional models separately for each variable, usually using generalized linear models (GLMs) or classification and regression trees (CART Breiman et al. (1984); Burgette and Reiter (2010)), and then iteratively samples plausible predicted values from the sequence of conditional models . For implementing MICE in R, most analysts use the mice package. For an in-depth review of the MICE algorithm, see Buuren and Groothuis-Oudshoorn (2011). For more details and reviews, see Rubin (1996), Harel and Zhou (2007), R
在许多情况下,数据缺失和信息披露控制是普遍存在且具有挑战性的问题。特别是在统计机构,他们从调查和人口普查中收集的被调查者层面的数据可能会有很高的失踪率。此外,机构在公布收集的数据供公众使用时,有义务保护受访者的隐私。本文介绍的NPBayesImputeCat R包提供了i)为缺失数据创建多个imputations和ii)为统计披露控制创建合成数据的例程,用于多变量分类数据,有或没有结构零。我们描述了包中使用的多项分布模型的产品的狄利克雷过程混合物,并使用来自美国社区调查(ACS)的数据样本说明了包的各种用途。我们还比较了缺失数据代入小鼠R包的结果和合成数据代入synthpop R包的结果。缺失数据的多重输入在许多统计分析中都会出现缺失数据的问题。为了估算缺失值,多重估算(multiple imputation)被广泛采用,该方法最早由Rubin(1987)提出。该方法基于观测数据估计预测模型,用预测模型的结果填充缺失值,并产生多个输入和完整的数据集。然后,数据分析人员对每个输入的数据集应用标准统计分析(例如回归分析),并使用适当的组合规则来获得有效的点估计和方差估计(Rubin, 1987)。作为对缺失数据多重插值组合规则的简要回顾,设q为感兴趣的某个估计q的完备数据估计量,设u为q的方差估计量。, m,设q(l)和u(l)分别为第LTH完整数据集中q和u的值。Q的多重插值估计等于Q ā m =∑l=1 Q (l)/m,与Q ā m相关的估计方差等于Tm = (1 + 1/m)bm + ūm,其中bm =∑l=1(Q (l)−Q ā m)/(m−1),ūm =∑l=1 u (l)/m。Q的推论基于(Q ā m−Q) ~ tv(0, Tm),其中tv是一个t-分布,v = (m−1)(1 + ūm/[(1 + 1/m)bm])自由度。通过链式方程进行多次插值(MICE, Buuren和Groothuis-Oudshoorn(2011))仍然是在多次插值后生成多个完整数据集的最流行方法。在MICE中,人们为每个变量单独指定单变量条件模型,通常使用广义线性模型(GLMs)或分类和回归树(CART Breiman et al. (1984);burgtte和Reiter(2010)),然后从条件模型序列中迭代地采样可信的预测值。为了在R中实现MICE,大多数分析人员使用MICE包。有关MICE算法的深入回顾,请参见Buuren和Groothuis-Oudshoorn(2011)。更多细节和评论请参见Rubin (1996), Harel and Zhou (2007), Reiter and Raghunathan(2007)。统计机构定期从调查和人口普查中收集资料,并为各种目的,包括研究和决策,公开提供这些资料。在世界上许多国家,统计机构在向公众提供这些信息时,有法律义务保护受访者的隐私。统计披露控制(SDC)是在公开发布之前应用于机密数据以保护隐私的技术集合。用于表格数据的常用SDC技术包括单元格抑制和添加噪声,用于被调查者级数据(也称为微数据)的常用SDC技术包括交换、添加噪声和聚合。Hundepool等人(2012)对SDC技术和应用进行了全面综述。将多重归算方法推广到SDC。促进微数据发布的一种方法是提供合成数据。Little(1993)和Rubin(1993)首先提出了合成数据方法,该方法基于原始的机密数据估计预测模型,利用预测模型中的数据模拟合成值,并产生多个合成数据集。然后,数据分析人员对每个合成数据集应用标准统计分析(例如回归分析),并使用适当的组合规则(不同于多次imputation)来获得有效的点估计和方差估计(Reiter和Raghunathan, 2007;Drechsler, R Journal Vol. 13/2, December 2021 ISSN 2073-4859贡献研究论文91 2011)。此外,合成数据有两种类型:完全合成数据(Rubin, 1993),其中每个变量都被认为是敏感的,因此是合成的;部分合成数据(Little, 1993),其中只有一部分变量被认为是敏感的和合成的,而其余变量是非合成的。
{"title":"Multiple Imputation and Synthetic Data Generation with NPBayesImputeCat","authors":"Jingchen Hu, O. Akande, Quanli Wang","doi":"10.32614/rj-2021-080","DOIUrl":"https://doi.org/10.32614/rj-2021-080","url":null,"abstract":"In many contexts, missing data and disclosure control are ubiquitous and challenging issues. In particular, at statistical agencies, the respondent-level data they collect from surveys and censuses can suffer from high rates of missingness. Furthermore, agencies are obliged to protect respondents’ privacy when publishing the collected data for public use. The NPBayesImputeCat R package, introduced in this paper, provides routines to i) create multiple imputations for missing data and ii) create synthetic data for statistical disclosure control, for multivariate categorical data, with or without structural zeros. We describe the Dirichlet process mixture of products of the multinomial distributions model used in the package and illustrate various uses of the package using data samples from the American Community Survey (ACS). We also compare results of the missing data imputation to the mice R package and those of the synthetic data generation to the synthpop R package. Introduction and background Multiple imputation for missing data Missing data problems arise in many statistical analyses. To impute missing values, multiple imputation, first proposed by Rubin (1987), has been widely adopted. This approach estimates predictive models based on the observed data, fills in missing values with draws from the predictive models, and produces multiple imputed and completed datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each imputed dataset and use appropriate combining rules to obtain valid point estimates and variance estimates (Rubin, 1987). As a brief review of the multiple imputation combining rules for missing data, let q be the completed data estimator of some estimand of interest Q, and let u be the estimator of the variance of q. For l = 1, . . . , m, let q(l) and u(l) be the values of q and u in the lth completed dataset. The multiple imputation estimate of Q is equal to q̄m = ∑l=1 q (l)/m, and the estimated variance associated with q̄m is equal to Tm = (1 + 1/m)bm + ūm , where bm = ∑l=1(q (l) − q̄m)/(m − 1) and ūm = ∑l=1 u (l)/m. Inferences for Q are based on (q̄m − Q) ∼ tv(0, Tm), where tv is a t-distribution with v = (m − 1)(1 + ūm/[(1 + 1/m)bm]) degrees of freedom. Multiple imputation by chained equations (MICE, Buuren and Groothuis-Oudshoorn (2011)) remains the most popular method for generating multiple completed datasets after multiple imputation. Under MICE, one specifies univariate conditional models separately for each variable, usually using generalized linear models (GLMs) or classification and regression trees (CART Breiman et al. (1984); Burgette and Reiter (2010)), and then iteratively samples plausible predicted values from the sequence of conditional models . For implementing MICE in R, most analysts use the mice package. For an in-depth review of the MICE algorithm, see Buuren and Groothuis-Oudshoorn (2011). For more details and reviews, see Rubin (1996), Harel and Zhou (2007), R","PeriodicalId":20974,"journal":{"name":"R J.","volume":"24 1","pages":"25"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84673342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Welch’s two-sample t-test based on least squares (LS) estimators is generally used to test the equality of two normal means when the variances are not equal. However, this test loses its power when the underlying distribution is not normal. In this paper, two different tests are proposed to test the equality of two long-tailed symmetric (LTS) means under heterogeneous variances. Adaptive modified maximum likelihood (AMML) estimators are used in developing the proposed tests since they are highly efficient under LTS distribution. An R package called RobustBF is given to show the implementation of these tests. Simulated Type I error rates and powers of the proposed tests are also given and compared with Welch’s t-test based on LS estimators via an extensive Monte Carlo simulation study.
{"title":"RobustBF: An R Package for Robust Solution to the Behrens-Fisher Problem","authors":"Gamze Güven, S. Acitas, Hatice Samkar, B. Şenoğlu","doi":"10.32614/rj-2021-107","DOIUrl":"https://doi.org/10.32614/rj-2021-107","url":null,"abstract":"Welch’s two-sample t-test based on least squares (LS) estimators is generally used to test the equality of two normal means when the variances are not equal. However, this test loses its power when the underlying distribution is not normal. In this paper, two different tests are proposed to test the equality of two long-tailed symmetric (LTS) means under heterogeneous variances. Adaptive modified maximum likelihood (AMML) estimators are used in developing the proposed tests since they are highly efficient under LTS distribution. An R package called RobustBF is given to show the implementation of these tests. Simulated Type I error rates and powers of the proposed tests are also given and compared with Welch’s t-test based on LS estimators via an extensive Monte Carlo simulation study.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"57 1","pages":"642"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81339370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In case–control studies, the odds ratio is commonly used to summarize the association between a binary exposure and a dichotomous outcome. However, exposure misclassification frequently appears in case–control studies due to inaccurate data reporting, which can produce bias in measures of association. In this article, we implement a Bayesian sensitivity analysis of misclassification to provide a full posterior inference on the corrected odds ratio under both non-differential and differential misclassification. We present an R (R Core Team, 2018) package BayesSenMC, which provides user-friendly functions for its implementation. The usage is illustrated by a real data analysis on the association between bipolar disorder and rheumatoid arthritis.
{"title":"BayesSenMC: an R package for Bayesian Sensitivity Analysis of Misclassification","authors":"Jinhui Yang, Lifeng Lin, H. Chu","doi":"10.32614/rj-2021-097","DOIUrl":"https://doi.org/10.32614/rj-2021-097","url":null,"abstract":"In case–control studies, the odds ratio is commonly used to summarize the association between a binary exposure and a dichotomous outcome. However, exposure misclassification frequently appears in case–control studies due to inaccurate data reporting, which can produce bias in measures of association. In this article, we implement a Bayesian sensitivity analysis of misclassification to provide a full posterior inference on the corrected odds ratio under both non-differential and differential misclassification. We present an R (R Core Team, 2018) package BayesSenMC, which provides user-friendly functions for its implementation. The usage is illustrated by a real data analysis on the association between bipolar disorder and rheumatoid arthritis.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"15 1","pages":"123"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83607910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Index numbers are descriptive statistical measures useful in economic settings for comparing simple and complex magnitudes registered, usually in two time periods. Although this theory has a large history, it still plays an important role in modern today’s societies where big amounts of economic data are available and need to be analyzed. After a detailed revision on classical index numbers in literature, this paper is focused on the description of the R package IndexNumber with strong capabilities for calculating them. Two of the four real data sets contained in this library are used for illustrating the determination of the index numbers in this work. Graphical tools are also implemented in order to show the time evolution of considered magnitudes simplifying the interpretation of the results.
{"title":"IndexNumber: An R Package for Measuring the Evolution of Magnitudes","authors":"A. Saavedra-Nieves, P. Saavedra-Nieves","doi":"10.32614/rj-2021-038","DOIUrl":"https://doi.org/10.32614/rj-2021-038","url":null,"abstract":"Index numbers are descriptive statistical measures useful in economic settings for comparing simple and complex magnitudes registered, usually in two time periods. Although this theory has a large history, it still plays an important role in modern today’s societies where big amounts of economic data are available and need to be analyzed. After a detailed revision on classical index numbers in literature, this paper is focused on the description of the R package IndexNumber with strong capabilities for calculating them. Two of the four real data sets contained in this library are used for illustrating the determination of the index numbers in this work. Graphical tools are also implemented in order to show the time evolution of considered magnitudes simplifying the interpretation of the results.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"31 1","pages":"253"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90080497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}