R J.

Pub Date : 2021-01-01 DOI: 10.32614/rj-2021-035

Markus Fritsch, Adrian Yu Pua Andrew, Joachim Schnurbus

This paper introduces pdynmc , an R package that provides users sufﬁcient ﬂexibility and precise control over the estimation and inference in linear dynamic panel data models. The package primarily allows for the inclusion of nonlinear moment conditions and the use of iterated GMM; additionally, visualizations for data structure and estimation results are provided. The current implementation reﬂects recent developments in literature, uses sensible argument defaults, and aligns commercial and noncommercial estimation commands. Since the understanding of the model assumptions is vital for setting up plausible estimation routines, we provide a broad introduction of linear dynamic panel data models directed towards practitioners before concisely describing the functionality available in pdynmc regarding instrument type, covariate type, estimation methodology, and general conﬁguration. We then demonstrate the functionality by revisiting the popular ﬁrm-level dataset of Arellano and Bond (1991). ,

本文介绍了一个R包pdynmc，它为用户提供了足够的灵活性和对线性动态面板数据模型的估计和推断的精确控制。该包主要允许包含非线性力矩条件和使用迭代GMM;此外，还提供了数据结构和估计结果的可视化。当前的实现反映了文献中的最新进展，使用了合理的参数默认值，并使商业和非商业评估命令保持一致。由于对模型假设的理解对于建立合理的估计例程至关重要，因此我们在简要描述pdynmc中关于仪器类型、协变量类型、估计方法和一般配置的可用功能之前，针对从业者提供了线性动态面板数据模型的广泛介绍。然后，我们通过重新访问阿雷拉诺和邦德(1991)的流行公司层面数据集来展示其功能。，

引用次数: 6

ROCnReg: An R Package for Receiver Operating Characteristic Curve Inference With and Without Covariates 一个有协变量和无协变量的接收机工作特性曲线推断的R包

R J.

Pub Date : 2021-01-01 DOI: 10.32614/rj-2021-066

M. Rodríguez-Álvarez, Vanda Inácio

This paper introduces the package ROCnReg that allows estimating the pooled ROC curve, the covariate-specific ROC curve, and the covariate-adjusted ROC curve by different methods, both from (semi) parametric and nonparametric perspectives and within Bayesian and frequentist paradigms. From the estimated ROC curve (pooled, covariate-specific, or covariate-adjusted), several summary measures of discriminatory accuracy, such as the (partial) area under the ROC curve and the Youden index, can be obtained. The package also provides functions to obtain ROC-based optimal threshold values using several criteria, namely, the Youden index criterion and the criterion that sets a target value for the false positive fraction. For the Bayesian methods, we provide tools for assessing model fit via posterior predictive checks, while the model choice can be carried out via several information criteria. Numerical and graphical outputs are provided for all methods. This is the only package implementing Bayesian procedures for ROC curves.

本文介绍了ROCnReg包，它允许从(半)参数和非参数的角度以及在贝叶斯和频率范式内，通过不同的方法估计合并的ROC曲线，协变量特定的ROC曲线和协变量调整的ROC曲线。从估计的ROC曲线(合并、协变量特异性或协变量调整)中，可以获得几种区分准确度的汇总度量，例如ROC曲线下的(部分)面积和约登指数。该包还提供了使用几个标准获得基于roc的最优阈值的函数，即约登指数准则和为假阳性分数设置目标值的准则。对于贝叶斯方法，我们提供了通过后验预测检查评估模型拟合的工具，而模型选择可以通过几个信息标准进行。所有方法都提供了数值和图形输出。这是唯一的包实现贝叶斯程序的ROC曲线。

引用次数: 8

PASSED: Calculate Power and Sample Size for Two Sample Tests 通过:计算两个样本测试的功率和样本量

R J.

Pub Date : 2021-01-01 DOI: 10.32614/rj-2021-094

Jinpu Li, R. Knigge, Kaiyi Chen, E. Leary

Power and sample size estimation are critical aspects in study design to demonstrate minimized risk for subjects and to justify the allocation of time, money, and other resources. Researchers often work with response variables which take the form of various distributions. Here, we present an R package, PASSED, that allows flexibility with seven common distributions and multiple options to accommodate sample size or power analysis. The relevant statistical theory, calculations, and examples for each distribution using PASSED are discussed in this paper.

功率和样本量估计是研究设计中的关键方面，以证明受试者的风险最小化，并证明时间、金钱和其他资源的分配是合理的。研究人员经常使用各种分布形式的响应变量。在这里，我们介绍了一个R包PASSED，它允许七个常见分布和多个选项的灵活性，以适应样本大小或功率分析。本文讨论了使用PASSED进行各种分布的相关统计理论、计算和实例。

引用次数: 3

Reproducible Summary Tables with the gtsummary Package 使用gtsummary包可重复的汇总表

R J.

Pub Date : 2021-01-01 DOI: 10.32614/rj-2021-053

D. Sjoberg, Karissa A Whiting, Michael Curry, J. Lavery, J. Larmarange

The gtsummary package provides an elegant and flexible way to create publication-ready summary tables in R. A critical part of the work of statisticians, data scientists, and analysts is summarizing data sets and regression models in R and publishing or sharing polished summary tables. The gtsummary package was created to streamline these everyday analysis tasks by allowing users to easily create reproducible summaries of data sets, regression models, survey data, and survival data with a simple interface and very little code. The package follows a tidy framework, making it easy to integrate in standard data workflows, and offers many table customization features through function arguments, helper functions, and custom themes.

gtsummary包提供了一种优雅而灵活的方式来用R创建准备发布的汇总表。统计学家、数据科学家和分析师工作的关键部分是用R总结数据集和回归模型，并发布或共享经过优化的汇总表。创建gtsummary包是为了简化这些日常分析任务，它允许用户使用简单的界面和很少的代码轻松创建数据集、回归模型、调查数据和生存数据的可重复摘要。该包遵循一个整洁的框架，使其易于集成到标准数据工作流中，并通过函数参数、辅助函数和自定义主题提供了许多表自定义特性。

引用次数: 209

DChaos: An R Package for Chaotic Time Series Analysis 混沌时间序列分析的R包

R J.

Pub Date : 2021-01-01 DOI: 10.32614/rj-2021-036

Julio E. Sandubete, L. Escot

Chaos theory has been hailed as a revolution of thoughts and attracting ever-increasing attention of many scientists from diverse disciplines. Chaotic systems are non-linear deterministic dynamic systems which can behave like an erratic and apparently random motion. A relevant field inside chaos theory is the detection of chaotic behavior from empirical time-series data. One of the main features of chaos is the well-known initial-value sensitivity property. Methods and techniques related to testing the hypothesis of chaos try to quantify the initial-value sensitive property estimating the so-called Lyapunov exponents. This paper describes the main estimation methods of the Lyapunov exponent from time series data. At the same time, we present the DChaos library. R users may compute the delayed-coordinate embedding vector from time series data, estimates the best-fitted neural net model from the delayed-coordinate embedding vectors, calculates analytically the partial derivatives from the chosen neural nets model. They can also obtain the neural net estimator of the Lyapunov exponent from the partial derivatives computed previously by two different procedures and four ways of subsampling by blocks. To sum up, the DChaos package allows the R users to test robustly the hypothesis of chaos in order to know if the data-generating process behind time series behaves chaotically or not. The package’s functionality is illustrated by examples.

混沌理论被誉为一场思想革命，越来越受到各学科科学家的关注。混沌系统是一种非线性的确定性动力系统，它的行为可能像一种不稳定的、明显随机的运动。混沌理论中的一个相关领域是从经验时间序列数据中检测混沌行为。混沌的一个主要特征是众所周知的初值敏感性。与检验混沌假设有关的方法和技术试图量化估计所谓李雅普诺夫指数的初值敏感性。本文介绍了从时间序列数据中估计李雅普诺夫指数的主要方法。同时，我们介绍了DChaos库。R用户可以从时间序列数据中计算延迟坐标嵌入向量，从延迟坐标嵌入向量中估计出最适合的神经网络模型，分析计算所选神经网络模型的偏导数。他们还可以通过两种不同的程序和四种分块子抽样方法，从先前计算的偏导数中获得Lyapunov指数的神经网络估计量。综上所述，DChaos包允许R用户对混沌假设进行鲁棒性测试，以了解时间序列背后的数据生成过程是否表现为混沌。通过示例说明了包的功能。

{"title":"DChaos: An R Package for Chaotic Time Series Analysis","authors":"Julio E. Sandubete, L. Escot","doi":"10.32614/rj-2021-036","DOIUrl":"https://doi.org/10.32614/rj-2021-036","url":null,"abstract":"Chaos theory has been hailed as a revolution of thoughts and attracting ever-increasing attention of many scientists from diverse disciplines. Chaotic systems are non-linear deterministic dynamic systems which can behave like an erratic and apparently random motion. A relevant field inside chaos theory is the detection of chaotic behavior from empirical time-series data. One of the main features of chaos is the well-known initial-value sensitivity property. Methods and techniques related to testing the hypothesis of chaos try to quantify the initial-value sensitive property estimating the so-called Lyapunov exponents. This paper describes the main estimation methods of the Lyapunov exponent from time series data. At the same time, we present the DChaos library. R users may compute the delayed-coordinate embedding vector from time series data, estimates the best-fitted neural net model from the delayed-coordinate embedding vectors, calculates analytically the partial derivatives from the chosen neural nets model. They can also obtain the neural net estimator of the Lyapunov exponent from the partial derivatives computed previously by two different procedures and four ways of subsampling by blocks. To sum up, the DChaos package allows the R users to test robustly the hypothesis of chaos in order to know if the data-generating process behind time series behaves chaotically or not. The package’s functionality is illustrated by examples.","PeriodicalId":20974,"journal":{"name":"R J.","volume":"30 1","pages":"232"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88762793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

SEEDCCA: An Integrated R-Package for Canonical Correlation Analysis and Partial Least Squares 典型相关分析和偏最小二乘的集成r包

R J.

Pub Date : 2021-01-01 DOI: 10.32614/rj-2021-026

Boyoung Kim, Yunju Im, Keun Yoo Jae

Canonical correlation analysis (CCA) has a long history as an explanatory statistical method in high-dimensional data analysis and has been successfully applied in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. The so-called seedCCA is a newly developed R package, and it implements not only the standard and seeded CCA but also partial least squares. The package enables us to fit CCA to large-p and small-n data. The paper provides a complete guide. Also, the seeded CCA application results are compared with the regularized CCA in the existing R package. It is believed that the package along with the paper will contribute to highdimensional data analysis in various science field practitioners and that the statistical methodologies in multivariate analysis become more fruitful. Introduction Explanatory studies are important to identify patterns and special structure in data prior to develop a specific model. When a study between two sets of a p-dimensional random variables X (X ∈ Rp) and a r-dimensional random variable Y (Y ∈ Rr), are of primary interest, one of the popular explanatory statistical methods would be canonical correlation analysis (CCA; Hotelling (1936)). The main goal of CCA is the dimension reduction of two sets of variables by measuring an association between the two sets. For this, pairs of linear combinations of variables are constructed by maximizing the Pearson correlation. The CCA has successful application in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. In Lee and Yoo (2014) it is shown that the CCA can be used as a dimension reduction tool for high-dimensional data, but also it is connected to least square estimator. Therefore, the CCA is not only explanatory and dimension reduction method but also can be utilized as alternative of least square estimation. If max(p, r) is bigger than or equal to the sample size, n, usual CCA application is not plausible due to no incapability of inverting sample covariance matrices. To overcome this, a regularized CCA is developed by Leurgans et al. (1993), whose idea was firstly suggested in Vinod (1976). In practice, the CCA package by González et al. (2008) can implement a version of the regularized CCA. To make the sample covariance matrices saying Σ̂x and Σ̂y, invertible, in González et al. (2008), they are replaced with Σ̂ λ1 x = Σ̂x + λ1Ip and Σ̂ λ2 y = Σ̂y + λ1Ir. The optimal values of λ1 and λ2 are chosen by maximizing a cross-validation score throughout the two-dimensional grid search. Although it is discussed that a relatively small grid of reasonable values for λ1 and λ2 can lesson intensive computing in González et al. (2008), it is still time-consuming as observed in later sections. Additionally, fast regularized CCA and robust CCA via projection-pursuit are recently developed in Cruz-Cano (2012) and Alfons et al. (2016), respectively. Another version of CCA to handle max(p,

典型相关分析(Canonical correlation analysis, CCA)作为一种用于高维数据分析的解释性统计方法有着悠久的历史，已成功应用于化学计量学、模式识别、基因组序列分析等多个科学领域。所谓的seedCCA是一个新开发的R包，它不仅实现了标准和种子的CCA，而且还实现了偏最小二乘法。该软件包使我们能够将CCA适合于大p和小n数据。本文提供了一个完整的指南。同时，将种子CCA应用结果与现有R包中的正则化CCA进行了比较。相信本文所附的包将有助于各个科学领域的从业者进行高维数据分析，并使多元分析中的统计方法更加富有成果。解释性研究对于在开发特定模型之前识别数据中的模式和特殊结构非常重要。当对两组p维随机变量X (X∈Rp)和r维随机变量Y (Y∈Rr)之间的研究主要感兴趣时，常用的解释性统计方法之一是典型相关分析(CCA;霍特林(1936))。CCA的主要目标是通过测量两组变量之间的关联来降低两组变量的维数。为此，通过最大化Pearson相关性来构造变量的线性组合对。CCA已成功应用于化学计量学、模式识别、基因组序列分析等科学领域。在Lee和Yoo(2014)中，表明CCA可以用作高维数据的降维工具，但它也与最小二乘估计器相连。因此，CCA不仅是一种解释和降维方法，而且可以作为最小二乘估计的替代方法。如果max(p, r)大于或等于样本量n，通常的CCA应用是不合理的，因为不能反转样本协方差矩阵。为了克服这一点，Leurgans等人(1993)开发了正则化CCA，其想法最早是在Vinod(1976)中提出的。在实践中，González等人(2008)的CCA包可以实现正则化CCA的一个版本。为了使González等人(2008)中的样本协方差矩阵Σ³x和Σ³y可逆，将它们替换为Σ³λ 1x = Σ³x + λ1Ip和Σ³λ 2y = Σ³y + λ1Ir。通过在整个二维网格搜索中最大化交叉验证分数来选择λ1和λ2的最优值。虽然在González等人(2008)中讨论了λ1和λ2的合理值的相对较小的网格可以进行密集计算，但在后面的章节中观察到它仍然很耗时。此外，Cruz-Cano(2012)和Alfons等人(2016)最近分别开发了快速正则化CCA和基于投影追踪的鲁棒CCA。另一种处理max(p, r) > n的CCA是Im等人(2014)提出的所谓种子型典型相关分析(seed canonical correlation analysis)。由于种子型CCA不需要任何计算量大的正则化过程，因此对更大数据的实现速度相当快。种子CCA需要两个步骤。在初始阶段，基于迭代投影，初始化简大于n的一组变量。下一步，将标准的CCA应用于初始步骤中获得的两组变量，最终完成数据的CCA。另一个优点是种子CCA的过程与偏最小二乘密切相关，偏最小二乘是大p-小n数据的常用统计方法之一，因此种子CCA可以产生PLS估计。最近开发的seedCCA包主要用于实现种子CCA。然而，该软件包可以适合一系列统计方法，这些方法是标准的典型相关和偏最小二乘法，具有单/多维响应，包括种子CCA。软件包已经上传到CRAN (https://cran.r-project.org/web/packages/ seedCCA/index.html)。本文的主要目的是介绍和说明seedCCA包。为此，使用标准CCA、种子CCA和偏最小二乘法拟合三个真实数据，其中两个数据在包中可用。其中一个已经在González et al.(2008)中进行了分析。因此，将种子CCA和正则化CCA的实施结果进行了密切比较。本文的组织结构如下。第2节将讨论这三种方法的集合。在第3节中说明了seedCCA的实现并与CCA进行了比较。在第4节中，我们对工作进行了总结。在本文的其余部分，我们将使用以下符号。p维随机变量X记为X∈Rp。所以，X∈Rp是一个随机变量，虽然这里没有特别提到。对于X∈Rp, Y∈Rr，我们定义cov(X) = Σx, cov(Y) = Σy, cov(X, Y) = Σxy, cov(Y, X) = Σyx。并且，假设Σx和Σy是正定的。典型相关分析中实现方法的集合假设两组变量X∈Rp和Y∈Rr，考虑它们的U = aTX和V = bTY的线性组合。然后我们有var(U) = aΣxa, var(V) = bΣyb, cov(U, V) = aΣxyb，其中a∈Rp×1, b∈Rr×1。则U与V之间的person -correlation为:cor(U, V) = aΣxyb√aTΣxa√bTΣyb。(1)我们寻找a和b，使cor(U, V)满足以下条件最大化。1. 第一个典型变量对(U1 = a1 X, V1 = b t1 Y)由最大化(1)得到。第二个典型变量对(U2 = a2 X, V2 = b t2 Y)由(1)的最大化构造而成，约束条件是var(U2) = var(V2) = 1，并且(U1, V1)和(U2, V2)不相关。3.在第k步，第k个典型变量对(Uk = ak X, Vk = b T k Y)由(1)的最大化得到，限制var(Uk) = var(Vk) = 1和(Uk, Vk)与前(k−1)个典型变量对不相关。4. 重复步骤1到3，直到k变成q (= min(p, r))。5. 选择(Uk, Vk)的前d对来表示X与y之间的关系，在此准则下，对(ai, bi)构造如下:ai = Σ−1/2 X ψi, bi = Σ−1/2 y φi，对于i = 1，…， q，其中(ψ1，…)， ψq)和(φ1，…， φq)分别为Σ−1/2 x ΣxyΣ−1 y ΣyxΣ−1/2 y和Σ−1/2 y ΣyxΣ−1 x ΣxyΣ−1/2 x的q特征向量，其对应的公共有序特征值为ρ∗21≥···≥ρ∗2 q≥0。则矩阵Mx = (a1，…， ad)和My = (b1，…， bd)称为正则系数矩阵，当d = 1，…同样，Mx X和My Y被称为正则变量。在样本中，用通常的矩估计量代替总体量。关于这个标准CCA的更多细节，读者可以参考Johnson和Wichern(2007)。种子典型相关分析由于标准的CCA应用在实践中需要对Σ³x和Σ³y进行反演，因此对于max(p, r) > n的高维数据是不可能的。Im等人(2014)提出了一种种子典型相关分析方法来克服这一缺陷。种子CCA是一个两步过程，包括初始化和最终化步骤。在初始化步骤中，原始的两组变量被简化为m维对，而不会丢失有关CCA应用程序的信息。在初始化步骤中，必须强制m << n。在最终化步骤中，对初始约简对实现标准的CCA，以进行修复和正交性。关于种子CCA的更详细讨论将在下一小节中进行。定义S(M)的符号为M∈Rp×r的列张成的子空间。Lee和Yoo(2014)给出了S(Mx) = S(Σ−1 x Σxy)和S(My) = S(Σ−1 y Σyx)的关系式。(2)式(2)中的关系直接表明，Mx和My构成了S(Σ−1 x Σxy)和S(Σ−1 y Σyx)的基矩阵，并且可以由Σ−1 x Σxy和Σ−1 y Σyx还原出Mx和My。文章3现在，

{"title":"SEEDCCA: An Integrated R-Package for Canonical Correlation Analysis and Partial Least Squares","authors":"Boyoung Kim, Yunju Im, Keun Yoo Jae","doi":"10.32614/rj-2021-026","DOIUrl":"https://doi.org/10.32614/rj-2021-026","url":null,"abstract":"Canonical correlation analysis (CCA) has a long history as an explanatory statistical method in high-dimensional data analysis and has been successfully applied in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. The so-called seedCCA is a newly developed R package, and it implements not only the standard and seeded CCA but also partial least squares. The package enables us to fit CCA to large-p and small-n data. The paper provides a complete guide. Also, the seeded CCA application results are compared with the regularized CCA in the existing R package. It is believed that the package along with the paper will contribute to highdimensional data analysis in various science field practitioners and that the statistical methodologies in multivariate analysis become more fruitful. Introduction Explanatory studies are important to identify patterns and special structure in data prior to develop a specific model. When a study between two sets of a p-dimensional random variables X (X ∈ Rp) and a r-dimensional random variable Y (Y ∈ Rr), are of primary interest, one of the popular explanatory statistical methods would be canonical correlation analysis (CCA; Hotelling (1936)). The main goal of CCA is the dimension reduction of two sets of variables by measuring an association between the two sets. For this, pairs of linear combinations of variables are constructed by maximizing the Pearson correlation. The CCA has successful application in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. In Lee and Yoo (2014) it is shown that the CCA can be used as a dimension reduction tool for high-dimensional data, but also it is connected to least square estimator. Therefore, the CCA is not only explanatory and dimension reduction method but also can be utilized as alternative of least square estimation. If max(p, r) is bigger than or equal to the sample size, n, usual CCA application is not plausible due to no incapability of inverting sample covariance matrices. To overcome this, a regularized CCA is developed by Leurgans et al. (1993), whose idea was firstly suggested in Vinod (1976). In practice, the CCA package by González et al. (2008) can implement a version of the regularized CCA. To make the sample covariance matrices saying Σ̂x and Σ̂y, invertible, in González et al. (2008), they are replaced with Σ̂ λ1 x = Σ̂x + λ1Ip and Σ̂ λ2 y = Σ̂y + λ1Ir. The optimal values of λ1 and λ2 are chosen by maximizing a cross-validation score throughout the two-dimensional grid search. Although it is discussed that a relatively small grid of reasonable values for λ1 and λ2 can lesson intensive computing in González et al. (2008), it is still time-consuming as observed in later sections. Additionally, fast regularized CCA and robust CCA via projection-pursuit are recently developed in Cruz-Cano (2012) and Alfons et al. (2016), respectively. Another version of CCA to handle max(p, ","PeriodicalId":20974,"journal":{"name":"R J.","volume":"44 1","pages":"7"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91542743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multiple Imputation and Synthetic Data Generation with NPBayesImputeCat 基于NPBayesImputeCat的多重输入与合成数据生成

R J.

Pub Date : 2021-01-01 DOI: 10.32614/rj-2021-080

Jingchen Hu, O. Akande, Quanli Wang

In many contexts, missing data and disclosure control are ubiquitous and challenging issues. In particular, at statistical agencies, the respondent-level data they collect from surveys and censuses can suffer from high rates of missingness. Furthermore, agencies are obliged to protect respondents’ privacy when publishing the collected data for public use. The NPBayesImputeCat R package, introduced in this paper, provides routines to i) create multiple imputations for missing data and ii) create synthetic data for statistical disclosure control, for multivariate categorical data, with or without structural zeros. We describe the Dirichlet process mixture of products of the multinomial distributions model used in the package and illustrate various uses of the package using data samples from the American Community Survey (ACS). We also compare results of the missing data imputation to the mice R package and those of the synthetic data generation to the synthpop R package. Introduction and background Multiple imputation for missing data Missing data problems arise in many statistical analyses. To impute missing values, multiple imputation, first proposed by Rubin (1987), has been widely adopted. This approach estimates predictive models based on the observed data, fills in missing values with draws from the predictive models, and produces multiple imputed and completed datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each imputed dataset and use appropriate combining rules to obtain valid point estimates and variance estimates (Rubin, 1987). As a brief review of the multiple imputation combining rules for missing data, let q be the completed data estimator of some estimand of interest Q, and let u be the estimator of the variance of q. For l = 1, . . . , m, let q(l) and u(l) be the values of q and u in the lth completed dataset. The multiple imputation estimate of Q is equal to q̄m = ∑l=1 q (l)/m, and the estimated variance associated with q̄m is equal to Tm = (1 + 1/m)bm + ūm , where bm = ∑l=1(q (l) − q̄m)/(m − 1) and ūm = ∑l=1 u (l)/m. Inferences for Q are based on (q̄m − Q) ∼ tv(0, Tm), where tv is a t-distribution with v = (m − 1)(1 + ūm/[(1 + 1/m)bm]) degrees of freedom. Multiple imputation by chained equations (MICE, Buuren and Groothuis-Oudshoorn (2011)) remains the most popular method for generating multiple completed datasets after multiple imputation. Under MICE, one specifies univariate conditional models separately for each variable, usually using generalized linear models (GLMs) or classification and regression trees (CART Breiman et al. (1984); Burgette and Reiter (2010)), and then iteratively samples plausible predicted values from the sequence of conditional models . For implementing MICE in R, most analysts use the mice package. For an in-depth review of the MICE algorithm, see Buuren and Groothuis-Oudshoorn (2011). For more details and reviews, see Rubin (1996), Harel and Zhou (2007), R

在许多情况下，数据缺失和信息披露控制是普遍存在且具有挑战性的问题。特别是在统计机构，他们从调查和人口普查中收集的被调查者层面的数据可能会有很高的失踪率。此外，机构在公布收集的数据供公众使用时，有义务保护受访者的隐私。本文介绍的NPBayesImputeCat R包提供了i)为缺失数据创建多个imputations和ii)为统计披露控制创建合成数据的例程，用于多变量分类数据，有或没有结构零。我们描述了包中使用的多项分布模型的产品的狄利克雷过程混合物，并使用来自美国社区调查(ACS)的数据样本说明了包的各种用途。我们还比较了缺失数据代入小鼠R包的结果和合成数据代入synthpop R包的结果。缺失数据的多重输入在许多统计分析中都会出现缺失数据的问题。为了估算缺失值，多重估算(multiple imputation)被广泛采用，该方法最早由Rubin(1987)提出。该方法基于观测数据估计预测模型，用预测模型的结果填充缺失值，并产生多个输入和完整的数据集。然后，数据分析人员对每个输入的数据集应用标准统计分析(例如回归分析)，并使用适当的组合规则来获得有效的点估计和方差估计(Rubin, 1987)。作为对缺失数据多重插值组合规则的简要回顾，设q为感兴趣的某个估计q的完备数据估计量，设u为q的方差估计量。， m，设q(l)和u(l)分别为第LTH完整数据集中q和u的值。Q的多重插值估计等于Q ā m =∑l=1 Q (l)/m，与Q ā m相关的估计方差等于Tm = (1 + 1/m)bm + ūm，其中bm =∑l=1(Q (l)−Q ā m)/(m−1)，ūm =∑l=1 u (l)/m。Q的推论基于(Q ā m−Q) ~ tv(0, Tm)，其中tv是一个t-分布，v = (m−1)(1 + ūm/[(1 + 1/m)bm])自由度。通过链式方程进行多次插值(MICE, Buuren和Groothuis-Oudshoorn(2011))仍然是在多次插值后生成多个完整数据集的最流行方法。在MICE中，人们为每个变量单独指定单变量条件模型，通常使用广义线性模型(GLMs)或分类和回归树(CART Breiman et al. (1984);burgtte和Reiter(2010))，然后从条件模型序列中迭代地采样可信的预测值。为了在R中实现MICE，大多数分析人员使用MICE包。有关MICE算法的深入回顾，请参见Buuren和Groothuis-Oudshoorn(2011)。更多细节和评论请参见Rubin (1996)， Harel and Zhou (2007)， Reiter and Raghunathan(2007)。统计机构定期从调查和人口普查中收集资料，并为各种目的，包括研究和决策，公开提供这些资料。在世界上许多国家，统计机构在向公众提供这些信息时，有法律义务保护受访者的隐私。统计披露控制(SDC)是在公开发布之前应用于机密数据以保护隐私的技术集合。用于表格数据的常用SDC技术包括单元格抑制和添加噪声，用于被调查者级数据(也称为微数据)的常用SDC技术包括交换、添加噪声和聚合。Hundepool等人(2012)对SDC技术和应用进行了全面综述。将多重归算方法推广到SDC。促进微数据发布的一种方法是提供合成数据。Little(1993)和Rubin(1993)首先提出了合成数据方法，该方法基于原始的机密数据估计预测模型，利用预测模型中的数据模拟合成值，并产生多个合成数据集。然后，数据分析人员对每个合成数据集应用标准统计分析(例如回归分析)，并使用适当的组合规则(不同于多次imputation)来获得有效的点估计和方差估计(Reiter和Raghunathan, 2007;Drechsler, R Journal Vol. 13/2, December 2021 ISSN 2073-4859贡献研究论文91 2011)。此外，合成数据有两种类型:完全合成数据(Rubin, 1993)，其中每个变量都被认为是敏感的，因此是合成的;部分合成数据(Little, 1993)，其中只有一部分变量被认为是敏感的和合成的，而其余变量是非合成的。

{"title":"Multiple Imputation and Synthetic Data Generation with NPBayesImputeCat","authors":"Jingchen Hu, O. Akande, Quanli Wang","doi":"10.32614/rj-2021-080","DOIUrl":"https://doi.org/10.32614/rj-2021-080","url":null,"abstract":"In many contexts, missing data and disclosure control are ubiquitous and challenging issues. In particular, at statistical agencies, the respondent-level data they collect from surveys and censuses can suffer from high rates of missingness. Furthermore, agencies are obliged to protect respondents’ privacy when publishing the collected data for public use. The NPBayesImputeCat R package, introduced in this paper, provides routines to i) create multiple imputations for missing data and ii) create synthetic data for statistical disclosure control, for multivariate categorical data, with or without structural zeros. We describe the Dirichlet process mixture of products of the multinomial distributions model used in the package and illustrate various uses of the package using data samples from the American Community Survey (ACS). We also compare results of the missing data imputation to the mice R package and those of the synthetic data generation to the synthpop R package. Introduction and background Multiple imputation for missing data Missing data problems arise in many statistical analyses. To impute missing values, multiple imputation, first proposed by Rubin (1987), has been widely adopted. This approach estimates predictive models based on the observed data, fills in missing values with draws from the predictive models, and produces multiple imputed and completed datasets. Data analysts then apply standard statistical analyses (e.g., regression analysis) on each imputed dataset and use appropriate combining rules to obtain valid point estimates and variance estimates (Rubin, 1987). As a brief review of the multiple imputation combining rules for missing data, let q be the completed data estimator of some estimand of interest Q, and let u be the estimator of the variance of q. For l = 1, . . . , m, let q(l) and u(l) be the values of q and u in the lth completed dataset. The multiple imputation estimate of Q is equal to q̄m = ∑l=1 q (l)/m, and the estimated variance associated with q̄m is equal to Tm = (1 + 1/m)bm + ūm , where bm = ∑l=1(q (l) − q̄m)/(m − 1) and ūm = ∑l=1 u (l)/m. Inferences for Q are based on (q̄m − Q) ∼ tv(0, Tm), where tv is a t-distribution with v = (m − 1)(1 + ūm/[(1 + 1/m)bm]) degrees of freedom. Multiple imputation by chained equations (MICE, Buuren and Groothuis-Oudshoorn (2011)) remains the most popular method for generating multiple completed datasets after multiple imputation. Under MICE, one specifies univariate conditional models separately for each variable, usually using generalized linear models (GLMs) or classification and regression trees (CART Breiman et al. (1984); Burgette and Reiter (2010)), and then iteratively samples plausible predicted values from the sequence of conditional models . For implementing MICE in R, most analysts use the mice package. For an in-depth review of the MICE algorithm, see Buuren and Groothuis-Oudshoorn (2011). For more details and reviews, see Rubin (1996), Harel and Zhou (2007), R","PeriodicalId":20974,"journal":{"name":"R J.","volume":"24 1","pages":"25"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84673342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

RobustBF: An R Package for Robust Solution to the Behrens-Fisher Problem 鲁棒bf: Behrens-Fisher问题鲁棒解的R包

R J.

Pub Date : 2021-01-01 DOI: 10.32614/rj-2021-107

Gamze Güven, S. Acitas, Hatice Samkar, B. Şenoğlu

Welch’s two-sample t-test based on least squares (LS) estimators is generally used to test the equality of two normal means when the variances are not equal. However, this test loses its power when the underlying distribution is not normal. In this paper, two different tests are proposed to test the equality of two long-tailed symmetric (LTS) means under heterogeneous variances. Adaptive modified maximum likelihood (AMML) estimators are used in developing the proposed tests since they are highly efficient under LTS distribution. An R package called RobustBF is given to show the implementation of these tests. Simulated Type I error rates and powers of the proposed tests are also given and compared with Welch’s t-test based on LS estimators via an extensive Monte Carlo simulation study.

基于最小二乘(LS)估计量的Welch双样本t检验通常用于在方差不相等时检验两个正态均值是否相等。但是，当底层分布不是正态分布时，该测试就失去了作用。本文提出了两种不同的检验方法来检验异质性方差下两个长尾对称(LTS)均值的相等性。由于自适应修正最大似然估计器在LTS分布下具有很高的效率，因此在开发所提出的测试中使用了自适应修正最大似然估计器。给出了一个名为RobustBF的R包来展示这些测试的实现。通过广泛的蒙特卡罗模拟研究，还给出了所提议测试的模拟I型错误率和功率，并将其与基于LS估计的Welch t检验进行了比较。

引用次数: 0

BayesSenMC: an R package for Bayesian Sensitivity Analysis of Misclassification BayesSenMC:一个用于误分类贝叶斯敏感性分析的R包

R J.

Pub Date : 2021-01-01 DOI: 10.32614/rj-2021-097

Jinhui Yang, Lifeng Lin, H. Chu

In case–control studies, the odds ratio is commonly used to summarize the association between a binary exposure and a dichotomous outcome. However, exposure misclassification frequently appears in case–control studies due to inaccurate data reporting, which can produce bias in measures of association. In this article, we implement a Bayesian sensitivity analysis of misclassification to provide a full posterior inference on the corrected odds ratio under both non-differential and differential misclassification. We present an R (R Core Team, 2018) package BayesSenMC, which provides user-friendly functions for its implementation. The usage is illustrated by a real data analysis on the association between bipolar disorder and rheumatoid arthritis.

在病例对照研究中，比值比通常用于总结二元暴露与二元结果之间的关系。然而，由于不准确的数据报告，暴露错误分类经常出现在病例对照研究中，这可能在关联测量中产生偏差。在本文中，我们实现了错误分类的贝叶斯灵敏度分析，以提供对非微分和微分错误分类下校正的优势比的完整后验推断。我们提出了一个R (R Core Team, 2018)包BayesSenMC，它为其实现提供了用户友好的功能。这种用法是通过对双相情感障碍和类风湿性关节炎之间关联的真实数据分析来说明的。

引用次数: 0

IndexNumber: An R Package for Measuring the Evolution of Magnitudes IndexNumber:一个测量震级演化的R包

R J.

Pub Date : 2021-01-01 DOI: 10.32614/rj-2021-038

A. Saavedra-Nieves, P. Saavedra-Nieves

Index numbers are descriptive statistical measures useful in economic settings for comparing simple and complex magnitudes registered, usually in two time periods. Although this theory has a large history, it still plays an important role in modern today’s societies where big amounts of economic data are available and need to be analyzed. After a detailed revision on classical index numbers in literature, this paper is focused on the description of the R package IndexNumber with strong capabilities for calculating them. Two of the four real data sets contained in this library are used for illustrating the determination of the index numbers in this work. Graphical tools are also implemented in order to show the time evolution of considered magnitudes simplifying the interpretation of the results.

指数是描述性的统计指标，在经济环境中用于比较通常在两个时间段内登记的简单和复杂数值。尽管这一理论有着悠久的历史，但在现代社会中，它仍然扮演着重要的角色，在现代社会中，大量的经济数据是可用的，需要分析。在对文献中的经典索引数进行详细修订后，本文重点描述了具有强大计算能力的R包IndexNumber。此库中包含的四个真实数据集中的两个用于说明本工作中索引号的确定。还实现了图形工具，以显示所考虑的震级的时间演变，简化了结果的解释。

引用次数: 0

R J.最新文献