The R package microsynth has been developed for implementation of the synthetic control methodology for comparative case studies involving micro- or meso-level data. The methodology implemented within microsynth is designed to assess the efficacy of a treatment or intervention within a well-defined geographic region that is itself a composite of several smaller regions (where data are available at the more granular level for comparison regions as well). The effect of the intervention on one or more time-varying outcomes is evaluated by determining a synthetic control region that resembles the treatment region across pre-intervention values of the outcome(s) and time-invariant covariates and that is a weighted composite of many untreated comparison regions. The microsynth procedure includes functionality that enables its user to (1) calculate weights for synthetic control, (2) tabulate results for statistical inferences, and (3) create time series plots of outcomes for treatment and synthetic control. In this article, microsynth is described in detail and its application is illustrated using data from a drug market intervention in Seattle, WA.
{"title":"microsynth: Synthetic Control Methods for Disaggregated and Micro-Level Data in R","authors":"Michael W Robbins, Steven Davenport","doi":"10.18637/JSS.V097.I02","DOIUrl":"https://doi.org/10.18637/JSS.V097.I02","url":null,"abstract":"The R package microsynth has been developed for implementation of the synthetic control methodology for comparative case studies involving micro- or meso-level data. The methodology implemented within microsynth is designed to assess the efficacy of a treatment or intervention within a well-defined geographic region that is itself a composite of several smaller regions (where data are available at the more granular level for comparison regions as well). The effect of the intervention on one or more time-varying outcomes is evaluated by determining a synthetic control region that resembles the treatment region across pre-intervention values of the outcome(s) and time-invariant covariates and that is a weighted composite of many untreated comparison regions. The microsynth procedure includes functionality that enables its user to (1) calculate weights for synthetic control, (2) tabulate results for statistical inferences, and (3) create time series plots of outcomes for treatment and synthetic control. In this article, microsynth is described in detail and its application is illustrated using data from a drug market intervention in Seattle, WA.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79200084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Brilleman, R. Wolfe, M. Moreno-Betancur, M. Crowther
The simsurv R package allows users to simulate survival (i.e., time-to-event) data from standard parametric distributions (exponential, Weibull, and Gompertz), two-component mixture distributions, or a user-defined hazard function. Baseline covariates can be included under a proportional hazards assumption. Clustered event times, for example individuals within a family, are also easily accommodated. Time-dependent effects (i.e., nonproportional hazards) can be included by interacting covariates with linear time or a user-defined function of time. Under a user-defined hazard function, event times can be generated for a variety of complex models such as flexible (spline-based) baseline hazards, models with time-varying covariates, or joint longitudinal-survival models.
{"title":"Simulating Survival Data Using the simsurv R Package","authors":"S. Brilleman, R. Wolfe, M. Moreno-Betancur, M. Crowther","doi":"10.18637/JSS.V097.I03","DOIUrl":"https://doi.org/10.18637/JSS.V097.I03","url":null,"abstract":"The simsurv R package allows users to simulate survival (i.e., time-to-event) data from standard parametric distributions (exponential, Weibull, and Gompertz), two-component mixture distributions, or a user-defined hazard function. Baseline covariates can be included under a proportional hazards assumption. Clustered event times, for example individuals within a family, are also easily accommodated. Time-dependent effects (i.e., nonproportional hazards) can be included by interacting covariates with linear time or a user-defined function of time. Under a user-defined hazard function, event times can be generated for a variety of complex models such as flexible (spline-based) baseline hazards, models with time-varying covariates, or joint longitudinal-survival models.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76268703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce the new package dmbc that implements a Bayesian algorithm for clustering a set of binary dissimilarity matrices within a model-based framework. Specifically, we consider the case when S matrices are available, each describing the dissimilarities among the same n objects, possibly expressed by S subjects (judges), or measured under different experimental conditions, or with reference to different characteristics of the objects themselves. In particular, we focus on binary dissimilarities, taking values 0 or 1 depending on whether or not two objects are deemed as dissimilar. We are interested in analyzing such data using multidimensional scaling (MDS). Differently from standard MDS algorithms, our goal is to cluster the dissimilarity matrices and, simultaneously, to extract an MDS configuration specific for each cluster. To this end, we develop a fully Bayesian three-way MDS approach, where the elements of each dissimilarity matrix are modeled as a mixture of Bernoulli random vectors. The parameter estimates and the MDS configurations are derived using a hybrid Metropolis-Gibbs Markov Chain Monte Carlo algorithm. We also propose a BIC-like criterion for jointly selecting the optimal number of clusters and latent space dimensions. We illustrate our approach referring both to synthetic data and to a publicly available data set taken from the literature. For the sake of efficiency, the core computations in the package are implemented in C/C++. The package also allows the simulation of multiple chains through the support of the parallel package.
{"title":"A Bayesian Approach for Model-Based Clustering of Several Binary Dissimilarity Matrices: The dmbc Package in R","authors":"S. Venturini, R. Piccarreta","doi":"10.18637/jss.v100.i16","DOIUrl":"https://doi.org/10.18637/jss.v100.i16","url":null,"abstract":"We introduce the new package dmbc that implements a Bayesian algorithm for clustering a set of binary dissimilarity matrices within a model-based framework. Specifically, we consider the case when S matrices are available, each describing the dissimilarities among the same n objects, possibly expressed by S subjects (judges), or measured under different experimental conditions, or with reference to different characteristics of the objects themselves. In particular, we focus on binary dissimilarities, taking values 0 or 1 depending on whether or not two objects are deemed as dissimilar. We are interested in analyzing such data using multidimensional scaling (MDS). Differently from standard MDS algorithms, our goal is to cluster the dissimilarity matrices and, simultaneously, to extract an MDS configuration specific for each cluster. To this end, we develop a fully Bayesian three-way MDS approach, where the elements of each dissimilarity matrix are modeled as a mixture of Bernoulli random vectors. The parameter estimates and the MDS configurations are derived using a hybrid Metropolis-Gibbs Markov Chain Monte Carlo algorithm. We also propose a BIC-like criterion for jointly selecting the optimal number of clusters and latent space dimensions. We illustrate our approach referring both to synthetic data and to a publicly available data set taken from the literature. For the sake of efficiency, the core computations in the package are implemented in C/C++. The package also allows the simulation of multiple chains through the support of the parallel package.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76029407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. T. Ho, Kim P. Huynh, David T. Jacho-Chávez, Diego Rojas-Baez
Stata (StataCorp 2019) is one of the most widely used software for data analysis, statistics, and model fitting by economists, public policy researchers, epidemiologists, among others. Stata’s recent release of version 16 in June 2019 includes an up-to-date methodological library and a user-friendly version of various cutting edge techniques. In the newest release, Stata has implemented several changes and additions (see https://www.stata.com/new-in-stata/) that include lasso, multiple data sets in memory, meta-analysis, choice models, Python integration, Bayes-multiple chains, panel-data extended regression models, sample-size analysis for confidence intervals, panel-data mixed logit, nonlinear dynamic stochastic general equilibrium (DSGE) models, numerical integration. This review covers the most salient innovations in Stata 16. It is the first release that brings along an implementation of machine-learning tools. The three innovations we consider in this review are: (1) Multiple data sets in Memory, (2) Lasso for causal inference, and (3) Python integration. The following three sections are used to describe each one of these innovations. The last section are the final thoughts and conclusions of our review.
Stata (StataCorp 2019)是经济学家、公共政策研究人员、流行病学家等最广泛使用的数据分析、统计和模型拟合软件之一。Stata最近于2019年6月发布的第16版包括最新的方法库和各种尖端技术的用户友好版本。在最新的版本中,Stata实现了一些变化和添加(参见https://www.stata.com/new-in-stata/),包括lasso,内存中的多个数据集,元分析,选择模型,Python集成,贝叶斯多链,面板数据扩展回归模型,置信区间的样本大小分析,面板数据混合logit,非线性动态随机一般均衡(DSGE)模型,数值积分。这篇综述涵盖了Stata 16中最显著的创新。这是第一个带来机器学习工具实现的版本。我们在这篇综述中考虑的三个创新是:(1)内存中的多个数据集,(2)Lasso用于因果推理,(3)Python集成。下面的三个部分将分别描述这些创新。最后一部分是我们回顾的最后想法和结论。
{"title":"Data Science in Stata 16: Frames, Lasso, and Python Integration","authors":"A. T. Ho, Kim P. Huynh, David T. Jacho-Chávez, Diego Rojas-Baez","doi":"10.18637/jss.v098.s01","DOIUrl":"https://doi.org/10.18637/jss.v098.s01","url":null,"abstract":"Stata (StataCorp 2019) is one of the most widely used software for data analysis, statistics, and model fitting by economists, public policy researchers, epidemiologists, among others. Stata’s recent release of version 16 in June 2019 includes an up-to-date methodological library and a user-friendly version of various cutting edge techniques. In the newest release, Stata has implemented several changes and additions (see https://www.stata.com/new-in-stata/) that include lasso, multiple data sets in memory, meta-analysis, choice models, Python integration, Bayes-multiple chains, panel-data extended regression models, sample-size analysis for confidence intervals, panel-data mixed logit, nonlinear dynamic stochastic general equilibrium (DSGE) models, numerical integration. This review covers the most salient innovations in Stata 16. It is the first release that brings along an implementation of machine-learning tools. The three innovations we consider in this review are: (1) Multiple data sets in Memory, (2) Lasso for causal inference, and (3) Python integration. The following three sections are used to describe each one of these innovations. The last section are the final thoughts and conclusions of our review.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80646023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolás M Ballarini, Marius Thomas, G. Rosenkranz, B. Bornkamp
The investigation of subgroups is an integral part of randomized clinical trials. Exploration of treatment effect heterogeneity is typically performed by covariate-adjusted analyses including treatment-by-covariate interactions. Several statistical techniques, such as model averaging and bagging, were proposed recently to address the problem of selection bias in treatment effect estimates for subgroups. In this paper, we describe the subtee R package for subgroup treatment effect estimation. The package can be used for all commonly encountered type of outcomes in clinical trials (continuous, binary, survival, count). We also provide additional functions to build the subgroup variables to be used and to plot the results using forest plots. The functions are demonstrated using data from a clinical trial investigating a treatment for prostate cancer with a survival endpoint.
{"title":"subtee: An R Package for Subgroup Treatment Effect Estimation in Clinical Trials","authors":"Nicolás M Ballarini, Marius Thomas, G. Rosenkranz, B. Bornkamp","doi":"10.18637/jss.v099.i14","DOIUrl":"https://doi.org/10.18637/jss.v099.i14","url":null,"abstract":"The investigation of subgroups is an integral part of randomized clinical trials. Exploration of treatment effect heterogeneity is typically performed by covariate-adjusted analyses including treatment-by-covariate interactions. Several statistical techniques, such as model averaging and bagging, were proposed recently to address the problem of selection bias in treatment effect estimates for subgroups. In this paper, we describe the subtee R package for subgroup treatment effect estimation. The package can be used for all commonly encountered type of outcomes in clinical trials (continuous, binary, survival, count). We also provide additional functions to build the subgroup variables to be used and to plot the results using forest plots. The functions are demonstrated using data from a clinical trial investigating a treatment for prostate cancer with a survival endpoint.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82936817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IncDTW: An R Package for Incremental Calculation of Dynamic Time Warping","authors":"Maximilian Leodolter, C. Plant, Norbert Brändle","doi":"10.18637/jss.v099.i09","DOIUrl":"https://doi.org/10.18637/jss.v099.i09","url":null,"abstract":"","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78058745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This introduction to the R package BNPmix is currently in press in the Journal of Statistical Software. BNPmix is an R package for Bayesian nonparametric multivariate density estimation, clustering, and regression, using Pitman-Yor mixture models, a flexible and robust generalization of the popular class of Dirichlet process mixture models. A variety of model specifications and state-of-the-art posterior samplers are implemented. In order to achieve computational efficiency, all sampling methods are written in C++ and seamless integrated into R by means of the Rcpp and RcppArmadillo packages. BNPmix exploits the ggplot2 capabilities and implements a series of generic functions to plot and print summaries of posterior densities and induced clustering of the data.
{"title":"BNPmix: An R Package for Bayesian Nonparametric Modeling via Pitman-Yor Mixtures","authors":"R. Corradin, A. Canale, Bernardo Nipoti","doi":"10.18637/jss.v100.i15","DOIUrl":"https://doi.org/10.18637/jss.v100.i15","url":null,"abstract":"This introduction to the R package BNPmix is currently in press in the Journal of Statistical Software. BNPmix is an R package for Bayesian nonparametric multivariate density estimation, clustering, and regression, using Pitman-Yor mixture models, a flexible and robust generalization of the popular class of Dirichlet process mixture models. A variety of model specifications and state-of-the-art posterior samplers are implemented. In order to achieve computational efficiency, all sampling methods are written in C++ and seamless integrated into R by means of the Rcpp and RcppArmadillo packages. BNPmix exploits the ggplot2 capabilities and implements a series of generic functions to plot and print summaries of posterior densities and induced clustering of the data.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83762513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"dynamichazard: Dynamic Hazard Models Using State Space Models","authors":"Benjamin Christoffersen","doi":"10.18637/jss.v099.i07","DOIUrl":"https://doi.org/10.18637/jss.v099.i07","url":null,"abstract":"","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85183686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The spatial scan statistic is commonly used to detect spatial disease clusters in epidemiological studies. Among the various types of scan statistics, the flexible scan statistic proposed by Tango and Takahashi (2005) is one of the most promising methods to detect arbitrarily-shaped clusters. In this paper, we introduce a new R package, rflexscan (Otani and Takahashi 2021), that provides efficient and easy-to-use methods for analyses of spatial count data using the flexible spatial scan statistic. The package is designed for any of the following interrelated purposes: to evaluate whether reported spatial disease clusters are statistically significant, to test whether a disease is randomly distributed over space, and to perform geographical surveillance of disease to detect areas of significantly high rates. The functionality of the package is demonstrated through an application to a public-domain small-area cancer incidence dataset in New York State, USA, between 2005 and 2009.
{"title":"Flexible Scan Statistics for Detecting Spatial Disease Clusters: The rflexscan R Package","authors":"Takahiro Otani, Kunihiko Takahashi","doi":"10.18637/jss.v099.i13","DOIUrl":"https://doi.org/10.18637/jss.v099.i13","url":null,"abstract":"The spatial scan statistic is commonly used to detect spatial disease clusters in epidemiological studies. Among the various types of scan statistics, the flexible scan statistic proposed by Tango and Takahashi (2005) is one of the most promising methods to detect arbitrarily-shaped clusters. In this paper, we introduce a new R package, rflexscan (Otani and Takahashi 2021), that provides efficient and easy-to-use methods for analyses of spatial count data using the flexible spatial scan statistic. The package is designed for any of the following interrelated purposes: to evaluate whether reported spatial disease clusters are statistically significant, to test whether a disease is randomly distributed over space, and to perform geographical surveillance of disease to detect areas of significantly high rates. The functionality of the package is demonstrated through an application to a public-domain small-area cancer incidence dataset in New York State, USA, between 2005 and 2009.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81783097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NScluster is an R package used for simulation and parameter estimation for NeymanScott cluster point process models and their extensions. For parameter estimation, NScluster uses the maximum Palm likelihood estimation procedure. As some estimation procedures proposed herein require heavy calculation, NScluster can use parallel computation via OpenMP and achieve significant speedup in some cases. In this paper, we discuss results obtained using a laptop PC and a shared memory supercomputer. In addition, we examine the performance characteristics of parallel computation via OpenMP.
{"title":"NScluster: An R Package for Maximum Palm Likelihood Estimation for Cluster Point Process Models Using OpenMP","authors":"U. Tanaka, Masami Saga, Junji Nakano","doi":"10.18637/jss.v098.i06","DOIUrl":"https://doi.org/10.18637/jss.v098.i06","url":null,"abstract":"NScluster is an R package used for simulation and parameter estimation for NeymanScott cluster point process models and their extensions. For parameter estimation, NScluster uses the maximum Palm likelihood estimation procedure. As some estimation procedures proposed herein require heavy calculation, NScluster can use parallel computation via OpenMP and achieve significant speedup in some cases. In this paper, we discuss results obtained using a laptop PC and a shared memory supercomputer. In addition, we examine the performance characteristics of parallel computation via OpenMP.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85885022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}