Mediation analysis is one of the most widely used statistical techniques in the social, behavioral, and medical sciences. Mediation models allow to study how an independent variable affects a dependent variable indirectly through one or more intervening variables, which are called mediators. The analysis is often carried out via a series of linear regressions, in which case the indirect effects can be computed as products of coefficients from those regressions. Statistical significance of the indirect effects is typically assessed via a bootstrap test based on ordinary least-squares estimates. However, this test is sensitive to outliers or other deviations from normality assumptions, which poses a serious threat to empirical testing of theory about mediation mechanisms. The R package robmed implements a robust procedure for mediation analysis based on the fast-and-robust bootstrap methodology for robust regression estimators, which yields reliable results even when the data deviate from the usual normality assumptions. Various other procedures for mediation analysis are included in package robmed as well. Moreover, robmed introduces a new formula interface that allows to specify mediation models with a single formula, and provides various plots for diagnostics or visual representation of the results.
{"title":"Robust Mediation Analysis: The R Package robmed","authors":"A. Alfons, N. Ateş, P. Groenen","doi":"10.18637/jss.v103.i13","DOIUrl":"https://doi.org/10.18637/jss.v103.i13","url":null,"abstract":"Mediation analysis is one of the most widely used statistical techniques in the social, behavioral, and medical sciences. Mediation models allow to study how an independent variable affects a dependent variable indirectly through one or more intervening variables, which are called mediators. The analysis is often carried out via a series of linear regressions, in which case the indirect effects can be computed as products of coefficients from those regressions. Statistical significance of the indirect effects is typically assessed via a bootstrap test based on ordinary least-squares estimates. However, this test is sensitive to outliers or other deviations from normality assumptions, which poses a serious threat to empirical testing of theory about mediation mechanisms. The R package robmed implements a robust procedure for mediation analysis based on the fast-and-robust bootstrap methodology for robust regression estimators, which yields reliable results even when the data deviate from the usual normality assumptions. Various other procedures for mediation analysis are included in package robmed as well. Moreover, robmed introduces a new formula interface that allows to specify mediation models with a single formula, and provides various plots for diagnostics or visual representation of the results.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":"40 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74051967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article presents a new implementation of hierarchical clustering for the R language that allows one to apply spatial or temporal contiguity constraints during the clustering process. The need for contiguity constraint arises, for instance, when one wants to partition a map into different domains of similar physical conditions, identify discontinuities in time series, group regional administrative units with respect to their performance, and so on. To increase computation efficiency, we programmed the core functions in plain C . The result is a new R function, constr.hclust , which is distributed in package adespatial . The program implements the general agglomerative hierarchical clustering algorithm described by Lance and Williams (1966; 1967), with the particularity of allowing only clusters that are contiguous in geographic space or along time to fuse at any given step. Contiguity can be defined with respect to space or time. Information about spatial contiguity is provided by a connection network among sites, with edges describing the links between connected sites. Clustering with a temporal contiguity constraint is also known as chronological clustering. Information on temporal contiguity can be implicitly provided as the rank positions of observations in the time series. The implementation was mirrored on that found in the hierarchical clustering function hclust of the standard R package stats ( R Core Team 2022). We transcribed that function from Fortran to C and added the functionality to apply constraints when running the function. The implementation is efficient. It is limited mainly by input/output access as massive amounts of memory are potentially needed to store copies of the dissimilarity matrix and update its elements when analyzing large problems. We provided R computer code for plotting results for numbers of clusters.
本文为R语言提供了一种新的分层聚类实现,它允许在聚类过程中应用空间或时间上的连续性约束。例如,当需要将地图划分为具有相似物理条件的不同域、识别时间序列中的不连续性、根据其性能对区域管理单元进行分组等等时,就会出现对连续性约束的需求。为了提高计算效率,我们用C语言编写了核心函数。结果是一个新的R函数,constr。Hclust,它分布在包空间中。该程序实现了Lance和Williams (1966;1967),其特点是只允许在地理空间或时间上连续的集群在任何给定的步骤上融合。连续性可以根据空间或时间来定义。关于空间连续性的信息由站点之间的连接网络提供,边缘描述了连接站点之间的链接。具有时间连续性约束的聚类也称为时间聚类。时间连续性的信息可以隐式地作为观测值在时间序列中的秩位置提供。该实现是基于标准R包统计(R Core Team 2022)的分层聚类功能hclust中的实现的。我们将该函数从Fortran转录到C,并添加了在运行函数时应用约束的功能。实现是高效的。它主要受到输入/输出访问的限制,因为在分析大型问题时,可能需要大量内存来存储不同矩阵的副本并更新其元素。我们提供了R计算机代码来绘制集群数量的结果。
{"title":"Hierarchical Clustering with Contiguity Constraint in R","authors":"G. Guénard, P. Legendre","doi":"10.18637/jss.v103.i07","DOIUrl":"https://doi.org/10.18637/jss.v103.i07","url":null,"abstract":"This article presents a new implementation of hierarchical clustering for the R language that allows one to apply spatial or temporal contiguity constraints during the clustering process. The need for contiguity constraint arises, for instance, when one wants to partition a map into different domains of similar physical conditions, identify discontinuities in time series, group regional administrative units with respect to their performance, and so on. To increase computation efficiency, we programmed the core functions in plain C . The result is a new R function, constr.hclust , which is distributed in package adespatial . The program implements the general agglomerative hierarchical clustering algorithm described by Lance and Williams (1966; 1967), with the particularity of allowing only clusters that are contiguous in geographic space or along time to fuse at any given step. Contiguity can be defined with respect to space or time. Information about spatial contiguity is provided by a connection network among sites, with edges describing the links between connected sites. Clustering with a temporal contiguity constraint is also known as chronological clustering. Information on temporal contiguity can be implicitly provided as the rank positions of observations in the time series. The implementation was mirrored on that found in the hierarchical clustering function hclust of the standard R package stats ( R Core Team 2022). We transcribed that function from Fortran to C and added the functionality to apply constraints when running the function. The implementation is efficient. It is limited mainly by input/output access as massive amounts of memory are potentially needed to store copies of the dissimilarity matrix and update its elements when analyzing large problems. We provided R computer code for plotting results for numbers of clusters.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":"2013 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87740464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. This paper discusses and demonstrates how and why stringi, a mature R package for fast and portable handling of string data based on ICU (International Components for Unicode), should be included in each statistician’s or data scientist’s repertoire to complement their numerical computing and data wrangling skills.
{"title":"stringi: Fast and Portable Character String Processing in R","authors":"M. Gagolewski","doi":"10.18637/jss.v103.i02","DOIUrl":"https://doi.org/10.18637/jss.v103.i02","url":null,"abstract":"Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. This paper discusses and demonstrates how and why stringi, a mature R package for fast and portable handling of string data based on ICU (International Components for Unicode), should be included in each statistician’s or data scientist’s repertoire to complement their numerical computing and data wrangling skills.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":"1 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90189254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Base R (2nd Edition)","authors":"James E. Helmreich","doi":"10.18637/jss.v103.b01","DOIUrl":"https://doi.org/10.18637/jss.v103.b01","url":null,"abstract":"","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":"47 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90818707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"bbl: Boltzmann Bayes Learner for High-Dimensional Inference with Discrete Predictors in R","authors":"J. Woo, Jinhua Wang","doi":"10.18637/jss.v101.i05","DOIUrl":"https://doi.org/10.18637/jss.v101.i05","url":null,"abstract":"","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":"22 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74074917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linear transformation models, including the proportional hazards model and proportional odds model, under right censoring were discussed by Chen, Jin, and Ying (2002). The asymptotic variance of the estimator they proposed has a closed form and can be obtained easily by plug-in rules, which improves the computational efficiency. We develop an R package TransModel based on Chen’s approach. The detailed usage of the package is discussed, and the function is applied to the Veterans’ Administration lung cancer data.
Chen, Jin, and Ying(2002)讨论了右审查下的线性变换模型,包括比例风险模型和比例几率模型。他们所提出的估计量的渐近方差具有封闭的形式,可以很容易地通过插件规则得到,从而提高了计算效率。我们基于Chen的方法开发了一个R包TransModel。讨论了该软件包的详细用法,并将该功能应用于退伍军人管理局肺癌数据。
{"title":"TransModel: An R Package for Linear Transformation Model with Censored Data","authors":"Jie Zhou, Jiajia Zhang, Wenbin Lu","doi":"10.18637/jss.v101.i09","DOIUrl":"https://doi.org/10.18637/jss.v101.i09","url":null,"abstract":"Linear transformation models, including the proportional hazards model and proportional odds model, under right censoring were discussed by Chen, Jin, and Ying (2002). The asymptotic variance of the estimator they proposed has a closed form and can be obtained easily by plug-in rules, which improves the computational efficiency. We develop an R package TransModel based on Chen’s approach. The detailed usage of the package is discussed, and the function is applied to the Veterans’ Administration lung cancer data.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":"1 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74541724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Pantalone, R. Benedetti, Federica Pierismoni
The basic idea underpinning the theory of spatially balanced sampling is that units closer to each other provide less information about a target of inference than units farther apart. Therefore, it should be desirable to select a sample well spread over the population of interest, or a spatially balanced sample . This situation is easily understood in, among many others, environmental, geological, biological, and agricultural surveys, where usually the main feature of the population is to be geo-referenced. Since traditional sampling designs generally do not exploit the spatial features and since it is desirable to take into account the information regarding spatial dependence, several sampling designs have been developed in order to achieve this objective. In this paper, we present the R package Spbsampling , which provides functions in order to perform three specific sampling designs that pursue the aforementioned purpose. In particular, these sampling designs achieve spatially balanced samples using a summary index of the distance matrix. In this sense, the applicability of the package is much wider, as a distance matrix can be defined for units according to variables different than geographical coordinates.
{"title":"Spbsampling: An R Package for Spatially Balanced Sampling","authors":"Francesco Pantalone, R. Benedetti, Federica Pierismoni","doi":"10.18637/jss.v103.c02","DOIUrl":"https://doi.org/10.18637/jss.v103.c02","url":null,"abstract":"The basic idea underpinning the theory of spatially balanced sampling is that units closer to each other provide less information about a target of inference than units farther apart. Therefore, it should be desirable to select a sample well spread over the population of interest, or a spatially balanced sample . This situation is easily understood in, among many others, environmental, geological, biological, and agricultural surveys, where usually the main feature of the population is to be geo-referenced. Since traditional sampling designs generally do not exploit the spatial features and since it is desirable to take into account the information regarding spatial dependence, several sampling designs have been developed in order to achieve this objective. In this paper, we present the R package Spbsampling , which provides functions in order to perform three specific sampling designs that pursue the aforementioned purpose. In particular, these sampling designs achieve spatially balanced samples using a summary index of the distance matrix. In this sense, the applicability of the package is much wider, as a distance matrix can be defined for units according to variables different than geographical coordinates.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":"26 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80110451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christophe Dutang, Vincent Goulet, Nicholas Langevin
Actuaries model insurance claim amounts using heavy tailed probability distributions. They routinely need to evaluate quantities related to these distributions such as quantiles in the far right tail, moments or limited moments. Furthermore, actuaries often resort to simulation to solve otherwise untractable risk evaluation problems. The paper discusses our implementation of support functions for the Feller-Pareto distribution for the R package actuar . The Feller-Pareto defines a large family of heavy tailed distributions encompassing the transformed beta family and many variants of the Pareto distribution.
{"title":"Feller-Pareto and Related Distributions: Numerical Implementation and Actuarial Applications","authors":"Christophe Dutang, Vincent Goulet, Nicholas Langevin","doi":"10.18637/jss.v103.i06","DOIUrl":"https://doi.org/10.18637/jss.v103.i06","url":null,"abstract":"Actuaries model insurance claim amounts using heavy tailed probability distributions. They routinely need to evaluate quantities related to these distributions such as quantiles in the far right tail, moments or limited moments. Furthermore, actuaries often resort to simulation to solve otherwise untractable risk evaluation problems. The paper discusses our implementation of support functions for the Feller-Pareto distribution for the R package actuar . The Feller-Pareto defines a large family of heavy tailed distributions encompassing the transformed beta family and many variants of the Pareto distribution.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":"42 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90748725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steffen Grønneberg, Njål Foldnes, Katerina M. Marcoulides
In factor analysis and structural equation modeling non-normal data simulation is traditionally performed by specifying univariate skewness and kurtosis together with the target covariance matrix. However, this leaves little control over the univariate distributions and the multivariate copula of the simulated vector. In this paper we explain how a more flexible simulation method called vine-to-anything (VITA) may be obtained from copula-based techniques, as implemented in a new R package, covsim . VITA is based on the concept of a regular vine, where bivariate copulas are coupled together into a full multivariate copula. We illustrate how to simulate continuous and ordinal data for covariance modeling, and how to use the new package discnorm to test for underlying normality in ordinal data. An introduction to copula and vine simulation is provided in the appendix.
{"title":"covsim: An R Package for Simulating Non-Normal Data for Structural Equation Models Using Copulas","authors":"Steffen Grønneberg, Njål Foldnes, Katerina M. Marcoulides","doi":"10.18637/jss.v102.i03","DOIUrl":"https://doi.org/10.18637/jss.v102.i03","url":null,"abstract":"In factor analysis and structural equation modeling non-normal data simulation is traditionally performed by specifying univariate skewness and kurtosis together with the target covariance matrix. However, this leaves little control over the univariate distributions and the multivariate copula of the simulated vector. In this paper we explain how a more flexible simulation method called vine-to-anything (VITA) may be obtained from copula-based techniques, as implemented in a new R package, covsim . VITA is based on the concept of a regular vine, where bivariate copulas are coupled together into a full multivariate copula. We illustrate how to simulate continuous and ordinal data for covariance modeling, and how to use the new package discnorm to test for underlying normality in ordinal data. An introduction to copula and vine simulation is provided in the appendix.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":"18 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74784007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The poolr package provides an implementation of a variety of methods for pooling (i.e., combining) p values, including Fisher’s method, Stouffer’s method, the inverse chisquare method, the binomial test, the Bonferroni method, and Tippett’s method. More importantly, the methods can be adjusted to account for dependence among the tests from which the p values have been derived assuming multivariate normality among the test statistics. All methods can be adjusted based on an estimate of the effective number of tests or by using an empirically-derived null distribution based on pseudo replicates that mimics a proper permutation test. For the Fisher, Stouffer, and inverse chi-square methods, the test statistics can also be directly generalized to account for dependence, leading to Brown’s method, Strube’s method, and the generalized inverse chi-square method. In this paper, we describe the various methods, discuss their implementation in the package, illustrate their use based on several examples, and compare the poolr package with several other packages that can be used to combine p values.
{"title":"The poolr Package for Combining Independent and Dependent p Values","authors":"Ozan Cinar, W. Viechtbauer","doi":"10.18637/jss.v101.i01","DOIUrl":"https://doi.org/10.18637/jss.v101.i01","url":null,"abstract":"The poolr package provides an implementation of a variety of methods for pooling (i.e., combining) p values, including Fisher’s method, Stouffer’s method, the inverse chisquare method, the binomial test, the Bonferroni method, and Tippett’s method. More importantly, the methods can be adjusted to account for dependence among the tests from which the p values have been derived assuming multivariate normality among the test statistics. All methods can be adjusted based on an estimate of the effective number of tests or by using an empirically-derived null distribution based on pseudo replicates that mimics a proper permutation test. For the Fisher, Stouffer, and inverse chi-square methods, the test statistics can also be directly generalized to account for dependence, leading to Brown’s method, Strube’s method, and the generalized inverse chi-square method. In this paper, we describe the various methods, discuss their implementation in the package, illustrate their use based on several examples, and compare the poolr package with several other packages that can be used to combine p values.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":"18 1","pages":""},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90338614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}