This paper introduces the logitr R package for fast maximum likelihood estimation of multinomial logit and mixed logit models with unobserved heterogeneity across individuals, which is modeled by allowing parameters to vary randomly over individuals according to a chosen distribution. The package is faster than other similar packages such as mlogit, gmnl, mixl, and apollo, and it supports utility models specified with"preference space"or"willingness to pay (WTP) space"parameterizations, allowing for the direct estimation of marginal WTP. The typical procedure of computing WTP post-estimation using a preference space model can lead to unreasonable distributions of WTP across the population in mixed logit models. The paper provides a discussion of some of the implications of each utility parameterization for WTP estimates. It also highlights some of the design features that enable logitr's performant estimation speed and includes a benchmarking exercise with similar packages. Finally, the paper highlights additional features that are designed specifically for WTP space models, including a consistent user interface for specifying models in either space and a parallelized multi-start optimization loop, which is particularly useful for searching the solution space for different local minima when estimating models with non-convex log-likelihood functions.
{"title":"logitr: Fast Estimation of Multinomial and Mixed Logit Models with Preference Space and Willingness-to-Pay Space Utility Parameterizations","authors":"J. Helveston","doi":"10.18637/jss.v105.i10","DOIUrl":"https://doi.org/10.18637/jss.v105.i10","url":null,"abstract":"This paper introduces the logitr R package for fast maximum likelihood estimation of multinomial logit and mixed logit models with unobserved heterogeneity across individuals, which is modeled by allowing parameters to vary randomly over individuals according to a chosen distribution. The package is faster than other similar packages such as mlogit, gmnl, mixl, and apollo, and it supports utility models specified with\"preference space\"or\"willingness to pay (WTP) space\"parameterizations, allowing for the direct estimation of marginal WTP. The typical procedure of computing WTP post-estimation using a preference space model can lead to unreasonable distributions of WTP across the population in mixed logit models. The paper provides a discussion of some of the implications of each utility parameterization for WTP estimates. It also highlights some of the design features that enable logitr's performant estimation speed and includes a benchmarking exercise with similar packages. Finally, the paper highlights additional features that are designed specifically for WTP space models, including a consistent user interface for specifying models in either space and a parallelized multi-start optimization loop, which is particularly useful for searching the solution space for different local minima when estimating models with non-convex log-likelihood functions.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79549247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The bayesTFR package for R provides a set of functions to produce probabilistic projections of the total fertility rates (TFR) for all countries, and is widely used, including as part of the basis for the UN's official population projections for all countries. Liu and Raftery (2020) extended the theoretical model by adding a layer that accounts for the past TFR estimation uncertainty. A major update of bayesTFR implements the new extension. Moreover, a new feature of producing annual TFR estimation and projections extends the existing functionality of estimating and projecting for five-year time periods. An additional autoregressive component has been developed in order to account for the larger autocorrelation in the annual version of the model. This article summarizes the updated model, describes the basic steps to generate probabilistic estimation and projections under different settings, compares performance, and provides instructions on how to summarize, visualize and diagnose the model results.
{"title":"Probabilistic Estimation and Projection of the Annual Total Fertility Rate Accounting for Past Uncertainty: A Major Update of the bayesTFR R Package","authors":"Peiran Liu, H. Ševčíková, A. Raftery","doi":"10.18637/jss.v106.i08","DOIUrl":"https://doi.org/10.18637/jss.v106.i08","url":null,"abstract":"The bayesTFR package for R provides a set of functions to produce probabilistic projections of the total fertility rates (TFR) for all countries, and is widely used, including as part of the basis for the UN's official population projections for all countries. Liu and Raftery (2020) extended the theoretical model by adding a layer that accounts for the past TFR estimation uncertainty. A major update of bayesTFR implements the new extension. Moreover, a new feature of producing annual TFR estimation and projections extends the existing functionality of estimating and projecting for five-year time periods. An additional autoregressive component has been developed in order to account for the larger autocorrelation in the annual version of the model. This article summarizes the updated model, describes the basic steps to generate probabilistic estimation and projections under different settings, compares performance, and provides instructions on how to summarize, visualize and diagnose the model results.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2022-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83223922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mediation analysis is one of the most widely used statistical techniques in the social, behavioral, and medical sciences. Mediation models allow to study how an independent variable affects a dependent variable indirectly through one or more intervening variables, which are called mediators. The analysis is often carried out via a series of linear regressions, in which case the indirect effects can be computed as products of coefficients from those regressions. Statistical significance of the indirect effects is typically assessed via a bootstrap test based on ordinary least-squares estimates. However, this test is sensitive to outliers or other deviations from normality assumptions, which poses a serious threat to empirical testing of theory about mediation mechanisms. The R package robmed implements a robust procedure for mediation analysis based on the fast-and-robust bootstrap methodology for robust regression estimators, which yields reliable results even when the data deviate from the usual normality assumptions. Various other procedures for mediation analysis are included in package robmed as well. Moreover, robmed introduces a new formula interface that allows to specify mediation models with a single formula, and provides various plots for diagnostics or visual representation of the results.
{"title":"Robust Mediation Analysis: The R Package robmed","authors":"A. Alfons, N. Ateş, P. Groenen","doi":"10.18637/jss.v103.i13","DOIUrl":"https://doi.org/10.18637/jss.v103.i13","url":null,"abstract":"Mediation analysis is one of the most widely used statistical techniques in the social, behavioral, and medical sciences. Mediation models allow to study how an independent variable affects a dependent variable indirectly through one or more intervening variables, which are called mediators. The analysis is often carried out via a series of linear regressions, in which case the indirect effects can be computed as products of coefficients from those regressions. Statistical significance of the indirect effects is typically assessed via a bootstrap test based on ordinary least-squares estimates. However, this test is sensitive to outliers or other deviations from normality assumptions, which poses a serious threat to empirical testing of theory about mediation mechanisms. The R package robmed implements a robust procedure for mediation analysis based on the fast-and-robust bootstrap methodology for robust regression estimators, which yields reliable results even when the data deviate from the usual normality assumptions. Various other procedures for mediation analysis are included in package robmed as well. Moreover, robmed introduces a new formula interface that allows to specify mediation models with a single formula, and provides various plots for diagnostics or visual representation of the results.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74051967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"bbl: Boltzmann Bayes Learner for High-Dimensional Inference with Discrete Predictors in R","authors":"J. Woo, Jinhua Wang","doi":"10.18637/jss.v101.i05","DOIUrl":"https://doi.org/10.18637/jss.v101.i05","url":null,"abstract":"","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74074917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article presents a new implementation of hierarchical clustering for the R language that allows one to apply spatial or temporal contiguity constraints during the clustering process. The need for contiguity constraint arises, for instance, when one wants to partition a map into different domains of similar physical conditions, identify discontinuities in time series, group regional administrative units with respect to their performance, and so on. To increase computation efficiency, we programmed the core functions in plain C . The result is a new R function, constr.hclust , which is distributed in package adespatial . The program implements the general agglomerative hierarchical clustering algorithm described by Lance and Williams (1966; 1967), with the particularity of allowing only clusters that are contiguous in geographic space or along time to fuse at any given step. Contiguity can be defined with respect to space or time. Information about spatial contiguity is provided by a connection network among sites, with edges describing the links between connected sites. Clustering with a temporal contiguity constraint is also known as chronological clustering. Information on temporal contiguity can be implicitly provided as the rank positions of observations in the time series. The implementation was mirrored on that found in the hierarchical clustering function hclust of the standard R package stats ( R Core Team 2022). We transcribed that function from Fortran to C and added the functionality to apply constraints when running the function. The implementation is efficient. It is limited mainly by input/output access as massive amounts of memory are potentially needed to store copies of the dissimilarity matrix and update its elements when analyzing large problems. We provided R computer code for plotting results for numbers of clusters.
本文为R语言提供了一种新的分层聚类实现,它允许在聚类过程中应用空间或时间上的连续性约束。例如,当需要将地图划分为具有相似物理条件的不同域、识别时间序列中的不连续性、根据其性能对区域管理单元进行分组等等时,就会出现对连续性约束的需求。为了提高计算效率,我们用C语言编写了核心函数。结果是一个新的R函数,constr。Hclust,它分布在包空间中。该程序实现了Lance和Williams (1966;1967),其特点是只允许在地理空间或时间上连续的集群在任何给定的步骤上融合。连续性可以根据空间或时间来定义。关于空间连续性的信息由站点之间的连接网络提供,边缘描述了连接站点之间的链接。具有时间连续性约束的聚类也称为时间聚类。时间连续性的信息可以隐式地作为观测值在时间序列中的秩位置提供。该实现是基于标准R包统计(R Core Team 2022)的分层聚类功能hclust中的实现的。我们将该函数从Fortran转录到C,并添加了在运行函数时应用约束的功能。实现是高效的。它主要受到输入/输出访问的限制,因为在分析大型问题时,可能需要大量内存来存储不同矩阵的副本并更新其元素。我们提供了R计算机代码来绘制集群数量的结果。
{"title":"Hierarchical Clustering with Contiguity Constraint in R","authors":"G. Guénard, P. Legendre","doi":"10.18637/jss.v103.i07","DOIUrl":"https://doi.org/10.18637/jss.v103.i07","url":null,"abstract":"This article presents a new implementation of hierarchical clustering for the R language that allows one to apply spatial or temporal contiguity constraints during the clustering process. The need for contiguity constraint arises, for instance, when one wants to partition a map into different domains of similar physical conditions, identify discontinuities in time series, group regional administrative units with respect to their performance, and so on. To increase computation efficiency, we programmed the core functions in plain C . The result is a new R function, constr.hclust , which is distributed in package adespatial . The program implements the general agglomerative hierarchical clustering algorithm described by Lance and Williams (1966; 1967), with the particularity of allowing only clusters that are contiguous in geographic space or along time to fuse at any given step. Contiguity can be defined with respect to space or time. Information about spatial contiguity is provided by a connection network among sites, with edges describing the links between connected sites. Clustering with a temporal contiguity constraint is also known as chronological clustering. Information on temporal contiguity can be implicitly provided as the rank positions of observations in the time series. The implementation was mirrored on that found in the hierarchical clustering function hclust of the standard R package stats ( R Core Team 2022). We transcribed that function from Fortran to C and added the functionality to apply constraints when running the function. The implementation is efficient. It is limited mainly by input/output access as massive amounts of memory are potentially needed to store copies of the dissimilarity matrix and update its elements when analyzing large problems. We provided R computer code for plotting results for numbers of clusters.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87740464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. This paper discusses and demonstrates how and why stringi, a mature R package for fast and portable handling of string data based on ICU (International Components for Unicode), should be included in each statistician’s or data scientist’s repertoire to complement their numerical computing and data wrangling skills.
{"title":"stringi: Fast and Portable Character String Processing in R","authors":"M. Gagolewski","doi":"10.18637/jss.v103.i02","DOIUrl":"https://doi.org/10.18637/jss.v103.i02","url":null,"abstract":"Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and bioinformatics. This paper discusses and demonstrates how and why stringi, a mature R package for fast and portable handling of string data based on ICU (International Components for Unicode), should be included in each statistician’s or data scientist’s repertoire to complement their numerical computing and data wrangling skills.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90189254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Base R (2nd Edition)","authors":"James E. Helmreich","doi":"10.18637/jss.v103.b01","DOIUrl":"https://doi.org/10.18637/jss.v103.b01","url":null,"abstract":"","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90818707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Linear transformation models, including the proportional hazards model and proportional odds model, under right censoring were discussed by Chen, Jin, and Ying (2002). The asymptotic variance of the estimator they proposed has a closed form and can be obtained easily by plug-in rules, which improves the computational efficiency. We develop an R package TransModel based on Chen’s approach. The detailed usage of the package is discussed, and the function is applied to the Veterans’ Administration lung cancer data.
Chen, Jin, and Ying(2002)讨论了右审查下的线性变换模型,包括比例风险模型和比例几率模型。他们所提出的估计量的渐近方差具有封闭的形式,可以很容易地通过插件规则得到,从而提高了计算效率。我们基于Chen的方法开发了一个R包TransModel。讨论了该软件包的详细用法,并将该功能应用于退伍军人管理局肺癌数据。
{"title":"TransModel: An R Package for Linear Transformation Model with Censored Data","authors":"Jie Zhou, Jiajia Zhang, Wenbin Lu","doi":"10.18637/jss.v101.i09","DOIUrl":"https://doi.org/10.18637/jss.v101.i09","url":null,"abstract":"Linear transformation models, including the proportional hazards model and proportional odds model, under right censoring were discussed by Chen, Jin, and Ying (2002). The asymptotic variance of the estimator they proposed has a closed form and can be obtained easily by plug-in rules, which improves the computational efficiency. We develop an R package TransModel based on Chen’s approach. The detailed usage of the package is discussed, and the function is applied to the Veterans’ Administration lung cancer data.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74541724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christophe Dutang, Vincent Goulet, Nicholas Langevin
Actuaries model insurance claim amounts using heavy tailed probability distributions. They routinely need to evaluate quantities related to these distributions such as quantiles in the far right tail, moments or limited moments. Furthermore, actuaries often resort to simulation to solve otherwise untractable risk evaluation problems. The paper discusses our implementation of support functions for the Feller-Pareto distribution for the R package actuar . The Feller-Pareto defines a large family of heavy tailed distributions encompassing the transformed beta family and many variants of the Pareto distribution.
{"title":"Feller-Pareto and Related Distributions: Numerical Implementation and Actuarial Applications","authors":"Christophe Dutang, Vincent Goulet, Nicholas Langevin","doi":"10.18637/jss.v103.i06","DOIUrl":"https://doi.org/10.18637/jss.v103.i06","url":null,"abstract":"Actuaries model insurance claim amounts using heavy tailed probability distributions. They routinely need to evaluate quantities related to these distributions such as quantiles in the far right tail, moments or limited moments. Furthermore, actuaries often resort to simulation to solve otherwise untractable risk evaluation problems. The paper discusses our implementation of support functions for the Feller-Pareto distribution for the R package actuar . The Feller-Pareto defines a large family of heavy tailed distributions encompassing the transformed beta family and many variants of the Pareto distribution.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90748725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Pantalone, R. Benedetti, Federica Pierismoni
The basic idea underpinning the theory of spatially balanced sampling is that units closer to each other provide less information about a target of inference than units farther apart. Therefore, it should be desirable to select a sample well spread over the population of interest, or a spatially balanced sample . This situation is easily understood in, among many others, environmental, geological, biological, and agricultural surveys, where usually the main feature of the population is to be geo-referenced. Since traditional sampling designs generally do not exploit the spatial features and since it is desirable to take into account the information regarding spatial dependence, several sampling designs have been developed in order to achieve this objective. In this paper, we present the R package Spbsampling , which provides functions in order to perform three specific sampling designs that pursue the aforementioned purpose. In particular, these sampling designs achieve spatially balanced samples using a summary index of the distance matrix. In this sense, the applicability of the package is much wider, as a distance matrix can be defined for units according to variables different than geographical coordinates.
{"title":"Spbsampling: An R Package for Spatially Balanced Sampling","authors":"Francesco Pantalone, R. Benedetti, Federica Pierismoni","doi":"10.18637/jss.v103.c02","DOIUrl":"https://doi.org/10.18637/jss.v103.c02","url":null,"abstract":"The basic idea underpinning the theory of spatially balanced sampling is that units closer to each other provide less information about a target of inference than units farther apart. Therefore, it should be desirable to select a sample well spread over the population of interest, or a spatially balanced sample . This situation is easily understood in, among many others, environmental, geological, biological, and agricultural surveys, where usually the main feature of the population is to be geo-referenced. Since traditional sampling designs generally do not exploit the spatial features and since it is desirable to take into account the information regarding spatial dependence, several sampling designs have been developed in order to achieve this objective. In this paper, we present the R package Spbsampling , which provides functions in order to perform three specific sampling designs that pursue the aforementioned purpose. In particular, these sampling designs achieve spatially balanced samples using a summary index of the distance matrix. In this sense, the applicability of the package is much wider, as a distance matrix can be defined for units according to variables different than geographical coordinates.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80110451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}