首页 > 最新文献

Journal of Statistical Software最新文献

英文 中文
BayesSUR: An R Package for High-Dimensional Multivariate Bayesian Variable and Covariance Selection in Linear Regression BayesSUR:一个用于线性回归中高维多元贝叶斯变量和协方差选择的R包
IF 5.8 2区 计算机科学 Q1 Mathematics Pub Date : 2021-04-28 DOI: 10.18637/jss.v100.i11
Zhi Zhao, Marco Banterle, L. Bottolo, S. Richardson, A. Lewin, M. Zucknick
In molecular biology, advances in high-throughput technologies have made it possible to study complex multivariate phenotypes and their simultaneous associations with highdimensional genomic and other omics data, a problem that can be studied with highdimensional multi-response regression, where the response variables are potentially highly correlated. To this purpose, we recently introduced several multivariate Bayesian variable and covariance selection models, e.g., Bayesian estimation methods for sparse seemingly unrelated regression for variable and covariance selection. Several variable selection priors have been implemented in this context, in particular the hotspot detection prior for latent variable inclusion indicators, which results in sparse variable selection for associations between predictors and multiple phenotypes. Here, we also propose an alternative, which uses a Markov random field (MRF) prior for incorporating prior knowledge about the dependence structure of the inclusion indicators. Inference of Bayesian seemingly unrelated regression (SUR) by Markov chain Monte Carlo methods is made computationally feasible by factorisation of the covariance matrix amongst the response variables. In this paper we present BayesSUR, an R package, which allows the user to easily specify and run a range of different Bayesian SUR models, which have been implemented in C++ for computational efficiency. The R package allows the specification of the models in a modular way, where the user chooses the priors for variable selection and for covariance selection separately. We demonstrate the performance of sparse SUR models with the hotspot prior and spike-and-slab MRF prior on synthetic and real data sets representing eQTL or mQTL studies and in vitro anti-cancer drug screening studies as examples for typical applications.
在分子生物学中,高通量技术的进步使得研究复杂的多变量表型及其与高维基因组和其他组学数据的同时关联成为可能,这一问题可以通过高维多响应回归来研究,其中响应变量可能高度相关。为此,我们最近介绍了几种多变量贝叶斯变量和协方差选择模型,例如用于变量和协方差选择的稀疏看似不相关回归的贝叶斯估计方法。在此背景下,已经实现了几个变量选择先验,特别是潜在变量包含指标的热点检测先验,这导致预测因子与多种表型之间关联的变量选择稀疏。在这里,我们还提出了一种替代方法,该方法使用马尔可夫随机场(MRF)先验来结合关于包含指标依赖结构的先验知识。通过对响应变量间的协方差矩阵进行因式分解,使马尔可夫链蒙特卡罗方法对贝叶斯似不相关回归(SUR)的推断在计算上可行。在本文中,我们介绍了BayesSUR,一个R包,它允许用户轻松地指定和运行一系列不同的贝叶斯SUR模型,这些模型已在c++中实现,以提高计算效率。R包允许以模块化的方式规范模型,其中用户分别选择变量选择和协方差选择的先验。我们以典型应用为例,在代表eQTL或mQTL研究和体外抗癌药物筛选研究的合成和真实数据集上,展示了具有热点先验和峰板MRF先验的稀疏SUR模型的性能。
{"title":"BayesSUR: An R Package for High-Dimensional Multivariate Bayesian Variable and Covariance Selection in Linear Regression","authors":"Zhi Zhao, Marco Banterle, L. Bottolo, S. Richardson, A. Lewin, M. Zucknick","doi":"10.18637/jss.v100.i11","DOIUrl":"https://doi.org/10.18637/jss.v100.i11","url":null,"abstract":"In molecular biology, advances in high-throughput technologies have made it possible to study complex multivariate phenotypes and their simultaneous associations with highdimensional genomic and other omics data, a problem that can be studied with highdimensional multi-response regression, where the response variables are potentially highly correlated. To this purpose, we recently introduced several multivariate Bayesian variable and covariance selection models, e.g., Bayesian estimation methods for sparse seemingly unrelated regression for variable and covariance selection. Several variable selection priors have been implemented in this context, in particular the hotspot detection prior for latent variable inclusion indicators, which results in sparse variable selection for associations between predictors and multiple phenotypes. Here, we also propose an alternative, which uses a Markov random field (MRF) prior for incorporating prior knowledge about the dependence structure of the inclusion indicators. Inference of Bayesian seemingly unrelated regression (SUR) by Markov chain Monte Carlo methods is made computationally feasible by factorisation of the covariance matrix amongst the response variables. In this paper we present BayesSUR, an R package, which allows the user to easily specify and run a range of different Bayesian SUR models, which have been implemented in C++ for computational efficiency. The R package allows the specification of the models in a modular way, where the user chooses the priors for variable selection and for covariance selection separately. We demonstrate the performance of sparse SUR models with the hotspot prior and spike-and-slab MRF prior on synthetic and real data sets representing eQTL or mQTL studies and in vitro anti-cancer drug screening studies as examples for typical applications.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86877519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Statistical Network Analysis with Bergm 统计网络分析与Bergm
IF 5.8 2区 计算机科学 Q1 Mathematics Pub Date : 2021-04-06 DOI: 10.18637/jss.v104.i01
A. Caimo, Lampros Bouranis, Robert W. Krause, N. Friel
Recent advances in computational methods for intractable models have made network data increasingly amenable to statistical analysis. Exponential random graph models (ERGMs) emerged as one of the main families of models capable of capturing the complex dependence structure of network data in a wide range of applied contexts. The Bergm package for R has become a popular package to carry out Bayesian parameter inference, missing data imputation, model selection and goodness-of-fit diagnostics for ERGMs. Over the last few years, the package has been considerably improved in terms of efficiency by adopting some of the state-of-the-art Bayesian computational methods for doubly-intractable distributions. Recently, version 5 of the package has been made available on CRAN having undergone a substantial makeover, which has made it more accessible and easy to use for practitioners. New functions include data augmentation procedures based on the approximate exchange algorithm for dealing with missing data, adjusted pseudo-likelihood and pseudo-posterior procedures, which allow for fast approximate inference of the ERGM parameter posterior and model evidence for networks on several thousands nodes.
棘手模型计算方法的最新进展使得网络数据越来越适合于统计分析。指数随机图模型(Exponential random graph model,简称ERGMs)是一类能够捕捉网络数据复杂依赖结构的主要模型,在广泛的应用环境中得到了广泛的应用。R语言的Bergm包已经成为一个流行的包,用于对ergm进行贝叶斯参数推断、缺失数据输入、模型选择和拟合优度诊断。在过去的几年中,通过采用一些最先进的贝叶斯计算方法来处理双难处理分布,软件包在效率方面有了很大的提高。最近,该软件包的第5版已经在CRAN上可用,它经历了实质性的改造,这使得从业者更容易访问和使用。新功能包括基于近似交换算法的数据增强程序,用于处理缺失数据,调整伪似然和伪后验程序,允许对数千个节点的网络进行ERGM参数后验和模型证据的快速近似推断。
{"title":"Statistical Network Analysis with Bergm","authors":"A. Caimo, Lampros Bouranis, Robert W. Krause, N. Friel","doi":"10.18637/jss.v104.i01","DOIUrl":"https://doi.org/10.18637/jss.v104.i01","url":null,"abstract":"Recent advances in computational methods for intractable models have made network data increasingly amenable to statistical analysis. Exponential random graph models (ERGMs) emerged as one of the main families of models capable of capturing the complex dependence structure of network data in a wide range of applied contexts. The Bergm package for R has become a popular package to carry out Bayesian parameter inference, missing data imputation, model selection and goodness-of-fit diagnostics for ERGMs. Over the last few years, the package has been considerably improved in terms of efficiency by adopting some of the state-of-the-art Bayesian computational methods for doubly-intractable distributions. Recently, version 5 of the package has been made available on CRAN having undergone a substantial makeover, which has made it more accessible and easy to use for practitioners. New functions include data augmentation procedures based on the approximate exchange algorithm for dealing with missing data, adjusted pseudo-likelihood and pseudo-posterior procedures, which allow for fast approximate inference of the ERGM parameter posterior and model evidence for networks on several thousands nodes.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78008212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
deepregression: a Flexible Neural Network Framework for Semi-Structured Deep Distributional Regression 深度回归:半结构化深度分布回归的灵活神经网络框架
IF 5.8 2区 计算机科学 Q1 Mathematics Pub Date : 2021-04-06 DOI: 10.18637/jss.v105.i02
D. Rügamer, Ruolin Shen, Christina Bukas, Lisa Barros de Andrade e Sousa, Dominik Thalmeier, N. Klein, Chris Kolb, Florian Pfisterer, Philipp Kopper, B. Bischl, C. Müller
In this paper we describe the implementation of semi-structured deep distributional regression, a flexible framework to learn conditional distributions based on the combination of additive regression models and deep networks. Our implementation encompasses (1) a modular neural network building system based on the deep learning library pkg{TensorFlow} for the fusion of various statistical and deep learning approaches, (2) an orthogonalization cell to allow for an interpretable combination of different subnetworks, as well as (3) pre-processing steps necessary to set up such models. The software package allows to define models in a user-friendly manner via a formula interface that is inspired by classical statistical model frameworks such as pkg{mgcv}. The packages' modular design and functionality provides a unique resource for both scalable estimation of complex statistical models and the combination of approaches from deep learning and statistics. This allows for state-of-the-art predictive performance while simultaneously retaining the indispensable interpretability of classical statistical models.
在本文中,我们描述了半结构化深度分布回归的实现,这是一种基于加性回归模型和深度网络相结合的学习条件分布的灵活框架。我们的实现包括(1)一个基于深度学习库pkg{TensorFlow}的模块化神经网络构建系统,用于融合各种统计和深度学习方法,(2)一个正交化单元,允许不同子网的可解释组合,以及(3)建立此类模型所需的预处理步骤。该软件包允许通过公式界面以用户友好的方式定义模型,该界面受经典统计模型框架(如pkg{mgcv})的启发。这些软件包的模块化设计和功能为复杂统计模型的可扩展估计以及深度学习和统计方法的组合提供了独特的资源。这允许最先进的预测性能,同时保留经典统计模型不可或缺的可解释性。
{"title":"deepregression: a Flexible Neural Network Framework for Semi-Structured Deep Distributional Regression","authors":"D. Rügamer, Ruolin Shen, Christina Bukas, Lisa Barros de Andrade e Sousa, Dominik Thalmeier, N. Klein, Chris Kolb, Florian Pfisterer, Philipp Kopper, B. Bischl, C. Müller","doi":"10.18637/jss.v105.i02","DOIUrl":"https://doi.org/10.18637/jss.v105.i02","url":null,"abstract":"In this paper we describe the implementation of semi-structured deep distributional regression, a flexible framework to learn conditional distributions based on the combination of additive regression models and deep networks. Our implementation encompasses (1) a modular neural network building system based on the deep learning library pkg{TensorFlow} for the fusion of various statistical and deep learning approaches, (2) an orthogonalization cell to allow for an interpretable combination of different subnetworks, as well as (3) pre-processing steps necessary to set up such models. The software package allows to define models in a user-friendly manner via a formula interface that is inspired by classical statistical model frameworks such as pkg{mgcv}. The packages' modular design and functionality provides a unique resource for both scalable estimation of complex statistical models and the combination of approaches from deep learning and statistics. This allows for state-of-the-art predictive performance while simultaneously retaining the indispensable interpretability of classical statistical models.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72435953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
mosum: A Package for Moving Sums in Change-Point Analysis mosum:一个在变化点分析中移动总数的包
IF 5.8 2区 计算机科学 Q1 Mathematics Pub Date : 2021-03-19 DOI: 10.18637/JSS.V097.I08
Alexander Meier, C. Kirch, Haeran Cho
Time series data, i.e., temporally ordered data, is routinely collected and analysed in in many fields of natural science, economy, technology and medicine, where it is of importance to verify the assumption of stochastic stationarity prior to modeling the data. Nonstationarities in the data are often attributed to structural changes with segments between adjacent change-points being approximately stationary. A particularly important, and thus widely studied, problem in statistics and signal processing is to detect changes in the mean at unknown time points. In this paper, we present the R package mosum, which implements elegant and mathematically well-justified procedures for the multiple mean change problem using the moving sum statistics.
时间序列数据,即时间有序数据,在自然科学、经济、技术和医学的许多领域都经常被收集和分析,在这些领域中,在对数据建模之前验证随机平稳性假设是很重要的。数据中的非平稳性通常归因于结构变化,相邻变化点之间的段近似平稳。在统计学和信号处理中,一个特别重要且被广泛研究的问题是在未知时间点检测平均值的变化。在本文中,我们提出了R包mosum,它使用移动和统计实现了优雅和数学上合理的多均值变化问题的过程。
{"title":"mosum: A Package for Moving Sums in Change-Point Analysis","authors":"Alexander Meier, C. Kirch, Haeran Cho","doi":"10.18637/JSS.V097.I08","DOIUrl":"https://doi.org/10.18637/JSS.V097.I08","url":null,"abstract":"Time series data, i.e., temporally ordered data, is routinely collected and analysed in in many fields of natural science, economy, technology and medicine, where it is of importance to verify the assumption of stochastic stationarity prior to modeling the data. Nonstationarities in the data are often attributed to structural changes with segments between adjacent change-points being approximately stationary. A particularly important, and thus widely studied, problem in statistics and signal processing is to detect changes in the mean at unknown time points. In this paper, we present the R package mosum, which implements elegant and mathematically well-justified procedures for the multiple mean change problem using the moving sum statistics.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88055458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
svars: An R Package for Data-Driven Identification in Multivariate Time Series Analysis svars:一个用于多变量时间序列分析中数据驱动识别的R包
IF 5.8 2区 计算机科学 Q1 Mathematics Pub Date : 2021-03-19 DOI: 10.18637/JSS.V097.I05
Alexander Lange, B. Dalheimer, H. Herwartz, Simone Maxand
Structural vector autoregressive (SVAR) models are frequently applied to trace the contemporaneous linkages among (macroeconomic) variables back to an interplay of orthogonal structural shocks. Under Gaussianity the structural parameters are unidentified without additional (often external and not data-based) information. In contrast, the often reasonable assumption of heteroskedastic and/or non-Gaussian model disturbances offers the possibility to identify unique structural shocks. We describe the R package svars which implements statistical identification techniques that can be both heteroskedasticity-based or independence-based. Moreover, it includes a rich variety of analysis tools that are well known in the SVAR literature. Next to a comprehensive review of the theoretical background, we provide a detailed description of the associated R functions. Furthermore, a macroeconomic application serves as a step-by-step guide on how to apply these functions to the identification and interpretation of structural VAR models.
结构向量自回归(SVAR)模型经常被用于追踪(宏观经济)变量之间的同期联系,以追溯到正交结构冲击的相互作用。在高斯性下,结构参数在没有附加(通常是外部的和非基于数据的)信息的情况下被识别。相反,通常对异方差和/或非高斯模型扰动的合理假设提供了识别独特结构冲击的可能性。我们描述了R包svars,它实现了统计识别技术,可以是基于异方差的,也可以是基于独立性的。此外,它还包括丰富多样的分析工具,这些工具在SVAR文献中是众所周知的。接下来是对理论背景的全面回顾,我们提供了相关R函数的详细描述。此外,宏观经济应用程序作为如何将这些函数应用于结构VAR模型的识别和解释的逐步指南。
{"title":"svars: An R Package for Data-Driven Identification in Multivariate Time Series Analysis","authors":"Alexander Lange, B. Dalheimer, H. Herwartz, Simone Maxand","doi":"10.18637/JSS.V097.I05","DOIUrl":"https://doi.org/10.18637/JSS.V097.I05","url":null,"abstract":"Structural vector autoregressive (SVAR) models are frequently applied to trace the contemporaneous linkages among (macroeconomic) variables back to an interplay of orthogonal structural shocks. Under Gaussianity the structural parameters are unidentified without additional (often external and not data-based) information. In contrast, the often reasonable assumption of heteroskedastic and/or non-Gaussian model disturbances offers the possibility to identify unique structural shocks. We describe the R package svars which implements statistical identification techniques that can be both heteroskedasticity-based or independence-based. Moreover, it includes a rich variety of analysis tools that are well known in the SVAR literature. Next to a comprehensive review of the theoretical background, we provide a detailed description of the associated R functions. Furthermore, a macroeconomic application serves as a step-by-step guide on how to apply these functions to the identification and interpretation of structural VAR models.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76133121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
FamEvent: An R Package for Generating and Modeling Time-to-Event Data in Family Designs. FamEvent:用于在家庭设计中生成时间到事件数据并对其进行建模的 R 软件包。
IF 5.8 2区 计算机科学 Q1 Mathematics Pub Date : 2021-03-01 Epub Date: 2021-03-19 DOI: 10.18637/jss.v097.i07
Yun-Hee Choi, Laurent Briollais, Wenqing He, Karen Kopciuk

FamEvent is a comprehensive R package for simulating and modelling age-at-disease onset in families carrying a rare gene mutation. The package can simulate complex family data for variable time-to-event outcomes under three common family study designs (population, high-risk clinic and multi-stage) with various levels of missing genetic information among family members. Residual familial correlation can be induced through the inclusion of a frailty term or a second gene. Disease-gene carrier probabilities are evaluated assuming Mendelian transmission or empirically from the data. When genetic information on the disease gene is missing, an Expectation-Maximization algorithm is employed to calculate the carrier probabilities. Penetrance model functions with ascertainment correction adapted to the sampling design provide age-specific cumulative disease risks by sex, mutation status, and other covariates for simulated data as well as real data analysis. Robust standard errors and 95% confidence intervals are available for these estimates. Plots of pedigrees and penetrance functions based on the fitted model provide graphical displays to evaluate and summarize the models.

FamEvent 是一个综合性 R 软件包,用于模拟和建模携带罕见基因突变的家族的发病年龄。该软件包可以模拟复杂的家族数据,在三种常见的家族研究设计(人群、高风险诊所和多阶段)下,根据不同程度的家族成员遗传信息缺失情况,计算不同的时间到事件结果。可通过加入虚弱项或第二个基因来诱导残余家族相关性。疾病基因携带者概率是根据孟德尔传播假设或数据经验进行评估的。如果疾病基因的遗传信息缺失,则采用期望最大化算法计算携带者概率。根据抽样设计进行确定性校正的穿透性模型函数,为模拟数据和真实数据分析提供了按性别、突变状态和其他协变量划分的特定年龄累积疾病风险。这些估计值有稳健的标准误差和 95% 的置信区间。根据拟合模型绘制的系谱图和渗透函数图提供了评估和总结模型的图形显示。
{"title":"FamEvent: An R Package for Generating and Modeling Time-to-Event Data in Family Designs.","authors":"Yun-Hee Choi, Laurent Briollais, Wenqing He, Karen Kopciuk","doi":"10.18637/jss.v097.i07","DOIUrl":"10.18637/jss.v097.i07","url":null,"abstract":"<p><p><b>FamEvent</b> is a comprehensive R package for simulating and modelling age-at-disease onset in families carrying a rare gene mutation. The package can simulate complex family data for variable time-to-event outcomes under three common family study designs (population, high-risk clinic and multi-stage) with various levels of missing genetic information among family members. Residual familial correlation can be induced through the inclusion of a frailty term or a second gene. Disease-gene carrier probabilities are evaluated assuming Mendelian transmission or empirically from the data. When genetic information on the disease gene is missing, an Expectation-Maximization algorithm is employed to calculate the carrier probabilities. Penetrance model functions with ascertainment correction adapted to the sampling design provide age-specific cumulative disease risks by sex, mutation status, and other covariates for simulated data as well as real data analysis. Robust standard errors and 95% confidence intervals are available for these estimates. Plots of pedigrees and penetrance functions based on the fitted model provide graphical displays to evaluate and summarize the models.</p>","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8427460/pdf/nihms-1735562.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"39408263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
intRinsic: An R Package for Model-Based Estimation of the Intrinsic Dimension of a Dataset 一个基于模型估计数据集内在维数的R包
IF 5.8 2区 计算机科学 Q1 Mathematics Pub Date : 2021-02-23 DOI: 10.18637/jss.v106.i09
Francesco Denti
This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset, an essential quantity for most dimensionality reduction techniques. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routines. Generally speaking, intRinsic encompasses models that fall into two categories: homogeneous and heterogeneous intrinsic dimension estimators. The first category contains the two nearest neighbors estimator, a method derived from the distributional properties of the ratios of the distances between each data point and its first two closest neighbors. The functions dedicated to this method carry out inference under both the frequentist and Bayesian frameworks. In the second category, we find the heterogeneous intrinsic dimension algorithm, a Bayesian mixture model for which an efficient Gibbs sampler is implemented. After presenting the theoretical background, we demonstrate the performance of the models on simulated datasets. This way, we can facilitate the exposition by immediately assessing the validity of the results. Then, we employ the package to study the intrinsic dimension of the Alon dataset, obtained from a famous microarray experiment. Finally, we show how the estimation of homogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insights into the topological structure of a dataset.
本文演示了intRinsic,这是一个R包,它实现了对数据集的内在维度(对于大多数降维技术来说都是必不可少的量)的最新的基于似然的估计。为了使这些新颖的估计器易于访问,该包包含了少量依赖于更广泛的高效、低级例程集的高级函数。一般来说,intRinsic包含两类模型:同质和异质intRinsic维估计器。第一类包含两个最近邻估计器,这是一种从每个数据点与其前两个最近邻之间的距离之比的分布特性推导出来的方法。专用于该方法的函数在频率论和贝叶斯框架下进行推理。在第二类中,我们发现了异构本征维算法,这是一种贝叶斯混合模型,它实现了一个有效的吉布斯采样器。在介绍了理论背景之后,我们在模拟数据集上验证了模型的性能。这样,我们可以通过立即评估结果的有效性来促进阐述。然后,我们使用包来研究从一个著名的微阵列实验中获得的Alon数据集的固有维数。最后,我们展示了对同质和异质内在维度的估计如何使我们获得对数据集拓扑结构的有价值的见解。
{"title":"intRinsic: An R Package for Model-Based Estimation of the Intrinsic Dimension of a Dataset","authors":"Francesco Denti","doi":"10.18637/jss.v106.i09","DOIUrl":"https://doi.org/10.18637/jss.v106.i09","url":null,"abstract":"This article illustrates intRinsic, an R package that implements novel state-of-the-art likelihood-based estimators of the intrinsic dimension of a dataset, an essential quantity for most dimensionality reduction techniques. In order to make these novel estimators easily accessible, the package contains a small number of high-level functions that rely on a broader set of efficient, low-level routines. Generally speaking, intRinsic encompasses models that fall into two categories: homogeneous and heterogeneous intrinsic dimension estimators. The first category contains the two nearest neighbors estimator, a method derived from the distributional properties of the ratios of the distances between each data point and its first two closest neighbors. The functions dedicated to this method carry out inference under both the frequentist and Bayesian frameworks. In the second category, we find the heterogeneous intrinsic dimension algorithm, a Bayesian mixture model for which an efficient Gibbs sampler is implemented. After presenting the theoretical background, we demonstrate the performance of the models on simulated datasets. This way, we can facilitate the exposition by immediately assessing the validity of the results. Then, we employ the package to study the intrinsic dimension of the Alon dataset, obtained from a famous microarray experiment. Finally, we show how the estimation of homogeneous and heterogeneous intrinsic dimensions allows us to gain valuable insights into the topological structure of a dataset.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85981197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
sensobol: An R Package to Compute Variance-Based Sensitivity Indices sensobol:一个计算基于方差的灵敏度指数的R包
IF 5.8 2区 计算机科学 Q1 Mathematics Pub Date : 2021-01-22 DOI: 10.18637/jss.v102.i05
A. Puy, S. L. Piano, Andrea Saltelli, S. Levin
The R package"sensobol"provides several functions to conduct variance-based uncertainty and sensitivity analysis, from the estimation of sensitivity indices to the visual representation of the results. It implements several state-of-the-art first and total-order estimators and allows the computation of up to third-order effects, as well as of the approximation error, in a swift and user-friendly way. Its flexibility makes it also appropriate for models with either a scalar or a multivariate output. We illustrate its functionality by conducting a variance-based sensitivity analysis of three classic models: the Sobol' (1998) G function, the logistic population growth model of Verhulst (1845), and the spruce budworm and forest model of Ludwig, Jones and Holling (1976).
R软件包“sensobol”提供了几个函数来进行基于方差的不确定性和敏感性分析,从敏感性指标的估计到结果的可视化表示。它实现了几个最先进的一阶和全阶估计器,并允许以一种快速和用户友好的方式计算高达三阶的效应,以及近似误差。它的灵活性使得它也适用于具有标量输出或多变量输出的模型。我们通过对三个经典模型(Sobol' (1998) G函数、Verhulst(1845)的logistic种群增长模型以及Ludwig、Jones和Holling(1976)的云杉budworm和森林模型)进行基于方差的敏感性分析来说明其功能。
{"title":"sensobol: An R Package to Compute Variance-Based Sensitivity Indices","authors":"A. Puy, S. L. Piano, Andrea Saltelli, S. Levin","doi":"10.18637/jss.v102.i05","DOIUrl":"https://doi.org/10.18637/jss.v102.i05","url":null,"abstract":"The R package\"sensobol\"provides several functions to conduct variance-based uncertainty and sensitivity analysis, from the estimation of sensitivity indices to the visual representation of the results. It implements several state-of-the-art first and total-order estimators and allows the computation of up to third-order effects, as well as of the approximation error, in a swift and user-friendly way. Its flexibility makes it also appropriate for models with either a scalar or a multivariate output. We illustrate its functionality by conducting a variance-based sensitivity analysis of three classic models: the Sobol' (1998) G function, the logistic population growth model of Verhulst (1845), and the spruce budworm and forest model of Ludwig, Jones and Holling (1976).","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86930207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Nonparametric Machine Learning and Efficient Computation with Bayesian Additive Regression Trees: The BART R Package 非参数机器学习和贝叶斯加性回归树的高效计算:BART R包
IF 5.8 2区 计算机科学 Q1 Mathematics Pub Date : 2021-01-14 DOI: 10.18637/JSS.V097.I01
R. Sparapani, Charles Spanbauer, R. McCulloch
In this article, we introduce the BART R package which is an acronym for Bayesian additive regression trees. BART is a Bayesian nonparametric, machine learning, ensemble predictive modeling method for continuous, binary, categorical and time-to-event outcomes. Furthermore, BART is a tree-based, black-box method which fits the outcome to an arbitrary random function, f , of the covariates. The BART technique is relatively computationally efficient as compared to its competitors, but large sample sizes can be demanding. Therefore, the BART package includes efficient state-of-the-art implementations for continuous, binary, categorical and time-to-event outcomes that can take advantage of modern off-the-shelf hardware and software multi-threading technology. The BART package is written in C++ for both programmer and execution efficiency. The BART package takes advantage of multi-threading via forking as provided by the parallel package and OpenMP when available and supported by the platform. The ensemble of binary trees produced by a BART fit can be stored and re-used later via the R predict function. In addition to being an R package, the installed BART routines can be called directly from C++. The BART package provides the tools for your BART toolbox.
在本文中,我们将介绍BART R包,它是贝叶斯加性回归树的缩写。BART是一种贝叶斯非参数、机器学习、集成预测建模方法,用于连续、二进制、分类和时间到事件的结果。此外,BART是一种基于树的黑箱方法,它将结果拟合到协变量的任意随机函数f中。与竞争对手相比,BART技术的计算效率相对较高,但是大样本量可能要求很高。因此,BART包包括高效的最先进的实现,可以利用现代现成的硬件和软件多线程技术,实现连续、二进制、分类和时间到事件的结果。BART包是用c++编写的,以提高编程效率和执行效率。BART包利用了并行包和OpenMP在平台可用和支持时提供的通过分叉的多线程。由BART拟合产生的二叉树集合可以通过R预测函数存储和重用。除了是R包之外,已安装的BART例程还可以直接从c++调用。BART包为您的BART工具箱提供了工具。
{"title":"Nonparametric Machine Learning and Efficient Computation with Bayesian Additive Regression Trees: The BART R Package","authors":"R. Sparapani, Charles Spanbauer, R. McCulloch","doi":"10.18637/JSS.V097.I01","DOIUrl":"https://doi.org/10.18637/JSS.V097.I01","url":null,"abstract":"In this article, we introduce the BART R package which is an acronym for Bayesian additive regression trees. BART is a Bayesian nonparametric, machine learning, ensemble predictive modeling method for continuous, binary, categorical and time-to-event outcomes. Furthermore, BART is a tree-based, black-box method which fits the outcome to an arbitrary random function, f , of the covariates. The BART technique is relatively computationally efficient as compared to its competitors, but large sample sizes can be demanding. Therefore, the BART package includes efficient state-of-the-art implementations for continuous, binary, categorical and time-to-event outcomes that can take advantage of modern off-the-shelf hardware and software multi-threading technology. The BART package is written in C++ for both programmer and execution efficiency. The BART package takes advantage of multi-threading via forking as provided by the parallel package and OpenMP when available and supported by the platform. The ensemble of binary trees produced by a BART fit can be stored and re-used later via the R predict function. In addition to being an R package, the installed BART routines can be called directly from C++. The BART package provides the tools for your BART toolbox.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86293135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 72
The R Package forestinventory: Design-Based Global and Small Area Estimations for Multiphase Forest Inventories R包森林清查:基于设计的多阶段森林清查全局和小面积估算
IF 5.8 2区 计算机科学 Q1 Mathematics Pub Date : 2021-01-14 DOI: 10.18637/JSS.V097.I04
Andreas Hill, Alexander Massey, D. Mandallaz
Forest inventories provide reliable evidence-based information to assess the state and development of forests over time. They typically consist of a random sample of plot locations in the forest that are assessed individually by field crews. Due to the high costs of these terrestrial campaigns, remote sensing information available in high quantity and low costs is frequently incorporated in the estimation process in order to reduce inventory costs or improve estimation precision. With respect to this objective, the application of multiphase forest inventory methods (e.g., double- and triple-sampling regression estimators) has proved to be efficient. While these methods have been successfully applied in practice, the availability of open-source software has been rare if not non-existent. The R package forestinventory provides a comprehensive set of global and small area regression estimators for multiphase forest inventories under simple and cluster sampling. The implemented methods have been demonstrated in various scientific studies ranging from small to large scale forest inventories, and can be used for post-stratification, regression and regression within strata. This article gives an extensive review of the mathematical theory of this family of design-based estimators, puts them into a common framework of forest inventory scenarios and demonstrates their application in the R environment.
森林清查为评估森林的长期状况和发展提供了可靠的循证信息。它们通常由森林中随机取样的小块地点组成,由实地工作人员单独评估。由于这些地面活动的成本很高,为了减少库存成本或提高估算精度,经常将数量多、成本低的遥感信息纳入估算过程。关于这一目标,采用多阶段森林清查方法(例如,双抽样和三抽样回归估计器)已证明是有效的。虽然这些方法在实践中得到了成功的应用,但开源软件的可用性即使不是不存在,也是很少的。R包森林清查为简单和聚类抽样下的多阶段森林清查提供了一套全面的全局和小区域回归估计。所执行的方法已在各种科学研究中得到证明,范围从小型到大型森林调查,并可用于分层后、回归和地层内回归。本文对这类基于设计的估算器的数学理论进行了广泛的回顾,将它们放入森林清查场景的通用框架中,并演示了它们在R环境中的应用。
{"title":"The R Package forestinventory: Design-Based Global and Small Area Estimations for Multiphase Forest Inventories","authors":"Andreas Hill, Alexander Massey, D. Mandallaz","doi":"10.18637/JSS.V097.I04","DOIUrl":"https://doi.org/10.18637/JSS.V097.I04","url":null,"abstract":"Forest inventories provide reliable evidence-based information to assess the state and development of forests over time. They typically consist of a random sample of plot locations in the forest that are assessed individually by field crews. Due to the high costs of these terrestrial campaigns, remote sensing information available in high quantity and low costs is frequently incorporated in the estimation process in order to reduce inventory costs or improve estimation precision. With respect to this objective, the application of multiphase forest inventory methods (e.g., double- and triple-sampling regression estimators) has proved to be efficient. While these methods have been successfully applied in practice, the availability of open-source software has been rare if not non-existent. The R package forestinventory provides a comprehensive set of global and small area regression estimators for multiphase forest inventories under simple and cluster sampling. The implemented methods have been demonstrated in various scientific studies ranging from small to large scale forest inventories, and can be used for post-stratification, regression and regression within strata. This article gives an extensive review of the mathematical theory of this family of design-based estimators, puts them into a common framework of forest inventory scenarios and demonstrates their application in the R environment.","PeriodicalId":17237,"journal":{"name":"Journal of Statistical Software","volume":null,"pages":null},"PeriodicalIF":5.8,"publicationDate":"2021-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76306607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Journal of Statistical Software
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1