首页 > 最新文献

Computational Statistics最新文献

英文 中文
Imbalanced data sampling design based on grid boundary domain for big data 基于网格边界域的大数据不平衡数据采样设计
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-03-08 DOI: 10.1007/s00180-024-01471-8

Abstract

The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.

摘要 数据分布往往与事先已知的概率有关,而感兴趣事件的发生概率较小,因此在社会学、经济学、工程学等各个领域都会出现大量的不平衡数据。现有的过采样和欠采样方法被广泛应用于不平衡数据分类问题,但过采样会导致过拟合,而欠采样会忽略有效信息。我们提出了一种新的采样设计算法,称为边界混合采样的邻域网格(NGBM),它关注边界信息。本文通过网格边界域识别获得分类边界信息,从而确定样本的重要性。在此前提下,对边界网格采用合成少数超采样技术,对其他网格采用随机欠采样技术。在这种混合采样策略的帮助下,可以提取出更重要的分类边界信息,尤其是对正样本信息的识别。通过数值模拟和实际数据分析,讨论了 NGBM 的参数设置策略,并说明了所提出的 NGBM 在不平衡数据中的优势以及实际应用。
{"title":"Imbalanced data sampling design based on grid boundary domain for big data","authors":"","doi":"10.1007/s00180-024-01471-8","DOIUrl":"https://doi.org/10.1007/s00180-024-01471-8","url":null,"abstract":"<h3>Abstract</h3> <p>The data distribution is often associated with a <em>priori</em>-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"54 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140075873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sparse estimation of linear model via Bayesian method $$^*$$ 通过贝叶斯方法对线性模型进行稀疏估计 $$^*$$
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-03-04 DOI: 10.1007/s00180-024-01474-5

Abstract

This paper considers the sparse estimation problem of regression coefficients in the linear model. Note that the global–local shrinkage priors do not allow the regression coefficients to be truly estimated as zero, we propose three threshold rules and compare their contraction properties, and also tandem those rules with the popular horseshoe prior and the horseshoe+ prior that are normally regarded as global–local shrinkage priors. The hierarchical prior expressions for the horseshoe prior and the horseshoe+ prior are obtained, and the full conditional posterior distributions for all parameters for algorithm implementation are also given. Simulation studies indicate that the horseshoe/horseshoe+ prior with the threshold rules are both superior to the spike-slab models. Finally, a real data analysis demonstrates the effectiveness of variable selection of the proposed method.

摘要 本文考虑了线性模型中回归系数的稀疏估计问题。我们提出了三种阈值规则,并比较了它们的收缩特性,还将这些规则与通常被视为全局局部收缩先验的流行的马蹄先验和马蹄+先验进行了串联。我们得到了马蹄先验和马蹄+先验的层次先验表达式,并给出了用于算法实现的所有参数的全条件后验分布。模拟研究表明,带有阈值规则的马蹄先验/马蹄+先验都优于尖峰板模型。最后,实际数据分析证明了所提方法在变量选择方面的有效性。
{"title":"Sparse estimation of linear model via Bayesian method $$^*$$","authors":"","doi":"10.1007/s00180-024-01474-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01474-5","url":null,"abstract":"<h3>Abstract</h3> <p>This paper considers the sparse estimation problem of regression coefficients in the linear model. Note that the global–local shrinkage priors do not allow the regression coefficients to be truly estimated as zero, we propose three threshold rules and compare their contraction properties, and also tandem those rules with the popular horseshoe prior and the horseshoe+ prior that are normally regarded as global–local shrinkage priors. The hierarchical prior expressions for the horseshoe prior and the horseshoe+ prior are obtained, and the full conditional posterior distributions for all parameters for algorithm implementation are also given. Simulation studies indicate that the horseshoe/horseshoe+ prior with the threshold rules are both superior to the spike-slab models. Finally, a real data analysis demonstrates the effectiveness of variable selection of the proposed method.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"35 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140036222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Degree selection methods for curve estimation via Bernstein polynomials 通过伯恩斯坦多项式进行曲线估算的度数选择方法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-03-02 DOI: 10.1007/s00180-024-01473-6

Abstract

Bernstein Polynomial (BP) bases can uniformly approximate any continuous function based on observed noisy samples. However, a persistent challenge is the data-driven selection of a suitable degree for the BPs. In the absence of noise, asymptotic theory suggests that a larger degree leads to better approximation. However, in the presence of noise, which reduces bias, a larger degree also results in larger variances due to high-dimensional parameter estimation. Thus, a balance in the classic bias-variance trade-off is essential. The main objective of this work is to determine the minimum possible degree of the approximating BPs using probabilistic methods that are robust to various shapes of an unknown continuous function. Beyond offering theoretical guidance, the paper includes numerical illustrations to address the issue of determining a suitable degree for BPs in approximating arbitrary continuous functions.

摘要 伯恩斯坦多项式(BP)基可以根据观测到的噪声样本均匀地近似任何连续函数。然而,一个长期存在的难题是如何根据数据为 BP 选择合适的阶数。在没有噪声的情况下,渐近理论表明,阶数越大,逼近效果越好。然而,在有噪声的情况下,噪声会减少偏差,但由于高维参数估计,较大的度数也会导致较大的方差。因此,传统的偏差-方差权衡中的平衡至关重要。这项工作的主要目的是利用概率方法确定近似 BP 的最小可能度,这种方法对未知连续函数的各种形状都具有鲁棒性。除了提供理论指导外,本文还通过数值说明来解决在逼近任意连续函数时如何确定 BP 的合适度这一问题。
{"title":"Degree selection methods for curve estimation via Bernstein polynomials","authors":"","doi":"10.1007/s00180-024-01473-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01473-6","url":null,"abstract":"<h3>Abstract</h3> <p>Bernstein Polynomial (BP) bases can uniformly approximate any continuous function based on observed noisy samples. However, a persistent challenge is the data-driven selection of a suitable degree for the BPs. In the absence of noise, asymptotic theory suggests that a larger degree leads to better approximation. However, in the presence of noise, which reduces bias, a larger degree also results in larger variances due to high-dimensional parameter estimation. Thus, a balance in the classic bias-variance trade-off is essential. The main objective of this work is to determine the minimum possible degree of the approximating BPs using probabilistic methods that are robust to various shapes of an unknown continuous function. Beyond offering theoretical guidance, the paper includes numerical illustrations to address the issue of determining a suitable degree for BPs in approximating arbitrary continuous functions.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"22 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140016810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic piecewise linear regression 自动片断线性回归
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-03-01 DOI: 10.1007/s00180-024-01475-4
Mathias von Ottenbreit, Riccardo De Bin

Regression modelling often presents a trade-off between predictiveness and interpretability. Highly predictive and popular tree-based algorithms such as Random Forest and boosted trees predict very well the outcome of new observations, but the effect of the predictors on the result is hard to interpret. Highly interpretable algorithms like linear effect-based boosting and MARS, on the other hand, are typically less predictive. Here we propose a novel regression algorithm, automatic piecewise linear regression (APLR), that combines the predictiveness of a boosting algorithm with the interpretability of a MARS model. In addition, as a boosting algorithm, it automatically handles variable selection, and, as a MARS-based approach, it takes into account non-linear relationships and possible interaction terms. We show on simulated and real data examples how APLR’s performance is comparable to that of the top-performing approaches in terms of prediction, while offering an easy way to interpret the results. APLR has been implemented in C++ and wrapped in a Python package as a Scikit-learn compatible estimator.

回归建模通常需要在预测性和可解释性之间做出权衡。高预测性和流行的基于树的算法(如随机森林和提升树)能很好地预测新观测结果,但预测因子对结果的影响却难以解释。另一方面,基于线性效应的提升算法和 MARS 等可解释性强的算法通常预测性较差。在这里,我们提出了一种新型回归算法--自动分片线性回归(APLR),它结合了提升算法的预测性和 MARS 模型的可解释性。此外,作为一种提升算法,它能自动处理变量选择;作为一种基于 MARS 的方法,它能考虑到非线性关系和可能的交互项。我们在模拟和真实数据示例中展示了 APLR 在预测方面的性能如何与表现最佳的方法相媲美,同时还提供了解释结果的简便方法。APLR 是用 C++ 实现的,并封装在一个 Python 软件包中,作为 Scikit-learn 兼容的估计器。
{"title":"Automatic piecewise linear regression","authors":"Mathias von Ottenbreit, Riccardo De Bin","doi":"10.1007/s00180-024-01475-4","DOIUrl":"https://doi.org/10.1007/s00180-024-01475-4","url":null,"abstract":"<p>Regression modelling often presents a trade-off between predictiveness and interpretability. Highly predictive and popular tree-based algorithms such as Random Forest and boosted trees predict very well the outcome of new observations, but the effect of the predictors on the result is hard to interpret. Highly interpretable algorithms like linear effect-based boosting and MARS, on the other hand, are typically less predictive. Here we propose a novel regression algorithm, automatic piecewise linear regression (APLR), that combines the predictiveness of a boosting algorithm with the interpretability of a MARS model. In addition, as a boosting algorithm, it automatically handles variable selection, and, as a MARS-based approach, it takes into account non-linear relationships and possible interaction terms. We show on simulated and real data examples how APLR’s performance is comparable to that of the top-performing approaches in terms of prediction, while offering an easy way to interpret the results. APLR has been implemented in C++ and wrapped in a Python package as a Scikit-learn compatible estimator.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"1 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140016808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variational Bayesian Lasso for spline regression 用于样条回归的变异贝叶斯套索法
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-02-24 DOI: 10.1007/s00180-024-01470-9
Larissa C. Alves, Ronaldo Dias, Helio S. Migon

This work presents a new scalable automatic Bayesian Lasso methodology with variational inference for non-parametric splines regression that can capture the non-linear relationship between a response variable and predictor variables. Note that under non-parametric point of view the regression curve is assumed to lie in a infinite dimension space. Regression splines use a finite approximation of this infinite space, representing the regression function by a linear combination of basis functions. The crucial point of the approach is determining the appropriate number of bases or equivalently number of knots, avoiding over-fitting/under-fitting. A decision-theoretic approach was devised for knot selection. Comprehensive simulation studies were conducted in challenging scenarios to compare alternative criteria for knot selection, thereby ensuring the efficacy of the proposed algorithms. Additionally, the performance of the proposed method was assessed using real-world datasets. The novel procedure demonstrated good performance in capturing the underlying data structure by selecting the appropriate number of knots/basis.

本研究提出了一种新的可扩展自动贝叶斯拉索方法,该方法采用变异推理进行非参数劈叉回归,可以捕捉响应变量与预测变量之间的非线性关系。请注意,从非参数的角度来看,回归曲线被假定位于无限维空间中。回归样条曲线使用这个无限空间的有限近似值,通过基函数的线性组合来表示回归函数。该方法的关键点在于确定适当的基数或等效的节数,避免过度拟合/拟合不足。为选择节点设计了一种决策理论方法。在具有挑战性的场景中进行了全面的模拟研究,以比较选择绳结的替代标准,从而确保所建议算法的有效性。此外,还利用现实世界的数据集对所提出方法的性能进行了评估。通过选择适当数量的节点/基点,新程序在捕捉底层数据结构方面表现出色。
{"title":"Variational Bayesian Lasso for spline regression","authors":"Larissa C. Alves, Ronaldo Dias, Helio S. Migon","doi":"10.1007/s00180-024-01470-9","DOIUrl":"https://doi.org/10.1007/s00180-024-01470-9","url":null,"abstract":"<p>This work presents a new scalable automatic Bayesian Lasso methodology with variational inference for non-parametric splines regression that can capture the non-linear relationship between a response variable and predictor variables. Note that under non-parametric point of view the regression curve is assumed to lie in a infinite dimension space. Regression splines use a finite approximation of this infinite space, representing the regression function by a linear combination of basis functions. The crucial point of the approach is determining the appropriate number of bases or equivalently number of knots, avoiding over-fitting/under-fitting. A decision-theoretic approach was devised for knot selection. Comprehensive simulation studies were conducted in challenging scenarios to compare alternative criteria for knot selection, thereby ensuring the efficacy of the proposed algorithms. Additionally, the performance of the proposed method was assessed using real-world datasets. The novel procedure demonstrated good performance in capturing the underlying data structure by selecting the appropriate number of knots/basis.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"611 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139956295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian estimation of the number of species from Poisson-Lindley stochastic abundance model using non-informative priors 利用非信息先验从泊松-林德利随机丰度模型中对物种数量进行贝叶斯估计
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-02-23 DOI: 10.1007/s00180-024-01464-7
Anurag Pathak, Manoj Kumar, Sanjay Kumar Singh, Umesh Singh, Sandeep Kumar

In this article, we propose a Poisson-Lindley distribution as a stochastic abundance model in which the sample is according to the independent Poisson process. Jeffery’s and Bernardo’s reference priors have been obtaining and proposed the Bayes estimators of the number of species for this model. The proposed Bayes estimators have been compared with the corresponding profile and conditional maximum likelihood estimators for their square root of the risks under squared error loss function (SELF). Jeffery’s and Bernardo’s reference priors have been considered and compared with the Bayesian approach based on biological data.

在本文中,我们提出了泊松-林德利分布作为随机丰度模型,其中样本是根据独立泊松过程。我们获得了 Jeffery 和 Bernardo 的参考先验,并提出了该模型的物种数量贝叶斯估计值。所提出的贝叶斯估计值与相应的轮廓估计值和条件最大似然估计值在平方误差损失函数(SELF)下的风险平方根进行了比较。还考虑了杰弗里和贝尔纳多的参考先验,并与基于生物数据的贝叶斯方法进行了比较。
{"title":"Bayesian estimation of the number of species from Poisson-Lindley stochastic abundance model using non-informative priors","authors":"Anurag Pathak, Manoj Kumar, Sanjay Kumar Singh, Umesh Singh, Sandeep Kumar","doi":"10.1007/s00180-024-01464-7","DOIUrl":"https://doi.org/10.1007/s00180-024-01464-7","url":null,"abstract":"<p>In this article, we propose a Poisson-Lindley distribution as a stochastic abundance model in which the sample is according to the independent Poisson process. Jeffery’s and Bernardo’s reference priors have been obtaining and proposed the Bayes estimators of the number of species for this model. The proposed Bayes estimators have been compared with the corresponding profile and conditional maximum likelihood estimators for their square root of the risks under squared error loss function (SELF). Jeffery’s and Bernardo’s reference priors have been considered and compared with the Bayesian approach based on biological data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139951516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generation of normal distributions revisited 重新审视正态分布的生成
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-02-23 DOI: 10.1007/s00180-024-01468-3
Takayuki Umeda

Normally distributed random numbers are commonly used in scientific computing in various fields. It is important to generate a set of random numbers as close to a normal distribution as possible for reducing initial fluctuations. Two types of samples from a uniform distribution are examined as source samples for inverse transform sampling methods. Three types of inverse transform sampling methods with new approximations of inverse cumulative distribution functions are also discussed for converting uniformly distributed source samples to normally distributed samples.

正态分布随机数常用于各个领域的科学计算。为减少初始波动,生成一组尽可能接近正态分布的随机数非常重要。本文研究了均匀分布的两种样本,作为反变换采样方法的源样本。此外,还讨论了三种具有新的反向累积分布函数近似值的反变换抽样方法,用于将均匀分布源样本转换为正态分布样本。
{"title":"Generation of normal distributions revisited","authors":"Takayuki Umeda","doi":"10.1007/s00180-024-01468-3","DOIUrl":"https://doi.org/10.1007/s00180-024-01468-3","url":null,"abstract":"<p>Normally distributed random numbers are commonly used in scientific computing in various fields. It is important to generate a set of random numbers as close to a normal distribution as possible for reducing initial fluctuations. Two types of samples from a uniform distribution are examined as source samples for inverse transform sampling methods. Three types of inverse transform sampling methods with new approximations of inverse cumulative distribution functions are also discussed for converting uniformly distributed source samples to normally distributed samples.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"32 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139951514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian regression models in gretl: the BayTool package gretl 中的贝叶斯回归模型:BayTool 软件包
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-02-21 DOI: 10.1007/s00180-024-01466-5
Luca Pedini

This article presents the gretl package BayTool which integrates the software functionalities, mostly concerned with frequentist approaches, with Bayesian estimation methods of commonly used econometric models. Computational efficiency is achieved by pairing an extensive use of Gibbs sampling for posterior simulation with the possibility of splitting single-threaded experiments into multiple cores or machines by means of parallelization. From the user’s perspective, the package requires only basic knowledge of gretl scripting to fully access its functionality, while providing a point-and-click solution in the form of a graphical interface for a less experienced audience. These features, in particular, make BayTool stand out as an excellent teaching device without sacrificing more advanced or complex applications.

本文介绍的 gretl 软件包 BayTool 整合了软件功能(主要涉及频繁主义方法)和常用计量经济学模型的贝叶斯估计方法。通过广泛使用吉布斯采样进行后验模拟,并通过并行化将单线程实验分拆到多个内核或多台机器上的可能性,实现了计算效率。从用户的角度来看,该软件包只需要具备基本的 gretl 脚本知识就能完全使用其功能,同时还以图形界面的形式为经验不足的用户提供了点选式解决方案。这些特点尤其使 BayTool 成为出色的教学设备,而不会牺牲更高级或更复杂的应用。
{"title":"Bayesian regression models in gretl: the BayTool package","authors":"Luca Pedini","doi":"10.1007/s00180-024-01466-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01466-5","url":null,"abstract":"<p>This article presents the <span>gretl</span> package <span>BayTool</span> which integrates the software functionalities, mostly concerned with frequentist approaches, with Bayesian estimation methods of commonly used econometric models. Computational efficiency is achieved by pairing an extensive use of Gibbs sampling for posterior simulation with the possibility of splitting single-threaded experiments into multiple cores or machines by means of parallelization. From the user’s perspective, the package requires only basic knowledge of <span>gretl</span> scripting to fully access its functionality, while providing a point-and-click solution in the form of a graphical interface for a less experienced audience. These features, in particular, make <span>BayTool</span> stand out as an excellent teaching device without sacrificing more advanced or complex applications.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"14 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian sequential probability ratio test for vaccine efficacy trials 疫苗效力试验的贝叶斯序列概率比检验
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-02-20 DOI: 10.1007/s00180-024-01458-5
Erina Paul, Santosh Sutradhar, Jonathan Hartzel, Devan V. Mehrotra

Designing vaccine efficacy (VE) trials often requires recruiting large numbers of participants when the diseases of interest have a low incidence. When developing novel vaccines, such as for COVID-19 disease, the plausible range of VE is quite large at the design stage. Thus, the number of events needed to demonstrate efficacy above a pre-defined regulatory threshold can be difficult to predict and the time needed to accrue the necessary events can often be long. Therefore, it is advantageous to evaluate the efficacy at earlier interim analysis in the trial to potentially allow the trials to stop early for overwhelming VE or futility. In such cases, incorporating interim analyses through the use of the sequential probability ratio test (SPRT) can be helpful to allow for multiple analyses while controlling for both type-I and type-II errors. In this article, we propose a Bayesian SPRT for designing a vaccine trial for comparing a test vaccine with a control assuming two Poisson incidence rates. We provide guidance on how to choose the prior distribution and how to optimize the number of events for interim analyses to maximize the efficiency of the design. Through simulations, we demonstrate how the proposed Bayesian SPRT performs better when compared with the corresponding frequentist SPRT. An R repository to implement the proposed method is placed at: https://github.com/Merck/bayesiansprt.

当相关疾病的发病率较低时,设计疫苗效力(VE)试验往往需要招募大量参与者。在开发新型疫苗(如 COVID-19 疾病)时,在设计阶段 VE 的合理范围相当大。因此,要证明疗效超过预先设定的监管阈值所需的事件数量可能难以预测,而积累必要事件所需的时间往往很长。因此,在试验的早期中期分析中对疗效进行评估是很有好处的,这样有可能使试验因VE过高或无效而提前结束。在这种情况下,通过使用序贯概率比检验(SPRT)进行中期分析有助于进行多重分析,同时控制 I 型和 II 型误差。在本文中,我们提出了一种贝叶斯概率比检验方法,用于设计疫苗试验,在假设两种泊松发病率的情况下比较试验疫苗和对照疫苗。我们就如何选择先验分布以及如何优化中期分析的事件数以最大限度地提高设计效率提供了指导。通过模拟,我们展示了所提出的贝叶斯 SPRT 与相应的频数 SPRT 相比如何表现得更好。实现所提方法的 R 代码库位于:https://github.com/Merck/bayesiansprt。
{"title":"Bayesian sequential probability ratio test for vaccine efficacy trials","authors":"Erina Paul, Santosh Sutradhar, Jonathan Hartzel, Devan V. Mehrotra","doi":"10.1007/s00180-024-01458-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01458-5","url":null,"abstract":"<p>Designing vaccine efficacy (VE) trials often requires recruiting large numbers of participants when the diseases of interest have a low incidence. When developing novel vaccines, such as for COVID-19 disease, the plausible range of VE is quite large at the design stage. Thus, the number of events needed to demonstrate efficacy above a pre-defined regulatory threshold can be difficult to predict and the time needed to accrue the necessary events can often be long. Therefore, it is advantageous to evaluate the efficacy at earlier interim analysis in the trial to potentially allow the trials to stop early for overwhelming VE or futility. In such cases, incorporating interim analyses through the use of the sequential probability ratio test (SPRT) can be helpful to allow for multiple analyses while controlling for both type-I and type-II errors. In this article, we propose a Bayesian SPRT for designing a vaccine trial for comparing a test vaccine with a control assuming two Poisson incidence rates. We provide guidance on how to choose the prior distribution and how to optimize the number of events for interim analyses to maximize the efficiency of the design. Through simulations, we demonstrate how the proposed Bayesian SPRT performs better when compared with the corresponding frequentist SPRT. An R repository to implement the proposed method is placed at: https://github.com/Merck/bayesiansprt.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"14 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Overlapping coefficient in network-based semi-supervised clustering 基于网络的半监督聚类中的重叠系数
IF 1.3 4区 数学 Q3 STATISTICS & PROBABILITY Pub Date : 2024-02-19 DOI: 10.1007/s00180-024-01457-6
Claudio Conversano, Luca Frigau, Giulia Contu

Network-based Semi-Supervised Clustering (NeSSC) is a semi-supervised approach for clustering in the presence of an outcome variable. It uses a classification or regression model on resampled versions of the original data to produce a proximity matrix that indicates the magnitude of the similarity between pairs of observations measured with respect to the outcome. This matrix is transformed into a complex network on which a community detection algorithm is applied to search for underlying community structures which is a partition of the instances into highly homogeneous clusters to be evaluated in terms of the outcome. In this paper, we focus on the case the outcome variable to be used in NeSSC is numeric and propose an alternative selection criterion of the optimal partition based on a measure of overlapping between density curves as well as a penalization criterion which takes accounts for the number of clusters in a candidate partition. Next, we consider the performance of the proposed method for some artificial datasets and for 20 different real datasets and compare NeSSC with the other three popular methods of semi-supervised clustering with a numeric outcome. Results show that NeSSC with the overlapping criterion works particularly well when a reduced number of clusters are scattered localized.

基于网络的半监督聚类(NeSSC)是一种在存在结果变量的情况下进行聚类的半监督方法。它在原始数据的重采样版本上使用分类或回归模型,生成一个邻近度矩阵,该矩阵显示了与结果相关的观测对之间的相似度大小。该矩阵被转化为一个复杂的网络,在该网络上应用群体检测算法来搜索潜在的群体结构,即把实例划分为高度同质的群组,以便根据结果进行评估。在本文中,我们重点讨论了 NeSSC 中使用的结果变量是数字变量的情况,并提出了一种基于密度曲线重叠度量的最优分区选择标准,以及一种考虑候选分区中聚类数量的惩罚标准。接下来,我们考虑了所提方法在一些人工数据集和 20 个不同真实数据集上的性能,并将 NeSSC 与其他三种流行的数字结果半监督聚类方法进行了比较。结果表明,采用重叠标准的 NeSSC 在聚类数量减少、分散定位的情况下效果尤佳。
{"title":"Overlapping coefficient in network-based semi-supervised clustering","authors":"Claudio Conversano, Luca Frigau, Giulia Contu","doi":"10.1007/s00180-024-01457-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01457-6","url":null,"abstract":"<p>Network-based Semi-Supervised Clustering (NeSSC) is a semi-supervised approach for clustering in the presence of an outcome variable. It uses a classification or regression model on resampled versions of the original data to produce a proximity matrix that indicates the magnitude of the similarity between pairs of observations measured with respect to the outcome. This matrix is transformed into a complex network on which a community detection algorithm is applied to search for underlying community structures which is a partition of the instances into highly homogeneous clusters to be evaluated in terms of the outcome. In this paper, we focus on the case the outcome variable to be used in NeSSC is numeric and propose an alternative selection criterion of the optimal partition based on a measure of overlapping between density curves as well as a penalization criterion which takes accounts for the number of clusters in a candidate partition. Next, we consider the performance of the proposed method for some artificial datasets and for 20 different real datasets and compare NeSSC with the other three popular methods of semi-supervised clustering with a numeric outcome. Results show that NeSSC with the overlapping criterion works particularly well when a reduced number of clusters are scattered localized.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"18 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139927826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1