首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
Bayesian taut splines for estimating the number of modes 用于估算模式数的贝叶斯紧绷样条曲线
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-04-15 DOI: 10.1016/j.csda.2024.107961
José E. Chacón , Javier Fernández Serrano

The number of modes in a probability density function is representative of the complexity of a model and can also be viewed as the number of subpopulations. Despite its relevance, there has been limited research in this area. A novel approach to estimating the number of modes in the univariate setting is presented, focusing on prediction accuracy and inspired by some overlooked aspects of the problem: the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view that blends local and global density properties. The technique combines flexible kernel estimators and parsimonious compositional splines in the Bayesian inference paradigm, providing soft solutions and incorporating expert judgment. The procedure includes feature exploration, model selection, and mode testing, illustrated in a sports analytics case study showcasing multiple companion visualisation tools. A thorough simulation study also demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, the new method emerges as a top-tier alternative, offering innovative solutions for analysts.

概率密度函数中的模态数代表了模型的复杂程度,也可以看作是子群的数量。尽管具有相关性,但这一领域的研究还很有限。本文提出了一种在单变量设置中估算模式数的新方法,该方法侧重于预测准确性,其灵感来源于该问题的一些被忽视的方面:对解中结构的需求、模式的主观性和不确定性,以及融合局部和全局密度特性的整体观的便利性。该技术在贝叶斯推理范式中结合了灵活的核估计器和简约的组合样条,提供了软解决方案并结合了专家判断。该程序包括特征探索、模型选择和模式测试,在体育分析案例研究中展示了多个配套的可视化工具。一项全面的模拟研究还表明,传统的模式驱动方法很难提供准确的结果。在这种情况下,新方法成为一种顶级替代方法,为分析人员提供了创新的解决方案。
{"title":"Bayesian taut splines for estimating the number of modes","authors":"José E. Chacón ,&nbsp;Javier Fernández Serrano","doi":"10.1016/j.csda.2024.107961","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107961","url":null,"abstract":"<div><p>The number of modes in a probability density function is representative of the complexity of a model and can also be viewed as the number of subpopulations. Despite its relevance, there has been limited research in this area. A novel approach to estimating the number of modes in the univariate setting is presented, focusing on prediction accuracy and inspired by some overlooked aspects of the problem: the need for structure in the solutions, the subjective and uncertain nature of modes, and the convenience of a holistic view that blends local and global density properties. The technique combines flexible kernel estimators and parsimonious compositional splines in the Bayesian inference paradigm, providing soft solutions and incorporating expert judgment. The procedure includes feature exploration, model selection, and mode testing, illustrated in a sports analytics case study showcasing multiple companion visualisation tools. A thorough simulation study also demonstrates that traditional modality-driven approaches paradoxically struggle to provide accurate results. In this context, the new method emerges as a top-tier alternative, offering innovative solutions for analysts.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000458/pdfft?md5=9c9dde675ebe359be2107f0ce88120f0&pid=1-s2.0-S0167947324000458-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140605592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian imaging inverse problem with SA-Roundtrip prior via HMC-pCN sampler 通过 HMC-pCN 采样器解决具有 SA-Roundtrip 先验的贝叶斯成像反演问题
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-04-10 DOI: 10.1016/j.csda.2024.107930
Jiayu Qian , Yuanyuan Liu , Jingya Yang , Qingping Zhou

Bayesian inference with deep generative prior has received considerable interest for solving imaging inverse problems in many scientific and engineering fields. The selection of the prior distribution is learned from, and therefore an important representation learning of, available prior measurements. The SA-Roundtrip, a novel deep generative prior, is introduced to enable controlled sampling generation and identify the data's intrinsic dimension. This prior incorporates a self-attention structure within a bidirectional generative adversarial network. Subsequently, Bayesian inference is applied to the posterior distribution in the low-dimensional latent space using the Hamiltonian Monte Carlo with preconditioned Crank-Nicolson (HMC-pCN) algorithm, which is proven to be ergodic under specific conditions. Experiments conducted on computed tomography (CT) reconstruction with the MNIST and TomoPhantom datasets reveal that the proposed method outperforms state-of-the-art comparisons, consistently yielding a robust and superior point estimator along with precise uncertainty quantification.

在解决许多科学和工程领域的成像反演问题时,使用深度生成先验的贝叶斯推理受到了广泛关注。先验分布的选择是从可用的先验测量中学习的,因此也是先验测量的重要表征学习。SA-Roundtrip 是一种新颖的深度生成先验,用于控制采样生成和识别数据的内在维度。该先验在双向生成对抗网络中加入了自注意结构。随后,使用汉密尔顿蒙特卡洛预处理 Crank-Nicolson 算法(HMC-pCN)对低维潜空间中的后验分布进行贝叶斯推理。利用 MNIST 和 TomoPhantom 数据集对计算机断层扫描(CT)重建进行的实验表明,所提出的方法优于最先进的比较方法,能持续产生稳健、卓越的点估算器以及精确的不确定性量化。
{"title":"Bayesian imaging inverse problem with SA-Roundtrip prior via HMC-pCN sampler","authors":"Jiayu Qian ,&nbsp;Yuanyuan Liu ,&nbsp;Jingya Yang ,&nbsp;Qingping Zhou","doi":"10.1016/j.csda.2024.107930","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107930","url":null,"abstract":"<div><p>Bayesian inference with deep generative prior has received considerable interest for solving imaging inverse problems in many scientific and engineering fields. The selection of the prior distribution is learned from, and therefore an important representation learning of, available prior measurements. The SA-Roundtrip, a novel deep generative prior, is introduced to enable controlled sampling generation and identify the data's intrinsic dimension. This prior incorporates a self-attention structure within a bidirectional generative adversarial network. Subsequently, Bayesian inference is applied to the posterior distribution in the low-dimensional latent space using the Hamiltonian Monte Carlo with preconditioned Crank-Nicolson (HMC-pCN) algorithm, which is proven to be ergodic under specific conditions. Experiments conducted on computed tomography (CT) reconstruction with the MNIST and TomoPhantom datasets reveal that the proposed method outperforms state-of-the-art comparisons, consistently yielding a robust and superior point estimator along with precise uncertainty quantification.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140555566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sufficient dimension reduction for a novel class of zero-inflated graphical models 一类新型零膨胀图形模型的充分降维
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-04-08 DOI: 10.1016/j.csda.2024.107959
Eric Koplin , Liliana Forzani , Diego Tomassi , Ruth M. Pfeiffer

Graphical models allow modeling of complex dependencies among components of a random vector. In many applications of graphical models, however, for example microbiome data, the data have an excess number of zero values. New pairwise graphical models with distributions in an exponential family are presented, that accommodate excess numbers of zeros in the random vector components. First these multivariate distributions are characterized in terms of univariate conditional distributions. Then predictors that arise from such a pairwise graphical model with excess zeros are modeled as functions of an outcome, and the corresponding first order sufficient dimension reduction (SDR) is derived. That is, linear combinations of the predictors that contain all the information for the regression of the outcome as a function of the predictors are obtained. To incorporate variable selection, the SDR is estimated using a pseudo-likelihood with a hierarchical penalty that prioritizes sparse interactions only for variables associated with the outcome. These methods yield consistent estimators of the reduction and can be applied to continuous or categorical outcomes. The new methods are then illustrated by studying normal, Poisson and truncated Poisson graphical models with excess zeros in simulations and by analyzing microbiome data from the American Gut Project. The models provided robust variable selection and the predictive performance of the Poisson zero-inflated pairwise graphical model was equal or better than that of other available methods for the analysis of microbiome data.

图形模型可以对随机向量各组成部分之间的复杂依赖关系进行建模。然而,在图形模型的许多应用中,例如微生物组数据,数据中有过多的零值。本文提出了新的成对图形模型,该模型的分布属于指数族,可容纳随机向量成分中过量的零值。首先,用单变量条件分布来描述这些多变量分布。然后,将这种具有多余零点的成对图形模型中产生的预测因子建模为结果函数,并推导出相应的一阶充分降维(SDR)。也就是说,预测因子的线性组合包含了将结果作为预测因子函数进行回归的所有信息。为了纳入变量选择,SDR 是使用带分层惩罚的伪似然估计的,分层惩罚只优先考虑与结果相关的变量的稀疏交互。这些方法可以得到一致的估计值,并可应用于连续或分类结果。然后,通过研究模拟中带有多余零的正态、泊松和截断泊松图形模型,并通过分析来自美国肠道项目的微生物组数据,对新方法进行了说明。这些模型提供了稳健的变量选择,泊松零膨胀配对图形模型的预测性能等同于或优于其他可用的微生物组数据分析方法。
{"title":"Sufficient dimension reduction for a novel class of zero-inflated graphical models","authors":"Eric Koplin ,&nbsp;Liliana Forzani ,&nbsp;Diego Tomassi ,&nbsp;Ruth M. Pfeiffer","doi":"10.1016/j.csda.2024.107959","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107959","url":null,"abstract":"<div><p>Graphical models allow modeling of complex dependencies among components of a random vector. In many applications of graphical models, however, for example microbiome data, the data have an excess number of zero values. New pairwise graphical models with distributions in an exponential family are presented, that accommodate excess numbers of zeros in the random vector components. First these multivariate distributions are characterized in terms of univariate conditional distributions. Then predictors that arise from such a pairwise graphical model with excess zeros are modeled as functions of an outcome, and the corresponding first order sufficient dimension reduction (SDR) is derived. That is, linear combinations of the predictors that contain all the information for the regression of the outcome as a function of the predictors are obtained. To incorporate variable selection, the SDR is estimated using a pseudo-likelihood with a hierarchical penalty that prioritizes sparse interactions only for variables associated with the outcome. These methods yield consistent estimators of the reduction and can be applied to continuous or categorical outcomes. The new methods are then illustrated by studying normal, Poisson and truncated Poisson graphical models with excess zeros in simulations and by analyzing microbiome data from the American Gut Project. The models provided robust variable selection and the predictive performance of the Poisson zero-inflated pairwise graphical model was equal or better than that of other available methods for the analysis of microbiome data.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140619396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A mixture of logistic skew-normal multinomial models 逻辑斜正态多项式模型的混合物
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-04-03 DOI: 10.1016/j.csda.2024.107946
Wangshu Tu , Ryan Browne , Sanjeena Subedi

The logistic normal multinomial distribution is gaining interest in modelling microbiome data. It utilizes a hierarchical structure such that the observed counts conditional on the compositions are assumed to be multinomial random variables and the log-ratio transformed compositions are assumed to be from a Gaussian distribution. While multinomial distribution accounts for the compositional nature of the data, and a Gaussian prior offers flexibility in the structure of covariance matrices, the log-ratio transformed compositions of the microbiome data can be highly skewed, especially at a lower taxonomic level. Thus, a Gaussian distribution may not be an ideal prior for the log-ratio transformed compositions. A novel mixture of logistic skew-normal multinomial (LSNM) distribution is proposed in which a multivariate skew-normal distribution is utilized as a prior for the log-ratio transformed compositions. A variational Gaussian approximation in conjunction with the EM algorithm is utilized for parameter estimation.

逻辑正态多叉分布越来越受到微生物组数据建模的关注。它采用了一种分层结构,即以组成为条件的观测计数被假定为多二项随机变量,而对数比率变换后的组成被假定为高斯分布。虽然多叉分布说明了数据的组成性质,高斯先验也为协方差矩阵的结构提供了灵活性,但微生物组数据的对数比率转换组成可能高度偏斜,特别是在较低的分类水平上。因此,高斯分布可能不是对数比率变换成分的理想先验值。本文提出了一种新颖的逻辑偏态正态多叉(LSNM)分布混合物,利用多元偏态正态分布作为对数比率变换成分的先验。利用变异高斯近似和 EM 算法进行参数估计。
{"title":"A mixture of logistic skew-normal multinomial models","authors":"Wangshu Tu ,&nbsp;Ryan Browne ,&nbsp;Sanjeena Subedi","doi":"10.1016/j.csda.2024.107946","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107946","url":null,"abstract":"<div><p>The logistic normal multinomial distribution is gaining interest in modelling microbiome data. It utilizes a hierarchical structure such that the observed counts conditional on the compositions are assumed to be multinomial random variables and the log-ratio transformed compositions are assumed to be from a Gaussian distribution. While multinomial distribution accounts for the compositional nature of the data, and a Gaussian prior offers flexibility in the structure of covariance matrices, the log-ratio transformed compositions of the microbiome data can be highly skewed, especially at a lower taxonomic level. Thus, a Gaussian distribution may not be an ideal prior for the log-ratio transformed compositions. A novel mixture of logistic skew-normal multinomial (LSNM) distribution is proposed in which a multivariate skew-normal distribution is utilized as a prior for the log-ratio transformed compositions. A variational Gaussian approximation in conjunction with the EM algorithm is utilized for parameter estimation.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140607145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semiparametric accelerated failure time models under unspecified random effect distributions 未指定随机效应分布下的半参数加速失效时间模型
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-03-27 DOI: 10.1016/j.csda.2024.107958
Byungtae Seo , Il Do Ha

Accelerated failure time (AFT) models with random effects, a useful alternative to frailty models, have been widely used for analyzing clustered (or correlated) time-to-event data. In the AFT model, the distribution of the unobserved random effect is conventionally assumed to be parametric, often modeled as a normal distribution. Although it has been known that a misspecfied random-effect distribution has little effect on regression parameter estimates, in some cases, the impact caused by such misspecification is not negligible. Particularly when our focus extends to quantities associated with random effects, the problem could become worse. In this paper, we propose a semi-parametric maximum likelihood approach in which the random-effect distribution under the AFT models is left unspecified. We provide a feasible algorithm to estimate the random-effect distribution as well as model parameters. Through comprehensive simulation studies, our results demonstrate the effectiveness of this proposed method across a range of random-effect distribution types (discrete or continuous) and under conditions of heavy censoring. The efficacy of the approach is further illustrated through simulation studies and real-world data examples.

随机效应加速故障时间(AFT)模型是虚弱模型的有效替代模型,已被广泛用于分析时间到事件的聚类(或相关)数据。在 AFT 模型中,未观测到的随机效应的分布通常被假定为参数分布,通常被建模为正态分布。虽然众所周知,假设错误的随机效应分布对回归参数估计的影响很小,但在某些情况下,这种假设错误造成的影响也不容忽视。特别是当我们的关注点扩展到与随机效应相关的数量时,问题可能会变得更加严重。在本文中,我们提出了一种半参数最大似然法,其中 AFT 模型下的随机效应分布未作指定。我们提供了一种估算随机效应分布和模型参数的可行算法。通过全面的模拟研究,我们的结果证明了所提出的方法在一系列随机效应分布类型(离散或连续)和严重删减条件下的有效性。我们还通过模拟研究和实际数据实例进一步说明了该方法的有效性。
{"title":"Semiparametric accelerated failure time models under unspecified random effect distributions","authors":"Byungtae Seo ,&nbsp;Il Do Ha","doi":"10.1016/j.csda.2024.107958","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107958","url":null,"abstract":"<div><p>Accelerated failure time (AFT) models with random effects, a useful alternative to frailty models, have been widely used for analyzing clustered (or correlated) time-to-event data. In the AFT model, the distribution of the unobserved random effect is conventionally assumed to be parametric, often modeled as a normal distribution. Although it has been known that a misspecfied random-effect distribution has little effect on regression parameter estimates, in some cases, the impact caused by such misspecification is not negligible. Particularly when our focus extends to quantities associated with random effects, the problem could become worse. In this paper, we propose a semi-parametric maximum likelihood approach in which the random-effect distribution under the AFT models is left unspecified. We provide a feasible algorithm to estimate the random-effect distribution as well as model parameters. Through comprehensive simulation studies, our results demonstrate the effectiveness of this proposed method across a range of random-effect distribution types (discrete or continuous) and under conditions of heavy censoring. The efficacy of the approach is further illustrated through simulation studies and real-world data examples.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140345092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variational Bayesian approach for analyzing interval-censored data under the proportional hazards model 在比例危险模型下分析区间删失数据的变异贝叶斯方法
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-03-26 DOI: 10.1016/j.csda.2024.107957
Wenting Liu , Huiqiong Li , Niansheng Tang , Jun Lyu

Interval-censored failure time data frequently occur in medical follow-up studies among others and include right-censored data as a special case. Their analysis is much difficult than the analysis of the right-censored data due to their much more complicated structures and no partial likelihood. This article presents a variational Bayesian (VB) approach for analyzing such data under a proportional hazards model. The VB approach obtains a direct approximation of the posterior density. Compared to the Markov chain Monte Carlo (MCMC)-based sampling approaches, the VB approach achieves enhanced computational efficiency without sacrificing estimation accuracy. An extensive simulation study is conducted to compare the performance of the proposed methods with two main Bayesian methods currently available in the literature and the classic proportional hazards model and indicates that they work well in practical situations.

间隔删失失效时间数据经常出现在医学随访研究中,其中右删失数据是一种特殊情况。由于它们的结构复杂得多,而且没有部分似然,因此分析它们比分析右删失数据困难得多。本文提出了一种在比例危险模型下分析此类数据的变分贝叶斯(VB)方法。变异贝叶斯方法可以直接得到后验密度的近似值。与基于马尔可夫链蒙特卡罗(MCMC)的采样方法相比,VB 方法在不牺牲估计精度的前提下提高了计算效率。我们进行了广泛的模拟研究,将所提出的方法与目前文献中的两种主要贝叶斯方法和经典的比例危险模型进行了性能比较,结果表明,这些方法在实际情况下运行良好。
{"title":"Variational Bayesian approach for analyzing interval-censored data under the proportional hazards model","authors":"Wenting Liu ,&nbsp;Huiqiong Li ,&nbsp;Niansheng Tang ,&nbsp;Jun Lyu","doi":"10.1016/j.csda.2024.107957","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107957","url":null,"abstract":"<div><p>Interval-censored failure time data frequently occur in medical follow-up studies among others and include right-censored data as a special case. Their analysis is much difficult than the analysis of the right-censored data due to their much more complicated structures and no partial likelihood. This article presents a variational Bayesian (VB) approach for analyzing such data under a proportional hazards model. The VB approach obtains a direct approximation of the posterior density. Compared to the Markov chain Monte Carlo (MCMC)-based sampling approaches, the VB approach achieves enhanced computational efficiency without sacrificing estimation accuracy. An extensive simulation study is conducted to compare the performance of the proposed methods with two main Bayesian methods currently available in the literature and the classic proportional hazards model and indicates that they work well in practical situations.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140328151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variable selection in Bayesian multiple instance regression using shotgun stochastic search 利用枪式随机搜索在贝叶斯多实例回归中选择变量
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-03-24 DOI: 10.1016/j.csda.2024.107954
Seongoh Park , Joungyoun Kim , Xinlei Wang , Johan Lim

In multiple instance learning (MIL), a bag represents a sample that has a set of instances, each of which is described by a vector of explanatory variables, but the entire bag only has one label/response. Though many methods for MIL have been developed to date, few have paid attention to interpretability of models and results. The proposed Bayesian regression model stands on two levels of hierarchy, which transparently show how explanatory variables explain and instances contribute to bag responses. Moreover, two selection problems are simultaneously addressed; the instance selection to find out the instances in each bag responsible for the bag response, and the variable selection to search for the important covariates. To explore a joint discrete space of indicator variables created for selection of both explanatory variables and instances, the shotgun stochastic search algorithm is modified to fit in the MIL context. Also, the proposed model offers a natural and rigorous way to quantify uncertainty in coefficient estimation and outcome prediction, which many modern MIL applications call for. The simulation study shows the proposed regression model can select variables and instances with high performance (AUC greater than 0.86), thus predicting responses well. The proposed method is applied to the musk data for prediction of binding strengths (labels) between molecules (bags) with different conformations (instances) and target receptors. It outperforms all existing methods, and can identify variables relevant in modeling responses.

在多实例学习(Multiple instance learning,MIL)中,一个包代表一个样本,其中有一组实例,每个实例都由一个解释变量向量描述,但整个包只有一个标签/响应。虽然迄今为止已开发出许多 MIL 方法,但很少有人关注模型和结果的可解释性。所提出的贝叶斯回归模型分为两个层次,透明地显示了解释变量是如何解释和实例是如何促成袋响应的。此外,还同时解决了两个选择问题:一个是实例选择,以找出每个袋中对袋响应负责的实例;另一个是变量选择,以寻找重要的协变量。为了探索为选择解释变量和实例而创建的指标变量的联合离散空间,对猎枪随机搜索算法进行了修改,以适应 MIL 环境。此外,所提出的模型为量化系数估计和结果预测中的不确定性提供了一种自然而严谨的方法,而这正是许多现代 MIL 应用所需要的。模拟研究表明,所提出的回归模型可以选择性能较高的变量和实例(AUC 大于 0.86),从而很好地预测反应。所提出的方法被应用于麝香数据,用于预测不同构象的分子(袋)(实例)与目标受体之间的结合强度(标签)。该方法优于所有现有方法,并能识别与反应建模相关的变量。
{"title":"Variable selection in Bayesian multiple instance regression using shotgun stochastic search","authors":"Seongoh Park ,&nbsp;Joungyoun Kim ,&nbsp;Xinlei Wang ,&nbsp;Johan Lim","doi":"10.1016/j.csda.2024.107954","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107954","url":null,"abstract":"<div><p>In multiple instance learning (MIL), a bag represents a sample that has a set of instances, each of which is described by a vector of explanatory variables, but the entire bag only has one label/response. Though many methods for MIL have been developed to date, few have paid attention to interpretability of models and results. The proposed Bayesian regression model stands on two levels of hierarchy, which transparently show how explanatory variables explain and instances contribute to bag responses. Moreover, two selection problems are simultaneously addressed; the instance selection to find out the instances in each bag responsible for the bag response, and the variable selection to search for the important covariates. To explore a joint discrete space of indicator variables created for selection of both explanatory variables and instances, the shotgun stochastic search algorithm is modified to fit in the MIL context. Also, the proposed model offers a natural and rigorous way to quantify uncertainty in coefficient estimation and outcome prediction, which many modern MIL applications call for. The simulation study shows the proposed regression model can select variables and instances with high performance (AUC greater than 0.86), thus predicting responses well. The proposed method is applied to the musk data for prediction of binding strengths (labels) between molecules (bags) with different conformations (instances) and target receptors. It outperforms all existing methods, and can identify variables relevant in modeling responses.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140351163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-task learning regression via convex clustering 通过凸聚类实现多任务学习回归
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-03-24 DOI: 10.1016/j.csda.2024.107956
Akira Okazaki , Shuichi Kawano

Multi-task learning (MTL) is a methodology that aims to improve the general performance of estimation and prediction by sharing common information among related tasks. In the MTL, there are several assumptions for the relationships and methods to incorporate them. One of the natural assumptions in the practical situation is that tasks are classified into some clusters with their characteristics. For this assumption, the group fused regularization approach performs clustering of the tasks by shrinking the difference among tasks. This enables the transfer of common information within the same cluster. However, this approach also transfers the information between different clusters, which worsens the estimation and prediction. To overcome this problem, an MTL method is proposed with a centroid parameter representing a cluster center of the task. Because this model separates parameters into the parameters for regression and the parameters for clustering, estimation and prediction accuracy for regression coefficient vectors are improved. The effectiveness of the proposed method is shown through Monte Carlo simulations and applications to real data.

多任务学习(Multi-task learning,MTL)是一种方法论,旨在通过共享相关任务之间的共同信息来提高估计和预测的总体性能。在多任务学习中,有几种关系假设和纳入这些关系的方法。在实际情况中,其中一个自然假设是任务被划分为具有各自特征的群组。针对这一假设,分组融合正则化方法通过缩小任务之间的差异来对任务进行聚类。这样就能在同一聚类中传递共同信息。不过,这种方法也会在不同群组之间传递信息,从而降低了估计和预测效果。为了克服这一问题,我们提出了一种 MTL 方法,其中心点参数代表任务的聚类中心。由于该模型将参数分为回归参数和聚类参数,因此提高了回归系数向量的估计和预测精度。通过蒙特卡罗模拟和实际数据应用,展示了所提方法的有效性。
{"title":"Multi-task learning regression via convex clustering","authors":"Akira Okazaki ,&nbsp;Shuichi Kawano","doi":"10.1016/j.csda.2024.107956","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107956","url":null,"abstract":"<div><p>Multi-task learning (MTL) is a methodology that aims to improve the general performance of estimation and prediction by sharing common information among related tasks. In the MTL, there are several assumptions for the relationships and methods to incorporate them. One of the natural assumptions in the practical situation is that tasks are classified into some clusters with their characteristics. For this assumption, the group fused regularization approach performs clustering of the tasks by shrinking the difference among tasks. This enables the transfer of common information within the same cluster. However, this approach also transfers the information between different clusters, which worsens the estimation and prediction. To overcome this problem, an MTL method is proposed with a centroid parameter representing a cluster center of the task. Because this model separates parameters into the parameters for regression and the parameters for clustering, estimation and prediction accuracy for regression coefficient vectors are improved. The effectiveness of the proposed method is shown through Monte Carlo simulations and applications to real data.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000409/pdfft?md5=67ff220c9ae2e0cf144b79296e79f566&pid=1-s2.0-S0167947324000409-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140290730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A new algorithm for inference in HMM's with lower span complexity 跨度复杂度较低的 HMM 推理新算法
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-03-20 DOI: 10.1016/j.csda.2024.107955
Diogo Pereira , Cláudia Nunes , Rui Rodrigues

The maximum likelihood problem for Hidden Markov Models is usually numerically solved by the Baum-Welch algorithm, which uses the Expectation-Maximization algorithm to find the estimates of the parameters. This algorithm has a recursion depth equal to the data sample size and cannot be computed in parallel, which limits the use of modern GPUs to speed up computation time. A new algorithm is proposed that provides the same estimates as the Baum-Welch algorithm, requiring about the same number of iterations, but is designed in such a way that it can be parallelized. As a consequence, it leads to a significant reduction in the computation time. This reduction is illustrated by means of numerical examples, where we consider simulated data as well as real datasets.

隐马尔可夫模型的最大似然问题通常通过 Baum-Welch 算法进行数值求解,该算法使用期望最大化算法来找到参数的估计值。该算法的递归深度等于数据样本大小,无法并行计算,这限制了现代 GPU 在加快计算时间方面的应用。我们提出了一种新算法,它能提供与 Baum-Welch 算法相同的估计值,所需的迭代次数也大致相同,但其设计方式使其可以并行化。因此,该算法大大缩短了计算时间。我们将通过数值示例来说明计算时间的减少,我们既考虑了模拟数据,也考虑了真实数据集。
{"title":"A new algorithm for inference in HMM's with lower span complexity","authors":"Diogo Pereira ,&nbsp;Cláudia Nunes ,&nbsp;Rui Rodrigues","doi":"10.1016/j.csda.2024.107955","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107955","url":null,"abstract":"<div><p>The maximum likelihood problem for Hidden Markov Models is usually numerically solved by the Baum-Welch algorithm, which uses the Expectation-Maximization algorithm to find the estimates of the parameters. This algorithm has a recursion depth equal to the data sample size and cannot be computed in parallel, which limits the use of modern GPUs to speed up computation time. A new algorithm is proposed that provides the same estimates as the Baum-Welch algorithm, requiring about the same number of iterations, but is designed in such a way that it can be parallelized. As a consequence, it leads to a significant reduction in the computation time. This reduction is illustrated by means of numerical examples, where we consider simulated data as well as real datasets.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000392/pdfft?md5=f5b9ec83440b072fb6330eb5106ddb15&pid=1-s2.0-S0167947324000392-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140190987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A semiparametric model for the cause-specific hazard under risk proportionality 风险比例下的特定原因危险半参数模型
IF 1.8 3区 数学 Q1 Mathematics Pub Date : 2024-03-19 DOI: 10.1016/j.csda.2024.107953
Simon M.S. Lo , Ralf A. Wilke , Takeshi Emura

Semiparametric Cox proportional hazards models enjoy great popularity in empirical survival analysis. A semiparametric model for cause-specific hazards under a proportionality restriction across risks is considered, which has desired practical properties such as estimation by partial likelihood and an analytical solution for the copula-graphic estimator. The cause-specific and marginal hazards are shown to share functional form restrictions in this case. The model for the cause-specific hazard can be used for inference about parametric restrictions on the marginal hazard without the risk of misspecifying the latter and without knowing the risk dependence. After the class of parametric marginal hazards has been determined, it can be estimated in conjunction with the degree of risk dependence. Finite sample properties are investigated with simulations. An application to employment duration demonstrates the practicality of the approach.

半参数考克斯比例危险模型在实证生存分析中非常受欢迎。本文考虑了在跨风险比例限制下的特定原因危害半参数模型,该模型具有理想的实用特性,如部分似然估计和共轭图形估计器的分析解。在这种情况下,特定原因危害和边际危害具有相同的函数形式限制。特定成因危险度模型可用于推断边际危险度的参数限制,而不必冒错误定义边际危险度的风险,也不必知道风险依赖性。在确定参数边际危害类别后,可以结合风险依赖程度对其进行估计。通过模拟研究了有限样本特性。对就业期限的应用证明了该方法的实用性。
{"title":"A semiparametric model for the cause-specific hazard under risk proportionality","authors":"Simon M.S. Lo ,&nbsp;Ralf A. Wilke ,&nbsp;Takeshi Emura","doi":"10.1016/j.csda.2024.107953","DOIUrl":"https://doi.org/10.1016/j.csda.2024.107953","url":null,"abstract":"<div><p>Semiparametric Cox proportional hazards models enjoy great popularity in empirical survival analysis. A semiparametric model for cause-specific hazards under a proportionality restriction across risks is considered, which has desired practical properties such as estimation by partial likelihood and an analytical solution for the copula-graphic estimator. The cause-specific and marginal hazards are shown to share functional form restrictions in this case. The model for the cause-specific hazard can be used for inference about parametric restrictions on the marginal hazard without the risk of misspecifying the latter and without knowing the risk dependence. After the class of parametric marginal hazards has been determined, it can be estimated in conjunction with the degree of risk dependence. Finite sample properties are investigated with simulations. An application to employment duration demonstrates the practicality of the approach.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.8,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000379/pdfft?md5=4bc502f302c73b799bcbf656f0393576&pid=1-s2.0-S0167947324000379-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140190988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1