首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
Bayesian optimization sequential surrogate (BOSS) algorithm: Fast Bayesian inference for a broad class of Bayesian hierarchical models 贝叶斯优化顺序代理(BOSS)算法:针对广泛的贝叶斯层次模型的快速贝叶斯推理
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-23 DOI: 10.1016/j.csda.2025.108253
Dayi Li , Ziang Zhang
Approximate Bayesian inference based on Laplace approximation and quadrature has become increasingly popular for its efficiency in fitting latent Gaussian models (LGM). However, many useful models can only be fitted as LGMs if some conditioning parameters are fixed. Such models are termed conditional LGMs, with examples including change-point detection, non-linear regression, and many others. Existing methods for fitting conditional LGMs rely on grid search or sampling-based approaches to explore the posterior density of the conditioning parameters; both require a large number of evaluations of the unnormalized posterior density of the conditioning parameters. Since each evaluation requires fitting a separate LGM, these methods become computationally prohibitive beyond simple scenarios. In this work, the Bayesian Optimization Sequential Surrogate (BOSS) algorithm is introduced, which combines Bayesian optimization with approximate Bayesian inference methods to significantly reduce the computational resources required for fitting conditional LGMs. With orders of magnitude fewer evaluations than those required by the existing methods, BOSS efficiently generates sequential design points that capture the majority of the posterior mass of the conditioning parameters and subsequently yields an accurate surrogate posterior distribution that can be easily normalized. The efficiency, accuracy, and practical utility of BOSS are demonstrated through extensive simulation studies and real-world applications in epidemiology, environmental sciences, and astrophysics.
基于拉普拉斯近似和正交的近似贝叶斯推理在拟合潜在高斯模型(LGM)方面的效率越来越高。然而,许多有用的模型只有在某些条件参数固定的情况下才能拟合为lgm。这样的模型被称为条件lgm,其示例包括变化点检测、非线性回归等。现有的拟合条件LGMs的方法依赖于网格搜索或基于抽样的方法来探索条件参数的后验密度;两者都需要对条件参数的非归一化后验密度进行大量的评估。由于每次求值都需要拟合一个单独的LGM,因此这些方法在计算上超出了简单场景的限制。本文介绍了贝叶斯优化序列代理(BOSS)算法,该算法将贝叶斯优化与近似贝叶斯推理方法相结合,大大减少了拟合条件lgm所需的计算资源。与现有方法相比,BOSS的评估次数少了几个数量级,有效地生成了序列设计点,这些设计点捕获了大部分条件反射参数的后验质量,随后产生了一个准确的替代后验分布,可以很容易地归一化。通过广泛的模拟研究和在流行病学、环境科学和天体物理学中的实际应用,证明了BOSS的效率、准确性和实用性。
{"title":"Bayesian optimization sequential surrogate (BOSS) algorithm: Fast Bayesian inference for a broad class of Bayesian hierarchical models","authors":"Dayi Li ,&nbsp;Ziang Zhang","doi":"10.1016/j.csda.2025.108253","DOIUrl":"10.1016/j.csda.2025.108253","url":null,"abstract":"<div><div>Approximate Bayesian inference based on Laplace approximation and quadrature has become increasingly popular for its efficiency in fitting latent Gaussian models (LGM). However, many useful models can only be fitted as LGMs if some conditioning parameters are fixed. Such models are termed conditional LGMs, with examples including change-point detection, non-linear regression, and many others. Existing methods for fitting conditional LGMs rely on grid search or sampling-based approaches to explore the posterior density of the conditioning parameters; both require a large number of evaluations of the unnormalized posterior density of the conditioning parameters. Since each evaluation requires fitting a separate LGM, these methods become computationally prohibitive beyond simple scenarios. In this work, the Bayesian Optimization Sequential Surrogate (BOSS) algorithm is introduced, which combines Bayesian optimization with approximate Bayesian inference methods to significantly reduce the computational resources required for fitting conditional LGMs. With orders of magnitude fewer evaluations than those required by the existing methods, BOSS efficiently generates sequential design points that capture the majority of the posterior mass of the conditioning parameters and subsequently yields an accurate surrogate posterior distribution that can be easily normalized. The efficiency, accuracy, and practical utility of BOSS are demonstrated through extensive simulation studies and real-world applications in epidemiology, environmental sciences, and astrophysics.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108253"},"PeriodicalIF":1.5,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GMM estimation of fixed effects partially linear additive SAR model with space-time correlated disturbances 具有时空相关扰动的部分线性可加SAR模型的固定效应GMM估计
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-22 DOI: 10.1016/j.csda.2025.108252
Bogui Li , Jianbao Chen
In order to study the ubiquitous space-time panel data in real world, a fixed effects partially linear additive spatial autoregressive (SAR) model with space-time correlated disturbances is proposed. Compared to the linear panel model with space-time correlated disturbances, it can simultaneously capture substantial spatial dependence of response, linearity and nonlinearity between response and regressors, spatial and serial correlations of disturbances, and avoid “curse of dimensionality” of nonparametric regression. By using B-splines to fit additive components and constructing linear and quadratic moment conditions which incorporate information in disturbances, the generalized method of moments (GMM) estimators of unknown parameters and additive components are obtained. Under certain regularity assumptions, it is proved that the GMM estimators are consistent and asymptotically normal. Furthermore, the asymptotically efficient best GMM estimators under normality are derived. Monte Carlo simulation and empirical analysis illustrate that the developed estimation method has good finite sample performance and application prospects.
为了研究现实世界中普遍存在的时空面板数据,提出了一种具有时空相关扰动的固定效应部分线性加性空间自回归模型。与具有时空相关扰动的线性面板模型相比,该模型能够同时捕捉到响应的空间依赖性、响应与回归量之间的线性和非线性、扰动的空间和序列相关性,避免了非参数回归的“维数诅咒”。利用b样条拟合加性分量,构造包含扰动信息的线性和二次矩条件,得到了未知参数和加性分量的广义矩估计方法。在一定的正则性假设下,证明了GMM估计量是一致且渐近正态的。进一步,导出了正态下渐近有效的最优GMM估计量。蒙特卡罗仿真和实证分析表明,该估计方法具有良好的有限样本性能和应用前景。
{"title":"GMM estimation of fixed effects partially linear additive SAR model with space-time correlated disturbances","authors":"Bogui Li ,&nbsp;Jianbao Chen","doi":"10.1016/j.csda.2025.108252","DOIUrl":"10.1016/j.csda.2025.108252","url":null,"abstract":"<div><div>In order to study the ubiquitous space-time panel data in real world, a fixed effects partially linear additive spatial autoregressive (SAR) model with space-time correlated disturbances is proposed. Compared to the linear panel model with space-time correlated disturbances, it can simultaneously capture substantial spatial dependence of response, linearity and nonlinearity between response and regressors, spatial and serial correlations of disturbances, and avoid “curse of dimensionality” of nonparametric regression. By using B-splines to fit additive components and constructing linear and quadratic moment conditions which incorporate information in disturbances, the generalized method of moments (GMM) estimators of unknown parameters and additive components are obtained. Under certain regularity assumptions, it is proved that the GMM estimators are consistent and asymptotically normal. Furthermore, the asymptotically efficient best GMM estimators under normality are derived. Monte Carlo simulation and empirical analysis illustrate that the developed estimation method has good finite sample performance and application prospects.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108252"},"PeriodicalIF":1.5,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144686097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Inferring the dynamics of quasi-reaction systems via nonlinear local mean-field approximations 用非线性局部平均场近似推断准反应系统的动力学
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-22 DOI: 10.1016/j.csda.2025.108251
Matteo Framba , Veronica Vinciotti , Ernst C. Wit
Parameter estimation of kinetic rates in stochastic quasi-reaction systems can be challenging, particularly when the time gap between consecutive measurements is large. Local linear approximation approaches account for the stochasticity in the system but fail to capture the intrinsically nonlinear nature of the mean dynamics of the process. Moreover, the mean dynamics of a quasi-reaction system can be described by a system of ODEs, which have an explicit solution only for simple unitary systems. An approximate analytical solution is derived for generic quasi-reaction systems via a first-order Taylor approximation of the hazard rate. This allows a nonlinear forward prediction of the future dynamics given the current state of the system. Predictions and corresponding observations are embedded in a nonlinear least-squares approach for parameter estimation. The performance of the algorithm is compared to existing methods via a simulation study. Besides the generality of the approach in the specification of the quasi-reaction system and the gains in computational efficiency, the results show an improvement in the kinetic rate estimation, particularly for data observed at large time intervals. Additionally, the availability of an explicit solution makes the method robust to stiffness, which is often present in biological systems. Application to Rhesus Macaque data illustrates the use of the method in the study of cell differentiation.
随机准反应系统中动力学速率的参数估计具有挑战性,特别是当连续测量之间的时间间隔较大时。局部线性逼近方法解释了系统的随机性,但未能捕捉到过程平均动力学的内在非线性性质。此外,准反应系统的平均动力学可以用ode系统来描述,而ode系统只有对简单酉系统才有显式解。通过危险率的一阶泰勒近似,导出了一般准反应系统的近似解析解。这允许在给定系统当前状态下对未来动态进行非线性前向预测。预测和相应的观测嵌入在参数估计的非线性最小二乘方法中。通过仿真研究,将该算法的性能与现有方法进行了比较。结果表明,该方法在准反应体系的描述中具有通用性,计算效率有所提高,在动力学速率估计方面也有改进,特别是在大时间间隔观测数据时。此外,显式解的可用性使该方法对刚度具有鲁棒性,这通常存在于生物系统中。恒河猴数据的应用说明了该方法在细胞分化研究中的应用。
{"title":"Inferring the dynamics of quasi-reaction systems via nonlinear local mean-field approximations","authors":"Matteo Framba ,&nbsp;Veronica Vinciotti ,&nbsp;Ernst C. Wit","doi":"10.1016/j.csda.2025.108251","DOIUrl":"10.1016/j.csda.2025.108251","url":null,"abstract":"<div><div>Parameter estimation of kinetic rates in stochastic quasi-reaction systems can be challenging, particularly when the time gap between consecutive measurements is large. Local linear approximation approaches account for the stochasticity in the system but fail to capture the intrinsically nonlinear nature of the mean dynamics of the process. Moreover, the mean dynamics of a quasi-reaction system can be described by a system of ODEs, which have an explicit solution only for simple unitary systems. An approximate analytical solution is derived for generic quasi-reaction systems via a first-order Taylor approximation of the hazard rate. This allows a nonlinear forward prediction of the future dynamics given the current state of the system. Predictions and corresponding observations are embedded in a nonlinear least-squares approach for parameter estimation. The performance of the algorithm is compared to existing methods via a simulation study. Besides the generality of the approach in the specification of the quasi-reaction system and the gains in computational efficiency, the results show an improvement in the kinetic rate estimation, particularly for data observed at large time intervals. Additionally, the availability of an explicit solution makes the method robust to stiffness, which is often present in biological systems. Application to Rhesus Macaque data illustrates the use of the method in the study of cell differentiation.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108251"},"PeriodicalIF":1.5,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144686096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sample-specific cooperative learning integrating heterogeneous radiomics and pathomics data 样本特异性合作学习整合异质放射组学和病理数据
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-21 DOI: 10.1016/j.csda.2025.108250
Shih-Ting Huang , Graham A. Colditz , Shu Jiang
Multi-omics analysis offers unparalleled insights into the interlinked molecular interactions that govern the underlying biological processes. In the era of big data, driven by the emergence of high-throughput technologies, it is possible to gain a more comprehensive and detailed understanding of complex systems. Nevertheless, the challenges lie in developing methods to effectively integrate and analyze this wealth of data. This challenge is even more apparent when the type of -omics data (e.g., pathomics) lacks pixel-to-pixel or region-to-region correspondence across the population. A novel sample-specific cooperative learning framework is introduced, designed to adaptively manage diverse multi-omics data types, even when there is no direct correspondence between regions. The proposed framework is defined for both continuous and categorical outcomes, with theoretical guarantees based on finite samples. Model performance is demonstrated and compared with existing methods using real-world datasets involving proteomics and metabolomics, and radiomics and pathomics.
多组学分析为控制潜在生物过程的相互联系的分子相互作用提供了无与伦比的见解。在大数据时代,在高通量技术的推动下,对复杂系统有了更全面、更详细的了解。然而,挑战在于开发有效整合和分析这些丰富数据的方法。当组学数据类型(如病状)在人群中缺乏像素到像素或区域到区域的对应关系时,这一挑战更加明显。引入了一种新的样本特定合作学习框架,旨在自适应地管理不同的多组学数据类型,即使在区域之间没有直接对应的情况下。所提出的框架是为连续和分类结果定义的,具有基于有限样本的理论保证。使用真实世界的数据集,包括蛋白质组学和代谢组学、放射组学和病理学,展示了模型的性能,并与现有方法进行了比较。
{"title":"Sample-specific cooperative learning integrating heterogeneous radiomics and pathomics data","authors":"Shih-Ting Huang ,&nbsp;Graham A. Colditz ,&nbsp;Shu Jiang","doi":"10.1016/j.csda.2025.108250","DOIUrl":"10.1016/j.csda.2025.108250","url":null,"abstract":"<div><div>Multi-omics analysis offers unparalleled insights into the interlinked molecular interactions that govern the underlying biological processes. In the era of big data, driven by the emergence of high-throughput technologies, it is possible to gain a more comprehensive and detailed understanding of complex systems. Nevertheless, the challenges lie in developing methods to effectively integrate and analyze this wealth of data. This challenge is even more apparent when the type of -omics data (e.g., pathomics) lacks pixel-to-pixel or region-to-region correspondence across the population. A novel sample-specific cooperative learning framework is introduced, designed to adaptively manage diverse multi-omics data types, even when there is no direct correspondence between regions. The proposed framework is defined for both continuous and categorical outcomes, with theoretical guarantees based on finite samples. Model performance is demonstrated and compared with existing methods using real-world datasets involving proteomics and metabolomics, and radiomics and pathomics.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108250"},"PeriodicalIF":1.5,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144686095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Boosting interaction tree stumps for modeling interactions 增强交互树桩以建模交互
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-16 DOI: 10.1016/j.csda.2025.108247
Michael Lau , Tamara Schikowski , Holger Schwender
Incorporating interaction effects is essential for accurately modeling complex underlying relationships in many applications. Often, not only strong predictive performance is desired, but also the interpretability of the resulting model. This need is evident in areas such as epidemiology, in which uncovering the interplay of biological mechanisms is critical for understanding complex diseases. Classical linear models, frequently used for constructing genetic risk scores, fail to capture interaction effects autonomously, while modern machine learning methods such as gradient boosting often produce black-box models that lack interpretability. Existing linear interaction models are largely limited to consider two-way interactions. To address these limitations, a novel statistical learning method, BITS (Boosting Interaction Tree Stumps), is introduced to construct linear models while autonomously detecting and incorporating interaction effects. BITS uses gradient boosting on interaction tree stumps, i.e., decision trees with a single split, where in BITS this split can possibly occur on an interaction term. A branch-and-bound approach is employed in BITS to discard weakly predictive terms. For high-dimensional data, a hybrid search strategy combining greedy and exhaustive approaches is proposed. Regularization techniques are integrated to prevent overfitting and the inclusion of spurious interaction effects. Simulation studies and real data applications demonstrate that BITS produces interpretable models with strong predictive performance. Moreover, in the simulation study, BITS primarily identifies truly influential terms.
在许多应用程序中,结合交互效果对于精确地建模复杂的潜在关系是必不可少的。通常,不仅需要强大的预测性能,还需要结果模型的可解释性。这种需求在流行病学等领域是显而易见的,在这些领域,揭示生物机制的相互作用对于理解复杂疾病至关重要。经典的线性模型,经常用于构建遗传风险评分,不能自主地捕获相互作用的影响,而现代机器学习方法,如梯度增强,经常产生缺乏可解释性的黑箱模型。现有的线性相互作用模型在很大程度上局限于考虑双向相互作用。为了解决这些限制,引入了一种新的统计学习方法BITS (Boosting Interaction Tree Stumps)来构建线性模型,同时自主检测和整合交互效应。BITS在交互树桩上使用梯度增强,即具有单个分裂的决策树,在BITS中,这种分裂可能发生在交互项上。在BITS中采用分支定界方法来丢弃弱预测项。针对高维数据,提出了一种贪婪和穷举相结合的混合搜索策略。正则化技术集成,以防止过度拟合和包含虚假的相互作用的影响。仿真研究和实际数据应用表明,BITS产生的可解释模型具有较强的预测性能。此外,在模拟研究中,BITS主要识别真正有影响力的术语。
{"title":"Boosting interaction tree stumps for modeling interactions","authors":"Michael Lau ,&nbsp;Tamara Schikowski ,&nbsp;Holger Schwender","doi":"10.1016/j.csda.2025.108247","DOIUrl":"10.1016/j.csda.2025.108247","url":null,"abstract":"<div><div>Incorporating interaction effects is essential for accurately modeling complex underlying relationships in many applications. Often, not only strong predictive performance is desired, but also the interpretability of the resulting model. This need is evident in areas such as epidemiology, in which uncovering the interplay of biological mechanisms is critical for understanding complex diseases. Classical linear models, frequently used for constructing genetic risk scores, fail to capture interaction effects autonomously, while modern machine learning methods such as gradient boosting often produce black-box models that lack interpretability. Existing linear interaction models are largely limited to consider two-way interactions. To address these limitations, a novel statistical learning method, BITS (Boosting Interaction Tree Stumps), is introduced to construct linear models while autonomously detecting and incorporating interaction effects. BITS uses gradient boosting on interaction tree stumps, i.e., decision trees with a single split, where in BITS this split can possibly occur on an interaction term. A branch-and-bound approach is employed in BITS to discard weakly predictive terms. For high-dimensional data, a hybrid search strategy combining greedy and exhaustive approaches is proposed. Regularization techniques are integrated to prevent overfitting and the inclusion of spurious interaction effects. Simulation studies and real data applications demonstrate that BITS produces interpretable models with strong predictive performance. Moreover, in the simulation study, BITS primarily identifies truly influential terms.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108247"},"PeriodicalIF":1.5,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144680256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On Jeffreys's cardioid distribution 杰弗里斯的心脏分布
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-16 DOI: 10.1016/j.csda.2025.108248
Arthur Pewsey
The cardioid distribution, despite being one of the fundamental models for circular data, has received limited attention both methodologically and in terms of its implementation in R. To redress these shortcomings, published results on the model are summarized, corrected and extended, and the scope and limitations of the existing support for the model in R identified. A thorough investigation into the performance of trigonometric moment and maximum likelihood based approaches to point and interval estimation of the model's location and concentration parameters is presented, and goodness-of-fit techniques outlined. A suite of reliable R functions is provided for the model's practical application. The application of the proposed inferential methods and R functions is illustrated by an analysis of palaeocurrent cross-bed azimuths.
尽管心型分布是圆形数据的基本模型之一,但在方法上和在R中的实施方面都受到有限的关注。为了纠正这些缺点,对该模型的已发表结果进行了总结、修正和扩展,并确定了R中现有支持该模型的范围和局限性。深入研究了三角矩和基于最大似然的方法对模型的位置和浓度参数的点和区间估计的性能,并概述了拟合优度技术。为模型的实际应用提供了一套可靠的R函数。通过古水流交叉层方位角的分析,说明了所提出的推理方法和R函数的应用。
{"title":"On Jeffreys's cardioid distribution","authors":"Arthur Pewsey","doi":"10.1016/j.csda.2025.108248","DOIUrl":"10.1016/j.csda.2025.108248","url":null,"abstract":"<div><div>The cardioid distribution, despite being one of the fundamental models for circular data, has received limited attention both methodologically and in terms of its implementation in R. To redress these shortcomings, published results on the model are summarized, corrected and extended, and the scope and limitations of the existing support for the model in R identified. A thorough investigation into the performance of trigonometric moment and maximum likelihood based approaches to point and interval estimation of the model's location and concentration parameters is presented, and goodness-of-fit techniques outlined. A suite of reliable R functions is provided for the model's practical application. The application of the proposed inferential methods and R functions is illustrated by an analysis of palaeocurrent cross-bed azimuths.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108248"},"PeriodicalIF":1.5,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144656280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modeling continuous distributions in hybrid Bayesian networks using mixtures of polynomials with tails 用带尾多项式的混合建模混合贝叶斯网络中的连续分布
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-11 DOI: 10.1016/j.csda.2025.108246
J.C. Luengo , D. Ramos-López , R. Rumí
A new approach to modeling continuous distributions in hybrid Bayesian networks (BNs) is presented. It is based on Mixtures of Polynomials (MoPs) with tails, named as tMoPs. This proposal is a variation of the usual MoP model, now including tails and several other improvements in the learning process. The adequate modeling of tails in variable distributions is relevant theoretically and for many reals applications, in which rare phenomena may have a great impact. The proposed approach has been designed to exploit the flexibility of the tMoP model to fit different continuous data distributions. This is especially relevant in those distributions with zones of density close to zero, in which polynomial fitting may be difficult. In these situations, tMoPs allow a polynomial fit in parts with higher density and the use of tails in areas with lower density. This permits a better global fit, without loss of overall accuracy and yielding a relatively simple density function. Learning algorithms for tMoPs conditional probability distributions with up to two parents of any type are developed. These tMoPs may be integrated into hybrid Bayesian networks to represent conditional probability distributions, thus allowing to perform probabilistic reasoning, such as causal inference, sensitivity analysis, and other decision-making operations. The suitability of tMoPs is evaluated in several ways, using a large set of real datasets with data of different natures. The experiments include: the analysis of goodness-of-fit with several continuous and pseudo-continuous variables, the optimization of certain parameters and the effect of variable selection and graph structure when using tMoPs in BNs, and finally the evaluation of the predictive ability of hybrid BNs based on tMoPs in classification and regression. Results show the good behavior of our proposal, with the tMoP hybrid Bayesian networks being equally accurate or outperforming other techniques in most scenarios, in addition to providing a more informative and convenient probabilistic model.
提出了一种新的混合贝叶斯网络连续分布建模方法。它基于带有尾部的多项式混合(MoPs),称为tops。这个建议是通常的MoP模型的一个变体,现在在学习过程中包括了尾巴和其他几个改进。对变量分布中尾的适当建模在理论上和许多实际应用中都是相关的,在这些应用中,罕见的现象可能会产生很大的影响。所提出的方法旨在利用tMoP模型的灵活性来拟合不同的连续数据分布。这在那些密度区域接近于零的分布中尤其重要,在这些分布中多项式拟合可能很困难。在这些情况下,tops允许在密度较高的部分使用多项式拟合,并在密度较低的区域使用尾部。这允许更好的全局拟合,而不会损失整体精度,并产生相对简单的密度函数。开发了具有最多两个任意类型父节点的tops条件概率分布的学习算法。这些tops可以集成到混合贝叶斯网络中,以表示条件概率分布,从而允许执行概率推理,如因果推理、灵敏度分析和其他决策操作。通过使用大量具有不同性质数据的真实数据集,从几个方面评估了tops的适用性。实验包括:分析几个连续变量和伪连续变量的拟合优度,在bp网络中使用tMoPs对某些参数的优化以及变量选择和图结构的影响,最后评估基于tMoPs的混合bp网络在分类和回归方面的预测能力。结果显示了我们的建议的良好行为,除了提供更多信息和方便的概率模型外,tMoP混合贝叶斯网络在大多数情况下同样准确或优于其他技术。
{"title":"Modeling continuous distributions in hybrid Bayesian networks using mixtures of polynomials with tails","authors":"J.C. Luengo ,&nbsp;D. Ramos-López ,&nbsp;R. Rumí","doi":"10.1016/j.csda.2025.108246","DOIUrl":"10.1016/j.csda.2025.108246","url":null,"abstract":"<div><div>A new approach to modeling continuous distributions in hybrid Bayesian networks (BNs) is presented. It is based on Mixtures of Polynomials (MoPs) with tails, named as tMoPs. This proposal is a variation of the usual MoP model, now including tails and several other improvements in the learning process. The adequate modeling of tails in variable distributions is relevant theoretically and for many reals applications, in which rare phenomena may have a great impact. The proposed approach has been designed to exploit the flexibility of the tMoP model to fit different continuous data distributions. This is especially relevant in those distributions with zones of density close to zero, in which polynomial fitting may be difficult. In these situations, tMoPs allow a polynomial fit in parts with higher density and the use of tails in areas with lower density. This permits a better global fit, without loss of overall accuracy and yielding a relatively simple density function. Learning algorithms for tMoPs conditional probability distributions with up to two parents of any type are developed. These tMoPs may be integrated into hybrid Bayesian networks to represent conditional probability distributions, thus allowing to perform probabilistic reasoning, such as causal inference, sensitivity analysis, and other decision-making operations. The suitability of tMoPs is evaluated in several ways, using a large set of real datasets with data of different natures. The experiments include: the analysis of goodness-of-fit with several continuous and pseudo-continuous variables, the optimization of certain parameters and the effect of variable selection and graph structure when using tMoPs in BNs, and finally the evaluation of the predictive ability of hybrid BNs based on tMoPs in classification and regression. Results show the good behavior of our proposal, with the tMoP hybrid Bayesian networks being equally accurate or outperforming other techniques in most scenarios, in addition to providing a more informative and convenient probabilistic model.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"212 ","pages":"Article 108246"},"PeriodicalIF":1.5,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144632238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Kernel density estimation for compositional data with zeros via hypersphere mapping 基于超球映射的含零成分数据核密度估计
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-11 DOI: 10.1016/j.csda.2025.108249
Changwon Yoon , Hyunbin Choi , Jeongyoun Ahn
Compositional data—measurements of relative proportions among components—arise frequently in fields ranging from chemometrics to bioinformatics. While density estimation of such data provides crucial insights into their underlying patterns and enables comparative analyses across groups, existing nonparametric approaches are limited, particularly in handling zero components that commonly occur in real-world datasets. We propose a novel kernel density estimation (KDE) method for compositional data that naturally accommodates zero components by exploiting the geometric correspondence between simplices and hyperspheres. This connection to spherical KDE allows us to establish theoretical guarantees, including consistency of the estimator. Through extensive simulations and real data analyses, we demonstrate our method's advantages over existing approaches, particularly in scenarios involving zero components.
成分数据-测量成分之间的相对比例-经常出现在从化学计量学到生物信息学等领域。虽然这些数据的密度估计提供了对其潜在模式的重要见解,并使跨组的比较分析成为可能,但现有的非参数方法是有限的,特别是在处理现实世界数据集中常见的零组件时。我们提出了一种新的核密度估计(KDE)方法,该方法通过利用简单体和超球之间的几何对应关系,自然地容纳零分量。这种与球形KDE的连接允许我们建立理论保证,包括估计器的一致性。通过广泛的模拟和真实数据分析,我们证明了我们的方法比现有方法的优势,特别是在涉及零组件的情况下。
{"title":"Kernel density estimation for compositional data with zeros via hypersphere mapping","authors":"Changwon Yoon ,&nbsp;Hyunbin Choi ,&nbsp;Jeongyoun Ahn","doi":"10.1016/j.csda.2025.108249","DOIUrl":"10.1016/j.csda.2025.108249","url":null,"abstract":"<div><div>Compositional data—measurements of relative proportions among components—arise frequently in fields ranging from chemometrics to bioinformatics. While density estimation of such data provides crucial insights into their underlying patterns and enables comparative analyses across groups, existing nonparametric approaches are limited, particularly in handling zero components that commonly occur in real-world datasets. We propose a novel kernel density estimation (KDE) method for compositional data that naturally accommodates zero components by exploiting the geometric correspondence between simplices and hyperspheres. This connection to spherical KDE allows us to establish theoretical guarantees, including consistency of the estimator. Through extensive simulations and real data analyses, we demonstrate our method's advantages over existing approaches, particularly in scenarios involving zero components.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"212 ","pages":"Article 108249"},"PeriodicalIF":1.5,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144632239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Testing the equality of high dimensional distributions 测试高维分布的等式
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-09 DOI: 10.1016/j.csda.2025.108245
Reza Modarres
The Euclidean distance is not a suitable distance for high dimensional settings due to the distance concentration phenomenon. A novel statistic that is inspired by the interpoint distances, but avoids their computation, is proposed for comparing and visualizing high dimensional datasets. The new statistic is based on a high dimensional dissimilarity index that takes advantage of the concentration phenomenon. A simultaneous display of observations means and standard deviations that aids visualization, detection of suspect outliers, and enhances separability among the competing classes in the transformed space is discussed. The finite sample convergence of the dissimilarity indices is studied, nine statistics are compared under several distributions, and three applications are presented.
由于距离集中现象的存在,欧几里得距离不是一个适合高维环境的距离。提出了一种新的统计量,该统计量受点间距离的启发,但避免了点间距离的计算,用于比较和可视化高维数据集。新的统计是基于一个高维的不相似指数,利用了集中现象。同时显示观测均值和标准偏差,有助于可视化、可疑异常值的检测,并增强转换空间中竞争类之间的可分离性。研究了不相似度指标的有限样本收敛性,比较了几种分布下的9种统计量,并给出了3种应用。
{"title":"Testing the equality of high dimensional distributions","authors":"Reza Modarres","doi":"10.1016/j.csda.2025.108245","DOIUrl":"10.1016/j.csda.2025.108245","url":null,"abstract":"<div><div>The Euclidean distance is not a suitable distance for high dimensional settings due to the distance concentration phenomenon. A novel statistic that is inspired by the interpoint distances, but avoids their computation, is proposed for comparing and visualizing high dimensional datasets. The new statistic is based on a high dimensional dissimilarity index that takes advantage of the concentration phenomenon. A simultaneous display of observations means and standard deviations that aids visualization, detection of suspect outliers, and enhances separability among the competing classes in the transformed space is discussed. The finite sample convergence of the dissimilarity indices is studied, nine statistics are compared under several distributions, and three applications are presented.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"212 ","pages":"Article 108245"},"PeriodicalIF":1.5,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144596280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-dimensional response growth curve modeling for longitudinal neuroimaging analysis 纵向神经成像分析的高维反应增长曲线建模
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2025-07-07 DOI: 10.1016/j.csda.2025.108239
Lu Wang , Xiang Lyu , Lexin Li
There is increasing interest in modeling high-dimensional longitudinal outcomes in applications such as developmental neuroimaging research. Growth curve model offers a useful tool to capture both the mean growth pattern across individuals, as well as the dynamic changes of outcomes over time within each individual. However, when the number of outcomes is large, it becomes challenging and often infeasible to tackle the large covariance matrix of the random effects involved in the model. A high-dimensional response growth curve model, with three novel components, is proposed: a low-rank factor model structure that substantially reduces the number of parameters in the large covariance matrix, a re-parameterization formulation coupled with a sparsity penalty that selects important fixed and random effect terms, and a computational trick that turns the inversion of a large matrix into the inversion of a stack of small matrices and thus considerably speeds up the computation. An efficient expectation-maximization-type estimation algorithm is developed, and the competitive performance of the proposed method is demonstrated through both simulations and a longitudinal study of brain structural connectivity in association with human immunodeficiency virus.
在诸如发育神经成像研究等应用中,对高维纵向结果建模的兴趣越来越大。增长曲线模型提供了一个有用的工具,既可以捕捉个体之间的平均增长模式,也可以捕捉每个个体内部结果随时间的动态变化。然而,当结果数量很大时,处理模型中涉及的随机效应的大协方差矩阵就变得具有挑战性,而且往往是不可行的。提出了一种高维响应增长曲线模型,具有三个新的组成部分:一个低秩因子模型结构,它大大减少了大协方差矩阵中参数的数量;一个再参数化公式,加上选择重要的固定和随机效应项的稀疏性惩罚;一个计算技巧,它将一个大矩阵的反演转化为一堆小矩阵的反演,从而大大加快了计算速度。开发了一种高效的期望最大化型估计算法,并通过模拟和与人类免疫缺陷病毒相关的大脑结构连接的纵向研究证明了所提出方法的竞争性能。
{"title":"High-dimensional response growth curve modeling for longitudinal neuroimaging analysis","authors":"Lu Wang ,&nbsp;Xiang Lyu ,&nbsp;Lexin Li","doi":"10.1016/j.csda.2025.108239","DOIUrl":"10.1016/j.csda.2025.108239","url":null,"abstract":"<div><div>There is increasing interest in modeling high-dimensional longitudinal outcomes in applications such as developmental neuroimaging research. Growth curve model offers a useful tool to capture both the mean growth pattern across individuals, as well as the dynamic changes of outcomes over time within each individual. However, when the number of outcomes is large, it becomes challenging and often infeasible to tackle the large covariance matrix of the random effects involved in the model. A high-dimensional response growth curve model, with three novel components, is proposed: a low-rank factor model structure that substantially reduces the number of parameters in the large covariance matrix, a re-parameterization formulation coupled with a sparsity penalty that selects important fixed and random effect terms, and a computational trick that turns the inversion of a large matrix into the inversion of a stack of small matrices and thus considerably speeds up the computation. An efficient expectation-maximization-type estimation algorithm is developed, and the competitive performance of the proposed method is demonstrated through both simulations and a longitudinal study of brain structural connectivity in association with human immunodeficiency virus.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"212 ","pages":"Article 108239"},"PeriodicalIF":1.5,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144580699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1