Pub Date : 2025-08-10DOI: 10.1016/j.csda.2025.108266
Mingyue Du, Ricong Zeng
The estimation of semiparametric probit model is discussed for the situation where one observes interval-censored failure time data arising from case-cohort studies. The probit model has recently attracted some attention for regression analysis of failure time data partly due to the popularity of the normal distribution and its similarity to linear models. Although some methods have been developed in the literature for its estimation, it does not seem to exist an established approach for the situation of case-cohort interval-censored data. To address this, a pseudo-maximum likelihood method is proposed and furthermore, an EM algorithm is developed for its implementation. The resulting estimators of regression parameters are shown to be consistent and asymptotically follow the normal distribution. To assess the empirical performance of the proposed method, a simulation study is conducted and indicates that it works well in practical situations. In addition, it is applied to a set of real data arising from an AIDS clinical trial that motivated this study.
{"title":"Estimation of semiparametric probit model based on case-cohort interval-censored failure time data","authors":"Mingyue Du, Ricong Zeng","doi":"10.1016/j.csda.2025.108266","DOIUrl":"10.1016/j.csda.2025.108266","url":null,"abstract":"<div><div>The estimation of semiparametric probit model is discussed for the situation where one observes interval-censored failure time data arising from case-cohort studies. The probit model has recently attracted some attention for regression analysis of failure time data partly due to the popularity of the normal distribution and its similarity to linear models. Although some methods have been developed in the literature for its estimation, it does not seem to exist an established approach for the situation of case-cohort interval-censored data. To address this, a pseudo-maximum likelihood method is proposed and furthermore, an EM algorithm is developed for its implementation. The resulting estimators of regression parameters are shown to be consistent and asymptotically follow the normal distribution. To assess the empirical performance of the proposed method, a simulation study is conducted and indicates that it works well in practical situations. In addition, it is applied to a set of real data arising from an AIDS clinical trial that motivated this study.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108266"},"PeriodicalIF":1.6,"publicationDate":"2025-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144886727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-06DOI: 10.1016/j.csda.2025.108255
Uche Mbaka , James Owen Ramsay , Michelle Carey
Functional data analysis frequently involves estimating a smooth covariance function based on observed data. This estimation is essential for understanding interactions among functions and constitutes a fundamental aspect of numerous advanced methodologies, including functional principal component analysis. Two approaches for estimating smooth covariance functions in the presence of measurement errors are introduced. The first method employs a low-rank approximation of the covariance matrix, while the second ensures positive definiteness via a Cholesky decomposition. Both approaches employ the use of penalized regression to produce smooth covariance estimates and have been validated through comprehensive simulation studies. The practical application of these methods is demonstrated through the examination of average weekly milk yields in dairy cows as well as egg-laying patterns of Mediterranean fruit flies.
{"title":"Estimating a smooth covariance for functional data","authors":"Uche Mbaka , James Owen Ramsay , Michelle Carey","doi":"10.1016/j.csda.2025.108255","DOIUrl":"10.1016/j.csda.2025.108255","url":null,"abstract":"<div><div>Functional data analysis frequently involves estimating a smooth covariance function based on observed data. This estimation is essential for understanding interactions among functions and constitutes a fundamental aspect of numerous advanced methodologies, including functional principal component analysis. Two approaches for estimating smooth covariance functions in the presence of measurement errors are introduced. The first method employs a low-rank approximation of the covariance matrix, while the second ensures positive definiteness via a Cholesky decomposition. Both approaches employ the use of penalized regression to produce smooth covariance estimates and have been validated through comprehensive simulation studies. The practical application of these methods is demonstrated through the examination of average weekly milk yields in dairy cows as well as egg-laying patterns of Mediterranean fruit flies.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108255"},"PeriodicalIF":1.6,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144886725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-05DOI: 10.1016/j.csda.2025.108256
Hyungwoo Kim , Seung Jun Shin
Optimizing the receiver operating characteristic (ROC) curve is a popular way to evaluate a binary classifier under imbalanced scenarios frequently encountered in practice. A practical approach to constructing a linear binary classifier is presented by simultaneously optimizing the area under the ROC curve (AUC) and selecting informative variables in high dimensions. In particular, the smoothly clipped absolute deviation (SCAD) penalty is employed, and its oracle property is established, which enables the development of a consistent BIC-type information criterion that greatly facilitates the tuning procedure. Both simulated and real data analyses demonstrate the promising performance of the proposed method in terms of AUC optimization and variable selection.
{"title":"Variable selection in AUC-optimizing classification","authors":"Hyungwoo Kim , Seung Jun Shin","doi":"10.1016/j.csda.2025.108256","DOIUrl":"10.1016/j.csda.2025.108256","url":null,"abstract":"<div><div>Optimizing the receiver operating characteristic (ROC) curve is a popular way to evaluate a binary classifier under imbalanced scenarios frequently encountered in practice. A practical approach to constructing a linear binary classifier is presented by simultaneously optimizing the area under the ROC curve (AUC) and selecting informative variables in high dimensions. In particular, the smoothly clipped absolute deviation (SCAD) penalty is employed, and its oracle property is established, which enables the development of a consistent BIC-type information criterion that greatly facilitates the tuning procedure. Both simulated and real data analyses demonstrate the promising performance of the proposed method in terms of AUC optimization and variable selection.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108256"},"PeriodicalIF":1.6,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144828904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-29DOI: 10.1016/j.csda.2025.108254
Quan Vu , Francis K.C. Hui , Samuel Muller , A.H. Welsh
When fitting generalized linear mixed models, choosing the random effects distribution is an important decision. As random effects are unobserved, misspecification of their distribution is a real possibility. Thus, the consequences of random effects misspecification for point prediction and prediction inference of random effects in generalized linear mixed models need to be investigated. A combination of theory, simulation, and a real application is used to explore the effect of using the common normality assumption for the random effects distribution when the correct specification is a mixture of normal distributions, focusing on the impacts on point prediction, mean squared prediction errors, and prediction intervals. Results show that the level of shrinkage for the predicted random effects can differ greatly under the two random effect distributions, and so is susceptible to misspecification. Also, the unconditional mean squared prediction errors for the random effects are almost always larger under the misspecified normal random effects distribution, while results for the mean squared prediction errors conditional on the random effects are more complicated but remain generally larger under the misspecified distribution (especially when the true random effect is close to the mean of one of the component distributions in the true mixture distribution). Results for prediction intervals indicate that the overall coverage probability is, in contrast, not greatly impacted by misspecification. It is concluded that misspecifying the random effects distribution can affect prediction of random effects, and greater caution is recommended when adopting the normality assumption in generalized linear mixed models.
{"title":"Random effects misspecification and its consequences for prediction in generalized linear mixed models","authors":"Quan Vu , Francis K.C. Hui , Samuel Muller , A.H. Welsh","doi":"10.1016/j.csda.2025.108254","DOIUrl":"10.1016/j.csda.2025.108254","url":null,"abstract":"<div><div>When fitting generalized linear mixed models, choosing the random effects distribution is an important decision. As random effects are unobserved, misspecification of their distribution is a real possibility. Thus, the consequences of random effects misspecification for point prediction and prediction inference of random effects in generalized linear mixed models need to be investigated. A combination of theory, simulation, and a real application is used to explore the effect of using the common normality assumption for the random effects distribution when the correct specification is a mixture of normal distributions, focusing on the impacts on point prediction, mean squared prediction errors, and prediction intervals. Results show that the level of shrinkage for the predicted random effects can differ greatly under the two random effect distributions, and so is susceptible to misspecification. Also, the unconditional mean squared prediction errors for the random effects are almost always larger under the misspecified normal random effects distribution, while results for the mean squared prediction errors conditional on the random effects are more complicated but remain generally larger under the misspecified distribution (especially when the true random effect is close to the mean of one of the component distributions in the true mixture distribution). Results for prediction intervals indicate that the overall coverage probability is, in contrast, not greatly impacted by misspecification. It is concluded that misspecifying the random effects distribution can affect prediction of random effects, and greater caution is recommended when adopting the normality assumption in generalized linear mixed models.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108254"},"PeriodicalIF":1.6,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144738685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-23DOI: 10.1016/j.csda.2025.108253
Dayi Li , Ziang Zhang
Approximate Bayesian inference based on Laplace approximation and quadrature has become increasingly popular for its efficiency in fitting latent Gaussian models (LGM). However, many useful models can only be fitted as LGMs if some conditioning parameters are fixed. Such models are termed conditional LGMs, with examples including change-point detection, non-linear regression, and many others. Existing methods for fitting conditional LGMs rely on grid search or sampling-based approaches to explore the posterior density of the conditioning parameters; both require a large number of evaluations of the unnormalized posterior density of the conditioning parameters. Since each evaluation requires fitting a separate LGM, these methods become computationally prohibitive beyond simple scenarios. In this work, the Bayesian Optimization Sequential Surrogate (BOSS) algorithm is introduced, which combines Bayesian optimization with approximate Bayesian inference methods to significantly reduce the computational resources required for fitting conditional LGMs. With orders of magnitude fewer evaluations than those required by the existing methods, BOSS efficiently generates sequential design points that capture the majority of the posterior mass of the conditioning parameters and subsequently yields an accurate surrogate posterior distribution that can be easily normalized. The efficiency, accuracy, and practical utility of BOSS are demonstrated through extensive simulation studies and real-world applications in epidemiology, environmental sciences, and astrophysics.
{"title":"Bayesian optimization sequential surrogate (BOSS) algorithm: Fast Bayesian inference for a broad class of Bayesian hierarchical models","authors":"Dayi Li , Ziang Zhang","doi":"10.1016/j.csda.2025.108253","DOIUrl":"10.1016/j.csda.2025.108253","url":null,"abstract":"<div><div>Approximate Bayesian inference based on Laplace approximation and quadrature has become increasingly popular for its efficiency in fitting latent Gaussian models (LGM). However, many useful models can only be fitted as LGMs if some conditioning parameters are fixed. Such models are termed conditional LGMs, with examples including change-point detection, non-linear regression, and many others. Existing methods for fitting conditional LGMs rely on grid search or sampling-based approaches to explore the posterior density of the conditioning parameters; both require a large number of evaluations of the unnormalized posterior density of the conditioning parameters. Since each evaluation requires fitting a separate LGM, these methods become computationally prohibitive beyond simple scenarios. In this work, the Bayesian Optimization Sequential Surrogate (BOSS) algorithm is introduced, which combines Bayesian optimization with approximate Bayesian inference methods to significantly reduce the computational resources required for fitting conditional LGMs. With orders of magnitude fewer evaluations than those required by the existing methods, BOSS efficiently generates sequential design points that capture the majority of the posterior mass of the conditioning parameters and subsequently yields an accurate surrogate posterior distribution that can be easily normalized. The efficiency, accuracy, and practical utility of BOSS are demonstrated through extensive simulation studies and real-world applications in epidemiology, environmental sciences, and astrophysics.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108253"},"PeriodicalIF":1.5,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144702404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-22DOI: 10.1016/j.csda.2025.108252
Bogui Li , Jianbao Chen
In order to study the ubiquitous space-time panel data in real world, a fixed effects partially linear additive spatial autoregressive (SAR) model with space-time correlated disturbances is proposed. Compared to the linear panel model with space-time correlated disturbances, it can simultaneously capture substantial spatial dependence of response, linearity and nonlinearity between response and regressors, spatial and serial correlations of disturbances, and avoid “curse of dimensionality” of nonparametric regression. By using B-splines to fit additive components and constructing linear and quadratic moment conditions which incorporate information in disturbances, the generalized method of moments (GMM) estimators of unknown parameters and additive components are obtained. Under certain regularity assumptions, it is proved that the GMM estimators are consistent and asymptotically normal. Furthermore, the asymptotically efficient best GMM estimators under normality are derived. Monte Carlo simulation and empirical analysis illustrate that the developed estimation method has good finite sample performance and application prospects.
{"title":"GMM estimation of fixed effects partially linear additive SAR model with space-time correlated disturbances","authors":"Bogui Li , Jianbao Chen","doi":"10.1016/j.csda.2025.108252","DOIUrl":"10.1016/j.csda.2025.108252","url":null,"abstract":"<div><div>In order to study the ubiquitous space-time panel data in real world, a fixed effects partially linear additive spatial autoregressive (SAR) model with space-time correlated disturbances is proposed. Compared to the linear panel model with space-time correlated disturbances, it can simultaneously capture substantial spatial dependence of response, linearity and nonlinearity between response and regressors, spatial and serial correlations of disturbances, and avoid “curse of dimensionality” of nonparametric regression. By using B-splines to fit additive components and constructing linear and quadratic moment conditions which incorporate information in disturbances, the generalized method of moments (GMM) estimators of unknown parameters and additive components are obtained. Under certain regularity assumptions, it is proved that the GMM estimators are consistent and asymptotically normal. Furthermore, the asymptotically efficient best GMM estimators under normality are derived. Monte Carlo simulation and empirical analysis illustrate that the developed estimation method has good finite sample performance and application prospects.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108252"},"PeriodicalIF":1.5,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144686097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-22DOI: 10.1016/j.csda.2025.108251
Matteo Framba , Veronica Vinciotti , Ernst C. Wit
Parameter estimation of kinetic rates in stochastic quasi-reaction systems can be challenging, particularly when the time gap between consecutive measurements is large. Local linear approximation approaches account for the stochasticity in the system but fail to capture the intrinsically nonlinear nature of the mean dynamics of the process. Moreover, the mean dynamics of a quasi-reaction system can be described by a system of ODEs, which have an explicit solution only for simple unitary systems. An approximate analytical solution is derived for generic quasi-reaction systems via a first-order Taylor approximation of the hazard rate. This allows a nonlinear forward prediction of the future dynamics given the current state of the system. Predictions and corresponding observations are embedded in a nonlinear least-squares approach for parameter estimation. The performance of the algorithm is compared to existing methods via a simulation study. Besides the generality of the approach in the specification of the quasi-reaction system and the gains in computational efficiency, the results show an improvement in the kinetic rate estimation, particularly for data observed at large time intervals. Additionally, the availability of an explicit solution makes the method robust to stiffness, which is often present in biological systems. Application to Rhesus Macaque data illustrates the use of the method in the study of cell differentiation.
{"title":"Inferring the dynamics of quasi-reaction systems via nonlinear local mean-field approximations","authors":"Matteo Framba , Veronica Vinciotti , Ernst C. Wit","doi":"10.1016/j.csda.2025.108251","DOIUrl":"10.1016/j.csda.2025.108251","url":null,"abstract":"<div><div>Parameter estimation of kinetic rates in stochastic quasi-reaction systems can be challenging, particularly when the time gap between consecutive measurements is large. Local linear approximation approaches account for the stochasticity in the system but fail to capture the intrinsically nonlinear nature of the mean dynamics of the process. Moreover, the mean dynamics of a quasi-reaction system can be described by a system of ODEs, which have an explicit solution only for simple unitary systems. An approximate analytical solution is derived for generic quasi-reaction systems via a first-order Taylor approximation of the hazard rate. This allows a nonlinear forward prediction of the future dynamics given the current state of the system. Predictions and corresponding observations are embedded in a nonlinear least-squares approach for parameter estimation. The performance of the algorithm is compared to existing methods via a simulation study. Besides the generality of the approach in the specification of the quasi-reaction system and the gains in computational efficiency, the results show an improvement in the kinetic rate estimation, particularly for data observed at large time intervals. Additionally, the availability of an explicit solution makes the method robust to stiffness, which is often present in biological systems. Application to Rhesus Macaque data illustrates the use of the method in the study of cell differentiation.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108251"},"PeriodicalIF":1.5,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144686096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-21DOI: 10.1016/j.csda.2025.108250
Shih-Ting Huang , Graham A. Colditz , Shu Jiang
Multi-omics analysis offers unparalleled insights into the interlinked molecular interactions that govern the underlying biological processes. In the era of big data, driven by the emergence of high-throughput technologies, it is possible to gain a more comprehensive and detailed understanding of complex systems. Nevertheless, the challenges lie in developing methods to effectively integrate and analyze this wealth of data. This challenge is even more apparent when the type of -omics data (e.g., pathomics) lacks pixel-to-pixel or region-to-region correspondence across the population. A novel sample-specific cooperative learning framework is introduced, designed to adaptively manage diverse multi-omics data types, even when there is no direct correspondence between regions. The proposed framework is defined for both continuous and categorical outcomes, with theoretical guarantees based on finite samples. Model performance is demonstrated and compared with existing methods using real-world datasets involving proteomics and metabolomics, and radiomics and pathomics.
{"title":"Sample-specific cooperative learning integrating heterogeneous radiomics and pathomics data","authors":"Shih-Ting Huang , Graham A. Colditz , Shu Jiang","doi":"10.1016/j.csda.2025.108250","DOIUrl":"10.1016/j.csda.2025.108250","url":null,"abstract":"<div><div>Multi-omics analysis offers unparalleled insights into the interlinked molecular interactions that govern the underlying biological processes. In the era of big data, driven by the emergence of high-throughput technologies, it is possible to gain a more comprehensive and detailed understanding of complex systems. Nevertheless, the challenges lie in developing methods to effectively integrate and analyze this wealth of data. This challenge is even more apparent when the type of -omics data (e.g., pathomics) lacks pixel-to-pixel or region-to-region correspondence across the population. A novel sample-specific cooperative learning framework is introduced, designed to adaptively manage diverse multi-omics data types, even when there is no direct correspondence between regions. The proposed framework is defined for both continuous and categorical outcomes, with theoretical guarantees based on finite samples. Model performance is demonstrated and compared with existing methods using real-world datasets involving proteomics and metabolomics, and radiomics and pathomics.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108250"},"PeriodicalIF":1.5,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144686095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-16DOI: 10.1016/j.csda.2025.108247
Michael Lau , Tamara Schikowski , Holger Schwender
Incorporating interaction effects is essential for accurately modeling complex underlying relationships in many applications. Often, not only strong predictive performance is desired, but also the interpretability of the resulting model. This need is evident in areas such as epidemiology, in which uncovering the interplay of biological mechanisms is critical for understanding complex diseases. Classical linear models, frequently used for constructing genetic risk scores, fail to capture interaction effects autonomously, while modern machine learning methods such as gradient boosting often produce black-box models that lack interpretability. Existing linear interaction models are largely limited to consider two-way interactions. To address these limitations, a novel statistical learning method, BITS (Boosting Interaction Tree Stumps), is introduced to construct linear models while autonomously detecting and incorporating interaction effects. BITS uses gradient boosting on interaction tree stumps, i.e., decision trees with a single split, where in BITS this split can possibly occur on an interaction term. A branch-and-bound approach is employed in BITS to discard weakly predictive terms. For high-dimensional data, a hybrid search strategy combining greedy and exhaustive approaches is proposed. Regularization techniques are integrated to prevent overfitting and the inclusion of spurious interaction effects. Simulation studies and real data applications demonstrate that BITS produces interpretable models with strong predictive performance. Moreover, in the simulation study, BITS primarily identifies truly influential terms.
在许多应用程序中,结合交互效果对于精确地建模复杂的潜在关系是必不可少的。通常,不仅需要强大的预测性能,还需要结果模型的可解释性。这种需求在流行病学等领域是显而易见的,在这些领域,揭示生物机制的相互作用对于理解复杂疾病至关重要。经典的线性模型,经常用于构建遗传风险评分,不能自主地捕获相互作用的影响,而现代机器学习方法,如梯度增强,经常产生缺乏可解释性的黑箱模型。现有的线性相互作用模型在很大程度上局限于考虑双向相互作用。为了解决这些限制,引入了一种新的统计学习方法BITS (Boosting Interaction Tree Stumps)来构建线性模型,同时自主检测和整合交互效应。BITS在交互树桩上使用梯度增强,即具有单个分裂的决策树,在BITS中,这种分裂可能发生在交互项上。在BITS中采用分支定界方法来丢弃弱预测项。针对高维数据,提出了一种贪婪和穷举相结合的混合搜索策略。正则化技术集成,以防止过度拟合和包含虚假的相互作用的影响。仿真研究和实际数据应用表明,BITS产生的可解释模型具有较强的预测性能。此外,在模拟研究中,BITS主要识别真正有影响力的术语。
{"title":"Boosting interaction tree stumps for modeling interactions","authors":"Michael Lau , Tamara Schikowski , Holger Schwender","doi":"10.1016/j.csda.2025.108247","DOIUrl":"10.1016/j.csda.2025.108247","url":null,"abstract":"<div><div>Incorporating interaction effects is essential for accurately modeling complex underlying relationships in many applications. Often, not only strong predictive performance is desired, but also the interpretability of the resulting model. This need is evident in areas such as epidemiology, in which uncovering the interplay of biological mechanisms is critical for understanding complex diseases. Classical linear models, frequently used for constructing genetic risk scores, fail to capture interaction effects autonomously, while modern machine learning methods such as gradient boosting often produce black-box models that lack interpretability. Existing linear interaction models are largely limited to consider two-way interactions. To address these limitations, a novel statistical learning method, BITS (Boosting Interaction Tree Stumps), is introduced to construct linear models while autonomously detecting and incorporating interaction effects. BITS uses gradient boosting on interaction tree stumps, i.e., decision trees with a single split, where in BITS this split can possibly occur on an interaction term. A branch-and-bound approach is employed in BITS to discard weakly predictive terms. For high-dimensional data, a hybrid search strategy combining greedy and exhaustive approaches is proposed. Regularization techniques are integrated to prevent overfitting and the inclusion of spurious interaction effects. Simulation studies and real data applications demonstrate that BITS produces interpretable models with strong predictive performance. Moreover, in the simulation study, BITS primarily identifies truly influential terms.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108247"},"PeriodicalIF":1.5,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144680256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-16DOI: 10.1016/j.csda.2025.108248
Arthur Pewsey
The cardioid distribution, despite being one of the fundamental models for circular data, has received limited attention both methodologically and in terms of its implementation in R. To redress these shortcomings, published results on the model are summarized, corrected and extended, and the scope and limitations of the existing support for the model in R identified. A thorough investigation into the performance of trigonometric moment and maximum likelihood based approaches to point and interval estimation of the model's location and concentration parameters is presented, and goodness-of-fit techniques outlined. A suite of reliable R functions is provided for the model's practical application. The application of the proposed inferential methods and R functions is illustrated by an analysis of palaeocurrent cross-bed azimuths.
{"title":"On Jeffreys's cardioid distribution","authors":"Arthur Pewsey","doi":"10.1016/j.csda.2025.108248","DOIUrl":"10.1016/j.csda.2025.108248","url":null,"abstract":"<div><div>The cardioid distribution, despite being one of the fundamental models for circular data, has received limited attention both methodologically and in terms of its implementation in R. To redress these shortcomings, published results on the model are summarized, corrected and extended, and the scope and limitations of the existing support for the model in R identified. A thorough investigation into the performance of trigonometric moment and maximum likelihood based approaches to point and interval estimation of the model's location and concentration parameters is presented, and goodness-of-fit techniques outlined. A suite of reliable R functions is provided for the model's practical application. The application of the proposed inferential methods and R functions is illustrated by an analysis of palaeocurrent cross-bed azimuths.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108248"},"PeriodicalIF":1.5,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144656280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}