Pub Date : 2025-11-17DOI: 10.1016/j.csda.2025.108304
Yue Huan , Guoqiang Wang , Hai Xiang Lin
Data assimilation (DA) combines numerical model simulations with observed data to obtain the best possible description of a dynamical system and its uncertainty. Incorrect modeling assumptions can lead to filter divergence, making model identification an important issue in the field of DA. Variations in dynamic model structures can result in differences in parameter dimensions, complicating the resampling step in PFs. To meet this challenge, the Sequential Hierarchical Bayesian Model (SHBM) is proposed in this paper, which integrates the evolution model along with observation model from the DA scheme, and the hierarchical parameter model. A two-step resampling method are also proposed to estimate the SHBM: the first step uses the resampling scheme in the bootstrap filter to resample new particles based on weights, which may produce some duplicate particles; the second step utilizes the Reversible Jump Markov Chain Monte Carlo (RJMCMC) methods to draw new particles from the target distribution. This approach ensures particle diversity, with the first step aiming at avoiding particle degeneracy, and the second step intends to prevent the sample impoverishment. The performance in the Advection Equation example and Lorenz 96 example demonstrates the effectiveness of the proposed method.
{"title":"Sequential hierarchical Bayesian model and particle filter estimation with two-step RJMCMC resampling","authors":"Yue Huan , Guoqiang Wang , Hai Xiang Lin","doi":"10.1016/j.csda.2025.108304","DOIUrl":"10.1016/j.csda.2025.108304","url":null,"abstract":"<div><div>Data assimilation (DA) combines numerical model simulations with observed data to obtain the best possible description of a dynamical system and its uncertainty. Incorrect modeling assumptions can lead to filter divergence, making model identification an important issue in the field of DA. Variations in dynamic model structures can result in differences in parameter dimensions, complicating the resampling step in PFs. To meet this challenge, the Sequential Hierarchical Bayesian Model (SHBM) is proposed in this paper, which integrates the evolution model along with observation model from the DA scheme, and the hierarchical parameter model. A two-step resampling method are also proposed to estimate the SHBM: the first step uses the resampling scheme in the bootstrap filter to resample new particles based on weights, which may produce some duplicate particles; the second step utilizes the Reversible Jump Markov Chain Monte Carlo (RJMCMC) methods to draw new particles from the target distribution. This approach ensures particle diversity, with the first step aiming at avoiding particle degeneracy, and the second step intends to prevent the sample impoverishment. The performance in the Advection Equation example and Lorenz 96 example demonstrates the effectiveness of the proposed method.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"216 ","pages":"Article 108304"},"PeriodicalIF":1.6,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-15DOI: 10.1016/j.csda.2025.108305
Baohao Wei , Dongsheng Tu , Chunlin Wang
The receiver operating characteristic (ROC) curve and its summary measures, such as the area under the curve (AUC) and Youden index, are frequently used to evaluate the performance of a binary classifier based on data of a continuous biomarker and meanwhile identify a suitable cut-off point for classification. In clinical applications, the biomarker used for classification may be semi-continuous in the sense that the observations contain excess zero values and the distribution of the positive values is skewed. In this paper, the distribution of a semi-continuous biomarker is modeled using a mixture of a discrete mass at zero and a continuous skewed positive component. In addition, the distributions of the continuous component in subjects with true negative and positive outcomes are linked by a semi-parametric density ratio model to gain efficiency. Under this framework, unified estimation and inference procedures are proposed for the ROC curve, its important summary measures, and the associated cut-off point. The asymptotic properties of the proposed semi-parametric estimators are established and used to construct their corresponding confidence intervals. Simulation results demonstrate the desirable performance of these estimators and confidence intervals in various settings. The proposed semi-parametric approach is also applied to assess the semi-continuous BRCA1 biomarker as a valid prognostic biomarker for predicting cancer progression at 4 years and identifying a cut-off point to classify patients with advanced ovarian cancer into two groups with good and bad prognoses.
{"title":"A semi-parametric approach to receiver operating characteristic analysis with semi-continuous biomarker","authors":"Baohao Wei , Dongsheng Tu , Chunlin Wang","doi":"10.1016/j.csda.2025.108305","DOIUrl":"10.1016/j.csda.2025.108305","url":null,"abstract":"<div><div>The receiver operating characteristic (ROC) curve and its summary measures, such as the area under the curve (AUC) and Youden index, are frequently used to evaluate the performance of a binary classifier based on data of a continuous biomarker and meanwhile identify a suitable cut-off point for classification. In clinical applications, the biomarker used for classification may be semi-continuous in the sense that the observations contain excess zero values and the distribution of the positive values is skewed. In this paper, the distribution of a semi-continuous biomarker is modeled using a mixture of a discrete mass at zero and a continuous skewed positive component. In addition, the distributions of the continuous component in subjects with true negative and positive outcomes are linked by a semi-parametric density ratio model to gain efficiency. Under this framework, unified estimation and inference procedures are proposed for the ROC curve, its important summary measures, and the associated cut-off point. The asymptotic properties of the proposed semi-parametric estimators are established and used to construct their corresponding confidence intervals. Simulation results demonstrate the desirable performance of these estimators and confidence intervals in various settings. The proposed semi-parametric approach is also applied to assess the semi-continuous BRCA1 biomarker as a valid prognostic biomarker for predicting cancer progression at 4 years and identifying a cut-off point to classify patients with advanced ovarian cancer into two groups with good and bad prognoses.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"216 ","pages":"Article 108305"},"PeriodicalIF":1.6,"publicationDate":"2025-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145584370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-10DOI: 10.1016/j.csda.2025.108303
Yuan Ke , Rongmao Zhang , Wenyang Zhang , Changliang Zou
Data with multiple responses is very common in economics, engineering, finance, and social science. Analyzing each response variable separately may not be a good strategy as this approach can overlook important information and lead to suboptimal results. In some cases, it may not even provide an answer to the question of interest. Multi-response linear models serve as an important tool for joint analysis. While the methodology and theory of classic multi-response linear models are well-established, they may not be applicable to high-dimensional cases. In this paper, we propose a powerful hypothesis test for the coefficient matrix of a high-dimensional multi-response linear model. We establish asymptotic results and conduct comprehensive simulation studies to demonstrate that the proposed hypothesis test is more powerful than alternative methods. Furthermore, we apply the hypothesis test to two real datasets, illustrating its usefulness in addressing practical problems.
{"title":"Hypothesis test in high dimensional multi-response linear models","authors":"Yuan Ke , Rongmao Zhang , Wenyang Zhang , Changliang Zou","doi":"10.1016/j.csda.2025.108303","DOIUrl":"10.1016/j.csda.2025.108303","url":null,"abstract":"<div><div>Data with multiple responses is very common in economics, engineering, finance, and social science. Analyzing each response variable separately may not be a good strategy as this approach can overlook important information and lead to suboptimal results. In some cases, it may not even provide an answer to the question of interest. Multi-response linear models serve as an important tool for joint analysis. While the methodology and theory of classic multi-response linear models are well-established, they may not be applicable to high-dimensional cases. In this paper, we propose a powerful hypothesis test for the coefficient matrix of a high-dimensional multi-response linear model. We establish asymptotic results and conduct comprehensive simulation studies to demonstrate that the proposed hypothesis test is more powerful than alternative methods. Furthermore, we apply the hypothesis test to two real datasets, illustrating its usefulness in addressing practical problems.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108303"},"PeriodicalIF":1.6,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-07DOI: 10.1016/j.csda.2025.108302
Clement Twumasi
A comprehensive analytical and computational framework is developed for the linear birth-death process (LBDP) with catastrophic extinction (BDC process), a continuous-time Markov model that incorporates sudden extinction events into the classical LBDP. Despite its conceptual simplicity, the underlying BDC process poses substantial challenges in deriving exact transition probabilities and performing reliable parameter estimation, particularly under discrete-time observations. While previous work established foundational properties using spectral methods and probability generating functions (PGFs), explicit analytical expressions for transition probabilities and theoretical moments have remained unavailable, limiting practical applications in extinction-prone systems. This limitation is addressed by reparameterising the PGF through functional restructuring, yielding exact closed-form expressions for the transition probability function and the theoretical moments of the discretely observed BDC process, with results validated through comprehensive numerical experiments for the first time. Three parameter estimation approaches tailored to the BDC process are introduced and evaluated: maximum likelihood estimation (MLE), generalised method of moments (GMM), and an embedded Galton-Watson (GW) approach, with trade-offs between computational efficiency and estimation accuracy examined across diverse simulation scenarios. To improve scalability, a Monte Carlo simulation framework based on a hybrid tau-leaping algorithm is formulated, specifically adapted to extinction-driven dynamics, offering a computationally efficient alternative to the exact stochastic simulation algorithm (SSA). The proposed methodologies offer a tractable and scalable foundation for incorporating the BDC process into applied stochastic models, particularly in ecological, epidemiological, and biological systems where populations are susceptible to sudden collapse due to catastrophic events such as host mortality or immune response.
{"title":"Modelling catastrophic extinction in stochastic birth-death process: Analytical insights, estimation, and efficient simulation","authors":"Clement Twumasi","doi":"10.1016/j.csda.2025.108302","DOIUrl":"10.1016/j.csda.2025.108302","url":null,"abstract":"<div><div>A comprehensive analytical and computational framework is developed for the linear birth-death process (LBDP) with catastrophic extinction (BDC process), a continuous-time Markov model that incorporates sudden extinction events into the classical LBDP. Despite its conceptual simplicity, the underlying BDC process poses substantial challenges in deriving exact transition probabilities and performing reliable parameter estimation, particularly under discrete-time observations. While previous work established foundational properties using spectral methods and probability generating functions (PGFs), explicit analytical expressions for transition probabilities and theoretical moments have remained unavailable, limiting practical applications in extinction-prone systems. This limitation is addressed by reparameterising the PGF through functional restructuring, yielding exact closed-form expressions for the transition probability function and the theoretical moments of the discretely observed BDC process, with results validated through comprehensive numerical experiments for the first time. Three parameter estimation approaches tailored to the BDC process are introduced and evaluated: maximum likelihood estimation (MLE), generalised method of moments (GMM), and an embedded Galton-Watson (GW) approach, with trade-offs between computational efficiency and estimation accuracy examined across diverse simulation scenarios. To improve scalability, a Monte Carlo simulation framework based on a hybrid tau-leaping algorithm is formulated, specifically adapted to extinction-driven dynamics, offering a computationally efficient alternative to the exact stochastic simulation algorithm (SSA). The proposed methodologies offer a tractable and scalable foundation for incorporating the BDC process into applied stochastic models, particularly in ecological, epidemiological, and biological systems where populations are susceptible to sudden collapse due to catastrophic events such as host mortality or immune response.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108302"},"PeriodicalIF":1.6,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145571266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1016/j.csda.2025.108292
Zejing Zheng, Shengbing Zheng, Junlong Zhao
Discrete responses are frequently encountered in applications, particularly in classification problems. However, the high cost of collecting responses or labels often leads to a scarcity of samples, which significantly diminishes the accuracy of statistical inferences, particularly in high-dimensional settings. To address this limitation, transfer learning can be utilized for high-dimensional data with discrete responses by incorporating relevant source data into the target study of interest. Within the framework of generalized linear models, the case where responses are bounded are first considered, and an importance-weighted transfer learning method, referred to as IWTL-DR, is proposed. This method selects data at the individual level, thereby utilizing the source data more efficiently. Subsequently, this approach is extended to scenarios involving unbounded responses. Theoretical properties of the IWTL-DR method are established and compared with existing techniques. Extensive simulations and analyses of real data show the advantages of our approach.
{"title":"Transfer learning for high dimensional data with discrete responses","authors":"Zejing Zheng, Shengbing Zheng, Junlong Zhao","doi":"10.1016/j.csda.2025.108292","DOIUrl":"10.1016/j.csda.2025.108292","url":null,"abstract":"<div><div>Discrete responses are frequently encountered in applications, particularly in classification problems. However, the high cost of collecting responses or labels often leads to a scarcity of samples, which significantly diminishes the accuracy of statistical inferences, particularly in high-dimensional settings. To address this limitation, transfer learning can be utilized for high-dimensional data with discrete responses by incorporating relevant source data into the target study of interest. Within the framework of generalized linear models, the case where responses are bounded are first considered, and an importance-weighted transfer learning method, referred to as IWTL-DR, is proposed. This method selects data at the individual level, thereby utilizing the source data more efficiently. Subsequently, this approach is extended to scenarios involving unbounded responses. Theoretical properties of the IWTL-DR method are established and compared with existing techniques. Extensive simulations and analyses of real data show the advantages of our approach.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108292"},"PeriodicalIF":1.6,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145364811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15DOI: 10.1016/j.csda.2025.108291
Lei Qin , Xiaomei Zhang , Yingqiu Zhu , Yang Chen , Ben-Chang Shia
As time series with matrix structures becoming more and more common in the fields of finance, economics, and management, modeling matrix-valued time series becomes an emerging research hotspot. Spatial effects lead by different locations play an important role in the analysis of time series. Although matrix autoregressive model (MAR) provides a promising solution for modeling matrix-valued time series, it only models the dynamic effects in the temporal dimension, without capturing the spatial effects. In this paper, we propose a bilateral matrix spatiotemporal autoregressive model (BMSAR), which fully considers the pure spatial effects, pure dynamic effects, and time-delay spatial effects while maintaining and utilizing the matrix structure. In order to solve the endogeneity problem, the estimation process for BMSAR is based on the least squares method and the Yule-Walker equation for iterative estimation. The simulation results show that as compared with the MAR, the BMSAR model effectively reflects the impact of spatial structure on the sequence observations. The estimator for BMSAR proposed in this paper is consistent. It achieves promising performance when the sample size is relatively large. The proposed model and algorithm are also verified using the trade and macroeconomic indicator datasets of seven countries in the G7 summit, and the prediction accuracy is significantly improved as compared with the existing models.
{"title":"Bilateral matrix spatiotemporal autoregressive model","authors":"Lei Qin , Xiaomei Zhang , Yingqiu Zhu , Yang Chen , Ben-Chang Shia","doi":"10.1016/j.csda.2025.108291","DOIUrl":"10.1016/j.csda.2025.108291","url":null,"abstract":"<div><div>As time series with matrix structures becoming more and more common in the fields of finance, economics, and management, modeling matrix-valued time series becomes an emerging research hotspot. Spatial effects lead by different locations play an important role in the analysis of time series. Although matrix autoregressive model (MAR) provides a promising solution for modeling matrix-valued time series, it only models the dynamic effects in the temporal dimension, without capturing the spatial effects. In this paper, we propose a bilateral matrix spatiotemporal autoregressive model (BMSAR), which fully considers the pure spatial effects, pure dynamic effects, and time-delay spatial effects while maintaining and utilizing the matrix structure. In order to solve the endogeneity problem, the estimation process for BMSAR is based on the least squares method and the Yule-Walker equation for iterative estimation. The simulation results show that as compared with the MAR, the BMSAR model effectively reflects the impact of spatial structure on the sequence observations. The estimator for BMSAR proposed in this paper is consistent. It achieves promising performance when the sample size is relatively large. The proposed model and algorithm are also verified using the trade and macroeconomic indicator datasets of seven countries in the G7 summit, and the prediction accuracy is significantly improved as compared with the existing models.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108291"},"PeriodicalIF":1.6,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10DOI: 10.1016/j.csda.2025.108276
Claire Donnat , Olga Klopp , Nicolas Verzelen
A novel method is introduced for denoising partially observed signals over networks using graph total variation (TV) regularization, a technique adapted from signal processing to handle binary data. This approach extends existing results derived for Gaussian data to the discrete, binary case — a method hereafter referred to as “one-bit TV denoising.” The framework considers a network represented as a set of nodes with binary observations, where edges encode pairwise relationships between nodes. A key theoretical contribution is the establishment of consistency guarantees of graph TV denoising for the recovery of underlying node-level probabilities. The method is well suited for settings with missing data, enabling robust inference from incomplete observations. Extensive numerical experiments and real-world applications further highlight its effectiveness, underscoring its potential in various practical scenarios that require denoising and prediction on networks with binary-valued data. Finally, applications to two real-world epidemic scenarios demonstrate that one-bit total variation denoising significantly enhances the accuracy of network-based nowcasting and forecasting.
{"title":"Denoising over networks with applications to partially observed epidemics","authors":"Claire Donnat , Olga Klopp , Nicolas Verzelen","doi":"10.1016/j.csda.2025.108276","DOIUrl":"10.1016/j.csda.2025.108276","url":null,"abstract":"<div><div>A novel method is introduced for denoising partially observed signals over networks using graph total variation (TV) regularization, a technique adapted from signal processing to handle binary data. This approach extends existing results derived for Gaussian data to the discrete, binary case — a method hereafter referred to as “one-bit TV denoising.” The framework considers a network represented as a set of nodes with binary observations, where edges encode pairwise relationships between nodes. A key theoretical contribution is the establishment of consistency guarantees of graph TV denoising for the recovery of underlying node-level probabilities. The method is well suited for settings with missing data, enabling robust inference from incomplete observations. Extensive numerical experiments and real-world applications further highlight its effectiveness, underscoring its potential in various practical scenarios that require denoising and prediction on networks with binary-valued data. Finally, applications to two real-world epidemic scenarios demonstrate that one-bit total variation denoising significantly enhances the accuracy of network-based nowcasting and forecasting.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108276"},"PeriodicalIF":1.6,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145322732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-03DOI: 10.1016/j.csda.2025.108288
Jia-Han Shih , Yi-Hau Chen
A regression association measure is proposed for capturing predictability of a multivariate outcome from a multivariate covariate . Motivated by existing measures, the conventional Kendall’s tau is first generalized to measure multivariate association between two random vectors. Then the predictability of from is measured by the generalized multivariate Kendall’s tau between and , where and share the same conditional distribution and are conditionally independent given . The proposed regression association measure can be expressed as the proportion of the variance of a function of that can be explained by , indicating that the measure has a direct interpretation in terms of predictability. Based on the proposed measure, a conditional regression association measure is further proposed, which can be utilized to perform variable selection. Since the proposed measures are based on and , a simple nonparametric estimation method based on nearest neighbors is available. An R package, MRAM, has been developed for implementation. Simulation studies are carried out to assess the performance of the proposed methods and real data examples are analyzed for illustration.
{"title":"Measuring multivariate regression association via spatial sign","authors":"Jia-Han Shih , Yi-Hau Chen","doi":"10.1016/j.csda.2025.108288","DOIUrl":"10.1016/j.csda.2025.108288","url":null,"abstract":"<div><div>A regression association measure is proposed for capturing predictability of a multivariate outcome <span><math><mrow><mi>Y</mi><mo>=</mo><mo>(</mo><msub><mi>Y</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>Y</mi><mi>d</mi></msub><mo>)</mo></mrow></math></span> from a multivariate covariate <span><math><mrow><mi>X</mi><mo>=</mo><mo>(</mo><msub><mi>X</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>X</mi><mi>p</mi></msub><mo>)</mo></mrow></math></span>. Motivated by existing measures, the conventional Kendall’s tau is first generalized to measure multivariate association between two random vectors. Then the predictability of <span><math><mi>Y</mi></math></span> from <span><math><mi>X</mi></math></span> is measured by the generalized multivariate Kendall’s tau between <span><math><mi>Y</mi></math></span> and <span><math><msup><mi>Y</mi><mo>′</mo></msup></math></span>, where <span><math><mi>Y</mi></math></span> and <span><math><msup><mi>Y</mi><mo>′</mo></msup></math></span> share the same conditional distribution and are conditionally independent given <span><math><mi>X</mi></math></span>. The proposed regression association measure can be expressed as the proportion of the variance of a function of <span><math><mi>Y</mi></math></span> that can be explained by <span><math><mi>X</mi></math></span>, indicating that the measure has a direct interpretation in terms of predictability. Based on the proposed measure, a conditional regression association measure is further proposed, which can be utilized to perform variable selection. Since the proposed measures are based on <span><math><mi>Y</mi></math></span> and <span><math><msup><mi>Y</mi><mo>′</mo></msup></math></span>, a simple nonparametric estimation method based on nearest neighbors is available. An R package, <span>MRAM</span>, has been developed for implementation. Simulation studies are carried out to assess the performance of the proposed methods and real data examples are analyzed for illustration.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108288"},"PeriodicalIF":1.6,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145322731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-29DOI: 10.1016/j.csda.2025.108280
Hwiyoung Lee , Zhenyao Ye , Chixiang Chen , Peter Kochunov , L. Elliot Hong , Shuo Chen
Association analysis of multivariate omics outcomes is challenging due to the high dimensionality and inter-correlation among outcome variables. In practice, the classic multi-univariate analysis approaches are commonly employed, utilizing linear regression models for each individual outcome followed by adjustments for multiplicity through control of the false discovery rate (FDR) or family-wise error rate (FWER). While straightforward, these multi-univariate methods overlook dependencies between outcome variables. This oversight leads to less accurate statistical inferences, characterized by lower power and an increased false discovery rate, ultimately resulting in reduced replicability across studies. Recently, advanced frequentist and Bayesian methods have been developed to account for these dependencies. However, these methods often pose significant computational challenges for researchers in the field. To bridge this gap, a computationally efficient autoregressive multivariate regression model is proposed that explicitly accounts for the dependence structure among outcome variables. Through extensive simulations, it is demonstrated that the approach provides more accurate multivariate inferences than traditional methods and remains robust even under model misspecification. Additionally, the proposed method is applied to investigate whether the associations between serum lipidomics outcomes and Alzheimer’s disease differentiate in allele carriers and non-carriers of the apolipoprotein E (APOE) gene.
{"title":"Fast autoregressive model for multivariate dependent outcomes with application to lipidomics analysis for Alzheimer’s disease and APOE-ε4","authors":"Hwiyoung Lee , Zhenyao Ye , Chixiang Chen , Peter Kochunov , L. Elliot Hong , Shuo Chen","doi":"10.1016/j.csda.2025.108280","DOIUrl":"10.1016/j.csda.2025.108280","url":null,"abstract":"<div><div>Association analysis of multivariate omics outcomes is challenging due to the high dimensionality and inter-correlation among outcome variables. In practice, the classic multi-univariate analysis approaches are commonly employed, utilizing linear regression models for each individual outcome followed by adjustments for multiplicity through control of the false discovery rate (FDR) or family-wise error rate (FWER). While straightforward, these multi-univariate methods overlook dependencies between outcome variables. This oversight leads to less accurate statistical inferences, characterized by lower power and an increased false discovery rate, ultimately resulting in reduced replicability across studies. Recently, advanced frequentist and Bayesian methods have been developed to account for these dependencies. However, these methods often pose significant computational challenges for researchers in the field. To bridge this gap, a computationally efficient autoregressive multivariate regression model is proposed that explicitly accounts for the dependence structure among outcome variables. Through extensive simulations, it is demonstrated that the approach provides more accurate multivariate inferences than traditional methods and remains robust even under model misspecification. Additionally, the proposed method is applied to investigate whether the associations between serum lipidomics outcomes and Alzheimer’s disease differentiate in <span><math><mrow><mrow><mi>ε</mi></mrow><mn>4</mn></mrow></math></span> allele carriers and non-carriers of the apolipoprotein E (APOE) gene.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108280"},"PeriodicalIF":1.6,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-27DOI: 10.1016/j.csda.2025.108289
Gitte Kremling, Gerhard Dikta
A consistent goodness-of-fit test for distributional regression is introduced. The test statistic is based on a process that traces the difference between a nonparametric and a semi-parametric estimate of the marginal distribution function of . As its asymptotic null distribution is not distribution-free, a parametric bootstrap method is used to determine critical values. Empirical results suggest that, in certain scenarios, the test outperforms existing specification tests by achieving a higher power and thereby offering greater sensitivity to deviations from the assumed parametric distribution family. Notably, the proposed test does not involve any hyperparameters and can easily be applied to individual datasets using the gofreg-package in R.
{"title":"Bootstrap-based goodness-of-fit test for parametric families of conditional distributions","authors":"Gitte Kremling, Gerhard Dikta","doi":"10.1016/j.csda.2025.108289","DOIUrl":"10.1016/j.csda.2025.108289","url":null,"abstract":"<div><div>A consistent goodness-of-fit test for distributional regression is introduced. The test statistic is based on a process that traces the difference between a nonparametric and a semi-parametric estimate of the marginal distribution function of <span><math><mi>Y</mi></math></span>. As its asymptotic null distribution is not distribution-free, a parametric bootstrap method is used to determine critical values. Empirical results suggest that, in certain scenarios, the test outperforms existing specification tests by achieving a higher power and thereby offering greater sensitivity to deviations from the assumed parametric distribution family. Notably, the proposed test does not involve any hyperparameters and can easily be applied to individual datasets using the gofreg-package in R.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"215 ","pages":"Article 108289"},"PeriodicalIF":1.6,"publicationDate":"2025-09-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145270816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}