Pub Date : 2025-01-10DOI: 10.1016/j.csda.2025.108126
Nadia L. Kudraszow , Alejandra V. Vahnovan , Julieta Ferrario , M. Victoria Fasano
Generalized Canonical Correlation Analysis (GCCA) is a powerful tool for analyzing and understanding linear relationships between multiple sets of variables. However, its classical estimations are highly sensitive to outliers, which can significantly affect the results of the analysis. A functional version of GCCA is proposed, based on scatter matrices, leading to robust and Fisher consistent estimators for appropriate choices of the scatter matrix. In cases where scatter matrices are ill-conditioned, a modification based on an estimation of the precision matrix is introduced. A procedure to identify influential observations is also developed. A simulation study evaluates the finite-sample performance of the proposed methods under clean and contaminated samples. The advantages of the influential data detection approach are demonstrated through an application to a real dataset.
{"title":"Robust generalized canonical correlation analysis based on scatter matrices","authors":"Nadia L. Kudraszow , Alejandra V. Vahnovan , Julieta Ferrario , M. Victoria Fasano","doi":"10.1016/j.csda.2025.108126","DOIUrl":"10.1016/j.csda.2025.108126","url":null,"abstract":"<div><div>Generalized Canonical Correlation Analysis (GCCA) is a powerful tool for analyzing and understanding linear relationships between multiple sets of variables. However, its classical estimations are highly sensitive to outliers, which can significantly affect the results of the analysis. A functional version of GCCA is proposed, based on scatter matrices, leading to robust and Fisher consistent estimators for appropriate choices of the scatter matrix. In cases where scatter matrices are ill-conditioned, a modification based on an estimation of the precision matrix is introduced. A procedure to identify influential observations is also developed. A simulation study evaluates the finite-sample performance of the proposed methods under clean and contaminated samples. The advantages of the influential data detection approach are demonstrated through an application to a real dataset.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"206 ","pages":"Article 108126"},"PeriodicalIF":1.5,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143171636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-08DOI: 10.1016/j.csda.2024.108123
Bo Chen , Feifei Chen , Junxin Wang , Tao Qiu
Testing the departures from symmetry is a critical issue in statistics. Over the last two decades, substantial effort has been invested in developing tests for central symmetry in multivariate and high-dimensional contexts. Traditional tests, which rely on Euclidean distance, face significant challenges in high-dimensional data. These tests struggle to capture overall central symmetry and are often limited to verifying whether the distribution's center aligns with the coordinate origin, a problem exacerbated by the “curse of dimensionality.” Furthermore, they tend to be computationally intensive, often making them impractical for large datasets. To overcome these limitations, we propose a nonparametric test based on the random projected energy distance, extending the energy distance test through random projections. This method effectively reduces data dimensions by projecting high-dimensional data onto lower-dimensional spaces, with the randomness ensuring maximum preservation of information. Theoretically, as the number of random projections approaches infinity, the risk of power loss from inadequate directions is mitigated. Leveraging U-statistic theory, our test's asymptotic null distribution is standard normal, which holds true regardless of the data dimensionality relative to sample size, thus eliminating the need for re-sampling to determine critical values. For computational efficiency with large datasets, we adopt a divide-and-average strategy, partitioning the data into smaller blocks of size m. Within each block, the estimates of the random projected energy distance are normally distributed. By averaging these estimates across all blocks, we derive a test statistic that is asymptotically standard normal. This method significantly reduces computational complexity from quadratic to linear in sample size, enhancing the feasibility of our test for extensive data analysis. Through extensive numerical studies, we have demonstrated the robust empirical performance of our test in terms of size and power, affirming its practical utility in statistical applications for high-dimensional data.
{"title":"An efficient and distribution-free symmetry test for high-dimensional data based on energy statistics and random projections","authors":"Bo Chen , Feifei Chen , Junxin Wang , Tao Qiu","doi":"10.1016/j.csda.2024.108123","DOIUrl":"10.1016/j.csda.2024.108123","url":null,"abstract":"<div><div>Testing the departures from symmetry is a critical issue in statistics. Over the last two decades, substantial effort has been invested in developing tests for central symmetry in multivariate and high-dimensional contexts. Traditional tests, which rely on Euclidean distance, face significant challenges in high-dimensional data. These tests struggle to capture overall central symmetry and are often limited to verifying whether the distribution's center aligns with the coordinate origin, a problem exacerbated by the “curse of dimensionality.” Furthermore, they tend to be computationally intensive, often making them impractical for large datasets. To overcome these limitations, we propose a nonparametric test based on the random projected energy distance, extending the energy distance test through random projections. This method effectively reduces data dimensions by projecting high-dimensional data onto lower-dimensional spaces, with the randomness ensuring maximum preservation of information. Theoretically, as the number of random projections approaches infinity, the risk of power loss from inadequate directions is mitigated. Leveraging <em>U</em>-statistic theory, our test's asymptotic null distribution is standard normal, which holds true regardless of the data dimensionality relative to sample size, thus eliminating the need for re-sampling to determine critical values. For computational efficiency with large datasets, we adopt a divide-and-average strategy, partitioning the data into smaller blocks of size <em>m</em>. Within each block, the estimates of the random projected energy distance are normally distributed. By averaging these estimates across all blocks, we derive a test statistic that is asymptotically standard normal. This method significantly reduces computational complexity from quadratic to linear in sample size, enhancing the feasibility of our test for extensive data analysis. Through extensive numerical studies, we have demonstrated the robust empirical performance of our test in terms of size and power, affirming its practical utility in statistical applications for high-dimensional data.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"206 ","pages":"Article 108123"},"PeriodicalIF":1.5,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143171647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-07DOI: 10.1016/j.csda.2025.108125
Alfonso Landeros , Seyoon Ko , Jack Z. Chang , Tong Tong Wu , Kenneth Lange
Modern biomedical datasets are often high-dimensional at multiple levels of biological organization. Practitioners must therefore grapple with data to estimate sparse or low-rank structures so as to adhere to the principle of parsimony. Further complicating matters is the presence of groups in data, each of which may have distinct associations with explanatory variables or be characterized by fundamentally different covariates. These themes in data analysis are explored in the context of classification. Vertex Discriminant Analysis (VDA) offers flexible linear and nonlinear models for classification that generalize the advantages of support vector machines to data with multiple classes. The proximal distance principle, which leverages projection and proximal operators in the design of practical algorithms, handily facilitates variable selection in VDA via nonconvex distance-to-set penalties directly controlling the number of active variables. Two flavors of sparse VDA are developed to address data in which instances may be homogeneous or heterogeneous with respect to predictors characterizing classes. Empirical studies illustrate how VDA is adapted to class-specific variable selection on simulated and real datasets, with an emphasis on applications to cancer classification via gene expression patterns.
{"title":"Sparse vertex discriminant analysis: Variable selection for biomedical classification applications","authors":"Alfonso Landeros , Seyoon Ko , Jack Z. Chang , Tong Tong Wu , Kenneth Lange","doi":"10.1016/j.csda.2025.108125","DOIUrl":"10.1016/j.csda.2025.108125","url":null,"abstract":"<div><div>Modern biomedical datasets are often high-dimensional at multiple levels of biological organization. Practitioners must therefore grapple with data to estimate sparse or low-rank structures so as to adhere to the principle of parsimony. Further complicating matters is the presence of groups in data, each of which may have distinct associations with explanatory variables or be characterized by fundamentally different covariates. These themes in data analysis are explored in the context of classification. Vertex Discriminant Analysis (VDA) offers flexible linear and nonlinear models for classification that generalize the advantages of support vector machines to data with multiple classes. The proximal distance principle, which leverages projection and proximal operators in the design of practical algorithms, handily facilitates variable selection in VDA via nonconvex distance-to-set penalties directly controlling the number of active variables. Two flavors of sparse VDA are developed to address data in which instances may be homogeneous or heterogeneous with respect to predictors characterizing classes. Empirical studies illustrate how VDA is adapted to class-specific variable selection on simulated and real datasets, with an emphasis on applications to cancer classification via gene expression patterns.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"206 ","pages":"Article 108125"},"PeriodicalIF":1.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143171633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Finite mixture regression models are commonly used to account for heterogeneity in populations and situations where the assumptions required for standard regression models may not hold. To expand the range of applicable distributions for components beyond the Gaussian distribution, other distributions, such as the exponential power distribution, the skew-normal distribution, and so on, are explored. To enable simultaneous model estimation, order selection, and variable selection, a penalized likelihood estimation approach that imposes penalties on both the mixing proportions and regression coefficients, which we call the double-penalized likelihood method is proposed in this paper. Four double-penalized likelihood functions and their performance are studied. The consistency of estimators, order selection, and variable selection are investigated. A modified expectation–maximization algorithm is proposed to implement the double-penalized likelihood method. Numerical simulations demonstrate the effectiveness of our proposed method and algorithm. Finally, the results of real data analysis are presented to illustrate the application of our approach. Overall, our study contributes to the development of mixture regression models and provides a useful tool for model and variable selection.
{"title":"Component selection and variable selection for mixture regression models","authors":"Xuefei Qi , Xingbai Xu , Zhenghui Feng , Heng Peng","doi":"10.1016/j.csda.2024.108124","DOIUrl":"10.1016/j.csda.2024.108124","url":null,"abstract":"<div><div>Finite mixture regression models are commonly used to account for heterogeneity in populations and situations where the assumptions required for standard regression models may not hold. To expand the range of applicable distributions for components beyond the Gaussian distribution, other distributions, such as the exponential power distribution, the skew-normal distribution, and so on, are explored. To enable simultaneous model estimation, order selection, and variable selection, a penalized likelihood estimation approach that imposes penalties on both the mixing proportions and regression coefficients, which we call the double-penalized likelihood method is proposed in this paper. Four double-penalized likelihood functions and their performance are studied. The consistency of estimators, order selection, and variable selection are investigated. A modified expectation–maximization algorithm is proposed to implement the double-penalized likelihood method. Numerical simulations demonstrate the effectiveness of our proposed method and algorithm. Finally, the results of real data analysis are presented to illustrate the application of our approach. Overall, our study contributes to the development of mixture regression models and provides a useful tool for model and variable selection.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"206 ","pages":"Article 108124"},"PeriodicalIF":1.5,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143171634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-02DOI: 10.1016/j.csda.2024.108107
Andrew Welbaum, Wanli Qiao
Misalignment often occurs in functional data and can severely impact their clustering results. A clustering algorithm for misaligned functional data is developed, by adapting the original mean shift algorithm in the Euclidean space. This mean shift algorithm is applied to the quotient space of the orbits of the square root velocity functions induced by the misaligned functional data, in which the elastic distance is equipped. Convergence properties of this algorithm are studied. The efficacy of the algorithm is demonstrated through simulations and various real data applications.
{"title":"Mean shift-based clustering for misaligned functional data","authors":"Andrew Welbaum, Wanli Qiao","doi":"10.1016/j.csda.2024.108107","DOIUrl":"10.1016/j.csda.2024.108107","url":null,"abstract":"<div><div>Misalignment often occurs in functional data and can severely impact their clustering results. A clustering algorithm for misaligned functional data is developed, by adapting the original mean shift algorithm in the Euclidean space. This mean shift algorithm is applied to the quotient space of the orbits of the square root velocity functions induced by the misaligned functional data, in which the elastic distance is equipped. Convergence properties of this algorithm are studied. The efficacy of the algorithm is demonstrated through simulations and various real data applications.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"206 ","pages":"Article 108107"},"PeriodicalIF":1.5,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143171635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-02DOI: 10.1016/j.csda.2024.108122
Chunshan Liu , Daniel R. Kowal , James Doss-Gollin , Marina Vannucci
Functional data analysis, which models data as realizations of random functions over a continuum, has emerged as a useful tool for time series data. Often, the goal is to infer the dynamic connections (or time-varying conditional dependencies) among multiple functions or time series. For this task, a dynamic and Bayesian functional graphical model is introduced. The proposed modeling approach prioritizes the careful definition of an appropriate graph to identify both time-invariant and time-varying connectivity patterns. A novel block-structured sparsity prior is paired with a finite basis expansion, which together yield effective shrinkage and graph selection with efficient computations via a Gibbs sampling algorithm. Crucially, the model includes (one or more) graph changepoints, which are learned jointly with all model parameters and incorporate graph dynamics. Simulation studies demonstrate excellent graph selection capabilities, with significant improvements over competing methods. The proposed approach is applied to study of dynamic connectivity patterns of sea surface temperatures in the Pacific Ocean and reveals meaningful edges.
{"title":"Bayesian functional graphical models with change-point detection","authors":"Chunshan Liu , Daniel R. Kowal , James Doss-Gollin , Marina Vannucci","doi":"10.1016/j.csda.2024.108122","DOIUrl":"10.1016/j.csda.2024.108122","url":null,"abstract":"<div><div>Functional data analysis, which models data as realizations of random functions over a continuum, has emerged as a useful tool for time series data. Often, the goal is to infer the dynamic connections (or time-varying conditional dependencies) among multiple functions or time series. For this task, a dynamic and Bayesian functional graphical model is introduced. The proposed modeling approach prioritizes the careful definition of an appropriate graph to identify both time-invariant and time-varying connectivity patterns. A novel block-structured sparsity prior is paired with a finite basis expansion, which together yield effective shrinkage and graph selection with efficient computations via a Gibbs sampling algorithm. Crucially, the model includes (one or more) graph changepoints, which are learned jointly with all model parameters and incorporate graph dynamics. Simulation studies demonstrate excellent graph selection capabilities, with significant improvements over competing methods. The proposed approach is applied to study of dynamic connectivity patterns of sea surface temperatures in the Pacific Ocean and reveals meaningful edges.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"206 ","pages":"Article 108122"},"PeriodicalIF":1.5,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143171632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1016/j.csda.2024.108112
Jian Xiao , Shaoting Li , Jun Chen , Wensheng Zhu
Omics-wide association analysis is an important tool for investigating medical and human health. Unobserved confounders can cause adverse effects to association analysis, thence adjusting for latent confounders is very crucial. However, the existing latent confounder-adjusted analysis methods lack effective false discovery rate (FDR) control and rely on some specific model assumptions. Motivated by this, the paper firstly proposes a novel latent confounding single index model for omics data. It is model-free in performance of allowing the connections between the response and covariates can be connected by any unknown monotonic link function, and the model's random errors can follow any unknown distribution. Utilizing the proposed model, the paper further employs the data splitting approach to develop a model-free and latent confounder-adjusted feature selection method with FDR control. The theoretical results demonstrate asymptotic FDR control properties of the new method and the numerical analysis results show it can control FDR for no-confounding, sparse confounding and dense confounding scenarios. The analysis of the actual gene expression data demonstrates that it can detect the co-expression genes interacting with the target genes in the presence of latent confounding. Such findings can help to comprehend the connects between pediatric small round blue cell cancers and gene network.
{"title":"Model-free latent confounder-adjusted feature selection with FDR control","authors":"Jian Xiao , Shaoting Li , Jun Chen , Wensheng Zhu","doi":"10.1016/j.csda.2024.108112","DOIUrl":"10.1016/j.csda.2024.108112","url":null,"abstract":"<div><div>Omics-wide association analysis is an important tool for investigating medical and human health. Unobserved confounders can cause adverse effects to association analysis, thence adjusting for latent confounders is very crucial. However, the existing latent confounder-adjusted analysis methods lack effective false discovery rate (FDR) control and rely on some specific model assumptions. Motivated by this, the paper firstly proposes a novel latent confounding single index model for omics data. It is model-free in performance of allowing the connections between the response and covariates can be connected by any unknown monotonic link function, and the model's random errors can follow any unknown distribution. Utilizing the proposed model, the paper further employs the data splitting approach to develop a model-free and latent confounder-adjusted feature selection method with FDR control. The theoretical results demonstrate asymptotic FDR control properties of the new method and the numerical analysis results show it can control FDR for no-confounding, sparse confounding and dense confounding scenarios. The analysis of the actual gene expression data demonstrates that it can detect the co-expression genes interacting with the target genes in the presence of latent confounding. Such findings can help to comprehend the connects between pediatric small round blue cell cancers and gene network.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"205 ","pages":"Article 108112"},"PeriodicalIF":1.5,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143161770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-30DOI: 10.1016/j.csda.2024.108113
Fang Lu , Hao Pan , Jing Yang
To address various forms of spatial dependence and the heterogeneous effects of the impacts of some regressors, this paper concentrates on the generalized method of moments (GMM) estimation and variable selection of higher-order spatial autoregressive (SAR) model with semi-varying coefficients and diverging number of parameters. With the varying coefficient functions being approximated by basis functions, the GMM estimation procedure is firstly proposed and then, a novel and convenient smooth-threshold GMM procedure is constructed for variable selection based on the smooth-threshold estimating equations. Under some regularity conditions, the asymptotic properties of the proposed estimation and variable selection methods are established. In particular, the asymptotic normality of the parametric estimator is derived via a novel way based on some fundamental operations on block matrix. Compared to the existing estimation methods of semiparametric SAR models, our proposed series-based GMM procedure can simultaneously enjoy the merits of lower computing cost, higher estimation accuracy or higher applicability, especially in the case of heteroscedasticity. Extensive numerical simulations are conducted to confirm the theories and to demonstrate the advantages of the proposed method, in finite sample performance. Two real data analysis are further followed for application.
{"title":"GMM estimation and variable selection of semiparametric model with increasing dimension and high-order spatial dependence","authors":"Fang Lu , Hao Pan , Jing Yang","doi":"10.1016/j.csda.2024.108113","DOIUrl":"10.1016/j.csda.2024.108113","url":null,"abstract":"<div><div>To address various forms of spatial dependence and the heterogeneous effects of the impacts of some regressors, this paper concentrates on the generalized method of moments (GMM) estimation and variable selection of higher-order spatial autoregressive (SAR) model with semi-varying coefficients and diverging number of parameters. With the varying coefficient functions being approximated by basis functions, the GMM estimation procedure is firstly proposed and then, a novel and convenient smooth-threshold GMM procedure is constructed for variable selection based on the smooth-threshold estimating equations. Under some regularity conditions, the asymptotic properties of the proposed estimation and variable selection methods are established. In particular, the asymptotic normality of the parametric estimator is derived via a novel way based on some fundamental operations on block matrix. Compared to the existing estimation methods of semiparametric SAR models, our proposed series-based GMM procedure can simultaneously enjoy the merits of lower computing cost, higher estimation accuracy or higher applicability, especially in the case of heteroscedasticity. Extensive numerical simulations are conducted to confirm the theories and to demonstrate the advantages of the proposed method, in finite sample performance. Two real data analysis are further followed for application.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"205 ","pages":"Article 108113"},"PeriodicalIF":1.5,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143161771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-19DOI: 10.1016/j.csda.2024.108111
Yanhui Li , Luqing Zhao , Jinjuan Wang
Identifying associations between microbial taxa and sample features has always been a worthwhile issue in microbiome analysis and various regression-based methods have been proposed. These methods can roughly be divided into two types. One considers sparsity characteristic of the microbiome data in the analysis, and the other considers phylogenetic tree to employ evolutionary information. However, none of these methods apply both sparsity and phylogenetic tree thoroughly in the regression analysis with theoretical guarantees. To fill this gap, a phylogenetic tree-assisted regression model accompanied by a Lasso-type penalty is proposed to detect feature-related microbial compositions. Specifically, based on the rational assumption that the smaller the phylogenetic distance between two microbial species, the closer their coefficients in the regression model, the phylogenetic tree is accommodated into the regression model by constructing a Laplacian-type penalty in the loss function. Both linear regression model for continuous outcome and generalized linear regression model for categorical outcome are analyzed in this framework. Additionally, debiasing algorithms are proposed for the coefficient estimators to give more precise evaluation. Extensive numerical simulations and real data analyses demonstrate the higher efficiency of the proposed method.
{"title":"A debiasing phylogenetic tree-assisted regression model for microbiome data","authors":"Yanhui Li , Luqing Zhao , Jinjuan Wang","doi":"10.1016/j.csda.2024.108111","DOIUrl":"10.1016/j.csda.2024.108111","url":null,"abstract":"<div><div>Identifying associations between microbial taxa and sample features has always been a worthwhile issue in microbiome analysis and various regression-based methods have been proposed. These methods can roughly be divided into two types. One considers sparsity characteristic of the microbiome data in the analysis, and the other considers phylogenetic tree to employ evolutionary information. However, none of these methods apply both sparsity and phylogenetic tree thoroughly in the regression analysis with theoretical guarantees. To fill this gap, a phylogenetic tree-assisted regression model accompanied by a Lasso-type penalty is proposed to detect feature-related microbial compositions. Specifically, based on the rational assumption that the smaller the phylogenetic distance between two microbial species, the closer their coefficients in the regression model, the phylogenetic tree is accommodated into the regression model by constructing a Laplacian-type penalty in the loss function. Both linear regression model for continuous outcome and generalized linear regression model for categorical outcome are analyzed in this framework. Additionally, debiasing algorithms are proposed for the coefficient estimators to give more precise evaluation. Extensive numerical simulations and real data analyses demonstrate the higher efficiency of the proposed method.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"205 ","pages":"Article 108111"},"PeriodicalIF":1.5,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143161772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-17DOI: 10.1016/j.csda.2024.108109
Kai Xu , Qing Cheng , Daojiang He
For the mutual independence testing problem, the use of summed nonparametric dependence measures, including Hoeffding's D, Blum-Kiefer-Rosenblatt's R, Bergsma-Dassios-Yanagimoto's , is considered. The asymptotic normality of this class of test statistics for the null hypothesis is established when (i) both the dimension and the sample size go to infinity simultaneously, and (ii) the dimension tends to infinity but the sample size is fixed. The new result for the asymptotic regime (ii) is applicable to the HDLSS (High Dimension, Low Sample Size) data. Further, the asymptotic Pitman efficiencies of the family of considered tests are investigated with respect to two important sum-of-squares tests for the asymptotic regime (i): the distance covariance based test and the product-moment covariance based test. Formulae for asymptotic relative efficiencies are found. An interesting finding reveals that even if the population follows a normally distributed structure, the two state-of-art tests suffer from power loss if some components of the underlying data have different scales. Simulations are conducted to confirm our asymptotic results. A real data analysis is performed to illustrate the considered methods.
{"title":"On summed nonparametric dependence measures in high dimensions, fixed or large samples","authors":"Kai Xu , Qing Cheng , Daojiang He","doi":"10.1016/j.csda.2024.108109","DOIUrl":"10.1016/j.csda.2024.108109","url":null,"abstract":"<div><div>For the mutual independence testing problem, the use of summed nonparametric dependence measures, including Hoeffding's <em>D</em>, Blum-Kiefer-Rosenblatt's <em>R</em>, Bergsma-Dassios-Yanagimoto's <span><math><msup><mrow><mi>τ</mi></mrow><mrow><mo>⁎</mo></mrow></msup></math></span>, is considered. The asymptotic normality of this class of test statistics for the null hypothesis is established when (i) both the dimension and the sample size go to infinity simultaneously, and (ii) the dimension tends to infinity but the sample size is fixed. The new result for the asymptotic regime (ii) is applicable to the HDLSS (High Dimension, Low Sample Size) data. Further, the asymptotic Pitman efficiencies of the family of considered tests are investigated with respect to two important sum-of-squares tests for the asymptotic regime (i): the distance covariance based test and the product-moment covariance based test. Formulae for asymptotic relative efficiencies are found. An interesting finding reveals that even if the population follows a normally distributed structure, the two state-of-art tests suffer from power loss if some components of the underlying data have different scales. Simulations are conducted to confirm our asymptotic results. A real data analysis is performed to illustrate the considered methods.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"205 ","pages":"Article 108109"},"PeriodicalIF":1.5,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143161773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}