Imputing missing values in multivariate spatial–temporal data is important in many fields. Existing low rank tensor learning methods are popular for handling this task but are sensitive to high level of skewness. The aim of this paper is to develop an alternative method with robustness and high imputation accuracy for multivariate spatial–temporal data. In view of the fact that quantile regression is robust to noises and outliers, we propose an imputed quantile vector autoregressive (IQVAR) model. IQVAR can simultaneously impute missing values and estimate parameters of quantile vector autoregressive model. The objective function includes check loss and nuclear norm penalization. We develop an ADMM (Alternating Direction Method of Multipliers) algorithm to solve the resulting optimization problem. Simulation studies and real data analysis are conducted to verify the efficiency of IQVAR. Compared with other approaches, IQVAR is more robust and accurate.
{"title":"Imputed quantile vector autoregressive model for multivariate spatial–temporal data","authors":"Liang Jinwen, Tian Maozai","doi":"10.1002/sam.11658","DOIUrl":"https://doi.org/10.1002/sam.11658","url":null,"abstract":"Imputing missing values in multivariate spatial–temporal data is important in many fields. Existing low rank tensor learning methods are popular for handling this task but are sensitive to high level of skewness. The aim of this paper is to develop an alternative method with robustness and high imputation accuracy for multivariate spatial–temporal data. In view of the fact that quantile regression is robust to noises and outliers, we propose an imputed quantile vector autoregressive (IQVAR) model. IQVAR can simultaneously impute missing values and estimate parameters of quantile vector autoregressive model. The objective function includes check loss and nuclear norm penalization. We develop an ADMM (Alternating Direction Method of Multipliers) algorithm to solve the resulting optimization problem. Simulation studies and real data analysis are conducted to verify the efficiency of IQVAR. Compared with other approaches, IQVAR is more robust and accurate.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"40 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139590190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As we have easier access to massive data sets, functional analyses have gained more interest. However, such data sets often contain large heterogeneities, noises, and dimensionalities. When generalizing the analyses from vectors to functions, classical methods might not work directly. This paper considers noisy information reduction in functional analyses from two perspectives: functional clustering to group similar observations and thus reduce the sample size and functional variable selection to reduce the dimensionality. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model due to its flexibility. Hence, this paper proposes a nonparametric Bayesian functional clustering and peak point selection method via weighted Dirichlet process mixture (WDPM) modeling that automatically clusters and provides accurate estimations, together with conditional Laplace prior, which is a conjugate variable selection prior. The proposed method is named WDPM-VS for short, and is able to simultaneously perform the following tasks: (1) Automatic cluster without specifying the number of clusters or cluster centers beforehand; (2) Cluster for heterogeneously behaved functions; (3) Select vibrational peak points; and (4) Reduce noisy information from the two perspectives: sample size and dimensionality. The method will greatly outperform its comparison methods in root mean squared errors. Based on this proposed method, we are able to identify biological factors that can explain the breast cancer racial disparities.
{"title":"Nonparametric Bayesian functional clustering with applications to racial disparities in breast cancer","authors":"Wenyu Gao, Inyoung Kim, Wonil Nam, Xiang Ren, Wei Zhou, Masoud Agah","doi":"10.1002/sam.11657","DOIUrl":"https://doi.org/10.1002/sam.11657","url":null,"abstract":"As we have easier access to massive data sets, functional analyses have gained more interest. However, such data sets often contain large heterogeneities, noises, and dimensionalities. When generalizing the analyses from vectors to functions, classical methods might not work directly. This paper considers noisy information reduction in functional analyses from two perspectives: functional clustering to group similar observations and thus reduce the sample size and functional variable selection to reduce the dimensionality. The complicated data structures and relations can be easily modeled by a Bayesian hierarchical model due to its flexibility. Hence, this paper proposes a nonparametric Bayesian functional clustering and peak point selection method via weighted Dirichlet process mixture (WDPM) modeling that automatically clusters and provides accurate estimations, together with conditional Laplace prior, which is a conjugate variable selection prior. The proposed method is named WDPM-VS for short, and is able to simultaneously perform the following tasks: (1) Automatic cluster without specifying the number of clusters or cluster centers beforehand; (2) Cluster for heterogeneously behaved functions; (3) Select vibrational peak points; and (4) Reduce noisy information from the two perspectives: sample size and dimensionality. The method will greatly outperform its comparison methods in root mean squared errors. Based on this proposed method, we are able to identify biological factors that can explain the breast cancer racial disparities.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"85 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139581738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Laila A. Al-Essa, Shakaiba Shafiq, Deniz Ozonur, Farrukh Jamal
In this article, a novel bounded interval model called the unit-Perks model is developed by suitably transforming the positive random variable of the Perks distribution. Numerous statistical features of the bounded interval Perks model are being explored based on the expansion of the density function. Eight distinct estimation approaches are being used to estimate the parameters of the unit-Perks model. A throughout simulation analysis is also included to evaluate the precision of the resulting estimators from eight estimating approaches. Two real bounded interval data sets are being utilized to investigate the practical applicability of the unit-Perks model. A comparison is also made to determine which method of estimation works better for the given model. According to a comparison of eight different estimation approaches, the maximum likelihood estimation approach outperformed than the other seven estimating approaches. The unit-perks model is then used to introduce the quantile regression model named as quantile unit-Perks distribution. Application to real data set for the quantile unit-Perks distribution is also performed. The quantile residuals are used for the residual analysis of the fitted regression model. On the basis of mathematical, computational, and pictorial evidences, it is concluded that the presented model exhibited greater modeling capabilities.
{"title":"Study of a bounded interval perks distribution with quantile regression analysis","authors":"Laila A. Al-Essa, Shakaiba Shafiq, Deniz Ozonur, Farrukh Jamal","doi":"10.1002/sam.11656","DOIUrl":"https://doi.org/10.1002/sam.11656","url":null,"abstract":"In this article, a novel bounded interval model called the unit-Perks model is developed by suitably transforming the positive random variable of the Perks distribution. Numerous statistical features of the bounded interval Perks model are being explored based on the expansion of the density function. Eight distinct estimation approaches are being used to estimate the parameters of the unit-Perks model. A throughout simulation analysis is also included to evaluate the precision of the resulting estimators from eight estimating approaches. Two real bounded interval data sets are being utilized to investigate the practical applicability of the unit-Perks model. A comparison is also made to determine which method of estimation works better for the given model. According to a comparison of eight different estimation approaches, the maximum likelihood estimation approach outperformed than the other seven estimating approaches. The unit-perks model is then used to introduce the quantile regression model named as quantile unit-Perks distribution. Application to real data set for the quantile unit-Perks distribution is also performed. The quantile residuals are used for the residual analysis of the fitted regression model. On the basis of mathematical, computational, and pictorial evidences, it is concluded that the presented model exhibited greater modeling capabilities.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"19 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139581983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ensemble methods, such as Bagging, Boosting, or Random Forests, often enhance the prediction performance of single learners on both classification and regression tasks. In the context of regression, we propose a gradient boosting-based algorithm incorporating a diversity term with the aim of constructing different learners that enrich the ensemble while achieving a trade-off of some individual optimality for global enhancement. Verifying the hypotheses of Biau and Cadre's theorem (2021, Advances in contemporary statistics and econometrics—Festschrift in honour of Christine Thomas-Agnan, Springer), we present a convergence result ensuring that the associated optimization strategy reaches the global optimum. In the experiments, we consider a variety of different base learners with increasing complexity: stumps, regression trees, Purely Random Forests, and Breiman's Random Forests. Finally, we consider simulated and benchmark datasets and a real-world electricity demand dataset to show, by means of numerical experiments, the suitability of our procedure by examining the behavior not only of the final or the aggregated predictor but also of the whole generated sequence.
在分类和回归任务中,集合方法(如 Bagging、Boosting 或 Random Forests)通常能提高单个学习者的预测性能。在回归方面,我们提出了一种基于梯度提升的算法,该算法包含一个多样性项,目的是构建不同的学习器,丰富集合,同时在某些个体最优性与全局增强性之间实现权衡。通过验证 Biau 和 Cadre 定理(2021 年,《当代统计学和计量经济学进展--克里斯蒂娜-托马斯-阿格南纪念文集》,施普林格出版社)的假设,我们提出了一个收敛结果,确保相关优化策略达到全局最优。在实验中,我们考虑了各种不同的基础学习器,其复杂度也在不断增加:树桩、回归树、纯随机森林和布雷曼随机森林。最后,我们考虑了模拟数据集、基准数据集和一个真实世界的电力需求数据集,通过数值实验,不仅检查最终预测器或聚合预测器的行为,还检查整个生成序列的行为,从而展示我们的程序的适用性。
{"title":"Boosting diversity in regression ensembles","authors":"Mathias Bourel, Jairo Cugliari, Yannig Goude, Jean-Michel Poggi","doi":"10.1002/sam.11654","DOIUrl":"https://doi.org/10.1002/sam.11654","url":null,"abstract":"Ensemble methods, such as Bagging, Boosting, or Random Forests, often enhance the prediction performance of single learners on both classification and regression tasks. In the context of regression, we propose a gradient boosting-based algorithm incorporating a diversity term with the aim of constructing different learners that enrich the ensemble while achieving a trade-off of some individual optimality for global enhancement. Verifying the hypotheses of Biau and Cadre's theorem (2021, <i>Advances in contemporary statistics and econometrics—Festschrift in honour of Christine Thomas-Agnan</i>, Springer), we present a convergence result ensuring that the associated optimization strategy reaches the global optimum. In the experiments, we consider a variety of different base learners with increasing complexity: stumps, regression trees, Purely Random Forests, and Breiman's Random Forests. Finally, we consider simulated and benchmark datasets and a real-world electricity demand dataset to show, by means of numerical experiments, the suitability of our procedure by examining the behavior not only of the final or the aggregated predictor but also of the whole generated sequence.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"33 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139063502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Outliers are common in longitudinal data analysis, and the multivariate contaminated normal (MCN) distribution in model-based clustering is often used to detect outliers and provide robust parameter estimates in each subgroup. In this paper, we propose a method, the mixture of MCN (MCNM), based on the joint mean-covariance model, specifically designed to analyze longitudinal data characterized by mild outliers. Our model can automatically detect outliers in longitudinal data and provide robust parameter estimates in each subgroup. We use iteratively expectation-conditional maximization (ECM) algorithm and Aitken acceleration to estimate the model parameters, achieving both algorithm acceleration and stable convergence. Our proposed method simultaneously clusters the population, identifies progression patterns of the mean and covariance structures for different subgroups over time, and detects outliers. To demonstrate the effectiveness of our method, we conduct simulation studies under various cases involving different proportions and degrees of contamination. Additionally, we apply our method to real data on the number of people infected with AIDS in 49 countries or regions from 2001 to 2021. Results show that our proposed method effectively clusters the data based on various mean progression trajectories. In summary, our proposed MCNM based on the joint mean-covariance model and MCD of covariance matrices provides a robust method for clustering longitudinal data with mild outliers. It effectively detects outliers and identifies progression patterns in different groups over time, making it valuable for various applications in longitudinal data analysis.
{"title":"Multivariate contaminated normal mixture regression modeling of longitudinal data based on joint mean-covariance model","authors":"Niu Xiaoyu, Tian Yuzhu, Tang Manlai, Tian Maozai","doi":"10.1002/sam.11653","DOIUrl":"https://doi.org/10.1002/sam.11653","url":null,"abstract":"Outliers are common in longitudinal data analysis, and the multivariate contaminated normal (MCN) distribution in model-based clustering is often used to detect outliers and provide robust parameter estimates in each subgroup. In this paper, we propose a method, the mixture of MCN (MCNM), based on the joint mean-covariance model, specifically designed to analyze longitudinal data characterized by mild outliers. Our model can automatically detect outliers in longitudinal data and provide robust parameter estimates in each subgroup. We use iteratively expectation-conditional maximization (ECM) algorithm and Aitken acceleration to estimate the model parameters, achieving both algorithm acceleration and stable convergence. Our proposed method simultaneously clusters the population, identifies progression patterns of the mean and covariance structures for different subgroups over time, and detects outliers. To demonstrate the effectiveness of our method, we conduct simulation studies under various cases involving different proportions and degrees of contamination. Additionally, we apply our method to real data on the number of people infected with AIDS in 49 countries or regions from 2001 to 2021. Results show that our proposed method effectively clusters the data based on various mean progression trajectories. In summary, our proposed MCNM based on the joint mean-covariance model and MCD of covariance matrices provides a robust method for clustering longitudinal data with mild outliers. It effectively detects outliers and identifies progression patterns in different groups over time, making it valuable for various applications in longitudinal data analysis.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"6 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2023-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139031070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Competing procedures, involving data smoothing, weighting, imputation, outlier removal, etc., may be available to prepare data for parametric model estimation. Often, however, little is known about the best choice of preparatory procedure for the planned estimation and the observed data. A machine learning-based decision rule, an “oracle,” can be constructed in such cases to decide the best procedure from a set