Pub Date : 2024-06-05DOI: 10.1007/s00180-024-01510-4
Shinjune Kim, Youngjae Oh, Johan Lim, DoHwan Park, Erin M. Green, Mark L. Ramos, Jaesik Jeong
Many multiple test procedures, which control the false discovery rate, have been developed to identify some cases (e.g. genes) showing statistically significant difference between two different groups. However, a common issue encountered in some practical data sets is the presence of highly spiky null distributions. Existing methods struggle to control type I error in such cases due to the “inflated false positives," but this problem has not been addressed in previous literature. Our team recently encountered this issue while analyzing SET4 gene deletion data and proposed modeling the null distribution using a scale mixture normal distribution. However, the use of this approach is limited due to strong assumptions on the spiky peak. In this paper, we present a novel multiple test procedure that can be applied to any type of spiky peak data, including situations with no spiky peak or with one or two spiky peaks. Our approach involves truncating the central statistics around 0, which primarily contribute to the null spike, as well as the two tails that may be contaminated by alternative distributions. We refer to this method as the “double truncation method." After applying double truncation, we estimate the null density using the doubly truncated maximum likelihood estimator. We demonstrate numerically that our proposed method effectively controls the false discovery rate at the desired level using simulated data. Furthermore, we apply our method to two real data sets, namely the SET protein data and peony data.
目前已开发出许多控制误发现率的多重检验程序,用于识别一些在两个不同组别之间显示出显著统计学差异的情况(如基因)。然而,在一些实际数据集中遇到的一个常见问题是存在高度尖峰的空分布。在这种情况下,由于 "虚假阳性 "的存在,现有的方法很难控制 I 类错误,但这一问题在以往的文献中还没有得到解决。我们的团队最近在分析 SET4 基因缺失数据时遇到了这个问题,并建议使用比例混合正态分布来模拟空分布。然而,由于对尖峰的强烈假设,这种方法的使用受到了限制。在本文中,我们提出了一种新的多重检验程序,它可应用于任何类型的尖峰数据,包括无尖峰或有一个或两个尖峰的情况。我们的方法包括截断 0 附近的中心统计量(这是空尖峰的主要贡献),以及可能被其他分布污染的两个尾部。我们将这种方法称为 "双重截断法"。应用双重截断法后,我们使用双重截断最大似然估计法估计空密度。我们利用模拟数据用数字证明了我们提出的方法能有效地将误发现率控制在理想水平。此外,我们还将我们的方法应用于两个真实数据集,即 SET 蛋白质数据和牡丹数据。
{"title":"Double truncation method for controlling local false discovery rate in case of spiky null","authors":"Shinjune Kim, Youngjae Oh, Johan Lim, DoHwan Park, Erin M. Green, Mark L. Ramos, Jaesik Jeong","doi":"10.1007/s00180-024-01510-4","DOIUrl":"https://doi.org/10.1007/s00180-024-01510-4","url":null,"abstract":"<p>Many multiple test procedures, which control the false discovery rate, have been developed to identify some cases (e.g. genes) showing statistically significant difference between two different groups. However, a common issue encountered in some practical data sets is the presence of highly spiky null distributions. Existing methods struggle to control type I error in such cases due to the “inflated false positives,\" but this problem has not been addressed in previous literature. Our team recently encountered this issue while analyzing SET4 gene deletion data and proposed modeling the null distribution using a scale mixture normal distribution. However, the use of this approach is limited due to strong assumptions on the spiky peak. In this paper, we present a novel multiple test procedure that can be applied to any type of spiky peak data, including situations with no spiky peak or with one or two spiky peaks. Our approach involves truncating the central statistics around 0, which primarily contribute to the null spike, as well as the two tails that may be contaminated by alternative distributions. We refer to this method as the “double truncation method.\" After applying double truncation, we estimate the null density using the doubly truncated maximum likelihood estimator. We demonstrate numerically that our proposed method effectively controls the false discovery rate at the desired level using simulated data. Furthermore, we apply our method to two real data sets, namely the SET protein data and peony data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"25 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-03DOI: 10.1007/s00180-024-01509-x
Yi Wu, Wei Wang, Wei Yu, Xuejun Wang
Kernel estimators of density function and hazard rate function are very important in nonparametric statistics. The paper aims to investigate the uniformly strong representations and the rates of uniformly strong consistency for kernel smoothing density and hazard rate function estimation with censored widely orthant dependent data based on the Kaplan–Meier estimator. Under some mild conditions, the rates of the remainder term and strong consistency are shown to be (Obig (sqrt{log (ng(n))/big (nb_{n}^{2}big )}big )~a.s.) and (Obig (sqrt{log (ng(n))/big (nb_{n}^{2}big )}big )+Obig (b_{n}^{2}big )~a.s.), respectively, where g(n) are the dominating coefficients of widely orthant dependent random variables. Some numerical simulations and a real data analysis are also presented to confirm the theoretical results based on finite sample performances.
{"title":"Asymptotic properties of kernel density and hazard rate function estimators with censored widely orthant dependent data","authors":"Yi Wu, Wei Wang, Wei Yu, Xuejun Wang","doi":"10.1007/s00180-024-01509-x","DOIUrl":"https://doi.org/10.1007/s00180-024-01509-x","url":null,"abstract":"<p>Kernel estimators of density function and hazard rate function are very important in nonparametric statistics. The paper aims to investigate the uniformly strong representations and the rates of uniformly strong consistency for kernel smoothing density and hazard rate function estimation with censored widely orthant dependent data based on the Kaplan–Meier estimator. Under some mild conditions, the rates of the remainder term and strong consistency are shown to be <span>(Obig (sqrt{log (ng(n))/big (nb_{n}^{2}big )}big )~a.s.)</span> and <span>(Obig (sqrt{log (ng(n))/big (nb_{n}^{2}big )}big )+Obig (b_{n}^{2}big )~a.s.)</span>, respectively, where <i>g</i>(<i>n</i>) are the dominating coefficients of widely orthant dependent random variables. Some numerical simulations and a real data analysis are also presented to confirm the theoretical results based on finite sample performances.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"128 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141256196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-29DOI: 10.1007/s00180-024-01508-y
Joanna Janczura
In this paper we propose a new method for probabilistic forecasting of electricity prices. It is based on averaging point forecasts from different models combined with expectile regression. We show that deriving the predicted distribution in terms of expectiles, might be in some cases advantageous to the commonly used quantiles. We apply the proposed method to the day-ahead electricity prices from the German market and compare its accuracy with the Quantile Regression Averaging method and quantile- as well as expectile-based historical simulation. The obtained results indicate that using the expectile regression improves the accuracy of the probabilistic forecasts of electricity prices, but a variance stabilizing transformation should be applied prior to modelling.
{"title":"Expectile regression averaging method for probabilistic forecasting of electricity prices","authors":"Joanna Janczura","doi":"10.1007/s00180-024-01508-y","DOIUrl":"https://doi.org/10.1007/s00180-024-01508-y","url":null,"abstract":"<p>In this paper we propose a new method for probabilistic forecasting of electricity prices. It is based on averaging point forecasts from different models combined with expectile regression. We show that deriving the predicted distribution in terms of expectiles, might be in some cases advantageous to the commonly used quantiles. We apply the proposed method to the day-ahead electricity prices from the German market and compare its accuracy with the Quantile Regression Averaging method and quantile- as well as expectile-based historical simulation. The obtained results indicate that using the expectile regression improves the accuracy of the probabilistic forecasts of electricity prices, but a variance stabilizing transformation should be applied prior to modelling.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"28 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141165757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-29DOI: 10.1007/s00180-024-01506-0
Frank Weber, Änne Glass, Aki Vehtari
The projection predictive variable selection is a decision-theoretically justified Bayesian variable selection approach achieving an outstanding trade-off between predictive performance and sparsity. Its projection problem is not easy to solve in general because it is based on the Kullback–Leibler divergence from a restricted posterior predictive distribution of the so-called reference model to the parameter-conditional predictive distribution of a candidate model. Previous work showed how this projection problem can be solved for response families employed in generalized linear models and how an approximate latent-space approach can be used for many other response families. Here, we present an exact projection method for all response families with discrete and finite support, called the augmented-data projection. A simulation study for an ordinal response family shows that the proposed method performs better than or similarly to the previously proposed approximate latent-space projection. The cost of the slightly better performance of the augmented-data projection is a substantial increase in runtime. Thus, if the augmented-data projection’s runtime is too high, we recommend the latent projection in the early phase of the model-building workflow and the augmented-data projection for final results. The ordinal response family from our simulation study is supported by both projection methods, but we also include a real-world cancer subtyping example with a nominal response family, a case that is not supported by the latent projection.
{"title":"Projection predictive variable selection for discrete response families with finite support","authors":"Frank Weber, Änne Glass, Aki Vehtari","doi":"10.1007/s00180-024-01506-0","DOIUrl":"https://doi.org/10.1007/s00180-024-01506-0","url":null,"abstract":"<p>The projection predictive variable selection is a decision-theoretically justified Bayesian variable selection approach achieving an outstanding trade-off between predictive performance and sparsity. Its projection problem is not easy to solve in general because it is based on the Kullback–Leibler divergence from a restricted posterior predictive distribution of the so-called reference model to the parameter-conditional predictive distribution of a candidate model. Previous work showed how this projection problem can be solved for response families employed in generalized linear models and how an approximate latent-space approach can be used for many other response families. Here, we present an exact projection method for all response families with discrete and finite support, called the augmented-data projection. A simulation study for an ordinal response family shows that the proposed method performs better than or similarly to the previously proposed approximate latent-space projection. The cost of the slightly better performance of the augmented-data projection is a substantial increase in runtime. Thus, if the augmented-data projection’s runtime is too high, we recommend the latent projection in the early phase of the model-building workflow and the augmented-data projection for final results. The ordinal response family from our simulation study is supported by both projection methods, but we also include a real-world cancer subtyping example with a nominal response family, a case that is not supported by the latent projection.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"42 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141165753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Several zero-augmented models exist for estimation involving outcomes with large numbers of zero. Two of such models for handling count endpoints are zero-inflated and hurdle regression models. In this article, we apply the extreme ranked set sampling (ERSS) scheme in estimation using zero-inflated and hurdle regression models. We provide theoretical derivations showing superiority of ERSS compared to simple random sampling (SRS) using these zero-augmented models. A simulation study is also conducted to compare the efficiency of ERSS to SRS and lastly, we illustrate applications with real data sets.
{"title":"Efficient regression analyses with zero-augmented models based on ranking","authors":"Deborah Kanda, Jingjing Yin, Xinyan Zhang, Hani Samawi","doi":"10.1007/s00180-024-01503-3","DOIUrl":"https://doi.org/10.1007/s00180-024-01503-3","url":null,"abstract":"<p>Several zero-augmented models exist for estimation involving outcomes with large numbers of zero. Two of such models for handling count endpoints are zero-inflated and hurdle regression models. In this article, we apply the extreme ranked set sampling (ERSS) scheme in estimation using zero-inflated and hurdle regression models. We provide theoretical derivations showing superiority of ERSS compared to simple random sampling (SRS) using these zero-augmented models. A simulation study is also conducted to compare the efficiency of ERSS to SRS and lastly, we illustrate applications with real data sets.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"5 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140935059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-09DOI: 10.1007/s00180-024-01500-6
Xiaohui Liu, Yuzi Liu, Petra Laketa, Stanislav Nagy, Yuting Chen
The scatter halfspace depth (sHD) is an extension of the location halfspace (also called Tukey) depth that is applicable in the nonparametric analysis of scatter. Using sHD, it is possible to define minimax optimal robust scatter estimators for multivariate data. The problem of exact computation of sHD for data of dimension (d ge 2) has, however, not been addressed in the literature. We develop an exact algorithm for the computation of sHD in any dimension d and implement it efficiently for any dimension (d ge 1). Since the exact computation of sHD is slow especially for higher dimensions, we also propose two fast approximate algorithms. All our programs are freely available in the R package scatterdepth.
散点半空间深度(sHD)是位置半空间深度(也称为 Tukey)的扩展,适用于散点的非参数分析。利用 sHD,可以定义多元数据的最小最优稳健散点估计值。然而,对于维数为 (d ge 2) 的数据,sHD 的精确计算问题在文献中还没有得到解决。我们开发了一种在任意维度 d 下计算 sHD 的精确算法,并在任意维度 (dge 1 )下有效地实现了这一算法。由于sHD的精确计算速度较慢,尤其是在高维情况下,因此我们还提出了两种快速近似算法。我们的所有程序都可以在R软件包scatterdepth中免费获取。
{"title":"Exact and approximate computation of the scatter halfspace depth","authors":"Xiaohui Liu, Yuzi Liu, Petra Laketa, Stanislav Nagy, Yuting Chen","doi":"10.1007/s00180-024-01500-6","DOIUrl":"https://doi.org/10.1007/s00180-024-01500-6","url":null,"abstract":"<p>The scatter halfspace depth (<b>sHD</b>) is an extension of the location halfspace (also called Tukey) depth that is applicable in the nonparametric analysis of scatter. Using <b>sHD</b>, it is possible to define minimax optimal robust scatter estimators for multivariate data. The problem of exact computation of <b>sHD</b> for data of dimension <span>(d ge 2)</span> has, however, not been addressed in the literature. We develop an exact algorithm for the computation of <b>sHD</b> in any dimension <i>d</i> and implement it efficiently for any dimension <span>(d ge 1)</span>. Since the exact computation of <b>sHD</b> is slow especially for higher dimensions, we also propose two fast approximate algorithms. All our programs are freely available in the <span>R</span> package <span>scatterdepth</span>.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"43 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140942041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-08DOI: 10.1007/s00180-024-01501-5
M. Corneli, E. Erosheva, X. Qian, M. Lorenzi
We consider mixtures of longitudinal trajectories, where one trajectory contains measurements over time of the variable of interest for one individual and each individual belongs to one cluster. The number of clusters as well as individual cluster memberships are unknown and must be inferred. We propose an original Bayesian clustering framework that allows us to obtain an exact finite-sample model selection criterion for selecting the number of clusters. Our finite-sample approach is more flexible and parsimonious than asymptotic alternatives such as Bayesian information criterion or integrated classification likelihood criterion in the choice of the number of clusters. Moreover, our approach has other desirable qualities: (i) it keeps the computational effort of the clustering algorithm under control and (ii) it generalizes to several families of regression mixture models, from linear to purely non-parametric. We test our method on simulated datasets as well as on a real world dataset from the Alzheimer’s disease neuroimaging initative database.
{"title":"A Bayesian approach for clustering and exact finite-sample model selection in longitudinal data mixtures","authors":"M. Corneli, E. Erosheva, X. Qian, M. Lorenzi","doi":"10.1007/s00180-024-01501-5","DOIUrl":"https://doi.org/10.1007/s00180-024-01501-5","url":null,"abstract":"<p>We consider mixtures of longitudinal trajectories, where one trajectory contains measurements over time of the variable of interest for one individual and each individual belongs to one cluster. The number of clusters as well as individual cluster memberships are unknown and must be inferred. We propose an original Bayesian clustering framework that allows us to obtain an exact finite-sample model selection criterion for selecting the number of clusters. Our finite-sample approach is more flexible and parsimonious than asymptotic alternatives such as Bayesian information criterion or integrated classification likelihood criterion in the choice of the number of clusters. Moreover, our approach has other desirable qualities: (i) it keeps the computational effort of the clustering algorithm under control and (ii) it generalizes to several families of regression mixture models, from linear to purely non-parametric. We test our method on simulated datasets as well as on a real world dataset from the Alzheimer’s disease neuroimaging initative database.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"38 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140935153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-06DOI: 10.1007/s00180-024-01478-1
Roberto Rocci, Maurizio Vichi, Monia Ranalli
Finite mixture of Gaussians are often used to classify two- (units and variables) or three- (units, variables and occasions) way data. However, two issues arise: model complexity and capturing the true cluster structure. Indeed, a large number of variables and/or occasions implies a large number of model parameters; while the existence of noise variables (and/or occasions) could mask the true cluster structure. The approach adopted in the present paper is to reduce the number of model parameters by identifying a sub-space containing the information needed to classify the observations. This should also help in identifying noise variables and/or occasions. The maximum likelihood model estimation is carried out through an EM-like algorithm. The effectiveness of the proposal is assessed through a simulation study and an application to real data.
有限高斯混合物通常用于对双向(单位和变量)或三向(单位、变量和场合)数据进行分类。然而,这就产生了两个问题:模型的复杂性和捕捉真实的聚类结构。事实上,大量的变量和/或场合意味着大量的模型参数;而噪声变量(和/或场合)的存在可能会掩盖真实的聚类结构。本文采用的方法是通过识别包含观测分类所需信息的子空间来减少模型参数的数量。这也有助于识别噪声变量和/或场合。最大似然模型估计是通过类似 EM 的算法进行的。通过模拟研究和对真实数据的应用,对该建议的有效性进行了评估。
{"title":"Mixture models for simultaneous classification and reduction of three-way data","authors":"Roberto Rocci, Maurizio Vichi, Monia Ranalli","doi":"10.1007/s00180-024-01478-1","DOIUrl":"https://doi.org/10.1007/s00180-024-01478-1","url":null,"abstract":"<p>Finite mixture of Gaussians are often used to classify two- (units and variables) or three- (units, variables and occasions) way data. However, two issues arise: model complexity and capturing the true cluster structure. Indeed, a large number of variables and/or occasions implies a large number of model parameters; while the existence of noise variables (and/or occasions) could mask the true cluster structure. The approach adopted in the present paper is to reduce the number of model parameters by identifying a sub-space containing the information needed to classify the observations. This should also help in identifying noise variables and/or occasions. The maximum likelihood model estimation is carried out through an EM-like algorithm. The effectiveness of the proposal is assessed through a simulation study and an application to real data.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"62 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140885015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-04DOI: 10.1007/s00180-024-01499-w
Kuo-Jung Lee, Ray-Bing Chen, Keunbaik Lee
Longitudinal studies have been conducted in various fields, including medicine, economics and the social sciences. In this paper, we focus on longitudinal ordinal data. Since the longitudinal data are collected over time, repeated outcomes within each subject may be serially correlated. To address both the within-subjects serial correlation and the specific variance between subjects, we propose a Bayesian cumulative probit random effects model for the analysis of longitudinal ordinal data. The hypersphere decomposition approach is employed to overcome the positive definiteness constraint and high-dimensionality of the correlation matrix. Additionally, we present a hybrid Gibbs/Metropolis-Hastings algorithm to efficiently generate cutoff points from truncated normal distributions, thereby expediting the convergence of the Markov Chain Monte Carlo (MCMC) algorithm. The performance and robustness of our proposed methodology under misspecified correlation matrices are demonstrated through simulation studies under complete data, missing completely at random (MCAR), and missing at random (MAR). We apply the proposed approach to analyze two sets of actual ordinal data: the arthritis dataset and the lung cancer dataset. To facilitate the implementation of our method, we have developed BayesRGMM, an open-source R package available on CRAN, accompanied by comprehensive documentation and source code accessible at https://github.com/kuojunglee/BayesRGMM/.
纵向研究已在医学、经济学和社会科学等多个领域开展。本文重点研究纵向序数数据。由于纵向数据是长期收集的,因此每个研究对象内部的重复结果可能存在序列相关性。为了解决受试者内部的序列相关性和受试者之间的特定方差,我们提出了一种贝叶斯累积 probit 随机效应模型,用于分析纵向序数数据。我们采用了超球分解方法来克服相关矩阵的正定性约束和高维性。此外,我们还提出了一种混合吉布斯/大都会-哈斯廷斯算法,从截断正态分布中有效地生成截断点,从而加快了马尔可夫链蒙特卡罗(MCMC)算法的收敛速度。我们通过对完整数据、完全随机缺失(MCAR)和随机缺失(MAR)数据的模拟研究,证明了我们提出的方法在误设相关矩阵下的性能和稳健性。我们将提出的方法用于分析两组实际的序数数据:关节炎数据集和肺癌数据集。为了方便方法的实施,我们开发了 BayesRGMM,这是一个开源的 R 软件包,可在 CRAN 上获取,并附有全面的文档和源代码,可在 https://github.com/kuojunglee/BayesRGMM/ 上访问。
{"title":"Robust Bayesian cumulative probit linear mixed models for longitudinal ordinal data","authors":"Kuo-Jung Lee, Ray-Bing Chen, Keunbaik Lee","doi":"10.1007/s00180-024-01499-w","DOIUrl":"https://doi.org/10.1007/s00180-024-01499-w","url":null,"abstract":"<p>Longitudinal studies have been conducted in various fields, including medicine, economics and the social sciences. In this paper, we focus on longitudinal ordinal data. Since the longitudinal data are collected over time, repeated outcomes within each subject may be serially correlated. To address both the within-subjects serial correlation and the specific variance between subjects, we propose a Bayesian cumulative probit random effects model for the analysis of longitudinal ordinal data. The hypersphere decomposition approach is employed to overcome the positive definiteness constraint and high-dimensionality of the correlation matrix. Additionally, we present a hybrid Gibbs/Metropolis-Hastings algorithm to efficiently generate cutoff points from truncated normal distributions, thereby expediting the convergence of the Markov Chain Monte Carlo (MCMC) algorithm. The performance and robustness of our proposed methodology under misspecified correlation matrices are demonstrated through simulation studies under complete data, missing completely at random (MCAR), and missing at random (MAR). We apply the proposed approach to analyze two sets of actual ordinal data: the arthritis dataset and the lung cancer dataset. To facilitate the implementation of our method, we have developed <span>BayesRGMM</span>, an open-source R package available on CRAN, accompanied by comprehensive documentation and source code accessible at https://github.com/kuojunglee/BayesRGMM/.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"62 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140885010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-03DOI: 10.1007/s00180-024-01495-0
Jaromír Antoch, Michal Černý, Ryozo Miura
The main objective of this paper is to discuss selected computational aspects of robust estimation in the linear model with the emphasis on R-estimators. We focus on numerical algorithms and computational efficiency rather than on statistical properties. In addition, we formulate some algorithmic properties that a “good” method for R-estimators is expected to satisfy and show how to satisfy them using the currently available algorithms. We illustrate both good and bad properties of the existing algorithms. We propose two-stage methods to minimize the effect of the bad properties. Finally we justify a challenge for new approaches based on interior-point methods in optimization.
本文的主要目的是讨论线性模型中稳健估计的某些计算问题,重点是 R 估计器。我们的重点是数值算法和计算效率,而不是统计特性。此外,我们还提出了 R 估计器的 "好 "方法应满足的一些算法属性,并展示了如何利用现有算法满足这些属性。我们举例说明了现有算法的优点和缺点。我们提出了两阶段方法,以尽量减少不良属性的影响。最后,我们对基于优化中内点法的新方法提出了挑战。
{"title":"R-estimation in linear models: algorithms, complexity, challenges","authors":"Jaromír Antoch, Michal Černý, Ryozo Miura","doi":"10.1007/s00180-024-01495-0","DOIUrl":"https://doi.org/10.1007/s00180-024-01495-0","url":null,"abstract":"<p>The main objective of this paper is to discuss selected computational aspects of robust estimation in the linear model with the emphasis on <i>R</i>-estimators. We focus on numerical algorithms and computational efficiency rather than on statistical properties. In addition, we formulate some algorithmic properties that a “good” method for <i>R</i>-estimators is expected to satisfy and show how to satisfy them using the currently available algorithms. We illustrate both good and bad properties of the existing algorithms. We propose two-stage methods to minimize the effect of the bad properties. Finally we justify a challenge for new approaches based on interior-point methods in optimization.</p>","PeriodicalId":55223,"journal":{"name":"Computational Statistics","volume":"176 1","pages":""},"PeriodicalIF":1.3,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140889694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}