首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
Three-way data clustering based on the mean-mixture of matrix-variate normal distributions 基于矩阵变量正态分布均值混合的三向数据聚类
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-25 DOI: 10.1016/j.csda.2024.108016
Mehrdad Naderi , Mostafa Tamandi , Elham Mirfarah , Wan-Lun Wang , Tsung-I Lin

With the steady growth of computer technologies, the application of statistical techniques to analyze extensive datasets has garnered substantial attention. The analysis of three-way (matrix-variate) data has emerged as a burgeoning field that has inspired statisticians in recent years to develop novel analytical methods. This paper introduces a unified finite mixture model that relies on the mean-mixture of matrix-variate normal distributions. The strength of our proposed model lies in its capability to capture and cluster a wide range of three-way data that exhibit heterogeneous, asymmetric and leptokurtic features. A computationally feasible ECME algorithm is developed to compute the maximum likelihood (ML) estimates. Numerous simulation studies are conducted to investigate the asymptotic properties of the ML estimators, validate the effectiveness of the Bayesian information criterion in selecting the appropriate model, and assess the classification ability in presence of contaminated noise. The utility of the proposed methodology is demonstrated by analyzing a real-life data example.

随着计算机技术的稳步发展,应用统计技术分析广泛的数据集已引起人们的极大关注。近年来,三向(矩阵变量)数据分析已成为一个新兴领域,激励着统计学家开发新的分析方法。本文介绍了一种统一的有限混合模型,它依赖于矩阵变量正态分布的均值混合。我们提出的模型的优势在于它能够捕捉和聚类各种表现出异质性、非对称性和leptokurtic特征的三向数据。为了计算最大似然估计值,我们开发了一种计算上可行的 ECME 算法。研究人员进行了大量模拟研究,以调查最大似然估计值的渐近特性,验证贝叶斯信息准则在选择适当模型方面的有效性,并评估在存在污染噪声时的分类能力。通过分析现实生活中的一个数据实例,证明了所提方法的实用性。
{"title":"Three-way data clustering based on the mean-mixture of matrix-variate normal distributions","authors":"Mehrdad Naderi ,&nbsp;Mostafa Tamandi ,&nbsp;Elham Mirfarah ,&nbsp;Wan-Lun Wang ,&nbsp;Tsung-I Lin","doi":"10.1016/j.csda.2024.108016","DOIUrl":"10.1016/j.csda.2024.108016","url":null,"abstract":"<div><p>With the steady growth of computer technologies, the application of statistical techniques to analyze extensive datasets has garnered substantial attention. The analysis of three-way (matrix-variate) data has emerged as a burgeoning field that has inspired statisticians in recent years to develop novel analytical methods. This paper introduces a unified finite mixture model that relies on the mean-mixture of matrix-variate normal distributions. The strength of our proposed model lies in its capability to capture and cluster a wide range of three-way data that exhibit heterogeneous, asymmetric and leptokurtic features. A computationally feasible ECME algorithm is developed to compute the maximum likelihood (ML) estimates. Numerous simulation studies are conducted to investigate the asymptotic properties of the ML estimators, validate the effectiveness of the Bayesian information criterion in selecting the appropriate model, and assess the classification ability in presence of contaminated noise. The utility of the proposed methodology is demonstrated by analyzing a real-life data example.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141947240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Tests for high-dimensional generalized linear models under general covariance structure 一般协方差结构下的高维广义线性模型试验
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-17 DOI: 10.1016/j.csda.2024.108026
Weichao Yang , Xu Guo , Lixing Zhu

This study investigates the testing of regression coefficients within high-dimensional generalized linear models featuring general covariance structures. The derived asymptotic properties reveal that distinct covariance structures can lead to varying limiting null distributions, including the normal distribution, for a widely employed quadratic-norm based test statistic. This circumstance renders it infeasible to determine critical values through a limiting null distribution. In response to this challenge, we propose a multiplier bootstrap test procedure for practical implementation. Additionally, we introduce a modified version of this procedure, incorporating projection when dealing with nuisance parameters. We then proceed to examine the asymptotic level and power of the proposed tests and assess their finite-sample performance through simulations. Finally, we present a real data analysis to illustrate the practical application of the proposed tests.

本研究探讨了具有一般协方差结构的高维广义线性模型中回归系数的检验问题。推导出的渐近性质表明,对于广泛使用的基于二次正态分布的检验统计量,不同的协方差结构会导致不同的极限零分布,包括正态分布。这种情况使得通过极限空分布确定临界值变得不可行。为了应对这一挑战,我们提出了一种乘数自举检验程序,以便实际应用。此外,我们还介绍了该程序的修改版,在处理骚扰参数时加入了投影。然后,我们继续检验所提出检验的渐近水平和功率,并通过模拟评估其有限样本性能。最后,我们通过实际数据分析来说明所提检验的实际应用。
{"title":"Tests for high-dimensional generalized linear models under general covariance structure","authors":"Weichao Yang ,&nbsp;Xu Guo ,&nbsp;Lixing Zhu","doi":"10.1016/j.csda.2024.108026","DOIUrl":"10.1016/j.csda.2024.108026","url":null,"abstract":"<div><p>This study investigates the testing of regression coefficients within high-dimensional generalized linear models featuring general covariance structures. The derived asymptotic properties reveal that distinct covariance structures can lead to varying limiting null distributions, including the normal distribution, for a widely employed quadratic-norm based test statistic. This circumstance renders it infeasible to determine critical values through a limiting null distribution. In response to this challenge, we propose a multiplier bootstrap test procedure for practical implementation. Additionally, we introduce a modified version of this procedure, incorporating projection when dealing with nuisance parameters. We then proceed to examine the asymptotic level and power of the proposed tests and assess their finite-sample performance through simulations. Finally, we present a real data analysis to illustrate the practical application of the proposed tests.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141728824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Modelling non-stationarity in asymptotically independent extremes 渐近独立极值的非稳态建模
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-14 DOI: 10.1016/j.csda.2024.108025
C.J.R. Murphy-Barltrop , J.L. Wadsworth

In many practical applications, evaluating the joint impact of combinations of environmental variables is important for risk management and structural design analysis. When such variables are considered simultaneously, non-stationarity can exist within both the marginal distributions and dependence structure, resulting in complex data structures. In the context of extremes, few methods have been proposed for modelling trends in extremal dependence, even though capturing this feature is important for quantifying joint impact. Moreover, most proposed techniques are only applicable to data structures exhibiting asymptotic dependence. Motivated by observed dependence trends of data from the UK Climate Projections, a novel semi-parametric modelling framework for bivariate extremal dependence structures is proposed. This framework can capture a wide variety of dependence trends for data exhibiting asymptotic independence. When applied to the climate projection dataset, the model detects significant dependence trends in observations and, in combination with models for marginal non-stationarity, can be used to produce estimates of bivariate risk measures at future time points.

在许多实际应用中,评估环境变量组合的共同影响对于风险管理和结构设计分析非常重要。当同时考虑这些变量时,边际分布和依赖结构中都可能存在非平稳性,从而导致复杂的数据结构。在极端情况下,尽管捕捉极端依赖性的趋势对于量化联合影响非常重要,但很少有方法可以用于模拟极端依赖性的趋势。此外,大多数建议的技术只适用于表现出渐进依赖性的数据结构。受英国气候预测中观测到的数据依赖趋势的启发,我们提出了一种新颖的双变量极端依赖结构半参数建模框架。该框架可以捕捉数据渐近独立性的各种依赖趋势。当应用于气候预测数据集时,该模型可检测到观测数据中的显著依赖趋势,并与边际非平稳性模型相结合,可用于生成未来时间点的二元风险度量估计值。
{"title":"Modelling non-stationarity in asymptotically independent extremes","authors":"C.J.R. Murphy-Barltrop ,&nbsp;J.L. Wadsworth","doi":"10.1016/j.csda.2024.108025","DOIUrl":"10.1016/j.csda.2024.108025","url":null,"abstract":"<div><p>In many practical applications, evaluating the joint impact of combinations of environmental variables is important for risk management and structural design analysis. When such variables are considered simultaneously, non-stationarity can exist within both the marginal distributions and dependence structure, resulting in complex data structures. In the context of extremes, few methods have been proposed for modelling trends in extremal dependence, even though capturing this feature is important for quantifying joint impact. Moreover, most proposed techniques are only applicable to data structures exhibiting asymptotic dependence. Motivated by observed dependence trends of data from the UK Climate Projections, a novel semi-parametric modelling framework for bivariate extremal dependence structures is proposed. This framework can capture a wide variety of dependence trends for data exhibiting asymptotic independence. When applied to the climate projection dataset, the model detects significant dependence trends in observations and, in combination with models for marginal non-stationarity, can be used to produce estimates of bivariate risk measures at future time points.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324001099/pdfft?md5=30bf72d73c4164fa1e95447a8e89f109&pid=1-s2.0-S0167947324001099-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141636850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multivariate ordinal regression for multiple repeated measurements 多重重复测量的多变量序数回归
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-02 DOI: 10.1016/j.csda.2024.108013
Laura Vana-Gür

A multivariate ordinal regression model which allows the joint modeling of three-dimensional panel data containing both repeated and multiple measurements for a collection of subjects is proposed. This is achieved by a multivariate autoregressive structure on the errors of the latent variables underlying the ordinal responses, which accounts for the correlations at a single point in time and the persistence over time. The error distribution is assumed to be normal or Student-t distributed. The estimation is performed using composite likelihood methods. Through several simulation exercises, the quality of the estimates in different settings as well as in comparison with a Bayesian approach is investigated. The simulation study confirms that the estimation procedure is able to recover the model parameters well and is competitive in terms of computation time. Finally, the framework is illustrated using a data set containing bankruptcy and credit rating information for US exchange-listed companies.

本文提出了一个多变量序数回归模型,该模型可以对包含重复测量和多次测量的三维面板数据进行联合建模。这是通过对作为序数反应基础的潜变量误差采用多元自回归结构来实现的,该结构考虑了单个时间点的相关性和随时间变化的持续性。误差分布假定为正态分布或 Student-t 分布。使用复合似然法进行估计。通过几次模拟练习,研究了不同环境下的估计质量,以及与贝叶斯方法的比较。模拟研究证实,估计程序能够很好地恢复模型参数,并且在计算时间方面具有竞争力。最后,使用包含美国交易所上市公司破产和信用评级信息的数据集对该框架进行了说明。
{"title":"Multivariate ordinal regression for multiple repeated measurements","authors":"Laura Vana-Gür","doi":"10.1016/j.csda.2024.108013","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108013","url":null,"abstract":"<div><p>A multivariate ordinal regression model which allows the joint modeling of three-dimensional panel data containing both repeated and multiple measurements for a collection of subjects is proposed. This is achieved by a multivariate autoregressive structure on the errors of the latent variables underlying the ordinal responses, which accounts for the correlations at a single point in time and the persistence over time. The error distribution is assumed to be normal or Student-<em>t</em> distributed. The estimation is performed using composite likelihood methods. Through several simulation exercises, the quality of the estimates in different settings as well as in comparison with a Bayesian approach is investigated. The simulation study confirms that the estimation procedure is able to recover the model parameters well and is competitive in terms of computation time. Finally, the framework is illustrated using a data set containing bankruptcy and credit rating information for US exchange-listed companies.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000975/pdfft?md5=ab85b2830c29a159e869e1da23f9a25e&pid=1-s2.0-S0167947324000975-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141541625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing designs in clinical trials with an application in treatment of Epidermolysis bullosa simplex, a rare genetic skin disease 优化临床试验设计,应用于治疗一种罕见的遗传性皮肤病--单纯性表皮松解症
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-02 DOI: 10.1016/j.csda.2024.108015
Joakim Nyberg , Andrew C. Hooker , Georg Zimmermann , Johan Verbeeck , Martin Geroldinger , Konstantin Emil Thiel , Geert Molenberghs , Martin Laimer , Verena Wally

Epidermolysis bullosa simplex (EBS) skin disease is a rare disease, which renders the use of optimal design techniques especially important to maximize the potential information in a future study, that is, to make efficient use of the limited number of available subjects and observations. A generalized linear mixed effects model (GLMM), built on an EBS trial was used to optimize the design. The model assumed a full treatment effect in the follow-up period. In addition to this model, two models with either no assumed treatment effect or a linearly declining treatment effect in the follow-up were assumed. The information gain and loss when changing the number of EBS blisters counts, altering the duration of the treatment as well as changing the study period was assessed. In addition, optimization of the EBS blister assessment times was performed. The optimization was utilizing the derived Fisher information matrix for the GLMM with EBS blister counts and the information gain and loss was quantified by D-optimal efficiency. The optimization results indicated that using optimal assessment times increases the information of about 110-120%, varying slightly between the assumed treatment models. In addition, the result showed that the assessment times were also sensitive to be moved ± one week, but assessment times within ± two days were not decreasing the information as long as three assessments (out of four assessments in the trial period) were within the treatment period and not in the follow-up period. Increasing the number of assessments to six or five per trial period increased the information to 130% and 115%, respectively, while decreasing the number of assessments to two or three, decreased the information to 50% and 80%, respectively. Increasing the length of the trial period had a minor impact on the information, while increasing the treatment period by two and four weeks had a larger impact, 120% and 130%, respectively. To conclude, general applications of optimal design methodology, derivation of the Fisher information matrix for GLMM with count data and examples on how optimal design could be used when designing trials for treatment of the EBS disease is presented. The methodology is also of interest for study designs where maximizing the information is essential. Therefore, a general applied research guidance for using optimal design is also provided.

单纯性表皮松解症(EBS)皮肤病是一种罕见疾病,因此使用优化设计技术来最大限度地利用未来研究中的潜在信息(即有效利用有限的受试者和观测数据)尤为重要。在 EBS 试验的基础上建立的广义线性混合效应模型(GLMM)被用来优化设计。该模型假定在随访期间有充分的治疗效果。除该模型外,还假设了两个模型,即不假设治疗效果或治疗效果在随访期间呈线性下降趋势。评估了在改变 EBS 水泡计数、改变治疗持续时间和改变研究期时的信息增益和损失。此外,还对 EBS 水泡评估时间进行了优化。优化利用了 EBS 水泡计数 GLMM 的费舍尔信息矩阵,并通过 D-最优效率量化了信息增益和损失。优化结果表明,使用最佳评估时间可增加约 110-120% 的信息量,不同的假定治疗模型之间略有不同。此外,结果表明,评估时间在±一周内移动也很敏感,但评估时间在±两天内移动并不会减少信息量,只要三次评估(试验期四次评估中的三次)是在治疗期而不是随访期进行的。将每个试验期的评估次数增加到六次或五次,信息量分别增加到 130% 和 115%,而将评估次数减少到两次或三次,信息量分别减少到 50% 和 80%。延长试验期对信息量的影响较小,而将治疗期延长两周和四周则影响较大,分别为 120% 和 130%。最后,介绍了优化设计方法的一般应用、计数数据 GLMM 的费舍尔信息矩阵的推导,以及在设计 EBS 疾病治疗试验时如何使用优化设计的示例。该方法也适用于对信息最大化至关重要的研究设计。因此,本文还提供了使用优化设计的一般应用研究指南。
{"title":"Optimizing designs in clinical trials with an application in treatment of Epidermolysis bullosa simplex, a rare genetic skin disease","authors":"Joakim Nyberg ,&nbsp;Andrew C. Hooker ,&nbsp;Georg Zimmermann ,&nbsp;Johan Verbeeck ,&nbsp;Martin Geroldinger ,&nbsp;Konstantin Emil Thiel ,&nbsp;Geert Molenberghs ,&nbsp;Martin Laimer ,&nbsp;Verena Wally","doi":"10.1016/j.csda.2024.108015","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108015","url":null,"abstract":"<div><p>Epidermolysis bullosa simplex (EBS) skin disease is a rare disease, which renders the use of optimal design techniques especially important to maximize the potential information in a future study, that is, to make efficient use of the limited number of available subjects and observations. A generalized linear mixed effects model (GLMM), built on an EBS trial was used to optimize the design. The model assumed a full treatment effect in the follow-up period. In addition to this model, two models with either no assumed treatment effect or a linearly declining treatment effect in the follow-up were assumed. The information gain and loss when changing the number of EBS blisters counts, altering the duration of the treatment as well as changing the study period was assessed. In addition, optimization of the EBS blister assessment times was performed. The optimization was utilizing the derived Fisher information matrix for the GLMM with EBS blister counts and the information gain and loss was quantified by D-optimal efficiency. The optimization results indicated that using optimal assessment times increases the information of about 110-120%, varying slightly between the assumed treatment models. In addition, the result showed that the assessment times were also sensitive to be moved ± one week, but assessment times within ± two days were not decreasing the information as long as three assessments (out of four assessments in the trial period) were within the treatment period and not in the follow-up period. Increasing the number of assessments to six or five per trial period increased the information to 130% and 115%, respectively, while decreasing the number of assessments to two or three, decreased the information to 50% and 80%, respectively. Increasing the length of the trial period had a minor impact on the information, while increasing the treatment period by two and four weeks had a larger impact, 120% and 130%, respectively. To conclude, general applications of optimal design methodology, derivation of the Fisher information matrix for GLMM with count data and examples on how optimal design could be used when designing trials for treatment of the EBS disease is presented. The methodology is also of interest for study designs where maximizing the information is essential. Therefore, a general applied research guidance for using optimal design is also provided.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000999/pdfft?md5=f5085e42686fa3be3531f90fc0181a2c&pid=1-s2.0-S0167947324000999-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141607093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bootstrap-based statistical inference for linear mixed effects under misspecifications 基于 Bootstrap 的线性混合效应统计推断(误设情况下
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-07-01 DOI: 10.1016/j.csda.2024.108014
Katarzyna Reluga , María-José Lombardía , Stefan Sperlich

Linear mixed effects are considered excellent predictors of cluster-level parameters in various domains. However, previous research has demonstrated that their performance is affected by departures from model assumptions. Given the common occurrence of these departures in empirical studies, there is a need for inferential methods that are robust to misspecifications while remaining accessible and appealing to practitioners. Statistical tools have been developed for cluster-wise and simultaneous inference for mixed effects under distributional misspecifications, employing a user-friendly semiparametric random effect bootstrap. The merits and limitations of this approach are discussed in the general context of model misspecification. Theoretical analysis demonstrates the asymptotic consistency of the methods under general regularity conditions. Simulations show that the proposed intervals are robust to departures from modelling assumptions, including asymmetry and long tails in the distributions of errors and random effects, outperforming competitors in terms of empirical coverage probability. Finally, the methodology is applied to construct confidence intervals for household income across counties in the Spanish region of Galicia.

线性混合效应被认为是各领域集群级参数的极佳预测工具。然而,以往的研究表明,它们的性能会受到偏离模型假设的影响。鉴于这些偏离情况在实证研究中经常出现,因此需要既能对错误假设保持稳健,又能为实践者所接受和青睐的推论方法。我们已经开发出了一些统计工具,利用方便用户的半参数随机效应自举法,对分布失当情况下的混合效应进行聚类和同步推断。该方法的优点和局限性在模型失当的一般情况下进行了讨论。理论分析表明,在一般正则条件下,这些方法具有渐近一致性。模拟表明,所提出的区间对偏离模型假设(包括误差和随机效应分布的不对称和长尾)具有鲁棒性,在经验覆盖概率方面优于竞争对手。最后,该方法被应用于构建西班牙加利西亚地区各县家庭收入的置信区间。
{"title":"Bootstrap-based statistical inference for linear mixed effects under misspecifications","authors":"Katarzyna Reluga ,&nbsp;María-José Lombardía ,&nbsp;Stefan Sperlich","doi":"10.1016/j.csda.2024.108014","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108014","url":null,"abstract":"<div><p>Linear mixed effects are considered excellent predictors of cluster-level parameters in various domains. However, previous research has demonstrated that their performance is affected by departures from model assumptions. Given the common occurrence of these departures in empirical studies, there is a need for inferential methods that are robust to misspecifications while remaining accessible and appealing to practitioners. Statistical tools have been developed for cluster-wise and simultaneous inference for mixed effects under distributional misspecifications, employing a user-friendly semiparametric random effect bootstrap. The merits and limitations of this approach are discussed in the general context of model misspecification. Theoretical analysis demonstrates the asymptotic consistency of the methods under general regularity conditions. Simulations show that the proposed intervals are robust to departures from modelling assumptions, including asymmetry and long tails in the distributions of errors and random effects, outperforming competitors in terms of empirical coverage probability. Finally, the methodology is applied to construct confidence intervals for household income across counties in the Spanish region of Galicia.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000987/pdfft?md5=733458402da2cf31e9cef3842c8c4865&pid=1-s2.0-S0167947324000987-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141541624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian modal regression based on mixture distributions 基于混合分布的贝叶斯模态回归
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-06-27 DOI: 10.1016/j.csda.2024.108012
Qingyang Liu, Xianzheng Huang, Ray Bai

Compared to mean regression and quantile regression, the literature on modal regression is very sparse. A unifying framework for Bayesian modal regression is proposed, based on a family of unimodal distributions indexed by the mode, along with other parameters that allow for flexible shapes and tail behaviors. Sufficient conditions for posterior propriety under an improper prior on the mode parameter are derived. Following prior elicitation, regression analysis of simulated data and datasets from several real-life applications are conducted. Besides drawing inference for covariate effects that are easy to interpret, prediction and model selection under the proposed Bayesian modal regression framework are also considered. Evidence from these analyses suggest that the proposed inference procedures are very robust to outliers, enabling one to discover interesting covariate effects missed by mean or median regression, and to construct much tighter prediction intervals than those from mean or median regression. Computer programs for implementing the proposed Bayesian modal regression are available at https://github.com/rh8liuqy/Bayesian_modal_regression.

与均值回归和量值回归相比,模态回归的文献非常稀少。本文提出了贝叶斯模态回归的统一框架,该框架基于以模态为索引的单模态分布系列,以及允许灵活形状和尾部行为的其他参数。推导出了在模态参数不恰当先验条件下后验适当性的充分条件。在得出先验之后,对模拟数据和来自若干实际应用的数据集进行了回归分析。除了得出易于解释的协变量效应推论外,还考虑了在所提出的贝叶斯模态回归框架下的预测和模型选择。这些分析的证据表明,所提出的推断程序对异常值具有很强的鲁棒性,使人们能够发现平均值或中位数回归所遗漏的有趣的协变量效应,并构建比平均值或中位数回归更为严格的预测区间。实现贝叶斯模态回归的计算机程序可在 https://github.com/rh8liuqy/Bayesian_modal_regression 上获取。
{"title":"Bayesian modal regression based on mixture distributions","authors":"Qingyang Liu,&nbsp;Xianzheng Huang,&nbsp;Ray Bai","doi":"10.1016/j.csda.2024.108012","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108012","url":null,"abstract":"<div><p>Compared to mean regression and quantile regression, the literature on modal regression is very sparse. A unifying framework for Bayesian modal regression is proposed, based on a family of unimodal distributions indexed by the mode, along with other parameters that allow for flexible shapes and tail behaviors. Sufficient conditions for posterior propriety under an improper prior on the mode parameter are derived. Following prior elicitation, regression analysis of simulated data and datasets from several real-life applications are conducted. Besides drawing inference for covariate effects that are easy to interpret, prediction and model selection under the proposed Bayesian modal regression framework are also considered. Evidence from these analyses suggest that the proposed inference procedures are very robust to outliers, enabling one to discover interesting covariate effects missed by mean or median regression, and to construct much tighter prediction intervals than those from mean or median regression. Computer programs for implementing the proposed Bayesian modal regression are available at <span>https://github.com/rh8liuqy/Bayesian_modal_regression</span><svg><path></path></svg>.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141485446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A nonparametrically corrected likelihood for Bayesian spectral analysis of multivariate time series 多变量时间序列贝叶斯谱分析的非参数校正似然法
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-06-25 DOI: 10.1016/j.csda.2024.108010
Yixuan Liu , Claudia Kirch , Jeong Eun Lee , Renate Meyer

A novel approach to Bayesian nonparametric spectral analysis of stationary multivariate time series is presented. Starting with a parametric vector-autoregressive model, the parametric likelihood is nonparametrically adjusted in the frequency domain to account for potential deviations from parametric assumptions. A proof of mutual contiguity of the nonparametrically corrected likelihood, the multivariate Whittle likelihood approximation and the exact likelihood for Gaussian time series is given. A multivariate extension of the nonparametric Bernstein-Dirichlet process prior for univariate spectral densities to the space of Hermitian positive definite spectral density matrices is specified directly on the correction matrices. An infinite series representation of this prior is then used to develop a Markov chain Monte Carlo algorithm to sample from the posterior distribution. The code is made publicly available for ease of use and reproducibility. With this novel approach, a generalisation of the multivariate Whittle-likelihood-based method of Meier et al. (2020) as well as an extension of the nonparametrically corrected likelihood for univariate stationary time series of Kirch et al. (2019) to the multivariate case is presented. It is demonstrated that the nonparametrically corrected likelihood combines the efficiencies of a parametric with the robustness of a nonparametric model. Its numerical accuracy is illustrated in a comprehensive simulation study. Its practical advantages are illustrated by a spectral analysis of two environmental time series data sets: a bivariate time series of the Southern Oscillation Index and fish recruitment and a multivariate time series of windspeed data at six locations in California.

本文提出了一种对静态多变量时间序列进行贝叶斯非参数谱分析的新方法。从参数向量自回归模型开始,在频域对参数似然进行非参数调整,以考虑参数假设的潜在偏差。给出了非参数修正似然、多变量惠特尔似然近似和高斯时间序列精确似然的相互连续性证明。将用于单变量谱密度的非参数伯恩斯坦-德里赫特过程先验的多变量扩展到赫米特正定谱密度矩阵空间,并直接在校正矩阵上指定。然后使用该先验的无穷级数表示来开发马尔科夫链蒙特卡罗算法,以便从后验分布中采样。为了便于使用和复制,我们公开了代码。通过这种新方法,介绍了 Meier 等人(2020 年)基于惠特尔似然法的多变量方法的一般化,以及 Kirch 等人(2019 年)单变量静态时间序列非参数校正似然法在多变量情况下的扩展。研究表明,非参数校正似然结合了参数模型的效率和非参数模型的稳健性。综合模拟研究说明了其数值精确性。通过对两个环境时间序列数据集(南方涛动指数和鱼类繁殖的双变量时间序列以及加利福尼亚州六个地点风速数据的多变量时间序列)进行频谱分析,说明了该模型的实际优势。
{"title":"A nonparametrically corrected likelihood for Bayesian spectral analysis of multivariate time series","authors":"Yixuan Liu ,&nbsp;Claudia Kirch ,&nbsp;Jeong Eun Lee ,&nbsp;Renate Meyer","doi":"10.1016/j.csda.2024.108010","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108010","url":null,"abstract":"<div><p>A novel approach to Bayesian nonparametric spectral analysis of stationary multivariate time series is presented. Starting with a parametric vector-autoregressive model, the parametric likelihood is nonparametrically adjusted in the frequency domain to account for potential deviations from parametric assumptions. A proof of mutual contiguity of the nonparametrically corrected likelihood, the multivariate Whittle likelihood approximation and the exact likelihood for Gaussian time series is given. A multivariate extension of the nonparametric Bernstein-Dirichlet process prior for univariate spectral densities to the space of Hermitian positive definite spectral density matrices is specified directly on the correction matrices. An infinite series representation of this prior is then used to develop a Markov chain Monte Carlo algorithm to sample from the posterior distribution. The code is made publicly available for ease of use and reproducibility. With this novel approach, a generalisation of the multivariate Whittle-likelihood-based method of <span>Meier et al. (2020)</span> as well as an extension of the nonparametrically corrected likelihood for univariate stationary time series of <span>Kirch et al. (2019)</span> to the multivariate case is presented. It is demonstrated that the nonparametrically corrected likelihood combines the efficiencies of a parametric with the robustness of a nonparametric model. Its numerical accuracy is illustrated in a comprehensive simulation study. Its practical advantages are illustrated by a spectral analysis of two environmental time series data sets: a bivariate time series of the Southern Oscillation Index and fish recruitment and a multivariate time series of windspeed data at six locations in California.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016794732400094X/pdfft?md5=4194de676b76fa0193f3ea88ff4e7bdc&pid=1-s2.0-S016794732400094X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141485447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An embedded diachronic sense change model with a case study from ancient Greek 以古希腊文为例的嵌入式非同步意义变化模型
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-06-21 DOI: 10.1016/j.csda.2024.108011
Schyan Zafar, Geoff K. Nicholls

Word meanings change over time, and word senses evolve, emerge or die out in the process. For ancient languages, where the corpora are often small and sparse, modelling such changes accurately proves challenging, and quantifying uncertainty in sense-change estimates consequently becomes important. GASC (Genre-Aware Semantic Change) and DiSC (Diachronic Sense Change) are existing generative models that have been used to analyse sense change for target words from an ancient Greek text corpus, using unsupervised learning without the help of any pre-training. These models represent the senses of a given target word such as “kosmos” (meaning decoration, order or world) as distributions over context words, and sense prevalence as a distribution over senses. The models are fitted using Markov Chain Monte Carlo (MCMC) methods to measure temporal changes in these representations. This paper introduces EDiSC, an Embedded DiSC model, which combines word embeddings with DiSC to provide superior model performance. It is shown empirically that EDiSC offers improved predictive accuracy, ground-truth recovery and uncertainty quantification, as well as better sampling efficiency and scalability properties with MCMC methods. The challenges of fitting these models are also discussed.

词义会随着时间的推移而发生变化,词义也会在这一过程中演变、出现或消亡。对于古代语言来说,语料库通常较小且稀疏,要准确模拟这种变化具有挑战性,因此量化意义变化估计值的不确定性变得非常重要。GASC(Genre-Aware Semantic Change,体裁感知语义变化)和 DiSC(Diachronic Sense Change,同步语义变化)是现有的生成模型,用于分析古希腊文本语料库中目标词的语义变化,采用无监督学习,无需任何预训练。这些模型将给定目标词(如 "kosmos",意为装饰、秩序或世界)的词义表示为上下文词的分布,将词义流行度表示为词义的分布。使用马尔可夫链蒙特卡洛(MCMC)方法对模型进行拟合,以测量这些表征的时间变化。本文介绍的 EDiSC 是一种嵌入式 DiSC 模型,它将词嵌入与 DiSC 结合在一起,从而提供卓越的模型性能。经验表明,与 MCMC 方法相比,EDiSC 在预测准确性、地面实况恢复和不确定性量化方面都有提高,而且具有更好的采样效率和可扩展性。此外,还讨论了拟合这些模型所面临的挑战。
{"title":"An embedded diachronic sense change model with a case study from ancient Greek","authors":"Schyan Zafar,&nbsp;Geoff K. Nicholls","doi":"10.1016/j.csda.2024.108011","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108011","url":null,"abstract":"<div><p>Word meanings change over time, and word <em>senses</em> evolve, emerge or die out in the process. For ancient languages, where the corpora are often small and sparse, modelling such changes accurately proves challenging, and quantifying uncertainty in sense-change estimates consequently becomes important. GASC (Genre-Aware Semantic Change) and DiSC (Diachronic Sense Change) are existing generative models that have been used to analyse sense change for target words from an ancient Greek text corpus, using unsupervised learning without the help of any pre-training. These models represent the senses of a given target word such as “kosmos” (meaning decoration, order or world) as distributions over context words, and sense prevalence as a distribution over senses. The models are fitted using Markov Chain Monte Carlo (MCMC) methods to measure temporal changes in these representations. This paper introduces EDiSC, an Embedded DiSC model, which combines word embeddings with DiSC to provide superior model performance. It is shown empirically that EDiSC offers improved predictive accuracy, ground-truth recovery and uncertainty quantification, as well as better sampling efficiency and scalability properties with MCMC methods. The challenges of fitting these models are also discussed.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000951/pdfft?md5=12930590074b9c3008e514576f2c4ba0&pid=1-s2.0-S0167947324000951-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141485448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A double Pólya-Gamma data augmentation scheme for a hierarchical Negative Binomial - Binomial data model 分层负二项-二项数据模型的双 Pólya-Gamma 数据扩充方案
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-06-20 DOI: 10.1016/j.csda.2024.108009
Xuan Ma, Jenný Brynjarsdóttir, Thomas LaFramboise

A double Pólya-Gamma data augmentation scheme is developed for posterior sampling from a Bayesian hierarchical model of total and categorical count data. The scheme applies to a Negative Binomial - Binomial (NBB) hierarchical regression model with logit links and normal priors on regression coefficients. The approach is shown to be very efficient and in most cases out-performs the Stan program. The hierarchical modeling framework and the Pólya-Gamma data augmentation scheme are applied to human mitochondrial DNA data.

本文提出了一种双 Pólya-Gamma 数据扩增方案,用于从总体和分类计数数据的贝叶斯分层模型中进行后验采样。该方案适用于带有对数链接和回归系数正态先验的负二项-二项(NBB)分层回归模型。结果表明,该方法非常高效,在大多数情况下都优于 Stan 程序。分层建模框架和 Pólya-Gamma 数据增强方案被应用于人类线粒体 DNA 数据。
{"title":"A double Pólya-Gamma data augmentation scheme for a hierarchical Negative Binomial - Binomial data model","authors":"Xuan Ma,&nbsp;Jenný Brynjarsdóttir,&nbsp;Thomas LaFramboise","doi":"10.1016/j.csda.2024.108009","DOIUrl":"https://doi.org/10.1016/j.csda.2024.108009","url":null,"abstract":"<div><p>A double Pólya-Gamma data augmentation scheme is developed for posterior sampling from a Bayesian hierarchical model of total and categorical count data. The scheme applies to a Negative Binomial - Binomial (NBB) hierarchical regression model with logit links and normal priors on regression coefficients. The approach is shown to be very efficient and in most cases out-performs the Stan program. The hierarchical modeling framework and the Pólya-Gamma data augmentation scheme are applied to human mitochondrial DNA data.</p></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":null,"pages":null},"PeriodicalIF":1.5,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167947324000938/pdfft?md5=5e06b3420d4ee7efb587c1f231e8d551&pid=1-s2.0-S0167947324000938-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141485449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1