首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
Efficient Bayesian functional principal component analysis of irregularly-observed multivariate curves 对不规则多变量曲线进行高效的贝叶斯函数主成分分析
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-12 DOI: 10.1016/j.csda.2024.108094
Tui H. Nolan , Sylvia Richardson , Hélène Ruffieux
The analysis of multivariate functional curves has the potential to yield important scientific discoveries in domains such as healthcare, medicine, economics and social sciences. However, it is common for real-world settings to present longitudinal data that are both irregularly and sparsely observed, which introduces important challenges for the current functional data methodology. A Bayesian hierarchical framework for multivariate functional principal component analysis is proposed, which accommodates the intricacies of such irregular observation settings by flexibly pooling information across subjects and correlated curves. The model represents common latent dynamics via shared functional principal component scores, thereby effectively borrowing strength across curves while circumventing the computationally challenging task of estimating covariance matrices. These scores also provide a parsimonious representation of the major modes of joint variation of the curves and constitute interpretable scalar summaries that can be employed in follow-up analyses. Estimation is conducted using variational inference, ensuring that accurate posterior approximation and robust uncertainty quantification are achieved. The algorithm also introduces a novel variational message passing fragment for multivariate functional principal component Gaussian likelihood that enables modularity and reuse across models. Detailed simulations assess the effectiveness of the approach in sharing information from sparse and irregularly sampled multivariate curves. The methodology is also exploited to estimate the molecular disease courses of individual patients with SARS-CoV-2 infection and characterise patient heterogeneity in recovery outcomes; this study reveals key coordinated dynamics across the immune, inflammatory and metabolic systems, which are associated with long-COVID symptoms up to one year post disease onset. The approach is implemented in the R package bayesFPCA.
多变量函数曲线分析有可能在医疗保健、医学、经济学和社会科学等领域产生重要的科学发现。然而,现实世界中常见的纵向数据既不规则又观测稀疏,这给当前的函数数据方法带来了重大挑战。本文提出了一种用于多元函数主成分分析的贝叶斯分层框架,该框架通过灵活地汇集受试者和相关曲线的信息,来适应这种不规则观测环境的复杂性。该模型通过共享的功能主成分得分来表示共同的潜在动态,从而有效地借用曲线间的力量,同时避免了估计协方差矩阵这一具有计算挑战性的任务。这些分数还提供了曲线联合变化主要模式的简明表述,并构成了可在后续分析中使用的可解释的标量总结。使用变异推理进行估计,确保实现精确的后验近似和稳健的不确定性量化。该算法还为多元函数主成分高斯似然引入了一个新颖的变分信息传递片段,实现了模块化和跨模型重用。详细的模拟评估了该方法在共享稀疏和不规则采样多元曲线信息方面的有效性。这项研究揭示了免疫、炎症和新陈代谢系统的关键协调动态,这些系统与发病后一年内的长COVID症状有关。该方法在 R 软件包 bayesFPCA 中实现。
{"title":"Efficient Bayesian functional principal component analysis of irregularly-observed multivariate curves","authors":"Tui H. Nolan ,&nbsp;Sylvia Richardson ,&nbsp;Hélène Ruffieux","doi":"10.1016/j.csda.2024.108094","DOIUrl":"10.1016/j.csda.2024.108094","url":null,"abstract":"<div><div>The analysis of multivariate functional curves has the potential to yield important scientific discoveries in domains such as healthcare, medicine, economics and social sciences. However, it is common for real-world settings to present longitudinal data that are both irregularly and sparsely observed, which introduces important challenges for the current functional data methodology. A Bayesian hierarchical framework for multivariate functional principal component analysis is proposed, which accommodates the intricacies of such irregular observation settings by flexibly pooling information across subjects and correlated curves. The model represents common latent dynamics via shared functional principal component scores, thereby effectively borrowing strength across curves while circumventing the computationally challenging task of estimating covariance matrices. These scores also provide a parsimonious representation of the major modes of joint variation of the curves and constitute interpretable scalar summaries that can be employed in follow-up analyses. Estimation is conducted using variational inference, ensuring that accurate posterior approximation and robust uncertainty quantification are achieved. The algorithm also introduces a novel variational message passing fragment for multivariate functional principal component Gaussian likelihood that enables modularity and reuse across models. Detailed simulations assess the effectiveness of the approach in sharing information from sparse and irregularly sampled multivariate curves. The methodology is also exploited to estimate the molecular disease courses of individual patients with SARS-CoV-2 infection and characterise patient heterogeneity in recovery outcomes; this study reveals key coordinated dynamics across the immune, inflammatory and metabolic systems, which are associated with long-COVID symptoms up to one year post disease onset. The approach is implemented in the R package <span>bayesFPCA</span>.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108094"},"PeriodicalIF":1.5,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Statistical modeling of Dengue transmission dynamics with environmental factors 利用环境因素建立登革热传播动态统计模型
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-08 DOI: 10.1016/j.csda.2024.108080
Lengyang Wang , Mingke Zhang
Dengue fever is one of the most common mosquito-borne infectious diseases in tropical regions. Understanding the dynamics of dengue transmission can help provide timely early warnings, thereby reducing mortality. However, previous studies have failed to simulate faithfully dengue dynamics and answer questions pertinent to outbreaks. By incorporating environmental factors into a time-series-susceptible-infectious-recovered (TSIR) model, a new substantive model, to analyze their impact on transmission, is proposed. The newly proposed environmental-time-series-susceptible-infectious-recovered (ETSIR) model can highlight statistically their significance on dengue transmission, thus providing deeper insight into the transmission and addressing several epidemiological puzzles.
登革热是热带地区最常见的蚊媒传染病之一。了解登革热的传播动态有助于及时发出预警,从而降低死亡率。然而,以往的研究未能忠实地模拟登革热的动态,也未能回答与疫情爆发相关的问题。通过将环境因素纳入时间序列-易感-感染-恢复(TSIR)模型,提出了一个新的实质性模型,以分析环境因素对传播的影响。新提出的环境-时间序列-易感-感染-恢复(ETSIR)模型可以从统计学角度突出环境因素对登革热传播的重要性,从而更深入地了解登革热的传播,并解决一些流行病学难题。
{"title":"Statistical modeling of Dengue transmission dynamics with environmental factors","authors":"Lengyang Wang ,&nbsp;Mingke Zhang","doi":"10.1016/j.csda.2024.108080","DOIUrl":"10.1016/j.csda.2024.108080","url":null,"abstract":"<div><div>Dengue fever is one of the most common mosquito-borne infectious diseases in tropical regions. Understanding the dynamics of dengue transmission can help provide timely early warnings, thereby reducing mortality. However, previous studies have failed to simulate faithfully dengue dynamics and answer questions pertinent to outbreaks. By incorporating environmental factors into a time-series-susceptible-infectious-recovered (TSIR) model, a new substantive model, to analyze their impact on transmission, is proposed. The newly proposed environmental-time-series-susceptible-infectious-recovered (ETSIR) model can highlight statistically their significance on dengue transmission, thus providing deeper insight into the transmission and addressing several epidemiological puzzles.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108080"},"PeriodicalIF":1.5,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of order-of-addition experiments 阶次添加实验分析
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-06 DOI: 10.1016/j.csda.2024.108077
Xueru Zhang , Dennis K.J. Lin , Min-Qian Liu , Jianbin Chen
The order-of-addition (OofA) experiment involves arranging components in a specific order to optimize a certain objective, which is attracting a great deal of attention in many disciplines, especially in the areas of biochemistry, scheduling, and engineering. Recent studies have highlighted its significance, and notable works have aimed to address NP-hard OofA problems from a statistical perspective. However, solving OofA problems presents challenges due to their complex nature and the presence of uncertainty, such as scheduling problems with uncertain processing times. These uncertainties affect processing times, which are not known with certainty in advance. They introduce heteroscedasticity into OofA experiments, where different orders result in varying dispersions. To address these challenges, a unified framework is proposed to analyze scheduling problems without making specific assumptions about the distribution of these certainties. It encompasses model development and optimization, encapsulating existing homoscedastic studies (where different orders produce the same dispersion value) as a specific instance. For heteroscedastic cases, a dual response optimization within an uncertainty set is proposed, aiming to minimize the dispersion of response while keeping the location of response with a predefined target value. However, solving the proposed non-linear minimax optimization is rather challenging. An equivalent optimization formulation with low computational cost is proposed for solving such a challenging problem. Theoretical supports are established to ensure the tractability of the proposed method. Simulation studies are conducted to demonstrate the effectiveness of the proposed approach. With its solid theoretical support, ease of implementation, and ability to find an optimal order, the proposed approach offers a practical and competitive solution to solving general order-of-addition problems.
加序(OofA)实验涉及按特定顺序排列组件以优化某个目标,在许多学科,尤其是生物化学、调度和工程学领域引起了广泛关注。最近的研究凸显了它的重要性,一些著名的著作旨在从统计学的角度来解决 NP 难的 OofA 问题。然而,由于 OofA 问题的复杂性和不确定性(如处理时间不确定的调度问题)的存在,解决 OofA 问题面临着挑战。这些不确定性会影响处理时间,而处理时间是无法事先确定的。它们在 OofA 实验中引入了异方差性,不同的订单会导致不同的分散性。为了应对这些挑战,我们提出了一个统一的框架来分析调度问题,而无需对这些不确定性的分布做出具体假设。该框架包括模型开发和优化,将现有的同散性研究(不同订单产生相同的离散值)封装为一个具体实例。对于异方差情况,提出了不确定性集内的双重响应优化,旨在最小化响应的离散性,同时将响应位置保持在预定义的目标值上。然而,求解所提出的非线性最小值优化相当具有挑战性。为解决这一难题,我们提出了一种计算成本较低的等效优化公式。建立了理论支持,以确保所提方法的可操作性。仿真研究证明了所提方法的有效性。凭借其坚实的理论支持、易于实施和找到最优阶次的能力,所提出的方法为解决一般加阶问题提供了一种实用且有竞争力的解决方案。
{"title":"Analysis of order-of-addition experiments","authors":"Xueru Zhang ,&nbsp;Dennis K.J. Lin ,&nbsp;Min-Qian Liu ,&nbsp;Jianbin Chen","doi":"10.1016/j.csda.2024.108077","DOIUrl":"10.1016/j.csda.2024.108077","url":null,"abstract":"<div><div>The order-of-addition (OofA) experiment involves arranging components in a specific order to optimize a certain objective, which is attracting a great deal of attention in many disciplines, especially in the areas of biochemistry, scheduling, and engineering. Recent studies have highlighted its significance, and notable works have aimed to address NP-hard OofA problems from a statistical perspective. However, solving OofA problems presents challenges due to their complex nature and the presence of uncertainty, such as scheduling problems with uncertain processing times. These uncertainties affect processing times, which are not known with certainty in advance. They introduce heteroscedasticity into OofA experiments, where different orders result in varying dispersions. To address these challenges, a unified framework is proposed to analyze scheduling problems without making specific assumptions about the distribution of these certainties. It encompasses model development and optimization, encapsulating existing homoscedastic studies (where different orders produce the same dispersion value) as a specific instance. For heteroscedastic cases, a dual response optimization within an uncertainty set is proposed, aiming to minimize the dispersion of response while keeping the location of response with a predefined target value. However, solving the proposed non-linear minimax optimization is rather challenging. An equivalent optimization formulation with low computational cost is proposed for solving such a challenging problem. Theoretical supports are established to ensure the tractability of the proposed method. Simulation studies are conducted to demonstrate the effectiveness of the proposed approach. With its solid theoretical support, ease of implementation, and ability to find an optimal order, the proposed approach offers a practical and competitive solution to solving general order-of-addition problems.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108077"},"PeriodicalIF":1.5,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A goodness-of-fit test for functional time series with applications to Ornstein-Uhlenbeck processes 功能时间序列的拟合优度检验及其在 Ornstein-Uhlenbeck 过程中的应用
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-05 DOI: 10.1016/j.csda.2024.108092
J. Álvarez-Liébana , A. López-Pérez , W. González-Manteiga , M. Febrero-Bande
High-frequency financial data can be collected as a sequence of time-ordered curves, such as intraday prices. The Functional Data Analysis (FDA) framework offers a powerful approach to uncover information embedded in the shape of the daily paths, often unavailable from classical statistical methods. A novel goodness-of-fit test for autoregressive Hilbertian (ARH) models is introduced, imposing only the Hilbert-Schmidt condition on the autocorrelation operator. The test statistic is formulated in terms of a Cramér–von Mises norm, with calibration achieved via a wild bootstrap resampling procedure. A simulation study examines the test's finite-sample performance in terms of power and size. Furthermore, a new specification test for diffusion models, including Ornstein-Uhlenbeck processes, is proposed, illustrated with an application to intraday currency exchange rates. Specifically, a two-stage methodology is proffered: firstly, the relationship between functional samples and their lagged values is assessed using an ARH(1) model; second, under linearity, a functional F-test is conducted.
高频金融数据可以作为一连串有时间顺序的曲线来收集,例如盘中价格。函数数据分析(FDA)框架提供了一种强大的方法,可以揭示蕴含在每日路径形状中的信息,而这些信息往往是经典统计方法无法获得的。本文引入了一种新的自回归希尔伯特(ARH)模型拟合优度检验,只对自相关算子施加希尔伯特-施密特条件。检验统计量是用 Cramér-von Mises 准则表示的,校准是通过野生引导重采样程序实现的。一项模拟研究考察了该检验在功率和规模方面的有限样本性能。此外,还针对扩散模型(包括奥恩斯坦-乌伦贝克过程)提出了一种新的规范检验方法,并将其应用于日内货币汇率。具体而言,提出了一种两阶段方法:首先,使用 ARH(1) 模型评估函数样本与其滞后值之间的关系;其次,在线性条件下进行函数 F 检验。
{"title":"A goodness-of-fit test for functional time series with applications to Ornstein-Uhlenbeck processes","authors":"J. Álvarez-Liébana ,&nbsp;A. López-Pérez ,&nbsp;W. González-Manteiga ,&nbsp;M. Febrero-Bande","doi":"10.1016/j.csda.2024.108092","DOIUrl":"10.1016/j.csda.2024.108092","url":null,"abstract":"<div><div>High-frequency financial data can be collected as a sequence of time-ordered curves, such as intraday prices. The Functional Data Analysis (FDA) framework offers a powerful approach to uncover information embedded in the shape of the daily paths, often unavailable from classical statistical methods. A novel goodness-of-fit test for autoregressive Hilbertian (ARH) models is introduced, imposing only the Hilbert-Schmidt condition on the autocorrelation operator. The test statistic is formulated in terms of a Cramér–von Mises norm, with calibration achieved via a wild bootstrap resampling procedure. A simulation study examines the test's finite-sample performance in terms of power and size. Furthermore, a new specification test for diffusion models, including Ornstein-Uhlenbeck processes, is proposed, illustrated with an application to intraday currency exchange rates. Specifically, a two-stage methodology is proffered: firstly, the relationship between functional samples and their lagged values is assessed using an ARH(1) model; second, under linearity, a functional F-test is conducted.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108092"},"PeriodicalIF":1.5,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Weighted support vector machine for extremely imbalanced data 用于极端不平衡数据的加权支持向量机
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-04 DOI: 10.1016/j.csda.2024.108078
Jongmin Mun , Sungwan Bang , Jaeoh Kim
Based on an asymptotically optimal weighted support vector machine (SVM) that introduces label shift, a systematic procedure is derived for applying oversampling and weighted SVM to extremely imbalanced datasets with a cluster-structured positive class. This method formalizes three intuitions: (i) oversampling should reflect the structure of the positive class; (ii) weights should account for both the imbalance and oversampling ratios; (iii) synthetic samples should carry less weight than the original samples. The proposed method generates synthetic samples from the estimated positive class distribution using a Gaussian mixture model. To prevent overfitting to excessive synthetic samples, different misclassification penalties are assigned to the original positive class, synthetic positive class, and negative class. The proposed method is numerically validated through simulations and an analysis of Republic of Korea Army artillery training data.
基于引入标签偏移的渐近最优加权支持向量机 (SVM),推导出了一种系统化程序,用于将超采样和加权 SVM 应用于具有聚类结构正类的极度不平衡数据集。该方法正式提出了三个直觉:(i) 超采样应反映正类的结构;(ii) 权重应考虑不平衡和超采样比率;(iii) 合成样本的权重应低于原始样本。建议的方法使用高斯混合模型从估计的正分类分布中生成合成样本。为防止过度拟合合成样本,对原始正类、合成正类和负类分配了不同的误分类惩罚。通过对大韩民国陆军炮兵训练数据的模拟和分析,对所提出的方法进行了数值验证。
{"title":"Weighted support vector machine for extremely imbalanced data","authors":"Jongmin Mun ,&nbsp;Sungwan Bang ,&nbsp;Jaeoh Kim","doi":"10.1016/j.csda.2024.108078","DOIUrl":"10.1016/j.csda.2024.108078","url":null,"abstract":"<div><div>Based on an asymptotically optimal weighted support vector machine (SVM) that introduces label shift, a systematic procedure is derived for applying oversampling and weighted SVM to extremely imbalanced datasets with a cluster-structured positive class. This method formalizes three intuitions: (i) oversampling should reflect the structure of the positive class; (ii) weights should account for both the imbalance and oversampling ratios; (iii) synthetic samples should carry less weight than the original samples. The proposed method generates synthetic samples from the estimated positive class distribution using a Gaussian mixture model. To prevent overfitting to excessive synthetic samples, different misclassification penalties are assigned to the original positive class, synthetic positive class, and negative class. The proposed method is numerically validated through simulations and an analysis of Republic of Korea Army artillery training data.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108078"},"PeriodicalIF":1.5,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cox regression model with doubly truncated and interval-censored data 双截断数据和区间截断数据的 Cox 回归模型
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-04 DOI: 10.1016/j.csda.2024.108090
Pao-sheng Shen
Interval sampling is an efficient sampling scheme used in epidemiological studies. Doubly truncated (DT) data arise under this sampling scheme when the failure time can be observed exactly. In practice, the failure time may not be observed and might be recorded only within time intervals, leading to doubly truncated and interval censored (DTIC) data. This article considers regression analysis of DTIC data under the Cox proportional hazards (PH) model and develops the conditional maximum likelihood estimators (cMLEs) for the regression parameters and baseline cumulative hazard function of models. The cMLEs are shown to be consistent and asymptotically normal. Simulation results indicate that the cMLEs perform well for samples of moderate size.
区间抽样是流行病学研究中使用的一种高效抽样方案。在这种抽样方案下,当故障时间可以精确观测到时,就会产生双截(DT)数据。在实践中,故障时间可能无法被观察到,而只能在时间间隔内记录,这就导致了双重截断和时间间隔删减(DTIC)数据。本文考虑在 Cox 比例危险(PH)模型下对 DTIC 数据进行回归分析,并开发了模型回归参数和基线累积危险函数的条件最大似然估计值(cMLE)。cMLEs 具有一致性和渐近正态性。模拟结果表明,cMLE 在中等规模的样本中表现良好。
{"title":"Cox regression model with doubly truncated and interval-censored data","authors":"Pao-sheng Shen","doi":"10.1016/j.csda.2024.108090","DOIUrl":"10.1016/j.csda.2024.108090","url":null,"abstract":"<div><div>Interval sampling is an efficient sampling scheme used in epidemiological studies. Doubly truncated (DT) data arise under this sampling scheme when the failure time can be observed exactly. In practice, the failure time may not be observed and might be recorded only within time intervals, leading to doubly truncated and interval censored (DTIC) data. This article considers regression analysis of DTIC data under the Cox proportional hazards (PH) model and develops the conditional maximum likelihood estimators (cMLEs) for the regression parameters and baseline cumulative hazard function of models. The cMLEs are shown to be consistent and asymptotically normal. Simulation results indicate that the cMLEs perform well for samples of moderate size.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108090"},"PeriodicalIF":1.5,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating computation: A pairwise fitting technique for multivariate probit models 加速计算:多元概率模型的成对拟合技术
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-31 DOI: 10.1016/j.csda.2024.108082
Margaux Delporte , Geert Verbeke , Steffen Fieuws , Geert Molenberghs
Fitting multivariate probit models via maximum likelihood presents considerable computational challenges, particularly in terms of computation time and convergence difficulties, even for small numbers of responses. These issues are exacerbated when dealing with ordinal data. An efficient computational approach is introduced, based on a pairwise fitting technique within a pseudo-likelihood framework. This methodology is applied to clinical case studies, specifically using a trivariate probit model. Additionally, the correlation structure among outcomes is allowed to depend on covariates, enhancing both the flexibility and interpretability of the model. By way of simulation and real data applications, the proposed approach demonstrates superior computational efficiency as the dimension of the outcome vector increases. The method's ability to capture covariate-dependent correlations makes it particularly useful in medical research, where understanding complex associations among health outcomes is of scientific importance.
通过最大似然法拟合多变量 probit 模型在计算上面临着相当大的挑战,尤其是在计算时间和收敛困难方面,即使是少量的响应也是如此。在处理顺序数据时,这些问题会更加严重。本文介绍了一种高效的计算方法,该方法基于伪似然法框架内的成对拟合技术。该方法适用于临床病例研究,特别是使用三变量 probit 模型。此外,允许结果之间的相关结构取决于协变量,从而提高了模型的灵活性和可解释性。通过模拟和真实数据应用,随着结果向量维度的增加,所提出的方法显示出卓越的计算效率。该方法能够捕捉协变量相关性,因此在医学研究中特别有用,因为了解健康结果之间的复杂关联具有重要的科学意义。
{"title":"Accelerating computation: A pairwise fitting technique for multivariate probit models","authors":"Margaux Delporte ,&nbsp;Geert Verbeke ,&nbsp;Steffen Fieuws ,&nbsp;Geert Molenberghs","doi":"10.1016/j.csda.2024.108082","DOIUrl":"10.1016/j.csda.2024.108082","url":null,"abstract":"<div><div>Fitting multivariate probit models via maximum likelihood presents considerable computational challenges, particularly in terms of computation time and convergence difficulties, even for small numbers of responses. These issues are exacerbated when dealing with ordinal data. An efficient computational approach is introduced, based on a pairwise fitting technique within a pseudo-likelihood framework. This methodology is applied to clinical case studies, specifically using a trivariate probit model. Additionally, the correlation structure among outcomes is allowed to depend on covariates, enhancing both the flexibility and interpretability of the model. By way of simulation and real data applications, the proposed approach demonstrates superior computational efficiency as the dimension of the outcome vector increases. The method's ability to capture covariate-dependent correlations makes it particularly useful in medical research, where understanding complex associations among health outcomes is of scientific importance.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108082"},"PeriodicalIF":1.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142578447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A unified consensus-based parallel algorithm for high-dimensional regression with combined regularizations 基于共识的高维回归并行统一算法与组合正则化
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-30 DOI: 10.1016/j.csda.2024.108081
Xiaofei Wu , Rongmei Liang , Zhimin Zhang , Zhenyu Cui
The parallel algorithm is widely recognized for its effectiveness in handling large-scale datasets stored in a distributed manner, making it a popular choice for solving statistical learning models. However, there is currently limited research on parallel algorithms specifically designed for high-dimensional regression with combined regularization terms. These terms, such as elastic-net, sparse group lasso, sparse fused lasso, and their nonconvex variants, have gained significant attention in various fields due to their ability to incorporate prior information and promote sparsity within specific groups or fused variables. The scarcity of parallel algorithms for combined regularizations can be attributed to the inherent nonsmoothness and complexity of these terms, as well as the absence of closed-form solutions for certain proximal operators associated with them. This paper proposes a unified constrained optimization formulation based on the consensus problem for these types of convex and nonconvex regression problems, and derives the corresponding parallel alternating direction method of multipliers (ADMM) algorithms. Furthermore, it is proven that the proposed algorithm not only has global convergence but also exhibits a linear convergence rate. It is worth noting that the computational complexity of the proposed algorithm remains the same for different regularization terms and losses, which implicitly demonstrates the universality of this algorithm. Extensive simulation experiments, along with a financial example, serve to demonstrate the reliability, stability, and scalability of our algorithm. The R package for implementing the proposed algorithm can be obtained at https://github.com/xfwu1016/CPADMM.
并行算法在处理以分布式方式存储的大规模数据集方面的有效性已得到广泛认可,因此成为解决统计学习模型的热门选择。然而,目前专门针对具有组合正则化条款的高维回归而设计的并行算法的研究还很有限。这些术语,如 elastic-net、sparse group lasso、sparse fused lasso 及其非凸变体,由于能够在特定组或融合变量内纳入先验信息并促进稀疏性,在各个领域都获得了极大的关注。组合正则化并行算法的匮乏可归因于这些术语固有的非平稳性和复杂性,以及与之相关的某些近似算子缺乏闭式解。本文针对这些类型的凸回归和非凸回归问题,提出了基于共识问题的统一约束优化公式,并推导出相应的并行交替乘法(ADMM)算法。此外,还证明了所提出的算法不仅具有全局收敛性,而且还表现出线性收敛率。值得注意的是,对于不同的正则化项和损失,所提算法的计算复杂度保持不变,这隐含地证明了该算法的通用性。大量的模拟实验以及一个财务实例证明了我们算法的可靠性、稳定性和可扩展性。实现该算法的 R 软件包可从 https://github.com/xfwu1016/CPADMM 获取。
{"title":"A unified consensus-based parallel algorithm for high-dimensional regression with combined regularizations","authors":"Xiaofei Wu ,&nbsp;Rongmei Liang ,&nbsp;Zhimin Zhang ,&nbsp;Zhenyu Cui","doi":"10.1016/j.csda.2024.108081","DOIUrl":"10.1016/j.csda.2024.108081","url":null,"abstract":"<div><div>The parallel algorithm is widely recognized for its effectiveness in handling large-scale datasets stored in a distributed manner, making it a popular choice for solving statistical learning models. However, there is currently limited research on parallel algorithms specifically designed for high-dimensional regression with combined regularization terms. These terms, such as elastic-net, sparse group lasso, sparse fused lasso, and their nonconvex variants, have gained significant attention in various fields due to their ability to incorporate prior information and promote sparsity within specific groups or fused variables. The scarcity of parallel algorithms for combined regularizations can be attributed to the inherent nonsmoothness and complexity of these terms, as well as the absence of closed-form solutions for certain proximal operators associated with them. This paper proposes a <em>unified</em> constrained optimization formulation based on the consensus problem for these types of convex and nonconvex regression problems, and derives the corresponding parallel alternating direction method of multipliers (ADMM) algorithms. Furthermore, it is proven that the proposed algorithm not only has global convergence but also exhibits a linear convergence rate. It is worth noting that the computational complexity of the proposed algorithm remains the same for different regularization terms and losses, which implicitly demonstrates the universality of this algorithm. Extensive simulation experiments, along with a financial example, serve to demonstrate the reliability, stability, and scalability of our algorithm. The R package for implementing the proposed algorithm can be obtained at <span><span>https://github.com/xfwu1016/CPADMM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108081"},"PeriodicalIF":1.5,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-model subset selection 多模型子集选择
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-29 DOI: 10.1016/j.csda.2024.108073
Anthony-Alexander Christidis , Stefan Van Aelst , Ruben Zamar
The two primary approaches for high-dimensional regression problems are sparse methods (e.g., best subset selection, which uses the 0-norm in the penalty) and ensemble methods (e.g., random forests). Although sparse methods typically yield interpretable models, in terms of prediction accuracy they are often outperformed by “blackbox” multi-model ensemble methods. A regression ensemble is introduced which combines the interpretability of sparse methods with the high prediction accuracy of ensemble methods. An algorithm is proposed to solve the joint optimization of the corresponding 0-penalized regression models by extending recent developments in 0-optimization for sparse methods to multi-model regression ensembles. The sparse and diverse models in the ensemble are learned simultaneously from the data. Each of these models provides an explanation for the relationship between a subset of predictors and the response variable. Empirical studies and theoretical knowledge about ensembles are used to gain insight into the ensemble method's performance, focusing on the interplay between bias, variance, covariance, and variable selection. In prediction tasks, the ensembles can outperform state-of-the-art competitors on both simulated and real data. Forward stepwise regression is also generalized to multi-model regression ensembles and used to obtain an initial solution for the algorithm. The optimization algorithms are implemented in publicly available software packages.
解决高维回归问题的两种主要方法是稀疏方法(如最佳子集选择,在惩罚中使用 ℓ0 正态)和集合方法(如随机森林)。虽然稀疏方法通常能产生可解释的模型,但就预测准确性而言,它们往往比 "黑箱 "多模型集合方法更胜一筹。本文介绍了一种回归集合方法,它结合了稀疏方法的可解释性和集合方法的高预测准确性。通过将稀疏方法的 ℓ0 优化的最新发展扩展到多模型回归集合,提出了一种算法来解决相应的 ℓ0 惩罚回归模型的联合优化问题。集合中的稀疏和多样化模型是同时从数据中学习的。这些模型中的每一个都能解释预测因子子集与响应变量之间的关系。关于集合的经验研究和理论知识被用来深入了解集合方法的性能,重点是偏差、方差、协方差和变量选择之间的相互作用。在预测任务中,集合方法在模拟数据和真实数据上的表现都优于最先进的竞争对手。前向逐步回归也被推广到多模型回归集合中,并用于获得算法的初始解。这些优化算法是在公开的软件包中实现的。
{"title":"Multi-model subset selection","authors":"Anthony-Alexander Christidis ,&nbsp;Stefan Van Aelst ,&nbsp;Ruben Zamar","doi":"10.1016/j.csda.2024.108073","DOIUrl":"10.1016/j.csda.2024.108073","url":null,"abstract":"<div><div>The two primary approaches for high-dimensional regression problems are sparse methods (e.g., best subset selection, which uses the <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>0</mn></mrow></msub></math></span>-norm in the penalty) and ensemble methods (e.g., random forests). Although sparse methods typically yield interpretable models, in terms of prediction accuracy they are often outperformed by “blackbox” multi-model ensemble methods. A regression ensemble is introduced which combines the interpretability of sparse methods with the high prediction accuracy of ensemble methods. An algorithm is proposed to solve the joint optimization of the corresponding <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>0</mn></mrow></msub></math></span>-penalized regression models by extending recent developments in <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>0</mn></mrow></msub></math></span>-optimization for sparse methods to multi-model regression ensembles. The sparse and diverse models in the ensemble are learned simultaneously from the data. Each of these models provides an explanation for the relationship between a subset of predictors and the response variable. Empirical studies and theoretical knowledge about ensembles are used to gain insight into the ensemble method's performance, focusing on the interplay between bias, variance, covariance, and variable selection. In prediction tasks, the ensembles can outperform state-of-the-art competitors on both simulated and real data. Forward stepwise regression is also generalized to multi-model regression ensembles and used to obtain an initial solution for the algorithm. The optimization algorithms are implemented in publicly available software packages.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108073"},"PeriodicalIF":1.5,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142560769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vine copula based structural equation models 基于藤蔓协程的结构方程模型
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-26 DOI: 10.1016/j.csda.2024.108076
Claudia Czado
Gaussian linear structural equation models (SEMs) are often used as a statistical model associated with a directed acyclic graph (DAG) also known as a Bayesian network. However, such a model might not be able to represent the non-Gaussian dependence present in some data sets resulting in nonlinear, non-additive and non Gaussian conditional distributions. Therefore the use of the class of D-vine copula based regression models for the specification of the conditional distribution of a node given its parents is proposed. This class extends the class of standard linear regression models considerably. The approach also allows to create an importance order of the parents of each node and gives the potential to remove edges from the starting DAG not supported by the data. Further uncertainty of conditional estimates can be assessed and fast generative simulation using the D-vine copula based SEM is available. The improvement over a Gaussian linear SEM is shown using random specifications of the D-vine based SEM as well as its ability to correctly remove edges not present in the data generation using simulation. An engineering application showcases the usefulness of the proposals.
高斯线性结构方程模型(SEM)通常被用作与有向无环图(DAG)(也称为贝叶斯网络)相关的统计模型。然而,这种模型可能无法表示某些数据集中存在的非高斯依赖性,从而导致非线性、非相加和非高斯条件分布。因此,我们建议使用基于 D-vine copula 的回归模型来指定一个节点的条件分布(给定其父节点)。这一类模型大大扩展了标准线性回归模型。该方法还允许创建每个节点父节点的重要性顺序,并有可能从起始 DAG 中删除数据不支持的边。此外,还可以评估条件估计值的不确定性,并使用基于 D-vine copula 的 SEM 进行快速生成模拟。与高斯线性 SEM 相比,基于 D-藤的 SEM 使用随机规格显示了其改进之处,并显示了其通过模拟正确移除数据生成中不存在的边的能力。一个工程应用展示了这些建议的实用性。
{"title":"Vine copula based structural equation models","authors":"Claudia Czado","doi":"10.1016/j.csda.2024.108076","DOIUrl":"10.1016/j.csda.2024.108076","url":null,"abstract":"<div><div>Gaussian linear structural equation models (SEMs) are often used as a statistical model associated with a directed acyclic graph (DAG) also known as a Bayesian network. However, such a model might not be able to represent the non-Gaussian dependence present in some data sets resulting in nonlinear, non-additive and non Gaussian conditional distributions. Therefore the use of the class of D-vine copula based regression models for the specification of the conditional distribution of a node given its parents is proposed. This class extends the class of standard linear regression models considerably. The approach also allows to create an importance order of the parents of each node and gives the potential to remove edges from the starting DAG not supported by the data. Further uncertainty of conditional estimates can be assessed and fast generative simulation using the D-vine copula based SEM is available. The improvement over a Gaussian linear SEM is shown using random specifications of the D-vine based SEM as well as its ability to correctly remove edges not present in the data generation using simulation. An engineering application showcases the usefulness of the proposals.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108076"},"PeriodicalIF":1.5,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142553893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1