首页 > 最新文献

Computational Statistics & Data Analysis最新文献

英文 中文
A goodness-of-fit test for functional time series with applications to Ornstein-Uhlenbeck processes 功能时间序列的拟合优度检验及其在 Ornstein-Uhlenbeck 过程中的应用
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-05 DOI: 10.1016/j.csda.2024.108092
J. Álvarez-Liébana , A. López-Pérez , W. González-Manteiga , M. Febrero-Bande
High-frequency financial data can be collected as a sequence of time-ordered curves, such as intraday prices. The Functional Data Analysis (FDA) framework offers a powerful approach to uncover information embedded in the shape of the daily paths, often unavailable from classical statistical methods. A novel goodness-of-fit test for autoregressive Hilbertian (ARH) models is introduced, imposing only the Hilbert-Schmidt condition on the autocorrelation operator. The test statistic is formulated in terms of a Cramér–von Mises norm, with calibration achieved via a wild bootstrap resampling procedure. A simulation study examines the test's finite-sample performance in terms of power and size. Furthermore, a new specification test for diffusion models, including Ornstein-Uhlenbeck processes, is proposed, illustrated with an application to intraday currency exchange rates. Specifically, a two-stage methodology is proffered: firstly, the relationship between functional samples and their lagged values is assessed using an ARH(1) model; second, under linearity, a functional F-test is conducted.
高频金融数据可以作为一连串有时间顺序的曲线来收集,例如盘中价格。函数数据分析(FDA)框架提供了一种强大的方法,可以揭示蕴含在每日路径形状中的信息,而这些信息往往是经典统计方法无法获得的。本文引入了一种新的自回归希尔伯特(ARH)模型拟合优度检验,只对自相关算子施加希尔伯特-施密特条件。检验统计量是用 Cramér-von Mises 准则表示的,校准是通过野生引导重采样程序实现的。一项模拟研究考察了该检验在功率和规模方面的有限样本性能。此外,还针对扩散模型(包括奥恩斯坦-乌伦贝克过程)提出了一种新的规范检验方法,并将其应用于日内货币汇率。具体而言,提出了一种两阶段方法:首先,使用 ARH(1) 模型评估函数样本与其滞后值之间的关系;其次,在线性条件下进行函数 F 检验。
{"title":"A goodness-of-fit test for functional time series with applications to Ornstein-Uhlenbeck processes","authors":"J. Álvarez-Liébana ,&nbsp;A. López-Pérez ,&nbsp;W. González-Manteiga ,&nbsp;M. Febrero-Bande","doi":"10.1016/j.csda.2024.108092","DOIUrl":"10.1016/j.csda.2024.108092","url":null,"abstract":"<div><div>High-frequency financial data can be collected as a sequence of time-ordered curves, such as intraday prices. The Functional Data Analysis (FDA) framework offers a powerful approach to uncover information embedded in the shape of the daily paths, often unavailable from classical statistical methods. A novel goodness-of-fit test for autoregressive Hilbertian (ARH) models is introduced, imposing only the Hilbert-Schmidt condition on the autocorrelation operator. The test statistic is formulated in terms of a Cramér–von Mises norm, with calibration achieved via a wild bootstrap resampling procedure. A simulation study examines the test's finite-sample performance in terms of power and size. Furthermore, a new specification test for diffusion models, including Ornstein-Uhlenbeck processes, is proposed, illustrated with an application to intraday currency exchange rates. Specifically, a two-stage methodology is proffered: firstly, the relationship between functional samples and their lagged values is assessed using an ARH(1) model; second, under linearity, a functional F-test is conducted.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108092"},"PeriodicalIF":1.5,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Weighted support vector machine for extremely imbalanced data 用于极端不平衡数据的加权支持向量机
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-04 DOI: 10.1016/j.csda.2024.108078
Jongmin Mun , Sungwan Bang , Jaeoh Kim
Based on an asymptotically optimal weighted support vector machine (SVM) that introduces label shift, a systematic procedure is derived for applying oversampling and weighted SVM to extremely imbalanced datasets with a cluster-structured positive class. This method formalizes three intuitions: (i) oversampling should reflect the structure of the positive class; (ii) weights should account for both the imbalance and oversampling ratios; (iii) synthetic samples should carry less weight than the original samples. The proposed method generates synthetic samples from the estimated positive class distribution using a Gaussian mixture model. To prevent overfitting to excessive synthetic samples, different misclassification penalties are assigned to the original positive class, synthetic positive class, and negative class. The proposed method is numerically validated through simulations and an analysis of Republic of Korea Army artillery training data.
基于引入标签偏移的渐近最优加权支持向量机 (SVM),推导出了一种系统化程序,用于将超采样和加权 SVM 应用于具有聚类结构正类的极度不平衡数据集。该方法正式提出了三个直觉:(i) 超采样应反映正类的结构;(ii) 权重应考虑不平衡和超采样比率;(iii) 合成样本的权重应低于原始样本。建议的方法使用高斯混合模型从估计的正分类分布中生成合成样本。为防止过度拟合合成样本,对原始正类、合成正类和负类分配了不同的误分类惩罚。通过对大韩民国陆军炮兵训练数据的模拟和分析,对所提出的方法进行了数值验证。
{"title":"Weighted support vector machine for extremely imbalanced data","authors":"Jongmin Mun ,&nbsp;Sungwan Bang ,&nbsp;Jaeoh Kim","doi":"10.1016/j.csda.2024.108078","DOIUrl":"10.1016/j.csda.2024.108078","url":null,"abstract":"<div><div>Based on an asymptotically optimal weighted support vector machine (SVM) that introduces label shift, a systematic procedure is derived for applying oversampling and weighted SVM to extremely imbalanced datasets with a cluster-structured positive class. This method formalizes three intuitions: (i) oversampling should reflect the structure of the positive class; (ii) weights should account for both the imbalance and oversampling ratios; (iii) synthetic samples should carry less weight than the original samples. The proposed method generates synthetic samples from the estimated positive class distribution using a Gaussian mixture model. To prevent overfitting to excessive synthetic samples, different misclassification penalties are assigned to the original positive class, synthetic positive class, and negative class. The proposed method is numerically validated through simulations and an analysis of Republic of Korea Army artillery training data.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108078"},"PeriodicalIF":1.5,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cox regression model with doubly truncated and interval-censored data 双截断数据和区间截断数据的 Cox 回归模型
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-11-04 DOI: 10.1016/j.csda.2024.108090
Pao-sheng Shen
Interval sampling is an efficient sampling scheme used in epidemiological studies. Doubly truncated (DT) data arise under this sampling scheme when the failure time can be observed exactly. In practice, the failure time may not be observed and might be recorded only within time intervals, leading to doubly truncated and interval censored (DTIC) data. This article considers regression analysis of DTIC data under the Cox proportional hazards (PH) model and develops the conditional maximum likelihood estimators (cMLEs) for the regression parameters and baseline cumulative hazard function of models. The cMLEs are shown to be consistent and asymptotically normal. Simulation results indicate that the cMLEs perform well for samples of moderate size.
区间抽样是流行病学研究中使用的一种高效抽样方案。在这种抽样方案下,当故障时间可以精确观测到时,就会产生双截(DT)数据。在实践中,故障时间可能无法被观察到,而只能在时间间隔内记录,这就导致了双重截断和时间间隔删减(DTIC)数据。本文考虑在 Cox 比例危险(PH)模型下对 DTIC 数据进行回归分析,并开发了模型回归参数和基线累积危险函数的条件最大似然估计值(cMLE)。cMLEs 具有一致性和渐近正态性。模拟结果表明,cMLE 在中等规模的样本中表现良好。
{"title":"Cox regression model with doubly truncated and interval-censored data","authors":"Pao-sheng Shen","doi":"10.1016/j.csda.2024.108090","DOIUrl":"10.1016/j.csda.2024.108090","url":null,"abstract":"<div><div>Interval sampling is an efficient sampling scheme used in epidemiological studies. Doubly truncated (DT) data arise under this sampling scheme when the failure time can be observed exactly. In practice, the failure time may not be observed and might be recorded only within time intervals, leading to doubly truncated and interval censored (DTIC) data. This article considers regression analysis of DTIC data under the Cox proportional hazards (PH) model and develops the conditional maximum likelihood estimators (cMLEs) for the regression parameters and baseline cumulative hazard function of models. The cMLEs are shown to be consistent and asymptotically normal. Simulation results indicate that the cMLEs perform well for samples of moderate size.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108090"},"PeriodicalIF":1.5,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating computation: A pairwise fitting technique for multivariate probit models 加速计算:多元概率模型的成对拟合技术
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-31 DOI: 10.1016/j.csda.2024.108082
Margaux Delporte , Geert Verbeke , Steffen Fieuws , Geert Molenberghs
Fitting multivariate probit models via maximum likelihood presents considerable computational challenges, particularly in terms of computation time and convergence difficulties, even for small numbers of responses. These issues are exacerbated when dealing with ordinal data. An efficient computational approach is introduced, based on a pairwise fitting technique within a pseudo-likelihood framework. This methodology is applied to clinical case studies, specifically using a trivariate probit model. Additionally, the correlation structure among outcomes is allowed to depend on covariates, enhancing both the flexibility and interpretability of the model. By way of simulation and real data applications, the proposed approach demonstrates superior computational efficiency as the dimension of the outcome vector increases. The method's ability to capture covariate-dependent correlations makes it particularly useful in medical research, where understanding complex associations among health outcomes is of scientific importance.
通过最大似然法拟合多变量 probit 模型在计算上面临着相当大的挑战,尤其是在计算时间和收敛困难方面,即使是少量的响应也是如此。在处理顺序数据时,这些问题会更加严重。本文介绍了一种高效的计算方法,该方法基于伪似然法框架内的成对拟合技术。该方法适用于临床病例研究,特别是使用三变量 probit 模型。此外,允许结果之间的相关结构取决于协变量,从而提高了模型的灵活性和可解释性。通过模拟和真实数据应用,随着结果向量维度的增加,所提出的方法显示出卓越的计算效率。该方法能够捕捉协变量相关性,因此在医学研究中特别有用,因为了解健康结果之间的复杂关联具有重要的科学意义。
{"title":"Accelerating computation: A pairwise fitting technique for multivariate probit models","authors":"Margaux Delporte ,&nbsp;Geert Verbeke ,&nbsp;Steffen Fieuws ,&nbsp;Geert Molenberghs","doi":"10.1016/j.csda.2024.108082","DOIUrl":"10.1016/j.csda.2024.108082","url":null,"abstract":"<div><div>Fitting multivariate probit models via maximum likelihood presents considerable computational challenges, particularly in terms of computation time and convergence difficulties, even for small numbers of responses. These issues are exacerbated when dealing with ordinal data. An efficient computational approach is introduced, based on a pairwise fitting technique within a pseudo-likelihood framework. This methodology is applied to clinical case studies, specifically using a trivariate probit model. Additionally, the correlation structure among outcomes is allowed to depend on covariates, enhancing both the flexibility and interpretability of the model. By way of simulation and real data applications, the proposed approach demonstrates superior computational efficiency as the dimension of the outcome vector increases. The method's ability to capture covariate-dependent correlations makes it particularly useful in medical research, where understanding complex associations among health outcomes is of scientific importance.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108082"},"PeriodicalIF":1.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142578447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A unified consensus-based parallel algorithm for high-dimensional regression with combined regularizations 基于共识的高维回归并行统一算法与组合正则化
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-30 DOI: 10.1016/j.csda.2024.108081
Xiaofei Wu , Rongmei Liang , Zhimin Zhang , Zhenyu Cui
The parallel algorithm is widely recognized for its effectiveness in handling large-scale datasets stored in a distributed manner, making it a popular choice for solving statistical learning models. However, there is currently limited research on parallel algorithms specifically designed for high-dimensional regression with combined regularization terms. These terms, such as elastic-net, sparse group lasso, sparse fused lasso, and their nonconvex variants, have gained significant attention in various fields due to their ability to incorporate prior information and promote sparsity within specific groups or fused variables. The scarcity of parallel algorithms for combined regularizations can be attributed to the inherent nonsmoothness and complexity of these terms, as well as the absence of closed-form solutions for certain proximal operators associated with them. This paper proposes a unified constrained optimization formulation based on the consensus problem for these types of convex and nonconvex regression problems, and derives the corresponding parallel alternating direction method of multipliers (ADMM) algorithms. Furthermore, it is proven that the proposed algorithm not only has global convergence but also exhibits a linear convergence rate. It is worth noting that the computational complexity of the proposed algorithm remains the same for different regularization terms and losses, which implicitly demonstrates the universality of this algorithm. Extensive simulation experiments, along with a financial example, serve to demonstrate the reliability, stability, and scalability of our algorithm. The R package for implementing the proposed algorithm can be obtained at https://github.com/xfwu1016/CPADMM.
并行算法在处理以分布式方式存储的大规模数据集方面的有效性已得到广泛认可,因此成为解决统计学习模型的热门选择。然而,目前专门针对具有组合正则化条款的高维回归而设计的并行算法的研究还很有限。这些术语,如 elastic-net、sparse group lasso、sparse fused lasso 及其非凸变体,由于能够在特定组或融合变量内纳入先验信息并促进稀疏性,在各个领域都获得了极大的关注。组合正则化并行算法的匮乏可归因于这些术语固有的非平稳性和复杂性,以及与之相关的某些近似算子缺乏闭式解。本文针对这些类型的凸回归和非凸回归问题,提出了基于共识问题的统一约束优化公式,并推导出相应的并行交替乘法(ADMM)算法。此外,还证明了所提出的算法不仅具有全局收敛性,而且还表现出线性收敛率。值得注意的是,对于不同的正则化项和损失,所提算法的计算复杂度保持不变,这隐含地证明了该算法的通用性。大量的模拟实验以及一个财务实例证明了我们算法的可靠性、稳定性和可扩展性。实现该算法的 R 软件包可从 https://github.com/xfwu1016/CPADMM 获取。
{"title":"A unified consensus-based parallel algorithm for high-dimensional regression with combined regularizations","authors":"Xiaofei Wu ,&nbsp;Rongmei Liang ,&nbsp;Zhimin Zhang ,&nbsp;Zhenyu Cui","doi":"10.1016/j.csda.2024.108081","DOIUrl":"10.1016/j.csda.2024.108081","url":null,"abstract":"<div><div>The parallel algorithm is widely recognized for its effectiveness in handling large-scale datasets stored in a distributed manner, making it a popular choice for solving statistical learning models. However, there is currently limited research on parallel algorithms specifically designed for high-dimensional regression with combined regularization terms. These terms, such as elastic-net, sparse group lasso, sparse fused lasso, and their nonconvex variants, have gained significant attention in various fields due to their ability to incorporate prior information and promote sparsity within specific groups or fused variables. The scarcity of parallel algorithms for combined regularizations can be attributed to the inherent nonsmoothness and complexity of these terms, as well as the absence of closed-form solutions for certain proximal operators associated with them. This paper proposes a <em>unified</em> constrained optimization formulation based on the consensus problem for these types of convex and nonconvex regression problems, and derives the corresponding parallel alternating direction method of multipliers (ADMM) algorithms. Furthermore, it is proven that the proposed algorithm not only has global convergence but also exhibits a linear convergence rate. It is worth noting that the computational complexity of the proposed algorithm remains the same for different regularization terms and losses, which implicitly demonstrates the universality of this algorithm. Extensive simulation experiments, along with a financial example, serve to demonstrate the reliability, stability, and scalability of our algorithm. The R package for implementing the proposed algorithm can be obtained at <span><span>https://github.com/xfwu1016/CPADMM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108081"},"PeriodicalIF":1.5,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-model subset selection 多模型子集选择
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-29 DOI: 10.1016/j.csda.2024.108073
Anthony-Alexander Christidis , Stefan Van Aelst , Ruben Zamar
The two primary approaches for high-dimensional regression problems are sparse methods (e.g., best subset selection, which uses the 0-norm in the penalty) and ensemble methods (e.g., random forests). Although sparse methods typically yield interpretable models, in terms of prediction accuracy they are often outperformed by “blackbox” multi-model ensemble methods. A regression ensemble is introduced which combines the interpretability of sparse methods with the high prediction accuracy of ensemble methods. An algorithm is proposed to solve the joint optimization of the corresponding 0-penalized regression models by extending recent developments in 0-optimization for sparse methods to multi-model regression ensembles. The sparse and diverse models in the ensemble are learned simultaneously from the data. Each of these models provides an explanation for the relationship between a subset of predictors and the response variable. Empirical studies and theoretical knowledge about ensembles are used to gain insight into the ensemble method's performance, focusing on the interplay between bias, variance, covariance, and variable selection. In prediction tasks, the ensembles can outperform state-of-the-art competitors on both simulated and real data. Forward stepwise regression is also generalized to multi-model regression ensembles and used to obtain an initial solution for the algorithm. The optimization algorithms are implemented in publicly available software packages.
解决高维回归问题的两种主要方法是稀疏方法(如最佳子集选择,在惩罚中使用 ℓ0 正态)和集合方法(如随机森林)。虽然稀疏方法通常能产生可解释的模型,但就预测准确性而言,它们往往比 "黑箱 "多模型集合方法更胜一筹。本文介绍了一种回归集合方法,它结合了稀疏方法的可解释性和集合方法的高预测准确性。通过将稀疏方法的 ℓ0 优化的最新发展扩展到多模型回归集合,提出了一种算法来解决相应的 ℓ0 惩罚回归模型的联合优化问题。集合中的稀疏和多样化模型是同时从数据中学习的。这些模型中的每一个都能解释预测因子子集与响应变量之间的关系。关于集合的经验研究和理论知识被用来深入了解集合方法的性能,重点是偏差、方差、协方差和变量选择之间的相互作用。在预测任务中,集合方法在模拟数据和真实数据上的表现都优于最先进的竞争对手。前向逐步回归也被推广到多模型回归集合中,并用于获得算法的初始解。这些优化算法是在公开的软件包中实现的。
{"title":"Multi-model subset selection","authors":"Anthony-Alexander Christidis ,&nbsp;Stefan Van Aelst ,&nbsp;Ruben Zamar","doi":"10.1016/j.csda.2024.108073","DOIUrl":"10.1016/j.csda.2024.108073","url":null,"abstract":"<div><div>The two primary approaches for high-dimensional regression problems are sparse methods (e.g., best subset selection, which uses the <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>0</mn></mrow></msub></math></span>-norm in the penalty) and ensemble methods (e.g., random forests). Although sparse methods typically yield interpretable models, in terms of prediction accuracy they are often outperformed by “blackbox” multi-model ensemble methods. A regression ensemble is introduced which combines the interpretability of sparse methods with the high prediction accuracy of ensemble methods. An algorithm is proposed to solve the joint optimization of the corresponding <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>0</mn></mrow></msub></math></span>-penalized regression models by extending recent developments in <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>0</mn></mrow></msub></math></span>-optimization for sparse methods to multi-model regression ensembles. The sparse and diverse models in the ensemble are learned simultaneously from the data. Each of these models provides an explanation for the relationship between a subset of predictors and the response variable. Empirical studies and theoretical knowledge about ensembles are used to gain insight into the ensemble method's performance, focusing on the interplay between bias, variance, covariance, and variable selection. In prediction tasks, the ensembles can outperform state-of-the-art competitors on both simulated and real data. Forward stepwise regression is also generalized to multi-model regression ensembles and used to obtain an initial solution for the algorithm. The optimization algorithms are implemented in publicly available software packages.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108073"},"PeriodicalIF":1.5,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142560769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vine copula based structural equation models 基于藤蔓协程的结构方程模型
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-26 DOI: 10.1016/j.csda.2024.108076
Claudia Czado
Gaussian linear structural equation models (SEMs) are often used as a statistical model associated with a directed acyclic graph (DAG) also known as a Bayesian network. However, such a model might not be able to represent the non-Gaussian dependence present in some data sets resulting in nonlinear, non-additive and non Gaussian conditional distributions. Therefore the use of the class of D-vine copula based regression models for the specification of the conditional distribution of a node given its parents is proposed. This class extends the class of standard linear regression models considerably. The approach also allows to create an importance order of the parents of each node and gives the potential to remove edges from the starting DAG not supported by the data. Further uncertainty of conditional estimates can be assessed and fast generative simulation using the D-vine copula based SEM is available. The improvement over a Gaussian linear SEM is shown using random specifications of the D-vine based SEM as well as its ability to correctly remove edges not present in the data generation using simulation. An engineering application showcases the usefulness of the proposals.
高斯线性结构方程模型(SEM)通常被用作与有向无环图(DAG)(也称为贝叶斯网络)相关的统计模型。然而,这种模型可能无法表示某些数据集中存在的非高斯依赖性,从而导致非线性、非相加和非高斯条件分布。因此,我们建议使用基于 D-vine copula 的回归模型来指定一个节点的条件分布(给定其父节点)。这一类模型大大扩展了标准线性回归模型。该方法还允许创建每个节点父节点的重要性顺序,并有可能从起始 DAG 中删除数据不支持的边。此外,还可以评估条件估计值的不确定性,并使用基于 D-vine copula 的 SEM 进行快速生成模拟。与高斯线性 SEM 相比,基于 D-藤的 SEM 使用随机规格显示了其改进之处,并显示了其通过模拟正确移除数据生成中不存在的边的能力。一个工程应用展示了这些建议的实用性。
{"title":"Vine copula based structural equation models","authors":"Claudia Czado","doi":"10.1016/j.csda.2024.108076","DOIUrl":"10.1016/j.csda.2024.108076","url":null,"abstract":"<div><div>Gaussian linear structural equation models (SEMs) are often used as a statistical model associated with a directed acyclic graph (DAG) also known as a Bayesian network. However, such a model might not be able to represent the non-Gaussian dependence present in some data sets resulting in nonlinear, non-additive and non Gaussian conditional distributions. Therefore the use of the class of D-vine copula based regression models for the specification of the conditional distribution of a node given its parents is proposed. This class extends the class of standard linear regression models considerably. The approach also allows to create an importance order of the parents of each node and gives the potential to remove edges from the starting DAG not supported by the data. Further uncertainty of conditional estimates can be assessed and fast generative simulation using the D-vine copula based SEM is available. The improvement over a Gaussian linear SEM is shown using random specifications of the D-vine based SEM as well as its ability to correctly remove edges not present in the data generation using simulation. An engineering application showcases the usefulness of the proposals.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108076"},"PeriodicalIF":1.5,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142553893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bayesian grouping-Gibbs sampling estimation of high-dimensional linear model with non-sparsity 非稀疏性高维线性模型的贝叶斯分组-吉布斯抽样估计
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-23 DOI: 10.1016/j.csda.2024.108072
Shanshan Qin , Guanlin Zhang , Yuehua Wu , Zhongyi Zhu
In high-dimensional linear regression models, common assumptions typically entail sparsity of regression coefficients βRp. However, these assumptions may not hold when the majority, if not all, of regression coefficients are non-zeros. Statistical methods designed for sparse models may lead to substantial bias in model estimation. Therefore, this article proposes a novel Bayesian Grouping-Gibbs Sampling (BGGS) method, which departs from the common sparse assumptions in high-dimensional problems. The BGGS method leverages a grouping strategy that partitions β into distinct groups, facilitating rapid sampling in high-dimensional space. The grouping number (k) can be determined using the ‘Elbow plot’, which operates efficiently and is robust against the initial value. Theoretical analysis, under some regular conditions, guarantees model selection and parameter estimation consistency, and bound for the prediction error. Furthermore, three finite simulations are conducted to assess the competitive advantages of the proposed method in terms of parameter estimation and prediction accuracy. Finally, the BGGS method is applied to a financial dataset to explore its practical utility.
在高维线性回归模型中,通常的假设要求回归系数 β∈Rp 具有稀疏性。然而,当大部分(如果不是全部)回归系数都是非零时,这些假设可能就不成立了。专为稀疏模型设计的统计方法可能会导致模型估计出现严重偏差。因此,本文提出了一种新颖的贝叶斯分组-吉布斯采样(BGGS)方法,它偏离了高维问题中常见的稀疏假设。BGGS 方法利用分组策略将 β 分成不同的组,从而促进在高维空间中的快速采样。分组数(k)可通过 "肘图法 "确定,该方法运行高效,且对初始值具有鲁棒性。在一些常规条件下,理论分析保证了模型选择和参数估计的一致性,以及预测误差的约束。此外,还进行了三次有限模拟,以评估所提出方法在参数估计和预测精度方面的竞争优势。最后,将 BGGS 方法应用于一个金融数据集,以探索其实用性。
{"title":"Bayesian grouping-Gibbs sampling estimation of high-dimensional linear model with non-sparsity","authors":"Shanshan Qin ,&nbsp;Guanlin Zhang ,&nbsp;Yuehua Wu ,&nbsp;Zhongyi Zhu","doi":"10.1016/j.csda.2024.108072","DOIUrl":"10.1016/j.csda.2024.108072","url":null,"abstract":"<div><div>In high-dimensional linear regression models, common assumptions typically entail sparsity of regression coefficients <span><math><mi>β</mi><mo>∈</mo><msup><mrow><mi>R</mi></mrow><mrow><mi>p</mi></mrow></msup></math></span>. However, these assumptions may not hold when the majority, if not all, of regression coefficients are non-zeros. Statistical methods designed for sparse models may lead to substantial bias in model estimation. Therefore, this article proposes a novel Bayesian Grouping-Gibbs Sampling (BGGS) method, which departs from the common sparse assumptions in high-dimensional problems. The BGGS method leverages a grouping strategy that partitions <strong><em>β</em></strong> into distinct groups, facilitating rapid sampling in high-dimensional space. The grouping number (<em>k</em>) can be determined using the ‘Elbow plot’, which operates efficiently and is robust against the initial value. Theoretical analysis, under some regular conditions, guarantees model selection and parameter estimation consistency, and bound for the prediction error. Furthermore, three finite simulations are conducted to assess the competitive advantages of the proposed method in terms of parameter estimation and prediction accuracy. Finally, the BGGS method is applied to a financial dataset to explore its practical utility.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108072"},"PeriodicalIF":1.5,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142529305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A comparative analysis of different adjustment sets using propensity score based estimators 使用基于倾向分数的估算器对不同调整集进行比较分析
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-21 DOI: 10.1016/j.csda.2024.108079
Shanshan Luo , Jiaqi Min , Wei Li , Xueli Wang , Zhi Geng
Propensity score based estimators are commonly employed in observational studies to address baseline confounders, without explicitly modeling their association with the outcome. In this paper, to fully leverage these estimators, we consider a series of regression models for improving estimation efficiency. The proposed estimators rely solely on a properly modeled propensity score and do not require the correct specification of outcome models. In addition, we consider a comparative analysis by applying the proposed estimators to four different adjustment sets, each consisting of background covariates. The theoretical results imply that incorporating predictive covariates into both propensity score and regression model demonstrates the lowest asymptotic variance. However, including instrumental variables in the propensity score may decrease the estimation efficiency of the proposed estimators. To evaluate the performance of the proposed estimators, we conduct simulation studies and provide a real data example.
在观察性研究中,通常会使用基于倾向得分的估计器来处理基线混杂因素,而不明确模拟它们与结果的关联。在本文中,为了充分利用这些估计方法,我们考虑了一系列提高估计效率的回归模型。所提出的估计方法仅依赖于正确建模的倾向得分,而不需要正确规范结果模型。此外,我们还考虑对四个不同的调整集(每个调整集由背景协变量组成)应用所提出的估计器进行比较分析。理论结果表明,将预测协变量纳入倾向评分和回归模型的渐近方差最小。然而,在倾向评分中加入工具变量可能会降低拟议估计器的估计效率。为了评估所提出的估计器的性能,我们进行了模拟研究,并提供了一个真实数据示例。
{"title":"A comparative analysis of different adjustment sets using propensity score based estimators","authors":"Shanshan Luo ,&nbsp;Jiaqi Min ,&nbsp;Wei Li ,&nbsp;Xueli Wang ,&nbsp;Zhi Geng","doi":"10.1016/j.csda.2024.108079","DOIUrl":"10.1016/j.csda.2024.108079","url":null,"abstract":"<div><div>Propensity score based estimators are commonly employed in observational studies to address baseline confounders, without explicitly modeling their association with the outcome. In this paper, to fully leverage these estimators, we consider a series of regression models for improving estimation efficiency. The proposed estimators rely solely on a properly modeled propensity score and do not require the correct specification of outcome models. In addition, we consider a comparative analysis by applying the proposed estimators to four different adjustment sets, each consisting of background covariates. The theoretical results imply that incorporating predictive covariates into both propensity score and regression model demonstrates the lowest asymptotic variance. However, including instrumental variables in the propensity score may decrease the estimation efficiency of the proposed estimators. To evaluate the performance of the proposed estimators, we conduct simulation studies and provide a real data example.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108079"},"PeriodicalIF":1.5,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142529306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unified specification tests in partially linear time series models 部分线性时间序列模型的统一规格检验
IF 1.5 3区 数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Pub Date : 2024-10-17 DOI: 10.1016/j.csda.2024.108074
Shuang Sun , Zening Song , Xiaojun Song
Based on a residual marked empirical process, Cramér–von Mises and Kolmogorov–Smirnov tests are proposed for the correct specification of the nonparametric components in partially linear time series models. The tests are unified in the sense that the asymptotic distribution of residual marked empirical process is invariant across different nν-consistent estimators in calculating residuals (where ν>1/4) under the null. In addition, the residual marked empirical process has the same power property under the sequence of local alternatives regardless of the estimators used. Achieved through a projection method, these features also enable using a computationally convenient multiplier bootstrap to approximate the unified null distributions of the test statistics. Simulations show satisfactory finite-sample performance of the proposed method. The application to validate the parametric form of conditional variance in the ARCH-X model is also highlighted, along with an empirical analysis of the conditional variance of the FTSE 100 index return series.
基于残差标记经验过程,提出了克拉梅尔-冯-米塞斯检验和 Kolmogorov-Smirnov 检验,以正确规范部分线性时间序列模型中的非参数成分。这些检验是统一的,即在计算残差(其中 ν>1/4)时,不同 nν 一致性估计器在空值下的残差标记经验过程的渐近分布是不变的。此外,无论使用哪种估计器,残差标记经验过程在局部替代序列下都具有相同的幂特性。通过投影法,这些特征还可以使用计算方便的乘数引导法来近似检验统计量的统一空分布。模拟结果表明,所提方法的有限样本性能令人满意。此外,还重点介绍了在 ARCH-X 模型中验证条件方差参数形式的应用,以及对富时 100 指数收益序列条件方差的实证分析。
{"title":"Unified specification tests in partially linear time series models","authors":"Shuang Sun ,&nbsp;Zening Song ,&nbsp;Xiaojun Song","doi":"10.1016/j.csda.2024.108074","DOIUrl":"10.1016/j.csda.2024.108074","url":null,"abstract":"<div><div>Based on a residual marked empirical process, Cramér–von Mises and Kolmogorov–Smirnov tests are proposed for the correct specification of the nonparametric components in partially linear time series models. The tests are unified in the sense that the asymptotic distribution of residual marked empirical process is invariant across different <span><math><msup><mrow><mi>n</mi></mrow><mrow><mi>ν</mi></mrow></msup></math></span>-consistent estimators in calculating residuals (where <span><math><mi>ν</mi><mo>&gt;</mo><mn>1</mn><mo>/</mo><mn>4</mn></math></span>) under the null. In addition, the residual marked empirical process has the same power property under the sequence of local alternatives regardless of the estimators used. Achieved through a projection method, these features also enable using a computationally convenient multiplier bootstrap to approximate the unified null distributions of the test statistics. Simulations show satisfactory finite-sample performance of the proposed method. The application to validate the parametric form of conditional variance in the ARCH-X model is also highlighted, along with an empirical analysis of the conditional variance of the FTSE 100 index return series.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"203 ","pages":"Article 108074"},"PeriodicalIF":1.5,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142529303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computational Statistics & Data Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1