Statistics and Computing最新文献_第2页

A comprehensive comparison of goodness-of-fit tests for logistic regression models 逻辑回归模型拟合优度检验的综合比较

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-08-30 DOI: 10.1007/s11222-024-10487-5

Huiling Liu, Xinmin Li, Feifei Chen, Wolfgang Härdle, Hua Liang

We introduce a projection-based test for assessing logistic regression models using the empirical residual marked empirical process and suggest a model-based bootstrap procedure to calculate critical values. We comprehensively compare this test and Stute and Zhu’s test with several commonly used goodness-of-fit (GoF) tests: the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, Osius–Rojek test, and Stukel test for logistic regression models in terms of type I error control and power performance in small ((n=50)), moderate ((n=100)), and large ((n=500)) sample sizes. We assess the power performance for two commonly encountered situations: nonlinear and interaction departures from the null hypothesis. All tests except the modified Hosmer–Lemeshow test and Osius–Rojek test have the correct size in all sample sizes. The power performance of the projection based test consistently outperforms its competitors. We apply these tests to analyze an AIDS dataset and a cancer dataset. For the former, all tests except the projection-based test do not reject a simple linear function in the logit, which has been illustrated to be deficient in the literature. For the latter dataset, the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, and Osius–Rojek test fail to detect the quadratic form in the logit, which was detected by the Stukel test, Stute and Zhu’s test, and the projection-based test.

我们介绍了一种基于投影的检验方法，用于评估使用经验残差标记经验过程的逻辑回归模型，并提出了一种基于模型的引导程序来计算临界值。我们将该检验、Stute和Zhu的检验与几种常用的拟合优度（GoF）检验：Hosmer-Lemeshow检验、修正的Hosmer-Lemeshow检验、Osius-Rojek检验和Stukel检验进行了全面比较，这些检验适用于小样本量（(n=50)）、中等样本量（(n=100)）和大样本量（(n=500)）的逻辑回归模型的I型误差控制和功率性能。我们评估了两种常见情况下的功率性能：偏离零假设的非线性和交互作用。除了修正的 Hosmer-Lemeshow 检验和 Osius-Rojek 检验外，所有检验在所有样本量下都有正确的大小。基于投影的检验的功率性能一直优于其竞争对手。我们将这些检验用于分析艾滋病数据集和癌症数据集。对于前者，除基于投影的检验外，所有检验都不能拒绝 logit 中的简单线性函数，而文献中已经说明了这种线性函数的缺陷。对于后一个数据集，Hosmer-Lemeshow 检验、修正的 Hosmer-Lemeshow 检验和 Osius-Rojek 检验均未能检测出对数中的二次函数形式，而 Stukel 检验、Stute 和 Zhu 检验以及基于投影的检验均检测出了二次函数形式。

{"title":"A comprehensive comparison of goodness-of-fit tests for logistic regression models","authors":"Huiling Liu, Xinmin Li, Feifei Chen, Wolfgang Härdle, Hua Liang","doi":"10.1007/s11222-024-10487-5","DOIUrl":"https://doi.org/10.1007/s11222-024-10487-5","url":null,"abstract":"We introduce a projection-based test for assessing logistic regression models using the empirical residual marked empirical process and suggest a model-based bootstrap procedure to calculate critical values. We comprehensively compare this test and Stute and Zhu’s test with several commonly used goodness-of-fit (GoF) tests: the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, Osius–Rojek test, and Stukel test for logistic regression models in terms of type I error control and power performance in small ((n=50)), moderate ((n=100)), and large ((n=500)) sample sizes. We assess the power performance for two commonly encountered situations: nonlinear and interaction departures from the null hypothesis. All tests except the modified Hosmer–Lemeshow test and Osius–Rojek test have the correct size in all sample sizes. The power performance of the projection based test consistently outperforms its competitors. We apply these tests to analyze an AIDS dataset and a cancer dataset. For the former, all tests except the projection-based test do not reject a simple linear function in the logit, which has been illustrated to be deficient in the literature. For the latter dataset, the Hosmer–Lemeshow test, modified Hosmer–Lemeshow test, and Osius–Rojek test fail to detect the quadratic form in the logit, which was detected by the Stukel test, Stute and Zhu’s test, and the projection-based test.","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"4 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

New forest-based approaches for sufficient dimension reduction 基于森林的充分降维新方法

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-08-30 DOI: 10.1007/s11222-024-10482-w

Shuang Dai, Ping Wu, Zhou Yu

Sufficient dimension reduction (SDR) primarily aims to reduce the dimensionality of high-dimensional predictor variables while retaining essential information about the responses. Traditional SDR methods typically employ kernel weighting functions, which unfortunately makes them susceptible to the curse of dimensionality. To address this issue, we in this paper propose novel forest-based approaches for SDR that utilize a locally adaptive kernel generated by Mondrian forests. Overall, our work takes the perspective of Mondrian forest as an adaptive weighted kernel technique for SDR problems. In the central mean subspace model, by integrating the methods from Xia et al. (J R Stat Soc Ser B (Stat Methodol) 64(3):363–410, 2002. https://doi.org/10.1111/1467-9868.03411) with Mondrian forest weights, we suggest the forest-based outer product of gradients estimation (mf-OPG) and the forest-based minimum average variance estimation (mf-MAVE). Moreover, we substitute the kernels used in nonparametric density function estimations (Xia in Ann Stat 35(6):2654–2690, 2007. https://doi.org/10.1214/009053607000000352), targeting the central subspace, with Mondrian forest weights. These techniques are referred to as mf-dOPG and mf-dMAVE, respectively. Under regularity conditions, we establish the asymptotic properties of our forest-based estimators, as well as the convergence of the affiliated algorithms. Through simulation studies and analysis of fully observable data, we demonstrate substantial improvements in computational efficiency and predictive accuracy of our proposals compared with the traditional counterparts.

充分降维（SDR）的主要目的是降低高维预测变量的维度，同时保留反应的基本信息。传统的降维方法通常采用核加权函数，但不幸的是，这种方法容易受到维度诅咒的影响。为了解决这个问题，我们在本文中提出了基于森林的新型 SDR 方法，该方法利用蒙德里安森林生成的局部自适应核。总体而言，我们的工作从蒙德里安森林的角度出发，将其作为一种用于 SDR 问题的自适应加权核技术。在中心均值子空间模型中，通过将 Xia 等人的方法（J R Stat Soc Ser B (Stat Methodol) 64(3):363-410, 2002. https://doi.org/10.1111/1467-9868.03411）与蒙德里安森林权重相结合，我们提出了基于森林的梯度外积估计（mf-OPG）和基于森林的最小平均方差估计（mf-MAVE）。此外，我们还用蒙德里安森林权重替代了非参数密度函数估计中使用的核（Xia 在 Ann Stat 35(6):2654-2690, 2007. https://doi.org/10.1214/009053607000000352），以中心子空间为目标。这些技术分别称为 mf-dOPG 和 mf-dMAVE。在正则条件下，我们建立了基于森林的估计器的渐近特性，以及附属算法的收敛性。通过模拟研究和对完全可观测数据的分析，我们证明了与传统方法相比，我们的建议在计算效率和预测准确性方面都有了大幅提高。

{"title":"New forest-based approaches for sufficient dimension reduction","authors":"Shuang Dai, Ping Wu, Zhou Yu","doi":"10.1007/s11222-024-10482-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10482-w","url":null,"abstract":"Sufficient dimension reduction (SDR) primarily aims to reduce the dimensionality of high-dimensional predictor variables while retaining essential information about the responses. Traditional SDR methods typically employ kernel weighting functions, which unfortunately makes them susceptible to the curse of dimensionality. To address this issue, we in this paper propose novel forest-based approaches for SDR that utilize a locally adaptive kernel generated by Mondrian forests. Overall, our work takes the perspective of Mondrian forest as an adaptive weighted kernel technique for SDR problems. In the central mean subspace model, by integrating the methods from Xia et al. (J R Stat Soc Ser B (Stat Methodol) 64(3):363–410, 2002. https://doi.org/10.1111/1467-9868.03411) with Mondrian forest weights, we suggest the forest-based outer product of gradients estimation (mf-OPG) and the forest-based minimum average variance estimation (mf-MAVE). Moreover, we substitute the kernels used in nonparametric density function estimations (Xia in Ann Stat 35(6):2654–2690, 2007. https://doi.org/10.1214/009053607000000352), targeting the central subspace, with Mondrian forest weights. These techniques are referred to as mf-dOPG and mf-dMAVE, respectively. Under regularity conditions, we establish the asymptotic properties of our forest-based estimators, as well as the convergence of the affiliated algorithms. Through simulation studies and analysis of fully observable data, we demonstrate substantial improvements in computational efficiency and predictive accuracy of our proposals compared with the traditional counterparts.","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"57 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SB-ETAS: using simulation based inference for scalable, likelihood-free inference for the ETAS model of earthquake occurrences SB-ETAS：使用基于模拟的推理方法对地震发生的 ETAS 模型进行可扩展的无似然推理

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-08-29 DOI: 10.1007/s11222-024-10486-6

Samuel Stockman, Daniel J. Lawson, Maximilian J. Werner

The rapid growth of earthquake catalogs, driven by machine learning-based phase picking and denser seismic networks, calls for the application of a broader range of models to determine whether the new data enhances earthquake forecasting capabilities. Additionally, this growth demands that existing forecasting models efficiently scale to handle the increased data volume. Approximate inference methods such as inlabru, which is based on the Integrated nested Laplace approximation, offer improved computational efficiencies and the ability to perform inference on more complex point-process models compared to traditional MCMC approaches. We present SB-ETAS: a simulation based inference procedure for the epidemic-type aftershock sequence (ETAS) model. This approximate Bayesian method uses sequential neural posterior estimation (SNPE) to learn posterior distributions from simulations, rather than typical MCMC sampling using the likelihood. On synthetic earthquake catalogs, SB-ETAS provides better coverage of ETAS posterior distributions compared with inlabru. Furthermore, we demonstrate that using a simulation based procedure for inference improves the scalability from (mathcal {O}(n^2)) to (mathcal {O}(nlog n)). This makes it feasible to fit to very large earthquake catalogs, such as one for Southern California dating back to 1981. SB-ETAS can find Bayesian estimates of ETAS parameters for this catalog in less than 10 h on a standard laptop, a task that would have taken over 2 weeks using MCMC. Beyond the standard ETAS model, this simulation based framework allows earthquake modellers to define and infer parameters for much more complex models by removing the need to define a likelihood function.

在基于机器学习的相位选取和更密集的地震网络的推动下，地震目录的快速增长要求应用更广泛的模型来确定新数据是否增强了地震预报能力。此外，这种增长要求现有预报模型能够有效扩展，以处理增加的数据量。与传统的 MCMC 方法相比，基于集成嵌套拉普拉斯近似的近似推断方法（如 inlabru）提高了计算效率，并能对更复杂的点过程模型进行推断。我们提出了 SB-ETAS：一种基于模拟的流行病型余震序列（ETAS）模型推断程序。这种近似贝叶斯方法使用序列神经后验估计（SNPE）从模拟中学习后验分布，而不是使用似然进行典型的 MCMC 采样。在合成地震目录上，与 inlabru 相比，SB-ETAS 能更好地覆盖 ETAS 后验分布。此外，我们还证明了使用基于模拟的推理过程可以将可扩展性从（mathcal {O}(n^2)) 提高到（mathcal {O}(nlog n)）。这使得它可以拟合非常大的地震目录，比如南加州可追溯到 1981 年的地震目录。SB-ETAS 可以在标准笔记本电脑上用不到 10 小时的时间为该目录找到 ETAS 参数的贝叶斯估计值，而使用 MCMC 则需要 2 周以上的时间。除标准 ETAS 模型外，这种基于模拟的框架还能让地震建模人员定义和推断更复杂模型的参数，无需定义似然函数。

{"title":"SB-ETAS: using simulation based inference for scalable, likelihood-free inference for the ETAS model of earthquake occurrences","authors":"Samuel Stockman, Daniel J. Lawson, Maximilian J. Werner","doi":"10.1007/s11222-024-10486-6","DOIUrl":"https://doi.org/10.1007/s11222-024-10486-6","url":null,"abstract":"The rapid growth of earthquake catalogs, driven by machine learning-based phase picking and denser seismic networks, calls for the application of a broader range of models to determine whether the new data enhances earthquake forecasting capabilities. Additionally, this growth demands that existing forecasting models efficiently scale to handle the increased data volume. Approximate inference methods such as inlabru, which is based on the Integrated nested Laplace approximation, offer improved computational efficiencies and the ability to perform inference on more complex point-process models compared to traditional MCMC approaches. We present SB-ETAS: a simulation based inference procedure for the epidemic-type aftershock sequence (ETAS) model. This approximate Bayesian method uses sequential neural posterior estimation (SNPE) to learn posterior distributions from simulations, rather than typical MCMC sampling using the likelihood. On synthetic earthquake catalogs, SB-ETAS provides better coverage of ETAS posterior distributions compared with inlabru. Furthermore, we demonstrate that using a simulation based procedure for inference improves the scalability from (mathcal {O}(n^2)) to (mathcal {O}(nlog n)). This makes it feasible to fit to very large earthquake catalogs, such as one for Southern California dating back to 1981. SB-ETAS can find Bayesian estimates of ETAS parameters for this catalog in less than 10 h on a standard laptop, a task that would have taken over 2 weeks using MCMC. Beyond the standard ETAS model, this simulation based framework allows earthquake modellers to define and infer parameters for much more complex models by removing the need to define a likelihood function.","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"22 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sparse Bayesian learning using TMB (Template Model Builder) 使用 TMB（模板模型生成器）进行稀疏贝叶斯学习

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-08-28 DOI: 10.1007/s11222-024-10476-8

Ingvild M. Helgøy, Hans J. Skaug, Yushu Li

Sparse Bayesian Learning, and more specifically the Relevance Vector Machine (RVM), can be used in supervised learning for both classification and regression problems. Such methods are particularly useful when applied to big data in order to find a sparse (in weight space) representation of the model. This paper demonstrates that the Template Model Builder (TMB) is an accurate and flexible computational framework for implementation of sparse Bayesian learning methods.The user of TMB is only required to specify the joint likelihood of the weights and the data, while the Laplace approximation of the marginal likelihood is automatically evaluated to numerical precision. This approximation is in turn used to estimate hyperparameters by maximum marginal likelihood. In order to reduce the computational cost of the Laplace approximation we introduce the notion of an “active set” of weights, and we devise an algorithm for dynamically updating this set until convergence, similar to what is done in other RVM type methods. We implement two different methods using TMB; the RVM and the Probabilistic Feature Selection and Classification Vector Machine method, where the latter also performs feature selection. Experiments based on benchmark data show that our TMB implementation performs comparable to that of the original implementation, but at a lower implementation cost. TMB can also calculate model and prediction uncertainty, by including estimation uncertainty from both latent variables and the hyperparameters. In conclusion, we find that TMB is a flexible tool that facilitates implementation and prototyping of sparse Bayesian methods.

稀疏贝叶斯学习，更具体地说是相关向量机（RVM），可用于分类和回归问题的监督学习。这种方法在应用于大数据时特别有用，可以找到模型的稀疏（权重空间）表示。本文证明了模板模型生成器（TMB）是实现稀疏贝叶斯学习方法的精确而灵活的计算框架。TMB 的用户只需指定权重和数据的联合似然，而边际似然的拉普拉斯近似值会自动评估到数字精度。这个近似值反过来又被用于用最大边际似然估计超参数。为了降低拉普拉斯近似的计算成本，我们引入了权重 "活动集 "的概念，并设计了一种动态更新权重集直至收敛的算法，这与其他 RVM 类型的方法类似。我们使用 TMB 实现了两种不同的方法：RVM 和概率特征选择与分类向量机方法，其中后者还执行特征选择。基于基准数据的实验表明，我们的 TMB 实现方法与原始实现方法性能相当，但实现成本更低。TMB 还能计算模型和预测的不确定性，包括潜在变量和超参数的估计不确定性。总之，我们发现 TMB 是一种灵活的工具，有助于稀疏贝叶斯方法的实现和原型设计。

{"title":"Sparse Bayesian learning using TMB (Template Model Builder)","authors":"Ingvild M. Helgøy, Hans J. Skaug, Yushu Li","doi":"10.1007/s11222-024-10476-8","DOIUrl":"https://doi.org/10.1007/s11222-024-10476-8","url":null,"abstract":"Sparse Bayesian Learning, and more specifically the Relevance Vector Machine (RVM), can be used in supervised learning for both classification and regression problems. Such methods are particularly useful when applied to big data in order to find a sparse (in weight space) representation of the model. This paper demonstrates that the Template Model Builder (TMB) is an accurate and flexible computational framework for implementation of sparse Bayesian learning methods.The user of TMB is only required to specify the joint likelihood of the weights and the data, while the Laplace approximation of the marginal likelihood is automatically evaluated to numerical precision. This approximation is in turn used to estimate hyperparameters by maximum marginal likelihood. In order to reduce the computational cost of the Laplace approximation we introduce the notion of an “active set” of weights, and we devise an algorithm for dynamically updating this set until convergence, similar to what is done in other RVM type methods. We implement two different methods using TMB; the RVM and the Probabilistic Feature Selection and Classification Vector Machine method, where the latter also performs feature selection. Experiments based on benchmark data show that our TMB implementation performs comparable to that of the original implementation, but at a lower implementation cost. TMB can also calculate model and prediction uncertainty, by including estimation uncertainty from both latent variables and the hyperparameters. In conclusion, we find that TMB is a flexible tool that facilitates implementation and prototyping of sparse Bayesian methods.","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"39 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A new maximum mean discrepancy based two-sample test for equal distributions in separable metric spaces 基于最大均值差异的新的可分离度量空间等分布双样本检验法

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-08-25 DOI: 10.1007/s11222-024-10483-9

Bu Zhou, Zhi Peng Ong, Jin-Ting Zhang

This paper presents a novel two-sample test for equal distributions in separable metric spaces, utilizing the maximum mean discrepancy (MMD). The test statistic is derived from the decomposition of the total variation of data in the reproducing kernel Hilbert space, and can be regarded as a V-statistic-based estimator of the squared MMD. The paper establishes the asymptotic null and alternative distributions of the test statistic. To approximate the null distribution accurately, a three-cumulant matched chi-squared approximation method is employed. The parameters for this approximation are consistently estimated from the data. Additionally, the paper introduces a new data-adaptive method based on the median absolute deviation to select the kernel width of the Gaussian kernel, and a new permutation test combining two different Gaussian kernel width selection methods, which improve the adaptability of the test to different data sets. Fast implementation of the test using matrix calculation is discussed. Extensive simulation studies and three real data examples are presented to demonstrate the good performance of the proposed test.

本文提出了一种利用最大均值差异（MMD）对可分离度量空间中的等分布进行双样本检验的新方法。检验统计量来自再现核希尔伯特空间中数据总变化的分解，可视为基于 V 统计量的 MMD 平方估计量。本文建立了检验统计量的渐近零分布和替代分布。为了准确地近似零分布，本文采用了一种三积匹配卡方近似方法。这种近似方法的参数是根据数据一致估计出来的。此外，本文还引入了一种基于中位绝对偏差的新数据适应性方法来选择高斯核的核宽度，以及一种结合了两种不同高斯核宽度选择方法的新 permutation 检验，从而提高了检验对不同数据集的适应性。还讨论了利用矩阵计算快速实现检验的问题。此外，还介绍了大量仿真研究和三个真实数据示例，以证明所提出的测试具有良好的性能。

{"title":"A new maximum mean discrepancy based two-sample test for equal distributions in separable metric spaces","authors":"Bu Zhou, Zhi Peng Ong, Jin-Ting Zhang","doi":"10.1007/s11222-024-10483-9","DOIUrl":"https://doi.org/10.1007/s11222-024-10483-9","url":null,"abstract":"This paper presents a novel two-sample test for equal distributions in separable metric spaces, utilizing the maximum mean discrepancy (MMD). The test statistic is derived from the decomposition of the total variation of data in the reproducing kernel Hilbert space, and can be regarded as a V-statistic-based estimator of the squared MMD. The paper establishes the asymptotic null and alternative distributions of the test statistic. To approximate the null distribution accurately, a three-cumulant matched chi-squared approximation method is employed. The parameters for this approximation are consistently estimated from the data. Additionally, the paper introduces a new data-adaptive method based on the median absolute deviation to select the kernel width of the Gaussian kernel, and a new permutation test combining two different Gaussian kernel width selection methods, which improve the adaptability of the test to different data sets. Fast implementation of the test using matrix calculation is discussed. Extensive simulation studies and three real data examples are presented to demonstrate the good performance of the proposed test.","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"3 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Wasserstein principal component analysis for circular measures 用于循环测量的瓦瑟斯坦主成分分析法

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-08-24 DOI: 10.1007/s11222-024-10473-x

Mario Beraha, Matteo Pegoraro

We consider the 2-Wasserstein space of probability measures supported on the unit-circle, and propose a framework for Principal Component Analysis (PCA) for data living in such a space. We build on a detailed investigation of the optimal transportation problem for measures on the unit-circle which might be of independent interest. In particular, building on previously obtained results, we derive an expression for optimal transport maps in (almost) closed form and propose an alternative definition of the tangent space at an absolutely continuous probability measure, together with fundamental characterizations of the associated exponential and logarithmic maps. PCA is performed by mapping data on the tangent space at the Wasserstein barycentre, which we approximate via an iterative scheme, and for which we establish a sufficient a posteriori condition to assess its convergence. Our methodology is illustrated on several simulated scenarios and a real data analysis of measurements of optical nerve thickness.

我们考虑了单位圆上支持的概率度量的 2-Wasserstein 空间，并为生活在这样一个空间中的数据提出了一个主成分分析（PCA）框架。我们以对单位圆上度量的最优传输问题的详细研究为基础，这可能会引起独立的兴趣。特别是，在之前所获结果的基础上，我们推导出了（几乎）闭合形式的最优传输映射表达式，并提出了绝对连续概率度量切线空间的替代定义，以及相关指数映射和对数映射的基本特征。PCA 是通过映射瓦瑟施泰因原点切线空间上的数据来实现的，我们通过迭代方案对其进行近似，并为此建立了充分的后验条件来评估其收敛性。我们将在几个模拟场景和光学神经厚度测量的真实数据分析中说明我们的方法。

引用次数: 0

Individualized causal mediation analysis with continuous treatment using conditional generative adversarial networks 利用条件生成对抗网络对连续治疗进行个性化因果中介分析

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-08-23 DOI: 10.1007/s11222-024-10484-8

Cheng Huan, Xinyuan Song, Hongwei Yuan

Traditional methods used in causal mediation analysis with continuous treatment often focus on estimating average causal effects, limiting their applicability in precision medicine. Machine learning techniques have emerged as a powerful approach for precisely estimating individualized causal effects. This paper proposes a novel method called CGAN-ICMA-CT that leverages Conditional Generative Adversarial Networks (CGANs) to infer individualized causal effects with continuous treatment. We thoroughly investigate the convergence properties of CGAN-ICMA-CT and show that the estimated distribution of our inferential conditional generator converges to the true conditional distribution under mild conditions. We conduct numerical experiments to validate the effectiveness of CGAN-ICMA-CT and compare it with four commonly used methods: linear regression, support vector machine regression, decision tree, and random forest regression. The results demonstrate that CGAN-ICMA-CT outperforms these methods regarding accuracy and precision. Furthermore, we apply the CGAN-ICMA-CT model to the real-world Job Corps dataset, showcasing its practical utility. By utilizing CGAN-ICMA-CT, we estimate the individualized causal effects of the Job Corps program on the number of arrests, providing insights into both direct effects and effects mediated through intermediate variables. Our findings confirm the potential of CGAN-ICMA-CT in advancing individualized causal mediation analysis with continuous treatment in precision medicine settings.

用于连续治疗因果中介分析的传统方法通常侧重于估计平均因果效应，这限制了它们在精准医疗中的适用性。机器学习技术已成为精确估计个体化因果效应的有力方法。本文提出了一种名为 CGAN-ICMA-CT 的新方法，它利用条件生成对抗网络（CGAN）来推断连续治疗的个体化因果效应。我们对 CGAN-ICMA-CT 的收敛特性进行了深入研究，结果表明，在温和条件下，推断条件生成器的估计分布会收敛到真实的条件分布。我们通过数值实验验证了 CGAN-ICMA-CT 的有效性，并将其与四种常用方法进行了比较：线性回归、支持向量机回归、决策树和随机森林回归。结果表明，CGAN-ICMA-CT 在准确度和精确度方面都优于这些方法。此外，我们还将 CGAN-ICMA-CT 模型应用于现实世界中的 Job Corps 数据集，展示了它的实用性。通过使用 CGAN-ICMA-CT，我们估算了就业指导中心项目对逮捕人数的个性化因果效应，从而深入了解了直接效应和通过中间变量中介的效应。我们的研究结果证实了 CGAN-ICMA-CT 在精准医疗环境下通过连续治疗推进个性化因果中介分析的潜力。

{"title":"Individualized causal mediation analysis with continuous treatment using conditional generative adversarial networks","authors":"Cheng Huan, Xinyuan Song, Hongwei Yuan","doi":"10.1007/s11222-024-10484-8","DOIUrl":"https://doi.org/10.1007/s11222-024-10484-8","url":null,"abstract":"Traditional methods used in causal mediation analysis with continuous treatment often focus on estimating average causal effects, limiting their applicability in precision medicine. Machine learning techniques have emerged as a powerful approach for precisely estimating individualized causal effects. This paper proposes a novel method called CGAN-ICMA-CT that leverages Conditional Generative Adversarial Networks (CGANs) to infer individualized causal effects with continuous treatment. We thoroughly investigate the convergence properties of CGAN-ICMA-CT and show that the estimated distribution of our inferential conditional generator converges to the true conditional distribution under mild conditions. We conduct numerical experiments to validate the effectiveness of CGAN-ICMA-CT and compare it with four commonly used methods: linear regression, support vector machine regression, decision tree, and random forest regression. The results demonstrate that CGAN-ICMA-CT outperforms these methods regarding accuracy and precision. Furthermore, we apply the CGAN-ICMA-CT model to the real-world Job Corps dataset, showcasing its practical utility. By utilizing CGAN-ICMA-CT, we estimate the individualized causal effects of the Job Corps program on the number of arrests, providing insights into both direct effects and effects mediated through intermediate variables. Our findings confirm the potential of CGAN-ICMA-CT in advancing individualized causal mediation analysis with continuous treatment in precision medicine settings.","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"7 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Taming numerical imprecision by adapting the KL divergence to negative probabilities 通过调整 KL 分歧以适应负概率来控制数值不精确性

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-08-13 DOI: 10.1007/s11222-024-10480-y

Simon Pfahler, Peter Georg, Rudolf Schill, Maren Klever, Lars Grasedyck, Rainer Spang, Tilo Wettig

The Kullback–Leibler (KL) divergence is frequently used in data science. For discrete distributions on large state spaces, approximations of probability vectors may result in a few small negative entries, rendering the KL divergence undefined. We address this problem by introducing a parameterized family of substitute divergence measures, the shifted KL (sKL) divergence measures. Our approach is generic and does not increase the computational overhead. We show that the sKL divergence shares important theoretical properties with the KL divergence and discuss how its shift parameters should be chosen. If Gaussian noise is added to a probability vector, we prove that the average sKL divergence converges to the KL divergence for small enough noise. We also show that our method solves the problem of negative entries in an application from computational oncology, the optimization of Mutual Hazard Networks for cancer progression using tensor-train approximations.

Kullback-Leibler (KL) 发散经常用于数据科学。对于大型状态空间上的离散分布，概率向量的近似可能会导致一些小的负条目，从而使 KL 发散无法定义。为了解决这个问题，我们引入了一个参数化的替代发散度量系列，即移位 KL（sKL）发散度量。我们的方法是通用的，不会增加计算开销。我们证明了 sKL 发散与 KL 发散具有相同的重要理论属性，并讨论了如何选择其移动参数。如果在概率向量中加入高斯噪声，我们证明在噪声足够小的情况下，平均 sKL 发散收敛于 KL 发散。我们还证明，我们的方法解决了计算肿瘤学应用中的负条目问题，即使用张量-列车近似优化癌症进展的相互危害网络。

引用次数: 0

A Bayesian approach to modeling finite element discretization error 有限元离散化误差建模的贝叶斯方法

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-08-09 DOI: 10.1007/s11222-024-10463-z

Anne Poot, Pierre Kerfriden, Iuri Rocha, Frans van der Meer

In this work, the uncertainty associated with the finite element discretization error is modeled following the Bayesian paradigm. First, a continuous formulation is derived, where a Gaussian process prior over the solution space is updated based on observations from a finite element discretization. To avoid the computation of intractable integrals, a second, finer, discretization is introduced that is assumed sufficiently dense to represent the true solution field. A prior distribution is assumed over the fine discretization, which is then updated based on observations from the coarse discretization. This yields a posterior distribution with a mean that serves as an estimate of the solution, and a covariance that models the uncertainty associated with this estimate. Two particular choices of prior are investigated: a prior defined implicitly by assigning a white noise distribution to the right-hand side term, and a prior whose covariance function is equal to the Green’s function of the partial differential equation. The former yields a posterior distribution with a mean close to the reference solution, but a covariance that contains little information regarding the finite element discretization error. The latter, on the other hand, yields posterior distribution with a mean equal to the coarse finite element solution, and a covariance with a close connection to the discretization error. For both choices of prior a contradiction arises, since the discretization error depends on the right-hand side term, but the posterior covariance does not. We demonstrate how, by rescaling the eigenvalues of the posterior covariance, this independence can be avoided.

在这项工作中，与有限元离散化误差相关的不确定性按照贝叶斯范式进行建模。首先，推导出一种连续公式，根据有限元离散化的观测结果更新解空间的高斯过程先验。为了避免计算棘手的积分，引入了第二种更精细的离散化，假定其密度足以代表真实的解场。在精细离散化的基础上假设一个先验分布，然后根据粗离散化的观测结果进行更新。这就产生了一个后验分布，其平均值可作为解的估计值，而协方差则可模拟与该估计值相关的不确定性。本文研究了两种特定的先验选择：一种是通过为右侧项分配白噪声分布而隐含定义的先验，另一种是协方差函数等于偏微分方程的格林函数的先验。前者得到的后验分布均值接近参考解，但协方差几乎不包含有限元离散化误差的信息。另一方面，后者得到的后验分布均值等于粗有限元解，协方差与离散化误差密切相关。对于这两种先验选择，都会产生矛盾，因为离散化误差取决于右侧项，但后验协方差却不取决于右侧项。我们将演示如何通过重新调整后验协方差的特征值来避免这种独立性。

{"title":"A Bayesian approach to modeling finite element discretization error","authors":"Anne Poot, Pierre Kerfriden, Iuri Rocha, Frans van der Meer","doi":"10.1007/s11222-024-10463-z","DOIUrl":"https://doi.org/10.1007/s11222-024-10463-z","url":null,"abstract":"In this work, the uncertainty associated with the finite element discretization error is modeled following the Bayesian paradigm. First, a continuous formulation is derived, where a Gaussian process prior over the solution space is updated based on observations from a finite element discretization. To avoid the computation of intractable integrals, a second, finer, discretization is introduced that is assumed sufficiently dense to represent the true solution field. A prior distribution is assumed over the fine discretization, which is then updated based on observations from the coarse discretization. This yields a posterior distribution with a mean that serves as an estimate of the solution, and a covariance that models the uncertainty associated with this estimate. Two particular choices of prior are investigated: a prior defined implicitly by assigning a white noise distribution to the right-hand side term, and a prior whose covariance function is equal to the Green’s function of the partial differential equation. The former yields a posterior distribution with a mean close to the reference solution, but a covariance that contains little information regarding the finite element discretization error. The latter, on the other hand, yields posterior distribution with a mean equal to the coarse finite element solution, and a covariance with a close connection to the discretization error. For both choices of prior a contradiction arises, since the discretization error depends on the right-hand side term, but the posterior covariance does not. We demonstrate how, by rescaling the eigenvalues of the posterior covariance, this independence can be avoided.","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"20 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Roughness regularization for functional data analysis with free knots spline estimation 利用自由结样条估计进行函数数据分析的粗糙度正则化

IF 2.2 2区数学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Statistics and Computing

Pub Date : 2024-08-08 DOI: 10.1007/s11222-024-10474-w

Anna De Magistris, Valentina De Simone, Elvira Romano, Gerardo Toraldo

In the era of big data, an ever-growing volume of information is recorded, either continuously over time or sporadically, at distinct time intervals. Functional Data Analysis (FDA) stands at the cutting edge of this data revolution, offering a powerful framework for handling and extracting meaningful insights from such complex datasets. The currently proposed FDA methods can often encounter challenges, especially when dealing with curves of varying shapes. This can largely be attributed to the method’s strong dependence on data approximation as a key aspect of the analysis process. In this work, we propose a free knots spline estimation method for functional data with two penalty terms and demonstrate its performance by comparing the results of several clustering methods on simulated and real data.

在大数据时代，越来越多的信息被记录下来，这些信息或随着时间的推移持续不断，或以不同的时间间隔零星记录。功能数据分析（FDA）站在这场数据革命的前沿，为处理此类复杂数据集并从中提取有意义的见解提供了一个强大的框架。目前提出的 FDA 方法经常会遇到挑战，尤其是在处理形状各异的曲线时。这在很大程度上归因于该方法对数据近似的强烈依赖，而数据近似是分析过程中的一个关键环节。在这项工作中，我们提出了一种带有两个惩罚项的函数数据自由结样条估计方法，并通过比较几种聚类方法在模拟数据和真实数据上的结果来证明其性能。

引用次数: 0