首页 > 最新文献

Statistics and Computing最新文献

英文 中文
Individualized causal mediation analysis with continuous treatment using conditional generative adversarial networks 利用条件生成对抗网络对连续治疗进行个性化因果中介分析
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-23 DOI: 10.1007/s11222-024-10484-8
Cheng Huan, Xinyuan Song, Hongwei Yuan

Traditional methods used in causal mediation analysis with continuous treatment often focus on estimating average causal effects, limiting their applicability in precision medicine. Machine learning techniques have emerged as a powerful approach for precisely estimating individualized causal effects. This paper proposes a novel method called CGAN-ICMA-CT that leverages Conditional Generative Adversarial Networks (CGANs) to infer individualized causal effects with continuous treatment. We thoroughly investigate the convergence properties of CGAN-ICMA-CT and show that the estimated distribution of our inferential conditional generator converges to the true conditional distribution under mild conditions. We conduct numerical experiments to validate the effectiveness of CGAN-ICMA-CT and compare it with four commonly used methods: linear regression, support vector machine regression, decision tree, and random forest regression. The results demonstrate that CGAN-ICMA-CT outperforms these methods regarding accuracy and precision. Furthermore, we apply the CGAN-ICMA-CT model to the real-world Job Corps dataset, showcasing its practical utility. By utilizing CGAN-ICMA-CT, we estimate the individualized causal effects of the Job Corps program on the number of arrests, providing insights into both direct effects and effects mediated through intermediate variables. Our findings confirm the potential of CGAN-ICMA-CT in advancing individualized causal mediation analysis with continuous treatment in precision medicine settings.

用于连续治疗因果中介分析的传统方法通常侧重于估计平均因果效应,这限制了它们在精准医疗中的适用性。机器学习技术已成为精确估计个体化因果效应的有力方法。本文提出了一种名为 CGAN-ICMA-CT 的新方法,它利用条件生成对抗网络(CGAN)来推断连续治疗的个体化因果效应。我们对 CGAN-ICMA-CT 的收敛特性进行了深入研究,结果表明,在温和条件下,推断条件生成器的估计分布会收敛到真实的条件分布。我们通过数值实验验证了 CGAN-ICMA-CT 的有效性,并将其与四种常用方法进行了比较:线性回归、支持向量机回归、决策树和随机森林回归。结果表明,CGAN-ICMA-CT 在准确度和精确度方面都优于这些方法。此外,我们还将 CGAN-ICMA-CT 模型应用于现实世界中的 Job Corps 数据集,展示了它的实用性。通过使用 CGAN-ICMA-CT,我们估算了就业指导中心项目对逮捕人数的个性化因果效应,从而深入了解了直接效应和通过中间变量中介的效应。我们的研究结果证实了 CGAN-ICMA-CT 在精准医疗环境下通过连续治疗推进个性化因果中介分析的潜力。
{"title":"Individualized causal mediation analysis with continuous treatment using conditional generative adversarial networks","authors":"Cheng Huan, Xinyuan Song, Hongwei Yuan","doi":"10.1007/s11222-024-10484-8","DOIUrl":"https://doi.org/10.1007/s11222-024-10484-8","url":null,"abstract":"<p>Traditional methods used in causal mediation analysis with continuous treatment often focus on estimating average causal effects, limiting their applicability in precision medicine. Machine learning techniques have emerged as a powerful approach for precisely estimating individualized causal effects. This paper proposes a novel method called CGAN-ICMA-CT that leverages Conditional Generative Adversarial Networks (CGANs) to infer individualized causal effects with continuous treatment. We thoroughly investigate the convergence properties of CGAN-ICMA-CT and show that the estimated distribution of our inferential conditional generator converges to the true conditional distribution under mild conditions. We conduct numerical experiments to validate the effectiveness of CGAN-ICMA-CT and compare it with four commonly used methods: linear regression, support vector machine regression, decision tree, and random forest regression. The results demonstrate that CGAN-ICMA-CT outperforms these methods regarding accuracy and precision. Furthermore, we apply the CGAN-ICMA-CT model to the real-world Job Corps dataset, showcasing its practical utility. By utilizing CGAN-ICMA-CT, we estimate the individualized causal effects of the Job Corps program on the number of arrests, providing insights into both direct effects and effects mediated through intermediate variables. Our findings confirm the potential of CGAN-ICMA-CT in advancing individualized causal mediation analysis with continuous treatment in precision medicine settings.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"7 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Taming numerical imprecision by adapting the KL divergence to negative probabilities 通过调整 KL 分歧以适应负概率来控制数值不精确性
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-13 DOI: 10.1007/s11222-024-10480-y
Simon Pfahler, Peter Georg, Rudolf Schill, Maren Klever, Lars Grasedyck, Rainer Spang, Tilo Wettig

The Kullback–Leibler (KL) divergence is frequently used in data science. For discrete distributions on large state spaces, approximations of probability vectors may result in a few small negative entries, rendering the KL divergence undefined. We address this problem by introducing a parameterized family of substitute divergence measures, the shifted KL (sKL) divergence measures. Our approach is generic and does not increase the computational overhead. We show that the sKL divergence shares important theoretical properties with the KL divergence and discuss how its shift parameters should be chosen. If Gaussian noise is added to a probability vector, we prove that the average sKL divergence converges to the KL divergence for small enough noise. We also show that our method solves the problem of negative entries in an application from computational oncology, the optimization of Mutual Hazard Networks for cancer progression using tensor-train approximations.

Kullback-Leibler (KL) 发散经常用于数据科学。对于大型状态空间上的离散分布,概率向量的近似可能会导致一些小的负条目,从而使 KL 发散无法定义。为了解决这个问题,我们引入了一个参数化的替代发散度量系列,即移位 KL(sKL)发散度量。我们的方法是通用的,不会增加计算开销。我们证明了 sKL 发散与 KL 发散具有相同的重要理论属性,并讨论了如何选择其移动参数。如果在概率向量中加入高斯噪声,我们证明在噪声足够小的情况下,平均 sKL 发散收敛于 KL 发散。我们还证明,我们的方法解决了计算肿瘤学应用中的负条目问题,即使用张量-列车近似优化癌症进展的相互危害网络。
{"title":"Taming numerical imprecision by adapting the KL divergence to negative probabilities","authors":"Simon Pfahler, Peter Georg, Rudolf Schill, Maren Klever, Lars Grasedyck, Rainer Spang, Tilo Wettig","doi":"10.1007/s11222-024-10480-y","DOIUrl":"https://doi.org/10.1007/s11222-024-10480-y","url":null,"abstract":"<p>The Kullback–Leibler (KL) divergence is frequently used in data science. For discrete distributions on large state spaces, approximations of probability vectors may result in a few small negative entries, rendering the KL divergence undefined. We address this problem by introducing a parameterized family of substitute divergence measures, the shifted KL (sKL) divergence measures. Our approach is generic and does not increase the computational overhead. We show that the sKL divergence shares important theoretical properties with the KL divergence and discuss how its shift parameters should be chosen. If Gaussian noise is added to a probability vector, we prove that the average sKL divergence converges to the KL divergence for small enough noise. We also show that our method solves the problem of negative entries in an application from computational oncology, the optimization of Mutual Hazard Networks for cancer progression using tensor-train approximations.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"185 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Bayesian approach to modeling finite element discretization error 有限元离散化误差建模的贝叶斯方法
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-09 DOI: 10.1007/s11222-024-10463-z
Anne Poot, Pierre Kerfriden, Iuri Rocha, Frans van der Meer

In this work, the uncertainty associated with the finite element discretization error is modeled following the Bayesian paradigm. First, a continuous formulation is derived, where a Gaussian process prior over the solution space is updated based on observations from a finite element discretization. To avoid the computation of intractable integrals, a second, finer, discretization is introduced that is assumed sufficiently dense to represent the true solution field. A prior distribution is assumed over the fine discretization, which is then updated based on observations from the coarse discretization. This yields a posterior distribution with a mean that serves as an estimate of the solution, and a covariance that models the uncertainty associated with this estimate. Two particular choices of prior are investigated: a prior defined implicitly by assigning a white noise distribution to the right-hand side term, and a prior whose covariance function is equal to the Green’s function of the partial differential equation. The former yields a posterior distribution with a mean close to the reference solution, but a covariance that contains little information regarding the finite element discretization error. The latter, on the other hand, yields posterior distribution with a mean equal to the coarse finite element solution, and a covariance with a close connection to the discretization error. For both choices of prior a contradiction arises, since the discretization error depends on the right-hand side term, but the posterior covariance does not. We demonstrate how, by rescaling the eigenvalues of the posterior covariance, this independence can be avoided.

在这项工作中,与有限元离散化误差相关的不确定性按照贝叶斯范式进行建模。首先,推导出一种连续公式,根据有限元离散化的观测结果更新解空间的高斯过程先验。为了避免计算棘手的积分,引入了第二种更精细的离散化,假定其密度足以代表真实的解场。在精细离散化的基础上假设一个先验分布,然后根据粗离散化的观测结果进行更新。这就产生了一个后验分布,其平均值可作为解的估计值,而协方差则可模拟与该估计值相关的不确定性。本文研究了两种特定的先验选择:一种是通过为右侧项分配白噪声分布而隐含定义的先验,另一种是协方差函数等于偏微分方程的格林函数的先验。前者得到的后验分布均值接近参考解,但协方差几乎不包含有限元离散化误差的信息。另一方面,后者得到的后验分布均值等于粗有限元解,协方差与离散化误差密切相关。对于这两种先验选择,都会产生矛盾,因为离散化误差取决于右侧项,但后验协方差却不取决于右侧项。我们将演示如何通过重新调整后验协方差的特征值来避免这种独立性。
{"title":"A Bayesian approach to modeling finite element discretization error","authors":"Anne Poot, Pierre Kerfriden, Iuri Rocha, Frans van der Meer","doi":"10.1007/s11222-024-10463-z","DOIUrl":"https://doi.org/10.1007/s11222-024-10463-z","url":null,"abstract":"<p>In this work, the uncertainty associated with the finite element discretization error is modeled following the Bayesian paradigm. First, a continuous formulation is derived, where a Gaussian process prior over the solution space is updated based on observations from a finite element discretization. To avoid the computation of intractable integrals, a second, finer, discretization is introduced that is assumed sufficiently dense to represent the true solution field. A prior distribution is assumed over the fine discretization, which is then updated based on observations from the coarse discretization. This yields a posterior distribution with a mean that serves as an estimate of the solution, and a covariance that models the uncertainty associated with this estimate. Two particular choices of prior are investigated: a prior defined implicitly by assigning a white noise distribution to the right-hand side term, and a prior whose covariance function is equal to the Green’s function of the partial differential equation. The former yields a posterior distribution with a mean close to the reference solution, but a covariance that contains little information regarding the finite element discretization error. The latter, on the other hand, yields posterior distribution with a mean equal to the coarse finite element solution, and a covariance with a close connection to the discretization error. For both choices of prior a contradiction arises, since the discretization error depends on the right-hand side term, but the posterior covariance does not. We demonstrate how, by rescaling the eigenvalues of the posterior covariance, this independence can be avoided.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"20 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Roughness regularization for functional data analysis with free knots spline estimation 利用自由结样条估计进行函数数据分析的粗糙度正则化
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-08 DOI: 10.1007/s11222-024-10474-w
Anna De Magistris, Valentina De Simone, Elvira Romano, Gerardo Toraldo

In the era of big data, an ever-growing volume of information is recorded, either continuously over time or sporadically, at distinct time intervals. Functional Data Analysis (FDA) stands at the cutting edge of this data revolution, offering a powerful framework for handling and extracting meaningful insights from such complex datasets. The currently proposed FDA methods can often encounter challenges, especially when dealing with curves of varying shapes. This can largely be attributed to the method’s strong dependence on data approximation as a key aspect of the analysis process. In this work, we propose a free knots spline estimation method for functional data with two penalty terms and demonstrate its performance by comparing the results of several clustering methods on simulated and real data.

在大数据时代,越来越多的信息被记录下来,这些信息或随着时间的推移持续不断,或以不同的时间间隔零星记录。功能数据分析(FDA)站在这场数据革命的前沿,为处理此类复杂数据集并从中提取有意义的见解提供了一个强大的框架。目前提出的 FDA 方法经常会遇到挑战,尤其是在处理形状各异的曲线时。这在很大程度上归因于该方法对数据近似的强烈依赖,而数据近似是分析过程中的一个关键环节。在这项工作中,我们提出了一种带有两个惩罚项的函数数据自由结样条估计方法,并通过比较几种聚类方法在模拟数据和真实数据上的结果来证明其性能。
{"title":"Roughness regularization for functional data analysis with free knots spline estimation","authors":"Anna De Magistris, Valentina De Simone, Elvira Romano, Gerardo Toraldo","doi":"10.1007/s11222-024-10474-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10474-w","url":null,"abstract":"<p>In the era of big data, an ever-growing volume of information is recorded, either continuously over time or sporadically, at distinct time intervals. Functional Data Analysis (FDA) stands at the cutting edge of this data revolution, offering a powerful framework for handling and extracting meaningful insights from such complex datasets. The currently proposed FDA methods can often encounter challenges, especially when dealing with curves of varying shapes. This can largely be attributed to the method’s strong dependence on data approximation as a key aspect of the analysis process. In this work, we propose a free knots spline estimation method for functional data with two penalty terms and demonstrate its performance by comparing the results of several clustering methods on simulated and real data.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"75 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning variational autoencoders via MCMC speed measures 通过 MCMC 速度测量学习变分自编码器
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-06 DOI: 10.1007/s11222-024-10481-x
Marcel Hirt, Vasileios Kreouzis, Petros Dellaportas

Variational autoencoders (VAEs) are popular likelihood-based generative models which can be efficiently trained by maximising an evidence lower bound. There has been much progress in improving the expressiveness of the variational distribution to obtain tighter variational bounds and increased generative performance. Whilst previous work has leveraged Markov chain Monte Carlo methods for constructing variational densities, gradient-based methods for adapting the proposal distributions for deep latent variable models have received less attention. This work suggests an entropy-based adaptation for a short-run metropolis-adjusted Langevin or Hamiltonian Monte Carlo (HMC) chain while optimising a tighter variational bound to the log-evidence. Experiments show that this approach yields higher held-out log-likelihoods as well as improved generative metrics. Our implicit variational density can adapt to complicated posterior geometries of latent hierarchical representations arising in hierarchical VAEs.

变异自动编码器(VAE)是一种流行的基于似然法的生成模型,它可以通过最大化证据下限来进行有效训练。为了获得更严格的变分边界和更高的生成性能,在提高变分分布的表达能力方面取得了很大进展。以前的研究利用马尔可夫链蒙特卡洛方法构建变分密度,而基于梯度的方法来调整深度潜变量模型的提议分布则较少受到关注。这项研究提出了一种基于熵的短程大都会调整朗文或汉密尔顿蒙特卡洛(HMC)链适应方法,同时优化对数证据的更严格变异约束。实验表明,这种方法能产生更高的保持对数似然以及更好的生成指标。我们的隐式变分密度可以适应分层 VAE 中潜在分层表示的复杂后验几何。
{"title":"Learning variational autoencoders via MCMC speed measures","authors":"Marcel Hirt, Vasileios Kreouzis, Petros Dellaportas","doi":"10.1007/s11222-024-10481-x","DOIUrl":"https://doi.org/10.1007/s11222-024-10481-x","url":null,"abstract":"<p>Variational autoencoders (VAEs) are popular likelihood-based generative models which can be efficiently trained by maximising an evidence lower bound. There has been much progress in improving the expressiveness of the variational distribution to obtain tighter variational bounds and increased generative performance. Whilst previous work has leveraged Markov chain Monte Carlo methods for constructing variational densities, gradient-based methods for adapting the proposal distributions for deep latent variable models have received less attention. This work suggests an entropy-based adaptation for a short-run metropolis-adjusted Langevin or Hamiltonian Monte Carlo (HMC) chain while optimising a tighter variational bound to the log-evidence. Experiments show that this approach yields higher held-out log-likelihoods as well as improved generative metrics. Our implicit variational density can adapt to complicated posterior geometries of latent hierarchical representations arising in hierarchical VAEs.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"130 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The COR criterion for optimal subset selection in distributed estimation 分布式估算中最优子集选择的 COR 准则
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-02 DOI: 10.1007/s11222-024-10471-z
Guangbao Guo, Haoyue Song, Lixing Zhu

The problem of selecting an optimal subset in distributed regression is a crucial issue, as each distributed data subset may contain redundant information, which can be attributed to various sources such as outliers, dispersion, inconsistent duplicates, too many independent variables, and excessive data points, among others. Efficient reduction and elimination of this redundancy can help alleviate inconsistency issues for statistical inference. Therefore, it is imperative to track redundancy while measuring and processing data. We develop a criterion for optimal subset selection that is related to Covariance matrices, Observation matrices, and Response vectors (COR). We also derive a novel distributed interval estimation for the proposed criterion and establish the existence of optimal subset length. Finally, numerical experiments are conducted to verify the experimental feasibility of the proposed criterion.

在分布式回归中,如何选择最优子集是一个关键问题,因为每个分布式数据子集都可能包含冗余信息,这些冗余信息可归因于各种来源,如异常值、离散性、不一致的重复数据、过多的自变量和过多的数据点等等。有效减少和消除这些冗余信息有助于缓解统计推断的不一致性问题。因此,在测量和处理数据时必须跟踪冗余。我们开发了一种与协方差矩阵、观测矩阵和响应向量(COR)相关的最优子集选择标准。我们还为所提出的标准推导了一种新的分布式区间估计,并确定了最佳子集长度的存在。最后,我们通过数值实验验证了所提准则的实验可行性。
{"title":"The COR criterion for optimal subset selection in distributed estimation","authors":"Guangbao Guo, Haoyue Song, Lixing Zhu","doi":"10.1007/s11222-024-10471-z","DOIUrl":"https://doi.org/10.1007/s11222-024-10471-z","url":null,"abstract":"<p>The problem of selecting an optimal subset in distributed regression is a crucial issue, as each distributed data subset may contain redundant information, which can be attributed to various sources such as outliers, dispersion, inconsistent duplicates, too many independent variables, and excessive data points, among others. Efficient reduction and elimination of this redundancy can help alleviate inconsistency issues for statistical inference. Therefore, it is imperative to track redundancy while measuring and processing data. We develop a criterion for optimal subset selection that is related to Covariance matrices, Observation matrices, and Response vectors (COR). We also derive a novel distributed interval estimation for the proposed criterion and establish the existence of optimal subset length. Finally, numerical experiments are conducted to verify the experimental feasibility of the proposed criterion.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"53 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141882928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-dimensional missing data imputation via undirected graphical model 通过无向图模型进行高维缺失数据估算
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-01 DOI: 10.1007/s11222-024-10475-9
Yoonah Lee, Seongoh Park

Multiple imputation is a practical approach in analyzing incomplete data, with multiple imputation by chained equations (MICE) being popularly used. MICE specifies a conditional distribution for each variable to be imputed, but estimating it is inherently a high-dimensional problem for large-scale data. Existing approaches propose to utilize regularized regression models, such as lasso. However, the estimation of them occurs iteratively across all incomplete variables, leading to a considerable increase in computational burden, as demonstrated in our simulation study. To overcome this computational bottleneck, we propose a novel method that estimates the conditional independence structure among variables before the imputation procedure. We extract such information from an undirected graphical model, leveraging the graphical lasso method based on the inverse probability weighting estimator. Our simulation study verifies the proposed method is way faster against the existing methods, while still maintaining comparable imputation performance.

多重估算是分析不完整数据的一种实用方法,其中常用的是链式方程多重估算(MICE)。MICE 为每个要估算的变量指定了一个条件分布,但对于大规模数据来说,估算条件分布本身就是一个高维问题。现有方法建议使用正则化回归模型,如 lasso。然而,正如我们的模拟研究所示,对这些模型的估计需要在所有不完整变量中反复进行,从而大大增加了计算负担。为了克服这一计算瓶颈,我们提出了一种新方法,即在估算程序之前估算变量之间的条件独立性结构。我们利用基于逆概率加权估计器的图形套索方法,从无向图形模型中提取了此类信息。我们的模拟研究证实,与现有方法相比,我们提出的方法速度更快,同时还能保持相当的估算性能。
{"title":"High-dimensional missing data imputation via undirected graphical model","authors":"Yoonah Lee, Seongoh Park","doi":"10.1007/s11222-024-10475-9","DOIUrl":"https://doi.org/10.1007/s11222-024-10475-9","url":null,"abstract":"<p>Multiple imputation is a practical approach in analyzing incomplete data, with multiple imputation by chained equations (MICE) being popularly used. MICE specifies a conditional distribution for each variable to be imputed, but estimating it is inherently a high-dimensional problem for large-scale data. Existing approaches propose to utilize regularized regression models, such as lasso. However, the estimation of them occurs iteratively across all incomplete variables, leading to a considerable increase in computational burden, as demonstrated in our simulation study. To overcome this computational bottleneck, we propose a novel method that estimates the conditional independence structure among variables before the imputation procedure. We extract such information from an undirected graphical model, leveraging the graphical lasso method based on the inverse probability weighting estimator. Our simulation study verifies the proposed method is way faster against the existing methods, while still maintaining comparable imputation performance.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"50 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141867005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed subsampling for multiplicative regression 用于乘法回归的分布式子采样
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-01 DOI: 10.1007/s11222-024-10477-7
Xiaoyan Li, Xiaochao Xia, Zhimin Zhang

Multiplicative regression is a useful alternative tool in modeling positive response data. This paper proposes two distributed estimators for multiplicative error model on distributed system with non-randomly distributed massive data. We first present a Poisson subsampling procedure to obtain a subsampling estimator based on the least product relative error (LPRE) loss, which is effective on a distributed system. Theoretically, we justify the subsampling estimator by establishing its convergence rate, asymptotic normality and deriving the optimal subsampling probabilities in terms of the L-optimality criterion. Then, we provide a distributed LPRE estimator based on the Poisson subsampling (DLPRE-P), which is communication-efficient since it needs to transmit a very small subsample from local machines to the central site, which is empirically feasible, together with the gradient of the loss. Practically, due to the use of Newton–Raphson iteration, the Hessian matrix can be computed more robustly using the subsampled data than using one local dataset. We also show that the DLPRE-P estimator is statistically efficient as the global estimator, which is based on putting all the datasets together. Furthermore, we propose a distributed regularized LPRE estimator (DRLPRE-P) to consider the variable selection problem in high dimension. A distributed algorithm based on the alternating direction method of multipliers (ADMM) is developed for implementing the DRLPRE-P. The oracle property holds for DRLPRE-P. Finally, simulation experiments and two real-world data analyses are conducted to illustrate the performance of our methods.

乘法回归是建立正反应数据模型的另一种有用工具。本文针对具有非随机分布海量数据的分布式系统上的乘法误差模型,提出了两种分布式估计器。我们首先提出了一种泊松子采样程序,以获得基于最小乘积相对误差(LPRE)损失的子采样估计器,该估计器在分布式系统中非常有效。从理论上讲,我们通过建立子采样估计器的收敛率和渐近正态性来证明其合理性,并根据 L-optimality 准则推导出最优子采样概率。然后,我们提供了一种基于泊松子采样的分布式 LPRE 估计器(DLPRE-P),该估计器通信效率高,因为它只需从本地机器向中心站点传输一个很小的子样本,这在经验上是可行的,同时损失梯度也是可行的。实际上,由于使用了牛顿-拉夫逊迭代法,使用子样本数据比使用一个本地数据集能更稳健地计算 Hessian 矩阵。我们还证明,DLPRE-P 估计器与基于所有数据集的全局估计器一样具有统计效率。此外,我们还提出了分布式正则化 LPRE 估计器(DRLPRE-P),以考虑高维度下的变量选择问题。为了实现 DRLPRE-P,我们开发了一种基于交替乘数方向法(ADMM)的分布式算法。DRLPRE-P 的神谕特性成立。最后,我们进行了模拟实验和两个真实世界数据分析,以说明我们方法的性能。
{"title":"Distributed subsampling for multiplicative regression","authors":"Xiaoyan Li, Xiaochao Xia, Zhimin Zhang","doi":"10.1007/s11222-024-10477-7","DOIUrl":"https://doi.org/10.1007/s11222-024-10477-7","url":null,"abstract":"<p>Multiplicative regression is a useful alternative tool in modeling positive response data. This paper proposes two distributed estimators for multiplicative error model on distributed system with non-randomly distributed massive data. We first present a Poisson subsampling procedure to obtain a subsampling estimator based on the least product relative error (LPRE) loss, which is effective on a distributed system. Theoretically, we justify the subsampling estimator by establishing its convergence rate, asymptotic normality and deriving the optimal subsampling probabilities in terms of the L-optimality criterion. Then, we provide a distributed LPRE estimator based on the Poisson subsampling (DLPRE-P), which is communication-efficient since it needs to transmit a very small subsample from local machines to the central site, which is empirically feasible, together with the gradient of the loss. Practically, due to the use of Newton–Raphson iteration, the Hessian matrix can be computed more robustly using the subsampled data than using one local dataset. We also show that the DLPRE-P estimator is statistically efficient as the global estimator, which is based on putting all the datasets together. Furthermore, we propose a distributed regularized LPRE estimator (DRLPRE-P) to consider the variable selection problem in high dimension. A distributed algorithm based on the alternating direction method of multipliers (ADMM) is developed for implementing the DRLPRE-P. The oracle property holds for DRLPRE-P. Finally, simulation experiments and two real-world data analyses are conducted to illustrate the performance of our methods.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"46 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141867000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detection of spatiotemporal changepoints: a generalised additive model approach 检测时空变化点:广义相加模型方法
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-01 DOI: 10.1007/s11222-024-10478-6
Michael J. Hollaway, Rebecca Killick

The detection of changepoints in spatio-temporal datasets has been receiving increased focus in recent years and is utilised in a wide range of fields. With temporal data observed at different spatial locations, the current approach is typically to use univariate changepoint methods in a marginal sense with the detected changepoint being representative of a single location only. We present a spatio-temporal changepoint method that utilises a generalised additive model (GAM) dependent on the 2D spatial location and the observation time to account for the underlying spatio-temporal process. We use the full likelihood of the GAM in conjunction with the pruned linear exact time (PELT) changepoint search algorithm to detect multiple changepoints across spatial locations in a computationally efficient manner. When compared to a univariate marginal approach our method is shown to perform more efficiently in simulation studies at detecting true changepoints and demonstrates less evidence of overfitting. Furthermore, as the approach explicitly models spatio-temporal dependencies between spatial locations, any changepoints detected are common across the locations. We demonstrate an application of the method to an air quality dataset covering the COVID-19 lockdown in the United Kingdom.

近年来,时空数据集中变化点的检测越来越受到关注,并被广泛应用于各个领域。对于在不同空间位置观测到的时间数据,目前的方法通常是使用边际意义上的单变量变化点方法,检测到的变化点仅代表单一位置。我们提出了一种时空变化点方法,利用依赖于二维空间位置和观测时间的广义加法模型(GAM)来解释潜在的时空过程。我们将 GAM 的全似然与剪枝线性精确时间(PELT)变化点搜索算法结合使用,以计算效率高的方式检测跨空间位置的多个变化点。与单变量边际方法相比,我们的方法在模拟研究中检测真实变化点的效率更高,过拟合的证据更少。此外,由于该方法对空间位置之间的时空依赖性进行了明确建模,因此检测到的任何变化点在不同位置之间都是共同的。我们演示了该方法在英国 COVID-19 封锁事件空气质量数据集中的应用。
{"title":"Detection of spatiotemporal changepoints: a generalised additive model approach","authors":"Michael J. Hollaway, Rebecca Killick","doi":"10.1007/s11222-024-10478-6","DOIUrl":"https://doi.org/10.1007/s11222-024-10478-6","url":null,"abstract":"<p>The detection of changepoints in spatio-temporal datasets has been receiving increased focus in recent years and is utilised in a wide range of fields. With temporal data observed at different spatial locations, the current approach is typically to use univariate changepoint methods in a marginal sense with the detected changepoint being representative of a single location only. We present a spatio-temporal changepoint method that utilises a generalised additive model (GAM) dependent on the 2D spatial location and the observation time to account for the underlying spatio-temporal process. We use the full likelihood of the GAM in conjunction with the pruned linear exact time (PELT) changepoint search algorithm to detect multiple changepoints across spatial locations in a computationally efficient manner. When compared to a univariate marginal approach our method is shown to perform more efficiently in simulation studies at detecting true changepoints and demonstrates less evidence of overfitting. Furthermore, as the approach explicitly models spatio-temporal dependencies between spatial locations, any changepoints detected are common across the locations. We demonstrate an application of the method to an air quality dataset covering the COVID-19 lockdown in the United Kingdom.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"187 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141882926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Mallows-type model averaging estimator for ridge regression with randomly right censored data 用于随机右删失数据脊回归的马洛式模型平均估算器
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-29 DOI: 10.1007/s11222-024-10472-y
Jie Zeng, Guozhi Hu, Weihu Cheng

Instead of picking up a single ridge parameter in ridge regression, this paper considers a frequentist model averaging approach to appropriately combine the set of ridge estimators with different ridge parameters, when the response is randomly right censored. Within this context, we propose a weighted least squares ridge estimation for unknown regression parameter. A new Mallows-type weight choice criterion is then developed to allocate model weights, where the unknown distribution function of the censoring random variable is replaced by the Kaplan–Meier estimator and the covariance matrix of random errors is substituted by its averaging estimator. Under some mild conditions, we show that when the fitting model is misspecified, the resulting model averaging estimator achieves optimality in terms of minimizing the loss function. Whereas, when the fitting model is correctly specified, the model averaging estimator of the regression parameter is root-n consistent. Additionally, for the weight vector which is obtained by minimizing the new criterion, we establish its rate of convergence to the infeasible optimal weight vector. Simulation results show that our method is better than some existing methods. A real dataset is analyzed for illustration as well.

本文考虑的不是在脊回归中选取单一的脊参数,而是在响应随机右删减的情况下,采用频数模型平均法来适当组合具有不同脊参数的脊估计器集合。在此背景下,我们提出了未知回归参数的加权最小二乘法脊估计。在此过程中,剔除随机变量的未知分布函数由 Kaplan-Meier 估计器代替,随机误差的协方差矩阵由其平均估计器代替。在一些温和的条件下,我们证明了当拟合模型被错误地指定时,所得到的模型平均估计器在最小化损失函数方面达到了最优。而当拟合模型被正确指定时,回归参数的模型平均估计器是根n一致的。此外,对于通过最小化新准则得到的权重向量,我们确定了它向不可行的最优权重向量的收敛速度。仿真结果表明,我们的方法优于现有的一些方法。我们还对一个真实数据集进行了分析说明。
{"title":"A Mallows-type model averaging estimator for ridge regression with randomly right censored data","authors":"Jie Zeng, Guozhi Hu, Weihu Cheng","doi":"10.1007/s11222-024-10472-y","DOIUrl":"https://doi.org/10.1007/s11222-024-10472-y","url":null,"abstract":"<p>Instead of picking up a single ridge parameter in ridge regression, this paper considers a frequentist model averaging approach to appropriately combine the set of ridge estimators with different ridge parameters, when the response is randomly right censored. Within this context, we propose a weighted least squares ridge estimation for unknown regression parameter. A new Mallows-type weight choice criterion is then developed to allocate model weights, where the unknown distribution function of the censoring random variable is replaced by the Kaplan–Meier estimator and the covariance matrix of random errors is substituted by its averaging estimator. Under some mild conditions, we show that when the fitting model is misspecified, the resulting model averaging estimator achieves optimality in terms of minimizing the loss function. Whereas, when the fitting model is correctly specified, the model averaging estimator of the regression parameter is root-<i>n</i> consistent. Additionally, for the weight vector which is obtained by minimizing the new criterion, we establish its rate of convergence to the infeasible optimal weight vector. Simulation results show that our method is better than some existing methods. A real dataset is analyzed for illustration as well.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"150 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141866999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistics and Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1