首页 > 最新文献

Statistics and Computing最新文献

英文 中文
High-dimensional missing data imputation via undirected graphical model 通过无向图模型进行高维缺失数据估算
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-01 DOI: 10.1007/s11222-024-10475-9
Yoonah Lee, Seongoh Park

Multiple imputation is a practical approach in analyzing incomplete data, with multiple imputation by chained equations (MICE) being popularly used. MICE specifies a conditional distribution for each variable to be imputed, but estimating it is inherently a high-dimensional problem for large-scale data. Existing approaches propose to utilize regularized regression models, such as lasso. However, the estimation of them occurs iteratively across all incomplete variables, leading to a considerable increase in computational burden, as demonstrated in our simulation study. To overcome this computational bottleneck, we propose a novel method that estimates the conditional independence structure among variables before the imputation procedure. We extract such information from an undirected graphical model, leveraging the graphical lasso method based on the inverse probability weighting estimator. Our simulation study verifies the proposed method is way faster against the existing methods, while still maintaining comparable imputation performance.

多重估算是分析不完整数据的一种实用方法,其中常用的是链式方程多重估算(MICE)。MICE 为每个要估算的变量指定了一个条件分布,但对于大规模数据来说,估算条件分布本身就是一个高维问题。现有方法建议使用正则化回归模型,如 lasso。然而,正如我们的模拟研究所示,对这些模型的估计需要在所有不完整变量中反复进行,从而大大增加了计算负担。为了克服这一计算瓶颈,我们提出了一种新方法,即在估算程序之前估算变量之间的条件独立性结构。我们利用基于逆概率加权估计器的图形套索方法,从无向图形模型中提取了此类信息。我们的模拟研究证实,与现有方法相比,我们提出的方法速度更快,同时还能保持相当的估算性能。
{"title":"High-dimensional missing data imputation via undirected graphical model","authors":"Yoonah Lee, Seongoh Park","doi":"10.1007/s11222-024-10475-9","DOIUrl":"https://doi.org/10.1007/s11222-024-10475-9","url":null,"abstract":"<p>Multiple imputation is a practical approach in analyzing incomplete data, with multiple imputation by chained equations (MICE) being popularly used. MICE specifies a conditional distribution for each variable to be imputed, but estimating it is inherently a high-dimensional problem for large-scale data. Existing approaches propose to utilize regularized regression models, such as lasso. However, the estimation of them occurs iteratively across all incomplete variables, leading to a considerable increase in computational burden, as demonstrated in our simulation study. To overcome this computational bottleneck, we propose a novel method that estimates the conditional independence structure among variables before the imputation procedure. We extract such information from an undirected graphical model, leveraging the graphical lasso method based on the inverse probability weighting estimator. Our simulation study verifies the proposed method is way faster against the existing methods, while still maintaining comparable imputation performance.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141867005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed subsampling for multiplicative regression 用于乘法回归的分布式子采样
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-01 DOI: 10.1007/s11222-024-10477-7
Xiaoyan Li, Xiaochao Xia, Zhimin Zhang

Multiplicative regression is a useful alternative tool in modeling positive response data. This paper proposes two distributed estimators for multiplicative error model on distributed system with non-randomly distributed massive data. We first present a Poisson subsampling procedure to obtain a subsampling estimator based on the least product relative error (LPRE) loss, which is effective on a distributed system. Theoretically, we justify the subsampling estimator by establishing its convergence rate, asymptotic normality and deriving the optimal subsampling probabilities in terms of the L-optimality criterion. Then, we provide a distributed LPRE estimator based on the Poisson subsampling (DLPRE-P), which is communication-efficient since it needs to transmit a very small subsample from local machines to the central site, which is empirically feasible, together with the gradient of the loss. Practically, due to the use of Newton–Raphson iteration, the Hessian matrix can be computed more robustly using the subsampled data than using one local dataset. We also show that the DLPRE-P estimator is statistically efficient as the global estimator, which is based on putting all the datasets together. Furthermore, we propose a distributed regularized LPRE estimator (DRLPRE-P) to consider the variable selection problem in high dimension. A distributed algorithm based on the alternating direction method of multipliers (ADMM) is developed for implementing the DRLPRE-P. The oracle property holds for DRLPRE-P. Finally, simulation experiments and two real-world data analyses are conducted to illustrate the performance of our methods.

乘法回归是建立正反应数据模型的另一种有用工具。本文针对具有非随机分布海量数据的分布式系统上的乘法误差模型,提出了两种分布式估计器。我们首先提出了一种泊松子采样程序,以获得基于最小乘积相对误差(LPRE)损失的子采样估计器,该估计器在分布式系统中非常有效。从理论上讲,我们通过建立子采样估计器的收敛率和渐近正态性来证明其合理性,并根据 L-optimality 准则推导出最优子采样概率。然后,我们提供了一种基于泊松子采样的分布式 LPRE 估计器(DLPRE-P),该估计器通信效率高,因为它只需从本地机器向中心站点传输一个很小的子样本,这在经验上是可行的,同时损失梯度也是可行的。实际上,由于使用了牛顿-拉夫逊迭代法,使用子样本数据比使用一个本地数据集能更稳健地计算 Hessian 矩阵。我们还证明,DLPRE-P 估计器与基于所有数据集的全局估计器一样具有统计效率。此外,我们还提出了分布式正则化 LPRE 估计器(DRLPRE-P),以考虑高维度下的变量选择问题。为了实现 DRLPRE-P,我们开发了一种基于交替乘数方向法(ADMM)的分布式算法。DRLPRE-P 的神谕特性成立。最后,我们进行了模拟实验和两个真实世界数据分析,以说明我们方法的性能。
{"title":"Distributed subsampling for multiplicative regression","authors":"Xiaoyan Li, Xiaochao Xia, Zhimin Zhang","doi":"10.1007/s11222-024-10477-7","DOIUrl":"https://doi.org/10.1007/s11222-024-10477-7","url":null,"abstract":"<p>Multiplicative regression is a useful alternative tool in modeling positive response data. This paper proposes two distributed estimators for multiplicative error model on distributed system with non-randomly distributed massive data. We first present a Poisson subsampling procedure to obtain a subsampling estimator based on the least product relative error (LPRE) loss, which is effective on a distributed system. Theoretically, we justify the subsampling estimator by establishing its convergence rate, asymptotic normality and deriving the optimal subsampling probabilities in terms of the L-optimality criterion. Then, we provide a distributed LPRE estimator based on the Poisson subsampling (DLPRE-P), which is communication-efficient since it needs to transmit a very small subsample from local machines to the central site, which is empirically feasible, together with the gradient of the loss. Practically, due to the use of Newton–Raphson iteration, the Hessian matrix can be computed more robustly using the subsampled data than using one local dataset. We also show that the DLPRE-P estimator is statistically efficient as the global estimator, which is based on putting all the datasets together. Furthermore, we propose a distributed regularized LPRE estimator (DRLPRE-P) to consider the variable selection problem in high dimension. A distributed algorithm based on the alternating direction method of multipliers (ADMM) is developed for implementing the DRLPRE-P. The oracle property holds for DRLPRE-P. Finally, simulation experiments and two real-world data analyses are conducted to illustrate the performance of our methods.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141867000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Detection of spatiotemporal changepoints: a generalised additive model approach 检测时空变化点:广义相加模型方法
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-08-01 DOI: 10.1007/s11222-024-10478-6
Michael J. Hollaway, Rebecca Killick

The detection of changepoints in spatio-temporal datasets has been receiving increased focus in recent years and is utilised in a wide range of fields. With temporal data observed at different spatial locations, the current approach is typically to use univariate changepoint methods in a marginal sense with the detected changepoint being representative of a single location only. We present a spatio-temporal changepoint method that utilises a generalised additive model (GAM) dependent on the 2D spatial location and the observation time to account for the underlying spatio-temporal process. We use the full likelihood of the GAM in conjunction with the pruned linear exact time (PELT) changepoint search algorithm to detect multiple changepoints across spatial locations in a computationally efficient manner. When compared to a univariate marginal approach our method is shown to perform more efficiently in simulation studies at detecting true changepoints and demonstrates less evidence of overfitting. Furthermore, as the approach explicitly models spatio-temporal dependencies between spatial locations, any changepoints detected are common across the locations. We demonstrate an application of the method to an air quality dataset covering the COVID-19 lockdown in the United Kingdom.

近年来,时空数据集中变化点的检测越来越受到关注,并被广泛应用于各个领域。对于在不同空间位置观测到的时间数据,目前的方法通常是使用边际意义上的单变量变化点方法,检测到的变化点仅代表单一位置。我们提出了一种时空变化点方法,利用依赖于二维空间位置和观测时间的广义加法模型(GAM)来解释潜在的时空过程。我们将 GAM 的全似然与剪枝线性精确时间(PELT)变化点搜索算法结合使用,以计算效率高的方式检测跨空间位置的多个变化点。与单变量边际方法相比,我们的方法在模拟研究中检测真实变化点的效率更高,过拟合的证据更少。此外,由于该方法对空间位置之间的时空依赖性进行了明确建模,因此检测到的任何变化点在不同位置之间都是共同的。我们演示了该方法在英国 COVID-19 封锁事件空气质量数据集中的应用。
{"title":"Detection of spatiotemporal changepoints: a generalised additive model approach","authors":"Michael J. Hollaway, Rebecca Killick","doi":"10.1007/s11222-024-10478-6","DOIUrl":"https://doi.org/10.1007/s11222-024-10478-6","url":null,"abstract":"<p>The detection of changepoints in spatio-temporal datasets has been receiving increased focus in recent years and is utilised in a wide range of fields. With temporal data observed at different spatial locations, the current approach is typically to use univariate changepoint methods in a marginal sense with the detected changepoint being representative of a single location only. We present a spatio-temporal changepoint method that utilises a generalised additive model (GAM) dependent on the 2D spatial location and the observation time to account for the underlying spatio-temporal process. We use the full likelihood of the GAM in conjunction with the pruned linear exact time (PELT) changepoint search algorithm to detect multiple changepoints across spatial locations in a computationally efficient manner. When compared to a univariate marginal approach our method is shown to perform more efficiently in simulation studies at detecting true changepoints and demonstrates less evidence of overfitting. Furthermore, as the approach explicitly models spatio-temporal dependencies between spatial locations, any changepoints detected are common across the locations. We demonstrate an application of the method to an air quality dataset covering the COVID-19 lockdown in the United Kingdom.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141882926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Mallows-type model averaging estimator for ridge regression with randomly right censored data 用于随机右删失数据脊回归的马洛式模型平均估算器
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-29 DOI: 10.1007/s11222-024-10472-y
Jie Zeng, Guozhi Hu, Weihu Cheng

Instead of picking up a single ridge parameter in ridge regression, this paper considers a frequentist model averaging approach to appropriately combine the set of ridge estimators with different ridge parameters, when the response is randomly right censored. Within this context, we propose a weighted least squares ridge estimation for unknown regression parameter. A new Mallows-type weight choice criterion is then developed to allocate model weights, where the unknown distribution function of the censoring random variable is replaced by the Kaplan–Meier estimator and the covariance matrix of random errors is substituted by its averaging estimator. Under some mild conditions, we show that when the fitting model is misspecified, the resulting model averaging estimator achieves optimality in terms of minimizing the loss function. Whereas, when the fitting model is correctly specified, the model averaging estimator of the regression parameter is root-n consistent. Additionally, for the weight vector which is obtained by minimizing the new criterion, we establish its rate of convergence to the infeasible optimal weight vector. Simulation results show that our method is better than some existing methods. A real dataset is analyzed for illustration as well.

本文考虑的不是在脊回归中选取单一的脊参数,而是在响应随机右删减的情况下,采用频数模型平均法来适当组合具有不同脊参数的脊估计器集合。在此背景下,我们提出了未知回归参数的加权最小二乘法脊估计。在此过程中,剔除随机变量的未知分布函数由 Kaplan-Meier 估计器代替,随机误差的协方差矩阵由其平均估计器代替。在一些温和的条件下,我们证明了当拟合模型被错误地指定时,所得到的模型平均估计器在最小化损失函数方面达到了最优。而当拟合模型被正确指定时,回归参数的模型平均估计器是根n一致的。此外,对于通过最小化新准则得到的权重向量,我们确定了它向不可行的最优权重向量的收敛速度。仿真结果表明,我们的方法优于现有的一些方法。我们还对一个真实数据集进行了分析说明。
{"title":"A Mallows-type model averaging estimator for ridge regression with randomly right censored data","authors":"Jie Zeng, Guozhi Hu, Weihu Cheng","doi":"10.1007/s11222-024-10472-y","DOIUrl":"https://doi.org/10.1007/s11222-024-10472-y","url":null,"abstract":"<p>Instead of picking up a single ridge parameter in ridge regression, this paper considers a frequentist model averaging approach to appropriately combine the set of ridge estimators with different ridge parameters, when the response is randomly right censored. Within this context, we propose a weighted least squares ridge estimation for unknown regression parameter. A new Mallows-type weight choice criterion is then developed to allocate model weights, where the unknown distribution function of the censoring random variable is replaced by the Kaplan–Meier estimator and the covariance matrix of random errors is substituted by its averaging estimator. Under some mild conditions, we show that when the fitting model is misspecified, the resulting model averaging estimator achieves optimality in terms of minimizing the loss function. Whereas, when the fitting model is correctly specified, the model averaging estimator of the regression parameter is root-<i>n</i> consistent. Additionally, for the weight vector which is obtained by minimizing the new criterion, we establish its rate of convergence to the infeasible optimal weight vector. Simulation results show that our method is better than some existing methods. A real dataset is analyzed for illustration as well.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141866999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Byzantine-robust and efficient distributed sparsity learning: a surrogate composite quantile regression approach 拜占庭式稳健高效分布式稀疏性学习:一种代用复合量化回归方法
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-22 DOI: 10.1007/s11222-024-10470-0
Canyi Chen, Zhengtian Zhu

Distributed statistical learning has gained significant traction recently, mainly due to the availability of unprecedentedly massive datasets. The objective of distributed statistical learning is to learn models by effectively utilizing data scattered across various machines. However, its performance can be impeded by three significant challenges: arbitrary noises, high dimensionality, and machine failures—the latter being specifically referred to as Byzantine failure. To address the first two challenges, we propose leveraging the potential of composite quantile regression in conjunction with the (ell _1) penalty. However, this combination introduces a doubly nonsmooth objective function, posing new challenges. In such scenarios, most existing Byzantine-robust methods exhibit slow sublinear convergence rates and fail to achieve near-optimal statistical convergence rates. To fill this gap, we introduce a novel smoothing procedure that effectively handles the nonsmooth aspects. This innovation allows us to develop a Byzantine-robust sparsity learning algorithm that converges provably to the near-optimal convergence rate linearly. Moreover, we establish support recovery guarantees for our proposed methods. We substantiate the effectiveness of our approaches through comprehensive empirical analyses.

最近,分布式统计学习获得了极大的关注,这主要是由于前所未有的海量数据集的出现。分布式统计学习的目标是通过有效利用分散在不同机器上的数据来学习模型。然而,它的性能可能会受到三个重大挑战的阻碍:任意噪声、高维度和机器故障--后者被特别称为拜占庭故障。为了应对前两个挑战,我们建议结合 (ell _1) 惩罚来利用复合量化回归的潜力。然而,这种组合引入了双重非光滑目标函数,带来了新的挑战。在这种情况下,大多数现有的拜占庭稳健方法都表现出缓慢的亚线性收敛率,无法达到接近最优的统计收敛率。为了填补这一空白,我们引入了一种新型平滑程序,它能有效处理非平滑问题。通过这一创新,我们开发出了一种拜占庭稳健稀疏性学习算法,该算法可以线性收敛到接近最优的收敛率。此外,我们还为所提出的方法建立了支持恢复保证。我们通过全面的实证分析证实了我们方法的有效性。
{"title":"Byzantine-robust and efficient distributed sparsity learning: a surrogate composite quantile regression approach","authors":"Canyi Chen, Zhengtian Zhu","doi":"10.1007/s11222-024-10470-0","DOIUrl":"https://doi.org/10.1007/s11222-024-10470-0","url":null,"abstract":"<p>Distributed statistical learning has gained significant traction recently, mainly due to the availability of unprecedentedly massive datasets. The objective of distributed statistical learning is to learn models by effectively utilizing data scattered across various machines. However, its performance can be impeded by three significant challenges: arbitrary noises, high dimensionality, and machine failures—the latter being specifically referred to as Byzantine failure. To address the first two challenges, we propose leveraging the potential of composite quantile regression in conjunction with the <span>(ell _1)</span> penalty. However, this combination introduces a <i>doubly</i> nonsmooth objective function, posing new challenges. In such scenarios, most existing Byzantine-robust methods exhibit slow sublinear convergence rates and fail to achieve near-optimal statistical convergence rates. To fill this gap, we introduce a novel smoothing procedure that effectively handles the nonsmooth aspects. This innovation allows us to develop a Byzantine-robust sparsity learning algorithm that converges provably to the near-optimal convergence rate <i>linearly</i>. Moreover, we establish support recovery guarantees for our proposed methods. We substantiate the effectiveness of our approaches through comprehensive empirical analyses.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141741483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ForLion: a new algorithm for D-optimal designs under general parametric statistical models with mixed factors ForLion:混合因子一般参数统计模型下 D-最优设计的新算法
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-18 DOI: 10.1007/s11222-024-10465-x
Yifei Huang, Keren Li, Abhyuday Mandal, Jie Yang

In this paper, we address the problem of designing an experimental plan with both discrete and continuous factors under fairly general parametric statistical models. We propose a new algorithm, named ForLion, to search for locally optimal approximate designs under the D-criterion. The algorithm performs an exhaustive search in a design space with mixed factors while keeping high efficiency and reducing the number of distinct experimental settings. Its optimality is guaranteed by the general equivalence theorem. We present the relevant theoretical results for multinomial logit models (MLM) and generalized linear models (GLM), and demonstrate the superiority of our algorithm over state-of-the-art design algorithms using real-life experiments under MLM and GLM. Our simulation studies show that the ForLion algorithm could reduce the number of experimental settings by 25% or improve the relative efficiency of the designs by 17.5% on average. Our algorithm can help the experimenters reduce the time cost, the usage of experimental devices, and thus the total cost of their experiments while preserving high efficiencies of the designs.

在本文中,我们探讨了在相当普遍的参数统计模型下,如何设计同时包含离散和连续因素的实验计划的问题。我们提出了一种名为 ForLion 的新算法,用于搜索 D 准则下的局部最优近似设计。该算法在混合因子的设计空间中进行穷举搜索,同时保持高效率并减少不同实验设置的数量。一般等价定理保证了算法的最优性。我们介绍了多项式对数模型(MLM)和广义线性模型(GLM)的相关理论结果,并通过 MLM 和 GLM 下的实际实验证明了我们的算法优于最先进的设计算法。我们的模拟研究表明,ForLion 算法可以减少 25% 的实验设置数量,或平均提高 17.5% 的设计相对效率。我们的算法可以帮助实验者减少时间成本、实验设备的使用,从而降低实验总成本,同时保持设计的高效率。
{"title":"ForLion: a new algorithm for D-optimal designs under general parametric statistical models with mixed factors","authors":"Yifei Huang, Keren Li, Abhyuday Mandal, Jie Yang","doi":"10.1007/s11222-024-10465-x","DOIUrl":"https://doi.org/10.1007/s11222-024-10465-x","url":null,"abstract":"<p>In this paper, we address the problem of designing an experimental plan with both discrete and continuous factors under fairly general parametric statistical models. We propose a new algorithm, named ForLion, to search for locally optimal approximate designs under the D-criterion. The algorithm performs an exhaustive search in a design space with mixed factors while keeping high efficiency and reducing the number of distinct experimental settings. Its optimality is guaranteed by the general equivalence theorem. We present the relevant theoretical results for multinomial logit models (MLM) and generalized linear models (GLM), and demonstrate the superiority of our algorithm over state-of-the-art design algorithms using real-life experiments under MLM and GLM. Our simulation studies show that the ForLion algorithm could reduce the number of experimental settings by 25% or improve the relative efficiency of the designs by 17.5% on average. Our algorithm can help the experimenters reduce the time cost, the usage of experimental devices, and thus the total cost of their experiments while preserving high efficiencies of the designs.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141741479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sparse and geometry-aware generalisation of the mutual information for joint discriminative clustering and feature selection 用于联合判别聚类和特征选择的互信息的稀疏和几何感知广义化
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-17 DOI: 10.1007/s11222-024-10467-9
Louis Ohl, Pierre-Alexandre Mattei, Charles Bouveyron, Mickaël Leclercq, Arnaud Droit, Frédéric Precioso

Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on the data distribution, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple (ell _1) penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a discriminative clustering model. We demonstrate the performances of Sparse GEMINI on synthetic datasets and large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.

聚类中的特征选择是一项艰巨的任务,需要同时发现相关聚类以及与这些聚类相关的变量。特征选择算法通常基于模型,通过优化模型选择或对数据分布的强假设来实现,而我们引入了一种判别聚类模型,试图通过简单的(ell _1)惩罚来最大化互信息的几何感知广义化(称为 GEMINI):稀疏 GEMINI。这种算法避免了组合特征子集探索的负担,可轻松扩展到高维数据和大量样本,同时只需设计一个判别聚类模型。我们在合成数据集和大规模数据集上演示了稀疏 GEMINI 的性能。我们的结果表明,稀疏 GEMINI 是一种有竞争力的算法,能够在不使用相关性标准或先验假设的情况下,选择与聚类相关的变量子集。
{"title":"Sparse and geometry-aware generalisation of the mutual information for joint discriminative clustering and feature selection","authors":"Louis Ohl, Pierre-Alexandre Mattei, Charles Bouveyron, Mickaël Leclercq, Arnaud Droit, Frédéric Precioso","doi":"10.1007/s11222-024-10467-9","DOIUrl":"https://doi.org/10.1007/s11222-024-10467-9","url":null,"abstract":"<p>Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on the data distribution, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple <span>(ell _1)</span> penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a discriminative clustering model. We demonstrate the performances of Sparse GEMINI on synthetic datasets and large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141720221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal designs for nonlinear mixed-effects models using competitive swarm optimizer with mutated agents 利用具有变异代理的竞争性蜂群优化器优化非线性混合效应模型的设计
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-17 DOI: 10.1007/s11222-024-10468-8
Elvis Han Cui, Zizhao Zhang, Weng Kee Wong

Nature-inspired meta-heuristic algorithms are increasingly used in many disciplines to tackle challenging optimization problems. Our focus is to apply a newly proposed nature-inspired meta-heuristics algorithm called CSO-MA to solve challenging design problems in biosciences and demonstrate its flexibility to find various types of optimal approximate or exact designs for nonlinear mixed models with one or several interacting factors and with or without random effects. We show that CSO-MA is efficient and can frequently outperform other algorithms either in terms of speed or accuracy. The algorithm, like other meta-heuristic algorithms, is free of technical assumptions and flexible in that it can incorporate cost structure or multiple user-specified constraints, such as, a fixed number of measurements per subject in a longitudinal study. When possible, we confirm some of the CSO-MA generated designs are optimal with theory by developing theory-based innovative plots. Our applications include searching optimal designs to estimate (i) parameters in mixed nonlinear models with correlated random effects, (ii) a function of parameters for a count model in a dose combination study, and (iii) parameters in a HIV dynamic model. In each case, we show the advantages of using a meta-heuristic approach to solve the optimization problem, and the added benefits of the generated designs.

自然启发元启发式算法越来越多地应用于许多学科,以解决具有挑战性的优化问题。我们的重点是将新提出的一种名为 CSO-MA 的自然启发元启发式算法用于解决生物科学中的挑战性设计问题,并证明它能灵活地为具有一个或多个相互作用因子、具有或不具有随机效应的非线性混合模型找到各种类型的最佳近似或精确设计。我们的研究表明,CSO-MA 非常高效,在速度或准确性方面经常优于其他算法。该算法与其他元启发式算法一样,不受技术假设的限制,可以灵活地纳入成本结构或多个用户指定的约束条件,例如纵向研究中每个受试者的固定测量次数。在可能的情况下,我们通过绘制基于理论的创新图,确认 CSO-MA 生成的某些设计是理论上的最优设计。我们的应用包括搜索最优设计,以估算 (i) 具有相关随机效应的混合非线性模型中的参数,(ii) 剂量组合研究中计数模型的参数函数,以及 (iii) HIV 动态模型中的参数。在每种情况下,我们都展示了使用元启发式方法解决优化问题的优势,以及所生成的设计方案的额外优势。
{"title":"Optimal designs for nonlinear mixed-effects models using competitive swarm optimizer with mutated agents","authors":"Elvis Han Cui, Zizhao Zhang, Weng Kee Wong","doi":"10.1007/s11222-024-10468-8","DOIUrl":"https://doi.org/10.1007/s11222-024-10468-8","url":null,"abstract":"<p>Nature-inspired meta-heuristic algorithms are increasingly used in many disciplines to tackle challenging optimization problems. Our focus is to apply a newly proposed nature-inspired meta-heuristics algorithm called CSO-MA to solve challenging design problems in biosciences and demonstrate its flexibility to find various types of optimal approximate or exact designs for nonlinear mixed models with one or several interacting factors and with or without random effects. We show that CSO-MA is efficient and can frequently outperform other algorithms either in terms of speed or accuracy. The algorithm, like other meta-heuristic algorithms, is free of technical assumptions and flexible in that it can incorporate cost structure or multiple user-specified constraints, such as, a fixed number of measurements per subject in a longitudinal study. When possible, we confirm some of the CSO-MA generated designs are optimal with theory by developing theory-based innovative plots. Our applications include searching optimal designs to estimate (i) parameters in mixed nonlinear models with correlated random effects, (ii) a function of parameters for a count model in a dose combination study, and (iii) parameters in a HIV dynamic model. In each case, we show the advantages of using a meta-heuristic approach to solve the optimization problem, and the added benefits of the generated designs.\u0000</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141741481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A mixture of experts regression model for functional response with functional covariates 带有功能协变量的功能响应专家混合回归模型
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-11 DOI: 10.1007/s11222-024-10455-z
Jean Steve Tamo Tchomgui, Julien Jacques, Guillaume Fraysse, Vincent Barriac, Stéphane Chretien

Due to the fast growth of data that are measured on a continuous scale, functional data analysis has undergone many developments in recent years. Regression models with a functional response involving functional covariates, also called “function-on-function”, are thus becoming very common. Studying this type of model in the presence of heterogeneous data can be particularly useful in various practical situations. We mainly develop in this work a function-on-function Mixture of Experts (FFMoE) regression model. Like most of the inference approach for models on functional data, we use basis expansion (B-splines) both for covariates and parameters. A regularized inference approach is also proposed, it accurately smoothes functional parameters in order to provide interpretable estimators. Numerical studies on simulated data illustrate the good performance of FFMoE as compared with competitors. Usefullness of the proposed model is illustrated on two data sets: the reference Canadian weather data set, in which the precipitations are modeled according to the temperature, and a Cycling data set, in which the developed power is explained by the speed, the cyclist heart rate and the slope of the road.

由于连续测量数据的快速增长,功能数据分析近年来取得了许多发展。因此,涉及函数协变量的函数响应回归模型(也称为 "函数对函数")变得非常普遍。在存在异质数据的情况下研究这类模型,在各种实际情况下都特别有用。在这项工作中,我们主要开发了一个函数对函数专家混合物(FFMoE)回归模型。与大多数函数数据模型的推理方法一样,我们对协变量和参数都使用了基扩展(B-样条曲线)。我们还提出了一种正则化推理方法,它能精确地平滑函数参数,从而提供可解释的估计值。对模拟数据的数值研究表明,与竞争对手相比,FFMoE 具有良好的性能。在两个数据集上说明了所提模型的实用性:一个是参考的加拿大天气数据集,其中降水量是根据温度建模的;另一个是自行车数据集,其中开发功率是通过速度、骑车人的心率和道路坡度来解释的。
{"title":"A mixture of experts regression model for functional response with functional covariates","authors":"Jean Steve Tamo Tchomgui, Julien Jacques, Guillaume Fraysse, Vincent Barriac, Stéphane Chretien","doi":"10.1007/s11222-024-10455-z","DOIUrl":"https://doi.org/10.1007/s11222-024-10455-z","url":null,"abstract":"<p>Due to the fast growth of data that are measured on a continuous scale, functional data analysis has undergone many developments in recent years. Regression models with a functional response involving functional covariates, also called “function-on-function”, are thus becoming very common. Studying this type of model in the presence of heterogeneous data can be particularly useful in various practical situations. We mainly develop in this work a function-on-function Mixture of Experts (FFMoE) regression model. Like most of the inference approach for models on functional data, we use basis expansion (B-splines) both for covariates and parameters. A regularized inference approach is also proposed, it accurately smoothes functional parameters in order to provide interpretable estimators. Numerical studies on simulated data illustrate the good performance of FFMoE as compared with competitors. Usefullness of the proposed model is illustrated on two data sets: the reference Canadian weather data set, in which the precipitations are modeled according to the temperature, and a Cycling data set, in which the developed power is explained by the speed, the cyclist heart rate and the slope of the road.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141612355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction to: Explainable generalized additive neural networks with independent neural network training 更正为具有独立神经网络训练的可解释广义加法神经网络
IF 1.6 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-08 DOI: 10.1007/s11222-024-10461-1
Ines Ortega-Fernandez, M. Sestelo, Nora M. Villanueva
{"title":"Correction to: Explainable generalized additive neural networks with independent neural network training","authors":"Ines Ortega-Fernandez, M. Sestelo, Nora M. Villanueva","doi":"10.1007/s11222-024-10461-1","DOIUrl":"https://doi.org/10.1007/s11222-024-10461-1","url":null,"abstract":"","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141666620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistics and Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1