首页 > 最新文献

Statistics and Computing最新文献

英文 中文
Byzantine-robust and efficient distributed sparsity learning: a surrogate composite quantile regression approach 拜占庭式稳健高效分布式稀疏性学习:一种代用复合量化回归方法
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-22 DOI: 10.1007/s11222-024-10470-0
Canyi Chen, Zhengtian Zhu

Distributed statistical learning has gained significant traction recently, mainly due to the availability of unprecedentedly massive datasets. The objective of distributed statistical learning is to learn models by effectively utilizing data scattered across various machines. However, its performance can be impeded by three significant challenges: arbitrary noises, high dimensionality, and machine failures—the latter being specifically referred to as Byzantine failure. To address the first two challenges, we propose leveraging the potential of composite quantile regression in conjunction with the (ell _1) penalty. However, this combination introduces a doubly nonsmooth objective function, posing new challenges. In such scenarios, most existing Byzantine-robust methods exhibit slow sublinear convergence rates and fail to achieve near-optimal statistical convergence rates. To fill this gap, we introduce a novel smoothing procedure that effectively handles the nonsmooth aspects. This innovation allows us to develop a Byzantine-robust sparsity learning algorithm that converges provably to the near-optimal convergence rate linearly. Moreover, we establish support recovery guarantees for our proposed methods. We substantiate the effectiveness of our approaches through comprehensive empirical analyses.

最近,分布式统计学习获得了极大的关注,这主要是由于前所未有的海量数据集的出现。分布式统计学习的目标是通过有效利用分散在不同机器上的数据来学习模型。然而,它的性能可能会受到三个重大挑战的阻碍:任意噪声、高维度和机器故障--后者被特别称为拜占庭故障。为了应对前两个挑战,我们建议结合 (ell _1) 惩罚来利用复合量化回归的潜力。然而,这种组合引入了双重非光滑目标函数,带来了新的挑战。在这种情况下,大多数现有的拜占庭稳健方法都表现出缓慢的亚线性收敛率,无法达到接近最优的统计收敛率。为了填补这一空白,我们引入了一种新型平滑程序,它能有效处理非平滑问题。通过这一创新,我们开发出了一种拜占庭稳健稀疏性学习算法,该算法可以线性收敛到接近最优的收敛率。此外,我们还为所提出的方法建立了支持恢复保证。我们通过全面的实证分析证实了我们方法的有效性。
{"title":"Byzantine-robust and efficient distributed sparsity learning: a surrogate composite quantile regression approach","authors":"Canyi Chen, Zhengtian Zhu","doi":"10.1007/s11222-024-10470-0","DOIUrl":"https://doi.org/10.1007/s11222-024-10470-0","url":null,"abstract":"<p>Distributed statistical learning has gained significant traction recently, mainly due to the availability of unprecedentedly massive datasets. The objective of distributed statistical learning is to learn models by effectively utilizing data scattered across various machines. However, its performance can be impeded by three significant challenges: arbitrary noises, high dimensionality, and machine failures—the latter being specifically referred to as Byzantine failure. To address the first two challenges, we propose leveraging the potential of composite quantile regression in conjunction with the <span>(ell _1)</span> penalty. However, this combination introduces a <i>doubly</i> nonsmooth objective function, posing new challenges. In such scenarios, most existing Byzantine-robust methods exhibit slow sublinear convergence rates and fail to achieve near-optimal statistical convergence rates. To fill this gap, we introduce a novel smoothing procedure that effectively handles the nonsmooth aspects. This innovation allows us to develop a Byzantine-robust sparsity learning algorithm that converges provably to the near-optimal convergence rate <i>linearly</i>. Moreover, we establish support recovery guarantees for our proposed methods. We substantiate the effectiveness of our approaches through comprehensive empirical analyses.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"84 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141741483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ForLion: a new algorithm for D-optimal designs under general parametric statistical models with mixed factors ForLion:混合因子一般参数统计模型下 D-最优设计的新算法
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-18 DOI: 10.1007/s11222-024-10465-x
Yifei Huang, Keren Li, Abhyuday Mandal, Jie Yang

In this paper, we address the problem of designing an experimental plan with both discrete and continuous factors under fairly general parametric statistical models. We propose a new algorithm, named ForLion, to search for locally optimal approximate designs under the D-criterion. The algorithm performs an exhaustive search in a design space with mixed factors while keeping high efficiency and reducing the number of distinct experimental settings. Its optimality is guaranteed by the general equivalence theorem. We present the relevant theoretical results for multinomial logit models (MLM) and generalized linear models (GLM), and demonstrate the superiority of our algorithm over state-of-the-art design algorithms using real-life experiments under MLM and GLM. Our simulation studies show that the ForLion algorithm could reduce the number of experimental settings by 25% or improve the relative efficiency of the designs by 17.5% on average. Our algorithm can help the experimenters reduce the time cost, the usage of experimental devices, and thus the total cost of their experiments while preserving high efficiencies of the designs.

在本文中,我们探讨了在相当普遍的参数统计模型下,如何设计同时包含离散和连续因素的实验计划的问题。我们提出了一种名为 ForLion 的新算法,用于搜索 D 准则下的局部最优近似设计。该算法在混合因子的设计空间中进行穷举搜索,同时保持高效率并减少不同实验设置的数量。一般等价定理保证了算法的最优性。我们介绍了多项式对数模型(MLM)和广义线性模型(GLM)的相关理论结果,并通过 MLM 和 GLM 下的实际实验证明了我们的算法优于最先进的设计算法。我们的模拟研究表明,ForLion 算法可以减少 25% 的实验设置数量,或平均提高 17.5% 的设计相对效率。我们的算法可以帮助实验者减少时间成本、实验设备的使用,从而降低实验总成本,同时保持设计的高效率。
{"title":"ForLion: a new algorithm for D-optimal designs under general parametric statistical models with mixed factors","authors":"Yifei Huang, Keren Li, Abhyuday Mandal, Jie Yang","doi":"10.1007/s11222-024-10465-x","DOIUrl":"https://doi.org/10.1007/s11222-024-10465-x","url":null,"abstract":"<p>In this paper, we address the problem of designing an experimental plan with both discrete and continuous factors under fairly general parametric statistical models. We propose a new algorithm, named ForLion, to search for locally optimal approximate designs under the D-criterion. The algorithm performs an exhaustive search in a design space with mixed factors while keeping high efficiency and reducing the number of distinct experimental settings. Its optimality is guaranteed by the general equivalence theorem. We present the relevant theoretical results for multinomial logit models (MLM) and generalized linear models (GLM), and demonstrate the superiority of our algorithm over state-of-the-art design algorithms using real-life experiments under MLM and GLM. Our simulation studies show that the ForLion algorithm could reduce the number of experimental settings by 25% or improve the relative efficiency of the designs by 17.5% on average. Our algorithm can help the experimenters reduce the time cost, the usage of experimental devices, and thus the total cost of their experiments while preserving high efficiencies of the designs.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"9 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141741479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sparse and geometry-aware generalisation of the mutual information for joint discriminative clustering and feature selection 用于联合判别聚类和特征选择的互信息的稀疏和几何感知广义化
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-17 DOI: 10.1007/s11222-024-10467-9
Louis Ohl, Pierre-Alexandre Mattei, Charles Bouveyron, Mickaël Leclercq, Arnaud Droit, Frédéric Precioso

Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on the data distribution, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple (ell _1) penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a discriminative clustering model. We demonstrate the performances of Sparse GEMINI on synthetic datasets and large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.

聚类中的特征选择是一项艰巨的任务,需要同时发现相关聚类以及与这些聚类相关的变量。特征选择算法通常基于模型,通过优化模型选择或对数据分布的强假设来实现,而我们引入了一种判别聚类模型,试图通过简单的(ell _1)惩罚来最大化互信息的几何感知广义化(称为 GEMINI):稀疏 GEMINI。这种算法避免了组合特征子集探索的负担,可轻松扩展到高维数据和大量样本,同时只需设计一个判别聚类模型。我们在合成数据集和大规模数据集上演示了稀疏 GEMINI 的性能。我们的结果表明,稀疏 GEMINI 是一种有竞争力的算法,能够在不使用相关性标准或先验假设的情况下,选择与聚类相关的变量子集。
{"title":"Sparse and geometry-aware generalisation of the mutual information for joint discriminative clustering and feature selection","authors":"Louis Ohl, Pierre-Alexandre Mattei, Charles Bouveyron, Mickaël Leclercq, Arnaud Droit, Frédéric Precioso","doi":"10.1007/s11222-024-10467-9","DOIUrl":"https://doi.org/10.1007/s11222-024-10467-9","url":null,"abstract":"<p>Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on the data distribution, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple <span>(ell _1)</span> penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a discriminative clustering model. We demonstrate the performances of Sparse GEMINI on synthetic datasets and large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"50 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141720221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal designs for nonlinear mixed-effects models using competitive swarm optimizer with mutated agents 利用具有变异代理的竞争性蜂群优化器优化非线性混合效应模型的设计
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-17 DOI: 10.1007/s11222-024-10468-8
Elvis Han Cui, Zizhao Zhang, Weng Kee Wong

Nature-inspired meta-heuristic algorithms are increasingly used in many disciplines to tackle challenging optimization problems. Our focus is to apply a newly proposed nature-inspired meta-heuristics algorithm called CSO-MA to solve challenging design problems in biosciences and demonstrate its flexibility to find various types of optimal approximate or exact designs for nonlinear mixed models with one or several interacting factors and with or without random effects. We show that CSO-MA is efficient and can frequently outperform other algorithms either in terms of speed or accuracy. The algorithm, like other meta-heuristic algorithms, is free of technical assumptions and flexible in that it can incorporate cost structure or multiple user-specified constraints, such as, a fixed number of measurements per subject in a longitudinal study. When possible, we confirm some of the CSO-MA generated designs are optimal with theory by developing theory-based innovative plots. Our applications include searching optimal designs to estimate (i) parameters in mixed nonlinear models with correlated random effects, (ii) a function of parameters for a count model in a dose combination study, and (iii) parameters in a HIV dynamic model. In each case, we show the advantages of using a meta-heuristic approach to solve the optimization problem, and the added benefits of the generated designs.

自然启发元启发式算法越来越多地应用于许多学科,以解决具有挑战性的优化问题。我们的重点是将新提出的一种名为 CSO-MA 的自然启发元启发式算法用于解决生物科学中的挑战性设计问题,并证明它能灵活地为具有一个或多个相互作用因子、具有或不具有随机效应的非线性混合模型找到各种类型的最佳近似或精确设计。我们的研究表明,CSO-MA 非常高效,在速度或准确性方面经常优于其他算法。该算法与其他元启发式算法一样,不受技术假设的限制,可以灵活地纳入成本结构或多个用户指定的约束条件,例如纵向研究中每个受试者的固定测量次数。在可能的情况下,我们通过绘制基于理论的创新图,确认 CSO-MA 生成的某些设计是理论上的最优设计。我们的应用包括搜索最优设计,以估算 (i) 具有相关随机效应的混合非线性模型中的参数,(ii) 剂量组合研究中计数模型的参数函数,以及 (iii) HIV 动态模型中的参数。在每种情况下,我们都展示了使用元启发式方法解决优化问题的优势,以及所生成的设计方案的额外优势。
{"title":"Optimal designs for nonlinear mixed-effects models using competitive swarm optimizer with mutated agents","authors":"Elvis Han Cui, Zizhao Zhang, Weng Kee Wong","doi":"10.1007/s11222-024-10468-8","DOIUrl":"https://doi.org/10.1007/s11222-024-10468-8","url":null,"abstract":"<p>Nature-inspired meta-heuristic algorithms are increasingly used in many disciplines to tackle challenging optimization problems. Our focus is to apply a newly proposed nature-inspired meta-heuristics algorithm called CSO-MA to solve challenging design problems in biosciences and demonstrate its flexibility to find various types of optimal approximate or exact designs for nonlinear mixed models with one or several interacting factors and with or without random effects. We show that CSO-MA is efficient and can frequently outperform other algorithms either in terms of speed or accuracy. The algorithm, like other meta-heuristic algorithms, is free of technical assumptions and flexible in that it can incorporate cost structure or multiple user-specified constraints, such as, a fixed number of measurements per subject in a longitudinal study. When possible, we confirm some of the CSO-MA generated designs are optimal with theory by developing theory-based innovative plots. Our applications include searching optimal designs to estimate (i) parameters in mixed nonlinear models with correlated random effects, (ii) a function of parameters for a count model in a dose combination study, and (iii) parameters in a HIV dynamic model. In each case, we show the advantages of using a meta-heuristic approach to solve the optimization problem, and the added benefits of the generated designs.\u0000</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"62 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141741481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A mixture of experts regression model for functional response with functional covariates 带有功能协变量的功能响应专家混合回归模型
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-11 DOI: 10.1007/s11222-024-10455-z
Jean Steve Tamo Tchomgui, Julien Jacques, Guillaume Fraysse, Vincent Barriac, Stéphane Chretien

Due to the fast growth of data that are measured on a continuous scale, functional data analysis has undergone many developments in recent years. Regression models with a functional response involving functional covariates, also called “function-on-function”, are thus becoming very common. Studying this type of model in the presence of heterogeneous data can be particularly useful in various practical situations. We mainly develop in this work a function-on-function Mixture of Experts (FFMoE) regression model. Like most of the inference approach for models on functional data, we use basis expansion (B-splines) both for covariates and parameters. A regularized inference approach is also proposed, it accurately smoothes functional parameters in order to provide interpretable estimators. Numerical studies on simulated data illustrate the good performance of FFMoE as compared with competitors. Usefullness of the proposed model is illustrated on two data sets: the reference Canadian weather data set, in which the precipitations are modeled according to the temperature, and a Cycling data set, in which the developed power is explained by the speed, the cyclist heart rate and the slope of the road.

由于连续测量数据的快速增长,功能数据分析近年来取得了许多发展。因此,涉及函数协变量的函数响应回归模型(也称为 "函数对函数")变得非常普遍。在存在异质数据的情况下研究这类模型,在各种实际情况下都特别有用。在这项工作中,我们主要开发了一个函数对函数专家混合物(FFMoE)回归模型。与大多数函数数据模型的推理方法一样,我们对协变量和参数都使用了基扩展(B-样条曲线)。我们还提出了一种正则化推理方法,它能精确地平滑函数参数,从而提供可解释的估计值。对模拟数据的数值研究表明,与竞争对手相比,FFMoE 具有良好的性能。在两个数据集上说明了所提模型的实用性:一个是参考的加拿大天气数据集,其中降水量是根据温度建模的;另一个是自行车数据集,其中开发功率是通过速度、骑车人的心率和道路坡度来解释的。
{"title":"A mixture of experts regression model for functional response with functional covariates","authors":"Jean Steve Tamo Tchomgui, Julien Jacques, Guillaume Fraysse, Vincent Barriac, Stéphane Chretien","doi":"10.1007/s11222-024-10455-z","DOIUrl":"https://doi.org/10.1007/s11222-024-10455-z","url":null,"abstract":"<p>Due to the fast growth of data that are measured on a continuous scale, functional data analysis has undergone many developments in recent years. Regression models with a functional response involving functional covariates, also called “function-on-function”, are thus becoming very common. Studying this type of model in the presence of heterogeneous data can be particularly useful in various practical situations. We mainly develop in this work a function-on-function Mixture of Experts (FFMoE) regression model. Like most of the inference approach for models on functional data, we use basis expansion (B-splines) both for covariates and parameters. A regularized inference approach is also proposed, it accurately smoothes functional parameters in order to provide interpretable estimators. Numerical studies on simulated data illustrate the good performance of FFMoE as compared with competitors. Usefullness of the proposed model is illustrated on two data sets: the reference Canadian weather data set, in which the precipitations are modeled according to the temperature, and a Cycling data set, in which the developed power is explained by the speed, the cyclist heart rate and the slope of the road.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"55 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141612355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A limit formula and a series expansion for the bivariate Normal tail probability 双变量正态尾概率的极限公式和数列展开
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-08 DOI: 10.1007/s11222-024-10466-w
Siu-Kui Au

This work presents a limit formula for the bivariate Normal tail probability. It only requires the larger threshold to grow indefinitely, but otherwise has no restrictions on how the thresholds grow. The correlation parameter can change and possibly depend on the thresholds. The formula is applicable regardless of Salvage’s condition. Asymptotically, it reduces to Ruben’s formula and Hashorva’s formula under the corresponding conditions, and therefore can be considered a generalisation. Under a mild condition, it satisfies Plackett’s identity on the derivative with respect to the correlation parameter. Motivated by the limit formula, a series expansion is also obtained for the exact tail probability using derivatives of the univariate Mill’s ratio. Under similar conditions for the limit formula, the series converges and its truncated approximation has a small remainder term for large thresholds. To take advantage of this, a simple procedure is developed for the general case by remapping the parameters so that they satisfy the conditions. Examples are presented to illustrate the theoretical findings.

这项研究提出了双变量正态尾概率的极限公式。它只要求较大的临界值无限增长,除此之外对临界值的增长方式没有任何限制。相关参数可以改变,也可能取决于临界值。无论 Salvage 的条件如何,该公式都适用。在相应的条件下,它可以渐进地还原为鲁本公式和哈肖尔瓦公式,因此可以被视为一种概括。在一个温和的条件下,它满足普拉基特关于相关参数导数的特性。受极限公式的启发,利用单变量米尔比的导数也得到了精确尾概率的级数展开。在极限公式的类似条件下,数列收敛,其截断近似值在临界值较大时余项较小。为了利用这一点,我们针对一般情况开发了一个简单的程序,通过重新映射参数使其满足条件。本文将举例说明理论发现。
{"title":"A limit formula and a series expansion for the bivariate Normal tail probability","authors":"Siu-Kui Au","doi":"10.1007/s11222-024-10466-w","DOIUrl":"https://doi.org/10.1007/s11222-024-10466-w","url":null,"abstract":"<p>This work presents a limit formula for the bivariate Normal tail probability. It only requires the larger threshold to grow indefinitely, but otherwise has no restrictions on how the thresholds grow. The correlation parameter can change and possibly depend on the thresholds. The formula is applicable regardless of Salvage’s condition. Asymptotically, it reduces to Ruben’s formula and Hashorva’s formula under the corresponding conditions, and therefore can be considered a generalisation. Under a mild condition, it satisfies Plackett’s identity on the derivative with respect to the correlation parameter. Motivated by the limit formula, a series expansion is also obtained for the exact tail probability using derivatives of the univariate Mill’s ratio. Under similar conditions for the limit formula, the series converges and its truncated approximation has a small remainder term for large thresholds. To take advantage of this, a simple procedure is developed for the general case by remapping the parameters so that they satisfy the conditions. Examples are presented to illustrate the theoretical findings.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"42 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Classifier-dependent feature selection via greedy methods 通过贪婪方法进行分类器特征选择
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-06 DOI: 10.1007/s11222-024-10460-2
Fabiana Camattari, Sabrina Guastavino, Francesco Marchetti, Michele Piana, Emma Perracchione

The purpose of this study is to introduce a new approach to feature ranking for classification tasks, called in what follows greedy feature selection. In statistical learning, feature selection is usually realized by means of methods that are independent of the classifier applied to perform the prediction using that reduced number of features. Instead, the greedy feature selection identifies the most important feature at each step and according to the selected classifier. The benefits of such scheme are investigated in terms of model capacity indicators, such as the Vapnik-Chervonenkis dimension or the kernel alignment. This theoretical study proves that the iterative greedy algorithm is able to construct classifiers whose complexity capacity grows at each step. The proposed method is then tested numerically on various datasets and compared to the state-of-the-art techniques. The results show that our iterative scheme is able to truly capture only a few relevant features, and may improve, especially for real and noisy data, the accuracy scores of other techniques. The greedy scheme is also applied to the challenging application of predicting geo-effective manifestations of the active Sun.

本研究的目的是为分类任务的特征排序引入一种新方法,下文称之为 "贪婪特征选择"。在统计学习中,特征选择通常是通过独立于分类器的方法来实现的,分类器使用减少的特征数量进行预测。相反,贪婪特征选择在每一步都会根据所选分类器确定最重要的特征。我们从模型容量指标(如 Vapnik-Chervonenkis 维度或内核对齐度)的角度研究了这种方案的优势。这项理论研究证明,迭代贪婪算法能够构建复杂度容量每一步都在增长的分类器。然后,我们在各种数据集上对所提出的方法进行了数值测试,并与最先进的技术进行了比较。结果表明,我们的迭代方案能够真正捕捉到少数几个相关特征,并能提高其他技术的准确率,尤其是在真实和高噪声数据中。贪婪方案还被应用于预测活跃太阳的地理效应表现这一具有挑战性的应用中。
{"title":"Classifier-dependent feature selection via greedy methods","authors":"Fabiana Camattari, Sabrina Guastavino, Francesco Marchetti, Michele Piana, Emma Perracchione","doi":"10.1007/s11222-024-10460-2","DOIUrl":"https://doi.org/10.1007/s11222-024-10460-2","url":null,"abstract":"<p>The purpose of this study is to introduce a new approach to feature ranking for classification tasks, called in what follows greedy feature selection. In statistical learning, feature selection is usually realized by means of methods that are independent of the classifier applied to perform the prediction using that reduced number of features. Instead, the greedy feature selection identifies the most important feature at each step and according to the selected classifier. The benefits of such scheme are investigated in terms of model capacity indicators, such as the Vapnik-Chervonenkis dimension or the kernel alignment. This theoretical study proves that the iterative greedy algorithm is able to construct classifiers whose complexity capacity grows at each step. The proposed method is then tested numerically on various datasets and compared to the state-of-the-art techniques. The results show that our iterative scheme is able to truly capture only a few relevant features, and may improve, especially for real and noisy data, the accuracy scores of other techniques. The greedy scheme is also applied to the challenging application of predicting geo-effective manifestations of the active Sun.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"3 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Locally sparse and robust partial least squares in scalar-on-function regression 标量函数回归中的局部稀疏稳健偏最小二乘法
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-06 DOI: 10.1007/s11222-024-10464-y
Sude Gurer, Han Lin Shang, Abhijit Mandal, Ufuk Beyaztas

We present a novel approach for estimating a scalar-on-function regression model, leveraging a functional partial least squares methodology. Our proposed method involves computing the functional partial least squares components through sparse partial robust M regression, facilitating robust and locally sparse estimations of the regression coefficient function. This strategy delivers a robust decomposition for the functional predictor and regression coefficient functions. After the decomposition, model parameters are estimated using a weighted loss function, incorporating robustness through iterative reweighting of the partial least squares components. The robust decomposition feature of our proposed method enables the robust estimation of model parameters in the scalar-on-function regression model, ensuring reliable predictions in the presence of outliers and leverage points. Moreover, it accurately identifies zero and nonzero sub-regions where the slope function is estimated, even in the presence of outliers and leverage points. We assess our proposed method’s estimation and predictive performance through a series of Monte Carlo experiments and an empirical dataset—that is, data collected in relation to oriented strand board. Compared to existing methods our proposed method performs favorably. Notably, our robust procedure exhibits superior performance in the presence of outliers while maintaining competitiveness in their absence. Our method has been implemented in the robsfplsr package in .

我们提出了一种利用函数偏最小二乘法估计函数标量回归模型的新方法。我们提出的方法包括通过稀疏偏稳健 M 回归计算函数偏最小二乘分量,从而促进对回归系数函数进行稳健的局部稀疏估计。这一策略可对函数预测和回归系数函数进行稳健分解。分解后,使用加权损失函数对模型参数进行估计,通过对偏最小二乘分量进行迭代重新加权来实现稳健性。我们所提出的方法的稳健分解功能能够稳健地估计标量-函数回归模型中的模型参数,确保在存在异常值和杠杆点的情况下做出可靠的预测。此外,即使在存在异常值和杠杆点的情况下,它也能准确识别出斜率函数估计值为零和非零的子区域。我们通过一系列蒙特卡罗实验和一个经验数据集(即收集的与定向刨花板有关的数据)来评估我们提出的方法的估计和预测性能。与现有方法相比,我们提出的方法表现出色。值得注意的是,我们的稳健程序在存在异常值的情况下表现出卓越的性能,而在没有异常值的情况下也能保持竞争力。我们的方法已在.NET Framework 3.0的 robsfplsr 软件包中实现。
{"title":"Locally sparse and robust partial least squares in scalar-on-function regression","authors":"Sude Gurer, Han Lin Shang, Abhijit Mandal, Ufuk Beyaztas","doi":"10.1007/s11222-024-10464-y","DOIUrl":"https://doi.org/10.1007/s11222-024-10464-y","url":null,"abstract":"<p>We present a novel approach for estimating a scalar-on-function regression model, leveraging a functional partial least squares methodology. Our proposed method involves computing the functional partial least squares components through sparse partial robust M regression, facilitating robust and locally sparse estimations of the regression coefficient function. This strategy delivers a robust decomposition for the functional predictor and regression coefficient functions. After the decomposition, model parameters are estimated using a weighted loss function, incorporating robustness through iterative reweighting of the partial least squares components. The robust decomposition feature of our proposed method enables the robust estimation of model parameters in the scalar-on-function regression model, ensuring reliable predictions in the presence of outliers and leverage points. Moreover, it accurately identifies zero and nonzero sub-regions where the slope function is estimated, even in the presence of outliers and leverage points. We assess our proposed method’s estimation and predictive performance through a series of Monte Carlo experiments and an empirical dataset—that is, data collected in relation to oriented strand board. Compared to existing methods our proposed method performs favorably. Notably, our robust procedure exhibits superior performance in the presence of outliers while maintaining competitiveness in their absence. Our method has been implemented in the <span>robsfplsr</span> package in .</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"100 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141575839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Shapley performance attribution for least-squares regression 最小二乘回归的高效夏普利性能归因
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-04 DOI: 10.1007/s11222-024-10459-9
Logan Bell, Nikhil Devanathan, Stephen Boyd

We consider the performance of a least-squares regression model, as judged by out-of-sample (R^2). Shapley values give a fair attribution of the performance of a model to its input features, taking into account interdependencies between features. Evaluating the Shapley values exactly requires solving a number of regression problems that is exponential in the number of features, so a Monte Carlo-type approximation is typically used. We focus on the special case of least-squares regression models, where several tricks can be used to compute and evaluate regression models efficiently. These tricks give a substantial speed up, allowing many more Monte Carlo samples to be evaluated, achieving better accuracy. We refer to our method as least-squares Shapley performance attribution (LS-SPA), and describe our open-source implementation.

我们通过样本外 (R^2)来判断最小二乘回归模型的性能。考虑到特征之间的相互依存关系,夏普利值可以将模型的性能公平地归因于其输入特征。精确评估夏普利值需要解决大量回归问题,而回归问题的数量与特征数量成指数关系,因此通常使用蒙特卡罗式近似方法。我们将重点放在最小二乘回归模型的特殊情况上,在这种情况下,可以使用几种技巧来高效计算和评估回归模型。这些技巧大大加快了计算速度,可以评估更多的蒙特卡罗样本,从而获得更高的精度。我们将这种方法称为最小二乘沙普利性能归因(LS-SPA),并介绍了我们的开源实现。
{"title":"Efficient Shapley performance attribution for least-squares regression","authors":"Logan Bell, Nikhil Devanathan, Stephen Boyd","doi":"10.1007/s11222-024-10459-9","DOIUrl":"https://doi.org/10.1007/s11222-024-10459-9","url":null,"abstract":"<p>We consider the performance of a least-squares regression model, as judged by out-of-sample <span>(R^2)</span>. Shapley values give a fair attribution of the performance of a model to its input features, taking into account interdependencies between features. Evaluating the Shapley values exactly requires solving a number of regression problems that is exponential in the number of features, so a Monte Carlo-type approximation is typically used. We focus on the special case of least-squares regression models, where several tricks can be used to compute and evaluate regression models efficiently. These tricks give a substantial speed up, allowing many more Monte Carlo samples to be evaluated, achieving better accuracy. We refer to our method as least-squares Shapley performance attribution (LS-SPA), and describe our open-source implementation.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"28 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On weak convergence of quantile-based empirical likelihood process for ROC curves 论 ROC 曲线基于量化的经验似然过程的弱收敛性
IF 2.2 2区 数学 Q2 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-04 DOI: 10.1007/s11222-024-10457-x
Hu Jiang, Liu Yiming, Zhou Wang

The empirical likelihood (EL) method possesses desirable qualities such as automatically determining confidence regions and circumventing the need for variance estimation. As an extension, a quantile-based EL (QEL) method is considered, which results in a simpler form. In this paper, we explore the framework of the QEL method. Firstly, we explore the weak convergence of the −2 log empirical likelihood ratio for ROC curves. We also introduce a novel statistic for testing the entire ROC curve and the equality of two distributions. To validate our approach, we conduct simulation studies and analyze real data from hepatitis C patients, comparing our method with existing ones.

经验似然法(EL)具有自动确定置信区域和无需方差估计等优点。作为扩展,我们考虑了一种基于量值的 EL(QEL)方法,它的形式更为简单。本文将探讨 QEL 方法的框架。首先,我们探讨了 ROC 曲线的-2 对数经验似然比的弱收敛性。我们还引入了一种新的统计量,用于测试整个 ROC 曲线和两个分布的相等性。为了验证我们的方法,我们进行了模拟研究,并分析了丙型肝炎患者的真实数据,将我们的方法与现有方法进行了比较。
{"title":"On weak convergence of quantile-based empirical likelihood process for ROC curves","authors":"Hu Jiang, Liu Yiming, Zhou Wang","doi":"10.1007/s11222-024-10457-x","DOIUrl":"https://doi.org/10.1007/s11222-024-10457-x","url":null,"abstract":"<p>The empirical likelihood (EL) method possesses desirable qualities such as automatically determining confidence regions and circumventing the need for variance estimation. As an extension, a quantile-based EL (QEL) method is considered, which results in a simpler form. In this paper, we explore the framework of the QEL method. Firstly, we explore the weak convergence of the −2 log empirical likelihood ratio for ROC curves. We also introduce a novel statistic for testing the entire ROC curve and the equality of two distributions. To validate our approach, we conduct simulation studies and analyze real data from hepatitis C patients, comparing our method with existing ones.</p>","PeriodicalId":22058,"journal":{"name":"Statistics and Computing","volume":"40 1","pages":""},"PeriodicalIF":2.2,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistics and Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1