首页 > 最新文献

J. Mach. Learn. Res.最新文献

英文 中文
Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics α散度变分推理满足重要性加权自编码器:方法学和渐近性
Pub Date : 2022-10-12 DOI: 10.48550/arXiv.2210.06226
Kam'elia Daudel, Joe Benton, Yuyang Shi, A. Doucet
Several algorithms involving the Variational R'enyi (VR) bound have been proposed to minimize an alpha-divergence between a target posterior distribution and a variational distribution. Despite promising empirical results, those algorithms resort to biased stochastic gradient descent procedures and thus lack theoretical guarantees. In this paper, we formalize and study the VR-IWAE bound, a generalization of the Importance Weighted Auto-Encoder (IWAE) bound. We show that the VR-IWAE bound enjoys several desirable properties and notably leads to the same stochastic gradient descent procedure as the VR bound in the reparameterized case, but this time by relying on unbiased gradient estimators. We then provide two complementary theoretical analyses of the VR-IWAE bound and thus of the standard IWAE bound. Those analyses shed light on the benefits or lack thereof of these bounds. Lastly, we illustrate our theoretical claims over toy and real-data examples.
已经提出了几种涉及变分R enyi (VR)界的算法来最小化目标后验分布和变分分布之间的α散度。尽管有希望的经验结果,这些算法诉诸于有偏差的随机梯度下降过程,因此缺乏理论保证。本文对重要性加权自编码器界(IWAE)的推广——VR-IWAE界进行形式化研究。我们表明,VR- iwae界具有几个理想的性质,并且在重参数化情况下显著导致与VR界相同的随机梯度下降过程,但这一次依赖于无偏梯度估计。然后,我们对VR-IWAE界和标准IWAE界提供了两个互补的理论分析。这些分析揭示了这些限制的利弊。最后,我们通过玩具和实际数据示例说明了我们的理论主张。
{"title":"Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics","authors":"Kam'elia Daudel, Joe Benton, Yuyang Shi, A. Doucet","doi":"10.48550/arXiv.2210.06226","DOIUrl":"https://doi.org/10.48550/arXiv.2210.06226","url":null,"abstract":"Several algorithms involving the Variational R'enyi (VR) bound have been proposed to minimize an alpha-divergence between a target posterior distribution and a variational distribution. Despite promising empirical results, those algorithms resort to biased stochastic gradient descent procedures and thus lack theoretical guarantees. In this paper, we formalize and study the VR-IWAE bound, a generalization of the Importance Weighted Auto-Encoder (IWAE) bound. We show that the VR-IWAE bound enjoys several desirable properties and notably leads to the same stochastic gradient descent procedure as the VR bound in the reparameterized case, but this time by relying on unbiased gradient estimators. We then provide two complementary theoretical analyses of the VR-IWAE bound and thus of the standard IWAE bound. Those analyses shed light on the benefits or lack thereof of these bounds. Lastly, we illustrate our theoretical claims over toy and real-data examples.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"42 8","pages":"243:1-243:83"},"PeriodicalIF":0.0,"publicationDate":"2022-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91449959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multi-Task Dynamical Systems 多任务动态系统
Pub Date : 2022-10-08 DOI: 10.48550/arXiv.2210.04023
Alex Bird
Time series datasets are often composed of a variety of sequences from the same domain, but from different entities, such as individuals, products, or organizations. We are interested in how time series models can be specialized to individual sequences (capturing the specific characteristics) while still retaining statistical power by sharing commonalities across the sequences. This paper describes the multi-task dynamical system (MTDS); a general methodology for extending multi-task learning (MTL) to time series models. Our approach endows dynamical systems with a set of hierarchical latent variables which can modulate all model parameters. To our knowledge, this is a novel development of MTL, and applies to time series both with and without control inputs. We apply the MTDS to motion-capture data of people walking in various styles using a multi-task recurrent neural network (RNN), and to patient drug-response data using a multi-task pharmacodynamic model.
时间序列数据集通常由来自同一领域的各种序列组成,但来自不同的实体,如个人、产品或组织。我们感兴趣的是如何将时间序列模型专门化到单个序列(捕获特定的特征),同时通过共享序列之间的共性来保持统计能力。本文介绍了多任务动态系统(MTDS);将多任务学习(MTL)扩展到时间序列模型的通用方法。我们的方法赋予动力系统一组可以调节所有模型参数的分层潜变量。据我们所知,这是MTL的一个新发展,适用于有或没有控制输入的时间序列。我们使用多任务递归神经网络(RNN)将MTDS应用于以不同方式行走的人的动作捕捉数据,并使用多任务药效学模型应用于患者药物反应数据。
{"title":"Multi-Task Dynamical Systems","authors":"Alex Bird","doi":"10.48550/arXiv.2210.04023","DOIUrl":"https://doi.org/10.48550/arXiv.2210.04023","url":null,"abstract":"Time series datasets are often composed of a variety of sequences from the same domain, but from different entities, such as individuals, products, or organizations. We are interested in how time series models can be specialized to individual sequences (capturing the specific characteristics) while still retaining statistical power by sharing commonalities across the sequences. This paper describes the multi-task dynamical system (MTDS); a general methodology for extending multi-task learning (MTL) to time series models. Our approach endows dynamical systems with a set of hierarchical latent variables which can modulate all model parameters. To our knowledge, this is a novel development of MTL, and applies to time series both with and without control inputs. We apply the MTDS to motion-capture data of people walking in various styles using a multi-task recurrent neural network (RNN), and to patient drug-response data using a multi-task pharmacodynamic model.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"58 1","pages":"230:1-230:52"},"PeriodicalIF":0.0,"publicationDate":"2022-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85296255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Unified Framework for Optimization-Based Graph Coarsening 基于优化的图粗化统一框架
Pub Date : 2022-10-02 DOI: 10.48550/arXiv.2210.00437
Manoj Kumar, Anurag Sharma, Surinder Kumar
Graph coarsening is a widely used dimensionality reduction technique for approaching large-scale graph machine learning problems. Given a large graph, graph coarsening aims to learn a smaller-tractable graph while preserving the properties of the originally given graph. Graph data consist of node features and graph matrix (e.g., adjacency and Laplacian). The existing graph coarsening methods ignore the node features and rely solely on a graph matrix to simplify graphs. In this paper, we introduce a novel optimization-based framework for graph dimensionality reduction. The proposed framework lies in the unification of graph learning and dimensionality reduction. It takes both the graph matrix and the node features as the input and learns the coarsen graph matrix and the coarsen feature matrix jointly while ensuring desired properties. The proposed optimization formulation is a multi-block non-convex optimization problem, which is solved efficiently by leveraging block majorization-minimization, $log$ determinant, Dirichlet energy, and regularization frameworks. The proposed algorithms are provably convergent and practically amenable to numerous tasks. It is also established that the learned coarsened graph is $epsilonin(0,1)$ similar to the original graph. Extensive experiments elucidate the efficacy of the proposed framework for real-world applications.
图粗化是一种广泛应用于处理大规模图机器学习问题的降维技术。给定一个大的图,图粗化的目的是学习一个更小的易于处理的图,同时保留原给定图的性质。图数据由节点特征和图矩阵(如邻接矩阵和拉普拉斯矩阵)组成。现有的图粗化方法忽略了节点特征,仅依靠图矩阵来简化图。本文提出了一种新的基于优化的图降维框架。提出的框架是图学习和降维的统一。它以图矩阵和节点特征作为输入,在保证所需性质的前提下,联合学习粗图矩阵和粗特征矩阵。所提出的优化公式是一个多块非凸优化问题,该问题通过利用块最大化最小化、$log$行列式、狄利克雷能量和正则化框架有效地解决。所提出的算法收敛性好,适用于多种任务。并证明了学习后的粗化图与原始图$epsilonin(0,1)$相似。大量的实验阐明了所提出的框架在实际应用中的有效性。
{"title":"A Unified Framework for Optimization-Based Graph Coarsening","authors":"Manoj Kumar, Anurag Sharma, Surinder Kumar","doi":"10.48550/arXiv.2210.00437","DOIUrl":"https://doi.org/10.48550/arXiv.2210.00437","url":null,"abstract":"Graph coarsening is a widely used dimensionality reduction technique for approaching large-scale graph machine learning problems. Given a large graph, graph coarsening aims to learn a smaller-tractable graph while preserving the properties of the originally given graph. Graph data consist of node features and graph matrix (e.g., adjacency and Laplacian). The existing graph coarsening methods ignore the node features and rely solely on a graph matrix to simplify graphs. In this paper, we introduce a novel optimization-based framework for graph dimensionality reduction. The proposed framework lies in the unification of graph learning and dimensionality reduction. It takes both the graph matrix and the node features as the input and learns the coarsen graph matrix and the coarsen feature matrix jointly while ensuring desired properties. The proposed optimization formulation is a multi-block non-convex optimization problem, which is solved efficiently by leveraging block majorization-minimization, $log$ determinant, Dirichlet energy, and regularization frameworks. The proposed algorithms are provably convergent and practically amenable to numerous tasks. It is also established that the learned coarsened graph is $epsilonin(0,1)$ similar to the original graph. Extensive experiments elucidate the efficacy of the proposed framework for real-world applications.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"2 1","pages":"118:1-118:50"},"PeriodicalIF":0.0,"publicationDate":"2022-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87414890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Faster Randomized Interior Point Methods for Tall/Wide Linear Programs 高/宽线性规划的快速随机内点方法
Pub Date : 2022-09-19 DOI: 10.48550/arXiv.2209.08722
Agniva Chowdhury, Gregory Dexter, Palma London, H. Avron, P. Drineas
Linear programming (LP) is an extremely useful tool which has been successfully applied to solve various problems in a wide range of areas, including operations research, engineering, economics, or even more abstract mathematical areas such as combinatorics. It is also used in many machine learning applications, such as $ell_1$-regularized SVMs, basis pursuit, nonnegative matrix factorization, etc. Interior Point Methods (IPMs) are one of the most popular methods to solve LPs both in theory and in practice. Their underlying complexity is dominated by the cost of solving a system of linear equations at each iteration. In this paper, we consider both feasible and infeasible IPMs for the special case where the number of variables is much larger than the number of constraints. Using tools from Randomized Linear Algebra, we present a preconditioning technique that, when combined with the iterative solvers such as Conjugate Gradient or Chebyshev Iteration, provably guarantees that IPM algorithms (suitably modified to account for the error incurred by the approximate solver), converge to a feasible, approximately optimal solution, without increasing their iteration complexity. Our empirical evaluations verify our theoretical results on both real-world and synthetic data.
线性规划(LP)是一种非常有用的工具,它已经成功地应用于解决各种领域的问题,包括运筹学,工程学,经济学,甚至更抽象的数学领域,如组合学。它也用于许多机器学习应用,如正则化支持向量机、基追踪、非负矩阵分解等。内点法(IPMs)是目前国内外应用最广泛的求解lp的方法之一。它们潜在的复杂性主要取决于每次迭代时求解线性方程组的成本。在本文中,我们考虑了变量数量远大于约束数量的特殊情况下可行和不可行的ipm。利用随机线性代数的工具,我们提出了一种预处理技术,当与迭代求解器(如共轭梯度或切比雪夫迭代)相结合时,可证明地保证IPM算法(适当修改以考虑近似求解器产生的误差)收敛于可行的近似最优解,而不会增加其迭代复杂性。我们的实证评估在现实世界和合成数据上验证了我们的理论结果。
{"title":"Faster Randomized Interior Point Methods for Tall/Wide Linear Programs","authors":"Agniva Chowdhury, Gregory Dexter, Palma London, H. Avron, P. Drineas","doi":"10.48550/arXiv.2209.08722","DOIUrl":"https://doi.org/10.48550/arXiv.2209.08722","url":null,"abstract":"Linear programming (LP) is an extremely useful tool which has been successfully applied to solve various problems in a wide range of areas, including operations research, engineering, economics, or even more abstract mathematical areas such as combinatorics. It is also used in many machine learning applications, such as $ell_1$-regularized SVMs, basis pursuit, nonnegative matrix factorization, etc. Interior Point Methods (IPMs) are one of the most popular methods to solve LPs both in theory and in practice. Their underlying complexity is dominated by the cost of solving a system of linear equations at each iteration. In this paper, we consider both feasible and infeasible IPMs for the special case where the number of variables is much larger than the number of constraints. Using tools from Randomized Linear Algebra, we present a preconditioning technique that, when combined with the iterative solvers such as Conjugate Gradient or Chebyshev Iteration, provably guarantees that IPM algorithms (suitably modified to account for the error incurred by the approximate solver), converge to a feasible, approximately optimal solution, without increasing their iteration complexity. Our empirical evaluations verify our theoretical results on both real-world and synthetic data.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"4 1","pages":"336:1-336:48"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75710875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Small Transformers Compute Universal Metric Embeddings 小型变压器计算通用公制嵌入
Pub Date : 2022-09-14 DOI: 10.48550/arXiv.2209.06788
Anastasis Kratsios, Valentin Debarnot, Ivan Dokmani'c
We study representations of data from an arbitrary metric space $mathcal{X}$ in the space of univariate Gaussian mixtures with a transport metric (Delon and Desolneux 2020). We derive embedding guarantees for feature maps implemented by small neural networks called emph{probabilistic transformers}. Our guarantees are of memorization type: we prove that a probabilistic transformer of depth about $nlog(n)$ and width about $n^2$ can bi-H"{o}lder embed any $n$-point dataset from $mathcal{X}$ with low metric distortion, thus avoiding the curse of dimensionality. We further derive probabilistic bi-Lipschitz guarantees, which trade off the amount of distortion and the probability that a randomly chosen pair of points embeds with that distortion. If $mathcal{X}$'s geometry is sufficiently regular, we obtain stronger, bi-Lipschitz guarantees for all points in the dataset. As applications, we derive neural embedding guarantees for datasets from Riemannian manifolds, metric trees, and certain types of combinatorial graphs. When instead embedding into multivariate Gaussian mixtures, we show that probabilistic transformers can compute bi-H"{o}lder embeddings with arbitrarily small distortion.
我们研究了任意度量空间$mathcal{X}$中具有输运度量的单变量高斯混合空间中的数据表示(Delon and isolneux 2020)。我们推导了由称为emph{概率转换器的小型神经网络实现的特征映射的嵌入保证}。我们的保证是记忆型的:我们证明了深度约为$nlog(n)$和宽度约为$n^2$的概率转换器可以bi-Hölder嵌入任何来自$mathcal{X}$的具有低度量失真的$n$点数据集,从而避免了维度的诅咒。我们进一步推导了概率双lipschitz保证,它权衡了扭曲的数量和随机选择的一对点嵌入该扭曲的概率。如果$mathcal{X}$的几何结构足够规则,我们就可以得到数据集中所有点的更强的双lipschitz保证。作为应用,我们从黎曼流形、度量树和某些类型的组合图中导出数据集的神经嵌入保证。当嵌入到多元高斯混合时,我们表明概率变压器可以计算任意小失真的bi-Hölder嵌入。
{"title":"Small Transformers Compute Universal Metric Embeddings","authors":"Anastasis Kratsios, Valentin Debarnot, Ivan Dokmani'c","doi":"10.48550/arXiv.2209.06788","DOIUrl":"https://doi.org/10.48550/arXiv.2209.06788","url":null,"abstract":"We study representations of data from an arbitrary metric space $mathcal{X}$ in the space of univariate Gaussian mixtures with a transport metric (Delon and Desolneux 2020). We derive embedding guarantees for feature maps implemented by small neural networks called emph{probabilistic transformers}. Our guarantees are of memorization type: we prove that a probabilistic transformer of depth about $nlog(n)$ and width about $n^2$ can bi-H\"{o}lder embed any $n$-point dataset from $mathcal{X}$ with low metric distortion, thus avoiding the curse of dimensionality. We further derive probabilistic bi-Lipschitz guarantees, which trade off the amount of distortion and the probability that a randomly chosen pair of points embeds with that distortion. If $mathcal{X}$'s geometry is sufficiently regular, we obtain stronger, bi-Lipschitz guarantees for all points in the dataset. As applications, we derive neural embedding guarantees for datasets from Riemannian manifolds, metric trees, and certain types of combinatorial graphs. When instead embedding into multivariate Gaussian mixtures, we show that probabilistic transformers can compute bi-H\"{o}lder embeddings with arbitrarily small distortion.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"15 1","pages":"170:1-170:48"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76787372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Statistical Comparisons of Classifiers by Generalized Stochastic Dominance 广义随机优势分类器的统计比较
Pub Date : 2022-09-05 DOI: 10.48550/arXiv.2209.01857
C. Jansen, Malte Nalenz, G. Schollmeyer, Thomas Augustin
Although being a crucial question for the development of machine learning algorithms, there is still no consensus on how to compare classifiers over multiple data sets with respect to several criteria. Every comparison framework is confronted with (at least) three fundamental challenges: the multiplicity of quality criteria, the multiplicity of data sets and the randomness of the selection of data sets. In this paper, we add a fresh view to the vivid debate by adopting recent developments in decision theory. Based on so-called preference systems, our framework ranks classifiers by a generalized concept of stochastic dominance, which powerfully circumvents the cumbersome, and often even self-contradictory, reliance on aggregates. Moreover, we show that generalized stochastic dominance can be operationalized by solving easy-to-handle linear programs and moreover statistically tested employing an adapted two-sample observation-randomization test. This yields indeed a powerful framework for the statistical comparison of classifiers over multiple data sets with respect to multiple quality criteria simultaneously. We illustrate and investigate our framework in a simulation study and with a set of standard benchmark data sets.
尽管这是机器学习算法发展的一个关键问题,但对于如何根据几个标准在多个数据集上比较分类器,仍然没有达成共识。每个比较框架都面临(至少)三个基本挑战:质量标准的多样性,数据集的多样性和数据集选择的随机性。在本文中,我们通过采用决策理论的最新发展,为生动的辩论增添了新的观点。基于所谓的偏好系统,我们的框架根据随机优势的广义概念对分类器进行排序,这有力地规避了繁琐的,甚至是自相矛盾的,对集合的依赖。此外,我们还证明了广义随机优势可以通过求解易于处理的线性程序来操作,并且采用自适应的双样本观察随机化检验进行统计检验。这确实产生了一个强大的框架,用于同时针对多个质量标准对多个数据集上的分类器进行统计比较。我们在模拟研究中使用一组标准基准数据集来说明和研究我们的框架。
{"title":"Statistical Comparisons of Classifiers by Generalized Stochastic Dominance","authors":"C. Jansen, Malte Nalenz, G. Schollmeyer, Thomas Augustin","doi":"10.48550/arXiv.2209.01857","DOIUrl":"https://doi.org/10.48550/arXiv.2209.01857","url":null,"abstract":"Although being a crucial question for the development of machine learning algorithms, there is still no consensus on how to compare classifiers over multiple data sets with respect to several criteria. Every comparison framework is confronted with (at least) three fundamental challenges: the multiplicity of quality criteria, the multiplicity of data sets and the randomness of the selection of data sets. In this paper, we add a fresh view to the vivid debate by adopting recent developments in decision theory. Based on so-called preference systems, our framework ranks classifiers by a generalized concept of stochastic dominance, which powerfully circumvents the cumbersome, and often even self-contradictory, reliance on aggregates. Moreover, we show that generalized stochastic dominance can be operationalized by solving easy-to-handle linear programs and moreover statistically tested employing an adapted two-sample observation-randomization test. This yields indeed a powerful framework for the statistical comparison of classifiers over multiple data sets with respect to multiple quality criteria simultaneously. We illustrate and investigate our framework in a simulation study and with a set of standard benchmark data sets.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"11 1","pages":"231:1-231:37"},"PeriodicalIF":0.0,"publicationDate":"2022-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81561281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization 非光滑非凸优化的简单最优随机梯度方法
Pub Date : 2022-08-22 DOI: 10.48550/arXiv.2208.10025
Zhize Li, Jian Li
We propose and analyze several stochastic gradient algorithms for finding stationary points or local minimum in nonconvex, possibly with nonsmooth regularizer, finite-sum and online optimization problems. First, we propose a simple proximal stochastic gradient algorithm based on variance reduction called ProxSVRG+. We provide a clean and tight analysis of ProxSVRG+, which shows that it outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, hence solves an open problem proposed in Reddi et al. (2016b). Also, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG (Reddi et al., 2016b) and extends to the online setting by avoiding full gradient computations. Then, we further propose an optimal algorithm, called SSRGD, based on SARAH (Nguyen et al., 2017) and show that SSRGD further improves the gradient complexity of ProxSVRG+ and achieves the optimal upper bound, matching the known lower bound of (Fang et al., 2018; Li et al., 2021). Moreover, we show that both ProxSVRG+ and SSRGD enjoy automatic adaptation with local structure of the objective function such as the Polyak-L{}ojasiewicz (PL) condition for nonconvex functions in the finite-sum case, i.e., we prove that both of them can automatically switch to faster global linear convergence without any restart performed in prior work ProxSVRG (Reddi et al., 2016b). Finally, we focus on the more challenging problem of finding an $(epsilon, delta)$-local minimum instead of just finding an $epsilon$-approximate (first-order) stationary point (which may be some bad unstable saddle points). We show that SSRGD can find an $(epsilon, delta)$-local minimum by simply adding some random perturbations. Our algorithm is almost as simple as its counterpart for finding stationary points, and achieves similar optimal rates.
我们提出并分析了几种随机梯度算法,用于寻找非凸的平稳点或局部最小值,可能具有非光滑正则化,有限和和在线优化问题。首先,我们提出了一种简单的基于方差约简的近端随机梯度算法ProxSVRG+。我们对ProxSVRG+进行了清晰而严密的分析,结果表明它在大范围的小批量大小下优于确定性近端梯度下降(ProxGD),从而解决了Reddi等人(2016b)提出的一个开放问题。此外,ProxSVRG+使用的近端oracle调用比ProxSVRG少得多(Reddi等人,2016b),并通过避免完全梯度计算扩展到在线设置。然后,我们进一步提出了一种基于SARAH的最优算法,称为SSRGD (Nguyen et al., 2017),并表明SSRGD进一步提高了ProxSVRG+的梯度复杂度,达到了最优上界,与已知的下界相匹配(Fang et al., 2018;Li等人,2021)。此外,我们证明了ProxSVRG+和SSRGD都可以自动适应目标函数的局部结构,如有限和情况下非凸函数的Polyak- L{} ojasiewicz (PL)条件,即,我们证明了它们都可以自动切换到更快的全局线性收敛,而无需在先前的工作ProxSVRG中执行任何重启(Reddi et al., 2016b)。最后,我们将重点放在寻找$(epsilon, delta)$ -局部最小值的更具挑战性的问题上,而不仅仅是寻找$epsilon$ -近似(一阶)平稳点(可能是一些不稳定的鞍点)。我们证明SSRGD可以通过简单地添加一些随机扰动来找到$(epsilon, delta)$ -局部最小值。我们的算法几乎和寻找平稳点的算法一样简单,并且达到了相似的最优速率。
{"title":"Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization","authors":"Zhize Li, Jian Li","doi":"10.48550/arXiv.2208.10025","DOIUrl":"https://doi.org/10.48550/arXiv.2208.10025","url":null,"abstract":"We propose and analyze several stochastic gradient algorithms for finding stationary points or local minimum in nonconvex, possibly with nonsmooth regularizer, finite-sum and online optimization problems. First, we propose a simple proximal stochastic gradient algorithm based on variance reduction called ProxSVRG+. We provide a clean and tight analysis of ProxSVRG+, which shows that it outperforms the deterministic proximal gradient descent (ProxGD) for a wide range of minibatch sizes, hence solves an open problem proposed in Reddi et al. (2016b). Also, ProxSVRG+ uses much less proximal oracle calls than ProxSVRG (Reddi et al., 2016b) and extends to the online setting by avoiding full gradient computations. Then, we further propose an optimal algorithm, called SSRGD, based on SARAH (Nguyen et al., 2017) and show that SSRGD further improves the gradient complexity of ProxSVRG+ and achieves the optimal upper bound, matching the known lower bound of (Fang et al., 2018; Li et al., 2021). Moreover, we show that both ProxSVRG+ and SSRGD enjoy automatic adaptation with local structure of the objective function such as the Polyak-L{}ojasiewicz (PL) condition for nonconvex functions in the finite-sum case, i.e., we prove that both of them can automatically switch to faster global linear convergence without any restart performed in prior work ProxSVRG (Reddi et al., 2016b). Finally, we focus on the more challenging problem of finding an $(epsilon, delta)$-local minimum instead of just finding an $epsilon$-approximate (first-order) stationary point (which may be some bad unstable saddle points). We show that SSRGD can find an $(epsilon, delta)$-local minimum by simply adding some random perturbations. Our algorithm is almost as simple as its counterpart for finding stationary points, and achieves similar optimal rates.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"47 1","pages":"239:1-239:61"},"PeriodicalIF":0.0,"publicationDate":"2022-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82237794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Lifted Bregman Training of Neural Networks 神经网络的改进Bregman训练
Pub Date : 2022-08-18 DOI: 10.48550/arXiv.2208.08772
Xiaoyu Wang, M. Benning
We introduce a novel mathematical formulation for the training of feed-forward neural networks with (potentially non-smooth) proximal maps as activation functions. This formulation is based on Bregman distances and a key advantage is that its partial derivatives with respect to the network's parameters do not require the computation of derivatives of the network's activation functions. Instead of estimating the parameters with a combination of first-order optimisation method and back-propagation (as is the state-of-the-art), we propose the use of non-smooth first-order optimisation methods that exploit the specific structure of the novel formulation. We present several numerical results that demonstrate that these training approaches can be equally well or even better suited for the training of neural network-based classifiers and (denoising) autoencoders with sparse coding compared to more conventional training frameworks.
我们引入了一种新的数学公式,用于训练具有(可能不光滑的)近端映射作为激活函数的前馈神经网络。这个公式是基于布雷格曼距离的,一个关键的优点是它对网络参数的偏导数不需要计算网络激活函数的导数。我们建议使用利用新公式的特定结构的非光滑一阶优化方法,而不是使用一阶优化方法和反向传播的组合来估计参数(这是最先进的)。我们给出了几个数值结果,表明与更传统的训练框架相比,这些训练方法可以同样好甚至更适合训练基于神经网络的分类器和(去噪)带有稀疏编码的自编码器。
{"title":"Lifted Bregman Training of Neural Networks","authors":"Xiaoyu Wang, M. Benning","doi":"10.48550/arXiv.2208.08772","DOIUrl":"https://doi.org/10.48550/arXiv.2208.08772","url":null,"abstract":"We introduce a novel mathematical formulation for the training of feed-forward neural networks with (potentially non-smooth) proximal maps as activation functions. This formulation is based on Bregman distances and a key advantage is that its partial derivatives with respect to the network's parameters do not require the computation of derivatives of the network's activation functions. Instead of estimating the parameters with a combination of first-order optimisation method and back-propagation (as is the state-of-the-art), we propose the use of non-smooth first-order optimisation methods that exploit the specific structure of the novel formulation. We present several numerical results that demonstrate that these training approaches can be equally well or even better suited for the training of neural network-based classifiers and (denoising) autoencoders with sparse coding compared to more conventional training frameworks.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"1 1","pages":"232:1-232:51"},"PeriodicalIF":0.0,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90032651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Robust methods for high-dimensional linear learning 高维线性学习的鲁棒方法
Pub Date : 2022-08-10 DOI: 10.48550/arXiv.2208.05447
Ibrahim Merad, Stéphane Gaïffas
We propose statistically robust and computationally efficient linear learning methods in the high-dimensional batch setting, where the number of features $d$ may exceed the sample size $n$. We employ, in a generic learning setting, two algorithms depending on whether the considered loss function is gradient-Lipschitz or not. Then, we instantiate our framework on several applications including vanilla sparse, group-sparse and low-rank matrix recovery. This leads, for each application, to efficient and robust learning algorithms, that reach near-optimal estimation rates under heavy-tailed distributions and the presence of outliers. For vanilla $s$-sparsity, we are able to reach the $slog (d)/n$ rate under heavy-tails and $eta$-corruption, at a computational cost comparable to that of non-robust analogs. We provide an efficient implementation of our algorithms in an open-source $mathtt{Python}$ library called $mathtt{linlearn}$, by means of which we carry out numerical experiments which confirm our theoretical findings together with a comparison to other recent approaches proposed in the literature.
我们在高维批处理设置中提出了统计鲁棒性和计算效率高的线性学习方法,其中特征数量d可能超过样本量n。在一般的学习设置中,我们采用两种算法,这取决于所考虑的损失函数是否为梯度lipschitz。然后,我们在香草稀疏、群稀疏和低秩矩阵恢复等几个应用上实例化了我们的框架。这导致,对于每个应用程序,高效和鲁棒的学习算法,在重尾分布和异常值的存在下达到接近最优的估计率。对于普通的$s$稀疏性,我们能够达到$slog (d)/n$在重尾和$eta$-腐败下的速率,其计算成本与非鲁棒类似物相当。我们在一个名为$mathtt{linlearn}$的开源$mathtt{Python}$库中提供了我们的算法的有效实现,通过该库,我们进行了数值实验,证实了我们的理论发现,并与文献中提出的其他最新方法进行了比较。
{"title":"Robust methods for high-dimensional linear learning","authors":"Ibrahim Merad, Stéphane Gaïffas","doi":"10.48550/arXiv.2208.05447","DOIUrl":"https://doi.org/10.48550/arXiv.2208.05447","url":null,"abstract":"We propose statistically robust and computationally efficient linear learning methods in the high-dimensional batch setting, where the number of features $d$ may exceed the sample size $n$. We employ, in a generic learning setting, two algorithms depending on whether the considered loss function is gradient-Lipschitz or not. Then, we instantiate our framework on several applications including vanilla sparse, group-sparse and low-rank matrix recovery. This leads, for each application, to efficient and robust learning algorithms, that reach near-optimal estimation rates under heavy-tailed distributions and the presence of outliers. For vanilla $s$-sparsity, we are able to reach the $slog (d)/n$ rate under heavy-tails and $eta$-corruption, at a computational cost comparable to that of non-robust analogs. We provide an efficient implementation of our algorithms in an open-source $mathtt{Python}$ library called $mathtt{linlearn}$, by means of which we carry out numerical experiments which confirm our theoretical findings together with a comparison to other recent approaches proposed in the literature.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"52 1","pages":"165:1-165:44"},"PeriodicalIF":0.0,"publicationDate":"2022-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85130961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Mappings for Marginal Probabilities with Applications to Models in Statistical Physics 边际概率映射及其在统计物理模型中的应用
Pub Date : 2022-08-10 DOI: 10.48550/arXiv.2208.05333
Mehdi Molkaraie
We present local mappings that relate the marginal probabilities of a global probability mass function represented by its primal normal factor graph to the corresponding marginal probabilities in its dual normal factor graph. The mapping is based on the Fourier transform of the local factors of the models. Details of the mapping are provided for the Ising model, where it is proved that the local extrema of the fixed points are attained at the phase transition of the two-dimensional nearest-neighbor Ising model. The results are further extended to the Potts model, to the clock model, and to Gaussian Markov random fields. By employing the mapping, we can transform simultaneously all the estimated marginal probabilities from the dual domain to the primal domain (and vice versa), which is advantageous if estimating the marginals can be carried out more efficiently in the dual domain. An example of particular significance is the ferromagnetic Ising model in a positive external magnetic field. For this model, there exists a rapidly mixing Markov chain (called the subgraphs--world process) to generate configurations in the dual normal factor graph of the model. Our numerical experiments illustrate that the proposed procedure can provide more accurate estimates of marginal probabilities of a global probability mass function in various settings.
我们给出了将由其原法因子图表示的全局概率质量函数的边际概率与其对偶法因子图中相应的边际概率相关联的局部映射。该映射基于模型局部因子的傅里叶变换。给出了二维最近邻Ising模型的映射细节,证明了在二维最近邻Ising模型的相变处得到不动点的局部极值。结果进一步推广到波茨模型、时钟模型和高斯马尔可夫随机场。通过使用映射,我们可以同时将所有估计的边际概率从对偶域转换到原始域(反之亦然),如果在对偶域中可以更有效地进行边际估计,这是有利的。一个特别重要的例子是正外磁场中的铁磁伊辛模型。对于该模型,在模型的对偶法向因子图中存在一个快速混合马尔可夫链(称为子图-世界过程)来生成构型。我们的数值实验表明,所提出的程序可以在各种设置下提供更准确的全局概率质量函数的边际概率估计。
{"title":"Mappings for Marginal Probabilities with Applications to Models in Statistical Physics","authors":"Mehdi Molkaraie","doi":"10.48550/arXiv.2208.05333","DOIUrl":"https://doi.org/10.48550/arXiv.2208.05333","url":null,"abstract":"We present local mappings that relate the marginal probabilities of a global probability mass function represented by its primal normal factor graph to the corresponding marginal probabilities in its dual normal factor graph. The mapping is based on the Fourier transform of the local factors of the models. Details of the mapping are provided for the Ising model, where it is proved that the local extrema of the fixed points are attained at the phase transition of the two-dimensional nearest-neighbor Ising model. The results are further extended to the Potts model, to the clock model, and to Gaussian Markov random fields. By employing the mapping, we can transform simultaneously all the estimated marginal probabilities from the dual domain to the primal domain (and vice versa), which is advantageous if estimating the marginals can be carried out more efficiently in the dual domain. An example of particular significance is the ferromagnetic Ising model in a positive external magnetic field. For this model, there exists a rapidly mixing Markov chain (called the subgraphs--world process) to generate configurations in the dual normal factor graph of the model. Our numerical experiments illustrate that the proposed procedure can provide more accurate estimates of marginal probabilities of a global probability mass function in various settings.","PeriodicalId":14794,"journal":{"name":"J. Mach. Learn. Res.","volume":"353 1","pages":"245:1-245:36"},"PeriodicalIF":0.0,"publicationDate":"2022-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84877624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
J. Mach. Learn. Res.
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1