Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning最新文献_第7页

Active Policy Improvement from Multiple Black-box Oracles 来自多个黑盒预言机的积极策略改进

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning

Pub Date : 2023-06-17 DOI: 10.48550/arXiv.2306.10259

Xuefeng Liu, Takuma Yoneda, Chaoqi Wang, Matthew R. Walter, Yuxin Chen

Reinforcement learning (RL) has made significant strides in various complex domains. However, identifying an effective policy via RL often necessitates extensive exploration. Imitation learning aims to mitigate this issue by using expert demonstrations to guide exploration. In real-world scenarios, one often has access to multiple suboptimal black-box experts, rather than a single optimal oracle. These experts do not universally outperform each other across all states, presenting a challenge in actively deciding which oracle to use and in which state. We introduce MAPS and MAPS-SE, a class of policy improvement algorithms that perform imitation learning from multiple suboptimal oracles. In particular, MAPS actively selects which of the oracles to imitate and improve their value function estimates, and MAPS-SE additionally leverages an active state exploration criterion to determine which states one should explore. We provide a comprehensive theoretical analysis and demonstrate that MAPS and MAPS-SE enjoy sample efficiency advantage over the state-of-the-art policy improvement algorithms. Empirical results show that MAPS-SE significantly accelerates policy optimization via state-wise imitation learning from multiple oracles across a broad spectrum of control tasks in the DeepMind Control Suite. Our code is publicly available at: https://github.com/ripl/maps.

强化学习(RL)在各种复杂领域取得了重大进展。然而，通过RL确定有效的政策往往需要广泛的探索。模仿学习旨在通过使用专家示范来指导探索来缓解这一问题。在实际场景中，人们通常可以访问多个次优黑箱专家，而不是单个最优oracle。这些专家并不是在所有州都表现得比其他人好，这就给主动决定在哪个州使用哪个oracle带来了挑战。我们介绍MAPS和MAPS- se，这是一类从多个次优oracle执行模仿学习的策略改进算法。特别是，MAPS积极地选择要模仿的预言机并改进它们的值函数估计，MAPS- se还利用活动状态探索标准来确定应该探索哪些状态。我们提供了一个全面的理论分析，并证明MAPS和MAPS- se比最先进的政策改进算法具有样本效率优势。实证结果表明，MAPS-SE通过在DeepMind控制套件中广泛的控制任务中从多个oracle进行状态模仿学习，显著加速了策略优化。我们的代码可以在https://github.com/ripl/maps上公开获得。

{"title":"Active Policy Improvement from Multiple Black-box Oracles","authors":"Xuefeng Liu, Takuma Yoneda, Chaoqi Wang, Matthew R. Walter, Yuxin Chen","doi":"10.48550/arXiv.2306.10259","DOIUrl":"https://doi.org/10.48550/arXiv.2306.10259","url":null,"abstract":"Reinforcement learning (RL) has made significant strides in various complex domains. However, identifying an effective policy via RL often necessitates extensive exploration. Imitation learning aims to mitigate this issue by using expert demonstrations to guide exploration. In real-world scenarios, one often has access to multiple suboptimal black-box experts, rather than a single optimal oracle. These experts do not universally outperform each other across all states, presenting a challenge in actively deciding which oracle to use and in which state. We introduce MAPS and MAPS-SE, a class of policy improvement algorithms that perform imitation learning from multiple suboptimal oracles. In particular, MAPS actively selects which of the oracles to imitate and improve their value function estimates, and MAPS-SE additionally leverages an active state exploration criterion to determine which states one should explore. We provide a comprehensive theoretical analysis and demonstrate that MAPS and MAPS-SE enjoy sample efficiency advantage over the state-of-the-art policy improvement algorithms. Empirical results show that MAPS-SE significantly accelerates policy optimization via state-wise imitation learning from multiple oracles across a broad spectrum of control tasks in the DeepMind Control Suite. Our code is publicly available at: https://github.com/ripl/maps.","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"50 1","pages":"22320-22337"},"PeriodicalIF":0.0,"publicationDate":"2023-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90441109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Bootstrapped Representations in Reinforcement Learning 强化学习中的自举表示

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning

Pub Date : 2023-06-16 DOI: 10.48550/arXiv.2306.10171

Charline Le Lan, Stephen Tu, Mark Rowland, A. Harutyunyan, Rishabh Agarwal, Marc G. Bellemare, Will Dabney

In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated into the learning process and help shape the learnt state representation. Bootstrapping methods are today's method of choice to make these additional predictions. Yet, it is unclear which features these algorithms capture and how they relate to those from other auxiliary-task-based approaches. In this paper, we address this gap and provide a theoretical characterization of the state representation learnt by temporal difference learning (Sutton, 1988). Surprisingly, we find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting. We describe the efficacy of these representations for policy evaluation, and use our theoretical analysis to design new auxiliary learning rules. We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al, 1999) and Mountain Car (Moore, 1990).

在强化学习(RL)中，状态表示是处理大型或连续状态空间的关键。虽然深度学习算法的承诺之一是自动构建针对它们试图解决的任务进行优化的特征，但这种表示可能不会从深度强化学习代理的端到端训练中出现。为了缓解这个问题，辅助目标通常被纳入学习过程，并帮助塑造学习状态表示。自举法是目前进行这些额外预测的首选方法。然而，目前还不清楚这些算法捕获了哪些特征，以及它们如何与其他基于辅助任务的方法相关联。在本文中，我们解决了这一差距，并提供了通过时间差异学习学习的状态表示的理论表征(Sutton, 1988)。令人惊讶的是，我们发现这种表示不同于蒙特卡罗和残差梯度算法在策略评估设置中对环境的大多数过渡结构学习的特征。我们描述了这些表征对政策评估的有效性，并使用我们的理论分析来设计新的辅助学习规则。我们对经典领域(如四房间领域(Sutton et al .， 1999)和Mountain Car (Moore, 1990))的不同累积函数的学习规则进行了实证比较，以补充我们的理论结果。

{"title":"Bootstrapped Representations in Reinforcement Learning","authors":"Charline Le Lan, Stephen Tu, Mark Rowland, A. Harutyunyan, Rishabh Agarwal, Marc G. Bellemare, Will Dabney","doi":"10.48550/arXiv.2306.10171","DOIUrl":"https://doi.org/10.48550/arXiv.2306.10171","url":null,"abstract":"In reinforcement learning (RL), state representations are key to dealing with large or continuous state spaces. While one of the promises of deep learning algorithms is to automatically construct features well-tuned for the task they try to solve, such a representation might not emerge from end-to-end training of deep RL agents. To mitigate this issue, auxiliary objectives are often incorporated into the learning process and help shape the learnt state representation. Bootstrapping methods are today's method of choice to make these additional predictions. Yet, it is unclear which features these algorithms capture and how they relate to those from other auxiliary-task-based approaches. In this paper, we address this gap and provide a theoretical characterization of the state representation learnt by temporal difference learning (Sutton, 1988). Surprisingly, we find that this representation differs from the features learned by Monte Carlo and residual gradient algorithms for most transition structures of the environment in the policy evaluation setting. We describe the efficacy of these representations for policy evaluation, and use our theoretical analysis to design new auxiliary learning rules. We complement our theoretical results with an empirical comparison of these learning rules for different cumulant functions on classic domains such as the four-room domain (Sutton et al, 1999) and Mountain Car (Moore, 1990).","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"62 1","pages":"18686-18713"},"PeriodicalIF":0.0,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80305321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

From Hypergraph Energy Functions to Hypergraph Neural Networks 从超图能量函数到超图神经网络

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning

Pub Date : 2023-06-16 DOI: 10.48550/arXiv.2306.09623

Yuxin Wang, Quan Gan, Xipeng Qiu, Xuanjing Huang, D. Wipf

Hypergraphs are a powerful abstraction for representing higher-order interactions between entities of interest. To exploit these relationships in making downstream predictions, a variety of hypergraph neural network architectures have recently been proposed, in large part building upon precursors from the more traditional graph neural network (GNN) literature. Somewhat differently, in this paper we begin by presenting an expressive family of parameterized, hypergraph-regularized energy functions. We then demonstrate how minimizers of these energies effectively serve as node embeddings that, when paired with a parameterized classifier, can be trained end-to-end via a supervised bilevel optimization process. Later, we draw parallels between the implicit architecture of the predictive models emerging from the proposed bilevel hypergraph optimization, and existing GNN architectures in common use. Empirically, we demonstrate state-of-the-art results on various hypergraph node classification benchmarks. Code is available at https://github.com/yxzwang/PhenomNN.

超图是一个强大的抽象，用于表示感兴趣的实体之间的高阶交互。为了利用这些关系进行下游预测，最近提出了各种超图神经网络架构，在很大程度上建立在更传统的图神经网络(GNN)文献的前身之上。有点不同的是，在本文中，我们首先提出了一组表达性的参数化、超图正则化的能量函数。然后，我们演示了这些能量的最小化是如何有效地作为节点嵌入的，当与参数化分类器配对时，可以通过有监督的双层优化过程进行端到端的训练。随后，我们将提出的双层超图优化中出现的预测模型的隐式架构与常用的现有GNN架构进行了比较。在经验上，我们在各种超图节点分类基准上展示了最先进的结果。代码可从https://github.com/yxzwang/PhenomNN获得。

{"title":"From Hypergraph Energy Functions to Hypergraph Neural Networks","authors":"Yuxin Wang, Quan Gan, Xipeng Qiu, Xuanjing Huang, D. Wipf","doi":"10.48550/arXiv.2306.09623","DOIUrl":"https://doi.org/10.48550/arXiv.2306.09623","url":null,"abstract":"Hypergraphs are a powerful abstraction for representing higher-order interactions between entities of interest. To exploit these relationships in making downstream predictions, a variety of hypergraph neural network architectures have recently been proposed, in large part building upon precursors from the more traditional graph neural network (GNN) literature. Somewhat differently, in this paper we begin by presenting an expressive family of parameterized, hypergraph-regularized energy functions. We then demonstrate how minimizers of these energies effectively serve as node embeddings that, when paired with a parameterized classifier, can be trained end-to-end via a supervised bilevel optimization process. Later, we draw parallels between the implicit architecture of the predictive models emerging from the proposed bilevel hypergraph optimization, and existing GNN architectures in common use. Empirically, we demonstrate state-of-the-art results on various hypergraph node classification benchmarks. Code is available at https://github.com/yxzwang/PhenomNN.","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"1 1","pages":"35605-35623"},"PeriodicalIF":0.0,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91232118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Subset Selection Based On Multiple Rankings in the Presence of Bias: Effectiveness of Fairness Constraints for Multiwinner Voting Score Functions 存在偏见时基于多重排名的子集选择:多赢家投票分数函数的公平性约束有效性

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning

Pub Date : 2023-06-16 DOI: 10.48550/arXiv.2306.09835

Niclas Boehmer, L. E. Celis, Lingxiao Huang, Anay Mehrotra, Nisheeth K. Vishnoi

We consider the problem of subset selection where one is given multiple rankings of items and the goal is to select the highest ``quality'' subset. Score functions from the multiwinner voting literature have been used to aggregate rankings into quality scores for subsets. We study this setting of subset selection problems when, in addition, rankings may contain systemic or unconscious biases toward a group of items. For a general model of input rankings and biases, we show that requiring the selected subset to satisfy group fairness constraints can improve the quality of the selection with respect to unbiased rankings. Importantly, we show that for fairness constraints to be effective, different multiwinner score functions may require a drastically different number of rankings: While for some functions, fairness constraints need an exponential number of rankings to recover a close-to-optimal solution, for others, this dependency is only polynomial. This result relies on a novel notion of ``smoothness'' of submodular functions in this setting that quantifies how well a function can ``correctly'' assess the quality of items in the presence of bias. The results in this paper can be used to guide the choice of multiwinner score functions for the subset selection setting considered here; we additionally provide a tool to empirically enable this.

我们考虑子集选择问题，其中一个问题是给定多个项目排名，目标是选择最高的“质量”子集。多赢家投票文献中的得分函数用于将排名汇总为子集的质量分数。我们研究了这种子集选择问题的设置，此外，排名可能包含对一组项目的系统或无意识的偏见。对于输入排序和偏差的一般模型，我们表明要求所选子集满足组公平约束可以提高相对于无偏排序的选择质量。重要的是，我们表明，为了使公平约束有效，不同的多赢家得分函数可能需要完全不同数量的排名:而对于某些函数，公平约束需要指数数量的排名来恢复接近最优的解决方案，对于其他函数，这种依赖关系只是多项式。这一结果依赖于子模块函数在这种情况下的“平滑度”的新概念，该概念量化了函数在存在偏差的情况下“正确”评估项目质量的程度。本文的结果可用于指导本文所考虑的子集选择设置的多赢家分数函数的选择;我们还提供了一个工具来实现这一点。

{"title":"Subset Selection Based On Multiple Rankings in the Presence of Bias: Effectiveness of Fairness Constraints for Multiwinner Voting Score Functions","authors":"Niclas Boehmer, L. E. Celis, Lingxiao Huang, Anay Mehrotra, Nisheeth K. Vishnoi","doi":"10.48550/arXiv.2306.09835","DOIUrl":"https://doi.org/10.48550/arXiv.2306.09835","url":null,"abstract":"We consider the problem of subset selection where one is given multiple rankings of items and the goal is to select the highest ``quality'' subset. Score functions from the multiwinner voting literature have been used to aggregate rankings into quality scores for subsets. We study this setting of subset selection problems when, in addition, rankings may contain systemic or unconscious biases toward a group of items. For a general model of input rankings and biases, we show that requiring the selected subset to satisfy group fairness constraints can improve the quality of the selection with respect to unbiased rankings. Importantly, we show that for fairness constraints to be effective, different multiwinner score functions may require a drastically different number of rankings: While for some functions, fairness constraints need an exponential number of rankings to recover a close-to-optimal solution, for others, this dependency is only polynomial. This result relies on a novel notion of ``smoothness'' of submodular functions in this setting that quantifies how well a function can ``correctly'' assess the quality of items in the presence of bias. The results in this paper can be used to guide the choice of multiwinner score functions for the subset selection setting considered here; we additionally provide a tool to empirically enable this.","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"86 11 1","pages":"2641-2688"},"PeriodicalIF":0.0,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84005697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Understanding the Role of Feedback in Online Learning with Switching Costs 理解反馈在有转换成本的在线学习中的作用

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning

Pub Date : 2023-06-16 DOI: 10.48550/arXiv.2306.09588

Duo Cheng, Xingyu Zhou, Bo Ji

In this paper, we study the role of feedback in online learning with switching costs. It has been shown that the minimax regret is $widetilde{Theta}(T^{2/3})$ under bandit feedback and improves to $widetilde{Theta}(sqrt{T})$ under full-information feedback, where $T$ is the length of the time horizon. However, it remains largely unknown how the amount and type of feedback generally impact regret. To this end, we first consider the setting of bandit learning with extra observations; that is, in addition to the typical bandit feedback, the learner can freely make a total of $B_{mathrm{ex}}$ extra observations. We fully characterize the minimax regret in this setting, which exhibits an interesting phase-transition phenomenon: when $B_{mathrm{ex}} = O(T^{2/3})$, the regret remains $widetilde{Theta}(T^{2/3})$, but when $B_{mathrm{ex}} = Omega(T^{2/3})$, it becomes $widetilde{Theta}(T/sqrt{B_{mathrm{ex}}})$, which improves as the budget $B_{mathrm{ex}}$ increases. To design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of $B$ total observations. We fully characterize the minimax regret in this setting as well and show that it is $widetilde{Theta}(T/sqrt{B})$, which scales smoothly with the total budget $B$. Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback. One interesting finding is that while bandit feedback can still guarantee optimal regret when the budget is relatively limited, it no longer suffices to achieve optimal regret when the budget is relatively large.

本文研究了反馈在具有切换成本的在线学习中的作用。研究表明，在强盗反馈下，最小最大后悔为$widetilde{Theta}(T^{2/3})$，而在全信息反馈下，最小最大后悔为$widetilde{Theta}(sqrt{T})$，其中$T$为时间范围长度。然而，反馈的数量和类型通常如何影响后悔，这在很大程度上仍然是未知的。为此，我们首先考虑具有额外观测值的强盗学习设置;也就是说，除了典型的强盗反馈之外，学习者还可以自由地进行$B_{mathrm{ex}}$额外观察。我们在这个设置中充分描述了极大极小遗憾，它展示了一个有趣的相变现象:当$B_{mathrm{ex}} = O(T^{2/3})$时，遗憾仍然是$widetilde{Theta}(T^{2/3})$，但当$B_{mathrm{ex}} = Omega(T^{2/3})$时，它变成$widetilde{Theta}(T/sqrt{B_{mathrm{ex}}})$，随着预算$B_{mathrm{ex}}$的增加而改善。为了设计能够实现最小最大遗憾的算法，考虑一个更一般的设置是有指导意义的，其中学习者的总观测值预算为$B$。在这种情况下，我们也充分描述了最小最大遗憾，并表明它是$widetilde{Theta}(T/sqrt{B})$，它与总预算$B$平滑地缩放。此外，我们提出了一个通用的算法框架，使我们能够设计不同的学习算法，这些算法可以根据反馈的数量和类型实现两种设置的匹配上界。一个有趣的发现是，当预算相对有限时，强盗反馈仍然可以保证最优后悔，但当预算相对较大时，它不再足以实现最优后悔。

{"title":"Understanding the Role of Feedback in Online Learning with Switching Costs","authors":"Duo Cheng, Xingyu Zhou, Bo Ji","doi":"10.48550/arXiv.2306.09588","DOIUrl":"https://doi.org/10.48550/arXiv.2306.09588","url":null,"abstract":"In this paper, we study the role of feedback in online learning with switching costs. It has been shown that the minimax regret is $widetilde{Theta}(T^{2/3})$ under bandit feedback and improves to $widetilde{Theta}(sqrt{T})$ under full-information feedback, where $T$ is the length of the time horizon. However, it remains largely unknown how the amount and type of feedback generally impact regret. To this end, we first consider the setting of bandit learning with extra observations; that is, in addition to the typical bandit feedback, the learner can freely make a total of $B_{mathrm{ex}}$ extra observations. We fully characterize the minimax regret in this setting, which exhibits an interesting phase-transition phenomenon: when $B_{mathrm{ex}} = O(T^{2/3})$, the regret remains $widetilde{Theta}(T^{2/3})$, but when $B_{mathrm{ex}} = Omega(T^{2/3})$, it becomes $widetilde{Theta}(T/sqrt{B_{mathrm{ex}}})$, which improves as the budget $B_{mathrm{ex}}$ increases. To design algorithms that can achieve the minimax regret, it is instructive to consider a more general setting where the learner has a budget of $B$ total observations. We fully characterize the minimax regret in this setting as well and show that it is $widetilde{Theta}(T/sqrt{B})$, which scales smoothly with the total budget $B$. Furthermore, we propose a generic algorithmic framework, which enables us to design different learning algorithms that can achieve matching upper bounds for both settings based on the amount and type of feedback. One interesting finding is that while bandit feedback can still guarantee optimal regret when the budget is relatively limited, it no longer suffices to achieve optimal regret when the budget is relatively large.","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"172 1","pages":"5521-5543"},"PeriodicalIF":0.0,"publicationDate":"2023-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76959488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Nearly-Optimal Hierarchical Clustering for Well-Clustered Graphs 良好聚类图的近最优分层聚类

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning

Pub Date : 2023-06-16 DOI: 10.48550/arXiv.2306.09950

Steinar Laenen, Bogdan-Adrian Manghiuc, He Sun

This paper presents two efficient hierarchical clustering (HC) algorithms with respect to Dasgupta's cost function. For any input graph $G$ with a clear cluster-structure, our designed algorithms run in nearly-linear time in the input size of $G$, and return an $O(1)$-approximate HC tree with respect to Dasgupta's cost function. We compare the performance of our algorithm against the previous state-of-the-art on synthetic and real-world datasets and show that our designed algorithm produces comparable or better HC trees with much lower running time.

针对Dasgupta代价函数，提出了两种高效的分层聚类算法。对于任何具有清晰聚类结构的输入图$G$，我们设计的算法在$G$的输入大小下以近线性的时间运行，并且相对于Dasgupta的代价函数返回$O(1)$-近似的HC树。我们将算法的性能与以前最先进的合成数据集和实际数据集进行了比较，并表明我们设计的算法以更低的运行时间产生了相当或更好的HC树。

引用次数: 2

Simplified Temporal Consistency Reinforcement Learning 简化时间一致性强化学习

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning

Pub Date : 2023-06-15 DOI: 10.48550/arXiv.2306.09466

Yi Zhao, Wenshuai Zhao, Rinu Boney, Juho Kannala, J. Pajarinen

Reinforcement learning is able to solve complex sequential decision-making tasks but is currently limited by sample efficiency and required computation. To improve sample efficiency, recent work focuses on model-based RL which interleaves model learning with planning. Recent methods further utilize policy learning, value estimation, and, self-supervised learning as auxiliary objectives. In this paper we show that, surprisingly, a simple representation learning approach relying only on a latent dynamics model trained by latent temporal consistency is sufficient for high-performance RL. This applies when using pure planning with a dynamics model conditioned on the representation, but, also when utilizing the representation as policy and value function features in model-free RL. In experiments, our approach learns an accurate dynamics model to solve challenging high-dimensional locomotion tasks with online planners while being 4.1 times faster to train compared to ensemble-based methods. With model-free RL without planning, especially on high-dimensional tasks, such as the DeepMind Control Suite Humanoid and Dog tasks, our approach outperforms model-free methods by a large margin and matches model-based methods' sample efficiency while training 2.4 times faster.

强化学习能够解决复杂的顺序决策任务，但目前受到样本效率和所需计算量的限制。为了提高样本效率，最近的工作集中在基于模型的强化学习上，它将模型学习与规划交叉起来。最近的方法进一步利用策略学习、价值估计和自我监督学习作为辅助目标。在本文中，我们表明，令人惊讶的是，一种简单的表征学习方法仅依赖于由潜在时间一致性训练的潜在动力学模型，就足以实现高性能的强化学习。这适用于使用纯计划和以表示为条件的动态模型时，也适用于在无模型RL中将表示用作策略和价值函数特征时。在实验中，我们的方法学习了一个精确的动力学模型，通过在线规划器解决具有挑战性的高维运动任务，同时与基于集成的方法相比，训练速度快4.1倍。在没有计划的无模型强化学习中，特别是在高维任务上，如DeepMind控制套件Humanoid和Dog任务，我们的方法大大优于无模型方法，并与基于模型的方法的样本效率相匹配，同时训练速度快2.4倍。

{"title":"Simplified Temporal Consistency Reinforcement Learning","authors":"Yi Zhao, Wenshuai Zhao, Rinu Boney, Juho Kannala, J. Pajarinen","doi":"10.48550/arXiv.2306.09466","DOIUrl":"https://doi.org/10.48550/arXiv.2306.09466","url":null,"abstract":"Reinforcement learning is able to solve complex sequential decision-making tasks but is currently limited by sample efficiency and required computation. To improve sample efficiency, recent work focuses on model-based RL which interleaves model learning with planning. Recent methods further utilize policy learning, value estimation, and, self-supervised learning as auxiliary objectives. In this paper we show that, surprisingly, a simple representation learning approach relying only on a latent dynamics model trained by latent temporal consistency is sufficient for high-performance RL. This applies when using pure planning with a dynamics model conditioned on the representation, but, also when utilizing the representation as policy and value function features in model-free RL. In experiments, our approach learns an accurate dynamics model to solve challenging high-dimensional locomotion tasks with online planners while being 4.1 times faster to train compared to ensemble-based methods. With model-free RL without planning, especially on high-dimensional tasks, such as the DeepMind Control Suite Humanoid and Dog tasks, our approach outperforms model-free methods by a large margin and matches model-based methods' sample efficiency while training 2.4 times faster.","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"4 1","pages":"42227-42246"},"PeriodicalIF":0.0,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82278847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Feed Two Birds with One Scone: Exploiting Wild Data for Both Out-of-Distribution Generalization and Detection 一饼喂二鸟:利用野外数据进行分布外泛化和检测

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning

Pub Date : 2023-06-15 DOI: 10.48550/arXiv.2306.09158

Haoyue Bai, Gregory H. Canal, Xuefeng Du, Jeongyeol Kwon, R. Nowak, Yixuan Li

Modern machine learning models deployed in the wild can encounter both covariate and semantic shifts, giving rise to the problems of out-of-distribution (OOD) generalization and OOD detection respectively. While both problems have received significant research attention lately, they have been pursued independently. This may not be surprising, since the two tasks have seemingly conflicting goals. This paper provides a new unified approach that is capable of simultaneously generalizing to covariate shifts while robustly detecting semantic shifts. We propose a margin-based learning framework that exploits freely available unlabeled data in the wild that captures the environmental test-time OOD distributions under both covariate and semantic shifts. We show both empirically and theoretically that the proposed margin constraint is the key to achieving both OOD generalization and detection. Extensive experiments show the superiority of our framework, outperforming competitive baselines that specialize in either OOD generalization or OOD detection. Code is publicly available at https://github.com/deeplearning-wisc/scone.

部署在野外的现代机器学习模型可能会遇到协变量和语义转移，分别产生分布外(OOD)泛化和OOD检测问题。虽然这两个问题最近都得到了重要的研究关注，但它们都是独立研究的。这并不奇怪，因为这两项任务的目标似乎是相互冲突的。本文提供了一种新的统一方法，能够在鲁棒检测语义位移的同时泛化到协变量位移。我们提出了一个基于边缘的学习框架，该框架利用了在协变量和语义变化下捕获环境测试时间OOD分布的自由可用的未标记数据。我们从经验和理论上都证明了所提出的边际约束是实现OOD泛化和检测的关键。大量的实验表明了我们的框架的优越性，优于专注于OOD泛化或OOD检测的竞争性基线。代码可在https://github.com/deeplearning-wisc/scone上公开获取。

{"title":"Feed Two Birds with One Scone: Exploiting Wild Data for Both Out-of-Distribution Generalization and Detection","authors":"Haoyue Bai, Gregory H. Canal, Xuefeng Du, Jeongyeol Kwon, R. Nowak, Yixuan Li","doi":"10.48550/arXiv.2306.09158","DOIUrl":"https://doi.org/10.48550/arXiv.2306.09158","url":null,"abstract":"Modern machine learning models deployed in the wild can encounter both covariate and semantic shifts, giving rise to the problems of out-of-distribution (OOD) generalization and OOD detection respectively. While both problems have received significant research attention lately, they have been pursued independently. This may not be surprising, since the two tasks have seemingly conflicting goals. This paper provides a new unified approach that is capable of simultaneously generalizing to covariate shifts while robustly detecting semantic shifts. We propose a margin-based learning framework that exploits freely available unlabeled data in the wild that captures the environmental test-time OOD distributions under both covariate and semantic shifts. We show both empirically and theoretically that the proposed margin constraint is the key to achieving both OOD generalization and detection. Extensive experiments show the superiority of our framework, outperforming competitive baselines that specialize in either OOD generalization or OOD detection. Code is publicly available at https://github.com/deeplearning-wisc/scone.","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"69 1","pages":"1454-1471"},"PeriodicalIF":0.0,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84325279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Gromov-Wasserstein Geometric View of Spectrum-Preserving Graph Coarsening 保谱图粗化的Gromov-Wasserstein几何观点

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning

Pub Date : 2023-06-15 DOI: 10.48550/arXiv.2306.08854

Yifan Chen, Rentian Yao, Yun Yang, Jie Chen

Graph coarsening is a technique for solving large-scale graph problems by working on a smaller version of the original graph, and possibly interpolating the results back to the original graph. It has a long history in scientific computing and has recently gained popularity in machine learning, particularly in methods that preserve the graph spectrum. This work studies graph coarsening from a different perspective, developing a theory for preserving graph distances and proposing a method to achieve this. The geometric approach is useful when working with a collection of graphs, such as in graph classification and regression. In this study, we consider a graph as an element on a metric space equipped with the Gromov--Wasserstein (GW) distance, and bound the difference between the distance of two graphs and their coarsened versions. Minimizing this difference can be done using the popular weighted kernel $K$-means method, which improves existing spectrum-preserving methods with the proper choice of the kernel. The study includes a set of experiments to support the theory and method, including approximating the GW distance, preserving the graph spectrum, classifying graphs using spectral information, and performing regression using graph convolutional networks. Code is available at https://github.com/ychen-stat-ml/GW-Graph-Coarsening .

图粗化是一种解决大规模图问题的技术，通过处理原始图的较小版本，并可能将结果插值回原始图。它在科学计算中有着悠久的历史，最近在机器学习中得到了普及，特别是在保留图谱的方法中。这项工作从不同的角度研究图粗化，发展了一种保持图距离的理论，并提出了一种实现这一目标的方法。几何方法在处理一组图时非常有用，例如在图分类和回归中。在本研究中，我们将图视为具有Gromov—Wasserstein (GW)距离的度量空间上的一个元素，并将两个图与其粗化版本之间的距离差进行绑定。可以使用流行的加权核$K$均值方法来最小化这种差异，该方法通过正确选择核来改进现有的频谱保持方法。该研究包括一组实验来支持理论和方法，包括逼近GW距离，保留图谱，利用谱信息对图进行分类，以及利用图卷积网络进行回归。代码可从https://github.com/ychen-stat-ml/GW-Graph-Coarsening获得。

{"title":"A Gromov-Wasserstein Geometric View of Spectrum-Preserving Graph Coarsening","authors":"Yifan Chen, Rentian Yao, Yun Yang, Jie Chen","doi":"10.48550/arXiv.2306.08854","DOIUrl":"https://doi.org/10.48550/arXiv.2306.08854","url":null,"abstract":"Graph coarsening is a technique for solving large-scale graph problems by working on a smaller version of the original graph, and possibly interpolating the results back to the original graph. It has a long history in scientific computing and has recently gained popularity in machine learning, particularly in methods that preserve the graph spectrum. This work studies graph coarsening from a different perspective, developing a theory for preserving graph distances and proposing a method to achieve this. The geometric approach is useful when working with a collection of graphs, such as in graph classification and regression. In this study, we consider a graph as an element on a metric space equipped with the Gromov--Wasserstein (GW) distance, and bound the difference between the distance of two graphs and their coarsened versions. Minimizing this difference can be done using the popular weighted kernel $K$-means method, which improves existing spectrum-preserving methods with the proper choice of the kernel. The study includes a set of experiments to support the theory and method, including approximating the GW distance, preserving the graph spectrum, classifying graphs using spectral information, and performing regression using graph convolutional networks. Code is available at https://github.com/ychen-stat-ml/GW-Graph-Coarsening .","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"1 1","pages":"5257-5281"},"PeriodicalIF":0.0,"publicationDate":"2023-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83745708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Nearly Optimal Algorithms with Sublinear Computational Complexity for Online Kernel Regression 在线核回归的亚线性计算复杂度近最优算法

Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning

Pub Date : 2023-06-14 DOI: 10.48550/arXiv.2306.08320

Junfan Li, Shizhong Liao

The trade-off between regret and computational cost is a fundamental problem for online kernel regression, and previous algorithms worked on the trade-off can not keep optimal regret bounds at a sublinear computational complexity. In this paper, we propose two new algorithms, AOGD-ALD and NONS-ALD, which can keep nearly optimal regret bounds at a sublinear computational complexity, and give sufficient conditions under which our algorithms work. Both algorithms dynamically maintain a group of nearly orthogonal basis used to approximate the kernel mapping, and keep nearly optimal regret bounds by controlling the approximate error. The number of basis depends on the approximate error and the decay rate of eigenvalues of the kernel matrix. If the eigenvalues decay exponentially, then AOGD-ALD and NONS-ALD separately achieves a regret of $O(sqrt{L(f)})$ and $O(mathrm{d}_{mathrm{eff}}(mu)ln{T})$ at a computational complexity in $O(ln^2{T})$. If the eigenvalues decay polynomially with degree $pgeq 1$, then our algorithms keep the same regret bounds at a computational complexity in $o(T)$ in the case of $p>4$ and $pgeq 10$, respectively. $L(f)$ is the cumulative losses of $f$ and $mathrm{d}_{mathrm{eff}}(mu)$ is the effective dimension of the problem. The two regret bounds are nearly optimal and are not comparable.

遗憾和计算代价之间的权衡是在线核回归的一个基本问题，以往的算法在计算复杂度为次线性的情况下不能保持最优后悔边界。在本文中，我们提出了两种新的算法AOGD-ALD和non - ald，它们能在亚线性的计算复杂度下保持近似最优遗憾界，并给出了算法工作的充分条件。两种算法都动态维护一组近正交基来近似核映射，并通过控制近似误差来保持近似最优后悔界。基的数目取决于核矩阵的近似误差和特征值的衰减率。如果特征值呈指数衰减，则AOGD-ALD和non - ald分别达到$O(sqrt{L(f)})$和$O(mathrm{d}_{mathrm{eff}}(mu)ln{T})$的遗憾，计算复杂度为$O(ln^2{T})$。如果特征值随程度$pgeq 1$多项式衰减，那么我们的算法分别在$p>4$和$pgeq 10$的情况下保持相同的遗憾边界在$o(T)$的计算复杂度。$L(f)$为$f$的累积损失，$mathrm{d}_{mathrm{eff}}(mu)$为问题的有效维数。这两个遗憾边界几乎是最优的，不具有可比性。

{"title":"Nearly Optimal Algorithms with Sublinear Computational Complexity for Online Kernel Regression","authors":"Junfan Li, Shizhong Liao","doi":"10.48550/arXiv.2306.08320","DOIUrl":"https://doi.org/10.48550/arXiv.2306.08320","url":null,"abstract":"The trade-off between regret and computational cost is a fundamental problem for online kernel regression, and previous algorithms worked on the trade-off can not keep optimal regret bounds at a sublinear computational complexity. In this paper, we propose two new algorithms, AOGD-ALD and NONS-ALD, which can keep nearly optimal regret bounds at a sublinear computational complexity, and give sufficient conditions under which our algorithms work. Both algorithms dynamically maintain a group of nearly orthogonal basis used to approximate the kernel mapping, and keep nearly optimal regret bounds by controlling the approximate error. The number of basis depends on the approximate error and the decay rate of eigenvalues of the kernel matrix. If the eigenvalues decay exponentially, then AOGD-ALD and NONS-ALD separately achieves a regret of $O(sqrt{L(f)})$ and $O(mathrm{d}_{mathrm{eff}}(mu)ln{T})$ at a computational complexity in $O(ln^2{T})$. If the eigenvalues decay polynomially with degree $pgeq 1$, then our algorithms keep the same regret bounds at a computational complexity in $o(T)$ in the case of $p>4$ and $pgeq 10$, respectively. $L(f)$ is the cumulative losses of $f$ and $mathrm{d}_{mathrm{eff}}(mu)$ is the effective dimension of the problem. The two regret bounds are nearly optimal and are not comparable.","PeriodicalId":74529,"journal":{"name":"Proceedings of the ... International Conference on Machine Learning. International Conference on Machine Learning","volume":"79 1","pages":"19743-19766"},"PeriodicalIF":0.0,"publicationDate":"2023-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86065078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0