首页 > 最新文献

International Conference on Algorithmic Learning Theory最新文献

英文 中文
On Computable Online Learning 关于可计算在线学习
Pub Date : 2023-02-08 DOI: 10.48550/arXiv.2302.04357
Niki Hasrati, S. Ben-David
We initiate a study of computable online (c-online) learning, which we analyze under varying requirements for"optimality"in terms of the mistake bound. Our main contribution is to give a necessary and sufficient condition for optimal c-online learning and show that the Littlestone dimension no longer characterizes the optimal mistake bound of c-online learning. Furthermore, we introduce anytime optimal (a-optimal) online learning, a more natural conceptualization of"optimality"and a generalization of Littlestone's Standard Optimal Algorithm. We show the existence of a computational separation between a-optimal and optimal online learning, proving that a-optimal online learning is computationally more difficult. Finally, we consider online learning with no requirements for optimality, and show, under a weaker notion of computability, that the finiteness of the Littlestone dimension no longer characterizes whether a class is c-online learnable with finite mistake bound. A potential avenue for strengthening this result is suggested by exploring the relationship between c-online and CPAC learning, where we show that c-online learning is as difficult as improper CPAC learning.
我们开始了一项可计算在线(c-online)学习的研究,我们根据错误界分析了不同要求下的“最优性”。我们的主要贡献是给出了最优c在线学习的充分必要条件,并表明Littlestone维不再表征c在线学习的最优错误界。此外,我们引入了随时最优(a-最优)在线学习,一个更自然的“最优性”概念和Littlestone标准最优算法的推广。我们证明了a-最优和最优在线学习之间存在计算分离,证明了a-最优在线学习在计算上更困难。最后,我们考虑了没有最优性要求的在线学习,并表明,在较弱的可计算性概念下,Littlestone维的有限性不再表征一个类是否具有有限错误界的c-在线可学习。通过探索在线c语言与CPAC学习之间的关系,我们提出了加强这一结果的潜在途径,其中我们表明在线c语言学习与不正确的CPAC学习一样困难。
{"title":"On Computable Online Learning","authors":"Niki Hasrati, S. Ben-David","doi":"10.48550/arXiv.2302.04357","DOIUrl":"https://doi.org/10.48550/arXiv.2302.04357","url":null,"abstract":"We initiate a study of computable online (c-online) learning, which we analyze under varying requirements for\"optimality\"in terms of the mistake bound. Our main contribution is to give a necessary and sufficient condition for optimal c-online learning and show that the Littlestone dimension no longer characterizes the optimal mistake bound of c-online learning. Furthermore, we introduce anytime optimal (a-optimal) online learning, a more natural conceptualization of\"optimality\"and a generalization of Littlestone's Standard Optimal Algorithm. We show the existence of a computational separation between a-optimal and optimal online learning, proving that a-optimal online learning is computationally more difficult. Finally, we consider online learning with no requirements for optimality, and show, under a weaker notion of computability, that the finiteness of the Littlestone dimension no longer characterizes whether a class is c-online learnable with finite mistake bound. A potential avenue for strengthening this result is suggested by exploring the relationship between c-online and CPAC learning, where we show that c-online learning is as difficult as improper CPAC learning.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125877141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
SQ Lower Bounds for Random Sparse Planted Vector Problem 随机稀疏种植向量问题的SQ下界
Pub Date : 2023-01-26 DOI: 10.48550/arXiv.2301.11124
Jingqiu Ding, Yiding Hua
Consider the setting where a $rho$-sparse Rademacher vector is planted in a random $d$-dimensional subspace of $R^n$. A classical question is how to recover this planted vector given a random basis in this subspace. A recent result by [ZSWB21] showed that the Lattice basis reduction algorithm can recover the planted vector when $ngeq d+1$. Although the algorithm is not expected to tolerate inverse polynomial amount of noise, it is surprising because it was previously shown that recovery cannot be achieved by low degree polynomials when $nll rho^2 d^{2}$ [MW21]. A natural question is whether we can derive an Statistical Query (SQ) lower bound matching the previous low degree lower bound in [MW21]. This will - imply that the SQ lower bound can be surpassed by lattice based algorithms; - predict the computational hardness when the planted vector is perturbed by inverse polynomial amount of noise. In this paper, we prove such an SQ lower bound. In particular, we show that super-polynomial number of VSTAT queries is needed to solve the easier statistical testing problem when $nll rho^2 d^{2}$ and $rhogg frac{1}{sqrt{d}}$. The most notable technique we used to derive the SQ lower bound is the almost equivalence relationship between SQ lower bound and low degree lower bound [BBH+20, MW21].
考虑在$R^n$的一个随机的$d$维子空间中种植一个$rho$ -稀疏的Rademacher向量的设置。一个经典的问题是如何在这个子空间中给定一个随机基来恢复这个植入的向量。[ZSWB21]最近的结果表明,当$ngeq d+1$时,格基约简算法可以恢复植入向量。虽然我们并不期望该算法能够容忍逆多项式的噪声量,但令人惊讶的是,之前的研究表明,当$nll rho^2 d^{2}$ [MW21]时,低次多项式无法实现恢复。一个自然的问题是,我们是否可以推导出与[MW21]中先前的低度下界相匹配的统计查询(SQ)下界。这将意味着SQ下界可以被基于格的算法超越;-预测当种植向量被反多项式的噪声量扰动时的计算硬度。在本文中,我们证明了这样一个SQ下界。特别是,我们表明,当$nll rho^2 d^{2}$和$rhogg frac{1}{sqrt{d}}$时,需要超多项式数量的VSTAT查询来解决更容易的统计测试问题。我们用来推导SQ下界的最值得注意的技术是SQ下界与低次下界之间的几乎等价关系[BBH+20, MW21]。
{"title":"SQ Lower Bounds for Random Sparse Planted Vector Problem","authors":"Jingqiu Ding, Yiding Hua","doi":"10.48550/arXiv.2301.11124","DOIUrl":"https://doi.org/10.48550/arXiv.2301.11124","url":null,"abstract":"Consider the setting where a $rho$-sparse Rademacher vector is planted in a random $d$-dimensional subspace of $R^n$. A classical question is how to recover this planted vector given a random basis in this subspace. A recent result by [ZSWB21] showed that the Lattice basis reduction algorithm can recover the planted vector when $ngeq d+1$. Although the algorithm is not expected to tolerate inverse polynomial amount of noise, it is surprising because it was previously shown that recovery cannot be achieved by low degree polynomials when $nll rho^2 d^{2}$ [MW21]. A natural question is whether we can derive an Statistical Query (SQ) lower bound matching the previous low degree lower bound in [MW21]. This will - imply that the SQ lower bound can be surpassed by lattice based algorithms; - predict the computational hardness when the planted vector is perturbed by inverse polynomial amount of noise. In this paper, we prove such an SQ lower bound. In particular, we show that super-polynomial number of VSTAT queries is needed to solve the easier statistical testing problem when $nll rho^2 d^{2}$ and $rhogg frac{1}{sqrt{d}}$. The most notable technique we used to derive the SQ lower bound is the almost equivalence relationship between SQ lower bound and low degree lower bound [BBH+20, MW21].","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116735093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Complexity Analysis of a Countable-armed Bandit Problem 可数武装强盗问题的复杂性分析
Pub Date : 2023-01-18 DOI: 10.48550/arXiv.2301.07243
Anand Kalvit, A. Zeevi
We consider a stochastic multi-armed bandit (MAB) problem motivated by ``large'' action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $mathcal{O}left( log n right)$. We also show that the instance-independent (minimax) regret is $tilde{mathcal{O}}left( sqrt{n} right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.
我们考虑一个随机多臂强盗(MAB)问题,该问题由“大”行动空间驱动,并赋予包含$K$臂类型的武器种群,每种武器都具有不同的平均奖励。决策者忽略了奖励分布的统计属性以及不同手臂类型的人口水平分布,也无法在游戏结束后观察手臂的类型。我们研究了最小化预期累积遗憾的经典问题$n$,并提出了实现速率最优的有限时间实例相关遗憾$mathcal{O}left( log n right)$的算法。我们还表明,当$K=2$时,与实例无关的(极大极小)后悔为$tilde{mathcal{O}}left( sqrt{n} right)$。虽然遗憾的顺序和问题的复杂性表明与经典MAB问题有很大的相似之处,但性能界限的性质和算法设计的突出方面与后者非常不同,决定复杂性的关键原语以及研究它们所需的分析工具也是如此。
{"title":"Complexity Analysis of a Countable-armed Bandit Problem","authors":"Anand Kalvit, A. Zeevi","doi":"10.48550/arXiv.2301.07243","DOIUrl":"https://doi.org/10.48550/arXiv.2301.07243","url":null,"abstract":"We consider a stochastic multi-armed bandit (MAB) problem motivated by ``large'' action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $mathcal{O}left( log n right)$. We also show that the instance-independent (minimax) regret is $tilde{mathcal{O}}left( sqrt{n} right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128870221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adversarial Online Multi-Task Reinforcement Learning 对抗在线多任务强化学习
Pub Date : 2023-01-11 DOI: 10.48550/arXiv.2301.04268
Quan Nguyen, Nishant A. Mehta
We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $mathcal{M}$ are well-separated under a notion of $lambda$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $Omega(Ksqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $Omega(frac{K}{lambda^2})$ in sample complexity for a class of uniformly-good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $tilde{O}(frac{K}{lambda^2})$ sample complexity guarantee for the clustering phase and $tilde{O}(sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $frac{1}{lambda^2}$ is tight.
我们考虑对抗性在线多任务强化学习设置,其中在每个$K$集中,学习者被赋予一个来自$M$未知有限视界MDP模型的有限集的未知任务。学习者的目标是根据每个任务的最优策略将后悔最小化。我们假设$mathcal{M}$中的mdp在$lambda$ -可分离性的概念下被很好地分离,并表明这个概念概括了以前作品中的许多任务可分离性概念。我们证明了任何学习算法的最小-最大下界$Omega(Ksqrt{DSAH})$和一类一致好的聚类-学习算法的样本复杂度的实例特定下界$Omega(frac{K}{lambda^2})$。我们使用一种称为2-JAO MDP的新构造来证明特定于实例的下界。对下界进行了多项式时间算法的补充,得到了聚类阶段的$tilde{O}(frac{K}{lambda^2})$样本复杂度保证和学习阶段的$tilde{O}(sqrt{MK})$遗憾保证,表明对$K$和$frac{1}{lambda^2}$的依赖性较强。
{"title":"Adversarial Online Multi-Task Reinforcement Learning","authors":"Quan Nguyen, Nishant A. Mehta","doi":"10.48550/arXiv.2301.04268","DOIUrl":"https://doi.org/10.48550/arXiv.2301.04268","url":null,"abstract":"We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $mathcal{M}$ are well-separated under a notion of $lambda$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $Omega(Ksqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $Omega(frac{K}{lambda^2})$ in sample complexity for a class of uniformly-good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $tilde{O}(frac{K}{lambda^2})$ sample complexity guarantee for the clustering phase and $tilde{O}(sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $frac{1}{lambda^2}$ is tight.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116291437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Limitations of Information-Theoretic Generalization Bounds for Gradient Descent Methods in Stochastic Convex Optimization 随机凸优化中梯度下降方法的信息论泛化界的局限性
Pub Date : 2022-12-27 DOI: 10.48550/arXiv.2212.13556
Mahdi Haghifam, Borja Rodr'iguez-G'alvez, R. Thobaben, M. Skoglund, Daniel M. Roy, G. Dziugaite
To date, no"information-theoretic"frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy"surrogate"algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.
到目前为止,还没有“信息论”的框架来推理泛化误差,以建立在随机凸优化设置梯度下降的极大极小率。在这项工作中,我们考虑了通过几个现有的信息理论框架来建立这种速率的前景:输入-输出互信息边界,条件互信息边界和变体,PAC-Bayes边界,以及最近的条件变体。我们证明了这些边界都不能建立极大极小率。然后,我们考虑在研究梯度方法中使用的一种常用策略,即最终迭代被高斯噪声破坏,产生有噪声的“代理”算法。我们通过对这些替代物的分析证明了极大极小率是不能成立的。我们的研究结果表明,使用信息理论技术分析梯度下降需要新的思路。
{"title":"Limitations of Information-Theoretic Generalization Bounds for Gradient Descent Methods in Stochastic Convex Optimization","authors":"Mahdi Haghifam, Borja Rodr'iguez-G'alvez, R. Thobaben, M. Skoglund, Daniel M. Roy, G. Dziugaite","doi":"10.48550/arXiv.2212.13556","DOIUrl":"https://doi.org/10.48550/arXiv.2212.13556","url":null,"abstract":"To date, no\"information-theoretic\"frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy\"surrogate\"algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133903912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Variance-Reduced Conservative Policy Iteration 减方差保守策略迭代
Pub Date : 2022-12-12 DOI: 10.48550/arXiv.2212.06283
Naman Agarwal, Brian Bullins, Karan Singh
We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $varepsilon$-functional local optimum from $O(varepsilon^{-4})$ to $O(varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $varepsilon$-global optimality after sampling $O(varepsilon^{-2})$ times, improving upon the previously established $O(varepsilon^{-3})$ sample requirement.
我们研究了将强化学习简化为策略空间上一系列经验风险最小化问题的样本复杂性。与策略梯度算法的参数空间相反,这种基于约简的算法在函数空间中表现出局部收敛性,因此不受策略类可能的非线性或不连续参数化的影响。我们提出了一种方差减少的保守策略迭代变体,它提高了从$O(varepsilon^{-4})$到$O(varepsilon^{-3})$生成$varepsilon$函数局部最优的样本复杂度。在状态覆盖和策略完备性假设下,该算法在采样$O(varepsilon^{-2})$次后具有$varepsilon$-全局最优性,改进了先前建立的$O(varepsilon^{-3})$样本需求。
{"title":"Variance-Reduced Conservative Policy Iteration","authors":"Naman Agarwal, Brian Bullins, Karan Singh","doi":"10.48550/arXiv.2212.06283","DOIUrl":"https://doi.org/10.48550/arXiv.2212.06283","url":null,"abstract":"We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $varepsilon$-functional local optimum from $O(varepsilon^{-4})$ to $O(varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $varepsilon$-global optimality after sampling $O(varepsilon^{-2})$ times, improving upon the previously established $O(varepsilon^{-3})$ sample requirement.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122798224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Linear Reinforcement Learning with Ball Structure Action Space 球结构作用空间的线性强化学习
Pub Date : 2022-11-14 DOI: 10.48550/arXiv.2211.07419
Zeyu Jia, Randy Jia, Dhruv Madeka, Dean Phillips Foster
We study the problem of Reinforcement Learning (RL) with linear function approximation, i.e. assuming the optimal action-value function is linear in a known $d$-dimensional feature mapping. Unfortunately, however, based on only this assumption, the worst case sample complexity has been shown to be exponential, even under a generative model. Instead of making further assumptions on the MDP or value functions, we assume that our action space is such that there always exist playable actions to explore any direction of the feature space. We formalize this assumption as a ``ball structure'' action space, and show that being able to freely explore the feature space allows for efficient RL. In particular, we propose a sample-efficient RL algorithm (BallRL) that learns an $epsilon$-optimal policy using only $tilde{O}left(frac{H^5d^3}{epsilon^3}right)$ number of trajectories.
我们研究了线性函数逼近的强化学习(RL)问题,即假设在已知的$d$维特征映射中最优动作值函数是线性的。然而,不幸的是,仅基于这一假设,即使在生成模型下,最坏情况下的样本复杂性也显示为指数。我们不再进一步假设MDP或价值函数,而是假设我们的行动空间总是存在可玩的行动去探索功能空间的任何方向。我们将这一假设形式化为“球结构”动作空间,并表明能够自由地探索特征空间可以实现高效的强化学习。特别地,我们提出了一个样本高效的RL算法(BallRL),它只使用$tilde{O}left(frac{H^5d^3}{epsilon^3}right)$个轨迹来学习$epsilon$ -最优策略。
{"title":"Linear Reinforcement Learning with Ball Structure Action Space","authors":"Zeyu Jia, Randy Jia, Dhruv Madeka, Dean Phillips Foster","doi":"10.48550/arXiv.2211.07419","DOIUrl":"https://doi.org/10.48550/arXiv.2211.07419","url":null,"abstract":"We study the problem of Reinforcement Learning (RL) with linear function approximation, i.e. assuming the optimal action-value function is linear in a known $d$-dimensional feature mapping. Unfortunately, however, based on only this assumption, the worst case sample complexity has been shown to be exponential, even under a generative model. Instead of making further assumptions on the MDP or value functions, we assume that our action space is such that there always exist playable actions to explore any direction of the feature space. We formalize this assumption as a ``ball structure'' action space, and show that being able to freely explore the feature space allows for efficient RL. In particular, we propose a sample-efficient RL algorithm (BallRL) that learns an $epsilon$-optimal policy using only $tilde{O}left(frac{H^5d^3}{epsilon^3}right)$ number of trajectories.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125941211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Efficient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization 基于随机原对偶优化的大型mdp高效全局规划
Pub Date : 2022-10-21 DOI: 10.48550/arXiv.2210.12057
Gergely Neu, Nneka Okolo
We propose a new stochastic primal-dual optimization algorithm for planning in a large discounted Markov decision process with a generative model and linear function approximation. Assuming that the feature map approximately satisfies standard realizability and Bellman-closedness conditions and also that the feature vectors of all state-action pairs are representable as convex combinations of a small core set of state-action pairs, we show that our method outputs a near-optimal policy after a polynomial number of queries to the generative model. Our method is computationally efficient and comes with the major advantage that it outputs a single softmax policy that is compactly represented by a low-dimensional parameter vector, and does not need to execute computationally expensive local planning subroutines in runtime.
提出了一种新的基于生成模型和线性函数逼近的大折扣马尔可夫决策过程规划的随机原对偶优化算法。假设特征映射近似满足标准可实现性和bellman -闭性条件,并且所有状态-动作对的特征向量都可以表示为一个小的状态-动作对核心集的凸组合,我们证明了我们的方法在对生成模型进行多项式次查询后输出一个近最优策略。我们的方法具有计算效率,其主要优点是它输出单个softmax策略,该策略由低维参数向量紧凑地表示,并且不需要在运行时执行计算昂贵的本地规划子程序。
{"title":"Efficient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization","authors":"Gergely Neu, Nneka Okolo","doi":"10.48550/arXiv.2210.12057","DOIUrl":"https://doi.org/10.48550/arXiv.2210.12057","url":null,"abstract":"We propose a new stochastic primal-dual optimization algorithm for planning in a large discounted Markov decision process with a generative model and linear function approximation. Assuming that the feature map approximately satisfies standard realizability and Bellman-closedness conditions and also that the feature vectors of all state-action pairs are representable as convex combinations of a small core set of state-action pairs, we show that our method outputs a near-optimal policy after a polynomial number of queries to the generative model. Our method is computationally efficient and comes with the major advantage that it outputs a single softmax policy that is compactly represented by a low-dimensional parameter vector, and does not need to execute computationally expensive local planning subroutines in runtime.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116721604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path 达到目标是困难的:解决随机最短路径的样本复杂度
Pub Date : 2022-10-10 DOI: 10.48550/arXiv.2210.04946
Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, A. Lazaric
We study the sample complexity of learning an $epsilon$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{min}$, and maximum expected cost of the optimal policy over all states $B_{star}$, where any algorithm requires at least $Omega(SAB_{star}^3/(c_{min}epsilon^2))$ samples to return an $epsilon$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{min}=0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this result with lower bounds when prior knowledge of the hitting time of the optimal policy is available and when we restrict optimality by competing against policies with bounded hitting time. Finally, we design an algorithm with matching upper bounds in these cases. This settles the sample complexity of learning $epsilon$-optimal polices in SSP with generative models. We also initiate the study of learning $epsilon$-optimal policies without access to a generative model (i.e., the so-called best-policy identification problem), and show that sample-efficient learning is impossible in general. On the other hand, efficient learning can be made possible if we assume the agent can directly reach the goal state from any state by paying a fixed cost. We then establish the first upper and lower bounds under this assumption. Finally, using similar analytic tools, we prove that horizon-free regret is impossible in SSPs under general costs, resolving an open problem in (Tarbouriech et al., 2021c).
研究了随机最短路径(SSP)问题中学习$epsilon$ -最优策略的样本复杂度。我们首先推导出样本复杂度界限,当学习者有机会获得一个生成模型。我们证明存在一个最坏情况的SSP实例,该实例具有$S$状态、$A$动作、最小代价$c_{min}$和所有状态上最优策略的最大期望代价$B_{star}$,其中任何算法都需要至少$Omega(SAB_{star}^3/(c_{min}epsilon^2))$个样本才能高概率地返回$epsilon$ -最优策略。令人惊讶的是,这意味着无论何时$c_{min}=0$一个SSP问题都可能是不可学习的,从而揭示了在SSP中学习比在有限视界和折扣设置中学习要困难得多。当最优策略的命中时间的先验知识是可用的,当我们通过与命中时间有限的策略竞争来限制最优性时,我们用下界来补充这个结果。最后,我们设计了一种具有匹配上界的算法。这解决了生成模型在SSP中学习$epsilon$ -最优策略的样本复杂性问题。我们还启动了学习$epsilon$ -不使用生成模型的最优策略的研究(即所谓的最佳策略识别问题),并表明样本高效学习通常是不可能的。另一方面,如果我们假设智能体可以通过支付固定成本从任何状态直接到达目标状态,则可以实现高效学习。然后我们在这个假设下建立了第一个上界和下界。最后,使用类似的分析工具,我们证明了一般成本下ssp不可能存在无视界后悔,解决了(Tarbouriech et al., 2021c)中的一个开放问题。
{"title":"Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path","authors":"Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, A. Lazaric","doi":"10.48550/arXiv.2210.04946","DOIUrl":"https://doi.org/10.48550/arXiv.2210.04946","url":null,"abstract":"We study the sample complexity of learning an $epsilon$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{min}$, and maximum expected cost of the optimal policy over all states $B_{star}$, where any algorithm requires at least $Omega(SAB_{star}^3/(c_{min}epsilon^2))$ samples to return an $epsilon$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{min}=0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this result with lower bounds when prior knowledge of the hitting time of the optimal policy is available and when we restrict optimality by competing against policies with bounded hitting time. Finally, we design an algorithm with matching upper bounds in these cases. This settles the sample complexity of learning $epsilon$-optimal polices in SSP with generative models. We also initiate the study of learning $epsilon$-optimal policies without access to a generative model (i.e., the so-called best-policy identification problem), and show that sample-efficient learning is impossible in general. On the other hand, efficient learning can be made possible if we assume the agent can directly reach the goal state from any state by paying a fixed cost. We then establish the first upper and lower bounds under this assumption. Finally, using similar analytic tools, we prove that horizon-free regret is impossible in SSPs under general costs, resolving an open problem in (Tarbouriech et al., 2021c).","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130666906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fisher information lower bounds for sampling 抽样的费雪信息下界
Pub Date : 2022-10-05 DOI: 10.48550/arXiv.2210.02482
Sinho Chewi, P. Gerber, Holden Lee, Chen Lu
We prove two lower bounds for the complexity of non-log-concave sampling within the framework of Balasubramanian et al. (2022), who introduced the use of Fisher information (FI) bounds as a notion of approximate first-order stationarity in sampling. Our first lower bound shows that averaged LMC is optimal for the regime of large FI by reducing the problem of finding stationary points in non-convex optimization to sampling. Our second lower bound shows that in the regime of small FI, obtaining a FI of at most $varepsilon^2$ from the target distribution requires $text{poly}(1/varepsilon)$ queries, which is surprising as it rules out the existence of high-accuracy algorithms (e.g., algorithms using Metropolis-Hastings filters) in this context.
我们在Balasubramanian等人(2022)的框架内证明了非对数凹采样复杂性的两个下界,他们引入了Fisher信息(FI)界作为采样中近似一阶平稳性的概念。我们的第一个下界表明,通过将非凸优化中寻找平稳点的问题减少到采样,平均LMC对于大FI区域是最优的。我们的第二个下界表明,在小FI的情况下,从目标分布中获得最多$varepsilon^2$的FI需要$text{poly}(1/varepsilon)$查询,这是令人惊讶的,因为它排除了在这种情况下存在的高精度算法(例如,使用Metropolis-Hastings过滤器的算法)。
{"title":"Fisher information lower bounds for sampling","authors":"Sinho Chewi, P. Gerber, Holden Lee, Chen Lu","doi":"10.48550/arXiv.2210.02482","DOIUrl":"https://doi.org/10.48550/arXiv.2210.02482","url":null,"abstract":"We prove two lower bounds for the complexity of non-log-concave sampling within the framework of Balasubramanian et al. (2022), who introduced the use of Fisher information (FI) bounds as a notion of approximate first-order stationarity in sampling. Our first lower bound shows that averaged LMC is optimal for the regime of large FI by reducing the problem of finding stationary points in non-convex optimization to sampling. Our second lower bound shows that in the regime of small FI, obtaining a FI of at most $varepsilon^2$ from the target distribution requires $text{poly}(1/varepsilon)$ queries, which is surprising as it rules out the existence of high-accuracy algorithms (e.g., algorithms using Metropolis-Hastings filters) in this context.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133028227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
International Conference on Algorithmic Learning Theory
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1