Pub Date : 2023-02-08DOI: 10.48550/arXiv.2302.04357
Niki Hasrati, S. Ben-David
We initiate a study of computable online (c-online) learning, which we analyze under varying requirements for"optimality"in terms of the mistake bound. Our main contribution is to give a necessary and sufficient condition for optimal c-online learning and show that the Littlestone dimension no longer characterizes the optimal mistake bound of c-online learning. Furthermore, we introduce anytime optimal (a-optimal) online learning, a more natural conceptualization of"optimality"and a generalization of Littlestone's Standard Optimal Algorithm. We show the existence of a computational separation between a-optimal and optimal online learning, proving that a-optimal online learning is computationally more difficult. Finally, we consider online learning with no requirements for optimality, and show, under a weaker notion of computability, that the finiteness of the Littlestone dimension no longer characterizes whether a class is c-online learnable with finite mistake bound. A potential avenue for strengthening this result is suggested by exploring the relationship between c-online and CPAC learning, where we show that c-online learning is as difficult as improper CPAC learning.
{"title":"On Computable Online Learning","authors":"Niki Hasrati, S. Ben-David","doi":"10.48550/arXiv.2302.04357","DOIUrl":"https://doi.org/10.48550/arXiv.2302.04357","url":null,"abstract":"We initiate a study of computable online (c-online) learning, which we analyze under varying requirements for\"optimality\"in terms of the mistake bound. Our main contribution is to give a necessary and sufficient condition for optimal c-online learning and show that the Littlestone dimension no longer characterizes the optimal mistake bound of c-online learning. Furthermore, we introduce anytime optimal (a-optimal) online learning, a more natural conceptualization of\"optimality\"and a generalization of Littlestone's Standard Optimal Algorithm. We show the existence of a computational separation between a-optimal and optimal online learning, proving that a-optimal online learning is computationally more difficult. Finally, we consider online learning with no requirements for optimality, and show, under a weaker notion of computability, that the finiteness of the Littlestone dimension no longer characterizes whether a class is c-online learnable with finite mistake bound. A potential avenue for strengthening this result is suggested by exploring the relationship between c-online and CPAC learning, where we show that c-online learning is as difficult as improper CPAC learning.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125877141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-26DOI: 10.48550/arXiv.2301.11124
Jingqiu Ding, Yiding Hua
Consider the setting where a $rho$-sparse Rademacher vector is planted in a random $d$-dimensional subspace of $R^n$. A classical question is how to recover this planted vector given a random basis in this subspace. A recent result by [ZSWB21] showed that the Lattice basis reduction algorithm can recover the planted vector when $ngeq d+1$. Although the algorithm is not expected to tolerate inverse polynomial amount of noise, it is surprising because it was previously shown that recovery cannot be achieved by low degree polynomials when $nll rho^2 d^{2}$ [MW21]. A natural question is whether we can derive an Statistical Query (SQ) lower bound matching the previous low degree lower bound in [MW21]. This will - imply that the SQ lower bound can be surpassed by lattice based algorithms; - predict the computational hardness when the planted vector is perturbed by inverse polynomial amount of noise. In this paper, we prove such an SQ lower bound. In particular, we show that super-polynomial number of VSTAT queries is needed to solve the easier statistical testing problem when $nll rho^2 d^{2}$ and $rhogg frac{1}{sqrt{d}}$. The most notable technique we used to derive the SQ lower bound is the almost equivalence relationship between SQ lower bound and low degree lower bound [BBH+20, MW21].
{"title":"SQ Lower Bounds for Random Sparse Planted Vector Problem","authors":"Jingqiu Ding, Yiding Hua","doi":"10.48550/arXiv.2301.11124","DOIUrl":"https://doi.org/10.48550/arXiv.2301.11124","url":null,"abstract":"Consider the setting where a $rho$-sparse Rademacher vector is planted in a random $d$-dimensional subspace of $R^n$. A classical question is how to recover this planted vector given a random basis in this subspace. A recent result by [ZSWB21] showed that the Lattice basis reduction algorithm can recover the planted vector when $ngeq d+1$. Although the algorithm is not expected to tolerate inverse polynomial amount of noise, it is surprising because it was previously shown that recovery cannot be achieved by low degree polynomials when $nll rho^2 d^{2}$ [MW21]. A natural question is whether we can derive an Statistical Query (SQ) lower bound matching the previous low degree lower bound in [MW21]. This will - imply that the SQ lower bound can be surpassed by lattice based algorithms; - predict the computational hardness when the planted vector is perturbed by inverse polynomial amount of noise. In this paper, we prove such an SQ lower bound. In particular, we show that super-polynomial number of VSTAT queries is needed to solve the easier statistical testing problem when $nll rho^2 d^{2}$ and $rhogg frac{1}{sqrt{d}}$. The most notable technique we used to derive the SQ lower bound is the almost equivalence relationship between SQ lower bound and low degree lower bound [BBH+20, MW21].","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116735093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-18DOI: 10.48550/arXiv.2301.07243
Anand Kalvit, A. Zeevi
We consider a stochastic multi-armed bandit (MAB) problem motivated by ``large'' action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $mathcal{O}left( log n right)$. We also show that the instance-independent (minimax) regret is $tilde{mathcal{O}}left( sqrt{n} right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.
我们考虑一个随机多臂强盗(MAB)问题,该问题由“大”行动空间驱动,并赋予包含$K$臂类型的武器种群,每种武器都具有不同的平均奖励。决策者忽略了奖励分布的统计属性以及不同手臂类型的人口水平分布,也无法在游戏结束后观察手臂的类型。我们研究了最小化预期累积遗憾的经典问题$n$,并提出了实现速率最优的有限时间实例相关遗憾$mathcal{O}left( log n right)$的算法。我们还表明,当$K=2$时,与实例无关的(极大极小)后悔为$tilde{mathcal{O}}left( sqrt{n} right)$。虽然遗憾的顺序和问题的复杂性表明与经典MAB问题有很大的相似之处,但性能界限的性质和算法设计的突出方面与后者非常不同,决定复杂性的关键原语以及研究它们所需的分析工具也是如此。
{"title":"Complexity Analysis of a Countable-armed Bandit Problem","authors":"Anand Kalvit, A. Zeevi","doi":"10.48550/arXiv.2301.07243","DOIUrl":"https://doi.org/10.48550/arXiv.2301.07243","url":null,"abstract":"We consider a stochastic multi-armed bandit (MAB) problem motivated by ``large'' action spaces, and endowed with a population of arms containing exactly $K$ arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play $n$, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of $mathcal{O}left( log n right)$. We also show that the instance-independent (minimax) regret is $tilde{mathcal{O}}left( sqrt{n} right)$ when $K=2$. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128870221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-11DOI: 10.48550/arXiv.2301.04268
Quan Nguyen, Nishant A. Mehta
We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $mathcal{M}$ are well-separated under a notion of $lambda$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $Omega(Ksqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $Omega(frac{K}{lambda^2})$ in sample complexity for a class of uniformly-good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $tilde{O}(frac{K}{lambda^2})$ sample complexity guarantee for the clustering phase and $tilde{O}(sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $frac{1}{lambda^2}$ is tight.
{"title":"Adversarial Online Multi-Task Reinforcement Learning","authors":"Quan Nguyen, Nishant A. Mehta","doi":"10.48550/arXiv.2301.04268","DOIUrl":"https://doi.org/10.48550/arXiv.2301.04268","url":null,"abstract":"We consider the adversarial online multi-task reinforcement learning setting, where in each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models. The learner's objective is to minimize its regret with respect to the optimal policy for each task. We assume the MDPs in $mathcal{M}$ are well-separated under a notion of $lambda$-separability, and show that this notion generalizes many task-separability notions from previous works. We prove a minimax lower bound of $Omega(Ksqrt{DSAH})$ on the regret of any learning algorithm and an instance-specific lower bound of $Omega(frac{K}{lambda^2})$ in sample complexity for a class of uniformly-good cluster-then-learn algorithms. We use a novel construction called 2-JAO MDP for proving the instance-specific lower bound. The lower bounds are complemented with a polynomial time algorithm that obtains $tilde{O}(frac{K}{lambda^2})$ sample complexity guarantee for the clustering phase and $tilde{O}(sqrt{MK})$ regret guarantee for the learning phase, indicating that the dependency on $K$ and $frac{1}{lambda^2}$ is tight.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116291437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-27DOI: 10.48550/arXiv.2212.13556
Mahdi Haghifam, Borja Rodr'iguez-G'alvez, R. Thobaben, M. Skoglund, Daniel M. Roy, G. Dziugaite
To date, no"information-theoretic"frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy"surrogate"algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.
{"title":"Limitations of Information-Theoretic Generalization Bounds for Gradient Descent Methods in Stochastic Convex Optimization","authors":"Mahdi Haghifam, Borja Rodr'iguez-G'alvez, R. Thobaben, M. Skoglund, Daniel M. Roy, G. Dziugaite","doi":"10.48550/arXiv.2212.13556","DOIUrl":"https://doi.org/10.48550/arXiv.2212.13556","url":null,"abstract":"To date, no\"information-theoretic\"frameworks for reasoning about generalization error have been shown to establish minimax rates for gradient descent in the setting of stochastic convex optimization. In this work, we consider the prospect of establishing such rates via several existing information-theoretic frameworks: input-output mutual information bounds, conditional mutual information bounds and variants, PAC-Bayes bounds, and recent conditional variants thereof. We prove that none of these bounds are able to establish minimax rates. We then consider a common tactic employed in studying gradient methods, whereby the final iterate is corrupted by Gaussian noise, producing a noisy\"surrogate\"algorithm. We prove that minimax rates cannot be established via the analysis of such surrogates. Our results suggest that new ideas are required to analyze gradient descent using information-theoretic techniques.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133903912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-12-12DOI: 10.48550/arXiv.2212.06283
Naman Agarwal, Brian Bullins, Karan Singh
We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $varepsilon$-functional local optimum from $O(varepsilon^{-4})$ to $O(varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $varepsilon$-global optimality after sampling $O(varepsilon^{-2})$ times, improving upon the previously established $O(varepsilon^{-3})$ sample requirement.
{"title":"Variance-Reduced Conservative Policy Iteration","authors":"Naman Agarwal, Brian Bullins, Karan Singh","doi":"10.48550/arXiv.2212.06283","DOIUrl":"https://doi.org/10.48550/arXiv.2212.06283","url":null,"abstract":"We study the sample complexity of reducing reinforcement learning to a sequence of empirical risk minimization problems over the policy space. Such reductions-based algorithms exhibit local convergence in the function space, as opposed to the parameter space for policy gradient algorithms, and thus are unaffected by the possibly non-linear or discontinuous parameterization of the policy class. We propose a variance-reduced variant of Conservative Policy Iteration that improves the sample complexity of producing a $varepsilon$-functional local optimum from $O(varepsilon^{-4})$ to $O(varepsilon^{-3})$. Under state-coverage and policy-completeness assumptions, the algorithm enjoys $varepsilon$-global optimality after sampling $O(varepsilon^{-2})$ times, improving upon the previously established $O(varepsilon^{-3})$ sample requirement.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122798224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-14DOI: 10.48550/arXiv.2211.07419
Zeyu Jia, Randy Jia, Dhruv Madeka, Dean Phillips Foster
We study the problem of Reinforcement Learning (RL) with linear function approximation, i.e. assuming the optimal action-value function is linear in a known $d$-dimensional feature mapping. Unfortunately, however, based on only this assumption, the worst case sample complexity has been shown to be exponential, even under a generative model. Instead of making further assumptions on the MDP or value functions, we assume that our action space is such that there always exist playable actions to explore any direction of the feature space. We formalize this assumption as a ``ball structure'' action space, and show that being able to freely explore the feature space allows for efficient RL. In particular, we propose a sample-efficient RL algorithm (BallRL) that learns an $epsilon$-optimal policy using only $tilde{O}left(frac{H^5d^3}{epsilon^3}right)$ number of trajectories.
{"title":"Linear Reinforcement Learning with Ball Structure Action Space","authors":"Zeyu Jia, Randy Jia, Dhruv Madeka, Dean Phillips Foster","doi":"10.48550/arXiv.2211.07419","DOIUrl":"https://doi.org/10.48550/arXiv.2211.07419","url":null,"abstract":"We study the problem of Reinforcement Learning (RL) with linear function approximation, i.e. assuming the optimal action-value function is linear in a known $d$-dimensional feature mapping. Unfortunately, however, based on only this assumption, the worst case sample complexity has been shown to be exponential, even under a generative model. Instead of making further assumptions on the MDP or value functions, we assume that our action space is such that there always exist playable actions to explore any direction of the feature space. We formalize this assumption as a ``ball structure'' action space, and show that being able to freely explore the feature space allows for efficient RL. In particular, we propose a sample-efficient RL algorithm (BallRL) that learns an $epsilon$-optimal policy using only $tilde{O}left(frac{H^5d^3}{epsilon^3}right)$ number of trajectories.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125941211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-21DOI: 10.48550/arXiv.2210.12057
Gergely Neu, Nneka Okolo
We propose a new stochastic primal-dual optimization algorithm for planning in a large discounted Markov decision process with a generative model and linear function approximation. Assuming that the feature map approximately satisfies standard realizability and Bellman-closedness conditions and also that the feature vectors of all state-action pairs are representable as convex combinations of a small core set of state-action pairs, we show that our method outputs a near-optimal policy after a polynomial number of queries to the generative model. Our method is computationally efficient and comes with the major advantage that it outputs a single softmax policy that is compactly represented by a low-dimensional parameter vector, and does not need to execute computationally expensive local planning subroutines in runtime.
{"title":"Efficient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization","authors":"Gergely Neu, Nneka Okolo","doi":"10.48550/arXiv.2210.12057","DOIUrl":"https://doi.org/10.48550/arXiv.2210.12057","url":null,"abstract":"We propose a new stochastic primal-dual optimization algorithm for planning in a large discounted Markov decision process with a generative model and linear function approximation. Assuming that the feature map approximately satisfies standard realizability and Bellman-closedness conditions and also that the feature vectors of all state-action pairs are representable as convex combinations of a small core set of state-action pairs, we show that our method outputs a near-optimal policy after a polynomial number of queries to the generative model. Our method is computationally efficient and comes with the major advantage that it outputs a single softmax policy that is compactly represented by a low-dimensional parameter vector, and does not need to execute computationally expensive local planning subroutines in runtime.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116721604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-10DOI: 10.48550/arXiv.2210.04946
Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, A. Lazaric
We study the sample complexity of learning an $epsilon$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{min}$, and maximum expected cost of the optimal policy over all states $B_{star}$, where any algorithm requires at least $Omega(SAB_{star}^3/(c_{min}epsilon^2))$ samples to return an $epsilon$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{min}=0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this result with lower bounds when prior knowledge of the hitting time of the optimal policy is available and when we restrict optimality by competing against policies with bounded hitting time. Finally, we design an algorithm with matching upper bounds in these cases. This settles the sample complexity of learning $epsilon$-optimal polices in SSP with generative models. We also initiate the study of learning $epsilon$-optimal policies without access to a generative model (i.e., the so-called best-policy identification problem), and show that sample-efficient learning is impossible in general. On the other hand, efficient learning can be made possible if we assume the agent can directly reach the goal state from any state by paying a fixed cost. We then establish the first upper and lower bounds under this assumption. Finally, using similar analytic tools, we prove that horizon-free regret is impossible in SSPs under general costs, resolving an open problem in (Tarbouriech et al., 2021c).
研究了随机最短路径(SSP)问题中学习$epsilon$ -最优策略的样本复杂度。我们首先推导出样本复杂度界限,当学习者有机会获得一个生成模型。我们证明存在一个最坏情况的SSP实例,该实例具有$S$状态、$A$动作、最小代价$c_{min}$和所有状态上最优策略的最大期望代价$B_{star}$,其中任何算法都需要至少$Omega(SAB_{star}^3/(c_{min}epsilon^2))$个样本才能高概率地返回$epsilon$ -最优策略。令人惊讶的是,这意味着无论何时$c_{min}=0$一个SSP问题都可能是不可学习的,从而揭示了在SSP中学习比在有限视界和折扣设置中学习要困难得多。当最优策略的命中时间的先验知识是可用的,当我们通过与命中时间有限的策略竞争来限制最优性时,我们用下界来补充这个结果。最后,我们设计了一种具有匹配上界的算法。这解决了生成模型在SSP中学习$epsilon$ -最优策略的样本复杂性问题。我们还启动了学习$epsilon$ -不使用生成模型的最优策略的研究(即所谓的最佳策略识别问题),并表明样本高效学习通常是不可能的。另一方面,如果我们假设智能体可以通过支付固定成本从任何状态直接到达目标状态,则可以实现高效学习。然后我们在这个假设下建立了第一个上界和下界。最后,使用类似的分析工具,我们证明了一般成本下ssp不可能存在无视界后悔,解决了(Tarbouriech et al., 2021c)中的一个开放问题。
{"title":"Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path","authors":"Liyu Chen, Andrea Tirinzoni, Matteo Pirotta, A. Lazaric","doi":"10.48550/arXiv.2210.04946","DOIUrl":"https://doi.org/10.48550/arXiv.2210.04946","url":null,"abstract":"We study the sample complexity of learning an $epsilon$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{min}$, and maximum expected cost of the optimal policy over all states $B_{star}$, where any algorithm requires at least $Omega(SAB_{star}^3/(c_{min}epsilon^2))$ samples to return an $epsilon$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{min}=0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this result with lower bounds when prior knowledge of the hitting time of the optimal policy is available and when we restrict optimality by competing against policies with bounded hitting time. Finally, we design an algorithm with matching upper bounds in these cases. This settles the sample complexity of learning $epsilon$-optimal polices in SSP with generative models. We also initiate the study of learning $epsilon$-optimal policies without access to a generative model (i.e., the so-called best-policy identification problem), and show that sample-efficient learning is impossible in general. On the other hand, efficient learning can be made possible if we assume the agent can directly reach the goal state from any state by paying a fixed cost. We then establish the first upper and lower bounds under this assumption. Finally, using similar analytic tools, we prove that horizon-free regret is impossible in SSPs under general costs, resolving an open problem in (Tarbouriech et al., 2021c).","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130666906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-10-05DOI: 10.48550/arXiv.2210.02482
Sinho Chewi, P. Gerber, Holden Lee, Chen Lu
We prove two lower bounds for the complexity of non-log-concave sampling within the framework of Balasubramanian et al. (2022), who introduced the use of Fisher information (FI) bounds as a notion of approximate first-order stationarity in sampling. Our first lower bound shows that averaged LMC is optimal for the regime of large FI by reducing the problem of finding stationary points in non-convex optimization to sampling. Our second lower bound shows that in the regime of small FI, obtaining a FI of at most $varepsilon^2$ from the target distribution requires $text{poly}(1/varepsilon)$ queries, which is surprising as it rules out the existence of high-accuracy algorithms (e.g., algorithms using Metropolis-Hastings filters) in this context.
{"title":"Fisher information lower bounds for sampling","authors":"Sinho Chewi, P. Gerber, Holden Lee, Chen Lu","doi":"10.48550/arXiv.2210.02482","DOIUrl":"https://doi.org/10.48550/arXiv.2210.02482","url":null,"abstract":"We prove two lower bounds for the complexity of non-log-concave sampling within the framework of Balasubramanian et al. (2022), who introduced the use of Fisher information (FI) bounds as a notion of approximate first-order stationarity in sampling. Our first lower bound shows that averaged LMC is optimal for the regime of large FI by reducing the problem of finding stationary points in non-convex optimization to sampling. Our second lower bound shows that in the regime of small FI, obtaining a FI of at most $varepsilon^2$ from the target distribution requires $text{poly}(1/varepsilon)$ queries, which is surprising as it rules out the existence of high-accuracy algorithms (e.g., algorithms using Metropolis-Hastings filters) in this context.","PeriodicalId":267197,"journal":{"name":"International Conference on Algorithmic Learning Theory","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133028227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}