基于二叉树状态空间分解的分层近似策略迭代。

IEEE transactions on neural networks Pub Date : 2011-12-01 Epub Date: 2011-10-10 DOI:10.1109/TNN.2011.2168422
Xin Xu, Chunming Liu, Simon X Yang, Dewen Hu
{"title":"基于二叉树状态空间分解的分层近似策略迭代。","authors":"Xin Xu,&nbsp;Chunming Liu,&nbsp;Simon X Yang,&nbsp;Dewen Hu","doi":"10.1109/TNN.2011.2168422","DOIUrl":null,"url":null,"abstract":"<p><p>In recent years, approximate policy iteration (API) has attracted increasing attention in reinforcement learning (RL), e.g., least-squares policy iteration (LSPI) and its kernelized version, the kernel-based LSPI algorithm. However, it remains difficult for API algorithms to obtain near-optimal policies for Markov decision processes (MDPs) with large or continuous state spaces. To address this problem, this paper presents a hierarchical API (HAPI) method with binary-tree state space decomposition for RL in a class of absorbing MDPs, which can be formulated as time-optimal learning control tasks. In the proposed method, after collecting samples adaptively in the state space of the original MDP, a learning-based decomposition strategy of sample sets was designed to implement the binary-tree state space decomposition process. Then, API algorithms were used on the sample subsets to approximate local optimal policies of sub-MDPs. The original MDP was decomposed into a binary-tree structure of absorbing sub-MDPs, constructed during the learning process, thus, local near-optimal policies were approximated by API algorithms with reduced complexity and higher precision. Furthermore, because of the improved quality of local policies, the combined global policy performed better than the near-optimal policy obtained by a single API algorithm in the original MDP. Three learning control problems, including path-tracking control of a real mobile robot, were studied to evaluate the performance of the HAPI method. With the same setting for basis function selection and sample collection, the proposed HAPI obtained better near-optimal policies than previous API methods such as LSPI and KLSPI.</p>","PeriodicalId":13434,"journal":{"name":"IEEE transactions on neural networks","volume":"22 12","pages":"1863-77"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TNN.2011.2168422","citationCount":"37","resultStr":"{\"title\":\"Hierarchical approximate policy iteration with binary-tree state space decomposition.\",\"authors\":\"Xin Xu,&nbsp;Chunming Liu,&nbsp;Simon X Yang,&nbsp;Dewen Hu\",\"doi\":\"10.1109/TNN.2011.2168422\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>In recent years, approximate policy iteration (API) has attracted increasing attention in reinforcement learning (RL), e.g., least-squares policy iteration (LSPI) and its kernelized version, the kernel-based LSPI algorithm. However, it remains difficult for API algorithms to obtain near-optimal policies for Markov decision processes (MDPs) with large or continuous state spaces. To address this problem, this paper presents a hierarchical API (HAPI) method with binary-tree state space decomposition for RL in a class of absorbing MDPs, which can be formulated as time-optimal learning control tasks. In the proposed method, after collecting samples adaptively in the state space of the original MDP, a learning-based decomposition strategy of sample sets was designed to implement the binary-tree state space decomposition process. Then, API algorithms were used on the sample subsets to approximate local optimal policies of sub-MDPs. The original MDP was decomposed into a binary-tree structure of absorbing sub-MDPs, constructed during the learning process, thus, local near-optimal policies were approximated by API algorithms with reduced complexity and higher precision. Furthermore, because of the improved quality of local policies, the combined global policy performed better than the near-optimal policy obtained by a single API algorithm in the original MDP. Three learning control problems, including path-tracking control of a real mobile robot, were studied to evaluate the performance of the HAPI method. With the same setting for basis function selection and sample collection, the proposed HAPI obtained better near-optimal policies than previous API methods such as LSPI and KLSPI.</p>\",\"PeriodicalId\":13434,\"journal\":{\"name\":\"IEEE transactions on neural networks\",\"volume\":\"22 12\",\"pages\":\"1863-77\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/TNN.2011.2168422\",\"citationCount\":\"37\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on neural networks\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TNN.2011.2168422\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2011/10/10 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on neural networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TNN.2011.2168422","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2011/10/10 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 37

摘要

近年来,近似策略迭代(API)在强化学习(RL)中引起了越来越多的关注,例如最小二乘策略迭代(LSPI)及其核化版本,即基于核的LSPI算法。然而,对于具有大或连续状态空间的马尔可夫决策过程(mdp), API算法仍然难以获得接近最优的策略。为了解决这一问题,本文提出了一种具有二叉树状态空间分解的分层API (HAPI)方法,用于一类吸收MDPs中的RL,该方法可表述为时间最优学习控制任务。该方法在原MDP状态空间中自适应采集样本后,设计基于学习的样本集分解策略,实现二叉树状态空间分解过程。然后,在样本子集上使用API算法来近似子mdp的局部最优策略。将原始MDP分解为吸收子MDP的二叉树结构,并在学习过程中构造,从而通过API算法逼近局部近最优策略,降低了复杂度,提高了精度。此外,由于改进了局部策略的质量,组合全局策略的性能优于原始MDP中单个API算法获得的近最优策略。以实际移动机器人的路径跟踪控制为例,研究了HAPI方法的学习控制性能。在基函数选择和样本收集设置相同的情况下,所提出的HAPI比以前的API方法(如LSPI和KLSPI)获得了更好的近最优策略。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Hierarchical approximate policy iteration with binary-tree state space decomposition.

In recent years, approximate policy iteration (API) has attracted increasing attention in reinforcement learning (RL), e.g., least-squares policy iteration (LSPI) and its kernelized version, the kernel-based LSPI algorithm. However, it remains difficult for API algorithms to obtain near-optimal policies for Markov decision processes (MDPs) with large or continuous state spaces. To address this problem, this paper presents a hierarchical API (HAPI) method with binary-tree state space decomposition for RL in a class of absorbing MDPs, which can be formulated as time-optimal learning control tasks. In the proposed method, after collecting samples adaptively in the state space of the original MDP, a learning-based decomposition strategy of sample sets was designed to implement the binary-tree state space decomposition process. Then, API algorithms were used on the sample subsets to approximate local optimal policies of sub-MDPs. The original MDP was decomposed into a binary-tree structure of absorbing sub-MDPs, constructed during the learning process, thus, local near-optimal policies were approximated by API algorithms with reduced complexity and higher precision. Furthermore, because of the improved quality of local policies, the combined global policy performed better than the near-optimal policy obtained by a single API algorithm in the original MDP. Three learning control problems, including path-tracking control of a real mobile robot, were studied to evaluate the performance of the HAPI method. With the same setting for basis function selection and sample collection, the proposed HAPI obtained better near-optimal policies than previous API methods such as LSPI and KLSPI.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
IEEE transactions on neural networks
IEEE transactions on neural networks 工程技术-工程:电子与电气
自引率
0.00%
发文量
2
审稿时长
8.7 months
期刊最新文献
Extracting rules from neural networks as decision diagrams. Design of a data-driven predictive controller for start-up process of AMT vehicles. Data-based hybrid tension estimation and fault diagnosis of cold rolling continuous annealing processes. Unified development of multiplicative algorithms for linear and quadratic nonnegative matrix factorization. Data-based system modeling using a type-2 fuzzy neural network with a hybrid learning algorithm.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1