{"title":"利用行动排序时差学习实现数据高效深度强化学习","authors":"Qi Liu;Yanjie Li;Yuecheng Liu;Ke Lin;Jianqi Gao;Yunjiang Lou","doi":"10.1109/TETCI.2024.3369641","DOIUrl":null,"url":null,"abstract":"In value-based deep reinforcement learning (RL), value function approximation errors lead to suboptimal policies. Temporal difference (TD) learning is one of the most important methodologies to approximate state-action (\n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\n) value function. In TD learning, it is critical to estimate \n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\n values of greedy actions more accurately because a more accurate target \n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\n value enhances the estimation accuracy of \n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\n value. To improve the estimation accuracy of \n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\n value, we propose an action-ranked TD learning method to enhance the performance of deep RL by weighting each TD error according to the rank of its corresponding state-action pair's value among all the \n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\n values on a state. The proposed method can provide more accurate target values for TD learning, making the estimation of the \n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\n value more accurate. We apply the proposed method to a representative value-based deep RL algorithm, and results show that the proposed method outperforms baselines on 31 out of 40 Atari games. Furthermore, we extend the proposed method to multi-agent deep RL. To adaptively determine the hyperparameter in action-ranked TD learning, we propose a meta action-ranked TD learning. A series of experiments quantitatively verify that our methods outperform baselines on Atari games, StarCraft-II, and Grid World environments.","PeriodicalId":13135,"journal":{"name":"IEEE Transactions on Emerging Topics in Computational Intelligence","volume":"8 4","pages":"2949-2961"},"PeriodicalIF":5.3000,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data Efficient Deep Reinforcement Learning With Action-Ranked Temporal Difference Learning\",\"authors\":\"Qi Liu;Yanjie Li;Yuecheng Liu;Ke Lin;Jianqi Gao;Yunjiang Lou\",\"doi\":\"10.1109/TETCI.2024.3369641\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In value-based deep reinforcement learning (RL), value function approximation errors lead to suboptimal policies. Temporal difference (TD) learning is one of the most important methodologies to approximate state-action (\\n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\\n) value function. In TD learning, it is critical to estimate \\n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\\n values of greedy actions more accurately because a more accurate target \\n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\\n value enhances the estimation accuracy of \\n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\\n value. To improve the estimation accuracy of \\n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\\n value, we propose an action-ranked TD learning method to enhance the performance of deep RL by weighting each TD error according to the rank of its corresponding state-action pair's value among all the \\n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\\n values on a state. The proposed method can provide more accurate target values for TD learning, making the estimation of the \\n<inline-formula><tex-math>$Q$</tex-math></inline-formula>\\n value more accurate. We apply the proposed method to a representative value-based deep RL algorithm, and results show that the proposed method outperforms baselines on 31 out of 40 Atari games. Furthermore, we extend the proposed method to multi-agent deep RL. To adaptively determine the hyperparameter in action-ranked TD learning, we propose a meta action-ranked TD learning. A series of experiments quantitatively verify that our methods outperform baselines on Atari games, StarCraft-II, and Grid World environments.\",\"PeriodicalId\":13135,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"volume\":\"8 4\",\"pages\":\"2949-2961\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2024-03-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computational Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10466624/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computational Intelligence","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10466624/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Data Efficient Deep Reinforcement Learning With Action-Ranked Temporal Difference Learning
In value-based deep reinforcement learning (RL), value function approximation errors lead to suboptimal policies. Temporal difference (TD) learning is one of the most important methodologies to approximate state-action (
$Q$
) value function. In TD learning, it is critical to estimate
$Q$
values of greedy actions more accurately because a more accurate target
$Q$
value enhances the estimation accuracy of
$Q$
value. To improve the estimation accuracy of
$Q$
value, we propose an action-ranked TD learning method to enhance the performance of deep RL by weighting each TD error according to the rank of its corresponding state-action pair's value among all the
$Q$
values on a state. The proposed method can provide more accurate target values for TD learning, making the estimation of the
$Q$
value more accurate. We apply the proposed method to a representative value-based deep RL algorithm, and results show that the proposed method outperforms baselines on 31 out of 40 Atari games. Furthermore, we extend the proposed method to multi-agent deep RL. To adaptively determine the hyperparameter in action-ranked TD learning, we propose a meta action-ranked TD learning. A series of experiments quantitatively verify that our methods outperform baselines on Atari games, StarCraft-II, and Grid World environments.
期刊介绍:
The IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI) publishes original articles on emerging aspects of computational intelligence, including theory, applications, and surveys.
TETCI is an electronics only publication. TETCI publishes six issues per year.
Authors are encouraged to submit manuscripts in any emerging topic in computational intelligence, especially nature-inspired computing topics not covered by other IEEE Computational Intelligence Society journals. A few such illustrative examples are glial cell networks, computational neuroscience, Brain Computer Interface, ambient intelligence, non-fuzzy computing with words, artificial life, cultural learning, artificial endocrine networks, social reasoning, artificial hormone networks, computational intelligence for the IoT and Smart-X technologies.