{"title":"离散行为域软行为评价的正则化方法有效性研究","authors":"Bang Giang Le;Viet Cuong Ta","doi":"10.1109/TSMC.2024.3505613","DOIUrl":null,"url":null,"abstract":"Soft actor-critic (SAC) is a reinforcement learning algorithm that employs the maximum entropy framework to train a stochastic policy. This work examines a specific failure case of SAC where the stochastic policy is trained to maximize the expected entropy from a sparse reward environment. We demonstrate that the over-exploration of SAC can make the entropy temperature collapse, followed by unstable updates to the actor. Based on our analyses, we introduce Reg-SAC, an improved version of SAC, to mitigate the detrimental effects of the entropy temperature on the learning stability of the stochastic policy. Reg-SAC incorporates a clipping value to prevent the entropy temperature collapse and regularizes the gradient updates of the policy via Kullback-Leibler divergence. Through experiments on discrete benchmarks, our proposed Reg-SAC outperforms the standard SAC in spare-reward grid world environments while it is able to maintain competitive performance in the dense-reward Atari benchmark. The results highlight that our regularized version makes the stochastic policy of SAC more stable in discrete-action domains.","PeriodicalId":48915,"journal":{"name":"IEEE Transactions on Systems Man Cybernetics-Systems","volume":"55 2","pages":"1425-1438"},"PeriodicalIF":8.6000,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"On the Effectiveness of Regularization Methods for Soft Actor-Critic in Discrete-Action Domains\",\"authors\":\"Bang Giang Le;Viet Cuong Ta\",\"doi\":\"10.1109/TSMC.2024.3505613\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Soft actor-critic (SAC) is a reinforcement learning algorithm that employs the maximum entropy framework to train a stochastic policy. This work examines a specific failure case of SAC where the stochastic policy is trained to maximize the expected entropy from a sparse reward environment. We demonstrate that the over-exploration of SAC can make the entropy temperature collapse, followed by unstable updates to the actor. Based on our analyses, we introduce Reg-SAC, an improved version of SAC, to mitigate the detrimental effects of the entropy temperature on the learning stability of the stochastic policy. Reg-SAC incorporates a clipping value to prevent the entropy temperature collapse and regularizes the gradient updates of the policy via Kullback-Leibler divergence. Through experiments on discrete benchmarks, our proposed Reg-SAC outperforms the standard SAC in spare-reward grid world environments while it is able to maintain competitive performance in the dense-reward Atari benchmark. The results highlight that our regularized version makes the stochastic policy of SAC more stable in discrete-action domains.\",\"PeriodicalId\":48915,\"journal\":{\"name\":\"IEEE Transactions on Systems Man Cybernetics-Systems\",\"volume\":\"55 2\",\"pages\":\"1425-1438\"},\"PeriodicalIF\":8.6000,\"publicationDate\":\"2024-12-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Systems Man Cybernetics-Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10777063/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Systems Man Cybernetics-Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10777063/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
On the Effectiveness of Regularization Methods for Soft Actor-Critic in Discrete-Action Domains
Soft actor-critic (SAC) is a reinforcement learning algorithm that employs the maximum entropy framework to train a stochastic policy. This work examines a specific failure case of SAC where the stochastic policy is trained to maximize the expected entropy from a sparse reward environment. We demonstrate that the over-exploration of SAC can make the entropy temperature collapse, followed by unstable updates to the actor. Based on our analyses, we introduce Reg-SAC, an improved version of SAC, to mitigate the detrimental effects of the entropy temperature on the learning stability of the stochastic policy. Reg-SAC incorporates a clipping value to prevent the entropy temperature collapse and regularizes the gradient updates of the policy via Kullback-Leibler divergence. Through experiments on discrete benchmarks, our proposed Reg-SAC outperforms the standard SAC in spare-reward grid world environments while it is able to maintain competitive performance in the dense-reward Atari benchmark. The results highlight that our regularized version makes the stochastic policy of SAC more stable in discrete-action domains.
期刊介绍:
The IEEE Transactions on Systems, Man, and Cybernetics: Systems encompasses the fields of systems engineering, covering issue formulation, analysis, and modeling throughout the systems engineering lifecycle phases. It addresses decision-making, issue interpretation, systems management, processes, and various methods such as optimization, modeling, and simulation in the development and deployment of large systems.