{"title":"通过自适应优先体验重放改进探索与开发之间的权衡","authors":"Hossein Hassani, Soodeh Nikan, Abdallah Shami","doi":"10.1016/j.neucom.2024.128836","DOIUrl":null,"url":null,"abstract":"<div><div>Experience replay is an indispensable part of deep reinforcement learning algorithms that enables the agent to revisit and reuse its past and recent experiences to update the network parameters. In many baseline off-policy algorithms, such as deep Q-networks (DQN), transitions in the replay buffer are typically sampled uniformly. This uniform sampling is not optimal for accelerating the agent’s training towards learning the optimal policy. A more selective and prioritized approach to experience sampling can yield improved learning efficiency and performance. In this regard, this work is devoted to the design of a novel prioritizing strategy to adaptively adjust the sampling probabilities of stored transitions in the replay buffer. Unlike existing sampling methods, the proposed algorithm takes into consideration the exploration–exploitation trade-off (EET) to rank transitions, which is of utmost importance in learning an optimal policy. Specifically, this approach utilizes temporal difference and Bellman errors as criteria for sampling priorities. To maintain balance in EET throughout training, the weights associated with both criteria are dynamically adjusted when constructing the sampling priorities. Additionally, any bias introduced by this sample prioritization is mitigated through assigning importance-sampling weight to each transition in the buffer. The efficacy of this prioritization scheme is assessed through training the DQN algorithm across various OpenAI Gym environments. The results obtained underscore the significance and superiority of our proposed algorithm over state-of-the-art methods. This is evidenced by its accelerated learning pace, greater cumulative reward, and higher success rate.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128836"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improved exploration–exploitation trade-off through adaptive prioritized experience replay\",\"authors\":\"Hossein Hassani, Soodeh Nikan, Abdallah Shami\",\"doi\":\"10.1016/j.neucom.2024.128836\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Experience replay is an indispensable part of deep reinforcement learning algorithms that enables the agent to revisit and reuse its past and recent experiences to update the network parameters. In many baseline off-policy algorithms, such as deep Q-networks (DQN), transitions in the replay buffer are typically sampled uniformly. This uniform sampling is not optimal for accelerating the agent’s training towards learning the optimal policy. A more selective and prioritized approach to experience sampling can yield improved learning efficiency and performance. In this regard, this work is devoted to the design of a novel prioritizing strategy to adaptively adjust the sampling probabilities of stored transitions in the replay buffer. Unlike existing sampling methods, the proposed algorithm takes into consideration the exploration–exploitation trade-off (EET) to rank transitions, which is of utmost importance in learning an optimal policy. Specifically, this approach utilizes temporal difference and Bellman errors as criteria for sampling priorities. To maintain balance in EET throughout training, the weights associated with both criteria are dynamically adjusted when constructing the sampling priorities. Additionally, any bias introduced by this sample prioritization is mitigated through assigning importance-sampling weight to each transition in the buffer. The efficacy of this prioritization scheme is assessed through training the DQN algorithm across various OpenAI Gym environments. The results obtained underscore the significance and superiority of our proposed algorithm over state-of-the-art methods. This is evidenced by its accelerated learning pace, greater cumulative reward, and higher success rate.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"614 \",\"pages\":\"Article 128836\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-11-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224016072\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016072","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Improved exploration–exploitation trade-off through adaptive prioritized experience replay
Experience replay is an indispensable part of deep reinforcement learning algorithms that enables the agent to revisit and reuse its past and recent experiences to update the network parameters. In many baseline off-policy algorithms, such as deep Q-networks (DQN), transitions in the replay buffer are typically sampled uniformly. This uniform sampling is not optimal for accelerating the agent’s training towards learning the optimal policy. A more selective and prioritized approach to experience sampling can yield improved learning efficiency and performance. In this regard, this work is devoted to the design of a novel prioritizing strategy to adaptively adjust the sampling probabilities of stored transitions in the replay buffer. Unlike existing sampling methods, the proposed algorithm takes into consideration the exploration–exploitation trade-off (EET) to rank transitions, which is of utmost importance in learning an optimal policy. Specifically, this approach utilizes temporal difference and Bellman errors as criteria for sampling priorities. To maintain balance in EET throughout training, the weights associated with both criteria are dynamically adjusted when constructing the sampling priorities. Additionally, any bias introduced by this sample prioritization is mitigated through assigning importance-sampling weight to each transition in the buffer. The efficacy of this prioritization scheme is assessed through training the DQN algorithm across various OpenAI Gym environments. The results obtained underscore the significance and superiority of our proposed algorithm over state-of-the-art methods. This is evidenced by its accelerated learning pace, greater cumulative reward, and higher success rate.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.