{"title":"Windows deep transformer Q-networks: an extended variance reduction architecture for partially observable reinforcement learning","authors":"Zijian Wang, Bin Wang, Hongbo Dou, Zhongyuan Liu","doi":"10.1007/s10489-024-05867-3","DOIUrl":null,"url":null,"abstract":"<p>Partial Observability Markov Desicion Process (POMDP) is always worth studying in reinforcement learning (RL) due to its universality in the real world. Compared with Markov Decision Processes (MDP), agents in POMDP cannot fully receive information from the environment, which is an obstacle to traditional RL algorithms. One solution is to establishes a sequence-to-sequence model. As the core of deep Q-networks, Transformer has achieved certain outperformed results in dealing with partial observability problems. Nevertheless, deep Q-network has the issue of over-estimation of Q-value, which leads to unstable input data quality in Transformer. With the accumulation of deviation fast, model performance may decline drastically, resulting in severe errors that are fatal to policy learning. In this paper, we note that the previous Q-value overestimation mitigation model is not suitable for Deep Transformer Q-Networks (DTQN) framework, for DTQN is a sequence-to-sequence model, not merely a value optimization model in traditional RL. Therefore, we propose Windows DTQN, based on the reduction of Q-value variance via the synergistic effect of shallow and deep windows. In particular, Windows DTQN ensembles the historical Q-networks through the shallow windows, and estimates the uncertainty of the Q-networks through the deep windows for weight allocation. Our experiments conducted on gridverse environments demonstrate that our model achieves better results than the current mainstream DQN algorithms in POMDP. Compared to DTQN, Windows DTQN increases the average success rate by 5.1% and the average return by 1.11.</p>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Intelligence","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s10489-024-05867-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Partial Observability Markov Desicion Process (POMDP) is always worth studying in reinforcement learning (RL) due to its universality in the real world. Compared with Markov Decision Processes (MDP), agents in POMDP cannot fully receive information from the environment, which is an obstacle to traditional RL algorithms. One solution is to establishes a sequence-to-sequence model. As the core of deep Q-networks, Transformer has achieved certain outperformed results in dealing with partial observability problems. Nevertheless, deep Q-network has the issue of over-estimation of Q-value, which leads to unstable input data quality in Transformer. With the accumulation of deviation fast, model performance may decline drastically, resulting in severe errors that are fatal to policy learning. In this paper, we note that the previous Q-value overestimation mitigation model is not suitable for Deep Transformer Q-Networks (DTQN) framework, for DTQN is a sequence-to-sequence model, not merely a value optimization model in traditional RL. Therefore, we propose Windows DTQN, based on the reduction of Q-value variance via the synergistic effect of shallow and deep windows. In particular, Windows DTQN ensembles the historical Q-networks through the shallow windows, and estimates the uncertainty of the Q-networks through the deep windows for weight allocation. Our experiments conducted on gridverse environments demonstrate that our model achieves better results than the current mainstream DQN algorithms in POMDP. Compared to DTQN, Windows DTQN increases the average success rate by 5.1% and the average return by 1.11.
期刊介绍:
With a focus on research in artificial intelligence and neural networks, this journal addresses issues involving solutions of real-life manufacturing, defense, management, government and industrial problems which are too complex to be solved through conventional approaches and require the simulation of intelligent thought processes, heuristics, applications of knowledge, and distributed and parallel processing. The integration of these multiple approaches in solving complex problems is of particular importance.
The journal presents new and original research and technological developments, addressing real and complex issues applicable to difficult problems. It provides a medium for exchanging scientific research and technological achievements accomplished by the international community.