Adaptive moving average Q-learning

IF 2.5 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Knowledge and Information Systems Pub Date : 2024-08-12 DOI:10.1007/s10115-024-02190-8
Tao Tan, Hong Xie, Yunni Xia, Xiaoyu Shi, Mingsheng Shang
{"title":"Adaptive moving average Q-learning","authors":"Tao Tan, Hong Xie, Yunni Xia, Xiaoyu Shi, Mingsheng Shang","doi":"10.1007/s10115-024-02190-8","DOIUrl":null,"url":null,"abstract":"<p>A variety of algorithms have been proposed to address the long-standing overestimation bias problem of Q-learning. Reducing this overestimation bias may lead to an underestimation bias, such as double Q-learning. However, it is still unclear how to make a good balance between overestimation and underestimation. We present a simple yet effective algorithm to fill in this gap and call Moving Average Q-learning. Specifically, we maintain two dependent Q-estimators. The first one is used to estimate the maximum expected Q-value. The second one is used to select the optimal action. In particular, the second estimator is the moving average of historical Q-values generated by the first estimator. The second estimator has only one hyperparameter, namely the moving average parameter. This parameter controls the dependence between the second estimator and the first estimator, ranging from independent to identical. Based on Moving Average Q-learning, we design an adaptive strategy to select the moving average parameter, resulting in AdaMA (<u>Ada</u>ptive <u>M</u>oving <u>A</u>verage) Q-learning. This adaptive strategy is a simple function, where the moving average parameter increases monotonically with the number of state–action pairs visited. Moreover, we extend AdaMA Q-learning to AdaMA DQN in high-dimensional environments. Extensive experiment results reveal why Moving Average Q-learning and AdaMA Q-learning can mitigate the overestimation bias, and also show that AdaMA Q-learning and AdaMA DQN outperform SOTA baselines drastically. In particular, when compared with the overestimated value of 1.66 in Q-learning, AdaMA Q-learning underestimates by 0.196, resulting in an improvement of 88.19%.</p>","PeriodicalId":54749,"journal":{"name":"Knowledge and Information Systems","volume":"372 1","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge and Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10115-024-02190-8","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

A variety of algorithms have been proposed to address the long-standing overestimation bias problem of Q-learning. Reducing this overestimation bias may lead to an underestimation bias, such as double Q-learning. However, it is still unclear how to make a good balance between overestimation and underestimation. We present a simple yet effective algorithm to fill in this gap and call Moving Average Q-learning. Specifically, we maintain two dependent Q-estimators. The first one is used to estimate the maximum expected Q-value. The second one is used to select the optimal action. In particular, the second estimator is the moving average of historical Q-values generated by the first estimator. The second estimator has only one hyperparameter, namely the moving average parameter. This parameter controls the dependence between the second estimator and the first estimator, ranging from independent to identical. Based on Moving Average Q-learning, we design an adaptive strategy to select the moving average parameter, resulting in AdaMA (Adaptive Moving Average) Q-learning. This adaptive strategy is a simple function, where the moving average parameter increases monotonically with the number of state–action pairs visited. Moreover, we extend AdaMA Q-learning to AdaMA DQN in high-dimensional environments. Extensive experiment results reveal why Moving Average Q-learning and AdaMA Q-learning can mitigate the overestimation bias, and also show that AdaMA Q-learning and AdaMA DQN outperform SOTA baselines drastically. In particular, when compared with the overestimated value of 1.66 in Q-learning, AdaMA Q-learning underestimates by 0.196, resulting in an improvement of 88.19%.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
自适应移动平均 Q 学习
为了解决 Q-learning 长期存在的高估偏差问题,人们提出了多种算法。减少高估偏差可能会导致低估偏差,如双 Q 学习。然而,如何在高估和低估之间取得良好的平衡仍不清楚。我们提出了一种简单而有效的算法来填补这一空白,并称之为移动平均 Q-learning。具体来说,我们保留了两个相互依赖的 Q 值估计器。第一个用于估计最大预期 Q 值。第二个用于选择最优行动。具体来说,第二个估计器是第一个估计器生成的历史 Q 值的移动平均值。第二个估计器只有一个超参数,即移动平均参数。该参数控制着第二个估计器和第一个估计器之间的依赖关系,范围从独立到相同。在移动平均 Q-learning 的基础上,我们设计了一种自适应策略来选择移动平均参数,这就是 AdaMA(自适应移动平均)Q-learning。这种自适应策略是一个简单的函数,其中移动平均参数随访问的状态-动作对数量的增加而单调增加。此外,我们还将 AdaMA Q-learning 扩展到了高维环境下的 AdaMA DQN。广泛的实验结果揭示了移动平均 Q-learning 和 AdaMA Q-learning 能够减轻高估偏差的原因,同时也表明 AdaMA Q-learning 和 AdaMA DQN 的性能大大优于 SOTA 基线。其中,与 Q-learning 的高估值 1.66 相比,AdaMA Q-learning 的低估值为 0.196,提高了 88.19%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Knowledge and Information Systems
Knowledge and Information Systems 工程技术-计算机:人工智能
CiteScore
5.70
自引率
7.40%
发文量
152
审稿时长
7.2 months
期刊介绍: Knowledge and Information Systems (KAIS) provides an international forum for researchers and professionals to share their knowledge and report new advances on all topics related to knowledge systems and advanced information systems. This monthly peer-reviewed archival journal publishes state-of-the-art research reports on emerging topics in KAIS, reviews of important techniques in related areas, and application papers of interest to a general readership.
期刊最新文献
Dynamic evolution of causal relationships among cryptocurrencies: an analysis via Bayesian networks Deep multi-semantic fuzzy K-means with adaptive weight adjustment Class incremental named entity recognition without forgetting Spectral clustering with scale fairness constraints Supervised kernel-based multi-modal Bhattacharya distance learning for imbalanced data classification
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1