Teng Yan, Zhendong Ruan, Yaobang Cai, Yu Han, Wenxian Li, Yang Zhang
{"title":"用于离线强化学习的 Q 值正则化决策 ConvFormer","authors":"Teng Yan, Zhendong Ruan, Yaobang Cai, Yu Han, Wenxian Li, Yang Zhang","doi":"arxiv-2409.08062","DOIUrl":null,"url":null,"abstract":"As a data-driven paradigm, offline reinforcement learning (Offline RL) has\nbeen formulated as sequence modeling, where the Decision Transformer (DT) has\ndemonstrated exceptional capabilities. Unlike previous reinforcement learning\nmethods that fit value functions or compute policy gradients, DT adjusts the\nautoregressive model based on the expected returns, past states, and actions,\nusing a causally masked Transformer to output the optimal action. However, due\nto the inconsistency between the sampled returns within a single trajectory and\nthe optimal returns across multiple trajectories, it is challenging to set an\nexpected return to output the optimal action and stitch together suboptimal\ntrajectories. Decision ConvFormer (DC) is easier to understand in the context\nof modeling RL trajectories within a Markov Decision Process compared to DT. We\npropose the Q-value Regularized Decision ConvFormer (QDC), which combines the\nunderstanding of RL trajectories by DC and incorporates a term that maximizes\naction values using dynamic programming methods during training. This ensures\nthat the expected returns of the sampled actions are consistent with the\noptimal returns. QDC achieves excellent performance on the D4RL benchmark,\noutperforming or approaching the optimal level in all tested environments. It\nparticularly demonstrates outstanding competitiveness in trajectory stitching\ncapability.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning\",\"authors\":\"Teng Yan, Zhendong Ruan, Yaobang Cai, Yu Han, Wenxian Li, Yang Zhang\",\"doi\":\"arxiv-2409.08062\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As a data-driven paradigm, offline reinforcement learning (Offline RL) has\\nbeen formulated as sequence modeling, where the Decision Transformer (DT) has\\ndemonstrated exceptional capabilities. Unlike previous reinforcement learning\\nmethods that fit value functions or compute policy gradients, DT adjusts the\\nautoregressive model based on the expected returns, past states, and actions,\\nusing a causally masked Transformer to output the optimal action. However, due\\nto the inconsistency between the sampled returns within a single trajectory and\\nthe optimal returns across multiple trajectories, it is challenging to set an\\nexpected return to output the optimal action and stitch together suboptimal\\ntrajectories. Decision ConvFormer (DC) is easier to understand in the context\\nof modeling RL trajectories within a Markov Decision Process compared to DT. We\\npropose the Q-value Regularized Decision ConvFormer (QDC), which combines the\\nunderstanding of RL trajectories by DC and incorporates a term that maximizes\\naction values using dynamic programming methods during training. This ensures\\nthat the expected returns of the sampled actions are consistent with the\\noptimal returns. QDC achieves excellent performance on the D4RL benchmark,\\noutperforming or approaching the optimal level in all tested environments. It\\nparticularly demonstrates outstanding competitiveness in trajectory stitching\\ncapability.\",\"PeriodicalId\":501031,\"journal\":{\"name\":\"arXiv - CS - Robotics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Robotics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.08062\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Robotics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.08062","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning
As a data-driven paradigm, offline reinforcement learning (Offline RL) has
been formulated as sequence modeling, where the Decision Transformer (DT) has
demonstrated exceptional capabilities. Unlike previous reinforcement learning
methods that fit value functions or compute policy gradients, DT adjusts the
autoregressive model based on the expected returns, past states, and actions,
using a causally masked Transformer to output the optimal action. However, due
to the inconsistency between the sampled returns within a single trajectory and
the optimal returns across multiple trajectories, it is challenging to set an
expected return to output the optimal action and stitch together suboptimal
trajectories. Decision ConvFormer (DC) is easier to understand in the context
of modeling RL trajectories within a Markov Decision Process compared to DT. We
propose the Q-value Regularized Decision ConvFormer (QDC), which combines the
understanding of RL trajectories by DC and incorporates a term that maximizes
action values using dynamic programming methods during training. This ensures
that the expected returns of the sampled actions are consistent with the
optimal returns. QDC achieves excellent performance on the D4RL benchmark,
outperforming or approaching the optimal level in all tested environments. It
particularly demonstrates outstanding competitiveness in trajectory stitching
capability.