{"title":"Visual Grounding for Object-Level Generalization in Reinforcement Learning","authors":"Haobin Jiang, Zongqing Lu","doi":"arxiv-2408.01942","DOIUrl":null,"url":null,"abstract":"Generalization is a pivotal challenge for agents following natural language\ninstructions. To approach this goal, we leverage a vision-language model (VLM)\nfor visual grounding and transfer its vision-language knowledge into\nreinforcement learning (RL) for object-centric tasks, which makes the agent\ncapable of zero-shot generalization to unseen objects and instructions. By\nvisual grounding, we obtain an object-grounded confidence map for the target\nobject indicated in the instruction. Based on this map, we introduce two routes\nto transfer VLM knowledge into RL. Firstly, we propose an object-grounded\nintrinsic reward function derived from the confidence map to more effectively\nguide the agent towards the target object. Secondly, the confidence map offers\na more unified, accessible task representation for the agent's policy, compared\nto language embeddings. This enables the agent to process unseen objects and\ninstructions through comprehensible visual confidence maps, facilitating\nzero-shot object-level generalization. Single-task experiments prove that our\nintrinsic reward significantly improves performance on challenging skill\nlearning. In multi-task experiments, through testing on tasks beyond the\ntraining set, we show that the agent, when provided with the confidence map as\nthe task representation, possesses better generalization capabilities than\nlanguage-based conditioning. The code is available at\nhttps://github.com/PKU-RL/COPL.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.01942","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Generalization is a pivotal challenge for agents following natural language
instructions. To approach this goal, we leverage a vision-language model (VLM)
for visual grounding and transfer its vision-language knowledge into
reinforcement learning (RL) for object-centric tasks, which makes the agent
capable of zero-shot generalization to unseen objects and instructions. By
visual grounding, we obtain an object-grounded confidence map for the target
object indicated in the instruction. Based on this map, we introduce two routes
to transfer VLM knowledge into RL. Firstly, we propose an object-grounded
intrinsic reward function derived from the confidence map to more effectively
guide the agent towards the target object. Secondly, the confidence map offers
a more unified, accessible task representation for the agent's policy, compared
to language embeddings. This enables the agent to process unseen objects and
instructions through comprehensible visual confidence maps, facilitating
zero-shot object-level generalization. Single-task experiments prove that our
intrinsic reward significantly improves performance on challenging skill
learning. In multi-task experiments, through testing on tasks beyond the
training set, we show that the agent, when provided with the confidence map as
the task representation, possesses better generalization capabilities than
language-based conditioning. The code is available at
https://github.com/PKU-RL/COPL.