Offline prompt reinforcement learning method based on feature extraction.

IF 3.5 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE PeerJ Computer Science Pub Date : 2025-01-02 eCollection Date: 2025-01-01 DOI:10.7717/peerj-cs.2490

Tianlei Yao, Xiliang Chen, Yi Yao, Weiye Huang, Zhaoyang Chen

{"title":"Offline prompt reinforcement learning method based on feature extraction.","authors":"Tianlei Yao, Xiliang Chen, Yi Yao, Weiye Huang, Zhaoyang Chen","doi":"10.7717/peerj-cs.2490","DOIUrl":null,"url":null,"abstract":"<p><p>Recent studies have shown that combining Transformer and conditional strategies to deal with offline reinforcement learning can bring better results. However, in a conventional reinforcement learning scenario, the agent can receive a single frame of observations one by one according to its natural chronological sequence, but in Transformer, a series of observations are received at each step. Individual features cannot be extracted efficiently to make more accurate decisions, and it is still difficult to generalize effectively for data outside the distribution. We focus on the characteristic of few-shot learning in pre-trained models, and combine prompt learning to enhance the ability of real-time policy adjustment. By sampling the specific information in the offline dataset as trajectory samples, the task information is encoded to help the pre-trained model quickly understand the task characteristics and the sequence generation paradigm to quickly adapt to the downstream tasks. In order to understand the dependencies in the sequence more accurately, we also divide the fixed-size state information blocks in the input trajectory, extract the features of the segmented sub-blocks respectively, and finally encode the whole sequence into the GPT model to generate decisions more accurately. Experiments show that the proposed method achieves better performance than the baseline method in related tasks, can be generalized to new environments and tasks better, and effectively improves the stability and accuracy of agent decision making.</p>","PeriodicalId":54224,"journal":{"name":"PeerJ Computer Science","volume":"11 ","pages":"e2490"},"PeriodicalIF":3.5000,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11784719/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PeerJ Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.7717/peerj-cs.2490","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recent studies have shown that combining Transformer and conditional strategies to deal with offline reinforcement learning can bring better results. However, in a conventional reinforcement learning scenario, the agent can receive a single frame of observations one by one according to its natural chronological sequence, but in Transformer, a series of observations are received at each step. Individual features cannot be extracted efficiently to make more accurate decisions, and it is still difficult to generalize effectively for data outside the distribution. We focus on the characteristic of few-shot learning in pre-trained models, and combine prompt learning to enhance the ability of real-time policy adjustment. By sampling the specific information in the offline dataset as trajectory samples, the task information is encoded to help the pre-trained model quickly understand the task characteristics and the sequence generation paradigm to quickly adapt to the downstream tasks. In order to understand the dependencies in the sequence more accurately, we also divide the fixed-size state information blocks in the input trajectory, extract the features of the segmented sub-blocks respectively, and finally encode the whole sequence into the GPT model to generate decisions more accurately. Experiments show that the proposed method achieves better performance than the baseline method in related tasks, can be generalized to new environments and tasks better, and effectively improves the stability and accuracy of agent decision making.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

求助全文

约1分钟内获得全文去求助

来源期刊

PeerJ Computer Science Computer Science-General Computer Science

CiteScore

6.10

自引率

5.30%

发文量

332

审稿时长

10 weeks

期刊介绍： PeerJ Computer Science is the new open access journal covering all subject areas in computer science, with the backing of a prestigious advisory board and more than 300 academic editors.