{"title":"Mild evaluation policy via dataset constraint for offline reinforcement learning","authors":"Xue Li , Xinghong Ling","doi":"10.1016/j.eswa.2025.126842","DOIUrl":null,"url":null,"abstract":"<div><div>Offline reinforcement learning (RL) agents seek optimal policies from fixed datasets. Policy constraints are typically employed to adhere to the behavior policy, thereby stabilizing value learning and mitigating the selection of out-of-distribution (OOD) actions. Conventional approaches apply identical constraints for both value learning and test time inference. However, the constraints suitable for value estimation may in fact be excessively restrictive for action selection during test time. To address this issue, we propose a mild evaluation policy via dataset constraint (MEDC) for test time inference with a more constrained target policy for value estimation. MEDC introduces a dual-policy constraint, comprising a target policy and an evaluation policy. The evaluation policy regularize the policy towards the nearest state–action pair, with behavior cloning performed on the target policy. The distributional shift is effectively addressed through the combination of dataset constraint and behavior cloning. The TD3 is employed to direct the policy in selecting actions that maximize the return. Moreover, MEDC achieves state-of-the-art performance compared with existing methods on the D4RL datasets.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"274 ","pages":"Article 126842"},"PeriodicalIF":7.5000,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425004646","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Offline reinforcement learning (RL) agents seek optimal policies from fixed datasets. Policy constraints are typically employed to adhere to the behavior policy, thereby stabilizing value learning and mitigating the selection of out-of-distribution (OOD) actions. Conventional approaches apply identical constraints for both value learning and test time inference. However, the constraints suitable for value estimation may in fact be excessively restrictive for action selection during test time. To address this issue, we propose a mild evaluation policy via dataset constraint (MEDC) for test time inference with a more constrained target policy for value estimation. MEDC introduces a dual-policy constraint, comprising a target policy and an evaluation policy. The evaluation policy regularize the policy towards the nearest state–action pair, with behavior cloning performed on the target policy. The distributional shift is effectively addressed through the combination of dataset constraint and behavior cloning. The TD3 is employed to direct the policy in selecting actions that maximize the return. Moreover, MEDC achieves state-of-the-art performance compared with existing methods on the D4RL datasets.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.