多目标ω-规则强化学习

IF 1.4 4区计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Formal Aspects of Computing Pub Date : 2023-06-26 DOI:10.1145/3605950

E. M. Hahn, Mateo Perez, S. Schewe, F. Somenzi, Ashutosh Trivedi, D. Wojtczak

{"title":"多目标ω-规则强化学习","authors":"E. M. Hahn, Mateo Perez, S. Schewe, F. Somenzi, Ashutosh Trivedi, D. Wojtczak","doi":"10.1145/3605950","DOIUrl":null,"url":null,"abstract":"The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1) weighted preference, where the decision maker provides scalar weights for various objectives, and (2) lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is both faithful (maximising reward means maximising probability of achieving the objectives under the corresponding preference) and effective (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool, Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems.","PeriodicalId":50432,"journal":{"name":"Formal Aspects of Computing","volume":null,"pages":null},"PeriodicalIF":1.4000,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-objective ω-Regular Reinforcement Learning\",\"authors\":\"E. M. Hahn, Mateo Perez, S. Schewe, F. Somenzi, Ashutosh Trivedi, D. Wojtczak\",\"doi\":\"10.1145/3605950\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1) weighted preference, where the decision maker provides scalar weights for various objectives, and (2) lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is both faithful (maximising reward means maximising probability of achieving the objectives under the corresponding preference) and effective (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool, Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems.\",\"PeriodicalId\":50432,\"journal\":{\"name\":\"Formal Aspects of Computing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2023-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Formal Aspects of Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3605950\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Formal Aspects of Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3605950","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

强化学习（RL）在安全关键系统设计中的作用不断扩大，促使ω-自动机成为一种表达学习需求的方式——通常是非马尔可夫的——比标量奖励信号更容易表达和解释。然而，现实世界中的顺序决策情况往往涉及多个潜在冲突的目标。表达对多个目标的相对偏好的两种主要方法是：（1）加权偏好，其中决策者为各种目标提供标量权重；（2）词典偏好，其中决策者提供对目标的排序，使得较高排序目标的任何数量的满足比较低排序目标的任意数量的满足更可取。在本文中，我们研究并开发了RL算法，以在加权和字典偏好下计算马尔可夫决策过程中针对多个ω-正则目标的最优策略。我们提供了从多个ω-正则目标到标量奖励信号的转换，该信号既忠实（最大化奖励意味着在相应偏好下实现目标的概率最大化）又有效（RL快速收敛到最优策略）。我们在一个正式的强化学习工具Mungojerrie中实现了翻译，并对我们在基准学习问题上的技术进行了实验评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Multi-objective ω-Regular Reinforcement Learning

The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1) weighted preference, where the decision maker provides scalar weights for various objectives, and (2) lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is both faithful (maximising reward means maximising probability of achieving the objectives under the corresponding preference) and effective (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool, Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Formal Aspects of Computing 工程技术-计算机：软件工程

CiteScore

3.30

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： This journal aims to publish contributions at the junction of theory and practice. The objective is to disseminate applicable research. Thus new theoretical contributions are welcome where they are motivated by potential application; applications of existing formalisms are of interest if they show something novel about the approach or application. In particular, the scope of Formal Aspects of Computing includes: well-founded notations for the description of systems; verifiable design methods; elucidation of fundamental computational concepts; approaches to fault-tolerant design; theorem-proving support; state-exploration tools; formal underpinning of widely used notations and methods; formal approaches to requirements analysis.