多目标ω-规则强化学习

IF 1.4 4区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING Formal Aspects of Computing Pub Date : 2023-06-26 DOI:10.1145/3605950
E. M. Hahn, Mateo Perez, S. Schewe, F. Somenzi, Ashutosh Trivedi, D. Wojtczak
{"title":"多目标ω-规则强化学习","authors":"E. M. Hahn, Mateo Perez, S. Schewe, F. Somenzi, Ashutosh Trivedi, D. Wojtczak","doi":"10.1145/3605950","DOIUrl":null,"url":null,"abstract":"The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1) weighted preference, where the decision maker provides scalar weights for various objectives, and (2) lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is both faithful (maximising reward means maximising probability of achieving the objectives under the corresponding preference) and effective (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool, Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems.","PeriodicalId":50432,"journal":{"name":"Formal Aspects of Computing","volume":null,"pages":null},"PeriodicalIF":1.4000,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-objective ω-Regular Reinforcement Learning\",\"authors\":\"E. M. Hahn, Mateo Perez, S. Schewe, F. Somenzi, Ashutosh Trivedi, D. Wojtczak\",\"doi\":\"10.1145/3605950\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1) weighted preference, where the decision maker provides scalar weights for various objectives, and (2) lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is both faithful (maximising reward means maximising probability of achieving the objectives under the corresponding preference) and effective (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool, Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems.\",\"PeriodicalId\":50432,\"journal\":{\"name\":\"Formal Aspects of Computing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2023-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Formal Aspects of Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3605950\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Formal Aspects of Computing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3605950","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

摘要

强化学习(RL)在安全关键系统设计中的作用不断扩大,促使ω-自动机成为一种表达学习需求的方式——通常是非马尔可夫的——比标量奖励信号更容易表达和解释。然而,现实世界中的顺序决策情况往往涉及多个潜在冲突的目标。表达对多个目标的相对偏好的两种主要方法是:(1)加权偏好,其中决策者为各种目标提供标量权重;(2)词典偏好,其中决策者提供对目标的排序,使得较高排序目标的任何数量的满足比较低排序目标的任意数量的满足更可取。在本文中,我们研究并开发了RL算法,以在加权和字典偏好下计算马尔可夫决策过程中针对多个ω-正则目标的最优策略。我们提供了从多个ω-正则目标到标量奖励信号的转换,该信号既忠实(最大化奖励意味着在相应偏好下实现目标的概率最大化)又有效(RL快速收敛到最优策略)。我们在一个正式的强化学习工具Mungojerrie中实现了翻译,并对我们在基准学习问题上的技术进行了实验评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Multi-objective ω-Regular Reinforcement Learning
The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1) weighted preference, where the decision maker provides scalar weights for various objectives, and (2) lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is both faithful (maximising reward means maximising probability of achieving the objectives under the corresponding preference) and effective (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool, Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Formal Aspects of Computing
Formal Aspects of Computing 工程技术-计算机:软件工程
CiteScore
3.30
自引率
0.00%
发文量
17
审稿时长
>12 weeks
期刊介绍: This journal aims to publish contributions at the junction of theory and practice. The objective is to disseminate applicable research. Thus new theoretical contributions are welcome where they are motivated by potential application; applications of existing formalisms are of interest if they show something novel about the approach or application. In particular, the scope of Formal Aspects of Computing includes: well-founded notations for the description of systems; verifiable design methods; elucidation of fundamental computational concepts; approaches to fault-tolerant design; theorem-proving support; state-exploration tools; formal underpinning of widely used notations and methods; formal approaches to requirements analysis.
期刊最新文献
ω-Regular Energy Problems A Calculus for the Specification, Design, and Verification of Distributed Concurrent Systems Does Every Computer Scientist Need to Know Formal Methods? Specification and Verification of Multi-clock Systems using a Temporal Logic with Clock Constraints SMT based parameter identifiable combination detection for non-linear continuous and hybrid dynamics
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1