{"title":"KnowPC: Knowledge-Driven Programmatic Reinforcement Learning for Zero-shot Coordination","authors":"Yin Gu, Qi Liu, Zhi Li, Kai Zhang","doi":"arxiv-2408.04336","DOIUrl":null,"url":null,"abstract":"Zero-shot coordination (ZSC) remains a major challenge in the cooperative AI\nfield, which aims to learn an agent to cooperate with an unseen partner in\ntraining environments or even novel environments. In recent years, a popular\nZSC solution paradigm has been deep reinforcement learning (DRL) combined with\nadvanced self-play or population-based methods to enhance the neural policy's\nability to handle unseen partners. Despite some success, these approaches\nusually rely on black-box neural networks as the policy function. However,\nneural networks typically lack interpretability and logic, making the learned\npolicies difficult for partners (e.g., humans) to understand and limiting their\ngeneralization ability. These shortcomings hinder the application of\nreinforcement learning methods in diverse cooperative scenarios.We suggest to\nrepresent the agent's policy with an interpretable program. Unlike neural\nnetworks, programs contain stable logic, but they are non-differentiable and\ndifficult to optimize.To automatically learn such programs, we introduce\nKnowledge-driven Programmatic reinforcement learning for zero-shot Coordination\n(KnowPC). We first define a foundational Domain-Specific Language (DSL),\nincluding program structures, conditional primitives, and action primitives. A\nsignificant challenge is the vast program search space, making it difficult to\nfind high-performing programs efficiently. To address this, KnowPC integrates\nan extractor and an reasoner. The extractor discovers environmental transition\nknowledge from multi-agent interaction trajectories, while the reasoner deduces\nthe preconditions of each action primitive based on the transition knowledge.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":"74 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.04336","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Zero-shot coordination (ZSC) remains a major challenge in the cooperative AI
field, which aims to learn an agent to cooperate with an unseen partner in
training environments or even novel environments. In recent years, a popular
ZSC solution paradigm has been deep reinforcement learning (DRL) combined with
advanced self-play or population-based methods to enhance the neural policy's
ability to handle unseen partners. Despite some success, these approaches
usually rely on black-box neural networks as the policy function. However,
neural networks typically lack interpretability and logic, making the learned
policies difficult for partners (e.g., humans) to understand and limiting their
generalization ability. These shortcomings hinder the application of
reinforcement learning methods in diverse cooperative scenarios.We suggest to
represent the agent's policy with an interpretable program. Unlike neural
networks, programs contain stable logic, but they are non-differentiable and
difficult to optimize.To automatically learn such programs, we introduce
Knowledge-driven Programmatic reinforcement learning for zero-shot Coordination
(KnowPC). We first define a foundational Domain-Specific Language (DSL),
including program structures, conditional primitives, and action primitives. A
significant challenge is the vast program search space, making it difficult to
find high-performing programs efficiently. To address this, KnowPC integrates
an extractor and an reasoner. The extractor discovers environmental transition
knowledge from multi-agent interaction trajectories, while the reasoner deduces
the preconditions of each action primitive based on the transition knowledge.