Charlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate
{"title":"Games for AI Control: Models of Safety Evaluations of AI Deployment Protocols","authors":"Charlie Griffin, Louis Thomson, Buck Shlegeris, Alessandro Abate","doi":"arxiv-2409.07985","DOIUrl":null,"url":null,"abstract":"To evaluate the safety and usefulness of deployment protocols for untrusted\nAIs, AI Control uses a red-teaming exercise played between a protocol designer\nand an adversary. This paper introduces AI-Control Games, a formal\ndecision-making model of the red-teaming exercise as a multi-objective,\npartially observable, stochastic game. We also introduce methods for finding\noptimal protocols in AI-Control Games, by reducing them to a set of zero-sum\npartially observable stochastic games. We apply our formalism to model,\nevaluate and synthesise protocols for deploying untrusted language models as\nprogramming assistants, focusing on Trusted Monitoring protocols, which use\nweaker language models and limited human assistance. Finally, we demonstrate\nthe utility of our formalism by showcasing improvements over empirical studies\nin existing settings, evaluating protocols in new settings, and analysing how\nmodelling assumptions affect the safety and usefulness of protocols.","PeriodicalId":501479,"journal":{"name":"arXiv - CS - Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07985","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
To evaluate the safety and usefulness of deployment protocols for untrusted
AIs, AI Control uses a red-teaming exercise played between a protocol designer
and an adversary. This paper introduces AI-Control Games, a formal
decision-making model of the red-teaming exercise as a multi-objective,
partially observable, stochastic game. We also introduce methods for finding
optimal protocols in AI-Control Games, by reducing them to a set of zero-sum
partially observable stochastic games. We apply our formalism to model,
evaluate and synthesise protocols for deploying untrusted language models as
programming assistants, focusing on Trusted Monitoring protocols, which use
weaker language models and limited human assistance. Finally, we demonstrate
the utility of our formalism by showcasing improvements over empirical studies
in existing settings, evaluating protocols in new settings, and analysing how
modelling assumptions affect the safety and usefulness of protocols.