Tanmana Sadhu, Ali Pesaranghader, Yanan Chen, Dong Hoon Yi
{"title":"Athena: Safe Autonomous Agents with Verbal Contrastive Learning","authors":"Tanmana Sadhu, Ali Pesaranghader, Yanan Chen, Dong Hoon Yi","doi":"arxiv-2408.11021","DOIUrl":null,"url":null,"abstract":"Due to emergent capabilities, large language models (LLMs) have been utilized\nas language-based agents to perform a variety of tasks and make decisions with\nan increasing degree of autonomy. These autonomous agents can understand\nhigh-level instructions, interact with their environments, and execute complex\ntasks using a selection of tools available to them. As the capabilities of the\nagents expand, ensuring their safety and trustworthiness becomes more\nimperative. In this study, we introduce the Athena framework which leverages\nthe concept of verbal contrastive learning where past safe and unsafe\ntrajectories are used as in-context (contrastive) examples to guide the agent\ntowards safety while fulfilling a given task. The framework also incorporates a\ncritiquing mechanism to guide the agent to prevent risky actions at every step.\nFurthermore, due to the lack of existing benchmarks on the safety reasoning\nability of LLM-based agents, we curate a set of 80 toolkits across 8 categories\nwith 180 scenarios to provide a safety evaluation benchmark. Our experimental\nevaluation, with both closed- and open-source LLMs, indicates verbal\ncontrastive learning and interaction-level critiquing improve the safety rate\nsignificantly.","PeriodicalId":501315,"journal":{"name":"arXiv - CS - Multiagent Systems","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multiagent Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.11021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Due to emergent capabilities, large language models (LLMs) have been utilized
as language-based agents to perform a variety of tasks and make decisions with
an increasing degree of autonomy. These autonomous agents can understand
high-level instructions, interact with their environments, and execute complex
tasks using a selection of tools available to them. As the capabilities of the
agents expand, ensuring their safety and trustworthiness becomes more
imperative. In this study, we introduce the Athena framework which leverages
the concept of verbal contrastive learning where past safe and unsafe
trajectories are used as in-context (contrastive) examples to guide the agent
towards safety while fulfilling a given task. The framework also incorporates a
critiquing mechanism to guide the agent to prevent risky actions at every step.
Furthermore, due to the lack of existing benchmarks on the safety reasoning
ability of LLM-based agents, we curate a set of 80 toolkits across 8 categories
with 180 scenarios to provide a safety evaluation benchmark. Our experimental
evaluation, with both closed- and open-source LLMs, indicates verbal
contrastive learning and interaction-level critiquing improve the safety rate
significantly.