PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

Ilya Gusev
{"title":"PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation","authors":"Ilya Gusev","doi":"arxiv-2409.06820","DOIUrl":null,"url":null,"abstract":"We introduce a novel benchmark for evaluating the role-playing capabilities\nof language models. Our approach leverages language models themselves to\nemulate users in dynamic, multi-turn conversations and to assess the resulting\ndialogues. The framework consists of three main components: a player model\nassuming a specific character role, an interrogator model simulating user\nbehavior, and a judge model evaluating conversation quality. We conducted\nexperiments comparing automated evaluations with human annotations to validate\nour approach, demonstrating strong correlations across multiple criteria. This\nwork provides a foundation for a robust and dynamic evaluation of model\ncapabilities in interactive scenarios.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06820","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
乒乓球:带用户模拟和多模型评估的角色扮演语言模型基准
我们介绍了一种评估语言模型角色扮演能力的新基准。我们的方法利用语言模型本身来模拟动态、多回合对话中的用户,并评估由此产生的对话。该框架由三个主要部分组成:扮演特定角色的玩家模型、模拟用户行为的审讯者模型以及评估对话质量的评判者模型。我们对自动评估和人工注释进行了比较实验,以验证我们的方法,结果表明在多个标准之间存在很强的相关性。这项工作为在交互场景中对模型能力进行稳健而动态的评估奠定了基础。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
LLMs + Persona-Plug = Personalized LLMs MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts Extract-and-Abstract: Unifying Extractive and Abstractive Summarization within Single Encoder-Decoder Framework Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources Human-like Affective Cognition in Foundation Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1