{"title":"PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation","authors":"Ilya Gusev","doi":"arxiv-2409.06820","DOIUrl":null,"url":null,"abstract":"We introduce a novel benchmark for evaluating the role-playing capabilities\nof language models. Our approach leverages language models themselves to\nemulate users in dynamic, multi-turn conversations and to assess the resulting\ndialogues. The framework consists of three main components: a player model\nassuming a specific character role, an interrogator model simulating user\nbehavior, and a judge model evaluating conversation quality. We conducted\nexperiments comparing automated evaluations with human annotations to validate\nour approach, demonstrating strong correlations across multiple criteria. This\nwork provides a foundation for a robust and dynamic evaluation of model\ncapabilities in interactive scenarios.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06820","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We introduce a novel benchmark for evaluating the role-playing capabilities
of language models. Our approach leverages language models themselves to
emulate users in dynamic, multi-turn conversations and to assess the resulting
dialogues. The framework consists of three main components: a player model
assuming a specific character role, an interrogator model simulating user
behavior, and a judge model evaluating conversation quality. We conducted
experiments comparing automated evaluations with human annotations to validate
our approach, demonstrating strong correlations across multiple criteria. This
work provides a foundation for a robust and dynamic evaluation of model
capabilities in interactive scenarios.