PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

arXiv - CS - Computation and Language Pub Date : 2024-09-10 DOI:arxiv-2409.06820

Ilya Gusev

引用次数: 0

Abstract

We introduce a novel benchmark for evaluating the role-playing capabilities of language models. Our approach leverages language models themselves to emulate users in dynamic, multi-turn conversations and to assess the resulting dialogues. The framework consists of three main components: a player model assuming a specific character role, an interrogator model simulating user behavior, and a judge model evaluating conversation quality. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of model capabilities in interactive scenarios.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

乒乓球：带用户模拟和多模型评估的角色扮演语言模型基准

我们介绍了一种评估语言模型角色扮演能力的新基准。我们的方法利用语言模型本身来模拟动态、多回合对话中的用户，并评估由此产生的对话。该框架由三个主要部分组成：扮演特定角色的玩家模型、模拟用户行为的审讯者模型以及评估对话质量的评判者模型。我们对自动评估和人工注释进行了比较实验，以验证我们的方法，结果表明在多个标准之间存在很强的相关性。这项工作为在交互场景中对模型能力进行稳健而动态的评估奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量