SimulBench：用创意模拟任务评估语言模型

arXiv - CS - Computation and Language Pub Date : 2024-09-11 DOI:arxiv-2409.07641

Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin

{"title":"SimulBench：用创意模拟任务评估语言模型","authors":"Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin","doi":"arxiv-2409.07641","DOIUrl":null,"url":null,"abstract":"We introduce SimulBench, a benchmark designed to evaluate large language\nmodels (LLMs) across a diverse collection of creative simulation scenarios,\nsuch as acting as a Linux terminal or playing text games with users. While\nthese simulation tasks serve as effective measures of an LLM's general\nintelligence, they are seldom incorporated into existing benchmarks. A major\nchallenge is to develop an evaluation framework for testing different LLMs\nfairly while preserving the multi-round interactive nature of simulation tasks\nbetween users and AI. To tackle this issue, we suggest using a fixed LLM as a\nuser agent to engage with an LLM to collect dialogues first under different\ntasks. Then, challenging dialogue scripts are extracted for evaluating\ndifferent target LLMs. To facilitate automatic assessment on \\DataName{}, GPT-4\nis employed as the evaluator, tasked with reviewing the quality of the final\nresponse generated by the target LLMs given multi-turn dialogue scripts. Our\ncomprehensive experiments indicate that these simulation tasks continue to pose\na significant challenge with their unique natures and show the gap between\nproprietary models and the most advanced open LLMs. For example, GPT-4-turbo\noutperforms LLaMA-3-70b-Chat on 18.55\\% more cases.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"11 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SimulBench: Evaluating Language Models with Creative Simulation Tasks\",\"authors\":\"Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, Bill Yuchen Lin\",\"doi\":\"arxiv-2409.07641\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce SimulBench, a benchmark designed to evaluate large language\\nmodels (LLMs) across a diverse collection of creative simulation scenarios,\\nsuch as acting as a Linux terminal or playing text games with users. While\\nthese simulation tasks serve as effective measures of an LLM's general\\nintelligence, they are seldom incorporated into existing benchmarks. A major\\nchallenge is to develop an evaluation framework for testing different LLMs\\nfairly while preserving the multi-round interactive nature of simulation tasks\\nbetween users and AI. To tackle this issue, we suggest using a fixed LLM as a\\nuser agent to engage with an LLM to collect dialogues first under different\\ntasks. Then, challenging dialogue scripts are extracted for evaluating\\ndifferent target LLMs. To facilitate automatic assessment on \\\\DataName{}, GPT-4\\nis employed as the evaluator, tasked with reviewing the quality of the final\\nresponse generated by the target LLMs given multi-turn dialogue scripts. Our\\ncomprehensive experiments indicate that these simulation tasks continue to pose\\na significant challenge with their unique natures and show the gap between\\nproprietary models and the most advanced open LLMs. For example, GPT-4-turbo\\noutperforms LLaMA-3-70b-Chat on 18.55\\\\% more cases.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"11 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07641\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07641","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们介绍了 SimulBench，它是一种用于评估大型语言模型（LLM）的基准，可以在各种不同的创造性模拟场景中进行评估，例如充当 Linux 终端或与用户玩文字游戏。虽然这些模拟任务可以有效衡量 LLM 的综合智能，但它们很少被纳入现有的基准。一个主要的挑战是开发一个评估框架，在公平测试不同 LLM 的同时，保留模拟任务在用户和人工智能之间的多轮交互特性。为了解决这个问题，我们建议使用一个固定的 LLM 作为用户代理，与 LLM 进行互动，首先收集不同任务下的对话。然后，提取具有挑战性的对话脚本，用于评估不同的目标 LLM。为了便于对 \DataName{} 进行自动评估，GPT-4 被用作评估者，其任务是在给出多轮对话脚本的情况下，审查目标 LLM 生成的最终响应的质量。我们的综合实验表明，这些模拟任务以其独特的性质继续构成重大挑战，并显示了专有模型与最先进的开放式 LLM 之间的差距。例如，GPT-4-turbo 在 18.55% 以上的案例上优于 LLaMA-3-70b-Chat。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SimulBench: Evaluating Language Models with Creative Simulation Tasks

We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM's general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop an evaluation framework for testing different LLMs fairly while preserving the multi-round interactive nature of simulation tasks between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with an LLM to collect dialogues first under different tasks. Then, challenging dialogue scripts are extracted for evaluating different target LLMs. To facilitate automatic assessment on \DataName{}, GPT-4 is employed as the evaluator, tasked with reviewing the quality of the final response generated by the target LLMs given multi-turn dialogue scripts. Our comprehensive experiments indicate that these simulation tasks continue to pose a significant challenge with their unique natures and show the gap between proprietary models and the most advanced open LLMs. For example, GPT-4-turbo outperforms LLaMA-3-70b-Chat on 18.55\% more cases.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量