LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI:arxiv-2409.11239

Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong

{"title":"LLM-as-a-Judge & Reward Model: What They Can and Cannot Do","authors":"Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong","doi":"arxiv-2409.11239","DOIUrl":null,"url":null,"abstract":"LLM-as-a-Judge and reward models are widely used alternatives of\nmultiple-choice questions or human annotators for large language model (LLM)\nevaluation. Their efficacy shines in evaluating long-form responses, serving a\ncritical role as evaluators of leaderboards and as proxies to align LLMs via\nreinforcement learning. However, despite their popularity, their effectiveness\noutside of English remains largely unexplored. In this paper, we conduct a\ncomprehensive analysis on automated evaluators, reporting key findings on their\nbehavior in a non-English environment. First, we discover that English\nevaluation capabilities significantly influence language-specific capabilities,\noften more than the language proficiency itself, enabling evaluators trained in\nEnglish to easily transfer their skills to other languages. Second, we identify\ncritical shortcomings, where LLMs fail to detect and penalize errors, such as\nfactual inaccuracies, cultural misrepresentations, and the presence of unwanted\nlanguage. Finally, we release Kudge, the first non-English meta-evaluation\ndataset containing 5,012 human annotations in Korean.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11239","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice questions or human annotators for large language model (LLM) evaluation. Their efficacy shines in evaluating long-form responses, serving a critical role as evaluators of leaderboards and as proxies to align LLMs via reinforcement learning. However, despite their popularity, their effectiveness outside of English remains largely unexplored. In this paper, we conduct a comprehensive analysis on automated evaluators, reporting key findings on their behavior in a non-English environment. First, we discover that English evaluation capabilities significantly influence language-specific capabilities, often more than the language proficiency itself, enabling evaluators trained in English to easily transfer their skills to other languages. Second, we identify critical shortcomings, where LLMs fail to detect and penalize errors, such as factual inaccuracies, cultural misrepresentations, and the presence of unwanted language. Finally, we release Kudge, the first non-English meta-evaluation dataset containing 5,012 human annotations in Korean.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

法学硕士担任法官与奖励模式：他们能做什么，不能做什么

在大型语言模型（LLM）评估中，LLM-as-a-Judge 和奖励模型被广泛用于替代多选题或人类注释者。它们在评估长式回答时发挥了重要作用，既是排行榜的评估者，也是通过强化学习调整 LLM 的代理。然而，尽管它们很受欢迎，但它们在英语之外的有效性在很大程度上仍未得到探索。在本文中，我们对自动评价器进行了全面分析，报告了它们在非英语环境中行为的主要发现。首先，我们发现英语评估能力对特定语言能力的影响很大，往往比语言能力本身的影响更大，这使得接受过英语培训的评估员能够轻松地将他们的技能转移到其他语言上。其次，我们发现了 LLM 的关键缺陷，即 LLM 无法检测和惩罚错误，如事实不准确、文化表述错误和存在不需要的语言。最后，我们发布了首个非英语元评价数据集 Kudge，其中包含 5012 个韩语人工注释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量