LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong
{"title":"LLM-as-a-Judge & Reward Model: What They Can and Cannot Do","authors":"Guijin Son, Hyunwoo Ko, Hoyoung Lee, Yewon Kim, Seunghyeok Hong","doi":"arxiv-2409.11239","DOIUrl":null,"url":null,"abstract":"LLM-as-a-Judge and reward models are widely used alternatives of\nmultiple-choice questions or human annotators for large language model (LLM)\nevaluation. Their efficacy shines in evaluating long-form responses, serving a\ncritical role as evaluators of leaderboards and as proxies to align LLMs via\nreinforcement learning. However, despite their popularity, their effectiveness\noutside of English remains largely unexplored. In this paper, we conduct a\ncomprehensive analysis on automated evaluators, reporting key findings on their\nbehavior in a non-English environment. First, we discover that English\nevaluation capabilities significantly influence language-specific capabilities,\noften more than the language proficiency itself, enabling evaluators trained in\nEnglish to easily transfer their skills to other languages. Second, we identify\ncritical shortcomings, where LLMs fail to detect and penalize errors, such as\nfactual inaccuracies, cultural misrepresentations, and the presence of unwanted\nlanguage. Finally, we release Kudge, the first non-English meta-evaluation\ndataset containing 5,012 human annotations in Korean.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11239","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice questions or human annotators for large language model (LLM) evaluation. Their efficacy shines in evaluating long-form responses, serving a critical role as evaluators of leaderboards and as proxies to align LLMs via reinforcement learning. However, despite their popularity, their effectiveness outside of English remains largely unexplored. In this paper, we conduct a comprehensive analysis on automated evaluators, reporting key findings on their behavior in a non-English environment. First, we discover that English evaluation capabilities significantly influence language-specific capabilities, often more than the language proficiency itself, enabling evaluators trained in English to easily transfer their skills to other languages. Second, we identify critical shortcomings, where LLMs fail to detect and penalize errors, such as factual inaccuracies, cultural misrepresentations, and the presence of unwanted language. Finally, we release Kudge, the first non-English meta-evaluation dataset containing 5,012 human annotations in Korean.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
法学硕士担任法官与奖励模式:他们能做什么,不能做什么
在大型语言模型(LLM)评估中,LLM-as-a-Judge 和奖励模型被广泛用于替代多选题或人类注释者。它们在评估长式回答时发挥了重要作用,既是排行榜的评估者,也是通过强化学习调整 LLM 的代理。然而,尽管它们很受欢迎,但它们在英语之外的有效性在很大程度上仍未得到探索。在本文中,我们对自动评价器进行了全面分析,报告了它们在非英语环境中行为的主要发现。首先,我们发现英语评估能力对特定语言能力的影响很大,往往比语言能力本身的影响更大,这使得接受过英语培训的评估员能够轻松地将他们的技能转移到其他语言上。其次,我们发现了 LLM 的关键缺陷,即 LLM 无法检测和惩罚错误,如事实不准确、文化表述错误和存在不需要的语言。最后,我们发布了首个非英语元评价数据集 Kudge,其中包含 5012 个韩语人工注释。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
LLMs + Persona-Plug = Personalized LLMs MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts Extract-and-Abstract: Unifying Extractive and Abstractive Summarization within Single Encoder-Decoder Framework Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources Human-like Affective Cognition in Foundation Models
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1