LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models

Minsuk Kahng;Ian Tenney;Mahima Pushkarna;Michael Xieyang Liu;James Wexler;Emily Reif;Krystal Kallarackal;Minsuk Chang;Michael Terry;Lucas Dixon
{"title":"LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models","authors":"Minsuk Kahng;Ian Tenney;Mahima Pushkarna;Michael Xieyang Liu;James Wexler;Emily Reif;Krystal Kallarackal;Minsuk Chang;Michael Terry;Lucas Dixon","doi":"10.1109/TVCG.2024.3456354","DOIUrl":null,"url":null,"abstract":"Evaluating large language models (LLMs) presents unique challenges. While automatic side-by-side evaluation, also known as LLM-as-a-judge, has become a promising solution, model developers and researchers face difficulties with scalability and interpretability when analyzing these evaluation outcomes. To address these challenges, we introduce LLM Comparator, a new visual analytics tool designed for side-by-side evaluations of LLMs. This tool provides analytical workflows that help users understand when and why one LLM outperforms or underperforms another, and how their responses differ. Through close collaboration with practitioners developing LLMs at Google, we have iteratively designed, developed, and refined the tool. Qualitative feedback from these users highlights that the tool facilitates in-depth analysis of individual examples while enabling users to visually overview and flexibly slice data. This empowers users to identify undesirable patterns, formulate hypotheses about model behavior, and gain insights for model improvement. LLM Comparator has been integrated into Google's LLM evaluation platforms and open-sourced.","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"31 1","pages":"503-513"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10670495","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10670495/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Evaluating large language models (LLMs) presents unique challenges. While automatic side-by-side evaluation, also known as LLM-as-a-judge, has become a promising solution, model developers and researchers face difficulties with scalability and interpretability when analyzing these evaluation outcomes. To address these challenges, we introduce LLM Comparator, a new visual analytics tool designed for side-by-side evaluations of LLMs. This tool provides analytical workflows that help users understand when and why one LLM outperforms or underperforms another, and how their responses differ. Through close collaboration with practitioners developing LLMs at Google, we have iteratively designed, developed, and refined the tool. Qualitative feedback from these users highlights that the tool facilitates in-depth analysis of individual examples while enabling users to visually overview and flexibly slice data. This empowers users to identify undesirable patterns, formulate hypotheses about model behavior, and gain insights for model improvement. LLM Comparator has been integrated into Google's LLM evaluation platforms and open-sourced.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
LLM Comparator:大型语言模型并排评估的互动分析
评估大型语言模型(LLM)是一项独特的挑战。虽然自动并排评估(也称为 LLM-as-a-judge)已成为一种很有前景的解决方案,但模型开发人员和研究人员在分析这些评估结果时面临着可扩展性和可解释性方面的困难。为了应对这些挑战,我们推出了 LLM Comparator,这是一款新的可视化分析工具,专为 LLM 的并排评估而设计。该工具提供的分析工作流程可帮助用户了解一种 LLM 优于或劣于另一种 LLM 的时间和原因,以及它们的反应有何不同。通过与谷歌开发 LLM 的从业人员密切合作,我们反复设计、开发并完善了该工具。这些用户的定性反馈表明,该工具有助于对单个示例进行深入分析,同时还能让用户直观地概览并灵活地切分数据。这使用户能够识别不理想的模式,提出有关模型行为的假设,并获得改进模型的见解。LLM Comparator 已集成到谷歌的 LLM 评估平台中,并已开源。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
2024 VGTC Visualization Lifetime Achievement Award Investigating the Potential of Haptic Props for 3D Object Manipulation in Handheld AR. Visualization-Driven Illumination for Density Plots. "where Did My Apps Go?" Supporting Scalable and Transition-Aware Access to Everyday Applications in Head-Worn Augmented Reality. PGSR: Planar-based Gaussian Splatting for Efficient and High-Fidelity Surface Reconstruction.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1