Benchmark suites instead of leaderboards for evaluating AI fairness.

IF 6.7 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Patterns Pub Date : 2024-11-08 DOI:10.1016/j.patter.2024.101080
Angelina Wang, Aaron Hertzmann, Olga Russakovsky
{"title":"Benchmark suites instead of leaderboards for evaluating AI fairness.","authors":"Angelina Wang, Aaron Hertzmann, Olga Russakovsky","doi":"10.1016/j.patter.2024.101080","DOIUrl":null,"url":null,"abstract":"<p><p>Benchmarks and leaderboards are commonly used to track the fairness impacts of artificial intelligence (AI) models. Many critics argue against this practice, since it incentivizes optimizing for metrics in an attempt to build the \"most fair\" AI model. However, this is an inherently impossible task since different applications have different considerations. While we agree with the critiques against leaderboards, we believe that the use of benchmarks can be reformed. Thus far, the critiques of leaderboards and benchmarks have become unhelpfully entangled. However, benchmarks, when not used for leaderboards, offer important tools for understanding a model. We advocate for collecting benchmarks into carefully curated \"benchmark suites,\" which can provide researchers and practitioners with tools for understanding the wide range of potential harms and trade-offs among different aspects of fairness. We describe the research needed to build these benchmark suites so that they can better assess different usage modalities, cover potential harms, and reflect diverse perspectives. By moving away from leaderboards and instead thoughtfully designing and compiling benchmark suites, we can better monitor and improve the fairness impacts of AI technology.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"5 11","pages":"101080"},"PeriodicalIF":6.7000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11573903/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Patterns","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.patter.2024.101080","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Benchmarks and leaderboards are commonly used to track the fairness impacts of artificial intelligence (AI) models. Many critics argue against this practice, since it incentivizes optimizing for metrics in an attempt to build the "most fair" AI model. However, this is an inherently impossible task since different applications have different considerations. While we agree with the critiques against leaderboards, we believe that the use of benchmarks can be reformed. Thus far, the critiques of leaderboards and benchmarks have become unhelpfully entangled. However, benchmarks, when not used for leaderboards, offer important tools for understanding a model. We advocate for collecting benchmarks into carefully curated "benchmark suites," which can provide researchers and practitioners with tools for understanding the wide range of potential harms and trade-offs among different aspects of fairness. We describe the research needed to build these benchmark suites so that they can better assess different usage modalities, cover potential harms, and reflect diverse perspectives. By moving away from leaderboards and instead thoughtfully designing and compiling benchmark suites, we can better monitor and improve the fairness impacts of AI technology.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用基准套件代替排行榜来评估人工智能的公平性。
基准和排行榜通常用于跟踪人工智能(AI)模型对公平性的影响。许多批评者反对这种做法,因为它激励人们优化指标,试图建立 "最公平 "的人工智能模型。然而,这本来就是不可能完成的任务,因为不同的应用有不同的考虑因素。虽然我们同意对排行榜的批评,但我们认为可以对基准的使用进行改革。迄今为止,对排行榜和基准的批评已经纠缠在一起,毫无益处。然而,基准在不用于排行榜的情况下,也是理解模型的重要工具。我们主张将基准收集起来,形成精心策划的 "基准套件",为研究人员和从业人员提供了解各种潜在危害和公平性不同方面之间权衡的工具。我们描述了建立这些基准套件所需的研究,以便它们能够更好地评估不同的使用模式、涵盖潜在的危害并反映不同的观点。通过摒弃排行榜,转而深思熟虑地设计和汇编基准套件,我们可以更好地监控和改进人工智能技术对公平性的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Patterns
Patterns Decision Sciences-Decision Sciences (all)
CiteScore
10.60
自引率
4.60%
发文量
153
审稿时长
19 weeks
期刊介绍:
期刊最新文献
Data-knowledge co-driven innovations in engineering and management. Integration of large language models and federated learning. Decorrelative network architecture for robust electrocardiogram classification. Best holdout assessment is sufficient for cancer transcriptomic model selection. The recent Physics and Chemistry Nobel Prizes, AI, and the convergence of knowledge fields.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1