Benchmark suites instead of leaderboards for evaluating AI fairness.

IF 6.7 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Patterns Pub Date : 2024-11-08 DOI:10.1016/j.patter.2024.101080

Angelina Wang, Aaron Hertzmann, Olga Russakovsky

{"title":"Benchmark suites instead of leaderboards for evaluating AI fairness.","authors":"Angelina Wang, Aaron Hertzmann, Olga Russakovsky","doi":"10.1016/j.patter.2024.101080","DOIUrl":null,"url":null,"abstract":"<p><p>Benchmarks and leaderboards are commonly used to track the fairness impacts of artificial intelligence (AI) models. Many critics argue against this practice, since it incentivizes optimizing for metrics in an attempt to build the \"most fair\" AI model. However, this is an inherently impossible task since different applications have different considerations. While we agree with the critiques against leaderboards, we believe that the use of benchmarks can be reformed. Thus far, the critiques of leaderboards and benchmarks have become unhelpfully entangled. However, benchmarks, when not used for leaderboards, offer important tools for understanding a model. We advocate for collecting benchmarks into carefully curated \"benchmark suites,\" which can provide researchers and practitioners with tools for understanding the wide range of potential harms and trade-offs among different aspects of fairness. We describe the research needed to build these benchmark suites so that they can better assess different usage modalities, cover potential harms, and reflect diverse perspectives. By moving away from leaderboards and instead thoughtfully designing and compiling benchmark suites, we can better monitor and improve the fairness impacts of AI technology.</p>","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"5 11","pages":"101080"},"PeriodicalIF":6.7000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11573903/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Patterns","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.patter.2024.101080","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Benchmarks and leaderboards are commonly used to track the fairness impacts of artificial intelligence (AI) models. Many critics argue against this practice, since it incentivizes optimizing for metrics in an attempt to build the "most fair" AI model. However, this is an inherently impossible task since different applications have different considerations. While we agree with the critiques against leaderboards, we believe that the use of benchmarks can be reformed. Thus far, the critiques of leaderboards and benchmarks have become unhelpfully entangled. However, benchmarks, when not used for leaderboards, offer important tools for understanding a model. We advocate for collecting benchmarks into carefully curated "benchmark suites," which can provide researchers and practitioners with tools for understanding the wide range of potential harms and trade-offs among different aspects of fairness. We describe the research needed to build these benchmark suites so that they can better assess different usage modalities, cover potential harms, and reflect diverse perspectives. By moving away from leaderboards and instead thoughtfully designing and compiling benchmark suites, we can better monitor and improve the fairness impacts of AI technology.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用基准套件代替排行榜来评估人工智能的公平性。

基准和排行榜通常用于跟踪人工智能（AI）模型对公平性的影响。许多批评者反对这种做法，因为它激励人们优化指标，试图建立 "最公平 "的人工智能模型。然而，这本来就是不可能完成的任务，因为不同的应用有不同的考虑因素。虽然我们同意对排行榜的批评，但我们认为可以对基准的使用进行改革。迄今为止，对排行榜和基准的批评已经纠缠在一起，毫无益处。然而，基准在不用于排行榜的情况下，也是理解模型的重要工具。我们主张将基准收集起来，形成精心策划的 "基准套件"，为研究人员和从业人员提供了解各种潜在危害和公平性不同方面之间权衡的工具。我们描述了建立这些基准套件所需的研究，以便它们能够更好地评估不同的使用模式、涵盖潜在的危害并反映不同的观点。通过摒弃排行榜，转而深思熟虑地设计和汇编基准套件，我们可以更好地监控和改进人工智能技术对公平性的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊