fl-IRT-ing with Psychometrics to Improve NLP Bias Measurement

IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Minds and Machines Pub Date : 2024-09-04 DOI:10.1007/s11023-024-09695-9
Dominik Bachmann, Oskar van der Wal, Edita Chvojka, Willem H. Zuidema, Leendert van Maanen, Katrin Schulz
{"title":"fl-IRT-ing with Psychometrics to Improve NLP Bias Measurement","authors":"Dominik Bachmann, Oskar van der Wal, Edita Chvojka, Willem H. Zuidema, Leendert van Maanen, Katrin Schulz","doi":"10.1007/s11023-024-09695-9","DOIUrl":null,"url":null,"abstract":"<p>To prevent ordinary people from being harmed by natural language processing (NLP) technology, finding ways to measure the extent to which a language model is biased (e.g., regarding gender) has become an active area of research. One popular class of NLP bias measures are bias benchmark datasets—collections of test items that are meant to assess a language model’s preference for stereotypical versus non-stereotypical language. In this paper, we argue that such bias benchmarks should be assessed with models from the psychometric framework of item response theory (IRT). Specifically, we tie an introduction to basic IRT concepts and models with a discussion of how they could be relevant to the evaluation, interpretation and improvement of bias benchmark datasets. Regarding evaluation, IRT provides us with methodological tools for assessing the quality of both individual test items (e.g., the extent to which an item can differentiate highly biased from less biased language models) as well as benchmarks as a whole (e.g., the extent to which the benchmark allows us to assess not only severe but also subtle levels of model bias). Through such diagnostic tools, the quality of benchmark datasets could be improved, for example by deleting or reworking poorly performing items. Finally, in regards to interpretation, we argue that IRT models’ estimates for language model bias are conceptually superior to traditional accuracy-based evaluation metrics, as the former take into account more information than just whether or not a language model provided a biased response.</p>","PeriodicalId":51133,"journal":{"name":"Minds and Machines","volume":"40 1","pages":""},"PeriodicalIF":4.2000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Minds and Machines","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11023-024-09695-9","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

To prevent ordinary people from being harmed by natural language processing (NLP) technology, finding ways to measure the extent to which a language model is biased (e.g., regarding gender) has become an active area of research. One popular class of NLP bias measures are bias benchmark datasets—collections of test items that are meant to assess a language model’s preference for stereotypical versus non-stereotypical language. In this paper, we argue that such bias benchmarks should be assessed with models from the psychometric framework of item response theory (IRT). Specifically, we tie an introduction to basic IRT concepts and models with a discussion of how they could be relevant to the evaluation, interpretation and improvement of bias benchmark datasets. Regarding evaluation, IRT provides us with methodological tools for assessing the quality of both individual test items (e.g., the extent to which an item can differentiate highly biased from less biased language models) as well as benchmarks as a whole (e.g., the extent to which the benchmark allows us to assess not only severe but also subtle levels of model bias). Through such diagnostic tools, the quality of benchmark datasets could be improved, for example by deleting or reworking poorly performing items. Finally, in regards to interpretation, we argue that IRT models’ estimates for language model bias are conceptually superior to traditional accuracy-based evaluation metrics, as the former take into account more information than just whether or not a language model provided a biased response.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
fl-IRT与心理测量学相结合,改善NLP偏差测量
为了防止普通人受到自然语言处理(NLP)技术的伤害,寻找测量语言模型偏差程度(如性别)的方法已成为一个活跃的研究领域。一类流行的 NLP 偏差测量方法是偏差基准数据集--测试项目集,旨在评估语言模型对刻板语言和非刻板语言的偏好程度。在本文中,我们认为此类偏差基准应使用项目反应理论(IRT)心理测量框架中的模型进行评估。具体来说,我们将介绍 IRT 的基本概念和模型,并讨论它们如何与偏差基准数据集的评估、解释和改进相关。在评估方面,IRT 为我们提供了评估单个测试项目质量(例如,一个项目能在多大程度上区分高偏差和低偏差的语言模型)以及整体基准(例如,基准能在多大程度上让我们不仅评估严重的模型偏差,也评估微妙的模型偏差)的方法工具。通过这些诊断工具,基准数据集的质量可以得到改善,例如删除或重新制作表现不佳的项目。最后,在解释方面,我们认为 IRT 模型对语言模型偏差的估计在概念上优于传统的基于准确性的评估指标,因为前者考虑到了更多的信息,而不仅仅是语言模型是否提供了有偏差的反应。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Minds and Machines
Minds and Machines 工程技术-计算机:人工智能
CiteScore
12.60
自引率
2.70%
发文量
30
审稿时长
>12 weeks
期刊介绍: Minds and Machines, affiliated with the Society for Machines and Mentality, serves as a platform for fostering critical dialogue between the AI and philosophical communities. With a focus on problems of shared interest, the journal actively encourages discussions on the philosophical aspects of computer science. Offering a global forum, Minds and Machines provides a space to debate and explore important and contentious issues within its editorial focus. The journal presents special editions dedicated to specific topics, invites critical responses to previously published works, and features review essays addressing current problem scenarios. By facilitating a diverse range of perspectives, Minds and Machines encourages a reevaluation of the status quo and the development of new insights. Through this collaborative approach, the journal aims to bridge the gap between AI and philosophy, fostering a tradition of critique and ensuring these fields remain connected and relevant.
期刊最新文献
Mapping the Ethics of Generative AI: A Comprehensive Scoping Review A Justifiable Investment in AI for Healthcare: Aligning Ambition with Reality fl-IRT-ing with Psychometrics to Improve NLP Bias Measurement Artificial Intelligence for the Internal Democracy of Political Parties A Causal Analysis of Harm
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1