Evaluating and Addressing Demographic Disparities in Medical Large Language Models: A Systematic Review

medRxiv - Health Systems and Quality Improvement Pub Date : 2024-09-09 DOI:10.1101/2024.09.09.24313295

Mahmud Omar, Vera Sorin, Donald U Apakama, Ali Soroush, Ankit Sakhuja, Robert Freeman, Carol R Horowitz, Lynne D Richardson, Girish Nadkarni, Eyal Klang

{"title":"Evaluating and Addressing Demographic Disparities in Medical Large Language Models: A Systematic Review","authors":"Mahmud Omar, Vera Sorin, Donald U Apakama, Ali Soroush, Ankit Sakhuja, Robert Freeman, Carol R Horowitz, Lynne D Richardson, Girish Nadkarni, Eyal Klang","doi":"10.1101/2024.09.09.24313295","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) are increasingly evaluated for use in healthcare. However, concerns about their impact on disparities persist. This study reviews current research on demographic biases in LLMs to identify prevalent bias types, assess measurement methods, and evaluate mitigation strategies.\nMethods: We conducted a systematic review, searching publications from January 2018 to July 2024 across five databases. We included peer-reviewed studies evaluating demographic biases in LLMs, focusing on gender, race, ethnicity, age, and other factors. Study quality was assessed using the Joanna Briggs Institute Critical Appraisal Tools. Results: Our review included 24 studies. Of these, 22 (91.7%) identified biases in LLMs. Gender bias was the most prevalent, reported in 15 of 16 studies (93.7%). Racial or ethnic biases were observed in 10 of 11 studies (90.9%). Only two studies found minimal or no bias in certain contexts. Mitigation strategies mainly included prompt engineering, with varying effectiveness.\nHowever, these findings are tempered by a potential publication bias, as studies with negative results are less frequently published.\nConclusion: Biases are observed in LLMs across various medical domains. While bias detection is improving, effective mitigation strategies are still developing. As LLMs increasingly influence critical decisions, addressing these biases and their resultant disparities is essential for ensuring fair AI systems. Future research should focus on a wider range of demographic factors, intersectional analyses, and non-Western cultural contexts.","PeriodicalId":501556,"journal":{"name":"medRxiv - Health Systems and Quality Improvement","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Health Systems and Quality Improvement","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.09.24313295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) are increasingly evaluated for use in healthcare. However, concerns about their impact on disparities persist. This study reviews current research on demographic biases in LLMs to identify prevalent bias types, assess measurement methods, and evaluate mitigation strategies. Methods: We conducted a systematic review, searching publications from January 2018 to July 2024 across five databases. We included peer-reviewed studies evaluating demographic biases in LLMs, focusing on gender, race, ethnicity, age, and other factors. Study quality was assessed using the Joanna Briggs Institute Critical Appraisal Tools. Results: Our review included 24 studies. Of these, 22 (91.7%) identified biases in LLMs. Gender bias was the most prevalent, reported in 15 of 16 studies (93.7%). Racial or ethnic biases were observed in 10 of 11 studies (90.9%). Only two studies found minimal or no bias in certain contexts. Mitigation strategies mainly included prompt engineering, with varying effectiveness. However, these findings are tempered by a potential publication bias, as studies with negative results are less frequently published. Conclusion: Biases are observed in LLMs across various medical domains. While bias detection is improving, effective mitigation strategies are still developing. As LLMs increasingly influence critical decisions, addressing these biases and their resultant disparities is essential for ensuring fair AI systems. Future research should focus on a wider range of demographic factors, intersectional analyses, and non-Western cultural contexts.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估和解决医学大语言模型中的人口统计学差异：系统回顾

背景：大语言模型（LLMs）越来越多地被评估用于医疗保健领域。然而，人们仍然担心它们对差异的影响。本研究回顾了目前关于 LLMs 中人口统计学偏差的研究，以确定普遍存在的偏差类型、评估测量方法并评估缓解策略：我们进行了一项系统性回顾，在五个数据库中搜索了 2018 年 1 月至 2024 年 7 月期间的出版物。我们纳入了评估法学硕士人口统计学偏见的同行评审研究，重点关注性别、种族、民族、年龄和其他因素。研究质量采用乔安娜-布里格斯研究所的关键评估工具进行评估。结果我们的研究包括 24 项研究。其中，22 项研究（91.7%）发现了法学硕士的偏见。性别偏见最为普遍，16 项研究中有 15 项报告了性别偏见（93.7%）。11 项研究中有 10 项（90.9%）发现了种族或民族偏见。只有两项研究发现在某些情况下存在极少或不存在偏见。然而，这些研究结果因潜在的发表偏差而受到影响，因为出现负面结果的研究发表较少：结论：在各个医学领域的 LLM 中都发现了偏倚。虽然偏倚检测正在不断改进，但有效的缓解策略仍在开发中。随着 LLM 对关键决策的影响越来越大，解决这些偏差及其导致的差异对于确保公平的人工智能系统至关重要。未来的研究应关注更广泛的人口因素、交叉分析和非西方文化背景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

medRxiv - Health Systems and Quality Improvement

自引率

0.00%

发文量