Assessing racial bias in healthcare predictive models: Practical lessons from an empirical evaluation of 30-day hospital readmission models

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Journal of Biomedical Informatics Pub Date : 2024-06-24 DOI:10.1016/j.jbi.2024.104683

H. Echo Wang , Jonathan P. Weiner , Suchi Saria , Harold Lehmann , Hadi Kharrazi

{"title":"Assessing racial bias in healthcare predictive models: Practical lessons from an empirical evaluation of 30-day hospital readmission models","authors":"H. Echo Wang , Jonathan P. Weiner , Suchi Saria , Harold Lehmann , Hadi Kharrazi","doi":"10.1016/j.jbi.2024.104683","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><p>Despite increased availability of methodologies to identify algorithmic bias, the operationalization of bias evaluation for healthcare predictive models is still limited. Therefore, this study proposes a process for bias evaluation through an empirical assessment of common hospital readmission models. The process includes selecting bias measures, interpretation, determining disparity impact and potential mitigations.</p></div><div><h3>Methods</h3><p>This retrospective analysis evaluated racial bias of four common models predicting 30-day unplanned readmission (i.e., LACE Index, HOSPITAL Score, and the CMS readmission measure applied as is and retrained). The models were assessed using 2.4 million adult inpatient discharges in Maryland from 2016 to 2019. Fairness metrics that are model-agnostic, easy to compute, and interpretable were implemented and apprised to select the most appropriate bias measures. The impact of changing model’s risk thresholds on these measures was further assessed to guide the selection of optimal thresholds to control and mitigate bias.</p></div><div><h3>Results</h3><p>Four bias measures were selected for the predictive task: zero-one-loss difference, false negative rate (FNR) parity, false positive rate (FPR) parity, and generalized entropy index. Based on these measures, the HOSPITAL score and the retrained CMS measure demonstrated the lowest racial bias. White patients showed a higher FNR while Black patients resulted in a higher FPR and zero-one-loss. As the models’ risk threshold changed, trade-offs between models’ fairness and overall performance were observed, and the assessment showed all models’ default thresholds were reasonable for balancing accuracy and bias.</p></div><div><h3>Conclusions</h3><p>This study proposes an Applied Framework to Assess Fairness of Predictive Models (AFAFPM) and demonstrates the process using 30-day hospital readmission model as the example. It suggests the feasibility of applying algorithmic bias assessment to determine optimized risk thresholds so that predictive models can be used more equitably and accurately. It is evident that a combination of qualitative and quantitative methods and a multidisciplinary team are necessary to identify, understand and respond to algorithm bias in real-world healthcare settings. Users should also apply multiple bias measures to ensure a more comprehensive, tailored, and balanced view. The results of bias measures, however, must be interpreted with caution and consider the larger operational, clinical, and policy context.</p></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"156 ","pages":"Article 104683"},"PeriodicalIF":4.5000,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046424001011","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

Despite increased availability of methodologies to identify algorithmic bias, the operationalization of bias evaluation for healthcare predictive models is still limited. Therefore, this study proposes a process for bias evaluation through an empirical assessment of common hospital readmission models. The process includes selecting bias measures, interpretation, determining disparity impact and potential mitigations.

Methods

This retrospective analysis evaluated racial bias of four common models predicting 30-day unplanned readmission (i.e., LACE Index, HOSPITAL Score, and the CMS readmission measure applied as is and retrained). The models were assessed using 2.4 million adult inpatient discharges in Maryland from 2016 to 2019. Fairness metrics that are model-agnostic, easy to compute, and interpretable were implemented and apprised to select the most appropriate bias measures. The impact of changing model’s risk thresholds on these measures was further assessed to guide the selection of optimal thresholds to control and mitigate bias.

Results

Four bias measures were selected for the predictive task: zero-one-loss difference, false negative rate (FNR) parity, false positive rate (FPR) parity, and generalized entropy index. Based on these measures, the HOSPITAL score and the retrained CMS measure demonstrated the lowest racial bias. White patients showed a higher FNR while Black patients resulted in a higher FPR and zero-one-loss. As the models’ risk threshold changed, trade-offs between models’ fairness and overall performance were observed, and the assessment showed all models’ default thresholds were reasonable for balancing accuracy and bias.

Conclusions

This study proposes an Applied Framework to Assess Fairness of Predictive Models (AFAFPM) and demonstrates the process using 30-day hospital readmission model as the example. It suggests the feasibility of applying algorithmic bias assessment to determine optimized risk thresholds so that predictive models can be used more equitably and accurately. It is evident that a combination of qualitative and quantitative methods and a multidisciplinary team are necessary to identify, understand and respond to algorithm bias in real-world healthcare settings. Users should also apply multiple bias measures to ensure a more comprehensive, tailored, and balanced view. The results of bias measures, however, must be interpreted with caution and consider the larger operational, clinical, and policy context.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估医疗保健预测模型中的种族偏见：从 30 天重新入院模型的实证评估中汲取实用经验。

目的：尽管识别算法偏差的方法越来越多，但医疗预测模型偏差评估的可操作性仍然有限。因此，本研究通过对常见的医院再入院模型进行实证评估，提出了一套偏差评估流程。该流程包括选择偏差测量、解释、确定差异影响和潜在缓解措施：这项回顾性分析评估了预测 30 天非计划再入院的四种常见模型（即 LACE 指数、HOSPITAL 评分和 CMS 再入院衡量标准的原样应用和再训练）的种族偏差。这些模型使用 2016 年至 2019 年马里兰州 240 万成人住院出院病例进行评估。采用了与模型无关、易于计算和解释的公平性指标，并对其进行了评估，以选择最合适的偏差测量方法。进一步评估了改变模型风险阈值对这些指标的影响，以指导选择最佳阈值来控制和减轻偏差：为预测任务选择了四种偏差测量方法：零一损失差、假阴性率（FNR）奇偶性、假阳性率（FPR）奇偶性和广义熵指数。根据这些指标，HOSPITAL 评分和重新训练的 CMS 指标显示出最低的种族偏差。白人患者的误诊率较高，而黑人患者的误诊率和零损失率较高。随着模型风险阈值的变化，模型的公平性和整体性能之间的权衡被观察到，评估显示所有模型的默认阈值在平衡准确性和偏差方面都是合理的：本研究提出了评估预测模型公平性的应用框架（AFAFPM），并以 30 天再入院模型为例演示了这一过程。它提出了应用算法偏差评估来确定优化风险阈值的可行性，以便更公平、更准确地使用预测模型。显然，要识别、理解和应对现实世界医疗环境中的算法偏差，就必须结合定性和定量方法以及多学科团队。用户还应采用多种偏差测量方法，以确保获得更全面、更有针对性、更平衡的观点。不过，在解释偏差测量结果时必须谨慎，并考虑到更大的操作、临床和政策背景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Biomedical Informatics 医学-计算机：跨学科应用

CiteScore

8.90

自引率

6.70%

发文量

243

审稿时长

32 days

期刊介绍： The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.