文本数据的可解释偏见缓解：在保持分类性能的同时减少患者笔记中的性别化

ACM transactions on computing for healthcare Pub Date : 2021-03-10 DOI:10.1145/3524887

J. Minot, N. Cheney, Marc E. Maier, Danne C. Elbers, C. Danforth, P. Dodds

{"title":"文本数据的可解释偏见缓解：在保持分类性能的同时减少患者笔记中的性别化","authors":"J. Minot, N. Cheney, Marc E. Maier, Danne C. Elbers, C. Danforth, P. Dodds","doi":"10.1145/3524887","DOIUrl":null,"url":null,"abstract":"Medical systems in general, and patient treatment decisions and outcomes in particular, can be affected by bias based on gender and other demographic elements. As language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. Much of the work addressing this question has focused on biases encoded in language models—statistical estimates of the relationships between concepts derived from distant reading of corpora. Building on this work, we investigate how differences in gender-specific word frequency distributions and language models interact with regards to bias. We identify and remove gendered language from two clinical-note datasets and describe a new debiasing procedure using BERT-based gender classifiers. We show minimal degradation in health condition classification tasks for low- to medium-levels of dataset bias removal via data augmentation. Finally, we compare the bias semantically encoded in the language models with the bias empirically observed in health records. This work outlines an interpretable approach for using data augmentation to identify and reduce biases in natural language processing pipelines.","PeriodicalId":72043,"journal":{"name":"ACM transactions on computing for healthcare","volume":"240 1","pages":"1 - 41"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Interpretable Bias Mitigation for Textual Data: Reducing Genderization in Patient Notes While Maintaining Classification Performance\",\"authors\":\"J. Minot, N. Cheney, Marc E. Maier, Danne C. Elbers, C. Danforth, P. Dodds\",\"doi\":\"10.1145/3524887\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Medical systems in general, and patient treatment decisions and outcomes in particular, can be affected by bias based on gender and other demographic elements. As language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. Much of the work addressing this question has focused on biases encoded in language models—statistical estimates of the relationships between concepts derived from distant reading of corpora. Building on this work, we investigate how differences in gender-specific word frequency distributions and language models interact with regards to bias. We identify and remove gendered language from two clinical-note datasets and describe a new debiasing procedure using BERT-based gender classifiers. We show minimal degradation in health condition classification tasks for low- to medium-levels of dataset bias removal via data augmentation. Finally, we compare the bias semantically encoded in the language models with the bias empirically observed in health records. This work outlines an interpretable approach for using data augmentation to identify and reduce biases in natural language processing pipelines.\",\"PeriodicalId\":72043,\"journal\":{\"name\":\"ACM transactions on computing for healthcare\",\"volume\":\"240 1\",\"pages\":\"1 - 41\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-03-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM transactions on computing for healthcare\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3524887\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM transactions on computing for healthcare","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3524887","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

摘要

一般的医疗系统，尤其是患者的治疗决策和结果，可能会受到基于性别和其他人口因素的偏见的影响。随着语言模型越来越多地应用于医学，人们对在影响患者护理的过程中构建算法公平性越来越感兴趣。解决这个问题的大部分工作都集中在语言模型中编码的偏见上——对从语料库的远距离阅读中得出的概念之间关系的统计估计。在这项工作的基础上，我们研究了特定性别的词频分布和语言模型的差异如何与偏见相互作用。我们从两个临床笔记数据集中识别并删除性别语言，并描述了一种使用基于BERT的性别分类器的新的去偏程序。我们展示了通过数据增强去除中低水平数据集偏差的健康状况分类任务的最小退化。最后，我们将语言模型中语义编码的偏见与健康记录中经验观察到的偏见进行了比较。这项工作概述了一种可解释的方法，用于使用数据扩充来识别和减少自然语言处理管道中的偏见。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Interpretable Bias Mitigation for Textual Data: Reducing Genderization in Patient Notes While Maintaining Classification Performance

Medical systems in general, and patient treatment decisions and outcomes in particular, can be affected by bias based on gender and other demographic elements. As language models are increasingly applied to medicine, there is a growing interest in building algorithmic fairness into processes impacting patient care. Much of the work addressing this question has focused on biases encoded in language models—statistical estimates of the relationships between concepts derived from distant reading of corpora. Building on this work, we investigate how differences in gender-specific word frequency distributions and language models interact with regards to bias. We identify and remove gendered language from two clinical-note datasets and describe a new debiasing procedure using BERT-based gender classifiers. We show minimal degradation in health condition classification tasks for low- to medium-levels of dataset bias removal via data augmentation. Finally, we compare the bias semantically encoded in the language models with the bias empirically observed in health records. This work outlines an interpretable approach for using data augmentation to identify and reduce biases in natural language processing pipelines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM transactions on computing for healthcare

CiteScore

10.30

自引率

0.00%

发文量