可信赖的医疗机器学习:shapley值的可扩展数据估值

Proceedings of the ACM Conference on Health, Inference, and Learning Pub Date : 2021-04-08 DOI:10.1145/3450439.3451861

Konstantin D. Pandl, Fabian Feiland, Scott Thiebes, A. Sunyaev

{"title":"可信赖的医疗机器学习:shapley值的可扩展数据估值","authors":"Konstantin D. Pandl, Fabian Feiland, Scott Thiebes, A. Sunyaev","doi":"10.1145/3450439.3451861","DOIUrl":null,"url":null,"abstract":"Collecting data from many sources is an essential approach to generate large data sets required for the training of machine learning models. Trustworthy machine learning requires incentives, guarantees of data quality, and information privacy. Applying recent advancements in data valuation methods for machine learning can help to enable these. In this work, we analyze the suitability of three different data valuation methods for medical image classification tasks, specifically pleural effusion, on an extensive data set of chest X-ray scans. Our results reveal that a heuristic for calculating the Shapley valuation scheme based on a k-nearest neighbor classifier can successfully value large quantities of data instances. We also demonstrate possible applications for incentivizing data sharing, the efficient detection of mislabeled data, and summarizing data sets to exclude private information. Thereby, this work contributes to developing modern data infrastructures for trustworthy machine learning in health care.","PeriodicalId":87342,"journal":{"name":"Proceedings of the ACM Conference on Health, Inference, and Learning","volume":"25 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Trustworthy machine learning for health care: scalable data valuation with the shapley value\",\"authors\":\"Konstantin D. Pandl, Fabian Feiland, Scott Thiebes, A. Sunyaev\",\"doi\":\"10.1145/3450439.3451861\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Collecting data from many sources is an essential approach to generate large data sets required for the training of machine learning models. Trustworthy machine learning requires incentives, guarantees of data quality, and information privacy. Applying recent advancements in data valuation methods for machine learning can help to enable these. In this work, we analyze the suitability of three different data valuation methods for medical image classification tasks, specifically pleural effusion, on an extensive data set of chest X-ray scans. Our results reveal that a heuristic for calculating the Shapley valuation scheme based on a k-nearest neighbor classifier can successfully value large quantities of data instances. We also demonstrate possible applications for incentivizing data sharing, the efficient detection of mislabeled data, and summarizing data sets to exclude private information. Thereby, this work contributes to developing modern data infrastructures for trustworthy machine learning in health care.\",\"PeriodicalId\":87342,\"journal\":{\"name\":\"Proceedings of the ACM Conference on Health, Inference, and Learning\",\"volume\":\"25 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ACM Conference on Health, Inference, and Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3450439.3451861\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM Conference on Health, Inference, and Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3450439.3451861","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

从许多来源收集数据是生成训练机器学习模型所需的大型数据集的基本方法。值得信赖的机器学习需要激励、数据质量保证和信息隐私。将最新的数据评估方法应用于机器学习可以帮助实现这些目标。在这项工作中，我们分析了三种不同的数据评估方法对医学图像分类任务的适用性，特别是胸膜积液，在胸部x射线扫描的广泛数据集上。我们的研究结果表明，基于k近邻分类器的Shapley估值方案的启发式计算可以成功地对大量数据实例进行估值。我们还演示了激励数据共享、有效检测错误标记数据以及汇总数据集以排除私人信息的可能应用。因此，这项工作有助于为医疗保健领域的可信机器学习开发现代数据基础设施。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Trustworthy machine learning for health care: scalable data valuation with the shapley value

Collecting data from many sources is an essential approach to generate large data sets required for the training of machine learning models. Trustworthy machine learning requires incentives, guarantees of data quality, and information privacy. Applying recent advancements in data valuation methods for machine learning can help to enable these. In this work, we analyze the suitability of three different data valuation methods for medical image classification tasks, specifically pleural effusion, on an extensive data set of chest X-ray scans. Our results reveal that a heuristic for calculating the Shapley valuation scheme based on a k-nearest neighbor classifier can successfully value large quantities of data instances. We also demonstrate possible applications for incentivizing data sharing, the efficient detection of mislabeled data, and summarizing data sets to exclude private information. Thereby, this work contributes to developing modern data infrastructures for trustworthy machine learning in health care.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ACM Conference on Health, Inference, and Learning

自引率

0.00%

发文量

期刊最新文献

Explaining a machine learning decision to physicians via counterfactuals Rare Life Event Detection via Mobile Sensing Using Multi-Task Learning PTGB: Pre-Train Graph Neural Networks for Brain Network Analysis Large-Scale Study of Temporal Shift in Health Insurance Claims Token Imbalance Adaptation for Radiology Report Generation