医学图像分析中分布式机器学习中有害数据集的自监督识别和消除

IF 15.1 1区医学 Q1 HEALTH CARE SCIENCES & SERVICES NPJ Digital Medicine Pub Date : 2025-02-15 DOI:10.1038/s41746-025-01499-0

Raissa Souza, Emma A. M. Stanley, Anthony J. Winder, Chris Kang, Kimberly Amador, Erik Y. Ohara, Gabrielle Dagasso, Richard Camicioli, Oury Monchi, Zahinoor Ismail, Matthias Wilms, Nils D. Forkert

{"title":"医学图像分析中分布式机器学习中有害数据集的自监督识别和消除","authors":"Raissa Souza, Emma A. M. Stanley, Anthony J. Winder, Chris Kang, Kimberly Amador, Erik Y. Ohara, Gabrielle Dagasso, Richard Camicioli, Oury Monchi, Zahinoor Ismail, Matthias Wilms, Nils D. Forkert","doi":"10.1038/s41746-025-01499-0","DOIUrl":null,"url":null,"abstract":"<p>Distributed learning enables collaborative machine learning model training without requiring cross-institutional data sharing, thereby addressing privacy concerns. However, local quality control variability can negatively impact model performance while systematic human visual inspection is time-consuming and may violate the goal of keeping data inaccessible outside acquisition centers. This work proposes a novel self-supervised method to identify and eliminate harmful data during distributed learning model training fully-automatically. Harmful data is defined as samples that, when included in training, increase misdiagnosis rates. The method was tested using neuroimaging data from 83 centers for Parkinson’s disease classification with simulated inclusion of a few harmful data samples. The proposed method reliably identified harmful images, with centers providing only harmful datasets being easier to identify than single harmful images within otherwise good datasets. While only evaluated using neuroimaging data, the presented method is application-agnostic and presents a step towards automated quality control in distributed learning.</p>","PeriodicalId":19349,"journal":{"name":"NPJ Digital Medicine","volume":"79 6 1","pages":""},"PeriodicalIF":15.1000,"publicationDate":"2025-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis\",\"authors\":\"Raissa Souza, Emma A. M. Stanley, Anthony J. Winder, Chris Kang, Kimberly Amador, Erik Y. Ohara, Gabrielle Dagasso, Richard Camicioli, Oury Monchi, Zahinoor Ismail, Matthias Wilms, Nils D. Forkert\",\"doi\":\"10.1038/s41746-025-01499-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Distributed learning enables collaborative machine learning model training without requiring cross-institutional data sharing, thereby addressing privacy concerns. However, local quality control variability can negatively impact model performance while systematic human visual inspection is time-consuming and may violate the goal of keeping data inaccessible outside acquisition centers. This work proposes a novel self-supervised method to identify and eliminate harmful data during distributed learning model training fully-automatically. Harmful data is defined as samples that, when included in training, increase misdiagnosis rates. The method was tested using neuroimaging data from 83 centers for Parkinson’s disease classification with simulated inclusion of a few harmful data samples. The proposed method reliably identified harmful images, with centers providing only harmful datasets being easier to identify than single harmful images within otherwise good datasets. While only evaluated using neuroimaging data, the presented method is application-agnostic and presents a step towards automated quality control in distributed learning.</p>\",\"PeriodicalId\":19349,\"journal\":{\"name\":\"NPJ Digital Medicine\",\"volume\":\"79 6 1\",\"pages\":\"\"},\"PeriodicalIF\":15.1000,\"publicationDate\":\"2025-02-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NPJ Digital Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1038/s41746-025-01499-0\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NPJ Digital Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1038/s41746-025-01499-0","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

分布式学习支持协作机器学习模型训练，而不需要跨机构数据共享，从而解决隐私问题。然而，局部质量控制的可变性会对模型性能产生负面影响，而系统的人工视觉检查既耗时又可能违背保持数据在采集中心之外不可访问的目标。本文提出了一种新的自监督方法来自动识别和消除分布式学习模型训练过程中的有害数据。有害数据被定义为样本，当纳入训练时，会增加误诊率。该方法使用来自83个帕金森病分类中心的神经成像数据进行了测试，并模拟了一些有害数据样本。所提出的方法可靠地识别有害图像，仅提供有害数据集的中心比在其他良好数据集中提供单个有害图像更容易识别。虽然仅使用神经影像学数据进行评估，但所提出的方法与应用无关，并且在分布式学习中向自动化质量控制迈出了一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis

Distributed learning enables collaborative machine learning model training without requiring cross-institutional data sharing, thereby addressing privacy concerns. However, local quality control variability can negatively impact model performance while systematic human visual inspection is time-consuming and may violate the goal of keeping data inaccessible outside acquisition centers. This work proposes a novel self-supervised method to identify and eliminate harmful data during distributed learning model training fully-automatically. Harmful data is defined as samples that, when included in training, increase misdiagnosis rates. The method was tested using neuroimaging data from 83 centers for Parkinson’s disease classification with simulated inclusion of a few harmful data samples. The proposed method reliably identified harmful images, with centers providing only harmful datasets being easier to identify than single harmful images within otherwise good datasets. While only evaluated using neuroimaging data, the presented method is application-agnostic and presents a step towards automated quality control in distributed learning.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

NPJ Digital Medicine Multiple-

CiteScore

25.10

自引率

3.30%

发文量

170

审稿时长

15 weeks

期刊介绍： npj Digital Medicine is an online open-access journal that focuses on publishing peer-reviewed research in the field of digital medicine. The journal covers various aspects of digital medicine, including the application and implementation of digital and mobile technologies in clinical settings, virtual healthcare, and the use of artificial intelligence and informatics. The primary goal of the journal is to support innovation and the advancement of healthcare through the integration of new digital and mobile technologies. When determining if a manuscript is suitable for publication, the journal considers four important criteria: novelty, clinical relevance, scientific rigor, and digital innovation.