利用 ComBat 对计算病理学中的机器学习应用进行深度特征批量校正

Pierre Murchan , Pilib Ó Broin , Anne-Marie Baird , Orla Sheils , Stephen P Finn
{"title":"利用 ComBat 对计算病理学中的机器学习应用进行深度特征批量校正","authors":"Pierre Murchan ,&nbsp;Pilib Ó Broin ,&nbsp;Anne-Marie Baird ,&nbsp;Orla Sheils ,&nbsp;Stephen P Finn","doi":"10.1016/j.jpi.2024.100396","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Developing artificial intelligence (AI) models for digital pathology requires large datasets from multiple sources. However, without careful implementation, AI models risk learning confounding site-specific features in datasets instead of clinically relevant information, leading to overestimated performance, poor generalizability to real-world data, and potential misdiagnosis.</div></div><div><h3>Methods</h3><div>Whole-slide images (WSIs) from The Cancer Genome Atlas (TCGA) colon (COAD), and stomach adenocarcinoma datasets were selected for inclusion in this study. Patch embeddings were obtained using three feature extraction models, followed by ComBat harmonization. Attention-based multiple instance learning models were trained to predict tissue-source site (TSS), as well as clinical and genetic attributes, using raw, Macenko normalized, and Combat-harmonized patch embeddings.</div></div><div><h3>Results</h3><div>TSS prediction achieved high accuracy (AUROC &gt; 0.95) with all three feature extraction models. ComBat harmonization significantly reduced the AUROC for TSS prediction, with mean AUROCs dropping to approximately 0.5 for most models, indicating successful mitigation of batch effects (e.g., CCL-ResNet50 in TCGA-COAD: Pre-ComBat AUROC = 0.960, Post-ComBat AUROC = 0.506, <em>p &lt;</em> 0.001). Clinical attributes associated with TSS, such as race and treatment response, showed decreased predictability post-harmonization. Notably, the prediction of genetic features like MSI status remained robust after harmonization (e.g., MSI in TCGA-COAD: Pre-ComBat AUROC = 0.667, Post-ComBat AUROC = 0.669, <em>p</em>=0.952), indicating the preservation of true histological signals.</div></div><div><h3>Conclusion</h3><div>ComBat harmonization of deep learning-derived histology features effectively reduces the risk of AI models learning confounding features in WSIs, ensuring more reliable performance estimates. This approach is promising for the integration of large-scale digital pathology datasets.</div></div>","PeriodicalId":37769,"journal":{"name":"Journal of Pathology Informatics","volume":"15 ","pages":"Article 100396"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep feature batch correction using ComBat for machine learning applications in computational pathology\",\"authors\":\"Pierre Murchan ,&nbsp;Pilib Ó Broin ,&nbsp;Anne-Marie Baird ,&nbsp;Orla Sheils ,&nbsp;Stephen P Finn\",\"doi\":\"10.1016/j.jpi.2024.100396\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Developing artificial intelligence (AI) models for digital pathology requires large datasets from multiple sources. However, without careful implementation, AI models risk learning confounding site-specific features in datasets instead of clinically relevant information, leading to overestimated performance, poor generalizability to real-world data, and potential misdiagnosis.</div></div><div><h3>Methods</h3><div>Whole-slide images (WSIs) from The Cancer Genome Atlas (TCGA) colon (COAD), and stomach adenocarcinoma datasets were selected for inclusion in this study. Patch embeddings were obtained using three feature extraction models, followed by ComBat harmonization. Attention-based multiple instance learning models were trained to predict tissue-source site (TSS), as well as clinical and genetic attributes, using raw, Macenko normalized, and Combat-harmonized patch embeddings.</div></div><div><h3>Results</h3><div>TSS prediction achieved high accuracy (AUROC &gt; 0.95) with all three feature extraction models. ComBat harmonization significantly reduced the AUROC for TSS prediction, with mean AUROCs dropping to approximately 0.5 for most models, indicating successful mitigation of batch effects (e.g., CCL-ResNet50 in TCGA-COAD: Pre-ComBat AUROC = 0.960, Post-ComBat AUROC = 0.506, <em>p &lt;</em> 0.001). Clinical attributes associated with TSS, such as race and treatment response, showed decreased predictability post-harmonization. Notably, the prediction of genetic features like MSI status remained robust after harmonization (e.g., MSI in TCGA-COAD: Pre-ComBat AUROC = 0.667, Post-ComBat AUROC = 0.669, <em>p</em>=0.952), indicating the preservation of true histological signals.</div></div><div><h3>Conclusion</h3><div>ComBat harmonization of deep learning-derived histology features effectively reduces the risk of AI models learning confounding features in WSIs, ensuring more reliable performance estimates. This approach is promising for the integration of large-scale digital pathology datasets.</div></div>\",\"PeriodicalId\":37769,\"journal\":{\"name\":\"Journal of Pathology Informatics\",\"volume\":\"15 \",\"pages\":\"Article 100396\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Pathology Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S215335392400035X\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pathology Informatics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S215335392400035X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0

摘要

背景为数字病理学开发人工智能(AI)模型需要来自多个来源的大型数据集。然而,如果不仔细实施,人工智能模型就有可能学习数据集中的混杂部位特异性特征,而不是临床相关信息,从而导致性能被高估、对真实世界数据的普适性差以及潜在的误诊。使用三种特征提取模型获得斑块嵌入,然后进行 ComBat 协调。使用原始、Macenko 归一化和 Combat 协调的斑块嵌入,训练了基于注意力的多实例学习模型,以预测组织来源部位(TSS)以及临床和遗传属性。ComBat 协调大大降低了 TSS 预测的 AUROC,大多数模型的平均 AUROC 降至 0.5 左右,这表明批次效应得到了成功缓解(例如,TCGA-COAD 中的 CCL-ResNet50 模型,其平均 AUROC 降至 0.5 左右):CCL-ResNet50 in TCGA-COAD:Pre-ComBat AUROC = 0.960, Post-ComBat AUROC = 0.506, p < 0.001)。与 TSS 相关的临床属性(如种族和治疗反应)在协调后的可预测性有所下降。值得注意的是,MSI 状态等遗传特征的预测能力在协调后仍然很强(例如,TCGA-COAD 中的 MSI:ConclusionComBat harmonization of deep learning-derived histology features effectively reduces the risk of AI models learning confounding features in WSIs, ensuring more reliable performance estimates.这种方法在整合大规模数字病理数据集方面大有可为。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Deep feature batch correction using ComBat for machine learning applications in computational pathology

Background

Developing artificial intelligence (AI) models for digital pathology requires large datasets from multiple sources. However, without careful implementation, AI models risk learning confounding site-specific features in datasets instead of clinically relevant information, leading to overestimated performance, poor generalizability to real-world data, and potential misdiagnosis.

Methods

Whole-slide images (WSIs) from The Cancer Genome Atlas (TCGA) colon (COAD), and stomach adenocarcinoma datasets were selected for inclusion in this study. Patch embeddings were obtained using three feature extraction models, followed by ComBat harmonization. Attention-based multiple instance learning models were trained to predict tissue-source site (TSS), as well as clinical and genetic attributes, using raw, Macenko normalized, and Combat-harmonized patch embeddings.

Results

TSS prediction achieved high accuracy (AUROC > 0.95) with all three feature extraction models. ComBat harmonization significantly reduced the AUROC for TSS prediction, with mean AUROCs dropping to approximately 0.5 for most models, indicating successful mitigation of batch effects (e.g., CCL-ResNet50 in TCGA-COAD: Pre-ComBat AUROC = 0.960, Post-ComBat AUROC = 0.506, p < 0.001). Clinical attributes associated with TSS, such as race and treatment response, showed decreased predictability post-harmonization. Notably, the prediction of genetic features like MSI status remained robust after harmonization (e.g., MSI in TCGA-COAD: Pre-ComBat AUROC = 0.667, Post-ComBat AUROC = 0.669, p=0.952), indicating the preservation of true histological signals.

Conclusion

ComBat harmonization of deep learning-derived histology features effectively reduces the risk of AI models learning confounding features in WSIs, ensuring more reliable performance estimates. This approach is promising for the integration of large-scale digital pathology datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Pathology Informatics
Journal of Pathology Informatics Medicine-Pathology and Forensic Medicine
CiteScore
3.70
自引率
0.00%
发文量
2
审稿时长
18 weeks
期刊介绍: The Journal of Pathology Informatics (JPI) is an open access peer-reviewed journal dedicated to the advancement of pathology informatics. This is the official journal of the Association for Pathology Informatics (API). The journal aims to publish broadly about pathology informatics and freely disseminate all articles worldwide. This journal is of interest to pathologists, informaticians, academics, researchers, health IT specialists, information officers, IT staff, vendors, and anyone with an interest in informatics. We encourage submissions from anyone with an interest in the field of pathology informatics. We publish all types of papers related to pathology informatics including original research articles, technical notes, reviews, viewpoints, commentaries, editorials, symposia, meeting abstracts, book reviews, and correspondence to the editors. All submissions are subject to rigorous peer review by the well-regarded editorial board and by expert referees in appropriate specialties.
期刊最新文献
Improving the generalizability of white blood cell classification with few-shot domain adaptation Pathology Informatics Summit 2024 Abstracts Ann Arbor Marriott at Eagle Crest Resort May 20-23, 2024 Ann Arbor, Michigan Deep learning-based classification of breast cancer molecular subtypes from H&E whole-slide images. Enhancing human phenotype ontology term extraction through synthetic case reports and embedding-based retrieval: A novel approach for improved biomedical data annotation. Prioritizing cases from a multi-institutional cohort for a dataset of pathologist annotations.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1