Publicly available datasets of breast histopathology H&E whole-slide images: A scoping review

Q2 Medicine Journal of Pathology Informatics Pub Date : 2024-02-01 DOI:10.1016/j.jpi.2024.100363

Masoud Tafavvoghi , Lars Ailo Bongo , Nikita Shvetsov , Lill-Tove Rasmussen Busund , Kajsa Møllersen

{"title":"Publicly available datasets of breast histopathology H&E whole-slide images: A scoping review","authors":"Masoud Tafavvoghi , Lars Ailo Bongo , Nikita Shvetsov , Lill-Tove Rasmussen Busund , Kajsa Møllersen","doi":"10.1016/j.jpi.2024.100363","DOIUrl":null,"url":null,"abstract":"<div><p>Advancements in digital pathology and computing resources have made a significant impact in the field of computational pathology for breast cancer diagnosis and treatment. However, access to high-quality labeled histopathological images of breast cancer is a big challenge that limits the development of accurate and robust deep learning models. In this scoping review, we identified the publicly available datasets of breast H&E-stained whole-slide images (WSIs) that can be used to develop deep learning algorithms. We systematically searched 9 scientific literature databases and 9 research data repositories and found 17 publicly available datasets containing 10 385 H&E WSIs of breast cancer. Moreover, we reported image metadata and characteristics for each dataset to assist researchers in selecting proper datasets for specific tasks in breast cancer computational pathology. In addition, we compiled 2 lists of breast H&E patches and private datasets as supplementary resources for researchers. Notably, only 28% of the included articles utilized multiple datasets, and only 14% used an external validation set, suggesting that the performance of other developed models may be susceptible to overestimation. The TCGA-BRCA was used in 52% of the selected studies. This dataset has a considerable selection bias that can impact the robustness and generalizability of the trained algorithms. There is also a lack of consistent metadata reporting of breast WSI datasets that can be an issue in developing accurate deep learning models, indicating the necessity of establishing explicit guidelines for documenting breast WSI dataset characteristics and metadata.</p></div>","PeriodicalId":37769,"journal":{"name":"Journal of Pathology Informatics","volume":"15 ","pages":"Article 100363"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2153353924000026/pdfft?md5=e1d6b199f5ede66427075250c84de4c0&pid=1-s2.0-S2153353924000026-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pathology Informatics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2153353924000026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

Advancements in digital pathology and computing resources have made a significant impact in the field of computational pathology for breast cancer diagnosis and treatment. However, access to high-quality labeled histopathological images of breast cancer is a big challenge that limits the development of accurate and robust deep learning models. In this scoping review, we identified the publicly available datasets of breast H&E-stained whole-slide images (WSIs) that can be used to develop deep learning algorithms. We systematically searched 9 scientific literature databases and 9 research data repositories and found 17 publicly available datasets containing 10 385 H&E WSIs of breast cancer. Moreover, we reported image metadata and characteristics for each dataset to assist researchers in selecting proper datasets for specific tasks in breast cancer computational pathology. In addition, we compiled 2 lists of breast H&E patches and private datasets as supplementary resources for researchers. Notably, only 28% of the included articles utilized multiple datasets, and only 14% used an external validation set, suggesting that the performance of other developed models may be susceptible to overestimation. The TCGA-BRCA was used in 52% of the selected studies. This dataset has a considerable selection bias that can impact the robustness and generalizability of the trained algorithms. There is also a lack of consistent metadata reporting of breast WSI datasets that can be an issue in developing accurate deep learning models, indicating the necessity of establishing explicit guidelines for documenting breast WSI dataset characteristics and metadata.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

可公开获取的乳腺组织病理学 H&E 全切片图像数据集：范围审查

数字病理学和计算资源的进步对用于乳腺癌诊断和治疗的计算病理学领域产生了重大影响。然而，获取高质量的乳腺癌标记组织病理学图像是一个巨大的挑战，限制了准确、稳健的深度学习模型的开发。在这篇范围综述中，我们确定了可用于开发深度学习算法的公开可用的乳腺H&E染色全切片图像（WSI）数据集。我们系统地搜索了 9 个科学文献数据库和 9 个研究数据存储库，发现了 17 个公开可用的数据集，包含 10 385 张乳腺癌 H&E WSIs。此外，我们还报告了每个数据集的图像元数据和特征，以帮助研究人员为乳腺癌计算病理学的特定任务选择合适的数据集。此外，我们还编制了两份乳腺 H&E 补丁和私人数据集列表，作为研究人员的补充资源。值得注意的是，只有28%的收录文章使用了多个数据集，只有14%的文章使用了外部验证集，这表明其他已开发模型的性能可能容易被高估。52%的入选研究使用了 TCGA-BRCA。该数据集存在相当大的选择偏差，可能会影响训练算法的稳健性和普适性。此外，乳腺 WSI 数据集缺乏一致的元数据报告，这可能会成为开发精确深度学习模型的一个问题，这表明有必要制定明确的指南来记录乳腺 WSI 数据集的特征和元数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Pathology Informatics Medicine-Pathology and Forensic Medicine

CiteScore

3.70

自引率

0.00%

发文量

审稿时长

18 weeks

期刊介绍： The Journal of Pathology Informatics (JPI) is an open access peer-reviewed journal dedicated to the advancement of pathology informatics. This is the official journal of the Association for Pathology Informatics (API). The journal aims to publish broadly about pathology informatics and freely disseminate all articles worldwide. This journal is of interest to pathologists, informaticians, academics, researchers, health IT specialists, information officers, IT staff, vendors, and anyone with an interest in informatics. We encourage submissions from anyone with an interest in the field of pathology informatics. We publish all types of papers related to pathology informatics including original research articles, technical notes, reviews, viewpoints, commentaries, editorials, symposia, meeting abstracts, book reviews, and correspondence to the editors. All submissions are subject to rigorous peer review by the well-regarded editorial board and by expert referees in appropriate specialties.