Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency

IF 11.8 1区医学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Medical image analysis Pub Date : 2024-12-01 Epub Date: 2024-08-12 DOI:10.1016/j.media.2024.103298

Tim G.W. Boers , Kiki N. Fockens , Joost A. van der Putten , Tim J.M. Jaspers , Carolus H.J. Kusters , Jelmer B. Jukema , Martijn R. Jong , Maarten R. Struyvenberg , Jeroen de Groof , Jacques J. Bergman , Peter H.N. de With , Fons van der Sommen

{"title":"Foundation models in gastrointestinal endoscopic AI: Impact of architecture, pre-training approach and data efficiency","authors":"Tim G.W. Boers , Kiki N. Fockens , Joost A. van der Putten , Tim J.M. Jaspers , Carolus H.J. Kusters , Jelmer B. Jukema , Martijn R. Jong , Maarten R. Struyvenberg , Jeroen de Groof , Jacques J. Bergman , Peter H.N. de With , Fons van der Sommen","doi":"10.1016/j.media.2024.103298","DOIUrl":null,"url":null,"abstract":"<div><p>Pre-training deep learning models with large data sets of natural images, such as ImageNet, has become the standard for endoscopic image analysis. This approach is generally superior to <em>training from scratch</em>, due to the scarcity of high-quality medical imagery and labels. However, it is still unknown whether the learned features on natural imagery provide an optimal starting point for the downstream medical endoscopic imaging tasks. Intuitively, pre-training with imagery closer to the target domain could lead to better-suited feature representations. This study evaluates whether leveraging in-domain pre-training in gastrointestinal endoscopic image analysis has potential benefits compared to pre-training on natural images.</p><p>To this end, we present a dataset comprising of 5,014,174 gastrointestinal endoscopic images from eight different medical centers (GastroNet-5M), and exploit self-supervised learning with SimCLRv2, MoCov2 and DINO to learn relevant features for in-domain downstream tasks. The learned features are compared to features learned on natural images derived with multiple methods, and variable amounts of data and/or labels (e.g. Billion-scale semi-weakly supervised learning and supervised learning on ImageNet-21k). The effects of the evaluation is performed on five downstream data sets, particularly designed for a variety of gastrointestinal tasks, for example, GIANA for angiodyplsia detection and Kvasir-SEG for polyp segmentation.</p><p>The findings indicate that self-supervised domain-specific pre-training, specifically using the DINO framework, results into better performing models compared to any supervised pre-training on natural images. On the ResNet50 and Vision-Transformer-small architectures, utilizing self-supervised in-domain pre-training with DINO leads to an average performance boost of 1.63% and 4.62%, respectively, on the downstream datasets. This improvement is measured against the best performance achieved through pre-training on natural images within any of the evaluated frameworks.</p><p>Moreover, the in-domain pre-trained models also exhibit increased robustness against distortion perturbations (noise, contrast, blur, etc.), where the in-domain pre-trained ResNet50 and Vision-Transformer-small with DINO achieved on average 1.28% and 3.55% higher on the performance metrics, compared to the best performance found for pre-trained models on natural images.</p><p>Overall, this study highlights the importance of in-domain pre-training for improving the generic nature, scalability and performance of deep learning for medical image analysis. The GastroNet-5M pre-trained weights are made publicly available in our repository: <span><span>huggingface.co/tgwboers/GastroNet-5M_Pretrained_Weights</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"98 ","pages":"Article 103298"},"PeriodicalIF":11.8000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1361841524002238/pdfft?md5=25ff2f1e7dfbb3491c0a72c80dc8e023&pid=1-s2.0-S1361841524002238-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841524002238","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/12 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Pre-training deep learning models with large data sets of natural images, such as ImageNet, has become the standard for endoscopic image analysis. This approach is generally superior to training from scratch, due to the scarcity of high-quality medical imagery and labels. However, it is still unknown whether the learned features on natural imagery provide an optimal starting point for the downstream medical endoscopic imaging tasks. Intuitively, pre-training with imagery closer to the target domain could lead to better-suited feature representations. This study evaluates whether leveraging in-domain pre-training in gastrointestinal endoscopic image analysis has potential benefits compared to pre-training on natural images.

To this end, we present a dataset comprising of 5,014,174 gastrointestinal endoscopic images from eight different medical centers (GastroNet-5M), and exploit self-supervised learning with SimCLRv2, MoCov2 and DINO to learn relevant features for in-domain downstream tasks. The learned features are compared to features learned on natural images derived with multiple methods, and variable amounts of data and/or labels (e.g. Billion-scale semi-weakly supervised learning and supervised learning on ImageNet-21k). The effects of the evaluation is performed on five downstream data sets, particularly designed for a variety of gastrointestinal tasks, for example, GIANA for angiodyplsia detection and Kvasir-SEG for polyp segmentation.

The findings indicate that self-supervised domain-specific pre-training, specifically using the DINO framework, results into better performing models compared to any supervised pre-training on natural images. On the ResNet50 and Vision-Transformer-small architectures, utilizing self-supervised in-domain pre-training with DINO leads to an average performance boost of 1.63% and 4.62%, respectively, on the downstream datasets. This improvement is measured against the best performance achieved through pre-training on natural images within any of the evaluated frameworks.

Moreover, the in-domain pre-trained models also exhibit increased robustness against distortion perturbations (noise, contrast, blur, etc.), where the in-domain pre-trained ResNet50 and Vision-Transformer-small with DINO achieved on average 1.28% and 3.55% higher on the performance metrics, compared to the best performance found for pre-trained models on natural images.

Overall, this study highlights the importance of in-domain pre-training for improving the generic nature, scalability and performance of deep learning for medical image analysis. The GastroNet-5M pre-trained weights are made publicly available in our repository: huggingface.co/tgwboers/GastroNet-5M_Pretrained_Weights.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

胃肠道内窥镜人工智能中的基础模型：架构、预训练方法和数据效率的影响

使用大型自然图像数据集（如 ImageNet）预训练深度学习模型已成为内窥镜图像分析的标准。由于缺乏高质量的医学图像和标签，这种方法通常优于从头开始训练。然而，在自然图像上学习到的特征是否能为下游的医学内窥镜成像任务提供最佳起点，目前仍是个未知数。直观地说，使用更接近目标领域的图像进行预训练可以获得更合适的特征表征。本研究评估了在胃肠道内窥镜图像分析中利用域内预训练与在自然图像上进行预训练相比是否具有潜在优势。为此，我们提供了一个数据集，其中包括来自八个不同医疗中心的 5,014,174 幅胃肠道内窥镜图像（GastroNet-5M），并利用 SimCLRv2、MoCov2 和 DINO 进行自监督学习，为域内下游任务学习相关特征。将学习到的特征与通过多种方法、不同数量的数据和/或标签（如 Billion-scale 半弱监督学习和 ImageNet-21k 上的监督学习）在自然图像上学习到的特征进行比较。评估效果是在五个下游数据集上进行的，这些数据集特别为各种胃肠道任务而设计，例如用于血管增生检测的 GIANA 和用于息肉分割的 Kvasir-SEG。在ResNet50和Vision-Transformer-small架构上，利用DINO进行自监督域内预训练可使下游数据集的平均性能分别提高1.63%和4.62%。此外，域内预训练模型对失真扰动（噪声、对比度、模糊等）的鲁棒性也有所提高，其中使用 DINO 进行域内预训练的 ResNet50 和 Vision-Transformer-small 平均分别提高了 1.28% 和 3.55%。总之，这项研究强调了域内预训练对于提高深度学习在医学图像分析中的通用性、可扩展性和性能的重要性。GastroNet-5M 的预训练权重可在我们的资源库中公开获取：huggingface.co/tgwboers/GastroNet-5M_Pretrained_Weights。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Medical image analysis 工程技术-工程：生物医学

CiteScore

22.10

自引率

6.40%

发文量

309

审稿时长

6.6 months

期刊介绍： Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.