Assessing generalizability of an AI-based visual test for cervical cancer screening.

IF 7.7 PLOS digital health Pub Date : 2024-10-02 eCollection Date: 2024-10-01 DOI:10.1371/journal.pdig.0000364

Syed Rakin Ahmed, Didem Egemen, Brian Befano, Ana Cecilia Rodriguez, Jose Jeronimo, Kanan Desai, Carolina Teran, Karla Alfaro, Joel Fokom-Domgue, Kittipat Charoenkwan, Chemtai Mungo, Rebecca Luckett, Rakiya Saidu, Taina Raiol, Ana Ribeiro, Julia C Gage, Silvia de Sanjose, Jayashree Kalpathy-Cramer, Mark Schiffman

{"title":"Assessing generalizability of an AI-based visual test for cervical cancer screening.","authors":"Syed Rakin Ahmed, Didem Egemen, Brian Befano, Ana Cecilia Rodriguez, Jose Jeronimo, Kanan Desai, Carolina Teran, Karla Alfaro, Joel Fokom-Domgue, Kittipat Charoenkwan, Chemtai Mungo, Rebecca Luckett, Rakiya Saidu, Taina Raiol, Ana Ribeiro, Julia C Gage, Silvia de Sanjose, Jayashree Kalpathy-Cramer, Mark Schiffman","doi":"10.1371/journal.pdig.0000364","DOIUrl":null,"url":null,"abstract":"<p><p>A number of challenges hinder artificial intelligence (AI) models from effective clinical translation. Foremost among these challenges is the lack of generalizability, which is defined as the ability of a model to perform well on datasets that have different characteristics from the training data. We recently investigated the development of an AI pipeline on digital images of the cervix, utilizing a multi-heterogeneous dataset of 9,462 women (17,013 images) and a multi-stage model selection and optimization approach, to generate a diagnostic classifier able to classify images of the cervix into \"normal\", \"indeterminate\" and \"precancer/cancer\" (denoted as \"precancer+\") categories. In this work, we investigate the performance of this multiclass classifier on external data not utilized in training and internal validation, to assess the generalizability of the classifier when moving to new settings. We assessed both the classification performance and repeatability of our classifier model across the two axes of heterogeneity present in our dataset: image capture device and geography, utilizing both out-of-the-box inference and retraining with external data. Our results demonstrate that device-level heterogeneity affects our model performance more than geography-level heterogeneity. Classification performance of our model is strong on images from a new geography without retraining, while incremental retraining with inclusion of images from a new device progressively improves classification performance on that device up to a point of saturation. Repeatability of our model is relatively unaffected by data heterogeneity and remains strong throughout. Our work supports the need for optimized retraining approaches that address data heterogeneity (e.g., when moving to a new device) to facilitate effective use of AI models in new settings.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"3 10","pages":"e0000364"},"PeriodicalIF":7.7000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446437/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000364","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

A number of challenges hinder artificial intelligence (AI) models from effective clinical translation. Foremost among these challenges is the lack of generalizability, which is defined as the ability of a model to perform well on datasets that have different characteristics from the training data. We recently investigated the development of an AI pipeline on digital images of the cervix, utilizing a multi-heterogeneous dataset of 9,462 women (17,013 images) and a multi-stage model selection and optimization approach, to generate a diagnostic classifier able to classify images of the cervix into "normal", "indeterminate" and "precancer/cancer" (denoted as "precancer+") categories. In this work, we investigate the performance of this multiclass classifier on external data not utilized in training and internal validation, to assess the generalizability of the classifier when moving to new settings. We assessed both the classification performance and repeatability of our classifier model across the two axes of heterogeneity present in our dataset: image capture device and geography, utilizing both out-of-the-box inference and retraining with external data. Our results demonstrate that device-level heterogeneity affects our model performance more than geography-level heterogeneity. Classification performance of our model is strong on images from a new geography without retraining, while incremental retraining with inclusion of images from a new device progressively improves classification performance on that device up to a point of saturation. Repeatability of our model is relatively unaffected by data heterogeneity and remains strong throughout. Our work supports the need for optimized retraining approaches that address data heterogeneity (e.g., when moving to a new device) to facilitate effective use of AI models in new settings.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估基于人工智能的宫颈癌筛查视觉测试的通用性。

许多挑战阻碍了人工智能（AI）模型有效地进行临床转化。其中最主要的挑战是缺乏通用性，通用性是指模型在与训练数据具有不同特征的数据集上表现良好的能力。我们最近研究了在宫颈数字图像上开发人工智能流水线的问题，利用由 9,462 名妇女（17,013 张图像）组成的多异构数据集和多阶段模型选择与优化方法，生成了一个诊断分类器，能够将宫颈图像分为 "正常"、"不确定 "和 "癌前/癌"（表示为 "癌前+"）类别。在这项工作中，我们研究了这一多类分类器在未用于训练和内部验证的外部数据上的性能，以评估分类器在转移到新环境时的通用性。我们利用开箱即用的推理和外部数据的再训练，评估了分类器模型在数据集的两个异质性轴（图像捕捉设备和地理位置）上的分类性能和可重复性。我们的结果表明，设备层面的异质性对模型性能的影响要大于地理层面的异质性。在不进行再训练的情况下，我们的模型对来自新地理位置的图像的分类性能很强，而通过加入来自新设备的图像进行增量再训练，可以逐步提高该设备的分类性能，直至达到饱和点。我们的模型的可重复性相对来说不受数据异质性的影响，在整个过程中保持强劲。我们的工作支持了对优化的再训练方法的需求，这种方法可以解决数据异质性问题（例如，在转移到新设备时），从而促进人工智能模型在新环境中的有效使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

PLOS digital health

自引率

0.00%

发文量