Testing the generalizability and effectiveness of deep learning models among clinics: sperm detection as a pilot study.

IF 4.2 2区医学 Q1 ENDOCRINOLOGY & METABOLISM Reproductive Biology and Endocrinology Pub Date : 2024-05-22 DOI:10.1186/s12958-024-01232-8

Jiaqi Wang, Yufei Jin, Aojun Jiang, Wenyuan Chen, Guanqiao Shan, Yifan Gu, Yue Ming, Jichang Li, Chunfeng Yue, Zongjie Huang, Clifford Librach, Ge Lin, Xibu Wang, Huan Zhao, Yu Sun, Zhuoran Zhang

{"title":"Testing the generalizability and effectiveness of deep learning models among clinics: sperm detection as a pilot study.","authors":"Jiaqi Wang, Yufei Jin, Aojun Jiang, Wenyuan Chen, Guanqiao Shan, Yifan Gu, Yue Ming, Jichang Li, Chunfeng Yue, Zongjie Huang, Clifford Librach, Ge Lin, Xibu Wang, Huan Zhao, Yu Sun, Zhuoran Zhang","doi":"10.1186/s12958-024-01232-8","DOIUrl":null,"url":null,"abstract":"Background: Deep learning has been increasingly investigated for assisting clinical in vitro fertilization (IVF). The first technical step in many tasks is to visually detect and locate sperm, oocytes, and embryos in images. For clinical deployment of such deep learning models, different clinics use different image acquisition hardware and different sample preprocessing protocols, raising the concern over whether the reported accuracy of a deep learning model by one clinic could be reproduced in another clinic. Here we aim to investigate the effect of each imaging factor on the generalizability of object detection models, using sperm analysis as a pilot example.Methods: Ablation studies were performed using state-of-the-art models for detecting human sperm to quantitatively assess how model precision (false-positive detection) and recall (missed detection) were affected by imaging magnification, imaging mode, and sample preprocessing protocols. The results led to the hypothesis that the richness of image acquisition conditions in a training dataset deterministically affects model generalizability. The hypothesis was tested by first enriching the training dataset with a wide range of imaging conditions, then validated through internal blind tests on new samples and external multi-center clinical validations.Results: Ablation experiments revealed that removing subsets of data from the training dataset significantly reduced model precision. Removing raw sample images from the training dataset caused the largest drop in model precision, whereas removing 20x images caused the largest drop in model recall. by incorporating different imaging and sample preprocessing conditions into a rich training dataset, the model achieved an intraclass correlation coefficient (ICC) of 0.97 (95% CI: 0.94-0.99) for precision, and an ICC of 0.97 (95% CI: 0.93-0.99) for recall. Multi-center clinical validation showed no significant differences in model precision or recall across different clinics and applications.Conclusions: The results validated the hypothesis that the richness of data in the training dataset is a key factor impacting model generalizability. These findings highlight the importance of diversity in a training dataset for model evaluation and suggest that future deep learning models in andrology and reproductive medicine should incorporate comprehensive feature sets for enhanced generalizability across clinics.","PeriodicalId":21011,"journal":{"name":"Reproductive Biology and Endocrinology","volume":"22 1","pages":"59"},"PeriodicalIF":4.2000,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11110326/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Reproductive Biology and Endocrinology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12958-024-01232-8","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Deep learning has been increasingly investigated for assisting clinical in vitro fertilization (IVF). The first technical step in many tasks is to visually detect and locate sperm, oocytes, and embryos in images. For clinical deployment of such deep learning models, different clinics use different image acquisition hardware and different sample preprocessing protocols, raising the concern over whether the reported accuracy of a deep learning model by one clinic could be reproduced in another clinic. Here we aim to investigate the effect of each imaging factor on the generalizability of object detection models, using sperm analysis as a pilot example.

Methods: Ablation studies were performed using state-of-the-art models for detecting human sperm to quantitatively assess how model precision (false-positive detection) and recall (missed detection) were affected by imaging magnification, imaging mode, and sample preprocessing protocols. The results led to the hypothesis that the richness of image acquisition conditions in a training dataset deterministically affects model generalizability. The hypothesis was tested by first enriching the training dataset with a wide range of imaging conditions, then validated through internal blind tests on new samples and external multi-center clinical validations.

Results: Ablation experiments revealed that removing subsets of data from the training dataset significantly reduced model precision. Removing raw sample images from the training dataset caused the largest drop in model precision, whereas removing 20x images caused the largest drop in model recall. by incorporating different imaging and sample preprocessing conditions into a rich training dataset, the model achieved an intraclass correlation coefficient (ICC) of 0.97 (95% CI: 0.94-0.99) for precision, and an ICC of 0.97 (95% CI: 0.93-0.99) for recall. Multi-center clinical validation showed no significant differences in model precision or recall across different clinics and applications.

Conclusions: The results validated the hypothesis that the richness of data in the training dataset is a key factor impacting model generalizability. These findings highlight the importance of diversity in a training dataset for model evaluation and suggest that future deep learning models in andrology and reproductive medicine should incorporate comprehensive feature sets for enhanced generalizability across clinics.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

测试深度学习模型在诊所中的通用性和有效性：精子检测试点研究。

背景：深度学习在辅助临床体外受精（IVF）方面的研究越来越多。许多任务的第一个技术步骤是在图像中直观地检测和定位精子、卵细胞和胚胎。对于此类深度学习模型的临床部署，不同的诊所使用不同的图像采集硬件和不同的样本预处理协议，这引起了人们的关注，即一家诊所报告的深度学习模型的准确性能否在另一家诊所重现。在此，我们以精子分析为例，研究各成像因素对物体检测模型通用性的影响：方法：使用最先进的人类精子检测模型进行消融研究，定量评估成像放大倍数、成像模式和样本预处理方案对模型精确度（假阳性检测）和召回率（漏检）的影响。结果提出了一个假设：训练数据集中图像采集条件的丰富程度会决定性地影响模型的通用性。我们首先用多种成像条件丰富了训练数据集，然后通过对新样本的内部盲测和外部多中心临床验证来验证这一假设：消融实验显示，从训练数据集中移除数据子集会显著降低模型精度。通过将不同的成像和样本预处理条件纳入丰富的训练数据集中，该模型的精度达到了 0.97（95% CI：0.94-0.99）的类内相关系数（ICC），召回率达到了 0.97（95% CI：0.93-0.99）。多中心临床验证显示，不同诊所和应用的模型精确度或召回率没有明显差异：结果验证了一个假设，即训练数据集中数据的丰富程度是影响模型通用性的关键因素。这些发现强调了训练数据集的多样性对模型评估的重要性，并建议未来泌尿学和生殖医学领域的深度学习模型应纳入全面的特征集，以增强跨诊所的通用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Reproductive Biology and Endocrinology 医学-内分泌学与代谢

CiteScore

7.90

自引率

2.30%

发文量

161

审稿时长

4-8 weeks

期刊介绍： Reproductive Biology and Endocrinology publishes and disseminates high-quality results from excellent research in the reproductive sciences. The journal publishes on topics covering gametogenesis, fertilization, early embryonic development, embryo-uterus interaction, reproductive development, pregnancy, uterine biology, endocrinology of reproduction, control of reproduction, reproductive immunology, neuroendocrinology, and veterinary and human reproductive medicine, including all vertebrate species.