利用计算压力测试评估深度学习骨龄算法对临床图像变化的鲁棒性。

IF 8.1 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Radiology-Artificial Intelligence Pub Date : 2024-05-01 DOI:10.1148/ryai.230240
Samantha M Santomartino, Kristin Putman, Elham Beheshtian, Vishwa S Parekh, Paul H Yi
{"title":"利用计算压力测试评估深度学习骨龄算法对临床图像变化的鲁棒性。","authors":"Samantha M Santomartino, Kristin Putman, Elham Beheshtian, Vishwa S Parekh, Paul H Yi","doi":"10.1148/ryai.230240","DOIUrl":null,"url":null,"abstract":"<p><p>Purpose To evaluate the robustness of an award-winning bone age deep learning (DL) model to extensive variations in image appearance. Materials and Methods In December 2021, the DL bone age model that won the 2017 RSNA Pediatric Bone Age Challenge was retrospectively evaluated using the RSNA validation set (1425 pediatric hand radiographs; internal test set in this study) and the Digital Hand Atlas (DHA) (1202 pediatric hand radiographs; external test set). Each test image underwent seven types of transformations (rotations, flips, brightness, contrast, inversion, laterality marker, and resolution) to represent a range of image appearances, many of which simulate real-world variations. Computational \"stress tests\" were performed by comparing the model's predictions on baseline and transformed images. Mean absolute differences (MADs) of predicted bone ages compared with radiologist-determined ground truth on baseline versus transformed images were compared using Wilcoxon signed rank tests. The proportion of clinically significant errors (CSEs) was compared using McNemar tests. Results There was no evidence of a difference in MAD of the model on the two baseline test sets (RSNA = 6.8 months, DHA = 6.9 months; <i>P</i> = .05), indicating good model generalization to external data. Except for the RSNA dataset images with an appended radiologic laterality marker (<i>P</i> = .86), there were significant differences in MAD for both the DHA and RSNA datasets among other transformation groups (rotations, flips, brightness, contrast, inversion, and resolution). There were significant differences in proportion of CSEs for 57% of the image transformations (19 of 33) performed on the DHA dataset. Conclusion Although an award-winning pediatric bone age DL model generalized well to curated external images, it had inconsistent predictions on images that had undergone simple transformations reflective of several real-world variations in image appearance. <b>Keywords:</b> Pediatrics, Hand, Convolutional Neural Network, Radiography <i>Supplemental material is available for this article.</i> © RSNA, 2024 See also commentary by Faghani and Erickson in this issue.</p>","PeriodicalId":29787,"journal":{"name":"Radiology-Artificial Intelligence","volume":null,"pages":null},"PeriodicalIF":8.1000,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11140516/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating the Robustness of a Deep Learning Bone Age Algorithm to Clinical Image Variation Using Computational Stress Testing.\",\"authors\":\"Samantha M Santomartino, Kristin Putman, Elham Beheshtian, Vishwa S Parekh, Paul H Yi\",\"doi\":\"10.1148/ryai.230240\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Purpose To evaluate the robustness of an award-winning bone age deep learning (DL) model to extensive variations in image appearance. Materials and Methods In December 2021, the DL bone age model that won the 2017 RSNA Pediatric Bone Age Challenge was retrospectively evaluated using the RSNA validation set (1425 pediatric hand radiographs; internal test set in this study) and the Digital Hand Atlas (DHA) (1202 pediatric hand radiographs; external test set). Each test image underwent seven types of transformations (rotations, flips, brightness, contrast, inversion, laterality marker, and resolution) to represent a range of image appearances, many of which simulate real-world variations. Computational \\\"stress tests\\\" were performed by comparing the model's predictions on baseline and transformed images. Mean absolute differences (MADs) of predicted bone ages compared with radiologist-determined ground truth on baseline versus transformed images were compared using Wilcoxon signed rank tests. The proportion of clinically significant errors (CSEs) was compared using McNemar tests. Results There was no evidence of a difference in MAD of the model on the two baseline test sets (RSNA = 6.8 months, DHA = 6.9 months; <i>P</i> = .05), indicating good model generalization to external data. Except for the RSNA dataset images with an appended radiologic laterality marker (<i>P</i> = .86), there were significant differences in MAD for both the DHA and RSNA datasets among other transformation groups (rotations, flips, brightness, contrast, inversion, and resolution). There were significant differences in proportion of CSEs for 57% of the image transformations (19 of 33) performed on the DHA dataset. Conclusion Although an award-winning pediatric bone age DL model generalized well to curated external images, it had inconsistent predictions on images that had undergone simple transformations reflective of several real-world variations in image appearance. <b>Keywords:</b> Pediatrics, Hand, Convolutional Neural Network, Radiography <i>Supplemental material is available for this article.</i> © RSNA, 2024 See also commentary by Faghani and Erickson in this issue.</p>\",\"PeriodicalId\":29787,\"journal\":{\"name\":\"Radiology-Artificial Intelligence\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":8.1000,\"publicationDate\":\"2024-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11140516/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radiology-Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1148/ryai.230240\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology-Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1148/ryai.230240","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

"刚刚接受 "的论文经过同行评审,已被接受在《放射学》上发表:人工智能》上发表。这篇文章在以最终版本发表之前,还将经过校对、排版和校对审核。请注意,在制作最终校对稿的过程中,可能会发现影响文章内容的错误。目的 评估获奖的骨龄深度学习(DL)模型对图像外观的广泛变化的鲁棒性。材料与方法 2021 年 12 月,使用北美放射学会(RSNA)验证集(n = 1425 张小儿手部放射照片;内部测试集)和数字手图集(DHA;n = 1202 张小儿手部放射照片;外部测试集)对赢得 2017 年 RSNA 小儿骨龄挑战赛的 DL 骨龄模型进行了回顾性评估。每张测试图像都经过七种类型的转换(旋转、翻转、亮度、对比度、反转、侧位标记和分辨率),以代表一系列图像外观,其中许多是模拟真实世界的变化。通过比较模型对基线图像和转换图像的预测,进行了计算 "压力测试"。使用 Wilcoxon Signed Rank 检验比较了基线图像和转换图像上预测骨龄与放射科医生确定的基本真实值的平均绝对差值(MAD)。使用 McNemar 检验比较有临床意义的误差 (CSE) 比例。结果 在两个基线测试集(RSNA = 6.8,DHA = 6.9;P = .05)上,没有证据表明模型的 MAD 存在差异,这表明模型对外部数据具有良好的泛化能力。除了带有附加放射学侧位标记(P = .86)的 RSNA 图像外,DHA 和 RSNA 数据集的 MAD 在其他转换组(旋转、翻转、亮度、对比度、反转和分辨率)之间存在显著差异。在对 DHA 数据集进行的图像转换中,57.6%(19/33)的 CSE 比例存在明显差异。结论 尽管获奖的小儿骨龄 DL 模型对经过策划的外部图像具有良好的通用性,但它对经过简单转换的图像的预测不一致,而这些转换反映了图像外观的几种真实世界的变化。©RSNA, 2024.
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Evaluating the Robustness of a Deep Learning Bone Age Algorithm to Clinical Image Variation Using Computational Stress Testing.

Purpose To evaluate the robustness of an award-winning bone age deep learning (DL) model to extensive variations in image appearance. Materials and Methods In December 2021, the DL bone age model that won the 2017 RSNA Pediatric Bone Age Challenge was retrospectively evaluated using the RSNA validation set (1425 pediatric hand radiographs; internal test set in this study) and the Digital Hand Atlas (DHA) (1202 pediatric hand radiographs; external test set). Each test image underwent seven types of transformations (rotations, flips, brightness, contrast, inversion, laterality marker, and resolution) to represent a range of image appearances, many of which simulate real-world variations. Computational "stress tests" were performed by comparing the model's predictions on baseline and transformed images. Mean absolute differences (MADs) of predicted bone ages compared with radiologist-determined ground truth on baseline versus transformed images were compared using Wilcoxon signed rank tests. The proportion of clinically significant errors (CSEs) was compared using McNemar tests. Results There was no evidence of a difference in MAD of the model on the two baseline test sets (RSNA = 6.8 months, DHA = 6.9 months; P = .05), indicating good model generalization to external data. Except for the RSNA dataset images with an appended radiologic laterality marker (P = .86), there were significant differences in MAD for both the DHA and RSNA datasets among other transformation groups (rotations, flips, brightness, contrast, inversion, and resolution). There were significant differences in proportion of CSEs for 57% of the image transformations (19 of 33) performed on the DHA dataset. Conclusion Although an award-winning pediatric bone age DL model generalized well to curated external images, it had inconsistent predictions on images that had undergone simple transformations reflective of several real-world variations in image appearance. Keywords: Pediatrics, Hand, Convolutional Neural Network, Radiography Supplemental material is available for this article. © RSNA, 2024 See also commentary by Faghani and Erickson in this issue.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
16.20
自引率
1.00%
发文量
0
期刊介绍: Radiology: Artificial Intelligence is a bi-monthly publication that focuses on the emerging applications of machine learning and artificial intelligence in the field of imaging across various disciplines. This journal is available online and accepts multiple manuscript types, including Original Research, Technical Developments, Data Resources, Review articles, Editorials, Letters to the Editor and Replies, Special Reports, and AI in Brief.
期刊最新文献
AI-integrated Screening to Replace Double Reading of Mammograms: A Population-wide Accuracy and Feasibility Study. Deep Learning Segmentation of Ascites on Abdominal CT Scans for Automatic Volume Quantification. Presurgical Upgrade Prediction of DCIS to Invasive Ductal Carcinoma Using Time-dependent Deep Learning Models with DCE MRI. Artificial Intelligence Outcome Prediction in Neonates with Encephalopathy (AI-OPiNE). Deep Learning to Detect Intracranial Hemorrhage in a National Teleradiology Program and the Impact on Interpretation Time.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1