Independent evaluation of the accuracy of 5 artificial intelligence software for detecting lung nodules on chest X-rays.

IF 2.9 2区 医学 Q2 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Quantitative Imaging in Medicine and Surgery Pub Date : 2024-08-01 Epub Date: 2024-07-25 DOI:10.21037/qims-24-160
Kirill Arzamasov, Yuriy Vasilev, Maria Zelenova, Lev Pestrenin, Yulia Busygina, Tatiana Bobrovskaya, Sergey Chetverikov, David Shikhmuradov, Andrey Pankratov, Yury Kirpichev, Valentin Sinitsyn, Irina Son, Olga Omelyanskaya
{"title":"Independent evaluation of the accuracy of 5 artificial intelligence software for detecting lung nodules on chest X-rays.","authors":"Kirill Arzamasov, Yuriy Vasilev, Maria Zelenova, Lev Pestrenin, Yulia Busygina, Tatiana Bobrovskaya, Sergey Chetverikov, David Shikhmuradov, Andrey Pankratov, Yury Kirpichev, Valentin Sinitsyn, Irina Son, Olga Omelyanskaya","doi":"10.21037/qims-24-160","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The integration of artificial intelligence (AI) into medicine is growing, with some experts predicting its standalone use soon. However, skepticism remains due to limited positive outcomes from independent validations. This research evaluates AI software's effectiveness in analyzing chest X-rays (CXR) to identify lung nodules, a possible lung cancer indicator.</p><p><strong>Methods: </strong>This retrospective study analyzed 7,670,212 record pairs from radiological exams conducted between 2020 and 2022 during the Moscow Computer Vision Experiment, focusing on CXR and computed tomography (CT) scans. All images were acquired during clinical routine. The final dataset comprised 100 CXR images (50 with lung nodules, 50 without), selected consecutively and based on inclusion and exclusion criteria, to evaluate the performance of all five AI-based solutions, participating in the Moscow Computer Vision Experiment and analyzing CXR. The evaluation was performed in 3 stages. In the first stage, the probability of a nodule in the lung obtained from AI services was compared with the Ground Truth (1-there is a nodule, 0-there is no nodule). In the second stage, 3 radiologists evaluated the segmentation of nodules performed by the AI services (1-nodule correctly segmented, 0-nodule incorrectly segmented or not segmented at all). In the third stage, the same radiologists additionally evaluated the classification of the nodules (1-nodule correctly segmented and classified, 0-all other cases). The results obtained in stages 2 and 3 were compared with Ground Truth, which was common to all three stages. For each stage, diagnostic accuracy metrics were calculated for each AI service.</p><p><strong>Results: </strong>Three software solutions (Celsus, Lunit INSIGHT CXR, and qXR) demonstrated diagnostic metrics that matched or surpassed the vendor specifications, and achieved the highest area under the receiver operating characteristic curve (AUC) of 0.956 [95% confidence interval (CI): 0.918 to 0.994]. However, when evaluated by three radiologists for accurate nodule segmentation and classification, all solutions performed below the vendor-declared metrics, with the highest AUC reaching 0.812 (95% CI: 0.744 to 0.879). Meanwhile, all AI services demonstrated 100% specificity at stages 2 and 3 of the study.</p><p><strong>Conclusions: </strong>To ensure the reliability and applicability of AI-based software, it is crucial to validate performance metrics using high-quality datasets and engage radiologists in the evaluation process. Developers are recommended to improve the accuracy of the underlying models before allowing the standalone use of the software for lung nodule detection. The dataset created during the study may be accessed at https://mosmed.ai/datasets/mosmeddatargogksnalichiemiotsutstviemlegochnihuzlovtipvii/.</p>","PeriodicalId":54267,"journal":{"name":"Quantitative Imaging in Medicine and Surgery","volume":null,"pages":null},"PeriodicalIF":2.9000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11320553/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Quantitative Imaging in Medicine and Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.21037/qims-24-160","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/25 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The integration of artificial intelligence (AI) into medicine is growing, with some experts predicting its standalone use soon. However, skepticism remains due to limited positive outcomes from independent validations. This research evaluates AI software's effectiveness in analyzing chest X-rays (CXR) to identify lung nodules, a possible lung cancer indicator.

Methods: This retrospective study analyzed 7,670,212 record pairs from radiological exams conducted between 2020 and 2022 during the Moscow Computer Vision Experiment, focusing on CXR and computed tomography (CT) scans. All images were acquired during clinical routine. The final dataset comprised 100 CXR images (50 with lung nodules, 50 without), selected consecutively and based on inclusion and exclusion criteria, to evaluate the performance of all five AI-based solutions, participating in the Moscow Computer Vision Experiment and analyzing CXR. The evaluation was performed in 3 stages. In the first stage, the probability of a nodule in the lung obtained from AI services was compared with the Ground Truth (1-there is a nodule, 0-there is no nodule). In the second stage, 3 radiologists evaluated the segmentation of nodules performed by the AI services (1-nodule correctly segmented, 0-nodule incorrectly segmented or not segmented at all). In the third stage, the same radiologists additionally evaluated the classification of the nodules (1-nodule correctly segmented and classified, 0-all other cases). The results obtained in stages 2 and 3 were compared with Ground Truth, which was common to all three stages. For each stage, diagnostic accuracy metrics were calculated for each AI service.

Results: Three software solutions (Celsus, Lunit INSIGHT CXR, and qXR) demonstrated diagnostic metrics that matched or surpassed the vendor specifications, and achieved the highest area under the receiver operating characteristic curve (AUC) of 0.956 [95% confidence interval (CI): 0.918 to 0.994]. However, when evaluated by three radiologists for accurate nodule segmentation and classification, all solutions performed below the vendor-declared metrics, with the highest AUC reaching 0.812 (95% CI: 0.744 to 0.879). Meanwhile, all AI services demonstrated 100% specificity at stages 2 and 3 of the study.

Conclusions: To ensure the reliability and applicability of AI-based software, it is crucial to validate performance metrics using high-quality datasets and engage radiologists in the evaluation process. Developers are recommended to improve the accuracy of the underlying models before allowing the standalone use of the software for lung nodule detection. The dataset created during the study may be accessed at https://mosmed.ai/datasets/mosmeddatargogksnalichiemiotsutstviemlegochnihuzlovtipvii/.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
独立评估 5 款人工智能软件检测胸部 X 光片肺结节的准确性。
背景:人工智能(AI)与医学的结合正在不断发展,一些专家预测人工智能很快就能独立使用。然而,由于独立验证的积极成果有限,人们对其仍持怀疑态度。本研究评估了人工智能软件在分析胸部X光片(CXR)以识别肺结节(一种可能的肺癌指标)方面的有效性:这项回顾性研究分析了 7,670,212 对来自 2020 年至 2022 年莫斯科计算机视觉实验期间进行的放射检查记录,重点是 CXR 和计算机断层扫描(CT)。所有图像都是在临床常规检查中获取的。最终数据集包括 100 张 CXR 图像(50 张有肺结节,50 张没有),根据纳入和排除标准连续选择,以评估参与莫斯科计算机视觉实验并分析 CXR 的所有五种基于人工智能的解决方案的性能。评估分三个阶段进行。第一阶段,将人工智能服务得出的肺部结节概率与地面实况(1-有结节,0-无结节)进行比较。在第二阶段,3 名放射科医生对人工智能服务对结节进行的分割进行评估(1-结节分割正确,0-结节分割错误或根本没有分割)。在第三阶段,同样的放射科医生还对结节的分类进行了评估(1-结节正确分割和分类,0-所有其他情况)。将第二和第三阶段获得的结果与 "地面实况 "进行比较,"地面实况 "是所有三个阶段的共同结果。在每个阶段,都计算了每种人工智能服务的诊断准确性指标:三个软件解决方案(Celsus、Lunit INSIGHT CXR 和 qXR)的诊断指标符合或超过了供应商的规格要求,接收器工作特征曲线下面积(AUC)最高,达到 0.956 [95% 置信区间 (CI):0.918 至 0.994]。但是,在由三位放射科医生对结节分割和分类的准确性进行评估时,所有解决方案的表现都低于供应商公布的指标,最高的 AUC 为 0.812(95% 置信区间:0.744 至 0.879)。同时,在研究的第二和第三阶段,所有人工智能服务都表现出了100%的特异性:为确保人工智能软件的可靠性和适用性,使用高质量数据集验证性能指标并让放射科医生参与评估过程至关重要。建议开发人员在允许独立使用软件进行肺结节检测之前,提高基础模型的准确性。研究期间创建的数据集可在 https://mosmed.ai/datasets/mosmeddatargogksnalichiemiotsutstviemlegochnihuzlovtipvii/ 上访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Quantitative Imaging in Medicine and Surgery
Quantitative Imaging in Medicine and Surgery Medicine-Radiology, Nuclear Medicine and Imaging
CiteScore
4.20
自引率
17.90%
发文量
252
期刊介绍: Information not localized
期刊最新文献
Comparison of single shot and multishot diffusion-weighted imaging in 5-T magnetic resonance imaging for brain disease diagnosis. Complications of synchronous microwave ablation and biopsy versus microwave ablation alone for pulmonary sub-solid nodules: a retrospective, large sample, case-control study. Congenital uterine arteriovenous malformation treated by hysterectomy: a description of two cases. Diagnostic value of a magnetic resonance imaging (MRI)-based vertebral bone quality score for bone mineral density assessment: an updated systematic review and meta-analysis. Dilated multi-scale residual attention (DMRA) U-Net: three-dimensional (3D) dilated multi-scale residual attention U-Net for brain tumor segmentation.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1