独立评估 5 款人工智能软件检测胸部 X 光片肺结节的准确性。

IF 2.3 2区医学 Q2 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Quantitative Imaging in Medicine and Surgery Pub Date : 2024-08-01 Epub Date: 2024-07-25 DOI:10.21037/qims-24-160

Kirill Arzamasov, Yuriy Vasilev, Maria Zelenova, Lev Pestrenin, Yulia Busygina, Tatiana Bobrovskaya, Sergey Chetverikov, David Shikhmuradov, Andrey Pankratov, Yury Kirpichev, Valentin Sinitsyn, Irina Son, Olga Omelyanskaya

{"title":"独立评估 5 款人工智能软件检测胸部 X 光片肺结节的准确性。","authors":"Kirill Arzamasov, Yuriy Vasilev, Maria Zelenova, Lev Pestrenin, Yulia Busygina, Tatiana Bobrovskaya, Sergey Chetverikov, David Shikhmuradov, Andrey Pankratov, Yury Kirpichev, Valentin Sinitsyn, Irina Son, Olga Omelyanskaya","doi":"10.21037/qims-24-160","DOIUrl":null,"url":null,"abstract":"Background: The integration of artificial intelligence (AI) into medicine is growing, with some experts predicting its standalone use soon. However, skepticism remains due to limited positive outcomes from independent validations. This research evaluates AI software's effectiveness in analyzing chest X-rays (CXR) to identify lung nodules, a possible lung cancer indicator.Methods: This retrospective study analyzed 7,670,212 record pairs from radiological exams conducted between 2020 and 2022 during the Moscow Computer Vision Experiment, focusing on CXR and computed tomography (CT) scans. All images were acquired during clinical routine. The final dataset comprised 100 CXR images (50 with lung nodules, 50 without), selected consecutively and based on inclusion and exclusion criteria, to evaluate the performance of all five AI-based solutions, participating in the Moscow Computer Vision Experiment and analyzing CXR. The evaluation was performed in 3 stages. In the first stage, the probability of a nodule in the lung obtained from AI services was compared with the Ground Truth (1-there is a nodule, 0-there is no nodule). In the second stage, 3 radiologists evaluated the segmentation of nodules performed by the AI services (1-nodule correctly segmented, 0-nodule incorrectly segmented or not segmented at all). In the third stage, the same radiologists additionally evaluated the classification of the nodules (1-nodule correctly segmented and classified, 0-all other cases). The results obtained in stages 2 and 3 were compared with Ground Truth, which was common to all three stages. For each stage, diagnostic accuracy metrics were calculated for each AI service.Results: Three software solutions (Celsus, Lunit INSIGHT CXR, and qXR) demonstrated diagnostic metrics that matched or surpassed the vendor specifications, and achieved the highest area under the receiver operating characteristic curve (AUC) of 0.956 [95% confidence interval (CI): 0.918 to 0.994]. However, when evaluated by three radiologists for accurate nodule segmentation and classification, all solutions performed below the vendor-declared metrics, with the highest AUC reaching 0.812 (95% CI: 0.744 to 0.879). Meanwhile, all AI services demonstrated 100% specificity at stages 2 and 3 of the study.Conclusions: To ensure the reliability and applicability of AI-based software, it is crucial to validate performance metrics using high-quality datasets and engage radiologists in the evaluation process. Developers are recommended to improve the accuracy of the underlying models before allowing the standalone use of the software for lung nodule detection. The dataset created during the study may be accessed at https://mosmed.ai/datasets/mosmeddatargogksnalichiemiotsutstviemlegochnihuzlovtipvii/.","PeriodicalId":54267,"journal":{"name":"Quantitative Imaging in Medicine and Surgery","volume":"14 8","pages":"5288-5303"},"PeriodicalIF":2.3000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11320553/pdf/","citationCount":"0","resultStr":"{\"title\":\"Independent evaluation of the accuracy of 5 artificial intelligence software for detecting lung nodules on chest X-rays.\",\"authors\":\"Kirill Arzamasov, Yuriy Vasilev, Maria Zelenova, Lev Pestrenin, Yulia Busygina, Tatiana Bobrovskaya, Sergey Chetverikov, David Shikhmuradov, Andrey Pankratov, Yury Kirpichev, Valentin Sinitsyn, Irina Son, Olga Omelyanskaya\",\"doi\":\"10.21037/qims-24-160\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The integration of artificial intelligence (AI) into medicine is growing, with some experts predicting its standalone use soon. However, skepticism remains due to limited positive outcomes from independent validations. This research evaluates AI software's effectiveness in analyzing chest X-rays (CXR) to identify lung nodules, a possible lung cancer indicator.Methods: This retrospective study analyzed 7,670,212 record pairs from radiological exams conducted between 2020 and 2022 during the Moscow Computer Vision Experiment, focusing on CXR and computed tomography (CT) scans. All images were acquired during clinical routine. The final dataset comprised 100 CXR images (50 with lung nodules, 50 without), selected consecutively and based on inclusion and exclusion criteria, to evaluate the performance of all five AI-based solutions, participating in the Moscow Computer Vision Experiment and analyzing CXR. The evaluation was performed in 3 stages. In the first stage, the probability of a nodule in the lung obtained from AI services was compared with the Ground Truth (1-there is a nodule, 0-there is no nodule). In the second stage, 3 radiologists evaluated the segmentation of nodules performed by the AI services (1-nodule correctly segmented, 0-nodule incorrectly segmented or not segmented at all). In the third stage, the same radiologists additionally evaluated the classification of the nodules (1-nodule correctly segmented and classified, 0-all other cases). The results obtained in stages 2 and 3 were compared with Ground Truth, which was common to all three stages. For each stage, diagnostic accuracy metrics were calculated for each AI service.Results: Three software solutions (Celsus, Lunit INSIGHT CXR, and qXR) demonstrated diagnostic metrics that matched or surpassed the vendor specifications, and achieved the highest area under the receiver operating characteristic curve (AUC) of 0.956 [95% confidence interval (CI): 0.918 to 0.994]. However, when evaluated by three radiologists for accurate nodule segmentation and classification, all solutions performed below the vendor-declared metrics, with the highest AUC reaching 0.812 (95% CI: 0.744 to 0.879). Meanwhile, all AI services demonstrated 100% specificity at stages 2 and 3 of the study.Conclusions: To ensure the reliability and applicability of AI-based software, it is crucial to validate performance metrics using high-quality datasets and engage radiologists in the evaluation process. Developers are recommended to improve the accuracy of the underlying models before allowing the standalone use of the software for lung nodule detection. The dataset created during the study may be accessed at https://mosmed.ai/datasets/mosmeddatargogksnalichiemiotsutstviemlegochnihuzlovtipvii/.\",\"PeriodicalId\":54267,\"journal\":{\"name\":\"Quantitative Imaging in Medicine and Surgery\",\"volume\":\"14 8\",\"pages\":\"5288-5303\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11320553/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Quantitative Imaging in Medicine and Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.21037/qims-24-160\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/7/25 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Quantitative Imaging in Medicine and Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.21037/qims-24-160","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/25 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

摘要

背景：人工智能（AI）与医学的结合正在不断发展，一些专家预测人工智能很快就能独立使用。然而，由于独立验证的积极成果有限，人们对其仍持怀疑态度。本研究评估了人工智能软件在分析胸部X光片（CXR）以识别肺结节（一种可能的肺癌指标）方面的有效性：这项回顾性研究分析了 7,670,212 对来自 2020 年至 2022 年莫斯科计算机视觉实验期间进行的放射检查记录，重点是 CXR 和计算机断层扫描（CT）。所有图像都是在临床常规检查中获取的。最终数据集包括 100 张 CXR 图像（50 张有肺结节，50 张没有），根据纳入和排除标准连续选择，以评估参与莫斯科计算机视觉实验并分析 CXR 的所有五种基于人工智能的解决方案的性能。评估分三个阶段进行。第一阶段，将人工智能服务得出的肺部结节概率与地面实况（1-有结节，0-无结节）进行比较。在第二阶段，3 名放射科医生对人工智能服务对结节进行的分割进行评估（1-结节分割正确，0-结节分割错误或根本没有分割）。在第三阶段，同样的放射科医生还对结节的分类进行了评估（1-结节正确分割和分类，0-所有其他情况）。将第二和第三阶段获得的结果与 "地面实况 "进行比较，"地面实况 "是所有三个阶段的共同结果。在每个阶段，都计算了每种人工智能服务的诊断准确性指标：三个软件解决方案（Celsus、Lunit INSIGHT CXR 和 qXR）的诊断指标符合或超过了供应商的规格要求，接收器工作特征曲线下面积（AUC）最高，达到 0.956 [95% 置信区间 (CI)：0.918 至 0.994]。但是，在由三位放射科医生对结节分割和分类的准确性进行评估时，所有解决方案的表现都低于供应商公布的指标，最高的 AUC 为 0.812（95% 置信区间：0.744 至 0.879）。同时，在研究的第二和第三阶段，所有人工智能服务都表现出了100%的特异性：为确保人工智能软件的可靠性和适用性，使用高质量数据集验证性能指标并让放射科医生参与评估过程至关重要。建议开发人员在允许独立使用软件进行肺结节检测之前，提高基础模型的准确性。研究期间创建的数据集可在 https://mosmed.ai/datasets/mosmeddatargogksnalichiemiotsutstviemlegochnihuzlovtipvii/ 上访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Independent evaluation of the accuracy of 5 artificial intelligence software for detecting lung nodules on chest X-rays.

Background: The integration of artificial intelligence (AI) into medicine is growing, with some experts predicting its standalone use soon. However, skepticism remains due to limited positive outcomes from independent validations. This research evaluates AI software's effectiveness in analyzing chest X-rays (CXR) to identify lung nodules, a possible lung cancer indicator.

Methods: This retrospective study analyzed 7,670,212 record pairs from radiological exams conducted between 2020 and 2022 during the Moscow Computer Vision Experiment, focusing on CXR and computed tomography (CT) scans. All images were acquired during clinical routine. The final dataset comprised 100 CXR images (50 with lung nodules, 50 without), selected consecutively and based on inclusion and exclusion criteria, to evaluate the performance of all five AI-based solutions, participating in the Moscow Computer Vision Experiment and analyzing CXR. The evaluation was performed in 3 stages. In the first stage, the probability of a nodule in the lung obtained from AI services was compared with the Ground Truth (1-there is a nodule, 0-there is no nodule). In the second stage, 3 radiologists evaluated the segmentation of nodules performed by the AI services (1-nodule correctly segmented, 0-nodule incorrectly segmented or not segmented at all). In the third stage, the same radiologists additionally evaluated the classification of the nodules (1-nodule correctly segmented and classified, 0-all other cases). The results obtained in stages 2 and 3 were compared with Ground Truth, which was common to all three stages. For each stage, diagnostic accuracy metrics were calculated for each AI service.

Results: Three software solutions (Celsus, Lunit INSIGHT CXR, and qXR) demonstrated diagnostic metrics that matched or surpassed the vendor specifications, and achieved the highest area under the receiver operating characteristic curve (AUC) of 0.956 [95% confidence interval (CI): 0.918 to 0.994]. However, when evaluated by three radiologists for accurate nodule segmentation and classification, all solutions performed below the vendor-declared metrics, with the highest AUC reaching 0.812 (95% CI: 0.744 to 0.879). Meanwhile, all AI services demonstrated 100% specificity at stages 2 and 3 of the study.

Conclusions: To ensure the reliability and applicability of AI-based software, it is crucial to validate performance metrics using high-quality datasets and engage radiologists in the evaluation process. Developers are recommended to improve the accuracy of the underlying models before allowing the standalone use of the software for lung nodule detection. The dataset created during the study may be accessed at https://mosmed.ai/datasets/mosmeddatargogksnalichiemiotsutstviemlegochnihuzlovtipvii/.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊