Evaluating ChatGPT's diagnostic potential for pathology images.

IF 3.1 3区医学 Q1 MEDICINE, GENERAL & INTERNAL Frontiers in Medicine Pub Date : 2025-01-23 eCollection Date: 2024-01-01 DOI:10.3389/fmed.2024.1507203

Liya Ding, Lei Fan, Miao Shen, Yawen Wang, Kaiqin Sheng, Zijuan Zou, Huimin An, Zhinong Jiang

{"title":"Evaluating ChatGPT's diagnostic potential for pathology images.","authors":"Liya Ding, Lei Fan, Miao Shen, Yawen Wang, Kaiqin Sheng, Zijuan Zou, Huimin An, Zhinong Jiang","doi":"10.3389/fmed.2024.1507203","DOIUrl":null,"url":null,"abstract":"Background: Chat Generative Pretrained Transformer (ChatGPT) is a type of large language model (LLM) developed by OpenAI, known for its extensive knowledge base and interactive capabilities. These attributes make it a valuable tool in the medical field, particularly for tasks such as answering medical questions, drafting clinical notes, and optimizing the generation of radiology reports. However, keeping accuracy in medical contexts is the biggest challenge to employing GPT-4 in a clinical setting. This study aims to investigate the accuracy of GPT-4, which can process both text and image inputs, in generating diagnoses from pathological images.Methods: This study analyzed 44 histopathological images from 16 organs and 100 colorectal biopsy photomicrographs. The initial evaluation was conducted using the standard GPT-4 model in January 2024, with a subsequent re-evaluation performed in July 2024. The diagnostic accuracy of GPT-4 was assessed by comparing its outputs to a reference standard using statistical measures. Additionally, four pathologists independently reviewed the same images to compare their diagnoses with the model's outputs. Both scanned and photographed images were tested to evaluate GPT-4's generalization ability across different image types.Results: GPT-4 achieved an overall accuracy of 0.64 in identifying tumor imaging and tissue origins. For colon polyp classification, accuracy varied from 0.57 to 0.75 in different subtypes. The model achieved 0.88 accuracy in distinguishing low-grade from high-grade dysplasia and 0.75 in distinguishing high-grade dysplasia from adenocarcinoma, with a high sensitivity in detecting adenocarcinoma. Consistency between initial and follow-up evaluations showed slight to moderate agreement, with Kappa values ranging from 0.204 to 0.375.Conclusion: GPT-4 demonstrates the ability to diagnose pathological images, showing improved performance over earlier versions. Its diagnostic accuracy in cancer is comparable to that of pathology residents. These findings suggest that GPT-4 holds promise as a supportive tool in pathology diagnostics, offering the potential to assist pathologists in routine diagnostic workflows.","PeriodicalId":12488,"journal":{"name":"Frontiers in Medicine","volume":"11 ","pages":"1507203"},"PeriodicalIF":3.1000,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11798939/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3389/fmed.2024.1507203","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Chat Generative Pretrained Transformer (ChatGPT) is a type of large language model (LLM) developed by OpenAI, known for its extensive knowledge base and interactive capabilities. These attributes make it a valuable tool in the medical field, particularly for tasks such as answering medical questions, drafting clinical notes, and optimizing the generation of radiology reports. However, keeping accuracy in medical contexts is the biggest challenge to employing GPT-4 in a clinical setting. This study aims to investigate the accuracy of GPT-4, which can process both text and image inputs, in generating diagnoses from pathological images.

Methods: This study analyzed 44 histopathological images from 16 organs and 100 colorectal biopsy photomicrographs. The initial evaluation was conducted using the standard GPT-4 model in January 2024, with a subsequent re-evaluation performed in July 2024. The diagnostic accuracy of GPT-4 was assessed by comparing its outputs to a reference standard using statistical measures. Additionally, four pathologists independently reviewed the same images to compare their diagnoses with the model's outputs. Both scanned and photographed images were tested to evaluate GPT-4's generalization ability across different image types.

Results: GPT-4 achieved an overall accuracy of 0.64 in identifying tumor imaging and tissue origins. For colon polyp classification, accuracy varied from 0.57 to 0.75 in different subtypes. The model achieved 0.88 accuracy in distinguishing low-grade from high-grade dysplasia and 0.75 in distinguishing high-grade dysplasia from adenocarcinoma, with a high sensitivity in detecting adenocarcinoma. Consistency between initial and follow-up evaluations showed slight to moderate agreement, with Kappa values ranging from 0.204 to 0.375.

Conclusion: GPT-4 demonstrates the ability to diagnose pathological images, showing improved performance over earlier versions. Its diagnostic accuracy in cancer is comparable to that of pathology residents. These findings suggest that GPT-4 holds promise as a supportive tool in pathology diagnostics, offering the potential to assist pathologists in routine diagnostic workflows.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估ChatGPT对病理图像的诊断潜力。

聊天生成预训练转换器（ChatGPT）是由OpenAI开发的一种大型语言模型（LLM），以其广泛的知识库和交互能力而闻名。这些属性使其成为医疗领域中有价值的工具，特别是用于回答医疗问题、起草临床记录和优化放射学报告生成等任务。然而，在医学环境中保持准确性是在临床环境中使用GPT-4的最大挑战。本研究旨在研究GPT-4的准确性，它可以处理文本和图像输入，从病理图像中生成诊断。方法：对16个脏器的44张组织病理图像和100张结直肠活检显微照片进行分析。在2024年1月使用标准GPT-4模型进行了初步评估，随后在2024年7月进行了重新评估。GPT-4的诊断准确性通过将其输出与参考标准进行比较来评估。此外，四位病理学家独立地检查了相同的图像，将他们的诊断与模型的输出进行比较。对扫描图像和拍摄图像进行测试，以评估GPT-4在不同图像类型中的泛化能力。结果：GPT-4在识别肿瘤影像和组织来源方面的总体准确率为0.64。对于结肠息肉的分类，不同亚型的准确率从0.57到0.75不等。该模型区分低级别和高级别非典型增生的准确率为0.88，区分高级别非典型增生和腺癌的准确率为0.75，对腺癌的检测灵敏度较高。初始评价与随访评价的一致性为轻度至中度一致，Kappa值为0.204 ~ 0.375。结论：GPT-4显示出诊断病理图像的能力，比早期版本表现出更高的性能。其对癌症的诊断准确度可与病理住院医师相媲美。这些发现表明，GPT-4有望成为病理学诊断的辅助工具，为病理学家的常规诊断工作流程提供帮助。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Frontiers in Medicine Medicine-General Medicine

CiteScore

5.10

自引率

5.10%

发文量

3710

审稿时长

12 weeks

期刊介绍： Frontiers in Medicine publishes rigorously peer-reviewed research linking basic research to clinical practice and patient care, as well as translating scientific advances into new therapies and diagnostic tools. Led by an outstanding Editorial Board of international experts, this multidisciplinary open-access journal is at the forefront of disseminating and communicating scientific knowledge and impactful discoveries to researchers, academics, clinicians and the public worldwide. In addition to papers that provide a link between basic research and clinical practice, a particular emphasis is given to studies that are directly relevant to patient care. In this spirit, the journal publishes the latest research results and medical knowledge that facilitate the translation of scientific advances into new therapies or diagnostic tools. The full listing of the Specialty Sections represented by Frontiers in Medicine is as listed below. As well as the established medical disciplines, Frontiers in Medicine is launching new sections that together will facilitate - the use of patient-reported outcomes under real world conditions - the exploitation of big data and the use of novel information and communication tools in the assessment of new medicines - the scientific bases for guidelines and decisions from regulatory authorities - access to medicinal products and medical devices worldwide - addressing the grand health challenges around the world