Andrew Y Wang, Sherman Lin, Christopher Tran, Robert J Homer, Dan Wilsdon, Joanna C Walsh, Emily A Goebel, Irene Sansano, Snehal Sonawane, Vincent Cockenpot, Sanjay Mukhopadhyay, Toros Taskin, Nusrat Zahra, Luca Cima, Orhan Semerci, Birsen Gizem Özamrak, Pallavi Mishra, Naga Sarika Vennavalli, Po-Hsuan Cameron Chen, Matthew J Cecchini
{"title":"Assessment of Pathology Domain-Specific Knowledge of ChatGPT and Comparison to Human Performance.","authors":"Andrew Y Wang, Sherman Lin, Christopher Tran, Robert J Homer, Dan Wilsdon, Joanna C Walsh, Emily A Goebel, Irene Sansano, Snehal Sonawane, Vincent Cockenpot, Sanjay Mukhopadhyay, Toros Taskin, Nusrat Zahra, Luca Cima, Orhan Semerci, Birsen Gizem Özamrak, Pallavi Mishra, Naga Sarika Vennavalli, Po-Hsuan Cameron Chen, Matthew J Cecchini","doi":"10.5858/arpa.2023-0296-OA","DOIUrl":null,"url":null,"abstract":"<p><strong>Context.—: </strong>Artificial intelligence algorithms hold the potential to fundamentally change many aspects of society. Application of these tools, including the publicly available ChatGPT, has demonstrated impressive domain-specific knowledge in many areas, including medicine.</p><p><strong>Objectives.—: </strong>To understand the level of pathology domain-specific knowledge for ChatGPT using different underlying large language models, GPT-3.5 and the updated GPT-4.</p><p><strong>Design.—: </strong>An international group of pathologists (n = 15) was recruited to generate pathology-specific questions at a similar level to those that could be seen on licensing (board) examinations. The questions (n = 15) were answered by GPT-3.5, GPT-4, and a staff pathologist who recently passed their Canadian pathology licensing exams. Participants were instructed to score answers on a 5-point scale and to predict which answer was written by ChatGPT.</p><p><strong>Results.—: </strong>GPT-3.5 performed at a similar level to the staff pathologist, while GPT-4 outperformed both. The overall score for both GPT-3.5 and GPT-4 was within the range of meeting expectations for a trainee writing licensing examinations. In all but one question, the reviewers were able to correctly identify the answers generated by GPT-3.5.</p><p><strong>Conclusions.—: </strong>By demonstrating the ability of ChatGPT to answer pathology-specific questions at a level similar to (GPT-3.5) or exceeding (GPT-4) a trained pathologist, this study highlights the potential of large language models to be transformative in this space. In the future, more advanced iterations of these algorithms with increased domain-specific knowledge may have the potential to assist pathologists and enhance pathology resident training.</p>","PeriodicalId":93883,"journal":{"name":"Archives of pathology & laboratory medicine","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Archives of pathology & laboratory medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5858/arpa.2023-0296-OA","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Context.—: Artificial intelligence algorithms hold the potential to fundamentally change many aspects of society. Application of these tools, including the publicly available ChatGPT, has demonstrated impressive domain-specific knowledge in many areas, including medicine.
Objectives.—: To understand the level of pathology domain-specific knowledge for ChatGPT using different underlying large language models, GPT-3.5 and the updated GPT-4.
Design.—: An international group of pathologists (n = 15) was recruited to generate pathology-specific questions at a similar level to those that could be seen on licensing (board) examinations. The questions (n = 15) were answered by GPT-3.5, GPT-4, and a staff pathologist who recently passed their Canadian pathology licensing exams. Participants were instructed to score answers on a 5-point scale and to predict which answer was written by ChatGPT.
Results.—: GPT-3.5 performed at a similar level to the staff pathologist, while GPT-4 outperformed both. The overall score for both GPT-3.5 and GPT-4 was within the range of meeting expectations for a trainee writing licensing examinations. In all but one question, the reviewers were able to correctly identify the answers generated by GPT-3.5.
Conclusions.—: By demonstrating the ability of ChatGPT to answer pathology-specific questions at a level similar to (GPT-3.5) or exceeding (GPT-4) a trained pathologist, this study highlights the potential of large language models to be transformative in this space. In the future, more advanced iterations of these algorithms with increased domain-specific knowledge may have the potential to assist pathologists and enhance pathology resident training.