{"title":"Decoding the NCCN Guidelines With AI: A Comparative Evaluation of ChatGPT-4.0 and Llama 2 in the Management of Thyroid Carcinoma.","authors":"Shivam Pandya, Tamir E Bresler, Tyler Wilson, Zin Htway, Manabu Fujita","doi":"10.1177/00031348241269430","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Artificial Intelligence (AI) has emerged as a promising tool in the delivery of health care. ChatGPT-4.0 (OpenAI, San Francisco, California) and Llama 2 (Meta, Menlo Park, CA) have each gained attention for their use in various medical applications.</p><p><strong>Objective: </strong>This study aims to evaluate and compare the effectiveness of ChatGPT-4.0 and Llama 2 in assisting with complex clinical decision making in the diagnosis and treatment of thyroid carcinoma.</p><p><strong>Participants: </strong>We reviewed the National Comprehensive Cancer Network® (NCCN) Clinical Practice Guidelines for the management of thyroid carcinoma and formulated up to 3 complex clinical questions for each decision-making page. ChatGPT-4.0 and Llama 2 were queried in a reproducible manner. The answers were scored on a Likert scale: 5) Correct; 4) correct, with missing information requiring clarification; 3) correct, but unable to complete answer; 2) partially incorrect; 1) absolutely incorrect. Score frequencies were compared, and subgroup analysis was conducted on <i>Correctness</i> (defined as scores 1-2 vs 3-5) and <i>Accuracy</i> (scores 1-3 vs 4-5).</p><p><strong>Results: </strong>In total, 58 pages of the NCCN Guidelines® were analyzed, generating 167 unique questions. There was no statistically significant difference between ChatGPT-4.0 and Llama 2 in terms of overall score (Mann-Whitney U-test; Mean Rank = 160.53 vs 174.47, <i>P</i> = 0.123), <i>Correctness</i> (<i>P</i> = 0.177), or <i>Accuracy</i> (<i>P</i> = 0.891).[Formula: see text].</p><p><strong>Conclusion: </strong>ChatGPT-4.0 and Llama 2 demonstrate a limited but substantial capacity to assist with complex clinical decision making relating to the management of thyroid carcinoma, with no significant difference in their effectiveness.</p>","PeriodicalId":7782,"journal":{"name":"American Surgeon","volume":" ","pages":"94-98"},"PeriodicalIF":1.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Surgeon","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/00031348241269430","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/13 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"SURGERY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Artificial Intelligence (AI) has emerged as a promising tool in the delivery of health care. ChatGPT-4.0 (OpenAI, San Francisco, California) and Llama 2 (Meta, Menlo Park, CA) have each gained attention for their use in various medical applications.
Objective: This study aims to evaluate and compare the effectiveness of ChatGPT-4.0 and Llama 2 in assisting with complex clinical decision making in the diagnosis and treatment of thyroid carcinoma.
Participants: We reviewed the National Comprehensive Cancer Network® (NCCN) Clinical Practice Guidelines for the management of thyroid carcinoma and formulated up to 3 complex clinical questions for each decision-making page. ChatGPT-4.0 and Llama 2 were queried in a reproducible manner. The answers were scored on a Likert scale: 5) Correct; 4) correct, with missing information requiring clarification; 3) correct, but unable to complete answer; 2) partially incorrect; 1) absolutely incorrect. Score frequencies were compared, and subgroup analysis was conducted on Correctness (defined as scores 1-2 vs 3-5) and Accuracy (scores 1-3 vs 4-5).
Results: In total, 58 pages of the NCCN Guidelines® were analyzed, generating 167 unique questions. There was no statistically significant difference between ChatGPT-4.0 and Llama 2 in terms of overall score (Mann-Whitney U-test; Mean Rank = 160.53 vs 174.47, P = 0.123), Correctness (P = 0.177), or Accuracy (P = 0.891).[Formula: see text].
Conclusion: ChatGPT-4.0 and Llama 2 demonstrate a limited but substantial capacity to assist with complex clinical decision making relating to the management of thyroid carcinoma, with no significant difference in their effectiveness.
期刊介绍:
The American Surgeon is a monthly peer-reviewed publication published by the Southeastern Surgical Congress. Its area of concentration is clinical general surgery, as defined by the content areas of the American Board of Surgery: alimentary tract (including bariatric surgery), abdomen and its contents, breast, skin and soft tissue, endocrine system, solid organ transplantation, pediatric surgery, surgical critical care, surgical oncology (including head and neck surgery), trauma and emergency surgery, and vascular surgery.