{"title":"在癌症治疗建议的质量、临床指南和专家意见的一致性方面,ChatGPT v4 优于 v3.5。","authors":"Chung-You Tsai, Pai-Yu Cheng, Juinn-Horng Deng, Fu-Shan Jaw, Shyi-Chun Yii","doi":"10.1177/20552076241269538","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>To assess the quality and alignment of ChatGPT's cancer treatment recommendations (RECs) with National Comprehensive Cancer Network (NCCN) guidelines and expert opinions.</p><p><strong>Methods: </strong>Three urologists performed quantitative and qualitative assessments in October 2023 analyzing responses from ChatGPT-4 and ChatGPT-3.5 to 108 prostate, kidney, and bladder cancer prompts using two zero-shot prompt templates. Performance evaluation involved calculating five ratios: expert-approved/expert-disagreed and NCCN-aligned RECs against total ChatGPT RECs plus coverage and adherence rates to NCCN. Experts rated the response's quality on a 1-5 scale considering correctness, comprehensiveness, specificity, and appropriateness.</p><p><strong>Results: </strong>ChatGPT-4 outperformed ChatGPT-3.5 in prostate cancer inquiries, with an average word count of 317.3 versus 124.4 (<i>p</i> < 0.001) and 6.1 versus 3.9 RECs (<i>p</i> < 0.001). Its rater-approved REC ratio (96.1% vs. 89.4%) and alignment with NCCN guidelines (76.8% vs. 49.1%, <i>p</i> = 0.001) were superior and scored significantly better on all quality dimensions. Across 108 prompts covering three cancers, ChatGPT-4 produced an average of 6.0 RECs per case, with an 88.5% approval rate from raters, 86.7% NCCN concordance, and only a 9.5% disagreement rate. It achieved high marks in correctness (4.5), comprehensiveness (4.4), specificity (4.0), and appropriateness (4.4). Subgroup analyses across cancer types, disease statuses, and different prompt templates were reported.</p><p><strong>Conclusions: </strong>ChatGPT-4 demonstrated significant improvement in providing accurate and detailed treatment recommendations for urological cancers in line with clinical guidelines and expert opinion. However, it is vital to recognize that AI tools are not without flaws and should be utilized with caution. ChatGPT could supplement, but not replace, personalized advice from healthcare professionals.</p>","PeriodicalId":51333,"journal":{"name":"DIGITAL HEALTH","volume":null,"pages":null},"PeriodicalIF":2.9000,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11325467/pdf/","citationCount":"0","resultStr":"{\"title\":\"ChatGPT v4 outperforming v3.5 on cancer treatment recommendations in quality, clinical guideline, and expert opinion concordance.\",\"authors\":\"Chung-You Tsai, Pai-Yu Cheng, Juinn-Horng Deng, Fu-Shan Jaw, Shyi-Chun Yii\",\"doi\":\"10.1177/20552076241269538\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>To assess the quality and alignment of ChatGPT's cancer treatment recommendations (RECs) with National Comprehensive Cancer Network (NCCN) guidelines and expert opinions.</p><p><strong>Methods: </strong>Three urologists performed quantitative and qualitative assessments in October 2023 analyzing responses from ChatGPT-4 and ChatGPT-3.5 to 108 prostate, kidney, and bladder cancer prompts using two zero-shot prompt templates. Performance evaluation involved calculating five ratios: expert-approved/expert-disagreed and NCCN-aligned RECs against total ChatGPT RECs plus coverage and adherence rates to NCCN. Experts rated the response's quality on a 1-5 scale considering correctness, comprehensiveness, specificity, and appropriateness.</p><p><strong>Results: </strong>ChatGPT-4 outperformed ChatGPT-3.5 in prostate cancer inquiries, with an average word count of 317.3 versus 124.4 (<i>p</i> < 0.001) and 6.1 versus 3.9 RECs (<i>p</i> < 0.001). Its rater-approved REC ratio (96.1% vs. 89.4%) and alignment with NCCN guidelines (76.8% vs. 49.1%, <i>p</i> = 0.001) were superior and scored significantly better on all quality dimensions. Across 108 prompts covering three cancers, ChatGPT-4 produced an average of 6.0 RECs per case, with an 88.5% approval rate from raters, 86.7% NCCN concordance, and only a 9.5% disagreement rate. It achieved high marks in correctness (4.5), comprehensiveness (4.4), specificity (4.0), and appropriateness (4.4). Subgroup analyses across cancer types, disease statuses, and different prompt templates were reported.</p><p><strong>Conclusions: </strong>ChatGPT-4 demonstrated significant improvement in providing accurate and detailed treatment recommendations for urological cancers in line with clinical guidelines and expert opinion. However, it is vital to recognize that AI tools are not without flaws and should be utilized with caution. ChatGPT could supplement, but not replace, personalized advice from healthcare professionals.</p>\",\"PeriodicalId\":51333,\"journal\":{\"name\":\"DIGITAL HEALTH\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11325467/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"DIGITAL HEALTH\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/20552076241269538\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"DIGITAL HEALTH","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/20552076241269538","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
ChatGPT v4 outperforming v3.5 on cancer treatment recommendations in quality, clinical guideline, and expert opinion concordance.
Objectives: To assess the quality and alignment of ChatGPT's cancer treatment recommendations (RECs) with National Comprehensive Cancer Network (NCCN) guidelines and expert opinions.
Methods: Three urologists performed quantitative and qualitative assessments in October 2023 analyzing responses from ChatGPT-4 and ChatGPT-3.5 to 108 prostate, kidney, and bladder cancer prompts using two zero-shot prompt templates. Performance evaluation involved calculating five ratios: expert-approved/expert-disagreed and NCCN-aligned RECs against total ChatGPT RECs plus coverage and adherence rates to NCCN. Experts rated the response's quality on a 1-5 scale considering correctness, comprehensiveness, specificity, and appropriateness.
Results: ChatGPT-4 outperformed ChatGPT-3.5 in prostate cancer inquiries, with an average word count of 317.3 versus 124.4 (p < 0.001) and 6.1 versus 3.9 RECs (p < 0.001). Its rater-approved REC ratio (96.1% vs. 89.4%) and alignment with NCCN guidelines (76.8% vs. 49.1%, p = 0.001) were superior and scored significantly better on all quality dimensions. Across 108 prompts covering three cancers, ChatGPT-4 produced an average of 6.0 RECs per case, with an 88.5% approval rate from raters, 86.7% NCCN concordance, and only a 9.5% disagreement rate. It achieved high marks in correctness (4.5), comprehensiveness (4.4), specificity (4.0), and appropriateness (4.4). Subgroup analyses across cancer types, disease statuses, and different prompt templates were reported.
Conclusions: ChatGPT-4 demonstrated significant improvement in providing accurate and detailed treatment recommendations for urological cancers in line with clinical guidelines and expert opinion. However, it is vital to recognize that AI tools are not without flaws and should be utilized with caution. ChatGPT could supplement, but not replace, personalized advice from healthcare professionals.