{"title":"GPT-4 对常见临床情况和疑难病例的诊断准确性","authors":"Geoffrey W. Rutledge","doi":"10.1002/lrh2.10438","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Introduction</h3>\n \n <p>Large language models (LLMs) have a high diagnostic accuracy when they evaluate previously published clinical cases.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We compared the accuracy of GPT-4's differential diagnoses for previously unpublished challenging case scenarios with the diagnostic accuracy for previously published cases.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>For a set of previously unpublished challenging clinical cases, GPT-4 achieved 61.1% correct in its top 6 diagnoses versus the previously reported 49.1% for physicians. For a set of 45 clinical vignettes of more common clinical scenarios, GPT-4 included the correct diagnosis in its top 3 diagnoses 100% of the time versus the previously reported 84.3% for physicians.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>GPT-4 performs at a level at least as good as, if not better than, that of experienced physicians on highly challenging cases in internal medicine. The extraordinary performance of GPT-4 on diagnosing common clinical scenarios could be explained in part by the fact that these cases were previously published and may have been included in the training dataset for this LLM.</p>\n </section>\n </div>","PeriodicalId":43916,"journal":{"name":"Learning Health Systems","volume":"8 3","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/lrh2.10438","citationCount":"0","resultStr":"{\"title\":\"Diagnostic accuracy of GPT-4 on common clinical scenarios and challenging cases\",\"authors\":\"Geoffrey W. Rutledge\",\"doi\":\"10.1002/lrh2.10438\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Introduction</h3>\\n \\n <p>Large language models (LLMs) have a high diagnostic accuracy when they evaluate previously published clinical cases.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Methods</h3>\\n \\n <p>We compared the accuracy of GPT-4's differential diagnoses for previously unpublished challenging case scenarios with the diagnostic accuracy for previously published cases.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Results</h3>\\n \\n <p>For a set of previously unpublished challenging clinical cases, GPT-4 achieved 61.1% correct in its top 6 diagnoses versus the previously reported 49.1% for physicians. For a set of 45 clinical vignettes of more common clinical scenarios, GPT-4 included the correct diagnosis in its top 3 diagnoses 100% of the time versus the previously reported 84.3% for physicians.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Conclusions</h3>\\n \\n <p>GPT-4 performs at a level at least as good as, if not better than, that of experienced physicians on highly challenging cases in internal medicine. The extraordinary performance of GPT-4 on diagnosing common clinical scenarios could be explained in part by the fact that these cases were previously published and may have been included in the training dataset for this LLM.</p>\\n </section>\\n </div>\",\"PeriodicalId\":43916,\"journal\":{\"name\":\"Learning Health Systems\",\"volume\":\"8 3\",\"pages\":\"\"},\"PeriodicalIF\":2.6000,\"publicationDate\":\"2024-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/lrh2.10438\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Learning Health Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/lrh2.10438\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH POLICY & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Learning Health Systems","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/lrh2.10438","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH POLICY & SERVICES","Score":null,"Total":0}
Diagnostic accuracy of GPT-4 on common clinical scenarios and challenging cases
Introduction
Large language models (LLMs) have a high diagnostic accuracy when they evaluate previously published clinical cases.
Methods
We compared the accuracy of GPT-4's differential diagnoses for previously unpublished challenging case scenarios with the diagnostic accuracy for previously published cases.
Results
For a set of previously unpublished challenging clinical cases, GPT-4 achieved 61.1% correct in its top 6 diagnoses versus the previously reported 49.1% for physicians. For a set of 45 clinical vignettes of more common clinical scenarios, GPT-4 included the correct diagnosis in its top 3 diagnoses 100% of the time versus the previously reported 84.3% for physicians.
Conclusions
GPT-4 performs at a level at least as good as, if not better than, that of experienced physicians on highly challenging cases in internal medicine. The extraordinary performance of GPT-4 on diagnosing common clinical scenarios could be explained in part by the fact that these cases were previously published and may have been included in the training dataset for this LLM.