Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy Miller, Danielle S Bitterman, Guanhua Chen, Anoop Mayampurath, Matthew M Churpek, Majid Afshar
{"title":"大型语言模型诊断生成中的不确定性估计:下一词概率不是测试前概率。","authors":"Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy Miller, Danielle S Bitterman, Guanhua Chen, Anoop Mayampurath, Matthew M Churpek, Majid Afshar","doi":"10.1093/jamiaopen/ooae154","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To evaluate large language models (LLMs) for pre-test diagnostic probability estimation and compare their uncertainty estimation performance with a traditional machine learning classifier.</p><p><strong>Materials and methods: </strong>We assessed 2 instruction-tuned LLMs, Mistral-7B-Instruct and Llama3-70B-chat-hf, on predicting binary outcomes for Sepsis, Arrhythmia, and Congestive Heart Failure (CHF) using electronic health record (EHR) data from 660 patients. Three uncertainty estimation methods-Verbalized Confidence, Token Logits, and LLM Embedding+XGB-were compared against an eXtreme Gradient Boosting (XGB) classifier trained on raw EHR data. Performance metrics included AUROC and Pearson correlation between predicted probabilities.</p><p><strong>Results: </strong>The XGB classifier outperformed the LLM-based methods across all tasks. LLM Embedding+XGB showed the closest performance to the XGB baseline, while Verbalized Confidence and Token Logits underperformed.</p><p><strong>Discussion: </strong>These findings, consistent across multiple models and demographic groups, highlight the limitations of current LLMs in providing reliable pre-test probability estimations and underscore the need for improved calibration and bias mitigation strategies. Future work should explore hybrid approaches that integrate LLMs with numerical reasoning modules and calibrated embeddings to enhance diagnostic accuracy and ensure fairer predictions across diverse populations.</p><p><strong>Conclusions: </strong>LLMs demonstrate potential but currently fall short in estimating diagnostic probabilities compared to traditional machine learning classifiers trained on structured EHR data. Further improvements are needed for reliable clinical use.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooae154"},"PeriodicalIF":2.5000,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11723528/pdf/","citationCount":"0","resultStr":"{\"title\":\"Uncertainty estimation in diagnosis generation from large language models: next-word probability is not pre-test probability.\",\"authors\":\"Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy Miller, Danielle S Bitterman, Guanhua Chen, Anoop Mayampurath, Matthew M Churpek, Majid Afshar\",\"doi\":\"10.1093/jamiaopen/ooae154\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>To evaluate large language models (LLMs) for pre-test diagnostic probability estimation and compare their uncertainty estimation performance with a traditional machine learning classifier.</p><p><strong>Materials and methods: </strong>We assessed 2 instruction-tuned LLMs, Mistral-7B-Instruct and Llama3-70B-chat-hf, on predicting binary outcomes for Sepsis, Arrhythmia, and Congestive Heart Failure (CHF) using electronic health record (EHR) data from 660 patients. Three uncertainty estimation methods-Verbalized Confidence, Token Logits, and LLM Embedding+XGB-were compared against an eXtreme Gradient Boosting (XGB) classifier trained on raw EHR data. Performance metrics included AUROC and Pearson correlation between predicted probabilities.</p><p><strong>Results: </strong>The XGB classifier outperformed the LLM-based methods across all tasks. LLM Embedding+XGB showed the closest performance to the XGB baseline, while Verbalized Confidence and Token Logits underperformed.</p><p><strong>Discussion: </strong>These findings, consistent across multiple models and demographic groups, highlight the limitations of current LLMs in providing reliable pre-test probability estimations and underscore the need for improved calibration and bias mitigation strategies. Future work should explore hybrid approaches that integrate LLMs with numerical reasoning modules and calibrated embeddings to enhance diagnostic accuracy and ensure fairer predictions across diverse populations.</p><p><strong>Conclusions: </strong>LLMs demonstrate potential but currently fall short in estimating diagnostic probabilities compared to traditional machine learning classifiers trained on structured EHR data. Further improvements are needed for reliable clinical use.</p>\",\"PeriodicalId\":36278,\"journal\":{\"name\":\"JAMIA Open\",\"volume\":\"8 1\",\"pages\":\"ooae154\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11723528/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JAMIA Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/jamiaopen/ooae154\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooae154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Uncertainty estimation in diagnosis generation from large language models: next-word probability is not pre-test probability.
Objective: To evaluate large language models (LLMs) for pre-test diagnostic probability estimation and compare their uncertainty estimation performance with a traditional machine learning classifier.
Materials and methods: We assessed 2 instruction-tuned LLMs, Mistral-7B-Instruct and Llama3-70B-chat-hf, on predicting binary outcomes for Sepsis, Arrhythmia, and Congestive Heart Failure (CHF) using electronic health record (EHR) data from 660 patients. Three uncertainty estimation methods-Verbalized Confidence, Token Logits, and LLM Embedding+XGB-were compared against an eXtreme Gradient Boosting (XGB) classifier trained on raw EHR data. Performance metrics included AUROC and Pearson correlation between predicted probabilities.
Results: The XGB classifier outperformed the LLM-based methods across all tasks. LLM Embedding+XGB showed the closest performance to the XGB baseline, while Verbalized Confidence and Token Logits underperformed.
Discussion: These findings, consistent across multiple models and demographic groups, highlight the limitations of current LLMs in providing reliable pre-test probability estimations and underscore the need for improved calibration and bias mitigation strategies. Future work should explore hybrid approaches that integrate LLMs with numerical reasoning modules and calibrated embeddings to enhance diagnostic accuracy and ensure fairer predictions across diverse populations.
Conclusions: LLMs demonstrate potential but currently fall short in estimating diagnostic probabilities compared to traditional machine learning classifiers trained on structured EHR data. Further improvements are needed for reliable clinical use.