MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications

arXiv - CS - Computation and Language Pub Date : 2024-09-11 DOI:arxiv-2409.07314

Praveen K Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Nada Saadi, Hamza Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan

{"title":"MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications","authors":"Praveen K Kanithi, Clément Christophe, Marco AF Pimentel, Tathagata Raha, Nada Saadi, Hamza Javed, Svetlana Maslenkova, Nasir Hayat, Ronnie Rajan, Shadab Khan","doi":"arxiv-2409.07314","DOIUrl":null,"url":null,"abstract":"The rapid development of Large Language Models (LLMs) for healthcare\napplications has spurred calls for holistic evaluation beyond frequently-cited\nbenchmarks like USMLE, to better reflect real-world performance. While\nreal-world assessments are valuable indicators of utility, they often lag\nbehind the pace of LLM evolution, likely rendering findings obsolete upon\ndeployment. This temporal disconnect necessitates a comprehensive upfront\nevaluation that can guide model selection for specific clinical applications.\nWe introduce MEDIC, a framework assessing LLMs across five critical dimensions\nof clinical competence: medical reasoning, ethics and bias, data and language\nunderstanding, in-context learning, and clinical safety. MEDIC features a novel\ncross-examination framework quantifying LLM performance across areas like\ncoverage and hallucination detection, without requiring reference outputs. We\napply MEDIC to evaluate LLMs on medical question-answering, safety,\nsummarization, note generation, and other tasks. Our results show performance\ndisparities across model sizes, baseline vs medically finetuned models, and\nhave implications on model selection for applications requiring specific model\nstrengths, such as low hallucination or lower cost of inference. MEDIC's\nmultifaceted evaluation reveals these performance trade-offs, bridging the gap\nbetween theoretical capabilities and practical implementation in healthcare\nsettings, ensuring that the most promising models are identified and adapted\nfor diverse healthcare applications.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"54 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07314","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid development of Large Language Models (LLMs) for healthcare applications has spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to better reflect real-world performance. While real-world assessments are valuable indicators of utility, they often lag behind the pace of LLM evolution, likely rendering findings obsolete upon deployment. This temporal disconnect necessitates a comprehensive upfront evaluation that can guide model selection for specific clinical applications. We introduce MEDIC, a framework assessing LLMs across five critical dimensions of clinical competence: medical reasoning, ethics and bias, data and language understanding, in-context learning, and clinical safety. MEDIC features a novel cross-examination framework quantifying LLM performance across areas like coverage and hallucination detection, without requiring reference outputs. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Our results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths, such as low hallucination or lower cost of inference. MEDIC's multifaceted evaluation reveals these performance trade-offs, bridging the gap between theoretical capabilities and practical implementation in healthcare settings, ensuring that the most promising models are identified and adapted for diverse healthcare applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MEDIC：建立评估临床应用中的法律硕士的综合框架

用于医疗保健应用的大型语言模型（LLMs）的快速发展促使人们呼吁在 USMLE 等经常被引用的基准之外进行整体评估，以更好地反映真实世界的性能。虽然真实世界的评估是衡量实用性的重要指标，但它们往往落后于 LLM 的发展速度，很可能导致评估结果在部署前就已经过时。我们介绍了 MEDIC，这是一个评估 LLM 的框架，涉及临床能力的五个关键维度：医学推理、伦理与偏见、数据与语言理解、情境学习和临床安全。MEDIC 具有一个新颖的交叉检查框架，可量化 LLM 在覆盖率和幻觉检测等方面的表现，而无需参考输出。我们将 MEDIC 用于评估 LLM 在医学问题解答、安全性、总结、笔记生成和其他任务上的表现。我们的结果表明，不同大小的模型、基线模型与经过医学微调的模型在性能上存在差异，并对需要特定模型优势（如低幻觉或较低推理成本）的应用中的模型选择产生了影响。MEDIC 的多方面评估揭示了这些性能权衡，缩小了理论能力与医疗环境中实际应用之间的差距，确保识别出最有前途的模型，并将其应用于各种医疗应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量