Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders.

Eyal Klang, Idit Tessler, Donald U Apakama, Ethan Abbott, Benjamin S Glicksberg, Monique Arnold, Akini Moses, Ankit Sakhuja, Ali Soroush, Alexander W Charney, David L Reich, Jolion McGreevy, Nicholas Gavin, Brendan Carr, Robert Freeman, Girish N Nadkarni
{"title":"Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders.","authors":"Eyal Klang, Idit Tessler, Donald U Apakama, Ethan Abbott, Benjamin S Glicksberg, Monique Arnold, Akini Moses, Ankit Sakhuja, Ali Soroush, Alexander W Charney, David L Reich, Jolion McGreevy, Nicholas Gavin, Brendan Carr, Robert Freeman, Girish N Nadkarni","doi":"10.1101/2024.10.15.24315526","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Accurate medical coding is essential for clinical and administrative purposes but complicated, time-consuming, and biased. This study compares Retrieval-Augmented Generation (RAG)-enhanced LLMs to provider-assigned codes in producing ICD-10-CM codes from emergency department (ED) clinical records.</p><p><strong>Methods: </strong>Retrospective cohort study using 500 ED visits randomly selected from the Mount Sinai Health System between January and April 2024. The RAG system integrated past 1,038,066 ED visits data (2021-2023) into the LLMs' predictions to improve coding accuracy. Nine commercial and open-source LLMs were evaluated. The primary outcome was a head-to-head comparison of the ICD-10-CM codes generated by the RAG-enhanced LLMs and those assigned by the original providers. A panel of four physicians and two LLMs blindly reviewed the codes, comparing the RAG-enhanced LLM and provider-assigned codes on accuracy and specificity.</p><p><strong>Findings: </strong>RAG-enhanced LLMs demonstrated superior performance to provider coders in both the accuracy and specificity of code assignments. In a targeted evaluation of 200 cases where discrepancies existed between GPT-4 and provider-assigned codes, human reviewers favored GPT-4 for accuracy in 447 instances, compared to 277 instances where providers' codes were preferred (p<0.001). Similarly, GPT-4 was selected for its superior specificity in 509 cases, whereas human coders were preferred in only 181 cases (p<0.001). Smaller open-access models, such as Llama-3.1-70B, also demonstrated substantial scalability when enhanced with RAG, with 218 instances of accuracy preference compared to 90 for providers' codes. Furthermore, across all models, the exact match rate between LLM-generated and provider-assigned codes significantly improved following RAG integration, with Qwen-2-7B increasing from 0.8% to 17.6% and Gemma-2-9b-it improving from 7.2% to 26.4%.</p><p><strong>Interpretation: </strong>RAG-enhanced LLMs improve medical coding accuracy in EDs, suggesting clinical workflow applications. These findings show that generative AI can improve clinical outcomes and reduce administrative burdens.</p><p><strong>Funding: </strong>This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.</p><p><strong>Twitter summary: </strong>A study showed AI models with retrieval-augmented generation outperformed human doctors in ED diagnostic coding accuracy and specificity. Even smaller AI models perform favorably when using RAG. This suggests potential for reducing administrative burden in healthcare, improving coding efficiency, and enhancing clinical documentation.</p>","PeriodicalId":94281,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11527068/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.10.15.24315526","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Accurate medical coding is essential for clinical and administrative purposes but complicated, time-consuming, and biased. This study compares Retrieval-Augmented Generation (RAG)-enhanced LLMs to provider-assigned codes in producing ICD-10-CM codes from emergency department (ED) clinical records.

Methods: Retrospective cohort study using 500 ED visits randomly selected from the Mount Sinai Health System between January and April 2024. The RAG system integrated past 1,038,066 ED visits data (2021-2023) into the LLMs' predictions to improve coding accuracy. Nine commercial and open-source LLMs were evaluated. The primary outcome was a head-to-head comparison of the ICD-10-CM codes generated by the RAG-enhanced LLMs and those assigned by the original providers. A panel of four physicians and two LLMs blindly reviewed the codes, comparing the RAG-enhanced LLM and provider-assigned codes on accuracy and specificity.

Findings: RAG-enhanced LLMs demonstrated superior performance to provider coders in both the accuracy and specificity of code assignments. In a targeted evaluation of 200 cases where discrepancies existed between GPT-4 and provider-assigned codes, human reviewers favored GPT-4 for accuracy in 447 instances, compared to 277 instances where providers' codes were preferred (p<0.001). Similarly, GPT-4 was selected for its superior specificity in 509 cases, whereas human coders were preferred in only 181 cases (p<0.001). Smaller open-access models, such as Llama-3.1-70B, also demonstrated substantial scalability when enhanced with RAG, with 218 instances of accuracy preference compared to 90 for providers' codes. Furthermore, across all models, the exact match rate between LLM-generated and provider-assigned codes significantly improved following RAG integration, with Qwen-2-7B increasing from 0.8% to 17.6% and Gemma-2-9b-it improving from 7.2% to 26.4%.

Interpretation: RAG-enhanced LLMs improve medical coding accuracy in EDs, suggesting clinical workflow applications. These findings show that generative AI can improve clinical outcomes and reduce administrative burdens.

Funding: This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.

Twitter summary: A study showed AI models with retrieval-augmented generation outperformed human doctors in ED diagnostic coding accuracy and specificity. Even smaller AI models perform favorably when using RAG. This suggests potential for reducing administrative burden in healthcare, improving coding efficiency, and enhancing clinical documentation.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
与人工编码员相比,评估检索增强大语言模型在急诊科 ICD-10-CM 编码中的性能。
背景:准确的医疗编码对临床和行政管理至关重要,但却复杂、耗时且存在偏差。本研究比较了在从急诊科(ED)临床记录中生成 ICD-10-CM 代码时,检索增强生成(RAG)增强型 LLM 与提供者指定的代码:回顾性队列研究:使用西奈山医疗系统在 2024 年 1 月至 4 月期间随机抽取的 500 个急诊就诊病例。RAG 系统将过去 1,038,066 次急诊就诊数据(2021-2023 年)整合到 LLM 预测中,以提高编码准确性。对九种商业和开源 LLM 进行了评估。主要结果是将 RAG 增强型 LLM 生成的 ICD-10-CM 代码与原始提供者分配的代码进行正面比较。由四名医生和两名 LLM 组成的小组对代码进行了盲审,比较了 RAG 增强型 LLM 和提供者分配的代码的准确性和特异性:结果:RAG 增强型 LLM 在代码分配的准确性和特异性方面均优于提供方编码员。在对 GPT-4 和医疗机构指定代码之间存在差异的 200 个病例进行的针对性评估中,人类评审员在 447 个病例中倾向于 GPT-4 的准确性,而在 277 个病例中倾向于医疗机构的代码(p 解释:RAG增强型LLM提高了急诊室医疗编码的准确性,为临床工作流程的应用提供了建议。这些研究结果表明,生成式人工智能可以改善临床结果,减轻行政负担:这项工作部分由西奈山伊坎医学院科学计算和数据部提供的计算和数据资源以及工作人员的专业知识支持,并得到了国家促进转化科学中心临床和转化科学奖(CTSA)UL1TR004419 基金的支持。本刊物所报道的研究还得到了美国国立卫生研究院研究基础设施办公室(Office of Research Infrastructure of the National Institutes of Health)的支持,奖励编号为 S10OD026880 和 S10OD030463。内容仅由作者本人负责,不代表美国国立卫生研究院的官方观点。资助者在研究设计、数据收集、数据分析和解释或撰写本稿件中未发挥任何作用。Twitter 摘要:一项研究表明,在 ED 诊断编码准确性和特异性方面,具有检索增强生成功能的人工智能模型优于人类医生。在使用 RAG 时,即使是较小的人工智能模型也表现良好。这表明在减轻医疗保健的管理负担、提高编码效率和增强临床文档方面具有潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Opioids Diminish the Placebo Antidepressant Response: A Post Hoc Analysis of a Randomized Controlled Ketamine Trial. Raising awareness of potential biases in medical machine learning: Experience from a Datathon. Prediction of Postoperative Delirium in Older Adults from Preoperative Cognition and Occipital Alpha Power from Resting-State Electroencephalogram. Reduced Cortical Excitability is Associated with Cognitive Symptoms in Concussed Adolescent Football Players. Basic helix-loop-helix transcription factor BHLHE22 monoallelic and biallelic variants cause a neurodevelopmental disorder with agenesis of the corpus callosum, intellectual disability, tone and movement abnormalities.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1