Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection.

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES JMIR Formative Research Pub Date : 2025-02-11 DOI:10.2196/60095
Sander Puts, Catharina M L Zegers, Andre Dekker, Iñigo Bermejo
{"title":"Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection.","authors":"Sander Puts, Catharina M L Zegers, Andre Dekker, Iñigo Bermejo","doi":"10.2196/60095","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The International Classification of Diseases (ICD), developed by the World Health Organization, standardizes health condition coding to support health care policy, research, and billing, but artificial intelligence automation, while promising, still underperforms compared with human accuracy and lacks the explainability needed for adoption in medical settings.</p><p><strong>Objective: </strong>The potential of large language models for assisting medical coders in the ICD-10 coding was explored through the development of a computer-assisted coding system. This study aimed to augment human coding by initially identifying lead terms and using retrieval-augmented generation (RAG)-based methods for computer-assisted coding enhancement.</p><p><strong>Methods: </strong>The explainability dataset from the CodiEsp challenge (CodiEsp-X) was used, featuring 1000 Spanish clinical cases annotated with ICD-10 codes. A new dataset, CodiEsp-X-lead, was generated using GPT-4 to replace full-textual evidence annotations with lead term annotations. A Robustly Optimized BERT (Bidirectional Encoder Representations from Transformers) Pretraining Approach transformer model was fine-tuned for named entity recognition to extract lead terms. GPT-4 was subsequently employed to generate code descriptions from the extracted textual evidence. Using a RAG approach, ICD codes were assigned to the lead terms by querying a vector database of ICD code descriptions with OpenAI's text-embedding-ada-002 model.</p><p><strong>Results: </strong>The fine-tuned Robustly Optimized BERT Pretraining Approach achieved an overall F1-score of 0.80 for ICD lead term extraction on the new CodiEsp-X-lead dataset. GPT-4-generated code descriptions reduced retrieval failures in the RAG approach by approximately 5% for both diagnoses and procedures. However, the overall explainability F1-score for the CodiEsp-X task was limited to 0.305, significantly lower than the state-of-the-art F1-score of 0.633. The diminished performance was partly due to the reliance on code descriptions, as some ICD codes lacked descriptions, and the approach did not fully align with the medical coder's workflow.</p><p><strong>Conclusions: </strong>While lead term extraction showed promising results, the subsequent RAG-based code assignment using GPT-4 and code descriptions was less effective. Future research should focus on refining the approach to more closely mimic the medical coder's workflow, potentially integrating the alphabetic index and official coding guidelines, rather than relying solely on code descriptions. This alignment may enhance system accuracy and better support medical coders in practice.</p>","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"9 ","pages":"e60095"},"PeriodicalIF":2.0000,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11835781/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/60095","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The International Classification of Diseases (ICD), developed by the World Health Organization, standardizes health condition coding to support health care policy, research, and billing, but artificial intelligence automation, while promising, still underperforms compared with human accuracy and lacks the explainability needed for adoption in medical settings.

Objective: The potential of large language models for assisting medical coders in the ICD-10 coding was explored through the development of a computer-assisted coding system. This study aimed to augment human coding by initially identifying lead terms and using retrieval-augmented generation (RAG)-based methods for computer-assisted coding enhancement.

Methods: The explainability dataset from the CodiEsp challenge (CodiEsp-X) was used, featuring 1000 Spanish clinical cases annotated with ICD-10 codes. A new dataset, CodiEsp-X-lead, was generated using GPT-4 to replace full-textual evidence annotations with lead term annotations. A Robustly Optimized BERT (Bidirectional Encoder Representations from Transformers) Pretraining Approach transformer model was fine-tuned for named entity recognition to extract lead terms. GPT-4 was subsequently employed to generate code descriptions from the extracted textual evidence. Using a RAG approach, ICD codes were assigned to the lead terms by querying a vector database of ICD code descriptions with OpenAI's text-embedding-ada-002 model.

Results: The fine-tuned Robustly Optimized BERT Pretraining Approach achieved an overall F1-score of 0.80 for ICD lead term extraction on the new CodiEsp-X-lead dataset. GPT-4-generated code descriptions reduced retrieval failures in the RAG approach by approximately 5% for both diagnoses and procedures. However, the overall explainability F1-score for the CodiEsp-X task was limited to 0.305, significantly lower than the state-of-the-art F1-score of 0.633. The diminished performance was partly due to the reliance on code descriptions, as some ICD codes lacked descriptions, and the approach did not fully align with the medical coder's workflow.

Conclusions: While lead term extraction showed promising results, the subsequent RAG-based code assignment using GPT-4 and code descriptions was less effective. Future research should focus on refining the approach to more closely mimic the medical coder's workflow, potentially integrating the alphabetic index and official coding guidelines, rather than relying solely on code descriptions. This alignment may enhance system accuracy and better support medical coders in practice.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
开发ICD-10编码助手:使用RoBERTa和GPT-4进行术语提取和基于描述的代码选择的初步研究。
背景:由世界卫生组织制定的国际疾病分类(ICD)标准化了健康状况编码,以支持卫生保健政策、研究和计费,但人工智能自动化虽然有希望,但与人类的准确性相比仍然表现不佳,并且缺乏在医疗环境中采用所需的可解释性。目的:通过开发计算机辅助编码系统,探讨大型语言模型在ICD-10编码中的应用潜力。本研究旨在通过初步识别引子词和使用基于检索增强生成(RAG)的计算机辅助编码增强方法来增强人类编码。方法:使用codisep挑战(CodiEsp- x)的可解释性数据集,其中包含1000例西班牙临床病例,并注释了ICD-10代码。使用GPT-4生成新的数据集codisp - x -lead,将全文证据注释替换为引子术语注释。提出了一种鲁棒优化的BERT (Bidirectional Encoder Representations from Transformers)预训练方法,对变压器模型进行了微调,用于命名实体识别,提取引子项。随后使用GPT-4从提取的文本证据生成代码描述。使用RAG方法,通过使用OpenAI的text-embedding-ada-002模型查询ICD代码描述的矢量数据库,将ICD代码分配给主要术语。结果:经过微调的稳健优化的BERT预训练方法在新的CodiEsp-X-lead数据集上实现了ICD引信项提取的总体f1得分为0.80。gpt -4生成的代码描述将RAG方法中的检索失败减少了大约5%,用于诊断和程序。然而,codisp - x任务的整体可解释性f1得分为0.305,显著低于国家一级f1得分0.633。性能下降的部分原因是依赖于代码描述,因为一些ICD代码缺乏描述,并且该方法与医疗编码人员的工作流程不完全一致。结论:虽然先导词提取显示出有希望的结果,但随后使用GPT-4和代码描述的基于rag的代码分配效果较差。未来的研究应该集中在改进方法,更接近地模仿医疗编码员的工作流程,可能会整合字母索引和官方编码指南,而不是仅仅依赖于代码描述。这种对齐可以提高系统的准确性,并在实践中更好地支持医疗编码员。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
JMIR Formative Research
JMIR Formative Research Medicine-Medicine (miscellaneous)
CiteScore
2.70
自引率
9.10%
发文量
579
审稿时长
12 weeks
期刊最新文献
Digital Phenotyping of Pain Modulation and Associations Among Personality, Attachment, and Behavioral Signatures: Cross-Sectional Study. Peaceful Death in Japanese YouTube Videos: Content and Network Analysis. Features of mHealth Apps for Tobacco Cessation Important to Black Adults: Discrete Choice Experiment. Evaluating Source-Based Large Language Models for Preclinical Dermatology Education: A Comparative Study. Using a Wearable-Based Animated Patient Avatar to Improve Patients' Perception of Vital Signs: Multicenter Computer-Based Study.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1