Ashley Simmons, Kullaya Takkavatakarn, Megan McDougal, Brian Dilcher, Jami Pincavitch, Lukas Meadows, Justin Kauffman, Eyal Klang, Rebecca Wig, Gordon Stephen Smith, Ali Soroush, Robert Freeman, Donald Apakama, Alexander Charney, Roopa Kohli-Seth, Girish Nadkarni, Ankit Sakhuja
{"title":"利用大型语言模型从临床文献中提取国际疾病分类代码。","authors":"Ashley Simmons, Kullaya Takkavatakarn, Megan McDougal, Brian Dilcher, Jami Pincavitch, Lukas Meadows, Justin Kauffman, Eyal Klang, Rebecca Wig, Gordon Stephen Smith, Ali Soroush, Robert Freeman, Donald Apakama, Alexander Charney, Roopa Kohli-Seth, Girish Nadkarni, Ankit Sakhuja","doi":"10.1055/a-2491-3872","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have shown promise in various professional fields, including medicine and law. However, their performance in highly specialized tasks, such as extracting ICD-10-CM codes from patient notes, remains underexplored.</p><p><strong>Objective: </strong>The primary objective was to evaluate and compare the performance of ICD-10-CM code extraction by different LLMs with that of human coder.</p><p><strong>Methods: </strong>We evaluated performance of six LLMs (GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b) in extracting ICD-10-CM codes against human coder. We used deidentified inpatient notes from American Health Information Management Association Vlab authentic patient cases for this study. We calculated percent agreement and Cohen's kappa values to assess the agreement between LLMs and human coder. We then identified reasons for discrepancies in code extraction by LLMs in a 10% random subset.</p><p><strong>Results: </strong>Among 50 inpatient notes, human coder extracted 165 unique ICD-10-CM codes. LLMs extracted significantly higher number of unique ICD-10-CM codes than human coder, with Llama 2-70b extracting most (658) and Gemini Advanced the least (221). GPT-4 achieved highest percent agreement with human coder at 15.2%, followed by Claude 3 (12.7%) and GPT-3.5 (12.4%). Cohen's kappa values indicated minimal to no agreement, ranging from -0.02 to 0.01. When focusing on primary diagnosis, Claude 3 achieved highest percent agreement (26%) and kappa value (0.25). Reasons for discrepancies in extraction of codes varied amongst LLMs and included extraction of codes for diagnoses not confirmed by providers (60% with GPT-4), extraction of non-specific codes (25% with GPT-3.5), extraction of codes for signs and symptoms despite presence of more specific diagnosis (22% with Claude-2.1) and hallucinations (35% with Claude-2.1).</p><p><strong>Conclusions: </strong>Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against the human coder.</p>","PeriodicalId":48956,"journal":{"name":"Applied Clinical Informatics","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Extracting International Classification of Diseases Codes from Clinical Documentation using Large Language Models.\",\"authors\":\"Ashley Simmons, Kullaya Takkavatakarn, Megan McDougal, Brian Dilcher, Jami Pincavitch, Lukas Meadows, Justin Kauffman, Eyal Klang, Rebecca Wig, Gordon Stephen Smith, Ali Soroush, Robert Freeman, Donald Apakama, Alexander Charney, Roopa Kohli-Seth, Girish Nadkarni, Ankit Sakhuja\",\"doi\":\"10.1055/a-2491-3872\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Large language models (LLMs) have shown promise in various professional fields, including medicine and law. However, their performance in highly specialized tasks, such as extracting ICD-10-CM codes from patient notes, remains underexplored.</p><p><strong>Objective: </strong>The primary objective was to evaluate and compare the performance of ICD-10-CM code extraction by different LLMs with that of human coder.</p><p><strong>Methods: </strong>We evaluated performance of six LLMs (GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b) in extracting ICD-10-CM codes against human coder. We used deidentified inpatient notes from American Health Information Management Association Vlab authentic patient cases for this study. We calculated percent agreement and Cohen's kappa values to assess the agreement between LLMs and human coder. We then identified reasons for discrepancies in code extraction by LLMs in a 10% random subset.</p><p><strong>Results: </strong>Among 50 inpatient notes, human coder extracted 165 unique ICD-10-CM codes. LLMs extracted significantly higher number of unique ICD-10-CM codes than human coder, with Llama 2-70b extracting most (658) and Gemini Advanced the least (221). GPT-4 achieved highest percent agreement with human coder at 15.2%, followed by Claude 3 (12.7%) and GPT-3.5 (12.4%). Cohen's kappa values indicated minimal to no agreement, ranging from -0.02 to 0.01. When focusing on primary diagnosis, Claude 3 achieved highest percent agreement (26%) and kappa value (0.25). Reasons for discrepancies in extraction of codes varied amongst LLMs and included extraction of codes for diagnoses not confirmed by providers (60% with GPT-4), extraction of non-specific codes (25% with GPT-3.5), extraction of codes for signs and symptoms despite presence of more specific diagnosis (22% with Claude-2.1) and hallucinations (35% with Claude-2.1).</p><p><strong>Conclusions: </strong>Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against the human coder.</p>\",\"PeriodicalId\":48956,\"journal\":{\"name\":\"Applied Clinical Informatics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-11-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Clinical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1055/a-2491-3872\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Clinical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/a-2491-3872","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
Extracting International Classification of Diseases Codes from Clinical Documentation using Large Language Models.
Background: Large language models (LLMs) have shown promise in various professional fields, including medicine and law. However, their performance in highly specialized tasks, such as extracting ICD-10-CM codes from patient notes, remains underexplored.
Objective: The primary objective was to evaluate and compare the performance of ICD-10-CM code extraction by different LLMs with that of human coder.
Methods: We evaluated performance of six LLMs (GPT-3.5, GPT-4, Claude 2.1, Claude 3, Gemini Advanced, and Llama 2-70b) in extracting ICD-10-CM codes against human coder. We used deidentified inpatient notes from American Health Information Management Association Vlab authentic patient cases for this study. We calculated percent agreement and Cohen's kappa values to assess the agreement between LLMs and human coder. We then identified reasons for discrepancies in code extraction by LLMs in a 10% random subset.
Results: Among 50 inpatient notes, human coder extracted 165 unique ICD-10-CM codes. LLMs extracted significantly higher number of unique ICD-10-CM codes than human coder, with Llama 2-70b extracting most (658) and Gemini Advanced the least (221). GPT-4 achieved highest percent agreement with human coder at 15.2%, followed by Claude 3 (12.7%) and GPT-3.5 (12.4%). Cohen's kappa values indicated minimal to no agreement, ranging from -0.02 to 0.01. When focusing on primary diagnosis, Claude 3 achieved highest percent agreement (26%) and kappa value (0.25). Reasons for discrepancies in extraction of codes varied amongst LLMs and included extraction of codes for diagnoses not confirmed by providers (60% with GPT-4), extraction of non-specific codes (25% with GPT-3.5), extraction of codes for signs and symptoms despite presence of more specific diagnosis (22% with Claude-2.1) and hallucinations (35% with Claude-2.1).
Conclusions: Current LLMs have poor performance in extraction of ICD-10-CM codes from inpatient notes when compared against the human coder.
期刊介绍:
ACI is the third Schattauer journal dealing with biomedical and health informatics. It perfectly complements our other journals Öffnet internen Link im aktuellen FensterMethods of Information in Medicine and the Öffnet internen Link im aktuellen FensterYearbook of Medical Informatics. The Yearbook of Medical Informatics being the “Milestone” or state-of-the-art journal and Methods of Information in Medicine being the “Science and Research” journal of IMIA, ACI intends to be the “Practical” journal of IMIA.