大型语言模型在临床诊断中的局限性。

medRxiv : the preprint server for health sciences Pub Date : 2024-02-26 DOI:10.1101/2023.07.13.23292613

Justin T Reese, Daniel Danis, J Harry Caufield, Tudor Groza, Elena Casiraghi, Giorgio Valentini, Christopher J Mungall, Peter N Robinson

{"title":"大型语言模型在临床诊断中的局限性。","authors":"Justin T Reese, Daniel Danis, J Harry Caufield, Tudor Groza, Elena Casiraghi, Giorgio Valentini, Christopher J Mungall, Peter N Robinson","doi":"10.1101/2023.07.13.23292613","DOIUrl":null,"url":null,"abstract":"Objective: Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information.Materials and methods: We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically.Results: Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task.Discussion: The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings.Conclusion: Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.","PeriodicalId":18659,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10370243/pdf/","citationCount":"0","resultStr":"{\"title\":\"On the limitations of large language models in clinical diagnosis.\",\"authors\":\"Justin T Reese, Daniel Danis, J Harry Caufield, Tudor Groza, Elena Casiraghi, Giorgio Valentini, Christopher J Mungall, Peter N Robinson\",\"doi\":\"10.1101/2023.07.13.23292613\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information.Materials and methods: We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically.Results: Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task.Discussion: The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings.Conclusion: Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.\",\"PeriodicalId\":18659,\"journal\":{\"name\":\"medRxiv : the preprint server for health sciences\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10370243/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv : the preprint server for health sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2023.07.13.23292613\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.07.13.23292613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

背景：GPT等大型语言模型（LLM）支持鉴别诊断等复杂任务的潜力一直是一个争论的话题，一些人将接近感知能力归因于这些模型，另一些人则声称LLM只是“在类固醇上自动完成”。最近的一项研究报告称，生成预训练变压器4（GPT-4）模型在复杂的微分诊断推理中表现良好。作者在《新英格兰医学杂志》的一系列病例记录中评估了GPT-4在确定正确诊断方面的表现。作者根据病例报告的临床表现部分构建了提示，并将GPT-4的结果与实际诊断进行了比较。GPT-4在64%的病例中返回了正确的诊断作为其反应的一部分，39%的病例中正确的诊断排名第一。然而，电子健康记录（EHR）中通常没有这种简洁但全面的临床过程叙述。此外，如果可用，EHR记录包含《健康保险便携性和责任法案》（HIPAA）规定禁止传输的识别信息。方法：为了评估GPT在可比数据集上的预期性能，这些数据集可以通过文本挖掘生成，并且设计不包含可识别信息，我们分析了病例报告的文本，并提取了人类表型本体论（HPO）术语，从中构建了GPT的提示，这些提示包含基本相同的临床异常，但缺乏周围的叙述。结果：虽然GPT-4在基于原始叙述的文本上的表现良好，最终诊断包括在29/75例病例的差异中（38.7%；17.3%的病例中排名1；平均排名3.4），但GPT-4对基于特征的方法（包括没有额外叙述的主要临床异常得克萨斯州）的表现要差得多，GPT-4包括8/75例病例的最终诊断（10.7%；4.0%的病例中排名第一；平均排名3.9）。解释：我们认为基于特征的查询是对GPT-4在诊断任务中表现的更合适的测试，因为叙述方法不太可能用于实际临床实践。未来的研究和算法开发需要确定利用LLM进行临床诊断的最佳方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

On the limitations of large language models in clinical diagnosis.

Objective: Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information.

Materials and methods: We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically.

Results: Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task.

Discussion: The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings.

Conclusion: Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助