大型语言模型在临床诊断中的局限性。

Justin T Reese, Daniel Danis, J Harry Caufield, Tudor Groza, Elena Casiraghi, Giorgio Valentini, Christopher J Mungall, Peter N Robinson
{"title":"大型语言模型在临床诊断中的局限性。","authors":"Justin T Reese, Daniel Danis, J Harry Caufield, Tudor Groza, Elena Casiraghi, Giorgio Valentini, Christopher J Mungall, Peter N Robinson","doi":"10.1101/2023.07.13.23292613","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information.</p><p><strong>Materials and methods: </strong>We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically.</p><p><strong>Results: </strong>Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task.</p><p><strong>Discussion: </strong>The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings.</p><p><strong>Conclusion: </strong>Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.</p>","PeriodicalId":18659,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10370243/pdf/","citationCount":"0","resultStr":"{\"title\":\"On the limitations of large language models in clinical diagnosis.\",\"authors\":\"Justin T Reese, Daniel Danis, J Harry Caufield, Tudor Groza, Elena Casiraghi, Giorgio Valentini, Christopher J Mungall, Peter N Robinson\",\"doi\":\"10.1101/2023.07.13.23292613\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information.</p><p><strong>Materials and methods: </strong>We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically.</p><p><strong>Results: </strong>Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task.</p><p><strong>Discussion: </strong>The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings.</p><p><strong>Conclusion: </strong>Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.</p>\",\"PeriodicalId\":18659,\"journal\":{\"name\":\"medRxiv : the preprint server for health sciences\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10370243/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv : the preprint server for health sciences\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2023.07.13.23292613\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.07.13.23292613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景:GPT等大型语言模型(LLM)支持鉴别诊断等复杂任务的潜力一直是一个争论的话题,一些人将接近感知能力归因于这些模型,另一些人则声称LLM只是“在类固醇上自动完成”。最近的一项研究报告称,生成预训练变压器4(GPT-4)模型在复杂的微分诊断推理中表现良好。作者在《新英格兰医学杂志》的一系列病例记录中评估了GPT-4在确定正确诊断方面的表现。作者根据病例报告的临床表现部分构建了提示,并将GPT-4的结果与实际诊断进行了比较。GPT-4在64%的病例中返回了正确的诊断作为其反应的一部分,39%的病例中正确的诊断排名第一。然而,电子健康记录(EHR)中通常没有这种简洁但全面的临床过程叙述。此外,如果可用,EHR记录包含《健康保险便携性和责任法案》(HIPAA)规定禁止传输的识别信息。方法:为了评估GPT在可比数据集上的预期性能,这些数据集可以通过文本挖掘生成,并且设计不包含可识别信息,我们分析了病例报告的文本,并提取了人类表型本体论(HPO)术语,从中构建了GPT的提示,这些提示包含基本相同的临床异常,但缺乏周围的叙述。结果:虽然GPT-4在基于原始叙述的文本上的表现良好,最终诊断包括在29/75例病例的差异中(38.7%;17.3%的病例中排名1;平均排名3.4),但GPT-4对基于特征的方法(包括没有额外叙述的主要临床异常得克萨斯州)的表现要差得多,GPT-4包括8/75例病例的最终诊断(10.7%;4.0%的病例中排名第一;平均排名3.9)。解释:我们认为基于特征的查询是对GPT-4在诊断任务中表现的更合适的测试,因为叙述方法不太可能用于实际临床实践。未来的研究和算法开发需要确定利用LLM进行临床诊断的最佳方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

摘要图片

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
On the limitations of large language models in clinical diagnosis.

Objective: Large Language Models such as GPT-4 previously have been applied to differential diagnostic challenges based on published case reports. Published case reports have a sophisticated narrative style that is not readily available from typical electronic health records (EHR). Furthermore, even if such a narrative were available in EHRs, privacy requirements would preclude sending it outside the hospital firewall. We therefore tested a method for parsing clinical texts to extract ontology terms and programmatically generating prompts that by design are free of protected health information.

Materials and methods: We investigated different methods to prepare prompts from 75 recently published case reports. We transformed the original narratives by extracting structured terms representing phenotypic abnormalities, comorbidities, treatments, and laboratory tests and creating prompts programmatically.

Results: Performance of all of these approaches was modest, with the correct diagnosis ranked first in only 5.3-17.6% of cases. The performance of the prompts created from structured data was substantially worse than that of the original narrative texts, even if additional information was added following manual review of term extraction. Moreover, different versions of GPT-4 demonstrated substantially different performance on this task.

Discussion: The sensitivity of the performance to the form of the prompt and the instability of results over two GPT-4 versions represent important current limitations to the use of GPT-4 to support diagnosis in real-life clinical settings.

Conclusion: Research is needed to identify the best methods for creating prompts from typically available clinical data to support differential diagnostics.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
After the Infection: A Survey of Pathogens and Non-communicable Human Disease. The Extra-Islet Pancreas Supports Autoimmunity in Human Type 1 Diabetes. Keyphrase Identification Using Minimal Labeled Data with Hierarchical Contexts and Transfer Learning. Advancing Efficacy Prediction for EHR-based Emulated Trials in Repurposing Heart Failure Therapies. Novel autoantibody targets identified in patients with autoimmune hepatitis (AIH) by PhIP-Seq reveals pathogenic insights.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1