评估五千例罕见病病例中 GPT-4 的诊断准确性

medRxiv - Genetic and Genomic Medicine Pub Date : 2024-07-22 DOI:10.1101/2024.07.22.24310816

Justin T Reese, Leonardo Chimirri, Daniel Danis, J Harry Caufield, Kyran Wissink Wissink, Elena Casiraghi, Giorgio Valentini, Melissa A Haendel, Christopher J Mungall, Peter N Robinson

{"title":"评估五千例罕见病病例中 GPT-4 的诊断准确性","authors":"Justin T Reese, Leonardo Chimirri, Daniel Danis, J Harry Caufield, Kyran Wissink Wissink, Elena Casiraghi, Giorgio Valentini, Melissa A Haendel, Christopher J Mungall, Peter N Robinson","doi":"10.1101/2024.07.22.24310816","DOIUrl":null,"url":null,"abstract":"Large language models (LLM) have shown great promise in supporting differential diagnosis, but 23 available published studies on the diagnostic accuracy evaluated small cohorts (number of cases, 30-422, mean 104) and have evaluated LLM responses subjectively by manual curation (23/23 studies). The performance of LLMs for rare disease diagnosis has not been evaluated systematically. Here, we perform a rigorous and large-scale analysis of the performance of a GPT-4 in prioritizing candidate diagnoses, using the largest-ever cohort of rare disease patients. Our computational study used 5267 computational case reports from previously published data. Each case was formatted as a Global Alliance for Genomics and Health (GA4GH) phenopacket, in which clinical anomalies were represented as Human Phenotype Ontology (HPO) terms. We developed software to generate prompts from each phenopacket. Prompts were sent to Generative Pre-trained Transformer 4 (GPT-4), and the rank of the correct diagnosis, if present in the response, was recorded. The mean reciprocal rank of the correct diagnosis was 0.24 (with the reciprocal of the MRR corresponding to a rank of 4.2), and the correct diagnosis was placed in rank 1 in 19.2% of the cases, in the first 3 ranks in 28.6%, and in the first 10 ranks in 32.5%. Our study is the largest to be reported to date and provides a realistic estimate of the performance of GPT-4 in rare disease medicine.","PeriodicalId":501375,"journal":{"name":"medRxiv - Genetic and Genomic Medicine","volume":"43 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluation of the Diagnostic Accuracy of GPT-4 in Five Thousand Rare Disease Cases\",\"authors\":\"Justin T Reese, Leonardo Chimirri, Daniel Danis, J Harry Caufield, Kyran Wissink Wissink, Elena Casiraghi, Giorgio Valentini, Melissa A Haendel, Christopher J Mungall, Peter N Robinson\",\"doi\":\"10.1101/2024.07.22.24310816\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large language models (LLM) have shown great promise in supporting differential diagnosis, but 23 available published studies on the diagnostic accuracy evaluated small cohorts (number of cases, 30-422, mean 104) and have evaluated LLM responses subjectively by manual curation (23/23 studies). The performance of LLMs for rare disease diagnosis has not been evaluated systematically. Here, we perform a rigorous and large-scale analysis of the performance of a GPT-4 in prioritizing candidate diagnoses, using the largest-ever cohort of rare disease patients. Our computational study used 5267 computational case reports from previously published data. Each case was formatted as a Global Alliance for Genomics and Health (GA4GH) phenopacket, in which clinical anomalies were represented as Human Phenotype Ontology (HPO) terms. We developed software to generate prompts from each phenopacket. Prompts were sent to Generative Pre-trained Transformer 4 (GPT-4), and the rank of the correct diagnosis, if present in the response, was recorded. The mean reciprocal rank of the correct diagnosis was 0.24 (with the reciprocal of the MRR corresponding to a rank of 4.2), and the correct diagnosis was placed in rank 1 in 19.2% of the cases, in the first 3 ranks in 28.6%, and in the first 10 ranks in 32.5%. Our study is the largest to be reported to date and provides a realistic estimate of the performance of GPT-4 in rare disease medicine.\",\"PeriodicalId\":501375,\"journal\":{\"name\":\"medRxiv - Genetic and Genomic Medicine\",\"volume\":\"43 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv - Genetic and Genomic Medicine\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.07.22.24310816\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Genetic and Genomic Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.07.22.24310816","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大语言模型（LLM）在支持鉴别诊断方面大有可为，但目前已发表的 23 项有关诊断准确性的研究评估的病例群规模较小（病例数为 30-422 例，平均为 104 例），而且这些研究都是通过人工筛选对 LLM 的反应进行主观评估（23/23 项研究）。尚未对 LLM 用于罕见病诊断的性能进行系统评估。在此，我们利用有史以来最大规模的罕见病患者队列，对 GPT-4 在确定候选诊断优先级方面的性能进行了严格而大规模的分析。我们的计算研究使用了以前发表的数据中的 5267 份计算病例报告。每个病例都被格式化为全球基因组学与健康联盟（GA4GH）表型包，其中临床异常以人类表型本体论（HPO）术语表示。我们开发的软件可从每个表型包中生成提示。提示信息被发送到生成预训练转换器 4 (GPT-4)，并记录正确诊断的等级（如果在回复中出现）。正确诊断的平均倒数等级为 0.24（MRR 的倒数对应等级为 4.2），19.2% 的病例将正确诊断置于等级 1，28.6% 的病例将正确诊断置于等级前 3，32.5% 的病例将正确诊断置于等级前 10。我们的研究是迄今为止报告的规模最大的研究，为 GPT-4 在罕见病医学中的应用提供了现实的评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Evaluation of the Diagnostic Accuracy of GPT-4 in Five Thousand Rare Disease Cases

Large language models (LLM) have shown great promise in supporting differential diagnosis, but 23 available published studies on the diagnostic accuracy evaluated small cohorts (number of cases, 30-422, mean 104) and have evaluated LLM responses subjectively by manual curation (23/23 studies). The performance of LLMs for rare disease diagnosis has not been evaluated systematically. Here, we perform a rigorous and large-scale analysis of the performance of a GPT-4 in prioritizing candidate diagnoses, using the largest-ever cohort of rare disease patients. Our computational study used 5267 computational case reports from previously published data. Each case was formatted as a Global Alliance for Genomics and Health (GA4GH) phenopacket, in which clinical anomalies were represented as Human Phenotype Ontology (HPO) terms. We developed software to generate prompts from each phenopacket. Prompts were sent to Generative Pre-trained Transformer 4 (GPT-4), and the rank of the correct diagnosis, if present in the response, was recorded. The mean reciprocal rank of the correct diagnosis was 0.24 (with the reciprocal of the MRR corresponding to a rank of 4.2), and the correct diagnosis was placed in rank 1 in 19.2% of the cases, in the first 3 ranks in 28.6%, and in the first 10 ranks in 32.5%. Our study is the largest to be reported to date and provides a realistic estimate of the performance of GPT-4 in rare disease medicine.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

medRxiv - Genetic and Genomic Medicine

自引率

0.00%

发文量