Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease.

IF 8.1 1区生物学 Q1 GENETICS & HEREDITY American journal of human genetics Pub Date : 2024-09-04 DOI:10.1016/j.ajhg.2024.08.010

Junyoung Kim,Kai Wang,Chunhua Weng,Cong Liu

{"title":"Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease.","authors":"Junyoung Kim,Kai Wang,Chunhua Weng,Cong Liu","doi":"10.1016/j.ajhg.2024.08.010","DOIUrl":null,"url":null,"abstract":"Phenotype-driven gene prioritization is fundamental to diagnosing rare genetic disorders. While traditional approaches rely on curated knowledge graphs with phenotype-gene relations, recent advancements in large language models (LLMs) promise a streamlined text-to-gene solution. In this study, we evaluated five LLMs, including two generative pre-trained transformers (GPT) series and three Llama2 series, assessing their performance across task completeness, gene prediction accuracy, and adherence to required output structures. We conducted experiments, exploring various combinations of models, prompts, phenotypic input types, and task difficulty levels. Our findings revealed that the best-performed LLM, GPT-4, achieved an average accuracy of 17.0% in identifying diagnosed genes within the top 50 predictions, which still falls behind traditional tools. However, accuracy increased with the model size. Consistent results were observed over time, as shown in the dataset curated after 2023. Advanced techniques such as retrieval-augmented generation (RAG) and few-shot learning did not improve the accuracy. Sophisticated prompts were more likely to enhance task completeness, especially in smaller models. Conversely, complicated prompts tended to decrease output structure compliance rate. LLMs also achieved better-than-random prediction accuracy with free-text input, though performance was slightly lower than with standardized concept input. Bias analysis showed that highly cited genes, such as BRCA1, TP53, and PTEN, are more likely to be predicted. Our study provides valuable insights into integrating LLMs with genomic analysis, contributing to the ongoing discussion on their utilization in clinical workflows.","PeriodicalId":7659,"journal":{"name":"American journal of human genetics","volume":"31 1","pages":""},"PeriodicalIF":8.1000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American journal of human genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.ajhg.2024.08.010","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Phenotype-driven gene prioritization is fundamental to diagnosing rare genetic disorders. While traditional approaches rely on curated knowledge graphs with phenotype-gene relations, recent advancements in large language models (LLMs) promise a streamlined text-to-gene solution. In this study, we evaluated five LLMs, including two generative pre-trained transformers (GPT) series and three Llama2 series, assessing their performance across task completeness, gene prediction accuracy, and adherence to required output structures. We conducted experiments, exploring various combinations of models, prompts, phenotypic input types, and task difficulty levels. Our findings revealed that the best-performed LLM, GPT-4, achieved an average accuracy of 17.0% in identifying diagnosed genes within the top 50 predictions, which still falls behind traditional tools. However, accuracy increased with the model size. Consistent results were observed over time, as shown in the dataset curated after 2023. Advanced techniques such as retrieval-augmented generation (RAG) and few-shot learning did not improve the accuracy. Sophisticated prompts were more likely to enhance task completeness, especially in smaller models. Conversely, complicated prompts tended to decrease output structure compliance rate. LLMs also achieved better-than-random prediction accuracy with free-text input, though performance was slightly lower than with standardized concept input. Bias analysis showed that highly cited genes, such as BRCA1, TP53, and PTEN, are more likely to be predicted. Our study provides valuable insights into integrating LLMs with genomic analysis, contributing to the ongoing discussion on their utilization in clinical workflows.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估大型语言模型在罕见遗传病诊断中表型驱动基因优先排序的实用性。

表型驱动的基因优先排序是诊断罕见遗传疾病的基础。传统方法依赖于具有表型-基因关系的知识图谱，而最近大语言模型（LLMs）的进步则有望提供一种简化的文本-基因解决方案。在这项研究中，我们评估了五种 LLM，包括两种生成式预训练转换器（GPT）系列和三种 Llama2 系列，评估了它们在任务完整性、基因预测准确性和遵守所需的输出结构方面的表现。我们进行了实验，探索了模型、提示、表型输入类型和任务难度的各种组合。我们的研究结果表明，表现最好的 LLM GPT-4 在识别前 50 个预测中的诊断基因方面的平均准确率为 17.0%，仍然落后于传统工具。不过，准确率随着模型大小的增加而提高。如 2023 年后的数据集所示，随着时间的推移，观察到了一致的结果。检索增强生成（RAG）和少量学习等先进技术并没有提高准确率。复杂的提示更有可能提高任务的完成度，尤其是在较小的模型中。相反，复杂的提示往往会降低输出结构的符合率。使用自由文本输入时，LLM 的预测准确率也优于随机输入，但性能略低于标准化概念输入。偏差分析表明，BRCA1、TP53 和 PTEN 等高被引基因更容易被预测。我们的研究为将 LLMs 与基因组分析结合起来提供了宝贵的见解，为正在进行的关于在临床工作流程中使用 LLMs 的讨论做出了贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

American journal of human genetics 生物-遗传学

CiteScore

14.70

自引率

4.10%

发文量

185

审稿时长

1 months

期刊介绍： The American Journal of Human Genetics (AJHG) is a monthly journal published by Cell Press, chosen by The American Society of Human Genetics (ASHG) as its premier publication starting from January 2008. AJHG represents Cell Press's first society-owned journal, and both ASHG and Cell Press anticipate significant synergies between AJHG content and that of other Cell Press titles.