Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease.

IF 8.1 1区 生物学 Q1 GENETICS & HEREDITY American journal of human genetics Pub Date : 2024-09-04 DOI:10.1016/j.ajhg.2024.08.010
Junyoung Kim,Kai Wang,Chunhua Weng,Cong Liu
{"title":"Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease.","authors":"Junyoung Kim,Kai Wang,Chunhua Weng,Cong Liu","doi":"10.1016/j.ajhg.2024.08.010","DOIUrl":null,"url":null,"abstract":"Phenotype-driven gene prioritization is fundamental to diagnosing rare genetic disorders. While traditional approaches rely on curated knowledge graphs with phenotype-gene relations, recent advancements in large language models (LLMs) promise a streamlined text-to-gene solution. In this study, we evaluated five LLMs, including two generative pre-trained transformers (GPT) series and three Llama2 series, assessing their performance across task completeness, gene prediction accuracy, and adherence to required output structures. We conducted experiments, exploring various combinations of models, prompts, phenotypic input types, and task difficulty levels. Our findings revealed that the best-performed LLM, GPT-4, achieved an average accuracy of 17.0% in identifying diagnosed genes within the top 50 predictions, which still falls behind traditional tools. However, accuracy increased with the model size. Consistent results were observed over time, as shown in the dataset curated after 2023. Advanced techniques such as retrieval-augmented generation (RAG) and few-shot learning did not improve the accuracy. Sophisticated prompts were more likely to enhance task completeness, especially in smaller models. Conversely, complicated prompts tended to decrease output structure compliance rate. LLMs also achieved better-than-random prediction accuracy with free-text input, though performance was slightly lower than with standardized concept input. Bias analysis showed that highly cited genes, such as BRCA1, TP53, and PTEN, are more likely to be predicted. Our study provides valuable insights into integrating LLMs with genomic analysis, contributing to the ongoing discussion on their utilization in clinical workflows.","PeriodicalId":7659,"journal":{"name":"American journal of human genetics","volume":"31 1","pages":""},"PeriodicalIF":8.1000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American journal of human genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.ajhg.2024.08.010","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}
引用次数: 0

Abstract

Phenotype-driven gene prioritization is fundamental to diagnosing rare genetic disorders. While traditional approaches rely on curated knowledge graphs with phenotype-gene relations, recent advancements in large language models (LLMs) promise a streamlined text-to-gene solution. In this study, we evaluated five LLMs, including two generative pre-trained transformers (GPT) series and three Llama2 series, assessing their performance across task completeness, gene prediction accuracy, and adherence to required output structures. We conducted experiments, exploring various combinations of models, prompts, phenotypic input types, and task difficulty levels. Our findings revealed that the best-performed LLM, GPT-4, achieved an average accuracy of 17.0% in identifying diagnosed genes within the top 50 predictions, which still falls behind traditional tools. However, accuracy increased with the model size. Consistent results were observed over time, as shown in the dataset curated after 2023. Advanced techniques such as retrieval-augmented generation (RAG) and few-shot learning did not improve the accuracy. Sophisticated prompts were more likely to enhance task completeness, especially in smaller models. Conversely, complicated prompts tended to decrease output structure compliance rate. LLMs also achieved better-than-random prediction accuracy with free-text input, though performance was slightly lower than with standardized concept input. Bias analysis showed that highly cited genes, such as BRCA1, TP53, and PTEN, are more likely to be predicted. Our study provides valuable insights into integrating LLMs with genomic analysis, contributing to the ongoing discussion on their utilization in clinical workflows.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
评估大型语言模型在罕见遗传病诊断中表型驱动基因优先排序的实用性。
表型驱动的基因优先排序是诊断罕见遗传疾病的基础。传统方法依赖于具有表型-基因关系的知识图谱,而最近大语言模型(LLMs)的进步则有望提供一种简化的文本-基因解决方案。在这项研究中,我们评估了五种 LLM,包括两种生成式预训练转换器(GPT)系列和三种 Llama2 系列,评估了它们在任务完整性、基因预测准确性和遵守所需的输出结构方面的表现。我们进行了实验,探索了模型、提示、表型输入类型和任务难度的各种组合。我们的研究结果表明,表现最好的 LLM GPT-4 在识别前 50 个预测中的诊断基因方面的平均准确率为 17.0%,仍然落后于传统工具。不过,准确率随着模型大小的增加而提高。如 2023 年后的数据集所示,随着时间的推移,观察到了一致的结果。检索增强生成(RAG)和少量学习等先进技术并没有提高准确率。复杂的提示更有可能提高任务的完成度,尤其是在较小的模型中。相反,复杂的提示往往会降低输出结构的符合率。使用自由文本输入时,LLM 的预测准确率也优于随机输入,但性能略低于标准化概念输入。偏差分析表明,BRCA1、TP53 和 PTEN 等高被引基因更容易被预测。我们的研究为将 LLMs 与基因组分析结合起来提供了宝贵的见解,为正在进行的关于在临床工作流程中使用 LLMs 的讨论做出了贡献。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
14.70
自引率
4.10%
发文量
185
审稿时长
1 months
期刊介绍: The American Journal of Human Genetics (AJHG) is a monthly journal published by Cell Press, chosen by The American Society of Human Genetics (ASHG) as its premier publication starting from January 2008. AJHG represents Cell Press's first society-owned journal, and both ASHG and Cell Press anticipate significant synergies between AJHG content and that of other Cell Press titles.
期刊最新文献
Primary cartilage transcriptional signatures reflect cell-type-specific molecular pathways underpinning osteoarthritis. The PRIMED Consortium: Reducing disparities in polygenic risk assessment. The methylomic landscape of human articular cartilage development contains epigenetic signatures of osteoarthritis risk. Comparative analysis of predicted DNA secondary structures infers complex human centromere topology. Toward trustable use of machine learning models of variant effects in the clinic.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1