Gonzalo Benegas, Gokcen Eraslan, Yun S Song
{"title":"Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics.","authors":"Gonzalo Benegas, Gokcen Eraslan, Yun S Song","doi":"10.1101/2025.02.11.637758","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning holds immense promise in biology, particularly for the challenging task of identifying causal variants for Mendelian and complex traits. Two primary approaches have emerged for this task: supervised sequence-to-function models trained on functional genomics experimental data and self-supervised DNA language models that learn evolutionary constraints on sequences. However, the field currently lacks consistently curated datasets with accurate labels, especially for non-coding variants, that are necessary to comprehensively benchmark these models and advance the field. In this work, we present TraitGym, a curated dataset of regulatory genetic variants that are either known to be causal or are strong candidates across 113 Mendelian and 83 complex traits, along with carefully constructed control variants. We frame the causal variant prediction task as a binary classification problem and benchmark various models, including functional-genomics-supervised models, self-supervised models, models that combine machine learning predictions with curated annotation features, and ensembles of these. Our results provide insights into the capabilities and limitations of different approaches for predicting the functional consequences of non-coding genetic variants. We find that alignment-based models CADD and GPN-MSA compare favorably for Mendelian traits and complex disease traits, while functional-genomics-supervised models Enformer and Borzoi perform better for complex non-disease traits. Evo2 shows substantial performance gains with scale, but still lags somewhat behind alignment-based models, struggling particularly with enhancer variants. The benchmark, including a Google Colab notebook to evaluate a model in a few minutes, is available at https://huggingface.co/datasets/songlab/TraitGym.</p>","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11844472/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.02.11.637758","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

机器学习在生物学领域大有可为,特别是对于识别孟德尔性状和复杂性状的因果变异这一具有挑战性的任务。针对这一任务出现了两种主要方法:根据功能基因组学实验数据训练的监督序列到功能模型,以及学习序列进化约束的自监督 DNA 语言模型。然而,该领域目前缺乏具有准确标签(尤其是非编码变异)的一致性数据集,而这些数据集是全面衡量这些模型并推动该领域发展所必需的。在这项工作中,我们展示了 TraitGym,这是一个由调控遗传变异组成的数据集,这些变异要么是已知的因果关系,要么是 113 个孟德尔性状和 83 个复杂性状的有力候选变异,以及精心构建的对照变异。我们将因果变异预测任务视为二元分类问题,并对各种模型进行了基准测试,包括功能基因组学监督模型、自我监督模型、将机器学习预测与标注特征相结合的模型以及这些模型的集合。我们的研究结果让人们深入了解了不同方法预测非编码基因变异功能后果的能力和局限性。我们发现,基于配准的模型 CADD 和 GPN-MSA 在孟德尔性状和复杂疾病性状方面表现较好,而功能基因组学监督模型 Enformer 和 Borzoi 在复杂非疾病性状方面表现更好。该基准包括一个谷歌 Colab 笔记本,可在几分钟内对模型进行评估,可在 https://huggingface.co/datasets/songlab/TraitGym 上查阅。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics.

Machine learning holds immense promise in biology, particularly for the challenging task of identifying causal variants for Mendelian and complex traits. Two primary approaches have emerged for this task: supervised sequence-to-function models trained on functional genomics experimental data and self-supervised DNA language models that learn evolutionary constraints on sequences. However, the field currently lacks consistently curated datasets with accurate labels, especially for non-coding variants, that are necessary to comprehensively benchmark these models and advance the field. In this work, we present TraitGym, a curated dataset of regulatory genetic variants that are either known to be causal or are strong candidates across 113 Mendelian and 83 complex traits, along with carefully constructed control variants. We frame the causal variant prediction task as a binary classification problem and benchmark various models, including functional-genomics-supervised models, self-supervised models, models that combine machine learning predictions with curated annotation features, and ensembles of these. Our results provide insights into the capabilities and limitations of different approaches for predicting the functional consequences of non-coding genetic variants. We find that alignment-based models CADD and GPN-MSA compare favorably for Mendelian traits and complex disease traits, while functional-genomics-supervised models Enformer and Borzoi perform better for complex non-disease traits. Evo2 shows substantial performance gains with scale, but still lags somewhat behind alignment-based models, struggling particularly with enhancer variants. The benchmark, including a Google Colab notebook to evaluate a model in a few minutes, is available at https://huggingface.co/datasets/songlab/TraitGym.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Sequencing by Expansion (SBX) - a novel, high-throughput single-molecule sequencing technology. C12ORF57: a novel principal regulator of synaptic AMPA currents and excitatory neuronal homeostasis. Engineering Multiplexed Synthetic Breath Biomarkers as Diagnostic Probes. Alevin-fry-atac enables rapid and memory frugal mapping of single-cell ATAC-seq data using virtual colors for accurate genomic pseudoalignment. Cholesterol-Dependent Dimerization and Conformational Dynamics of EphA2 Receptors: Insights from Coarse-Grained and All-Atom Simulations.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1