Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics.

bioRxiv : the preprint server for biology Pub Date : 2025-03-04 DOI:10.1101/2025.02.11.637758

Gonzalo Benegas, Gökcen Eraslan, Yun S Song

{"title":"Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics.","authors":"Gonzalo Benegas, Gökcen Eraslan, Yun S Song","doi":"10.1101/2025.02.11.637758","DOIUrl":null,"url":null,"abstract":"<p><p>Machine learning holds immense promise in biology, particularly for the challenging task of identifying causal variants for Mendelian and complex traits. Two primary approaches have emerged for this task: supervised sequence-to-function models trained on functional genomics experimental data and self-supervised DNA language models that learn evolutionary constraints on sequences. However, the field currently lacks consistently curated datasets with accurate labels, especially for non-coding variants, that are necessary to comprehensively benchmark these models and advance the field. In this work, we present TraitGym, a curated dataset of regulatory genetic variants that are either known to be causal or are strong candidates across 113 Mendelian and 83 complex traits, along with carefully constructed control variants. We frame the causal variant prediction task as a binary classification problem and benchmark various models, including functional-genomics-supervised models, self-supervised models, models that combine machine learning predictions with curated annotation features, and ensembles of these. Our results provide insights into the capabilities and limitations of different approaches for predicting the functional consequences of non-coding genetic variants. We find that alignment-based models CADD and GPN-MSA compare favorably for Mendelian traits and complex disease traits, while functional-genomics-supervised models Enformer and Borzoi perform better for complex non-disease traits. Evo2 shows substantial performance gains with scale, but still lags somewhat behind alignment-based models, struggling particularly with enhancer variants. The benchmark, including a Google Colab notebook to evaluate a model in a few minutes, is available at https://huggingface.co/datasets/songlab/TraitGym.</p>","PeriodicalId":519960,"journal":{"name":"bioRxiv : the preprint server for biology","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11844472/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv : the preprint server for biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2025.02.11.637758","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning holds immense promise in biology, particularly for the challenging task of identifying causal variants for Mendelian and complex traits. Two primary approaches have emerged for this task: supervised sequence-to-function models trained on functional genomics experimental data and self-supervised DNA language models that learn evolutionary constraints on sequences. However, the field currently lacks consistently curated datasets with accurate labels, especially for non-coding variants, that are necessary to comprehensively benchmark these models and advance the field. In this work, we present TraitGym, a curated dataset of regulatory genetic variants that are either known to be causal or are strong candidates across 113 Mendelian and 83 complex traits, along with carefully constructed control variants. We frame the causal variant prediction task as a binary classification problem and benchmark various models, including functional-genomics-supervised models, self-supervised models, models that combine machine learning predictions with curated annotation features, and ensembles of these. Our results provide insights into the capabilities and limitations of different approaches for predicting the functional consequences of non-coding genetic variants. We find that alignment-based models CADD and GPN-MSA compare favorably for Mendelian traits and complex disease traits, while functional-genomics-supervised models Enformer and Borzoi perform better for complex non-disease traits. Evo2 shows substantial performance gains with scale, but still lags somewhat behind alignment-based models, struggling particularly with enhancer variants. The benchmark, including a Google Colab notebook to evaluate a model in a few minutes, is available at https://huggingface.co/datasets/songlab/TraitGym.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

人类遗传学因果调控变异预测的基准DNA序列模型。

机器学习在生物学领域大有可为，特别是对于识别孟德尔性状和复杂性状的因果变异这一具有挑战性的任务。针对这一任务出现了两种主要方法：根据功能基因组学实验数据训练的监督序列到功能模型，以及学习序列进化约束的自监督 DNA 语言模型。然而，该领域目前缺乏具有准确标签（尤其是非编码变异）的一致性数据集，而这些数据集是全面衡量这些模型并推动该领域发展所必需的。在这项工作中，我们展示了 TraitGym，这是一个由调控遗传变异组成的数据集，这些变异要么是已知的因果关系，要么是 113 个孟德尔性状和 83 个复杂性状的有力候选变异，以及精心构建的对照变异。我们将因果变异预测任务视为二元分类问题，并对各种模型进行了基准测试，包括功能基因组学监督模型、自我监督模型、将机器学习预测与标注特征相结合的模型以及这些模型的集合。我们的研究结果让人们深入了解了不同方法预测非编码基因变异功能后果的能力和局限性。我们发现，基于配准的模型 CADD 和 GPN-MSA 在孟德尔性状和复杂疾病性状方面表现较好，而功能基因组学监督模型 Enformer 和 Borzoi 在复杂非疾病性状方面表现更好。该基准包括一个谷歌 Colab 笔记本，可在几分钟内对模型进行评估，可在 https://huggingface.co/datasets/songlab/TraitGym 上查阅。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

bioRxiv : the preprint server for biology

自引率

0.00%

发文量