Generative haplotype prediction outperforms statistical methods for small variant detection in next-generation sequencing data.

IF 5.4 Bioinformatics (Oxford, England) Pub Date : 2024-11-01 DOI:10.1093/bioinformatics/btae565

Brendan O'Fallon, Ashini Bolia, Jacob Durtschi, Luobin Yang, Eric Fredrickson, Hunter Best

{"title":"Generative haplotype prediction outperforms statistical methods for small variant detection in next-generation sequencing data.","authors":"Brendan O'Fallon, Ashini Bolia, Jacob Durtschi, Luobin Yang, Eric Fredrickson, Hunter Best","doi":"10.1093/bioinformatics/btae565","DOIUrl":null,"url":null,"abstract":"Motivation: Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Despite significant progress in recent years, current methods still generate thousands of false-positive detections in a typical human whole genome, creating a significant manual review burden.Results: We introduce a new approach that replaces the handcrafted statistical techniques of previous methods with a single deep generative model. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct diploid germline haplotypes in a generative fashion identical to modern large language models. We train our model on 37 whole genome sequences from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3, and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants, our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested.Availability and implementation: Jenever is implemented as a python-based command line tool. Source code is available at https://github.com/ARUP-NGS/jenever/.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549014/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae565","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Despite significant progress in recent years, current methods still generate thousands of false-positive detections in a typical human whole genome, creating a significant manual review burden.

Results: We introduce a new approach that replaces the handcrafted statistical techniques of previous methods with a single deep generative model. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct diploid germline haplotypes in a generative fashion identical to modern large language models. We train our model on 37 whole genome sequences from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3, and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants, our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested.

Availability and implementation: Jenever is implemented as a python-based command line tool. Source code is available at https://github.com/ARUP-NGS/jenever/.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

生成单倍型预测优于 NGS 数据小变异检测统计方法

动机检测下一代测序数据中的种系变异是现代基因组学分析的重要组成部分。变异检测工具通常依赖于统计算法，如德布鲁因图或隐马尔可夫模型，并经常与启发式技术和阈值相结合，以最大限度地提高准确性。尽管近年来取得了重大进展，但目前的方法仍会在典型的人类全基因组中产生成千上万的假阳性检测，给人工审查造成了巨大负担：我们引入了一种新方法，用单一的深度生成模型取代了以往方法中的手工统计技术。我们的模型使用标准的基于变压器的编码器和双解码器架构，以与现代大型语言模型（LLM）相同的生成方式学习构建二倍体种系单倍型。我们在 37 个来自 "瓶中基因组 "样本的全基因组序列（WGS）上训练了我们的模型，并证明我们的方法可以学习生成准确的单倍型，并为所有类别的小变异提供正确的相位和基因型。我们将名为 Jenever 的方法与 FreeBayes、GATK HaplotypeCaller、Clair3 和 DeepVariant 进行了比较，结果表明，与其他方法相比，我们的方法具有更高的整体准确性。在 F1 最大质量阈值下，我们的模型对插入和缺失变异的灵敏度和精确度最高，基因分型错误最少。对于单核苷酸变异，我们的模型灵敏度最高，但精确度稍低，在我们测试的所有调用者中，我们的模型获得了最高的总体 F1 分数：Jenever 是一个基于 python 的命令行工具。源代码可从 https://github.com/ARUP-NGS/jenever/ 获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量