Brendan O'Fallon, Ashini Bolia, Jacob Durtschi, Luobin Yang, Eric Fredrickson, Hunter Best
{"title":"Generative haplotype prediction outperforms statistical methods for small variant detection in next-generation sequencing data.","authors":"Brendan O'Fallon, Ashini Bolia, Jacob Durtschi, Luobin Yang, Eric Fredrickson, Hunter Best","doi":"10.1093/bioinformatics/btae565","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Despite significant progress in recent years, current methods still generate thousands of false-positive detections in a typical human whole genome, creating a significant manual review burden.</p><p><strong>Results: </strong>We introduce a new approach that replaces the handcrafted statistical techniques of previous methods with a single deep generative model. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct diploid germline haplotypes in a generative fashion identical to modern large language models. We train our model on 37 whole genome sequences from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3, and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants, our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested.</p><p><strong>Availability and implementation: </strong>Jenever is implemented as a python-based command line tool. Source code is available at https://github.com/ARUP-NGS/jenever/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11549014/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae565","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Motivation: Detection of germline variants in next-generation sequencing data is an essential component of modern genomics analysis. Variant detection tools typically rely on statistical algorithms such as de Bruijn graphs or Hidden Markov models, and are often coupled with heuristic techniques and thresholds to maximize accuracy. Despite significant progress in recent years, current methods still generate thousands of false-positive detections in a typical human whole genome, creating a significant manual review burden.
Results: We introduce a new approach that replaces the handcrafted statistical techniques of previous methods with a single deep generative model. Using a standard transformer-based encoder and double-decoder architecture, our model learns to construct diploid germline haplotypes in a generative fashion identical to modern large language models. We train our model on 37 whole genome sequences from Genome-in-a-Bottle samples, and demonstrate that our method learns to produce accurate haplotypes with correct phase and genotype for all classes of small variants. We compare our method, called Jenever, to FreeBayes, GATK HaplotypeCaller, Clair3, and DeepVariant, and demonstrate that our method has superior overall accuracy compared to other methods. At F1-maximizing quality thresholds, our model delivers the highest sensitivity, precision, and the fewest genotyping errors for insertion and deletion variants. For single nucleotide variants, our model demonstrates the highest sensitivity but at somewhat lower precision, and achieves the highest overall F1 score among all callers we tested.
Availability and implementation: Jenever is implemented as a python-based command line tool. Source code is available at https://github.com/ARUP-NGS/jenever/.