Enricherator: A Bayesian Method for Inferring Regularized Genome-wide Enrichments from Sequencing Count Data

IF 4.7 2区生物学 Q1 BIOCHEMISTRY & MOLECULAR BIOLOGY Journal of Molecular Biology Pub Date : 2024-09-01 DOI:10.1016/j.jmb.2024.168567

{"title":"Enricherator: A Bayesian Method for Inferring Regularized Genome-wide Enrichments from Sequencing Count Data","authors":"","doi":"10.1016/j.jmb.2024.168567","DOIUrl":null,"url":null,"abstract":"<div>A pervasive question in biological research studying gene regulation, chromatin structure, or genomics is where, and to what extent, does a signal of interest arise genome-wide? This question is addressed using a variety of methods relying on high-throughput sequencing data as their final output, including ChIP-seq for protein-DNA interactions,1 GapR-seq for measuring supercoiling,2 and HBD-seq or DRIP-seq for R-loop positioning.3, 4 Current computational methods to calculate genome-wide enrichment of the signal of interest usually do not properly handle the count-based nature of sequencing data, they often do not make use of the local correlation structure of sequencing data, and they do not apply any regularization of enrichment estimates. This can result in unrealistic estimates of the true underlying biological enrichment of interest, unrealistically low estimates of confidence in point estimates of enrichment (or no estimates of confidence at all), unrealistic gyrations in enrichment estimates at very close (<10 bp) genomic loci due to noise inherent in sequencing data, and in a multiple-hypothesis testing problem during interpretation of genome-wide enrichment estimates. We developed a tool called Enricherator to infer genome-wide enrichments from sequencing count data. Enricherator uses the variational Bayes algorithm to fit a generalized linear model to sequencing count data and to sample from the approximate posterior distribution of enrichment estimates (https://github.com/jwschroeder3/enricherator<svg><path></path></svg>). Enrichments inferred by Enricherator more precisely identify known binding sites in cases where low coverage between binding sites leads to false-positive peak calls in these noisy regions of the genome; these benefits extend to published datasets.</div>","PeriodicalId":369,"journal":{"name":"Journal of Molecular Biology","volume":"436 17","pages":"Article 168567"},"PeriodicalIF":4.7000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0022283624001621/pdfft?md5=12eadc9303ecf2b7325490d62b957d44&pid=1-s2.0-S0022283624001621-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Molecular Biology","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0022283624001621","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

A pervasive question in biological research studying gene regulation, chromatin structure, or genomics is where, and to what extent, does a signal of interest arise genome-wide? This question is addressed using a variety of methods relying on high-throughput sequencing data as their final output, including ChIP-seq for protein-DNA interactions,¹ GapR-seq for measuring supercoiling,² and HBD-seq or DRIP-seq for R-loop positioning.3, 4 Current computational methods to calculate genome-wide enrichment of the signal of interest usually do not properly handle the count-based nature of sequencing data, they often do not make use of the local correlation structure of sequencing data, and they do not apply any regularization of enrichment estimates. This can result in unrealistic estimates of the true underlying biological enrichment of interest, unrealistically low estimates of confidence in point estimates of enrichment (or no estimates of confidence at all), unrealistic gyrations in enrichment estimates at very close (<10 bp) genomic loci due to noise inherent in sequencing data, and in a multiple-hypothesis testing problem during interpretation of genome-wide enrichment estimates. We developed a tool called Enricherator to infer genome-wide enrichments from sequencing count data. Enricherator uses the variational Bayes algorithm to fit a generalized linear model to sequencing count data and to sample from the approximate posterior distribution of enrichment estimates (https://github.com/jwschroeder3/enricherator). Enrichments inferred by Enricherator more precisely identify known binding sites in cases where low coverage between binding sites leads to false-positive peak calls in these noisy regions of the genome; these benefits extend to published datasets.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Enricherator：从测序计数数据推断正规化全基因组富集度的贝叶斯方法

在研究基因调控、染色质结构或基因组学的生物学研究中，一个普遍存在的问题是，感兴趣的信号在全基因组范围内的什么地方以及在多大程度上出现？解决这一问题的方法多种多样，其最终输出都依赖于高通量测序数据，包括用于蛋白质-DNA 相互作用的 ChIP-seq、用于测量超卷曲的 GapR-seq 以及用于 R 环定位的 HBD-seq 或 DRIP-seq。目前计算感兴趣信号的全基因组富集度的计算方法通常不能正确处理测序数据基于计数的性质，它们往往没有利用测序数据的局部相关结构，也没有对富集度估计值进行任何正则化处理。这可能会导致对感兴趣的真实潜在生物富集度的不切实际的估计、对富集度点估计值的不切实际的低置信度估计（或根本没有置信度估计）、由于测序数据固有的噪声而导致非常接近（<10 bp）基因组位点的富集度估计值出现不切实际的回旋，以及在解释全基因组富集度估计值时出现多重假设检验问题。我们开发了一种名为 Enricherator 的工具，用于从测序计数数据中推断全基因组富集度。Enricherator 使用变异贝叶斯算法对测序计数数据拟合广义线性模型，并从富集估计值的近似后验分布中采样（）。当结合位点之间的低覆盖率导致基因组中这些嘈杂区域出现假阳性峰值调用时，Enricherator推断出的富集度能更精确地识别已知的结合位点；这些优势已扩展到已发表的数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Molecular Biology 生物-生化与分子生物学

CiteScore

11.30

自引率

1.80%

发文量

412

审稿时长

28 days

期刊介绍： Journal of Molecular Biology (JMB) provides high quality, comprehensive and broad coverage in all areas of molecular biology. The journal publishes original scientific research papers that provide mechanistic and functional insights and report a significant advance to the field. The journal encourages the submission of multidisciplinary studies that use complementary experimental and computational approaches to address challenging biological questions. Research areas include but are not limited to: Biomolecular interactions, signaling networks, systems biology; Cell cycle, cell growth, cell differentiation; Cell death, autophagy; Cell signaling and regulation; Chemical biology; Computational biology, in combination with experimental studies; DNA replication, repair, and recombination; Development, regenerative biology, mechanistic and functional studies of stem cells; Epigenetics, chromatin structure and function; Gene expression; Membrane processes, cell surface proteins and cell-cell interactions; Methodological advances, both experimental and theoretical, including databases; Microbiology, virology, and interactions with the host or environment; Microbiota mechanistic and functional studies; Nuclear organization; Post-translational modifications, proteomics; Processing and function of biologically important macromolecules and complexes; Molecular basis of disease; RNA processing, structure and functions of non-coding RNAs, transcription; Sorting, spatiotemporal organization, trafficking; Structural biology; Synthetic biology; Translation, protein folding, chaperones, protein degradation and quality control.

期刊最新文献

Editorial Board Outside Front Cover Assembly of the human multi-tRNA synthetase complex through leucine zipper motifs. Corrigendum to “The Role of ATG9 Vesicles in Autophagosome Biogenesis” [J. Mol. Biol. 436(15) (2024) 168489] Structural studies on Mycobacterial NudC reveal a class of zinc independent NADH pyrophosphatase.