Filtering STARR-Seq Peaks for Enhancers with Sequence Models

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics Pub Date : 2020-09-21 DOI:10.1145/3388440.3414905

R. J. Nowling, Rafael Reple Geromel, B. Halligan

{"title":"Filtering STARR-Seq Peaks for Enhancers with Sequence Models","authors":"R. J. Nowling, Rafael Reple Geromel, B. Halligan","doi":"10.1145/3388440.3414905","DOIUrl":null,"url":null,"abstract":"STARR-Seq is a high-throughput technique for directly identifying genomic regions with enhancer activity [1]. Genomic DNA is sheared, inserted into artificial plasmids designed so that DNA with enhancer activity trigger self-transcription, and transfected into culture cells. The resulting RNA is converted back into cDNA, sequenced, and aligned to a reference genome. \"Peaks\" are called by comparing observed read depth at each point to an expected read depth from control DNA using a statistical test. Examples of peak calling methods based on read depth include MACS2 [4], basicSTARRSeq, and STARRPeaker [3]. It is challenging to accurately distinguish between real peaks and artifacts in regions where mean read depth is low but the variance is high. Fortunately, enhancer activity is strongly correlated with sequence content. We propose using sequence-based machine learning models in a semi-supervised framework to filter peaks. 501-bp sequences centered on the ≈11k STARR peaks from [1] were extracted from the Drosophila melanogaster dm3 genome. Randomly-sampled 501-bp sequences were used as a negative set. Peaks were filtered using a Bonferroni-corrected significance value (α = 0.05) to create a \"high-confidence\" subset of ≈2.2k peaks. A Logistic Regression model with k-mer count features was trained on the high-confidence peak sequences and their negatives and used to classifying the remaining ≈8.8k peak sequences. The self-trained, sequenced-based model identified an additional ≈3.7k candidate enhancers (\"medium confidence\"). The remaining ≈5k STARR peaks were considered \"low confidence\" peaks. We plotted histograms of the read depth log-fold change for the three sets of peaks (high, medium, and low confidence) (see Figure 1). The distributions for the medium- and low-confidence peaks overlapped significantly. The sequence-based model identified enhancer candidates that would otherwise be filtered out using read depth alone. We called peaks for the 4 D. melanogaster FAIRE-Seq data sets from [2]. Sequencing data were cleaned with Trimmomatic, aligned to the dm3 genome with bwa backtrack, and filtered for mapping quality (q < 10) with samtools. MACS2 called ≈61k FAIRE peaks. The STARR peaks overlapped with the FAIRE peaks with precisions of 52.7% (high-confidence peaks), 40.6% (medium-confidence peaks), and 22.5% (low-confidence peaks).","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3414905","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

STARR-Seq is a high-throughput technique for directly identifying genomic regions with enhancer activity [1]. Genomic DNA is sheared, inserted into artificial plasmids designed so that DNA with enhancer activity trigger self-transcription, and transfected into culture cells. The resulting RNA is converted back into cDNA, sequenced, and aligned to a reference genome. "Peaks" are called by comparing observed read depth at each point to an expected read depth from control DNA using a statistical test. Examples of peak calling methods based on read depth include MACS2 [4], basicSTARRSeq, and STARRPeaker [3]. It is challenging to accurately distinguish between real peaks and artifacts in regions where mean read depth is low but the variance is high. Fortunately, enhancer activity is strongly correlated with sequence content. We propose using sequence-based machine learning models in a semi-supervised framework to filter peaks. 501-bp sequences centered on the ≈11k STARR peaks from [1] were extracted from the Drosophila melanogaster dm3 genome. Randomly-sampled 501-bp sequences were used as a negative set. Peaks were filtered using a Bonferroni-corrected significance value (α = 0.05) to create a "high-confidence" subset of ≈2.2k peaks. A Logistic Regression model with k-mer count features was trained on the high-confidence peak sequences and their negatives and used to classifying the remaining ≈8.8k peak sequences. The self-trained, sequenced-based model identified an additional ≈3.7k candidate enhancers ("medium confidence"). The remaining ≈5k STARR peaks were considered "low confidence" peaks. We plotted histograms of the read depth log-fold change for the three sets of peaks (high, medium, and low confidence) (see Figure 1). The distributions for the medium- and low-confidence peaks overlapped significantly. The sequence-based model identified enhancer candidates that would otherwise be filtered out using read depth alone. We called peaks for the 4 D. melanogaster FAIRE-Seq data sets from [2]. Sequencing data were cleaned with Trimmomatic, aligned to the dm3 genome with bwa backtrack, and filtered for mapping quality (q < 10) with samtools. MACS2 called ≈61k FAIRE peaks. The STARR peaks overlapped with the FAIRE peaks with precisions of 52.7% (high-confidence peaks), 40.6% (medium-confidence peaks), and 22.5% (low-confidence peaks).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用序列模型滤波增强子的STARR-Seq峰

STARR-Seq是一种直接鉴定具有增强子活性的基因组区域的高通量技术[1]。基因组DNA被剪切，插入人工质粒中，使具有增强子活性的DNA触发自转录，并转染到培养细胞中。得到的RNA被转换回cDNA，测序，并与参考基因组对齐。通过使用统计检验将每个点的观察到的读取深度与来自对照DNA的预期读取深度进行比较，称为“峰值”。基于读深度的峰值调用方法包括MACS2[4]、basicSTARRSeq和STARRPeaker[3]。在平均读取深度低但方差大的区域，准确区分真实峰值和伪影是一项挑战。幸运的是，增强子活性与序列内容密切相关。我们建议在半监督框架中使用基于序列的机器学习模型来过滤峰值。从黑腹果蝇dm3基因组中提取[1]中以≈11k STARR峰为中心的501-bp序列。随机采样的501-bp序列作为阴性集。使用bonferroni校正显著性值(α = 0.05)过滤峰，以创建约2.2万个峰的“高置信度”子集。对高置信度峰序列及其负值训练具有k-mer计数特征的Logistic回归模型，用于对剩余的≈8.8k个峰序列进行分类。自我训练的基于序列的模型确定了额外的≈3.7k候选增强子(“中等置信度”)。其余≈5k的STARR峰被认为是“低置信度”峰。我们绘制了三组峰值(高、中、低置信度)的读取深度对数倍变化直方图(见图1)。中置信度和低置信度峰值的分布明显重叠。基于序列的模型识别了候选增强子，否则仅使用读取深度就会被过滤掉。我们从[2]中调用了4个D. melanogaster FAIRE-Seq数据集的峰值。使用Trimmomatic对测序数据进行清洗，使用bwa backtrack对dm3基因组进行比对，并使用samtools对定位质量(q < 10)进行过滤。MACS2称为≈61k FAIRE峰。STARR峰与FAIRE峰重叠，精度分别为52.7%(高置信度峰)、40.6%(中置信度峰)和22.5%(低置信度峰)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

自引率

0.00%

发文量

期刊最新文献

RA2Vec CanMod From Interatomic Distances to Protein Tertiary Structures with a Deep Convolutional Neural Network Prediction of Large for Gestational Age Infants in Overweight and Obese Women at Approximately 20 Gestational Weeks Using Patient Information for the Prediction of Caregiver Burden in Amyotrophic Lateral Sclerosis