Filtering STARR-Seq Peaks for Enhancers with Sequence Models

R. J. Nowling, Rafael Reple Geromel, B. Halligan
{"title":"Filtering STARR-Seq Peaks for Enhancers with Sequence Models","authors":"R. J. Nowling, Rafael Reple Geromel, B. Halligan","doi":"10.1145/3388440.3414905","DOIUrl":null,"url":null,"abstract":"STARR-Seq is a high-throughput technique for directly identifying genomic regions with enhancer activity [1]. Genomic DNA is sheared, inserted into artificial plasmids designed so that DNA with enhancer activity trigger self-transcription, and transfected into culture cells. The resulting RNA is converted back into cDNA, sequenced, and aligned to a reference genome. \"Peaks\" are called by comparing observed read depth at each point to an expected read depth from control DNA using a statistical test. Examples of peak calling methods based on read depth include MACS2 [4], basicSTARRSeq, and STARRPeaker [3]. It is challenging to accurately distinguish between real peaks and artifacts in regions where mean read depth is low but the variance is high. Fortunately, enhancer activity is strongly correlated with sequence content. We propose using sequence-based machine learning models in a semi-supervised framework to filter peaks. 501-bp sequences centered on the ≈11k STARR peaks from [1] were extracted from the Drosophila melanogaster dm3 genome. Randomly-sampled 501-bp sequences were used as a negative set. Peaks were filtered using a Bonferroni-corrected significance value (α = 0.05) to create a \"high-confidence\" subset of ≈2.2k peaks. A Logistic Regression model with k-mer count features was trained on the high-confidence peak sequences and their negatives and used to classifying the remaining ≈8.8k peak sequences. The self-trained, sequenced-based model identified an additional ≈3.7k candidate enhancers (\"medium confidence\"). The remaining ≈5k STARR peaks were considered \"low confidence\" peaks. We plotted histograms of the read depth log-fold change for the three sets of peaks (high, medium, and low confidence) (see Figure 1). The distributions for the medium- and low-confidence peaks overlapped significantly. The sequence-based model identified enhancer candidates that would otherwise be filtered out using read depth alone. We called peaks for the 4 D. melanogaster FAIRE-Seq data sets from [2]. Sequencing data were cleaned with Trimmomatic, aligned to the dm3 genome with bwa backtrack, and filtered for mapping quality (q < 10) with samtools. MACS2 called ≈61k FAIRE peaks. The STARR peaks overlapped with the FAIRE peaks with precisions of 52.7% (high-confidence peaks), 40.6% (medium-confidence peaks), and 22.5% (low-confidence peaks).","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3414905","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

STARR-Seq is a high-throughput technique for directly identifying genomic regions with enhancer activity [1]. Genomic DNA is sheared, inserted into artificial plasmids designed so that DNA with enhancer activity trigger self-transcription, and transfected into culture cells. The resulting RNA is converted back into cDNA, sequenced, and aligned to a reference genome. "Peaks" are called by comparing observed read depth at each point to an expected read depth from control DNA using a statistical test. Examples of peak calling methods based on read depth include MACS2 [4], basicSTARRSeq, and STARRPeaker [3]. It is challenging to accurately distinguish between real peaks and artifacts in regions where mean read depth is low but the variance is high. Fortunately, enhancer activity is strongly correlated with sequence content. We propose using sequence-based machine learning models in a semi-supervised framework to filter peaks. 501-bp sequences centered on the ≈11k STARR peaks from [1] were extracted from the Drosophila melanogaster dm3 genome. Randomly-sampled 501-bp sequences were used as a negative set. Peaks were filtered using a Bonferroni-corrected significance value (α = 0.05) to create a "high-confidence" subset of ≈2.2k peaks. A Logistic Regression model with k-mer count features was trained on the high-confidence peak sequences and their negatives and used to classifying the remaining ≈8.8k peak sequences. The self-trained, sequenced-based model identified an additional ≈3.7k candidate enhancers ("medium confidence"). The remaining ≈5k STARR peaks were considered "low confidence" peaks. We plotted histograms of the read depth log-fold change for the three sets of peaks (high, medium, and low confidence) (see Figure 1). The distributions for the medium- and low-confidence peaks overlapped significantly. The sequence-based model identified enhancer candidates that would otherwise be filtered out using read depth alone. We called peaks for the 4 D. melanogaster FAIRE-Seq data sets from [2]. Sequencing data were cleaned with Trimmomatic, aligned to the dm3 genome with bwa backtrack, and filtered for mapping quality (q < 10) with samtools. MACS2 called ≈61k FAIRE peaks. The STARR peaks overlapped with the FAIRE peaks with precisions of 52.7% (high-confidence peaks), 40.6% (medium-confidence peaks), and 22.5% (low-confidence peaks).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用序列模型滤波增强子的STARR-Seq峰
STARR-Seq是一种直接鉴定具有增强子活性的基因组区域的高通量技术[1]。基因组DNA被剪切,插入人工质粒中,使具有增强子活性的DNA触发自转录,并转染到培养细胞中。得到的RNA被转换回cDNA,测序,并与参考基因组对齐。通过使用统计检验将每个点的观察到的读取深度与来自对照DNA的预期读取深度进行比较,称为“峰值”。基于读深度的峰值调用方法包括MACS2[4]、basicSTARRSeq和STARRPeaker[3]。在平均读取深度低但方差大的区域,准确区分真实峰值和伪影是一项挑战。幸运的是,增强子活性与序列内容密切相关。我们建议在半监督框架中使用基于序列的机器学习模型来过滤峰值。从黑腹果蝇dm3基因组中提取[1]中以≈11k STARR峰为中心的501-bp序列。随机采样的501-bp序列作为阴性集。使用bonferroni校正显著性值(α = 0.05)过滤峰,以创建约2.2万个峰的“高置信度”子集。对高置信度峰序列及其负值训练具有k-mer计数特征的Logistic回归模型,用于对剩余的≈8.8k个峰序列进行分类。自我训练的基于序列的模型确定了额外的≈3.7k候选增强子(“中等置信度”)。其余≈5k的STARR峰被认为是“低置信度”峰。我们绘制了三组峰值(高、中、低置信度)的读取深度对数倍变化直方图(见图1)。中置信度和低置信度峰值的分布明显重叠。基于序列的模型识别了候选增强子,否则仅使用读取深度就会被过滤掉。我们从[2]中调用了4个D. melanogaster FAIRE-Seq数据集的峰值。使用Trimmomatic对测序数据进行清洗,使用bwa backtrack对dm3基因组进行比对,并使用samtools对定位质量(q < 10)进行过滤。MACS2称为≈61k FAIRE峰。STARR峰与FAIRE峰重叠,精度分别为52.7%(高置信度峰)、40.6%(中置信度峰)和22.5%(低置信度峰)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
RA2Vec CanMod From Interatomic Distances to Protein Tertiary Structures with a Deep Convolutional Neural Network Prediction of Large for Gestational Age Infants in Overweight and Obese Women at Approximately 20 Gestational Weeks Using Patient Information for the Prediction of Caregiver Burden in Amyotrophic Lateral Sclerosis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1