{"title":"Filtering STARR-Seq Peaks for Enhancers with Sequence Models","authors":"R. J. Nowling, Rafael Reple Geromel, B. Halligan","doi":"10.1145/3388440.3414905","DOIUrl":null,"url":null,"abstract":"STARR-Seq is a high-throughput technique for directly identifying genomic regions with enhancer activity [1]. Genomic DNA is sheared, inserted into artificial plasmids designed so that DNA with enhancer activity trigger self-transcription, and transfected into culture cells. The resulting RNA is converted back into cDNA, sequenced, and aligned to a reference genome. \"Peaks\" are called by comparing observed read depth at each point to an expected read depth from control DNA using a statistical test. Examples of peak calling methods based on read depth include MACS2 [4], basicSTARRSeq, and STARRPeaker [3]. It is challenging to accurately distinguish between real peaks and artifacts in regions where mean read depth is low but the variance is high. Fortunately, enhancer activity is strongly correlated with sequence content. We propose using sequence-based machine learning models in a semi-supervised framework to filter peaks. 501-bp sequences centered on the ≈11k STARR peaks from [1] were extracted from the Drosophila melanogaster dm3 genome. Randomly-sampled 501-bp sequences were used as a negative set. Peaks were filtered using a Bonferroni-corrected significance value (α = 0.05) to create a \"high-confidence\" subset of ≈2.2k peaks. A Logistic Regression model with k-mer count features was trained on the high-confidence peak sequences and their negatives and used to classifying the remaining ≈8.8k peak sequences. The self-trained, sequenced-based model identified an additional ≈3.7k candidate enhancers (\"medium confidence\"). The remaining ≈5k STARR peaks were considered \"low confidence\" peaks. We plotted histograms of the read depth log-fold change for the three sets of peaks (high, medium, and low confidence) (see Figure 1). The distributions for the medium- and low-confidence peaks overlapped significantly. The sequence-based model identified enhancer candidates that would otherwise be filtered out using read depth alone. We called peaks for the 4 D. melanogaster FAIRE-Seq data sets from [2]. Sequencing data were cleaned with Trimmomatic, aligned to the dm3 genome with bwa backtrack, and filtered for mapping quality (q < 10) with samtools. MACS2 called ≈61k FAIRE peaks. The STARR peaks overlapped with the FAIRE peaks with precisions of 52.7% (high-confidence peaks), 40.6% (medium-confidence peaks), and 22.5% (low-confidence peaks).","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3414905","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
STARR-Seq is a high-throughput technique for directly identifying genomic regions with enhancer activity [1]. Genomic DNA is sheared, inserted into artificial plasmids designed so that DNA with enhancer activity trigger self-transcription, and transfected into culture cells. The resulting RNA is converted back into cDNA, sequenced, and aligned to a reference genome. "Peaks" are called by comparing observed read depth at each point to an expected read depth from control DNA using a statistical test. Examples of peak calling methods based on read depth include MACS2 [4], basicSTARRSeq, and STARRPeaker [3]. It is challenging to accurately distinguish between real peaks and artifacts in regions where mean read depth is low but the variance is high. Fortunately, enhancer activity is strongly correlated with sequence content. We propose using sequence-based machine learning models in a semi-supervised framework to filter peaks. 501-bp sequences centered on the ≈11k STARR peaks from [1] were extracted from the Drosophila melanogaster dm3 genome. Randomly-sampled 501-bp sequences were used as a negative set. Peaks were filtered using a Bonferroni-corrected significance value (α = 0.05) to create a "high-confidence" subset of ≈2.2k peaks. A Logistic Regression model with k-mer count features was trained on the high-confidence peak sequences and their negatives and used to classifying the remaining ≈8.8k peak sequences. The self-trained, sequenced-based model identified an additional ≈3.7k candidate enhancers ("medium confidence"). The remaining ≈5k STARR peaks were considered "low confidence" peaks. We plotted histograms of the read depth log-fold change for the three sets of peaks (high, medium, and low confidence) (see Figure 1). The distributions for the medium- and low-confidence peaks overlapped significantly. The sequence-based model identified enhancer candidates that would otherwise be filtered out using read depth alone. We called peaks for the 4 D. melanogaster FAIRE-Seq data sets from [2]. Sequencing data were cleaned with Trimmomatic, aligned to the dm3 genome with bwa backtrack, and filtered for mapping quality (q < 10) with samtools. MACS2 called ≈61k FAIRE peaks. The STARR peaks overlapped with the FAIRE peaks with precisions of 52.7% (high-confidence peaks), 40.6% (medium-confidence peaks), and 22.5% (low-confidence peaks).