PeakMatcher

R. J. Nowling, C. R. Beal, Scott J. Emrich, S. Behura, M. Halfon, M. Duman-Scheel
{"title":"PeakMatcher","authors":"R. J. Nowling, C. R. Beal, Scott J. Emrich, S. Behura, M. Halfon, M. Duman-Scheel","doi":"10.1145/3388440.3414907","DOIUrl":null,"url":null,"abstract":"When reference genome assemblies are updated, the peaks from DNA enrichment assays such as ChIP-Seq and FAIRE-Seq need to be called again using the new genome assembly. PeakMatcher is an open-source package that aids in validation by matching peaks across two genome assemblies using the alignment of reads or within the same genome. PeakMatcher calculates recall and precision while also outputting lists of peak-to-peak matches. PeakMatcher uses read alignments to match peaks across genome assemblies. PeakMatcher finds all read aligned to one genome that overlap with a given list of peaks. PeakMatcher uses the read names to locate where those reads are aligned against a second genome. Lastly, all peaks called against the second genome that overlap with the aligned reads are found and output. PeakMatcher groups uses the peak-read-peak relationships to discover 1-to-1, 1-to-many, and many-to-many relationships. Overlap queries are performed with interval trees for maximum efficiency. We evaluated PeakMatcher on two data sets. The first data set was FAIRE-Seq (Formaldehyde-Assisted Isolation of Regulatory Elements Sequencing) of DNA isolated embyros of the mosquito Aedes aegypti [2, 4]. We implemented a peak calling pipeline and validated it on the older (highly fragmented) AaegL3 assembly [5]. PeakMatcher matched 92.9% (precision) of the 121,594 previously-called peaks from [2, 4] with 89.4% (recall) of the 124,959 peaks called with our new pipeline. Next, we applied the peak-calling pipeline to call FAIRE peaks using the newer, chromosome-complete AaegL5 assembly [3]. PeakMatcher found matches for 14 of the 16 experimentally-validated AaegL3 FAIRE peaks from [2, 4]. We validated the matches by comparing nearby genes across the genomes. Nearby genes were consistent for 11 of the 14 peaks; inconsistencies for at least two of the remaining peaks were clearly attributable to differences in assemblies. When applied to all of the peaks, Peak-Matcher matched 78.8% (precision) of the 124,959 AaegL3 peaks with 76.7% (recall) of the 128,307 AaegL5 peaks. The second data set was STARR-Seq (Self-Transcribing Active Regulatory Region Sequencing) of Drosophila melanogaster DNA in S2 culture cells [1]. We called STARR peaks against two versions (dm3 and r5.53) of the D. melanogaster genome [6]. PeakMatcher matched 77.4% (precision) of the 4,195 dm3 peaks with 94.8% (recall) of the 3,114 r5.53 peaks. PeakMatcher and associated documentation are available on GitHub (https://github.com/rnowling/peak-matcher) under the open-source Apache Software License v2. PeakMatcher was written in Python 3 using the intervaltree library.","PeriodicalId":411338,"journal":{"name":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3388440.3414907","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

When reference genome assemblies are updated, the peaks from DNA enrichment assays such as ChIP-Seq and FAIRE-Seq need to be called again using the new genome assembly. PeakMatcher is an open-source package that aids in validation by matching peaks across two genome assemblies using the alignment of reads or within the same genome. PeakMatcher calculates recall and precision while also outputting lists of peak-to-peak matches. PeakMatcher uses read alignments to match peaks across genome assemblies. PeakMatcher finds all read aligned to one genome that overlap with a given list of peaks. PeakMatcher uses the read names to locate where those reads are aligned against a second genome. Lastly, all peaks called against the second genome that overlap with the aligned reads are found and output. PeakMatcher groups uses the peak-read-peak relationships to discover 1-to-1, 1-to-many, and many-to-many relationships. Overlap queries are performed with interval trees for maximum efficiency. We evaluated PeakMatcher on two data sets. The first data set was FAIRE-Seq (Formaldehyde-Assisted Isolation of Regulatory Elements Sequencing) of DNA isolated embyros of the mosquito Aedes aegypti [2, 4]. We implemented a peak calling pipeline and validated it on the older (highly fragmented) AaegL3 assembly [5]. PeakMatcher matched 92.9% (precision) of the 121,594 previously-called peaks from [2, 4] with 89.4% (recall) of the 124,959 peaks called with our new pipeline. Next, we applied the peak-calling pipeline to call FAIRE peaks using the newer, chromosome-complete AaegL5 assembly [3]. PeakMatcher found matches for 14 of the 16 experimentally-validated AaegL3 FAIRE peaks from [2, 4]. We validated the matches by comparing nearby genes across the genomes. Nearby genes were consistent for 11 of the 14 peaks; inconsistencies for at least two of the remaining peaks were clearly attributable to differences in assemblies. When applied to all of the peaks, Peak-Matcher matched 78.8% (precision) of the 124,959 AaegL3 peaks with 76.7% (recall) of the 128,307 AaegL5 peaks. The second data set was STARR-Seq (Self-Transcribing Active Regulatory Region Sequencing) of Drosophila melanogaster DNA in S2 culture cells [1]. We called STARR peaks against two versions (dm3 and r5.53) of the D. melanogaster genome [6]. PeakMatcher matched 77.4% (precision) of the 4,195 dm3 peaks with 94.8% (recall) of the 3,114 r5.53 peaks. PeakMatcher and associated documentation are available on GitHub (https://github.com/rnowling/peak-matcher) under the open-source Apache Software License v2. PeakMatcher was written in Python 3 using the intervaltree library.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
RA2Vec CanMod From Interatomic Distances to Protein Tertiary Structures with a Deep Convolutional Neural Network Prediction of Large for Gestational Age Infants in Overweight and Obese Women at Approximately 20 Gestational Weeks Using Patient Information for the Prediction of Caregiver Burden in Amyotrophic Lateral Sclerosis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1