Hobbes3: Dynamic generation of variable-length signatures for efficient approximate subsequence mappings

Jongik Kim, Chen Li, Xiaohui Xie
{"title":"Hobbes3: Dynamic generation of variable-length signatures for efficient approximate subsequence mappings","authors":"Jongik Kim, Chen Li, Xiaohui Xie","doi":"10.1109/ICDE.2016.7498238","DOIUrl":null,"url":null,"abstract":"Recent advances in DNA sequencing have enabled a flood of sequencing-based applications for studying biology and medicine. A key requirement of these applications is to rapidly and accurately map DNA subsequences to a reference genome. This DNA subsequence mapping problem shares core technical challenges with the similarity query processing problem studied in the database research literature. To solve this problem, existing techniques first extract signatures from a query, then retrieve candidate mapping positions from an index using the extracted signatures, and finally verify the candidate positions. The efficiency of these techniques depends critically on signatures selected from queries, while signature selection relies on an indexing scheme of a reference genome. The q-gram inverted indexing, one of the most widely used indexing schemes, can discover candidate positions quickly, but has the limitation that signatures of queries are restricted to fixed-length q-grams. To address the problem, we propose a flexible way to generate variable-length signatures using a fixed-length q-gram index. The proposed technique groups a few q-grams into a variable-length signature, and generates candidate positions for the variable-length signature using the inverted lists of the q-grams. We also propose a novel dynamic programming algorithm to balance between the filtering power of signatures and the overhead of generating candidate positions for the signatures. Through extensive experiments on both simulated and real genomic data, we show that our technique substantially improves the performance of read mapping in terms of both mapping speed and accuracy.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"26 1","pages":"169-180"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2016.7498238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19

Abstract

Recent advances in DNA sequencing have enabled a flood of sequencing-based applications for studying biology and medicine. A key requirement of these applications is to rapidly and accurately map DNA subsequences to a reference genome. This DNA subsequence mapping problem shares core technical challenges with the similarity query processing problem studied in the database research literature. To solve this problem, existing techniques first extract signatures from a query, then retrieve candidate mapping positions from an index using the extracted signatures, and finally verify the candidate positions. The efficiency of these techniques depends critically on signatures selected from queries, while signature selection relies on an indexing scheme of a reference genome. The q-gram inverted indexing, one of the most widely used indexing schemes, can discover candidate positions quickly, but has the limitation that signatures of queries are restricted to fixed-length q-grams. To address the problem, we propose a flexible way to generate variable-length signatures using a fixed-length q-gram index. The proposed technique groups a few q-grams into a variable-length signature, and generates candidate positions for the variable-length signature using the inverted lists of the q-grams. We also propose a novel dynamic programming algorithm to balance between the filtering power of signatures and the overhead of generating candidate positions for the signatures. Through extensive experiments on both simulated and real genomic data, we show that our technique substantially improves the performance of read mapping in terms of both mapping speed and accuracy.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
动态生成有效的近似子序列映射的变长签名
DNA测序的最新进展使基于测序的应用程序在生物学和医学研究中成为可能。这些应用的一个关键要求是快速准确地将DNA子序列映射到参考基因组。该DNA子序列映射问题与数据库研究文献中研究的相似度查询处理问题具有相同的核心技术挑战。为了解决这个问题,现有技术首先从查询中提取签名,然后使用提取的签名从索引中检索候选映射位置,最后验证候选位置。这些技术的效率主要取决于从查询中选择的签名,而签名选择依赖于参考基因组的索引方案。q-gram倒排索引是目前使用最广泛的索引方案之一,它可以快速发现候选位置,但其缺点是查询的签名仅限于固定长度的q-gram。为了解决这个问题,我们提出了一种灵活的方法来使用固定长度的q-gram索引生成变长签名。该技术将几个q-g分组为变长签名,并使用q-g的倒排表生成变长签名的候选位置。我们还提出了一种新的动态规划算法来平衡签名的过滤能力和为签名生成候选位置的开销。通过对模拟和真实基因组数据的大量实验,我们表明我们的技术在映射速度和精度方面都大大提高了读映射的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Data profiling SEED: A system for entity exploration and debugging in large-scale knowledge graphs TemProRA: Top-k temporal-probabilistic results analysis Durable graph pattern queries on historical graphs SCouT: Scalable coupled matrix-tensor factorization - algorithm and discoveries
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1