Hobbes3: Dynamic generation of variable-length signatures for efficient approximate subsequence mappings

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI:10.1109/ICDE.2016.7498238

Jongik Kim, Chen Li, Xiaohui Xie

{"title":"Hobbes3: Dynamic generation of variable-length signatures for efficient approximate subsequence mappings","authors":"Jongik Kim, Chen Li, Xiaohui Xie","doi":"10.1109/ICDE.2016.7498238","DOIUrl":null,"url":null,"abstract":"Recent advances in DNA sequencing have enabled a flood of sequencing-based applications for studying biology and medicine. A key requirement of these applications is to rapidly and accurately map DNA subsequences to a reference genome. This DNA subsequence mapping problem shares core technical challenges with the similarity query processing problem studied in the database research literature. To solve this problem, existing techniques first extract signatures from a query, then retrieve candidate mapping positions from an index using the extracted signatures, and finally verify the candidate positions. The efficiency of these techniques depends critically on signatures selected from queries, while signature selection relies on an indexing scheme of a reference genome. The q-gram inverted indexing, one of the most widely used indexing schemes, can discover candidate positions quickly, but has the limitation that signatures of queries are restricted to fixed-length q-grams. To address the problem, we propose a flexible way to generate variable-length signatures using a fixed-length q-gram index. The proposed technique groups a few q-grams into a variable-length signature, and generates candidate positions for the variable-length signature using the inverted lists of the q-grams. We also propose a novel dynamic programming algorithm to balance between the filtering power of signatures and the overhead of generating candidate positions for the signatures. Through extensive experiments on both simulated and real genomic data, we show that our technique substantially improves the performance of read mapping in terms of both mapping speed and accuracy.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"26 1","pages":"169-180"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2016.7498238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

Recent advances in DNA sequencing have enabled a flood of sequencing-based applications for studying biology and medicine. A key requirement of these applications is to rapidly and accurately map DNA subsequences to a reference genome. This DNA subsequence mapping problem shares core technical challenges with the similarity query processing problem studied in the database research literature. To solve this problem, existing techniques first extract signatures from a query, then retrieve candidate mapping positions from an index using the extracted signatures, and finally verify the candidate positions. The efficiency of these techniques depends critically on signatures selected from queries, while signature selection relies on an indexing scheme of a reference genome. The q-gram inverted indexing, one of the most widely used indexing schemes, can discover candidate positions quickly, but has the limitation that signatures of queries are restricted to fixed-length q-grams. To address the problem, we propose a flexible way to generate variable-length signatures using a fixed-length q-gram index. The proposed technique groups a few q-grams into a variable-length signature, and generates candidate positions for the variable-length signature using the inverted lists of the q-grams. We also propose a novel dynamic programming algorithm to balance between the filtering power of signatures and the overhead of generating candidate positions for the signatures. Through extensive experiments on both simulated and real genomic data, we show that our technique substantially improves the performance of read mapping in terms of both mapping speed and accuracy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

动态生成有效的近似子序列映射的变长签名

DNA测序的最新进展使基于测序的应用程序在生物学和医学研究中成为可能。这些应用的一个关键要求是快速准确地将DNA子序列映射到参考基因组。该DNA子序列映射问题与数据库研究文献中研究的相似度查询处理问题具有相同的核心技术挑战。为了解决这个问题，现有技术首先从查询中提取签名，然后使用提取的签名从索引中检索候选映射位置，最后验证候选位置。这些技术的效率主要取决于从查询中选择的签名，而签名选择依赖于参考基因组的索引方案。q-gram倒排索引是目前使用最广泛的索引方案之一，它可以快速发现候选位置，但其缺点是查询的签名仅限于固定长度的q-gram。为了解决这个问题，我们提出了一种灵活的方法来使用固定长度的q-gram索引生成变长签名。该技术将几个q-g分组为变长签名，并使用q-g的倒排表生成变长签名的候选位置。我们还提出了一种新的动态规划算法来平衡签名的过滤能力和为签名生成候选位置的开销。通过对模拟和真实基因组数据的大量实验，我们表明我们的技术在映射速度和精度方面都大大提高了读映射的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量

期刊最新文献

Data profiling SEED: A system for entity exploration and debugging in large-scale knowledge graphs TemProRA: Top-k temporal-probabilistic results analysis Durable graph pattern queries on historical graphs SCouT: Scalable coupled matrix-tensor factorization - algorithm and discoveries