Computing Maximal Unique Matches with the r-index

Bulletin of the Society of Sea Water Science, Japan Pub Date : 2022-05-03 DOI:10.48550/arXiv.2205.01576

Sara Giuliani, Giuseppe Romana, Massimiliano Rossi

{"title":"Computing Maximal Unique Matches with the r-index","authors":"Sara Giuliani, Giuseppe Romana, Massimiliano Rossi","doi":"10.48550/arXiv.2205.01576","DOIUrl":null,"url":null,"abstract":"In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches ( MEMs ) and Maximal Unique Matches ( MUMs ) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the r -index that is a Burrows-Wheeler Transform ( BWT )-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the r -index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.’s approach to enable the computation of MUMs on the r -index, while preserving the space and time bounds. We add additional O ( r ) samples of the longest common prefix ( LCP ) array, where r is the number of equal-letter runs of the BWT , that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs . We implemented a proof-of-concept of our approach, that we call mum-phinder , and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs . We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.","PeriodicalId":9448,"journal":{"name":"Bulletin of the Society of Sea Water Science, Japan","volume":"35 1","pages":"22:1-22:16"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of the Society of Sea Water Science, Japan","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2205.01576","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

In recent years, pangenomes received increasing attention from the scientific community for their ability to incorporate population variation information and alleviate reference genome bias. Maximal Exact Matches ( MEMs ) and Maximal Unique Matches ( MUMs ) have proven themselves to be useful in multiple bioinformatic contexts, for example short-read alignment and multiple-genome alignment. However, standard techniques using suffix trees and FM-indexes do not scale to a pangenomic level. Recently, Gagie et al. [JACM 20] introduced the r -index that is a Burrows-Wheeler Transform ( BWT )-based index able to handle hundreds of human genomes. Later, Rossi et al. [JCB 22] enabled the computation of MEMs using the r -index, and Boucher et al. [DCC 21] showed how to compute them in a streaming fashion. In this paper, we show how to augment Boucher et al.’s approach to enable the computation of MUMs on the r -index, while preserving the space and time bounds. We add additional O ( r ) samples of the longest common prefix ( LCP ) array, where r is the number of equal-letter runs of the BWT , that permits the computation of the second longest match of the pattern suffix with respect to the input text, which in turn allows the computation of candidate MUMs . We implemented a proof-of-concept of our approach, that we call mum-phinder , and tested on real-world datasets. We compared our approach with competing methods that are able to compute MUMs . We observe that our method is up to 8 times smaller, while up to 19 times slower when the dataset is not highly repetitive, while on highly repetitive data, our method is up to 6.5 times slower and uses up to 25 times less memory.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用r索引计算最大唯一匹配

近年来，泛基因组因其整合种群变异信息和减轻参考基因组偏差的能力而受到科学界的越来越多的关注。最大精确匹配(MEMs)和最大唯一匹配(mum)已被证明在多种生物信息学背景下非常有用，例如短读比对和多基因组比对。然而，使用后缀树和fm索引的标准技术不能扩展到全基因组水平。最近，Gagie等人[JACM 20]引入了r -索引，这是一种基于Burrows-Wheeler变换(BWT)的索引，能够处理数百个人类基因组。后来，Rossi等人[JCB 22]使用r -指数实现了MEMs的计算，而Boucher等人[DCC 21]展示了如何以流方式计算它们。在本文中，我们展示了如何增强Boucher等人的方法来实现r -索引上的mum计算，同时保留空间和时间界限。我们添加了最长公共前缀(LCP)数组的额外O (r)个样本，其中r是BWT的等字母运行次数，它允许计算模式后缀相对于输入文本的第二长的匹配，这反过来允许计算候选的MUMs。我们实现了我们的方法的概念验证，我们称之为mum-phinder，并在现实世界的数据集上进行了测试。我们将我们的方法与能够计算mom的竞争方法进行了比较。我们观察到，当数据集不是高度重复的时候，我们的方法要小8倍，而慢19倍，而在高度重复的数据上，我们的方法要慢6.5倍，使用的内存要少25倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bulletin of the Society of Sea Water Science, Japan

自引率

0.00%

发文量

期刊最新文献

Efficient Yao Graph Construction Partitioning the Bags of a Tree Decomposition Into Cliques Arc-Flags Meet Trip-Based Public Transit Routing Maximum Coverage in Sublinear Space, Faster FREIGHT: Fast Streaming Hypergraph Partitioning