{"title":"BWT construction and search at the terabase scale","authors":"Heng Li","doi":"arxiv-2409.00613","DOIUrl":null,"url":null,"abstract":"Motivation: Burrows-Wheeler Transform (BWT) is a common component in\nfull-text indices. Initially developed for data compression, it is particularly\npowerful for encoding redundant sequences such as pangenome data. However, BWT\nconstruction is resource intensive and hard to be parallelized, and many\nmethods for querying large full-text indices only report exact matches or their\nsimple extensions. These limitations have hampered the biological applications\nof full-text indices. Results: We developed ropebwt3 for efficient BWT construction and query.\nRopebwt3 could index 100 assembled human genomes in 21 hours and index 7.3\nterabases of commonly studied bacterial assemblies in 26 days. This was\nachieved using 82 gigabytes of memory at the peak without working disk space.\nRopebwt3 can find maximal exact matches and inexact alignments under affine-gap\npenalties, and can retrieve all distinct local haplotypes matching a query\nsequence. It demonstrates the feasibility of full-text indexing at the terabase\nscale. Availability and implementation: https://github.com/lh3/ropebwt3","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Motivation: Burrows-Wheeler Transform (BWT) is a common component in
full-text indices. Initially developed for data compression, it is particularly
powerful for encoding redundant sequences such as pangenome data. However, BWT
construction is resource intensive and hard to be parallelized, and many
methods for querying large full-text indices only report exact matches or their
simple extensions. These limitations have hampered the biological applications
of full-text indices. Results: We developed ropebwt3 for efficient BWT construction and query.
Ropebwt3 could index 100 assembled human genomes in 21 hours and index 7.3
terabases of commonly studied bacterial assemblies in 26 days. This was
achieved using 82 gigabytes of memory at the peak without working disk space.
Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap
penalties, and can retrieve all distinct local haplotypes matching a query
sequence. It demonstrates the feasibility of full-text indexing at the terabase
scale. Availability and implementation: https://github.com/lh3/ropebwt3