BWT construction and search at the terabase scale.

Heng Li
{"title":"BWT construction and search at the terabase scale.","authors":"Heng Li","doi":"10.1093/bioinformatics/btae717","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Burrows-Wheeler Transform (BWT) is a common component in full-text indices. Initially developed for data compression, it is particularly powerful for encoding redundant sequences such as pangenome data. However, BWT construction is resource intensive and hard to be parallelized, and many methods for querying large full-text indices only report exact matches or their simple extensions. These limitations have hampered the biological applications of full-text indices.</p><p><strong>Results: </strong>We developed ropebwt3 for efficient BWT construction and query. Ropebwt3 indexed 320 assembled human genomes in 65 h and indexed 7.3 terabases of commonly studied bacterial assemblies in 26 days. This was achieved using up to 170 gigabytes of memory at the peak without working disk space. Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap penalties, and can retrieve similar local haplotypes matching a query sequence. It demonstrates the feasibility of full-text indexing at the terabase scale.</p><p><strong>Availability and implementation: </strong>https://github.com/lh3/ropebwt3.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11646566/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae717","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: Burrows-Wheeler Transform (BWT) is a common component in full-text indices. Initially developed for data compression, it is particularly powerful for encoding redundant sequences such as pangenome data. However, BWT construction is resource intensive and hard to be parallelized, and many methods for querying large full-text indices only report exact matches or their simple extensions. These limitations have hampered the biological applications of full-text indices.

Results: We developed ropebwt3 for efficient BWT construction and query. Ropebwt3 indexed 320 assembled human genomes in 65 h and indexed 7.3 terabases of commonly studied bacterial assemblies in 26 days. This was achieved using up to 170 gigabytes of memory at the peak without working disk space. Ropebwt3 can find maximal exact matches and inexact alignments under affine-gap penalties, and can retrieve similar local haplotypes matching a query sequence. It demonstrates the feasibility of full-text indexing at the terabase scale.

Availability and implementation: https://github.com/lh3/ropebwt3.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
太数据库规模下的小波变换构造与搜索。
动机:Burrows-Wheeler变换(BWT)是全文索引中常见的组件。它最初是为数据压缩而开发的,对于编码冗余序列(如泛基因组数据)特别强大。然而,BWT的构建是资源密集型的,并且很难并行化,许多查询大型全文索引的方法只报告精确匹配或它们的简单扩展。这些局限性阻碍了全文索引的生物学应用。结果:我们开发的ropebwt3用于高效的BWT构建和查询。Ropebwt3在65小时内索引了320个组装的人类基因组,在26天内索引了7.3个常见细菌组装的tb碱基。在没有工作磁盘空间的情况下,这在峰值时使用了高达170 gb的内存。Ropebwt3可以在仿射间隙惩罚下找到最大精确匹配和不精确比对,并可以检索匹配查询序列的相似局部单倍型。论证了在太数据库规模下全文索引的可行性。可用性和实现:https://github.com/lh3/ropebwt3。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Using Cancer Profiles to Identify Synthetic Lethal Therapeutic Targets and Predictive Biomarkers in Cancer Gene Dependency Data. microntology: a lightweight, data-driven controlled vocabulary to describe Earth's microbial habitats. Lift&Add-rapid and robust addition of new species to alignments of conserved non-coding sequences. mmContext: an open framework for multimodal contrastive learning of omics and text data. A variational framework with composite sparse regularization for cryo-electron tomography reconstruction.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1