An optimized FM-index library for nucleotide and amino acid search.

Pub Date : 2021-12-31 DOI:10.1186/s13015-021-00204-6

Tim Anderson, Travis J Wheeler

{"title":"An optimized FM-index library for nucleotide and amino acid search.","authors":"Tim Anderson, Travis J Wheeler","doi":"10.1186/s13015-021-00204-6","DOIUrl":null,"url":null,"abstract":"Background: Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library.Results: We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index's suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3's FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is [Formula: see text]2-4x faster than SeqAn3 for nucleotide search, and [Formula: see text]2-6x faster for amino acid search; it is also [Formula: see text]4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage.Conclusions: AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.","PeriodicalId":72149,"journal":{"name":"","volume":"16 1","pages":"25"},"PeriodicalIF":0.0,"publicationDate":"2021-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8719400/pdf/","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13015-021-00204-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Background: Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library.

Results: We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index's suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3's FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is [Formula: see text]2-4x faster than SeqAn3 for nucleotide search, and [Formula: see text]2-6x faster for amino acid search; it is also [Formula: see text]4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage.

Conclusions: AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于核苷酸和氨基酸搜索的优化FM索引库。

背景：模式匹配是各种生物序列分析管道中的关键步骤。调频索引是一种用于模式匹配的压缩数据结构，其搜索运行时间与数据库文本的长度无关。FM-index 的实现相当复杂，因此，快速灵活的 FM-index 库的出现将有助于提高 FM-index 的采用率：我们介绍了 AvxWindowedFMindex（AWFM-index），它是一个用 C 语言编写的轻量级、开源、线程并行调频索引库，针对核苷酸和氨基酸序列的索引进行了优化。AWFM-index 引入了一种新的方法，将调频索引数据存储为分层位矢量格式，通过 AVX2 bitwise 指令实现极高效的调频索引出现函数计算，并将其与索引后缀数组的可选磁盘存储和用于部分 k-mer 搜索的高速缓存高效查找表相结合。与 SeqAn3 的 FM 索引实现相比，AWFM 索引在一系列可比内存占用范围内执行精确匹配计数和定位查询的速度更快。经过速度优化后，AWFM-index 的核苷酸搜索速度比 SeqAn3 快 2-4 倍，氨基酸搜索速度比 SeqAn3 快 2-6 倍：AWFM-index很容易集成到生物信息学软件中，提供运行时性能参数化，并在高层（统计或定位查询字符串的所有实例）和底层（逐步控制FM-index后向搜索过程）为客户提供FM-index功能。该开源库可在 https://github.com/TravisWheelerLab/AvxWindowFmIndex 上下载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助