HAlign 4: A New Strategy for Rapidly Aligning Millions of Sequences.

Bioinformatics (Oxford, England) Pub Date : 2024-11-28 DOI:10.1093/bioinformatics/btae718

Tong Zhou, Pinglu Zhang, Quan Zou, Wu Han

{"title":"HAlign 4: A New Strategy for Rapidly Aligning Millions of Sequences.","authors":"Tong Zhou, Pinglu Zhang, Quan Zou, Wu Han","doi":"10.1093/bioinformatics/btae718","DOIUrl":null,"url":null,"abstract":"Motivation: HAlign is a high-performance multiple sequence alignment software based on the star alignment strategy, which is the preferred choice for rapidly aligning large numbers of sequences. HAlign3, implemented in Java, is the latest version capable of aligning an ultra-large number of similar DNA/RNA sequences. However, HAlign3 still struggles with long sequences and extremely large numbers of sequences.Results: To address this issue, we have implemented HAlign4 in C ++. In this version, we replaced the original suffix tree with Burrows-Wheeler Transform (BWT) and introduced the wavefront alignment algorithm to further optimize both time and memory efficiency. Experiments show that HAlign4 significantly outperforms HAlign3 in runtime and memory usage in both single-threaded and multi-threaded configurations, while maintains high alignment accuracy comparable to MAFFT. HAlign4 can complete the alignment of 10 million COVID-19 sequences in about 12 minutes and 300GB of memory using 96 threads, demonstrating its efficiency and practicality for large-scale alignment on standard workstations.Availability: Source code is available at https://github.com/malabz/HAlign-4, dataset is available at https://zenodo.org/records/13934503.Supplementary information: Supplementary data are available at Bioinformatics online.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: HAlign is a high-performance multiple sequence alignment software based on the star alignment strategy, which is the preferred choice for rapidly aligning large numbers of sequences. HAlign3, implemented in Java, is the latest version capable of aligning an ultra-large number of similar DNA/RNA sequences. However, HAlign3 still struggles with long sequences and extremely large numbers of sequences.

Results: To address this issue, we have implemented HAlign4 in C ++. In this version, we replaced the original suffix tree with Burrows-Wheeler Transform (BWT) and introduced the wavefront alignment algorithm to further optimize both time and memory efficiency. Experiments show that HAlign4 significantly outperforms HAlign3 in runtime and memory usage in both single-threaded and multi-threaded configurations, while maintains high alignment accuracy comparable to MAFFT. HAlign4 can complete the alignment of 10 million COVID-19 sequences in about 12 minutes and 300GB of memory using 96 threads, demonstrating its efficiency and practicality for large-scale alignment on standard workstations.

Availability: Source code is available at https://github.com/malabz/HAlign-4, dataset is available at https://zenodo.org/records/13934503.

Supplementary information: Supplementary data are available at Bioinformatics online.

查看原文