AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data.

IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES GigaScience Pub Date : 2024-01-02 DOI:10.1093/gigascience/giae086
Jorge M Silva, Armando J Pinho, Diogo Pratas
{"title":"AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data.","authors":"Jorge M Silva, Armando J Pinho, Diogo Pratas","doi":"10.1093/gigascience/giae086","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Most viral genome sequences generated during the latest pandemic have presented new challenges for computational analysis. Analyzing millions of viral genomes in multi-FASTA format is computationally demanding, especially when using alignment-based methods. Most existing methods are not designed to handle such large datasets, often requiring the analysis to be divided into smaller parts to obtain results using available computational resources.</p><p><strong>Findings: </strong>We introduce AltaiR, a toolkit for analyzing multiple sequences in multi-FASTA format using exclusively alignment-free methodologies. AltaiR enables the identification of singularity and similarity patterns within sequences and computes static and temporal dynamics without restrictions on the number or size of input sequences. It automatically filters low-quality, biased, or deviant data. We demonstrate AltaiR's capabilities by analyzing more than 1.5 million full severe acute respiratory virus coronavirus 2 sequences, revealing interesting observations regarding viral genome characteristics over time, such as shifts in nucleotide composition, decreases in average Kolmogorov sequence complexity, and the evolution of the smallest sequences not found in the human host.</p><p><strong>Conclusions: </strong>AltaiR can identify temporal characteristics and trends in large numbers of sequences, making it ideal for scenarios involving endemic or epidemic outbreaks with vast amounts of available sequence data. Implemented in C with multithreading and methodological optimizations, AltaiR is computationally efficient, flexible, and dependency-free. It accepts any sequence in FASTA format, including amino acid sequences. The complete toolkit is freely available at https://github.com/cobilab/altair.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"13 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11590114/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giae086","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Most viral genome sequences generated during the latest pandemic have presented new challenges for computational analysis. Analyzing millions of viral genomes in multi-FASTA format is computationally demanding, especially when using alignment-based methods. Most existing methods are not designed to handle such large datasets, often requiring the analysis to be divided into smaller parts to obtain results using available computational resources.

Findings: We introduce AltaiR, a toolkit for analyzing multiple sequences in multi-FASTA format using exclusively alignment-free methodologies. AltaiR enables the identification of singularity and similarity patterns within sequences and computes static and temporal dynamics without restrictions on the number or size of input sequences. It automatically filters low-quality, biased, or deviant data. We demonstrate AltaiR's capabilities by analyzing more than 1.5 million full severe acute respiratory virus coronavirus 2 sequences, revealing interesting observations regarding viral genome characteristics over time, such as shifts in nucleotide composition, decreases in average Kolmogorov sequence complexity, and the evolution of the smallest sequences not found in the human host.

Conclusions: AltaiR can identify temporal characteristics and trends in large numbers of sequences, making it ideal for scenarios involving endemic or epidemic outbreaks with vast amounts of available sequence data. Implemented in C with multithreading and methodological optimizations, AltaiR is computationally efficient, flexible, and dependency-free. It accepts any sequence in FASTA format, including amino acid sequences. The complete toolkit is freely available at https://github.com/cobilab/altair.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
AltaiR:用于多 FASTA 数据无配准和时序分析的 C 语言工具包。
背景:最近大流行期间产生的大多数病毒基因组序列给计算分析带来了新的挑战。分析多 FASTA 格式的数百万个病毒基因组对计算要求很高,尤其是在使用基于比对的方法时。大多数现有方法都不是为处理如此大的数据集而设计的,往往需要将分析分成较小的部分,才能利用现有计算资源获得结果:我们介绍了 AltaiR,这是一种完全采用无配准方法分析多 FASTA 格式多序列的工具包。AltaiR 能够识别序列中的奇异性和相似性模式,并计算静态和时间动态,而不受输入序列数量或大小的限制。它能自动过滤低质量、有偏见或偏差的数据。我们通过分析 150 多万条完整的严重急性呼吸道病毒冠状病毒 2 序列,展示了 AltaiR 的能力,揭示了病毒基因组随时间变化的有趣特征,如核苷酸组成的变化、平均柯尔莫哥洛夫序列复杂性的降低,以及人类宿主中未发现的最小序列的进化:AltaiR可以识别大量序列的时间特征和趋势,因此非常适合涉及流行病或疫情爆发、拥有大量可用序列数据的情况。AltaiR 采用 C 语言实现,具有多线程和方法优化功能,计算效率高、灵活性强且无依赖性。它接受任何 FASTA 格式的序列,包括氨基酸序列。完整的工具包可在 https://github.com/cobilab/altair 免费获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
GigaScience
GigaScience MULTIDISCIPLINARY SCIENCES-
CiteScore
15.50
自引率
1.10%
发文量
119
审稿时长
1 weeks
期刊介绍: GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.
期刊最新文献
IPEV: identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning Large-scale genomic survey with deep learning-based method reveals strain-level phage specificity determinants An effective strategy for assembling the sex-limited chromosome Enhanced bovine genome annotation through integration of transcriptomics and epi-transcriptomics datasets facilitates genomic biology Korea4K: whole genome sequences of 4,157 Koreans with 107 phenotypes derived from extensive health check-ups
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1