SCARAP: scalable cross-species comparative genomics of prokaryotes.

Bioinformatics (Oxford, England) Pub Date : 2024-12-26 DOI:10.1093/bioinformatics/btae735

Stijn Wittouck, Tom Eilers, Vera van Noort, Sarah Lebeer

{"title":"SCARAP: scalable cross-species comparative genomics of prokaryotes.","authors":"Stijn Wittouck, Tom Eilers, Vera van Noort, Sarah Lebeer","doi":"10.1093/bioinformatics/btae735","DOIUrl":null,"url":null,"abstract":"Motivation: Much of prokaryotic comparative genomics currently relies on two critical computational tasks: pangenome inference and core genome inference. Pangenome inference involves clustering genes from a set of genomes into gene families, enabling genome-wide association studies and evolutionary history analysis. The core genome represents gene families present in nearly all genomes and is required to infer a high-quality phylogeny. For species-level datasets, fast pangenome inference tools have been developed. However, tools applicable to more diverse datasets are currently slow and scale poorly.Results: Here, we introduce SCARAP, a program containing three modules for comparative genomics analyses: a fast and scalable pangenome inference module, a direct core genome inference module, and a module for subsampling representative genomes. When benchmarked against existing tools, the SCARAP pan module proved up to an order of magnitude faster with comparable accuracy. The core module was validated by comparing its result against a core genome extracted from a full pangenome. The sample module demonstrated the rapid sampling of genomes with decreasing novelty. Applied to a dataset of over 31 000 Lactobacillales genomes, SCARAP showcased its ability to derive a representative pangenome. Finally, we applied the novel concept of gene fixation frequency to this pangenome, showing that Lactobacillales genes that are prevalent but rarely fixate in species often encode bacteriophage functions.Availability and implementation: The SCARAP toolkit is publicly available at https://github.com/swittouck/scarap.","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11681940/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Much of prokaryotic comparative genomics currently relies on two critical computational tasks: pangenome inference and core genome inference. Pangenome inference involves clustering genes from a set of genomes into gene families, enabling genome-wide association studies and evolutionary history analysis. The core genome represents gene families present in nearly all genomes and is required to infer a high-quality phylogeny. For species-level datasets, fast pangenome inference tools have been developed. However, tools applicable to more diverse datasets are currently slow and scale poorly.

Results: Here, we introduce SCARAP, a program containing three modules for comparative genomics analyses: a fast and scalable pangenome inference module, a direct core genome inference module, and a module for subsampling representative genomes. When benchmarked against existing tools, the SCARAP pan module proved up to an order of magnitude faster with comparable accuracy. The core module was validated by comparing its result against a core genome extracted from a full pangenome. The sample module demonstrated the rapid sampling of genomes with decreasing novelty. Applied to a dataset of over 31 000 Lactobacillales genomes, SCARAP showcased its ability to derive a representative pangenome. Finally, we applied the novel concept of gene fixation frequency to this pangenome, showing that Lactobacillales genes that are prevalent but rarely fixate in species often encode bacteriophage functions.

Availability and implementation: The SCARAP toolkit is publicly available at https://github.com/swittouck/scarap.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SCARAP：原核生物可扩展的跨物种比较基因组学。

动机：许多原核比较基因组学目前依赖于两个关键的计算任务：泛基因组推断和核心基因组推断。泛基因组推断涉及将一组基因组中的基因聚类到基因家族中，从而实现全基因组关联研究和进化历史分析。核心基因组代表了几乎所有基因组中存在的基因家族，需要推断出高质量的系统发育。对于物种水平的数据集，快速泛基因组推断工具已经开发出来。然而，适用于更多样化数据集的工具目前速度较慢，可扩展性较差。结果：本文介绍了SCARAP，这是一个包含三个比较基因组分析模块的程序：快速可扩展的泛基因组推断模块，直接核心基因组推断模块和亚样本代表性基因组模块。当与现有工具进行基准测试时，SCARAP pan模块被证明可以在相当的精度下提高数量级。通过将其结果与从全泛基因组中提取的核心基因组进行比较，验证了核心模块。样本模块显示了基因组的快速采样和新颖性降低。SCARAP应用于超过31,000个乳酸杆菌基因组的数据集，展示了其获得代表性泛基因组的能力。最后，我们将基因固定频率的新概念应用于该泛基因组，表明在物种中普遍存在但很少固定的乳酸杆菌基因通常编码噬菌体功能。可用性和实施：SCARAP工具包可在https://github.com/swittouck/scarap.Supplementary information公开获取；补充数据可在Bioinformatics在线获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bioinformatics (Oxford, England)

自引率

0.00%

发文量