SCARAP: scalable cross-species comparative genomics of prokaryotes.

Stijn Wittouck, Tom Eilers, Vera van Noort, Sarah Lebeer
{"title":"SCARAP: scalable cross-species comparative genomics of prokaryotes.","authors":"Stijn Wittouck, Tom Eilers, Vera van Noort, Sarah Lebeer","doi":"10.1093/bioinformatics/btae735","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Much of prokaryotic comparative genomics currently relies on two critical computational tasks: pangenome inference and core genome inference. Pangenome inference involves clustering genes from a set of genomes into gene families, enabling genome-wide association studies and evolutionary history analysis. The core genome represents gene families present in nearly all genomes and is required to infer a high-quality phylogeny. For species-level datasets, fast pangenome inference tools have been developed. However, tools applicable to more diverse datasets are currently slow and scale poorly.</p><p><strong>Results: </strong>Here, we introduce SCARAP, a program containing three modules for comparative genomics analyses: a fast and scalable pangenome inference module, a direct core genome inference module, and a module for subsampling representative genomes. When benchmarked against existing tools, the SCARAP pan module proved up to an order of magnitude faster with comparable accuracy. The core module was validated by comparing its result against a core genome extracted from a full pangenome. The sample module demonstrated the rapid sampling of genomes with decreasing novelty. Applied to a dataset of over 31 000 Lactobacillales genomes, SCARAP showcased its ability to derive a representative pangenome. Finally, we applied the novel concept of gene fixation frequency to this pangenome, showing that Lactobacillales genes that are prevalent but rarely fixate in species often encode bacteriophage functions.</p><p><strong>Availability and implementation: </strong>The SCARAP toolkit is publicly available at https://github.com/swittouck/scarap.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11681940/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btae735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Motivation: Much of prokaryotic comparative genomics currently relies on two critical computational tasks: pangenome inference and core genome inference. Pangenome inference involves clustering genes from a set of genomes into gene families, enabling genome-wide association studies and evolutionary history analysis. The core genome represents gene families present in nearly all genomes and is required to infer a high-quality phylogeny. For species-level datasets, fast pangenome inference tools have been developed. However, tools applicable to more diverse datasets are currently slow and scale poorly.

Results: Here, we introduce SCARAP, a program containing three modules for comparative genomics analyses: a fast and scalable pangenome inference module, a direct core genome inference module, and a module for subsampling representative genomes. When benchmarked against existing tools, the SCARAP pan module proved up to an order of magnitude faster with comparable accuracy. The core module was validated by comparing its result against a core genome extracted from a full pangenome. The sample module demonstrated the rapid sampling of genomes with decreasing novelty. Applied to a dataset of over 31 000 Lactobacillales genomes, SCARAP showcased its ability to derive a representative pangenome. Finally, we applied the novel concept of gene fixation frequency to this pangenome, showing that Lactobacillales genes that are prevalent but rarely fixate in species often encode bacteriophage functions.

Availability and implementation: The SCARAP toolkit is publicly available at https://github.com/swittouck/scarap.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
SCARAP:原核生物可扩展的跨物种比较基因组学。
动机:许多原核比较基因组学目前依赖于两个关键的计算任务:泛基因组推断和核心基因组推断。泛基因组推断涉及将一组基因组中的基因聚类到基因家族中,从而实现全基因组关联研究和进化历史分析。核心基因组代表了几乎所有基因组中存在的基因家族,需要推断出高质量的系统发育。对于物种水平的数据集,快速泛基因组推断工具已经开发出来。然而,适用于更多样化数据集的工具目前速度较慢,可扩展性较差。结果:本文介绍了SCARAP,这是一个包含三个比较基因组分析模块的程序:快速可扩展的泛基因组推断模块,直接核心基因组推断模块和亚样本代表性基因组模块。当与现有工具进行基准测试时,SCARAP pan模块被证明可以在相当的精度下提高数量级。通过将其结果与从全泛基因组中提取的核心基因组进行比较,验证了核心模块。样本模块显示了基因组的快速采样和新颖性降低。SCARAP应用于超过31,000个乳酸杆菌基因组的数据集,展示了其获得代表性泛基因组的能力。最后,我们将基因固定频率的新概念应用于该泛基因组,表明在物种中普遍存在但很少固定的乳酸杆菌基因通常编码噬菌体功能。可用性和实施:SCARAP工具包可在https://github.com/swittouck/scarap.Supplementary information公开获取;补充数据可在Bioinformatics在线获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HTSinfer: Inferring metadata from bulk illumina RNA-Seq libraries. MOSTPLAS: A Self-correction Multi-label Learning Model for Plasmid Host Range Prediction. GCLink: a graph contrastive link prediction framework for gene regulatory network inference. PNL: a software to build polygenic risk scores using a Super Learner approach based on PairNet, a Convolutional Neural Network. TiltRec: An ultra-fast and open-source toolkit for cryo-electron tomographic reconstruction.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1