Cloud-enabled Scalable Analysis of Large Proteomics Cohorts

Harendra Guturu, Andrew Nichols, Lee S. Cantrell, Seth Just, János Kis, Theodore Platt, Iman Mohtashemi, Jian Wang, Serafim Batzoglou
{"title":"Cloud-enabled Scalable Analysis of Large Proteomics Cohorts","authors":"Harendra Guturu, Andrew Nichols, Lee S. Cantrell, Seth Just, János Kis, Theodore Platt, Iman Mohtashemi, Jian Wang, Serafim Batzoglou","doi":"10.1101/2024.09.05.611509","DOIUrl":null,"url":null,"abstract":"Rapid advances in depth and throughput of untargeted mass-spectrometry-based proteomic technologies are enabling large-scale cohort proteomic and proteogenomic analyses. As such studies scale, the data infrastructure and search engines required to process data must also scale. This challenge is amplified in search engines that rely on library-free match between runs (MBR) search, which enable enhanced depth-per-sample and data completeness. However, to-date, no MBR-based search could scale to process cohorts of thousands or more individuals. Here, we present a strategy to deploy search engines in a distributed cloud environment without source code modification, thereby enhancing resource scalability and throughput. Additionally, we present an algorithm, Scalable MBR, that replicates the MBR procedure of the popular DIA-NN software for scalability to thousands of samples. We demonstrate that Scalable MBR can search thousands of MS raw files in a few hours compared to days required for the original DIA-NN MBR procedure and demonstrate that the results are almost indistinguishable to those of DIA-NN native MBR. The method has been tested to scale to over 15,000 injections and is available for use in the Proteograph(TM) Analysis Suite.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.05.611509","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Rapid advances in depth and throughput of untargeted mass-spectrometry-based proteomic technologies are enabling large-scale cohort proteomic and proteogenomic analyses. As such studies scale, the data infrastructure and search engines required to process data must also scale. This challenge is amplified in search engines that rely on library-free match between runs (MBR) search, which enable enhanced depth-per-sample and data completeness. However, to-date, no MBR-based search could scale to process cohorts of thousands or more individuals. Here, we present a strategy to deploy search engines in a distributed cloud environment without source code modification, thereby enhancing resource scalability and throughput. Additionally, we present an algorithm, Scalable MBR, that replicates the MBR procedure of the popular DIA-NN software for scalability to thousands of samples. We demonstrate that Scalable MBR can search thousands of MS raw files in a few hours compared to days required for the original DIA-NN MBR procedure and demonstrate that the results are almost indistinguishable to those of DIA-NN native MBR. The method has been tested to scale to over 15,000 injections and is available for use in the Proteograph(TM) Analysis Suite.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
大型蛋白质组学群组的云端可扩展分析
基于非靶向质谱的蛋白质组学技术在深度和通量方面的快速发展,使得大规模队列蛋白质组学和蛋白质基因组学分析成为可能。随着此类研究的扩展,处理数据所需的数据基础设施和搜索引擎也必须随之扩展。这种挑战在依靠无库运行间匹配(MBR)搜索的搜索引擎中更为严峻,因为这种搜索能提高每个样本的深度和数据的完整性。然而,迄今为止,还没有一种基于 MBR 的搜索能扩展到处理数千或更多个体的队列。在此,我们提出了一种无需修改源代码即可在分布式云环境中部署搜索引擎的策略,从而提高了资源的可扩展性和吞吐量。此外,我们还介绍了一种名为 "可扩展 MBR "的算法,该算法复制了流行的 DIA-NN 软件的 MBR 程序,可扩展至数千个样本。我们证明,与 DIA-NN 原始 MBR 程序所需的数天时间相比,Scalable MBR 可在数小时内搜索数千个 MS 原始文件,并证明其结果与 DIA-NN 原始 MBR 的结果几乎没有区别。经测试,该方法可扩展至 15,000 多次注射,并可在 Proteograph(TM) 分析套件中使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
ECSFinder: Optimized prediction of evolutionarily conserved RNA secondary structures from genome sequences GeneSpectra: a method for context-aware comparison of cell type gene expression across species A Bioinformatician, Computer Scientist, and Geneticist lead bioinformatic tool development - which one is better? Interpretable high-resolution dimension reduction of spatial transcriptomics data by DeepFuseNMF Pangenomics to understand prophage dynamics in the Pectobacterium genus and the radiating lineages of P. brasiliense
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1