Cloud-enabled Scalable Analysis of Large Proteomics Cohorts

bioRxiv - Bioinformatics Pub Date : 2024-09-10 DOI:10.1101/2024.09.05.611509

Harendra Guturu, Andrew Nichols, Lee S. Cantrell, Seth Just, János Kis, Theodore Platt, Iman Mohtashemi, Jian Wang, Serafim Batzoglou

{"title":"Cloud-enabled Scalable Analysis of Large Proteomics Cohorts","authors":"Harendra Guturu, Andrew Nichols, Lee S. Cantrell, Seth Just, János Kis, Theodore Platt, Iman Mohtashemi, Jian Wang, Serafim Batzoglou","doi":"10.1101/2024.09.05.611509","DOIUrl":null,"url":null,"abstract":"Rapid advances in depth and throughput of untargeted mass-spectrometry-based proteomic technologies are enabling large-scale cohort proteomic and proteogenomic analyses. As such studies scale, the data infrastructure and search engines required to process data must also scale. This challenge is amplified in search engines that rely on library-free match between runs (MBR) search, which enable enhanced depth-per-sample and data completeness. However, to-date, no MBR-based search could scale to process cohorts of thousands or more individuals. Here, we present a strategy to deploy search engines in a distributed cloud environment without source code modification, thereby enhancing resource scalability and throughput. Additionally, we present an algorithm, Scalable MBR, that replicates the MBR procedure of the popular DIA-NN software for scalability to thousands of samples. We demonstrate that Scalable MBR can search thousands of MS raw files in a few hours compared to days required for the original DIA-NN MBR procedure and demonstrate that the results are almost indistinguishable to those of DIA-NN native MBR. The method has been tested to scale to over 15,000 injections and is available for use in the Proteograph(TM) Analysis Suite.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"35 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.09.05.611509","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Rapid advances in depth and throughput of untargeted mass-spectrometry-based proteomic technologies are enabling large-scale cohort proteomic and proteogenomic analyses. As such studies scale, the data infrastructure and search engines required to process data must also scale. This challenge is amplified in search engines that rely on library-free match between runs (MBR) search, which enable enhanced depth-per-sample and data completeness. However, to-date, no MBR-based search could scale to process cohorts of thousands or more individuals. Here, we present a strategy to deploy search engines in a distributed cloud environment without source code modification, thereby enhancing resource scalability and throughput. Additionally, we present an algorithm, Scalable MBR, that replicates the MBR procedure of the popular DIA-NN software for scalability to thousands of samples. We demonstrate that Scalable MBR can search thousands of MS raw files in a few hours compared to days required for the original DIA-NN MBR procedure and demonstrate that the results are almost indistinguishable to those of DIA-NN native MBR. The method has been tested to scale to over 15,000 injections and is available for use in the Proteograph(TM) Analysis Suite.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大型蛋白质组学群组的云端可扩展分析

基于非靶向质谱的蛋白质组学技术在深度和通量方面的快速发展，使得大规模队列蛋白质组学和蛋白质基因组学分析成为可能。随着此类研究的扩展，处理数据所需的数据基础设施和搜索引擎也必须随之扩展。这种挑战在依靠无库运行间匹配（MBR）搜索的搜索引擎中更为严峻，因为这种搜索能提高每个样本的深度和数据的完整性。然而，迄今为止，还没有一种基于 MBR 的搜索能扩展到处理数千或更多个体的队列。在此，我们提出了一种无需修改源代码即可在分布式云环境中部署搜索引擎的策略，从而提高了资源的可扩展性和吞吐量。此外，我们还介绍了一种名为 "可扩展 MBR "的算法，该算法复制了流行的 DIA-NN 软件的 MBR 程序，可扩展至数千个样本。我们证明，与 DIA-NN 原始 MBR 程序所需的数天时间相比，Scalable MBR 可在数小时内搜索数千个 MS 原始文件，并证明其结果与 DIA-NN 原始 MBR 的结果几乎没有区别。经测试，该方法可扩展至 15,000 多次注射，并可在 Proteograph(TM) 分析套件中使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

bioRxiv - Bioinformatics

自引率

0.00%

发文量