Pub Date : 2024-09-10DOI: 10.1101/2024.09.05.611509
Harendra Guturu, Andrew Nichols, Lee S. Cantrell, Seth Just, János Kis, Theodore Platt, Iman Mohtashemi, Jian Wang, Serafim Batzoglou
Rapid advances in depth and throughput of untargeted mass-spectrometry-based proteomic technologies are enabling large-scale cohort proteomic and proteogenomic analyses. As such studies scale, the data infrastructure and search engines required to process data must also scale. This challenge is amplified in search engines that rely on library-free match between runs (MBR) search, which enable enhanced depth-per-sample and data completeness. However, to-date, no MBR-based search could scale to process cohorts of thousands or more individuals. Here, we present a strategy to deploy search engines in a distributed cloud environment without source code modification, thereby enhancing resource scalability and throughput. Additionally, we present an algorithm, Scalable MBR, that replicates the MBR procedure of the popular DIA-NN software for scalability to thousands of samples. We demonstrate that Scalable MBR can search thousands of MS raw files in a few hours compared to days required for the original DIA-NN MBR procedure and demonstrate that the results are almost indistinguishable to those of DIA-NN native MBR. The method has been tested to scale to over 15,000 injections and is available for use in the Proteograph(TM) Analysis Suite.
{"title":"Cloud-enabled Scalable Analysis of Large Proteomics Cohorts","authors":"Harendra Guturu, Andrew Nichols, Lee S. Cantrell, Seth Just, János Kis, Theodore Platt, Iman Mohtashemi, Jian Wang, Serafim Batzoglou","doi":"10.1101/2024.09.05.611509","DOIUrl":"https://doi.org/10.1101/2024.09.05.611509","url":null,"abstract":"Rapid advances in depth and throughput of untargeted mass-spectrometry-based proteomic technologies are enabling large-scale cohort proteomic and proteogenomic analyses. As such studies scale, the data infrastructure and search engines required to process data must also scale. This challenge is amplified in search engines that rely on library-free match between runs (MBR) search, which enable enhanced depth-per-sample and data completeness. However, to-date, no MBR-based search could scale to process cohorts of thousands or more individuals. Here, we present a strategy to deploy search engines in a distributed cloud environment without source code modification, thereby enhancing resource scalability and throughput. Additionally, we present an algorithm, Scalable MBR, that replicates the MBR procedure of the popular DIA-NN software for scalability to thousands of samples. We demonstrate that Scalable MBR can search thousands of MS raw files in a few hours compared to days required for the original DIA-NN MBR procedure and demonstrate that the results are almost indistinguishable to those of DIA-NN native MBR. The method has been tested to scale to over 15,000 injections and is available for use in the Proteograph(TM) Analysis Suite.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10DOI: 10.1101/2024.09.05.611521
Michael B Sohn, Kristin Scheible, Steven R Gill
High sparsity (i.e., excessive zeros) in microbiome data, which are high-dimensional and compositional, is unavoidable and can significantly alter analysis results. However, efforts to address this high sparsity have been very limited because, in part, it is impossible to justify the validity of any such methods, as zeros in microbiome data arise from multiple sources (e.g., true absence, stochastic nature of sampling). The most common approach is to treat all zeros as structural zeros (i.e., true absence) or rounded zeros (i.e., undetected due to detection limit). However, this approach can underestimate the mean abundance while overestimating its variance because many zeros can arise from the stochastic nature of sampling and/or functional redundancy (i.e., different microbes can perform the same functions), thus losing power. In this manuscript, we argue that treating all zeros as missing values would not significantly alter analysis results if the proportion of structural zeros is similar for all taxa, and we propose a semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data. We demonstrate the merits of the proposed method and its beneficial effects on downstream analyses in extensive simulation studies. We reanalyzed a type II diabetes (T2D) dataset to determine differentially abundant species between T2D patients and non-diabetic controls.
微生物组数据具有高维性和组成性,其中的高稀疏性(即过多的零)是不可避免的,会严重改变分析结果。然而,解决这种高稀疏性的努力非常有限,部分原因是无法证明任何此类方法的有效性,因为微生物组数据中的零是由多种原因造成的(如真正的缺失、采样的随机性)。最常见的方法是将所有零点视为结构零点(即真正缺失)或四舍五入零点(即因检测限而未检测到)。然而,这种方法可能会低估平均丰度,同时高估其方差,因为许多零可能是由于取样的随机性和/或功能冗余(即不同微生物可以执行相同的功能)引起的,从而失去了研究的意义。在本手稿中,我们认为如果所有类群的结构零比例相似,那么将所有零作为缺失值处理并不会显著改变分析结果,我们还提出了一种针对高稀疏、高维、成分数据的半参数多重估算方法。我们在大量模拟研究中证明了所提方法的优点及其对下游分析的有利影响。我们重新分析了一个 II 型糖尿病(T2D)数据集,以确定 T2D 患者与非糖尿病对照组之间物种丰富度的差异。
{"title":"A semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data","authors":"Michael B Sohn, Kristin Scheible, Steven R Gill","doi":"10.1101/2024.09.05.611521","DOIUrl":"https://doi.org/10.1101/2024.09.05.611521","url":null,"abstract":"High sparsity (i.e., excessive zeros) in microbiome data, which are high-dimensional and compositional, is unavoidable and can significantly alter analysis results. However, efforts to address this high sparsity have been very limited because, in part, it is impossible to justify the validity of any such methods, as zeros in microbiome data arise from multiple sources (e.g., true absence, stochastic nature of sampling). The most common approach is to treat all zeros as structural zeros (i.e., true absence) or rounded zeros (i.e., undetected due to detection limit). However, this approach can underestimate the mean abundance while overestimating its variance because many zeros can arise from the stochastic nature of sampling and/or functional redundancy (i.e., different microbes can perform the same functions), thus losing power. In this manuscript, we argue that treating all zeros as missing values would not significantly alter analysis results if the proportion of structural zeros is similar for all taxa, and we propose a semi-parametric multiple imputation method for high-sparse, high-dimensional, compositional data. We demonstrate the merits of the proposed method and its beneficial effects on downstream analyses in extensive simulation studies. We reanalyzed a type II diabetes (T2D) dataset to determine differentially abundant species between T2D patients and non-diabetic controls.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"157 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10DOI: 10.1101/2024.09.05.611486
Christopher T Boughter
Immunopeptidomics is a growing subfield of proteomics that has the potential to shed new light on a long-neglected aspect of adaptive immunology: a comprehensive understanding of the peptides presented by major histocompatibility complexes (MHC) to T cells. As the field of immunopeptidomics continues to grow and mature, a parallel expansion in the methods for extracting quantitative features of these peptides is necessary. Currently, massive experimental efforts to isolate a given immunopeptidome are summarized in tables and pie charts, or worse, entirely thrown out in favor of singular peptides of interest. Ideally, an unbiased approach would dive deeper into these large proteomic datasets, identifying sequence-level biochemical signatures inherent to each individual dataset and the given immunological niche. This chapter will outline the steps for a powerful approach to such analysis, utilizing the Automated Immune Molecule Separator (AIMS) software for the characterization of immunopeptidomic datasets. AIMS is a flexible tool for the identification of biophysical signatures in peptidomic datasets, the elucidation of nuanced differences in repertoires collected across tissues or experimental conditions, and the generation of machine learning models for future applications to classification problems. In learning to use AIMS, readers of this chapter will receive a broad introduction to the field of protein bioinformatics and its utility in the analysis of immunopeptidomic datasets and other large-scale immune repertoire datasets.
免疫肽组学是蛋白质组学中一个不断发展的子领域,它有可能为适应性免疫学中一个长期被忽视的方面带来新的启示:全面了解主要组织相容性复合体(MHC)向 T 细胞呈现的肽。随着免疫肽组学领域的不断发展和成熟,提取这些肽的定量特征的方法也必须同步扩展。目前,为分离特定免疫肽组所做的大量实验工作被总结成表格和饼状图,更有甚者,完全丢弃了感兴趣的单个肽。理想情况下,一种无偏见的方法可以深入研究这些大型蛋白质组数据集,识别每个数据集和特定免疫位点固有的序列级生化特征。本章将概述利用自动免疫分子分离器(AIMS)软件表征免疫肽组数据集的强大分析方法的步骤。AIMS 是一种灵活的工具,可用于识别肽组数据集中的生物物理特征,阐明不同组织或实验条件下收集的复合物之间的细微差别,并生成机器学习模型,以便将来应用于分类问题。在学习使用 AIMS 的过程中,本章读者将广泛了解蛋白质生物信息学领域及其在分析免疫肽组数据集和其他大规模免疫组数据集中的应用。
{"title":"Utilizing Protein Bioinformatics to Delve Deeper Into Immunopeptidomic Datasets","authors":"Christopher T Boughter","doi":"10.1101/2024.09.05.611486","DOIUrl":"https://doi.org/10.1101/2024.09.05.611486","url":null,"abstract":"Immunopeptidomics is a growing subfield of proteomics that has the potential to shed new light on a long-neglected aspect of adaptive immunology: a comprehensive understanding of the peptides presented by major histocompatibility complexes (MHC) to T cells. As the field of immunopeptidomics continues to grow and mature, a parallel expansion in the methods for extracting quantitative features of these peptides is necessary. Currently, massive experimental efforts to isolate a given immunopeptidome are summarized in tables and pie charts, or worse, entirely thrown out in favor of singular peptides of interest. Ideally, an unbiased approach would dive deeper into these large proteomic datasets, identifying sequence-level biochemical signatures inherent to each individual dataset and the given immunological niche. This chapter will outline the steps for a powerful approach to such analysis, utilizing the Automated Immune Molecule Separator (AIMS) software for the characterization of immunopeptidomic datasets. AIMS is a flexible tool for the identification of biophysical signatures in peptidomic datasets, the elucidation of nuanced differences in repertoires collected across tissues or experimental conditions, and the generation of machine learning models for future applications to classification problems. In learning to use AIMS, readers of this chapter will receive a broad introduction to the field of protein bioinformatics and its utility in the analysis of immunopeptidomic datasets and other large-scale immune repertoire datasets.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"05 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10DOI: 10.1101/2024.09.05.611508
Thomas P Burghardt
Background: Human ventriculum myosin (βmys) powers contraction sometimes in complex with myosin binding protein C (MYBPC3). The latter regulates βmys activity and impacts overall cardiac function. Nonsynonymous single nucleotide variants (SNVs) change protein sequence in βmys or MYBPC3 causing inheritable heart diseases by affecting the βmys/MYBPC3 complex. Muscle genetics encode instructions for contraction informing native protein construction, functional integration, and inheritable disease impairment. A digital model decodes these instructions and evolves by continuously processing new information content from diverse data modalities in partnership with the human agent. Methods: A general neural-network contraction model characterizes SNV impacts on human health. It rationalizes phenotype and pathogenicity assignment given the SNVs genetic characteristics and in this sense decodes βmys/MYBPC3 complex genetics and implicitly captures ventricular muscle functionality. When a SNV modified domain locates to an inter-protein contact in βmys/MYBPC3 it affects complex coordination. Domains involved, one in βmys and the other in MYBPC3, form coordinated domains (co-domains). Co-domains are bilateral implying potential for their SNV modification probabilities to respond jointly to a common perturbation to reveal their location. Human genetic diversity from the serial founder effect is the common systemic perturbation coupling co-domains that are mapped by a methodology called 2-dimensional correlation genetics (2D-CG). Results: Interpreting the general neural-network contraction model output involves 2D-CG co-domain mapping that provides natural language expressed structural insights. It aligns machine-learned intelligence from the neural network model with human provided structural insight from the 2D-CG map, and other data from the literature, to form a neural-symbolic hybrid model integrating genetic and protein interaction data into a nascent digital twin. This process is the template for combining new information content from diverse data modalities into a digital model that can evolve. The nascent digital twin interprets SNV implications to discover disease mechanism, can evaluate potential remedies for efficacy, and does so without animal models.
{"title":"Neural-symbolic hybrid model for myosin complex in cardiac ventriculum decodes structural bases for inheritable heart disease from its genetic encoding","authors":"Thomas P Burghardt","doi":"10.1101/2024.09.05.611508","DOIUrl":"https://doi.org/10.1101/2024.09.05.611508","url":null,"abstract":"Background: Human ventriculum myosin (βmys) powers contraction sometimes in complex with myosin binding protein C (MYBPC3). The latter regulates βmys activity and impacts overall cardiac function. Nonsynonymous single nucleotide variants (SNVs) change protein sequence in βmys or MYBPC3 causing inheritable heart diseases by affecting the βmys/MYBPC3 complex. Muscle genetics encode instructions for contraction informing native protein construction, functional integration, and inheritable disease impairment. A digital model decodes these instructions and evolves by continuously processing new information content from diverse data modalities in partnership with the human agent.\u0000Methods: A general neural-network contraction model characterizes SNV impacts on human health. It rationalizes phenotype and pathogenicity assignment given the SNVs genetic characteristics and in this sense decodes βmys/MYBPC3 complex genetics and implicitly captures ventricular muscle functionality. When a SNV modified domain locates to an inter-protein contact in βmys/MYBPC3 it affects complex coordination. Domains involved, one in βmys and the other in MYBPC3, form coordinated domains (co-domains). Co-domains are bilateral implying potential for their SNV modification probabilities to respond jointly to a common perturbation to reveal their location. Human genetic diversity from the serial founder effect is the common systemic perturbation coupling co-domains that are mapped by a methodology called 2-dimensional correlation genetics (2D-CG). Results: Interpreting the general neural-network contraction model output involves 2D-CG co-domain mapping that provides natural language expressed structural insights. It aligns machine-learned intelligence from the neural network model with human provided structural insight from the 2D-CG map, and other data from the literature, to form a neural-symbolic hybrid model integrating genetic and protein interaction data into a nascent digital twin. This process is the template for combining new information content from diverse data modalities into a digital model that can evolve. The nascent digital twin interprets SNV implications to discover disease mechanism, can evaluate potential remedies for efficacy, and does so without animal models.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10DOI: 10.1101/2024.09.05.611379
Alex M. Ascension, Ander Izeta
Single-cell RNA sequencing (scRNAseq) studies have unveiled large transcriptomic heterogeneity within both human and mouse dermal fibroblasts, but a consensus atlas that spans both species is lacking. Here, by studying 25 human and 9 mouse datasets through a semi-supervised procedure, we categorize 15 distinct human fibroblast populations across 5 main axes. Analysis of human fibroblast markers characteristic of each population suggested diverse functions, such as position-dependent ECM synthesis, association with immune responses or structural roles in skin appendages. Similarly, mouse fibroblasts were categorized into 17 populations across 5 axes. Comparison of mouse and human fibroblast populations highlighted similarities suggesting a degree of functional overlap, though nuanced differences were also noted: transcriptomically, human axes seem to segregate by function, while mouse axes seem to prioritize positional information over function. Importantly, addition of newer datasets did not significantly change the defined population structure. This study enhances our understanding of dermal fibroblast diversity, shedding light on species-specific distinctions as well as shared functionalities.
{"title":"A consensus single-cell transcriptomic atlas of dermal fibroblast heterogeneity","authors":"Alex M. Ascension, Ander Izeta","doi":"10.1101/2024.09.05.611379","DOIUrl":"https://doi.org/10.1101/2024.09.05.611379","url":null,"abstract":"Single-cell RNA sequencing (scRNAseq) studies have unveiled large transcriptomic heterogeneity within both human and mouse dermal fibroblasts, but a consensus atlas that spans both species is lacking. Here, by studying 25 human and 9 mouse datasets through a semi-supervised procedure, we categorize 15 distinct human fibroblast populations across 5 main axes. Analysis of human fibroblast markers characteristic of each population suggested diverse functions, such as position-dependent ECM synthesis, association with immune responses or structural roles in skin appendages. Similarly, mouse fibroblasts were categorized into 17 populations across 5 axes. Comparison of mouse and human fibroblast populations highlighted similarities suggesting a degree of functional overlap, though nuanced differences were also noted: transcriptomically, human axes seem to segregate by function, while mouse axes seem to prioritize positional information over function. Importantly, addition of newer datasets did not significantly change the defined population structure. This study enhances our understanding of dermal fibroblast diversity, shedding light on species-specific distinctions as well as shared functionalities.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10DOI: 10.1101/2024.09.06.611604
Jan F. M. Stuke, Gerhard Hummer
In selective autophagy, cargo recruitment is mediated by LC3-interacting regions (LIRs)/Atg8-interacting motifs (AIMs) in the cargo or cargo receptor proteins. The binding of these motifs to LC3/Atg8 proteins at the phagophore membrane is often modulated by post-translational modifications, especially phosphorylation. As a challenge for computational LIR predictions, sequences may contain the short canonical (W/F/Y)XX(L/I/V) motif without being functional. Conversely, LIRs may be formed by non-canonical but functional sequence motifs. AlphaFold2 has proven to be useful for LIR predictions, even if some LIRs are missed and proteins with thousands of residues reach the limits of computational feasibility. We present a fragment-based approach to address these limitations. We find that fragment length and phosphomimetic mutations modulate the interactions predicted by AlphaFold2. Systematic fragment screening for a range of target proteins yields structural models for interactions that AlphaFold2 and AlphaFold3 fail to predict for full-length targets. We provide guidance on fragment choice, sequence tuning, and LC3 isoform effects for optimal LIR screens. Finally, we also test the transferability of this general framework to SUMO-SIM interactions, another type of protein-protein interaction involving short linear motifs (SLiMs).
{"title":"AlphaFold2 SLiM screen for LC3-LIR interactions in autophagy","authors":"Jan F. M. Stuke, Gerhard Hummer","doi":"10.1101/2024.09.06.611604","DOIUrl":"https://doi.org/10.1101/2024.09.06.611604","url":null,"abstract":"In selective autophagy, cargo recruitment is mediated by LC3-interacting regions (LIRs)/Atg8-interacting motifs (AIMs) in the cargo or cargo receptor proteins. The binding of these motifs to LC3/Atg8 proteins at the phagophore membrane is often modulated by post-translational modifications, especially phosphorylation. As a challenge for computational LIR predictions, sequences may contain the short canonical (W/F/Y)XX(L/I/V) motif without being functional. Conversely, LIRs may be formed by non-canonical but functional sequence motifs. AlphaFold2 has proven to be useful for LIR predictions, even if some LIRs are missed and proteins with thousands of residues reach the limits of computational feasibility. We present a fragment-based approach to address these limitations. We find that fragment length and phosphomimetic mutations modulate the interactions predicted by AlphaFold2. Systematic fragment screening for a range of target proteins yields structural models for interactions that AlphaFold2 and AlphaFold3 fail to predict for full-length targets. We provide guidance on fragment choice, sequence tuning, and LC3 isoform effects for optimal LIR screens. Finally, we also test the transferability of this general framework to SUMO-SIM interactions, another type of protein-protein interaction involving short linear motifs (SLiMs).","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10DOI: 10.1101/2024.09.05.611403
Wei Huang, Xinda Ren, Yinpu Bai, Hui Liu
Tumor heterogeneity often leads to substantial differences in responses to same drug treatment. The presence of pre-existing or acquired drug-resistant cell subpopulations within a tumor survive and proliferate, ultimately resulting in tumor relapse and metastasis. The drug resistance is the leading cause of failure in clinical tumor therapy. Therefore, accurate identification of drug-resistant tumor cell subpopulations could greatly facilitate the precision medicine and novel drug development. However, the scarcity of single-cell drug response data significantly hinders the exploration of tumor cell resistance mechanisms and the development of computational predictive methods. In this paper, we propose scDrugAtlas, a comprehensive database devoted to integrating the drug response data at single-cell level. We manually compiled more than 100 datasets containing single-cell drug responses from various public resources. The current version comprises large-scale single-cell transcriptional profiles and drug response labels from more than 1,000 samples (cell line, mouse, PDX models, patients and bacterium), across 66 unique drugs and 13 major cancer types. Particularly, we assigned a confidence level to each response label based on the tissue source (primary or relapse/metastasis), drug exposure time and drug-induced cell phenotype. We believe scDrugAtlas could greatly facilitate the Bioinformatics community for developing computational models and biologists for identifying drug-resistant tumor cells and underlying molecular mechanism. The scDrugAtlas database is available at: http://drug.hliulab.tech/scDrugAtlas/.
{"title":"scDrugAtlas: an integrative single-cell drug response atlas for unraveling tumor heterogeneity in therapeutic efficacy","authors":"Wei Huang, Xinda Ren, Yinpu Bai, Hui Liu","doi":"10.1101/2024.09.05.611403","DOIUrl":"https://doi.org/10.1101/2024.09.05.611403","url":null,"abstract":"Tumor heterogeneity often leads to substantial differences in responses to same drug treatment. The presence of pre-existing or acquired drug-resistant cell subpopulations within a tumor survive and proliferate, ultimately resulting in tumor relapse and metastasis. The drug resistance is the leading cause of failure in clinical tumor therapy. Therefore, accurate identification of drug-resistant tumor cell subpopulations could greatly facilitate the precision medicine and novel drug development. However, the scarcity of single-cell drug response data significantly hinders the exploration of tumor cell resistance mechanisms and the development of computational predictive methods. In this paper, we propose scDrugAtlas, a comprehensive database devoted to integrating the drug response data at single-cell level. We manually compiled more than 100 datasets containing single-cell drug responses from various public resources. The current version comprises large-scale single-cell transcriptional profiles and drug response labels from more than 1,000 samples (cell line, mouse, PDX models, patients and bacterium), across 66 unique drugs and 13 major cancer types. Particularly, we assigned a confidence level to each response label based on the tissue source (primary or relapse/metastasis), drug exposure time and drug-induced cell phenotype. We believe scDrugAtlas could greatly facilitate the Bioinformatics community for developing computational models and biologists for identifying drug-resistant tumor cells and underlying molecular mechanism. The scDrugAtlas database is available at: http://drug.hliulab.tech/scDrugAtlas/.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-10DOI: 10.1101/2024.09.06.611505
Daphne Wijnbergen, Rajaram Kaliyaperumal, Kees Burger, Luiz Olavo Bonino da Silva Santos, Barend Mons, Marco Roos, Eleni Mina
Background Use of the FAIR principles (Findable, Accessible, Interoperable and Reusable) allows the rapidly growing number of biomedical datasets to be optimally (re)used. An important aspect of the FAIR principles is metadata. The FAIR Data Point specifications and reference implementation have been designed as an example on how to publish metadata according to the FAIR principles. Various tools to create metadata have been created, but many of these have limitations, such as interfaces that are not intuitive, metadata that does not adhere to a common metadata schema, limited scalability, and inefficient collaboration. We aim to address these limitations in the FAIR Data Point Populator. Results The FAIR Data Point Populator consists of a GitHub workflow together with Excel templates that have tooltips, validation and documentation. The Excel templates are targeted towards non-technical users, and can be used collaboratively in online spreadsheet software. A more technical user then uses the GitHub workflow to read multiple entries in the Excel sheets, and transform it into machine readable metadata. This metadata is then automatically uploaded to a connected FAIR Data Point. We applied the FAIR Data Point Populator on the metadata of two datasets, and a patient registry. We were then able to run a query on the FAIR Data Point Index, in order to retrieve one of the datasets. Conclusion The FAIR Data Point Populator addresses several limitations of other tools. It makes creating metadata easier, ensures adherence to a common metadata schema, allows bulk creation of metadata entries and increases collaboration. As a result of this, the barrier of entry for FAIRification is lower, which enables the creation of FAIR data by more people.
{"title":"The FAIR Data Point Populator: collaborative FAIRification and population of FAIR Data Points","authors":"Daphne Wijnbergen, Rajaram Kaliyaperumal, Kees Burger, Luiz Olavo Bonino da Silva Santos, Barend Mons, Marco Roos, Eleni Mina","doi":"10.1101/2024.09.06.611505","DOIUrl":"https://doi.org/10.1101/2024.09.06.611505","url":null,"abstract":"Background Use of the FAIR principles (Findable, Accessible, Interoperable and Reusable) allows the rapidly growing number of biomedical datasets to be optimally (re)used. An important aspect of the FAIR principles is metadata. The FAIR Data Point specifications and reference implementation have been designed as an example on how to publish metadata according to the FAIR principles. Various tools to create metadata have been created, but many of these have limitations, such as interfaces that are not intuitive, metadata that does not adhere to a common metadata schema, limited scalability, and inefficient collaboration. We aim to address these limitations in the FAIR Data Point Populator. Results The FAIR Data Point Populator consists of a GitHub workflow together with Excel templates that have tooltips, validation and documentation. The Excel templates are targeted towards non-technical users, and can be used collaboratively in online spreadsheet software. A more technical user then uses the GitHub workflow to read multiple entries in the Excel sheets, and transform it into machine readable metadata. This metadata is then automatically uploaded to a connected FAIR Data Point. We applied the FAIR Data Point Populator on the metadata of two datasets, and a patient registry. We were then able to run a query on the FAIR Data Point Index, in order to retrieve one of the datasets. Conclusion The FAIR Data Point Populator addresses several limitations of other tools. It makes creating metadata easier, ensures adherence to a common metadata schema, allows bulk creation of metadata entries and increases collaboration. As a result of this, the barrier of entry for FAIRification is lower, which enables the creation of FAIR data by more people.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142224318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-09DOI: 10.1101/2024.09.05.611300
Shaine Chenxin Bao, Dalia Mizikovsky, Kathleen Pishas, Qiongyi Zhao, Karla J Cowley, Evanny Marinovic, Mark Carey, Ian Campbell, Kaylene J Simpson, Dane Cheasley, Nathan Palpant
High-throughput analysis methods have emerged as central technologies to accelerate discovery through scalable generation of large-scale data. Analysis of these datasets remains challenging due to limitations in computational approaches for dimensionality reduction. Here, we present UnTANGLeD, a versatile computational pipeline that prioritises biologically robust and meaningful information to guide actionable strategies from input screening data which we demonstrate using results from image-based drug screening. By providing a robust framework for analysing high dimensional biological data, UnTANGLeD offers a powerful tool for analysis of theoretically any data type from any screening platform.
{"title":"A robust unsupervised clustering approach for high-dimensional biological imaging data reveals shared drug-induced morphological signatures","authors":"Shaine Chenxin Bao, Dalia Mizikovsky, Kathleen Pishas, Qiongyi Zhao, Karla J Cowley, Evanny Marinovic, Mark Carey, Ian Campbell, Kaylene J Simpson, Dane Cheasley, Nathan Palpant","doi":"10.1101/2024.09.05.611300","DOIUrl":"https://doi.org/10.1101/2024.09.05.611300","url":null,"abstract":"High-throughput analysis methods have emerged as central technologies to accelerate discovery through scalable generation of large-scale data. Analysis of these datasets remains challenging due to limitations in computational approaches for dimensionality reduction. Here, we present UnTANGLeD, a versatile computational pipeline that prioritises biologically robust and meaningful information to guide actionable strategies from input screening data which we demonstrate using results from image-based drug screening. By providing a robust framework for analysing high dimensional biological data, UnTANGLeD offers a powerful tool for analysis of theoretically any data type from any screening platform.","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"416 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-09DOI: 10.1101/2024.09.08.611582
Martin Steinegger, Eli Levy Karin, Rachel Seongeun Kim
The AlphaFold Protein Structure Database (AFDB) is the largest repository of accurately predicted structures with taxonomic labels. Despite providing predictions for over 214 million UniProt entries, the AFDB does not cover viral sequences, severely limiting their study. To bridge this gap, we created the Big Fantastic Virus Database (BFVD), a repository of 351,242 protein structures predicted by applying ColabFold to the viral sequence representatives of the UniRef30 clusters. BFVD holds a unique repertoire of protein structures as over 63% of its entries show no or low structural similarity to existing repositories. We demonstrate how BFVD substantially enhances the fraction of annotated bacteriophage proteins compared to sequence-based annotation using Bakta. In that, BFVD is on par with the AFDB, while holding nearly three orders of magnitude fewer structures. BFVD is an important virus-specific expansion to protein structure repositories, offering new opportunities to advance viral research. BFVD is freely available at https://bfvd.steineggerlab.workers.dev
{"title":"BFVD - a large repository of predicted viral protein structures","authors":"Martin Steinegger, Eli Levy Karin, Rachel Seongeun Kim","doi":"10.1101/2024.09.08.611582","DOIUrl":"https://doi.org/10.1101/2024.09.08.611582","url":null,"abstract":"The AlphaFold Protein Structure Database (AFDB) is the largest repository of accurately predicted structures with taxonomic labels. Despite providing predictions for over 214 million UniProt entries, the AFDB does not cover viral sequences, severely limiting their study. To bridge this gap, we created the Big Fantastic Virus Database (BFVD), a repository of 351,242 protein structures predicted by applying ColabFold to the viral sequence representatives of the UniRef30 clusters. BFVD holds a unique repertoire of protein structures as over 63% of its entries show no or low structural similarity to existing repositories. We demonstrate how BFVD substantially enhances the fraction of annotated bacteriophage proteins compared to sequence-based annotation using Bakta. In that, BFVD is on par with the AFDB, while holding nearly three orders of magnitude fewer structures. BFVD is an important virus-specific expansion to protein structure repositories, offering new opportunities to advance viral research. BFVD is freely available at https://bfvd.steineggerlab.workers.dev","PeriodicalId":501307,"journal":{"name":"bioRxiv - Bioinformatics","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}