Mass spectrometry-based proteomics facilitates the identification and quantification of thousands of proteins but encounters challenges in measuring human antibodies due to their vast diversity. Bottom-up proteomics methods primarily rely on database searches, comparing experimental peptide values to theoretical database sequences. While the human body can produce millions of distinct antibodies, current databases, such as UniProtKB/Swiss-Prot, contain only 1095 sequences (as of January 2024), potentially hindering antibody identification via mass spectrometry. Therefore, expanding the database is crucial for discovering new antibodies. Recent genomic studies have amassed millions of human antibody sequences in the Observed Antibody Space (OAS) database, yet this data remains underutilized. Leveraging this vast collection, we conduct efficient database searches in publicly available proteomics data, focusing on SARS-CoV-2. In our study, thirty million heavy antibody sequences from 146 SARS-CoV-2 patients in the OAS database were digested in silico to obtain 18 million unique peptides. These peptides form the basis for new bottom-up proteomics databases. We used those databases for searching new antibody peptides in publicly available SARS-CoV-2 human plasma samples in the Proteomics Identification Database (PRIDE). This approach avoids false positives in antibody peptide identification as confirmed by searching against negative controls (brain samples) and employing different database sizes. We show that new antibody peptides were found in previous plasma samples and expect that the newly discovered antibody peptides can be further employed to develop therapeutic antibodies. The method will be broadly applicable to find characteristic antibodies for other diseases.
The CD8 T cell immune response operates at multiple temporal and spatial scales, including all the early complex biochemical and biomechanical processes, up to long term cell population behavior.
In order to model this response, we devised a multiscale agent-based approach using Simuscale software. Within each agent (cell) of our model, we introduced a gene regulatory network (GRN) based upon a piecewise deterministic Markov process formalism. Cell fate – differentiation, proliferation, death – was coupled to the state of the GRN through rule-based mechanisms. Cells interact in a 3D computational domain and signal to each other via cell–cell contacts, influencing the GRN behavior.
Results show the ability of the model to correctly capture both population behavior and molecular time-dependent evolution. We examined the impact of several parameters on molecular and population dynamics, and demonstrated the add-on value of using a multiscale approach by showing the influence of molecular parameters, particularly protein degradation rates, on the outcome of the response, such as effector and memory cell counts.
Recent advancements in immune sequencing and experimental techniques are generating extensive T cell receptor (TCR) repertoire data, enabling the development of models to predict TCR binding specificity. Despite the computational challenges posed by the vast diversity of TCRs and epitopes, significant progress has been made. This review explores the evolution of computational models designed for this task, emphasizing machine learning efforts, including early unsupervised clustering approaches, supervised models, and recent applications of Protein Language Models (PLMs), deep learning models pretrained on extensive collections of unlabeled protein sequences that capture crucial biological properties.
We survey the most prominent models in each category and offer a critical discussion on recurrent challenges, including the lack of generalization to new epitopes, dataset biases, and shortcomings in model validation designs. Focusing on PLMs, we discuss the transformative impact of Transformer-based protein models in bioinformatics, particularly in TCR specificity analysis. We discuss recent studies that exploit PLMs to deliver notably competitive performances in TCR-related tasks, while also examining current limitations and future directions. Lastly, we address the pressing need for improved interpretability in these often opaque models, and examine current efforts to extract biological insights from large black box models.
With the application of spatial biology, the detection and identification of the diverse cell types present in the tumor microenvironment, including specific immune subsets, is possible at single cell resolution. Since spatial biology analysis of tumor tissue allows multiple biological parameters to be measured, including cell type, cell number, cell state, as well as the precise location and the spatial relationship of every cell to other cells and histopathological hallmarks, a vast amount of data is generated. The power of this is realized when correlating the spatial biology data with clinical data for each patient, from which the tissue was collected during biopsy or surgery, conducted as part of the patient's diagnosis and treatment. Aside from the enormous leap in chemistry and molecular biology technology required to develop the analytical tools for spatial biology, collection, analysis of cells in the tumor microenvironment has been possible only with the development of computational tools capable of deciphering tumor tissue complexity to predict tumor evolution and response to treatment and the role of immune cells in regulating tumor biology. Here we describe how spatial biology analysis, combined with computational analysis have been used to deconstruct the complexity of the brain tumor microenvironment and shed light on why brain tumors exhibit extreme immunosuppression. We also discuss how the understanding gained using spatial biology has shed light on how tumor immunosuppression can be overcome.
Proteins have an arsenal of medical applications that include disrupting protein interactions, acting as potent vaccines, and replacing genetically deficient proteins. While therapeutics must avoid triggering unwanted immune-responses, vaccines should support a robust immune-reaction targeting a broad range of pathogen variants. Therefore, computational methods modifying proteins’ immunogenicity without disrupting function are needed. While many components of the immune-system can be involved in a reaction, we focus on Cytotoxic T-lymphocytes (CTLs). These target short peptides presented via the MHC Class I (MHC-I) pathway. To explore the limits of modifying the visibility of those peptides to CTLs within the distribution of naturally occurring sequences, we developed a novel machine learning technique, CAPE-XVAE. It combines a language model with reinforcement learning to modify a protein’s immune-visibility. Our results show that CAPE-XVAE effectively modifies the visibility of the HIV Nef protein to CTLs. We contrast CAPE-XVAE to CAPE-Packer, a physics-based method we also developed. Compared to CAPE-Packer, the machine learning approach suggests sequences that draw upon local sequence similarities in the training set. This is beneficial for vaccine development, where the sequence should be representative of the real viral population. Additionally, the language model approach holds promise for preserving both known and unknown functional constraints, which is essential for the immune-modulation of therapeutic proteins. In contrast, CAPE-Packer, emphasizes preserving the protein’s overall fold and can reach greater extremes of immune-visibility, but falls short of capturing the sequence diversity of viral variants available to learn from. Source code: https://github.com/hcgasser/CAPE (Tag: v1.1)
CD8 T cell proper differentiation during antiviral responses relies on metabolic adaptations. Herein, we investigated global metabolic activity in single CD8 T cells along an in vivo response by estimating metabolic fluxes from single-cell RNA-sequencing data. The approach was validated by the observation of metabolic variations known from experimental studies on global cell populations, while adding temporally detailed information and unravelling yet undescribed sections of CD8 T cell metabolism that are affected by cellular differentiation. Furthermore, inter-cellular variability in gene expression level, highlighted by single cell data, and heterogeneity of metabolic activity 4 days post-infection, revealed a new transition stage accompanied by a metabolic switch in activated cells differentiating into full-blown effectors.
Coronaviruses are known to infect a wide range of mammals. In humans, coronaviruses have been responsible for causing the common cold. The immune response against common cold coronaviruses appears to elicit a cross-protective response to SARS-CoV-2. This study identified protein regions in the mammalian coronaviruses' proteome that are identical to those of SARS-CoV-2. Using bioinformatics analysis, the study predicted the involvement of SARS-CoV-2-identical protein regions, identified in mammalian coronaviruses, in antigen-presenting processes and their ability to elicit immune responses. The SARS-CoV-2-identical protein regions were predominantly found in the proteomes of betacoronaviruses, with less prevalence in alphacoronaviruses. Alphacoronaviruses, such as FCoV in domestic felines and MCoV in minks, are known to infect species highly susceptible to SARS-CoV-2. In contrast, betacoronaviruses infect mammals with lower susceptibility to SARS-CoV-2, including dogs, mice, and farmed animals. Furthermore, betacoronaviruses exhibited a higher number of peptides with an increased potential for efficient presentation during the antigen-presenting process, indicating their greater immunogenicity. Conversely, the SW1 gammacoronavirus showed a lower count of SARS-CoV-2 protein regions and a reduced potential for efficient antigen presentation. The results suggested that the elevated number of SARS-CoV-2 identical stretches found in betacoronaviruses may provide potential cross-protection between SARS-CoV-2 and mammalian betacoronaviruses. This cross-protection could be similar to that observed between human coronaviruses causing the common cold and SARS-CoV-2. The limited numbers observed in the proteomes of FCoV, MCoV, and SW1-CoV may offer an explanation for the susceptibility of cats and minks to SARS-CoV-2, as well as a potential vulnerability in cetaceans.
Deciphering the antigen recognition capabilities by T-cell and B-cell receptors (antibodies) is essential for advancing our understanding of adaptive immune system responses. In recent years, the development of protein language models (PLMs) has facilitated the development of bioinformatic pipelines where complex amino acid sequences are transformed into vectorized embeddings, which are then applied to a range of downstream analytical tasks. With their success, we have witnessed the emergence of domain-specific PLMs tailored to specific proteins, such as immune receptors. Domain-specific models are often assumed to possess enhanced representation capabilities for targeted applications, however, this assumption has not been thoroughly evaluated. In this manuscript, we assess the efficacy of both generalist and domain-specific transformer-based embeddings in characterizing B and T-cell receptors. Specifically, we assess the accuracy of models that leverage these embeddings to predict antigen specificity and elucidate the evolutionary changes that B cells undergo during an immune response. We demonstrate that the prevailing notion of domain-specific models outperforming general models requires a more nuanced examination. We also observe remarkable differences between generalist and domain-specific PLMs, not only in terms of performance but also in the manner they encode information. Finally, we observe that the choice of the size and the embedding layer in PLMs are essential model hyperparameters in different tasks. Overall, our analyzes reveal the promising potential of PLMs in modeling protein function while providing insights into their information-handling capabilities. We also discuss the crucial factors that should be taken into account when selecting a PLM tailored to a particular task.

