Immunoinformatics (Amsterdam, Netherlands)最新文献_第3页

Scifer: An R/Bioconductor package for large-scale integration of Sanger sequencing and flow cytometry data of index-sorted single cells Scifer：用于大规模整合桑格测序和流式细胞仪指数分选单细胞数据的 R/Bioconductor 软件包

Immunoinformatics (Amsterdam, Netherlands)

Pub Date : 2024-12-01 Epub Date: 2024-10-29 DOI: 10.1016/j.immuno.2024.100046

Rodrigo Arcoverde Cerveira , Klara Lenart , Marcel Martin , Matthew James Hinchcliff , Fredrika Hellgren , Kewei Ye , Juliana Assis Geraldo , Taras Kreslavsky , Sebastian Ols , Karin Loré

Sanger sequencing remains widely used in various experimental contexts, often in combination with flow cytometry for indexing specific cell populations. However, existing software lacks the capability to automate quality control (QC) of raw Sanger sequencing data and integrate it with flow cytometry information on a large scale. Here, we introduce scifer, an R package now available in the latest release of Bioconductor (3.20) showcasing its effectiveness in seamlessly integrating these types of data as demonstrated by analyses of B cell and T cell receptor sequences. Scifer preprocesses raw data from index sorts and immune receptor Sanger sequencing. It identifies high-quality sequences based on selected parameters, such as length, Phred scores, and heavy-chain complementarity-determining region 3 (HCDR3) quality. As a result, the quality of germline assignments is significantly increased and spurious variable gene mutations are reduced. Scifer is automated and can process thousands of sequences in less than an hour. Its output provides quality control reports, FASTA files, summarized tables, and electropherograms for manual inspection. In summary, scifer is a user-friendly software that speeds up the analysis of immune receptor repertoire sequences, offering wide applicability.

桑格测序仍被广泛应用于各种实验中，通常与流式细胞仪结合使用，对特定细胞群进行索引。然而，现有软件缺乏对原始 Sanger 测序数据进行自动质量控制（QC）并将其与流式细胞仪信息大规模整合的能力。在这里，我们将介绍 scifer，这是一个 R 软件包，目前可在最新发布的 Bioconductor 3.20 中使用，通过对 B 细胞和 T 细胞受体序列的分析，我们展示了它在无缝整合这些类型数据方面的有效性。Scifer 对来自索引分类和免疫受体 Sanger 测序的原始数据进行预处理。它根据长度、Phred 分数和重链互补决定区 3 (HCDR3) 质量等选定参数识别高质量序列。因此，种系分配的质量大大提高，虚假的可变基因突变也减少了。Scifer 是自动化的，可在一小时内处理数千条序列。其输出结果包括质量控制报告、FASTA 文件、汇总表和供人工检查的电图。总之，scifer 是一款用户友好型软件，可加快免疫受体序列的分析速度，具有广泛的适用性。

{"title":"Scifer: An R/Bioconductor package for large-scale integration of Sanger sequencing and flow cytometry data of index-sorted single cells","authors":"Rodrigo Arcoverde Cerveira , Klara Lenart , Marcel Martin , Matthew James Hinchcliff , Fredrika Hellgren , Kewei Ye , Juliana Assis Geraldo , Taras Kreslavsky , Sebastian Ols , Karin Loré","doi":"10.1016/j.immuno.2024.100046","DOIUrl":"10.1016/j.immuno.2024.100046","url":null,"abstract":"<div><div>Sanger sequencing remains widely used in various experimental contexts, often in combination with flow cytometry for indexing specific cell populations. However, existing software lacks the capability to automate quality control (QC) of raw Sanger sequencing data and integrate it with flow cytometry information on a large scale. Here, we introduce scifer, an R package now available in the latest release of Bioconductor (3.20) showcasing its effectiveness in seamlessly integrating these types of data as demonstrated by analyses of B cell and T cell receptor sequences. Scifer preprocesses raw data from index sorts and immune receptor Sanger sequencing. It identifies high-quality sequences based on selected parameters, such as length, Phred scores, and heavy-chain complementarity-determining region 3 (HCDR3) quality. As a result, the quality of germline assignments is significantly increased and spurious variable gene mutations are reduced. Scifer is automated and can process thousands of sequences in less than an hour. Its output provides quality control reports, FASTA files, summarized tables, and electropherograms for manual inspection. In summary, scifer is a user-friendly software that speeds up the analysis of immune receptor repertoire sequences, offering wide applicability.</div></div>","PeriodicalId":73343,"journal":{"name":"Immunoinformatics (Amsterdam, Netherlands)","volume":"16 ","pages":"Article 100046"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142663038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lessons learned from the IMMREP23 TCR-epitope prediction challenge 从 IMMREP23 TCR 表位预测挑战中汲取的经验教训

Immunoinformatics (Amsterdam, Netherlands)

Pub Date : 2024-12-01 Epub Date: 2024-09-28 DOI: 10.1016/j.immuno.2024.100045

Morten Nielsen , Anne Eugster , Mathias Fynbo Jensen , Manisha Goel , Andreas Tiffeau-Mayer , Aurelien Pelissier , Sebastiaan Valkiers , María Rodríguez Martínez , Barthélémy Meynard-Piganeeau , Victor Greiff , Thierry Mora , Aleksandra M. Walczak , Giancarlo Croce , Dana L Moreno , David Gfeller , Pieter Meysman , Justin Barton

Here, we present the findings from IMMREP23, the second benchmark competition focused on predicting the specificity of TCR-pMHC interactions.

The interaction of T cell receptors (TCR) towards their pMHC target is a cornerstone of the cellular immune system. Over the last decade, substantial progress has been made within the field of TCR specificity prediction, providing proof of concept for predicting TCR-pMHC interactions in a narrow space of “seen” pMHC targets where substantial training data is available. However, a significant challenge persists in extending the predictive capability to novel “unseen” pMHC targets. Furthermore, the performance of proposed methods is often challenged when evaluated outside the initial publication and data sets.

To address these issues, IMMREP23 challenge invited participants to predict, for a given test set of TCR-pMHC pairs, the likelihood that a pair would bind. A total of 53 teams participated, providing a total of 398 submissions.

The benchmark confirms that current methods achieve reasonable performance in the "seen" pMHC setting. However, most participating methods had close to random performance on the subset of “unseen” peptides, underlining that this prediction challenge remains essentially unsolved.

Finally, another key lesson from the benchmark is the critical issue of data leakage. Specifically, the data set construction procedure employed in IMMREP23 led to biases in the negative test data set. These biases were identified by several participating teams, and complicated the interpretation of the benchmark results. Based on these results, we put forward suggestions on how future competitions could avoid such data leakages and biases.

T 细胞受体（TCR）与其 pMHC 靶点的相互作用是细胞免疫系统的基石。在过去的十年中，TCR 特异性预测领域取得了长足的进步，证明了在有大量训练数据的情况下，在 "可见 "pMHC 靶点的狭窄空间内预测 TCR-pMHC 相互作用的概念。然而，将预测能力扩展到 "未见 "的新型 pMHC 靶点仍是一个重大挑战。为了解决这些问题，IMMREP23 挑战赛邀请参赛者针对给定的 TCR-pMHC 对测试集，预测一对 TCR-pMHC 对结合的可能性。共有 53 个团队参加，提交了 398 份报告。该基准证实，目前的方法在 "看到的 "pMHC 环境中取得了合理的性能。然而，大多数参与方法在 "未见 "肽子集上的性能接近随机，这突出表明这一预测难题基本上仍未解决。最后，基准测试的另一个关键教训是数据泄漏这一关键问题。具体来说，IMMREP23 采用的数据集构建程序导致负测试数据集出现偏差。一些参与团队发现了这些偏差，并使基准结果的解释变得复杂。基于这些结果，我们就未来的竞赛如何避免此类数据泄漏和偏差提出了建议。

{"title":"Lessons learned from the IMMREP23 TCR-epitope prediction challenge","authors":"Morten Nielsen , Anne Eugster , Mathias Fynbo Jensen , Manisha Goel , Andreas Tiffeau-Mayer , Aurelien Pelissier , Sebastiaan Valkiers , María Rodríguez Martínez , Barthélémy Meynard-Piganeeau , Victor Greiff , Thierry Mora , Aleksandra M. Walczak , Giancarlo Croce , Dana L Moreno , David Gfeller , Pieter Meysman , Justin Barton","doi":"10.1016/j.immuno.2024.100045","DOIUrl":"10.1016/j.immuno.2024.100045","url":null,"abstract":"<div><div>Here, we present the findings from IMMREP23, the second benchmark competition focused on predicting the specificity of TCR-pMHC interactions.</div><div>The interaction of T cell receptors (TCR) towards their pMHC target is a cornerstone of the cellular immune system. Over the last decade, substantial progress has been made within the field of TCR specificity prediction, providing proof of concept for predicting TCR-pMHC interactions in a narrow space of “seen” pMHC targets where substantial training data is available. However, a significant challenge persists in extending the predictive capability to novel “unseen” pMHC targets. Furthermore, the performance of proposed methods is often challenged when evaluated outside the initial publication and data sets.</div><div>To address these issues, IMMREP23 challenge invited participants to predict, for a given test set of TCR-pMHC pairs, the likelihood that a pair would bind. A total of 53 teams participated, providing a total of 398 submissions.</div><div>The benchmark confirms that current methods achieve reasonable performance in the \"seen\" pMHC setting. However, most participating methods had close to random performance on the subset of “unseen” peptides, underlining that this prediction challenge remains essentially unsolved.</div><div>Finally, another key lesson from the benchmark is the critical issue of data leakage. Specifically, the data set construction procedure employed in IMMREP23 led to biases in the negative test data set. These biases were identified by several participating teams, and complicated the interpretation of the benchmark results. Based on these results, we put forward suggestions on how future competitions could avoid such data leakages and biases.</div></div>","PeriodicalId":73343,"journal":{"name":"Immunoinformatics (Amsterdam, Netherlands)","volume":"16 ","pages":"Article 100045"},"PeriodicalIF":0.0,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142426792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data mining antibody sequences for database searching in bottom-up proteomics 自下而上蛋白质组学数据库搜索抗体序列的数据挖掘

Immunoinformatics (Amsterdam, Netherlands)

Pub Date : 2024-09-01 Epub Date: 2024-08-22 DOI: 10.1016/j.immuno.2024.100042

Xuan-Tung Trinh , Rebecca Freitag , Konrad Krawczyk , Veit Schwämmle

Mass spectrometry-based proteomics facilitates the identification and quantification of thousands of proteins but encounters challenges in measuring human antibodies due to their vast diversity. Bottom-up proteomics methods primarily rely on database searches, comparing experimental peptide values to theoretical database sequences. While the human body can produce millions of distinct antibodies, current databases, such as UniProtKB/Swiss-Prot, contain only 1095 sequences (as of January 2024), potentially hindering antibody identification via mass spectrometry. Therefore, expanding the database is crucial for discovering new antibodies. Recent genomic studies have amassed millions of human antibody sequences in the Observed Antibody Space (OAS) database, yet this data remains underutilized. Leveraging this vast collection, we conduct efficient database searches in publicly available proteomics data, focusing on SARS-CoV-2. In our study, thirty million heavy antibody sequences from 146 SARS-CoV-2 patients in the OAS database were digested in silico to obtain 18 million unique peptides. These peptides form the basis for new bottom-up proteomics databases. We used those databases for searching new antibody peptides in publicly available SARS-CoV-2 human plasma samples in the Proteomics Identification Database (PRIDE). This approach avoids false positives in antibody peptide identification as confirmed by searching against negative controls (brain samples) and employing different database sizes. We show that new antibody peptides were found in previous plasma samples and expect that the newly discovered antibody peptides can be further employed to develop therapeutic antibodies. The method will be broadly applicable to find characteristic antibodies for other diseases.

以质谱为基础的蛋白质组学有助于识别和量化成千上万的蛋白质，但由于人类抗体种类繁多，在测量人类抗体时遇到了挑战。自下而上的蛋白质组学方法主要依靠数据库搜索，将实验肽值与理论数据库序列进行比较。虽然人体可以产生数百万种不同的抗体，但目前的数据库（如 UniProtKB/Swiss-Prot）只包含 1095 个序列（截至 2024 年 1 月），可能会妨碍通过质谱鉴定抗体。因此，扩大数据库对发现新抗体至关重要。最近的基因组研究在观察抗体空间（OAS）数据库中积累了数百万个人类抗体序列，但这些数据仍未得到充分利用。利用这个庞大的数据库，我们在公开的蛋白质组学数据中进行了高效的数据库搜索，重点是 SARS-CoV-2 。在我们的研究中，我们对 OAS 数据库中来自 146 名 SARS-CoV-2 患者的 3,000 万个重抗体序列进行了硅消化，获得了 1,800 万个独特的肽段。这些肽构成了新的自下而上蛋白质组学数据库的基础。我们利用这些数据库在蛋白质组学鉴定数据库（PRIDE）中公开的 SARS-CoV-2 人类血浆样本中搜索新的抗体肽。通过与阴性对照（脑样本）进行搜索和使用不同大小的数据库，这种方法避免了抗体肽鉴定中的假阳性。我们发现在以前的血浆样本中发现了新的抗体肽，并期望新发现的抗体肽能进一步用于开发治疗性抗体。该方法将广泛应用于寻找其他疾病的特征抗体。

{"title":"Data mining antibody sequences for database searching in bottom-up proteomics","authors":"Xuan-Tung Trinh , Rebecca Freitag , Konrad Krawczyk , Veit Schwämmle","doi":"10.1016/j.immuno.2024.100042","DOIUrl":"10.1016/j.immuno.2024.100042","url":null,"abstract":"<div><p>Mass spectrometry-based proteomics facilitates the identification and quantification of thousands of proteins but encounters challenges in measuring human antibodies due to their vast diversity. Bottom-up proteomics methods primarily rely on database searches, comparing experimental peptide values to theoretical database sequences. While the human body can produce millions of distinct antibodies, current databases, such as UniProtKB/Swiss-Prot, contain only 1095 sequences (as of January 2024), potentially hindering antibody identification via mass spectrometry. Therefore, expanding the database is crucial for discovering new antibodies. Recent genomic studies have amassed millions of human antibody sequences in the Observed Antibody Space (OAS) database, yet this data remains underutilized. Leveraging this vast collection, we conduct efficient database searches in publicly available proteomics data, focusing on SARS-CoV-2. In our study, thirty million heavy antibody sequences from 146 SARS-CoV-2 patients in the OAS database were digested <em>in silico</em> to obtain 18 million unique peptides. These peptides form the basis for new bottom-up proteomics databases. We used those databases for searching new antibody peptides in publicly available SARS-CoV-2 human plasma samples in the Proteomics Identification Database (PRIDE). This approach avoids false positives in antibody peptide identification as confirmed by searching against negative controls (brain samples) and employing different database sizes. We show that new antibody peptides were found in previous plasma samples and expect that the newly discovered antibody peptides can be further employed to develop therapeutic antibodies. The method will be broadly applicable to find characteristic antibodies for other diseases.</p></div>","PeriodicalId":73343,"journal":{"name":"Immunoinformatics (Amsterdam, Netherlands)","volume":"15 ","pages":"Article 100042"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667119024000120/pdfft?md5=6bc5ac01ada92397791db50d32ef768f&pid=1-s2.0-S2667119024000120-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142076922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

In silico modelling of CD8 T cell immune response links genetic regulation to population dynamics CD8 T 细胞免疫反应的硅学建模将遗传调控与种群动态联系起来

Immunoinformatics (Amsterdam, Netherlands)

Pub Date : 2024-09-01 Epub Date: 2024-09-07 DOI: 10.1016/j.immuno.2024.100043

Thi Nhu Thao Nguyen , Madge Martin , Christophe Arpin , Samuel Bernard , Olivier Gandrillon , Fabien Crauste

The CD8 T cell immune response operates at multiple temporal and spatial scales, including all the early complex biochemical and biomechanical processes, up to long term cell population behavior.

In order to model this response, we devised a multiscale agent-based approach using Simuscale software. Within each agent (cell) of our model, we introduced a gene regulatory network (GRN) based upon a piecewise deterministic Markov process formalism. Cell fate – differentiation, proliferation, death – was coupled to the state of the GRN through rule-based mechanisms. Cells interact in a 3D computational domain and signal to each other via cell–cell contacts, influencing the GRN behavior.

Results show the ability of the model to correctly capture both population behavior and molecular time-dependent evolution. We examined the impact of several parameters on molecular and population dynamics, and demonstrated the add-on value of using a multiscale approach by showing the influence of molecular parameters, particularly protein degradation rates, on the outcome of the response, such as effector and memory cell counts.

CD8 T 细胞免疫反应在多个时间和空间尺度上运行，包括所有早期复杂的生物化学和生物力学过程，以及长期的细胞群行为。为了模拟这种反应，我们使用 Simuscale 软件设计了一种基于多尺度代理的方法。在模型的每个代理（细胞）中，我们都引入了基于片断确定性马尔可夫过程形式主义的基因调控网络（GRN）。细胞的命运--分化、增殖、死亡--通过基于规则的机制与基因调控网络的状态相耦合。结果表明，该模型能够正确捕捉群体行为和分子随时间变化的演化。我们研究了几个参数对分子和群体动力学的影响，并通过展示分子参数（尤其是蛋白质降解率）对效应细胞和记忆细胞数量等反应结果的影响，证明了使用多尺度方法的附加价值。

{"title":"In silico modelling of CD8 T cell immune response links genetic regulation to population dynamics","authors":"Thi Nhu Thao Nguyen , Madge Martin , Christophe Arpin , Samuel Bernard , Olivier Gandrillon , Fabien Crauste","doi":"10.1016/j.immuno.2024.100043","DOIUrl":"10.1016/j.immuno.2024.100043","url":null,"abstract":"<div><p>The CD8 T cell immune response operates at multiple temporal and spatial scales, including all the early complex biochemical and biomechanical processes, up to long term cell population behavior.</p><p>In order to model this response, we devised a multiscale agent-based approach using <span>Simuscale</span> software. Within each agent (cell) of our model, we introduced a gene regulatory network (GRN) based upon a piecewise deterministic Markov process formalism. Cell fate – differentiation, proliferation, death – was coupled to the state of the GRN through rule-based mechanisms. Cells interact in a 3D computational domain and signal to each other via cell–cell contacts, influencing the GRN behavior.</p><p>Results show the ability of the model to correctly capture both population behavior and molecular time-dependent evolution. We examined the impact of several parameters on molecular and population dynamics, and demonstrated the add-on value of using a multiscale approach by showing the influence of molecular parameters, particularly protein degradation rates, on the outcome of the response, such as effector and memory cell counts.</p></div>","PeriodicalId":73343,"journal":{"name":"Immunoinformatics (Amsterdam, Netherlands)","volume":"15 ","pages":"Article 100043"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667119024000132/pdfft?md5=92c4f652893809c6f3e06131e312c290&pid=1-s2.0-S2667119024000132-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142173664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

T-cell receptor binding prediction: A machine learning revolution T 细胞受体结合预测：机器学习革命

Immunoinformatics (Amsterdam, Netherlands)

Pub Date : 2024-09-01 Epub Date: 2024-07-22 DOI: 10.1016/j.immuno.2024.100040

Anna Weber , Aurélien Pélissier , María Rodríguez Martínez

Recent advancements in immune sequencing and experimental techniques are generating extensive T cell receptor (TCR) repertoire data, enabling the development of models to predict TCR binding specificity. Despite the computational challenges posed by the vast diversity of TCRs and epitopes, significant progress has been made. This review explores the evolution of computational models designed for this task, emphasizing machine learning efforts, including early unsupervised clustering approaches, supervised models, and recent applications of Protein Language Models (PLMs), deep learning models pretrained on extensive collections of unlabeled protein sequences that capture crucial biological properties.

We survey the most prominent models in each category and offer a critical discussion on recurrent challenges, including the lack of generalization to new epitopes, dataset biases, and shortcomings in model validation designs. Focusing on PLMs, we discuss the transformative impact of Transformer-based protein models in bioinformatics, particularly in TCR specificity analysis. We discuss recent studies that exploit PLMs to deliver notably competitive performances in TCR-related tasks, while also examining current limitations and future directions. Lastly, we address the pressing need for improved interpretability in these often opaque models, and examine current efforts to extract biological insights from large black box models.

免疫测序和实验技术的最新进展正在产生大量的 T 细胞受体（TCR）谱系数据，从而能够开发出预测 TCR 结合特异性的模型。尽管 TCR 和表位的多样性给计算带来了挑战，但我们还是取得了重大进展。这篇综述探讨了为这一任务设计的计算模型的演变，强调了机器学习的努力，包括早期的无监督聚类方法、有监督模型和蛋白质语言模型（PLM）的最新应用，PLM是在大量未标记的蛋白质序列集合上预先训练的深度学习模型，能捕捉关键的生物学特性。我们调查了每个类别中最突出的模型，并对反复出现的挑战进行了批判性讨论，包括缺乏对新表位的泛化、数据集偏差和模型验证设计的缺陷。以 PLM 为重点，我们讨论了基于 Transformer 的蛋白质模型在生物信息学中的变革性影响，尤其是在 TCR 特异性分析中。我们讨论了近期利用 PLM 在 TCR 相关任务中取得显著竞争力的研究，同时还探讨了当前的局限性和未来的发展方向。最后，我们探讨了提高这些通常不透明的模型可解释性的迫切需要，并考察了目前从大型黑盒模型中提取生物学见解的努力。

{"title":"T-cell receptor binding prediction: A machine learning revolution","authors":"Anna Weber , Aurélien Pélissier , María Rodríguez Martínez","doi":"10.1016/j.immuno.2024.100040","DOIUrl":"10.1016/j.immuno.2024.100040","url":null,"abstract":"<div><p>Recent advancements in immune sequencing and experimental techniques are generating extensive T cell receptor (TCR) repertoire data, enabling the development of models to predict TCR binding specificity. Despite the computational challenges posed by the vast diversity of TCRs and epitopes, significant progress has been made. This review explores the evolution of computational models designed for this task, emphasizing machine learning efforts, including early unsupervised clustering approaches, supervised models, and recent applications of Protein Language Models (PLMs), deep learning models pretrained on extensive collections of unlabeled protein sequences that capture crucial biological properties.</p><p>We survey the most prominent models in each category and offer a critical discussion on recurrent challenges, including the lack of generalization to new epitopes, dataset biases, and shortcomings in model validation designs. Focusing on PLMs, we discuss the transformative impact of Transformer-based protein models in bioinformatics, particularly in TCR specificity analysis. We discuss recent studies that exploit PLMs to deliver notably competitive performances in TCR-related tasks, while also examining current limitations and future directions. Lastly, we address the pressing need for improved interpretability in these often opaque models, and examine current efforts to extract biological insights from large black box models.</p></div>","PeriodicalId":73343,"journal":{"name":"Immunoinformatics (Amsterdam, Netherlands)","volume":"15 ","pages":"Article 100040"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667119024000107/pdfft?md5=d53078634a01ebcc5850282ff7db1fa1&pid=1-s2.0-S2667119024000107-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141961980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Navigating the immunosuppressive brain tumor microenvironment using spatial biology 利用空间生物学为免疫抑制性脑肿瘤微环境导航

Immunoinformatics (Amsterdam, Netherlands)

Pub Date : 2024-09-01 Epub Date: 2024-08-13 DOI: 10.1016/j.immuno.2024.100041

Samuel S. Widodo , Marija Dinevska , Stanley S. Stylli , Adriano L. Martinelli , Marianna Rapsomaniki , Theo Mantamadiotis

With the application of spatial biology, the detection and identification of the diverse cell types present in the tumor microenvironment, including specific immune subsets, is possible at single cell resolution. Since spatial biology analysis of tumor tissue allows multiple biological parameters to be measured, including cell type, cell number, cell state, as well as the precise location and the spatial relationship of every cell to other cells and histopathological hallmarks, a vast amount of data is generated. The power of this is realized when correlating the spatial biology data with clinical data for each patient, from which the tissue was collected during biopsy or surgery, conducted as part of the patient's diagnosis and treatment. Aside from the enormous leap in chemistry and molecular biology technology required to develop the analytical tools for spatial biology, collection, analysis of cells in the tumor microenvironment has been possible only with the development of computational tools capable of deciphering tumor tissue complexity to predict tumor evolution and response to treatment and the role of immune cells in regulating tumor biology. Here we describe how spatial biology analysis, combined with computational analysis have been used to deconstruct the complexity of the brain tumor microenvironment and shed light on why brain tumors exhibit extreme immunosuppression. We also discuss how the understanding gained using spatial biology has shed light on how tumor immunosuppression can be overcome.

应用空间生物学技术，可以以单细胞分辨率检测和识别肿瘤微环境中存在的各种细胞类型，包括特定的免疫亚群。由于对肿瘤组织的空间生物学分析可测量多种生物参数，包括细胞类型、细胞数量、细胞状态，以及每个细胞的精确位置及其与其他细胞和组织病理学特征的空间关系，因此可生成大量数据。将空间生物学数据与每位患者的临床数据（组织是在活组织检查或手术中采集的，作为患者诊断和治疗的一部分）关联起来，就能发现这些数据的威力。除了开发空间生物学分析工具所需的化学和分子生物学技术的巨大飞跃之外，只有开发出能够破译肿瘤组织复杂性的计算工具，才能对肿瘤微环境中的细胞进行收集和分析，从而预测肿瘤的演变、对治疗的反应以及免疫细胞在调节肿瘤生物学中的作用。在这里，我们将介绍如何利用空间生物学分析结合计算分析来解构脑肿瘤微环境的复杂性，并揭示脑肿瘤表现出极端免疫抑制的原因。我们还讨论了如何利用空间生物学获得的理解来阐明如何克服肿瘤免疫抑制。

{"title":"Navigating the immunosuppressive brain tumor microenvironment using spatial biology","authors":"Samuel S. Widodo , Marija Dinevska , Stanley S. Stylli , Adriano L. Martinelli , Marianna Rapsomaniki , Theo Mantamadiotis","doi":"10.1016/j.immuno.2024.100041","DOIUrl":"10.1016/j.immuno.2024.100041","url":null,"abstract":"<div><p>With the application of spatial biology, the detection and identification of the diverse cell types present in the tumor microenvironment, including specific immune subsets, is possible at single cell resolution. Since spatial biology analysis of tumor tissue allows multiple biological parameters to be measured, including cell type, cell number, cell state, as well as the precise location and the spatial relationship of every cell to other cells and histopathological hallmarks, a vast amount of data is generated. The power of this is realized when correlating the spatial biology data with clinical data for each patient, from which the tissue was collected during biopsy or surgery, conducted as part of the patient's diagnosis and treatment. Aside from the enormous leap in chemistry and molecular biology technology required to develop the analytical tools for spatial biology, collection, analysis of cells in the tumor microenvironment has been possible only with the development of computational tools capable of deciphering tumor tissue complexity to predict tumor evolution and response to treatment and the role of immune cells in regulating tumor biology. Here we describe how spatial biology analysis, combined with computational analysis have been used to deconstruct the complexity of the brain tumor microenvironment and shed light on why brain tumors exhibit extreme immunosuppression. We also discuss how the understanding gained using spatial biology has shed light on how tumor immunosuppression can be overcome.</p></div>","PeriodicalId":73343,"journal":{"name":"Immunoinformatics (Amsterdam, Netherlands)","volume":"15 ","pages":"Article 100041"},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667119024000119/pdfft?md5=04d68aa94c0735faff67ea6c15c37656&pid=1-s2.0-S2667119024000119-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141997366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Guiding a language-model based protein design method towards MHC Class-I immune-visibility targets in vaccines and therapeutics 引导基于语言模型的蛋白质设计方法，实现疫苗和治疗中的 MHC I 类免疫可见性目标

Immunoinformatics (Amsterdam, Netherlands)

Pub Date : 2024-06-01 Epub Date: 2024-05-07 DOI: 10.1016/j.immuno.2024.100035

Hans-Christof Gasser , Diego A. Oyarzún , Ajitha Rajan , Javier Antonio Alfaro

Proteins have an arsenal of medical applications that include disrupting protein interactions, acting as potent vaccines, and replacing genetically deficient proteins. While therapeutics must avoid triggering unwanted immune-responses, vaccines should support a robust immune-reaction targeting a broad range of pathogen variants. Therefore, computational methods modifying proteins’ immunogenicity without disrupting function are needed. While many components of the immune-system can be involved in a reaction, we focus on Cytotoxic T-lymphocytes (CTLs). These target short peptides presented via the MHC Class I (MHC-I) pathway. To explore the limits of modifying the visibility of those peptides to CTLs within the distribution of naturally occurring sequences, we developed a novel machine learning technique, CAPE-XVAE. It combines a language model with reinforcement learning to modify a protein’s immune-visibility. Our results show that CAPE-XVAE effectively modifies the visibility of the HIV Nef protein to CTLs. We contrast CAPE-XVAE to CAPE-Packer, a physics-based method we also developed. Compared to CAPE-Packer, the machine learning approach suggests sequences that draw upon local sequence similarities in the training set. This is beneficial for vaccine development, where the sequence should be representative of the real viral population. Additionally, the language model approach holds promise for preserving both known and unknown functional constraints, which is essential for the immune-modulation of therapeutic proteins. In contrast, CAPE-Packer, emphasizes preserving the protein’s overall fold and can reach greater extremes of immune-visibility, but falls short of capturing the sequence diversity of viral variants available to learn from. Source code: https://github.com/hcgasser/CAPE (Tag: v1.1)

蛋白质在医学上有广泛的应用，包括破坏蛋白质相互作用、作为强效疫苗和替代基因缺陷蛋白质。治疗药物必须避免引发不必要的免疫反应，而疫苗则应支持针对各种病原体变体的强效免疫反应。因此，需要用计算方法在不破坏功能的情况下改变蛋白质的免疫原性。虽然免疫系统的许多成分都可能参与反应，但我们将重点放在细胞毒性 T 淋巴细胞（CTLs）上。它们的靶标是通过 MHC I 类（MHC-I）途径呈现的短肽。为了探索在天然序列分布范围内修改这些肽对 CTL 的可见性的极限，我们开发了一种新型机器学习技术 CAPE-XVAE。它将语言模型与强化学习相结合，以改变蛋白质的免疫可见性。我们的研究结果表明，CAPE-XVAE 能有效改变 HIV Nef 蛋白在 CTLs 中的可见性。我们将 CAPE-XVAE 与 CAPE-Packer 进行了对比，后者也是我们开发的一种基于物理的方法。与 CAPE-Packer 相比，机器学习方法能利用训练集中的局部序列相似性提出序列建议。这有利于疫苗研发，因为疫苗序列应能代表真实的病毒群。此外，语言模型方法有望保留已知和未知的功能约束，这对于治疗蛋白的免疫调节至关重要。相比之下，CAPE-Packer 则强调保留蛋白质的整体折叠，并能达到更高的免疫可见度，但却无法捕捉可供学习的病毒变体序列多样性。源代码：https://github.com/hcgasser/CAPE（标签：v1.1）

{"title":"Guiding a language-model based protein design method towards MHC Class-I immune-visibility targets in vaccines and therapeutics","authors":"Hans-Christof Gasser , Diego A. Oyarzún , Ajitha Rajan , Javier Antonio Alfaro","doi":"10.1016/j.immuno.2024.100035","DOIUrl":"10.1016/j.immuno.2024.100035","url":null,"abstract":"<div><p>Proteins have an arsenal of medical applications that include disrupting protein interactions, acting as potent vaccines, and replacing genetically deficient proteins. While therapeutics must avoid triggering unwanted immune-responses, vaccines should support a robust immune-reaction targeting a broad range of pathogen variants. Therefore, computational methods modifying proteins’ immunogenicity without disrupting function are needed. While many components of the immune-system can be involved in a reaction, we focus on Cytotoxic T-lymphocytes (CTLs). These target short peptides presented via the MHC Class I (MHC-I) pathway. To explore the limits of modifying the visibility of those peptides to CTLs within the distribution of naturally occurring sequences, we developed a novel machine learning technique, <span>CAPE-XVAE</span>. It combines a language model with reinforcement learning to modify a protein’s immune-visibility. Our results show that <span>CAPE-XVAE</span> effectively modifies the visibility of the HIV Nef protein to CTLs. We contrast <span>CAPE-XVAE</span> to <span>CAPE-Packer</span>, a physics-based method we also developed. Compared to <span>CAPE-Packer</span>, the machine learning approach suggests sequences that draw upon local sequence similarities in the training set. This is beneficial for vaccine development, where the sequence should be representative of the real viral population. Additionally, the language model approach holds promise for preserving both known and unknown functional constraints, which is essential for the immune-modulation of therapeutic proteins. In contrast, <span>CAPE-Packer</span>, emphasizes preserving the protein’s overall fold and can reach greater extremes of immune-visibility, but falls short of capturing the sequence diversity of viral variants available to learn from. Source code: <span>https://github.com/hcgasser/CAPE</span><svg><path></path></svg> (Tag: <span>v1.1</span>)</p></div>","PeriodicalId":73343,"journal":{"name":"Immunoinformatics (Amsterdam, Netherlands)","volume":"14 ","pages":"Article 100035"},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667119024000053/pdfft?md5=add2e81105c2c0a169282f80ff064817&pid=1-s2.0-S2667119024000053-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141049339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

In silico single-cell metabolism analysis unravels a new transition stage of CD8 T cells 4 days post-infection 硅学单细胞代谢分析揭示了感染后 4 天 CD8 T 细胞的新过渡阶段

Immunoinformatics (Amsterdam, Netherlands)

Pub Date : 2024-06-01 Epub Date: 2024-05-30 DOI: 10.1016/j.immuno.2024.100038

Christophe Arpin , Franck Picard , Olivier Gandrillon

CD8 T cell proper differentiation during antiviral responses relies on metabolic adaptations. Herein, we investigated global metabolic activity in single CD8 T cells along an in vivo response by estimating metabolic fluxes from single-cell RNA-sequencing data. The approach was validated by the observation of metabolic variations known from experimental studies on global cell populations, while adding temporally detailed information and unravelling yet undescribed sections of CD8 T cell metabolism that are affected by cellular differentiation. Furthermore, inter-cellular variability in gene expression level, highlighted by single cell data, and heterogeneity of metabolic activity 4 days post-infection, revealed a new transition stage accompanied by a metabolic switch in activated cells differentiating into full-blown effectors.

抗病毒反应过程中 CD8 T 细胞的适当分化依赖于代谢适应。在这里，我们通过单细胞 RNA 序列数据估算代谢通量，研究了单个 CD8 T 细胞在体内应答过程中的全局代谢活动。这种方法通过观察全球细胞群实验研究中已知的代谢变化得到了验证，同时增加了时间上的详细信息，并揭示了 CD8 T 细胞代谢中尚未描述的受细胞分化影响的部分。此外，单细胞数据突出显示了细胞间基因表达水平的差异，以及感染后 4 天代谢活动的异质性，揭示了一个新的过渡阶段，伴随着活化细胞分化为全面效应细胞的代谢转换。

引用次数: 0

SARS-CoV-2-identical protein regions found in mammalian coronaviruses have immunogenic potential and can imply cross-protection 在哺乳动物冠状病毒中发现的与 SARS-CoV-2 相同的蛋白质区域具有免疫原性，可能意味着交叉保护

Immunoinformatics (Amsterdam, Netherlands)

Pub Date : 2024-06-01 Epub Date: 2024-04-03 DOI: 10.1016/j.immuno.2024.100034

Luciano Rodrigo Lopes

Coronaviruses are known to infect a wide range of mammals. In humans, coronaviruses have been responsible for causing the common cold. The immune response against common cold coronaviruses appears to elicit a cross-protective response to SARS-CoV-2. This study identified protein regions in the mammalian coronaviruses' proteome that are identical to those of SARS-CoV-2. Using bioinformatics analysis, the study predicted the involvement of SARS-CoV-2-identical protein regions, identified in mammalian coronaviruses, in antigen-presenting processes and their ability to elicit immune responses. The SARS-CoV-2-identical protein regions were predominantly found in the proteomes of betacoronaviruses, with less prevalence in alphacoronaviruses. Alphacoronaviruses, such as FCoV in domestic felines and MCoV in minks, are known to infect species highly susceptible to SARS-CoV-2. In contrast, betacoronaviruses infect mammals with lower susceptibility to SARS-CoV-2, including dogs, mice, and farmed animals. Furthermore, betacoronaviruses exhibited a higher number of peptides with an increased potential for efficient presentation during the antigen-presenting process, indicating their greater immunogenicity. Conversely, the SW1 gammacoronavirus showed a lower count of SARS-CoV-2 protein regions and a reduced potential for efficient antigen presentation. The results suggested that the elevated number of SARS-CoV-2 identical stretches found in betacoronaviruses may provide potential cross-protection between SARS-CoV-2 and mammalian betacoronaviruses. This cross-protection could be similar to that observed between human coronaviruses causing the common cold and SARS-CoV-2. The limited numbers observed in the proteomes of FCoV, MCoV, and SW1-CoV may offer an explanation for the susceptibility of cats and minks to SARS-CoV-2, as well as a potential vulnerability in cetaceans.

冠状病毒可感染多种哺乳动物。在人类中，冠状病毒是引起普通感冒的罪魁祸首。对普通感冒冠状病毒的免疫反应似乎会引起对 SARS-CoV-2 的交叉保护反应。这项研究发现了哺乳动物冠状病毒蛋白质组中与 SARS-CoV-2 相同的蛋白质区域。通过生物信息学分析，该研究预测了在哺乳动物冠状病毒中发现的与 SARS-CoV-2 相同的蛋白质区域参与抗原递呈过程及其引起免疫反应的能力。SARS-CoV-2相同蛋白区主要存在于betacoronaviruses的蛋白质组中，在alphacoronaviruses中发现的较少。阿尔法冠状病毒，如家猫中的 FCoV 和水貂中的 MCoV，已知会感染对 SARS-CoV-2 高度易感的物种。相比之下，倍他克龙病毒感染的哺乳动物对 SARS-CoV-2 的易感性较低，包括狗、小鼠和养殖动物。此外，betacoronaviruses 表现出更多的多肽，在抗原递呈过程中有效递呈的可能性更大，这表明它们的免疫原性更强。相反，SW1 gammacoronavirus 的 SARS-CoV-2 蛋白区数量较少，有效抗原呈递的潜力降低。研究结果表明，在betacoronaviruses中发现的较多的SARS-CoV-2相同片段可能会在SARS-CoV-2和哺乳动物betacoronaviruses之间提供潜在的交叉保护。这种交叉保护可能类似于在引起普通感冒的人类冠状病毒和 SARS-CoV-2 之间观察到的交叉保护。在FCoV、MCoV和SW1-CoV的蛋白质组中观察到的有限数量可以解释猫和水貂对SARS-CoV-2的易感性，以及鲸目动物的潜在易感性。

{"title":"SARS-CoV-2-identical protein regions found in mammalian coronaviruses have immunogenic potential and can imply cross-protection","authors":"Luciano Rodrigo Lopes","doi":"10.1016/j.immuno.2024.100034","DOIUrl":"https://doi.org/10.1016/j.immuno.2024.100034","url":null,"abstract":"<div><p>Coronaviruses are known to infect a wide range of mammals. In humans, coronaviruses have been responsible for causing the common cold. The immune response against common cold coronaviruses appears to elicit a cross-protective response to SARS-CoV-2. This study identified protein regions in the mammalian coronaviruses' proteome that are identical to those of SARS-CoV-2. Using bioinformatics analysis, the study predicted the involvement of SARS-CoV-2-identical protein regions, identified in mammalian coronaviruses, in antigen-presenting processes and their ability to elicit immune responses. The SARS-CoV-2-identical protein regions were predominantly found in the proteomes of betacoronaviruses, with less prevalence in alphacoronaviruses. Alphacoronaviruses, such as FCoV in domestic felines and MCoV in minks, are known to infect species highly susceptible to SARS-CoV-2. In contrast, betacoronaviruses infect mammals with lower susceptibility to SARS-CoV-2, including dogs, mice, and farmed animals. Furthermore, betacoronaviruses exhibited a higher number of peptides with an increased potential for efficient presentation during the antigen-presenting process, indicating their greater immunogenicity. Conversely, the SW1 gammacoronavirus showed a lower count of SARS-CoV-2 protein regions and a reduced potential for efficient antigen presentation. The results suggested that the elevated number of SARS-CoV-2 identical stretches found in betacoronaviruses may provide potential cross-protection between SARS-CoV-2 and mammalian betacoronaviruses. This cross-protection could be similar to that observed between human coronaviruses causing the common cold and SARS-CoV-2. The limited numbers observed in the proteomes of FCoV, MCoV, and SW1-CoV may offer an explanation for the susceptibility of cats and minks to SARS-CoV-2, as well as a potential vulnerability in cetaceans.</p></div>","PeriodicalId":73343,"journal":{"name":"Immunoinformatics (Amsterdam, Netherlands)","volume":"14 ","pages":"Article 100034"},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667119024000041/pdfft?md5=910020ff52d72814379c047e7f9baa60&pid=1-s2.0-S2667119024000041-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140533472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Do domain-specific protein language models outperform general models on immunology-related tasks? 在免疫学相关任务中，特定领域的蛋白质语言模型是否优于一般模型？

Immunoinformatics (Amsterdam, Netherlands)

Pub Date : 2024-06-01 Epub Date: 2024-05-18 DOI: 10.1016/j.immuno.2024.100036

Nicolas Deutschmann , Aurelien Pelissier , Anna Weber , Shuaijun Gao , Jasmina Bogojeska , María Rodríguez Martínez

Deciphering the antigen recognition capabilities by T-cell and B-cell receptors (antibodies) is essential for advancing our understanding of adaptive immune system responses. In recent years, the development of protein language models (PLMs) has facilitated the development of bioinformatic pipelines where complex amino acid sequences are transformed into vectorized embeddings, which are then applied to a range of downstream analytical tasks. With their success, we have witnessed the emergence of domain-specific PLMs tailored to specific proteins, such as immune receptors. Domain-specific models are often assumed to possess enhanced representation capabilities for targeted applications, however, this assumption has not been thoroughly evaluated. In this manuscript, we assess the efficacy of both generalist and domain-specific transformer-based embeddings in characterizing B and T-cell receptors. Specifically, we assess the accuracy of models that leverage these embeddings to predict antigen specificity and elucidate the evolutionary changes that B cells undergo during an immune response. We demonstrate that the prevailing notion of domain-specific models outperforming general models requires a more nuanced examination. We also observe remarkable differences between generalist and domain-specific PLMs, not only in terms of performance but also in the manner they encode information. Finally, we observe that the choice of the size and the embedding layer in PLMs are essential model hyperparameters in different tasks. Overall, our analyzes reveal the promising potential of PLMs in modeling protein function while providing insights into their information-handling capabilities. We also discuss the crucial factors that should be taken into account when selecting a PLM tailored to a particular task.

破译 T 细胞和 B 细胞受体（抗体）的抗原识别能力对于加深我们对适应性免疫系统反应的理解至关重要。近年来，蛋白质语言模型（PLM）的发展促进了生物信息学管道的发展，复杂的氨基酸序列被转化为矢量化嵌入，然后应用于一系列下游分析任务。随着这些模型的成功，我们看到了针对特定蛋白质（如免疫受体）的领域特异性 PLM 的出现。领域特异性模型通常被认为具有更强的表示能力，可用于有针对性的应用，但这一假设尚未得到全面评估。在本手稿中，我们评估了基于通用和特定领域变换器的嵌入在表征 B 细胞和 T 细胞受体方面的功效。具体来说，我们评估了利用这些嵌入来预测抗原特异性的模型的准确性，并阐明了 B 细胞在免疫反应过程中所经历的进化变化。我们证明，目前流行的领域特异性模型优于通用模型的概念需要更细致的研究。我们还观察到通用和特定领域 PLM 之间的显著差异，这不仅体现在性能上，还体现在它们编码信息的方式上。最后，我们观察到，在不同的任务中，选择 PLM 的大小和嵌入层是至关重要的模型超参数。总之，我们的分析揭示了 PLM 在蛋白质功能建模方面的巨大潜力，同时也提供了对其信息处理能力的深入了解。我们还讨论了在选择适合特定任务的 PLM 时应考虑的关键因素。

{"title":"Do domain-specific protein language models outperform general models on immunology-related tasks?","authors":"Nicolas Deutschmann , Aurelien Pelissier , Anna Weber , Shuaijun Gao , Jasmina Bogojeska , María Rodríguez Martínez","doi":"10.1016/j.immuno.2024.100036","DOIUrl":"https://doi.org/10.1016/j.immuno.2024.100036","url":null,"abstract":"<div><p>Deciphering the antigen recognition capabilities by T-cell and B-cell receptors (antibodies) is essential for advancing our understanding of adaptive immune system responses. In recent years, the development of protein language models (PLMs) has facilitated the development of bioinformatic pipelines where complex amino acid sequences are transformed into vectorized embeddings, which are then applied to a range of downstream analytical tasks. With their success, we have witnessed the emergence of domain-specific PLMs tailored to specific proteins, such as immune receptors. Domain-specific models are often assumed to possess enhanced representation capabilities for targeted applications, however, this assumption has not been thoroughly evaluated. In this manuscript, we assess the efficacy of both generalist and domain-specific transformer-based embeddings in characterizing B and T-cell receptors. Specifically, we assess the accuracy of models that leverage these embeddings to predict antigen specificity and elucidate the evolutionary changes that B cells undergo during an immune response. We demonstrate that the prevailing notion of domain-specific models outperforming general models requires a more nuanced examination. We also observe remarkable differences between generalist and domain-specific PLMs, not only in terms of performance but also in the manner they encode information. Finally, we observe that the choice of the size and the embedding layer in PLMs are essential model hyperparameters in different tasks. Overall, our analyzes reveal the promising potential of PLMs in modeling protein function while providing insights into their information-handling capabilities. We also discuss the crucial factors that should be taken into account when selecting a PLM tailored to a particular task.</p></div>","PeriodicalId":73343,"journal":{"name":"Immunoinformatics (Amsterdam, Netherlands)","volume":"14 ","pages":"Article 100036"},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667119024000065/pdfft?md5=b75c9d971ec449ef41c1c0a25e659b0d&pid=1-s2.0-S2667119024000065-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141090222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0