Pub Date : 2025-09-17Epub Date: 2025-09-10DOI: 10.1016/j.cels.2025.101387
Francesca-Zhoufan Li, Jason Yang, Kadina E Johnston, Emre Gürsoy, Yisong Yue, Frances H Arnold
Various machine learning-assisted directed evolution (MLDE) strategies have been shown to identify high-fitness protein variants more efficiently than typical directed evolution approaches. However, limited understanding of the factors influencing MLDE performance across diverse proteins has hindered optimal strategy selection for wet-lab campaigns. To address this, we systematically analyzed multiple MLDE strategies, including active learning and focused training using six distinct zero-shot predictors, across 16 diverse protein fitness landscapes. By quantifying landscape navigability with six attributes, we found that MLDE offers a greater advantage on landscapes that are more challenging for directed evolution, especially when focused training is combined with active learning. Despite varying levels of advantage across landscapes, focused training with zero-shot predictors leveraging distinct evolutionary, structural, and stability knowledge sources consistently outperforms random sampling for both binding interactions and enzyme activities. Our findings provide practical guidelines for selecting MLDE strategies for protein engineering. A record of this paper's transparent peer review process is included in the supplemental information.
{"title":"Evaluation of machine learning-assisted directed evolution across diverse combinatorial landscapes.","authors":"Francesca-Zhoufan Li, Jason Yang, Kadina E Johnston, Emre Gürsoy, Yisong Yue, Frances H Arnold","doi":"10.1016/j.cels.2025.101387","DOIUrl":"10.1016/j.cels.2025.101387","url":null,"abstract":"<p><p>Various machine learning-assisted directed evolution (MLDE) strategies have been shown to identify high-fitness protein variants more efficiently than typical directed evolution approaches. However, limited understanding of the factors influencing MLDE performance across diverse proteins has hindered optimal strategy selection for wet-lab campaigns. To address this, we systematically analyzed multiple MLDE strategies, including active learning and focused training using six distinct zero-shot predictors, across 16 diverse protein fitness landscapes. By quantifying landscape navigability with six attributes, we found that MLDE offers a greater advantage on landscapes that are more challenging for directed evolution, especially when focused training is combined with active learning. Despite varying levels of advantage across landscapes, focused training with zero-shot predictors leveraging distinct evolutionary, structural, and stability knowledge sources consistently outperforms random sampling for both binding interactions and enzyme activities. Our findings provide practical guidelines for selecting MLDE strategies for protein engineering. A record of this paper's transparent peer review process is included in the supplemental information.</p>","PeriodicalId":93929,"journal":{"name":"Cell systems","volume":" ","pages":"101387"},"PeriodicalIF":7.7,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145042533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deciphering the cell state dynamics is crucial for understanding biological processes. Single-cell lineage-tracing technologies provide an effective way to track single-cell lineages by heritable DNA barcodes, but the high missing rates of lineage barcodes and the intra-clonal heterogeneity bring great challenges to dissecting the mechanisms of cell fate decision. Here, we systematically evaluate the features of single-cell lineage-tracing data and then develop an algorithm, scTrace+, to enhance the cell dynamic traces by incorporating multi-faceted transcriptomic similarities into lineage relationships via a kernelized probabilistic matrix factorization model. We assess its feasibility and performance by conducting ablation and benchmarking experiments on multiple real datasets and show that scTrace+ can accurately predict the fates of cells. Further, scTrace+ effectively identifies some important driver genes implicated in cellular fate decisions of diverse biological processes, such as cell differentiation or tumor drug responses. A record of this paper's transparent peer review process is included in the supplemental information.
{"title":"scTrace+: Enhancing cell fate inference by integrating the lineage-tracing and multi-faceted transcriptomic similarity information.","authors":"Wenbo Guo, Zeyu Chen, Xinqi Li, Jingmin Huang, Qifan Hu, Jin Gu","doi":"10.1016/j.cels.2025.101398","DOIUrl":"10.1016/j.cels.2025.101398","url":null,"abstract":"<p><p>Deciphering the cell state dynamics is crucial for understanding biological processes. Single-cell lineage-tracing technologies provide an effective way to track single-cell lineages by heritable DNA barcodes, but the high missing rates of lineage barcodes and the intra-clonal heterogeneity bring great challenges to dissecting the mechanisms of cell fate decision. Here, we systematically evaluate the features of single-cell lineage-tracing data and then develop an algorithm, scTrace+, to enhance the cell dynamic traces by incorporating multi-faceted transcriptomic similarities into lineage relationships via a kernelized probabilistic matrix factorization model. We assess its feasibility and performance by conducting ablation and benchmarking experiments on multiple real datasets and show that scTrace+ can accurately predict the fates of cells. Further, scTrace+ effectively identifies some important driver genes implicated in cellular fate decisions of diverse biological processes, such as cell differentiation or tumor drug responses. A record of this paper's transparent peer review process is included in the supplemental information.</p>","PeriodicalId":93929,"journal":{"name":"Cell systems","volume":" ","pages":"101398"},"PeriodicalIF":7.7,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145042600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-17Epub Date: 2025-08-22DOI: 10.1016/j.cels.2025.101371
Ruoxi Zhang, Ben Ma, Gang Xu, Jianpeng Ma
Protein language models (PLMs), such as the highly successful ESM-2, have proven particularly effective. However, language models designed for RNA continue to face challenges. A key question is as follows: can the information derived from PLMs be harnessed and transferred to RNA? To investigate this, a model termed ProtRNA has been developed by a cross-modality transfer learning strategy for addressing the challenges posed by RNA's limited and less conserved sequences. By leveraging the evolutionary and physicochemical information encoded in protein sequences, the ESM-2 model is adapted to processing "low-resource" RNA sequence data. The results show comparable or superior performance in various RNA downstream tasks, with only 1/8 the trainable parameters and 1/6 the training data employed by the primary reference baseline RNA language model. This approach highlights the potential of cross-modality transfer learning in biological language models.
{"title":"ProtRNA: A protein-derived RNA language model by cross-modality transfer learning.","authors":"Ruoxi Zhang, Ben Ma, Gang Xu, Jianpeng Ma","doi":"10.1016/j.cels.2025.101371","DOIUrl":"10.1016/j.cels.2025.101371","url":null,"abstract":"<p><p>Protein language models (PLMs), such as the highly successful ESM-2, have proven particularly effective. However, language models designed for RNA continue to face challenges. A key question is as follows: can the information derived from PLMs be harnessed and transferred to RNA? To investigate this, a model termed ProtRNA has been developed by a cross-modality transfer learning strategy for addressing the challenges posed by RNA's limited and less conserved sequences. By leveraging the evolutionary and physicochemical information encoded in protein sequences, the ESM-2 model is adapted to processing \"low-resource\" RNA sequence data. The results show comparable or superior performance in various RNA downstream tasks, with only 1/8 the trainable parameters and 1/6 the training data employed by the primary reference baseline RNA language model. This approach highlights the potential of cross-modality transfer learning in biological language models.</p>","PeriodicalId":93929,"journal":{"name":"Cell systems","volume":" ","pages":"101371"},"PeriodicalIF":7.7,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144982416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-17Epub Date: 2025-08-13DOI: 10.1016/j.cels.2025.101368
Milind Jagota, Chloe Hsu, Thomas Mazumder, Kevin Sung, William S DeWitt, Jennifer Listgarten, Frederick A Matsen Iv, Chun Jimmie Ye, Yun S Song
Although antibody sequences are highly diverse, they are constrained by requirements for expression and limited off-target reactivity. Describing which sequences violate such constraints has proven to be difficult. Here, we introduce a machine-learning framework to leverage a previously underutilized source of data for this problem. We use human single-cell sequencing data to find instances of allelic inclusion, a rare event where B cells express two different antibody light chains as mRNA. Previous studies suggest that one of these chains is either autoreactive or non-expressing as protein. We train machine-learning models to identify abnormal sequences associated with allelic inclusion. The resulting models generalize to predict antibody properties including polyreactivity, surface expression, and mutation usage, outperforming methods that do not use allelic inclusion data. We also investigate similar selection forces on the heavy chain in mice and observe that surrogate light-chain pairing has a large impact on heavy-chain diversity.
{"title":"Learning antibody sequence constraints from allelic inclusion.","authors":"Milind Jagota, Chloe Hsu, Thomas Mazumder, Kevin Sung, William S DeWitt, Jennifer Listgarten, Frederick A Matsen Iv, Chun Jimmie Ye, Yun S Song","doi":"10.1016/j.cels.2025.101368","DOIUrl":"10.1016/j.cels.2025.101368","url":null,"abstract":"<p><p>Although antibody sequences are highly diverse, they are constrained by requirements for expression and limited off-target reactivity. Describing which sequences violate such constraints has proven to be difficult. Here, we introduce a machine-learning framework to leverage a previously underutilized source of data for this problem. We use human single-cell sequencing data to find instances of allelic inclusion, a rare event where B cells express two different antibody light chains as mRNA. Previous studies suggest that one of these chains is either autoreactive or non-expressing as protein. We train machine-learning models to identify abnormal sequences associated with allelic inclusion. The resulting models generalize to predict antibody properties including polyreactivity, surface expression, and mutation usage, outperforming methods that do not use allelic inclusion data. We also investigate similar selection forces on the heavy chain in mice and observe that surrogate light-chain pairing has a large impact on heavy-chain diversity.</p>","PeriodicalId":93929,"journal":{"name":"Cell systems","volume":" ","pages":"101368"},"PeriodicalIF":7.7,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144857236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-17Epub Date: 2025-09-10DOI: 10.1016/j.cels.2025.101397
Daniel Martinez-Martinez, Tanara V Peres, Kristin Gehling, Leonor Quintaneiro, Cecilia Cabrera, Maksym Cherevatenko, Stephen J Cutty, Lena Best, Georgios Marinos, Johannes Zimmerman, Ayesha Safoor, Despoina Chrysostomou, Joao B Mokochinski, Alex Montoya, Susanne Brodesser, Michalina Zatorska, Timothy Scott, Ivan Andrew, Holger Kramer, Masuma Begum, Bian Zhang, Bernard T Golding, Julian R Marchesi, Susumu Hirabayashi, Christoph Kaleta, Alexis R Barr, Christian Frezza, Helena M Cochemé, Filipe Cabreiro
Understanding how the microbiota produces regulatory metabolites is of significance for cancer and cancer therapy. Using a host-microbe-drug-nutrient 4-way screening approach, we evaluated the role of nutrition at the molecular level in the context of 5-fluorouracil toxicity. Notably, our screens identified the metabolite 2-methylisocitrate, which was found to be produced and enriched in human tumor-associated microbiota. 2-methylisocitrate exhibits anti-proliferative properties across genetically and tissue-diverse cancer cell lines, three-dimensional (3D) spheroids, and an in vivo Drosophila gut tumor model, where it reduced tumor dissemination and increased survival. Chemical landscape interaction screens identified drug-metabolite signatures and highlighted the synergy between 5-fluorouracil and 2-methylisocitrate. Multi-omic analyses revealed that 2-methylisocitrate acts via multiple cellular pathways linking metabolism and DNA damage to regulate chemotherapy. Finally, we converted 2-methylisocitrate into its trimethyl ester, thereby enhancing its potency. This work highlights the great impact of microbiome-derived metabolites on tumor proliferation and their potential as promising co-adjuvants for cancer treatment.
{"title":"Chemotherapy modulation by a cancer-associated microbiota metabolite.","authors":"Daniel Martinez-Martinez, Tanara V Peres, Kristin Gehling, Leonor Quintaneiro, Cecilia Cabrera, Maksym Cherevatenko, Stephen J Cutty, Lena Best, Georgios Marinos, Johannes Zimmerman, Ayesha Safoor, Despoina Chrysostomou, Joao B Mokochinski, Alex Montoya, Susanne Brodesser, Michalina Zatorska, Timothy Scott, Ivan Andrew, Holger Kramer, Masuma Begum, Bian Zhang, Bernard T Golding, Julian R Marchesi, Susumu Hirabayashi, Christoph Kaleta, Alexis R Barr, Christian Frezza, Helena M Cochemé, Filipe Cabreiro","doi":"10.1016/j.cels.2025.101397","DOIUrl":"10.1016/j.cels.2025.101397","url":null,"abstract":"<p><p>Understanding how the microbiota produces regulatory metabolites is of significance for cancer and cancer therapy. Using a host-microbe-drug-nutrient 4-way screening approach, we evaluated the role of nutrition at the molecular level in the context of 5-fluorouracil toxicity. Notably, our screens identified the metabolite 2-methylisocitrate, which was found to be produced and enriched in human tumor-associated microbiota. 2-methylisocitrate exhibits anti-proliferative properties across genetically and tissue-diverse cancer cell lines, three-dimensional (3D) spheroids, and an in vivo Drosophila gut tumor model, where it reduced tumor dissemination and increased survival. Chemical landscape interaction screens identified drug-metabolite signatures and highlighted the synergy between 5-fluorouracil and 2-methylisocitrate. Multi-omic analyses revealed that 2-methylisocitrate acts via multiple cellular pathways linking metabolism and DNA damage to regulate chemotherapy. Finally, we converted 2-methylisocitrate into its trimethyl ester, thereby enhancing its potency. This work highlights the great impact of microbiome-derived metabolites on tumor proliferation and their potential as promising co-adjuvants for cancer treatment.</p>","PeriodicalId":93929,"journal":{"name":"Cell systems","volume":" ","pages":"101397"},"PeriodicalIF":7.7,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145042552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatial transcriptomics allows for the measurement of gene expression within the native tissue context. However, despite technological advancements, computational methods to link cell states with their microenvironment and compare these relationships across samples and conditions remain limited. To address this, we introduce Tissue Motif-Based Spatial Inference across Conditions (TissueMosaic), a self-supervised convolutional neural network designed to discover and represent tissue architectural motifs from multi-sample spatial transcriptomic datasets. TissueMosaic further links these motifs to gene expression, enabling the study of how changes in tissue structure impact cell-intrinsic function. TissueMosaic increases the signal-to-noise ratio of spatial differential expression analysis through a motif enrichment strategy, resulting in more reliable detection of genes that covary with tissue structure changes. Here, we demonstrate that TissueMosaic learns representations that outperform neighborhood cell-type composition baselines and existing methods on downstream tasks. These findings underscore the potential of self-supervised learning to advance spatial transcriptomics discovery.
{"title":"TissueMosaic: Self-supervised learning of tissue representations enables differential spatial transcriptomics across samples.","authors":"Sandeep Kambhampati, Luca D'Alessio, Fedor Grab, Stephen Fleming, Sophia Liu, Ruth Raichur, Fei Chen, Mehrtash Babadi","doi":"10.1016/j.cels.2025.101394","DOIUrl":"10.1016/j.cels.2025.101394","url":null,"abstract":"<p><p>Spatial transcriptomics allows for the measurement of gene expression within the native tissue context. However, despite technological advancements, computational methods to link cell states with their microenvironment and compare these relationships across samples and conditions remain limited. To address this, we introduce Tissue Motif-Based Spatial Inference across Conditions (TissueMosaic), a self-supervised convolutional neural network designed to discover and represent tissue architectural motifs from multi-sample spatial transcriptomic datasets. TissueMosaic further links these motifs to gene expression, enabling the study of how changes in tissue structure impact cell-intrinsic function. TissueMosaic increases the signal-to-noise ratio of spatial differential expression analysis through a motif enrichment strategy, resulting in more reliable detection of genes that covary with tissue structure changes. Here, we demonstrate that TissueMosaic learns representations that outperform neighborhood cell-type composition baselines and existing methods on downstream tasks. These findings underscore the potential of self-supervised learning to advance spatial transcriptomics discovery.</p>","PeriodicalId":93929,"journal":{"name":"Cell systems","volume":" ","pages":"101394"},"PeriodicalIF":7.7,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145030783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-17Epub Date: 2025-09-08DOI: 10.1016/j.cels.2025.101374
Huangqingbo Sun, Shiqiu Yu, Anna Martinez Casals, Anna Bäckström, Yuxin Lu, Cecilia Lindskog, Matthew Ruffalo, Emma Lundberg, Robert F Murphy
Identifying cell types in highly multiplexed images is essential for understanding tissue spatial organization. Current cell-type annotation methods often rely on extensive reference images and manual adjustments. In this work, we present a tool, the Robust Image-Based Cell Annotator (RIBCA), that enables accurate, automated, unbiased, and fine-grained cell-type annotation for images with a wide range of antibody panels without requiring additional model training or human intervention. Our tool has successfully annotated over 3 million cells, revealing the spatial organization of various cell types across more than 40 different human tissues. It is open source and features a modular design, allowing for easy extension to additional cell types.
{"title":"Flexible and robust cell-type annotation for highly multiplexed tissue images.","authors":"Huangqingbo Sun, Shiqiu Yu, Anna Martinez Casals, Anna Bäckström, Yuxin Lu, Cecilia Lindskog, Matthew Ruffalo, Emma Lundberg, Robert F Murphy","doi":"10.1016/j.cels.2025.101374","DOIUrl":"10.1016/j.cels.2025.101374","url":null,"abstract":"<p><p>Identifying cell types in highly multiplexed images is essential for understanding tissue spatial organization. Current cell-type annotation methods often rely on extensive reference images and manual adjustments. In this work, we present a tool, the Robust Image-Based Cell Annotator (RIBCA), that enables accurate, automated, unbiased, and fine-grained cell-type annotation for images with a wide range of antibody panels without requiring additional model training or human intervention. Our tool has successfully annotated over 3 million cells, revealing the spatial organization of various cell types across more than 40 different human tissues. It is open source and features a modular design, allowing for easy extension to additional cell types.</p>","PeriodicalId":93929,"journal":{"name":"Cell systems","volume":" ","pages":"101374"},"PeriodicalIF":7.7,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12728825/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145030770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-20DOI: 10.1016/j.cels.2025.101370
Connor Tiffany, Joseph P Zackular
Microbial colonization is shaped by a complex network of interactions that influence both commensals and pathogens, including the public-health threat Clostridioides difficile. In this issue of Cell Systems, Carr et al. present a community-scale modeling framework for predicting colonization and metabolism, offering new insights into C. difficile infection.
{"title":"Microbial bellwether: Community-scale metabolic modeling to predict infection.","authors":"Connor Tiffany, Joseph P Zackular","doi":"10.1016/j.cels.2025.101370","DOIUrl":"https://doi.org/10.1016/j.cels.2025.101370","url":null,"abstract":"<p><p>Microbial colonization is shaped by a complex network of interactions that influence both commensals and pathogens, including the public-health threat Clostridioides difficile. In this issue of Cell Systems, Carr et al. present a community-scale modeling framework for predicting colonization and metabolism, offering new insights into C. difficile infection.</p>","PeriodicalId":93929,"journal":{"name":"Cell systems","volume":"16 8","pages":"101370"},"PeriodicalIF":7.7,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144982351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-20Epub Date: 2025-07-29DOI: 10.1016/j.cels.2025.101347
Tianyu Lu, Melissa Liu, Yilin Chen, Jinho Kim, Po-Ssu Huang
Recent advances in generative modeling enable efficient sampling of protein structures, but their tendency to optimize for designability imposes a bias toward idealized structures at the expense of loops and other complex structural motifs that are critical for function. We introduce SHAPES (structural and hierarchical assessment of proteins with embedding similarity) to evaluate five state-of-the-art generative models of protein structures. Using structural embeddings across multiple structural hierarchies, ranging from local geometries to global protein architectures, we reveal substantial undersampling of the observed protein structure space by these models. We use Fréchet protein distance (FPD) to quantify distributional coverage. Different models are distinct in their coverage behavior across different sampling noise scales and temperatures. The frequency of tertiary motifs (TERMs) further supports the observations. More robust sequence design and structure prediction methods are likely crucial in guiding the development of models with improved coverage of the designable protein space. A record of this paper's transparent peer review process is included in the supplemental information.
{"title":"Assessing generative model coverage of protein structures with SHAPES.","authors":"Tianyu Lu, Melissa Liu, Yilin Chen, Jinho Kim, Po-Ssu Huang","doi":"10.1016/j.cels.2025.101347","DOIUrl":"10.1016/j.cels.2025.101347","url":null,"abstract":"<p><p>Recent advances in generative modeling enable efficient sampling of protein structures, but their tendency to optimize for designability imposes a bias toward idealized structures at the expense of loops and other complex structural motifs that are critical for function. We introduce SHAPES (structural and hierarchical assessment of proteins with embedding similarity) to evaluate five state-of-the-art generative models of protein structures. Using structural embeddings across multiple structural hierarchies, ranging from local geometries to global protein architectures, we reveal substantial undersampling of the observed protein structure space by these models. We use Fréchet protein distance (FPD) to quantify distributional coverage. Different models are distinct in their coverage behavior across different sampling noise scales and temperatures. The frequency of tertiary motifs (TERMs) further supports the observations. More robust sequence design and structure prediction methods are likely crucial in guiding the development of models with improved coverage of the designable protein space. A record of this paper's transparent peer review process is included in the supplemental information.</p>","PeriodicalId":93929,"journal":{"name":"Cell systems","volume":" ","pages":"101347"},"PeriodicalIF":7.7,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12321228/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144755440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-20Epub Date: 2025-07-25DOI: 10.1016/j.cels.2025.101345
Amitava Banerjee, David J Pattinson, Cornelia L Wincek, Paul Bunk, Armend Axhemi, Sarah R Chapin, Saket Navlakha, Hannah V Meyer
Comprehensively mapping all targets of a T cell receptor (TCR) is important for predicting pathogenic escape and off-target effects of TCR therapies. However, this mapping has been challenging due to lack of unbiased benchmarking datasets and computational methods sensitive to small-peptide mutations. To address this, we curated the benchmark for activation of T cells with cross-reactive avidity for epitopes (BATCAVE) database, encompassing near-complete single-amino-acid mutational assays, centered around 25 immunogenic epitopes, across both major histocompatibility complex classes, against 151 human and mouse TCRs, containing 22,000+ TCR-peptide pairs in total. We then introduce Bayesian inference of activation of TCR by mutant antigens (BATMAN), an interpretable Bayesian model, trained on BATCAVE, for predicting the peptides that activate a TCR, and an active learning extension, which efficiently maps targets of a novel TCR by selecting a few peptides to assay. We show that BATMAN outperforms existing methods, reveals structural and biochemical predictors of TCR-peptide interactions, and can predict polyclonal T cell responses and TCR targets with high sequence dissimilarity. A record of this paper's transparent peer review process is included in the supplemental information.
{"title":"T cell receptor cross-reactivity prediction improved by a comprehensive mutational scan database.","authors":"Amitava Banerjee, David J Pattinson, Cornelia L Wincek, Paul Bunk, Armend Axhemi, Sarah R Chapin, Saket Navlakha, Hannah V Meyer","doi":"10.1016/j.cels.2025.101345","DOIUrl":"10.1016/j.cels.2025.101345","url":null,"abstract":"<p><p>Comprehensively mapping all targets of a T cell receptor (TCR) is important for predicting pathogenic escape and off-target effects of TCR therapies. However, this mapping has been challenging due to lack of unbiased benchmarking datasets and computational methods sensitive to small-peptide mutations. To address this, we curated the benchmark for activation of T cells with cross-reactive avidity for epitopes (BATCAVE) database, encompassing near-complete single-amino-acid mutational assays, centered around 25 immunogenic epitopes, across both major histocompatibility complex classes, against 151 human and mouse TCRs, containing 22,000+ TCR-peptide pairs in total. We then introduce Bayesian inference of activation of TCR by mutant antigens (BATMAN), an interpretable Bayesian model, trained on BATCAVE, for predicting the peptides that activate a TCR, and an active learning extension, which efficiently maps targets of a novel TCR by selecting a few peptides to assay. We show that BATMAN outperforms existing methods, reveals structural and biochemical predictors of TCR-peptide interactions, and can predict polyclonal T cell responses and TCR targets with high sequence dissimilarity. A record of this paper's transparent peer review process is included in the supplemental information.</p>","PeriodicalId":93929,"journal":{"name":"Cell systems","volume":" ","pages":"101345"},"PeriodicalIF":7.7,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144719264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}