Pub Date : 2025-09-08DOI: 10.1038/s42256-025-01103-w
Haoran Sun, Liang He, Pan Deng, Guoqing Liu, Zhiyu Zhao, Yuliang Jiang, Chuan Cao, Fusong Ju, Lijun Wu, Haiguang Liu, Tao Qin, Tie-Yan Liu
Protein engineering holds substantial promise for designing proteins with customized functions, yet the vast landscape of potential mutations versus limited laboratory capacity constrains the discovery of optimal sequences. Here, to address this, we present the μProtein framework, which accelerates protein engineering by combining μFormer, a deep learning model for accurate mutational effect prediction, with μSearch, a reinforcement learning algorithm designed to efficiently navigate the protein fitness landscape using μFormer as an oracle. μProtein leverages single-mutation data to predict optimal sequences with complex, multi-amino-acid mutations through its modelling of epistatic interactions and a multi-step search strategy. In addition to strong performance on benchmark datasets, μProtein identified high-gain-of-function multi-point mutants for the enzyme β-lactamase, surpassing one of the highest-known activity levels, in wet laboratory, trained solely on single-mutation data. These results demonstrate μProtein’s capability to discover impactful mutations across the vast protein sequence space, offering a robust, efficient approach for protein optimization. μProtein, combining deep learning and reinforcement learning, is developed to design high-function proteins. This framework, trained only on single-mutation data, discovers multi-site β-lactamase mutants with up to 2,000× growth rates.
{"title":"Accelerating protein engineering with fitness landscape modelling and reinforcement learning","authors":"Haoran Sun, Liang He, Pan Deng, Guoqing Liu, Zhiyu Zhao, Yuliang Jiang, Chuan Cao, Fusong Ju, Lijun Wu, Haiguang Liu, Tao Qin, Tie-Yan Liu","doi":"10.1038/s42256-025-01103-w","DOIUrl":"10.1038/s42256-025-01103-w","url":null,"abstract":"Protein engineering holds substantial promise for designing proteins with customized functions, yet the vast landscape of potential mutations versus limited laboratory capacity constrains the discovery of optimal sequences. Here, to address this, we present the μProtein framework, which accelerates protein engineering by combining μFormer, a deep learning model for accurate mutational effect prediction, with μSearch, a reinforcement learning algorithm designed to efficiently navigate the protein fitness landscape using μFormer as an oracle. μProtein leverages single-mutation data to predict optimal sequences with complex, multi-amino-acid mutations through its modelling of epistatic interactions and a multi-step search strategy. In addition to strong performance on benchmark datasets, μProtein identified high-gain-of-function multi-point mutants for the enzyme β-lactamase, surpassing one of the highest-known activity levels, in wet laboratory, trained solely on single-mutation data. These results demonstrate μProtein’s capability to discover impactful mutations across the vast protein sequence space, offering a robust, efficient approach for protein optimization. μProtein, combining deep learning and reinforcement learning, is developed to design high-function proteins. This framework, trained only on single-mutation data, discovers multi-site β-lactamase mutants with up to 2,000× growth rates.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 9","pages":"1446-1460"},"PeriodicalIF":23.9,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145009025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-04DOI: 10.1038/s42256-025-01108-5
Dickson Aruhomukama
{"title":"Applying genomic AI to combat antibiotic resistance in low-income countries","authors":"Dickson Aruhomukama","doi":"10.1038/s42256-025-01108-5","DOIUrl":"10.1038/s42256-025-01108-5","url":null,"abstract":"","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 9","pages":"1369-1370"},"PeriodicalIF":23.9,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144987427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-02DOI: 10.1038/s42256-025-01117-4
Janosh Riebesell, Rhys E. A. Goodall, Philipp Benner, Yuan Chiang, Bowen Deng, Gerbrand Ceder, Mark Asta, Alpha A. Lee, Anubhav Jain, Kristin A. Persson
{"title":"Author Correction: A framework to evaluate machine learning crystal stability predictions","authors":"Janosh Riebesell, Rhys E. A. Goodall, Philipp Benner, Yuan Chiang, Bowen Deng, Gerbrand Ceder, Mark Asta, Alpha A. Lee, Anubhav Jain, Kristin A. Persson","doi":"10.1038/s42256-025-01117-4","DOIUrl":"10.1038/s42256-025-01117-4","url":null,"abstract":"","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 9","pages":"1586-1586"},"PeriodicalIF":23.9,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.comhttps://www.nature.com/articles/s42256-025-01117-4.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145129500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-01DOI: 10.1038/s42256-025-01090-y
Johannes Y. Lee, Sangjoon Lee, Abhishek Mishra, Xu Yan, Brandon McMahan, Brent Gaisford, Charles Kobashigawa, Mike Qu, Chang Xie, Jonathan C. Kao
Motor brain–computer interfaces (BCIs) decode neural signals to help people with paralysis move and communicate. Even with important advances in the past two decades, BCIs face a key obstacle to clinical viability: BCI performance should strongly outweigh costs and risks. To significantly increase the BCI performance, we use shared autonomy, where artificial intelligence (AI) copilots collaborate with BCI users to achieve task goals. We demonstrate this AI-BCI in a non-invasive BCI system decoding electroencephalography signals. We first contribute a hybrid adaptive decoding approach using a convolutional neural network and ReFIT-like Kalman filter, enabling healthy users and a participant with paralysis to control computer cursors and robotic arms via decoded electroencephalography signals. We then design two AI copilots to aid BCI users in a cursor control task and a robotic arm pick-and-place task. We demonstrate AI-BCIs that enable a participant with paralysis to achieve 3.9-times-higher performance in target hit rate during cursor control and control a robotic arm to sequentially move random blocks to random locations, a task they could not do without an AI copilot. As AI copilots improve, BCIs designed with shared autonomy may achieve higher performance. AI copilots are integrated into brain–computer interfaces, enabling a paralysed participant to achieve improved control of computer cursors and robotic arms. This shared autonomy approach offers a promising path to increase BCI performance and clinical viability.
{"title":"Brain–computer interface control with artificial intelligence copilots","authors":"Johannes Y. Lee, Sangjoon Lee, Abhishek Mishra, Xu Yan, Brandon McMahan, Brent Gaisford, Charles Kobashigawa, Mike Qu, Chang Xie, Jonathan C. Kao","doi":"10.1038/s42256-025-01090-y","DOIUrl":"10.1038/s42256-025-01090-y","url":null,"abstract":"Motor brain–computer interfaces (BCIs) decode neural signals to help people with paralysis move and communicate. Even with important advances in the past two decades, BCIs face a key obstacle to clinical viability: BCI performance should strongly outweigh costs and risks. To significantly increase the BCI performance, we use shared autonomy, where artificial intelligence (AI) copilots collaborate with BCI users to achieve task goals. We demonstrate this AI-BCI in a non-invasive BCI system decoding electroencephalography signals. We first contribute a hybrid adaptive decoding approach using a convolutional neural network and ReFIT-like Kalman filter, enabling healthy users and a participant with paralysis to control computer cursors and robotic arms via decoded electroencephalography signals. We then design two AI copilots to aid BCI users in a cursor control task and a robotic arm pick-and-place task. We demonstrate AI-BCIs that enable a participant with paralysis to achieve 3.9-times-higher performance in target hit rate during cursor control and control a robotic arm to sequentially move random blocks to random locations, a task they could not do without an AI copilot. As AI copilots improve, BCIs designed with shared autonomy may achieve higher performance. AI copilots are integrated into brain–computer interfaces, enabling a paralysed participant to achieve improved control of computer cursors and robotic arms. This shared autonomy approach offers a promising path to increase BCI performance and clinical viability.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 9","pages":"1510-1523"},"PeriodicalIF":23.9,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144928057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-26DOI: 10.1038/s42256-025-01099-3
Shuchen Zhu, Heyang Hua, Shengquan Chen
Single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) deciphers genome-wide chromatin accessibility, providing profound insights into gene regulation mechanisms. With the rapid advance of sequencing technologies, scATAC-seq data typically encompass numerous samples from various conditions, resulting in complex batch effects, thus necessitating reliable integration tools. While numerous batch integration tools exist for single-cell RNA sequencing data, inherent data characteristic differences limit their effectiveness on scATAC-seq data. Existing integration methods for scATAC-seq data suffer from several fundamental limitations, such as disrupting the biological heterogeneity and focusing solely on low-dimensional correction, which may distort data and hinder downstream analysis. Here we propose Fountain, a deep learning framework for scATAC-seq data integration via rigorous barycentric mapping. Barycentric mapping transforms one data distribution to another in a principled and effective manner through optimal transport. By regularizing barycentric mapping with geometric data information, Fountain achieves accurate batch alignment while preserving biological heterogeneity. Comprehensive experiments across diverse real-world datasets demonstrate the advantages of Fountain over existing methods in batch correction and biological conservation. In addition, the trained Fountain model can integrate data from new batches alongside already integrated data without retraining, enabling continuous online data integration. Moreover, Fountain’s reconstruction strategy generates batch-corrected ATAC profiles, improving the capture of cellular heterogeneity and revealing cell-type-specific implications such as expression enrichment analysis and partitioned heritability analysis. Zhu, Hua and Chen propose Fountain, a deep learning framework for batch integration of scATAC-seq data that utilizes regularized barycentric mapping. It preserves biological heterogeneity, enabling online and original dimensionality integration.
{"title":"Rigorous integration of single-cell ATAC-seq data using regularized barycentric mapping","authors":"Shuchen Zhu, Heyang Hua, Shengquan Chen","doi":"10.1038/s42256-025-01099-3","DOIUrl":"10.1038/s42256-025-01099-3","url":null,"abstract":"Single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) deciphers genome-wide chromatin accessibility, providing profound insights into gene regulation mechanisms. With the rapid advance of sequencing technologies, scATAC-seq data typically encompass numerous samples from various conditions, resulting in complex batch effects, thus necessitating reliable integration tools. While numerous batch integration tools exist for single-cell RNA sequencing data, inherent data characteristic differences limit their effectiveness on scATAC-seq data. Existing integration methods for scATAC-seq data suffer from several fundamental limitations, such as disrupting the biological heterogeneity and focusing solely on low-dimensional correction, which may distort data and hinder downstream analysis. Here we propose Fountain, a deep learning framework for scATAC-seq data integration via rigorous barycentric mapping. Barycentric mapping transforms one data distribution to another in a principled and effective manner through optimal transport. By regularizing barycentric mapping with geometric data information, Fountain achieves accurate batch alignment while preserving biological heterogeneity. Comprehensive experiments across diverse real-world datasets demonstrate the advantages of Fountain over existing methods in batch correction and biological conservation. In addition, the trained Fountain model can integrate data from new batches alongside already integrated data without retraining, enabling continuous online data integration. Moreover, Fountain’s reconstruction strategy generates batch-corrected ATAC profiles, improving the capture of cellular heterogeneity and revealing cell-type-specific implications such as expression enrichment analysis and partitioned heritability analysis. Zhu, Hua and Chen propose Fountain, a deep learning framework for batch integration of scATAC-seq data that utilizes regularized barycentric mapping. It preserves biological heterogeneity, enabling online and original dimensionality integration.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 9","pages":"1461-1477"},"PeriodicalIF":23.9,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-25DOI: 10.1038/s42256-025-01101-y
Melissa S. Cantú, Michael R. King
{"title":"LLMs as all-in-one tools to easily generate publication-ready citation diversity reports","authors":"Melissa S. Cantú, Michael R. King","doi":"10.1038/s42256-025-01101-y","DOIUrl":"10.1038/s42256-025-01101-y","url":null,"abstract":"","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 9","pages":"1371-1372"},"PeriodicalIF":23.9,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Self-supervised learning (SSL) has emerged as a powerful approach for learning meaningful representations from large-scale unlabelled datasets in single-cell genomics. Richter et al. evaluated SSL pretext tasks on modelling single-cell RNA sequencing (scRNA-seq) data, demonstrating the effective use of SSL models. However, the transferability of these pretrained SSL models to the spatial transcriptomics domain remains unexplored. Here we assess the performance of three SSL models (random mask, gene programme mask and Barlow Twins) pretrained on scRNA-seq data with spatial transcriptomics datasets, focusing on cell-type prediction and spatial clustering. Our experiments demonstrate that the SSL model with random mask strategy exhibits the best overall performance among evaluated SSL models. Moreover, the models trained from scratch on spatial transcriptomics data outperform the fine-tuned SSL models on cell-type prediction, highlighting a domain gap between scRNA-seq and spatial transcriptomics data whose underlying causes remain an open question. Through expanded analyses of multiple imputation methods and data degradation scenarios, we demonstrate that gene imputation would degrade SSL model performance on cell-type prediction, an effect that is exacerbated by increasing data sparsity. Finally, integrating zero-shot random mask embeddings into chosen spatial clustering methods significantly enhanced their accuracy. Overall, our findings provide valuable insights into the limitations and potential of transferring SSL models to spatial transcriptomics and offer practical guidance for researchers leveraging pretrained models for spatial transcriptomics data analysis. Self-supervised learning models for single-cell RNA sequencing data exhibit poor transferability to spatial transcriptomics for cell-type prediction, although their learned features may enhance spatial analysis.
{"title":"Reusability report: Exploring the transferability of self-supervised learning models from single-cell to spatial transcriptomics","authors":"Chuangyi Han, Senlin Lin, Zhikang Wang, Yan Cui, Qi Zou, Zhiyuan Yuan","doi":"10.1038/s42256-025-01097-5","DOIUrl":"10.1038/s42256-025-01097-5","url":null,"abstract":"Self-supervised learning (SSL) has emerged as a powerful approach for learning meaningful representations from large-scale unlabelled datasets in single-cell genomics. Richter et al. evaluated SSL pretext tasks on modelling single-cell RNA sequencing (scRNA-seq) data, demonstrating the effective use of SSL models. However, the transferability of these pretrained SSL models to the spatial transcriptomics domain remains unexplored. Here we assess the performance of three SSL models (random mask, gene programme mask and Barlow Twins) pretrained on scRNA-seq data with spatial transcriptomics datasets, focusing on cell-type prediction and spatial clustering. Our experiments demonstrate that the SSL model with random mask strategy exhibits the best overall performance among evaluated SSL models. Moreover, the models trained from scratch on spatial transcriptomics data outperform the fine-tuned SSL models on cell-type prediction, highlighting a domain gap between scRNA-seq and spatial transcriptomics data whose underlying causes remain an open question. Through expanded analyses of multiple imputation methods and data degradation scenarios, we demonstrate that gene imputation would degrade SSL model performance on cell-type prediction, an effect that is exacerbated by increasing data sparsity. Finally, integrating zero-shot random mask embeddings into chosen spatial clustering methods significantly enhanced their accuracy. Overall, our findings provide valuable insights into the limitations and potential of transferring SSL models to spatial transcriptomics and offer practical guidance for researchers leveraging pretrained models for spatial transcriptomics data analysis. Self-supervised learning models for single-cell RNA sequencing data exhibit poor transferability to spatial transcriptomics for cell-type prediction, although their learned features may enhance spatial analysis.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 9","pages":"1414-1428"},"PeriodicalIF":23.9,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-20DOI: 10.1038/s42256-025-01106-7
Recent years have seen a surge in geospatial artificial intelligence models, with promising applications in ecological and environmental monitoring tasks. Further work should also focus on the sustainable development of such models.
{"title":"Towards responsible geospatial foundation models","authors":"","doi":"10.1038/s42256-025-01106-7","DOIUrl":"10.1038/s42256-025-01106-7","url":null,"abstract":"Recent years have seen a surge in geospatial artificial intelligence models, with promising applications in ecological and environmental monitoring tasks. Further work should also focus on the sustainable development of such models.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 8","pages":"1189-1189"},"PeriodicalIF":23.9,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.comhttps://www.nature.com/articles/s42256-025-01106-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generative drug design opens avenues for discovering novel compounds within the vast chemical space rather than conventional screening against limited libraries. However, the practical utility of the generated molecules is frequently constrained, as many designs prioritize a narrow range of pharmacological properties and neglect physical reliability, which hinders the success rate of subsequent wet-laboratory evaluations. Here, to address this, we propose ED2Mol, a deep learning-based approach that leverages fundamental electron density information to improve de novo molecular generation and optimization. The extensive evaluations across multiple benchmarks demonstrate that ED2Mol surpasses existing methods in terms of the generation success rate and >97% physical reliability. It also facilitates automated hit optimization that is not fully implemented by other methods using fragment-based strategies. Furthermore, ED2Mol exhibits generalizability to more challenging, unseen allosteric pocket benchmarks, attaining consistent performance. More importantly, ED2Mol has been applied to various real-world essential targets, successfully identifying wet-laboratory-validated bioactive compounds, ranging from FGFR3 orthosteric inhibitors to CDC42 allosteric inhibitors, GCK and GPRC5A allosteric activators. The directly generated binding modes of these compounds are close to predictions through molecular docking and further validated via the X-ray co-crystal structure. All these results highlight ED2Mol’s potential as a useful tool in drug design with enhanced effectiveness, physical reliability and practical applicability. A deep generative model is developed for de novo molecular design and optimization by leveraging electron density. Wet-laboratory assays validated its reliability to generate diverse bioactive molecules—orthosteric and allosteric, inhibitors and activators.
{"title":"Electron-density-informed effective and reliable de novo molecular design and optimization with ED2Mol","authors":"Mingyu Li, Kun Song, Jixiao He, Mingzhu Zhao, Gengshu You, Jie Zhong, Mengxi Zhao, Arong Li, Yu Chen, Guobin Li, Ying Kong, Jiacheng Wei, Zhaofu Wang, Jiamin Zhou, Hongbing Yang, Shichao Ma, Hailong Zhang, Irakoze Loïca Mélita, Weidong Lin, Yuhang Lu, Zhengtian Yu, Xun Lu, Yujun Zhao, Jian Zhang","doi":"10.1038/s42256-025-01095-7","DOIUrl":"10.1038/s42256-025-01095-7","url":null,"abstract":"Generative drug design opens avenues for discovering novel compounds within the vast chemical space rather than conventional screening against limited libraries. However, the practical utility of the generated molecules is frequently constrained, as many designs prioritize a narrow range of pharmacological properties and neglect physical reliability, which hinders the success rate of subsequent wet-laboratory evaluations. Here, to address this, we propose ED2Mol, a deep learning-based approach that leverages fundamental electron density information to improve de novo molecular generation and optimization. The extensive evaluations across multiple benchmarks demonstrate that ED2Mol surpasses existing methods in terms of the generation success rate and >97% physical reliability. It also facilitates automated hit optimization that is not fully implemented by other methods using fragment-based strategies. Furthermore, ED2Mol exhibits generalizability to more challenging, unseen allosteric pocket benchmarks, attaining consistent performance. More importantly, ED2Mol has been applied to various real-world essential targets, successfully identifying wet-laboratory-validated bioactive compounds, ranging from FGFR3 orthosteric inhibitors to CDC42 allosteric inhibitors, GCK and GPRC5A allosteric activators. The directly generated binding modes of these compounds are close to predictions through molecular docking and further validated via the X-ray co-crystal structure. All these results highlight ED2Mol’s potential as a useful tool in drug design with enhanced effectiveness, physical reliability and practical applicability. A deep generative model is developed for de novo molecular design and optimization by leveraging electron density. Wet-laboratory assays validated its reliability to generate diverse bioactive molecules—orthosteric and allosteric, inhibitors and activators.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 8","pages":"1355-1368"},"PeriodicalIF":23.9,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144901527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-20DOI: 10.1038/s42256-025-01089-5
Eugen Ursu, Aygul Minnegalieva, Puneet Rawat, Maria Chernigovskaya, Robi Tacutu, Geir Kjetil Sandve, Philippe A. Robert, Victor Greiff
Supervised machine learning models depend on training datasets containing positive and negative examples: dataset composition directly impacts model performance and bias. Given the importance of machine learning for immunotherapeutic design, we examined how different negative class definitions affect model generalization and rule discovery for antibody–antigen binding. Using synthetic-structure-based binding data, we evaluated models trained with various definitions of negative sets. Our findings reveal that high out-of-distribution performance can be achieved when the negative dataset contains more similar samples to the positive dataset, despite lower in-distribution performance. Furthermore, by leveraging ground-truth information, we show that binding rules associated with positive data change based on the negative data used. Validation on experimental data supported simulation-based observations. This work underscores the role of dataset composition in creating robust, generalizable and biology-aware sequence-based ML models. Negative data composition critically shapes machine learning robustness in sequence-based biological tasks. Training data composition and its implications are investigated on biological rule discoveries.
{"title":"Training data composition determines machine learning generalization and biological rule discovery","authors":"Eugen Ursu, Aygul Minnegalieva, Puneet Rawat, Maria Chernigovskaya, Robi Tacutu, Geir Kjetil Sandve, Philippe A. Robert, Victor Greiff","doi":"10.1038/s42256-025-01089-5","DOIUrl":"10.1038/s42256-025-01089-5","url":null,"abstract":"Supervised machine learning models depend on training datasets containing positive and negative examples: dataset composition directly impacts model performance and bias. Given the importance of machine learning for immunotherapeutic design, we examined how different negative class definitions affect model generalization and rule discovery for antibody–antigen binding. Using synthetic-structure-based binding data, we evaluated models trained with various definitions of negative sets. Our findings reveal that high out-of-distribution performance can be achieved when the negative dataset contains more similar samples to the positive dataset, despite lower in-distribution performance. Furthermore, by leveraging ground-truth information, we show that binding rules associated with positive data change based on the negative data used. Validation on experimental data supported simulation-based observations. This work underscores the role of dataset composition in creating robust, generalizable and biology-aware sequence-based ML models. Negative data composition critically shapes machine learning robustness in sequence-based biological tasks. Training data composition and its implications are investigated on biological rule discoveries.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 8","pages":"1206-1219"},"PeriodicalIF":23.9,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}