首页 > 最新文献

Nature Machine Intelligence最新文献

英文 中文
LLMs as all-in-one tools to easily generate publication-ready citation diversity reports 法学硕士是一个多功能的工具,可以轻松地生成出版就绪的引文多样性报告
IF 23.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-25 DOI: 10.1038/s42256-025-01101-y
Melissa S. Cantú, Michael R. King
{"title":"LLMs as all-in-one tools to easily generate publication-ready citation diversity reports","authors":"Melissa S. Cantú, Michael R. King","doi":"10.1038/s42256-025-01101-y","DOIUrl":"10.1038/s42256-025-01101-y","url":null,"abstract":"","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 9","pages":"1371-1372"},"PeriodicalIF":23.9,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reusability report: Exploring the transferability of self-supervised learning models from single-cell to spatial transcriptomics 可重用性报告:探索自监督学习模型从单细胞到空间转录组学的可转移性
IF 23.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-21 DOI: 10.1038/s42256-025-01097-5
Chuangyi Han, Senlin Lin, Zhikang Wang, Yan Cui, Qi Zou, Zhiyuan Yuan
Self-supervised learning (SSL) has emerged as a powerful approach for learning meaningful representations from large-scale unlabelled datasets in single-cell genomics. Richter et al. evaluated SSL pretext tasks on modelling single-cell RNA sequencing (scRNA-seq) data, demonstrating the effective use of SSL models. However, the transferability of these pretrained SSL models to the spatial transcriptomics domain remains unexplored. Here we assess the performance of three SSL models (random mask, gene programme mask and Barlow Twins) pretrained on scRNA-seq data with spatial transcriptomics datasets, focusing on cell-type prediction and spatial clustering. Our experiments demonstrate that the SSL model with random mask strategy exhibits the best overall performance among evaluated SSL models. Moreover, the models trained from scratch on spatial transcriptomics data outperform the fine-tuned SSL models on cell-type prediction, highlighting a domain gap between scRNA-seq and spatial transcriptomics data whose underlying causes remain an open question. Through expanded analyses of multiple imputation methods and data degradation scenarios, we demonstrate that gene imputation would degrade SSL model performance on cell-type prediction, an effect that is exacerbated by increasing data sparsity. Finally, integrating zero-shot random mask embeddings into chosen spatial clustering methods significantly enhanced their accuracy. Overall, our findings provide valuable insights into the limitations and potential of transferring SSL models to spatial transcriptomics and offer practical guidance for researchers leveraging pretrained models for spatial transcriptomics data analysis. Self-supervised learning models for single-cell RNA sequencing data exhibit poor transferability to spatial transcriptomics for cell-type prediction, although their learned features may enhance spatial analysis.
自监督学习(SSL)已成为单细胞基因组学中从大规模未标记数据集中学习有意义表示的一种强大方法。Richter等人评估了SSL借口任务对单细胞RNA测序(scRNA-seq)数据的建模,证明了SSL模型的有效使用。然而,这些预训练SSL模型到空间转录组学领域的可移植性仍未被探索。在这里,我们评估了三种SSL模型(随机掩码、基因程序掩码和Barlow Twins)在scRNA-seq数据和空间转录组学数据集上预训练的性能,重点关注细胞类型预测和空间聚类。我们的实验表明,随机掩码策略的SSL模型在评估的SSL模型中表现出最好的综合性能。此外,在空间转录组学数据上从零开始训练的模型在细胞类型预测上优于经过微调的SSL模型,这突出了scRNA-seq和空间转录组学数据之间的结构域差距,其潜在原因仍然是一个悬而未决的问题。通过对多种插入方法和数据退化场景的扩展分析,我们证明基因插入会降低SSL模型在细胞类型预测方面的性能,这种影响会随着数据稀疏度的增加而加剧。最后,将零镜头随机掩模嵌入到所选择的空间聚类方法中,显著提高了聚类方法的精度。总的来说,我们的研究结果为将SSL模型转移到空间转录组学的局限性和潜力提供了有价值的见解,并为研究人员利用预训练模型进行空间转录组学数据分析提供了实用指导。
{"title":"Reusability report: Exploring the transferability of self-supervised learning models from single-cell to spatial transcriptomics","authors":"Chuangyi Han, Senlin Lin, Zhikang Wang, Yan Cui, Qi Zou, Zhiyuan Yuan","doi":"10.1038/s42256-025-01097-5","DOIUrl":"10.1038/s42256-025-01097-5","url":null,"abstract":"Self-supervised learning (SSL) has emerged as a powerful approach for learning meaningful representations from large-scale unlabelled datasets in single-cell genomics. Richter et al. evaluated SSL pretext tasks on modelling single-cell RNA sequencing (scRNA-seq) data, demonstrating the effective use of SSL models. However, the transferability of these pretrained SSL models to the spatial transcriptomics domain remains unexplored. Here we assess the performance of three SSL models (random mask, gene programme mask and Barlow Twins) pretrained on scRNA-seq data with spatial transcriptomics datasets, focusing on cell-type prediction and spatial clustering. Our experiments demonstrate that the SSL model with random mask strategy exhibits the best overall performance among evaluated SSL models. Moreover, the models trained from scratch on spatial transcriptomics data outperform the fine-tuned SSL models on cell-type prediction, highlighting a domain gap between scRNA-seq and spatial transcriptomics data whose underlying causes remain an open question. Through expanded analyses of multiple imputation methods and data degradation scenarios, we demonstrate that gene imputation would degrade SSL model performance on cell-type prediction, an effect that is exacerbated by increasing data sparsity. Finally, integrating zero-shot random mask embeddings into chosen spatial clustering methods significantly enhanced their accuracy. Overall, our findings provide valuable insights into the limitations and potential of transferring SSL models to spatial transcriptomics and offer practical guidance for researchers leveraging pretrained models for spatial transcriptomics data analysis. Self-supervised learning models for single-cell RNA sequencing data exhibit poor transferability to spatial transcriptomics for cell-type prediction, although their learned features may enhance spatial analysis.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 9","pages":"1414-1428"},"PeriodicalIF":23.9,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards responsible geospatial foundation models 建立负责任的地理空间基础模型
IF 23.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-20 DOI: 10.1038/s42256-025-01106-7
Recent years have seen a surge in geospatial artificial intelligence models, with promising applications in ecological and environmental monitoring tasks. Further work should also focus on the sustainable development of such models.
近年来,地理空间人工智能模型激增,在生态和环境监测任务中具有广阔的应用前景。进一步的工作还应侧重于这种模式的可持续发展。
{"title":"Towards responsible geospatial foundation models","authors":"","doi":"10.1038/s42256-025-01106-7","DOIUrl":"10.1038/s42256-025-01106-7","url":null,"abstract":"Recent years have seen a surge in geospatial artificial intelligence models, with promising applications in ecological and environmental monitoring tasks. Further work should also focus on the sustainable development of such models.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 8","pages":"1189-1189"},"PeriodicalIF":23.9,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.nature.comhttps://www.nature.com/articles/s42256-025-01106-7.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144900431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Electron-density-informed effective and reliable de novo molecular design and optimization with ED2Mol 基于电子密度的ED2Mol有效可靠的从头分子设计和优化
IF 23.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-20 DOI: 10.1038/s42256-025-01095-7
Mingyu Li, Kun Song, Jixiao He, Mingzhu Zhao, Gengshu You, Jie Zhong, Mengxi Zhao, Arong Li, Yu Chen, Guobin Li, Ying Kong, Jiacheng Wei, Zhaofu Wang, Jiamin Zhou, Hongbing Yang, Shichao Ma, Hailong Zhang, Irakoze Loïca Mélita, Weidong Lin, Yuhang Lu, Zhengtian Yu, Xun Lu, Yujun Zhao, Jian Zhang
Generative drug design opens avenues for discovering novel compounds within the vast chemical space rather than conventional screening against limited libraries. However, the practical utility of the generated molecules is frequently constrained, as many designs prioritize a narrow range of pharmacological properties and neglect physical reliability, which hinders the success rate of subsequent wet-laboratory evaluations. Here, to address this, we propose ED2Mol, a deep learning-based approach that leverages fundamental electron density information to improve de novo molecular generation and optimization. The extensive evaluations across multiple benchmarks demonstrate that ED2Mol surpasses existing methods in terms of the generation success rate and >97% physical reliability. It also facilitates automated hit optimization that is not fully implemented by other methods using fragment-based strategies. Furthermore, ED2Mol exhibits generalizability to more challenging, unseen allosteric pocket benchmarks, attaining consistent performance. More importantly, ED2Mol has been applied to various real-world essential targets, successfully identifying wet-laboratory-validated bioactive compounds, ranging from FGFR3 orthosteric inhibitors to CDC42 allosteric inhibitors, GCK and GPRC5A allosteric activators. The directly generated binding modes of these compounds are close to predictions through molecular docking and further validated via the X-ray co-crystal structure. All these results highlight ED2Mol’s potential as a useful tool in drug design with enhanced effectiveness, physical reliability and practical applicability. A deep generative model is developed for de novo molecular design and optimization by leveraging electron density. Wet-laboratory assays validated its reliability to generate diverse bioactive molecules—orthosteric and allosteric, inhibitors and activators.
生成式药物设计为在广阔的化学空间中发现新化合物开辟了途径,而不是传统的针对有限文库的筛选。然而,所生成分子的实际效用经常受到限制,因为许多设计优先考虑药理学性质的狭窄范围,而忽略了物理可靠性,这阻碍了后续湿实验室评估的成功率。在这里,为了解决这个问题,我们提出了ED2Mol,一种基于深度学习的方法,利用基本的电子密度信息来改进从头分子生成和优化。在多个基准测试中进行的广泛评估表明,ED2Mol在生成成功率和97%物理可靠性方面优于现有方法。它还促进了自动命中优化,这是使用基于片段的策略的其他方法无法完全实现的。此外,ED2Mol在更具挑战性、不可见的变构口袋基准测试中表现出通用性,从而获得一致的性能。更重要的是,ED2Mol已应用于各种现实世界的基本靶标,成功识别湿实验室验证的生物活性化合物,范围从FGFR3正构抑制剂到CDC42变构抑制剂,GCK和GPRC5A变构激活剂。直接生成的这些化合物的结合模式与通过分子对接预测的结果接近,并通过x射线共晶结构进一步验证。所有这些结果都突出了ED2Mol作为药物设计有用工具的潜力,具有增强的有效性,物理可靠性和实用性。
{"title":"Electron-density-informed effective and reliable de novo molecular design and optimization with ED2Mol","authors":"Mingyu Li, Kun Song, Jixiao He, Mingzhu Zhao, Gengshu You, Jie Zhong, Mengxi Zhao, Arong Li, Yu Chen, Guobin Li, Ying Kong, Jiacheng Wei, Zhaofu Wang, Jiamin Zhou, Hongbing Yang, Shichao Ma, Hailong Zhang, Irakoze Loïca Mélita, Weidong Lin, Yuhang Lu, Zhengtian Yu, Xun Lu, Yujun Zhao, Jian Zhang","doi":"10.1038/s42256-025-01095-7","DOIUrl":"10.1038/s42256-025-01095-7","url":null,"abstract":"Generative drug design opens avenues for discovering novel compounds within the vast chemical space rather than conventional screening against limited libraries. However, the practical utility of the generated molecules is frequently constrained, as many designs prioritize a narrow range of pharmacological properties and neglect physical reliability, which hinders the success rate of subsequent wet-laboratory evaluations. Here, to address this, we propose ED2Mol, a deep learning-based approach that leverages fundamental electron density information to improve de novo molecular generation and optimization. The extensive evaluations across multiple benchmarks demonstrate that ED2Mol surpasses existing methods in terms of the generation success rate and >97% physical reliability. It also facilitates automated hit optimization that is not fully implemented by other methods using fragment-based strategies. Furthermore, ED2Mol exhibits generalizability to more challenging, unseen allosteric pocket benchmarks, attaining consistent performance. More importantly, ED2Mol has been applied to various real-world essential targets, successfully identifying wet-laboratory-validated bioactive compounds, ranging from FGFR3 orthosteric inhibitors to CDC42 allosteric inhibitors, GCK and GPRC5A allosteric activators. The directly generated binding modes of these compounds are close to predictions through molecular docking and further validated via the X-ray co-crystal structure. All these results highlight ED2Mol’s potential as a useful tool in drug design with enhanced effectiveness, physical reliability and practical applicability. A deep generative model is developed for de novo molecular design and optimization by leveraging electron density. Wet-laboratory assays validated its reliability to generate diverse bioactive molecules—orthosteric and allosteric, inhibitors and activators.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 8","pages":"1355-1368"},"PeriodicalIF":23.9,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144901527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Training data composition determines machine learning generalization and biological rule discovery 训练数据的组成决定了机器学习的泛化和生物规则的发现
IF 23.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-20 DOI: 10.1038/s42256-025-01089-5
Eugen Ursu, Aygul Minnegalieva, Puneet Rawat, Maria Chernigovskaya, Robi Tacutu, Geir Kjetil Sandve, Philippe A. Robert, Victor Greiff
Supervised machine learning models depend on training datasets containing positive and negative examples: dataset composition directly impacts model performance and bias. Given the importance of machine learning for immunotherapeutic design, we examined how different negative class definitions affect model generalization and rule discovery for antibody–antigen binding. Using synthetic-structure-based binding data, we evaluated models trained with various definitions of negative sets. Our findings reveal that high out-of-distribution performance can be achieved when the negative dataset contains more similar samples to the positive dataset, despite lower in-distribution performance. Furthermore, by leveraging ground-truth information, we show that binding rules associated with positive data change based on the negative data used. Validation on experimental data supported simulation-based observations. This work underscores the role of dataset composition in creating robust, generalizable and biology-aware sequence-based ML models. Negative data composition critically shapes machine learning robustness in sequence-based biological tasks. Training data composition and its implications are investigated on biological rule discoveries.
监督式机器学习模型依赖于包含正例和负例的训练数据集:数据集的组成直接影响模型的性能和偏差。鉴于机器学习对免疫治疗设计的重要性,我们研究了不同的负类定义如何影响抗体-抗原结合的模型泛化和规则发现。使用基于合成结构的绑定数据,我们评估了用各种负集定义训练的模型。我们的研究结果表明,尽管分布内性能较低,但当负数据集包含更多与正数据集相似的样本时,可以实现高的分布外性能。此外,通过利用真实信息,我们表明与正数据相关的绑定规则会根据所使用的负数据而变化。实验数据验证支持基于模拟的观察。这项工作强调了数据集组合在创建健壮、可推广和基于生物感知序列的ML模型中的作用。
{"title":"Training data composition determines machine learning generalization and biological rule discovery","authors":"Eugen Ursu, Aygul Minnegalieva, Puneet Rawat, Maria Chernigovskaya, Robi Tacutu, Geir Kjetil Sandve, Philippe A. Robert, Victor Greiff","doi":"10.1038/s42256-025-01089-5","DOIUrl":"10.1038/s42256-025-01089-5","url":null,"abstract":"Supervised machine learning models depend on training datasets containing positive and negative examples: dataset composition directly impacts model performance and bias. Given the importance of machine learning for immunotherapeutic design, we examined how different negative class definitions affect model generalization and rule discovery for antibody–antigen binding. Using synthetic-structure-based binding data, we evaluated models trained with various definitions of negative sets. Our findings reveal that high out-of-distribution performance can be achieved when the negative dataset contains more similar samples to the positive dataset, despite lower in-distribution performance. Furthermore, by leveraging ground-truth information, we show that binding rules associated with positive data change based on the negative data used. Validation on experimental data supported simulation-based observations. This work underscores the role of dataset composition in creating robust, generalizable and biology-aware sequence-based ML models. Negative data composition critically shapes machine learning robustness in sequence-based biological tasks. Training data composition and its implications are investigated on biological rule discoveries.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 8","pages":"1206-1219"},"PeriodicalIF":23.9,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The importance of negative training data for robust antibody binding prediction 阴性训练数据对稳健抗体结合预测的重要性
IF 23.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-20 DOI: 10.1038/s42256-025-01080-0
Wesley Ta, Jonathan M. Stokes
Thoughtfully designed negative training datasets may hold the key to more robust machine learning models. Ursu et al. reveal how negative training data composition shapes antibody prediction models and their generalizability. Sometimes, the best way to get better is to train harder.
精心设计的负训练数据集可能是更强大的机器学习模型的关键。Ursu等人揭示了负训练数据组成如何塑造抗体预测模型及其泛化性。有时候,变得更好的最好方法就是更加努力地训练。
{"title":"The importance of negative training data for robust antibody binding prediction","authors":"Wesley Ta, Jonathan M. Stokes","doi":"10.1038/s42256-025-01080-0","DOIUrl":"10.1038/s42256-025-01080-0","url":null,"abstract":"Thoughtfully designed negative training datasets may hold the key to more robust machine learning models. Ursu et al. reveal how negative training data composition shapes antibody prediction models and their generalizability. Sometimes, the best way to get better is to train harder.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 8","pages":"1192-1194"},"PeriodicalIF":23.9,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A unified pre-trained deep learning framework for cross-task reaction performance prediction and synthesis planning 一个统一的预训练深度学习框架,用于跨任务反应性能预测和综合规划
IF 23.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-19 DOI: 10.1038/s42256-025-01098-4
Li-Cheng Xu, Miao-Jiong Tang, Junyi An, Fenglei Cao, Yuan Qi
Artificial intelligence has transformed the field of precise organic synthesis. Data-driven methods, including machine learning and deep learning, have shown great promise in predicting reaction performance and synthesis planning. However, the inherent methodological divergence between numerical regression-driven reaction performance prediction and sequence generation-based synthesis planning creates formidable challenges in constructing a unified deep learning architecture. Here we present RXNGraphormer, a framework to jointly address these tasks through a unified pre-training approach. By synergizing graph neural networks for intramolecular pattern recognition with Transformer-based models for intermolecular interaction modelling, and training on 13 million reactions via a carefully designed strategy, RXNGraphormer achieves state-of-the-art performance across eight benchmark datasets for reactivity or selectivity prediction and forward-synthesis or retrosynthesis planning, as well as three external realistic datasets for reactivity and selectivity prediction. Notably, the model generates chemically meaningful embeddings that spontaneously cluster reactions by type without explicit supervision. This work bridges the critical gap between performance prediction and synthesis planning tasks in chemical AI, offering a versatile tool for accurate reaction prediction and synthesis design. Xu et al. present RXNGraphormer, a pre-trained model that learns bond transformation patterns from over 13 million reactions, achieving state-of-the-art accuracy in reaction performance prediction and synthesis planning.
人工智能已经改变了精密有机合成领域。数据驱动的方法,包括机器学习和深度学习,在预测反应性能和合成计划方面显示出很大的希望。然而,数值回归驱动的反应性能预测和基于序列生成的综合规划之间固有的方法分歧给构建统一的深度学习架构带来了巨大的挑战。在这里,我们提出了RXNGraphormer,这是一个通过统一的预训练方法共同解决这些任务的框架。通过协同用于分子内模式识别的图神经网络与用于分子间相互作用建模的基于transformer的模型,以及通过精心设计的策略对1300万个反应进行训练,RXNGraphormer在8个用于反应性或选择性预测和正向合成或反向合成计划的基准数据集以及用于反应性和选择性预测的3个外部现实数据集上实现了最先进的性能。值得注意的是,该模型生成了化学上有意义的嵌入,可以根据类型自发聚集反应,而无需明确的监督。这项工作弥合了化学人工智能中性能预测和合成计划任务之间的关键差距,为准确的反应预测和合成设计提供了一个多功能工具。
{"title":"A unified pre-trained deep learning framework for cross-task reaction performance prediction and synthesis planning","authors":"Li-Cheng Xu, Miao-Jiong Tang, Junyi An, Fenglei Cao, Yuan Qi","doi":"10.1038/s42256-025-01098-4","DOIUrl":"10.1038/s42256-025-01098-4","url":null,"abstract":"Artificial intelligence has transformed the field of precise organic synthesis. Data-driven methods, including machine learning and deep learning, have shown great promise in predicting reaction performance and synthesis planning. However, the inherent methodological divergence between numerical regression-driven reaction performance prediction and sequence generation-based synthesis planning creates formidable challenges in constructing a unified deep learning architecture. Here we present RXNGraphormer, a framework to jointly address these tasks through a unified pre-training approach. By synergizing graph neural networks for intramolecular pattern recognition with Transformer-based models for intermolecular interaction modelling, and training on 13 million reactions via a carefully designed strategy, RXNGraphormer achieves state-of-the-art performance across eight benchmark datasets for reactivity or selectivity prediction and forward-synthesis or retrosynthesis planning, as well as three external realistic datasets for reactivity and selectivity prediction. Notably, the model generates chemically meaningful embeddings that spontaneously cluster reactions by type without explicit supervision. This work bridges the critical gap between performance prediction and synthesis planning tasks in chemical AI, offering a versatile tool for accurate reaction prediction and synthesis design. Xu et al. present RXNGraphormer, a pre-trained model that learns bond transformation patterns from over 13 million reactions, achieving state-of-the-art accuracy in reaction performance prediction and synthesis planning.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 9","pages":"1561-1571"},"PeriodicalIF":23.9,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Boosting the predictive power of protein representations with a corpus of text annotations 利用文本注释语料库提高蛋白质表示的预测能力
IF 23.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-18 DOI: 10.1038/s42256-025-01088-6
Haonan Duan, Marta Skreta, Leonardo Cotta, Ella Miray Rajaonson, Nikita Dhawan, Alán Aspuru-Guzik, Chris J. Maddison
Protein language models are trained to predict amino acid sequences from vast protein databases and learn to represent proteins as feature vectors. These vector representations have enabled impressive applications, from predicting mutation effects to protein folding. One of the reasons offered for the success of these models is that conserved sequence motifs tend to be important for protein fitness. Yet, the relationship between sequence conservation and fitness can be confounded by the evolutionary and environmental context. Should we, therefore, look to other data sources that may contain more direct functional information? In this work, we conduct a comprehensive study examining the effects of training protein models to predict 19 types of text annotation from UniProt. Our results show that fine-tuning protein models on a subset of these annotations enhances the models’ predictive capabilities on a variety of function prediction tasks. In particular, when evaluated on our tasks, our model outperforms the basic local alignment search tool, which none of the pretrained protein models accomplished. Our results suggest that a much wider array of data modalities, such as text annotations, may be tapped to improve protein language models. Although protein language models have enabled major advances, they often rely on indirect signals that may not fully capture functional relevance. Fine-tuning these models on textual annotations is shown to improve their performance on function prediction tasks.
蛋白质语言模型被训练来预测大量蛋白质数据库中的氨基酸序列,并学习将蛋白质表示为特征向量。这些载体表示已经实现了令人印象深刻的应用,从预测突变效应到蛋白质折叠。这些模型成功的原因之一是保守的序列基序往往对蛋白质适应度很重要。然而,序列保护和适应度之间的关系可能会被进化和环境背景所混淆。因此,我们是否应该寻找其他可能包含更直接功能信息的数据源?在这项工作中,我们进行了一项全面的研究,检查了训练蛋白质模型对UniProt 19种文本注释的预测效果。我们的研究结果表明,在这些注释的子集上微调蛋白质模型可以增强模型在各种功能预测任务上的预测能力。特别是,当对我们的任务进行评估时,我们的模型优于基本的局部比对搜索工具,这是任何预训练的蛋白质模型都无法完成的。我们的研究结果表明,可以利用更广泛的数据模式,如文本注释,来改进蛋白质语言模型。尽管蛋白质语言模型取得了重大进展,但它们往往依赖于可能无法完全捕获功能相关性的间接信号。在文本注释上对这些模型进行微调可以提高它们在功能预测任务上的性能。
{"title":"Boosting the predictive power of protein representations with a corpus of text annotations","authors":"Haonan Duan, Marta Skreta, Leonardo Cotta, Ella Miray Rajaonson, Nikita Dhawan, Alán Aspuru-Guzik, Chris J. Maddison","doi":"10.1038/s42256-025-01088-6","DOIUrl":"10.1038/s42256-025-01088-6","url":null,"abstract":"Protein language models are trained to predict amino acid sequences from vast protein databases and learn to represent proteins as feature vectors. These vector representations have enabled impressive applications, from predicting mutation effects to protein folding. One of the reasons offered for the success of these models is that conserved sequence motifs tend to be important for protein fitness. Yet, the relationship between sequence conservation and fitness can be confounded by the evolutionary and environmental context. Should we, therefore, look to other data sources that may contain more direct functional information? In this work, we conduct a comprehensive study examining the effects of training protein models to predict 19 types of text annotation from UniProt. Our results show that fine-tuning protein models on a subset of these annotations enhances the models’ predictive capabilities on a variety of function prediction tasks. In particular, when evaluated on our tasks, our model outperforms the basic local alignment search tool, which none of the pretrained protein models accomplished. Our results suggest that a much wider array of data modalities, such as text annotations, may be tapped to improve protein language models. Although protein language models have enabled major advances, they often rely on indirect signals that may not fully capture functional relevance. Fine-tuning these models on textual annotations is shown to improve their performance on function prediction tasks.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 9","pages":"1403-1413"},"PeriodicalIF":23.9,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145129515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quantifying artificial intelligence through algorithmic generalization 通过算法泛化量化人工智能
IF 23.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-18 DOI: 10.1038/s42256-025-01092-w
Takuya Ito, Murray Campbell, Lior Horesh, Tim Klinger, Parikshit Ram
The rapid development of artificial intelligence (AI) systems has created an urgent need for their scientific quantification. While their fluency across a variety of domains is impressive, AI systems fall short on tests requiring algorithmic reasoning—a glaring limitation, given the necessity for interpretable and reliable technology. Despite a surge in reasoning benchmarks emerging from the academic community, no theoretical framework exists to quantify algorithmic reasoning in AI systems. Here we adopt a framework from computational complexity theory to quantify algorithmic generalization using algebraic expressions: algebraic circuit complexity. Algebraic circuit complexity theory—the study of algebraic expressions as circuit models—is a natural framework for studying the complexity of algorithmic computation. Algebraic circuit complexity enables the study of generalization by defining benchmarks in terms of the computational requirements for solving a problem. Moreover, algebraic circuits are generic mathematical objects; an arbitrarily large number of samples can be generated for a specified circuit, making it an ideal experimental sandbox for the data-hungry models that are used today. In this Perspective, we adopt tools from algebraic circuit complexity, apply them to formalize a science of algorithmic generalization, and address key challenges for its successful application to AI science. Despite impressive performances of current large AI models, symbolic and abstract reasoning tasks often elicit failure modes in these systems. In this Perspective, Ito et al. propose to make use of computational complexity theory, formulating algebraic problems as computable circuits to address the challenge of mathematical and symbolic reasoning in AI systems.
人工智能(AI)系统的快速发展迫切需要对其进行科学量化。尽管人工智能系统在各个领域的流畅性令人印象深刻,但它们在需要算法推理的测试中表现不佳——考虑到需要可解释和可靠的技术,这是一个明显的限制。尽管学术界出现了大量的推理基准,但目前还没有理论框架来量化人工智能系统中的算法推理。在这里,我们采用计算复杂性理论中的一个框架,用代数表达式来量化算法泛化:代数电路复杂性。代数电路复杂性理论——将代数表达式作为电路模型的研究——是研究算法计算复杂性的自然框架。代数电路复杂性可以通过根据解决问题的计算需求定义基准来研究泛化。此外,代数电路是一般的数学对象;可以为指定电路生成任意数量的样本,使其成为当今使用的数据饥渴模型的理想实验沙盒。从这个角度来看,我们采用代数电路复杂性的工具,应用它们来形式化算法泛化科学,并解决其成功应用于人工智能科学的关键挑战。尽管目前大型人工智能模型的表现令人印象深刻,但符号和抽象推理任务往往会在这些系统中引发故障模式。从这个角度来看,Ito等人提出利用计算复杂性理论,将代数问题表述为可计算电路,以解决人工智能系统中数学和符号推理的挑战。
{"title":"Quantifying artificial intelligence through algorithmic generalization","authors":"Takuya Ito, Murray Campbell, Lior Horesh, Tim Klinger, Parikshit Ram","doi":"10.1038/s42256-025-01092-w","DOIUrl":"10.1038/s42256-025-01092-w","url":null,"abstract":"The rapid development of artificial intelligence (AI) systems has created an urgent need for their scientific quantification. While their fluency across a variety of domains is impressive, AI systems fall short on tests requiring algorithmic reasoning—a glaring limitation, given the necessity for interpretable and reliable technology. Despite a surge in reasoning benchmarks emerging from the academic community, no theoretical framework exists to quantify algorithmic reasoning in AI systems. Here we adopt a framework from computational complexity theory to quantify algorithmic generalization using algebraic expressions: algebraic circuit complexity. Algebraic circuit complexity theory—the study of algebraic expressions as circuit models—is a natural framework for studying the complexity of algorithmic computation. Algebraic circuit complexity enables the study of generalization by defining benchmarks in terms of the computational requirements for solving a problem. Moreover, algebraic circuits are generic mathematical objects; an arbitrarily large number of samples can be generated for a specified circuit, making it an ideal experimental sandbox for the data-hungry models that are used today. In this Perspective, we adopt tools from algebraic circuit complexity, apply them to formalize a science of algorithmic generalization, and address key challenges for its successful application to AI science. Despite impressive performances of current large AI models, symbolic and abstract reasoning tasks often elicit failure modes in these systems. In this Perspective, Ito et al. propose to make use of computational complexity theory, formulating algebraic problems as computable circuits to address the challenge of mathematical and symbolic reasoning in AI systems.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 8","pages":"1195-1205"},"PeriodicalIF":23.9,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145123738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Informed protein–ligand docking via geodesic guidance in translational, rotational and torsional spaces 通过在平移,旋转和扭转空间的测地线引导,了解蛋白质配体对接
IF 23.9 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-08-15 DOI: 10.1038/s42256-025-01091-x
Raúl Miñán, Javier Gallardo, Álvaro Ciudad, Alexis Molina
Molecular docking plays a crucial role in structure-based drug discovery, enabling the prediction of how small molecules interact with protein targets. Traditional docking methods rely on scoring functions and search heuristics, whereas recent generative approaches, such as DiffDock, leverage deep learning for pose prediction. However, blind-diffusion-based docking often struggles with binding site localization and pose accuracy, particularly in complex protein–ligand systems. This work introduces GeoDirDock (GDD), a guided diffusion approach to molecular docking that enhances the accuracy and physical plausibility of ligand docking predictions. GDD guides the denoising process of a diffusion model along geodesic paths within multiple spaces representing translational, rotational and torsional degrees of freedom. Our method leverages expert knowledge to direct the generative modelling process, specifically targeting desired protein–ligand interaction regions. We demonstrate that GDD outperforms existing blind docking methods in terms of root mean squared distance accuracy and physicochemical pose realism. Our results indicate that incorporating domain expertise into the diffusion process leads to more biologically relevant docking predictions. Additionally, we explore the potential of GDD as a template-based modelling tool for lead optimization in drug discovery through angle transfer in maximum common substructure docking, showcasing its capability to accurately predict ligand orientations for chemically similar compounds. Future applications in real-world drug discovery campaigns will naturally continue to refine and extend the utility of prior-informed diffusion docking methods. GeoDirDock is a framework that guides the denoising process of a generative diffusion docking model along geodesic paths within multiple spaces representing translational, rotational and torsional degrees of freedom. This approach enhances the accuracy and physical plausibility of ligand docking predictions.
分子对接在基于结构的药物发现中起着至关重要的作用,可以预测小分子如何与蛋白质靶点相互作用。传统的对接方法依赖于评分函数和搜索启发式,而最近的生成方法,如DiffDock,利用深度学习进行姿态预测。然而,基于盲扩散的对接通常在结合位点定位和位姿准确性方面存在问题,特别是在复杂的蛋白质配体系统中。这项工作介绍了GeoDirDock (GDD),一种分子对接的引导扩散方法,提高了配体对接预测的准确性和物理合理性。GDD指导扩散模型沿表示平移、旋转和扭转自由度的多个空间内的测地线路径的去噪过程。我们的方法利用专家知识来指导生成建模过程,特别是针对所需的蛋白质-配体相互作用区域。我们证明GDD在均方根距离精度和物理化学姿态真实感方面优于现有的盲对接方法。我们的研究结果表明,将领域专业知识纳入扩散过程可以产生更多与生物学相关的对接预测。此外,我们探索了GDD作为基于模板的建模工具的潜力,通过最大共同子结构对接中的角度转移来优化药物发现中的先导物,展示了其准确预测化学相似化合物的配体取向的能力。未来在现实世界药物发现活动中的应用自然会继续完善和扩展先验信息扩散对接方法的效用。
{"title":"Informed protein–ligand docking via geodesic guidance in translational, rotational and torsional spaces","authors":"Raúl Miñán, Javier Gallardo, Álvaro Ciudad, Alexis Molina","doi":"10.1038/s42256-025-01091-x","DOIUrl":"10.1038/s42256-025-01091-x","url":null,"abstract":"Molecular docking plays a crucial role in structure-based drug discovery, enabling the prediction of how small molecules interact with protein targets. Traditional docking methods rely on scoring functions and search heuristics, whereas recent generative approaches, such as DiffDock, leverage deep learning for pose prediction. However, blind-diffusion-based docking often struggles with binding site localization and pose accuracy, particularly in complex protein–ligand systems. This work introduces GeoDirDock (GDD), a guided diffusion approach to molecular docking that enhances the accuracy and physical plausibility of ligand docking predictions. GDD guides the denoising process of a diffusion model along geodesic paths within multiple spaces representing translational, rotational and torsional degrees of freedom. Our method leverages expert knowledge to direct the generative modelling process, specifically targeting desired protein–ligand interaction regions. We demonstrate that GDD outperforms existing blind docking methods in terms of root mean squared distance accuracy and physicochemical pose realism. Our results indicate that incorporating domain expertise into the diffusion process leads to more biologically relevant docking predictions. Additionally, we explore the potential of GDD as a template-based modelling tool for lead optimization in drug discovery through angle transfer in maximum common substructure docking, showcasing its capability to accurately predict ligand orientations for chemically similar compounds. Future applications in real-world drug discovery campaigns will naturally continue to refine and extend the utility of prior-informed diffusion docking methods. GeoDirDock is a framework that guides the denoising process of a generative diffusion docking model along geodesic paths within multiple spaces representing translational, rotational and torsional degrees of freedom. This approach enhances the accuracy and physical plausibility of ligand docking predictions.","PeriodicalId":48533,"journal":{"name":"Nature Machine Intelligence","volume":"7 9","pages":"1555-1560"},"PeriodicalIF":23.9,"publicationDate":"2025-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144851536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Nature Machine Intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1