首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
SLICE (SMARTS and Logic In ChEmistry): fast generation of molecules using advanced chemical synthesis logic and modern coding style SLICE (SMARTS and Logic In ChEmistry):使用先进的化学合成逻辑和现代编码风格快速生成分子。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-09 DOI: 10.1186/s13321-025-01119-9
Stefi Nouleho Ilemo, Victorien Delannée, Olga Grushin, Philip Judson, Hitesh Patel, Marc C. Nicklaus, Nadya I. Tarasova

While virtual libraries of synthetically accessible compounds have exploded in size to many billions, our capacity to extract valuable drug leads from these vast databases remains limited by computational resources. To overcome this, we developed SLICE SMARTS and Logic In ChEmistry), a powerful new tool designed for the agile exploration of massive chemical spaces. SLICE enables the fast, “à la carte” generation of virtual compound libraries through chemist-defined reaction chemistries and readily available building blocks. Its user-friendly, no-code graphical interface, the SLICE Designer, allows chemists to easily define SMARTS patterns, configure atom and bond properties, and establish chemical constraints and logic. The resulting XML files are then fed into the SLICE Engine, which generates diverse virtual libraries from specified building blocks at speeds of 0.6–2.5 million compounds per hour. SLICE provides the agility and performance needed to support efficient lead generation within discovery workflows.

虽然可合成化合物的虚拟文库的规模已经激增到数十亿,但我们从这些庞大的数据库中提取有价值的药物线索的能力仍然受到计算资源的限制。为了克服这个问题,我们开发了SLICE SMARTS和Logic In ChEmistry,这是一种强大的新工具,专为快速探索大量化学空间而设计。SLICE通过化学家定义的反应化学和现成的构建块,实现了快速,“点菜”生成虚拟化合物库。其用户友好的无代码图形界面SLICE Designer允许化学家轻松定义SMARTS模式,配置原子和键属性,并建立化学约束和逻辑。然后将生成的XML文件输入SLICE引擎,该引擎以每小时60 - 250万个化合物的速度从指定的构建块生成各种虚拟库。SLICE提供了在发现工作流程中支持高效潜在客户生成所需的敏捷性和性能。
{"title":"SLICE (SMARTS and Logic In ChEmistry): fast generation of molecules using advanced chemical synthesis logic and modern coding style","authors":"Stefi Nouleho Ilemo,&nbsp;Victorien Delannée,&nbsp;Olga Grushin,&nbsp;Philip Judson,&nbsp;Hitesh Patel,&nbsp;Marc C. Nicklaus,&nbsp;Nadya I. Tarasova","doi":"10.1186/s13321-025-01119-9","DOIUrl":"10.1186/s13321-025-01119-9","url":null,"abstract":"<div><p>While virtual libraries of synthetically accessible compounds have exploded in size to many billions, our capacity to extract valuable drug leads from these vast databases remains limited by computational resources. To overcome this, we developed SLICE SMARTS and Logic In ChEmistry), a powerful new tool designed for the agile exploration of massive chemical spaces. SLICE enables the fast, “à la carte” generation of virtual compound libraries through chemist-defined reaction chemistries and readily available building blocks. Its user-friendly, no-code graphical interface, the SLICE Designer, allows chemists to easily define SMARTS patterns, configure atom and bond properties, and establish chemical constraints and logic. The resulting XML files are then fed into the SLICE Engine, which generates diverse virtual libraries from specified building blocks at speeds of 0.6–2.5 million compounds per hour. SLICE provides the agility and performance needed to support efficient lead generation within discovery workflows.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01119-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145712913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Uncovering molecular determinants of potency and binding affinity in hit compounds targeting FGF14/Nav1.6 complex 揭示靶向FGF14/Nav1.6复合物的靶向化合物的效价和结合亲和力的分子决定因素。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-09 DOI: 10.1186/s13321-025-01122-0
Hamid Teimouri, Zahra Haghighijoo, Timothy J. Baumgartner, Aditya K. Singh, Paul A. Wadsworth, Cun Zhang, Haiying Chen, Jia Zhou, Fernanda Laezza

Identifying molecular mechanisms that regulate neuronal excitability is essential for developing targeted therapies for neuropsychiatric disorders. The protein–protein interaction (PPI) between fibroblast growth factor 14 (FGF14) and the voltage-gated Na+ channel Nav1.6 is critical in regulating neuronal excitability and has emerged as a promising drug target. However, the physicochemical features that drive small-molecule modulation of this interface remain elusive. Here, we apply a descriptor-based chemoinformatics approach to analyze 15 hit compounds identified via high-throughput screening, aiming to elucidate structure–activity relationships influencing their potency and binding affinity. The analysis revealed distinct subsets of physicochemical features strongly associated with either potency or binding affinity values, suggesting that these parameters are governed by largely independent molecular determinants. This independence implies that optimizing a compound for improved affinity need not compromise potency, and vice versa. Together, these findings may guide the rational optimization of first-in-class compounds aimed at controlling neuronal excitability through targeted PPI interface modulation.

Graphical Abstract

确定调节神经元兴奋性的分子机制对于开发针对神经精神疾病的靶向治疗至关重要。成纤维细胞生长因子14 (FGF14)和电压门控Na+通道Nav1.6之间的蛋白-蛋白相互作用(PPI)在调节神经元兴奋性中至关重要,并已成为一个有前景的药物靶点。然而,驱动该界面的小分子调制的物理化学特征仍然难以捉摸。本研究采用基于描述符的化学信息学方法分析了通过高通量筛选鉴定的15种hit化合物,旨在阐明影响其效力和结合亲和力的结构-活性关系。分析揭示了不同的物理化学特征子集与效力或结合亲和力值密切相关,表明这些参数在很大程度上是由独立的分子决定因素控制的。这种独立性意味着优化化合物以提高亲和力不需要损害效力,反之亦然。总之,这些发现可以指导通过靶向PPI界面调节来控制神经元兴奋性的一流化合物的合理优化。
{"title":"Uncovering molecular determinants of potency and binding affinity in hit compounds targeting FGF14/Nav1.6 complex","authors":"Hamid Teimouri,&nbsp;Zahra Haghighijoo,&nbsp;Timothy J. Baumgartner,&nbsp;Aditya K. Singh,&nbsp;Paul A. Wadsworth,&nbsp;Cun Zhang,&nbsp;Haiying Chen,&nbsp;Jia Zhou,&nbsp;Fernanda Laezza","doi":"10.1186/s13321-025-01122-0","DOIUrl":"10.1186/s13321-025-01122-0","url":null,"abstract":"<div><p>Identifying molecular mechanisms that regulate neuronal excitability is essential for developing targeted therapies for neuropsychiatric disorders. The protein–protein interaction (PPI) between fibroblast growth factor 14 (FGF14) and the voltage-gated Na<sup>+</sup> channel Nav1.6 is critical in regulating neuronal excitability and has emerged as a promising drug target. However, the physicochemical features that drive small-molecule modulation of this interface remain elusive. Here, we apply a descriptor-based chemoinformatics approach to analyze 15 hit compounds identified via high-throughput screening, aiming to elucidate structure–activity relationships influencing their potency and binding affinity. The analysis revealed distinct subsets of physicochemical features strongly associated with either potency or binding affinity values, suggesting that these parameters are governed by largely independent molecular determinants. This independence implies that optimizing a compound for improved affinity need not compromise potency, and vice versa. Together, these findings may guide the rational optimization of first-in-class compounds aimed at controlling neuronal excitability through targeted PPI interface modulation.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01122-0.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145710814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SMARTS-RX: a SMARTS-based representation of chemical functions for reactivity analysis SMARTS-RX:用于反应性分析的基于smarts的化学函数表示。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-08 DOI: 10.1186/s13321-025-01136-8
Thierry Kogej, Christos Kannas, Samuel Genheden, Eike Caldeweyher, Mikhail Kabeshov

Chemical functional group annotation provides a mechanistically meaningful framework to interpret model outcomes and guide synthetic strategies. Here, we present SMARTS-RX—a curated, hierarchical ontology of 406 SMARTS-based functional group descriptors—designed to characterize chemically relevant and reactive functionalities in small molecules. SMARTS-RX achieves a balance between granularity and computational efficiency by focusing on functional groups central to pharmaceutical synthesis and medicinal chemistry. We describe the development of SMARTS-RX, including its systematic nomenclature and SMARTS encoding, which enable precise tracking of chemical environments. The utility of SMARTS-RX for mapping chemical reactivity is demonstrated through analyses of functional group distributions across major reaction types, using large-scale datasets from AstraZeneca’s Electronic Lab Notebooks and Reaxys. Finally, we illustrate how this SMARTS library can be applied to guide building-block selection from commercial catalogues. A public GitHub repository has been created aiming for a continuous improvement of the current SMARTS_RX.

Scientific Contribution: SMARTS-RX introduces a curated, hierarchical ontology of 406 SMARTS-based descriptors prioritizing pharmaceutical relevance and mechanistic interpretability. Distinct from prior efforts, SMARTS-RX encodes detailed chemical environments to improve reactivity mapping and feature extraction for both expert analysis and computational modelling. This resource advances functional group annotation by balancing chemical specificity and computational performance, supporting reproducible and scalable cheminformatics research.

化学官能团注释为解释模型结果和指导合成策略提供了一个有机械意义的框架。在这里,我们提出了smarts - rx -一个精心策划的分层本体,包含406个基于smarts的官能团描述符,旨在表征小分子中的化学相关和反应性功能。SMARTS-RX通过专注于药物合成和药物化学中心的官能团,实现了粒度和计算效率之间的平衡。我们描述了SMARTS- rx的发展,包括它的系统命名和SMARTS编码,它可以精确跟踪化学环境。通过分析主要反应类型的官能团分布,使用来自阿斯利康电子实验室笔记本和Reaxys的大规模数据集,展示了SMARTS-RX在绘制化学反应性方面的实用性。最后,我们说明了如何将SMARTS库应用于指导从商业目录中选择构建块。已经创建了一个公共GitHub存储库,旨在持续改进当前的SMARTS_RX。科学贡献:SMARTS-RX引入了一个由406个基于smarts的描述符组成的精心策划的分层本体,优先考虑药物相关性和机制可解释性。与之前的工作不同,SMARTS-RX对详细的化学环境进行编码,以改进专家分析和计算建模的反应性映射和特征提取。该资源通过平衡化学特异性和计算性能来推进官能团注释,支持可重复和可扩展的化学信息学研究。
{"title":"SMARTS-RX: a SMARTS-based representation of chemical functions for reactivity analysis","authors":"Thierry Kogej,&nbsp;Christos Kannas,&nbsp;Samuel Genheden,&nbsp;Eike Caldeweyher,&nbsp;Mikhail Kabeshov","doi":"10.1186/s13321-025-01136-8","DOIUrl":"10.1186/s13321-025-01136-8","url":null,"abstract":"<div><p>Chemical functional group annotation provides a mechanistically meaningful framework to interpret model outcomes and guide synthetic strategies. Here, we present SMARTS-RX—a curated, hierarchical ontology of 406 SMARTS-based functional group descriptors—designed to characterize chemically relevant and reactive functionalities in small molecules. SMARTS-RX achieves a balance between granularity and computational efficiency by focusing on functional groups central to pharmaceutical synthesis and medicinal chemistry. We describe the development of SMARTS-RX, including its systematic nomenclature and SMARTS encoding, which enable precise tracking of chemical environments. The utility of SMARTS-RX for mapping chemical reactivity is demonstrated through analyses of functional group distributions across major reaction types, using large-scale datasets from AstraZeneca’s Electronic Lab Notebooks and Reaxys. Finally, we illustrate how this SMARTS library can be applied to guide building-block selection from commercial catalogues. A public GitHub repository has been created aiming for a continuous improvement of the current SMARTS_RX.</p><p><b>Scientific Contribution:</b> SMARTS-RX introduces a curated, hierarchical ontology of 406 SMARTS-based descriptors prioritizing pharmaceutical relevance and mechanistic interpretability. Distinct from prior efforts, SMARTS-RX encodes detailed chemical environments to improve reactivity mapping and feature extraction for both expert analysis and computational modelling. This resource advances functional group annotation by balancing chemical specificity and computational performance, supporting reproducible and scalable cheminformatics research.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01136-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SGEDiff: a subgraph-enriched diffusion model for structure-based 3D molecular generation SGEDiff:一个基于结构的三维分子生成的富子图扩散模型。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-08 DOI: 10.1186/s13321-025-01123-z
Changda Gong, Jiaojiao Fang, Yan Tang, Guixia Liu, Yun Tang, Weihua Li

Structure-based molecular generation is an emerging approach in computer-aided drug discovery, enabling the design of compounds that that complement the three-dimensional structure of target proteins. However, most diffusion-based 3D molecular generative models still face several limitations, such as imbalanced protein–ligand representations or reliance on predefined binding pockets. To address these limitations, we propose SGEDiff, a novel subgraph enriched generative framework for 3D molecule generation. Our model hierarchically fuses subgraph and global graph representations to capture both local binding patterns and key structural features of protein pockets. Furthermore, an integrated pocket prediction module identifies binding regions in unseen proteins, eliminating reliance on predefined pocket coordinates. Experimental results show that SGEDiff outperforms baseline diffusion-based methods in generating high-affinity molecules across diverse targets. Moreover, practical applications in de novo drug design demonstrate improved success rates in generating compounds for novel protein targets, underscoring its potential to advance structure-based drug discovery.

基于结构的分子生成是计算机辅助药物发现的一种新兴方法,可以设计出与目标蛋白质的三维结构互补的化合物。然而,大多数基于扩散的3D分子生成模型仍然面临一些局限性,例如蛋白质配体表示不平衡或依赖预定义的结合袋。为了解决这些限制,我们提出了SGEDiff,一个新的子图丰富的3D分子生成框架。我们的模型分层融合子图和全局图表示,以捕获蛋白质口袋的局部结合模式和关键结构特征。此外,集成的口袋预测模块识别不可见蛋白质的结合区域,消除了对预定义口袋坐标的依赖。实验结果表明,SGEDiff在生成跨不同靶标的高亲和力分子方面优于基于扩散的基线方法。此外,在新药物设计中的实际应用表明,为新蛋白质靶点生成化合物的成功率提高,强调了其推进基于结构的药物发现的潜力。
{"title":"SGEDiff: a subgraph-enriched diffusion model for structure-based 3D molecular generation","authors":"Changda Gong,&nbsp;Jiaojiao Fang,&nbsp;Yan Tang,&nbsp;Guixia Liu,&nbsp;Yun Tang,&nbsp;Weihua Li","doi":"10.1186/s13321-025-01123-z","DOIUrl":"10.1186/s13321-025-01123-z","url":null,"abstract":"<div><p>Structure-based molecular generation is an emerging approach in computer-aided drug discovery, enabling the design of compounds that that complement the three-dimensional structure of target proteins. However, most diffusion-based 3D molecular generative models still face several limitations, such as imbalanced protein–ligand representations or reliance on predefined binding pockets. To address these limitations, we propose SGEDiff, a novel subgraph enriched generative framework for 3D molecule generation. Our model hierarchically fuses subgraph and global graph representations to capture both local binding patterns and key structural features of protein pockets. Furthermore, an integrated pocket prediction module identifies binding regions in unseen proteins, eliminating reliance on predefined pocket coordinates. Experimental results show that SGEDiff outperforms baseline diffusion-based methods in generating high-affinity molecules across diverse targets. Moreover, practical applications in de novo drug design demonstrate improved success rates in generating compounds for novel protein targets, underscoring its potential to advance structure-based drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01123-z.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145704327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-MoleScale: a multi-scale approach for molecular property prediction with graph contrastive and sequence learning Multi-MoleScale:一种基于图对比和序列学习的多尺度分子性质预测方法。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-06 DOI: 10.1186/s13321-025-01126-w
Xinpo Lou, Jianxiu Cai, Shirley W. I. Siu

In recent years, machine learning models have shown substantial progress in predicting molecular properties. However, integrating molecular graph structures with sequence information continues to present a significant challenge. In this paper, we introduce Multi-MoleScale, a novel multi-scale framework designed to address this challenge. By combining Graph Contrastive Learning (GCL) with sequence-based models like BERT, Multi-MoleScale enhances the prediction of molecular properties by capturing both structural and contextual representations of molecules. Specifically, the model leverages GCL to effectively capture the intrinsic graph-based features of molecules while utilizing BERT’s pretraining capabilities to learn the contextual relationships within molecular sequences. The contrastive learning component enables Multi-MoleScale to distinguish between relevant and irrelevant molecular features, thereby enhancing its predictive accuracy across diverse molecular types. To assess the performance of our method, we conducted experiments on several widely used public datasets, including 12 molecular property datasets, the ADMET dataset, and 14 breast cancer cell line datasets. The results show that Multi-MoleScale consistently outperforms existing deep learning and self-supervised learning approaches. Notably, the model does not require handcrafted features, making it highly adaptable and versatile for a variety of molecular discovery tasks. This makes Multi-MoleScale a promising tool for applications in drug discovery, materials science, and other molecular research fields. Our data and code are available at https://github.com/pdssunny/Multi-MoleScale.

近年来,机器学习模型在预测分子性质方面取得了实质性进展。然而,整合分子图结构与序列信息仍然是一个重大的挑战。在本文中,我们介绍了Multi-MoleScale,一种新的多尺度框架,旨在解决这一挑战。通过将图对比学习(GCL)与BERT等基于序列的模型相结合,Multi-MoleScale通过捕获分子的结构和上下文表示来增强分子特性的预测。具体来说,该模型利用GCL有效地捕获分子的内在基于图的特征,同时利用BERT的预训练能力来学习分子序列中的上下文关系。对比学习组件使Multi-MoleScale能够区分相关和不相关的分子特征,从而提高其对不同分子类型的预测准确性。为了评估我们的方法的性能,我们在几个广泛使用的公共数据集上进行了实验,包括12个分子特性数据集、ADMET数据集和14个乳腺癌细胞系数据集。结果表明,Multi-MoleScale始终优于现有的深度学习和自监督学习方法。值得注意的是,该模型不需要手工制作的特征,使其具有高度的适应性和通用性,适用于各种分子发现任务。这使得Multi-MoleScale成为药物发现、材料科学和其他分子研究领域应用的有前途的工具。我们的数据和代码可在https://github.com/pdssunny/Multi-MoleScale上获得。
{"title":"Multi-MoleScale: a multi-scale approach for molecular property prediction with graph contrastive and sequence learning","authors":"Xinpo Lou,&nbsp;Jianxiu Cai,&nbsp;Shirley W. I. Siu","doi":"10.1186/s13321-025-01126-w","DOIUrl":"10.1186/s13321-025-01126-w","url":null,"abstract":"<div><p>In recent years, machine learning models have shown substantial progress in predicting molecular properties. However, integrating molecular graph structures with sequence information continues to present a significant challenge. In this paper, we introduce Multi-MoleScale, a novel multi-scale framework designed to address this challenge. By combining Graph Contrastive Learning (GCL) with sequence-based models like BERT, Multi-MoleScale enhances the prediction of molecular properties by capturing both structural and contextual representations of molecules. Specifically, the model leverages GCL to effectively capture the intrinsic graph-based features of molecules while utilizing BERT’s pretraining capabilities to learn the contextual relationships within molecular sequences. The contrastive learning component enables Multi-MoleScale to distinguish between relevant and irrelevant molecular features, thereby enhancing its predictive accuracy across diverse molecular types. To assess the performance of our method, we conducted experiments on several widely used public datasets, including 12 molecular property datasets, the ADMET dataset, and 14 breast cancer cell line datasets. The results show that Multi-MoleScale consistently outperforms existing deep learning and self-supervised learning approaches. Notably, the model does not require handcrafted features, making it highly adaptable and versatile for a variety of molecular discovery tasks. This makes Multi-MoleScale a promising tool for applications in drug discovery, materials science, and other molecular research fields. Our data and code are available at https://github.com/pdssunny/Multi-MoleScale. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01126-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145689020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ionization efficiency prediction of electrospray ionization mass spectrometry analytes based on molecular fingerprints and cumulative neutral losses 基于分子指纹和累积中性损失的电喷雾电离质谱分析物电离效率预测。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-06 DOI: 10.1186/s13321-025-01129-7
Alexandros Nikolopoulos, Denice van Herwerden, Viktoriia Turkina, Anneli Kruve, Melissa Baerenfaenger, Saer Samanipour

Quantification is a challenge for non-targeted analysis (NTA) with liquid chromatography–high resolution mass spectrometry (LC–HRMS), due to the lack of analytical standards. Quantification via structure-based predicted ionization efficiency (IE) has been shown to provide the highest accuracy in estimating concentration. However, achieving confident analyte identification is a challenging task, as multiple candidate structures may be likely. This uncertainty in identification limits the reliability of structure-based IE prediction models, since quantification can be severely compromised in cases of wrongly (tentatively) identified chemicals or lack of candidate structures. Here we investigate the possibility of using cumulative neutral losses from fragmentation spectra (i.e. MS2) to predict the logIE. The first model was based on molecular fingerprints and was applied on structurally identified analytes. PubChem fingerprints performed the best with the root-mean-square error (RMSE) of 0.72 logIE for the test set. The second model was based on the MS2 spectrum, expressed as cumulative neutral losses. This approach is applicable to analytes with unknown structures and showed promising results with RMSE of 0.79 logIE for the test set and 0.62 logIE for chromatographic features extracted from LC-HRMS data of tea extracts spiked with pesticides. The prediction models were compiled in a Julia package, which is publicly available on GitHub, and may be used as part of a quantification workflow to estimate concentrations of identified and unidentified compounds in NTA.

Scientific contribution: This study expands the possibilities of standard free quantification for HRMS. It aims to provide reliable IE prediction for known substances by robust fingerprint calculation, and more importantly IE prediction for unknown substances using their MS2 fragmentation pattern. These workflows employ minimal method-specific variables, highlighting the tool generalizability.

由于缺乏分析标准,定量是液相色谱-高分辨率质谱(LC-HRMS)非靶向分析(NTA)的一个挑战。通过基于结构的预测电离效率(IE)的定量已被证明在估计浓度方面提供了最高的准确性。然而,实现可靠的分析物鉴定是一项具有挑战性的任务,因为可能有多个候选结构。这种鉴定的不确定性限制了基于结构的IE预测模型的可靠性,因为在错误(暂时)鉴定化学物质或缺乏候选结构的情况下,量化可能会受到严重损害。在这里,我们研究了使用碎片谱(即MS2)的累积中性损失来预测逻辑的可能性。第一个模型基于分子指纹图谱,并应用于结构鉴定的分析物。PubChem指纹在测试集上表现最好,均方根误差(RMSE)为0.72。第二个模型基于MS2频谱,表示为累积中性损失。该方法适用于结构未知的分析物,对添加农药的茶提取物的LC-HRMS数据提取的色谱特征的RMSE为0.79 logIE, RMSE为0.62 logIE。预测模型是在Julia包中编译的,该包可以在GitHub上公开获得,并且可以用作量化工作流程的一部分,以估计NTA中已识别和未识别化合物的浓度。科学贡献:本研究拓展了HRMS无标定量的可能性。它旨在通过稳健的指纹计算为已知物质提供可靠的IE预测,更重要的是利用其MS2碎片模式对未知物质进行IE预测。这些工作流使用最小的特定于方法的变量,突出了工具的通用性。
{"title":"Ionization efficiency prediction of electrospray ionization mass spectrometry analytes based on molecular fingerprints and cumulative neutral losses","authors":"Alexandros Nikolopoulos,&nbsp;Denice van Herwerden,&nbsp;Viktoriia Turkina,&nbsp;Anneli Kruve,&nbsp;Melissa Baerenfaenger,&nbsp;Saer Samanipour","doi":"10.1186/s13321-025-01129-7","DOIUrl":"10.1186/s13321-025-01129-7","url":null,"abstract":"<div><p>Quantification is a challenge for non-targeted analysis (NTA) with liquid chromatography–high resolution mass spectrometry (LC–HRMS), due to the lack of analytical standards. Quantification via structure-based predicted ionization efficiency (IE) has been shown to provide the highest accuracy in estimating concentration. However, achieving confident analyte identification is a challenging task, as multiple candidate structures may be likely. This uncertainty in identification limits the reliability of structure-based IE prediction models, since quantification can be severely compromised in cases of wrongly (tentatively) identified chemicals or lack of candidate structures. Here we investigate the possibility of using cumulative neutral losses from fragmentation spectra (i.e. MS2) to predict the log<i>IE</i>. The first model was based on molecular fingerprints and was applied on structurally identified analytes. PubChem fingerprints performed the best with the root-mean-square error (RMSE) of 0.72 log<i>IE</i> for the test set. The second model was based on the MS2 spectrum, expressed as cumulative neutral losses. This approach is applicable to analytes with unknown structures and showed promising results with RMSE of 0.79 log<i>IE</i> for the test set and 0.62 log<i>IE</i> for chromatographic features extracted from LC-HRMS data of tea extracts spiked with pesticides. The prediction models were compiled in a Julia package, which is publicly available on GitHub, and may be used as part of a quantification workflow to estimate concentrations of identified and unidentified compounds in NTA. </p><p><b>Scientific contribution:</b> This study expands the possibilities of standard free quantification for HRMS. It aims to provide reliable IE prediction for known substances by robust fingerprint calculation, and more importantly IE prediction for unknown substances using their MS2 fragmentation pattern. These workflows employ minimal method-specific variables, highlighting the tool generalizability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12750826/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145695670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NOCTIS: open-source toolkit that turns reaction data into actionable graph networks NOCTIS:开源工具包,将反应数据转化为可操作的图网络。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-04 DOI: 10.1186/s13321-025-01118-w
Nataliya Lopanitsyna, Marta Pasquini, Marco Stenta

Background

Chemical reactions form densely connected networks, and exploring these networks is essential for designing efficient and sustainable synthetic routes. As reaction data from literature, patents, and high-throughput experimentation continue to grow, so does the need for tools that can navigate and mine these large-scale datasets. Graph-based representations capture the topology of reaction space, yet few open-source tools exist for building and querying such networks. To address this, we developed NOCTIS, an open-source toolkit for constructing and analyzing reaction data as graphs.

Results

NOCTIS is an open-source Python package for building Networks of Organic Chemistry (NOCs) from reaction strings. It supports graph-based analysis, parallel processing of large datasets, and export to common Python formats (e.g., NetworkX, pandas). Built on Neo4j technology, it features a modular, extensible architecture with open-source dependencies. We also provide a companion plugin for exhaustive route enumeration. It traverses graph-encoded reactions to assemble all valid synthetic routes, helping prevent redundant exploration and supporting knowledge reuse in synthesis planning. The underlying algorithm is documented in detail along with its current limitations. Using the MIT USPTO-480k dataset (Adv Neural Inf Process Syst 30, 2017), we demonstrate the plugin’s route mining capabilities, analyze network connectivity, and assess synthetic trees.

Conclusion

Built on LinChemIn (J Chem Inf Model 64(6):1765–1771, 2024), NOCTIS serves as an open and extensible toolkit for network-based reaction analysis and route mining, laying the groundwork for data-driven route design at scale. Future work will extend query capabilities and improve the efficiency of route extraction.

背景:化学反应形成紧密相连的网络,探索这些网络对于设计高效和可持续的合成路线至关重要。随着来自文献、专利和高通量实验的反应数据不断增长,对能够导航和挖掘这些大规模数据集的工具的需求也在不断增长。基于图的表示捕获了反应空间的拓扑结构,但是很少有开源工具用于构建和查询这样的网络。为了解决这个问题,我们开发了NOCTIS,这是一个用于构建和分析反应数据图表的开源工具包。结果:NOCTIS是一个开源Python包,用于从反应串中构建有机化学网络(NOCs)。它支持基于图的分析,大型数据集的并行处理,以及导出为通用的Python格式(例如,NetworkX, pandas)。它基于Neo4j技术构建,具有模块化、可扩展的架构和开源依赖关系。我们还提供了详尽路由枚举的配套插件。它遍历图形编码的反应,以组装所有有效的合成路线,有助于防止冗余的探索,并支持合成规划中的知识重用。详细记录了底层算法及其当前限制。使用MIT USPTO-480k数据集(Adv Neural Inf Process Syst 30, 2017),我们展示了插件的路由挖掘能力,分析网络连接并评估合成树。结论:NOCTIS基于LinChemIn (J Chem Inf Model 64(6):1765-1771, 2024),为基于网络的反应分析和路由挖掘提供了一个开放和可扩展的工具包,为大规模数据驱动的路由设计奠定了基础。未来的工作将扩展查询能力,提高路由提取的效率。
{"title":"NOCTIS: open-source toolkit that turns reaction data into actionable graph networks","authors":"Nataliya Lopanitsyna,&nbsp;Marta Pasquini,&nbsp;Marco Stenta","doi":"10.1186/s13321-025-01118-w","DOIUrl":"10.1186/s13321-025-01118-w","url":null,"abstract":"<div><h3>Background</h3><p>Chemical reactions form densely connected networks, and exploring these networks is essential for designing efficient and sustainable synthetic routes. As reaction data from literature, patents, and high-throughput experimentation continue to grow, so does the need for tools that can navigate and mine these large-scale datasets. Graph-based representations capture the topology of reaction space, yet few open-source tools exist for building and querying such networks. To address this, we developed NOCTIS, an open-source toolkit for constructing and analyzing reaction data as graphs.</p><h3>Results</h3><p>NOCTIS is an open-source Python package for building Networks of Organic Chemistry (NOCs) from reaction strings. It supports graph-based analysis, parallel processing of large datasets, and export to common Python formats (e.g., NetworkX, pandas). Built on Neo4j technology, it features a modular, extensible architecture with open-source dependencies. We also provide a companion plugin for exhaustive route enumeration. It traverses graph-encoded reactions to assemble all valid synthetic routes, helping prevent redundant exploration and supporting knowledge reuse in synthesis planning. The underlying algorithm is documented in detail along with its current limitations. Using the MIT USPTO-480k dataset (Adv Neural Inf Process Syst 30, 2017), we demonstrate the plugin’s route mining capabilities, analyze network connectivity, and assess synthetic trees.</p><h3>Conclusion</h3><p>Built on LinChemIn (J Chem Inf Model 64(6):1765–1771, 2024), NOCTIS serves as an open and extensible toolkit for network-based reaction analysis and route mining, laying the groundwork for data-driven route design at scale. Future work will extend query capabilities and improve the efficiency of route extraction.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798089/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145676116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The consolidation of open-source computer-assisted chemical synthesis data into a comprehensive database 将开源计算机辅助的化学合成数据整合为一个综合数据库。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-04 DOI: 10.1186/s13321-025-01130-0
Haris Hasic, Takashi Ishida

Over the past decade, computer-assisted chemical synthesis has resurfaced as a prominent research subject. Even though the idea of utilizing computers to assist chemical synthesis has existed for nearly as long as computers themselves, the inherent complexity repeatedly exceeded the available resources. However, recent machine learning approaches have exhibited the potential to break this tendency. The performance of such approaches is heavily dependent on data that suffers from limited quantity, quality, visibility, and accessibility, posing significant challenges to potential scientific breakthroughs. This research addresses these issues by consolidating all relevant open-source computer-assisted chemical synthesis data into a comprehensive database, providing a practical overview of the state of data in the process. The computer-assisted chemical synthesis or CaCS database is designed to be a central repository for storing and analyzing data, with the primary objective being easy integration and utilization within existing research projects. It provides the users with a programmatic interface to retrieve the data required for various tasks like predicting the outcomes of chemical synthesis and retrosynthetic analysis or retrosynthesis, estimating the synthesizability of chemical compounds, and planning and optimizing the chemical synthesis routes. The database archives the original data to ensure reusability and traceability in downstream tasks and stores the processed data in a more efficient manner. The advantages and disadvantages are highlighted through a realistic case study of how such a database would be utilized within a computer-assisted chemical synthesis research project today. The code and documentation relevant to the CaCS database are available on GitHub under the MIT license at https://github.com/neo-chem-synth-wave/ncsw-data.

Scientific contribution: The primary scientific contribution of this research is the consolidation of all relevant open-source computer-assisted chemical synthesis data into a comprehensive database. The database archives the original data to ensure reusability and traceability in downstream tasks, efficiently stores the processed data, and provides the users with a programmatic interface to manage and query the stored data. Rather than improving the existing or introducing new data, such a database provides a systematic overview of the existing open data sources and an easily reproducible environment for transparent processing and benchmarking purposes.

在过去的十年里,计算机辅助化学合成作为一个突出的研究课题重新浮出水面。尽管利用计算机辅助化学合成的想法几乎与计算机本身一样长,但其固有的复杂性一再超出了可用的资源。然而,最近的机器学习方法已经显示出打破这一趋势的潜力。这些方法的性能严重依赖于数量、质量、可见性和可访问性有限的数据,这对潜在的科学突破构成了重大挑战。本研究通过将所有相关的开源计算机辅助化学合成数据整合到一个综合数据库中来解决这些问题,提供了该过程中数据状态的实际概述。计算机辅助化学合成数据库被设计成储存和分析数据的中央储存库,其主要目标是在现有的研究项目中易于整合和利用。它为用户提供了一个可编程的界面来检索各种任务所需的数据,如预测化学合成和反合成分析或反合成的结果,估计化合物的可合成性,规划和优化化学合成路线。数据库对原始数据进行归档,以确保下游任务中的可重用性和可追溯性,并以更有效的方式存储处理后的数据。通过一个现实的案例研究,强调了这种数据库如何在今天的计算机辅助化学合成研究项目中使用。与CaCS数据库相关的代码和文档在MIT许可下的GitHub上可以在https://github.com/neo-chem-synth-wave/ncsw-data.Scientific上找到:这项研究的主要科学贡献是将所有相关的开源计算机辅助化学合成数据整合到一个综合数据库中。数据库对原始数据进行归档,保证下游任务的可重用性和可追溯性,高效存储处理后的数据,并为用户提供可编程的接口来管理和查询存储的数据。这种数据库不是改进现有数据或引入新数据,而是提供了对现有开放数据源的系统概述,并为透明处理和基准测试提供了易于再现的环境。
{"title":"The consolidation of open-source computer-assisted chemical synthesis data into a comprehensive database","authors":"Haris Hasic,&nbsp;Takashi Ishida","doi":"10.1186/s13321-025-01130-0","DOIUrl":"10.1186/s13321-025-01130-0","url":null,"abstract":"<div><p>Over the past decade, computer-assisted chemical synthesis has resurfaced as a prominent research subject. Even though the idea of utilizing computers to assist chemical synthesis has existed for nearly as long as computers themselves, the inherent complexity repeatedly exceeded the available resources. However, recent machine learning approaches have exhibited the potential to break this tendency. The performance of such approaches is heavily dependent on data that suffers from limited quantity, quality, visibility, and accessibility, posing significant challenges to potential scientific breakthroughs. This research addresses these issues by consolidating all relevant open-source computer-assisted chemical synthesis data into a comprehensive database, providing a practical overview of the state of data in the process. The computer-assisted chemical synthesis or CaCS database is designed to be a central repository for storing and analyzing data, with the primary objective being easy integration and utilization within existing research projects. It provides the users with a programmatic interface to retrieve the data required for various tasks like predicting the outcomes of chemical synthesis and retrosynthetic analysis or retrosynthesis, estimating the synthesizability of chemical compounds, and planning and optimizing the chemical synthesis routes. The database archives the original data to ensure reusability and traceability in downstream tasks and stores the processed data in a more efficient manner. The advantages and disadvantages are highlighted through a realistic case study of how such a database would be utilized within a computer-assisted chemical synthesis research project today. The code and documentation relevant to the CaCS database are available on GitHub under the MIT license at https://github.com/neo-chem-synth-wave/ncsw-data.</p><p><b>Scientific contribution:</b> The primary scientific contribution of this research is the consolidation of all relevant open-source computer-assisted chemical synthesis data into a comprehensive database. The database archives the original data to ensure reusability and traceability in downstream tasks, efficiently stores the processed data, and provides the users with a programmatic interface to manage and query the stored data. Rather than improving the existing or introducing new data, such a database provides a systematic overview of the existing open data sources and an easily reproducible environment for transparent processing and benchmarking purposes.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01130-0.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepRNA-DTI: a deep learning approach for RNA-compound interaction prediction with binding site interpretability DeepRNA-DTI:基于结合位点可解释性的rna -化合物相互作用预测的深度学习方法。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-02 DOI: 10.1186/s13321-025-01132-y
Haelee Bae, Hojung Nam

RNA-targeted therapeutics represent a promising frontier for expanding the druggable genome beyond conventional protein targets. However, computational prediction of RNA-compound interactions remains challenging due to limited experimental data and the inherent complexity of RNA structures. Here, we present DeepRNA-DTI, a novel sequence-based deep learning approach for RNA-compound interaction prediction with binding site interpretability. Our model leverages transfer learning from pretrained embeddings, RNA-FM for RNA sequences and Mole-BERT for compounds, and employs a multitask learning framework that simultaneously predicts both presence of interactions and nucleotide-level binding sites. This dual prediction strategy provides mechanistic insights into RNA-compound recognition patterns. Trained on a comprehensive dataset integrating resources from the Protein Data Bank and literature sources, DeepRNA-DTI demonstrates superior performance compared to existing methods. The model shows consistent effectiveness across diverse RNA subtypes, highlighting its robust generalization capabilities. Application to high-throughput virtual screening of over 48 million compounds against oncogenic pre-miR-21 successfully identified known binders and novel chemical scaffolds with RNA-specific physicochemical properties. By combining sequence-based predictions with binding site interpretability, DeepRNA-DTI advances our ability to identify promising RNA-targeting compounds and offers new opportunities for RNA-directed drug discovery. The codes and data are publicly available at https://github.com/GIST-CSBL/DeepRNA-DTI/.

rna靶向治疗代表了一个有前途的前沿,以扩大可药物基因组超越传统的蛋白质目标。然而,由于有限的实验数据和RNA结构固有的复杂性,RNA-化合物相互作用的计算预测仍然具有挑战性。在这里,我们提出了DeepRNA-DTI,这是一种新的基于序列的深度学习方法,用于预测具有结合位点可解释性的rna -化合物相互作用。我们的模型利用预训练嵌入的迁移学习,RNA序列的RNA- fm和化合物的Mole-BERT,并采用多任务学习框架,同时预测相互作用和核苷酸水平结合位点的存在。这种双重预测策略提供了rna -化合物识别模式的机制见解。DeepRNA-DTI在整合了蛋白质数据库和文献资源的综合数据集上进行了训练,与现有方法相比,DeepRNA-DTI表现出优越的性能。该模型在不同的RNA亚型中显示出一致的有效性,突出了其强大的泛化能力。应用于超过4800万种抗致癌pre-miR-21化合物的高通量虚拟筛选,成功鉴定出具有rna特异性物理化学性质的已知结合物和新型化学支架。通过结合基于序列的预测和结合位点的可解释性,DeepRNA-DTI提高了我们识别有前途的rna靶向化合物的能力,并为rna靶向药物的发现提供了新的机会。代码和数据可在https://github.com/GIST-CSBL/DeepRNA-DTI/上公开获取。
{"title":"DeepRNA-DTI: a deep learning approach for RNA-compound interaction prediction with binding site interpretability","authors":"Haelee Bae,&nbsp;Hojung Nam","doi":"10.1186/s13321-025-01132-y","DOIUrl":"10.1186/s13321-025-01132-y","url":null,"abstract":"<div><p>RNA-targeted therapeutics represent a promising frontier for expanding the druggable genome beyond conventional protein targets. However, computational prediction of RNA-compound interactions remains challenging due to limited experimental data and the inherent complexity of RNA structures. Here, we present DeepRNA-DTI, a novel sequence-based deep learning approach for RNA-compound interaction prediction with binding site interpretability. Our model leverages transfer learning from pretrained embeddings, RNA-FM for RNA sequences and Mole-BERT for compounds, and employs a multitask learning framework that simultaneously predicts both presence of interactions and nucleotide-level binding sites. This dual prediction strategy provides mechanistic insights into RNA-compound recognition patterns. Trained on a comprehensive dataset integrating resources from the Protein Data Bank and literature sources, DeepRNA-DTI demonstrates superior performance compared to existing methods. The model shows consistent effectiveness across diverse RNA subtypes, highlighting its robust generalization capabilities. Application to high-throughput virtual screening of over 48 million compounds against oncogenic pre-miR-21 successfully identified known binders and novel chemical scaffolds with RNA-specific physicochemical properties. By combining sequence-based predictions with binding site interpretability, DeepRNA-DTI advances our ability to identify promising RNA-targeting compounds and offers new opportunities for RNA-directed drug discovery. The codes and data are publicly available at https://github.com/GIST-CSBL/DeepRNA-DTI/. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01132-y.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145657100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
All-atom protein sequence design using discrete diffusion models 用离散扩散模型设计全原子蛋白质序列。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-01 DOI: 10.1186/s13321-025-01121-1
Amelia Villegas-Morcillo, Gijs J. Admiraal, Marcel J. T. Reinders, Jana M. Weber

Advancing protein design is crucial for breakthroughs in medicine and biotechnology. Traditional approaches for protein sequence representation often rely solely on the 20 canonical amino acids, limiting the representation of non-canonical amino acids and residues that undergo post-translational modifications. This work explores discrete diffusion models for generating novel protein sequences using the all-atom chemical representation SELFIES. By encoding the atomic composition of each amino acid in the protein, this approach expands the design possibilities beyond standard sequence representations. Using a modified ByteNet architecture within the discrete diffusion D3PM framework, we evaluate the impact of this all-atom representation on protein quality, diversity, and novelty, compared to conventional amino acid-based models. To this end, we develop a comprehensive assessment pipeline to determine whether generated SELFIES sequences translate into valid proteins containing both canonical and non-canonical amino acids. Additionally, we examine the influence of two noise schedules within the diffusion process—uniform (random replacement of tokens) and absorbing (progressive masking)—on generation performance. While models trained on the all-atom representation struggle to consistently generate fully valid proteins, the successfully generated proteins show improved novelty and diversity compared to their amino acid-based model counterparts. Furthermore, the all-atom representation achieves structural foldability results comparable to those of amino acid-based models. Lastly, our results highlight the absorbing noise schedule as the most effective for both representations. Data and code are available at https://github.com/Intelligent-molecular-systems/All-Atom-Protein-Sequence-Generation.

推进蛋白质设计对于医学和生物技术的突破至关重要。传统的蛋白质序列表征方法通常仅依赖于20个典型氨基酸,限制了非典型氨基酸和经过翻译后修饰的残基的表征。这项工作探索了使用全原子化学表示自拍生成新蛋白质序列的离散扩散模型。通过编码蛋白质中每个氨基酸的原子组成,这种方法扩展了超出标准序列表示的设计可能性。在离散扩散D3PM框架中使用改进的ByteNet架构,与传统的基于氨基酸的模型相比,我们评估了这种全原子表示对蛋白质质量、多样性和新颖性的影响。为此,我们开发了一个全面的评估管道,以确定生成的自序列是否转化为含有规范和非规范氨基酸的有效蛋白质。此外,我们还研究了扩散过程中两种噪声时间表-均匀(随机替换标记)和吸收(渐进掩蔽)对生成性能的影响。虽然在全原子表示上训练的模型难以一致地生成完全有效的蛋白质,但与基于氨基酸的模型相比,成功生成的蛋白质显示出更高的新颖性和多样性。此外,全原子表示实现了与基于氨基酸的模型相当的结构可折叠性结果。最后,我们的结果强调了吸收噪声的时间表是最有效的两种表示。数据和代码可在https://github.com/Intelligent-molecular-systems/All-Atom-Protein-Sequence-Generation上获得。
{"title":"All-atom protein sequence design using discrete diffusion models","authors":"Amelia Villegas-Morcillo,&nbsp;Gijs J. Admiraal,&nbsp;Marcel J. T. Reinders,&nbsp;Jana M. Weber","doi":"10.1186/s13321-025-01121-1","DOIUrl":"10.1186/s13321-025-01121-1","url":null,"abstract":"<div><p>Advancing protein design is crucial for breakthroughs in medicine and biotechnology. Traditional approaches for protein sequence representation often rely solely on the 20 canonical amino acids, limiting the representation of non-canonical amino acids and residues that undergo post-translational modifications. This work explores discrete diffusion models for generating novel protein sequences using the all-atom chemical representation SELFIES. By encoding the atomic composition of each amino acid in the protein, this approach expands the design possibilities beyond standard sequence representations. Using a modified ByteNet architecture within the discrete diffusion D3PM framework, we evaluate the impact of this all-atom representation on protein quality, diversity, and novelty, compared to conventional amino acid-based models. To this end, we develop a comprehensive assessment pipeline to determine whether generated SELFIES sequences translate into valid proteins containing both canonical and non-canonical amino acids. Additionally, we examine the influence of two noise schedules within the diffusion process—uniform (random replacement of tokens) and absorbing (progressive masking)—on generation performance. While models trained on the all-atom representation struggle to consistently generate fully valid proteins, the successfully generated proteins show improved novelty and diversity compared to their amino acid-based model counterparts. Furthermore, the all-atom representation achieves structural foldability results comparable to those of amino acid-based models. Lastly, our results highlight the absorbing noise schedule as the most effective for both representations. Data and code are available at https://github.com/Intelligent-molecular-systems/All-Atom-Protein-Sequence-Generation.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01121-1.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145652885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1