首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Summarizing relationships between chemicals, genes, proteins, and diseases in PubChem using analysis of their co-occurrences in patents 通过分析专利中的化学物质、基因、蛋白质和疾病之间的关系,总结PubChem中的化学物质、基因、蛋白质和疾病之间的关系。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-17 DOI: 10.1186/s13321-025-01134-w
Leonid Zaslavsky, Tiejun Cheng, Asta Gindulyte, Sunghwan Kim, Paul A. Thiessen, Evan E. Bolton

The knowledge panels in PubChem allow users to quickly identify and summarize important relationships between chemicals, genes, proteins, and diseases by analyzing the co-occurrences of those entities in a collection of text documents. In the present study, the analysis and summarization techniques used to develop the literature knowledge panels in PubChem were extended to patent documents from the Google Patent Research Data (GPRD) set. The annotations of the patent documents in the GPRD set were mapped to NCBI database records to create the patent co-occurrence data. The annotations were not only from the titles and abstracts of patents but also from other parts such as claims and descriptions, greatly improving the coverage of the co-occurrence-based entity relationships in PubChem. Informativeness weights of entities were introduced in the co-occurrence and relevance score computations to account for a significant variation in the number of matched annotations per patent section. This narrows the focus to the co-occurrences that are more relevant to the subject matter of the patent. The resulting co-occurrence data was used to generate the patent knowledge panels, enabling users to identify entities co-mentioned in patents alongside a specific chemical or gene. The patent co-occurrence data can be downloaded interactively or accessed programmatically. Overall, the patent knowledge panels described in this study provide users with quick access to essential biomedical entities associated with a given PubChem record. Users can delve into relevant patent documents related to these entities or download the underlying co-occurrence data for further exploration and analysis.

《PubChem》中的知识面板允许用户通过分析文本文档集合中这些实体的共同出现,快速识别和总结化学物质、基因、蛋白质和疾病之间的重要关系。在本研究中,将PubChem中用于开发文献知识面板的分析和汇总技术扩展到谷歌专利研究数据(GPRD)集的专利文件。将GPRD集中专利文献的注释映射到NCBI数据库记录中,创建专利共现数据。这些注释不仅来自专利的标题和摘要,还来自其他部分,如权利要求和描述,极大地提高了PubChem中基于共同发生的实体关系的覆盖率。在共现性和相关性评分计算中引入了实体的信息权重,以解释每个专利部分匹配注释数量的显著变化。这将焦点缩小到与专利主题更相关的共现事件上。由此产生的共现数据用于生成专利知识面板,使用户能够识别专利中与特定化学物质或基因共同提到的实体。专利共现数据可以交互式下载或以编程方式访问。总的来说,本研究中描述的专利知识面板为用户提供了与给定PubChem记录相关的基本生物医学实体的快速访问。用户可以深入研究与这些实体相关的相关专利文件或下载底层共现数据,以便进一步探索和分析。
{"title":"Summarizing relationships between chemicals, genes, proteins, and diseases in PubChem using analysis of their co-occurrences in patents","authors":"Leonid Zaslavsky,&nbsp;Tiejun Cheng,&nbsp;Asta Gindulyte,&nbsp;Sunghwan Kim,&nbsp;Paul A. Thiessen,&nbsp;Evan E. Bolton","doi":"10.1186/s13321-025-01134-w","DOIUrl":"10.1186/s13321-025-01134-w","url":null,"abstract":"<div><p>The knowledge panels in PubChem allow users to quickly identify and summarize important relationships between chemicals, genes, proteins, and diseases by analyzing the co-occurrences of those entities in a collection of text documents. In the present study, the analysis and summarization techniques used to develop the literature knowledge panels in PubChem were extended to patent documents from the Google Patent Research Data (GPRD) set. The annotations of the patent documents in the GPRD set were mapped to NCBI database records to create the patent co-occurrence data. The annotations were not only from the titles and abstracts of patents but also from other parts such as claims and descriptions, greatly improving the coverage of the co-occurrence-based entity relationships in PubChem. Informativeness weights of entities were introduced in the co-occurrence and relevance score computations to account for a significant variation in the number of matched annotations per patent section. This narrows the focus to the co-occurrences that are more relevant to the subject matter of the patent. The resulting co-occurrence data was used to generate the patent knowledge panels, enabling users to identify entities co-mentioned in patents alongside a specific chemical or gene. The patent co-occurrence data can be downloaded interactively or accessed programmatically. Overall, the patent knowledge panels described in this study provide users with quick access to essential biomedical entities associated with a given PubChem record. Users can delve into relevant patent documents related to these entities or download the underlying co-occurrence data for further exploration and analysis.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01134-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145765406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VN-EGNN: E(3)- and SE(3)-Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification VN-EGNN: E(3)-和SE(3)-带虚拟节点的等变图神经网络增强了蛋白质结合位点的识别。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-15 DOI: 10.1186/s13321-025-01127-9
Florian Sestak, Lisa Schneckenreiter, Johannes Brandstetter, Sepp Hochreiter, Andreas Mayr, Günter Klambauer

We present VN-EGNN, a novel approach to binding site identification that significantly advances predictive performance. By integrating virtual nodes into E(n)– and SE(n)-equivariant graph neural networks (EGNNs) and extending the message-passing scheme, we address limitations of traditional GNNs in modeling complex geometric entities such as binding pockets and at the same time get neural representations of binding sites. Our extensive experiments demonstrate that VN-EGNN sets a new state-of-the-art in locating binding site centers on the COACH420, HOLO4K, and PDBbind2020 datasets, showcasing a marked improvement in the DCC/DCA success rates over existing methods. These results underscore the potential of VN-EGNN in drug discovery and protein-ligand interaction studies.

我们提出了VN-EGNN,一种新的结合位点识别方法,显著提高了预测性能。通过将虚拟节点集成到E(n)-和SE(n)-等变图神经网络(egnn)中,并扩展消息传递方案,解决了传统gnnn在建模复杂几何实体(如结合口袋)方面的局限性,同时获得了结合位点的神经表示。我们的大量实验表明,VN-EGNN在COACH420, HOLO4K和PDBbind2020数据集上定位结合位点中心方面设置了新的技术,与现有方法相比,显示了DCC/DCA成功率的显着提高。这些结果强调了VN-EGNN在药物发现和蛋白质-配体相互作用研究中的潜力。
{"title":"VN-EGNN: E(3)- and SE(3)-Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification","authors":"Florian Sestak,&nbsp;Lisa Schneckenreiter,&nbsp;Johannes Brandstetter,&nbsp;Sepp Hochreiter,&nbsp;Andreas Mayr,&nbsp;Günter Klambauer","doi":"10.1186/s13321-025-01127-9","DOIUrl":"10.1186/s13321-025-01127-9","url":null,"abstract":"<div><p>We present VN-EGNN, a novel approach to binding site identification that significantly advances predictive performance. By integrating virtual nodes into E(<i>n</i>)– and SE(<i>n</i>)-equivariant graph neural networks (EGNNs) and extending the message-passing scheme, we address limitations of traditional GNNs in modeling complex geometric entities such as binding pockets and at the same time get neural representations of binding sites. Our extensive experiments demonstrate that VN-EGNN sets a new state-of-the-art in locating binding site centers on the COACH420, HOLO4K, and PDBbind2020 datasets, showcasing a marked improvement in the DCC/DCA success rates over existing methods. These results underscore the potential of VN-EGNN in drug discovery and protein-ligand interaction studies.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01127-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145761919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Machine learning to predict food effects during drug development: a comprehensive review 在药物开发过程中预测食物效应的机器学习:全面回顾。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-10 DOI: 10.1186/s13321-025-01131-z
Alam Shah, Fulin Bi, Jin Yang

Drug absorption can be altered due to the consumption of food, impacting the efficacy and safety of the drug administered, and predicting food effects (FE) can be quite complex. Traditional methods, including in vitro and in vivo models, fail to predict the full range of food-drug interactions owing to the biological variability of the gastrointestinal system. This review evaluates the predictive ability and accuracy of machine learning (ML) in predicting FE in comparison to conventional methods. We consider how ML models use food dataset information and assist in enhancing the formulation and dosing of the drugs. We discussed recent trends in FE prediction, its mechanisms, and effects on drug bioavailability. Supervised and unsupervised learning, as well as reinforcement learning, are analyzed in the context of absorption, distribution, metabolism, and elimination (ADME) forecasting and drug development. ML is certainly useful in addressing the issues posed by traditional methods; however, challenges about data quality, model generalizability, and integration into the drug development process are obstacles that must be overcome. This review explains how other emerging technologies, for example, PBPK modeling, can be combined with ML to enhance its prospects in the field of drug development. We examined prospects of deep learning, explainable artificial intelligence (AI), and ethical and legal aspects of applying ML in pharmacokinetics, as well as the interdisciplinary approaches that are required to improve patient care outcomes.

Graphical Abstract

由于食物的消耗,药物的吸收可能会改变,从而影响所给药物的功效和安全性,并且预测食物效应(FE)可能相当复杂。由于胃肠道系统的生物可变性,包括体外和体内模型在内的传统方法无法预测食品-药物相互作用的全部范围。本综述评估了机器学习(ML)预测FE的预测能力和准确性,与传统方法相比。我们考虑ML模型如何使用食品数据集信息,并协助加强药物的配方和剂量。我们讨论了FE预测的最新趋势,其机制以及对药物生物利用度的影响。在吸收、分布、代谢和消除(ADME)预测和药物开发的背景下,分析了有监督学习和无监督学习以及强化学习。ML在解决传统方法所带来的问题时当然是有用的;然而,关于数据质量、模型通用性和药物开发过程集成的挑战是必须克服的障碍。这篇综述解释了其他新兴技术,例如PBPK建模,如何与ML相结合,以增强其在药物开发领域的前景。我们研究了深度学习、可解释人工智能(AI)的前景,以及将ML应用于药代动力学的伦理和法律方面,以及改善患者护理结果所需的跨学科方法。
{"title":"Machine learning to predict food effects during drug development: a comprehensive review","authors":"Alam Shah,&nbsp;Fulin Bi,&nbsp;Jin Yang","doi":"10.1186/s13321-025-01131-z","DOIUrl":"10.1186/s13321-025-01131-z","url":null,"abstract":"<div><p>Drug absorption can be altered due to the consumption of food, impacting the efficacy and safety of the drug administered, and predicting food effects (FE) can be quite complex. Traditional methods, including in vitro and in vivo models, fail to predict the full range of food-drug interactions owing to the biological variability of the gastrointestinal system. This review evaluates the predictive ability and accuracy of machine learning (ML) in predicting FE in comparison to conventional methods. We consider how ML models use food dataset information and assist in enhancing the formulation and dosing of the drugs. We discussed recent trends in FE prediction, its mechanisms, and effects on drug bioavailability. Supervised and unsupervised learning, as well as reinforcement learning, are analyzed in the context of absorption, distribution, metabolism, and elimination (ADME) forecasting and drug development. ML is certainly useful in addressing the issues posed by traditional methods; however, challenges about data quality, model generalizability, and integration into the drug development process are obstacles that must be overcome. This review explains how other emerging technologies, for example, PBPK modeling, can be combined with ML to enhance its prospects in the field of drug development. We examined prospects of deep learning, explainable artificial intelligence (AI), and ethical and legal aspects of applying ML in pharmacokinetics, as well as the interdisciplinary approaches that are required to improve patient care outcomes.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01131-z.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145718191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SLICE (SMARTS and Logic In ChEmistry): fast generation of molecules using advanced chemical synthesis logic and modern coding style SLICE (SMARTS and Logic In ChEmistry):使用先进的化学合成逻辑和现代编码风格快速生成分子。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-09 DOI: 10.1186/s13321-025-01119-9
Stefi Nouleho Ilemo, Victorien Delannée, Olga Grushin, Philip Judson, Hitesh Patel, Marc C. Nicklaus, Nadya I. Tarasova

While virtual libraries of synthetically accessible compounds have exploded in size to many billions, our capacity to extract valuable drug leads from these vast databases remains limited by computational resources. To overcome this, we developed SLICE SMARTS and Logic In ChEmistry), a powerful new tool designed for the agile exploration of massive chemical spaces. SLICE enables the fast, “à la carte” generation of virtual compound libraries through chemist-defined reaction chemistries and readily available building blocks. Its user-friendly, no-code graphical interface, the SLICE Designer, allows chemists to easily define SMARTS patterns, configure atom and bond properties, and establish chemical constraints and logic. The resulting XML files are then fed into the SLICE Engine, which generates diverse virtual libraries from specified building blocks at speeds of 0.6–2.5 million compounds per hour. SLICE provides the agility and performance needed to support efficient lead generation within discovery workflows.

虽然可合成化合物的虚拟文库的规模已经激增到数十亿,但我们从这些庞大的数据库中提取有价值的药物线索的能力仍然受到计算资源的限制。为了克服这个问题,我们开发了SLICE SMARTS和Logic In ChEmistry,这是一种强大的新工具,专为快速探索大量化学空间而设计。SLICE通过化学家定义的反应化学和现成的构建块,实现了快速,“点菜”生成虚拟化合物库。其用户友好的无代码图形界面SLICE Designer允许化学家轻松定义SMARTS模式,配置原子和键属性,并建立化学约束和逻辑。然后将生成的XML文件输入SLICE引擎,该引擎以每小时60 - 250万个化合物的速度从指定的构建块生成各种虚拟库。SLICE提供了在发现工作流程中支持高效潜在客户生成所需的敏捷性和性能。
{"title":"SLICE (SMARTS and Logic In ChEmistry): fast generation of molecules using advanced chemical synthesis logic and modern coding style","authors":"Stefi Nouleho Ilemo,&nbsp;Victorien Delannée,&nbsp;Olga Grushin,&nbsp;Philip Judson,&nbsp;Hitesh Patel,&nbsp;Marc C. Nicklaus,&nbsp;Nadya I. Tarasova","doi":"10.1186/s13321-025-01119-9","DOIUrl":"10.1186/s13321-025-01119-9","url":null,"abstract":"<div><p>While virtual libraries of synthetically accessible compounds have exploded in size to many billions, our capacity to extract valuable drug leads from these vast databases remains limited by computational resources. To overcome this, we developed SLICE SMARTS and Logic In ChEmistry), a powerful new tool designed for the agile exploration of massive chemical spaces. SLICE enables the fast, “à la carte” generation of virtual compound libraries through chemist-defined reaction chemistries and readily available building blocks. Its user-friendly, no-code graphical interface, the SLICE Designer, allows chemists to easily define SMARTS patterns, configure atom and bond properties, and establish chemical constraints and logic. The resulting XML files are then fed into the SLICE Engine, which generates diverse virtual libraries from specified building blocks at speeds of 0.6–2.5 million compounds per hour. SLICE provides the agility and performance needed to support efficient lead generation within discovery workflows.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01119-9.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145712913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Uncovering molecular determinants of potency and binding affinity in hit compounds targeting FGF14/Nav1.6 complex 揭示靶向FGF14/Nav1.6复合物的靶向化合物的效价和结合亲和力的分子决定因素。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-09 DOI: 10.1186/s13321-025-01122-0
Hamid Teimouri, Zahra Haghighijoo, Timothy J. Baumgartner, Aditya K. Singh, Paul A. Wadsworth, Cun Zhang, Haiying Chen, Jia Zhou, Fernanda Laezza

Identifying molecular mechanisms that regulate neuronal excitability is essential for developing targeted therapies for neuropsychiatric disorders. The protein–protein interaction (PPI) between fibroblast growth factor 14 (FGF14) and the voltage-gated Na+ channel Nav1.6 is critical in regulating neuronal excitability and has emerged as a promising drug target. However, the physicochemical features that drive small-molecule modulation of this interface remain elusive. Here, we apply a descriptor-based chemoinformatics approach to analyze 15 hit compounds identified via high-throughput screening, aiming to elucidate structure–activity relationships influencing their potency and binding affinity. The analysis revealed distinct subsets of physicochemical features strongly associated with either potency or binding affinity values, suggesting that these parameters are governed by largely independent molecular determinants. This independence implies that optimizing a compound for improved affinity need not compromise potency, and vice versa. Together, these findings may guide the rational optimization of first-in-class compounds aimed at controlling neuronal excitability through targeted PPI interface modulation.

Graphical Abstract

确定调节神经元兴奋性的分子机制对于开发针对神经精神疾病的靶向治疗至关重要。成纤维细胞生长因子14 (FGF14)和电压门控Na+通道Nav1.6之间的蛋白-蛋白相互作用(PPI)在调节神经元兴奋性中至关重要,并已成为一个有前景的药物靶点。然而,驱动该界面的小分子调制的物理化学特征仍然难以捉摸。本研究采用基于描述符的化学信息学方法分析了通过高通量筛选鉴定的15种hit化合物,旨在阐明影响其效力和结合亲和力的结构-活性关系。分析揭示了不同的物理化学特征子集与效力或结合亲和力值密切相关,表明这些参数在很大程度上是由独立的分子决定因素控制的。这种独立性意味着优化化合物以提高亲和力不需要损害效力,反之亦然。总之,这些发现可以指导通过靶向PPI界面调节来控制神经元兴奋性的一流化合物的合理优化。
{"title":"Uncovering molecular determinants of potency and binding affinity in hit compounds targeting FGF14/Nav1.6 complex","authors":"Hamid Teimouri,&nbsp;Zahra Haghighijoo,&nbsp;Timothy J. Baumgartner,&nbsp;Aditya K. Singh,&nbsp;Paul A. Wadsworth,&nbsp;Cun Zhang,&nbsp;Haiying Chen,&nbsp;Jia Zhou,&nbsp;Fernanda Laezza","doi":"10.1186/s13321-025-01122-0","DOIUrl":"10.1186/s13321-025-01122-0","url":null,"abstract":"<div><p>Identifying molecular mechanisms that regulate neuronal excitability is essential for developing targeted therapies for neuropsychiatric disorders. The protein–protein interaction (PPI) between fibroblast growth factor 14 (FGF14) and the voltage-gated Na<sup>+</sup> channel Nav1.6 is critical in regulating neuronal excitability and has emerged as a promising drug target. However, the physicochemical features that drive small-molecule modulation of this interface remain elusive. Here, we apply a descriptor-based chemoinformatics approach to analyze 15 hit compounds identified via high-throughput screening, aiming to elucidate structure–activity relationships influencing their potency and binding affinity. The analysis revealed distinct subsets of physicochemical features strongly associated with either potency or binding affinity values, suggesting that these parameters are governed by largely independent molecular determinants. This independence implies that optimizing a compound for improved affinity need not compromise potency, and vice versa. Together, these findings may guide the rational optimization of first-in-class compounds aimed at controlling neuronal excitability through targeted PPI interface modulation.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01122-0.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145710814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SMARTS-RX: a SMARTS-based representation of chemical functions for reactivity analysis SMARTS-RX:用于反应性分析的基于smarts的化学函数表示。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-08 DOI: 10.1186/s13321-025-01136-8
Thierry Kogej, Christos Kannas, Samuel Genheden, Eike Caldeweyher, Mikhail Kabeshov

Chemical functional group annotation provides a mechanistically meaningful framework to interpret model outcomes and guide synthetic strategies. Here, we present SMARTS-RX—a curated, hierarchical ontology of 406 SMARTS-based functional group descriptors—designed to characterize chemically relevant and reactive functionalities in small molecules. SMARTS-RX achieves a balance between granularity and computational efficiency by focusing on functional groups central to pharmaceutical synthesis and medicinal chemistry. We describe the development of SMARTS-RX, including its systematic nomenclature and SMARTS encoding, which enable precise tracking of chemical environments. The utility of SMARTS-RX for mapping chemical reactivity is demonstrated through analyses of functional group distributions across major reaction types, using large-scale datasets from AstraZeneca’s Electronic Lab Notebooks and Reaxys. Finally, we illustrate how this SMARTS library can be applied to guide building-block selection from commercial catalogues. A public GitHub repository has been created aiming for a continuous improvement of the current SMARTS_RX.

Scientific Contribution: SMARTS-RX introduces a curated, hierarchical ontology of 406 SMARTS-based descriptors prioritizing pharmaceutical relevance and mechanistic interpretability. Distinct from prior efforts, SMARTS-RX encodes detailed chemical environments to improve reactivity mapping and feature extraction for both expert analysis and computational modelling. This resource advances functional group annotation by balancing chemical specificity and computational performance, supporting reproducible and scalable cheminformatics research.

化学官能团注释为解释模型结果和指导合成策略提供了一个有机械意义的框架。在这里,我们提出了smarts - rx -一个精心策划的分层本体,包含406个基于smarts的官能团描述符,旨在表征小分子中的化学相关和反应性功能。SMARTS-RX通过专注于药物合成和药物化学中心的官能团,实现了粒度和计算效率之间的平衡。我们描述了SMARTS- rx的发展,包括它的系统命名和SMARTS编码,它可以精确跟踪化学环境。通过分析主要反应类型的官能团分布,使用来自阿斯利康电子实验室笔记本和Reaxys的大规模数据集,展示了SMARTS-RX在绘制化学反应性方面的实用性。最后,我们说明了如何将SMARTS库应用于指导从商业目录中选择构建块。已经创建了一个公共GitHub存储库,旨在持续改进当前的SMARTS_RX。科学贡献:SMARTS-RX引入了一个由406个基于smarts的描述符组成的精心策划的分层本体,优先考虑药物相关性和机制可解释性。与之前的工作不同,SMARTS-RX对详细的化学环境进行编码,以改进专家分析和计算建模的反应性映射和特征提取。该资源通过平衡化学特异性和计算性能来推进官能团注释,支持可重复和可扩展的化学信息学研究。
{"title":"SMARTS-RX: a SMARTS-based representation of chemical functions for reactivity analysis","authors":"Thierry Kogej,&nbsp;Christos Kannas,&nbsp;Samuel Genheden,&nbsp;Eike Caldeweyher,&nbsp;Mikhail Kabeshov","doi":"10.1186/s13321-025-01136-8","DOIUrl":"10.1186/s13321-025-01136-8","url":null,"abstract":"<div><p>Chemical functional group annotation provides a mechanistically meaningful framework to interpret model outcomes and guide synthetic strategies. Here, we present SMARTS-RX—a curated, hierarchical ontology of 406 SMARTS-based functional group descriptors—designed to characterize chemically relevant and reactive functionalities in small molecules. SMARTS-RX achieves a balance between granularity and computational efficiency by focusing on functional groups central to pharmaceutical synthesis and medicinal chemistry. We describe the development of SMARTS-RX, including its systematic nomenclature and SMARTS encoding, which enable precise tracking of chemical environments. The utility of SMARTS-RX for mapping chemical reactivity is demonstrated through analyses of functional group distributions across major reaction types, using large-scale datasets from AstraZeneca’s Electronic Lab Notebooks and Reaxys. Finally, we illustrate how this SMARTS library can be applied to guide building-block selection from commercial catalogues. A public GitHub repository has been created aiming for a continuous improvement of the current SMARTS_RX.</p><p><b>Scientific Contribution:</b> SMARTS-RX introduces a curated, hierarchical ontology of 406 SMARTS-based descriptors prioritizing pharmaceutical relevance and mechanistic interpretability. Distinct from prior efforts, SMARTS-RX encodes detailed chemical environments to improve reactivity mapping and feature extraction for both expert analysis and computational modelling. This resource advances functional group annotation by balancing chemical specificity and computational performance, supporting reproducible and scalable cheminformatics research.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01136-8.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145707040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SGEDiff: a subgraph-enriched diffusion model for structure-based 3D molecular generation SGEDiff:一个基于结构的三维分子生成的富子图扩散模型。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-08 DOI: 10.1186/s13321-025-01123-z
Changda Gong, Jiaojiao Fang, Yan Tang, Guixia Liu, Yun Tang, Weihua Li

Structure-based molecular generation is an emerging approach in computer-aided drug discovery, enabling the design of compounds that that complement the three-dimensional structure of target proteins. However, most diffusion-based 3D molecular generative models still face several limitations, such as imbalanced protein–ligand representations or reliance on predefined binding pockets. To address these limitations, we propose SGEDiff, a novel subgraph enriched generative framework for 3D molecule generation. Our model hierarchically fuses subgraph and global graph representations to capture both local binding patterns and key structural features of protein pockets. Furthermore, an integrated pocket prediction module identifies binding regions in unseen proteins, eliminating reliance on predefined pocket coordinates. Experimental results show that SGEDiff outperforms baseline diffusion-based methods in generating high-affinity molecules across diverse targets. Moreover, practical applications in de novo drug design demonstrate improved success rates in generating compounds for novel protein targets, underscoring its potential to advance structure-based drug discovery.

基于结构的分子生成是计算机辅助药物发现的一种新兴方法,可以设计出与目标蛋白质的三维结构互补的化合物。然而,大多数基于扩散的3D分子生成模型仍然面临一些局限性,例如蛋白质配体表示不平衡或依赖预定义的结合袋。为了解决这些限制,我们提出了SGEDiff,一个新的子图丰富的3D分子生成框架。我们的模型分层融合子图和全局图表示,以捕获蛋白质口袋的局部结合模式和关键结构特征。此外,集成的口袋预测模块识别不可见蛋白质的结合区域,消除了对预定义口袋坐标的依赖。实验结果表明,SGEDiff在生成跨不同靶标的高亲和力分子方面优于基于扩散的基线方法。此外,在新药物设计中的实际应用表明,为新蛋白质靶点生成化合物的成功率提高,强调了其推进基于结构的药物发现的潜力。
{"title":"SGEDiff: a subgraph-enriched diffusion model for structure-based 3D molecular generation","authors":"Changda Gong,&nbsp;Jiaojiao Fang,&nbsp;Yan Tang,&nbsp;Guixia Liu,&nbsp;Yun Tang,&nbsp;Weihua Li","doi":"10.1186/s13321-025-01123-z","DOIUrl":"10.1186/s13321-025-01123-z","url":null,"abstract":"<div><p>Structure-based molecular generation is an emerging approach in computer-aided drug discovery, enabling the design of compounds that that complement the three-dimensional structure of target proteins. However, most diffusion-based 3D molecular generative models still face several limitations, such as imbalanced protein–ligand representations or reliance on predefined binding pockets. To address these limitations, we propose SGEDiff, a novel subgraph enriched generative framework for 3D molecule generation. Our model hierarchically fuses subgraph and global graph representations to capture both local binding patterns and key structural features of protein pockets. Furthermore, an integrated pocket prediction module identifies binding regions in unseen proteins, eliminating reliance on predefined pocket coordinates. Experimental results show that SGEDiff outperforms baseline diffusion-based methods in generating high-affinity molecules across diverse targets. Moreover, practical applications in de novo drug design demonstrate improved success rates in generating compounds for novel protein targets, underscoring its potential to advance structure-based drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01123-z.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145704327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-MoleScale: a multi-scale approach for molecular property prediction with graph contrastive and sequence learning Multi-MoleScale:一种基于图对比和序列学习的多尺度分子性质预测方法。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-06 DOI: 10.1186/s13321-025-01126-w
Xinpo Lou, Jianxiu Cai, Shirley W. I. Siu

In recent years, machine learning models have shown substantial progress in predicting molecular properties. However, integrating molecular graph structures with sequence information continues to present a significant challenge. In this paper, we introduce Multi-MoleScale, a novel multi-scale framework designed to address this challenge. By combining Graph Contrastive Learning (GCL) with sequence-based models like BERT, Multi-MoleScale enhances the prediction of molecular properties by capturing both structural and contextual representations of molecules. Specifically, the model leverages GCL to effectively capture the intrinsic graph-based features of molecules while utilizing BERT’s pretraining capabilities to learn the contextual relationships within molecular sequences. The contrastive learning component enables Multi-MoleScale to distinguish between relevant and irrelevant molecular features, thereby enhancing its predictive accuracy across diverse molecular types. To assess the performance of our method, we conducted experiments on several widely used public datasets, including 12 molecular property datasets, the ADMET dataset, and 14 breast cancer cell line datasets. The results show that Multi-MoleScale consistently outperforms existing deep learning and self-supervised learning approaches. Notably, the model does not require handcrafted features, making it highly adaptable and versatile for a variety of molecular discovery tasks. This makes Multi-MoleScale a promising tool for applications in drug discovery, materials science, and other molecular research fields. Our data and code are available at https://github.com/pdssunny/Multi-MoleScale.

近年来,机器学习模型在预测分子性质方面取得了实质性进展。然而,整合分子图结构与序列信息仍然是一个重大的挑战。在本文中,我们介绍了Multi-MoleScale,一种新的多尺度框架,旨在解决这一挑战。通过将图对比学习(GCL)与BERT等基于序列的模型相结合,Multi-MoleScale通过捕获分子的结构和上下文表示来增强分子特性的预测。具体来说,该模型利用GCL有效地捕获分子的内在基于图的特征,同时利用BERT的预训练能力来学习分子序列中的上下文关系。对比学习组件使Multi-MoleScale能够区分相关和不相关的分子特征,从而提高其对不同分子类型的预测准确性。为了评估我们的方法的性能,我们在几个广泛使用的公共数据集上进行了实验,包括12个分子特性数据集、ADMET数据集和14个乳腺癌细胞系数据集。结果表明,Multi-MoleScale始终优于现有的深度学习和自监督学习方法。值得注意的是,该模型不需要手工制作的特征,使其具有高度的适应性和通用性,适用于各种分子发现任务。这使得Multi-MoleScale成为药物发现、材料科学和其他分子研究领域应用的有前途的工具。我们的数据和代码可在https://github.com/pdssunny/Multi-MoleScale上获得。
{"title":"Multi-MoleScale: a multi-scale approach for molecular property prediction with graph contrastive and sequence learning","authors":"Xinpo Lou,&nbsp;Jianxiu Cai,&nbsp;Shirley W. I. Siu","doi":"10.1186/s13321-025-01126-w","DOIUrl":"10.1186/s13321-025-01126-w","url":null,"abstract":"<div><p>In recent years, machine learning models have shown substantial progress in predicting molecular properties. However, integrating molecular graph structures with sequence information continues to present a significant challenge. In this paper, we introduce Multi-MoleScale, a novel multi-scale framework designed to address this challenge. By combining Graph Contrastive Learning (GCL) with sequence-based models like BERT, Multi-MoleScale enhances the prediction of molecular properties by capturing both structural and contextual representations of molecules. Specifically, the model leverages GCL to effectively capture the intrinsic graph-based features of molecules while utilizing BERT’s pretraining capabilities to learn the contextual relationships within molecular sequences. The contrastive learning component enables Multi-MoleScale to distinguish between relevant and irrelevant molecular features, thereby enhancing its predictive accuracy across diverse molecular types. To assess the performance of our method, we conducted experiments on several widely used public datasets, including 12 molecular property datasets, the ADMET dataset, and 14 breast cancer cell line datasets. The results show that Multi-MoleScale consistently outperforms existing deep learning and self-supervised learning approaches. Notably, the model does not require handcrafted features, making it highly adaptable and versatile for a variety of molecular discovery tasks. This makes Multi-MoleScale a promising tool for applications in drug discovery, materials science, and other molecular research fields. Our data and code are available at https://github.com/pdssunny/Multi-MoleScale. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01126-w.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145689020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ionization efficiency prediction of electrospray ionization mass spectrometry analytes based on molecular fingerprints and cumulative neutral losses 基于分子指纹和累积中性损失的电喷雾电离质谱分析物电离效率预测。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-06 DOI: 10.1186/s13321-025-01129-7
Alexandros Nikolopoulos, Denice van Herwerden, Viktoriia Turkina, Anneli Kruve, Melissa Baerenfaenger, Saer Samanipour

Quantification is a challenge for non-targeted analysis (NTA) with liquid chromatography–high resolution mass spectrometry (LC–HRMS), due to the lack of analytical standards. Quantification via structure-based predicted ionization efficiency (IE) has been shown to provide the highest accuracy in estimating concentration. However, achieving confident analyte identification is a challenging task, as multiple candidate structures may be likely. This uncertainty in identification limits the reliability of structure-based IE prediction models, since quantification can be severely compromised in cases of wrongly (tentatively) identified chemicals or lack of candidate structures. Here we investigate the possibility of using cumulative neutral losses from fragmentation spectra (i.e. MS2) to predict the logIE. The first model was based on molecular fingerprints and was applied on structurally identified analytes. PubChem fingerprints performed the best with the root-mean-square error (RMSE) of 0.72 logIE for the test set. The second model was based on the MS2 spectrum, expressed as cumulative neutral losses. This approach is applicable to analytes with unknown structures and showed promising results with RMSE of 0.79 logIE for the test set and 0.62 logIE for chromatographic features extracted from LC-HRMS data of tea extracts spiked with pesticides. The prediction models were compiled in a Julia package, which is publicly available on GitHub, and may be used as part of a quantification workflow to estimate concentrations of identified and unidentified compounds in NTA.

Scientific contribution: This study expands the possibilities of standard free quantification for HRMS. It aims to provide reliable IE prediction for known substances by robust fingerprint calculation, and more importantly IE prediction for unknown substances using their MS2 fragmentation pattern. These workflows employ minimal method-specific variables, highlighting the tool generalizability.

由于缺乏分析标准,定量是液相色谱-高分辨率质谱(LC-HRMS)非靶向分析(NTA)的一个挑战。通过基于结构的预测电离效率(IE)的定量已被证明在估计浓度方面提供了最高的准确性。然而,实现可靠的分析物鉴定是一项具有挑战性的任务,因为可能有多个候选结构。这种鉴定的不确定性限制了基于结构的IE预测模型的可靠性,因为在错误(暂时)鉴定化学物质或缺乏候选结构的情况下,量化可能会受到严重损害。在这里,我们研究了使用碎片谱(即MS2)的累积中性损失来预测逻辑的可能性。第一个模型基于分子指纹图谱,并应用于结构鉴定的分析物。PubChem指纹在测试集上表现最好,均方根误差(RMSE)为0.72。第二个模型基于MS2频谱,表示为累积中性损失。该方法适用于结构未知的分析物,对添加农药的茶提取物的LC-HRMS数据提取的色谱特征的RMSE为0.79 logIE, RMSE为0.62 logIE。预测模型是在Julia包中编译的,该包可以在GitHub上公开获得,并且可以用作量化工作流程的一部分,以估计NTA中已识别和未识别化合物的浓度。科学贡献:本研究拓展了HRMS无标定量的可能性。它旨在通过稳健的指纹计算为已知物质提供可靠的IE预测,更重要的是利用其MS2碎片模式对未知物质进行IE预测。这些工作流使用最小的特定于方法的变量,突出了工具的通用性。
{"title":"Ionization efficiency prediction of electrospray ionization mass spectrometry analytes based on molecular fingerprints and cumulative neutral losses","authors":"Alexandros Nikolopoulos,&nbsp;Denice van Herwerden,&nbsp;Viktoriia Turkina,&nbsp;Anneli Kruve,&nbsp;Melissa Baerenfaenger,&nbsp;Saer Samanipour","doi":"10.1186/s13321-025-01129-7","DOIUrl":"10.1186/s13321-025-01129-7","url":null,"abstract":"<div><p>Quantification is a challenge for non-targeted analysis (NTA) with liquid chromatography–high resolution mass spectrometry (LC–HRMS), due to the lack of analytical standards. Quantification via structure-based predicted ionization efficiency (IE) has been shown to provide the highest accuracy in estimating concentration. However, achieving confident analyte identification is a challenging task, as multiple candidate structures may be likely. This uncertainty in identification limits the reliability of structure-based IE prediction models, since quantification can be severely compromised in cases of wrongly (tentatively) identified chemicals or lack of candidate structures. Here we investigate the possibility of using cumulative neutral losses from fragmentation spectra (i.e. MS2) to predict the log<i>IE</i>. The first model was based on molecular fingerprints and was applied on structurally identified analytes. PubChem fingerprints performed the best with the root-mean-square error (RMSE) of 0.72 log<i>IE</i> for the test set. The second model was based on the MS2 spectrum, expressed as cumulative neutral losses. This approach is applicable to analytes with unknown structures and showed promising results with RMSE of 0.79 log<i>IE</i> for the test set and 0.62 log<i>IE</i> for chromatographic features extracted from LC-HRMS data of tea extracts spiked with pesticides. The prediction models were compiled in a Julia package, which is publicly available on GitHub, and may be used as part of a quantification workflow to estimate concentrations of identified and unidentified compounds in NTA. </p><p><b>Scientific contribution:</b> This study expands the possibilities of standard free quantification for HRMS. It aims to provide reliable IE prediction for known substances by robust fingerprint calculation, and more importantly IE prediction for unknown substances using their MS2 fragmentation pattern. These workflows employ minimal method-specific variables, highlighting the tool generalizability.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12750826/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145695670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NOCTIS: open-source toolkit that turns reaction data into actionable graph networks NOCTIS:开源工具包,将反应数据转化为可操作的图网络。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-04 DOI: 10.1186/s13321-025-01118-w
Nataliya Lopanitsyna, Marta Pasquini, Marco Stenta

Background

Chemical reactions form densely connected networks, and exploring these networks is essential for designing efficient and sustainable synthetic routes. As reaction data from literature, patents, and high-throughput experimentation continue to grow, so does the need for tools that can navigate and mine these large-scale datasets. Graph-based representations capture the topology of reaction space, yet few open-source tools exist for building and querying such networks. To address this, we developed NOCTIS, an open-source toolkit for constructing and analyzing reaction data as graphs.

Results

NOCTIS is an open-source Python package for building Networks of Organic Chemistry (NOCs) from reaction strings. It supports graph-based analysis, parallel processing of large datasets, and export to common Python formats (e.g., NetworkX, pandas). Built on Neo4j technology, it features a modular, extensible architecture with open-source dependencies. We also provide a companion plugin for exhaustive route enumeration. It traverses graph-encoded reactions to assemble all valid synthetic routes, helping prevent redundant exploration and supporting knowledge reuse in synthesis planning. The underlying algorithm is documented in detail along with its current limitations. Using the MIT USPTO-480k dataset (Adv Neural Inf Process Syst 30, 2017), we demonstrate the plugin’s route mining capabilities, analyze network connectivity, and assess synthetic trees.

Conclusion

Built on LinChemIn (J Chem Inf Model 64(6):1765–1771, 2024), NOCTIS serves as an open and extensible toolkit for network-based reaction analysis and route mining, laying the groundwork for data-driven route design at scale. Future work will extend query capabilities and improve the efficiency of route extraction.

背景:化学反应形成紧密相连的网络,探索这些网络对于设计高效和可持续的合成路线至关重要。随着来自文献、专利和高通量实验的反应数据不断增长,对能够导航和挖掘这些大规模数据集的工具的需求也在不断增长。基于图的表示捕获了反应空间的拓扑结构,但是很少有开源工具用于构建和查询这样的网络。为了解决这个问题,我们开发了NOCTIS,这是一个用于构建和分析反应数据图表的开源工具包。结果:NOCTIS是一个开源Python包,用于从反应串中构建有机化学网络(NOCs)。它支持基于图的分析,大型数据集的并行处理,以及导出为通用的Python格式(例如,NetworkX, pandas)。它基于Neo4j技术构建,具有模块化、可扩展的架构和开源依赖关系。我们还提供了详尽路由枚举的配套插件。它遍历图形编码的反应,以组装所有有效的合成路线,有助于防止冗余的探索,并支持合成规划中的知识重用。详细记录了底层算法及其当前限制。使用MIT USPTO-480k数据集(Adv Neural Inf Process Syst 30, 2017),我们展示了插件的路由挖掘能力,分析网络连接并评估合成树。结论:NOCTIS基于LinChemIn (J Chem Inf Model 64(6):1765-1771, 2024),为基于网络的反应分析和路由挖掘提供了一个开放和可扩展的工具包,为大规模数据驱动的路由设计奠定了基础。未来的工作将扩展查询能力,提高路由提取的效率。
{"title":"NOCTIS: open-source toolkit that turns reaction data into actionable graph networks","authors":"Nataliya Lopanitsyna,&nbsp;Marta Pasquini,&nbsp;Marco Stenta","doi":"10.1186/s13321-025-01118-w","DOIUrl":"10.1186/s13321-025-01118-w","url":null,"abstract":"<div><h3>Background</h3><p>Chemical reactions form densely connected networks, and exploring these networks is essential for designing efficient and sustainable synthetic routes. As reaction data from literature, patents, and high-throughput experimentation continue to grow, so does the need for tools that can navigate and mine these large-scale datasets. Graph-based representations capture the topology of reaction space, yet few open-source tools exist for building and querying such networks. To address this, we developed NOCTIS, an open-source toolkit for constructing and analyzing reaction data as graphs.</p><h3>Results</h3><p>NOCTIS is an open-source Python package for building Networks of Organic Chemistry (NOCs) from reaction strings. It supports graph-based analysis, parallel processing of large datasets, and export to common Python formats (e.g., NetworkX, pandas). Built on Neo4j technology, it features a modular, extensible architecture with open-source dependencies. We also provide a companion plugin for exhaustive route enumeration. It traverses graph-encoded reactions to assemble all valid synthetic routes, helping prevent redundant exploration and supporting knowledge reuse in synthesis planning. The underlying algorithm is documented in detail along with its current limitations. Using the MIT USPTO-480k dataset (Adv Neural Inf Process Syst 30, 2017), we demonstrate the plugin’s route mining capabilities, analyze network connectivity, and assess synthetic trees.</p><h3>Conclusion</h3><p>Built on LinChemIn (J Chem Inf Model 64(6):1765–1771, 2024), NOCTIS serves as an open and extensible toolkit for network-based reaction analysis and route mining, laying the groundwork for data-driven route design at scale. Future work will extend query capabilities and improve the efficiency of route extraction.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12798089/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145676116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1