首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Multi-MoleScale: a multi-scale approach for molecular property prediction with graph contrastive and sequence learning. Multi-MoleScale:一种基于图对比和序列学习的多尺度分子性质预测方法。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-06 DOI: 10.1186/s13321-025-01126-w
Xinpo Lou,Jianxiu Cai,Shirley W I Siu
In recent years, machine learning models have shown substantial progress in predicting molecular properties. However, integrating molecular graph structures with sequence information continues to present a significant challenge. In this paper, we introduce Multi-MoleScale, a novel multi-scale framework designed to address this challenge. By combining Graph Contrastive Learning (GCL) with sequence-based models like BERT, Multi-MoleScale enhances the prediction of molecular properties by capturing both structural and contextual representations of molecules. Specifically, the model leverages GCL to effectively capture the intrinsic graph-based features of molecules while utilizing BERT's pretraining capabilities to learn the contextual relationships within molecular sequences. The contrastive learning component enables Multi-MoleScale to distinguish between relevant and irrelevant molecular features, thereby enhancing its predictive accuracy across diverse molecular types. To assess the performance of our method, we conducted experiments on several widely used public datasets, including 12 molecular property datasets, the ADMET dataset, and 14 breast cancer cell line datasets. The results show that Multi-MoleScale consistently outperforms existing deep learning and self-supervised learning approaches. Notably, the model does not require handcrafted features, making it highly adaptable and versatile for a variety of molecular discovery tasks. This makes Multi-MoleScale a promising tool for applications in drug discovery, materials science, and other molecular research fields. Our data and code are available at https://github.com/pdssunny/Multi-MoleScale.
近年来,机器学习模型在预测分子性质方面取得了实质性进展。然而,整合分子图结构与序列信息仍然是一个重大的挑战。在本文中,我们介绍了Multi-MoleScale,一种新的多尺度框架,旨在解决这一挑战。通过将图对比学习(GCL)与BERT等基于序列的模型相结合,Multi-MoleScale通过捕获分子的结构和上下文表示来增强分子特性的预测。具体来说,该模型利用GCL有效地捕获分子的内在基于图的特征,同时利用BERT的预训练能力来学习分子序列中的上下文关系。对比学习组件使Multi-MoleScale能够区分相关和不相关的分子特征,从而提高其对不同分子类型的预测准确性。为了评估我们的方法的性能,我们在几个广泛使用的公共数据集上进行了实验,包括12个分子特性数据集、ADMET数据集和14个乳腺癌细胞系数据集。结果表明,Multi-MoleScale始终优于现有的深度学习和自监督学习方法。值得注意的是,该模型不需要手工制作的特征,使其具有高度的适应性和通用性,适用于各种分子发现任务。这使得Multi-MoleScale成为药物发现、材料科学和其他分子研究领域应用的有前途的工具。我们的数据和代码可在https://github.com/pdssunny/Multi-MoleScale上获得。
{"title":"Multi-MoleScale: a multi-scale approach for molecular property prediction with graph contrastive and sequence learning.","authors":"Xinpo Lou,Jianxiu Cai,Shirley W I Siu","doi":"10.1186/s13321-025-01126-w","DOIUrl":"https://doi.org/10.1186/s13321-025-01126-w","url":null,"abstract":"In recent years, machine learning models have shown substantial progress in predicting molecular properties. However, integrating molecular graph structures with sequence information continues to present a significant challenge. In this paper, we introduce Multi-MoleScale, a novel multi-scale framework designed to address this challenge. By combining Graph Contrastive Learning (GCL) with sequence-based models like BERT, Multi-MoleScale enhances the prediction of molecular properties by capturing both structural and contextual representations of molecules. Specifically, the model leverages GCL to effectively capture the intrinsic graph-based features of molecules while utilizing BERT's pretraining capabilities to learn the contextual relationships within molecular sequences. The contrastive learning component enables Multi-MoleScale to distinguish between relevant and irrelevant molecular features, thereby enhancing its predictive accuracy across diverse molecular types. To assess the performance of our method, we conducted experiments on several widely used public datasets, including 12 molecular property datasets, the ADMET dataset, and 14 breast cancer cell line datasets. The results show that Multi-MoleScale consistently outperforms existing deep learning and self-supervised learning approaches. Notably, the model does not require handcrafted features, making it highly adaptable and versatile for a variety of molecular discovery tasks. This makes Multi-MoleScale a promising tool for applications in drug discovery, materials science, and other molecular research fields. Our data and code are available at https://github.com/pdssunny/Multi-MoleScale.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"5 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145689020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ionization efficiency prediction of electrospray ionization mass spectrometry analytes based on molecular fingerprints and cumulative neutral losses. 基于分子指纹和累积中性损失的电喷雾电离质谱分析物电离效率预测。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-06 DOI: 10.1186/s13321-025-01129-7
Alexandros Nikolopoulos, Denice van Herwerden, Viktoriia Turkina, Anneli Kruve, Melissa Baerenfaenger, Saer Samanipour

Quantification is a challenge for non-targeted analysis (NTA) with liquid chromatography-high resolution mass spectrometry (LC-HRMS), due to the lack of analytical standards. Quantification via structure-based predicted ionization efficiency (IE) has been shown to provide the highest accuracy in estimating concentration. However, achieving confident analyte identification is a challenging task, as multiple candidate structures may be likely. This uncertainty in identification limits the reliability of structure-based IE prediction models, since quantification can be severely compromised in cases of wrongly (tentatively) identified chemicals or lack of candidate structures. Here we investigate the possibility of using cumulative neutral losses from fragmentation spectra (i.e. MS2) to predict the logIE. The first model was based on molecular fingerprints and was applied on structurally identified analytes. PubChem fingerprints performed the best with the root-mean-square error (RMSE) of 0.72 logIE for the test set. The second model was based on the MS2 spectrum, expressed as cumulative neutral losses. This approach is applicable to analytes with unknown structures and showed promising results with RMSE of 0.79 logIE for the test set and 0.62 logIE for chromatographic features extracted from LC-HRMS data of tea extracts spiked with pesticides. The prediction models were compiled in a Julia package, which is publicly available on GitHub, and may be used as part of a quantification workflow to estimate concentrations of identified and unidentified compounds in NTA. Scientific contribution: This study expands the possibilities of standard free quantification for HRMS. It aims to provide reliable IE prediction for known substances by robust fingerprint calculation, and more importantly IE prediction for unknown substances using their MS2 fragmentation pattern. These workflows employ minimal method-specific variables, highlighting the tool generalizability.

由于缺乏分析标准,定量是液相色谱-高分辨率质谱(LC-HRMS)非靶向分析(NTA)的一个挑战。通过基于结构的预测电离效率(IE)的定量已被证明在估计浓度方面提供了最高的准确性。然而,实现可靠的分析物鉴定是一项具有挑战性的任务,因为可能有多个候选结构。这种鉴定的不确定性限制了基于结构的IE预测模型的可靠性,因为在错误(暂时)鉴定化学物质或缺乏候选结构的情况下,量化可能会受到严重损害。在这里,我们研究了使用碎片谱(即MS2)的累积中性损失来预测逻辑的可能性。第一个模型基于分子指纹图谱,并应用于结构鉴定的分析物。PubChem指纹在测试集上表现最好,均方根误差(RMSE)为0.72。第二个模型基于MS2频谱,表示为累积中性损失。该方法适用于结构未知的分析物,对添加农药的茶提取物的LC-HRMS数据提取的色谱特征的RMSE为0.79 logIE, RMSE为0.62 logIE。预测模型是在Julia包中编译的,该包可以在GitHub上公开获得,并且可以用作量化工作流程的一部分,以估计NTA中已识别和未识别化合物的浓度。科学贡献:本研究拓展了HRMS无标定量的可能性。它旨在通过稳健的指纹计算为已知物质提供可靠的IE预测,更重要的是利用其MS2碎片模式对未知物质进行IE预测。这些工作流使用最小的特定于方法的变量,突出了工具的通用性。
{"title":"Ionization efficiency prediction of electrospray ionization mass spectrometry analytes based on molecular fingerprints and cumulative neutral losses.","authors":"Alexandros Nikolopoulos, Denice van Herwerden, Viktoriia Turkina, Anneli Kruve, Melissa Baerenfaenger, Saer Samanipour","doi":"10.1186/s13321-025-01129-7","DOIUrl":"https://doi.org/10.1186/s13321-025-01129-7","url":null,"abstract":"<p><p>Quantification is a challenge for non-targeted analysis (NTA) with liquid chromatography-high resolution mass spectrometry (LC-HRMS), due to the lack of analytical standards. Quantification via structure-based predicted ionization efficiency (IE) has been shown to provide the highest accuracy in estimating concentration. However, achieving confident analyte identification is a challenging task, as multiple candidate structures may be likely. This uncertainty in identification limits the reliability of structure-based IE prediction models, since quantification can be severely compromised in cases of wrongly (tentatively) identified chemicals or lack of candidate structures. Here we investigate the possibility of using cumulative neutral losses from fragmentation spectra (i.e. MS2) to predict the logIE. The first model was based on molecular fingerprints and was applied on structurally identified analytes. PubChem fingerprints performed the best with the root-mean-square error (RMSE) of 0.72 logIE for the test set. The second model was based on the MS2 spectrum, expressed as cumulative neutral losses. This approach is applicable to analytes with unknown structures and showed promising results with RMSE of 0.79 logIE for the test set and 0.62 logIE for chromatographic features extracted from LC-HRMS data of tea extracts spiked with pesticides. The prediction models were compiled in a Julia package, which is publicly available on GitHub, and may be used as part of a quantification workflow to estimate concentrations of identified and unidentified compounds in NTA. Scientific contribution: This study expands the possibilities of standard free quantification for HRMS. It aims to provide reliable IE prediction for known substances by robust fingerprint calculation, and more importantly IE prediction for unknown substances using their MS2 fragmentation pattern. These workflows employ minimal method-specific variables, highlighting the tool generalizability.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145695670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NOCTIS: open-source toolkit that turns reaction data into actionable graph networks. NOCTIS:开源工具包,将反应数据转化为可操作的图网络。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-04 DOI: 10.1186/s13321-025-01118-w
Nataliya Lopanitsyna, Marta Pasquini, Marco Stenta

Background: Chemical reactions form densely connected networks, and exploring these networks is essential for designing efficient and sustainable synthetic routes. As reaction data from literature, patents, and high-throughput experimentation continue to grow, so does the need for tools that can navigate and mine these large-scale datasets. Graph-based representations capture the topology of reaction space, yet few open-source tools exist for building and querying such networks. To address this, we developed NOCTIS, an open-source toolkit for constructing and analyzing reaction data as graphs.

Results: NOCTIS is an open-source Python package for building Networks of Organic Chemistry (NOCs) from reaction strings. It supports graph-based analysis, parallel processing of large datasets, and export to common Python formats (e.g., NetworkX, pandas). Built on Neo4j technology, it features a modular, extensible architecture with open-source dependencies. We also provide a companion plugin for exhaustive route enumeration. It traverses graph-encoded reactions to assemble all valid synthetic routes, helping prevent redundant exploration and supporting knowledge reuse in synthesis planning. The underlying algorithm is documented in detail along with its current limitations. Using the MIT USPTO-480k dataset (Adv Neural Inf Process Syst 30, 2017), we demonstrate the plugin's route mining capabilities, analyze network connectivity, and assess synthetic trees.

Conclusion: Built on LinChemIn (J Chem Inf Model 64(6):1765-1771, 2024), NOCTIS serves as an open and extensible toolkit for network-based reaction analysis and route mining, laying the groundwork for data-driven route design at scale. Future work will extend query capabilities and improve the efficiency of route extraction.

背景:化学反应形成紧密相连的网络,探索这些网络对于设计高效和可持续的合成路线至关重要。随着来自文献、专利和高通量实验的反应数据不断增长,对能够导航和挖掘这些大规模数据集的工具的需求也在不断增长。基于图的表示捕获了反应空间的拓扑结构,但是很少有开源工具用于构建和查询这样的网络。为了解决这个问题,我们开发了NOCTIS,这是一个用于构建和分析反应数据图表的开源工具包。结果:NOCTIS是一个开源Python包,用于从反应串中构建有机化学网络(NOCs)。它支持基于图的分析,大型数据集的并行处理,以及导出为通用的Python格式(例如,NetworkX, pandas)。它基于Neo4j技术构建,具有模块化、可扩展的架构和开源依赖关系。我们还提供了详尽路由枚举的配套插件。它遍历图形编码的反应,以组装所有有效的合成路线,有助于防止冗余的探索,并支持合成规划中的知识重用。详细记录了底层算法及其当前限制。使用MIT USPTO-480k数据集(Adv Neural Inf Process Syst 30, 2017),我们展示了插件的路由挖掘能力,分析网络连接并评估合成树。结论:NOCTIS基于LinChemIn (J Chem Inf Model 64(6):1765-1771, 2024),为基于网络的反应分析和路由挖掘提供了一个开放和可扩展的工具包,为大规模数据驱动的路由设计奠定了基础。未来的工作将扩展查询能力,提高路由提取的效率。
{"title":"NOCTIS: open-source toolkit that turns reaction data into actionable graph networks.","authors":"Nataliya Lopanitsyna, Marta Pasquini, Marco Stenta","doi":"10.1186/s13321-025-01118-w","DOIUrl":"https://doi.org/10.1186/s13321-025-01118-w","url":null,"abstract":"<p><strong>Background: </strong>Chemical reactions form densely connected networks, and exploring these networks is essential for designing efficient and sustainable synthetic routes. As reaction data from literature, patents, and high-throughput experimentation continue to grow, so does the need for tools that can navigate and mine these large-scale datasets. Graph-based representations capture the topology of reaction space, yet few open-source tools exist for building and querying such networks. To address this, we developed NOCTIS, an open-source toolkit for constructing and analyzing reaction data as graphs.</p><p><strong>Results: </strong>NOCTIS is an open-source Python package for building Networks of Organic Chemistry (NOCs) from reaction strings. It supports graph-based analysis, parallel processing of large datasets, and export to common Python formats (e.g., NetworkX, pandas). Built on Neo4j technology, it features a modular, extensible architecture with open-source dependencies. We also provide a companion plugin for exhaustive route enumeration. It traverses graph-encoded reactions to assemble all valid synthetic routes, helping prevent redundant exploration and supporting knowledge reuse in synthesis planning. The underlying algorithm is documented in detail along with its current limitations. Using the MIT USPTO-480k dataset (Adv Neural Inf Process Syst 30, 2017), we demonstrate the plugin's route mining capabilities, analyze network connectivity, and assess synthetic trees.</p><p><strong>Conclusion: </strong>Built on LinChemIn (J Chem Inf Model 64(6):1765-1771, 2024), NOCTIS serves as an open and extensible toolkit for network-based reaction analysis and route mining, laying the groundwork for data-driven route design at scale. Future work will extend query capabilities and improve the efficiency of route extraction.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145676116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The consolidation of open-source computer-assisted chemical synthesis data into a comprehensive database. 将开源计算机辅助的化学合成数据整合为一个综合数据库。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-04 DOI: 10.1186/s13321-025-01130-0
Haris Hasic,Takashi Ishida
Over the past decade, computer-assisted chemical synthesis has resurfaced as a prominent research subject. Even though the idea of utilizing computers to assist chemical synthesis has existed for nearly as long as computers themselves, the inherent complexity repeatedly exceeded the available resources. However, recent machine learning approaches have exhibited the potential to break this tendency. The performance of such approaches is heavily dependent on data that suffers from limited quantity, quality, visibility, and accessibility, posing significant challenges to potential scientific breakthroughs. This research addresses these issues by consolidating all relevant open-source computer-assisted chemical synthesis data into a comprehensive database, providing a practical overview of the state of data in the process. The computer-assisted chemical synthesis or CaCS database is designed to be a central repository for storing and analyzing data, with the primary objective being easy integration and utilization within existing research projects. It provides the users with a programmatic interface to retrieve the data required for various tasks like predicting the outcomes of chemical synthesis and retrosynthetic analysis or retrosynthesis, estimating the synthesizability of chemical compounds, and planning and optimizing the chemical synthesis routes. The database archives the original data to ensure reusability and traceability in downstream tasks and stores the processed data in a more efficient manner. The advantages and disadvantages are highlighted through a realistic case study of how such a database would be utilized within a computer-assisted chemical synthesis research project today. The code and documentation relevant to the CaCS database are available on GitHub under the MIT license at https://github.com/neo-chem-synth-wave/ncsw-data.Scientific contribution: The primary scientific contribution of this research is the consolidation of all relevant open-source computer-assisted chemical synthesis data into a comprehensive database. The database archives the original data to ensure reusability and traceability in downstream tasks, efficiently stores the processed data, and provides the users with a programmatic interface to manage and query the stored data. Rather than improving the existing or introducing new data, such a database provides a systematic overview of the existing open data sources and an easily reproducible environment for transparent processing and benchmarking purposes.
在过去的十年里,计算机辅助化学合成作为一个突出的研究课题重新浮出水面。尽管利用计算机辅助化学合成的想法几乎与计算机本身一样长,但其固有的复杂性一再超出了可用的资源。然而,最近的机器学习方法已经显示出打破这一趋势的潜力。这些方法的性能严重依赖于数量、质量、可见性和可访问性有限的数据,这对潜在的科学突破构成了重大挑战。本研究通过将所有相关的开源计算机辅助化学合成数据整合到一个综合数据库中来解决这些问题,提供了该过程中数据状态的实际概述。计算机辅助化学合成数据库被设计成储存和分析数据的中央储存库,其主要目标是在现有的研究项目中易于整合和利用。它为用户提供了一个可编程的界面来检索各种任务所需的数据,如预测化学合成和反合成分析或反合成的结果,估计化合物的可合成性,规划和优化化学合成路线。数据库对原始数据进行归档,以确保下游任务中的可重用性和可追溯性,并以更有效的方式存储处理后的数据。通过一个现实的案例研究,强调了这种数据库如何在今天的计算机辅助化学合成研究项目中使用。与CaCS数据库相关的代码和文档在MIT许可下的GitHub上可以在https://github.com/neo-chem-synth-wave/ncsw-data.Scientific上找到:这项研究的主要科学贡献是将所有相关的开源计算机辅助化学合成数据整合到一个综合数据库中。数据库对原始数据进行归档,保证下游任务的可重用性和可追溯性,高效存储处理后的数据,并为用户提供可编程的接口来管理和查询存储的数据。这种数据库不是改进现有数据或引入新数据,而是提供了对现有开放数据源的系统概述,并为透明处理和基准测试提供了易于再现的环境。
{"title":"The consolidation of open-source computer-assisted chemical synthesis data into a comprehensive database.","authors":"Haris Hasic,Takashi Ishida","doi":"10.1186/s13321-025-01130-0","DOIUrl":"https://doi.org/10.1186/s13321-025-01130-0","url":null,"abstract":"Over the past decade, computer-assisted chemical synthesis has resurfaced as a prominent research subject. Even though the idea of utilizing computers to assist chemical synthesis has existed for nearly as long as computers themselves, the inherent complexity repeatedly exceeded the available resources. However, recent machine learning approaches have exhibited the potential to break this tendency. The performance of such approaches is heavily dependent on data that suffers from limited quantity, quality, visibility, and accessibility, posing significant challenges to potential scientific breakthroughs. This research addresses these issues by consolidating all relevant open-source computer-assisted chemical synthesis data into a comprehensive database, providing a practical overview of the state of data in the process. The computer-assisted chemical synthesis or CaCS database is designed to be a central repository for storing and analyzing data, with the primary objective being easy integration and utilization within existing research projects. It provides the users with a programmatic interface to retrieve the data required for various tasks like predicting the outcomes of chemical synthesis and retrosynthetic analysis or retrosynthesis, estimating the synthesizability of chemical compounds, and planning and optimizing the chemical synthesis routes. The database archives the original data to ensure reusability and traceability in downstream tasks and stores the processed data in a more efficient manner. The advantages and disadvantages are highlighted through a realistic case study of how such a database would be utilized within a computer-assisted chemical synthesis research project today. The code and documentation relevant to the CaCS database are available on GitHub under the MIT license at https://github.com/neo-chem-synth-wave/ncsw-data.Scientific contribution: The primary scientific contribution of this research is the consolidation of all relevant open-source computer-assisted chemical synthesis data into a comprehensive database. The database archives the original data to ensure reusability and traceability in downstream tasks, efficiently stores the processed data, and provides the users with a programmatic interface to manage and query the stored data. Rather than improving the existing or introducing new data, such a database provides a systematic overview of the existing open data sources and an easily reproducible environment for transparent processing and benchmarking purposes.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"201 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepRNA-DTI: a deep learning approach for RNA-compound interaction prediction with binding site interpretability. DeepRNA-DTI:基于结合位点可解释性的rna -化合物相互作用预测的深度学习方法。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-02 DOI: 10.1186/s13321-025-01132-y
Haelee Bae,Hojung Nam
RNA-targeted therapeutics represent a promising frontier for expanding the druggable genome beyond conventional protein targets. However, computational prediction of RNA-compound interactions remains challenging due to limited experimental data and the inherent complexity of RNA structures. Here, we present DeepRNA-DTI, a novel sequence-based deep learning approach for RNA-compound interaction prediction with binding site interpretability. Our model leverages transfer learning from pretrained embeddings, RNA-FM for RNA sequences and Mole-BERT for compounds, and employs a multitask learning framework that simultaneously predicts both presence of interactions and nucleotide-level binding sites. This dual prediction strategy provides mechanistic insights into RNA-compound recognition patterns. Trained on a comprehensive dataset integrating resources from the Protein Data Bank and literature sources, DeepRNA-DTI demonstrates superior performance compared to existing methods. The model shows consistent effectiveness across diverse RNA subtypes, highlighting its robust generalization capabilities. Application to high-throughput virtual screening of over 48 million compounds against oncogenic pre-miR-21 successfully identified known binders and novel chemical scaffolds with RNA-specific physicochemical properties. By combining sequence-based predictions with binding site interpretability, DeepRNA-DTI advances our ability to identify promising RNA-targeting compounds and offers new opportunities for RNA-directed drug discovery. The codes and data are publicly available at https://github.com/GIST-CSBL/DeepRNA-DTI/.
rna靶向治疗代表了一个有前途的前沿,以扩大可药物基因组超越传统的蛋白质目标。然而,由于有限的实验数据和RNA结构固有的复杂性,RNA-化合物相互作用的计算预测仍然具有挑战性。在这里,我们提出了DeepRNA-DTI,这是一种新的基于序列的深度学习方法,用于预测具有结合位点可解释性的rna -化合物相互作用。我们的模型利用预训练嵌入的迁移学习,RNA序列的RNA- fm和化合物的Mole-BERT,并采用多任务学习框架,同时预测相互作用和核苷酸水平结合位点的存在。这种双重预测策略提供了rna -化合物识别模式的机制见解。DeepRNA-DTI在整合了蛋白质数据库和文献资源的综合数据集上进行了训练,与现有方法相比,DeepRNA-DTI表现出优越的性能。该模型在不同的RNA亚型中显示出一致的有效性,突出了其强大的泛化能力。应用于超过4800万种抗致癌pre-miR-21化合物的高通量虚拟筛选,成功鉴定出具有rna特异性物理化学性质的已知结合物和新型化学支架。通过结合基于序列的预测和结合位点的可解释性,DeepRNA-DTI提高了我们识别有前途的rna靶向化合物的能力,并为rna靶向药物的发现提供了新的机会。代码和数据可在https://github.com/GIST-CSBL/DeepRNA-DTI/上公开获取。
{"title":"DeepRNA-DTI: a deep learning approach for RNA-compound interaction prediction with binding site interpretability.","authors":"Haelee Bae,Hojung Nam","doi":"10.1186/s13321-025-01132-y","DOIUrl":"https://doi.org/10.1186/s13321-025-01132-y","url":null,"abstract":"RNA-targeted therapeutics represent a promising frontier for expanding the druggable genome beyond conventional protein targets. However, computational prediction of RNA-compound interactions remains challenging due to limited experimental data and the inherent complexity of RNA structures. Here, we present DeepRNA-DTI, a novel sequence-based deep learning approach for RNA-compound interaction prediction with binding site interpretability. Our model leverages transfer learning from pretrained embeddings, RNA-FM for RNA sequences and Mole-BERT for compounds, and employs a multitask learning framework that simultaneously predicts both presence of interactions and nucleotide-level binding sites. This dual prediction strategy provides mechanistic insights into RNA-compound recognition patterns. Trained on a comprehensive dataset integrating resources from the Protein Data Bank and literature sources, DeepRNA-DTI demonstrates superior performance compared to existing methods. The model shows consistent effectiveness across diverse RNA subtypes, highlighting its robust generalization capabilities. Application to high-throughput virtual screening of over 48 million compounds against oncogenic pre-miR-21 successfully identified known binders and novel chemical scaffolds with RNA-specific physicochemical properties. By combining sequence-based predictions with binding site interpretability, DeepRNA-DTI advances our ability to identify promising RNA-targeting compounds and offers new opportunities for RNA-directed drug discovery. The codes and data are publicly available at https://github.com/GIST-CSBL/DeepRNA-DTI/.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"2 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145657100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
All-atom protein sequence design using discrete diffusion models. 用离散扩散模型设计全原子蛋白质序列。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-01 DOI: 10.1186/s13321-025-01121-1
Amelia Villegas-Morcillo, Gijs J Admiraal, Marcel J T Reinders, Jana M Weber

Advancing protein design is crucial for breakthroughs in medicine and biotechnology. Traditional approaches for protein sequence representation often rely solely on the 20 canonical amino acids, limiting the representation of non-canonical amino acids and residues that undergo post-translational modifications. This work explores discrete diffusion models for generating novel protein sequences using the all-atom chemical representation SELFIES. By encoding the atomic composition of each amino acid in the protein, this approach expands the design possibilities beyond standard sequence representations. Using a modified ByteNet architecture within the discrete diffusion D3PM framework, we evaluate the impact of this all-atom representation on protein quality, diversity, and novelty, compared to conventional amino acid-based models. To this end, we develop a comprehensive assessment pipeline to determine whether generated SELFIES sequences translate into valid proteins containing both canonical and non-canonical amino acids. Additionally, we examine the influence of two noise schedules within the diffusion process-uniform (random replacement of tokens) and absorbing (progressive masking)-on generation performance. While models trained on the all-atom representation struggle to consistently generate fully valid proteins, the successfully generated proteins show improved novelty and diversity compared to their amino acid-based model counterparts. Furthermore, the all-atom representation achieves structural foldability results comparable to those of amino acid-based models. Lastly, our results highlight the absorbing noise schedule as the most effective for both representations. Data and code are available at https://github.com/Intelligent-molecular-systems/All-Atom-Protein-Sequence-Generation.

推进蛋白质设计对于医学和生物技术的突破至关重要。传统的蛋白质序列表征方法通常仅依赖于20个典型氨基酸,限制了非典型氨基酸和经过翻译后修饰的残基的表征。这项工作探索了使用全原子化学表示自拍生成新蛋白质序列的离散扩散模型。通过编码蛋白质中每个氨基酸的原子组成,这种方法扩展了超出标准序列表示的设计可能性。在离散扩散D3PM框架中使用改进的ByteNet架构,与传统的基于氨基酸的模型相比,我们评估了这种全原子表示对蛋白质质量、多样性和新颖性的影响。为此,我们开发了一个全面的评估管道,以确定生成的自序列是否转化为含有规范和非规范氨基酸的有效蛋白质。此外,我们还研究了扩散过程中两种噪声时间表-均匀(随机替换标记)和吸收(渐进掩蔽)对生成性能的影响。虽然在全原子表示上训练的模型难以一致地生成完全有效的蛋白质,但与基于氨基酸的模型相比,成功生成的蛋白质显示出更高的新颖性和多样性。此外,全原子表示实现了与基于氨基酸的模型相当的结构可折叠性结果。最后,我们的结果强调了吸收噪声的时间表是最有效的两种表示。数据和代码可在https://github.com/Intelligent-molecular-systems/All-Atom-Protein-Sequence-Generation上获得。
{"title":"All-atom protein sequence design using discrete diffusion models.","authors":"Amelia Villegas-Morcillo, Gijs J Admiraal, Marcel J T Reinders, Jana M Weber","doi":"10.1186/s13321-025-01121-1","DOIUrl":"https://doi.org/10.1186/s13321-025-01121-1","url":null,"abstract":"<p><p>Advancing protein design is crucial for breakthroughs in medicine and biotechnology. Traditional approaches for protein sequence representation often rely solely on the 20 canonical amino acids, limiting the representation of non-canonical amino acids and residues that undergo post-translational modifications. This work explores discrete diffusion models for generating novel protein sequences using the all-atom chemical representation SELFIES. By encoding the atomic composition of each amino acid in the protein, this approach expands the design possibilities beyond standard sequence representations. Using a modified ByteNet architecture within the discrete diffusion D3PM framework, we evaluate the impact of this all-atom representation on protein quality, diversity, and novelty, compared to conventional amino acid-based models. To this end, we develop a comprehensive assessment pipeline to determine whether generated SELFIES sequences translate into valid proteins containing both canonical and non-canonical amino acids. Additionally, we examine the influence of two noise schedules within the diffusion process-uniform (random replacement of tokens) and absorbing (progressive masking)-on generation performance. While models trained on the all-atom representation struggle to consistently generate fully valid proteins, the successfully generated proteins show improved novelty and diversity compared to their amino acid-based model counterparts. Furthermore, the all-atom representation achieves structural foldability results comparable to those of amino acid-based models. Lastly, our results highlight the absorbing noise schedule as the most effective for both representations. Data and code are available at https://github.com/Intelligent-molecular-systems/All-Atom-Protein-Sequence-Generation.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145652885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Novel molecule design with POWGAN, a policy-optimized Wasserstein generative adversarial network. 基于策略优化的Wasserstein生成对抗网络POWGAN的新型分子设计。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-01 DOI: 10.1186/s13321-025-01114-0
Bruno Macedo, Inês Ribeiro Vaz, Tiago Taveira Gomes
<p><p>Generative artificial intelligence has the potential to open new vast chemical search spaces, yet existing reinforcement-guided generative adversarial networks (GANs) struggle to produce non-fragmented and property-oriented molecules at scale without compromising other properties. To overcome these limitations, we present Policy-Optimised Wasserstein GAN (POWGAN), a graph-based generator that incorporates a dynamically scaled reward into adversarial training. The scaling factor increases when progress stalls, keeping gradients informative while steadily steering the generator towards user-defined objectives. When POWGAN replaces the loss function in a previous MedGAN architecture, using graph connectivity (non-fragmentation) as the target property, attains 1.00 fully connected quinoline-like molecules, compared to previous 0.62, while maintaining novelty (0.93) and uniqueness (0.95). The resulting model R-MedGAN produces > 12,000 novel quinoline-like, a significant increase over its predecessor under identical experimental conditions. Chemical space visualizations demonstrate that these molecules populate regions not present in the training dataset or MedGAN, confirming genuine scaffold innovation. By achieving a new architecture capable of orienting generative process towards a reward, our study also showed this strategy is capable of progressing towards druglikeness properties. Synthetic Accessibility Scores (SAS) measured by Erlth algorithm between 1 and 6, and lipophilicity measured as LogP between 1.35 and 1.80, both increased the proportion from 8 to 65% and 17% to 45%, respectively, compared to baseline. Our study shows R-MedGAN architecture, incorporating POWGAN loss, is also generalizable for models trained with different molecular scaffolds other than quinoline originally tested in MedGAN (R-MedGAN-QNL). For indole (R-MedGAN-IND) and imidazole (R-MedGAN-IMZ) datasets, connectivity increased from 0.38 and 0.50 up to 1.00 during training. This study provides evidence that an adaptive reward-scaling policy in a Wasserstein GAN can simultaneously guide the generative training towards a reward by enhancing molecular connectivity, expand generative throughput, preserve diversity, and improve drug-likeness properties. By eliminating the limitation trade-off between property optimisation and sample diversity, POWGAN and its R-MedGAN implementation advance the state of the art in molecule-generating GANs and deploys a robust, scalable platform for high-throughput, goal-directed chemical exploration in early-stage drug discovery. These findings underscore the effectiveness of adaptive reinforcement-driven strategies in generative adversarial networks oriented by rewards for molecular discovery. SCIENTIFIC CONTRIBUTION: In this work we introduce POWGAN, a policy-optimized Wasserstein GAN that uses adaptive reward scaling to improve goal-directed molecule generation. Integrated into MedGAN (R-MedGAN), it increases the number of valid, connect
生成式人工智能有潜力开辟新的广阔的化学搜索空间,但现有的强化引导生成式对抗网络(gan)难以在不影响其他特性的情况下大规模生产非碎片化和属性导向的分子。为了克服这些限制,我们提出了策略优化的沃瑟斯坦GAN (POWGAN),这是一种基于图的生成器,它将动态缩放的奖励整合到对抗训练中。当进度停止时,比例因子增加,保持梯度信息,同时稳定地将生成器转向用户定义的目标。当POWGAN取代先前MedGAN架构中的损失函数时,使用图连通性(非碎片化)作为目标属性,获得1.00个完全连接的喹啉类分子,而之前的为0.62个,同时保持新颖性(0.93)和唯一性(0.95)。由此产生的R-MedGAN模型在相同的实验条件下产生了1,000,000个新型喹啉样物质,比其前身显着增加。化学空间可视化表明,这些分子填充了训练数据集或MedGAN中不存在的区域,证实了真正的支架创新。通过实现一种能够将生成过程导向奖励的新架构,我们的研究还表明,这种策略能够朝着类似药物的特性发展。与基线相比,Erlth算法测量的综合可达性评分(SAS)在1 ~ 6之间,亲脂性LogP在1.35 ~ 1.80之间,两者的比例分别从8增加到65%和17%增加到45%。我们的研究表明,包含POWGAN损失的R-MedGAN结构也可用于除最初在MedGAN中测试的喹啉以外的不同分子支架训练的模型(R-MedGAN- qnl)。对于吲哚(R-MedGAN-IND)和咪唑(R-MedGAN-IMZ)数据集,在训练期间连通性从0.38和0.50增加到1.00。本研究提供了证据,表明Wasserstein GAN中的自适应奖励尺度策略可以通过增强分子连通性、扩大生成吞吐量、保持多样性和改善药物相似性来同时引导生成训练向奖励方向发展。通过消除性质优化和样品多样性之间的限制权衡,POWGAN及其R-MedGAN实现推进了分子生成gan的最新技术,并为早期药物发现的高通量、目标导向的化学探索部署了一个强大的、可扩展的平台。这些发现强调了自适应强化驱动策略在以分子发现奖励为导向的生成对抗网络中的有效性。科学贡献:在这项工作中,我们介绍了POWGAN,一种策略优化的Wasserstein GAN,它使用自适应奖励缩放来改进目标导向的分子生成。整合到MedGAN (R-MedGAN)中,它在保持多样性和药物相似性的同时,增加了相同设置下有效、连接和新分子的数量。这表明自适应奖励策略可以在规模上共同增强分子拓扑和性质优化。
{"title":"Novel molecule design with POWGAN, a policy-optimized Wasserstein generative adversarial network.","authors":"Bruno Macedo, Inês Ribeiro Vaz, Tiago Taveira Gomes","doi":"10.1186/s13321-025-01114-0","DOIUrl":"https://doi.org/10.1186/s13321-025-01114-0","url":null,"abstract":"&lt;p&gt;&lt;p&gt;Generative artificial intelligence has the potential to open new vast chemical search spaces, yet existing reinforcement-guided generative adversarial networks (GANs) struggle to produce non-fragmented and property-oriented molecules at scale without compromising other properties. To overcome these limitations, we present Policy-Optimised Wasserstein GAN (POWGAN), a graph-based generator that incorporates a dynamically scaled reward into adversarial training. The scaling factor increases when progress stalls, keeping gradients informative while steadily steering the generator towards user-defined objectives. When POWGAN replaces the loss function in a previous MedGAN architecture, using graph connectivity (non-fragmentation) as the target property, attains 1.00 fully connected quinoline-like molecules, compared to previous 0.62, while maintaining novelty (0.93) and uniqueness (0.95). The resulting model R-MedGAN produces &gt; 12,000 novel quinoline-like, a significant increase over its predecessor under identical experimental conditions. Chemical space visualizations demonstrate that these molecules populate regions not present in the training dataset or MedGAN, confirming genuine scaffold innovation. By achieving a new architecture capable of orienting generative process towards a reward, our study also showed this strategy is capable of progressing towards druglikeness properties. Synthetic Accessibility Scores (SAS) measured by Erlth algorithm between 1 and 6, and lipophilicity measured as LogP between 1.35 and 1.80, both increased the proportion from 8 to 65% and 17% to 45%, respectively, compared to baseline. Our study shows R-MedGAN architecture, incorporating POWGAN loss, is also generalizable for models trained with different molecular scaffolds other than quinoline originally tested in MedGAN (R-MedGAN-QNL). For indole (R-MedGAN-IND) and imidazole (R-MedGAN-IMZ) datasets, connectivity increased from 0.38 and 0.50 up to 1.00 during training. This study provides evidence that an adaptive reward-scaling policy in a Wasserstein GAN can simultaneously guide the generative training towards a reward by enhancing molecular connectivity, expand generative throughput, preserve diversity, and improve drug-likeness properties. By eliminating the limitation trade-off between property optimisation and sample diversity, POWGAN and its R-MedGAN implementation advance the state of the art in molecule-generating GANs and deploys a robust, scalable platform for high-throughput, goal-directed chemical exploration in early-stage drug discovery. These findings underscore the effectiveness of adaptive reinforcement-driven strategies in generative adversarial networks oriented by rewards for molecular discovery. SCIENTIFIC CONTRIBUTION: In this work we introduce POWGAN, a policy-optimized Wasserstein GAN that uses adaptive reward scaling to improve goal-directed molecule generation. Integrated into MedGAN (R-MedGAN), it increases the number of valid, connect","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145652913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How to build machine learning models able to extrapolate from standard to modified peptides. 如何建立能够从标准肽到修饰肽进行外推的机器学习模型。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-27 DOI: 10.1186/s13321-025-01115-z
Raúl Fernández-Díaz, Rodrigo Ochoa, Thanh Lam Hoang, Vanessa Lopez, Denis C Shields

Bioactive peptides are an important class of natural products with great functional versatility. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for standard peptides (composed of the 20 canonical amino acids) is more abundant than for modified ones. Thus, we set out to identify whether predictive models fitted to standard data are reliable when applied to modified peptides. To do this, we first considered two critical aspects of the modeling problem, namely, choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model.

生物活性肽是一类重要的天然产物,具有多种功能。化学修饰可以改善它们的药理学,但它们的结构多样性对计算建模提出了独特的挑战。此外,标准肽(由20个典型氨基酸组成)的数据比修饰肽更丰富。因此,我们着手确定适用于标准数据的预测模型在应用于修饰肽时是否可靠。为此,我们首先考虑了建模问题的两个关键方面,即选择用于指导数据集划分的相似函数和选择分子表示。基于相似性的数据集划分是一种评估技术,它将数据集划分为训练子集和测试子集,使测试集中的分子与用于拟合模型的分子不同。
{"title":"How to build machine learning models able to extrapolate from standard to modified peptides.","authors":"Raúl Fernández-Díaz, Rodrigo Ochoa, Thanh Lam Hoang, Vanessa Lopez, Denis C Shields","doi":"10.1186/s13321-025-01115-z","DOIUrl":"https://doi.org/10.1186/s13321-025-01115-z","url":null,"abstract":"<p><p>Bioactive peptides are an important class of natural products with great functional versatility. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for standard peptides (composed of the 20 canonical amino acids) is more abundant than for modified ones. Thus, we set out to identify whether predictive models fitted to standard data are reliable when applied to modified peptides. To do this, we first considered two critical aspects of the modeling problem, namely, choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145626959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nipah Virus Inhibitor Knowledgebase (NVIK): a combined evidence approach to prioritise small molecule inhibitors 尼帕病毒抑制剂知识库(NVIK):优先考虑小分子抑制剂的综合证据方法
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-24 DOI: 10.1186/s13321-025-01049-6
Bhupender Singh, Nishi Kumari, Ayush Upadhyay, Bhavini Pahuja, Eugenia Covernton, Kishan Kalia, Kanika Tuteja, Priyanka Rani Paul, Rakesh Kumar, Mayur Sudhakar Zarkar, Anshu Bhardwaj

Nipah Virus (NiV) came into limelight due to an outbreak in Kerala, India. NiV infection can cause severe respiratory and neurological problems with fatality rate of 40–70%. It is a public health concern and has the potential to become a global pandemic. Lack of treatment has forced the containment methods to be restricted to isolation and surveillance. WHO’s ‘R&D Blueprint list of priority diseases’ (2018) indicates that there is an urgent need for accelerated research & development for addressing NiV. In the quest for druglike NiV inhibitors (NVIs) a thorough literature search followed by systematic data curation was conducted. Rigorous data analysis was done with curated NVIs for prioritising curated compounds. Our efforts led to the creation of Nipah Virus Inhibitor Knowledgebase (NVIK), a well-curated structured knowledgebase of 220 NVIs with 142 unique small molecule inhibitors. The reported IC50/EC50 values for some of these inhibitors are in the nanomolar range—as low as 0.47 nM. Of 142 unique small-molecule inhibitors, 124 (87.32%) compounds cleared the PAINS filter. The clustering analysis identified more than 90% of the NVIs as singletons signifying their diverse structural features. This diverse chemical space can be utilized in numerous ways to develop druglike anti-nipah molecules. Further, we prioritised top 10 NVIs, based on robustness of assays, physicochemical properties and their toxicity profiles. All the NVIs related information including their structures, physicochemical properties, similarity analysis with FDA approved drugs and other chemical libraries along with predicted ADMET profiles are freely accessible at https://datascience.imtech.res.in/anshu/nipah/. The NVIK has the provision to submit new inhibitors as and when reported by the community for further enhancement of the NVIs landscape.

Scientific contribution

The NVIK is a dedicated resource for NiV drug discovery containing manually curated NVIs. The NVIs are structurally mapped with known chemical space to identify their structural diversity and recommend strategies for chemical library expansion. Also, in NVIK a combined evidence-based strategy is used to prioritise these inhibitors.

Graphical Abstract

尼帕病毒(NiV)因印度喀拉拉邦的疫情而引起关注。NiV感染可引起严重的呼吸系统和神经系统问题,死亡率为40-70%。这是一个公共卫生问题,并有可能成为全球流行病。由于缺乏治疗,控制方法只能局限于隔离和监测。世卫组织《研发蓝图重点疾病清单》(2018年)表明,迫切需要加快研发以应对NiV。在寻找类药物NiV抑制剂(NVIs)的过程中,进行了全面的文献检索,然后进行了系统的数据整理。严格的数据分析与精心策划的NVIs完成,以优先考虑精心策划的化合物。我们的努力促成了尼帕病毒抑制剂知识库(NVIK)的创建,这是一个精心策划的结构化知识库,包含220种NVIs和142种独特的小分子抑制剂。据报道,其中一些抑制剂的IC50/EC50值在纳摩尔范围内,低至0.47 nM。在142个独特的小分子抑制剂中,124个(87.32%)化合物通过了PAINS过滤器。聚类分析发现,超过90%的NVIs为单例,这表明它们的结构特征多样。这种多样的化学空间可以以多种方式用于开发类似药物的抗尼帕分子。此外,我们根据检测的稳健性、物理化学性质及其毒性特征对前10名NVIs进行了优先排序。所有与NVIs相关的信息,包括它们的结构、理化性质、与FDA批准的药物的相似性分析和其他化学文库,以及预测的ADMET谱,都可以在https://datascience.imtech.res.in/anshu/nipah/上免费获取。NVIK规定,在社区报告时提交新的抑制剂,以进一步改善NVIs景观。NVIK是一个专门用于NiV药物发现的资源,其中包含手动策划的NVIs。将NVIs与已知的化学空间进行结构映射,以确定其结构多样性并推荐化学库扩展策略。此外,在NVIK中,综合循证策略用于优先考虑这些抑制剂。
{"title":"Nipah Virus Inhibitor Knowledgebase (NVIK): a combined evidence approach to prioritise small molecule inhibitors","authors":"Bhupender Singh,&nbsp;Nishi Kumari,&nbsp;Ayush Upadhyay,&nbsp;Bhavini Pahuja,&nbsp;Eugenia Covernton,&nbsp;Kishan Kalia,&nbsp;Kanika Tuteja,&nbsp;Priyanka Rani Paul,&nbsp;Rakesh Kumar,&nbsp;Mayur Sudhakar Zarkar,&nbsp;Anshu Bhardwaj","doi":"10.1186/s13321-025-01049-6","DOIUrl":"10.1186/s13321-025-01049-6","url":null,"abstract":"<div><p>Nipah Virus (NiV) came into limelight due to an outbreak in Kerala, India. NiV infection can cause severe respiratory and neurological problems with fatality rate of 40–70%. It is a public health concern and has the potential to become a global pandemic. Lack of treatment has forced the containment methods to be restricted to isolation and surveillance. WHO’s ‘R&amp;D Blueprint list of priority diseases’ (2018) indicates that there is an urgent need for accelerated research &amp; development for addressing NiV. In the quest for druglike NiV inhibitors (NVIs) a thorough literature search followed by systematic data curation was conducted. Rigorous data analysis was done with curated NVIs for prioritising curated compounds. Our efforts led to the creation of Nipah Virus Inhibitor Knowledgebase (NVIK), a well-curated structured knowledgebase of 220 NVIs with 142 unique small molecule inhibitors. The reported IC50/EC50 values for some of these inhibitors are in the nanomolar range—as low as 0.47 nM. Of 142 unique small-molecule inhibitors, 124 (87.32%) compounds cleared the PAINS filter. The clustering analysis identified more than 90% of the NVIs as singletons signifying their diverse structural features. This diverse chemical space can be utilized in numerous ways to develop druglike anti-nipah molecules. Further, we prioritised top 10 NVIs, based on robustness of assays, physicochemical properties and their toxicity profiles. All the NVIs related information including their structures, physicochemical properties, similarity analysis with FDA approved drugs and other chemical libraries along with predicted ADMET profiles are freely accessible at https://datascience.imtech.res.in/anshu/nipah/. The NVIK has the provision to submit new inhibitors as and when reported by the community for further enhancement of the NVIs landscape.</p><p>Scientific contribution</p><p>The NVIK is a dedicated resource for NiV drug discovery containing manually curated NVIs. The NVIs are structurally mapped with known chemical space to identify their structural diversity and recommend strategies for chemical library expansion. Also, in NVIK a combined evidence-based strategy is used to prioritise these inhibitors.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01049-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145583550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond performance: how design choices shape chemical language models 超越性能:设计选择如何塑造化学语言模型
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-18 DOI: 10.1186/s13321-025-01099-w
Inken Fender, Jannik Adrian Gut, Thomas Lemmin

Chemical language models (CLMs) have shown strong performance in molecular property prediction and generation tasks. However, the impact of design choices, such as molecular representation format, tokenization strategy, and model architecture, on both performance and chemical interpretability remains underexplored. In this study, we systematically evaluate how these factors influence CLM performance and chemical understanding. We evaluated models through fine-tuning on downstream tasks and probing the structure of their latent spaces using probing predictors, vector operations, and dimensionality reduction techniques. Although downstream task performance was similar across model configurations, substantial differences were observed in the structure and interpretability of internal representations, highlighting that design choices meaningfully shape how chemical information is encoded. In practice, atomwise tokenization generally improved interpretability, and a RoBERTa-based model with SMILES input remains a reliable starting point for standard prediction tasks, as no alternative consistently outperformed it. These results provide guidance for the development of more chemically grounded and interpretable CLMs.

Graphical Abstract

化学语言模型(Chemical language models, CLMs)在分子性质预测和生成任务中显示出强大的性能。然而,设计选择(如分子表示格式、标记化策略和模型架构)对性能和化学可解释性的影响仍未得到充分探讨。在本研究中,我们系统地评估了这些因素如何影响CLM性能和化学理解。我们通过对下游任务进行微调来评估模型,并使用探测预测器、向量操作和降维技术探测其潜在空间的结构。尽管不同模型配置的下游任务表现相似,但在内部表示的结构和可解释性方面观察到实质性差异,突出表明设计选择有意义地塑造了化学信息的编码方式。在实践中,原子标记化通常提高了可解释性,并且带有SMILES输入的基于roberta的模型仍然是标准预测任务的可靠起点,因为没有替代方案始终优于它。这些结果为开发更具化学基础和可解释的clm提供了指导。本研究对核心设计选择如何塑造化学语言模型提供了系统的评估。尽管不同的配置通常会产生相似的下游性能,但它们在内部表示的结构和可解释性方面产生了实质性的差异。对于标准预测任务,带有原子标记化SMILES输入的基于roberta的模型为标准预测任务提供了实用且可靠的设置。通过阐明分子表征和标记化策略的影响,我们的研究结果为开发更具可解释性和化学信息的clm提供了可操作的指导。
{"title":"Beyond performance: how design choices shape chemical language models","authors":"Inken Fender,&nbsp;Jannik Adrian Gut,&nbsp;Thomas Lemmin","doi":"10.1186/s13321-025-01099-w","DOIUrl":"10.1186/s13321-025-01099-w","url":null,"abstract":"<div><p>Chemical language models (CLMs) have shown strong performance in molecular property prediction and generation tasks. However, the impact of design choices, such as molecular representation format, tokenization strategy, and model architecture, on both performance and chemical interpretability remains underexplored. In this study, we systematically evaluate how these factors influence CLM performance and chemical understanding. We evaluated models through fine-tuning on downstream tasks and probing the structure of their latent spaces using probing predictors, vector operations, and dimensionality reduction techniques. Although downstream task performance was similar across model configurations, substantial differences were observed in the structure and interpretability of internal representations, highlighting that design choices meaningfully shape how chemical information is encoded. In practice, atomwise tokenization generally improved interpretability, and a RoBERTa-based model with SMILES input remains a reliable starting point for standard prediction tasks, as no alternative consistently outperformed it. These results provide guidance for the development of more chemically grounded and interpretable CLMs.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01099-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145535347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1