首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
xBitterT5: an explainable transformer-based framework with multimodal inputs for identifying bitter-taste peptides xbitt5:一个可解释的基于转换器的框架,具有多模态输入,用于识别苦味肽
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-20 DOI: 10.1186/s13321-025-01078-1
Nguyen Doan Hieu Nguyen, Nhat Truong Pham, Duong Thanh Tran, Leyi Wei, Adeel Malik, Balachandran Manavalan

Bitter peptides (BPs), derived from the hydrolysis of proteins in food, play a crucial role in both food science and biomedicine by influencing taste perception and participating in various physiological processes. Accurate identification of BPs is essential for understanding food quality and potential health impacts. Traditional machine learning approaches for BP identification have relied on conventional feature descriptors, achieving moderate success but struggling with the complexities of biological sequence data. Recent advances utilizing protein language model embedding and meta-learning approaches have improved the accuracy, but frequently neglect the molecular representations of peptides and lack interpretability. In this study, we propose xBitterT5, a novel multimodal and interpretable framework for BP identification that integrates pretrained transformer-based embeddings from BioT5+ with the combination of peptide sequence and its SELFIES molecular representation. Specifically, incorporating both peptide sequences and their molecular strings, xBitterT5 demonstrates superior performance compared to previous methods on the same benchmark datasets. Importantly, the model provides residue-level interpretability, highlighting chemically meaningful substructures that significantly contribute to its bitterness, thus offering mechanistic insights beyond black-box predictions. A user-friendly web server (https://balalab-skku.org/xBitterT5/) and a standalone version (https://github.com/cbbl-skku-org/xBitterT5/) are freely available to support both computational biologists and experimental researchers in peptide-based food and biomedicine.

苦肽(Bitter peptides, BPs)是由食物中的蛋白质水解而成,通过影响味觉和参与多种生理过程,在食品科学和生物医学中发挥着重要作用。准确识别bp对于了解食品质量和潜在的健康影响至关重要。BP识别的传统机器学习方法依赖于传统的特征描述符,取得了中等程度的成功,但在生物序列数据的复杂性方面存在困难。利用蛋白质语言模型嵌入和元学习方法的最新进展提高了准确性,但经常忽略肽的分子表示和缺乏可解释性。在这项研究中,我们提出了一种新的多模态和可解释的BP识别框架xBitterT5,它将来自BioT5+的预训练变压器嵌入与肽序列及其自定义分子表示相结合。具体来说,结合肽序列及其分子链,xBitterT5在相同的基准数据集上比以前的方法表现出更优越的性能。重要的是,该模型提供了残留水平的可解释性,突出了化学上有意义的子结构,这些子结构对其苦味有重要贡献,从而提供了超越黑箱预测的机制见解。一个用户友好的web服务器(https://balalab-skku.org/xBitterT5/)和一个独立的版本(https://github.com/cbbl-skku-org/xBitterT5/)是免费的,以支持计算生物学家和实验研究人员在肽类食品和生物医学。
{"title":"xBitterT5: an explainable transformer-based framework with multimodal inputs for identifying bitter-taste peptides","authors":"Nguyen Doan Hieu Nguyen,&nbsp;Nhat Truong Pham,&nbsp;Duong Thanh Tran,&nbsp;Leyi Wei,&nbsp;Adeel Malik,&nbsp;Balachandran Manavalan","doi":"10.1186/s13321-025-01078-1","DOIUrl":"10.1186/s13321-025-01078-1","url":null,"abstract":"<div><p>Bitter peptides (BPs), derived from the hydrolysis of proteins in food, play a crucial role in both food science and biomedicine by influencing taste perception and participating in various physiological processes. Accurate identification of BPs is essential for understanding food quality and potential health impacts. Traditional machine learning approaches for BP identification have relied on conventional feature descriptors, achieving moderate success but struggling with the complexities of biological sequence data. Recent advances utilizing protein language model embedding and meta-learning approaches have improved the accuracy, but frequently neglect the molecular representations of peptides and lack interpretability. In this study, we propose xBitterT5, a novel multimodal and interpretable framework for BP identification that integrates pretrained transformer-based embeddings from BioT5+ with the combination of peptide sequence and its SELFIES molecular representation. Specifically, incorporating both peptide sequences and their molecular strings, xBitterT5 demonstrates superior performance compared to previous methods on the same benchmark datasets. Importantly, the model provides residue-level interpretability, highlighting chemically meaningful substructures that significantly contribute to its bitterness, thus offering mechanistic insights beyond black-box predictions. A user-friendly web server (https://balalab-skku.org/xBitterT5/) and a standalone version (https://github.com/cbbl-skku-org/xBitterT5/) are freely available to support both computational biologists and experimental researchers in peptide-based food and biomedicine.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01078-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144880932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ReactionT5: a pre-trained transformer model for accurate chemical reaction prediction with limited data 反应5:一个预先训练的变压器模型,在有限的数据下进行准确的化学反应预测
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-19 DOI: 10.1186/s13321-025-01075-4
Tatsuya Sagawa, Ryosuke Kojima

Accurate chemical reaction prediction is critical for reducing both cost and time in drug development. This study introduces ReactionT5, a transformer-based chemical reaction foundation model pre-trained on the Open Reaction Database—a large publicly available reaction dataset. In benchmarks for product prediction, retrosynthesis, and yield prediction, ReactionT5 outperformed existing models. Specifically, ReactionT5 achieved 97.5% accuracy in product prediction, 71.0% in retrosynthesis, and a coefficient of determination of 0.947 in yield prediction. Remarkably, ReactionT5, when fine-tuned with only a limited dataset of reactions, achieved performance on par with models fine-tuned on the complete dataset. Additionally, the visualization of ReactionT5 embeddings illustrates that the model successfully captures and represents the chemical reaction space, indicating effective learning of reaction properties.

Graphical Abstract

准确的化学反应预测对于降低药物开发的成本和时间至关重要。本研究介绍了基于变压器的化学反应基础模型reaction t5,该模型是在Open reaction database(一个大型公开可用的反应数据集)上预先训练的。在产品预测、反合成和产率预测的基准测试中,ReactionT5优于现有模型。其中,ReactionT5的产物预测准确率为97.5%,反合成准确率为71.0%,产率预测的决定系数为0.947。值得注意的是,当仅对有限的反应数据集进行微调时,反动5达到了与在完整数据集上微调的模型相当的性能。此外,reaction t5嵌入的可视化表明,该模型成功捕获并表示了化学反应空间,表明对反应性质的有效学习。图形抽象
{"title":"ReactionT5: a pre-trained transformer model for accurate chemical reaction prediction with limited data","authors":"Tatsuya Sagawa,&nbsp;Ryosuke Kojima","doi":"10.1186/s13321-025-01075-4","DOIUrl":"10.1186/s13321-025-01075-4","url":null,"abstract":"<div><p>Accurate chemical reaction prediction is critical for reducing both cost and time in drug development. This study introduces ReactionT5, a transformer-based chemical reaction foundation model pre-trained on the Open Reaction Database—a large publicly available reaction dataset. In benchmarks for product prediction, retrosynthesis, and yield prediction, ReactionT5 outperformed existing models. Specifically, ReactionT5 achieved 97.5% accuracy in product prediction, 71.0% in retrosynthesis, and a coefficient of determination of 0.947 in yield prediction. Remarkably, ReactionT5, when fine-tuned with only a limited dataset of reactions, achieved performance on par with models fine-tuned on the complete dataset. Additionally, the visualization of ReactionT5 embeddings illustrates that the model successfully captures and represents the chemical reaction space, indicating effective learning of reaction properties.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01075-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144868633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving drug-induced liver injury prediction using graph neural networks with augmented graph features from molecular optimisation 基于分子优化增广图特征的图神经网络改进药物性肝损伤预测
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-18 DOI: 10.1186/s13321-025-01068-3
Taeyeub Lee, Joram M. Posma

Purpose

Drug-induced liver injury (DILI) is a significant concern in drug development, often leading to the discontinuation of clinical trials and the withdrawal of drugs from the market. This study explores the application of graph neural networks (GNNs) for DILI prediction, using molecular graph representations as the primary input.

Methods

We evaluated several GNN architectures, including Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), Graph Sample and Aggregation (GraphSAGE), and Graph Isomorphism Networks (GINs), using the latest FDA DILI dataset and other molecular property prediction datasets. We introduce a novel approach that creates a custom graph dataset, driven by molecular optimisation, that incorporates detailed and realistic chemical features such as bond lengths and partial charges as input into the GNN models. We have named our model approach DILIGeNN.

Results

DILIGeNN achieved an AUC of 0.897 on the DILI dataset, surpassing the current state-of-the-art model in the DILI prediction task. Furthermore, DILIGeNN outperformed the state-of-the-art in other graph-based molecular prediction tasks, achieving an AUC of 0.918 on the Clintox dataset, 0.993 on the BBBP dataset, and 0.953 on the BACE dataset, indicating strong generalisation and performance across different datasets.

Conclusion

DILIGeNN, utilising a single graph representation as input, outperforms the state-of-the-art methods in DILI prediction that incorporate both molecular fingerprint and graph-structured data. These findings highlight the effectiveness of our molecular graph generation and the GNN training approach as a powerful tool for early-stage drug development and drug repurposing pipeline.

Scientific Contribution: DILIGeNN is a GNN framework that extracts graph features from 3D optimised molecular structures as is done in target-based drug discovery and molecular docking simulation. Our method is the first to encode spatial and electrostatic information into a single graph representation, as opposed to other work that require multiple graphs or additional chemical descriptors for feature representation. Our approach, using warm starts following repeated early stopping during training, outperforms the current state-of-the-art methods in liver toxicity (DILI), permeability (BBBP) and activity (BACE) prediction tasks.

Graphic Abstract

目的药物性肝损伤(DILI)是药物开发中的一个重要问题,经常导致临床试验中止和药物退出市场。本研究探索了图神经网络(GNNs)在DILI预测中的应用,使用分子图表示作为主要输入。方法利用最新的FDA DILI数据集和其他分子性质预测数据集,我们评估了几种GNN架构,包括图卷积网络(GCNs)、图注意力网络(GATs)、图样本和聚合(GraphSAGE)和图同构网络(GINs)。我们引入了一种新方法,该方法创建了一个自定义图形数据集,由分子优化驱动,该数据集将详细和现实的化学特征(如键长和部分电荷)作为输入输入到GNN模型中。我们将我们的模型方法命名为DILIGeNN。结果diligenn在DILI数据集上的AUC为0.897,在DILI预测任务中超过了目前最先进的模型。此外,DILIGeNN在其他基于图的分子预测任务中表现优于最先进的技术,在Clintox数据集上实现了0.918的AUC,在BBBP数据集上实现了0.993,在BACE数据集上实现了0.953,表明在不同数据集上具有很强的泛化性和性能。结论:使用单个图表示作为输入的diligenn在DILI预测中优于结合分子指纹和图结构数据的最先进方法。这些发现突出了我们的分子图生成和GNN训练方法作为早期药物开发和药物再利用管道的强大工具的有效性。科学贡献:DILIGeNN是一个GNN框架,从3D优化的分子结构中提取图形特征,就像在基于靶标的药物发现和分子对接模拟中所做的那样。我们的方法是第一个将空间和静电信息编码成单个图表示,而不是其他需要多个图或额外的化学描述符来表示特征的工作。我们的方法是在训练中反复提前停止后进行热启动,在肝毒性(DILI)、渗透性(BBBP)和活性(BACE)预测任务中优于当前最先进的方法。图形抽象
{"title":"Improving drug-induced liver injury prediction using graph neural networks with augmented graph features from molecular optimisation","authors":"Taeyeub Lee,&nbsp;Joram M. Posma","doi":"10.1186/s13321-025-01068-3","DOIUrl":"10.1186/s13321-025-01068-3","url":null,"abstract":"<div><h3>Purpose</h3><p>Drug-induced liver injury (DILI) is a significant concern in drug development, often leading to the discontinuation of clinical trials and the withdrawal of drugs from the market. This study explores the application of graph neural networks (GNNs) for DILI prediction, using molecular graph representations as the primary input.</p><h3>Methods</h3><p>We evaluated several GNN architectures, including Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), Graph Sample and Aggregation (GraphSAGE), and Graph Isomorphism Networks (GINs), using the latest FDA DILI dataset and other molecular property prediction datasets. We introduce a novel approach that creates a custom graph dataset, driven by molecular optimisation, that incorporates detailed and realistic chemical features such as bond lengths and partial charges as input into the GNN models. We have named our model approach DILIGeNN.</p><h3>Results</h3><p>DILIGeNN achieved an AUC of 0.897 on the DILI dataset, surpassing the current state-of-the-art model in the DILI prediction task. Furthermore, DILIGeNN outperformed the state-of-the-art in other graph-based molecular prediction tasks, achieving an AUC of 0.918 on the Clintox dataset, 0.993 on the BBBP dataset, and 0.953 on the BACE dataset, indicating strong generalisation and performance across different datasets.</p><h3>Conclusion</h3><p>DILIGeNN, utilising a single graph representation as input, outperforms the state-of-the-art methods in DILI prediction that incorporate both molecular fingerprint and graph-structured data. These findings highlight the effectiveness of our molecular graph generation and the GNN training approach as a powerful tool for early-stage drug development and drug repurposing pipeline.</p><p>Scientific Contribution: DILIGeNN is a GNN framework that extracts graph features from 3D optimised molecular structures as is done in target-based drug discovery and molecular docking simulation. Our method is the first to encode spatial and electrostatic information into a single graph representation, as opposed to other work that require multiple graphs or additional chemical descriptors for feature representation. Our approach, using warm starts following repeated early stopping during training, outperforms the current state-of-the-art methods in liver toxicity (DILI), permeability (BBBP) and activity (BACE) prediction tasks.</p><h3>Graphic Abstract</h3><div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01068-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144861405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cycle-configuration descriptors: a novel graph-theoretic approach to enhancing molecular inference 循环构型描述符:一种新的增强分子推理的图论方法
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-18 DOI: 10.1186/s13321-025-01042-z
Bowen Song, Jianshen Zhu, Naveed Ahmed Azam, Kazuya Haraguchi, Liang Zhao, Tatsuya Akutsu

Inference of molecules with desired activities/properties is one of the key and challenging issues in cheminformatics and bioinformatics. For that purpose, our research group has recently developed a state-of-the-art framework mol-infer for molecular inference. This framework first constructs a prediction function for a fixed property using machine learning models, which is then simulated by mixed-integer linear programming to infer desired molecules. The accuracy of the framework heavily relies on the representation power of the descriptors. In this study, we highlight a typical class of non-isomorphic chemical graphs with reasonably different property values that cannot be distinguished by the standard “two-layered (2L) model" of mol-infer. To address this distinguishability problem of the 2L model, we propose a novel family of descriptors, named cycle-configuration (CC), which captures the notion of ortho/meta/para patterns that appear in aromatic rings, which was impossible in the framework so far. Extensive computational experiments show that with the new descriptors, we can construct prediction functions with similar or better performance for all 44 tested chemical properties, including 27 regression datasets and 17 classification datasets comparing with our previous studies, confirming the effectiveness of the CC descriptors. For inference, we also provide a system of linear constraints to formulate the CC descriptors as linear constraints. We demonstrate that a chemical graph with up to 50 non-hydrogen vertices can be inferred within a practical time frame.

分子活性/性质的推断是化学信息学和生物信息学的关键和挑战性问题之一。为此,我们的研究小组最近开发了一个用于分子推理的最先进的框架moll -infer。该框架首先使用机器学习模型构建固定属性的预测函数,然后通过混合整数线性规划模拟以推断所需分子。框架的准确性很大程度上依赖于描述符的表示能力。在这项研究中,我们强调了一类典型的非同构化学图,它们具有合理不同的性质值,不能被mol-infer的标准“两层(2L)模型”所区分。为了解决2L模型的可分辨性问题,我们提出了一个新的描述符家族,称为循环配置(CC),它捕获了芳香环中出现的邻位/元/对位模式的概念,这在目前的框架中是不可能的。大量的计算实验表明,与我们之前的研究相比,使用新的描述符,我们可以对所有44个被测试的化学性质构建具有相似或更好性能的预测函数,包括27个回归数据集和17个分类数据集,证实了CC描述符的有效性。对于推理,我们还提供了一个线性约束系统来将CC描述符表述为线性约束。我们证明了在实际时间框架内可以推断出具有多达50个非氢顶点的化学图。
{"title":"Cycle-configuration descriptors: a novel graph-theoretic approach to enhancing molecular inference","authors":"Bowen Song,&nbsp;Jianshen Zhu,&nbsp;Naveed Ahmed Azam,&nbsp;Kazuya Haraguchi,&nbsp;Liang Zhao,&nbsp;Tatsuya Akutsu","doi":"10.1186/s13321-025-01042-z","DOIUrl":"10.1186/s13321-025-01042-z","url":null,"abstract":"<div><p>Inference of molecules with desired activities/properties is one of the key and challenging issues in cheminformatics and bioinformatics. For that purpose, our research group has recently developed a state-of-the-art framework <span>mol-infer</span> for molecular inference. This framework first constructs a prediction function for a fixed property using machine learning models, which is then simulated by mixed-integer linear programming to infer desired molecules. The accuracy of the framework heavily relies on the representation power of the descriptors. In this study, we highlight a typical class of non-isomorphic chemical graphs with reasonably different property values that cannot be distinguished by the standard “two-layered (2L) model\" of <span>mol-infer</span>. To address this distinguishability problem of the 2L model, we propose a novel family of descriptors, named <i>cycle-configuration (CC)</i>, which captures the notion of ortho/meta/para patterns that appear in aromatic rings, which was impossible in the framework so far. Extensive computational experiments show that with the new descriptors, we can construct prediction functions with similar or better performance for all 44 tested chemical properties, including 27 regression datasets and 17 classification datasets comparing with our previous studies, confirming the effectiveness of the CC descriptors. For inference, we also provide a system of linear constraints to formulate the CC descriptors as linear constraints. We demonstrate that a chemical graph with up to 50 non-hydrogen vertices can be inferred within a practical time frame. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01042-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144861404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-fidelity graph neural networks for predicting toluene/water partition coefficients 预测甲苯/水分配系数的多保真度图神经网络
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-08 DOI: 10.1186/s13321-025-01057-6
Thomas Nevolianis, Jan G. Rittig, Alexander Mitsos, Kai Leonhard

Accurate prediction of toluene/water partition coefficients of neutral species is crucial in drug discovery and separation processes; however, data-driven modeling of these coefficients remains challenging due to limited available experimental data. To address the limitation of available data, we apply multi-fidelity learning approaches leveraging a quantum chemical dataset (low fidelity) of approximately 9000 entries generated by COSMO-RS and an experimental dataset (high fidelity) of about 250 entries collected from the literature. We explore the transfer learning, feature-augmented learning, and multi-target learning approaches in combination with graph neural networks, validating them on two external datasets: one with molecules similar to training data (EXT-Zamora) and one with more challenging molecules (EXT-SAMPL9). Our results show that multi-target learning significantly improves predictive accuracy, achieving a root-mean-square error of 0.44 (log {P}) units for the EXT-Zamora, compared to a root-mean-square error of 0.63 (log {P}) units for single-task models. For the EXT-SAMPL9 dataset, multi-target learning achieves a root-mean-square error of 1.02 (log {P}) units, indicating reasonable performance even for more complex molecular structures. These findings highlight the potential of multi-fidelity learning approaches that leverage quantum chemical data to improve toluene/water partition coefficient predictions and address challenges posed by limited experimental data. We expect the applicability of the methods used beyond just toluene/water partition coefficients.

准确预测中性物质的甲苯/水分配系数在药物发现和分离过程中至关重要;然而,由于可用的实验数据有限,这些系数的数据驱动建模仍然具有挑战性。为了解决可用数据的限制,我们应用多保真度学习方法,利用cosmos - rs生成的约9000个条目的量子化学数据集(低保真度)和从文献中收集的约250个条目的实验数据集(高保真度)。我们结合图神经网络探索了迁移学习、特征增强学习和多目标学习方法,并在两个外部数据集上验证了它们:一个具有与训练数据相似的分子(EXT-Zamora),另一个具有更具挑战性的分子(EXT-SAMPL9)。我们的研究结果表明,多目标学习显著提高了预测精度,EXT-Zamora的均方根误差为0.44 $$log {P}$$单位,而单任务模型的均方根误差为0.63 $$log {P}$$单位。对于EXT-SAMPL9数据集,多目标学习的均方根误差为1.02 $$log {P}$$单位,即使对于更复杂的分子结构,也有合理的性能。这些发现突出了利用量子化学数据改进甲苯/水分配系数预测和解决有限实验数据带来的挑战的多保真度学习方法的潜力。我们期望所使用的方法的适用性不仅仅是甲苯/水分配系数。我们研究了迁移学习、特征增强学习和多目标学习方法结合图神经网络预测甲苯-水分配系数的好处。我们展示了如何将来自半经验cosmos - rs模型的大量廉价数据与少量高保真度实验数据和多目标学习有效地结合在一起,从而产生具有广泛适用性和低不确定性的机器学习模型,其划分系数为0.44至1.02个log单位,具体取决于测试集。
{"title":"Multi-fidelity graph neural networks for predicting toluene/water partition coefficients","authors":"Thomas Nevolianis,&nbsp;Jan G. Rittig,&nbsp;Alexander Mitsos,&nbsp;Kai Leonhard","doi":"10.1186/s13321-025-01057-6","DOIUrl":"10.1186/s13321-025-01057-6","url":null,"abstract":"<div><p>Accurate prediction of toluene/water partition coefficients of neutral species is crucial in drug discovery and separation processes; however, data-driven modeling of these coefficients remains challenging due to limited available experimental data. To address the limitation of available data, we apply multi-fidelity learning approaches leveraging a quantum chemical dataset (low fidelity) of approximately 9000 entries generated by COSMO-RS and an experimental dataset (high fidelity) of about 250 entries collected from the literature. We explore the <i>transfer learning</i>, <i>feature-augmented learning</i>, and <i>multi-target learning</i> approaches in combination with graph neural networks, validating them on two external datasets: one with molecules similar to training data (EXT-Zamora) and one with more challenging molecules (EXT-SAMPL9). Our results show that <i>multi-target learning</i> significantly improves predictive accuracy, achieving a root-mean-square error of 0.44 <span>(log {P})</span> units for the EXT-Zamora, compared to a root-mean-square error of 0.63 <span>(log {P})</span> units for single-task models. For the EXT-SAMPL9 dataset, <i>multi-target learning</i> achieves a root-mean-square error of 1.02 <span>(log {P})</span> units, indicating reasonable performance even for more complex molecular structures. These findings highlight the potential of multi-fidelity learning approaches that leverage quantum chemical data to improve toluene/water partition coefficient predictions and address challenges posed by limited experimental data. We expect the applicability of the methods used beyond just toluene/water partition coefficients.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01057-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144797314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advanced machine learning for innovative drug discovery 创新药物发现的先进机器学习
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-08 DOI: 10.1186/s13321-025-01061-w
Igor V. Tetko, Djork-Arné Clevert

This editorial presents an analysis of the articles published in the Journal of Cheminformatics Special Issue “AI in Drug Discovery”. We review how novel machine learning developments are enhancing structural-based drug discovery; providing better forecasts of molecular properties while also improving various elements of chemical reaction prediction. Methodological developments focused on increasing the accuracy of models via pre-training, estimating the accuracy of predictions, tuning model hyperparameters while avoiding overfitting, in addition to a diverse range of other novel and interesting methodological aspects, including the incorporation of human expert knowledge to analysing the susceptibility of models to adversary attacks, were explored in this Special Issue. In summary, the Special Issue brought together an excellent collection of articles that collectively demonstrate how machine learning methods have become an essential asset in modern drug discovery, with the potential to advance autonomous chemistry labs in the near future.

Graphical Abstract

这篇社论对发表在化学信息学杂志特刊“药物发现中的人工智能”上的文章进行了分析。我们回顾了新的机器学习发展如何增强基于结构的药物发现;提供更好的分子性质预测,同时也改进了各种元素的化学反应预测。方法学的发展侧重于通过预训练来提高模型的准确性,估计预测的准确性,在避免过度拟合的同时调整模型超参数,以及各种其他新颖有趣的方法学方面,包括结合人类专家知识来分析模型对对手攻击的易感性,在本期特刊中进行了探讨。总之,特刊汇集了一系列优秀的文章,这些文章共同展示了机器学习方法如何成为现代药物发现的重要资产,并有可能在不久的将来推动自主化学实验室的发展。
{"title":"Advanced machine learning for innovative drug discovery","authors":"Igor V. Tetko,&nbsp;Djork-Arné Clevert","doi":"10.1186/s13321-025-01061-w","DOIUrl":"10.1186/s13321-025-01061-w","url":null,"abstract":"<div><p>This editorial presents an analysis of the articles published in the <i>Journal of Cheminformatics</i> Special Issue “AI in Drug Discovery”. We review how novel machine learning developments are enhancing structural-based drug discovery; providing better forecasts of molecular properties while also improving various elements of chemical reaction prediction. Methodological developments focused on increasing the accuracy of models via pre-training, estimating the accuracy of predictions, tuning model hyperparameters while avoiding overfitting, in addition to a diverse range of other novel and interesting methodological aspects, including the incorporation of human expert knowledge to analysing the susceptibility of models to adversary attacks, were explored in this Special Issue. In summary, the Special Issue brought together an excellent collection of articles that collectively demonstrate how machine learning methods have become an essential asset in modern drug discovery, with the potential to advance autonomous chemistry labs in the near future.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01061-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144797318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nanodesigner: resolving the complex-CDR interdependency with iterative refinement 纳米设计器:通过迭代细化解决复杂的cdr相互依赖
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-07 DOI: 10.1186/s13321-025-01069-2
Melissa Maria Rios Zertuche, Şenay Kafkas, Dominik Renn, Magnus Rueping, Robert Hoehndorf

Camelid heavy-chain only antibodies consist of two heavy chains and single variable domains (VHHs), which retain antigen-binding functionality even when isolated. The term “nanobody” is now more generally used for describing small, single-domain antibodies. Several antibody generative models have been developed for the sequence and structure co-design of the complementarity-determining regions (CDRs) based on the binding interface with a target antigen. However, these models are not tailored for nanobodies and are often constrained by their reliance on experimentally determined antigen–antibody structures, which are labor-intensive to obtain. Here, we introduce NanoDesigner, a tool for nanobody design and optimization based on generative AI methods. NanoDesigner integrates key stages—structure prediction, docking, CDR generation, and side-chain packing—into an iterative framework based on an expectation maximization (EM) algorithm. The algorithm effectively tackles an interdependency challenge where accurate docking presupposes a priori knowledge of the CDR conformation, while effective CDR generation relies on accurate docking outputs to guide its design. NanoDesigner approximately doubles the success rate of de novo nanobody designs through continuous refinement of docking and CDR generation.

骆驼重链抗体由两条重链和单变量结构域(VHHs)组成,即使被分离也能保持抗原结合功能。“纳米体”这个术语现在更普遍地用于描述小的、单域的抗体。基于与靶抗原结合界面的互补决定区(cdr)的序列和结构协同设计,已经建立了几种抗体生成模型。然而,这些模型并不是为纳米体量身定制的,而且往往受到它们依赖于实验确定的抗原-抗体结构的限制,这些结构的获得需要大量的劳动。在这里,我们介绍NanoDesigner,一个基于生成式人工智能方法的纳米体设计和优化工具。NanoDesigner将关键阶段(结构预测、对接、CDR生成和侧链打包)集成到基于期望最大化(EM)算法的迭代框架中。该算法有效地解决了相互依赖的挑战,其中准确对接以CDR构象的先验知识为前提,而有效的CDR生成依赖于准确的对接输出来指导其设计。通过对对接和CDR生成的不断改进,NanoDesigner将从头设计纳米体的成功率提高了近一倍。我们开发了一种利用生成式人工智能设计和优化纳米体的新方法。我们使用迭代方法来解决cdr的设计依赖于由纳米体和蛋白质靶点组成的复合物的知识,以及对复合物的准确预测依赖于cdr知识的问题。通过直接比较,我们证明了我们的方法比目前的技术水平有所提高。
{"title":"Nanodesigner: resolving the complex-CDR interdependency with iterative refinement","authors":"Melissa Maria Rios Zertuche,&nbsp;Şenay Kafkas,&nbsp;Dominik Renn,&nbsp;Magnus Rueping,&nbsp;Robert Hoehndorf","doi":"10.1186/s13321-025-01069-2","DOIUrl":"10.1186/s13321-025-01069-2","url":null,"abstract":"<div><p>Camelid heavy-chain only antibodies consist of two heavy chains and single variable domains (VHHs), which retain antigen-binding functionality even when isolated. The term “nanobody” is now more generally used for describing small, single-domain antibodies. Several antibody generative models have been developed for the sequence and structure co-design of the complementarity-determining regions (CDRs) based on the binding interface with a target antigen. However, these models are not tailored for nanobodies and are often constrained by their reliance on experimentally determined antigen–antibody structures, which are labor-intensive to obtain. Here, we introduce NanoDesigner, a tool for nanobody design and optimization based on generative AI methods. NanoDesigner integrates key stages—structure prediction, docking, CDR generation, and side-chain packing—into an iterative framework based on an expectation maximization (EM) algorithm. The algorithm effectively tackles an interdependency challenge where accurate docking presupposes <i>a priori</i> knowledge of the CDR conformation, while effective CDR generation relies on accurate docking outputs to guide its design. NanoDesigner approximately doubles the success rate of de novo nanobody designs through continuous refinement of docking and CDR generation.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01069-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144797315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
From molecules to data: the emerging impact of chemoinformatics in chemistry 从分子到数据:化学信息学在化学中的新影响。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-07 DOI: 10.1186/s13321-025-00978-6
Anup Basnet Chetry, Keisuke Ohto

Chemoinformatics is a rapidly advancing field that integrates chemistry, computer science, and data analysis to enhance the study and application of chemical systems. This interdisciplinary approach leverages computational tools and large datasets to drive innovation in various chemical disciplines, including drug discovery, materials science, and environmental chemistry. Recent advancements in artificial intelligence (AI) and machine learning (ML) have significantly improved the ability to analyze complex datasets, predict molecular properties, and design new compounds. Additionally, the expansion of open-access databases and collaborative platforms has facilitated broader access to chemical data and fostered global research collaboration. Sophisticated molecular modeling techniques, such as multi-scale modeling and free energy calculations, have enhanced the accuracy of predictions, while big data analytics has enabled the extraction of valuable insights from vast datasets. Emerging technologies, including quantum computing, hold promise for further revolutionizing the field by offering new capabilities for simulating and optimizing chemical processes. Despite these advancements, chemoinformatics faces challenges related to data integrity, computational demands, and interdisciplinary collaboration. Addressing these challenges is crucial for the continued growth and effectiveness of chemoinformatics. Overall, the field is poised to play a pivotal role in advancing chemical research and developing innovative solutions to address global challenges.

Scientific contribution This article highlights the growing impact of chemoinformatics in modern chemistry by integrating computational tools with molecular science to enhance data-driven discovery. It explores advancements in machine learning, artificial intelligence, and big data analytics, which improve molecular property predictions and accelerate chemical innovations. The study also discusses key applications in drug design and materials science, demonstrating how chemoinformatics drives efficiency and sustainability in research. Additionally, it outlines future challenges and opportunities, emphasizing the need for improved algorithms, data standardization, and interdisciplinary collaboration. This work contributes to the evolving role of chemoinformatics as a crucial pillar of modern chemical research.

化学信息学是一个快速发展的领域,它将化学、计算机科学和数据分析相结合,以加强化学系统的研究和应用。这种跨学科的方法利用计算工具和大型数据集来推动各种化学学科的创新,包括药物发现、材料科学和环境化学。人工智能(AI)和机器学习(ML)的最新进展显著提高了分析复杂数据集、预测分子性质和设计新化合物的能力。此外,开放获取数据库和协作平台的扩展促进了对化学数据的更广泛获取,并促进了全球研究合作。复杂的分子建模技术,如多尺度建模和自由能计算,提高了预测的准确性,而大数据分析使从大量数据集中提取有价值的见解成为可能。包括量子计算在内的新兴技术,通过提供模拟和优化化学过程的新功能,有望进一步革新该领域。尽管取得了这些进步,化学信息学仍面临着与数据完整性、计算需求和跨学科合作相关的挑战。解决这些挑战对化学信息学的持续发展和有效性至关重要。总体而言,该领域将在推进化学研究和开发创新解决方案以应对全球挑战方面发挥关键作用。这篇文章强调了化学信息学在现代化学中日益增长的影响,通过将计算工具与分子科学相结合来增强数据驱动的发现。它探讨了机器学习、人工智能和大数据分析方面的进展,这些进展可以改善分子性质预测并加速化学创新。该研究还讨论了在药物设计和材料科学中的关键应用,展示了化学信息学如何推动研究的效率和可持续性。此外,它概述了未来的挑战和机遇,强调了改进算法、数据标准化和跨学科合作的必要性。这项工作有助于化学信息学作为现代化学研究的重要支柱的发展作用。
{"title":"From molecules to data: the emerging impact of chemoinformatics in chemistry","authors":"Anup Basnet Chetry,&nbsp;Keisuke Ohto","doi":"10.1186/s13321-025-00978-6","DOIUrl":"10.1186/s13321-025-00978-6","url":null,"abstract":"<div><p>Chemoinformatics is a rapidly advancing field that integrates chemistry, computer science, and data analysis to enhance the study and application of chemical systems. This interdisciplinary approach leverages computational tools and large datasets to drive innovation in various chemical disciplines, including drug discovery, materials science, and environmental chemistry. Recent advancements in artificial intelligence (AI) and machine learning (ML) have significantly improved the ability to analyze complex datasets, predict molecular properties, and design new compounds. Additionally, the expansion of open-access databases and collaborative platforms has facilitated broader access to chemical data and fostered global research collaboration. Sophisticated molecular modeling techniques, such as multi-scale modeling and free energy calculations, have enhanced the accuracy of predictions, while big data analytics has enabled the extraction of valuable insights from vast datasets. Emerging technologies, including quantum computing, hold promise for further revolutionizing the field by offering new capabilities for simulating and optimizing chemical processes. Despite these advancements, chemoinformatics faces challenges related to data integrity, computational demands, and interdisciplinary collaboration. Addressing these challenges is crucial for the continued growth and effectiveness of chemoinformatics. Overall, the field is poised to play a pivotal role in advancing chemical research and developing innovative solutions to address global challenges.</p><p><b>Scientific contribution</b> This article highlights the growing impact of chemoinformatics in modern chemistry by integrating computational tools with molecular science to enhance data-driven discovery. It explores advancements in machine learning, artificial intelligence, and big data analytics, which improve molecular property predictions and accelerate chemical innovations. The study also discusses key applications in drug design and materials science, demonstrating how chemoinformatics drives efficiency and sustainability in research. Additionally, it outlines future challenges and opportunities, emphasizing the need for improved algorithms, data standardization, and interdisciplinary collaboration. This work contributes to the evolving role of chemoinformatics as a crucial pillar of modern chemical research.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-00978-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144796717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VitroBert: modeling DILI by pretraining BERT on in vitro data VitroBert:通过在体外数据上预训练BERT来建模DILI
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-06 DOI: 10.1186/s13321-025-01048-7
Muhammad Arslan Masood, Anamya Ajjolli Nagaraja, Katia Belaid, Natalie Mesens, Hugo Ceulemans, Samuel Kaski, Dorota Herman, Markus Heinonen

Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules interact with biological systems. We therefore introduce VitroBERT, a bidirectional encoder representations from transformers (BERT) model pretrained on large-scale in vitro assay profiles to generate biologically informed molecular embeddings. When leveraged to predict in vivo DILI endpoints, these embeddings delivered up to a 29% improvement in biochemistry-related tasks and a 16% gain in histopathology endpoints compared to unsupervised pretraining (MolBERT). However, no significant improvement was observed in clinical tasks. Furthermore, to address the critical issue of class imbalance, we evaluated multiple loss functions-including BCE, weighted BCE, Focal loss, and weighted Focal loss-and identified weighted Focal loss as the most effective. Our findings demonstrate the potential of integrating biological context into molecular models and highlight the importance of selecting appropriate loss functions in improving model performance of highly imbalanced DILI-related tasks.

药物性肝损伤(DILI)由于其复杂性、小数据集和严重的分类不平衡而面临重大挑战。虽然无监督预训练是学习下游任务分子表示的常用方法,但它通常缺乏对分子如何与生物系统相互作用的了解。因此,我们介绍了VitroBERT,这是一种双向编码器表示,来自变压器(BERT)模型,在大规模体外分析谱上进行预训练,以产生生物学信息的分子嵌入。当用于预测体内DILI终点时,与无监督预训练相比,这些嵌入在生物化学相关任务中提高了29%,在组织病理学终点上提高了16% (MolBERT)。然而,在临床任务中没有观察到明显的改善。此外,为了解决类不平衡的关键问题,我们评估了多个损失函数,包括BCE、加权BCE、Focal loss和加权Focal loss,并确定加权Focal loss是最有效的。我们的研究结果证明了将生物学背景整合到分子模型中的潜力,并强调了选择适当的损失函数在提高高度不平衡的dili相关任务的模型性能方面的重要性。我们介绍vitrobert -一个通过结合生物监督扩展传统分子预训练的一般框架。我们的研究结果表明,富含体外相互作用的分子嵌入优于仅基于化学数据的分子嵌入。这些发现强化了体外数据在获取生物学相关信息方面的价值,并强调了该模型利用这些特征来模拟DILI风险的能力。
{"title":"VitroBert: modeling DILI by pretraining BERT on in vitro data","authors":"Muhammad Arslan Masood,&nbsp;Anamya Ajjolli Nagaraja,&nbsp;Katia Belaid,&nbsp;Natalie Mesens,&nbsp;Hugo Ceulemans,&nbsp;Samuel Kaski,&nbsp;Dorota Herman,&nbsp;Markus Heinonen","doi":"10.1186/s13321-025-01048-7","DOIUrl":"10.1186/s13321-025-01048-7","url":null,"abstract":"<div><p>Drug-induced liver injury (DILI) presents a significant challenge due to its complexity, small datasets, and severe class imbalance. While unsupervised pretraining is a common approach to learn molecular representations for downstream tasks, it often lacks insights into how molecules interact with biological systems. We therefore introduce VitroBERT, a bidirectional encoder representations from transformers (BERT) model pretrained on large-scale in vitro assay profiles to generate biologically informed molecular embeddings. When leveraged to predict in vivo DILI endpoints, these embeddings delivered up to a 29% improvement in biochemistry-related tasks and a 16% gain in histopathology endpoints compared to unsupervised pretraining (MolBERT). However, no significant improvement was observed in clinical tasks. Furthermore, to address the critical issue of class imbalance, we evaluated multiple loss functions-including BCE, weighted BCE, Focal loss, and weighted Focal loss-and identified weighted Focal loss as the most effective. Our findings demonstrate the potential of integrating biological context into molecular models and highlight the importance of selecting appropriate loss functions in improving model performance of highly imbalanced DILI-related tasks. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01048-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144786529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction model for chemical explosion consequences via multimodal feature fusion 基于多模态特征融合的化学爆炸后果预测模型
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-05 DOI: 10.1186/s13321-025-01060-x
Yilin Wang, Beibei Wang, Yichen Zhang, Jiquan Zhang, Yijie Song, Shuang-Hua Yang

Chemical explosion accidents represent a significant threat to both human safety and environmental integrity. The accurate prediction of such incidents plays a pivotal role in risk mitigation and safety enhancement within the chemical industry. This study proposes an innovative Bayes-Transformer-SVM model based on multimodal feature fusion, integrating Quantitative Structure–Property Relationship (QSPR) and Quantitative Property-Consequence Relationship (QPCR) principles. The model utilizes molecular descriptors derived from the Simplified Molecular Input Line Entry System (SMILES) and Gaussian16 software, combined with leakage condition parameters, as input features to investigate the quantitative relationship between these factors and explosion consequences. A comprehensive validation and evaluation of the constructed model were performed. Results demonstrate that the optimized Bayes-Transformer-SVM model achieves superior performance, with test set metrics reaching an R2 of 0.9475 and RMSE of 0.1139, outperforming alternative prediction models. The developed model offers a novel and effective approach for assessing explosion risks associated with both existing and newly developed chemical substances. The model enables rapid explosion consequence assessment for chemical storage or transport scenarios, supporting safety-by-design frameworks.

This study constructed a Bayes-Transformer-SVM model for predicting the consequences of hazardous chemical explosions. The model utilized SMILES encoding and Gaussian16 quantum chemical descriptors, combined with leakage condition scenario parameters, achieving excellent performance. Its core lies in the establishment of a multimodal fusion theoretical framework, breaking through the limitations oftraditional cross-modal correlation analysis; the development of an optimized architecture that combines Transformer feature extraction and SVM regression; highlighting the potential application of the model in chemoinformatics; and enabling the prospective assessment of the explosion risks of unknown chemicals, supporting a safety-oriented design concept.

化学爆炸事故是对人类安全和环境完整性的重大威胁。对此类事件的准确预测在减轻化学工业的风险和加强安全方面起着关键作用。本研究结合定量结构-属性关系(QSPR)和定量属性-后果关系(QPCR)原理,提出了一种基于多模态特征融合的贝叶斯-变换-支持向量机模型。该模型利用来自简化分子输入线输入系统(SMILES)和Gaussian16软件的分子描述符,结合泄漏条件参数作为输入特征,研究这些因素与爆炸后果之间的定量关系。对所构建的模型进行了全面的验证和评价。结果表明,优化后的Bayes-Transformer-SVM模型性能优越,测试集指标R2为0.9475,RMSE为0.1139,优于其他预测模型。所建立的模型为评估现有和新开发的化学物质的爆炸风险提供了一种新颖有效的方法。该模型能够对化学品储存或运输场景进行快速爆炸后果评估,支持设计安全框架。本文构建了危险化学品爆炸后果预测的贝叶斯-变换-支持向量机模型。该模型利用SMILES编码和Gaussian16量子化学描述符,结合泄漏条件场景参数,取得了优异的性能。其核心在于建立了多模态融合理论框架,突破了传统跨模态相关分析的局限性;开发了Transformer特征提取与SVM回归相结合的优化架构;强调了该模型在化学信息学中的潜在应用;并且能够对未知化学品的爆炸风险进行前瞻性评估,支持以安全为导向的设计理念。
{"title":"Prediction model for chemical explosion consequences via multimodal feature fusion","authors":"Yilin Wang,&nbsp;Beibei Wang,&nbsp;Yichen Zhang,&nbsp;Jiquan Zhang,&nbsp;Yijie Song,&nbsp;Shuang-Hua Yang","doi":"10.1186/s13321-025-01060-x","DOIUrl":"10.1186/s13321-025-01060-x","url":null,"abstract":"<p>Chemical explosion accidents represent a significant threat to both human safety and environmental integrity. The accurate prediction of such incidents plays a pivotal role in risk mitigation and safety enhancement within the chemical industry. This study proposes an innovative Bayes-Transformer-SVM model based on multimodal feature fusion, integrating Quantitative Structure–Property Relationship (QSPR) and Quantitative Property-Consequence Relationship (QPCR) principles. The model utilizes molecular descriptors derived from the Simplified Molecular Input Line Entry System (SMILES) and Gaussian16 software, combined with leakage condition parameters, as input features to investigate the quantitative relationship between these factors and explosion consequences. A comprehensive validation and evaluation of the constructed model were performed. Results demonstrate that the optimized Bayes-Transformer-SVM model achieves superior performance, with test set metrics reaching an R<sup>2</sup> of 0.9475 and RMSE of 0.1139, outperforming alternative prediction models. The developed model offers a novel and effective approach for assessing explosion risks associated with both existing and newly developed chemical substances. The model enables rapid explosion consequence assessment for chemical storage or transport scenarios, supporting safety-by-design frameworks.</p><p>This study constructed a Bayes-Transformer-SVM model for predicting the consequences of hazardous chemical explosions. The model utilized SMILES encoding and Gaussian16 quantum chemical descriptors, combined with leakage condition scenario parameters, achieving excellent performance. Its core lies in the establishment of a multimodal fusion theoretical framework, breaking through the limitations oftraditional cross-modal correlation analysis; the development of an optimized architecture that combines Transformer feature extraction and SVM regression; highlighting the potential application of the model in chemoinformatics; and enabling the prospective assessment of the explosion risks of unknown chemicals, supporting a safety-oriented design concept.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01060-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144778291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1