首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
DeepTGIN: a novel hybrid multimodal approach using transformers and graph isomorphism networks for protein-ligand binding affinity prediction DeepTGIN:一种新的混合多模态方法,使用变压器和图同构网络进行蛋白质配体结合亲和力预测
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-29 DOI: 10.1186/s13321-024-00938-6
Guishen Wang, Hangchen Zhang, Mengting Shao, Yuncong Feng, Chen Cao, Xiaowen Hu

Predicting protein-ligand binding affinity is essential for understanding protein-ligand interactions and advancing drug discovery. Recent research has demonstrated the advantages of sequence-based models and graph-based models. In this study, we present a novel hybrid multimodal approach, DeepTGIN, which integrates transformers and graph isomorphism networks to predict protein-ligand binding affinity. DeepTGIN is designed to learn sequence and graph features efficiently. The DeepTGIN model comprises three modules: the data representation module, the encoder module, and the prediction module. The transformer encoder learns sequential features from proteins and protein pockets separately, while the graph isomorphism network extracts graph features from the ligands. To evaluate the performance of DeepTGIN, we compared it with state-of-the-art models using the PDBbind 2016 core set and PDBbind 2013 core set. DeepTGIN outperforms these models in terms of R, RMSE, MAE, SD, and CI metrics. Ablation studies further demonstrate the effectiveness of the ligand features and the encoder module. The code is available at: https://github.com/zhc-moushang/DeepTGIN.

DeepTGIN is a novel hybrid multimodal deep learning model for predict protein-ligand binding affinity. The model combines the Transformer encoder to extract sequence features from protein and protein pocket, while integrating graph isomorphism networks to capture features from the ligand. This model addresses the limitations of existing methods in exploring protein pocket and ligand features.

预测蛋白质-配体结合亲和力对于理解蛋白质-配体相互作用和推进药物发现至关重要。最近的研究已经证明了基于序列的模型和基于图的模型的优点。在这项研究中,我们提出了一种新的混合多模态方法,DeepTGIN,它集成了变压器和图同构网络来预测蛋白质与配体的结合亲和力。DeepTGIN旨在有效地学习序列和图的特征。DeepTGIN模型包括三个模块:数据表示模块、编码器模块和预测模块。变压器编码器分别从蛋白质和蛋白质口袋中学习序列特征,图同构网络从配体中提取图特征。为了评估DeepTGIN的性能,我们使用PDBbind 2016核心集和PDBbind 2013核心集将其与最先进的模型进行了比较。DeepTGIN在R、RMSE、MAE、SD和CI指标方面优于这些模型。烧蚀研究进一步证明了配体特征和编码器模块的有效性。该代码可在:https://github.com/zhc-moushang/DeepTGIN.DeepTGIN是一种新的混合多模态深度学习模型,用于预测蛋白质-配体结合亲和力。该模型结合Transformer编码器从蛋白质和蛋白质口袋中提取序列特征,同时集成图同构网络从配体中捕获特征。该模型解决了现有方法在探索蛋白质口袋和配体特征方面的局限性。
{"title":"DeepTGIN: a novel hybrid multimodal approach using transformers and graph isomorphism networks for protein-ligand binding affinity prediction","authors":"Guishen Wang,&nbsp;Hangchen Zhang,&nbsp;Mengting Shao,&nbsp;Yuncong Feng,&nbsp;Chen Cao,&nbsp;Xiaowen Hu","doi":"10.1186/s13321-024-00938-6","DOIUrl":"10.1186/s13321-024-00938-6","url":null,"abstract":"<p>Predicting protein-ligand binding affinity is essential for understanding protein-ligand interactions and advancing drug discovery. Recent research has demonstrated the advantages of sequence-based models and graph-based models. In this study, we present a novel hybrid multimodal approach, DeepTGIN, which integrates transformers and graph isomorphism networks to predict protein-ligand binding affinity. DeepTGIN is designed to learn sequence and graph features efficiently. The DeepTGIN model comprises three modules: the data representation module, the encoder module, and the prediction module. The transformer encoder learns sequential features from proteins and protein pockets separately, while the graph isomorphism network extracts graph features from the ligands. To evaluate the performance of DeepTGIN, we compared it with state-of-the-art models using the PDBbind 2016 core set and PDBbind 2013 core set. DeepTGIN outperforms these models in terms of R, RMSE, MAE, SD, and CI metrics. Ablation studies further demonstrate the effectiveness of the ligand features and the encoder module. The code is available at: https://github.com/zhc-moushang/DeepTGIN.</p><p>DeepTGIN is a novel hybrid multimodal deep learning model for predict protein-ligand binding affinity. The model combines the Transformer encoder to extract sequence features from protein and protein pocket, while integrating graph isomorphism networks to capture features from the ligand. This model addresses the limitations of existing methods in exploring protein pocket and ligand features.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00938-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142889770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
STOUT V2.0: SMILES to IUPAC name conversion using transformer models 使用变压器模型的SMILES到IUPAC名称转换
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-27 DOI: 10.1186/s13321-024-00941-x
Kohulan Rajan, Achim Zielesny, Christoph Steinbeck

Naming chemical compounds systematically is a complex task governed by a set of rules established by the International Union of Pure and Applied Chemistry (IUPAC). These rules are universal and widely accepted by chemists worldwide, but their complexity makes it challenging for individuals to consistently apply them accurately. A translation method can be employed to address this challenge. Accurate translation of chemical compounds from SMILES notation into their corresponding IUPAC names is crucial, as it can significantly streamline the laborious process of naming chemical structures. Here, we present STOUT (SMILES-TO-IUPAC-name translator) V2, which addresses this challenge by introducing a transformer-based model that translates string representations of chemical structures into IUPAC names. Trained on a dataset of nearly 1 billion SMILES strings and their corresponding IUPAC names, STOUT V2 demonstrates exceptional accuracy in generating IUPAC names, even for complex chemical structures. The model's ability to capture intricate patterns and relationships within chemical structures enables it to generate precise and standardised IUPAC names. While established deterministic algorithms remain the gold standard for systematic chemical naming, our work, enabled by access to OpenEye’s Lexichem software through an academic license, demonstrates the potential of neural approaches to complement existing tools in chemical nomenclature.

Scientific contribution STOUT V2, built upon transformer-based models, is a significant advancement from our previous work. The web application enhances its accessibility and utility. By making the model and source code fully open and well-documented, we aim to promote unrestricted use and encourage further development.

Graphical Abstract

系统地为化合物命名是一项复杂的任务,它受到国际纯粹与应用化学联合会(IUPAC)制定的一套规则的制约。这些规则是普遍的,被世界各地的化学家广泛接受,但它们的复杂性使得个人很难始终如一地准确地应用它们。可以采用翻译方法来解决这一挑战。将化学化合物从SMILES符号准确地翻译成相应的IUPAC名称是至关重要的,因为它可以大大简化命名化学结构的繁琐过程。在这里,我们提出了STOUT (SMILES-TO-IUPAC-name translator) V2,它通过引入一个基于变压器的模型来解决这一挑战,该模型将化学结构的字符串表示转换为IUPAC名称。在近10亿个SMILES字符串及其对应的IUPAC名称的数据集上进行训练,STOUT V2在生成IUPAC名称方面表现出卓越的准确性,即使对于复杂的化学结构也是如此。该模型能够捕捉化学结构中复杂的模式和关系,使其能够生成精确和标准化的IUPAC名称。虽然已建立的确定性算法仍然是系统化学命名的黄金标准,但我们的工作,通过学术许可访问OpenEye的Lexichem软件,证明了神经方法在化学命名中补充现有工具的潜力。基于变压器模型的STOUT V2是我们以前工作的重大进步。web应用程序增强了其可访问性和实用性。通过使模型和源代码完全开放并有良好的文档,我们的目标是促进无限制的使用并鼓励进一步的开发。
{"title":"STOUT V2.0: SMILES to IUPAC name conversion using transformer models","authors":"Kohulan Rajan,&nbsp;Achim Zielesny,&nbsp;Christoph Steinbeck","doi":"10.1186/s13321-024-00941-x","DOIUrl":"10.1186/s13321-024-00941-x","url":null,"abstract":"<div><p>Naming chemical compounds systematically is a complex task governed by a set of rules established by the International Union of Pure and Applied Chemistry (IUPAC). These rules are universal and widely accepted by chemists worldwide, but their complexity makes it challenging for individuals to consistently apply them accurately. A translation method can be employed to address this challenge. Accurate translation of chemical compounds from SMILES notation into their corresponding IUPAC names is crucial, as it can significantly streamline the laborious process of naming chemical structures. Here, we present STOUT (SMILES-TO-IUPAC-name translator) V2, which addresses this challenge by introducing a transformer-based model that translates string representations of chemical structures into IUPAC names. Trained on a dataset of nearly 1 billion SMILES strings and their corresponding IUPAC names, STOUT V2 demonstrates exceptional accuracy in generating IUPAC names, even for complex chemical structures. The model's ability to capture intricate patterns and relationships within chemical structures enables it to generate precise and standardised IUPAC names. While established deterministic algorithms remain the gold standard for systematic chemical naming, our work, enabled by access to OpenEye’s Lexichem software through an academic license, demonstrates the potential of neural approaches to complement existing tools in chemical nomenclature.</p><p><b>Scientific contribution </b>STOUT V2, built upon transformer-based models, is a significant advancement from our previous work. The web application enhances its accessibility and utility. By making the model and source code fully open and well-documented, we aim to promote unrestricted use and encourage further development.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><img></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00941-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Comprehensive benchmarking of computational tools for predicting toxicokinetic and physicochemical properties of chemicals 综合基准的计算工具,预测毒性动力学和物理化学性质的化学品
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-26 DOI: 10.1186/s13321-024-00931-z
Domenico Gadaleta, Eva Serrano-Candelas, Rita Ortega-Vallbona, Erika Colombo, Marina Garcia de Lomana, Giada Biava, Pablo Aparicio-Sánchez, Alessandra Roncaglioni, Rafael Gozalbes, Emilio Benfenati

Ensuring the safety of chemicals for environmental and human health involves assessing physicochemical (PC) and toxicokinetic (TK) properties, which are crucial for absorption, distribution, metabolism, excretion, and toxicity (ADMET). Computational methods play a vital role in predicting these properties, given the current trends in reducing experimental approaches, especially those that involve animal experimentation. In the present manuscript, twelve software tools implementing Quantitative Structure–Activity Relationship (QSAR) models were selected for the prediction of 17 relevant PC and TK properties. A total of 41 validation datasets were collected from the literature, curated and used for assessing the models’ external predictivity, emphasizing the performance of the models inside the applicability domain. Overall, the results confirmed the adequate predictive performance of the majority of the selected tools, with models for PC properties (R2 average = 0.717) generally outperforming those for TK properties (R2 average = 0.639 for regression, average balanced accuracy = 0.780 for classification). Notably, several of the tools evaluated exhibited good predictivity across different properties and were identified as recurring optimal choices. Moreover, a systematic analysis of the chemical space covered by the external validation datasets confirmed the validity of the collected results for relevant chemical categories (e.g., drugs and industrial chemicals), further increasing the confidence in the overall evaluation. The best performing models were ultimately suggested for each investigated property and proposed as robust computational tools for high-throughput assessment of highly relevant chemical properties.

The present manuscript provides an overview of the state-of-the-art available computational tools for predicting the PC and TK properties of chemicals. The results here offer valuable guidance to researchers, regulatory authorities, and the industry in identifying robust computational tools suitable for predicting relevant chemical properties in the context of chemical design, toxicity and environmental fate assessment.

确保化学品对环境和人类健康的安全性涉及评估物理化学(PC)和毒性动力学(TK)特性,这对吸收、分布、代谢、排泄和毒性(ADMET)至关重要。考虑到目前减少实验方法的趋势,特别是那些涉及动物实验的方法,计算方法在预测这些特性方面起着至关重要的作用。在本文中,选择了12个实现定量构效关系(QSAR)模型的软件工具来预测17个相关的PC和TK属性。从文献中收集了41个验证数据集,整理并用于评估模型的外部预测性,强调模型在适用性领域内的性能。总体而言,结果证实了大多数选择工具的足够预测性能,PC属性模型(R2平均= 0.717)通常优于TK属性模型(回归的R2平均= 0.639,分类的平均平衡精度= 0.780)。值得注意的是,评估的几个工具对不同的属性表现出良好的预测能力,并被确定为反复出现的最佳选择。此外,对外部验证数据集所涵盖的化学空间进行了系统分析,确认了收集的结果对相关化学类别(例如药物和工业化学品)的有效性,进一步增加了对总体评价的信心。最终为每个研究性质提出了最佳表现模型,并建议作为高通量评估高度相关化学性质的强大计算工具。目前的手稿提供了最先进的可用计算工具的概述,用于预测化学物质的PC和TK性质。研究结果为研究人员、监管机构和行业提供了有价值的指导,以确定适用于化学设计、毒性和环境命运评估背景下预测相关化学性质的强大计算工具。
{"title":"Comprehensive benchmarking of computational tools for predicting toxicokinetic and physicochemical properties of chemicals","authors":"Domenico Gadaleta,&nbsp;Eva Serrano-Candelas,&nbsp;Rita Ortega-Vallbona,&nbsp;Erika Colombo,&nbsp;Marina Garcia de Lomana,&nbsp;Giada Biava,&nbsp;Pablo Aparicio-Sánchez,&nbsp;Alessandra Roncaglioni,&nbsp;Rafael Gozalbes,&nbsp;Emilio Benfenati","doi":"10.1186/s13321-024-00931-z","DOIUrl":"10.1186/s13321-024-00931-z","url":null,"abstract":"<p>Ensuring the safety of chemicals for environmental and human health involves assessing physicochemical (PC) and toxicokinetic (TK) properties, which are crucial for absorption, distribution, metabolism, excretion, and toxicity (ADMET). Computational methods play a vital role in predicting these properties, given the current trends in reducing experimental approaches, especially those that involve animal experimentation. In the present manuscript, twelve software tools implementing Quantitative Structure–Activity Relationship (QSAR) models were selected for the prediction of 17 relevant PC and TK properties. A total of 41 validation datasets were collected from the literature, curated and used for assessing the models’ external predictivity, emphasizing the performance of the models inside the applicability domain. Overall, the results confirmed the adequate predictive performance of the majority of the selected tools, with models for PC properties (R<sup>2</sup> average = 0.717) generally outperforming those for TK properties (R<sup>2</sup> average = 0.639 for regression, average balanced accuracy = 0.780 for classification). Notably, several of the tools evaluated exhibited good predictivity across different properties and were identified as recurring optimal choices. Moreover, a systematic analysis of the chemical space covered by the external validation datasets confirmed the validity of the collected results for relevant chemical categories (e.g., drugs and industrial chemicals), further increasing the confidence in the overall evaluation. The best performing models were ultimately suggested for each investigated property and proposed as robust computational tools for high-throughput assessment of highly relevant chemical properties.</p><p>The present manuscript provides an overview of the state-of-the-art available computational tools for predicting the PC and TK properties of chemicals. The results here offer valuable guidance to researchers, regulatory authorities, and the industry in identifying robust computational tools suitable for predicting relevant chemical properties in the context of chemical design, toxicity and environmental fate assessment.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00931-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AttenhERG: a reliable and interpretable graph neural network framework for predicting hERG channel blockers 一个可靠和可解释的图神经网络框架,用于预测hERG通道阻滞剂
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-23 DOI: 10.1186/s13321-024-00940-y
Tianbiao Yang, Xiaoyu Ding, Elizabeth McMichael, Frank W. Pun, Alex Aliper, Feng Ren, Alex Zhavoronkov, Xiao Ding

Cardiotoxicity, particularly drug-induced arrhythmias, poses a significant challenge in drug development, highlighting the importance of early-stage prediction of human ether-a-go-go-related gene (hERG) toxicity. hERG encodes the pore-forming subunit of the cardiac potassium channel. Traditional methods are both costly and time-intensive, necessitating the development of computational approaches. In this study, we introduce AttenhERG, a novel graph neural network framework designed to predict hERG channel blockers reliably and interpretably. AttenhERG demonstrates improved performance compared to existing methods with an AUROC of 0.835, showcasing its efficacy in accurately predicting hERG activity across diverse datasets. Additionally, uncertainty evaluation analysis reveals the model's reliability, enhancing its utility in drug discovery and safety assessment. Case studies illustrate the practical application of AttenhERG in optimizing compounds for hERG toxicity, highlighting its potential in rational drug design.

Scientific contribution

AttenhERG is a breakthrough framework that significantly improves the interpretability and accuracy of predicting hERG channel blockers. By integrating uncertainty estimation, AttenhERG demonstrates superior reliability compared to benchmark models. Two case studies, involving APH1A and NMT1 inhibitors, further emphasize AttenhERG's practical application in compound optimization.

心脏毒性,特别是药物引起的心律失常,是药物开发中的一个重大挑战,这突出了早期预测人类乙醚-a-go-go相关基因(hERG)毒性的重要性。hERG编码心脏钾通道的成孔亚基。传统的方法既昂贵又费时,需要开发计算方法。在本研究中,我们引入了一种新的图神经网络框架AttenhERG,旨在可靠且可解释地预测hERG通道阻滞剂。与现有方法相比,AttenhERG的AUROC为0.835,显示了其在不同数据集上准确预测hERG活性的有效性。通过不确定度评价分析,揭示了模型的可靠性,提高了模型在药物研发和安全性评价中的实用性。案例研究说明了AttenhERG在优化hERG毒性化合物方面的实际应用,突出了其在合理药物设计方面的潜力。AttenhERG是一个突破性的框架,显著提高了hERG通道阻断剂预测的可解释性和准确性。通过整合不确定性估计,与基准模型相比,AttenhERG显示出更高的可靠性。两个涉及APH1A和NMT1抑制剂的案例研究进一步强调了AttenhERG在化合物优化中的实际应用。
{"title":"AttenhERG: a reliable and interpretable graph neural network framework for predicting hERG channel blockers","authors":"Tianbiao Yang,&nbsp;Xiaoyu Ding,&nbsp;Elizabeth McMichael,&nbsp;Frank W. Pun,&nbsp;Alex Aliper,&nbsp;Feng Ren,&nbsp;Alex Zhavoronkov,&nbsp;Xiao Ding","doi":"10.1186/s13321-024-00940-y","DOIUrl":"10.1186/s13321-024-00940-y","url":null,"abstract":"<div><p>Cardiotoxicity, particularly drug-induced arrhythmias, poses a significant challenge in drug development, highlighting the importance of early-stage prediction of human ether-a-go-go-related gene (hERG) toxicity. hERG encodes the pore-forming subunit of the cardiac potassium channel. Traditional methods are both costly and time-intensive, necessitating the development of computational approaches. In this study, we introduce AttenhERG, a novel graph neural network framework designed to predict hERG channel blockers reliably and interpretably. AttenhERG demonstrates improved performance compared to existing methods with an AUROC of 0.835, showcasing its efficacy in accurately predicting hERG activity across diverse datasets. Additionally, uncertainty evaluation analysis reveals the model's reliability, enhancing its utility in drug discovery and safety assessment. Case studies illustrate the practical application of AttenhERG in optimizing compounds for hERG toxicity, highlighting its potential in rational drug design.</p><p><b>Scientific contribution</b></p><p>AttenhERG is a breakthrough framework that significantly improves the interpretability and accuracy of predicting hERG channel blockers. By integrating uncertainty estimation, AttenhERG demonstrates superior reliability compared to benchmark models. Two case studies, involving APH1A and NMT1 inhibitors, further emphasize AttenhERG's practical application in compound optimization.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00940-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142874098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction: StreaMD: the toolkit for high-throughput molecular dynamics simulations 更正:StreaMD:用于高通量分子动力学模拟的工具包
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-23 DOI: 10.1186/s13321-024-00942-w
Aleksandra Ivanova, Olena Mokshyna, Pavel Polishchuk
{"title":"Correction: StreaMD: the toolkit for high-throughput molecular dynamics simulations","authors":"Aleksandra Ivanova,&nbsp;Olena Mokshyna,&nbsp;Pavel Polishchuk","doi":"10.1186/s13321-024-00942-w","DOIUrl":"10.1186/s13321-024-00942-w","url":null,"abstract":"","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00942-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142875284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interface-aware molecular generative framework for protein–protein interaction modulators 蛋白质相互作用调节剂的界面感知分子生成框架
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-20 DOI: 10.1186/s13321-024-00930-0
Jianmin Wang, Jiashun Mao, Chunyan Li, Hongxin Xiang, Xun Wang, Shuang Wang, Zixu Wang, Yangyang Chen, Yuquan Li, Kyoung Tai No, Tao Song, Xiangxiang Zeng

Protein–protein interactions (PPIs) play a crucial role in numerous biochemical and biological processes. Although several structure-based molecular generative models have been developed, PPI interfaces and compounds targeting PPIs exhibit distinct physicochemical properties compared to traditional binding pockets and small-molecule drugs. As a result, generating compounds that effectively target PPIs, particularly by considering PPI complexes or interface hotspot residues, remains a significant challenge. In this work, we constructed a comprehensive dataset of PPI interfaces with active and inactive compound pairs. Based on this, we propose a novel molecular generative framework tailored to PPI interfaces, named GENiPPI. Our evaluation demonstrates that GENiPPI captures the implicit relationships between the PPI interfaces and the active molecules, and can generate novel compounds that target these interfaces. Moreover, GENiPPI can generate structurally diverse novel compounds with limited PPI interface modulators. To the best of our knowledge, this is the first exploration of a structure-based molecular generative model focused on PPI interfaces, which could facilitate the design of PPI modulators. The PPI interface-based molecular generative model enriches the existing landscape of structure-based (pocket/interface) molecular generative model.

This study introduces GENiPPI, a protein-protein interaction (PPI) interface-aware molecular generative framework. The framework first employs Graph Attention Networks to capture atomic-level interaction features at the protein complex interface. Subsequently, Convolutional Neural Networks extract compound representations in voxel and electron density spaces. These features are integrated into a Conditional Wasserstein Generative AdversarialNetwork, which trains the model to generate compound representations targeting PPI interfaces. GENiPPI effectively captures the relationship between PPI interfaces and active/inactive compounds. Furthermore, in fewshot molecular generation, GENiPPI successfully generates compounds comparable to known disruptors. GENiPPI provides an efficient tool for structure-based design of PPI modulators.

蛋白质-蛋白质相互作用(PPIs)在许多生物化学和生物过程中起着至关重要的作用。虽然已经开发了几种基于结构的分子生成模型,但与传统的结合口袋和小分子药物相比,靶向PPI的界面和化合物表现出不同的物理化学性质。因此,产生有效靶向PPI的化合物,特别是通过考虑PPI配合物或界面热点残基,仍然是一个重大挑战。在这项工作中,我们构建了一个具有活性和非活性化合物对的PPI界面的综合数据集。基于此,我们提出了一种针对PPI接口的新型分子生成框架,命名为GENiPPI。我们的评估表明,GENiPPI捕获了PPI界面和活性分子之间的隐式关系,并可以生成针对这些界面的新化合物。此外,GENiPPI可以用有限的PPI界面调节剂生成结构多样的新化合物。据我们所知,这是第一次探索基于结构的分子生成模型,该模型专注于PPI界面,可以促进PPI调节剂的设计。基于PPI界面的分子生成模型丰富了基于结构(口袋/界面)的分子生成模型的现有格局。本研究介绍了一种蛋白质-蛋白质相互作用(PPI)界面感知的分子生成框架GENiPPI。该框架首先采用图注意网络(Graph Attention Networks)来捕捉蛋白质复合物界面上原子级的相互作用特征。随后,卷积神经网络在体素和电子密度空间中提取复合表示。这些特征被集成到一个条件Wasserstein生成对抗网络(Conditional Wasserstein Generative AdversarialNetwork)中,该网络训练模型生成针对PPI接口的复合表示。GENiPPI有效地捕获了PPI界面与活性/非活性化合物之间的关系。此外,在少量的分子生成中,GENiPPI成功地生成了与已知干扰物相当的化合物。GENiPPI为基于结构的PPI调制器设计提供了一个有效的工具。
{"title":"Interface-aware molecular generative framework for protein–protein interaction modulators","authors":"Jianmin Wang,&nbsp;Jiashun Mao,&nbsp;Chunyan Li,&nbsp;Hongxin Xiang,&nbsp;Xun Wang,&nbsp;Shuang Wang,&nbsp;Zixu Wang,&nbsp;Yangyang Chen,&nbsp;Yuquan Li,&nbsp;Kyoung Tai No,&nbsp;Tao Song,&nbsp;Xiangxiang Zeng","doi":"10.1186/s13321-024-00930-0","DOIUrl":"10.1186/s13321-024-00930-0","url":null,"abstract":"<p>Protein–protein interactions (PPIs) play a crucial role in numerous biochemical and biological processes. Although several structure-based molecular generative models have been developed, PPI interfaces and compounds targeting PPIs exhibit distinct physicochemical properties compared to traditional binding pockets and small-molecule drugs. As a result, generating compounds that effectively target PPIs, particularly by considering PPI complexes or interface hotspot residues, remains a significant challenge. In this work, we constructed a comprehensive dataset of PPI interfaces with active and inactive compound pairs. Based on this, we propose a novel molecular generative framework tailored to PPI interfaces, named GENiPPI. Our evaluation demonstrates that GENiPPI captures the implicit relationships between the PPI interfaces and the active molecules, and can generate novel compounds that target these interfaces. Moreover, GENiPPI can generate structurally diverse novel compounds with limited PPI interface modulators. To the best of our knowledge, this is the first exploration of a structure-based molecular generative model focused on PPI interfaces, which could facilitate the design of PPI modulators. The PPI interface-based molecular generative model enriches the existing landscape of structure-based (pocket/interface) molecular generative model.</p><p>This study introduces GENiPPI, a protein-protein interaction (PPI) interface-aware molecular generative framework. The framework first employs Graph Attention Networks to capture atomic-level interaction features at the protein complex interface. Subsequently, Convolutional Neural Networks extract compound representations in voxel and electron density spaces. These features are integrated into a Conditional Wasserstein Generative Adversarial\u0000Network, which trains the model to generate compound representations targeting PPI interfaces. GENiPPI effectively captures the relationship between PPI interfaces and active/inactive compounds. Furthermore, in fewshot molecular generation, GENiPPI successfully generates compounds comparable to known disruptors. GENiPPI provides an efficient tool for structure-based design of PPI modulators.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00930-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142858556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MolNexTR: a generalized deep learning model for molecular image recognition MolNexTR:用于分子图像识别的广义深度学习模型
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-18 DOI: 10.1186/s13321-024-00926-w
Yufan Chen, Ching Ting Leung, Yong Huang, Jianwei Sun, Hao Chen, Hanyu Gao

In the field of chemical structure recognition, the task of converting molecular images into machine-readable data formats such as SMILES string stands as a significant challenge, primarily due to the varied drawing styles and conventions prevalent in chemical literature. To bridge this gap, we proposed MolNexTR, a novel image-to-graph deep learning model that collaborates to fuse the strengths of ConvNext, a powerful Convolutional Neural Network variant, and Vision-TRansformer. This integration facilitates a more detailed extraction of both local and global features from molecular images. MolNexTR can predict atoms and bonds simultaneously and understand their layout rules. It also excels at flexibly integrating symbolic chemistry principles to discern chirality and decipher abbreviated structures. We further incorporate a series of advanced algorithms, including an improved data augmentation module, an image contamination module, and a post-processing module for getting the final SMILES output. These modules cooperate to enhance the model’s robustness to diverse styles of molecular images found in real literature. In our test sets, MolNexTR has demonstrated superior performance, achieving an accuracy rate of 81–97%, marking a significant advancement in the domain of molecular structure recognition.

Scientific contribution

MolNexTR is a novel image-to-graph model that incorporates a unique dual-stream encoder to extract complex molecular image features, and combines chemical rules to predict atoms and bonds while understanding atom and bond layout rules. In addition, it employs a series of novel augmentation algorithms to significantly enhance the robustness and performance of the model.

在化学结构识别领域,将分子图像转换为机器可读的数据格式(如 SMILES 字符串)是一项重大挑战,这主要是由于化学文献中普遍存在不同的绘图风格和习惯。为了弥补这一差距,我们提出了 MolNexTR,这是一种新颖的图像到图深度学习模型,它融合了 ConvNext(一种强大的卷积神经网络变体)和 Vision-TRansformer 的优势。这种融合有助于从分子图像中更详细地提取局部和全局特征。MolNexTR 可以同时预测原子和化学键,并了解它们的布局规则。它还擅长灵活整合符号化学原理,以辨别手性和破译简略结构。我们还采用了一系列先进的算法,包括改进的数据增强模块、图像污染模块和用于获得最终 SMILES 输出的后处理模块。这些模块相互配合,增强了模型对真实文献中不同风格分子图像的鲁棒性。在我们的测试集中,MolNexTR 表现出了卓越的性能,准确率达到 81-97%,标志着分子结构识别领域的重大进步。科学贡献 MolNexTR 是一种新颖的图像到图模型,它采用了独特的双流编码器来提取复杂的分子图像特征,并结合化学规则来预测原子和化学键,同时理解原子和化学键的布局规则。此外,它还采用了一系列新颖的增强算法,大大提高了模型的鲁棒性和性能。
{"title":"MolNexTR: a generalized deep learning model for molecular image recognition","authors":"Yufan Chen,&nbsp;Ching Ting Leung,&nbsp;Yong Huang,&nbsp;Jianwei Sun,&nbsp;Hao Chen,&nbsp;Hanyu Gao","doi":"10.1186/s13321-024-00926-w","DOIUrl":"10.1186/s13321-024-00926-w","url":null,"abstract":"<div><p>In the field of chemical structure recognition, the task of converting molecular images into machine-readable data formats such as SMILES string stands as a significant challenge, primarily due to the varied drawing styles and conventions prevalent in chemical literature. To bridge this gap, we proposed MolNexTR, a novel image-to-graph deep learning model that collaborates to fuse the strengths of ConvNext, a powerful Convolutional Neural Network variant, and Vision-TRansformer. This integration facilitates a more detailed extraction of both local and global features from molecular images. MolNexTR can predict atoms and bonds simultaneously and understand their layout rules. It also excels at flexibly integrating symbolic chemistry principles to discern chirality and decipher abbreviated structures. We further incorporate a series of advanced algorithms, including an improved data augmentation module, an image contamination module, and a post-processing module for getting the final SMILES output. These modules cooperate to enhance the model’s robustness to diverse styles of molecular images found in real literature. In our test sets, MolNexTR has demonstrated superior performance, achieving an accuracy rate of 81–97%, marking a significant advancement in the domain of molecular structure recognition.</p><p><b>Scientific contribution</b></p><p>MolNexTR is a novel image-to-graph model that incorporates a unique dual-stream encoder to extract complex molecular image features, and combines chemical rules to predict atoms and bonds while understanding atom and bond layout rules. In addition, it employs a series of novel augmentation algorithms to significantly enhance the robustness and performance of the model.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00926-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142841267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlavorMiner: a machine learning platform for extracting molecular flavor profiles from structural data FlavorMiner:一个从结构数据中提取分子风味特征的机器学习平台
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-10 DOI: 10.1186/s13321-024-00935-9
Fabio Herrera-Rocha, Miguel Fernández-Niño, Jorge Duitama, Mónica P. Cala, María José Chica, Ludger A. Wessjohann, Mehdi D. Davari, Andrés Fernando González Barrios

Flavor is the main factor driving consumers acceptance of food products. However, tracking the biochemistry of flavor is a formidable challenge due to the complexity of food composition. Current methodologies for linking individual molecules to flavor in foods and beverages are expensive and time-consuming. Predictive models based on machine learning (ML) are emerging as an alternative to speed up this process. Nonetheless, the optimal approach to predict flavor features of molecules remains elusive. In this work we present FlavorMiner, an ML-based multilabel flavor predictor. FlavorMiner seamlessly integrates different combinations of algorithms and mathematical representations, augmented with class balance strategies to address the inherent class of the input dataset. Notably, Random Forest and K-Nearest Neighbors combined with Extended Connectivity Fingerprint and RDKit molecular descriptors consistently outperform other combinations in most cases. Resampling strategies surpass weight balance methods in mitigating bias associated with class imbalance. FlavorMiner exhibits remarkable accuracy, with an average ROC AUC score of 0.88. This algorithm was used to analyze cocoa metabolomics data, unveiling its profound potential to help extract valuable insights from intricate food metabolomics data. FlavorMiner can be used for flavor mining in any food product, drawing from a diverse training dataset that spans over 934 distinct food products.

Scientific Contribution FlavorMiner is an advanced machine learning (ML)-based tool designed to predict molecular flavor features with high accuracy and efficiency, addressing the complexity of food metabolomics. By leveraging robust algorithmic combinations paired with mathematical representations FlavorMiner achieves high predictive performance. Applied to cocoa metabolomics, FlavorMiner demonstrated its capacity to extract meaningful insights, showcasing its versatility for flavor analysis across diverse food products. This study underscores the transformative potential of ML in accelerating flavor biochemistry research, offering a scalable solution for the food and beverage industry.

风味是促使消费者接受食品的主要因素。然而,由于食品成分的复杂性,跟踪风味的生物化学过程是一项艰巨的挑战。目前将单个分子与食品和饮料风味联系起来的方法既昂贵又耗时。基于机器学习(ML)的预测模型正在成为加快这一过程的替代方法。尽管如此,预测分子风味特征的最佳方法仍然难以捉摸。在这项工作中,我们介绍了基于 ML 的多标签风味预测器 FlavorMiner。FlavorMiner 无缝集成了不同的算法组合和数学表示法,并采用类平衡策略来解决输入数据集的固有类别问题。值得注意的是,在大多数情况下,随机森林和 K-近邻与扩展连接指纹和 RDKit 分子描述符的组合始终优于其他组合。在减轻与类不平衡相关的偏差方面,重采样策略超过了权重平衡方法。FlavorMiner 的准确度非常高,平均 ROC AUC 得分为 0.88。该算法被用于分析可可代谢组学数据,揭示了其帮助从复杂的食品代谢组学数据中提取有价值见解的巨大潜力。FlavorMiner 可用于任何食品的风味挖掘,其训练数据集跨越 934 种不同的食品。科学贡献 FlavorMiner 是一种基于机器学习 (ML) 的先进工具,旨在高精度、高效率地预测分子风味特征,解决食品代谢组学的复杂性问题。通过利用强大的算法组合和数学表示,FlavorMiner 实现了高预测性能。FlavorMiner 在可可代谢组学中的应用证明了它有能力提取有意义的见解,展示了它在各种食品风味分析方面的多功能性。这项研究强调了 ML 在加速风味生物化学研究方面的变革潜力,为食品和饮料行业提供了一个可扩展的解决方案。
{"title":"FlavorMiner: a machine learning platform for extracting molecular flavor profiles from structural data","authors":"Fabio Herrera-Rocha,&nbsp;Miguel Fernández-Niño,&nbsp;Jorge Duitama,&nbsp;Mónica P. Cala,&nbsp;María José Chica,&nbsp;Ludger A. Wessjohann,&nbsp;Mehdi D. Davari,&nbsp;Andrés Fernando González Barrios","doi":"10.1186/s13321-024-00935-9","DOIUrl":"10.1186/s13321-024-00935-9","url":null,"abstract":"<div><p>Flavor is the main factor driving consumers acceptance of food products. However, tracking the biochemistry of flavor is a formidable challenge due to the complexity of food composition. Current methodologies for linking individual molecules to flavor in foods and beverages are expensive and time-consuming. Predictive models based on machine learning (ML) are emerging as an alternative to speed up this process. Nonetheless, the optimal approach to predict flavor features of molecules remains elusive. In this work we present FlavorMiner, an ML-based multilabel flavor predictor. FlavorMiner seamlessly integrates different combinations of algorithms and mathematical representations, augmented with class balance strategies to address the inherent class of the input dataset. Notably, Random Forest and K-Nearest Neighbors combined with Extended Connectivity Fingerprint and RDKit molecular descriptors consistently outperform other combinations in most cases. Resampling strategies surpass weight balance methods in mitigating bias associated with class imbalance. FlavorMiner exhibits remarkable accuracy, with an average ROC AUC score of 0.88. This algorithm was used to analyze cocoa metabolomics data, unveiling its profound potential to help extract valuable insights from intricate food metabolomics data. FlavorMiner can be used for flavor mining in any food product, drawing from a diverse training dataset that spans over 934 distinct food products.</p><p><b>Scientific Contribution</b> FlavorMiner is an advanced machine learning (ML)-based tool designed to predict molecular flavor features with high accuracy and efficiency, addressing the complexity of food metabolomics. By leveraging robust algorithmic combinations paired with mathematical representations FlavorMiner achieves high predictive performance. Applied to cocoa metabolomics, FlavorMiner demonstrated its capacity to extract meaningful insights, showcasing its versatility for flavor analysis across diverse food products. This study underscores the transformative potential of ML in accelerating flavor biochemistry research, offering a scalable solution for the food and beverage industry.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00935-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142796783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Human-in-the-loop active learning for goal-oriented molecule generation 面向目标分子生成的人在环主动学习
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-09 DOI: 10.1186/s13321-024-00924-y
Yasmine Nahal, Janosch Menke, Julien Martinelli, Markus Heinonen, Mikhail Kabeshov, Jon Paul Janet, Eva Nittinger, Ola Engkvist, Samuel Kaski

Machine learning (ML) systems have enabled the modelling of quantitative structure–property relationships (QSPR) and structure-activity relationships (QSAR) using existing experimental data to predict target properties for new molecules. These property predictors hold significant potential in accelerating drug discovery by guiding generative artificial intelligence (AI) agents to explore desired chemical spaces. However, they often struggle to generalize due to the limited scope of the training data. When optimized by generative agents, this limitation can result in the generation of molecules with artificially high predicted probabilities of satisfying target properties, which subsequently fail experimental validation. To address this challenge, we propose an adaptive approach that integrates active learning (AL) and iterative feedback to refine property predictors, thereby improving the outcomes of their optimization by generative AI agents. Our method leverages the Expected Predictive Information Gain (EPIG) criterion to select additional molecules for evaluation by an oracle. This process aims to provide the greatest reduction in predictive uncertainty, enabling more accurate model evaluations of subsequently generated molecules. Recognizing the impracticality of immediate wet-lab or physics-based experiments due to time and logistical constraints, we propose leveraging human experts for their cost-effectiveness and domain knowledge to effectively augment property predictors, bridging gaps in the limited training data. Empirical evaluations through both simulated and real human-in-the-loop experiments demonstrate that our approach refines property predictors to better align with oracle assessments. Additionally, we observe improved accuracy of predicted properties as well as improved drug-likeness among the top-ranking generated molecules.

We present an adaptable framework that integrates AL and human expertise to refine property predictors for goal-oriented molecule generation. This approach is robust to noise in human feedback and ensures that navigating chemical space with human-refined predictors leverages human insights to identify molecules that not only satisfy predicted property profiles but also score highly on oracle models. Additionally, it prioritizes practical characteristics such as drug-likeness, synthetic accessibility, and a favorable balance between exploring diverse chemical space and exploiting similarity to existing training data.

机器学习(ML)系统已经能够利用现有的实验数据对定量结构-性质关系(QSPR)和结构-活性关系(QSAR)进行建模,以预测新分子的目标性质。通过引导生成式人工智能(AI)代理探索所需的化学空间,这些属性预测因子在加速药物发现方面具有巨大潜力。然而,由于训练数据的范围有限,它们往往难以概括。当由生成代理进行优化时,这种限制可能导致生成的分子具有满足目标特性的人为高预测概率,这些分子随后无法通过实验验证。为了应对这一挑战,我们提出了一种自适应方法,该方法集成了主动学习(AL)和迭代反馈来改进属性预测器,从而改善生成式人工智能代理优化结果。我们的方法利用预期预测信息增益(EPIG)标准来选择额外的分子进行评估。该过程旨在最大限度地减少预测的不确定性,从而对随后产生的分子进行更准确的模型评估。由于时间和后勤限制,我们认识到即时湿实验室或基于物理的实验的不实用性,我们建议利用人类专家的成本效益和领域知识来有效地增强属性预测器,弥合有限训练数据中的差距。通过模拟和真实的人在循环实验的经验评估表明,我们的方法改进了属性预测器,以更好地与oracle评估保持一致。此外,我们观察到在排名靠前的生成分子中,预测性质的准确性得到了提高,药物相似性得到了提高。我们提出了一个适应性强的框架,该框架集成了人工智能和人类专业知识,以改进面向目标的分子生成的属性预测器。这种方法对人类反馈中的噪声具有鲁棒性,并确保使用人类改进的预测器导航化学空间,利用人类的见解来识别分子,这些分子不仅满足预测的属性特征,而且在oracle模型上也得到很高的分数。此外,它优先考虑实际特征,如药物相似性,合成可及性,以及探索多种化学空间和利用与现有训练数据的相似性之间的有利平衡。
{"title":"Human-in-the-loop active learning for goal-oriented molecule generation","authors":"Yasmine Nahal,&nbsp;Janosch Menke,&nbsp;Julien Martinelli,&nbsp;Markus Heinonen,&nbsp;Mikhail Kabeshov,&nbsp;Jon Paul Janet,&nbsp;Eva Nittinger,&nbsp;Ola Engkvist,&nbsp;Samuel Kaski","doi":"10.1186/s13321-024-00924-y","DOIUrl":"10.1186/s13321-024-00924-y","url":null,"abstract":"<p>Machine learning (ML) systems have enabled the modelling of quantitative structure–property relationships (QSPR) and structure-activity relationships (QSAR) using existing experimental data to predict target properties for new molecules. These property predictors hold significant potential in accelerating drug discovery by guiding generative artificial intelligence (AI) agents to explore desired chemical spaces. However, they often struggle to generalize due to the limited scope of the training data. When optimized by generative agents, this limitation can result in the generation of molecules with artificially high predicted probabilities of satisfying target properties, which subsequently fail experimental validation. To address this challenge, we propose an adaptive approach that integrates active learning (AL) and iterative feedback to refine property predictors, thereby improving the outcomes of their optimization by generative AI agents. Our method leverages the Expected Predictive Information Gain (EPIG) criterion to select additional molecules for evaluation by an oracle. This process aims to provide the greatest reduction in predictive uncertainty, enabling more accurate model evaluations of subsequently generated molecules. Recognizing the impracticality of immediate wet-lab or physics-based experiments due to time and logistical constraints, we propose leveraging human experts for their cost-effectiveness and domain knowledge to effectively augment property predictors, bridging gaps in the limited training data. Empirical evaluations through both simulated and real human-in-the-loop experiments demonstrate that our approach refines property predictors to better align with oracle assessments. Additionally, we observe improved accuracy of predicted properties as well as improved drug-likeness among the top-ranking generated molecules.</p><p>We present an adaptable framework that integrates AL and human expertise to refine property predictors for goal-oriented molecule generation. This approach is robust to noise in human feedback and ensures that navigating chemical space with human-refined predictors leverages human insights to identify molecules that not only satisfy predicted property profiles but also score highly on oracle models. Additionally, it prioritizes practical characteristics such as drug-likeness, synthetic accessibility, and a favorable balance between exploring diverse chemical space and exploiting similarity to existing training data.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00924-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142796789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Be aware of overfitting by hyperparameter optimization! 注意超参数优化的过拟合!
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-12-09 DOI: 10.1186/s13321-024-00934-w
Igor V. Tetko, Ruud van Deursen, Guillaume Godin

Hyperparameter optimization is very frequently employed in machine learning. However, an optimization of a large space of parameters could result in overfitting of models. In recent studies on solubility prediction the authors collected seven thermodynamic and kinetic solubility datasets from different data sources. They used state-of-the-art graph-based methods and compared models developed for each dataset using different data cleaning protocols and hyperparameter optimization. In our study we showed that hyperparameter optimization did not always result in better models, possibly due to overfitting when using the same statistical measures. Similar results could be calculated using pre-set hyperparameters, reducing the computational effort by around 10,000 times. We also extended the previous analysis by adding a representation learning method based on Natural Language Processing of smiles called Transformer CNN. We show that across all analyzed sets using exactly the same protocol, Transformer CNN provided better results than graph-based methods for 26 out of 28 pairwise comparisons by using only a tiny fraction of time as compared to other methods. Last but not least we stressed the importance of comparing calculation results using exactly the same statistical measures.

Scientific Contribution We showed that models with pre-optimized hyperparameters can suffer from overfitting and that using pre-set hyperparameters yields similar performances but four orders faster. Transformer CNN provided significantly higher accuracy compared to other investigated methods.

超参数优化是机器学习中非常常用的方法。然而,对较大的参数空间进行优化可能会导致模型的过拟合。在最近的溶解度预测研究中,作者收集了来自不同数据源的7个热力学和动力学溶解度数据集。他们使用了最先进的基于图形的方法,并比较了使用不同数据清理协议和超参数优化为每个数据集开发的模型。在我们的研究中,我们表明,超参数优化并不总是产生更好的模型,可能是由于使用相同的统计度量时的过拟合。使用预先设置的超参数可以计算出类似的结果,从而将计算工作量减少约10,000倍。我们还扩展了之前的分析,增加了一种基于微笑自然语言处理的表示学习方法,称为Transformer CNN。我们表明,在使用完全相同协议的所有分析集中,Transformer CNN在28个两两比较中的26个中提供了比基于图的方法更好的结果,与其他方法相比,只使用了一小部分时间。最后但并非最不重要的是,我们强调了使用完全相同的统计措施比较计算结果的重要性。我们表明,使用预先优化的超参数的模型可能会出现过拟合,而使用预先设置的超参数可以产生类似的性能,但速度要快4个数量级。与其他研究方法相比,Transformer CNN提供了更高的准确性。
{"title":"Be aware of overfitting by hyperparameter optimization!","authors":"Igor V. Tetko,&nbsp;Ruud van Deursen,&nbsp;Guillaume Godin","doi":"10.1186/s13321-024-00934-w","DOIUrl":"10.1186/s13321-024-00934-w","url":null,"abstract":"<div><p>Hyperparameter optimization is very frequently employed in machine learning. However, an optimization of a large space of parameters could result in overfitting of models. In recent studies on solubility prediction the authors collected seven thermodynamic and kinetic solubility datasets from different data sources. They used state-of-the-art graph-based methods and compared models developed for each dataset using different data cleaning protocols and hyperparameter optimization. In our study we showed that hyperparameter optimization did not always result in better models, possibly due to overfitting when using the same statistical measures. Similar results could be calculated using pre-set hyperparameters, reducing the computational effort by around 10,000 times. We also extended the previous analysis by adding a representation learning method based on Natural Language Processing of smiles called Transformer CNN. We show that across all analyzed sets using exactly the same protocol, Transformer CNN provided better results than graph-based methods for 26 out of 28 pairwise comparisons by using only a tiny fraction of time as compared to other methods. Last but not least we stressed the importance of comparing calculation results using exactly the same statistical measures.</p><p><b>Scientific Contribution</b> We showed that models with pre-optimized hyperparameters can suffer from overfitting and that using pre-set hyperparameters yields similar performances but four orders faster. Transformer CNN provided significantly higher accuracy compared to other investigated methods.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00934-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142796786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1