Pub Date : 2024-12-23DOI: 10.1186/s13321-024-00940-y
Tianbiao Yang, Xiaoyu Ding, Elizabeth McMichael, Frank W. Pun, Alex Aliper, Feng Ren, Alex Zhavoronkov, Xiao Ding
Cardiotoxicity, particularly drug-induced arrhythmias, poses a significant challenge in drug development, highlighting the importance of early-stage prediction of human ether-a-go-go-related gene (hERG) toxicity. hERG encodes the pore-forming subunit of the cardiac potassium channel. Traditional methods are both costly and time-intensive, necessitating the development of computational approaches. In this study, we introduce AttenhERG, a novel graph neural network framework designed to predict hERG channel blockers reliably and interpretably. AttenhERG demonstrates improved performance compared to existing methods with an AUROC of 0.835, showcasing its efficacy in accurately predicting hERG activity across diverse datasets. Additionally, uncertainty evaluation analysis reveals the model's reliability, enhancing its utility in drug discovery and safety assessment. Case studies illustrate the practical application of AttenhERG in optimizing compounds for hERG toxicity, highlighting its potential in rational drug design.
Scientific contribution
AttenhERG is a breakthrough framework that significantly improves the interpretability and accuracy of predicting hERG channel blockers. By integrating uncertainty estimation, AttenhERG demonstrates superior reliability compared to benchmark models. Two case studies, involving APH1A and NMT1 inhibitors, further emphasize AttenhERG's practical application in compound optimization.
{"title":"AttenhERG: a reliable and interpretable graph neural network framework for predicting hERG channel blockers","authors":"Tianbiao Yang, Xiaoyu Ding, Elizabeth McMichael, Frank W. Pun, Alex Aliper, Feng Ren, Alex Zhavoronkov, Xiao Ding","doi":"10.1186/s13321-024-00940-y","DOIUrl":"10.1186/s13321-024-00940-y","url":null,"abstract":"<div><p>Cardiotoxicity, particularly drug-induced arrhythmias, poses a significant challenge in drug development, highlighting the importance of early-stage prediction of human ether-a-go-go-related gene (hERG) toxicity. hERG encodes the pore-forming subunit of the cardiac potassium channel. Traditional methods are both costly and time-intensive, necessitating the development of computational approaches. In this study, we introduce AttenhERG, a novel graph neural network framework designed to predict hERG channel blockers reliably and interpretably. AttenhERG demonstrates improved performance compared to existing methods with an AUROC of 0.835, showcasing its efficacy in accurately predicting hERG activity across diverse datasets. Additionally, uncertainty evaluation analysis reveals the model's reliability, enhancing its utility in drug discovery and safety assessment. Case studies illustrate the practical application of AttenhERG in optimizing compounds for hERG toxicity, highlighting its potential in rational drug design.</p><p><b>Scientific contribution</b></p><p>AttenhERG is a breakthrough framework that significantly improves the interpretability and accuracy of predicting hERG channel blockers. By integrating uncertainty estimation, AttenhERG demonstrates superior reliability compared to benchmark models. Two case studies, involving APH1A and NMT1 inhibitors, further emphasize AttenhERG's practical application in compound optimization.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00940-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142874098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-20DOI: 10.1186/s13321-024-00930-0
Jianmin Wang, Jiashun Mao, Chunyan Li, Hongxin Xiang, Xun Wang, Shuang Wang, Zixu Wang, Yangyang Chen, Yuquan Li, Kyoung Tai No, Tao Song, Xiangxiang Zeng
Protein–protein interactions (PPIs) play a crucial role in numerous biochemical and biological processes. Although several structure-based molecular generative models have been developed, PPI interfaces and compounds targeting PPIs exhibit distinct physicochemical properties compared to traditional binding pockets and small-molecule drugs. As a result, generating compounds that effectively target PPIs, particularly by considering PPI complexes or interface hotspot residues, remains a significant challenge. In this work, we constructed a comprehensive dataset of PPI interfaces with active and inactive compound pairs. Based on this, we propose a novel molecular generative framework tailored to PPI interfaces, named GENiPPI. Our evaluation demonstrates that GENiPPI captures the implicit relationships between the PPI interfaces and the active molecules, and can generate novel compounds that target these interfaces. Moreover, GENiPPI can generate structurally diverse novel compounds with limited PPI interface modulators. To the best of our knowledge, this is the first exploration of a structure-based molecular generative model focused on PPI interfaces, which could facilitate the design of PPI modulators. The PPI interface-based molecular generative model enriches the existing landscape of structure-based (pocket/interface) molecular generative model.
This study introduces GENiPPI, a protein-protein interaction (PPI) interface-aware molecular generative framework. The framework first employs Graph Attention Networks to capture atomic-level interaction features at the protein complex interface. Subsequently, Convolutional Neural Networks extract compound representations in voxel and electron density spaces. These features are integrated into a Conditional Wasserstein Generative Adversarial Network, which trains the model to generate compound representations targeting PPI interfaces. GENiPPI effectively captures the relationship between PPI interfaces and active/inactive compounds. Furthermore, in fewshot molecular generation, GENiPPI successfully generates compounds comparable to known disruptors. GENiPPI provides an efficient tool for structure-based design of PPI modulators.
{"title":"Interface-aware molecular generative framework for protein–protein interaction modulators","authors":"Jianmin Wang, Jiashun Mao, Chunyan Li, Hongxin Xiang, Xun Wang, Shuang Wang, Zixu Wang, Yangyang Chen, Yuquan Li, Kyoung Tai No, Tao Song, Xiangxiang Zeng","doi":"10.1186/s13321-024-00930-0","DOIUrl":"10.1186/s13321-024-00930-0","url":null,"abstract":"<p>Protein–protein interactions (PPIs) play a crucial role in numerous biochemical and biological processes. Although several structure-based molecular generative models have been developed, PPI interfaces and compounds targeting PPIs exhibit distinct physicochemical properties compared to traditional binding pockets and small-molecule drugs. As a result, generating compounds that effectively target PPIs, particularly by considering PPI complexes or interface hotspot residues, remains a significant challenge. In this work, we constructed a comprehensive dataset of PPI interfaces with active and inactive compound pairs. Based on this, we propose a novel molecular generative framework tailored to PPI interfaces, named GENiPPI. Our evaluation demonstrates that GENiPPI captures the implicit relationships between the PPI interfaces and the active molecules, and can generate novel compounds that target these interfaces. Moreover, GENiPPI can generate structurally diverse novel compounds with limited PPI interface modulators. To the best of our knowledge, this is the first exploration of a structure-based molecular generative model focused on PPI interfaces, which could facilitate the design of PPI modulators. The PPI interface-based molecular generative model enriches the existing landscape of structure-based (pocket/interface) molecular generative model.</p><p>This study introduces GENiPPI, a protein-protein interaction (PPI) interface-aware molecular generative framework. The framework first employs Graph Attention Networks to capture atomic-level interaction features at the protein complex interface. Subsequently, Convolutional Neural Networks extract compound representations in voxel and electron density spaces. These features are integrated into a Conditional Wasserstein Generative Adversarial\u0000Network, which trains the model to generate compound representations targeting PPI interfaces. GENiPPI effectively captures the relationship between PPI interfaces and active/inactive compounds. Furthermore, in fewshot molecular generation, GENiPPI successfully generates compounds comparable to known disruptors. GENiPPI provides an efficient tool for structure-based design of PPI modulators.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00930-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142858556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the field of chemical structure recognition, the task of converting molecular images into machine-readable data formats such as SMILES string stands as a significant challenge, primarily due to the varied drawing styles and conventions prevalent in chemical literature. To bridge this gap, we proposed MolNexTR, a novel image-to-graph deep learning model that collaborates to fuse the strengths of ConvNext, a powerful Convolutional Neural Network variant, and Vision-TRansformer. This integration facilitates a more detailed extraction of both local and global features from molecular images. MolNexTR can predict atoms and bonds simultaneously and understand their layout rules. It also excels at flexibly integrating symbolic chemistry principles to discern chirality and decipher abbreviated structures. We further incorporate a series of advanced algorithms, including an improved data augmentation module, an image contamination module, and a post-processing module for getting the final SMILES output. These modules cooperate to enhance the model’s robustness to diverse styles of molecular images found in real literature. In our test sets, MolNexTR has demonstrated superior performance, achieving an accuracy rate of 81–97%, marking a significant advancement in the domain of molecular structure recognition.
Scientific contribution
MolNexTR is a novel image-to-graph model that incorporates a unique dual-stream encoder to extract complex molecular image features, and combines chemical rules to predict atoms and bonds while understanding atom and bond layout rules. In addition, it employs a series of novel augmentation algorithms to significantly enhance the robustness and performance of the model.
{"title":"MolNexTR: a generalized deep learning model for molecular image recognition","authors":"Yufan Chen, Ching Ting Leung, Yong Huang, Jianwei Sun, Hao Chen, Hanyu Gao","doi":"10.1186/s13321-024-00926-w","DOIUrl":"10.1186/s13321-024-00926-w","url":null,"abstract":"<div><p>In the field of chemical structure recognition, the task of converting molecular images into machine-readable data formats such as SMILES string stands as a significant challenge, primarily due to the varied drawing styles and conventions prevalent in chemical literature. To bridge this gap, we proposed MolNexTR, a novel image-to-graph deep learning model that collaborates to fuse the strengths of ConvNext, a powerful Convolutional Neural Network variant, and Vision-TRansformer. This integration facilitates a more detailed extraction of both local and global features from molecular images. MolNexTR can predict atoms and bonds simultaneously and understand their layout rules. It also excels at flexibly integrating symbolic chemistry principles to discern chirality and decipher abbreviated structures. We further incorporate a series of advanced algorithms, including an improved data augmentation module, an image contamination module, and a post-processing module for getting the final SMILES output. These modules cooperate to enhance the model’s robustness to diverse styles of molecular images found in real literature. In our test sets, MolNexTR has demonstrated superior performance, achieving an accuracy rate of 81–97%, marking a significant advancement in the domain of molecular structure recognition.</p><p><b>Scientific contribution</b></p><p>MolNexTR is a novel image-to-graph model that incorporates a unique dual-stream encoder to extract complex molecular image features, and combines chemical rules to predict atoms and bonds while understanding atom and bond layout rules. In addition, it employs a series of novel augmentation algorithms to significantly enhance the robustness and performance of the model.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00926-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142841267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-10DOI: 10.1186/s13321-024-00935-9
Fabio Herrera-Rocha, Miguel Fernández-Niño, Jorge Duitama, Mónica P. Cala, María José Chica, Ludger A. Wessjohann, Mehdi D. Davari, Andrés Fernando González Barrios
Flavor is the main factor driving consumers acceptance of food products. However, tracking the biochemistry of flavor is a formidable challenge due to the complexity of food composition. Current methodologies for linking individual molecules to flavor in foods and beverages are expensive and time-consuming. Predictive models based on machine learning (ML) are emerging as an alternative to speed up this process. Nonetheless, the optimal approach to predict flavor features of molecules remains elusive. In this work we present FlavorMiner, an ML-based multilabel flavor predictor. FlavorMiner seamlessly integrates different combinations of algorithms and mathematical representations, augmented with class balance strategies to address the inherent class of the input dataset. Notably, Random Forest and K-Nearest Neighbors combined with Extended Connectivity Fingerprint and RDKit molecular descriptors consistently outperform other combinations in most cases. Resampling strategies surpass weight balance methods in mitigating bias associated with class imbalance. FlavorMiner exhibits remarkable accuracy, with an average ROC AUC score of 0.88. This algorithm was used to analyze cocoa metabolomics data, unveiling its profound potential to help extract valuable insights from intricate food metabolomics data. FlavorMiner can be used for flavor mining in any food product, drawing from a diverse training dataset that spans over 934 distinct food products.
Scientific Contribution FlavorMiner is an advanced machine learning (ML)-based tool designed to predict molecular flavor features with high accuracy and efficiency, addressing the complexity of food metabolomics. By leveraging robust algorithmic combinations paired with mathematical representations FlavorMiner achieves high predictive performance. Applied to cocoa metabolomics, FlavorMiner demonstrated its capacity to extract meaningful insights, showcasing its versatility for flavor analysis across diverse food products. This study underscores the transformative potential of ML in accelerating flavor biochemistry research, offering a scalable solution for the food and beverage industry.
{"title":"FlavorMiner: a machine learning platform for extracting molecular flavor profiles from structural data","authors":"Fabio Herrera-Rocha, Miguel Fernández-Niño, Jorge Duitama, Mónica P. Cala, María José Chica, Ludger A. Wessjohann, Mehdi D. Davari, Andrés Fernando González Barrios","doi":"10.1186/s13321-024-00935-9","DOIUrl":"10.1186/s13321-024-00935-9","url":null,"abstract":"<div><p>Flavor is the main factor driving consumers acceptance of food products. However, tracking the biochemistry of flavor is a formidable challenge due to the complexity of food composition. Current methodologies for linking individual molecules to flavor in foods and beverages are expensive and time-consuming. Predictive models based on machine learning (ML) are emerging as an alternative to speed up this process. Nonetheless, the optimal approach to predict flavor features of molecules remains elusive. In this work we present FlavorMiner, an ML-based multilabel flavor predictor. FlavorMiner seamlessly integrates different combinations of algorithms and mathematical representations, augmented with class balance strategies to address the inherent class of the input dataset. Notably, Random Forest and K-Nearest Neighbors combined with Extended Connectivity Fingerprint and RDKit molecular descriptors consistently outperform other combinations in most cases. Resampling strategies surpass weight balance methods in mitigating bias associated with class imbalance. FlavorMiner exhibits remarkable accuracy, with an average ROC AUC score of 0.88. This algorithm was used to analyze cocoa metabolomics data, unveiling its profound potential to help extract valuable insights from intricate food metabolomics data. FlavorMiner can be used for flavor mining in any food product, drawing from a diverse training dataset that spans over 934 distinct food products.</p><p><b>Scientific Contribution</b> FlavorMiner is an advanced machine learning (ML)-based tool designed to predict molecular flavor features with high accuracy and efficiency, addressing the complexity of food metabolomics. By leveraging robust algorithmic combinations paired with mathematical representations FlavorMiner achieves high predictive performance. Applied to cocoa metabolomics, FlavorMiner demonstrated its capacity to extract meaningful insights, showcasing its versatility for flavor analysis across diverse food products. This study underscores the transformative potential of ML in accelerating flavor biochemistry research, offering a scalable solution for the food and beverage industry.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00935-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142796783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09DOI: 10.1186/s13321-024-00924-y
Yasmine Nahal, Janosch Menke, Julien Martinelli, Markus Heinonen, Mikhail Kabeshov, Jon Paul Janet, Eva Nittinger, Ola Engkvist, Samuel Kaski
Machine learning (ML) systems have enabled the modelling of quantitative structure–property relationships (QSPR) and structure-activity relationships (QSAR) using existing experimental data to predict target properties for new molecules. These property predictors hold significant potential in accelerating drug discovery by guiding generative artificial intelligence (AI) agents to explore desired chemical spaces. However, they often struggle to generalize due to the limited scope of the training data. When optimized by generative agents, this limitation can result in the generation of molecules with artificially high predicted probabilities of satisfying target properties, which subsequently fail experimental validation. To address this challenge, we propose an adaptive approach that integrates active learning (AL) and iterative feedback to refine property predictors, thereby improving the outcomes of their optimization by generative AI agents. Our method leverages the Expected Predictive Information Gain (EPIG) criterion to select additional molecules for evaluation by an oracle. This process aims to provide the greatest reduction in predictive uncertainty, enabling more accurate model evaluations of subsequently generated molecules. Recognizing the impracticality of immediate wet-lab or physics-based experiments due to time and logistical constraints, we propose leveraging human experts for their cost-effectiveness and domain knowledge to effectively augment property predictors, bridging gaps in the limited training data. Empirical evaluations through both simulated and real human-in-the-loop experiments demonstrate that our approach refines property predictors to better align with oracle assessments. Additionally, we observe improved accuracy of predicted properties as well as improved drug-likeness among the top-ranking generated molecules.
We present an adaptable framework that integrates AL and human expertise to refine property predictors for goal-oriented molecule generation. This approach is robust to noise in human feedback and ensures that navigating chemical space with human-refined predictors leverages human insights to identify molecules that not only satisfy predicted property profiles but also score highly on oracle models. Additionally, it prioritizes practical characteristics such as drug-likeness, synthetic accessibility, and a favorable balance between exploring diverse chemical space and exploiting similarity to existing training data.
{"title":"Human-in-the-loop active learning for goal-oriented molecule generation","authors":"Yasmine Nahal, Janosch Menke, Julien Martinelli, Markus Heinonen, Mikhail Kabeshov, Jon Paul Janet, Eva Nittinger, Ola Engkvist, Samuel Kaski","doi":"10.1186/s13321-024-00924-y","DOIUrl":"10.1186/s13321-024-00924-y","url":null,"abstract":"<p>Machine learning (ML) systems have enabled the modelling of quantitative structure–property relationships (QSPR) and structure-activity relationships (QSAR) using existing experimental data to predict target properties for new molecules. These property predictors hold significant potential in accelerating drug discovery by guiding generative artificial intelligence (AI) agents to explore desired chemical spaces. However, they often struggle to generalize due to the limited scope of the training data. When optimized by generative agents, this limitation can result in the generation of molecules with artificially high predicted probabilities of satisfying target properties, which subsequently fail experimental validation. To address this challenge, we propose an adaptive approach that integrates active learning (AL) and iterative feedback to refine property predictors, thereby improving the outcomes of their optimization by generative AI agents. Our method leverages the Expected Predictive Information Gain (EPIG) criterion to select additional molecules for evaluation by an oracle. This process aims to provide the greatest reduction in predictive uncertainty, enabling more accurate model evaluations of subsequently generated molecules. Recognizing the impracticality of immediate wet-lab or physics-based experiments due to time and logistical constraints, we propose leveraging human experts for their cost-effectiveness and domain knowledge to effectively augment property predictors, bridging gaps in the limited training data. Empirical evaluations through both simulated and real human-in-the-loop experiments demonstrate that our approach refines property predictors to better align with oracle assessments. Additionally, we observe improved accuracy of predicted properties as well as improved drug-likeness among the top-ranking generated molecules.</p><p>We present an adaptable framework that integrates AL and human expertise to refine property predictors for goal-oriented molecule generation. This approach is robust to noise in human feedback and ensures that navigating chemical space with human-refined predictors leverages human insights to identify molecules that not only satisfy predicted property profiles but also score highly on oracle models. Additionally, it prioritizes practical characteristics such as drug-likeness, synthetic accessibility, and a favorable balance between exploring diverse chemical space and exploiting similarity to existing training data.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00924-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142796789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-09DOI: 10.1186/s13321-024-00934-w
Igor V. Tetko, Ruud van Deursen, Guillaume Godin
Hyperparameter optimization is very frequently employed in machine learning. However, an optimization of a large space of parameters could result in overfitting of models. In recent studies on solubility prediction the authors collected seven thermodynamic and kinetic solubility datasets from different data sources. They used state-of-the-art graph-based methods and compared models developed for each dataset using different data cleaning protocols and hyperparameter optimization. In our study we showed that hyperparameter optimization did not always result in better models, possibly due to overfitting when using the same statistical measures. Similar results could be calculated using pre-set hyperparameters, reducing the computational effort by around 10,000 times. We also extended the previous analysis by adding a representation learning method based on Natural Language Processing of smiles called Transformer CNN. We show that across all analyzed sets using exactly the same protocol, Transformer CNN provided better results than graph-based methods for 26 out of 28 pairwise comparisons by using only a tiny fraction of time as compared to other methods. Last but not least we stressed the importance of comparing calculation results using exactly the same statistical measures.
Scientific Contribution We showed that models with pre-optimized hyperparameters can suffer from overfitting and that using pre-set hyperparameters yields similar performances but four orders faster. Transformer CNN provided significantly higher accuracy compared to other investigated methods.
{"title":"Be aware of overfitting by hyperparameter optimization!","authors":"Igor V. Tetko, Ruud van Deursen, Guillaume Godin","doi":"10.1186/s13321-024-00934-w","DOIUrl":"10.1186/s13321-024-00934-w","url":null,"abstract":"<div><p>Hyperparameter optimization is very frequently employed in machine learning. However, an optimization of a large space of parameters could result in overfitting of models. In recent studies on solubility prediction the authors collected seven thermodynamic and kinetic solubility datasets from different data sources. They used state-of-the-art graph-based methods and compared models developed for each dataset using different data cleaning protocols and hyperparameter optimization. In our study we showed that hyperparameter optimization did not always result in better models, possibly due to overfitting when using the same statistical measures. Similar results could be calculated using pre-set hyperparameters, reducing the computational effort by around 10,000 times. We also extended the previous analysis by adding a representation learning method based on Natural Language Processing of smiles called Transformer CNN. We show that across all analyzed sets using exactly the same protocol, Transformer CNN provided better results than graph-based methods for 26 out of 28 pairwise comparisons by using only a tiny fraction of time as compared to other methods. Last but not least we stressed the importance of comparing calculation results using exactly the same statistical measures.</p><p><b>Scientific Contribution</b> We showed that models with pre-optimized hyperparameters can suffer from overfitting and that using pre-set hyperparameters yields similar performances but four orders faster. Transformer CNN provided significantly higher accuracy compared to other investigated methods.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00934-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142796786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-05DOI: 10.1186/s13321-024-00936-8
Hakjean Kim, Seongok Ryu, Nuri Jung, Jinsol Yang, Chaok Seok
The two key components of computational molecular design are virtually generating molecules and predicting the properties of these generated molecules. This study focuses on an effective method for molecular generation through virtual synthesis and global optimization of a given objective function. Using a pre-trained graph neural network (GNN) objective function to approximate the docking energies of compounds for four target receptors, we generated highly optimized compounds with 300–400 times less computational effort compared to virtual compound library screening. These optimized compounds exhibit similar synthesizability and diversity to known binders with high potency and are notably novel compared to library chemicals or known ligands. This method, called CSearch, can be effectively utilized to generate chemicals optimized for a given objective function. With the GNN function approximating docking energies, CSearch generated molecules with predicted binding poses to the target receptors similar to known inhibitors, demonstrating its effectiveness in producing drug-like binders.
Scientific Contribution We have developed a method for effectively exploring the chemical space of drug-like molecules using a global optimization algorithm with fragment-based virtual synthesis. The compounds generated using this method optimize the given objective function efficiently and are synthesizable like commercial library compounds. Furthermore, they are diverse, novel drug-like molecules with properties similar to known inhibitors for target receptors.
{"title":"CSearch: chemical space search via virtual synthesis and global optimization","authors":"Hakjean Kim, Seongok Ryu, Nuri Jung, Jinsol Yang, Chaok Seok","doi":"10.1186/s13321-024-00936-8","DOIUrl":"10.1186/s13321-024-00936-8","url":null,"abstract":"<div><p>The two key components of computational molecular design are virtually generating molecules and predicting the properties of these generated molecules. This study focuses on an effective method for molecular generation through virtual synthesis and global optimization of a given objective function. Using a pre-trained graph neural network (GNN) objective function to approximate the docking energies of compounds for four target receptors, we generated highly optimized compounds with 300–400 times less computational effort compared to virtual compound library screening. These optimized compounds exhibit similar synthesizability and diversity to known binders with high potency and are notably novel compared to library chemicals or known ligands. This method, called CSearch, can be effectively utilized to generate chemicals optimized for a given objective function. With the GNN function approximating docking energies, CSearch generated molecules with predicted binding poses to the target receptors similar to known inhibitors, demonstrating its effectiveness in producing drug-like binders.</p><p><b>Scientific Contribution</b> We have developed a method for effectively exploring the chemical space of drug-like molecules using a global optimization algorithm with fragment-based virtual synthesis. The compounds generated using this method optimize the given objective function efficiently and are synthesizable like commercial library compounds. Furthermore, they are diverse, novel drug-like molecules with properties similar to known inhibitors for target receptors.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00936-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-05DOI: 10.1186/s13321-024-00937-7
João Correia, João Capela, Miguel Rocha
The domain of computational chemistry has experienced a significant evolution due to the introduction of Machine Learning (ML) technologies. Despite its potential to revolutionize the field, researchers are often encumbered by obstacles, such as the complexity of selecting optimal algorithms, the automation of data pre-processing steps, the necessity for adaptive feature engineering, and the assurance of model performance consistency across different datasets. Addressing these issues head-on, DeepMol stands out as an Automated ML (AutoML) tool by automating critical steps of the ML pipeline. DeepMol rapidly and automatically identifies the most effective data representation, pre-processing methods and model configurations for a specific molecular property/activity prediction problem. On 22 benchmark datasets, DeepMol obtained competitive pipelines compared with those requiring time-consuming feature engineering, model design and selection processes. As one of the first AutoML tools specifically developed for the computational chemistry domain, DeepMol stands out with its open-source code, in-depth tutorials, detailed documentation, and examples of real-world applications, all available at https://github.com/BioSystemsUM/DeepMol and https://deepmol.readthedocs.io/en/latest/. By introducing AutoML as a groundbreaking feature in computational chemistry, DeepMol establishes itself as the pioneering state-of-the-art tool in the field.
Scientific contribution
DeepMol aims to provide an integrated framework of AutoML for computational chemistry. DeepMol provides a more robust alternative to other tools with its integrated pipeline serialization, enabling seamless deployment using the fit, transform, and predict paradigms. It uniquely supports both conventional and deep learning models for regression, classification and multi-task, offering unmatched flexibility compared to other AutoML tools. DeepMol's predefined configurations and customizable objective functions make it accessible to users at all skill levels while enabling efficient and reproducible workflows. Benchmarking on diverse datasets demonstrated its ability to deliver optimized pipelines and superior performance across various molecular machine-learning tasks.
{"title":"Deepmol: an automated machine and deep learning framework for computational chemistry","authors":"João Correia, João Capela, Miguel Rocha","doi":"10.1186/s13321-024-00937-7","DOIUrl":"10.1186/s13321-024-00937-7","url":null,"abstract":"<div><p>The domain of computational chemistry has experienced a significant evolution due to the introduction of Machine Learning (ML) technologies. Despite its potential to revolutionize the field, researchers are often encumbered by obstacles, such as the complexity of selecting optimal algorithms, the automation of data pre-processing steps, the necessity for adaptive feature engineering, and the assurance of model performance consistency across different datasets. Addressing these issues head-on, <i>DeepMol</i> stands out as an Automated ML (AutoML) tool by automating critical steps of the ML pipeline. <i>DeepMol</i> rapidly and automatically identifies the most effective data representation, pre-processing methods and model configurations for a specific molecular property/activity prediction problem. On 22 benchmark datasets, <i>DeepMol</i> obtained competitive pipelines compared with those requiring time-consuming feature engineering, model design and selection processes. As one of the first AutoML tools specifically developed for the computational chemistry domain, <i>DeepMol</i> stands out with its open-source code, in-depth tutorials, detailed documentation, and examples of real-world applications, all available at https://github.com/BioSystemsUM/DeepMol and https://deepmol.readthedocs.io/en/latest/. By introducing AutoML as a groundbreaking feature in computational chemistry, DeepMol establishes itself as the pioneering state-of-the-art tool in the field.</p><p><b>Scientific contribution</b></p><p><i>DeepMol</i> aims to provide an integrated framework of AutoML for computational chemistry. <i>DeepMol</i> provides a more robust alternative to other tools with its integrated pipeline serialization, enabling seamless deployment using the <i>fit</i>, <i>transform</i>, and <i>predict</i> paradigms. It uniquely supports both conventional and deep learning models for regression, classification and multi-task, offering unmatched flexibility compared to other AutoML tools. <i>DeepMol's</i> predefined configurations and customizable objective functions make it accessible to users at all skill levels while enabling efficient and reproducible workflows. Benchmarking on diverse datasets demonstrated its ability to deliver optimized pipelines and superior performance across various molecular machine-learning tasks.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00937-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-03DOI: 10.1186/s13321-024-00932-y
Markus Dablander, Thierry Hanser, Renaud Lambiotte, Garrett M. Morris
Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods. In contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure selection, and a wide variety of other potential techniques. We go on to describe Sort & Slice, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the L most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, L. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated substructure-pooling methods across distinct prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning.
Scientific contribution
A general mathematical framework for the vectorisation of structural fingerprints called substructure pooling; and the technical description and computational evaluation of Sort & Slice, a conceptually simple and bit-collision-free method for the pooling of ECFP substructures that robustly and markedly outperforms classical hash-based folding at molecular property prediction.
{"title":"Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints","authors":"Markus Dablander, Thierry Hanser, Renaud Lambiotte, Garrett M. Morris","doi":"10.1186/s13321-024-00932-y","DOIUrl":"10.1186/s13321-024-00932-y","url":null,"abstract":"<div><p>Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods. In contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure selection, and a wide variety of other potential techniques. We go on to describe <i>Sort & Slice</i>, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort & Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the <i>L</i> most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, <i>L</i>. We computationally compare the performance of hash-based folding, Sort & Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort & Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated substructure-pooling methods across distinct prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort & Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning. </p><br><p><b>Scientific contribution</b> </p><p>A general mathematical framework for the vectorisation of structural fingerprints called <i>substructure pooling</i>; and the technical description and computational evaluation of <i>Sort & Slice</i>, a conceptually simple and bit-collision-free method for the pooling of ECFP substructures that robustly and markedly outperforms classical hash-based folding at molecular property prediction.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00932-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142760674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}