In chemoinformatics, chemical databases have great importance since their main objective is to store and organize the chemical structures of molecules and their properties, from basic information such as chemical structure to more complex like molecular fingerprints or other types of calculated or experimental descriptors and biological activity. However, this data can only be utilized in projects to identify novel therapeutic molecules or other fields through their correct characterization and analysis. In this Application Note, we compiled five workflows within the open-source data analytics and visualization platform KNIME that can be implemented for the chemoinformatic characterization of databases. To illustrate the application of the workflows, we used BIOFACQUIM, a compound database of natural products isolated and characterized in Mexico [1].
{"title":"KNIME Workflows for Chemoinformatic Characterization of Chemical Databases.","authors":"Carlos D Ramírez-Márquez, José L Medina-Franco","doi":"10.1002/minf.202400337","DOIUrl":"10.1002/minf.202400337","url":null,"abstract":"<p><p>In chemoinformatics, chemical databases have great importance since their main objective is to store and organize the chemical structures of molecules and their properties, from basic information such as chemical structure to more complex like molecular fingerprints or other types of calculated or experimental descriptors and biological activity. However, this data can only be utilized in projects to identify novel therapeutic molecules or other fields through their correct characterization and analysis. In this Application Note, we compiled five workflows within the open-source data analytics and visualization platform KNIME that can be implemented for the chemoinformatic characterization of databases. To illustrate the application of the workflows, we used BIOFACQUIM, a compound database of natural products isolated and characterized in Mexico [1].</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 2","pages":"e202400337"},"PeriodicalIF":2.8,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143365158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conventional molecular geometry searches on a potential energy surface (PES) utilize energy gradients from quantum chemical calculations. However, replacing energy calculations with noisy quantum computer measurements generates errors in the energies, which makes geometry optimization using the energy gradient difficult. One gradient-free optimization method that can potentially solve this problem is Bayesian optimization (BO). To use BO in geometry search, an acquisition function (AF), which involves an objective variable, must be defined suitably. In this study, we propose a strategy for geometry searches using BO and examine the appropriate AFs to explore two critical structures: the global minimum (GM) on the singlet ground state (S0) and the most stable conical intersection (CI) point between S0 and the singlet excited state. We applied our strategy to two molecules and located the GM and the most stable CI geometries with high accuracy for both molecules. We also succeeded in the geometry searches even when artificial random noises were added to the energies to simulate geometry optimization using noisy quantum computer measurements.
{"title":"Exploration of the Global Minimum and Conical Intersection with Bayesian Optimization.","authors":"Riho Somaki, Taichi Inagaki, Miho Hatanaka","doi":"10.1002/minf.202400041","DOIUrl":"10.1002/minf.202400041","url":null,"abstract":"<p><p>Conventional molecular geometry searches on a potential energy surface (PES) utilize energy gradients from quantum chemical calculations. However, replacing energy calculations with noisy quantum computer measurements generates errors in the energies, which makes geometry optimization using the energy gradient difficult. One gradient-free optimization method that can potentially solve this problem is Bayesian optimization (BO). To use BO in geometry search, an acquisition function (AF), which involves an objective variable, must be defined suitably. In this study, we propose a strategy for geometry searches using BO and examine the appropriate AFs to explore two critical structures: the global minimum (GM) on the singlet ground state (S<sub>0</sub>) and the most stable conical intersection (CI) point between S<sub>0</sub> and the singlet excited state. We applied our strategy to two molecules and located the GM and the most stable CI geometries with high accuracy for both molecules. We also succeeded in the geometry searches even when artificial random noises were added to the energies to simulate geometry optimization using noisy quantum computer measurements.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 2","pages":"e202400041"},"PeriodicalIF":2.8,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11781018/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143066818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2024-12-05DOI: 10.1002/minf.202400305
Gabriel Corrêa Veríssimo, Rafaela Salgado Ferreira, Vinícius Gonçalves Maltarollo
Virtual screening (VS) in drug design employs computational methodologies to systematically rank molecules from a virtual compound library based on predicted features related to their biological activities or chemical properties. The recent expansion in commercially accessible compound libraries and the advancements in artificial intelligence (AI) and computational power - including enhanced central processing units (CPUs), graphics processing units (GPUs), high-performance computing (HPC), and cloud computing - have significantly expanded our capacity to screen libraries containing over 109 molecules. Herein, we review the concept of ultra-large virtual screening (ULVS), focusing on the various algorithms and methodologies employed for virtual screening at this scale. In this context, we present the software utilized, applications, and results of different approaches, such as brute force docking, reaction-based docking approaches, machine learning (ML) strategies applied to docking or other VS methods, and similarity/pharmacophore search-based techniques. These examples represent a paradigm shift in the drug discovery process, demonstrating not only the feasibility of billion-scale compound screening but also their potential to identify hit candidates and increase the structural diversity of novel compounds with biological activities.
{"title":"Ultra-Large Virtual Screening: Definition, Recent Advances, and Challenges in Drug Design.","authors":"Gabriel Corrêa Veríssimo, Rafaela Salgado Ferreira, Vinícius Gonçalves Maltarollo","doi":"10.1002/minf.202400305","DOIUrl":"10.1002/minf.202400305","url":null,"abstract":"<p><p>Virtual screening (VS) in drug design employs computational methodologies to systematically rank molecules from a virtual compound library based on predicted features related to their biological activities or chemical properties. The recent expansion in commercially accessible compound libraries and the advancements in artificial intelligence (AI) and computational power - including enhanced central processing units (CPUs), graphics processing units (GPUs), high-performance computing (HPC), and cloud computing - have significantly expanded our capacity to screen libraries containing over 10<sup>9</sup> molecules. Herein, we review the concept of ultra-large virtual screening (ULVS), focusing on the various algorithms and methodologies employed for virtual screening at this scale. In this context, we present the software utilized, applications, and results of different approaches, such as brute force docking, reaction-based docking approaches, machine learning (ML) strategies applied to docking or other VS methods, and similarity/pharmacophore search-based techniques. These examples represent a paradigm shift in the drug discovery process, demonstrating not only the feasibility of billion-scale compound screening but also their potential to identify hit candidates and increase the structural diversity of novel compounds with biological activities.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400305"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142780630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David F Nippa, Alex T Müller, Kenneth Atz, David B Konrad, Uwe Grether, Rainer E Martin, Gisbert Schneider
Utilizing the growing wealth of chemical reaction data can boost synthesis planning and increase success rates. Yet, the effectiveness of machine learning tools for retrosynthesis planning and forward reaction prediction relies on accessible, well-curated data presented in a structured format. Although some public and licensed reaction databases exist, they often lack essential information about reaction conditions. To address this issue and promote the principles of findable, accessible, interoperable, and reusable (FAIR) data reporting and sharing, we introduce the Simple User-Friendly Reaction Format (SURF). SURF standardizes the documentation of reaction data through a structured tabular format, requiring only a basic understanding of spreadsheets. This format enables chemists to record the synthesis of molecules in a format that is understandable by both humans and machines, which facilitates seamless sharing and integration directly into machine learning pipelines. SURF files are designed to be interoperable, easily imported into relational databases, and convertible into other formats. This complements existing initiatives like the Open Reaction Database (ORD) and Unified Data Model (UDM). At Roche, SURF plays a crucial role in democratizing FAIR reaction data sharing and expediting the chemical synthesis process.
{"title":"Simple User-Friendly Reaction Format.","authors":"David F Nippa, Alex T Müller, Kenneth Atz, David B Konrad, Uwe Grether, Rainer E Martin, Gisbert Schneider","doi":"10.1002/minf.202400361","DOIUrl":"10.1002/minf.202400361","url":null,"abstract":"<p><p>Utilizing the growing wealth of chemical reaction data can boost synthesis planning and increase success rates. Yet, the effectiveness of machine learning tools for retrosynthesis planning and forward reaction prediction relies on accessible, well-curated data presented in a structured format. Although some public and licensed reaction databases exist, they often lack essential information about reaction conditions. To address this issue and promote the principles of findable, accessible, interoperable, and reusable (FAIR) data reporting and sharing, we introduce the Simple User-Friendly Reaction Format (SURF). SURF standardizes the documentation of reaction data through a structured tabular format, requiring only a basic understanding of spreadsheets. This format enables chemists to record the synthesis of molecules in a format that is understandable by both humans and machines, which facilitates seamless sharing and integration directly into machine learning pipelines. SURF files are designed to be interoperable, easily imported into relational databases, and convertible into other formats. This complements existing initiatives like the Open Reaction Database (ORD) and Unified Data Model (UDM). At Roche, SURF plays a crucial role in democratizing FAIR reaction data sharing and expediting the chemical synthesis process.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 1","pages":"e202400361"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11755691/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143024131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2024-10-10DOI: 10.1002/minf.202400186
Markus Orsi, Jean-Louis Reymond
Herein we report a virtual library of 1E+60 members, a common estimate for the total size of the drug-like chemical space. The library is obtained from 100 commercially available peptide and peptoid building blocks assembled into linear or cyclic oligomers of up to 30 units, forming molecules within the size range of peptide drugs and potentially accessible by solid-phase synthesis. We demonstrate ligand-based virtual screening (LBVS) using the peptide design genetic algorithm (PDGA), which evolves a population of 50 members to resemble a given target molecule using molecular fingerprint similarity as fitness function. Target molecules are reached in less than 10,000 generations. Like in many journeys, the value of the chemical space journey using PDGA lies not in reaching the target but in the journey itself, here by encountering non-obvious analogs. We also show that PDGA can be used to generate median molecules and analogs of non-peptide target molecules.
{"title":"Navigating a 1E+60 Chemical Space of Peptide/Peptoid Oligomers.","authors":"Markus Orsi, Jean-Louis Reymond","doi":"10.1002/minf.202400186","DOIUrl":"10.1002/minf.202400186","url":null,"abstract":"<p><p>Herein we report a virtual library of 1E+60 members, a common estimate for the total size of the drug-like chemical space. The library is obtained from 100 commercially available peptide and peptoid building blocks assembled into linear or cyclic oligomers of up to 30 units, forming molecules within the size range of peptide drugs and potentially accessible by solid-phase synthesis. We demonstrate ligand-based virtual screening (LBVS) using the peptide design genetic algorithm (PDGA), which evolves a population of 50 members to resemble a given target molecule using molecular fingerprint similarity as fitness function. Target molecules are reached in less than 10,000 generations. Like in many journeys, the value of the chemical space journey using PDGA lies not in reaching the target but in the journey itself, here by encountering non-obvious analogs. We also show that PDGA can be used to generate median molecules and analogs of non-peptide target molecules.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400186"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11733718/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142400782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2024-08-06DOI: 10.1002/minf.202400154
I M Kashafutdinova, A Poyezzhayeva, T Gimadiev, T Madzhidov
During the early stages of drug design, identifying compounds with suitable bioactivities is crucial. Given the vast array of potential drug databases, it's feasible to assay only a limited subset of candidates. The optimal method for selecting the candidates, aiming to minimize the overall number of assays, involves an active learning (AL) approach. In this work, we benchmarked a range of AL strategies with two main objectives: (1) to identify a strategy that ensures high model performance and (2) to select molecules with desired properties using minimal assays. To evaluate the different AL strategies, we employed the simulated AL workflow based on "virtual" experiments. These experiments leveraged ChEMBL datasets, which come with known biological activity values for the molecules. Furthermore, for classification tasks, we proposed the hybrid selection strategy that unified both exploration and exploitation AL strategies into a single acquisition function, defined by parameters n and c. We have also shown that popular minimal margin and maximal variance selection approaches for exploration selection correspond to minimization of the hybrid acquisition function with n=1 and 2 respectively. The balance between the exploration and exploitation strategies can be adjusted using a coefficient (c), making the optimal strategy selection straightforward. The primary strength of the hybrid selection method lies in its adaptability; it offers the flexibility to adjust the criteria for molecule selection based on the specific task by modifying the value of the contribution coefficient. Our analysis revealed that, in regression tasks, AL strategies didn't succeed at ensuring high model performance, however, they were successful in selecting molecules with desired properties using minimal number of tests. In analogous experiments in classification tasks, exploration strategy and the hybrid selection function with a constant c<1 (for n=1) and c≤0.2 (for n=2) were effective in achieving the goal of constructing a high-performance predictive model using minimal data. When searching for molecules with desired properties, exploitation, and the hybrid function with c≥1 (n=1) and c≥0.7 (n=2) demonstrated efficiency identifying molecules in fewer iterations compared to random selection method. Notably, when the hybrid function was set to an intermediate coefficient value (c=0.7), it successfully addressed both tasks simultaneously.
在药物设计的早期阶段,确定具有合适生物活性的化合物至关重要。鉴于潜在药物数据库数量庞大,因此只能对有限的候选化合物进行检测。选择候选化合物的最佳方法是主动学习(AL)方法,目的是最大限度地减少化验的总次数。在这项工作中,我们对一系列主动学习策略进行了基准测试,主要目的有两个:(1)确定一种能确保高模型性能的策略;(2)使用最少的化验选择具有所需特性的分子。为了评估不同的 AL 策略,我们采用了基于 "虚拟 "实验的模拟 AL 工作流程。这些实验利用了 ChEMBL 数据集,其中包含了已知分子的生物活性值。此外,针对分类任务,我们提出了混合选择策略,将探索和利用 AL 策略统一为一个单一的获取函数,该函数由参数 n 和 c 定义。我们还证明,用于探索选择的流行最小边际和最大方差选择方法分别对应于混合获取函数的最小化(n=1 和 2)。探索策略和开发策略之间的平衡可以通过系数(c)进行调整,从而使最优策略选择变得简单明了。混合选择方法的主要优势在于其适应性;它可以根据具体任务,通过修改贡献系数的值来灵活调整分子选择的标准。我们的分析表明,在回归任务中,AL 策略并不能成功地确保高模型性能,但却能用最少的测试次数成功地选择出具有所需特性的分子。在分类任务的类似实验中,探索策略和具有常数 c
{"title":"Active learning approaches in molecule pKi prediction.","authors":"I M Kashafutdinova, A Poyezzhayeva, T Gimadiev, T Madzhidov","doi":"10.1002/minf.202400154","DOIUrl":"10.1002/minf.202400154","url":null,"abstract":"<p><p>During the early stages of drug design, identifying compounds with suitable bioactivities is crucial. Given the vast array of potential drug databases, it's feasible to assay only a limited subset of candidates. The optimal method for selecting the candidates, aiming to minimize the overall number of assays, involves an active learning (AL) approach. In this work, we benchmarked a range of AL strategies with two main objectives: (1) to identify a strategy that ensures high model performance and (2) to select molecules with desired properties using minimal assays. To evaluate the different AL strategies, we employed the simulated AL workflow based on \"virtual\" experiments. These experiments leveraged ChEMBL datasets, which come with known biological activity values for the molecules. Furthermore, for classification tasks, we proposed the hybrid selection strategy that unified both exploration and exploitation AL strategies into a single acquisition function, defined by parameters n and c. We have also shown that popular minimal margin and maximal variance selection approaches for exploration selection correspond to minimization of the hybrid acquisition function with n=1 and 2 respectively. The balance between the exploration and exploitation strategies can be adjusted using a coefficient (c), making the optimal strategy selection straightforward. The primary strength of the hybrid selection method lies in its adaptability; it offers the flexibility to adjust the criteria for molecule selection based on the specific task by modifying the value of the contribution coefficient. Our analysis revealed that, in regression tasks, AL strategies didn't succeed at ensuring high model performance, however, they were successful in selecting molecules with desired properties using minimal number of tests. In analogous experiments in classification tasks, exploration strategy and the hybrid selection function with a constant c<1 (for n=1) and c≤0.2 (for n=2) were effective in achieving the goal of constructing a high-performance predictive model using minimal data. When searching for molecules with desired properties, exploitation, and the hybrid function with c≥1 (n=1) and c≥0.7 (n=2) demonstrated efficiency identifying molecules in fewer iterations compared to random selection method. Notably, when the hybrid function was set to an intermediate coefficient value (c=0.7), it successfully addressed both tasks simultaneously.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400154"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141893849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent times, graph representation learning has been becoming a hot research topic which has attracted a lot of attention from researchers. Graph embeddings have diverse applications across fields such as information and social network analysis, bioinformatics and cheminformatics, natural language processing (NLP), and recommendation systems. Among the advanced deep learning (DL) based architectures used in graph representation learning, graph neural networks (GNNs) have emerged as the dominant and highly effective framework. The recent GNN-based methods have demonstrated state-of-the-art performance on complex supervised and unsupervised tasks at both the node and graph levels. In recent years, to enhance multi-view and structured graph representations, contrastive learning-based techniques have been developed, introducing models known as graph contrastive learning (GCL) models. These GCL approaches leverage unsupervised contrastive methods to capture multi-view graph representations by comparing node and graph embeddings, yielding significant improvements in both graph-level representations and task-specific applications, such as molecular embedding and classification. However, as most GCL techniques are primarily designed to focus on the explicit graph structure through GNN-based encoders, they often overlook critical topological insights that could be provided through topological data analysis (TDA). Given the promising research indicating that topological features can greatly benefit various graph learning tasks, we propose a novel topology-enhanced, multi-view graph contrastive learning model called TMGCL. Our TMGCL model is designed to capture and utilize both comprehensive multi-scale topological and global structural information from graphs. This enhanced representation capability positions TMGCL to directly support a range of applications, such as molecular classification, with improved accuracy and robustness. Extensive experiments within two real-world datasets proved the effectiveness and outperformance of our proposed TMGCL in comparing with state-of-the-art GNN/GCL-based baselines.
{"title":"A Topology-Enhanced Multi-Viewed Contrastive Approach for Molecular Graph Representation Learning and Classification.","authors":"Phu Pham","doi":"10.1002/minf.202400252","DOIUrl":"https://doi.org/10.1002/minf.202400252","url":null,"abstract":"<p><p>In recent times, graph representation learning has been becoming a hot research topic which has attracted a lot of attention from researchers. Graph embeddings have diverse applications across fields such as information and social network analysis, bioinformatics and cheminformatics, natural language processing (NLP), and recommendation systems. Among the advanced deep learning (DL) based architectures used in graph representation learning, graph neural networks (GNNs) have emerged as the dominant and highly effective framework. The recent GNN-based methods have demonstrated state-of-the-art performance on complex supervised and unsupervised tasks at both the node and graph levels. In recent years, to enhance multi-view and structured graph representations, contrastive learning-based techniques have been developed, introducing models known as graph contrastive learning (GCL) models. These GCL approaches leverage unsupervised contrastive methods to capture multi-view graph representations by comparing node and graph embeddings, yielding significant improvements in both graph-level representations and task-specific applications, such as molecular embedding and classification. However, as most GCL techniques are primarily designed to focus on the explicit graph structure through GNN-based encoders, they often overlook critical topological insights that could be provided through topological data analysis (TDA). Given the promising research indicating that topological features can greatly benefit various graph learning tasks, we propose a novel topology-enhanced, multi-view graph contrastive learning model called TMGCL. Our TMGCL model is designed to capture and utilize both comprehensive multi-scale topological and global structural information from graphs. This enhanced representation capability positions TMGCL to directly support a range of applications, such as molecular classification, with improved accuracy and robustness. Extensive experiments within two real-world datasets proved the effectiveness and outperformance of our proposed TMGCL in comparing with state-of-the-art GNN/GCL-based baselines.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 1","pages":"e202400252"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142951853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2024-11-23DOI: 10.1002/minf.202400293
Jude Y Betow, Gemma Turon, Clovis S Metuge, Simeon Akame, Vanessa A Shu, Oyere T Ebob, Miquel Duran-Frigola, Fidele Ntie-Kang
Diseases caused by viruses are challenging to contain, as their outbreak and spread could be very sudden, compounded by rapid mutations, making the development of drugs and vaccines a continued endeavour that requires fast discovery and preparedness. Targeting viral infections with small molecules remains one of the treatment options to reduce transmission and the disease burden. A lesson learned from the recent coronavirus disease (COVID-19) is to collect ready-to-screen small molecule libraries in preparation for the next viral outbreak, and potentially find a clinical candidate before it becomes a pandemic. Public availability of diverse compound libraries, well annotated in terms of chemical structures and scaffolds, modes of action, and bioactivities are, therefore, crucial to ensure the participation of academic laboratories in these screening efforts, especially in resource-limited settings where synthesis, testing and computing capacity are scarce. Here, we demonstrate a low-resource approach to populate the chemical space of naturally occurring and synthetic small molecules that have shown in vitro and/or in vivo activities against the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and its target proteins. We have manually curated two datasets of small molecules (naturally occurring and synthetically derived) by reading and collecting (hand-curating) the published literature. Information from the literature reveals that a majority of the reported SARS-CoV-2 compounds act by inhibiting the main protease, while 25% of the compounds currently have no known target. Scaffold analysis and principal component analysis revealed that the most common scaffolds in the datasets are quite distinct. We then expanded the initially manually curated dataset of over 1200 compounds via an ultra-large scale 2D and 3D similarity search, obtaining an expanded collection of over 150 k purchasable compounds. The spanned chemical space significantly extends beyond that of a commercially available coronavirus library of more than 20 k small molecules and constitutes a good starting collection for virtual screening campaigns given its manageable size and proximity to hand-curated compounds.
{"title":"The Chemical Space Spanned by Manually Curated Datasets of Natural and Synthetic Compounds with Activities against SARS-CoV-2.","authors":"Jude Y Betow, Gemma Turon, Clovis S Metuge, Simeon Akame, Vanessa A Shu, Oyere T Ebob, Miquel Duran-Frigola, Fidele Ntie-Kang","doi":"10.1002/minf.202400293","DOIUrl":"10.1002/minf.202400293","url":null,"abstract":"<p><p>Diseases caused by viruses are challenging to contain, as their outbreak and spread could be very sudden, compounded by rapid mutations, making the development of drugs and vaccines a continued endeavour that requires fast discovery and preparedness. Targeting viral infections with small molecules remains one of the treatment options to reduce transmission and the disease burden. A lesson learned from the recent coronavirus disease (COVID-19) is to collect ready-to-screen small molecule libraries in preparation for the next viral outbreak, and potentially find a clinical candidate before it becomes a pandemic. Public availability of diverse compound libraries, well annotated in terms of chemical structures and scaffolds, modes of action, and bioactivities are, therefore, crucial to ensure the participation of academic laboratories in these screening efforts, especially in resource-limited settings where synthesis, testing and computing capacity are scarce. Here, we demonstrate a low-resource approach to populate the chemical space of naturally occurring and synthetic small molecules that have shown in vitro and/or in vivo activities against the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and its target proteins. We have manually curated two datasets of small molecules (naturally occurring and synthetically derived) by reading and collecting (hand-curating) the published literature. Information from the literature reveals that a majority of the reported SARS-CoV-2 compounds act by inhibiting the main protease, while 25% of the compounds currently have no known target. Scaffold analysis and principal component analysis revealed that the most common scaffolds in the datasets are quite distinct. We then expanded the initially manually curated dataset of over 1200 compounds via an ultra-large scale 2D and 3D similarity search, obtaining an expanded collection of over 150 k purchasable compounds. The spanned chemical space significantly extends beyond that of a commercially available coronavirus library of more than 20 k small molecules and constitutes a good starting collection for virtual screening campaigns given its manageable size and proximity to hand-curated compounds.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400293"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142693295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-01Epub Date: 2024-10-18DOI: 10.1002/minf.202400169
Xinxin Yu, Yuanting Chen, Long Chen, Weihua Li, Yuhao Wang, Yun Tang, Guixia Liu
In silico methods for prediction of chemical toxicity can decrease the cost and increase the efficiency in the early stage of drug discovery. However, due to low accessibility of sufficient and reliable toxicity data, constructing robust and accurate prediction models is challenging. Contrastive learning, a type of self-supervised learning, leverages large unlabeled data to obtain more expressive molecular representations, which can boost the prediction performance on downstream tasks. While molecular graph contrastive learning has gathered growing attentions, current models neglect the quality of negative data set. Here, we proposed a self-supervised pretraining deep learning framework named GCLmf. We first utilized molecular fragments that meet specific conditions as hard negative samples to boost the quality of the negative set and thus increase the difficulty of the proxy tasks during pre-training to learn informative representations. GCLmf has shown excellent predictive power on various molecular property benchmarks and demonstrates high performance in 33 toxicity tasks in comparison with multiple baselines. In addition, we further investigated the necessity of introducing hard negatives in model building and the impact of the proportion of hard negatives on the model.
{"title":"GCLmf: A Novel Molecular Graph Contrastive Learning Framework Based on Hard Negatives and Application in Toxicity Prediction.","authors":"Xinxin Yu, Yuanting Chen, Long Chen, Weihua Li, Yuhao Wang, Yun Tang, Guixia Liu","doi":"10.1002/minf.202400169","DOIUrl":"10.1002/minf.202400169","url":null,"abstract":"<p><p>In silico methods for prediction of chemical toxicity can decrease the cost and increase the efficiency in the early stage of drug discovery. However, due to low accessibility of sufficient and reliable toxicity data, constructing robust and accurate prediction models is challenging. Contrastive learning, a type of self-supervised learning, leverages large unlabeled data to obtain more expressive molecular representations, which can boost the prediction performance on downstream tasks. While molecular graph contrastive learning has gathered growing attentions, current models neglect the quality of negative data set. Here, we proposed a self-supervised pretraining deep learning framework named GCLmf. We first utilized molecular fragments that meet specific conditions as hard negative samples to boost the quality of the negative set and thus increase the difficulty of the proxy tasks during pre-training to learn informative representations. GCLmf has shown excellent predictive power on various molecular property benchmarks and demonstrates high performance in 33 toxicity tasks in comparison with multiple baselines. In addition, we further investigated the necessity of introducing hard negatives in model building and the impact of the proportion of hard negatives on the model.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400169"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142470301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Primary carnitine deficiency (PCD) is a rare autosomal recessive genetic disorder caused by missense mutations in the SLC22A5 gene encoding the organic carnitine transporter novel type 2 (OCTN2). This study investigates the structural consequences of PCD-causing mutations, focusing on the N32S variant. Using an alpha-fold model, molecular dynamics simulations reveal altered interactions and dynamics suggesting potential mechanistic changes in carnitine transport. In addition, we identify mutation hotspots (R169, E452) across the SLC family with the major facilitator superfamily (MFS) fold. Our data demonstrates the applicability of structural modeling for linking genetic information and clinical observations and providing a rationale for the influence of disease-causing mutations on protein dynamics.
{"title":"Structural and Dynamic Assessment of Disease-Causing Mutations for the Carnitine Transporter OCTN2.","authors":"Johannes Jokiel, Marcel Bermudez","doi":"10.1002/minf.202400002","DOIUrl":"10.1002/minf.202400002","url":null,"abstract":"<p><p>Primary carnitine deficiency (PCD) is a rare autosomal recessive genetic disorder caused by missense mutations in the SLC22A5 gene encoding the organic carnitine transporter novel type 2 (OCTN2). This study investigates the structural consequences of PCD-causing mutations, focusing on the N32S variant. Using an alpha-fold model, molecular dynamics simulations reveal altered interactions and dynamics suggesting potential mechanistic changes in carnitine transport. In addition, we identify mutation hotspots (R169, E452) across the SLC family with the major facilitator superfamily (MFS) fold. Our data demonstrates the applicability of structural modeling for linking genetic information and clinical observations and providing a rationale for the influence of disease-causing mutations on protein dynamics.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 1","pages":"e202400002"},"PeriodicalIF":3.1,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11733719/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143056046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}