Anastasia Rudik, Leonid Stolbov, Alexey Lagunin, Dmitry Filimonov, Vladimir Poroikov
The pharmacokinetic profile of a potential drug is largely determined by its metabolic stability, which reflects its susceptibility to biotransformation. Metabolic stability data allow one to assess the therapeutic value of a compound and its toxicological risk. This assesment relies primarily on pharmacokinetic parameters, particularly half-life (t1/2) and clearance (CL), which are typically determined using in vitro systems including hepatocytes and liver microsomal fractions. Using the publicly available ChEMBL v. 35 and PubChem databases, we collected over 8000 chemical compounds with experimental intrinsic CL and/or half-life data from liver microsome assays obtained in mice, rats, and humans. Different thresholds were applied to differentiate the stable and unstable molecules. The Naive Bayesian classifier with MNA (Multilevel Neighborhoods of Atoms) descriptors and Self-Consistent Extreme Classifier (SCEC) with QNA (Quantitative Neighborhoods of Atoms) descriptors were used for creating classification models. The accuracy (AUC) of most classification models exceeded 0.85. Self-Consistent Regression was used to create quantitative models. The coefficient of determination of the regression models varied from 0.35 (rat, t1/2) to 0.7 (human, CLint). These models were integrated into the freely available web application MetaStab-Analyzer, which provides a unique combination of qualitative (stable/unstable/moderate) and quantitative predictions for three species. A key feature of the application is the providing of numerical metrics for each prediction, which increases its interpretability. This combination of innovative algorithms (SCR and SCEC), dual qualitative-quantitative assessment, and a user-friendly interface is not available in any existing tool. MetaStab-Analyzer is freely available at https://www.way2drug.com/metastab/.
一种潜在药物的药代动力学特征在很大程度上取决于其代谢稳定性,代谢稳定性反映了其对生物转化的易感性。代谢稳定性数据使人们能够评估一种化合物的治疗价值及其毒理学风险。这种评估主要依赖于药代动力学参数,特别是半衰期(t1/2)和清除率(CL),这通常是用体外系统确定的,包括肝细胞和肝微粒体部分。利用公开的ChEMBL v. 35和PubChem数据库,我们从小鼠、大鼠和人类的肝微粒体分析中收集了8000多种具有实验固有CL和/或半衰期数据的化合物。采用不同的阈值来区分稳定分子和不稳定分子。采用朴素贝叶斯分类器和自洽极端分类器分别采用多层原子邻域(MNA)和定量原子邻域(QNA)描述符建立分类模型。大多数分类模型的准确率(AUC)超过0.85。采用自洽回归建立定量模型。回归模型的决定系数从0.35(大鼠,t1/2)到0.7(人,CLint)不等。这些模型被集成到免费的web应用程序MetaStab-Analyzer中,该应用程序提供了对三种物种的定性(稳定/不稳定/中等)和定量预测的独特组合。该应用程序的一个关键特征是为每个预测提供数值度量,这增加了其可解释性。这种创新算法(SCR和SCEC)、双重定性定量评估和用户友好界面的组合在任何现有工具中都是不可用的。MetaStab-Analyzer可在https://www.way2drug.com/metastab/免费获得。
{"title":"MetaStab-Analyzer: Classification and Regression Models for Metabolic Stability Prediction.","authors":"Anastasia Rudik, Leonid Stolbov, Alexey Lagunin, Dmitry Filimonov, Vladimir Poroikov","doi":"10.1002/minf.70018","DOIUrl":"https://doi.org/10.1002/minf.70018","url":null,"abstract":"<p><p>The pharmacokinetic profile of a potential drug is largely determined by its metabolic stability, which reflects its susceptibility to biotransformation. Metabolic stability data allow one to assess the therapeutic value of a compound and its toxicological risk. This assesment relies primarily on pharmacokinetic parameters, particularly half-life (t<sub>1/2</sub>) and clearance (CL), which are typically determined using in vitro systems including hepatocytes and liver microsomal fractions. Using the publicly available ChEMBL v. 35 and PubChem databases, we collected over 8000 chemical compounds with experimental intrinsic CL and/or half-life data from liver microsome assays obtained in mice, rats, and humans. Different thresholds were applied to differentiate the stable and unstable molecules. The Naive Bayesian classifier with MNA (Multilevel Neighborhoods of Atoms) descriptors and Self-Consistent Extreme Classifier (SCEC) with QNA (Quantitative Neighborhoods of Atoms) descriptors were used for creating classification models. The accuracy (AUC) of most classification models exceeded 0.85. Self-Consistent Regression was used to create quantitative models. The coefficient of determination of the regression models varied from 0.35 (rat, t<sub>1/2</sub>) to 0.7 (human, CL<sub>int</sub>). These models were integrated into the freely available web application MetaStab-Analyzer, which provides a unique combination of qualitative (stable/unstable/moderate) and quantitative predictions for three species. A key feature of the application is the providing of numerical metrics for each prediction, which increases its interpretability. This combination of innovative algorithms (SCR and SCEC), dual qualitative-quantitative assessment, and a user-friendly interface is not available in any existing tool. MetaStab-Analyzer is freely available at https://www.way2drug.com/metastab/.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"45 2","pages":"e70018"},"PeriodicalIF":3.1,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146119507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The French Chemoinformatics Society (SFCi) is a learned society that unites French academics, students and industrial scientists in chemoinformatics, promoting this discipline at the interface of chemistry, computer science, data science and AI through conferences, training initiatives, networking and community-building.
{"title":"11<sup>th</sup> National Conference of the French Chemoinformatics Society (SFCi).","authors":"Alban Lepailleur, Ronan Bureau","doi":"10.1002/minf.70017","DOIUrl":"10.1002/minf.70017","url":null,"abstract":"<p><p>The French Chemoinformatics Society (SFCi) is a learned society that unites French academics, students and industrial scientists in chemoinformatics, promoting this discipline at the interface of chemistry, computer science, data science and AI through conferences, training initiatives, networking and community-building.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"45 1","pages":"e70017"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146065224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chemical language models (CLMs), particularly encoder-decoder transformers, have advanced generative molecular design. Transformer CLMs are able to learn a variety of molecular mappings for compound design that can be conditioned using context-dependent rules. However, their black-box nature complicates the interpretation of predictions. Current analysis methods mostly focus on attention weights of token relationships or attention flow in encoder and decoder modules and cannot explain predictions at the molecular level. Sequence-based compound design was used as a model system to investigate transformer learning characteristics through systematic control calculations involving modifications of protein sequences and sequence-compound pairs. The analysis revealed that compound reproducibility depended on similarity relationships between training and test data and on compound memorization, while specific sequence information was not learned. These findings indicate that predictions of transformer CLMs are driven by memorization effects and statistical correlations rather than by learning specific chemical or biological information. Understanding this learning behavior aids in avoiding over-interpretation of model outputs and informs the appropriate application of transformer-based CLMs in molecular design.
{"title":"Transformer Learning in Sequence-Based Drug Design Depends on Compound Memorization and Similarity of Sequence-Compound Pairs.","authors":"Jürgen Bajorath","doi":"10.1002/minf.70016","DOIUrl":"10.1002/minf.70016","url":null,"abstract":"<p><p>Chemical language models (CLMs), particularly encoder-decoder transformers, have advanced generative molecular design. Transformer CLMs are able to learn a variety of molecular mappings for compound design that can be conditioned using context-dependent rules. However, their black-box nature complicates the interpretation of predictions. Current analysis methods mostly focus on attention weights of token relationships or attention flow in encoder and decoder modules and cannot explain predictions at the molecular level. Sequence-based compound design was used as a model system to investigate transformer learning characteristics through systematic control calculations involving modifications of protein sequences and sequence-compound pairs. The analysis revealed that compound reproducibility depended on similarity relationships between training and test data and on compound memorization, while specific sequence information was not learned. These findings indicate that predictions of transformer CLMs are driven by memorization effects and statistical correlations rather than by learning specific chemical or biological information. Understanding this learning behavior aids in avoiding over-interpretation of model outputs and informs the appropriate application of transformer-based CLMs in molecular design.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"45 1","pages":"e70016"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12782052/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145934112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Staphylococcus aureus is a bacterium classified among the ESKAPE pathogens, which are anticipated to pose a significant global health emergency in the coming decades. The FabI enzyme, present in both Gram-positive and Gram-negative bacteria, is a key enzyme involved in fatty acid synthesis II (FAS-II). In this study, we utilized transformation rules to expand the chemical space from the most potent S. aureus FabI inhibitors. Three newly generated focused libraries, named INDDS, DIADS, and PYRDS, encompassed 172,026 compounds. These compounds were ranked based on structural similarity and predicted pIC50 values obtained from machine learning models. This approach allowed to prioritize compounds in each focused library targeting S. aureus FabI. We analyzed the pharmacological properties and chemical space diversity of the S. aureus FabI inhibitors to gather relevant insights and support the prioritization of compounds for further study. The three newly generated libraries are freely available at https://github.com/DIFACQUIM/S.aureus_inhibitors.
{"title":"Structure-Activity Relationships and Design of Focused Libraries Tailored for Staphylococcus Aureus Inhibition.","authors":"Alberto Marbán-González, José L Medina-Franco","doi":"10.1002/minf.70015","DOIUrl":"10.1002/minf.70015","url":null,"abstract":"<p><p>Staphylococcus aureus is a bacterium classified among the ESKAPE pathogens, which are anticipated to pose a significant global health emergency in the coming decades. The FabI enzyme, present in both Gram-positive and Gram-negative bacteria, is a key enzyme involved in fatty acid synthesis II (FAS-II). In this study, we utilized transformation rules to expand the chemical space from the most potent S. aureus FabI inhibitors. Three newly generated focused libraries, named INDDS, DIADS, and PYRDS, encompassed 172,026 compounds. These compounds were ranked based on structural similarity and predicted pIC<sub>50</sub> values obtained from machine learning models. This approach allowed to prioritize compounds in each focused library targeting S. aureus FabI. We analyzed the pharmacological properties and chemical space diversity of the S. aureus FabI inhibitors to gather relevant insights and support the prioritization of compounds for further study. The three newly generated libraries are freely available at https://github.com/DIFACQUIM/S.aureus_inhibitors.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 11-12","pages":"e70015"},"PeriodicalIF":3.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12694758/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145724727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alejandro Gómez-García, Martin J Lavecchia, Dionisio A Olmedo, Pablo N Solís, José L Medina-Franco
For more than 5 years, several countries in Latin America have been developing and updating compound databases of natural products (NPs) isolated and characterized by their countries. In parallel, multiple research groups have been collaborating and assembling a unified Latin American Natural Product Database (LANaPDB), an open-access compound collection representative of Latin America that stands out as a geographical region distinct from its vastness and richness of NP resources. Herein, we report a significant update of LANaPDB, which gathers NPs from eight countries. Major updates to the database include adding 1,164 new compounds obtained from NaturAr, a NP collection from Argentina published in 2025, and 132 new compounds from Panama. The updated LANaPDB has 14,742 nonduplicate compounds. Moreover, a comprehensive evaluation of 41 ADMET (absorption, distribution, metabolism, excretion, and toxicity)-related parameters was carried out for LANaPDB, and the results were compared with one of the largest NP databases, the Universal Natural Product Database, and the approved small-molecule drugs. The results indicated that the three databases have a very similar ADMET profile. Besides, most of the LANaPDB compounds presented high bioavailability, volume of distribution, plasma protein binding rate, blood-brain barrier penetration, susceptibility to CYP3A4, and half-life less than 12 h. Moreover, most of the LANaPDB compounds were predicted with a low probability of inducing toxicity-related reactions. The third version of LANaPDB and the codes for the curation and determination of 41 ADMET-related parameters are freely available at https://doi.org/10.5281/zenodo.15595030. The code is general and can be used to analyze other compound libraries.
{"title":"Update and ADMET Profile of the Latin American Natural Product Database: LANaPDB.","authors":"Alejandro Gómez-García, Martin J Lavecchia, Dionisio A Olmedo, Pablo N Solís, José L Medina-Franco","doi":"10.1002/minf.70013","DOIUrl":"10.1002/minf.70013","url":null,"abstract":"<p><p>For more than 5 years, several countries in Latin America have been developing and updating compound databases of natural products (NPs) isolated and characterized by their countries. In parallel, multiple research groups have been collaborating and assembling a unified Latin American Natural Product Database (LANaPDB), an open-access compound collection representative of Latin America that stands out as a geographical region distinct from its vastness and richness of NP resources. Herein, we report a significant update of LANaPDB, which gathers NPs from eight countries. Major updates to the database include adding 1,164 new compounds obtained from NaturAr, a NP collection from Argentina published in 2025, and 132 new compounds from Panama. The updated LANaPDB has 14,742 nonduplicate compounds. Moreover, a comprehensive evaluation of 41 ADMET (absorption, distribution, metabolism, excretion, and toxicity)-related parameters was carried out for LANaPDB, and the results were compared with one of the largest NP databases, the Universal Natural Product Database, and the approved small-molecule drugs. The results indicated that the three databases have a very similar ADMET profile. Besides, most of the LANaPDB compounds presented high bioavailability, volume of distribution, plasma protein binding rate, blood-brain barrier penetration, susceptibility to CYP3A4, and half-life less than 12 h. Moreover, most of the LANaPDB compounds were predicted with a low probability of inducing toxicity-related reactions. The third version of LANaPDB and the codes for the curation and determination of 41 ADMET-related parameters are freely available at https://doi.org/10.5281/zenodo.15595030. The code is general and can be used to analyze other compound libraries.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 11-12","pages":"e70013"},"PeriodicalIF":3.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12694011/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145715146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploration of (Ultra)Big Chemical Spaces.","authors":"José L Medina-Franco","doi":"10.1002/minf.70012","DOIUrl":"https://doi.org/10.1002/minf.70012","url":null,"abstract":"","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 10","pages":"e70012"},"PeriodicalIF":3.1,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145588154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Docking is a structure-based cheminformatics tool broadly employed in early drug discovery. Based on the tridimensional structure of the protein target, docking is used to predict the binding interactions between the protein and a ligand, estimate the corresponding binding affinity, or perform virtual screenings (VSs) to identify new active compounds. This study introduces the ligand B-factor index (LBI), a novel computational metric for prioritizing protein-ligand complexes for docking. Unlike other metrics, LBI directly compares atomic displacements in the ligand and binding site. LBI is defined as the ratio of the median atomic B-factor of the binding site to that of the bound ligand. Using the comparative assessment of scoring functions (CASF-2016) dataset, we evaluated the effectiveness of LBI in guiding the selection of protein-ligand complexes to enhance docking performance. Our results show a moderate correlation (Spearman ρ ~ 0.48) between LBI and the experimental binding affinities, outperforming several docking scoring functions. Additionally, LBI correlates with improved redocking success (root mean square deviation < 2 Å), underlying the significance of a ligand-focused metric. While LBI outperforms other metrics such as the protein B-factor index and resolution, its utility in VS docking remains to be further investigated. LBI is easy to compute, interpretable, applicable in structure-based cheminformatics, and freely available for calculation at https://chembioinf.ro/tool-bi-computing.html.
{"title":"Ligand B-Factor Index: A Metric for Prioritizing Protein-Ligand Complexes in Docking.","authors":"Liliana Halip, Cristian Neanu, Sorin Avram","doi":"10.1002/minf.70010","DOIUrl":"10.1002/minf.70010","url":null,"abstract":"<p><p>Docking is a structure-based cheminformatics tool broadly employed in early drug discovery. Based on the tridimensional structure of the protein target, docking is used to predict the binding interactions between the protein and a ligand, estimate the corresponding binding affinity, or perform virtual screenings (VSs) to identify new active compounds. This study introduces the ligand B-factor index (LBI), a novel computational metric for prioritizing protein-ligand complexes for docking. Unlike other metrics, LBI directly compares atomic displacements in the ligand and binding site. LBI is defined as the ratio of the median atomic B-factor of the binding site to that of the bound ligand. Using the comparative assessment of scoring functions (CASF-2016) dataset, we evaluated the effectiveness of LBI in guiding the selection of protein-ligand complexes to enhance docking performance. Our results show a moderate correlation (Spearman ρ ~ 0.48) between LBI and the experimental binding affinities, outperforming several docking scoring functions. Additionally, LBI correlates with improved redocking success (root mean square deviation < 2 Å), underlying the significance of a ligand-focused metric. While LBI outperforms other metrics such as the protein B-factor index and resolution, its utility in VS docking remains to be further investigated. LBI is easy to compute, interpretable, applicable in structure-based cheminformatics, and freely available for calculation at https://chembioinf.ro/tool-bi-computing.html.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 9","pages":"e202500127"},"PeriodicalIF":3.1,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12423484/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145033654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Omer Kaspi, Yaniv Y Avissar, Arnon Grafit, Ron Chibel, Olga Girshevitz, Hanoch Senderowitz
Ignition cases involving arsons are typically handled by forensic experts who examine spectra of samples collected from scenes of fire to test for the existence or absence of ignitable liquids. This is tedious work, since many cases do not involve such liquids. To facilitate this process, we have developed in this work a Machine Learning (ML)-based workflow for samples' classification based on their gas chromatography (GC) chromatograms (i.e., spectra). To this end, annotated spectra of 181 samples containing three groups of liquids (petroleum distillates, gasoline, and an assortment of other substances) collected from fire scenes as well as two reference databases were obtained from the Israeli Department of Identification and Forensic Sciences (DIFS). These spectra were used for the derivation of ML-based classification models using three algorithms, namely, kNN, representative spectrum, and random forest (RF) giving rise to reliable predictions. To increase the size of the dataset to a level that would enable the usage of more advanced ML algorithms, we have used the experimental spectra to develop a new spectra synthesis algorithm and utilized it to generate a large dataset of synthetic spectra. This dataset was used for the derivation of new kNN, RF, and representative spectrum models as well as deep learning (DL) models producing F1-scores over an independent test set composed entirely of "real" spectra ranging from 0.74-0.95, 0.86-0.95, 0.30-0.75, and 0.85-0.96 for kNN, RF, representative spectrum, and DL, respectively. Following the completion of the work, a second set of real spectra was provided to us by DIFS, and modeling it with the second set of models yielded F1-scores ranging from 0.92-0.96, 0.96-1.00, 0.71-0.82, and 0.95-0.98 for kNN, RF, representative spectrum, and DL, respectively. These results therefore suggest that for this dataset, performances depend more on the size of the dataset used for model training than on the ML algorithm. We propose that the workflow and spectra synthesis algorithm developed in this work could be readily applied to other forensic domains where samples are characterized by spectra, either solely or in combination with other parameters.
{"title":"Machine Learning-Based Identification of Petroleum Distillates and Gasoline Traces Using Measured and Synthetic GC Spectra from Collected Samples.","authors":"Omer Kaspi, Yaniv Y Avissar, Arnon Grafit, Ron Chibel, Olga Girshevitz, Hanoch Senderowitz","doi":"10.1002/minf.70008","DOIUrl":"https://doi.org/10.1002/minf.70008","url":null,"abstract":"<p><p>Ignition cases involving arsons are typically handled by forensic experts who examine spectra of samples collected from scenes of fire to test for the existence or absence of ignitable liquids. This is tedious work, since many cases do not involve such liquids. To facilitate this process, we have developed in this work a Machine Learning (ML)-based workflow for samples' classification based on their gas chromatography (GC) chromatograms (i.e., spectra). To this end, annotated spectra of 181 samples containing three groups of liquids (petroleum distillates, gasoline, and an assortment of other substances) collected from fire scenes as well as two reference databases were obtained from the Israeli Department of Identification and Forensic Sciences (DIFS). These spectra were used for the derivation of ML-based classification models using three algorithms, namely, kNN, representative spectrum, and random forest (RF) giving rise to reliable predictions. To increase the size of the dataset to a level that would enable the usage of more advanced ML algorithms, we have used the experimental spectra to develop a new spectra synthesis algorithm and utilized it to generate a large dataset of synthetic spectra. This dataset was used for the derivation of new kNN, RF, and representative spectrum models as well as deep learning (DL) models producing F1-scores over an independent test set composed entirely of \"real\" spectra ranging from 0.74-0.95, 0.86-0.95, 0.30-0.75, and 0.85-0.96 for kNN, RF, representative spectrum, and DL, respectively. Following the completion of the work, a second set of real spectra was provided to us by DIFS, and modeling it with the second set of models yielded F1-scores ranging from 0.92-0.96, 0.96-1.00, 0.71-0.82, and 0.95-0.98 for kNN, RF, representative spectrum, and DL, respectively. These results therefore suggest that for this dataset, performances depend more on the size of the dataset used for model training than on the ML algorithm. We propose that the workflow and spectra synthesis algorithm developed in this work could be readily applied to other forensic domains where samples are characterized by spectra, either solely or in combination with other parameters.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 8","pages":"e202400371"},"PeriodicalIF":3.1,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12371388/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144961933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In computer-aided drug design, molecular generation models play a crucial role in accelerating the drug development process. Current models mainly fall into two categories: deep learning models with high performance but poor interpretability and heuristic algorithms with better interpretability but limited performance. In this study, we introduce an innovative molecular generation model, the compound construction model (CCMol), which integrates the powerful generative capabilities of the generative pretrained transformer (GPT) and the efficient optimization mechanisms of genetic algorithms (GA) to achieve effective and innovative molecular structures. Specifically, our approach uses structure-based drug design comprising both ligand and protein primary structure-based aspects. CCMol integrates GPT for initial molecular generation and GA for iterative optimization of physicochemical and biological properties. The model's reliability was validated by generating molecules targeting three critical disease-related proteins (GLP1, WRN, and JAK2). The results indicate that CCMol is on average with current advanced models in multiple indicators and performs better than the baseline model in terms of structure diversity and drug-related properties indicators, demonstrating that CCMol exhibits outstanding performance in developing novel and effective candidate drug molecules, particularly suitable for expanding the chemical validity of candidate structures at the early stages of drug discovery.
{"title":"Integrating Generative Pretrained Transformer and Genetic Algorithms for Efficient and Diverse Molecular Generation.","authors":"Chengcheng Xu, Chen Zeng, Xi Yang, Yingxu Liu, Xiangzhen Ning, Lidan Zheng, Yang Liu, Qing Fan, Chao Xu, Haichun Liu, Xian Wei, Yadong Chen, Yanmin Zhang, Rui Gu","doi":"10.1002/minf.70005","DOIUrl":"https://doi.org/10.1002/minf.70005","url":null,"abstract":"<p><p>In computer-aided drug design, molecular generation models play a crucial role in accelerating the drug development process. Current models mainly fall into two categories: deep learning models with high performance but poor interpretability and heuristic algorithms with better interpretability but limited performance. In this study, we introduce an innovative molecular generation model, the compound construction model (CCMol), which integrates the powerful generative capabilities of the generative pretrained transformer (GPT) and the efficient optimization mechanisms of genetic algorithms (GA) to achieve effective and innovative molecular structures. Specifically, our approach uses structure-based drug design comprising both ligand and protein primary structure-based aspects. CCMol integrates GPT for initial molecular generation and GA for iterative optimization of physicochemical and biological properties. The model's reliability was validated by generating molecules targeting three critical disease-related proteins (GLP1, WRN, and JAK2). The results indicate that CCMol is on average with current advanced models in multiple indicators and performs better than the baseline model in terms of structure diversity and drug-related properties indicators, demonstrating that CCMol exhibits outstanding performance in developing novel and effective candidate drug molecules, particularly suitable for expanding the chemical validity of candidate structures at the early stages of drug discovery.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 8","pages":"e202500094"},"PeriodicalIF":3.1,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144784859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Esteban Bertsch-Aguilar, Antonio Piedra, Daniel Acuña, Sebastián Suñer, Sylvana Pinheiro, William J Zamora
Lipophilicity is a fundamental physicochemical property widely used to evaluate key parameters in drug design, materials science, and food engineering. It plays a critical role in predicting membrane permeability, absorption, and distribution of compounds. Moreover, lipophilicity is commonly integrated into scoring functions to model biomolecular interactions and serves as an important molecular descriptor in machine learning models for property prediction and compound classification. The election of the appropriate pH-dependent lipophilicity ( ) model is important to ensure its accuracy. The incorporation of the ion apparent partition coefficient ( ) into predictions of pH-dependent lipophilicity profiles can be essential for accurately reproducing experimental results. In accordance with the principles for findable, accessible, interoperable, and reusable data to improve data management and sharing, here, we introduce LiProS, a FAIR workflow that is easily accessible through a Google Colab notebook. LiProS assists researchers in efficiently determining the appropriate pH-dependent lipophilicity profile based on the SMILES code of their molecules of interest. In addition, LiProS demonstrated its utility in the analysis of ionizable compounds within the NAPRORE-CR natural products database, enabling the identification of the most appropriate lipophilicity formalism tailored to the physicochemical characteristics of these compounds.
亲脂性是一种基本的物理化学性质,广泛用于评价药物设计、材料科学和食品工程中的关键参数。它在预测膜的透性、吸收和化合物的分布方面起着关键作用。此外,亲脂性通常被集成到评分函数中来模拟生物分子相互作用,并作为机器学习模型中用于属性预测和化合物分类的重要分子描述符。选择合适的pH依赖性亲脂性(log D pH $$ mathrm{log} {D}_{pH} $$)模型对于确保其准确性非常重要。将离子表观分配系数(P I app $$ {P}_{text{I}}^{text{app}}$$)结合到ph依赖性亲脂性谱的预测中,对于准确再现实验结果至关重要。根据可查找、可访问、可互操作和可重用数据的原则,以改善数据管理和共享,在这里,我们介绍了LiProS,一个通过谷歌Colab笔记本轻松访问的FAIR工作流。LiProS帮助研究人员根据感兴趣分子的SMILES编码有效地确定适当的ph依赖性亲脂性。此外,LiProS在分析NAPRORE-CR天然产物数据库中的可电离化合物方面也证明了它的实用性,能够根据这些化合物的物理化学特性确定最合适的亲脂性形式。
{"title":"LiProS: Findable, Accessible, Interoperable, and Reusable Data Simulation Workflow to Predict Accurate Lipophilicity Profiles for Small Molecules.","authors":"Esteban Bertsch-Aguilar, Antonio Piedra, Daniel Acuña, Sebastián Suñer, Sylvana Pinheiro, William J Zamora","doi":"10.1002/minf.70007","DOIUrl":"https://doi.org/10.1002/minf.70007","url":null,"abstract":"<p><p>Lipophilicity is a fundamental physicochemical property widely used to evaluate key parameters in drug design, materials science, and food engineering. It plays a critical role in predicting membrane permeability, absorption, and distribution of compounds. Moreover, lipophilicity is commonly integrated into scoring functions to model biomolecular interactions and serves as an important molecular descriptor in machine learning models for property prediction and compound classification. The election of the appropriate pH-dependent lipophilicity ( <math> <semantics><mrow><mi>log</mi> <msub><mi>D</mi> <mrow><mtext>pH</mtext></mrow> </msub> </mrow> <annotation>$$ mathrm{log} {D}_{pH} $$</annotation></semantics> </math> ) model is important to ensure its accuracy. The incorporation of the ion apparent partition coefficient ( <math> <semantics> <mrow><msubsup><mi>P</mi> <mi>I</mi> <mtext>app</mtext></msubsup> </mrow> <annotation>$$ {P}_{text{I}}^{text{app}}$$</annotation></semantics> </math> ) into predictions of pH-dependent lipophilicity profiles can be essential for accurately reproducing experimental results. In accordance with the principles for findable, accessible, interoperable, and reusable data to improve data management and sharing, here, we introduce LiProS, a FAIR workflow that is easily accessible through a Google Colab notebook. LiProS assists researchers in efficiently determining the appropriate pH-dependent lipophilicity profile based on the SMILES code of their molecules of interest. In addition, LiProS demonstrated its utility in the analysis of ionizable compounds within the NAPRORE-CR natural products database, enabling the identification of the most appropriate lipophilicity formalism tailored to the physicochemical characteristics of these compounds.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 8","pages":"e202500136"},"PeriodicalIF":3.1,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144962005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}