Molecular Informatics最新文献

MetaStab-Analyzer: Classification and Regression Models for Metabolic Stability Prediction. 亚稳态分析仪：代谢稳定性预测的分类和回归模型。

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2026-02-01 DOI: 10.1002/minf.70018

Anastasia Rudik, Leonid Stolbov, Alexey Lagunin, Dmitry Filimonov, Vladimir Poroikov

The pharmacokinetic profile of a potential drug is largely determined by its metabolic stability, which reflects its susceptibility to biotransformation. Metabolic stability data allow one to assess the therapeutic value of a compound and its toxicological risk. This assesment relies primarily on pharmacokinetic parameters, particularly half-life (t_1/2) and clearance (CL), which are typically determined using in vitro systems including hepatocytes and liver microsomal fractions. Using the publicly available ChEMBL v. 35 and PubChem databases, we collected over 8000 chemical compounds with experimental intrinsic CL and/or half-life data from liver microsome assays obtained in mice, rats, and humans. Different thresholds were applied to differentiate the stable and unstable molecules. The Naive Bayesian classifier with MNA (Multilevel Neighborhoods of Atoms) descriptors and Self-Consistent Extreme Classifier (SCEC) with QNA (Quantitative Neighborhoods of Atoms) descriptors were used for creating classification models. The accuracy (AUC) of most classification models exceeded 0.85. Self-Consistent Regression was used to create quantitative models. The coefficient of determination of the regression models varied from 0.35 (rat, t_1/2) to 0.7 (human, CL_int). These models were integrated into the freely available web application MetaStab-Analyzer, which provides a unique combination of qualitative (stable/unstable/moderate) and quantitative predictions for three species. A key feature of the application is the providing of numerical metrics for each prediction, which increases its interpretability. This combination of innovative algorithms (SCR and SCEC), dual qualitative-quantitative assessment, and a user-friendly interface is not available in any existing tool. MetaStab-Analyzer is freely available at https://www.way2drug.com/metastab/.

一种潜在药物的药代动力学特征在很大程度上取决于其代谢稳定性，代谢稳定性反映了其对生物转化的易感性。代谢稳定性数据使人们能够评估一种化合物的治疗价值及其毒理学风险。这种评估主要依赖于药代动力学参数，特别是半衰期（t1/2）和清除率（CL），这通常是用体外系统确定的，包括肝细胞和肝微粒体部分。利用公开的ChEMBL v. 35和PubChem数据库，我们从小鼠、大鼠和人类的肝微粒体分析中收集了8000多种具有实验固有CL和/或半衰期数据的化合物。采用不同的阈值来区分稳定分子和不稳定分子。采用朴素贝叶斯分类器和自洽极端分类器分别采用多层原子邻域（MNA）和定量原子邻域（QNA）描述符建立分类模型。大多数分类模型的准确率（AUC）超过0.85。采用自洽回归建立定量模型。回归模型的决定系数从0.35（大鼠，t1/2）到0.7（人，CLint）不等。这些模型被集成到免费的web应用程序MetaStab-Analyzer中，该应用程序提供了对三种物种的定性（稳定/不稳定/中等）和定量预测的独特组合。该应用程序的一个关键特征是为每个预测提供数值度量，这增加了其可解释性。这种创新算法（SCR和SCEC）、双重定性定量评估和用户友好界面的组合在任何现有工具中都是不可用的。MetaStab-Analyzer可在https://www.way2drug.com/metastab/免费获得。

{"title":"MetaStab-Analyzer: Classification and Regression Models for Metabolic Stability Prediction.","authors":"Anastasia Rudik, Leonid Stolbov, Alexey Lagunin, Dmitry Filimonov, Vladimir Poroikov","doi":"10.1002/minf.70018","DOIUrl":"https://doi.org/10.1002/minf.70018","url":null,"abstract":"The pharmacokinetic profile of a potential drug is largely determined by its metabolic stability, which reflects its susceptibility to biotransformation. Metabolic stability data allow one to assess the therapeutic value of a compound and its toxicological risk. This assesment relies primarily on pharmacokinetic parameters, particularly half-life (t1/2) and clearance (CL), which are typically determined using in vitro systems including hepatocytes and liver microsomal fractions. Using the publicly available ChEMBL v. 35 and PubChem databases, we collected over 8000 chemical compounds with experimental intrinsic CL and/or half-life data from liver microsome assays obtained in mice, rats, and humans. Different thresholds were applied to differentiate the stable and unstable molecules. The Naive Bayesian classifier with MNA (Multilevel Neighborhoods of Atoms) descriptors and Self-Consistent Extreme Classifier (SCEC) with QNA (Quantitative Neighborhoods of Atoms) descriptors were used for creating classification models. The accuracy (AUC) of most classification models exceeded 0.85. Self-Consistent Regression was used to create quantitative models. The coefficient of determination of the regression models varied from 0.35 (rat, t1/2) to 0.7 (human, CLint). These models were integrated into the freely available web application MetaStab-Analyzer, which provides a unique combination of qualitative (stable/unstable/moderate) and quantitative predictions for three species. A key feature of the application is the providing of numerical metrics for each prediction, which increases its interpretability. This combination of innovative algorithms (SCR and SCEC), dual qualitative-quantitative assessment, and a user-friendly interface is not available in any existing tool. MetaStab-Analyzer is freely available at https://www.way2drug.com/metastab/.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"45 2","pages":"e70018"},"PeriodicalIF":3.1,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146119507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

11^th National Conference of the French Chemoinformatics Society (SFCi). 法国化学信息学学会（SFCi）第十一届全国会议。

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2026-01-01 DOI: 10.1002/minf.70017

Alban Lepailleur, Ronan Bureau

The French Chemoinformatics Society (SFCi) is a learned society that unites French academics, students and industrial scientists in chemoinformatics, promoting this discipline at the interface of chemistry, computer science, data science and AI through conferences, training initiatives, networking and community-building.

法国化学信息学协会（SFCi）是一个学术团体，将法国的学者、学生和工业科学家联合在化学信息学领域，通过会议、培训计划、网络和社区建设，在化学、计算机科学、数据科学和人工智能的界面上促进这一学科的发展。

引用次数: 0

Transformer Learning in Sequence-Based Drug Design Depends on Compound Memorization and Similarity of Sequence-Compound Pairs. 基于序列的药物设计中的变形学习依赖于化合物记忆和序列-化合物对的相似性。

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2026-01-01 DOI: 10.1002/minf.70016

Jürgen Bajorath

Chemical language models (CLMs), particularly encoder-decoder transformers, have advanced generative molecular design. Transformer CLMs are able to learn a variety of molecular mappings for compound design that can be conditioned using context-dependent rules. However, their black-box nature complicates the interpretation of predictions. Current analysis methods mostly focus on attention weights of token relationships or attention flow in encoder and decoder modules and cannot explain predictions at the molecular level. Sequence-based compound design was used as a model system to investigate transformer learning characteristics through systematic control calculations involving modifications of protein sequences and sequence-compound pairs. The analysis revealed that compound reproducibility depended on similarity relationships between training and test data and on compound memorization, while specific sequence information was not learned. These findings indicate that predictions of transformer CLMs are driven by memorization effects and statistical correlations rather than by learning specific chemical or biological information. Understanding this learning behavior aids in avoiding over-interpretation of model outputs and informs the appropriate application of transformer-based CLMs in molecular design.

化学语言模型（CLMs），特别是编码器-解码器转换器，具有先进的生成分子设计。Transformer clm能够学习用于化合物设计的各种分子映射，这些分子映射可以使用与上下文相关的规则进行调节。然而，它们的黑箱性质使预测的解释变得复杂。目前的分析方法大多集中在标记关系的注意权重或编码器和解码器模块的注意流上，无法解释分子水平上的预测。基于序列的化合物设计作为模型系统，通过涉及蛋白质序列和序列-化合物对修饰的系统控制计算来研究变压器学习特性。分析表明，复合重现性依赖于训练数据和测试数据之间的相似关系和复合记忆，而没有学习到具体的序列信息。这些发现表明，变形clm的预测是由记忆效应和统计相关性驱动的，而不是由学习特定的化学或生物信息驱动的。理解这种学习行为有助于避免对模型输出的过度解释，并告知基于变压器的clm在分子设计中的适当应用。

{"title":"Transformer Learning in Sequence-Based Drug Design Depends on Compound Memorization and Similarity of Sequence-Compound Pairs.","authors":"Jürgen Bajorath","doi":"10.1002/minf.70016","DOIUrl":"10.1002/minf.70016","url":null,"abstract":"Chemical language models (CLMs), particularly encoder-decoder transformers, have advanced generative molecular design. Transformer CLMs are able to learn a variety of molecular mappings for compound design that can be conditioned using context-dependent rules. However, their black-box nature complicates the interpretation of predictions. Current analysis methods mostly focus on attention weights of token relationships or attention flow in encoder and decoder modules and cannot explain predictions at the molecular level. Sequence-based compound design was used as a model system to investigate transformer learning characteristics through systematic control calculations involving modifications of protein sequences and sequence-compound pairs. The analysis revealed that compound reproducibility depended on similarity relationships between training and test data and on compound memorization, while specific sequence information was not learned. These findings indicate that predictions of transformer CLMs are driven by memorization effects and statistical correlations rather than by learning specific chemical or biological information. Understanding this learning behavior aids in avoiding over-interpretation of model outputs and informs the appropriate application of transformer-based CLMs in molecular design.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"45 1","pages":"e70016"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12782052/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145934112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Structure-Activity Relationships and Design of Focused Libraries Tailored for Staphylococcus Aureus Inhibition. 金黄色葡萄球菌抑制特异性文库的构效关系与设计

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-11-01 DOI: 10.1002/minf.70015

Alberto Marbán-González, José L Medina-Franco

Staphylococcus aureus is a bacterium classified among the ESKAPE pathogens, which are anticipated to pose a significant global health emergency in the coming decades. The FabI enzyme, present in both Gram-positive and Gram-negative bacteria, is a key enzyme involved in fatty acid synthesis II (FAS-II). In this study, we utilized transformation rules to expand the chemical space from the most potent S. aureus FabI inhibitors. Three newly generated focused libraries, named INDDS, DIADS, and PYRDS, encompassed 172,026 compounds. These compounds were ranked based on structural similarity and predicted pIC₅₀ values obtained from machine learning models. This approach allowed to prioritize compounds in each focused library targeting S. aureus FabI. We analyzed the pharmacological properties and chemical space diversity of the S. aureus FabI inhibitors to gather relevant insights and support the prioritization of compounds for further study. The three newly generated libraries are freely available at https://github.com/DIFACQUIM/S.aureus_inhibitors.

金黄色葡萄球菌是ESKAPE病原体中的一种细菌，预计在未来几十年将构成重大的全球卫生紧急情况。FabI酶存在于革兰氏阳性和革兰氏阴性细菌中，是参与脂肪酸合成II （FAS-II）的关键酶。在这项研究中，我们利用转化规则来扩大最有效的金黄色葡萄球菌FabI抑制剂的化学空间。三个新生成的重点文库，分别命名为INDDS、DIADS和PYRDS，包含172026个化合物。根据结构相似性和机器学习模型预测的pIC50值对这些化合物进行排名。这种方法可以优先考虑每个针对金黄色葡萄球菌FabI的重点文库中的化合物。我们分析了金黄色葡萄球菌FabI抑制剂的药理学性质和化学空间多样性，以收集相关见解，并为进一步研究的化合物优先排序提供支持。这三个新生成的库可以在https://github.com/DIFACQUIM/S.aureus_inhibitors上免费获得。

引用次数: 0

Update and ADMET Profile of the Latin American Natural Product Database: LANaPDB. 拉丁美洲天然产品数据库：LANaPDB的更新和ADMET简介。

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-11-01 DOI: 10.1002/minf.70013

Alejandro Gómez-García, Martin J Lavecchia, Dionisio A Olmedo, Pablo N Solís, José L Medina-Franco

For more than 5 years, several countries in Latin America have been developing and updating compound databases of natural products (NPs) isolated and characterized by their countries. In parallel, multiple research groups have been collaborating and assembling a unified Latin American Natural Product Database (LANaPDB), an open-access compound collection representative of Latin America that stands out as a geographical region distinct from its vastness and richness of NP resources. Herein, we report a significant update of LANaPDB, which gathers NPs from eight countries. Major updates to the database include adding 1,164 new compounds obtained from NaturAr, a NP collection from Argentina published in 2025, and 132 new compounds from Panama. The updated LANaPDB has 14,742 nonduplicate compounds. Moreover, a comprehensive evaluation of 41 ADMET (absorption, distribution, metabolism, excretion, and toxicity)-related parameters was carried out for LANaPDB, and the results were compared with one of the largest NP databases, the Universal Natural Product Database, and the approved small-molecule drugs. The results indicated that the three databases have a very similar ADMET profile. Besides, most of the LANaPDB compounds presented high bioavailability, volume of distribution, plasma protein binding rate, blood-brain barrier penetration, susceptibility to CYP3A4, and half-life less than 12 h. Moreover, most of the LANaPDB compounds were predicted with a low probability of inducing toxicity-related reactions. The third version of LANaPDB and the codes for the curation and determination of 41 ADMET-related parameters are freely available at https://doi.org/10.5281/zenodo.15595030. The code is general and can be used to analyze other compound libraries.

5年多来，拉丁美洲的一些国家一直在开发和更新本国分离和鉴定的天然产物（NPs）化合物数据库。与此同时，多个研究小组一直在合作并组建一个统一的拉丁美洲天然产物数据库（LANaPDB），这是一个开放获取的化合物集合，代表拉丁美洲，作为一个地理区域，它以其广阔和丰富的NP资源而脱颖而出。在此，我们报告了LANaPDB的重大更新，该数据库收集了来自八个国家的NPs。数据库的主要更新包括增加了1164个新化合物，这些新化合物来自于NaturAr（2025年出版的阿根廷NP集合）和132个来自巴拿马的新化合物。更新后的LANaPDB有14,742个不重复的化合物。此外，我们还对LANaPDB进行了41项ADMET（吸收、分布、代谢、排泄和毒性）相关参数的综合评价，并将结果与最大的NP数据库之一——通用天然产物数据库（Universal Natural Product Database）和已获批的小分子药物进行了比较。结果表明，这三个数据库具有非常相似的ADMET特征。此外，大多数LANaPDB化合物具有较高的生物利用度、分布体积、血浆蛋白结合率、血脑屏障穿透性、对CYP3A4敏感性、半衰期小于12 h等特点。此外，预测大多数LANaPDB化合物诱导毒性相关反应的概率较低。第三版LANaPDB以及41个admet相关参数的整理和确定代码可在https://doi.org/10.5281/zenodo.15595030上免费获得。该代码是通用的，可用于分析其他复合库。

{"title":"Update and ADMET Profile of the Latin American Natural Product Database: LANaPDB.","authors":"Alejandro Gómez-García, Martin J Lavecchia, Dionisio A Olmedo, Pablo N Solís, José L Medina-Franco","doi":"10.1002/minf.70013","DOIUrl":"10.1002/minf.70013","url":null,"abstract":"For more than 5 years, several countries in Latin America have been developing and updating compound databases of natural products (NPs) isolated and characterized by their countries. In parallel, multiple research groups have been collaborating and assembling a unified Latin American Natural Product Database (LANaPDB), an open-access compound collection representative of Latin America that stands out as a geographical region distinct from its vastness and richness of NP resources. Herein, we report a significant update of LANaPDB, which gathers NPs from eight countries. Major updates to the database include adding 1,164 new compounds obtained from NaturAr, a NP collection from Argentina published in 2025, and 132 new compounds from Panama. The updated LANaPDB has 14,742 nonduplicate compounds. Moreover, a comprehensive evaluation of 41 ADMET (absorption, distribution, metabolism, excretion, and toxicity)-related parameters was carried out for LANaPDB, and the results were compared with one of the largest NP databases, the Universal Natural Product Database, and the approved small-molecule drugs. The results indicated that the three databases have a very similar ADMET profile. Besides, most of the LANaPDB compounds presented high bioavailability, volume of distribution, plasma protein binding rate, blood-brain barrier penetration, susceptibility to CYP3A4, and half-life less than 12 h. Moreover, most of the LANaPDB compounds were predicted with a low probability of inducing toxicity-related reactions. The third version of LANaPDB and the codes for the curation and determination of 41 ADMET-related parameters are freely available at https://doi.org/10.5281/zenodo.15595030. The code is general and can be used to analyze other compound libraries.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 11-12","pages":"e70013"},"PeriodicalIF":3.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12694011/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145715146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploration of (Ultra)Big Chemical Spaces. （超）大化学空间的探索。

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-10-01 DOI: 10.1002/minf.70012

José L Medina-Franco

引用次数: 0

Ligand B-Factor Index: A Metric for Prioritizing Protein-Ligand Complexes in Docking. 配体b因子指数：蛋白质-配体复合物在对接中优先排序的度量。

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-09-01 DOI: 10.1002/minf.70010

Liliana Halip, Cristian Neanu, Sorin Avram

Docking is a structure-based cheminformatics tool broadly employed in early drug discovery. Based on the tridimensional structure of the protein target, docking is used to predict the binding interactions between the protein and a ligand, estimate the corresponding binding affinity, or perform virtual screenings (VSs) to identify new active compounds. This study introduces the ligand B-factor index (LBI), a novel computational metric for prioritizing protein-ligand complexes for docking. Unlike other metrics, LBI directly compares atomic displacements in the ligand and binding site. LBI is defined as the ratio of the median atomic B-factor of the binding site to that of the bound ligand. Using the comparative assessment of scoring functions (CASF-2016) dataset, we evaluated the effectiveness of LBI in guiding the selection of protein-ligand complexes to enhance docking performance. Our results show a moderate correlation (Spearman ρ ~ 0.48) between LBI and the experimental binding affinities, outperforming several docking scoring functions. Additionally, LBI correlates with improved redocking success (root mean square deviation < 2 Å), underlying the significance of a ligand-focused metric. While LBI outperforms other metrics such as the protein B-factor index and resolution, its utility in VS docking remains to be further investigated. LBI is easy to compute, interpretable, applicable in structure-based cheminformatics, and freely available for calculation at https://chembioinf.ro/tool-bi-computing.html.

对接是一种基于结构的化学信息学工具，广泛应用于早期药物发现。基于蛋白质靶标的三维结构，对接用于预测蛋白质与配体之间的结合相互作用，估计相应的结合亲和力，或进行虚拟筛选（VSs）以识别新的活性化合物。本研究引入了配体b因子指数（LBI），这是一种新的计算度量，用于优先考虑蛋白质-配体复合物的对接。与其他指标不同，LBI直接比较配体和结合位点的原子位移。LBI定义为结合位点的中间原子b因子与结合配体的中间原子b因子之比。利用评分函数比较评估（CASF-2016）数据集，我们评估了LBI在指导蛋白质-配体复合物选择以提高对接性能方面的有效性。我们的研究结果表明，LBI与实验结合亲和力之间存在适度的相关性（Spearman ρ ~ 0.48），优于几种对接评分函数。此外，LBI与改进的再对接成功（均方根偏差）相关

{"title":"Ligand B-Factor Index: A Metric for Prioritizing Protein-Ligand Complexes in Docking.","authors":"Liliana Halip, Cristian Neanu, Sorin Avram","doi":"10.1002/minf.70010","DOIUrl":"10.1002/minf.70010","url":null,"abstract":"Docking is a structure-based cheminformatics tool broadly employed in early drug discovery. Based on the tridimensional structure of the protein target, docking is used to predict the binding interactions between the protein and a ligand, estimate the corresponding binding affinity, or perform virtual screenings (VSs) to identify new active compounds. This study introduces the ligand B-factor index (LBI), a novel computational metric for prioritizing protein-ligand complexes for docking. Unlike other metrics, LBI directly compares atomic displacements in the ligand and binding site. LBI is defined as the ratio of the median atomic B-factor of the binding site to that of the bound ligand. Using the comparative assessment of scoring functions (CASF-2016) dataset, we evaluated the effectiveness of LBI in guiding the selection of protein-ligand complexes to enhance docking performance. Our results show a moderate correlation (Spearman ρ ~ 0.48) between LBI and the experimental binding affinities, outperforming several docking scoring functions. Additionally, LBI correlates with improved redocking success (root mean square deviation < 2 Å), underlying the significance of a ligand-focused metric. While LBI outperforms other metrics such as the protein B-factor index and resolution, its utility in VS docking remains to be further investigated. LBI is easy to compute, interpretable, applicable in structure-based cheminformatics, and freely available for calculation at https://chembioinf.ro/tool-bi-computing.html.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 9","pages":"e202500127"},"PeriodicalIF":3.1,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12423484/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145033654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine Learning-Based Identification of Petroleum Distillates and Gasoline Traces Using Measured and Synthetic GC Spectra from Collected Samples. 基于机器学习的石油馏分和汽油痕迹的识别，使用从收集的样品中测量和合成GC光谱。

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-08-01 DOI: 10.1002/minf.70008

Omer Kaspi, Yaniv Y Avissar, Arnon Grafit, Ron Chibel, Olga Girshevitz, Hanoch Senderowitz

Ignition cases involving arsons are typically handled by forensic experts who examine spectra of samples collected from scenes of fire to test for the existence or absence of ignitable liquids. This is tedious work, since many cases do not involve such liquids. To facilitate this process, we have developed in this work a Machine Learning (ML)-based workflow for samples' classification based on their gas chromatography (GC) chromatograms (i.e., spectra). To this end, annotated spectra of 181 samples containing three groups of liquids (petroleum distillates, gasoline, and an assortment of other substances) collected from fire scenes as well as two reference databases were obtained from the Israeli Department of Identification and Forensic Sciences (DIFS). These spectra were used for the derivation of ML-based classification models using three algorithms, namely, kNN, representative spectrum, and random forest (RF) giving rise to reliable predictions. To increase the size of the dataset to a level that would enable the usage of more advanced ML algorithms, we have used the experimental spectra to develop a new spectra synthesis algorithm and utilized it to generate a large dataset of synthetic spectra. This dataset was used for the derivation of new kNN, RF, and representative spectrum models as well as deep learning (DL) models producing F1-scores over an independent test set composed entirely of "real" spectra ranging from 0.74-0.95, 0.86-0.95, 0.30-0.75, and 0.85-0.96 for kNN, RF, representative spectrum, and DL, respectively. Following the completion of the work, a second set of real spectra was provided to us by DIFS, and modeling it with the second set of models yielded F1-scores ranging from 0.92-0.96, 0.96-1.00, 0.71-0.82, and 0.95-0.98 for kNN, RF, representative spectrum, and DL, respectively. These results therefore suggest that for this dataset, performances depend more on the size of the dataset used for model training than on the ML algorithm. We propose that the workflow and spectra synthesis algorithm developed in this work could be readily applied to other forensic domains where samples are characterized by spectra, either solely or in combination with other parameters.

涉及纵火的点火案件通常由法医专家处理，他们检查从火灾现场收集的样品的光谱，以测试是否存在可燃液体。这是一项繁琐的工作，因为许多情况下不涉及这种液体。为了促进这一过程，我们在这项工作中开发了一个基于机器学习（ML）的工作流程，用于根据其气相色谱（GC）色谱图（即光谱）对样品进行分类。为此，从火灾现场收集的含有三组液体（石油馏分油、汽油和各种其他物质）的181个样品的注释光谱以及从以色列鉴定和法医学部（DIFS）获得的两个参考数据库。这些光谱被用于推导基于ml的分类模型，使用三种算法，即kNN，代表性光谱和随机森林（RF），从而产生可靠的预测。为了将数据集的大小增加到能够使用更先进的ML算法的水平，我们使用实验光谱开发了一种新的光谱合成算法，并利用它来生成大型合成光谱数据集。该数据集用于推导新的kNN、RF和代表性光谱模型，以及深度学习（DL）模型，这些模型在完全由kNN、RF、代表性光谱和DL组成的独立测试集上产生f1分数，测试集分别为0.74-0.95、0.86-0.95、0.30-0.75和0.85-0.96。工作完成后，DIFS为我们提供了第二组真实光谱，用第二组模型对kNN、RF、代表性光谱和DL分别进行了0.92-0.96、0.96-1.00、0.71-0.82和0.95-0.98的f1评分。因此，这些结果表明，对于这个数据集，性能更多地取决于用于模型训练的数据集的大小，而不是ML算法。我们提出，在这项工作中开发的工作流和光谱合成算法可以很容易地应用于其他法医领域，在这些领域中，样品可以单独或与其他参数结合使用光谱来表征。

{"title":"Machine Learning-Based Identification of Petroleum Distillates and Gasoline Traces Using Measured and Synthetic GC Spectra from Collected Samples.","authors":"Omer Kaspi, Yaniv Y Avissar, Arnon Grafit, Ron Chibel, Olga Girshevitz, Hanoch Senderowitz","doi":"10.1002/minf.70008","DOIUrl":"https://doi.org/10.1002/minf.70008","url":null,"abstract":"Ignition cases involving arsons are typically handled by forensic experts who examine spectra of samples collected from scenes of fire to test for the existence or absence of ignitable liquids. This is tedious work, since many cases do not involve such liquids. To facilitate this process, we have developed in this work a Machine Learning (ML)-based workflow for samples' classification based on their gas chromatography (GC) chromatograms (i.e., spectra). To this end, annotated spectra of 181 samples containing three groups of liquids (petroleum distillates, gasoline, and an assortment of other substances) collected from fire scenes as well as two reference databases were obtained from the Israeli Department of Identification and Forensic Sciences (DIFS). These spectra were used for the derivation of ML-based classification models using three algorithms, namely, kNN, representative spectrum, and random forest (RF) giving rise to reliable predictions. To increase the size of the dataset to a level that would enable the usage of more advanced ML algorithms, we have used the experimental spectra to develop a new spectra synthesis algorithm and utilized it to generate a large dataset of synthetic spectra. This dataset was used for the derivation of new kNN, RF, and representative spectrum models as well as deep learning (DL) models producing F1-scores over an independent test set composed entirely of \"real\" spectra ranging from 0.74-0.95, 0.86-0.95, 0.30-0.75, and 0.85-0.96 for kNN, RF, representative spectrum, and DL, respectively. Following the completion of the work, a second set of real spectra was provided to us by DIFS, and modeling it with the second set of models yielded F1-scores ranging from 0.92-0.96, 0.96-1.00, 0.71-0.82, and 0.95-0.98 for kNN, RF, representative spectrum, and DL, respectively. These results therefore suggest that for this dataset, performances depend more on the size of the dataset used for model training than on the ML algorithm. We propose that the workflow and spectra synthesis algorithm developed in this work could be readily applied to other forensic domains where samples are characterized by spectra, either solely or in combination with other parameters.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 8","pages":"e202400371"},"PeriodicalIF":3.1,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12371388/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144961933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrating Generative Pretrained Transformer and Genetic Algorithms for Efficient and Diverse Molecular Generation. 集成生成预训练变压器和遗传算法的高效和多样化分子生成。

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-08-01 DOI: 10.1002/minf.70005

Chengcheng Xu, Chen Zeng, Xi Yang, Yingxu Liu, Xiangzhen Ning, Lidan Zheng, Yang Liu, Qing Fan, Chao Xu, Haichun Liu, Xian Wei, Yadong Chen, Yanmin Zhang, Rui Gu

In computer-aided drug design, molecular generation models play a crucial role in accelerating the drug development process. Current models mainly fall into two categories: deep learning models with high performance but poor interpretability and heuristic algorithms with better interpretability but limited performance. In this study, we introduce an innovative molecular generation model, the compound construction model (CCMol), which integrates the powerful generative capabilities of the generative pretrained transformer (GPT) and the efficient optimization mechanisms of genetic algorithms (GA) to achieve effective and innovative molecular structures. Specifically, our approach uses structure-based drug design comprising both ligand and protein primary structure-based aspects. CCMol integrates GPT for initial molecular generation and GA for iterative optimization of physicochemical and biological properties. The model's reliability was validated by generating molecules targeting three critical disease-related proteins (GLP1, WRN, and JAK2). The results indicate that CCMol is on average with current advanced models in multiple indicators and performs better than the baseline model in terms of structure diversity and drug-related properties indicators, demonstrating that CCMol exhibits outstanding performance in developing novel and effective candidate drug molecules, particularly suitable for expanding the chemical validity of candidate structures at the early stages of drug discovery.

在计算机辅助药物设计中，分子生成模型在加速药物开发过程中起着至关重要的作用。目前的模型主要分为两类：性能高但可解释性差的深度学习模型和可解释性较好的启发式算法，但性能有限。在本研究中，我们引入了一种创新的分子生成模型——化合物构建模型（CCMol），该模型将生成式预训练变压器（GPT）强大的生成能力与遗传算法（GA）的高效优化机制相结合，以实现有效和创新的分子结构。具体来说，我们的方法使用基于结构的药物设计，包括配体和基于蛋白质初级结构的方面。CCMol将GPT用于初始分子生成，遗传算法用于物理化学和生物性质的迭代优化。通过生成靶向三种关键疾病相关蛋白（GLP1， WRN和JAK2）的分子，验证了该模型的可靠性。结果表明，CCMol在多个指标上与现有先进模型平均水平相当，在结构多样性和药物相关性质指标上优于基线模型，表明CCMol在开发新型有效的候选药物分子方面表现出色，特别适合在药物发现的早期阶段扩大候选结构的化学有效性。

{"title":"Integrating Generative Pretrained Transformer and Genetic Algorithms for Efficient and Diverse Molecular Generation.","authors":"Chengcheng Xu, Chen Zeng, Xi Yang, Yingxu Liu, Xiangzhen Ning, Lidan Zheng, Yang Liu, Qing Fan, Chao Xu, Haichun Liu, Xian Wei, Yadong Chen, Yanmin Zhang, Rui Gu","doi":"10.1002/minf.70005","DOIUrl":"https://doi.org/10.1002/minf.70005","url":null,"abstract":"In computer-aided drug design, molecular generation models play a crucial role in accelerating the drug development process. Current models mainly fall into two categories: deep learning models with high performance but poor interpretability and heuristic algorithms with better interpretability but limited performance. In this study, we introduce an innovative molecular generation model, the compound construction model (CCMol), which integrates the powerful generative capabilities of the generative pretrained transformer (GPT) and the efficient optimization mechanisms of genetic algorithms (GA) to achieve effective and innovative molecular structures. Specifically, our approach uses structure-based drug design comprising both ligand and protein primary structure-based aspects. CCMol integrates GPT for initial molecular generation and GA for iterative optimization of physicochemical and biological properties. The model's reliability was validated by generating molecules targeting three critical disease-related proteins (GLP1, WRN, and JAK2). The results indicate that CCMol is on average with current advanced models in multiple indicators and performs better than the baseline model in terms of structure diversity and drug-related properties indicators, demonstrating that CCMol exhibits outstanding performance in developing novel and effective candidate drug molecules, particularly suitable for expanding the chemical validity of candidate structures at the early stages of drug discovery.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 8","pages":"e202500094"},"PeriodicalIF":3.1,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144784859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LiProS: Findable, Accessible, Interoperable, and Reusable Data Simulation Workflow to Predict Accurate Lipophilicity Profiles for Small Molecules. LiProS：可查找，可访问，可互操作和可重用的数据模拟工作流，以预测小分子的准确亲脂性概况。

IF 3.1 4区医学 Q3 CHEMISTRY, MEDICINAL

Molecular Informatics

Pub Date : 2025-08-01 DOI: 10.1002/minf.70007

Esteban Bertsch-Aguilar, Antonio Piedra, Daniel Acuña, Sebastián Suñer, Sylvana Pinheiro, William J Zamora

Lipophilicity is a fundamental physicochemical property widely used to evaluate key parameters in drug design, materials science, and food engineering. It plays a critical role in predicting membrane permeability, absorption, and distribution of compounds. Moreover, lipophilicity is commonly integrated into scoring functions to model biomolecular interactions and serves as an important molecular descriptor in machine learning models for property prediction and compound classification. The election of the appropriate pH-dependent lipophilicity ( $\log D_{pH}$ ) model is important to ensure its accuracy. The incorporation of the ion apparent partition coefficient ( $P_{I}^{app}$ ) into predictions of pH-dependent lipophilicity profiles can be essential for accurately reproducing experimental results. In accordance with the principles for findable, accessible, interoperable, and reusable data to improve data management and sharing, here, we introduce LiProS, a FAIR workflow that is easily accessible through a Google Colab notebook. LiProS assists researchers in efficiently determining the appropriate pH-dependent lipophilicity profile based on the SMILES code of their molecules of interest. In addition, LiProS demonstrated its utility in the analysis of ionizable compounds within the NAPRORE-CR natural products database, enabling the identification of the most appropriate lipophilicity formalism tailored to the physicochemical characteristics of these compounds.

亲脂性是一种基本的物理化学性质，广泛用于评价药物设计、材料科学和食品工程中的关键参数。它在预测膜的透性、吸收和化合物的分布方面起着关键作用。此外，亲脂性通常被集成到评分函数中来模拟生物分子相互作用，并作为机器学习模型中用于属性预测和化合物分类的重要分子描述符。选择合适的pH依赖性亲脂性（log D pH $$ mathrm{log} {D}_{pH} $$）模型对于确保其准确性非常重要。将离子表观分配系数（P I app $$ {P}_{text{I}}^{text{app}}$$）结合到ph依赖性亲脂性谱的预测中，对于准确再现实验结果至关重要。根据可查找、可访问、可互操作和可重用数据的原则，以改善数据管理和共享，在这里，我们介绍了LiProS，一个通过谷歌Colab笔记本轻松访问的FAIR工作流。LiProS帮助研究人员根据感兴趣分子的SMILES编码有效地确定适当的ph依赖性亲脂性。此外，LiProS在分析NAPRORE-CR天然产物数据库中的可电离化合物方面也证明了它的实用性，能够根据这些化合物的物理化学特性确定最合适的亲脂性形式。

{"title":"LiProS: Findable, Accessible, Interoperable, and Reusable Data Simulation Workflow to Predict Accurate Lipophilicity Profiles for Small Molecules.","authors":"Esteban Bertsch-Aguilar, Antonio Piedra, Daniel Acuña, Sebastián Suñer, Sylvana Pinheiro, William J Zamora","doi":"10.1002/minf.70007","DOIUrl":"https://doi.org/10.1002/minf.70007","url":null,"abstract":"Lipophilicity is a fundamental physicochemical property widely used to evaluate key parameters in drug design, materials science, and food engineering. It plays a critical role in predicting membrane permeability, absorption, and distribution of compounds. Moreover, lipophilicity is commonly integrated into scoring functions to model biomolecular interactions and serves as an important molecular descriptor in machine learning models for property prediction and compound classification. The election of the appropriate pH-dependent lipophilicity ( <math> <semantics><mrow><mi>log</mi> <msub><mi>D</mi> <mrow><mtext>pH</mtext></mrow> </msub> </mrow> <annotation>$$ mathrm{log} {D}_{pH} $$</annotation></semantics> </math> ) model is important to ensure its accuracy. The incorporation of the ion apparent partition coefficient ( <math> <semantics> <mrow><msubsup><mi>P</mi> <mi>I</mi> <mtext>app</mtext></msubsup> </mrow> <annotation>$$ {P}_{text{I}}^{text{app}}$$</annotation></semantics> </math> ) into predictions of pH-dependent lipophilicity profiles can be essential for accurately reproducing experimental results. In accordance with the principles for findable, accessible, interoperable, and reusable data to improve data management and sharing, here, we introduce LiProS, a FAIR workflow that is easily accessible through a Google Colab notebook. LiProS assists researchers in efficiently determining the appropriate pH-dependent lipophilicity profile based on the SMILES code of their molecules of interest. In addition, LiProS demonstrated its utility in the analysis of ionizable compounds within the NAPRORE-CR natural products database, enabling the identification of the most appropriate lipophilicity formalism tailored to the physicochemical characteristics of these compounds.","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 8","pages":"e202500136"},"PeriodicalIF":3.1,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144962005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0