首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Generative artificial intelligence based models optimization towards molecule design enhancement 基于生成式人工智能的分子设计优化模型。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-04 DOI: 10.1186/s13321-025-01059-4
Tarek Khater, Sara Awni Alkhatib, Aamna AlShehhi, Charalampos Pitsalidis, Anna Maria Pappa, Son Tung Ngo, Vincent Chan, Vi Khanh Truong

Generative artificial intelligence (GenAI) models have emerged as a transformative tool for addressing the complex challenges of drug discovery, enabling the design of structurally diverse, chemically valid, and functionally relevant molecules. Despite significant advancements, the rapid expansion of GenAI applications still faces challenges related to prediction accuracy, molecular validity, and optimization for drug-like properties. This review provides a comprehensive analysis of recent techniques and strategies aimed at enhancing the performance of GenAI models in molecular design. We explore key generative architectures, including variational autoencoders, generative adversarial networks, and transformer-based models, highlighting their unique contributions to drug discovery. Additionally, we discuss critical advancements such as reinforcement learning, multi-objective optimization, and the integration of domain-specific chemical knowledge, which collectively enhance molecular validity, novelty, and drug-likeness. Also, the review examines persistent challenges, including data quality limitations, model interpretability, and the need for improved objective functions, while offering insights into future research directions. By mapping the evolving landscape of GenAI-driven molecular design and providing strategic guidance for overcoming existing limitations, this review serves as an essential resource for researchers leveraging GenAI in drug discovery.

生成式人工智能(GenAI)模型已经成为解决药物发现复杂挑战的变革性工具,使设计结构多样化、化学有效和功能相关的分子成为可能。尽管取得了重大进展,但GenAI应用的快速扩展仍然面临着与预测准确性、分子有效性和药物样性质优化相关的挑战。本文综述了旨在提高GenAI模型在分子设计中的性能的最新技术和策略的综合分析。我们探索了关键的生成架构,包括变分自编码器、生成对抗网络和基于变压器的模型,强调了它们对药物发现的独特贡献。此外,我们还讨论了诸如强化学习、多目标优化和特定领域化学知识的整合等关键进展,这些进展共同提高了分子有效性、新颖性和药物相似性。此外,该综述还探讨了持续存在的挑战,包括数据质量限制,模型可解释性以及改进目标函数的需求,同时为未来的研究方向提供了见解。通过绘制基因ai驱动分子设计的发展图景,并为克服现有局限性提供战略指导,本综述为研究人员利用基因ai进行药物发现提供了重要资源。
{"title":"Generative artificial intelligence based models optimization towards molecule design enhancement","authors":"Tarek Khater,&nbsp;Sara Awni Alkhatib,&nbsp;Aamna AlShehhi,&nbsp;Charalampos Pitsalidis,&nbsp;Anna Maria Pappa,&nbsp;Son Tung Ngo,&nbsp;Vincent Chan,&nbsp;Vi Khanh Truong","doi":"10.1186/s13321-025-01059-4","DOIUrl":"10.1186/s13321-025-01059-4","url":null,"abstract":"<div><p>Generative artificial intelligence (GenAI) models have emerged as a transformative tool for addressing the complex challenges of drug discovery, enabling the design of structurally diverse, chemically valid, and functionally relevant molecules. Despite significant advancements, the rapid expansion of GenAI applications still faces challenges related to prediction accuracy, molecular validity, and optimization for drug-like properties. This review provides a comprehensive analysis of recent techniques and strategies aimed at enhancing the performance of GenAI models in molecular design. We explore key generative architectures, including variational autoencoders, generative adversarial networks, and transformer-based models, highlighting their unique contributions to drug discovery. Additionally, we discuss critical advancements such as reinforcement learning, multi-objective optimization, and the integration of domain-specific chemical knowledge, which collectively enhance molecular validity, novelty, and drug-likeness. Also, the review examines persistent challenges, including data quality limitations, model interpretability, and the need for improved objective functions, while offering insights into future research directions. By mapping the evolving landscape of GenAI-driven molecular design and providing strategic guidance for overcoming existing limitations, this review serves as an essential resource for researchers leveraging GenAI in drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01059-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144778000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Correction: A transformer based generative chemical language AI model for structural elucidation of organic compounds 更正:一个基于转换器的生成化学语言AI模型,用于有机化合物的结构解析
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-04 DOI: 10.1186/s13321-025-01065-6
Xiaofeng Tan
{"title":"Correction: A transformer based generative chemical language AI model for structural elucidation of organic compounds","authors":"Xiaofeng Tan","doi":"10.1186/s13321-025-01065-6","DOIUrl":"10.1186/s13321-025-01065-6","url":null,"abstract":"","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01065-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144778292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neural SHAKE: geometric constraints in neural differential equations 神经震荡:神经微分方程中的几何约束。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-04 DOI: 10.1186/s13321-025-01053-w
Justin S. Diamond, Markus A. Lill

Generating accurate molecular conformations hinges on sampling effectively from a high-dimensional space of atomic arrangements, which grows exponentially with system size. To ensure physically valid geometries and increase the likelihood of reaching low-energy conformations, it is us ful to incorporate prior physicsbased information by recasting them as geometric constraints that naturally arise as nonlinear constraint satisfaction problems. In this work, we propose an approach to embed these strict constraints into neural differential equations, leveraging the denoising diffusion framework. By projecting the stochastic generative dynamics onto a manifold defined by constraint sets, our method enforces exact feasibility at each step, unlike alternative approaches that merely impose soft constraints through probabilistic guidance. This technique generates lower-energy molecular conformations, enables more efficient subspace exploration, and formally subsumes classifier-guidance-type methods by treating geometric constraints as strict algebraic conditions within the diffusion process.

Neural SHAKE formulates exact manifold‑projected score‑based diffusion : each reverse-SDEincrement is orthogonally projected, via a Lagrange-multiplier solve, onto the constraint surfaceσₐ(x)=0 for a = 1,…, A, with A the number of independent constraints and thus the manifold’scodimension . This projection preserves global SE(3) symmetry and enforces constraints tosolver tolerance. It induces a well-posed surface Fokker–Planck flow on the (3 N − A)-dimensional manifold, while a coarea/Fixman Jacobian carries the ambient 3 N-dimensionaldensity to a normalized density on that manifold, preserving probability mass after the dimensionality reduction.

生成精确的分子构象取决于从原子排列的高维空间有效地采样,原子排列随系统大小呈指数增长。为了确保物理上有效的几何形状并增加达到低能量构象的可能性,我们可以通过将先前的基于物理的信息重新转换为自然出现的非线性约束满足问题的几何约束来合并它们。在这项工作中,我们提出了一种利用去噪扩散框架将这些严格约束嵌入神经微分方程的方法。通过将随机生成动力学投射到约束集定义的流形上,我们的方法在每一步都强制执行精确的可行性,而不像其他方法仅仅通过概率指导施加软约束。该技术产生低能分子构象,实现更有效的子空间探索,并通过将几何约束作为扩散过程中的严格代数条件正式纳入分类器引导型方法。科学贡献:Neural SHAKE制定了精确的流形投影分数为基础的扩散:每个反向sdeincrement通过拉格朗日乘子解正交投影到约束表面σ (x)=0, a = 1,…,a,其中a是独立约束的数量,因此是流形的scodimension。该投影保留了全局SE(3)对称性,并对求解容忍度施加了约束。它在(3n - a)维流形上诱导了一个适定的表面Fokker-Planck流,而coarea/Fixman雅可比矩阵将周围的3n维密度携带到该流形上的标准化密度,在降维后保留了概率质量。
{"title":"Neural SHAKE: geometric constraints in neural differential equations","authors":"Justin S. Diamond,&nbsp;Markus A. Lill","doi":"10.1186/s13321-025-01053-w","DOIUrl":"10.1186/s13321-025-01053-w","url":null,"abstract":"<p>Generating accurate molecular conformations hinges on sampling effectively from a high-dimensional space of atomic arrangements, which grows exponentially with system size. To ensure physically valid geometries and increase the likelihood of reaching low-energy conformations, it is us ful to incorporate prior physicsbased information by recasting them as geometric constraints that naturally arise as nonlinear constraint satisfaction problems. In this work, we propose an approach to embed these strict constraints into neural differential equations, leveraging the denoising diffusion framework. By projecting the stochastic generative dynamics onto a manifold defined by constraint sets, our method enforces exact feasibility at each step, unlike alternative approaches that merely impose soft constraints through probabilistic guidance. This technique generates lower-energy molecular conformations, enables more efficient subspace exploration, and formally subsumes classifier-guidance-type methods by treating geometric constraints as strict algebraic conditions within the diffusion process.</p><p>Neural SHAKE formulates exact manifold‑projected score‑based diffusion : each reverse-SDEincrement is orthogonally projected, via a Lagrange-multiplier solve, onto the constraint surfaceσₐ(x)=0 for a = 1,…, A, with A the number of independent constraints and thus the manifold’scodimension . This projection preserves global SE(3) symmetry and enforces constraints tosolver tolerance. It induces a well-posed surface Fokker–Planck flow on the (3 N − A)-dimensional manifold, while a coarea/Fixman Jacobian carries the ambient 3 N-dimensionaldensity to a normalized density on that manifold, preserving probability mass after the dimensionality reduction.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01053-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144777973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How to crack a SMILES: automatic crosschecked chemical structure resolution across multiple services using MoleculeResolver 如何破解一个SMILES:使用MoleculeResolver在多个服务中自动交叉检查化学结构分辨率。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-08-04 DOI: 10.1186/s13321-025-01064-7
Simon Müller

Accurate chemical structure resolution from textual identifiers such as names and CAS RN® is critical for computational modeling in chemistry and related fields. This paper introduces MoleculeResolver, an automated, robust Python-based tool designed to address inconsistencies and inaccuracies commonly encountered when converting chemical identifiers to canonical SMILES strings. MoleculeResolver systematically crosschecks structures retrieved from multiple reputable chemical databases, implements rigorous identifier plausibility checks, standardizes molecular structures, and intelligently selects the most accurate representation based on a unique resolution algorithm.

Benchmarks across diverse datasets confirm that MoleculeResolver significantly enhances precision, recall, and overall reliability compared to traditional single-source methods, proving its utility as a valuable resource for chemists, data scientists, and researchers engaged in high-quality molecular data analysis and predictive model development.

从文本标识符(如名称和CAS RN®)中精确的化学结构解析对于化学和相关领域的计算建模至关重要。本文介绍了MoleculeResolver,这是一个自动化的、健壮的基于python的工具,旨在解决在将化学标识符转换为规范SMILES字符串时经常遇到的不一致和不准确问题。MoleculeResolver系统地交叉检查从多个信誉良好的化学数据库检索的结构,实现严格的标识符合理性检查,标准化分子结构,并基于独特的分辨率算法智能地选择最准确的表示。科学贡献:不同数据集的基准测试证实,与传统的单源方法相比,MoleculeResolver显著提高了精度、召回率和整体可靠性,证明了其作为化学家、数据科学家和从事高质量分子数据分析和预测模型开发的研究人员的宝贵资源的实用性。
{"title":"How to crack a SMILES: automatic crosschecked chemical structure resolution across multiple services using MoleculeResolver","authors":"Simon Müller","doi":"10.1186/s13321-025-01064-7","DOIUrl":"10.1186/s13321-025-01064-7","url":null,"abstract":"<p>Accurate chemical structure resolution from textual identifiers such as names and CAS RN® is critical for computational modeling in chemistry and related fields. This paper introduces MoleculeResolver, an automated, robust Python-based tool designed to address inconsistencies and inaccuracies commonly encountered when converting chemical identifiers to canonical SMILES strings. MoleculeResolver systematically crosschecks structures retrieved from multiple reputable chemical databases, implements rigorous identifier plausibility checks, standardizes molecular structures, and intelligently selects the most accurate representation based on a unique resolution algorithm.</p><p> Benchmarks across diverse datasets confirm that MoleculeResolver significantly enhances precision, recall, and overall reliability compared to traditional single-source methods, proving its utility as a valuable resource for chemists, data scientists, and researchers engaged in high-quality molecular data analysis and predictive model development.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01064-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144777999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep learning molecular interaction motifs from receptor structures alone 仅从受体结构中深度学习分子相互作用基序
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-07-30 DOI: 10.1186/s13321-025-01055-8
Seeun Kim, Simaek Oh, Hyeonuk Woo, Jiho Sim, Chaok Seok, Hahnbeom Park

Interactions of proteins with other molecules are often mediated by a set of critical binding motifs on their surfaces. Most traditional binder designs relied on motifs borrowed from known binder molecules, which highly restricted their applicability to novel targets or new binding sites. This work presents a deep learning network MotifGen that predicts potential binder motifs directly from receptor structures without further supporting information. MotifGen generates motif profiles at the receptor surface for 14 types of functional groups or 6 chemical interaction classes. These profiles are highly human-interpretable and can be further utilized as pre-trained embedding inputs for versatile few-shot binder design applications. We demonstrate MotifGen's effectiveness through its applications to peptide binder design and small molecule binding site prediction, where it either surpassed existing methods or added significant value when integrated. Our motif-centric approach can offer a new design strategy for novel binder discovery for challenging receptor targets.

蛋白质与其他分子的相互作用通常是由其表面上的一组关键结合基序介导的。大多数传统的结合剂设计依赖于从已知结合剂分子中借来的基序,这极大地限制了它们对新靶点或新结合位点的适用性。这项工作提出了一个深度学习网络MotifGen,可以直接从受体结构中预测潜在的粘合基序,而无需进一步的支持信息。MotifGen在受体表面生成14种官能团或6种化学相互作用类的基序谱。这些配置文件具有高度的可解释性,并且可以进一步用作通用的少量粘结剂设计应用程序的预训练嵌入输入。我们通过其在肽结合剂设计和小分子结合位点预测方面的应用证明了MotifGen的有效性,在这些方面,它要么超越了现有的方法,要么在集成后增加了显著的价值。我们以基序为中心的方法可以为具有挑战性的受体靶点的新粘合剂发现提供新的设计策略。我们引入了一种新的基于深度学习的计算策略来识别给定受体结构的潜在结合基序。这些预测的结合基序可以直接应用于各种药物类型的设计,包括肽和小分子。为了证明它的实用性,我们展示了它在肽结合物序列识别和结合位点预测任务中的应用,这两个任务都是基于结构的药物设计中的关键任务。
{"title":"Deep learning molecular interaction motifs from receptor structures alone","authors":"Seeun Kim,&nbsp;Simaek Oh,&nbsp;Hyeonuk Woo,&nbsp;Jiho Sim,&nbsp;Chaok Seok,&nbsp;Hahnbeom Park","doi":"10.1186/s13321-025-01055-8","DOIUrl":"10.1186/s13321-025-01055-8","url":null,"abstract":"<div><p>Interactions of proteins with other molecules are often mediated by a set of critical binding motifs on their surfaces. Most traditional binder designs relied on motifs borrowed from known binder molecules, which highly restricted their applicability to novel targets or new binding sites. This work presents a deep learning network MotifGen that predicts potential binder motifs directly from receptor structures without further supporting information. MotifGen generates motif profiles at the receptor surface for 14 types of functional groups or 6 chemical interaction classes. These profiles are highly human-interpretable and can be further utilized as pre-trained embedding inputs for versatile few-shot binder design applications. We demonstrate MotifGen's effectiveness through its applications to peptide binder design and small molecule binding site prediction, where it either surpassed existing methods or added significant value when integrated. Our motif-centric approach can offer a new design strategy for novel binder discovery for challenging receptor targets.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01055-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144747676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
(texttt {DiffER}): categorical diffusion ensembles for single-step chemical retrosynthesis 用于单步化学反合成的分类扩散系统。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-07-29 DOI: 10.1186/s13321-025-01056-7
Sean Current, Ziqi Chen, Daniel Adu-Ampratwum, Xia Ning, Srinivasan Parthasarathy

Methods for automatic chemical retrosynthesis have found recent success through the application of models traditionally built for natural language processing, primarily through transformer neural networks. These models have demonstrated significant ability to translate between the SMILES encodings of chemical products and reactants, but are constrained as a result of their autoregressive nature. We propose (texttt {DiffER}), an alternative template-free method for single-step retrosynthesis prediction in the form of categorical diffusion, which allows the entire output SMILES sequence to be predicted in unison. We construct an ensemble of diffusion models which achieves state-of-the-art performance for top-1 accuracy and competitive performance for top-3, top-5, and top-10 accuracy among template-free methods. We prove that (texttt {DiffER}) is a strong baseline for a new class of template-free model and is capable of learning a variety of synthetic techniques used in laboratory settings.

通过应用传统上为自然语言处理构建的模型(主要是通过变压器神经网络),自动化学反合成方法最近取得了成功。这些模型已经证明了在化学产品和反应物的SMILES编码之间进行翻译的显著能力,但由于它们的自回归性质而受到限制。我们提出了DiffER,这是一种以分类扩散形式进行单步反合成预测的替代无模板方法,它允许对整个输出SMILES序列进行一致的预测。我们构建了一个扩散模型集合,该模型在无模板方法中具有最先进的前1精度和竞争性能的前3、前5和前10精度。我们证明了DiffER是一类新的无模板模型的强大基线,并且能够学习在实验室环境中使用的各种合成技术。
{"title":"(texttt {DiffER}): categorical diffusion ensembles for single-step chemical retrosynthesis","authors":"Sean Current,&nbsp;Ziqi Chen,&nbsp;Daniel Adu-Ampratwum,&nbsp;Xia Ning,&nbsp;Srinivasan Parthasarathy","doi":"10.1186/s13321-025-01056-7","DOIUrl":"10.1186/s13321-025-01056-7","url":null,"abstract":"<div><p>Methods for automatic chemical retrosynthesis have found recent success through the application of models traditionally built for natural language processing, primarily through transformer neural networks. These models have demonstrated significant ability to translate between the SMILES encodings of chemical products and reactants, but are constrained as a result of their autoregressive nature. We propose <span>(texttt {DiffER})</span>, an alternative template-free method for single-step retrosynthesis prediction in the form of categorical diffusion, which allows the entire output SMILES sequence to be predicted in unison. We construct an ensemble of diffusion models which achieves state-of-the-art performance for top-1 accuracy and competitive performance for top-3, top-5, and top-10 accuracy among template-free methods. We prove that <span>(texttt {DiffER})</span> is a strong baseline for a new class of template-free model and is capable of learning a variety of synthetic techniques used in laboratory settings.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01056-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144737377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
mineMS2: annotation of spectral libraries with exact fragmentation patterns mineMS2:具有精确碎片模式的谱库注释
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-07-24 DOI: 10.1186/s13321-025-01051-y
Alexis Delabrière, Coline Gianfrotta, Sylvain Dechaumet, Annelaure Damont, Thaïs Hautbergue, Pierrick Roger, Emilien L. Jamin, Olivier Puel, Christophe Junot, François Fenaille, Etienne A. Thévenot

Identification is a major challenge in metabolomics due to the large structural diversity of metabolites. Tandem mass spectrometry is a reference technology for studying the fragmentation of molecules and characterizing their structure. Recent instruments can fragment large amounts of compounds in a single acquisition. The search for similarities within a collection of MS/MS spectra is a powerful approach to facilitate the identification of new metabolites. We propose an innovative de novo strategy for searching for exact fragmentation patterns within collections of MS/MS spectra. This approach is based on (i) a new representation of spectra as graphs of m/z differences, and (ii) an efficient frequent-subgraph mining algorithm. We demonstrate both on a spectral database from standards and on acquisitions in biological matrices that these new fragmentation patterns capture similarities that are not extracted by existing methods, and facilitate the structural interpretation of molecular network components and the elucidation of unknown spectra. The mineMS2 software is publicly available as an R package (https://github.com/odisce/mineMS2).

We present an innovative strategy for structural elucidation, which extracts exact fragmentation patterns of m/z differences within collections of MS/MS spectra. The algorithms are implemented in a software library enabling efficient mining of MS/MS data and coupling to molecular networks. We show on real datasets the specific value of the patterns as fragmentation graphs for structural interpretation and de novo identification, and their complementarity to existing approaches.

由于代谢物的结构多样性,鉴定是代谢组学的一个主要挑战。串联质谱法是研究分子断裂和表征分子结构的一种参考技术。最近的仪器可以在一次采集中分离大量的化合物。在MS/MS光谱集合中寻找相似性是促进鉴定新代谢物的有力方法。我们提出了一种创新的从头策略,用于在MS/MS光谱集合中搜索精确的碎片模式。该方法基于(i)将光谱表示为m/z差图的新表示,以及(ii)高效的频率子图挖掘算法。我们在标准的光谱数据库和生物基质的获取上证明,这些新的碎片模式捕获了现有方法无法提取的相似性,并促进了分子网络组件的结构解释和未知光谱的阐明。mineMS2软件是一个公开的R包(https://github.com/odisce/mineMS2)。我们提出了一种创新的结构解析策略,该策略可以提取MS/MS光谱集合中m/z差异的精确碎片模式。这些算法在一个软件库中实现,能够有效地挖掘MS/MS数据并与分子网络耦合。我们在真实数据集上展示了这些模式作为碎片图的特定价值,用于结构解释和从头识别,以及它们与现有方法的互补性。
{"title":"mineMS2: annotation of spectral libraries with exact fragmentation patterns","authors":"Alexis Delabrière,&nbsp;Coline Gianfrotta,&nbsp;Sylvain Dechaumet,&nbsp;Annelaure Damont,&nbsp;Thaïs Hautbergue,&nbsp;Pierrick Roger,&nbsp;Emilien L. Jamin,&nbsp;Olivier Puel,&nbsp;Christophe Junot,&nbsp;François Fenaille,&nbsp;Etienne A. Thévenot","doi":"10.1186/s13321-025-01051-y","DOIUrl":"10.1186/s13321-025-01051-y","url":null,"abstract":"<p>Identification is a major challenge in metabolomics due to the large structural diversity of metabolites. Tandem mass spectrometry is a reference technology for studying the fragmentation of molecules and characterizing their structure. Recent instruments can fragment large amounts of compounds in a single acquisition. The search for similarities within a collection of MS/MS spectra is a powerful approach to facilitate the identification of new metabolites. We propose an innovative <i>de novo</i> strategy for searching for exact fragmentation patterns within collections of MS/MS spectra. This approach is based on (i) a new representation of spectra as graphs of m/z differences, and (ii) an efficient frequent-subgraph mining algorithm. We demonstrate both on a spectral database from standards and on acquisitions in biological matrices that these new fragmentation patterns capture similarities that are not extracted by existing methods, and facilitate the structural interpretation of molecular network components and the elucidation of unknown spectra. The mineMS2 software is publicly available as an R package (https://github.com/odisce/mineMS2).</p><p> We present an innovative strategy for structural elucidation, which extracts exact fragmentation patterns of m/z differences within collections of MS/MS spectra. The algorithms are implemented in a software library enabling efficient mining of MS/MS data and coupling to molecular networks. We show on real datasets the specific value of the patterns as fragmentation graphs for structural interpretation and <i>de novo</i> identification, and their complementarity to existing approaches.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01051-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144694131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HERGAI: an artificial intelligence tool for structure-based prediction of hERG inhibitors HERGAI:用于基于结构的hERG抑制剂预测的人工智能工具
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-07-24 DOI: 10.1186/s13321-025-01063-8
Viet-Khoa Tran-Nguyen, Ulrick Fineddie Randriharimanamizara, Olivier Taboureau

The human Ether-à-go-go-Related Gene (hERG) potassium channel is crucial for repolarizing the cardiac action potential and regulating the heartbeat. Molecules that inhibit this protein can cause acquired long QT syndrome, increasing the risk of arrhythmias and sudden fatal cardiac arrests. Detecting compounds with potential hERG inhibitory activity is therefore essential to mitigate cardiotoxicity risks. In this article, we present a new hERG data set of unprecedented size, comprising nearly 300,000 molecules reported in PubChem and ChEMBL, approximately 2000 of which were confirmed hERG blockers identified through in vitro assays. Multiple structure-based artificial intelligence (AI) binary classifiers for predicting hERG inhibitors were developed, employing, as descriptors, protein–ligand extended connectivity (PLEC) fingerprints fed into random forest, extreme gradient boosting, and deep neural network (DNN) algorithms. Our best-performing model, a stacking ensemble classifier with a DNN meta-learner, achieved state-of-the-art classification performance, accurately identifying 86% of molecules having half-maximal inhibitory concentrations (IC50s) not exceeding 20 µM in our challenging test set, including 94% of hERG blockers whose IC50s were not greater than 1 µM. It also demonstrated superior screening power compared to virtual screening schemes that used existing scoring functions. This model, named “HERGAI,” along with relevant input/output data and user-friendly source code, is available in our GitHub repository (https://github.com/vktrannguyen/HERGAI) and can be used to predict drug-induced hERG blockade, even on large data sets.

人类以太-à-go-go-Related基因(hERG)钾通道对心脏动作电位复极和调节心跳至关重要。抑制这种蛋白的分子可引起获得性长QT综合征,增加心律失常和致命性心脏骤停的风险。因此,检测具有潜在hERG抑制活性的化合物对于减轻心脏毒性风险至关重要。在这篇文章中,我们提出了一个前所未有的新的hERG数据集,包括PubChem和ChEMBL中报道的近30万个分子,其中约2000个是通过体外实验确定的hERG阻滞剂。开发了用于预测hERG抑制剂的多个基于结构的人工智能(AI)二元分类器,采用将蛋白质配体扩展连接(PLEC)指纹输入随机森林、极端梯度增强和深度神经网络(DNN)算法作为描述符。我们表现最好的模型是一个带有DNN元学习器的堆叠集成分类器,它实现了最先进的分类性能,在我们具有挑战性的测试集中准确地识别出86%的一半最大抑制浓度(ic50)不超过20µM的分子,包括94%的ic50不大于1µM的hERG阻滞剂。与使用现有评分功能的虚拟筛选方案相比,它也显示出更好的筛选能力。这个名为“HERGAI”的模型,连同相关的输入/输出数据和用户友好的源代码,可以在我们的GitHub存储库(https://github.com/vktrannguyen/HERGAI)中获得,可以用于预测药物诱导的hERG阻断,即使在大型数据集上也是如此。我们为人工智能研究提供了最大和最复杂的hERG抑制数据集,整合了PubChem和ChEMBL精心策划的实验数据。这一现实和具有挑战性的数据集使训练和评估预测hERG阻滞剂的先进模型成为可能。我们还介绍了“HERGAI”,这是一种具有强大分类和筛选性能的新型堆叠集成分类器,利用最先进的机器学习/深度学习技术,并首次将PLEC指纹作为hergg结合配体构象的描述符。
{"title":"HERGAI: an artificial intelligence tool for structure-based prediction of hERG inhibitors","authors":"Viet-Khoa Tran-Nguyen,&nbsp;Ulrick Fineddie Randriharimanamizara,&nbsp;Olivier Taboureau","doi":"10.1186/s13321-025-01063-8","DOIUrl":"10.1186/s13321-025-01063-8","url":null,"abstract":"<div><p>The human Ether-à-go-go-Related Gene (hERG) potassium channel is crucial for repolarizing the cardiac action potential and regulating the heartbeat. Molecules that inhibit this protein can cause acquired long QT syndrome, increasing the risk of arrhythmias and sudden fatal cardiac arrests. Detecting compounds with potential hERG inhibitory activity is therefore essential to mitigate cardiotoxicity risks. In this article, we present a new hERG data set of unprecedented size, comprising nearly 300,000 molecules reported in PubChem and ChEMBL, approximately 2000 of which were confirmed hERG blockers identified through in vitro assays. Multiple structure-based artificial intelligence (AI) binary classifiers for predicting hERG inhibitors were developed, employing, as descriptors, protein–ligand extended connectivity (PLEC) fingerprints fed into random forest, extreme gradient boosting, and deep neural network (DNN) algorithms. Our best-performing model, a stacking ensemble classifier with a DNN meta-learner, achieved state-of-the-art classification performance, accurately identifying 86% of molecules having half-maximal inhibitory concentrations (IC<sub>50</sub>s) not exceeding 20 µM in our challenging test set, including 94% of hERG blockers whose IC<sub>50</sub>s were not greater than 1 µM. It also demonstrated superior screening power compared to virtual screening schemes that used existing scoring functions. This model, named “HERGAI,” along with relevant input/output data and user-friendly source code, is available in our GitHub repository (https://github.com/vktrannguyen/HERGAI) and can be used to predict drug-induced hERG blockade, even on large data sets.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01063-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144694132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The topology of molecular representations and its influence on machine learning performance 分子表征拓扑及其对机器学习性能的影响
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-07-21 DOI: 10.1186/s13321-025-01045-w
Florian Rottach, Sebastian Schieferdecker, Carsten Eickhoff

Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often lacks a systematic approach and follows computationally exhaustive empirical testing. Moreover, research has shown that deep learning models do not substantially outperform traditional approaches across many tasks with no clear explanation for this shortfall. In this work, we present TopoLearn, a model that predicts the effectiveness of representations on datasets based on the topological characteristics of the corresponding feature space. Using interpretability techniques, we find that persistent homology descriptors are linked with the error metrics of trained machine learning models, offering a new method to better understand and select molecular representations.

Scientific contribution Our research is the first to establish an empirical connection between the topology of feature spaces and the machine learning performance of molecular representations. In addition, we facilitate future research endeavors by providing open access to our developed model.

化学信息学的进步导致了许多分子数字编码的方法。分子表示的选择影响了化学数据集学习算法的准确性和泛化性。设计和选择适当的表示往往缺乏系统的方法,并遵循计算详尽的经验检验。此外,研究表明,深度学习模型在许多任务中并没有明显优于传统方法,而且没有明确的解释这种不足。在这项工作中,我们提出了topollearn,这是一个基于相应特征空间的拓扑特征预测数据集上表示的有效性的模型。使用可解释性技术,我们发现持久的同源描述符与训练有素的机器学习模型的误差度量相关联,为更好地理解和选择分子表征提供了一种新方法。我们的研究首次建立了特征空间拓扑与分子表征的机器学习性能之间的经验联系。此外,我们通过提供对我们开发的模型的开放访问来促进未来的研究工作。
{"title":"The topology of molecular representations and its influence on machine learning performance","authors":"Florian Rottach,&nbsp;Sebastian Schieferdecker,&nbsp;Carsten Eickhoff","doi":"10.1186/s13321-025-01045-w","DOIUrl":"10.1186/s13321-025-01045-w","url":null,"abstract":"<div><p>Advancements in cheminformatics have led to numerous methods for encoding molecules numerically. The choice of molecular representation impacts the accuracy and generalizability of learning algorithms applied to chemical datasets. Designing and selecting the appropriate representation often lacks a systematic approach and follows computationally exhaustive empirical testing. Moreover, research has shown that deep learning models do not substantially outperform traditional approaches across many tasks with no clear explanation for this shortfall. In this work, we present TopoLearn, a model that predicts the effectiveness of representations on datasets based on the topological characteristics of the corresponding feature space. Using interpretability techniques, we find that persistent homology descriptors are linked with the error metrics of trained machine learning models, offering a new method to better understand and select molecular representations.</p><p><b>Scientific contribution</b> Our research is the first to establish an empirical connection between the topology of feature spaces and the machine learning performance of molecular representations. In addition, we facilitate future research endeavors by providing open access to our developed model.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01045-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144678217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking ML in ADMET predictions: the practical impact of feature representations in ligand-based models 在ADMET预测中对ML进行基准测试:基于配体的模型中特征表示的实际影响
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-07-21 DOI: 10.1186/s13321-025-01041-0
Gintautas Kamuntavičius, Tanya Paquet, Orestis Bastas, Dainius Šalkauskas, Alvaro Prat, Hisham Abdel Aty, Aurimas Pabrinkis, Povilas Norvaišas, Roy Tal

This study, focusing on predicting Absorption, Distribution, Metabolism, Excretion, and Toxicology (ADMET) properties, addresses the key challenges of ML models trained using ligand-based representations. We propose a structured approach to data feature selection, taking a step beyond the conventional practice of combining different representations without systematic reasoning. Additionally, we enhance model evaluation methods by integrating cross-validation with statistical hypothesis testing, adding a layer of reliability to the model assessments. Our final evaluations include a practical scenario, where models trained on one source of data are evaluated on a different one. This approach aims to bolster the reliability of ADMET predictions, providing more dependable and informative model evaluations.

Scientific contribution

This study provided a structured approach to feature selection. We improve model evaluation by combining cross-validation with statistical hypothesis testing, making results more reliable. The methodology used in our study can be generalized beyond feature selection, boosting the confidence in selected models which is crucial in a noisy domain such as the ADMET prediction tasks. Additionally, we assess how well models trained on one dataset perform on another, offering practical insights for using external data in drug discovery.

本研究的重点是预测吸收、分布、代谢、排泄和毒理学(ADMET)特性,解决了使用基于配体的表示训练的ML模型的关键挑战。我们提出了一种结构化的数据特征选择方法,超越了在没有系统推理的情况下组合不同表示的传统做法。此外,我们通过将交叉验证与统计假设检验相结合,增强了模型评估方法,为模型评估增加了一层可靠性。我们的最终评估包括一个实际场景,其中在一个数据源上训练的模型在另一个数据源上进行评估。这种方法旨在提高ADMET预测的可靠性,提供更可靠和信息丰富的模型评估。本研究提供了一种结构化的特征选择方法。我们将交叉验证与统计假设检验相结合,改进模型评价,使结果更加可靠。我们研究中使用的方法可以推广到特征选择之外,提高了所选模型的置信度,这在噪声领域(如ADMET预测任务)中至关重要。此外,我们评估了在一个数据集上训练的模型在另一个数据集上的表现,为在药物发现中使用外部数据提供了实际的见解。
{"title":"Benchmarking ML in ADMET predictions: the practical impact of feature representations in ligand-based models","authors":"Gintautas Kamuntavičius,&nbsp;Tanya Paquet,&nbsp;Orestis Bastas,&nbsp;Dainius Šalkauskas,&nbsp;Alvaro Prat,&nbsp;Hisham Abdel Aty,&nbsp;Aurimas Pabrinkis,&nbsp;Povilas Norvaišas,&nbsp;Roy Tal","doi":"10.1186/s13321-025-01041-0","DOIUrl":"10.1186/s13321-025-01041-0","url":null,"abstract":"<div><p>This study, focusing on predicting Absorption, Distribution, Metabolism, Excretion, and Toxicology (ADMET) properties, addresses the key challenges of ML models trained using ligand-based representations. We propose a structured approach to data feature selection, taking a step beyond the conventional practice of combining different representations without systematic reasoning. Additionally, we enhance model evaluation methods by integrating cross-validation with statistical hypothesis testing, adding a layer of reliability to the model assessments. Our final evaluations include a practical scenario, where models trained on one source of data are evaluated on a different one. This approach aims to bolster the reliability of ADMET predictions, providing more dependable and informative model evaluations.</p><p><b>Scientific contribution</b></p><p>This study provided a structured approach to feature selection. We improve model evaluation by combining cross-validation with statistical hypothesis testing, making results more reliable. The methodology used in our study can be generalized beyond feature selection, boosting the confidence in selected models which is crucial in a noisy domain such as the ADMET prediction tasks. Additionally, we assess how well models trained on one dataset perform on another, offering practical insights for using external data in drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01041-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144678216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1