首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Structure-free drug-target affinity prediction using protein and molecule language models. 基于蛋白质和分子语言模型的无结构药物靶标亲和力预测。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-01-03 DOI: 10.1186/s13321-025-01146-6
Amir Hallaji Bidgoli,Morteza Mahdavi,Hamed Malek
Accurate prediction of drug-target affinity (DTA) is crucial for advancing drug discovery and optimizing experimental processes. Traditional DTA models often rely on handcrafted features or structural data, which can limit their generalizability and scalability. In this study, we propose a novel, sequence-centric approach for DTA prediction that leverages pretrained large language models (LLMs), namely ChemBERTa and ESM2, to encode protein and molecule sequences. These models produce semantically rich embeddings without the need for structural data. We introduce a customized Residual Inception architecture that efficiently integrates these sequence embeddings through multi-scale convolutions and residual connections, significantly improving prediction accuracy. Our method is evaluated on benchmark datasets Davis, KIBA, and BindingDB, achieving state-of-the-art performance with MSE = 0.182 and CI = 0.915 on Davis, MSE = 0.135 and CI = 0.902 on KIBA, and MSE = 0.467 and CI = 0.888 on BindingDB. These results highlight the potential of sequence-based approaches to provide scalable, accurate, and robust solutions for DTA prediction, offering valuable insights into drug-target interactions even in data-sparse settings. SCIENTIFIC CONTRIBUTION: The combination of pretrained language models and a lightweight neural architecture paves the way for more effective and adaptable DTA frameworks in real-world drug discovery applications.
准确预测药物-靶标亲和力(DTA)对于推进药物发现和优化实验过程至关重要。传统的DTA模型通常依赖于手工制作的特征或结构数据,这限制了它们的泛化性和可扩展性。在这项研究中,我们提出了一种新的、以序列为中心的DTA预测方法,该方法利用预训练的大语言模型(LLMs),即ChemBERTa和ESM2,对蛋白质和分子序列进行编码。这些模型产生语义丰富的嵌入,而不需要结构数据。我们引入了一个定制的残差初始架构,通过多尺度卷积和残差连接有效地集成了这些序列嵌入,显著提高了预测精度。我们的方法在基准数据集Davis、KIBA和BindingDB上进行了评估,在Davis上的MSE = 0.182, CI = 0.915,在KIBA上的MSE = 0.135, CI = 0.902,在BindingDB上的MSE = 0.467, CI = 0.888,达到了最先进的性能。这些结果突出了基于序列的方法为DTA预测提供可扩展、准确和健壮的解决方案的潜力,即使在数据稀疏的情况下,也为药物-靶标相互作用提供了有价值的见解。科学贡献:预训练语言模型和轻量级神经体系结构的结合为现实世界药物发现应用中更有效和适应性更强的DTA框架铺平了道路。
{"title":"Structure-free drug-target affinity prediction using protein and molecule language models.","authors":"Amir Hallaji Bidgoli,Morteza Mahdavi,Hamed Malek","doi":"10.1186/s13321-025-01146-6","DOIUrl":"https://doi.org/10.1186/s13321-025-01146-6","url":null,"abstract":"Accurate prediction of drug-target affinity (DTA) is crucial for advancing drug discovery and optimizing experimental processes. Traditional DTA models often rely on handcrafted features or structural data, which can limit their generalizability and scalability. In this study, we propose a novel, sequence-centric approach for DTA prediction that leverages pretrained large language models (LLMs), namely ChemBERTa and ESM2, to encode protein and molecule sequences. These models produce semantically rich embeddings without the need for structural data. We introduce a customized Residual Inception architecture that efficiently integrates these sequence embeddings through multi-scale convolutions and residual connections, significantly improving prediction accuracy. Our method is evaluated on benchmark datasets Davis, KIBA, and BindingDB, achieving state-of-the-art performance with MSE = 0.182 and CI = 0.915 on Davis, MSE = 0.135 and CI = 0.902 on KIBA, and MSE = 0.467 and CI = 0.888 on BindingDB. These results highlight the potential of sequence-based approaches to provide scalable, accurate, and robust solutions for DTA prediction, offering valuable insights into drug-target interactions even in data-sparse settings. SCIENTIFIC CONTRIBUTION: The combination of pretrained language models and a lightweight neural architecture paves the way for more effective and adaptable DTA frameworks in real-world drug discovery applications.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"11 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Molecular graph-based invariant representation learning with environmental inference and subgraph generation for out-of-distribution generalization. 基于分子图的环境推理和分布外泛化子图生成的不变表示学习。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-01-02 DOI: 10.1186/s13321-025-01142-w
Hang Zhu,Sisi Yuan,Mingjing Tang,Guifei Zhou,Zhanxuan Hu,Zhaoyang Liu,Jin Li,Jianmin Wang,Chunyan Li
Molecular representation learning (MRL) is a crucial link between machine learning and chemistry. It plays a vital role in predicting molecular properties and complex tasks such as drug discovery by encoding molecules as numerical vectors. While existing methods perform excellently when handling training and testing data from the same distribution, their generalization ability is often insufficient when faced with distribution shifts. Enhancing model generalization capability for out-of-distribution (OOD) data remains a significant challenge, as real-world molecular environments are often dynamic and uncertain. To effectively address this issue, we propose an innovative framework called EISG (Integrating Environmental Inference and Subgraph Generation) for molecular representation learning aimed at improving the performance of the model on OOD data by capturing the invariance of molecular graphs in different environments. Specifically, we introduce an unsupervised environmental classification model to identify latent variables generated by different distributions and designed a subgraph extractor based on information bottleneck theory to extracts invariant representations from molecular graphs that are closely related to the prediction labels. By combining new learning objectives, the environmental classifier and the subgraph extractor work in tandem to help the model identify invariant graph representations in different environments, leading to more robust OOD generalization. Experimental results demonstrate that our model exhibits strong generalization capabilities across various OOD settings. Code is available on GitHub.
分子表征学习(MRL)是机器学习与化学之间的重要纽带。它在预测分子性质和复杂任务中发挥着至关重要的作用,例如通过编码分子作为数字载体来发现药物。虽然现有方法在处理来自同一分布的训练和测试数据时表现出色,但在面对分布变化时,其泛化能力往往不足。由于现实世界的分子环境通常是动态的和不确定的,因此提高模型泛化能力仍然是一个重大挑战。为了有效地解决这个问题,我们提出了一个名为EISG (integrated Environmental Inference and Subgraph Generation)的创新框架,用于分子表示学习,旨在通过捕获不同环境中分子图的不变性来提高模型在OOD数据上的性能。具体来说,我们引入了一种无监督环境分类模型来识别由不同分布产生的潜在变量,并设计了一种基于信息瓶颈理论的子图提取器,从分子图中提取与预测标签密切相关的不变表示。通过结合新的学习目标,环境分类器和子图提取器协同工作,帮助模型识别不同环境中的不变图表示,从而实现更稳健的OOD泛化。实验结果表明,我们的模型在各种OOD设置中表现出强大的泛化能力。代码可在GitHub上获得。
{"title":"Molecular graph-based invariant representation learning with environmental inference and subgraph generation for out-of-distribution generalization.","authors":"Hang Zhu,Sisi Yuan,Mingjing Tang,Guifei Zhou,Zhanxuan Hu,Zhaoyang Liu,Jin Li,Jianmin Wang,Chunyan Li","doi":"10.1186/s13321-025-01142-w","DOIUrl":"https://doi.org/10.1186/s13321-025-01142-w","url":null,"abstract":"Molecular representation learning (MRL) is a crucial link between machine learning and chemistry. It plays a vital role in predicting molecular properties and complex tasks such as drug discovery by encoding molecules as numerical vectors. While existing methods perform excellently when handling training and testing data from the same distribution, their generalization ability is often insufficient when faced with distribution shifts. Enhancing model generalization capability for out-of-distribution (OOD) data remains a significant challenge, as real-world molecular environments are often dynamic and uncertain. To effectively address this issue, we propose an innovative framework called EISG (Integrating Environmental Inference and Subgraph Generation) for molecular representation learning aimed at improving the performance of the model on OOD data by capturing the invariance of molecular graphs in different environments. Specifically, we introduce an unsupervised environmental classification model to identify latent variables generated by different distributions and designed a subgraph extractor based on information bottleneck theory to extracts invariant representations from molecular graphs that are closely related to the prediction labels. By combining new learning objectives, the environmental classifier and the subgraph extractor work in tandem to help the model identify invariant graph representations in different environments, leading to more robust OOD generalization. Experimental results demonstrate that our model exhibits strong generalization capabilities across various OOD settings. Code is available on GitHub.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing SMILES token sequences via trie-based refinement and transition graph filtering. 通过基于尝试的改进和转换图过滤优化SMILES标记序列。
IF 8.6 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2026-01-02 DOI: 10.1186/s13321-025-01143-9
Sridhar Radhakrishnan,Krish Mody,Arvind Venkatesh,Ananth Venkatesh
Tokenization plays a critical role in preparing SMILES strings for molecular foundation models. Poor token units can fragment chemically meaningful substructures, inflate sequence length, and hinder model learning and interpretability. Existing approaches such as SMILES Pair Encoding (SPE) and Atom Pair Encoding (APE) compress token sequences but often ignore domain-specific chemistry or fail to generalize to larger or more diverse molecules. We propose a domain-aware method for SMILES compression that combines frequency-guided substring mining using a prefix trie with an optional entropy-based refinement step using a token transition graph (TTG). On a corpus of 100,000 PubChem molecules, the Trie+TTG method reduces token sequences by more than 50% compared to APE while preserving chemically coherent substructures. The method generalizes effectively to large, out-of-distribution molecules, achieving compression rates of up to 90% with minimal sensitivity to molecule size. To assess downstream utility, we evaluate latent-space structure using unsupervised clustering and perform QSAR regression on ESOL. Trie+TTG produces more separable molecular representations and stronger predictive performance than Trie-only and APE. In addition, on peptide corpora, our method substantially outperforms SPE and the PeptideCLM tokenizer in compression and entropy metrics. These results show that combining trie-based mining with TTG refinement yields compact, stable, and chemically meaningful tokenizations suitable for modern molecular representation learning.Scientific contributions: We present a trie-based framework that compresses SMILES sequences into shorter, chemically coherent units while guaranteeing lossless reconstruction. By incorporating a token transition graph for entropy-guided refinement, our method selects contextually stable merges that improve both compression efficiency and generalization. Unlike prior approaches such as APE and SPE, our tokenizer combines frequency and context awareness, yielding more compact, interpretable, and transferable molecular representations.
标记化在为分子基础模型准备SMILES字符串中起着关键作用。较差的标记单元会破坏化学上有意义的子结构,增加序列长度,并阻碍模型的学习和可解释性。现有的方法,如SMILES对编码(SPE)和原子对编码(APE)压缩标记序列,但往往忽略特定域的化学或不能推广到更大或更多样化的分子。我们提出了一种用于SMILES压缩的域感知方法,该方法结合了使用前缀树的频率引导子字符串挖掘和使用令牌转换图(TTG)的可选的基于熵的细化步骤。在10万个PubChem分子的语料库上,与APE相比,Trie+TTG方法减少了50%以上的标记序列,同时保留了化学上连贯的子结构。该方法有效地推广到大的,不在分布范围内的分子,实现高达90%的压缩率,对分子大小的敏感性最小。为了评估下游效用,我们使用无监督聚类来评估潜在空间结构,并对ESOL进行QSAR回归。与Trie-only和APE相比,Trie+TTG产生了更多的可分离分子表征和更强的预测性能。此外,在肽语料库上,我们的方法在压缩和熵度量方面大大优于SPE和PeptideCLM标记器。这些结果表明,将基于尝试的挖掘与TTG细化相结合,可以产生紧凑、稳定且具有化学意义的标记化,适用于现代分子表示学习。科学贡献:我们提出了一个基于尝试的框架,将SMILES序列压缩成更短的、化学上连贯的单元,同时保证无损重建。通过结合标记转移图进行熵引导细化,我们的方法选择上下文稳定的合并,从而提高压缩效率和泛化。与APE和SPE等先前的方法不同,我们的标记器结合了频率和上下文感知,产生了更紧凑、可解释和可转移的分子表示。
{"title":"Optimizing SMILES token sequences via trie-based refinement and transition graph filtering.","authors":"Sridhar Radhakrishnan,Krish Mody,Arvind Venkatesh,Ananth Venkatesh","doi":"10.1186/s13321-025-01143-9","DOIUrl":"https://doi.org/10.1186/s13321-025-01143-9","url":null,"abstract":"Tokenization plays a critical role in preparing SMILES strings for molecular foundation models. Poor token units can fragment chemically meaningful substructures, inflate sequence length, and hinder model learning and interpretability. Existing approaches such as SMILES Pair Encoding (SPE) and Atom Pair Encoding (APE) compress token sequences but often ignore domain-specific chemistry or fail to generalize to larger or more diverse molecules. We propose a domain-aware method for SMILES compression that combines frequency-guided substring mining using a prefix trie with an optional entropy-based refinement step using a token transition graph (TTG). On a corpus of 100,000 PubChem molecules, the Trie+TTG method reduces token sequences by more than 50% compared to APE while preserving chemically coherent substructures. The method generalizes effectively to large, out-of-distribution molecules, achieving compression rates of up to 90% with minimal sensitivity to molecule size. To assess downstream utility, we evaluate latent-space structure using unsupervised clustering and perform QSAR regression on ESOL. Trie+TTG produces more separable molecular representations and stronger predictive performance than Trie-only and APE. In addition, on peptide corpora, our method substantially outperforms SPE and the PeptideCLM tokenizer in compression and entropy metrics. These results show that combining trie-based mining with TTG refinement yields compact, stable, and chemically meaningful tokenizations suitable for modern molecular representation learning.Scientific contributions: We present a trie-based framework that compresses SMILES sequences into shorter, chemically coherent units while guaranteeing lossless reconstruction. By incorporating a token transition graph for entropy-guided refinement, our method selects contextually stable merges that improve both compression efficiency and generalization. Unlike prior approaches such as APE and SPE, our tokenizer combines frequency and context awareness, yielding more compact, interpretable, and transferable molecular representations.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"33 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CalVSP: a program for analyzing the molecular surface areas, volumes, and polar surface areas CalVSP:用于分析分子表面积、体积和极性表面积的程序。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-29 DOI: 10.1186/s13321-025-01120-2
Yuzhu Li, Daiju Yang, Qingyi Shi, Weidong Zhang, Qingyan Sun

The molecular volume, surface area, and polar molecular surface area are important descriptors for characterizing and predicting the molecular properties of lead compounds. Existing computational tools for calculating the above parameters often have complex workflows and are not well-suited for high-throughput conditions. CalVSP is an open-source software for computing molecular volume, molecular surface area, and polar surface area. The software implements a grid-based algorithm that dynamically optimizes grid spacing via quantum chemical reference data to ensure precise parameter calculations. CalVSP was tested on 9489 3D molecular structures, and the results revealed a mean absolute percentage error of 1.25% (95% CI: 1.23–1.27%) for the molecular volume and 1.33% (95% CI: 1.31–1.35%) for the molecular surface area compared with the quantum chemical data. For the molecular polar surface area calculations, the mean absolute percentage error was 4.59% (95% CI: 4.16–5.04%) across the 388 tested molecular structures. The CalVSP written in the C programming language offers a lightweight and easy tool. It can be integrated with other molecular property prediction tools to increase computational accuracy and for large-scale molecular calculations.

Graphical Abstract

分子体积、比表面积和极性分子比表面积是表征和预测先导化合物分子性质的重要描述符。用于计算上述参数的现有计算工具通常具有复杂的工作流程,并且不太适合高通量条件。CalVSP是一个用于计算分子体积、分子表面积和极性表面积的开源软件。该软件实现了基于网格的算法,通过量子化学参考数据动态优化网格间距,以确保精确的参数计算。CalVSP在9489个三维分子结构上进行了测试,结果显示,与量子化学数据相比,分子体积的平均绝对百分比误差为1.25% (95% CI: 1.23-1.27%),分子表面积的平均绝对百分比误差为1.33% (95% CI: 1.31-1.35%)。对于分子极性表面积计算,在388个测试的分子结构中,平均绝对百分比误差为4.59% (95% CI: 4.16-5.04%)。用C语言编写的CalVSP提供了一个轻量级和简单的工具。它可以与其他分子性质预测工具集成,以提高计算精度和大规模分子计算。
{"title":"CalVSP: a program for analyzing the molecular surface areas, volumes, and polar surface areas","authors":"Yuzhu Li,&nbsp;Daiju Yang,&nbsp;Qingyi Shi,&nbsp;Weidong Zhang,&nbsp;Qingyan Sun","doi":"10.1186/s13321-025-01120-2","DOIUrl":"10.1186/s13321-025-01120-2","url":null,"abstract":"<div><p>The molecular volume, surface area, and polar molecular surface area are important descriptors for characterizing and predicting the molecular properties of lead compounds. Existing computational tools for calculating the above parameters often have complex workflows and are not well-suited for high-throughput conditions. CalVSP is an open-source software for computing molecular volume, molecular surface area, and polar surface area. The software implements a grid-based algorithm that dynamically optimizes grid spacing via quantum chemical reference data to ensure precise parameter calculations. CalVSP was tested on 9489 3D molecular structures, and the results revealed a mean absolute percentage error of 1.25% (95% CI: 1.23–1.27%) for the molecular volume and 1.33% (95% CI: 1.31–1.35%) for the molecular surface area compared with the quantum chemical data. For the molecular polar surface area calculations, the mean absolute percentage error was 4.59% (95% CI: 4.16–5.04%) across the 388 tested molecular structures. The CalVSP written in the C programming language offers a lightweight and easy tool. It can be integrated with other molecular property prediction tools to increase computational accuracy and for large-scale molecular calculations.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12752003/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145852923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Capsule graph networks for accurate and interpretable crystalline materials property prediction. 用于准确和可解释的晶体材料性质预测的胶囊图网络。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-29 DOI: 10.1186/s13321-025-01139-5
Eddah K Sure, Xing Wu, Quan Qian

Accurate and interpretable modeling of crystalline materials is essential for understanding the structure-property relationships in materials critical in accelerating materials discovery. While recent graph neural networks (GNNs) have achieved high predictive accuracy, they often struggle to provide physical interpretability and fail to explicitly model the hierarchical and symmetrical nature of crystals. In this work, we introduce Capsule Graph Networks with E(3)-Equivariance (CGN-e3), a novel deep learning framework that integrates equivariant message passing with capsule networks to capture both geometric symmetries and hierarchical motif structures. CGN-e3 leverages E(3)-equivariant message passing to learn physically consistent features and organize them into capsule representations that can disentangle local motifs, such as polyhedral environments, and connects them to global properties. We validate the effectiveness of our framework on bandgap and formation energy prediction, as well as material classification using Materials Project and Matbench datasets. Our model achieves a MAE of 0.054 eV/atom and 0.379 eV on formation energy and bandgap prediction, respectively, outperforming CGCNN and matching the performance of MEGNet on the same dataset, while also providing insightful interpretations of the learned capsule representations.Scientific contribution: We present the first integration of E(3)-equivariant graph neural networks with capsule networks for modeling crystalline materials. This unified architecture captures both the fundamental physical symmetries of 3D space; rotation, translation, reflection and the intrinsic hierarchical part-whole relationships e.g., atoms to polyhedra to extended motifs found in crystal structures. The framework provides an unsupervised pathway for interpretable motif discovery. The dynamic routing-by-agreement mechanism identifies and aggregates structurally significant local environments such as the T i O 6 octahedra into higher-order graph-level capsules. This process yields human-intelligible insights by explicitly quantifying the contribution of specific structural motifs to target material properties, moving beyond "black-box" predictions. We validate our framework on key property prediction tasks and provide capsule-level interpretation of the results.

准确和可解释的晶体材料建模对于理解材料的结构-性质关系至关重要,这对加速材料的发现至关重要。虽然最近的图神经网络(gnn)已经取得了很高的预测精度,但它们往往难以提供物理可解释性,并且无法明确地模拟晶体的层次和对称性质。在这项工作中,我们引入了具有E(3)-等方差的胶囊图网络(CGN-e3),这是一种新的深度学习框架,它将等变消息传递与胶囊网络集成在一起,以捕获几何对称性和分层母题结构。CGN-e3利用E(3)等变信息传递来学习物理上一致的特征,并将它们组织成胶囊表示,可以解开局部主题(如多面体环境),并将它们与全局属性联系起来。我们使用Materials Project和Matbench数据集验证了我们的框架在带隙和地层能量预测以及材料分类方面的有效性。我们的模型在地层能量和带隙预测上的MAE分别为0.054 eV/atom和0.379 eV,在相同的数据集上优于CGCNN并与MEGNet的性能相匹配,同时还提供了对学习到的胶囊表示的深刻解释。科学贡献:我们首次提出了E(3)-等变图神经网络与胶囊网络的集成,用于模拟晶体材料。这种统一的架构既抓住了3D空间的基本物理对称性;旋转,平移,反射和固有的层次部分-整体关系,例如,晶体结构中发现的原子到多面体到扩展基元。该框架为可解释基序的发现提供了一个无监督的途径。动态协议路由机制识别并聚集结构上重要的局部环境,如TiO6八面体到高阶图级胶囊中。通过明确量化特定结构基序对目标材料特性的贡献,这一过程产生了人类可理解的见解,超越了“黑箱”预测。我们在关键属性预测任务上验证了我们的框架,并提供了对结果的胶囊级解释。
{"title":"Capsule graph networks for accurate and interpretable crystalline materials property prediction.","authors":"Eddah K Sure, Xing Wu, Quan Qian","doi":"10.1186/s13321-025-01139-5","DOIUrl":"10.1186/s13321-025-01139-5","url":null,"abstract":"<p><p>Accurate and interpretable modeling of crystalline materials is essential for understanding the structure-property relationships in materials critical in accelerating materials discovery. While recent graph neural networks (GNNs) have achieved high predictive accuracy, they often struggle to provide physical interpretability and fail to explicitly model the hierarchical and symmetrical nature of crystals. In this work, we introduce Capsule Graph Networks with E(3)-Equivariance (CGN-e3), a novel deep learning framework that integrates equivariant message passing with capsule networks to capture both geometric symmetries and hierarchical motif structures. CGN-e3 leverages E(3)-equivariant message passing to learn physically consistent features and organize them into capsule representations that can disentangle local motifs, such as polyhedral environments, and connects them to global properties. We validate the effectiveness of our framework on bandgap and formation energy prediction, as well as material classification using Materials Project and Matbench datasets. Our model achieves a MAE of 0.054 eV/atom and 0.379 eV on formation energy and bandgap prediction, respectively, outperforming CGCNN and matching the performance of MEGNet on the same dataset, while also providing insightful interpretations of the learned capsule representations.Scientific contribution: We present the first integration of E(3)-equivariant graph neural networks with capsule networks for modeling crystalline materials. This unified architecture captures both the fundamental physical symmetries of 3D space; rotation, translation, reflection and the intrinsic hierarchical part-whole relationships e.g., atoms to polyhedra to extended motifs found in crystal structures. The framework provides an unsupervised pathway for interpretable motif discovery. The dynamic routing-by-agreement mechanism identifies and aggregates structurally significant local environments such as the <math><mrow><mi>T</mi> <mi>i</mi> <msub><mi>O</mi> <mn>6</mn></msub> </mrow> </math> octahedra into higher-order graph-level capsules. This process yields human-intelligible insights by explicitly quantifying the contribution of specific structural motifs to target material properties, moving beyond \"black-box\" predictions. We validate our framework on key property prediction tasks and provide capsule-level interpretation of the results.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":"14"},"PeriodicalIF":5.7,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12865943/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145852890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Human Omnibus of Targetable Pockets 目标口袋的人类综合。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-24 DOI: 10.1186/s13321-025-01125-x
Kristy A. Carpenter, Russ B. Altman

Hundreds of computational methods for predicting ligand binding pockets exist, but the problem of finding druggable pockets throughout the human proteome persists. Different strategies for pocket-finding excel in different use cases. Ensemble models that leverage multiple different pocket-finding strategies can best capture diverse pockets at scale. Despite this, no publicly available human-proteome-wide datasets of pocket predictions from multiple pocket-finding methods exist. We present the Human Omnibus of Targetable Pockets (HOTPocket), a dataset of over 2.4 million predicted pockets over the entire human proteome that utilizes both experimentally-determined and computationally-predicted protein structures. We assembled this dataset by running seven diverse, established pocket-finding methods over all PDB and AlphaFold2 structures of the canonical human proteome. We created a novel pocket scoring method, hotpocketNN, which we used to filter candidate pockets and assemble the final proteome-wide dataset. Our hotpocketNN method is able to recover known ligand binding pockets, including those which are dissimilar from any pocket seen in its training set. The hotpocketNN method outperforms all constituent methods, including P2Rank and Fpocket, when assessing the precision with DCA criterion on the Astex Diverse Set and PoseBusters dataset. Additionally, hotpocketNN was able to identify recently-discovered druggable pockets on KRAS and the mu opioid receptor. We make both the HOTPocket dataset and the hotpocketNN method freely available.

目前已有数百种预测配体结合口袋的计算方法,但在整个人类蛋白质组中寻找可药物口袋的问题仍然存在。不同的口袋寻找策略适用于不同的用例。利用多种不同的口袋寻找策略的集成模型可以最好地大规模捕获不同的口袋。尽管如此,没有公开可用的人类蛋白质组范围的口袋预测数据集,从多种口袋寻找方法存在。我们提出了人类目标口袋的Omnibus (HOTPocket),这是一个超过240万个预测口袋的数据集,涵盖整个人类蛋白质组,利用实验确定和计算预测的蛋白质结构。我们通过运行7种不同的、已建立的口袋查找方法,对所有典型人类蛋白质组的PDB和AlphaFold2结构进行了组装。我们创建了一种新颖的口袋评分方法,hotpocketNN,我们使用它来过滤候选口袋并组装最终的蛋白质组范围数据集。我们的hotpocketNN方法能够恢复已知的配体结合口袋,包括那些与训练集中看到的任何口袋不同的口袋。在Astex多样化集和PoseBusters数据集上使用DCA标准评估精度时,hotpocketNN方法优于所有组成方法,包括P2Rank和Fpocket。此外,hotpocketNN能够识别最近在KRAS和mu阿片受体上发现的可药物口袋。我们将HOTPocket数据集和hotpocketNN方法都免费提供。
{"title":"The Human Omnibus of Targetable Pockets","authors":"Kristy A. Carpenter,&nbsp;Russ B. Altman","doi":"10.1186/s13321-025-01125-x","DOIUrl":"10.1186/s13321-025-01125-x","url":null,"abstract":"<div><p>Hundreds of computational methods for predicting ligand binding pockets exist, but the problem of finding druggable pockets throughout the human proteome persists. Different strategies for pocket-finding excel in different use cases. Ensemble models that leverage multiple different pocket-finding strategies can best capture diverse pockets at scale. Despite this, no publicly available human-proteome-wide datasets of pocket predictions from multiple pocket-finding methods exist. We present the Human Omnibus of Targetable Pockets (HOTPocket), a dataset of over 2.4 million predicted pockets over the entire human proteome that utilizes both experimentally-determined and computationally-predicted protein structures. We assembled this dataset by running seven diverse, established pocket-finding methods over all PDB and AlphaFold2 structures of the canonical human proteome. We created a novel pocket scoring method, <i>hotpocketNN</i>, which we used to filter candidate pockets and assemble the final proteome-wide dataset. Our <i>hotpocketNN</i> method is able to recover known ligand binding pockets, including those which are dissimilar from any pocket seen in its training set. The <i>hotpocketNN</i> method outperforms all constituent methods, including P2Rank and Fpocket, when assessing the precision with DCA criterion on the Astex Diverse Set and PoseBusters dataset. Additionally, <i>hotpocketNN</i> was able to identify recently-discovered druggable pockets on KRAS and the mu opioid receptor. We make both the HOTPocket dataset and the <i>hotpocketNN</i> method freely available.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01125-x.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145824043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CaliciBoost: Performance-driven evaluation of molecular representations for caco-2 permeability prediction CaliciBoost:性能驱动的caco-2渗透率预测分子表征评估。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-22 DOI: 10.1186/s13321-025-01137-7
Huong Van Le, Weibin Ren, Junhong Kim, Yukyung Yun, Young Bin Park, Young Jun Kim, Bok Kyung Han, Inho Choi, Jong-Il Park, Hwi-yeol Yun, Jae-Mun Choi

Caco-2= permeability serves as a critical in vitro indicator for predicting the oral absorption of drug candidates= during early-stage drug discovery. To enhance the accuracy and= efficiency of computational predictions, we systematically investigated the impact of eight molecular feature= representation types including 2D/3D descriptors, structural fingerprints, and deep learning-based embeddings combined with automated machine learning techniques to predict Caco-2 permeability. We evaluated model performance across various molecular representations using two datasets differing in scale and chemical diversity, namely the TDC benchmark and curated OCHEM data. Among the tested fingerprints and descriptors, PaDEL, Mordred, and RDKit emerged as particularly effective for predicting Caco-2 permeability. Notably, our model CaliciBoost, identified through training optimization, achieved the lowest MAE and secured the top position on the TDC Caco-2 Leaderboard. Furthermore, for both Padel and Mordred, using TDC data, incorporating 3D descriptors seem lead to improvements over using 2D features alone, as supported by feature importance analyses. These findings highlight the effectiveness of automated machine learning approaches in ADMET modeling and offer practical guidance for feature selection in data-limited prediction tasks.

This work provides a systematic benchmarking of eight molecular feature representation types in conjunction with AutoML for Caco-2 permeability prediction. It highlights the critical role of 3D descriptors in enhancing predictive accuracy and establishes a PaDEL-based AutoML model that achieves top-ranked performance on a public leaderboard. The study also emphasizes the value of interpretable feature selection (via SHAP and permutation importance), offering insights into feature contributions and generalizable modeling strategies for cheminformatics applications.

Caco-2通透性是预测候选药物早期口服吸收的重要体外指标。为了提高计算预测的准确性和效率,我们系统地研究了八种分子特征表示类型的影响,包括2D/3D描述符、结构指纹、基于深度学习的嵌入以及自动机器学习技术,以预测Caco-2的渗透率。我们使用两个不同规模和化学多样性的数据集,即TDC基准和整理的OCHEM数据,评估了模型在各种分子表征中的性能。在测试的指纹和描述符中,PaDEL、Mordred和RDKit在预测Caco-2渗透率方面表现得特别有效。值得注意的是,我们的模型CaliciBoost通过训练优化识别,获得了最低的MAE,并在TDC Caco-2排行榜上获得了第一名。此外,对于帕德尔和莫德雷德来说,结合3D描述符使用TDC数据似乎比单独使用2D特征更有改进,这得到了特征重要性分析的支持。这些发现突出了自动机器学习方法在ADMET建模中的有效性,并为数据有限的预测任务中的特征选择提供了实用指导。科学贡献:这项工作为Caco-2渗透率预测提供了八种分子特征表示类型的系统基准测试。它强调了3D描述符在提高预测精度方面的关键作用,并建立了一个基于pdel的AutoML模型,该模型在公共排行榜上的表现名列前茅。该研究还强调了可解释特征选择(通过SHAP和排列重要性)的价值,为化学信息学应用提供了特征贡献和通用建模策略的见解。
{"title":"CaliciBoost: Performance-driven evaluation of molecular representations for caco-2 permeability prediction","authors":"Huong Van Le,&nbsp;Weibin Ren,&nbsp;Junhong Kim,&nbsp;Yukyung Yun,&nbsp;Young Bin Park,&nbsp;Young Jun Kim,&nbsp;Bok Kyung Han,&nbsp;Inho Choi,&nbsp;Jong-Il Park,&nbsp;Hwi-yeol Yun,&nbsp;Jae-Mun Choi","doi":"10.1186/s13321-025-01137-7","DOIUrl":"10.1186/s13321-025-01137-7","url":null,"abstract":"<p>Caco-2= permeability serves as a critical in vitro indicator for predicting the oral absorption of drug candidates= during early-stage drug discovery. To enhance the accuracy and= efficiency of computational predictions, we systematically investigated the impact of eight molecular feature= representation types including 2D/3D descriptors, structural fingerprints, and deep learning-based embeddings combined with automated machine learning techniques to predict Caco-2 permeability. We evaluated model performance across various molecular representations using two datasets differing in scale and chemical diversity, namely the TDC benchmark and curated OCHEM data. Among the tested fingerprints and descriptors, PaDEL, Mordred, and RDKit emerged as particularly effective for predicting Caco-2 permeability. Notably, our model CaliciBoost, identified through training optimization, achieved the lowest MAE and secured the top position on the TDC Caco-2 Leaderboard. Furthermore, for both Padel and Mordred, using TDC data, incorporating 3D descriptors seem lead to improvements over using 2D features alone, as supported by feature importance analyses. These findings highlight the effectiveness of automated machine learning approaches in ADMET modeling and offer practical guidance for feature selection in data-limited prediction tasks.</p><p>This work provides a systematic benchmarking of eight molecular feature representation types in conjunction with AutoML for Caco-2 permeability prediction. It highlights the critical role of 3D descriptors in enhancing predictive accuracy and establishes a PaDEL-based AutoML model that achieves top-ranked performance on a public leaderboard. The study also emphasizes the value of interpretable feature selection (via SHAP and permutation importance), offering insights into feature contributions and generalizable modeling strategies for cheminformatics applications.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12752011/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145808825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A quantum chemical dataset of interacting molecular pairs for chemical reaction studies 用于化学反应研究的相互作用分子对的量子化学数据集
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-19 DOI: 10.1186/s13321-025-01124-y
Seunghun Jang, Gyoung S. Na

Understanding molecular interactions beyond single-molecule properties is critical for studying real-world chemical systems. Quantum chemical calculations of molecule–molecule interactions are computationally demanding, making large, publicly available datasets scarce. Here, we present an efficient framework for generating initial configurations of molecular interaction systems and construct a molecular interaction dataset, containing 49,620 individual molecules and 247,741 molecular pairs spanning chromophore–solvent, solute–solvent, and drug–drug interactions, each associated with experimentally characterized equilibrium structures. Our dataset can be used for theoretical studies and machine learning applications in chemical sciences, particularly for modeling intermolecular interactions and structure-based prediction of experimental properties. In future work, we plan to expand the dataset to include non-equilibrium structures and atomic forces, thereby broadening its applicability to reaction modeling and force field development.

理解超越单分子性质的分子相互作用对于研究现实世界的化学系统至关重要。分子-分子相互作用的量子化学计算需要大量的计算,这使得大型的、公开的数据集变得稀缺。在这里,我们提出了一个有效的框架来生成分子相互作用系统的初始配置,并构建了一个分子相互作用数据集,包含49,620个单个分子和247,741个分子对,跨越发色团-溶剂、溶质-溶剂和药物-药物相互作用,每个分子对都与实验表征的平衡结构相关。我们的数据集可用于化学科学的理论研究和机器学习应用,特别是用于分子间相互作用的建模和基于结构的实验性质预测。在未来的工作中,我们计划将数据集扩展到包括非平衡结构和原子力,从而扩大其在反应建模和力场开发中的适用性。
{"title":"A quantum chemical dataset of interacting molecular pairs for chemical reaction studies","authors":"Seunghun Jang,&nbsp;Gyoung S. Na","doi":"10.1186/s13321-025-01124-y","DOIUrl":"10.1186/s13321-025-01124-y","url":null,"abstract":"<div><p>Understanding molecular interactions beyond single-molecule properties is critical for studying real-world chemical systems. Quantum chemical calculations of molecule–molecule interactions are computationally demanding, making large, publicly available datasets scarce. Here, we present an efficient framework for generating initial configurations of molecular interaction systems and construct a molecular interaction dataset, containing 49,620 individual molecules and 247,741 molecular pairs spanning chromophore–solvent, solute–solvent, and drug–drug interactions, each associated with experimentally characterized equilibrium structures. Our dataset can be used for theoretical studies and machine learning applications in chemical sciences, particularly for modeling intermolecular interactions and structure-based prediction of experimental properties. In future work, we plan to expand the dataset to include non-equilibrium structures and atomic forces, thereby broadening its applicability to reaction modeling and force field development. </p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01124-y.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145779254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RetroScore: graph edit distance-guided retrosynthesis for accessibility scoring with route metrics RetroScore:图形编辑距离引导逆合成可达性评分与路线指标。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-19 DOI: 10.1186/s13321-025-01138-6
Sinuo Gao, Xiaofei Zhou, Lu Liang, Jianping Lin

Molecular generation is a critical method in drug design, but its practical application is often limited by the difficulty of synthesizing the generated molecules. To address this challenge, we present RetroScore, a synthetic accessibility evaluation framework guided by multistep retrosynthesis. Our methodology integrates the semi-template model Graph2Edits with the multistep retrosynthesis planning algorithm Retro*, forming the Graph2Edits-Retro*d system. By incorporating the green chemistry metric of graph edit distance into the reaction cost function and a multistage screening protocol, this system identifies optimal routes while balancing reliability, synthetic efficiency, and economic feasibility. Benchmark evaluations demonstrate a 97.37% planning success rate with balanced optimization across route length, confidence score, and graph edit distance. In the molecular generation task, the RetroScore outperforms six of the seven synthetic accessibility metrics, yielding molecules with enhanced synthetic accessibility profiles across heterogeneous evaluation frameworks. To facilitate practical implementation, we developed an open-access web platform for automated retrosynthesis route prediction and RetroScore calculation, providing researchers with rapid synthetic accessibility assessments. The RetroScore web server is publicly accessible at http://aidd.bioai-global.com/RetroScore/, and the source code is available at https://github.com/Snowgao320/RetroScore.

分子生成是药物设计中的一种关键方法,但其实际应用往往受到合成所生成分子的困难的限制。为了应对这一挑战,我们提出了RetroScore,这是一个由多步骤反合成指导的综合可达性评估框架。我们的方法将半模板模型Graph2Edits与多步逆合成规划算法Retro*相结合,形成Graph2Edits-Retro*d系统。通过将图形编辑距离的绿色化学度量纳入反应成本函数和多级筛选协议,该系统在平衡可靠性、合成效率和经济可行性的同时确定了最佳路线。基准评估表明,在路径长度、置信度评分和图编辑距离上平衡优化的情况下,规划成功率为97.37%。在分子生成任务中,RetroScore优于7个合成可达性指标中的6个,生成的分子在不同的评估框架中具有增强的合成可达性特征。为了便于实际实施,我们开发了一个开放访问的网络平台,用于自动逆转录合成路线预测和RetroScore计算,为研究人员提供快速的合成可达性评估。RetroScore web服务器可在http://aidd.bioai-global.com/RetroScore/上公开访问,源代码可在https://github.com/Snowgao320/RetroScore上获得。
{"title":"RetroScore: graph edit distance-guided retrosynthesis for accessibility scoring with route metrics","authors":"Sinuo Gao,&nbsp;Xiaofei Zhou,&nbsp;Lu Liang,&nbsp;Jianping Lin","doi":"10.1186/s13321-025-01138-6","DOIUrl":"10.1186/s13321-025-01138-6","url":null,"abstract":"<div><p>Molecular generation is a critical method in drug design, but its practical application is often limited by the difficulty of synthesizing the generated molecules. To address this challenge, we present RetroScore, a synthetic accessibility evaluation framework guided by multistep retrosynthesis. Our methodology integrates the semi-template model Graph2Edits with the multistep retrosynthesis planning algorithm Retro*, forming the Graph2Edits-Retro*d system. By incorporating the green chemistry metric of graph edit distance into the reaction cost function and a multistage screening protocol, this system identifies optimal routes while balancing reliability, synthetic efficiency, and economic feasibility. Benchmark evaluations demonstrate a 97.37% planning success rate with balanced optimization across route length, confidence score, and graph edit distance. In the molecular generation task, the RetroScore outperforms six of the seven synthetic accessibility metrics, yielding molecules with enhanced synthetic accessibility profiles across heterogeneous evaluation frameworks. To facilitate practical implementation, we developed an open-access web platform for automated retrosynthesis route prediction and RetroScore calculation, providing researchers with rapid synthetic accessibility assessments. The RetroScore web server is publicly accessible at http://aidd.bioai-global.com/RetroScore/, and the source code is available at https://github.com/Snowgao320/RetroScore.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01138-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145777315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ProjFusNet: deep neural network for peptide precursor prediction using projection-fused protein language model and structural features ProjFusNet:利用投影融合蛋白语言模型和结构特征进行肽前体预测的深度神经网络
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-19 DOI: 10.1186/s13321-025-01117-x
Jinjin Li, Fang Fang, Changhang Lin, Hua Shi, Feifei Cui, Zilong Zhang, Leyi Wei

Peptide precursors, as the source molecules of bioactive peptides, play essential roles in neuroregulation, immune defense, and drug development. Their accurate identification is crucial for elucidating mechanisms of life regulation and developing novel therapeutics. However, the complexity and diversity of peptide precursor sequences pose significant challenges to prediction tasks. Existing methods predominantly rely on sequence features or structural features, hindering the full exploitation of complementary information between modalities and consequently limiting prediction performance. We introduce ProjFusNet, a deep learning framework that integrates evolutionary-scale protein sequence representations from ESM-2 with structural features via a projected multimodal fusion strategy. A bidirectional LSTM is further employed to model the complex interactions between sequence and structure. In rigorous five-fold cross-validation, ProjFusNet demonstrates improved performance across key metrics, including ACC, SN, AUC, SP, and MCC, compared to single-feature models.

肽前体作为生物活性肽的来源分子,在神经调节、免疫防御、药物开发等方面发挥着重要作用。它们的准确识别对于阐明生命调控机制和开发新的治疗方法至关重要。然而,肽前体序列的复杂性和多样性给预测任务带来了重大挑战。现有方法主要依赖于序列特征或结构特征,阻碍了模型之间互补信息的充分利用,从而限制了预测性能。我们介绍了ProjFusNet,这是一个深度学习框架,通过投影多模态融合策略将ESM-2的进化尺度蛋白质序列表示与结构特征集成在一起。进一步采用双向LSTM对序列和结构之间复杂的相互作用进行建模。在严格的五重交叉验证中,与单一特征模型相比,ProjFusNet展示了跨关键指标(包括ACC、SN、AUC、SP和MCC)的改进性能。
{"title":"ProjFusNet: deep neural network for peptide precursor prediction using projection-fused protein language model and structural features","authors":"Jinjin Li,&nbsp;Fang Fang,&nbsp;Changhang Lin,&nbsp;Hua Shi,&nbsp;Feifei Cui,&nbsp;Zilong Zhang,&nbsp;Leyi Wei","doi":"10.1186/s13321-025-01117-x","DOIUrl":"10.1186/s13321-025-01117-x","url":null,"abstract":"<div><p>Peptide precursors, as the source molecules of bioactive peptides, play essential roles in neuroregulation, immune defense, and drug development. Their accurate identification is crucial for elucidating mechanisms of life regulation and developing novel therapeutics. However, the complexity and diversity of peptide precursor sequences pose significant challenges to prediction tasks. Existing methods predominantly rely on sequence features or structural features, hindering the full exploitation of complementary information between modalities and consequently limiting prediction performance. We introduce ProjFusNet, a deep learning framework that integrates evolutionary-scale protein sequence representations from ESM-2 with structural features via a projected multimodal fusion strategy. A bidirectional LSTM is further employed to model the complex interactions between sequence and structure. In rigorous five-fold cross-validation, ProjFusNet demonstrates improved performance across key metrics, including ACC, SN, AUC, SP, and MCC, compared to single-feature models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01117-x.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145779253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1