Pub Date : 2026-01-03DOI: 10.1186/s13321-025-01146-6
Amir Hallaji Bidgoli,Morteza Mahdavi,Hamed Malek
Accurate prediction of drug-target affinity (DTA) is crucial for advancing drug discovery and optimizing experimental processes. Traditional DTA models often rely on handcrafted features or structural data, which can limit their generalizability and scalability. In this study, we propose a novel, sequence-centric approach for DTA prediction that leverages pretrained large language models (LLMs), namely ChemBERTa and ESM2, to encode protein and molecule sequences. These models produce semantically rich embeddings without the need for structural data. We introduce a customized Residual Inception architecture that efficiently integrates these sequence embeddings through multi-scale convolutions and residual connections, significantly improving prediction accuracy. Our method is evaluated on benchmark datasets Davis, KIBA, and BindingDB, achieving state-of-the-art performance with MSE = 0.182 and CI = 0.915 on Davis, MSE = 0.135 and CI = 0.902 on KIBA, and MSE = 0.467 and CI = 0.888 on BindingDB. These results highlight the potential of sequence-based approaches to provide scalable, accurate, and robust solutions for DTA prediction, offering valuable insights into drug-target interactions even in data-sparse settings. SCIENTIFIC CONTRIBUTION: The combination of pretrained language models and a lightweight neural architecture paves the way for more effective and adaptable DTA frameworks in real-world drug discovery applications.
准确预测药物-靶标亲和力(DTA)对于推进药物发现和优化实验过程至关重要。传统的DTA模型通常依赖于手工制作的特征或结构数据,这限制了它们的泛化性和可扩展性。在这项研究中,我们提出了一种新的、以序列为中心的DTA预测方法,该方法利用预训练的大语言模型(LLMs),即ChemBERTa和ESM2,对蛋白质和分子序列进行编码。这些模型产生语义丰富的嵌入,而不需要结构数据。我们引入了一个定制的残差初始架构,通过多尺度卷积和残差连接有效地集成了这些序列嵌入,显著提高了预测精度。我们的方法在基准数据集Davis、KIBA和BindingDB上进行了评估,在Davis上的MSE = 0.182, CI = 0.915,在KIBA上的MSE = 0.135, CI = 0.902,在BindingDB上的MSE = 0.467, CI = 0.888,达到了最先进的性能。这些结果突出了基于序列的方法为DTA预测提供可扩展、准确和健壮的解决方案的潜力,即使在数据稀疏的情况下,也为药物-靶标相互作用提供了有价值的见解。科学贡献:预训练语言模型和轻量级神经体系结构的结合为现实世界药物发现应用中更有效和适应性更强的DTA框架铺平了道路。
{"title":"Structure-free drug-target affinity prediction using protein and molecule language models.","authors":"Amir Hallaji Bidgoli,Morteza Mahdavi,Hamed Malek","doi":"10.1186/s13321-025-01146-6","DOIUrl":"https://doi.org/10.1186/s13321-025-01146-6","url":null,"abstract":"Accurate prediction of drug-target affinity (DTA) is crucial for advancing drug discovery and optimizing experimental processes. Traditional DTA models often rely on handcrafted features or structural data, which can limit their generalizability and scalability. In this study, we propose a novel, sequence-centric approach for DTA prediction that leverages pretrained large language models (LLMs), namely ChemBERTa and ESM2, to encode protein and molecule sequences. These models produce semantically rich embeddings without the need for structural data. We introduce a customized Residual Inception architecture that efficiently integrates these sequence embeddings through multi-scale convolutions and residual connections, significantly improving prediction accuracy. Our method is evaluated on benchmark datasets Davis, KIBA, and BindingDB, achieving state-of-the-art performance with MSE = 0.182 and CI = 0.915 on Davis, MSE = 0.135 and CI = 0.902 on KIBA, and MSE = 0.467 and CI = 0.888 on BindingDB. These results highlight the potential of sequence-based approaches to provide scalable, accurate, and robust solutions for DTA prediction, offering valuable insights into drug-target interactions even in data-sparse settings. SCIENTIFIC CONTRIBUTION: The combination of pretrained language models and a lightweight neural architecture paves the way for more effective and adaptable DTA frameworks in real-world drug discovery applications.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"11 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1186/s13321-025-01142-w
Hang Zhu,Sisi Yuan,Mingjing Tang,Guifei Zhou,Zhanxuan Hu,Zhaoyang Liu,Jin Li,Jianmin Wang,Chunyan Li
Molecular representation learning (MRL) is a crucial link between machine learning and chemistry. It plays a vital role in predicting molecular properties and complex tasks such as drug discovery by encoding molecules as numerical vectors. While existing methods perform excellently when handling training and testing data from the same distribution, their generalization ability is often insufficient when faced with distribution shifts. Enhancing model generalization capability for out-of-distribution (OOD) data remains a significant challenge, as real-world molecular environments are often dynamic and uncertain. To effectively address this issue, we propose an innovative framework called EISG (Integrating Environmental Inference and Subgraph Generation) for molecular representation learning aimed at improving the performance of the model on OOD data by capturing the invariance of molecular graphs in different environments. Specifically, we introduce an unsupervised environmental classification model to identify latent variables generated by different distributions and designed a subgraph extractor based on information bottleneck theory to extracts invariant representations from molecular graphs that are closely related to the prediction labels. By combining new learning objectives, the environmental classifier and the subgraph extractor work in tandem to help the model identify invariant graph representations in different environments, leading to more robust OOD generalization. Experimental results demonstrate that our model exhibits strong generalization capabilities across various OOD settings. Code is available on GitHub.
分子表征学习(MRL)是机器学习与化学之间的重要纽带。它在预测分子性质和复杂任务中发挥着至关重要的作用,例如通过编码分子作为数字载体来发现药物。虽然现有方法在处理来自同一分布的训练和测试数据时表现出色,但在面对分布变化时,其泛化能力往往不足。由于现实世界的分子环境通常是动态的和不确定的,因此提高模型泛化能力仍然是一个重大挑战。为了有效地解决这个问题,我们提出了一个名为EISG (integrated Environmental Inference and Subgraph Generation)的创新框架,用于分子表示学习,旨在通过捕获不同环境中分子图的不变性来提高模型在OOD数据上的性能。具体来说,我们引入了一种无监督环境分类模型来识别由不同分布产生的潜在变量,并设计了一种基于信息瓶颈理论的子图提取器,从分子图中提取与预测标签密切相关的不变表示。通过结合新的学习目标,环境分类器和子图提取器协同工作,帮助模型识别不同环境中的不变图表示,从而实现更稳健的OOD泛化。实验结果表明,我们的模型在各种OOD设置中表现出强大的泛化能力。代码可在GitHub上获得。
{"title":"Molecular graph-based invariant representation learning with environmental inference and subgraph generation for out-of-distribution generalization.","authors":"Hang Zhu,Sisi Yuan,Mingjing Tang,Guifei Zhou,Zhanxuan Hu,Zhaoyang Liu,Jin Li,Jianmin Wang,Chunyan Li","doi":"10.1186/s13321-025-01142-w","DOIUrl":"https://doi.org/10.1186/s13321-025-01142-w","url":null,"abstract":"Molecular representation learning (MRL) is a crucial link between machine learning and chemistry. It plays a vital role in predicting molecular properties and complex tasks such as drug discovery by encoding molecules as numerical vectors. While existing methods perform excellently when handling training and testing data from the same distribution, their generalization ability is often insufficient when faced with distribution shifts. Enhancing model generalization capability for out-of-distribution (OOD) data remains a significant challenge, as real-world molecular environments are often dynamic and uncertain. To effectively address this issue, we propose an innovative framework called EISG (Integrating Environmental Inference and Subgraph Generation) for molecular representation learning aimed at improving the performance of the model on OOD data by capturing the invariance of molecular graphs in different environments. Specifically, we introduce an unsupervised environmental classification model to identify latent variables generated by different distributions and designed a subgraph extractor based on information bottleneck theory to extracts invariant representations from molecular graphs that are closely related to the prediction labels. By combining new learning objectives, the environmental classifier and the subgraph extractor work in tandem to help the model identify invariant graph representations in different environments, leading to more robust OOD generalization. Experimental results demonstrate that our model exhibits strong generalization capabilities across various OOD settings. Code is available on GitHub.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tokenization plays a critical role in preparing SMILES strings for molecular foundation models. Poor token units can fragment chemically meaningful substructures, inflate sequence length, and hinder model learning and interpretability. Existing approaches such as SMILES Pair Encoding (SPE) and Atom Pair Encoding (APE) compress token sequences but often ignore domain-specific chemistry or fail to generalize to larger or more diverse molecules. We propose a domain-aware method for SMILES compression that combines frequency-guided substring mining using a prefix trie with an optional entropy-based refinement step using a token transition graph (TTG). On a corpus of 100,000 PubChem molecules, the Trie+TTG method reduces token sequences by more than 50% compared to APE while preserving chemically coherent substructures. The method generalizes effectively to large, out-of-distribution molecules, achieving compression rates of up to 90% with minimal sensitivity to molecule size. To assess downstream utility, we evaluate latent-space structure using unsupervised clustering and perform QSAR regression on ESOL. Trie+TTG produces more separable molecular representations and stronger predictive performance than Trie-only and APE. In addition, on peptide corpora, our method substantially outperforms SPE and the PeptideCLM tokenizer in compression and entropy metrics. These results show that combining trie-based mining with TTG refinement yields compact, stable, and chemically meaningful tokenizations suitable for modern molecular representation learning.Scientific contributions: We present a trie-based framework that compresses SMILES sequences into shorter, chemically coherent units while guaranteeing lossless reconstruction. By incorporating a token transition graph for entropy-guided refinement, our method selects contextually stable merges that improve both compression efficiency and generalization. Unlike prior approaches such as APE and SPE, our tokenizer combines frequency and context awareness, yielding more compact, interpretable, and transferable molecular representations.
{"title":"Optimizing SMILES token sequences via trie-based refinement and transition graph filtering.","authors":"Sridhar Radhakrishnan,Krish Mody,Arvind Venkatesh,Ananth Venkatesh","doi":"10.1186/s13321-025-01143-9","DOIUrl":"https://doi.org/10.1186/s13321-025-01143-9","url":null,"abstract":"Tokenization plays a critical role in preparing SMILES strings for molecular foundation models. Poor token units can fragment chemically meaningful substructures, inflate sequence length, and hinder model learning and interpretability. Existing approaches such as SMILES Pair Encoding (SPE) and Atom Pair Encoding (APE) compress token sequences but often ignore domain-specific chemistry or fail to generalize to larger or more diverse molecules. We propose a domain-aware method for SMILES compression that combines frequency-guided substring mining using a prefix trie with an optional entropy-based refinement step using a token transition graph (TTG). On a corpus of 100,000 PubChem molecules, the Trie+TTG method reduces token sequences by more than 50% compared to APE while preserving chemically coherent substructures. The method generalizes effectively to large, out-of-distribution molecules, achieving compression rates of up to 90% with minimal sensitivity to molecule size. To assess downstream utility, we evaluate latent-space structure using unsupervised clustering and perform QSAR regression on ESOL. Trie+TTG produces more separable molecular representations and stronger predictive performance than Trie-only and APE. In addition, on peptide corpora, our method substantially outperforms SPE and the PeptideCLM tokenizer in compression and entropy metrics. These results show that combining trie-based mining with TTG refinement yields compact, stable, and chemically meaningful tokenizations suitable for modern molecular representation learning.Scientific contributions: We present a trie-based framework that compresses SMILES sequences into shorter, chemically coherent units while guaranteeing lossless reconstruction. By incorporating a token transition graph for entropy-guided refinement, our method selects contextually stable merges that improve both compression efficiency and generalization. Unlike prior approaches such as APE and SPE, our tokenizer combines frequency and context awareness, yielding more compact, interpretable, and transferable molecular representations.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"33 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1186/s13321-025-01120-2
Yuzhu Li, Daiju Yang, Qingyi Shi, Weidong Zhang, Qingyan Sun
The molecular volume, surface area, and polar molecular surface area are important descriptors for characterizing and predicting the molecular properties of lead compounds. Existing computational tools for calculating the above parameters often have complex workflows and are not well-suited for high-throughput conditions. CalVSP is an open-source software for computing molecular volume, molecular surface area, and polar surface area. The software implements a grid-based algorithm that dynamically optimizes grid spacing via quantum chemical reference data to ensure precise parameter calculations. CalVSP was tested on 9489 3D molecular structures, and the results revealed a mean absolute percentage error of 1.25% (95% CI: 1.23–1.27%) for the molecular volume and 1.33% (95% CI: 1.31–1.35%) for the molecular surface area compared with the quantum chemical data. For the molecular polar surface area calculations, the mean absolute percentage error was 4.59% (95% CI: 4.16–5.04%) across the 388 tested molecular structures. The CalVSP written in the C programming language offers a lightweight and easy tool. It can be integrated with other molecular property prediction tools to increase computational accuracy and for large-scale molecular calculations.