Pub Date : 2026-01-17DOI: 10.1186/s13321-025-01096-z
Matt Raymond, Jacob Charles Saldinger, Paolo Elvati, Angela Violi
Extracting meaningful features from complex, high-dimensional datasets across scientific domains remains challenging. Current methods often struggle with scalability, limiting their applicability to large datasets, or make restrictive assumptions about feature-property relationships, hindering their ability to capture complex interactions. BoUTS's general and scalable feature selection algorithm surpasses these limitations by identifying both universal features relevant to all datasets and task-specific features predictive for specific subsets. Evaluated on seven diverse chemical regression datasets, BoUTS achieves state-of-the-art feature sparsity while generally maintaining prediction accuracy comparable to specialized methods. Notably, BoUTS's universal features enable domain-specific knowledge transfer between datasets, and we expect these results to be broadly useful to manually-guided inverse problems. Beyond its current application, BoUTS holds potential for elucidating data-poor systems by leveraging information from similar data-rich systems.Scientific Contribution: BoUTS selects nonlinear, universally informative features across multiple datasets. We identify crucial "universal features" across seven real-world chemistry datasets, which enhance cross-dataset interpretability and selection stability. BoUTS is highly scalable and is applicable to tabular data from many domains, and our results identify connections between seemingly unrelated chemical domains.
{"title":"Universal feature selection for simultaneous interpretability of multitask datasets.","authors":"Matt Raymond, Jacob Charles Saldinger, Paolo Elvati, Angela Violi","doi":"10.1186/s13321-025-01096-z","DOIUrl":"https://doi.org/10.1186/s13321-025-01096-z","url":null,"abstract":"<p><p>Extracting meaningful features from complex, high-dimensional datasets across scientific domains remains challenging. Current methods often struggle with scalability, limiting their applicability to large datasets, or make restrictive assumptions about feature-property relationships, hindering their ability to capture complex interactions. BoUTS's general and scalable feature selection algorithm surpasses these limitations by identifying both universal features relevant to all datasets and task-specific features predictive for specific subsets. Evaluated on seven diverse chemical regression datasets, BoUTS achieves state-of-the-art feature sparsity while generally maintaining prediction accuracy comparable to specialized methods. Notably, BoUTS's universal features enable domain-specific knowledge transfer between datasets, and we expect these results to be broadly useful to manually-guided inverse problems. Beyond its current application, BoUTS holds potential for elucidating data-poor systems by leveraging information from similar data-rich systems.Scientific Contribution: BoUTS selects nonlinear, universally informative features across multiple datasets. We identify crucial \"universal features\" across seven real-world chemistry datasets, which enhance cross-dataset interpretability and selection stability. BoUTS is highly scalable and is applicable to tabular data from many domains, and our results identify connections between seemingly unrelated chemical domains.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145994089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-16DOI: 10.1186/s13321-026-01153-1
Adam Bess,Sean Rowland,Chris Alvin,Supratik Mukhopadhyay
PURPOSEThe rapid evolution of antibiotic-resistant bacteria poses an urgent global health crisis. A key gap in current antibiotic discovery approaches is the absence of automated chemical synthesis methods designed to systematically generate and evaluate compounds within specific antibiotic classes. We address this gap through fragment-based computational experiments that systematically explore antibiotic chemical spaces.METHODSOur computational methodology consists of three steps: fragmentation of known compounds (eMolFrag), generating new molecular structures by recombining fragments (eSynth), and filtering of candidates based on desired properties (eFilter). eFilter combines structural analysis, pathway information, and protein targets to predict pharmacokinetic properties and therapeutic efficacy. We conducted three experiments: historical reconstruction of penicillin derivatives, hybrid molecule design combining functional groups from multiple antibiotic classes, and chemical space exploration of recently discovered antibiotics.RESULTSStarting from Penicillin G and Methicillin, eSynth generated over 1.4 million potential penicillin derivatives. eFilter computationally predicted ampicillin, amoxicillin, and 10 other penicillin derivatives as high-scoring candidates, demonstrating that the pipeline can navigate the chemical space of β-lactam antibiotics. For hybrid molecules, 1.53% showed computational predictions suggesting broad-spectrum activity against penicillin and quinolone targets, showing predicted binding scores higher than reference antibiotics in all protein targets evaluated. Chemical space exploration successfully generated computational candidates resembling Halicin-like molecules, with the top compound showing a binding score of 13.4 against JNK1.CONCLUSIONSOur fragment-based pipeline demonstrates the feasibility of systematically exploring antibiotic chemical spaces through computational reconstruction of historical development pathways and generation of hybrid molecules with with predicted multi-target binding profiles. All results represent computational predictions requiring experimental validation.
{"title":"Ai derivation and exploration of antibiotic class spaces.","authors":"Adam Bess,Sean Rowland,Chris Alvin,Supratik Mukhopadhyay","doi":"10.1186/s13321-026-01153-1","DOIUrl":"https://doi.org/10.1186/s13321-026-01153-1","url":null,"abstract":"PURPOSEThe rapid evolution of antibiotic-resistant bacteria poses an urgent global health crisis. A key gap in current antibiotic discovery approaches is the absence of automated chemical synthesis methods designed to systematically generate and evaluate compounds within specific antibiotic classes. We address this gap through fragment-based computational experiments that systematically explore antibiotic chemical spaces.METHODSOur computational methodology consists of three steps: fragmentation of known compounds (eMolFrag), generating new molecular structures by recombining fragments (eSynth), and filtering of candidates based on desired properties (eFilter). eFilter combines structural analysis, pathway information, and protein targets to predict pharmacokinetic properties and therapeutic efficacy. We conducted three experiments: historical reconstruction of penicillin derivatives, hybrid molecule design combining functional groups from multiple antibiotic classes, and chemical space exploration of recently discovered antibiotics.RESULTSStarting from Penicillin G and Methicillin, eSynth generated over 1.4 million potential penicillin derivatives. eFilter computationally predicted ampicillin, amoxicillin, and 10 other penicillin derivatives as high-scoring candidates, demonstrating that the pipeline can navigate the chemical space of β-lactam antibiotics. For hybrid molecules, 1.53% showed computational predictions suggesting broad-spectrum activity against penicillin and quinolone targets, showing predicted binding scores higher than reference antibiotics in all protein targets evaluated. Chemical space exploration successfully generated computational candidates resembling Halicin-like molecules, with the top compound showing a binding score of 13.4 against JNK1.CONCLUSIONSOur fragment-based pipeline demonstrates the feasibility of systematically exploring antibiotic chemical spaces through computational reconstruction of historical development pathways and generation of hybrid molecules with with predicted multi-target binding profiles. All results represent computational predictions requiring experimental validation.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study presents an integrated computational modeling framework combining deep learning and Quantitative Systems Pharmacology (QSP) to predict the efficacy of PROTAC (PROteolysis Targeting Chimera) molecules. PROTACs have emerged as promising therapeutics for targeted protein degradation (TPD), offering significant advantages in addressing proteins that traditional small-molecule inhibitors cannot target. However, experimental evaluation of PROTAC efficacy is hindered by extensive variability in molecular configurations, necessitating efficient computational prediction methods. The proposed model integrates binding affinity predictions from DeepCalici, a convolutional neural network-based deep learning model, with a mechanistic QSP Hook model to estimate key pharmacodynamic parameters, notably half-maximal degradation concentration(DC50) and maximal degradation(Dmax). This study utilized curated experimental data from PROTAC-DB, including experimentally validated DC50 and Dmax values. The dissociation constants (Kd) between PROTAC molecules and their protein targets (POI) or E3 ligases were predicted using DeepCalici and, then incorporated into the Hook model. To enhance the prediction accuracy, a supplementary deep neural network adjusted the hook model parameters based on chemical and biochemical features. The integrated modeling approach achieved a strong predictive performance for DC50, demonstrating its practical value in prioritizing effective PROTAC candidates. However, the predictions for Dmax were less accurate, likely reflecting the variability in the experimental conditions not captured in the current dataset. This study highlights the critical importance of comprehensive structural data for accurate modeling of PROTAC efficacy and suggests future improvements using standardized experimental data. Such integrative modeling approaches promise to accelerate the discovery and optimization of PROTAC therapeutics.
{"title":"Proteolysis-targeting Chimera efficacy prediction using a deep-learning-QSP model.","authors":"Sungwoo Goo,Jina Kim,Soyoung Lee,Sangkeun Jung,Jung-Woo Chae,Jae-Mun Choi,Hwi-Yeol Yun","doi":"10.1186/s13321-026-01152-2","DOIUrl":"https://doi.org/10.1186/s13321-026-01152-2","url":null,"abstract":"This study presents an integrated computational modeling framework combining deep learning and Quantitative Systems Pharmacology (QSP) to predict the efficacy of PROTAC (PROteolysis Targeting Chimera) molecules. PROTACs have emerged as promising therapeutics for targeted protein degradation (TPD), offering significant advantages in addressing proteins that traditional small-molecule inhibitors cannot target. However, experimental evaluation of PROTAC efficacy is hindered by extensive variability in molecular configurations, necessitating efficient computational prediction methods. The proposed model integrates binding affinity predictions from DeepCalici, a convolutional neural network-based deep learning model, with a mechanistic QSP Hook model to estimate key pharmacodynamic parameters, notably half-maximal degradation concentration(DC50) and maximal degradation(Dmax). This study utilized curated experimental data from PROTAC-DB, including experimentally validated DC50 and Dmax values. The dissociation constants (Kd) between PROTAC molecules and their protein targets (POI) or E3 ligases were predicted using DeepCalici and, then incorporated into the Hook model. To enhance the prediction accuracy, a supplementary deep neural network adjusted the hook model parameters based on chemical and biochemical features. The integrated modeling approach achieved a strong predictive performance for DC50, demonstrating its practical value in prioritizing effective PROTAC candidates. However, the predictions for Dmax were less accurate, likely reflecting the variability in the experimental conditions not captured in the current dataset. This study highlights the critical importance of comprehensive structural data for accurate modeling of PROTAC efficacy and suggests future improvements using standardized experimental data. Such integrative modeling approaches promise to accelerate the discovery and optimization of PROTAC therapeutics.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"27 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145949681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-11DOI: 10.1186/s13321-025-01140-y
Yoonsuk Jang,Juyeon Lee,Keunhong Jeong,Jaeoh Kim
Graph convolutional networks (GCN) are effective for learning molecular representations, but their reliance on local message passing and simple feature concatenation limits their ability to capture global physicochemical properties. We present KROnecker-product based multimodal fusion with Variable sElection for eXpressive molecular representation learning (KROVEX), a method that integrates graph embeddings with molecular descriptors through a Kronecker-product to explicitly model second-order interactions. Informative descriptors are identified using a two-stage procedure that combines iterative sure independence screening with Elastic Net regularization. The proposed approach was evaluated on two benchmark datasets (FreeSolv and ESOL) as well as two self-curated datasets with vapor pressure and aqueous solubility as the target property. Overall, our method outperformed not only GCN but also fusion-based baselines such as EGCN, D-MPNN, and BAN under both the random and scaffold split. More importantly, the fusion operates at the final embedding level, enabling consistent performance across different GNN backbones (e.g., GAT and GIN). KROVEX achieves state-of-the-art performance on vapor pressure prediction, establishing a new benchmark for this safety-critical property essential for environmental monitoring and industrial process design. Ablation studies further demonstrated that (1) statistically guided descriptor selection yields more informative features than predefined descriptors, and (2) Kronecker-product fusion provides greater improvements than simple concatenation as the number of descriptors increases. These results demonstrate that parsimonious descriptor selection combined with multimodal graph fusion enhances predictive performance and interpretability, providing a generalizable framework for molecular property prediction.
{"title":"Multimodal graph fusion with statistically guided parsimonious descriptor selection for molecular property prediction.","authors":"Yoonsuk Jang,Juyeon Lee,Keunhong Jeong,Jaeoh Kim","doi":"10.1186/s13321-025-01140-y","DOIUrl":"https://doi.org/10.1186/s13321-025-01140-y","url":null,"abstract":"Graph convolutional networks (GCN) are effective for learning molecular representations, but their reliance on local message passing and simple feature concatenation limits their ability to capture global physicochemical properties. We present KROnecker-product based multimodal fusion with Variable sElection for eXpressive molecular representation learning (KROVEX), a method that integrates graph embeddings with molecular descriptors through a Kronecker-product to explicitly model second-order interactions. Informative descriptors are identified using a two-stage procedure that combines iterative sure independence screening with Elastic Net regularization. The proposed approach was evaluated on two benchmark datasets (FreeSolv and ESOL) as well as two self-curated datasets with vapor pressure and aqueous solubility as the target property. Overall, our method outperformed not only GCN but also fusion-based baselines such as EGCN, D-MPNN, and BAN under both the random and scaffold split. More importantly, the fusion operates at the final embedding level, enabling consistent performance across different GNN backbones (e.g., GAT and GIN). KROVEX achieves state-of-the-art performance on vapor pressure prediction, establishing a new benchmark for this safety-critical property essential for environmental monitoring and industrial process design. Ablation studies further demonstrated that (1) statistically guided descriptor selection yields more informative features than predefined descriptors, and (2) Kronecker-product fusion provides greater improvements than simple concatenation as the number of descriptors increases. These results demonstrate that parsimonious descriptor selection combined with multimodal graph fusion enhances predictive performance and interpretability, providing a generalizable framework for molecular property prediction.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"27 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145949678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Accurately predicting novel compound-protein interactions (CPIs) is essential for accelerating drug discovery. The generalizability of machine learning-based CPI prediction models relies significantly on the availability and diversity of CPI datasets. To maximize data utility, particularly for highly confidential datasets maintained by industries, federated learning (FL)-which integrates multi-site data while preserving privacy-has emerged as a promising approach. Nonetheless, its effectiveness when applied to heterogeneous data from diverse molecular domains, a common real-world scenario, remains unclear, thereby limiting its broader adoption. This study evaluates FL for CPI prediction using datasets spanning multiple chemical and protein domains, providing practical guidance for optimizing the FL approach. Results indicate that the FL model enhanced out-of-domain prediction performance but was surpassed by local models for in-domain data under data heterogeneity. Drawing on these findings, a new strategy was developed to achieve robust performance for in- and out-of-domain tasks: a similarity-guided ensemble (SGE) that combines the global FL model with fine-tuned models based on each client's local data. This method demonstrated effectiveness with real-world industry data, including samples from the public database and 13 pharmaceutical companies. Cumulatively, these findings offer practical guidance for implementing FL in contemporary drug discovery processes. SCIENTIFIC CONTRIBUTION: This study identifies the performance trade-offs caused by heterogeneous data distributions in FL for CPI prediction. To overcome these challenges, we developed a workflow integrating local fine-tuning and a SGE, ensuring robust accuracy for both in-domain and out-of-domain predictions. The effectiveness of this approach was validated using both public datasets and real-world in-house datasets from 13 pharmaceutical companies.
{"title":"Empowering federated learning for robust compound-protein interaction prediction across heterogeneous cross-pharma domains.","authors":"Takuto Koyama,Hiroaki Iwata,Ryosuke Kojima,Takao Otsuka,Aki Hasegawa,Seungeon Lee,Hiroshi Ueda,Toshiharu Morimoto,Ryoko Sasaki,Nao Torimoto,Sei Murakami,Manabu Tojo,Teruki Honma,Shigeyuki Matsumoto,Yasushi Okuno","doi":"10.1186/s13321-025-01147-5","DOIUrl":"https://doi.org/10.1186/s13321-025-01147-5","url":null,"abstract":"Accurately predicting novel compound-protein interactions (CPIs) is essential for accelerating drug discovery. The generalizability of machine learning-based CPI prediction models relies significantly on the availability and diversity of CPI datasets. To maximize data utility, particularly for highly confidential datasets maintained by industries, federated learning (FL)-which integrates multi-site data while preserving privacy-has emerged as a promising approach. Nonetheless, its effectiveness when applied to heterogeneous data from diverse molecular domains, a common real-world scenario, remains unclear, thereby limiting its broader adoption. This study evaluates FL for CPI prediction using datasets spanning multiple chemical and protein domains, providing practical guidance for optimizing the FL approach. Results indicate that the FL model enhanced out-of-domain prediction performance but was surpassed by local models for in-domain data under data heterogeneity. Drawing on these findings, a new strategy was developed to achieve robust performance for in- and out-of-domain tasks: a similarity-guided ensemble (SGE) that combines the global FL model with fine-tuned models based on each client's local data. This method demonstrated effectiveness with real-world industry data, including samples from the public database and 13 pharmaceutical companies. Cumulatively, these findings offer practical guidance for implementing FL in contemporary drug discovery processes. SCIENTIFIC CONTRIBUTION: This study identifies the performance trade-offs caused by heterogeneous data distributions in FL for CPI prediction. To overcome these challenges, we developed a workflow integrating local fine-tuning and a SGE, ensuring robust accuracy for both in-domain and out-of-domain predictions. The effectiveness of this approach was validated using both public datasets and real-world in-house datasets from 13 pharmaceutical companies.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"40 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Protein structural data are highly valuable for research, and many significant results have been published on their basis. An important part of protein structures is their ligands. The conformation of rings in ligands is crucial for the ligands' scaffold and shape and, therefore, for interactions with their surroundings and the subsequent biological function. For this reason, we developed a workflow to detect conformations of cyclohexane, cyclopentane, and benzene rings. The workflow can process rings originating from ligands, which are parts of experimental protein structures deposited in the Protein Data Bank and determined by X-ray crystallography. This fully automatic workflow utilises the Hill-Reilly approach to calculate puckering angles that quantitatively describe ring conformation. The reproducibility of the workflow is guaranteed by storing datasets within Onedata, which enables automatic dataset retrieval and the workflow execution. We analysed 128 012 ring structures originating from 25 479 different ligands. We found that cyclohexane ring structures include more than 22 % of unfavourable conformations, cyclopentane ring structures about 5 % and benzene ring structures only 0.01 %. We discovered that energetically unfavourable ring structures can occur in cyclohexane and cyclopentane ligands for proper chemical reasons. Their examination can help us to understand the binding of these ligands, which can be helpful for pharmacology, chemoinformatics, etc. On the other hand, energetically unfavourable ring conformations are often caused by model quality issues. Therefore, their occurrence should motivate researchers to inspect the quality of the protein model and also the ring's fit into experimental data.Scientific Contribution: Our analysis uncovers a conformational behaviour of cyclohexane, cyclopentane, and benzene rings, occurring in ligands, which are parts of experimental protein structures deposited in the PDB. This paper's other substantial contribution is presenting the first successful application of the Hill-Reilly approach for cyclohexane, cyclopentane, and benzene rings. Moreover, we provide Hill-Reilly parameters for these ring conformations, which can be used in other analyses.
{"title":"Analysis of cyclohexane, cyclopentane, and benzene conformations in ligands for PDB X-ray structures using the Hill-Reilly approach.","authors":"Gabriela Bučeková,Viktoriia Doshchenko,Tomáš Svoboda,Jana Porubská,Aliaksei Chareshneu,Tomáš Raček,Vladimír Horský,Radka Svobodová,Ondřej Schindler","doi":"10.1186/s13321-026-01154-0","DOIUrl":"https://doi.org/10.1186/s13321-026-01154-0","url":null,"abstract":"Protein structural data are highly valuable for research, and many significant results have been published on their basis. An important part of protein structures is their ligands. The conformation of rings in ligands is crucial for the ligands' scaffold and shape and, therefore, for interactions with their surroundings and the subsequent biological function. For this reason, we developed a workflow to detect conformations of cyclohexane, cyclopentane, and benzene rings. The workflow can process rings originating from ligands, which are parts of experimental protein structures deposited in the Protein Data Bank and determined by X-ray crystallography. This fully automatic workflow utilises the Hill-Reilly approach to calculate puckering angles that quantitatively describe ring conformation. The reproducibility of the workflow is guaranteed by storing datasets within Onedata, which enables automatic dataset retrieval and the workflow execution. We analysed 128 012 ring structures originating from 25 479 different ligands. We found that cyclohexane ring structures include more than 22 % of unfavourable conformations, cyclopentane ring structures about 5 % and benzene ring structures only 0.01 %. We discovered that energetically unfavourable ring structures can occur in cyclohexane and cyclopentane ligands for proper chemical reasons. Their examination can help us to understand the binding of these ligands, which can be helpful for pharmacology, chemoinformatics, etc. On the other hand, energetically unfavourable ring conformations are often caused by model quality issues. Therefore, their occurrence should motivate researchers to inspect the quality of the protein model and also the ring's fit into experimental data.Scientific Contribution: Our analysis uncovers a conformational behaviour of cyclohexane, cyclopentane, and benzene rings, occurring in ligands, which are parts of experimental protein structures deposited in the PDB. This paper's other substantial contribution is presenting the first successful application of the Hill-Reilly approach for cyclohexane, cyclopentane, and benzene rings. Moreover, we provide Hill-Reilly parameters for these ring conformations, which can be used in other analyses.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"45 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-09DOI: 10.1186/s13321-025-01145-7
Rachit Kumar,Joseph Romano,Marylyn Ritchie
In this review, we discuss the various different types of learnable protein representations that have been used in computational biology, with a particular focus on representations that have been used in the paradigm of predicting drug-target affinity. We explore this from multiple perspectives: the source of protein information used, the training paradigms used in generating and applying such representations, and the types of (deep-learning-based) encoding or embedding methods that have been used to generate and operate on such representations. We focus on drug-target affinity due to its particular relevance and utility in the field of drug development and assessment, and we make suggestions for how drug-target affinity prediction methods development can be further improved by examining the current literature from the aforementioned perspectives. This survey thus serves as a valuable resource for researchers seeking to develop methods for predicting drug-target affinity by exploring how protein information has been used and could be used in effective ways to improve such predictions.
{"title":"Learnable protein representations in computational biology for predicting drug-target affinity.","authors":"Rachit Kumar,Joseph Romano,Marylyn Ritchie","doi":"10.1186/s13321-025-01145-7","DOIUrl":"https://doi.org/10.1186/s13321-025-01145-7","url":null,"abstract":"In this review, we discuss the various different types of learnable protein representations that have been used in computational biology, with a particular focus on representations that have been used in the paradigm of predicting drug-target affinity. We explore this from multiple perspectives: the source of protein information used, the training paradigms used in generating and applying such representations, and the types of (deep-learning-based) encoding or embedding methods that have been used to generate and operate on such representations. We focus on drug-target affinity due to its particular relevance and utility in the field of drug development and assessment, and we make suggestions for how drug-target affinity prediction methods development can be further improved by examining the current literature from the aforementioned perspectives. This survey thus serves as a valuable resource for researchers seeking to develop methods for predicting drug-target affinity by exploring how protein information has been used and could be used in effective ways to improve such predictions.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"4 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145937866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The prediction of Antimicrobial Peptides (AMPs) is a critical research area in drug discovery. Traditional methods, which rely on sequence alignment or handcrafted features, often fail to capture complex sequence-function relationships. Recently, Large Language Models (LLMs) like ESM2 have demonstrated remarkable success in extracting deep semantic features from protein sequences. Meanwhile, Graph Neural Networks (GNNs), particularly Graph Attention Networks (GATs), can effectively learn inter-node relationships, specifically capturing peptide-residue compositional links and inter-residue co-occurrence patterns, to aggregate neighborhood information. In this work, we propose PepGraphormer, a novel fusion model that combines the powers of large-scale pretraining from ESM2 and the structural learning advantages from GATs for AMP prediction. We first construct a heterogeneous graph with peptide sequences and amino acids as nodes, where ESM2 is leveraged to generate high-quality initial embeddings for the peptide sequence nodes. Then the classification fuses the direct predictions from ESM2 and the graph-based predictions from the GAT. The training jointly trains the ESM2 and GAT modules and learns the embeddings for nodes in the graph. Comparisons with current state-of-the-art models on multiple datasets demonstrate that PepGraphormer achieves excellent accuracy and stability in the AMP prediction task. Further ablation and generalization experiments confirm the effectiveness and robustness of this fusion framework, presenting a new avenue for computationally-driven therapeutic peptide discovery.Scientific contribution This work proposes a novel framework PepGraphormer that combines the powers of transformer-based large language model (ESM2) and graph attention network for antimicrobial peptide prediction, without requiring the 3D protein structural information used in previous studies. The model significantly outperforms state-of-the-art methods and various deep learning baselines on multiple AMP benchmark datasets.
{"title":"PepGraphormer: an ESM-GAT hybrid deep learning framework for antimicrobial peptide prediction.","authors":"Changhang Lin,Shuwen Xiong,Jinjin Li,Feifei Cui,Zilong Zhang,Hua Shi,Leyi Wei","doi":"10.1186/s13321-025-01144-8","DOIUrl":"https://doi.org/10.1186/s13321-025-01144-8","url":null,"abstract":"The prediction of Antimicrobial Peptides (AMPs) is a critical research area in drug discovery. Traditional methods, which rely on sequence alignment or handcrafted features, often fail to capture complex sequence-function relationships. Recently, Large Language Models (LLMs) like ESM2 have demonstrated remarkable success in extracting deep semantic features from protein sequences. Meanwhile, Graph Neural Networks (GNNs), particularly Graph Attention Networks (GATs), can effectively learn inter-node relationships, specifically capturing peptide-residue compositional links and inter-residue co-occurrence patterns, to aggregate neighborhood information. In this work, we propose PepGraphormer, a novel fusion model that combines the powers of large-scale pretraining from ESM2 and the structural learning advantages from GATs for AMP prediction. We first construct a heterogeneous graph with peptide sequences and amino acids as nodes, where ESM2 is leveraged to generate high-quality initial embeddings for the peptide sequence nodes. Then the classification fuses the direct predictions from ESM2 and the graph-based predictions from the GAT. The training jointly trains the ESM2 and GAT modules and learns the embeddings for nodes in the graph. Comparisons with current state-of-the-art models on multiple datasets demonstrate that PepGraphormer achieves excellent accuracy and stability in the AMP prediction task. Further ablation and generalization experiments confirm the effectiveness and robustness of this fusion framework, presenting a new avenue for computationally-driven therapeutic peptide discovery.Scientific contribution This work proposes a novel framework PepGraphormer that combines the powers of transformer-based large language model (ESM2) and graph attention network for antimicrobial peptide prediction, without requiring the 3D protein structural information used in previous studies. The model significantly outperforms state-of-the-art methods and various deep learning baselines on multiple AMP benchmark datasets.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"83 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.1186/s13321-025-01146-6
Amir Hallaji Bidgoli,Morteza Mahdavi,Hamed Malek
Accurate prediction of drug-target affinity (DTA) is crucial for advancing drug discovery and optimizing experimental processes. Traditional DTA models often rely on handcrafted features or structural data, which can limit their generalizability and scalability. In this study, we propose a novel, sequence-centric approach for DTA prediction that leverages pretrained large language models (LLMs), namely ChemBERTa and ESM2, to encode protein and molecule sequences. These models produce semantically rich embeddings without the need for structural data. We introduce a customized Residual Inception architecture that efficiently integrates these sequence embeddings through multi-scale convolutions and residual connections, significantly improving prediction accuracy. Our method is evaluated on benchmark datasets Davis, KIBA, and BindingDB, achieving state-of-the-art performance with MSE = 0.182 and CI = 0.915 on Davis, MSE = 0.135 and CI = 0.902 on KIBA, and MSE = 0.467 and CI = 0.888 on BindingDB. These results highlight the potential of sequence-based approaches to provide scalable, accurate, and robust solutions for DTA prediction, offering valuable insights into drug-target interactions even in data-sparse settings. SCIENTIFIC CONTRIBUTION: The combination of pretrained language models and a lightweight neural architecture paves the way for more effective and adaptable DTA frameworks in real-world drug discovery applications.
准确预测药物-靶标亲和力(DTA)对于推进药物发现和优化实验过程至关重要。传统的DTA模型通常依赖于手工制作的特征或结构数据,这限制了它们的泛化性和可扩展性。在这项研究中,我们提出了一种新的、以序列为中心的DTA预测方法,该方法利用预训练的大语言模型(LLMs),即ChemBERTa和ESM2,对蛋白质和分子序列进行编码。这些模型产生语义丰富的嵌入,而不需要结构数据。我们引入了一个定制的残差初始架构,通过多尺度卷积和残差连接有效地集成了这些序列嵌入,显著提高了预测精度。我们的方法在基准数据集Davis、KIBA和BindingDB上进行了评估,在Davis上的MSE = 0.182, CI = 0.915,在KIBA上的MSE = 0.135, CI = 0.902,在BindingDB上的MSE = 0.467, CI = 0.888,达到了最先进的性能。这些结果突出了基于序列的方法为DTA预测提供可扩展、准确和健壮的解决方案的潜力,即使在数据稀疏的情况下,也为药物-靶标相互作用提供了有价值的见解。科学贡献:预训练语言模型和轻量级神经体系结构的结合为现实世界药物发现应用中更有效和适应性更强的DTA框架铺平了道路。
{"title":"Structure-free drug-target affinity prediction using protein and molecule language models.","authors":"Amir Hallaji Bidgoli,Morteza Mahdavi,Hamed Malek","doi":"10.1186/s13321-025-01146-6","DOIUrl":"https://doi.org/10.1186/s13321-025-01146-6","url":null,"abstract":"Accurate prediction of drug-target affinity (DTA) is crucial for advancing drug discovery and optimizing experimental processes. Traditional DTA models often rely on handcrafted features or structural data, which can limit their generalizability and scalability. In this study, we propose a novel, sequence-centric approach for DTA prediction that leverages pretrained large language models (LLMs), namely ChemBERTa and ESM2, to encode protein and molecule sequences. These models produce semantically rich embeddings without the need for structural data. We introduce a customized Residual Inception architecture that efficiently integrates these sequence embeddings through multi-scale convolutions and residual connections, significantly improving prediction accuracy. Our method is evaluated on benchmark datasets Davis, KIBA, and BindingDB, achieving state-of-the-art performance with MSE = 0.182 and CI = 0.915 on Davis, MSE = 0.135 and CI = 0.902 on KIBA, and MSE = 0.467 and CI = 0.888 on BindingDB. These results highlight the potential of sequence-based approaches to provide scalable, accurate, and robust solutions for DTA prediction, offering valuable insights into drug-target interactions even in data-sparse settings. SCIENTIFIC CONTRIBUTION: The combination of pretrained language models and a lightweight neural architecture paves the way for more effective and adaptable DTA frameworks in real-world drug discovery applications.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"11 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-02DOI: 10.1186/s13321-025-01142-w
Hang Zhu,Sisi Yuan,Mingjing Tang,Guifei Zhou,Zhanxuan Hu,Zhaoyang Liu,Jin Li,Jianmin Wang,Chunyan Li
Molecular representation learning (MRL) is a crucial link between machine learning and chemistry. It plays a vital role in predicting molecular properties and complex tasks such as drug discovery by encoding molecules as numerical vectors. While existing methods perform excellently when handling training and testing data from the same distribution, their generalization ability is often insufficient when faced with distribution shifts. Enhancing model generalization capability for out-of-distribution (OOD) data remains a significant challenge, as real-world molecular environments are often dynamic and uncertain. To effectively address this issue, we propose an innovative framework called EISG (Integrating Environmental Inference and Subgraph Generation) for molecular representation learning aimed at improving the performance of the model on OOD data by capturing the invariance of molecular graphs in different environments. Specifically, we introduce an unsupervised environmental classification model to identify latent variables generated by different distributions and designed a subgraph extractor based on information bottleneck theory to extracts invariant representations from molecular graphs that are closely related to the prediction labels. By combining new learning objectives, the environmental classifier and the subgraph extractor work in tandem to help the model identify invariant graph representations in different environments, leading to more robust OOD generalization. Experimental results demonstrate that our model exhibits strong generalization capabilities across various OOD settings. Code is available on GitHub.
分子表征学习(MRL)是机器学习与化学之间的重要纽带。它在预测分子性质和复杂任务中发挥着至关重要的作用,例如通过编码分子作为数字载体来发现药物。虽然现有方法在处理来自同一分布的训练和测试数据时表现出色,但在面对分布变化时,其泛化能力往往不足。由于现实世界的分子环境通常是动态的和不确定的,因此提高模型泛化能力仍然是一个重大挑战。为了有效地解决这个问题,我们提出了一个名为EISG (integrated Environmental Inference and Subgraph Generation)的创新框架,用于分子表示学习,旨在通过捕获不同环境中分子图的不变性来提高模型在OOD数据上的性能。具体来说,我们引入了一种无监督环境分类模型来识别由不同分布产生的潜在变量,并设计了一种基于信息瓶颈理论的子图提取器,从分子图中提取与预测标签密切相关的不变表示。通过结合新的学习目标,环境分类器和子图提取器协同工作,帮助模型识别不同环境中的不变图表示,从而实现更稳健的OOD泛化。实验结果表明,我们的模型在各种OOD设置中表现出强大的泛化能力。代码可在GitHub上获得。
{"title":"Molecular graph-based invariant representation learning with environmental inference and subgraph generation for out-of-distribution generalization.","authors":"Hang Zhu,Sisi Yuan,Mingjing Tang,Guifei Zhou,Zhanxuan Hu,Zhaoyang Liu,Jin Li,Jianmin Wang,Chunyan Li","doi":"10.1186/s13321-025-01142-w","DOIUrl":"https://doi.org/10.1186/s13321-025-01142-w","url":null,"abstract":"Molecular representation learning (MRL) is a crucial link between machine learning and chemistry. It plays a vital role in predicting molecular properties and complex tasks such as drug discovery by encoding molecules as numerical vectors. While existing methods perform excellently when handling training and testing data from the same distribution, their generalization ability is often insufficient when faced with distribution shifts. Enhancing model generalization capability for out-of-distribution (OOD) data remains a significant challenge, as real-world molecular environments are often dynamic and uncertain. To effectively address this issue, we propose an innovative framework called EISG (Integrating Environmental Inference and Subgraph Generation) for molecular representation learning aimed at improving the performance of the model on OOD data by capturing the invariance of molecular graphs in different environments. Specifically, we introduce an unsupervised environmental classification model to identify latent variables generated by different distributions and designed a subgraph extractor based on information bottleneck theory to extracts invariant representations from molecular graphs that are closely related to the prediction labels. By combining new learning objectives, the environmental classifier and the subgraph extractor work in tandem to help the model identify invariant graph representations in different environments, leading to more robust OOD generalization. Experimental results demonstrate that our model exhibits strong generalization capabilities across various OOD settings. Code is available on GitHub.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}