Pub Date : 2026-01-22DOI: 10.1021/acs.jcim.5c02572
Lee Bin Choi,Ohyeon Lee,Sanghun Lee
Lithium iron phosphate (LiFePO4, LFP) has regained prominence as a cathode for lithium ion batteries thanks to its intrinsic safety, thermal stability, long cycle life, and cost advantages. We present an agentic knowledge-graph pipeline that converts titles/abstracts into directed, signed agent → property relations. Using a Scopus corpus of the 9500 most-cited LFP journal articles (2000-present), we benchmark three matched modes: A, rules with a closed vocabulary; B, LLM-only with an open vocabulary; and mixed LLM with a hybrid vocabulary. A yields a compact, high-precision core; B expands recall but increases label dispersion; C preserves much of B's breadth while maintaining schema alignment via canonicalization and role gating. Robustness tests with eight bootstrap passes show rapid convergence: requiring recurrence across ∼6 passes plus a modest publication-support threshold yields a compact, high-confidence backbone. The resulting network is predominantly positive and centers on transport and interfacial outcomes, with a small number of mixed and negative ties indicating condition dependence. Beyond LFP, the workflow can be adapted to other battery chemistries with modest retuning of vocabularies and projection rules alongside routine validation on held-out annotations, enabling a stability-aware, literature-scale synthesis of direction-of-effect relations.
{"title":"Agentic Knowledge Graphs of the LiFePO4 Cathode for Lithium Ion Battery: Balancing Discovery and Stability with LLMs.","authors":"Lee Bin Choi,Ohyeon Lee,Sanghun Lee","doi":"10.1021/acs.jcim.5c02572","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02572","url":null,"abstract":"Lithium iron phosphate (LiFePO4, LFP) has regained prominence as a cathode for lithium ion batteries thanks to its intrinsic safety, thermal stability, long cycle life, and cost advantages. We present an agentic knowledge-graph pipeline that converts titles/abstracts into directed, signed agent → property relations. Using a Scopus corpus of the 9500 most-cited LFP journal articles (2000-present), we benchmark three matched modes: A, rules with a closed vocabulary; B, LLM-only with an open vocabulary; and mixed LLM with a hybrid vocabulary. A yields a compact, high-precision core; B expands recall but increases label dispersion; C preserves much of B's breadth while maintaining schema alignment via canonicalization and role gating. Robustness tests with eight bootstrap passes show rapid convergence: requiring recurrence across ∼6 passes plus a modest publication-support threshold yields a compact, high-confidence backbone. The resulting network is predominantly positive and centers on transport and interfacial outcomes, with a small number of mixed and negative ties indicating condition dependence. Beyond LFP, the workflow can be adapted to other battery chemistries with modest retuning of vocabularies and projection rules alongside routine validation on held-out annotations, enabling a stability-aware, literature-scale synthesis of direction-of-effect relations.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"62 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146015149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional molecular screening methods are often limited by high computational cost, long design cycles, and a strong reliance on high-quality 3D protein structures, which are not always available or reliable. To address these limitations, we propose CoDrug, an innovative multimodal fusion framework that integrates textual information with structural representations of proteins and compounds. CoDrug employs two complementary fusion strategies─text-protein sequence fusion, in which SciBERT encodes functional descriptions and ESM extracts sequence-level features, and text-compound structure fusion, in which ChemFormer encodes SMILES and SciBERT processes compound-related textual descriptions. Using contrastive learning, CoDrug aligns textual and structural embeddings in a shared latent space, enabling effective cross-modal representation learning. This architecture supports novel functionalities, including text-driven virtual screening and text-driven molecular optimization, enhancing representation expressiveness and generalization while delivering strong performance under zero-shot settings. Evaluations on diverse benchmarks demonstrate that CoDrug achieves competitive or superior results compared with state-of-the-art baselines, particularly when 3D structural data are incomplete or unavailable. The framework's natural language interface lowers the technical barrier for AI-assisted drug discovery, allowing chemists to efficiently navigate and optimize chemical space without specialized computational expertise. By bridging language-driven hypotheses and structure-guided molecular design, CoDrug offers a scalable and flexible paradigm for accelerating the early stages of drug discovery.
{"title":"CoDrug: A Text-Driven Molecular Virtual Screening and Multiproperty Optimization Framework via Multimodal Language Model.","authors":"Rui Gu,Yingxu Liu,Bingxing Zhu,Li Liang,Haichun Liu,Yanmin Zhang,Yadong Chen","doi":"10.1021/acs.jcim.5c02499","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02499","url":null,"abstract":"Traditional molecular screening methods are often limited by high computational cost, long design cycles, and a strong reliance on high-quality 3D protein structures, which are not always available or reliable. To address these limitations, we propose CoDrug, an innovative multimodal fusion framework that integrates textual information with structural representations of proteins and compounds. CoDrug employs two complementary fusion strategies─text-protein sequence fusion, in which SciBERT encodes functional descriptions and ESM extracts sequence-level features, and text-compound structure fusion, in which ChemFormer encodes SMILES and SciBERT processes compound-related textual descriptions. Using contrastive learning, CoDrug aligns textual and structural embeddings in a shared latent space, enabling effective cross-modal representation learning. This architecture supports novel functionalities, including text-driven virtual screening and text-driven molecular optimization, enhancing representation expressiveness and generalization while delivering strong performance under zero-shot settings. Evaluations on diverse benchmarks demonstrate that CoDrug achieves competitive or superior results compared with state-of-the-art baselines, particularly when 3D structural data are incomplete or unavailable. The framework's natural language interface lowers the technical barrier for AI-assisted drug discovery, allowing chemists to efficiently navigate and optimize chemical space without specialized computational expertise. By bridging language-driven hypotheses and structure-guided molecular design, CoDrug offers a scalable and flexible paradigm for accelerating the early stages of drug discovery.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"263 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1021/acs.jcim.5c01681
Ziyu Song,Ruixuan Wang,Xun Jiao,Zuyi Huang
The pKa value of a protein-ionizable residue reflects its potency to donate a proton at a given pH value, which is essential for understanding a wide range of biological activity. Therefore, the accurate prediction of pKa values of protein residues is crucial for understanding enzymatic activity and protein-ligand binding, which are fundamental to drug discovery. Despite significant time and resources being invested to develop computational methods for protein residue pKa prediction, the accuracy of existing tools, such as the widely used PROPKA, remains limited. In this study, an integrated framework that fuses molecular dynamics simulations and deep learning models is proposed to improve the predictive accuracy of pKa values for ionizable residues. Specifically, we employ high-throughput molecular modeling using the AMOEBA polarized force field to construct a protein structure data set enriched with atomic electrostatics and other physics-inspired features. Using the experimentally determined pKa values from the PKAD-2 data set, we trained three graph-based neural network models. All three models demonstrated substantial improvements in prediction accuracy across four ionizable residue types, aspartic acid, glutamic acid, lysine, and histidine, when compared to PROPKA3.5.1, with the graph attention networks-based model exhibiting both high accuracy and strong generalizability when benchmarking against several recently published machine learning models. Beyond these improvements in predictive performance, feature importance analysis of the best-performing models revealed physically meaningful patterns of the descriptive features that aligned with the underlying biophysical principles governing protein residue pKa values, most notably, the complexity of the local microenvironment and the atomic geometric arrangement within the protein structure. Together, the trained pKa models and the curated dipole moment-enhanced data set based on a polarizable FF offer a valuable resource for the research community, with potential applications in early-stage drug target identification and protein engineering.
{"title":"Graph-Based Deep Learning Models for Predicting pKa Values of Protein-Ionizable Residues via Physically Inspired Feature Engineering.","authors":"Ziyu Song,Ruixuan Wang,Xun Jiao,Zuyi Huang","doi":"10.1021/acs.jcim.5c01681","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c01681","url":null,"abstract":"The pKa value of a protein-ionizable residue reflects its potency to donate a proton at a given pH value, which is essential for understanding a wide range of biological activity. Therefore, the accurate prediction of pKa values of protein residues is crucial for understanding enzymatic activity and protein-ligand binding, which are fundamental to drug discovery. Despite significant time and resources being invested to develop computational methods for protein residue pKa prediction, the accuracy of existing tools, such as the widely used PROPKA, remains limited. In this study, an integrated framework that fuses molecular dynamics simulations and deep learning models is proposed to improve the predictive accuracy of pKa values for ionizable residues. Specifically, we employ high-throughput molecular modeling using the AMOEBA polarized force field to construct a protein structure data set enriched with atomic electrostatics and other physics-inspired features. Using the experimentally determined pKa values from the PKAD-2 data set, we trained three graph-based neural network models. All three models demonstrated substantial improvements in prediction accuracy across four ionizable residue types, aspartic acid, glutamic acid, lysine, and histidine, when compared to PROPKA3.5.1, with the graph attention networks-based model exhibiting both high accuracy and strong generalizability when benchmarking against several recently published machine learning models. Beyond these improvements in predictive performance, feature importance analysis of the best-performing models revealed physically meaningful patterns of the descriptive features that aligned with the underlying biophysical principles governing protein residue pKa values, most notably, the complexity of the local microenvironment and the atomic geometric arrangement within the protein structure. Together, the trained pKa models and the curated dipole moment-enhanced data set based on a polarizable FF offer a valuable resource for the research community, with potential applications in early-stage drug target identification and protein engineering.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"30 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1021/acs.jcim.5c02735
Shijia Yan,Junliang Shang,Shoujia Jiang,Xiaohan Zhang,Fanyu Zhang,Yan Sun,Jin-Xing Liu
Single-cell RNA sequencing (scRNA-seq) technology has become an essential tool for dissecting cellular heterogeneity and elucidating complex biological systems. Nevertheless, the uneven distribution of cell types and the limited representation of rare cell populations present substantial challenges for effective modeling and accurate identification. Most existing methods primarily focus on the annotation of abundant cell types, often overlooking rare, yet biologically significant subpopulations. In addition, the variability of cellular distributions across different biological contexts highlights the need for models with greater adaptability and a stronger capacity for contextual information integration. To overcome these challenges, we introduced scACAN, an adaptive graph construction framework that leverages aggregated local graph context information to design a positive sample selection strategy. By incorporating adaptive sampling and iterative optimization based on clustering results, scACAN effectively enhances the identification of both the major and rare cell types. Comprehensive experiments on multiple real-world scRNA-seq data sets demonstrate that scACAN achieves superior performance and reveals additional biologically meaningful rare cell subpopulations, providing a robust and generalizable solution for single-cell data analysis.
{"title":"scACAN: An Adaptive Learning Framework Aggregating Local Graph Structure Context for Rare Cell Type Identification.","authors":"Shijia Yan,Junliang Shang,Shoujia Jiang,Xiaohan Zhang,Fanyu Zhang,Yan Sun,Jin-Xing Liu","doi":"10.1021/acs.jcim.5c02735","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02735","url":null,"abstract":"Single-cell RNA sequencing (scRNA-seq) technology has become an essential tool for dissecting cellular heterogeneity and elucidating complex biological systems. Nevertheless, the uneven distribution of cell types and the limited representation of rare cell populations present substantial challenges for effective modeling and accurate identification. Most existing methods primarily focus on the annotation of abundant cell types, often overlooking rare, yet biologically significant subpopulations. In addition, the variability of cellular distributions across different biological contexts highlights the need for models with greater adaptability and a stronger capacity for contextual information integration. To overcome these challenges, we introduced scACAN, an adaptive graph construction framework that leverages aggregated local graph context information to design a positive sample selection strategy. By incorporating adaptive sampling and iterative optimization based on clustering results, scACAN effectively enhances the identification of both the major and rare cell types. Comprehensive experiments on multiple real-world scRNA-seq data sets demonstrate that scACAN achieves superior performance and reveals additional biologically meaningful rare cell subpopulations, providing a robust and generalizable solution for single-cell data analysis.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"31 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Molecular glues, including protein degraders and protein-protein interaction (PPI) stabilizers, have emerged as a new paradigm of drug design for regulating interactions between biomacromolecules; yet it is still a challenge for rational design of molecular glues. KRAS, as a prevalent oncogenic driver, is notoriously difficult to target by traditional small molecular drugs due to its challenging binding surface and frequent mutations. Although the small molecular drug RMC7977 has been designed as a PPI stabilizer for stabilizing the inherently weak RAS-CYPA interaction, the precise molecular mechanism underlying its stabilization effect and selectivity difference requires a deeper understanding. To this end, we leverage an integrated computational strategy combining molecular dynamics (MD) simulation, end-point binding free-energy calculation, and enhanced sampling technologies to elucidate the dynamic characteristics of RAS-ligand-CYPA interactions. Our result exhibits a high correlation between the predicted binding affinities and the experimental observations, demonstrating that RMC7977, acting as a strong PPI stabilizer, significantly enhances the stability of the KRAS-CYPA interaction, where, by delicately remodeling the protein-protein interface, the drug optimizes various interactions. Moreover, the results also uncover the dynamic process of stabilizer-mediated KRAS-CYPA stabilization and the mechanistic origin of the binding selectivity. This study provides essential molecular-level insights into RMC7977's function and offers a valuable computational framework for evaluating the stabilization effect of ligands targeting the KRAS-CYPA and other challenging PPI systems.
{"title":"Understanding the Kinetic Mechanism of Ligands Stabilizing the RAS-CYPA Interaction.","authors":"Kexin Xu,Mingyun Shen,Zhe Wang,Sutong Xiang,Qirui Deng,Kaimo Yang,Zhiliang Jiang,Zihao Wang,Chen Yin,Tingjun Hou,Huiyong Sun","doi":"10.1021/acs.jcim.5c02966","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02966","url":null,"abstract":"Molecular glues, including protein degraders and protein-protein interaction (PPI) stabilizers, have emerged as a new paradigm of drug design for regulating interactions between biomacromolecules; yet it is still a challenge for rational design of molecular glues. KRAS, as a prevalent oncogenic driver, is notoriously difficult to target by traditional small molecular drugs due to its challenging binding surface and frequent mutations. Although the small molecular drug RMC7977 has been designed as a PPI stabilizer for stabilizing the inherently weak RAS-CYPA interaction, the precise molecular mechanism underlying its stabilization effect and selectivity difference requires a deeper understanding. To this end, we leverage an integrated computational strategy combining molecular dynamics (MD) simulation, end-point binding free-energy calculation, and enhanced sampling technologies to elucidate the dynamic characteristics of RAS-ligand-CYPA interactions. Our result exhibits a high correlation between the predicted binding affinities and the experimental observations, demonstrating that RMC7977, acting as a strong PPI stabilizer, significantly enhances the stability of the KRAS-CYPA interaction, where, by delicately remodeling the protein-protein interface, the drug optimizes various interactions. Moreover, the results also uncover the dynamic process of stabilizer-mediated KRAS-CYPA stabilization and the mechanistic origin of the binding selectivity. This study provides essential molecular-level insights into RMC7977's function and offers a valuable computational framework for evaluating the stabilization effect of ligands targeting the KRAS-CYPA and other challenging PPI systems.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"263 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1021/acs.jcim.5c02451
Duoyun Yi,Yanpeng Zhao,Huiyan Xu,Yixin Zhang,Mengxuan Wan,Peng Zan,Song He,Xiaochen Bo
Accurate prediction of protein-ligand binding affinity is essential in drug discovery. However, the limited availability and high cost of experimentally resolved protein-ligand complex structures significantly hinder the generalizability and broad applicability of current structure-based deep learning approaches. To address this challenge, we present CompBind, a novel framework for binding affinity prediction that leverages latent interaction patterns learned from existing complex structures while eliminating the need for 3D structural inputs during inference. Specifically, CompBind integrates bidirectional cross-attention with a dual-objective pretraining strategy, where contrastive learning enforces feature-space consistency between monomer pairs and their corresponding complex structures, while generative learning reconstructs interaction features to model the bidirectional mapping between monomeric and complex representations. This enables the model to infer binding representations directly from protein and ligand sequences alone. Across challenging affinity prediction scenarios, including cold-start and sparse-label conditions, CompBind not only outperforms noncomplex-based methods but also competitively rivals complex-based prediction approaches. In a drug repurposing case study targeting glutathione peroxidase 4 (GPX4), a clinically relevant but traditionally undruggable protein, CompBind successfully ranked known inhibitors among the top candidates. Furthermore, the built-in attention mechanism enhances model interpretability by identifying key binding residues. By decoupling predictive accuracy from the availability of experimental complex structures, CompBind offers a scalable, generalizable, and practical solution for accelerating drug discovery pipelines.
{"title":"CompBind: Complex Guided Pretraining-Based Structure-Free Protein-Ligand Affinity Prediction.","authors":"Duoyun Yi,Yanpeng Zhao,Huiyan Xu,Yixin Zhang,Mengxuan Wan,Peng Zan,Song He,Xiaochen Bo","doi":"10.1021/acs.jcim.5c02451","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02451","url":null,"abstract":"Accurate prediction of protein-ligand binding affinity is essential in drug discovery. However, the limited availability and high cost of experimentally resolved protein-ligand complex structures significantly hinder the generalizability and broad applicability of current structure-based deep learning approaches. To address this challenge, we present CompBind, a novel framework for binding affinity prediction that leverages latent interaction patterns learned from existing complex structures while eliminating the need for 3D structural inputs during inference. Specifically, CompBind integrates bidirectional cross-attention with a dual-objective pretraining strategy, where contrastive learning enforces feature-space consistency between monomer pairs and their corresponding complex structures, while generative learning reconstructs interaction features to model the bidirectional mapping between monomeric and complex representations. This enables the model to infer binding representations directly from protein and ligand sequences alone. Across challenging affinity prediction scenarios, including cold-start and sparse-label conditions, CompBind not only outperforms noncomplex-based methods but also competitively rivals complex-based prediction approaches. In a drug repurposing case study targeting glutathione peroxidase 4 (GPX4), a clinically relevant but traditionally undruggable protein, CompBind successfully ranked known inhibitors among the top candidates. Furthermore, the built-in attention mechanism enhances model interpretability by identifying key binding residues. By decoupling predictive accuracy from the availability of experimental complex structures, CompBind offers a scalable, generalizable, and practical solution for accelerating drug discovery pipelines.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"29 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146005482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Per- and polyfluoroalkyl substances (PFAS)/forever chemicals are persistent synthetic chemicals with widespread use in a variety of consumer and industrial products. Some of these chemicals have undergone exhaustive research regarding experimental toxicity testing and human epidemiological inference; however, most compounds contain little or no information about their hazards or safety. ToxFCDB prioritizes these data-poor compounds for detailed toxicity investigations by constructing an effective web-based database for in silico preliminary evaluations employing more than 50 QSAR models/databases. The database compiles 8204 PFAS with their molecular structures, chemical classification, physicochemical and toxicokinetic properties, molecular descriptors, toxicological data, chemical genes, and human targets. This database aims to assist industrialists, policymakers, and researchers in assessing state-of-the-art data-centric information to make informed decisions to safeguard public health and the environment. In addition, the ToxFCDB could be a valuable tool for encouraging additional toxicological research in the domain of redesigning chemicals and polymers. The ToxFCDB is accessible online at http://ctf.iitr.res.in/toxfcdb/.
{"title":"ToxFCDB: Toxicity Database for Forever Chemicals","authors":"Meetali Sinha,Deepak Kumar Sachan,Joy Chakraborty,Anamta Ali,Anshika Gupta,Tanya Jamal,Ramakrishnan Parthasarathi","doi":"10.1021/acs.jcim.5c01917","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c01917","url":null,"abstract":"Per- and polyfluoroalkyl substances (PFAS)/forever chemicals are persistent synthetic chemicals with widespread use in a variety of consumer and industrial products. Some of these chemicals have undergone exhaustive research regarding experimental toxicity testing and human epidemiological inference; however, most compounds contain little or no information about their hazards or safety. ToxFCDB prioritizes these data-poor compounds for detailed toxicity investigations by constructing an effective web-based database for in silico preliminary evaluations employing more than 50 QSAR models/databases. The database compiles 8204 PFAS with their molecular structures, chemical classification, physicochemical and toxicokinetic properties, molecular descriptors, toxicological data, chemical genes, and human targets. This database aims to assist industrialists, policymakers, and researchers in assessing state-of-the-art data-centric information to make informed decisions to safeguard public health and the environment. In addition, the ToxFCDB could be a valuable tool for encouraging additional toxicological research in the domain of redesigning chemicals and polymers. The ToxFCDB is accessible online at http://ctf.iitr.res.in/toxfcdb/.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"38 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1021/acs.jcim.5c02637
Kamran Arshad,Muhammad Arif,Dong-Jun Yu
Motivation: DNA-binding proteins (DBPs) play a significant role in the entire biological system. Many DNA-related studies actively investigate to understand whether a protein binds to DNA. Conventionally, wet-lab experiments are conducted to characterize DBP functions. However, these methods are often expensive and time-intensive. With the rapid advancement of bioinformatics, there is a growing demand for efficient computational protocols to predict DBPs. Several sequence-based computational tools have been designed to predict DBPs; however, research gaps persist for further improvement. Method: We developed a novel deep learning (DL)-based predictor, called DeepDBPI, for enhancing DBP prediction. The proposed DeepDBPI model leverages the evolutionary and graphical-based properties of protein sequences using novel descriptors, namely covariance correlation-based position-specific scoring matrix (CC-PSSM), binary-profile-based (BP-PSSM), Trigram (TRG-PSSM), and feature encoding based on graphical and statistical (FEGS) methods. Then, we applied the wavelet denoising (WD) algorithm to remove the noise from sequence-derived features. We fed the filtered features to ResNet, LSTM, BiLSTM, RNN, BiRNN, and BiGRU. Results: The DeepDBPI model achieved the best prediction performance with Bi-GRU using the denoised-based FEGS encoding method under 5-fold cross-validation, evaluated by ACC, SN, SP, and MCC. Our proposed model achieved 92.13% ACC, 93.07% SN, 91.19% SP, and 0.8427 MCC on the independent test. We believe the effectiveness of the developed bioinformatics protocol provides insights for drug discovery and other proteomic problems. All data, including the dataset, feature extraction techniques, and models, are available at: https://doi.org/10.5281/zenodo.17496063
{"title":"DeepDBPI: DNA-Binding Protein Identifier Using a Deep Learning Model with Transformed Denoised Features","authors":"Kamran Arshad,Muhammad Arif,Dong-Jun Yu","doi":"10.1021/acs.jcim.5c02637","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02637","url":null,"abstract":"Motivation: DNA-binding proteins (DBPs) play a significant role in the entire biological system. Many DNA-related studies actively investigate to understand whether a protein binds to DNA. Conventionally, wet-lab experiments are conducted to characterize DBP functions. However, these methods are often expensive and time-intensive. With the rapid advancement of bioinformatics, there is a growing demand for efficient computational protocols to predict DBPs. Several sequence-based computational tools have been designed to predict DBPs; however, research gaps persist for further improvement. Method: We developed a novel deep learning (DL)-based predictor, called DeepDBPI, for enhancing DBP prediction. The proposed DeepDBPI model leverages the evolutionary and graphical-based properties of protein sequences using novel descriptors, namely covariance correlation-based position-specific scoring matrix (CC-PSSM), binary-profile-based (BP-PSSM), Trigram (TRG-PSSM), and feature encoding based on graphical and statistical (FEGS) methods. Then, we applied the wavelet denoising (WD) algorithm to remove the noise from sequence-derived features. We fed the filtered features to ResNet, LSTM, BiLSTM, RNN, BiRNN, and BiGRU. Results: The DeepDBPI model achieved the best prediction performance with Bi-GRU using the denoised-based FEGS encoding method under 5-fold cross-validation, evaluated by ACC, SN, SP, and MCC. Our proposed model achieved 92.13% ACC, 93.07% SN, 91.19% SP, and 0.8427 MCC on the independent test. We believe the effectiveness of the developed bioinformatics protocol provides insights for drug discovery and other proteomic problems. All data, including the dataset, feature extraction techniques, and models, are available at: https://doi.org/10.5281/zenodo.17496063","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"32 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1021/acs.jcim.5c02441
Oleksandra Herasymenko,Madhushika Silva,Galen J. Correy,Abd Al-Aziz A. Abu-Saleh,Suzanne Ackloo,Cheryl Arrowsmith,Alan Ashworth,Fuqiang Ban,Hartmut Beck,Kevin P. Bishop,Hugo J. Bohórquez,Albina Bolotokova,Marko Breznik,Irene Chau,Yu Chen,Artem Cherkasov,Wim Dehaen,Dennis Della Corte,Katrin Denzinger,Niklas P. Doering,Kristina Edfeldt,Aled Edwards,Darren Fayne,Francesco Gentile,Elisa Gibson,Ozan Gokdemir,Anders Gunnarsson,Judith Günther,John J. Irwin,Jan Halborg Jensen,Rachel J. Harding,Alexander Hillisch,Laurent Hoffer,Anders Hogner,Ashley Hutchinson,Shubhangi Kandwal,Andrea Karlova,Kushal Koirala,Sergei Kotelnikov,Dima Kozakov,Juyong Lee,Soowon Lee,Uta Lessel,Sijie Liu,Xuefeng Liu,Peter Loppnau,Jens Meiler,Rocco Moretti,Yurii S. Moroz,Charuvaka Muvva,Tudor I. Oprea,Brooks Paige,Amit Pandit,Keunwan Park,Gennady Poda,Mykola V. Protopopov,Vera Pütter,Rahul Ravichandran,Didier Rognan,Edina Rosta,Yogesh Sabnis,Thomas Scott,Almagul Seitova,Purshotam Sharma,François Sindt,Minghu Song,Casper Steinmann,Rick Stevens,Valerij Talagayev,Valentyna V. Tararina,Olga Tarkhanova,Damon Tingey,John F. Trant,Dakota Treleaven,Alexander Tropsha,Patrick Walters,Jude Wells,Yvonne Westermaier,Gerhard Wolber,Lars Wortmann,Shuangjia Zheng,James S. Fraser,Matthieu Schapira
The third Critical Assessment of Computational Hit-finding Experiments (CACHE) challenged computational teams to identify chemically novel ligands targeting the macrodomain 1 of SARS-CoV-2 Nsp3, a promising coronavirus drug target. Twenty-three groups deployed diverse design strategies to collectively select 1739 ligand candidates. While over 85% of the designed molecules were chemically novel, the best experimentally confirmed hits were structurally similar to previously published compounds. Confirming a trend observed in CACHE #1 and #2, two of the best-performing workflows used compounds selected by physics-based computational screening methods to train machine learning models able to rapidly screen large chemical libraries, while four others used exclusively physics-based approaches. Three pharmacophore searches and one fragment growing strategy were also part of the seven winning workflows. While active molecules discovered by CACHE #3 participants largely mimicked the adenine ring of the endogenous substrate, ADP-ribose, preserving the canonical chemotype commonly observed in previously reported Nsp3-Mac1 ligands, they still provide novel structure–activity relationship insights that may inform the development of future antivirals. Collectively, these results show that multiple molecular design strategies can efficiently converge on similar potent molecules.
{"title":"CACHE Challenge #3: Targeting the Nsp3 Macrodomain of SARS-CoV-2","authors":"Oleksandra Herasymenko,Madhushika Silva,Galen J. Correy,Abd Al-Aziz A. Abu-Saleh,Suzanne Ackloo,Cheryl Arrowsmith,Alan Ashworth,Fuqiang Ban,Hartmut Beck,Kevin P. Bishop,Hugo J. Bohórquez,Albina Bolotokova,Marko Breznik,Irene Chau,Yu Chen,Artem Cherkasov,Wim Dehaen,Dennis Della Corte,Katrin Denzinger,Niklas P. Doering,Kristina Edfeldt,Aled Edwards,Darren Fayne,Francesco Gentile,Elisa Gibson,Ozan Gokdemir,Anders Gunnarsson,Judith Günther,John J. Irwin,Jan Halborg Jensen,Rachel J. Harding,Alexander Hillisch,Laurent Hoffer,Anders Hogner,Ashley Hutchinson,Shubhangi Kandwal,Andrea Karlova,Kushal Koirala,Sergei Kotelnikov,Dima Kozakov,Juyong Lee,Soowon Lee,Uta Lessel,Sijie Liu,Xuefeng Liu,Peter Loppnau,Jens Meiler,Rocco Moretti,Yurii S. Moroz,Charuvaka Muvva,Tudor I. Oprea,Brooks Paige,Amit Pandit,Keunwan Park,Gennady Poda,Mykola V. Protopopov,Vera Pütter,Rahul Ravichandran,Didier Rognan,Edina Rosta,Yogesh Sabnis,Thomas Scott,Almagul Seitova,Purshotam Sharma,François Sindt,Minghu Song,Casper Steinmann,Rick Stevens,Valerij Talagayev,Valentyna V. Tararina,Olga Tarkhanova,Damon Tingey,John F. Trant,Dakota Treleaven,Alexander Tropsha,Patrick Walters,Jude Wells,Yvonne Westermaier,Gerhard Wolber,Lars Wortmann,Shuangjia Zheng,James S. Fraser,Matthieu Schapira","doi":"10.1021/acs.jcim.5c02441","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02441","url":null,"abstract":"The third Critical Assessment of Computational Hit-finding Experiments (CACHE) challenged computational teams to identify chemically novel ligands targeting the macrodomain 1 of SARS-CoV-2 Nsp3, a promising coronavirus drug target. Twenty-three groups deployed diverse design strategies to collectively select 1739 ligand candidates. While over 85% of the designed molecules were chemically novel, the best experimentally confirmed hits were structurally similar to previously published compounds. Confirming a trend observed in CACHE #1 and #2, two of the best-performing workflows used compounds selected by physics-based computational screening methods to train machine learning models able to rapidly screen large chemical libraries, while four others used exclusively physics-based approaches. Three pharmacophore searches and one fragment growing strategy were also part of the seven winning workflows. While active molecules discovered by CACHE #3 participants largely mimicked the adenine ring of the endogenous substrate, ADP-ribose, preserving the canonical chemotype commonly observed in previously reported Nsp3-Mac1 ligands, they still provide novel structure–activity relationship insights that may inform the development of future antivirals. Collectively, these results show that multiple molecular design strategies can efficiently converge on similar potent molecules.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"6 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146006342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1021/acs.jcim.5c01924
Jean V Sampaio,Andrielly H S Costa,Aline O Albuquerque,Júlia S Souza,Diego S Almeida,Eduardo M Gaieta,Matheus V Almeida,Geraldo R Sartori,João H M Silva
The utilization of predictive tools has become increasingly prevalent in the development of biopharmaceuticals, reducing the time and cost of research. However, most methods for computational antibody design are hampered by their reliance on scarcely available antibody structures, potential for immunogenic modifications, and a restricted exploration of the paratope's potential chemical and conformational space. We propose Ab-SELDON, a modular and easily customizable antibody design pipeline capable of iteratively optimizing an antibody-antigen (Ab-Ag) interaction in five different modification steps, including CDR and framework grafting, and mutagenesis. The optimization process is guided by diversity data collected from millions of publicly available human antibody sequences. This approach enhanced the exploration of the chemical and conformational space of the paratope during computational tests involving the optimization of an anti-HER2 antibody. Optimization of another antibody against Gal-3BP stabilized the Ab-Ag interaction in molecular dynamics simulations at lower runtime than alternative pipelines. Tests with SKEMPI's Ab-Ag mutations also demonstrated the pipeline's ability to correctly identify the effect of the majority of mutations, especially multipoint and those that increased binding affinity. This freely available pipeline presents a new approach for computationally efficient and automated in silico antibody design, thereby facilitating the development of new biopharmaceuticals.
{"title":"Ab-SELDON: Leveraging Diversity Data for an Efficient Automated Computational Pipeline for Antibody Design.","authors":"Jean V Sampaio,Andrielly H S Costa,Aline O Albuquerque,Júlia S Souza,Diego S Almeida,Eduardo M Gaieta,Matheus V Almeida,Geraldo R Sartori,João H M Silva","doi":"10.1021/acs.jcim.5c01924","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c01924","url":null,"abstract":"The utilization of predictive tools has become increasingly prevalent in the development of biopharmaceuticals, reducing the time and cost of research. However, most methods for computational antibody design are hampered by their reliance on scarcely available antibody structures, potential for immunogenic modifications, and a restricted exploration of the paratope's potential chemical and conformational space. We propose Ab-SELDON, a modular and easily customizable antibody design pipeline capable of iteratively optimizing an antibody-antigen (Ab-Ag) interaction in five different modification steps, including CDR and framework grafting, and mutagenesis. The optimization process is guided by diversity data collected from millions of publicly available human antibody sequences. This approach enhanced the exploration of the chemical and conformational space of the paratope during computational tests involving the optimization of an anti-HER2 antibody. Optimization of another antibody against Gal-3BP stabilized the Ab-Ag interaction in molecular dynamics simulations at lower runtime than alternative pipelines. Tests with SKEMPI's Ab-Ag mutations also demonstrated the pipeline's ability to correctly identify the effect of the majority of mutations, especially multipoint and those that increased binding affinity. This freely available pipeline presents a new approach for computationally efficient and automated in silico antibody design, thereby facilitating the development of new biopharmaceuticals.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"30 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146005047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}