Identifying transition states (TSs), the high-energy configurations that molecules pass through during chemical reactions, is essential for understanding and designing chemical processes. However, accurately and efficiently identifying these states remains one of the most challenging problems in computational chemistry. In this work, we introduce a new generative AI approach that improves the quality of initial guesses for TS structures. Our method can be combined with a variety of existing techniques, including both machine-learning models and fast, approximate quantum methods, to refine their predictions and bring them closer to chemically accurate results. Applied to TS guesses from a state-of-the-art machine-learning model, our approach reduces the median structural error to 0.077 Å and lowers the median absolute error in reaction barrier heights to 0.40 kcal mol–1. When starting from a widely used tight-binding approximation, it increases the success rate of locating valid TSs by 41% and speeds up high-level quantum optimization by a factor of 3. By making TS searches more accurate, robust, and efficient, this method could accelerate reaction mechanism discovery and support the development of new materials, catalysts, and pharmaceuticals.
{"title":"Adaptive Transition-State Refinement with Learned Equilibrium Flows","authors":"Samir Darouich,Vinh Tong,Tanja Bien,Johannes Kästner,Mathias Niepert","doi":"10.1021/acs.jcim.5c02902","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02902","url":null,"abstract":"Identifying transition states (TSs), the high-energy configurations that molecules pass through during chemical reactions, is essential for understanding and designing chemical processes. However, accurately and efficiently identifying these states remains one of the most challenging problems in computational chemistry. In this work, we introduce a new generative AI approach that improves the quality of initial guesses for TS structures. Our method can be combined with a variety of existing techniques, including both machine-learning models and fast, approximate quantum methods, to refine their predictions and bring them closer to chemically accurate results. Applied to TS guesses from a state-of-the-art machine-learning model, our approach reduces the median structural error to 0.077 Å and lowers the median absolute error in reaction barrier heights to 0.40 kcal mol–1. When starting from a widely used tight-binding approximation, it increases the success rate of locating valid TSs by 41% and speeds up high-level quantum optimization by a factor of 3. By making TS searches more accurate, robust, and efficient, this method could accelerate reaction mechanism discovery and support the development of new materials, catalysts, and pharmaceuticals.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"3 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146097965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1021/acs.jcim.5c02465
Shangyu Li,Peizhe Sun
With the surge in QSAR model development, concerns about evaluation rigor, particularly regarding the influence of data splitting, have grown. Using five data sets of various sizes, we systematically assessed the effects of random splits (RS), similarity-based splits (SS), and random-seed variability on model generalizability under two scenarios: limited data for chemical screening and standard modeling with ample data. Both the choice of data set partitioning method and the selection of random seeds can substantially affect internal test performance, which may not reliably reflect true predictive capability. Although SS can improve internal test performance in many settings, these gains do not necessarily translate into stronger external generalizability. Moreover, under low sampling ratios, SS may perform worse than RS on both internal and external tests. This challenges the implicit assumption that rational splits optimized for internal performance universally improve model performance. Notably, variability across random seeds was high on internal tests in the smallest data set (R2: 0.453–0.783), whereas on the fixed external data set R2 varied less (0.633–0.672), regardless of applicability domain (AD) filtering. This undermined cross-study comparability and underscored the risk of overly optimistic conclusions. Our findings highlighted that test-set construction must be aligned with real-world application scenarios. Researchers should avoid relying on single or cherry-picked random seeds or unsuitable rational partitioning. Transparent, application-aligned partitioning protocols and AD methods should be employed to emphasize true external generalizability over potentially inflated internal metrics.
{"title":"Toward More Trustworthy QSAR: A Systematic Discussion on Data Set Partitioning","authors":"Shangyu Li,Peizhe Sun","doi":"10.1021/acs.jcim.5c02465","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02465","url":null,"abstract":"With the surge in QSAR model development, concerns about evaluation rigor, particularly regarding the influence of data splitting, have grown. Using five data sets of various sizes, we systematically assessed the effects of random splits (RS), similarity-based splits (SS), and random-seed variability on model generalizability under two scenarios: limited data for chemical screening and standard modeling with ample data. Both the choice of data set partitioning method and the selection of random seeds can substantially affect internal test performance, which may not reliably reflect true predictive capability. Although SS can improve internal test performance in many settings, these gains do not necessarily translate into stronger external generalizability. Moreover, under low sampling ratios, SS may perform worse than RS on both internal and external tests. This challenges the implicit assumption that rational splits optimized for internal performance universally improve model performance. Notably, variability across random seeds was high on internal tests in the smallest data set (R2: 0.453–0.783), whereas on the fixed external data set R2 varied less (0.633–0.672), regardless of applicability domain (AD) filtering. This undermined cross-study comparability and underscored the risk of overly optimistic conclusions. Our findings highlighted that test-set construction must be aligned with real-world application scenarios. Researchers should avoid relying on single or cherry-picked random seeds or unsuitable rational partitioning. Transparent, application-aligned partitioning protocols and AD methods should be employed to emphasize true external generalizability over potentially inflated internal metrics.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"37 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146097961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01DOI: 10.1021/acs.jcim.5c02513
Haoxing Luo,Yue Hu,Chaolin Song,Xinhui Li,Yuyin Ma,Yurong Qian,Lei Deng
Proteins, as essential components of living organisms, play a critical role in both drug discovery and disease mechanism research. Multiple empirical studies have shown that there is a significant correlation between protein function and drug targets with therapeutic potential. Therefore, how to accurately and efficiently predict protein function is an urgent issue that needs to be addressed. Existing research faces challenges such as insufficient utilization of protein data and low heterogeneous fusion performance. In this paper, we propose ME-PFP, a novel ensemble learning framework that integrates sequence representations from a protein language model, domain, and protein–protein interaction data to improve protein function prediction. To effectively capture and utilize heterogeneous features, we design three specialized attention-based feature extractors tailored to each data modality. These features are then fused through a dynamic weighting strategy to enable complementary information exchange between different modalities, thereby improving protein function prediction performance. Extensive experiments on benchmark data sets show that ME-PFP significantly outperforms sequence-based and multisource fusion models. Notably, it achieved an average improvement of 13.23% on the human data set and 11.11% on the yeast data set. The experimental results show that this study not only improves the accuracy of protein function prediction, but also promotes progress in the field of computational biology.
{"title":"ME-PFP: An Ensemble Learning Approach Fusing Multi-Source Features for Protein Function Prediction","authors":"Haoxing Luo,Yue Hu,Chaolin Song,Xinhui Li,Yuyin Ma,Yurong Qian,Lei Deng","doi":"10.1021/acs.jcim.5c02513","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02513","url":null,"abstract":"Proteins, as essential components of living organisms, play a critical role in both drug discovery and disease mechanism research. Multiple empirical studies have shown that there is a significant correlation between protein function and drug targets with therapeutic potential. Therefore, how to accurately and efficiently predict protein function is an urgent issue that needs to be addressed. Existing research faces challenges such as insufficient utilization of protein data and low heterogeneous fusion performance. In this paper, we propose ME-PFP, a novel ensemble learning framework that integrates sequence representations from a protein language model, domain, and protein–protein interaction data to improve protein function prediction. To effectively capture and utilize heterogeneous features, we design three specialized attention-based feature extractors tailored to each data modality. These features are then fused through a dynamic weighting strategy to enable complementary information exchange between different modalities, thereby improving protein function prediction performance. Extensive experiments on benchmark data sets show that ME-PFP significantly outperforms sequence-based and multisource fusion models. Notably, it achieved an average improvement of 13.23% on the human data set and 11.11% on the yeast data set. The experimental results show that this study not only improves the accuracy of protein function prediction, but also promotes progress in the field of computational biology.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"58 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146097967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-31DOI: 10.1021/acs.jcim.5c02424
Chris John,Edoardo Cignoni,Lorenzo Cupellini,Benedetta Mennucci
Excited states of embedded chromophores are highly influenced by their interaction with the environment. Herein, we present a machine-learning (ML) framework capable of predicting the different environmental contributions to excitation energies of chromophores in a polarizable embedding. Our ML models are built in a hierarchical structure to capture both the effect of ground-state polarization and the response of the polarizable environment to the electronic transition. With the use of the right descriptors, the models trained on the quantum mechanics/molecular mechanics (QM/MM) calculations in a nonpolarizable environment are able to successfully predict the effects of a polarizable environment on excitation energies. The ML models are applied to three chromophores present in light-harvesting complexes (chlorophyll a, chlorophyll b, and lutein) and are used to reproduce the excitonic structure of a multichromophoric system unseen in the training set to a level of accuracy offered by a polarizable QM/MM calculation, while taking a fraction of its time.
{"title":"Machine-Learning Framework for Excitation Energies of Chromophores in Polarizable Environments.","authors":"Chris John,Edoardo Cignoni,Lorenzo Cupellini,Benedetta Mennucci","doi":"10.1021/acs.jcim.5c02424","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02424","url":null,"abstract":"Excited states of embedded chromophores are highly influenced by their interaction with the environment. Herein, we present a machine-learning (ML) framework capable of predicting the different environmental contributions to excitation energies of chromophores in a polarizable embedding. Our ML models are built in a hierarchical structure to capture both the effect of ground-state polarization and the response of the polarizable environment to the electronic transition. With the use of the right descriptors, the models trained on the quantum mechanics/molecular mechanics (QM/MM) calculations in a nonpolarizable environment are able to successfully predict the effects of a polarizable environment on excitation energies. The ML models are applied to three chromophores present in light-harvesting complexes (chlorophyll a, chlorophyll b, and lutein) and are used to reproduce the excitonic structure of a multichromophoric system unseen in the training set to a level of accuracy offered by a polarizable QM/MM calculation, while taking a fraction of its time.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"55 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1021/acs.jcim.5c02703
Zhiyuan Liu,Quan Qian
Structure relaxation plays a crucial role in atomic simulation and materials modeling, yet traditional first-principles approaches remain computationally expensive and, therefore, difficult to scale to high-throughput applications. In this work, we propose Traj2Relax, a trajectory-supervised structure relaxation framework based on conditional velocity field modeling. Instead of relying on explicit energy or force evaluations, Traj2Relax learns a time-dependent velocity field from geometric differences between successive configurations along real relaxation trajectories, enabling physically consistent structural convergence across a wide range of perturbation magnitudes. A time-scheduled noise mechanism is introduced during training to improve stability under highly distorted inputs, while deterministic integration during inference produces smooth, interpretable relaxation trajectories. Experimental results show that Traj2Relax achieves competitive accuracy under near-equilibrium conditions and demonstrates clear advantages under moderate to strong perturbations, where energy-driven and distribution-based relaxation methods tend to degrade. On representative inorganic crystal systems, Traj2Relax attains a root-mean-square deviation of 0.26 Å and a space-group consistency of 82.3% under equilibrium settings and maintains a root-mean-square deviation of 0.38 Å with a recovery rate of 5.8% under strong perturbations. The framework further supports deterministic, batch-parallel relaxation, yielding an order-of-magnitude improvement in inference throughput compared with iterative energy-minimization-based approaches. Overall, Traj2Relax provides an efficient and physically grounded alternative for learning-driven structure relaxation, particularly suited for high-throughput screening scenarios involving nonequilibrium or highly perturbed structures.
{"title":"Traj2Relax: A Trajectory-Supervised Method for Robust Structure Relaxation.","authors":"Zhiyuan Liu,Quan Qian","doi":"10.1021/acs.jcim.5c02703","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02703","url":null,"abstract":"Structure relaxation plays a crucial role in atomic simulation and materials modeling, yet traditional first-principles approaches remain computationally expensive and, therefore, difficult to scale to high-throughput applications. In this work, we propose Traj2Relax, a trajectory-supervised structure relaxation framework based on conditional velocity field modeling. Instead of relying on explicit energy or force evaluations, Traj2Relax learns a time-dependent velocity field from geometric differences between successive configurations along real relaxation trajectories, enabling physically consistent structural convergence across a wide range of perturbation magnitudes. A time-scheduled noise mechanism is introduced during training to improve stability under highly distorted inputs, while deterministic integration during inference produces smooth, interpretable relaxation trajectories. Experimental results show that Traj2Relax achieves competitive accuracy under near-equilibrium conditions and demonstrates clear advantages under moderate to strong perturbations, where energy-driven and distribution-based relaxation methods tend to degrade. On representative inorganic crystal systems, Traj2Relax attains a root-mean-square deviation of 0.26 Å and a space-group consistency of 82.3% under equilibrium settings and maintains a root-mean-square deviation of 0.38 Å with a recovery rate of 5.8% under strong perturbations. The framework further supports deterministic, batch-parallel relaxation, yielding an order-of-magnitude improvement in inference throughput compared with iterative energy-minimization-based approaches. Overall, Traj2Relax provides an efficient and physically grounded alternative for learning-driven structure relaxation, particularly suited for high-throughput screening scenarios involving nonequilibrium or highly perturbed structures.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"16 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1021/acs.jcim.5c02565
Riccardo Nifosì,Luca Bellucci
Coiled coils, owing to their simple yet versatile architecture, serve as valuable model systems for both experimental and computational studies in protein science. Whereas the sequence-structure relationships that govern their oligomeric state and stability have been thoroughly investigated, important gaps remain, most notably regarding the role of central chloride ions coordinated by asparagine triads observed in several trimeric coiled-coil (TCC) crystal structures. To investigate the thermodynamics of chloride binding at this site, we performed extensive molecular simulations using metadynamics and alchemical free-energy calculations, both enhanced with replica exchange, to determine the chloride binding free energy (ΔGbind) in three TCCs of similar length but different stability (PDB IDs: 2wpy, 4dzk, 1mof). Despite the nearly identical local coordination environment, the computed ΔGbind values strongly depend on the overall protein structure, with variations in superhelical radius R0 upon ion removal systematically accompanying the observed binding thermodynamics. In particular, both the metastable TCC 2wpy─a variant of the GCN4 leucine-zipper domain previously shown to be unstable in the absence of chloride─and the synthetic design 4dzk exhibit highly unfavorable binding, suggesting that current biomolecular force fields may not fully capture either the stabilizing role of chloride or the conformational ensemble of the unbound state. By contrast, the calculated ΔGbind in 1mof, a fragment of the MoMuLV retroviral transmembrane protein, is favorable and is associated with the presence of an additional C-terminal leash domain that modulates the binding-site environment. These results identify TCCs as critical benchmarks for improving the description of anion-protein interactions and the balance between bound and unbound states in future force-field developments.
{"title":"Chloride Binding in Trimeric Coiled Coils: Free Energy and Structural Determinants from Molecular Simulations.","authors":"Riccardo Nifosì,Luca Bellucci","doi":"10.1021/acs.jcim.5c02565","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02565","url":null,"abstract":"Coiled coils, owing to their simple yet versatile architecture, serve as valuable model systems for both experimental and computational studies in protein science. Whereas the sequence-structure relationships that govern their oligomeric state and stability have been thoroughly investigated, important gaps remain, most notably regarding the role of central chloride ions coordinated by asparagine triads observed in several trimeric coiled-coil (TCC) crystal structures. To investigate the thermodynamics of chloride binding at this site, we performed extensive molecular simulations using metadynamics and alchemical free-energy calculations, both enhanced with replica exchange, to determine the chloride binding free energy (ΔGbind) in three TCCs of similar length but different stability (PDB IDs: 2wpy, 4dzk, 1mof). Despite the nearly identical local coordination environment, the computed ΔGbind values strongly depend on the overall protein structure, with variations in superhelical radius R0 upon ion removal systematically accompanying the observed binding thermodynamics. In particular, both the metastable TCC 2wpy─a variant of the GCN4 leucine-zipper domain previously shown to be unstable in the absence of chloride─and the synthetic design 4dzk exhibit highly unfavorable binding, suggesting that current biomolecular force fields may not fully capture either the stabilizing role of chloride or the conformational ensemble of the unbound state. By contrast, the calculated ΔGbind in 1mof, a fragment of the MoMuLV retroviral transmembrane protein, is favorable and is associated with the presence of an additional C-terminal leash domain that modulates the binding-site environment. These results identify TCCs as critical benchmarks for improving the description of anion-protein interactions and the balance between bound and unbound states in future force-field developments.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"23 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Drug repositioning aims to identify new indications for existing drugs, offering a cost-effective and time-efficient strategy for therapeutic development. Its core challenge lies in accurately predicting potential drug-disease associations (DDAs). However, existing computational approaches often suffer from inadequate drug representation, insufficient modeling of disease semantics, and imbalanced data distributions, which collectively limit predictive accuracy and generalization ability. To address these challenges, we propose an innovative framework, termed XRepDDA, that integrates multimodal feature representation with deep metric learning to improve DDA prediction accuracy and robustness. For drug representation, the SMI-TED pretrained chemical language model encodes SMILES sequences into chemically informative molecular embeddings. For disease representation, a hierarchical semantic graph based on the MeSH ontology is constructed together with a semantic-enhanced graph embedding strategy to capture hierarchical and semantic relationships among diseases. To mitigate class imbalance, we applied the AllKNN adaptive undersampling strategy. The prediction module is built upon an improved ModernNCA architecture, which learns a discriminative embedding space through deep metric learning. Experiments on multiple public benchmark data sets demonstrate that XRepDDA consistently outperforms diverse baseline models, including traditional machine learning, tree-based ensemble, and deep learning methods, achieving AUC and AUPR values of up to 0.9990 and 0.9991, respectively. Furthermore, molecular docking experiments on top-ranked candidate drugs for Alzheimer's disease and stomach neoplasms provide in silico validation of predictive reliability. To enhance interpretability, a multilevel explainability framework is established, combining SHAP-based global feature attribution with attention mechanisms and molecular perturbation analyses to identify key features and pharmacophores at the local level. These results support the chemical interpretability and the biological plausibility of the predictions.
{"title":"XRepDDA: An Interpretable Drug-Disease Association Prediction Framework Leveraging Pretrained Chemical Language Models.","authors":"Chenyi Zhang,Yun Zuo,Qiao Ning,Sisi Yuan,Zhaohong Deng,Hongwei Yin,Anjing Zhao","doi":"10.1021/acs.jcim.5c02901","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02901","url":null,"abstract":"Drug repositioning aims to identify new indications for existing drugs, offering a cost-effective and time-efficient strategy for therapeutic development. Its core challenge lies in accurately predicting potential drug-disease associations (DDAs). However, existing computational approaches often suffer from inadequate drug representation, insufficient modeling of disease semantics, and imbalanced data distributions, which collectively limit predictive accuracy and generalization ability. To address these challenges, we propose an innovative framework, termed XRepDDA, that integrates multimodal feature representation with deep metric learning to improve DDA prediction accuracy and robustness. For drug representation, the SMI-TED pretrained chemical language model encodes SMILES sequences into chemically informative molecular embeddings. For disease representation, a hierarchical semantic graph based on the MeSH ontology is constructed together with a semantic-enhanced graph embedding strategy to capture hierarchical and semantic relationships among diseases. To mitigate class imbalance, we applied the AllKNN adaptive undersampling strategy. The prediction module is built upon an improved ModernNCA architecture, which learns a discriminative embedding space through deep metric learning. Experiments on multiple public benchmark data sets demonstrate that XRepDDA consistently outperforms diverse baseline models, including traditional machine learning, tree-based ensemble, and deep learning methods, achieving AUC and AUPR values of up to 0.9990 and 0.9991, respectively. Furthermore, molecular docking experiments on top-ranked candidate drugs for Alzheimer's disease and stomach neoplasms provide in silico validation of predictive reliability. To enhance interpretability, a multilevel explainability framework is established, combining SHAP-based global feature attribution with attention mechanisms and molecular perturbation analyses to identify key features and pharmacophores at the local level. These results support the chemical interpretability and the biological plausibility of the predictions.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"191 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146088970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1021/acs.jcim.5c02777
Taras Voitsitskyi, Ihor Koleiev, Roman Stratiichuk, Oleksandr Kot, Roman Kyrylenko, Illia Savchenko, Vladyslav Husak, Semen Yesylevskyy, Sergii Starosyla, Alan Nafiiev
Classical protein-ligand docking has been a cornerstone technique in computational drug discovery for decades but has reached an accuracy and performance plateau. Recently introduced Machine Learning (ML)-based docking methods offer a promising paradigm shift, but their practical adoption is hampered by accuracy-to-speed trade-offs, inadequate benchmarking standards, and questionable chemical validity of predicted poses. In this study, we introduce ArtiDock─an ML-based docking technique optimized for high-throughput virtual screening applications. To evaluate ArtiDock, we developed a dedicated performance and accuracy benchmark for pocket-specific rigid protein-ligand docking, which mimics realistic industrial drug discovery scenarios and is based on the novel PLINDER data set. We demonstrate that ArtiDock is 29-38% more accurate in comparison to leading open-source and commercial classical docking techniques such as AutoDock, Vina, and Glide, while providing a low computational cost. ArtiDock notably excels in challenging docking scenarios involving unbound protein structures and binding sites containing ions and structured water molecules. Additionally, we demonstrated competitive accuracy of our approach at considerably higher throughput compared to a wide range of AI docking and AI cofolding methods using the PoseX benchmark. Our results show that ArtiDock could be considered as a method of choice in high-throughput virtual screening scenarios.
{"title":"ArtiDock: Accurate Machine Learning Approach to Protein-Ligand Docking Optimized for High-Throughput Virtual Screening.","authors":"Taras Voitsitskyi, Ihor Koleiev, Roman Stratiichuk, Oleksandr Kot, Roman Kyrylenko, Illia Savchenko, Vladyslav Husak, Semen Yesylevskyy, Sergii Starosyla, Alan Nafiiev","doi":"10.1021/acs.jcim.5c02777","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02777","url":null,"abstract":"<p><p>Classical protein-ligand docking has been a cornerstone technique in computational drug discovery for decades but has reached an accuracy and performance plateau. Recently introduced Machine Learning (ML)-based docking methods offer a promising paradigm shift, but their practical adoption is hampered by accuracy-to-speed trade-offs, inadequate benchmarking standards, and questionable chemical validity of predicted poses. In this study, we introduce ArtiDock─an ML-based docking technique optimized for high-throughput virtual screening applications. To evaluate ArtiDock, we developed a dedicated performance and accuracy benchmark for pocket-specific rigid protein-ligand docking, which mimics realistic industrial drug discovery scenarios and is based on the novel PLINDER data set. We demonstrate that ArtiDock is 29-38% more accurate in comparison to leading open-source and commercial classical docking techniques such as AutoDock, Vina, and Glide, while providing a low computational cost. ArtiDock notably excels in challenging docking scenarios involving unbound protein structures and binding sites containing ions and structured water molecules. Additionally, we demonstrated competitive accuracy of our approach at considerably higher throughput compared to a wide range of AI docking and AI cofolding methods using the PoseX benchmark. Our results show that ArtiDock could be considered as a method of choice in high-throughput virtual screening scenarios.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.3,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146091591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1021/acs.jcim.5c02541
Alexander Meßler,Hilke Bahmann
Platinum (Pt) complexes are highly relevant for medicinal chemistry and homogeneous catalysis. In the development of novel Pt-based chemotherapeutic agents and catalysts, characterization of the compounds using nuclear magnetic resonance (NMR) spectroscopy of the 195Pt nucleus is standard. However, measuring 195Pt-NMR signals can be tedious due to the large chemical shift range and limited resolution. To facilitate experimental measurements by narrowing down the shift range, reliable predictions of the chemical shift are needed. Especially for lighter nuclei such as 1H and 13C, machine learning (ML) methods predict chemical shifts accurately, while analogous models for heavier nuclei are scarce. In this work, we propose Gaussian Process Regression (GPR) models for the prediction of 195Pt chemical shifts. The underlying data set comprises 292 structures and three different descriptors were used to encode structural and chemical features of the molecules. Based on the prediction uncertainties derived from the posterior variance of the models, a reasonably narrow shift range can be estimated for a given Pt complex. The most robust model yields a mean absolute error (MAE) of 114 ppm on the holdout test set, which is significantly more accurate than relativistic DFT calculations.
{"title":"Uncertainty-Aware Prediction of 195Pt Chemical Shifts from Limited Data.","authors":"Alexander Meßler,Hilke Bahmann","doi":"10.1021/acs.jcim.5c02541","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02541","url":null,"abstract":"Platinum (Pt) complexes are highly relevant for medicinal chemistry and homogeneous catalysis. In the development of novel Pt-based chemotherapeutic agents and catalysts, characterization of the compounds using nuclear magnetic resonance (NMR) spectroscopy of the 195Pt nucleus is standard. However, measuring 195Pt-NMR signals can be tedious due to the large chemical shift range and limited resolution. To facilitate experimental measurements by narrowing down the shift range, reliable predictions of the chemical shift are needed. Especially for lighter nuclei such as 1H and 13C, machine learning (ML) methods predict chemical shifts accurately, while analogous models for heavier nuclei are scarce. In this work, we propose Gaussian Process Regression (GPR) models for the prediction of 195Pt chemical shifts. The underlying data set comprises 292 structures and three different descriptors were used to encode structural and chemical features of the molecules. Based on the prediction uncertainties derived from the posterior variance of the models, a reasonably narrow shift range can be estimated for a given Pt complex. The most robust model yields a mean absolute error (MAE) of 114 ppm on the holdout test set, which is significantly more accurate than relativistic DFT calculations.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"182 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-29DOI: 10.1021/acs.jcim.5c02555
Lorena Ruano,Álvaro Pérez-Barcia,Vito F Palmisano,Juan J Nogueira,Marcos Mandado,Nicolás Ramos-Berdullas
The aryl hydrocarbon receptor (AhR) is a ligand-activated transcription factor that mediates biological signals and regulates diverse cellular functions. Of particular concern are the effects triggered by dioxins and dioxin-like compounds (DLCs), whose toxicological outcomes arise through both canonical and noncanonical pathways, leading to the designation of AhR as the "dioxin receptor". However, conventional risk assessment approaches based on toxic equivalency factors (TEFs), which primarily reflect the capacity of these compounds to bind and activate AhR, do not fully account for critical aspects such as environmental concentration and bioavailability, potentially underestimating their true impact. In this work, we present a comparative analysis of polychlorinated dibenzo-p-dioxins (PCDDs) with varying degrees of chlorination, focusing on their interactions with the AhR at the ligand-binding domain and on their permeation abilities across a model lipid membrane. To this end, we combine classical molecular dynamics (CMD) simulations with a hybrid quantum mechanics/molecular mechanics energy decomposition analysis (QM/MM-EDA) framework. This integrated approach enables a molecular-level characterization of receptor binding affinities and membrane permeation efficiencies. Our findings provide novel insights into the mechanisms underlying the relative toxicity of DLCs and highlight the need for integrative assessment strategies that encompass both receptor-ligand interactions and physicochemical behavior in biological environments. It is noteworthy that the toxicity of these compounds, as quantified by the pEC50 index, correlates with the membrane permeation barrier rather than with AhR binding affinity, identifying permeation as the key mechanistic step in the toxicological process of these compounds.
{"title":"Unravelling the Role Played by Non-covalent Interactions in the Action Mechanism of PCDDs within Cells.","authors":"Lorena Ruano,Álvaro Pérez-Barcia,Vito F Palmisano,Juan J Nogueira,Marcos Mandado,Nicolás Ramos-Berdullas","doi":"10.1021/acs.jcim.5c02555","DOIUrl":"https://doi.org/10.1021/acs.jcim.5c02555","url":null,"abstract":"The aryl hydrocarbon receptor (AhR) is a ligand-activated transcription factor that mediates biological signals and regulates diverse cellular functions. Of particular concern are the effects triggered by dioxins and dioxin-like compounds (DLCs), whose toxicological outcomes arise through both canonical and noncanonical pathways, leading to the designation of AhR as the \"dioxin receptor\". However, conventional risk assessment approaches based on toxic equivalency factors (TEFs), which primarily reflect the capacity of these compounds to bind and activate AhR, do not fully account for critical aspects such as environmental concentration and bioavailability, potentially underestimating their true impact. In this work, we present a comparative analysis of polychlorinated dibenzo-p-dioxins (PCDDs) with varying degrees of chlorination, focusing on their interactions with the AhR at the ligand-binding domain and on their permeation abilities across a model lipid membrane. To this end, we combine classical molecular dynamics (CMD) simulations with a hybrid quantum mechanics/molecular mechanics energy decomposition analysis (QM/MM-EDA) framework. This integrated approach enables a molecular-level characterization of receptor binding affinities and membrane permeation efficiencies. Our findings provide novel insights into the mechanisms underlying the relative toxicity of DLCs and highlight the need for integrative assessment strategies that encompass both receptor-ligand interactions and physicochemical behavior in biological environments. It is noteworthy that the toxicity of these compounds, as quantified by the pEC50 index, correlates with the membrane permeation barrier rather than with AhR binding affinity, identifying permeation as the key mechanistic step in the toxicological process of these compounds.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"34 1","pages":""},"PeriodicalIF":5.6,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146073277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}