Journal of Cheminformatics最新文献_第2页

Universal feature selection for simultaneous interpretability of multitask datasets. 多任务数据集同时可解释性的通用特征选择。

IF 5.7 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2026-01-17 DOI: 10.1186/s13321-025-01096-z

Matt Raymond, Jacob Charles Saldinger, Paolo Elvati, Angela Violi

Extracting meaningful features from complex, high-dimensional datasets across scientific domains remains challenging. Current methods often struggle with scalability, limiting their applicability to large datasets, or make restrictive assumptions about feature-property relationships, hindering their ability to capture complex interactions. BoUTS's general and scalable feature selection algorithm surpasses these limitations by identifying both universal features relevant to all datasets and task-specific features predictive for specific subsets. Evaluated on seven diverse chemical regression datasets, BoUTS achieves state-of-the-art feature sparsity while generally maintaining prediction accuracy comparable to specialized methods. Notably, BoUTS's universal features enable domain-specific knowledge transfer between datasets, and we expect these results to be broadly useful to manually-guided inverse problems. Beyond its current application, BoUTS holds potential for elucidating data-poor systems by leveraging information from similar data-rich systems.Scientific Contribution: BoUTS selects nonlinear, universally informative features across multiple datasets. We identify crucial "universal features" across seven real-world chemistry datasets, which enhance cross-dataset interpretability and selection stability. BoUTS is highly scalable and is applicable to tabular data from many domains, and our results identify connections between seemingly unrelated chemical domains.

从复杂的、跨科学领域的高维数据集中提取有意义的特征仍然具有挑战性。当前的方法经常与可扩展性作挣扎，限制了它们对大型数据集的适用性，或者对特征-属性关系做出限制性假设，阻碍了它们捕捉复杂交互的能力。BoUTS的通用和可扩展的特征选择算法通过识别与所有数据集相关的通用特征和预测特定子集的任务特定特征来超越这些限制。在七个不同的化学回归数据集上进行评估后，BoUTS达到了最先进的特征稀疏性，同时总体上保持了与专业方法相当的预测准确性。值得注意的是，BoUTS的通用特性使特定领域的知识能够在数据集之间转移，我们希望这些结果对人工引导的逆问题广泛有用。除了当前的应用程序之外，BoUTS还具有通过利用来自类似的数据丰富系统的信息来阐明数据贫乏系统的潜力。科学贡献：BoUTS在多个数据集中选择非线性的、普遍的信息特征。我们确定了七个真实世界化学数据集的关键“通用特征”，这些特征增强了跨数据集的可解释性和选择稳定性。BoUTS是高度可扩展的，适用于许多领域的表格数据，我们的结果识别了看似不相关的化学领域之间的联系。

{"title":"Universal feature selection for simultaneous interpretability of multitask datasets.","authors":"Matt Raymond, Jacob Charles Saldinger, Paolo Elvati, Angela Violi","doi":"10.1186/s13321-025-01096-z","DOIUrl":"https://doi.org/10.1186/s13321-025-01096-z","url":null,"abstract":"<p><p>Extracting meaningful features from complex, high-dimensional datasets across scientific domains remains challenging. Current methods often struggle with scalability, limiting their applicability to large datasets, or make restrictive assumptions about feature-property relationships, hindering their ability to capture complex interactions. BoUTS's general and scalable feature selection algorithm surpasses these limitations by identifying both universal features relevant to all datasets and task-specific features predictive for specific subsets. Evaluated on seven diverse chemical regression datasets, BoUTS achieves state-of-the-art feature sparsity while generally maintaining prediction accuracy comparable to specialized methods. Notably, BoUTS's universal features enable domain-specific knowledge transfer between datasets, and we expect these results to be broadly useful to manually-guided inverse problems. Beyond its current application, BoUTS holds potential for elucidating data-poor systems by leveraging information from similar data-rich systems.Scientific Contribution: BoUTS selects nonlinear, universally informative features across multiple datasets. We identify crucial \"universal features\" across seven real-world chemistry datasets, which enhance cross-dataset interpretability and selection stability. BoUTS is highly scalable and is applicable to tabular data from many domains, and our results identify connections between seemingly unrelated chemical domains.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2026-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145994089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ai derivation and exploration of antibiotic class spaces. 抗生素类空间的推导与探索。

IF 8.6 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2026-01-16 DOI: 10.1186/s13321-026-01153-1

Adam Bess,Sean Rowland,Chris Alvin,Supratik Mukhopadhyay

PURPOSEThe rapid evolution of antibiotic-resistant bacteria poses an urgent global health crisis. A key gap in current antibiotic discovery approaches is the absence of automated chemical synthesis methods designed to systematically generate and evaluate compounds within specific antibiotic classes. We address this gap through fragment-based computational experiments that systematically explore antibiotic chemical spaces.METHODSOur computational methodology consists of three steps: fragmentation of known compounds (eMolFrag), generating new molecular structures by recombining fragments (eSynth), and filtering of candidates based on desired properties (eFilter). eFilter combines structural analysis, pathway information, and protein targets to predict pharmacokinetic properties and therapeutic efficacy. We conducted three experiments: historical reconstruction of penicillin derivatives, hybrid molecule design combining functional groups from multiple antibiotic classes, and chemical space exploration of recently discovered antibiotics.RESULTSStarting from Penicillin G and Methicillin, eSynth generated over 1.4 million potential penicillin derivatives. eFilter computationally predicted ampicillin, amoxicillin, and 10 other penicillin derivatives as high-scoring candidates, demonstrating that the pipeline can navigate the chemical space of β-lactam antibiotics. For hybrid molecules, 1.53% showed computational predictions suggesting broad-spectrum activity against penicillin and quinolone targets, showing predicted binding scores higher than reference antibiotics in all protein targets evaluated. Chemical space exploration successfully generated computational candidates resembling Halicin-like molecules, with the top compound showing a binding score of 13.4 against JNK1.CONCLUSIONSOur fragment-based pipeline demonstrates the feasibility of systematically exploring antibiotic chemical spaces through computational reconstruction of historical development pathways and generation of hybrid molecules with with predicted multi-target binding profiles. All results represent computational predictions requiring experimental validation.

目的抗生素耐药菌的快速进化引发了迫在眉睫的全球健康危机。当前抗生素发现方法的一个关键缺陷是缺乏旨在系统地生成和评估特定抗生素类别中的化合物的自动化化学合成方法。我们通过系统地探索抗生素化学空间的基于片段的计算实验来解决这一差距。方法我们的计算方法包括三个步骤：已知化合物的片段化（eMolFrag），通过重组片段生成新的分子结构（eSynth），基于所需性质筛选候选分子（effilter）。effilter结合结构分析、途径信息和蛋白质靶点来预测药代动力学特性和治疗效果。我们进行了三个实验：青霉素衍生物的历史重建，结合多种抗生素类官能团的杂交分子设计，以及新发现抗生素的化学空间探索。结果从青霉素G和甲氧西林开始，合成了超过140万种潜在的青霉素衍生物。eFilter通过计算预测氨苄西林、阿莫西林和其他10种青霉素衍生物是高分候选药物，证明该管道可以导航β-内酰胺类抗生素的化学空间。对于杂交分子，1.53%的计算预测显示对青霉素和喹诺酮类靶标具有广谱活性，表明在所有评估的蛋白质靶标上的预测结合得分高于参考抗生素。化学空间探索成功地产生了类似于halicin类分子的计算候选物，顶部化合物对JNK1的结合得分为13.4。结论基于片段的管道通过计算重建历史发展路径和生成具有预测的多靶点结合谱的杂交分子，证明了系统探索抗生素化学空间的可行性。所有结果都是需要实验验证的计算预测。

{"title":"Ai derivation and exploration of antibiotic class spaces.","authors":"Adam Bess,Sean Rowland,Chris Alvin,Supratik Mukhopadhyay","doi":"10.1186/s13321-026-01153-1","DOIUrl":"https://doi.org/10.1186/s13321-026-01153-1","url":null,"abstract":"PURPOSEThe rapid evolution of antibiotic-resistant bacteria poses an urgent global health crisis. A key gap in current antibiotic discovery approaches is the absence of automated chemical synthesis methods designed to systematically generate and evaluate compounds within specific antibiotic classes. We address this gap through fragment-based computational experiments that systematically explore antibiotic chemical spaces.METHODSOur computational methodology consists of three steps: fragmentation of known compounds (eMolFrag), generating new molecular structures by recombining fragments (eSynth), and filtering of candidates based on desired properties (eFilter). eFilter combines structural analysis, pathway information, and protein targets to predict pharmacokinetic properties and therapeutic efficacy. We conducted three experiments: historical reconstruction of penicillin derivatives, hybrid molecule design combining functional groups from multiple antibiotic classes, and chemical space exploration of recently discovered antibiotics.RESULTSStarting from Penicillin G and Methicillin, eSynth generated over 1.4 million potential penicillin derivatives. eFilter computationally predicted ampicillin, amoxicillin, and 10 other penicillin derivatives as high-scoring candidates, demonstrating that the pipeline can navigate the chemical space of β-lactam antibiotics. For hybrid molecules, 1.53% showed computational predictions suggesting broad-spectrum activity against penicillin and quinolone targets, showing predicted binding scores higher than reference antibiotics in all protein targets evaluated. Chemical space exploration successfully generated computational candidates resembling Halicin-like molecules, with the top compound showing a binding score of 13.4 against JNK1.CONCLUSIONSOur fragment-based pipeline demonstrates the feasibility of systematically exploring antibiotic chemical spaces through computational reconstruction of historical development pathways and generation of hybrid molecules with with predicted multi-target binding profiles. All results represent computational predictions requiring experimental validation.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145986646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Proteolysis-targeting Chimera efficacy prediction using a deep-learning-QSP model. 基于深度学习- qsp模型的蛋白水解靶向嵌合体疗效预测。

IF 8.6 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2026-01-12 DOI: 10.1186/s13321-026-01152-2

Sungwoo Goo,Jina Kim,Soyoung Lee,Sangkeun Jung,Jung-Woo Chae,Jae-Mun Choi,Hwi-Yeol Yun

This study presents an integrated computational modeling framework combining deep learning and Quantitative Systems Pharmacology (QSP) to predict the efficacy of PROTAC (PROteolysis Targeting Chimera) molecules. PROTACs have emerged as promising therapeutics for targeted protein degradation (TPD), offering significant advantages in addressing proteins that traditional small-molecule inhibitors cannot target. However, experimental evaluation of PROTAC efficacy is hindered by extensive variability in molecular configurations, necessitating efficient computational prediction methods. The proposed model integrates binding affinity predictions from DeepCalici, a convolutional neural network-based deep learning model, with a mechanistic QSP Hook model to estimate key pharmacodynamic parameters, notably half-maximal degradation concentration(DC50) and maximal degradation(Dmax). This study utilized curated experimental data from PROTAC-DB, including experimentally validated DC50 and Dmax values. The dissociation constants (Kd) between PROTAC molecules and their protein targets (POI) or E3 ligases were predicted using DeepCalici and, then incorporated into the Hook model. To enhance the prediction accuracy, a supplementary deep neural network adjusted the hook model parameters based on chemical and biochemical features. The integrated modeling approach achieved a strong predictive performance for DC50, demonstrating its practical value in prioritizing effective PROTAC candidates. However, the predictions for Dmax were less accurate, likely reflecting the variability in the experimental conditions not captured in the current dataset. This study highlights the critical importance of comprehensive structural data for accurate modeling of PROTAC efficacy and suggests future improvements using standardized experimental data. Such integrative modeling approaches promise to accelerate the discovery and optimization of PROTAC therapeutics.

本研究提出了一个结合深度学习和定量系统药理学（QSP）的集成计算建模框架来预测PROTAC （PROteolysis Targeting Chimera）分子的疗效。PROTACs已成为靶向蛋白降解（TPD）的有希望的治疗方法，在处理传统小分子抑制剂无法靶向的蛋白质方面具有显着优势。然而，PROTAC有效性的实验评估受到分子构型广泛变化的阻碍，需要有效的计算预测方法。该模型将基于卷积神经网络的深度学习模型DeepCalici的结合亲和力预测与机制QSP Hook模型相结合，以估计关键的药效学参数，特别是半最大降解浓度（DC50）和最大降解（Dmax）。本研究利用了PROTAC-DB的精心整理的实验数据，包括实验验证的DC50和Dmax值。使用DeepCalici预测PROTAC分子与其蛋白靶点（POI）或E3连接酶之间的解离常数（Kd），然后将其纳入Hook模型。为了提高预测精度，补充深度神经网络根据化学和生化特征调整钩子模型参数。集成建模方法对DC50的预测性能较好，证明了其在确定有效PROTAC候选物优先级方面的实用价值。然而，对Dmax的预测不太准确，可能反映了当前数据集中未捕获的实验条件的可变性。这项研究强调了全面的结构数据对PROTAC疗效准确建模的重要性，并建议未来使用标准化的实验数据进行改进。这种综合建模方法有望加速发现和优化PROTAC治疗方法。

{"title":"Proteolysis-targeting Chimera efficacy prediction using a deep-learning-QSP model.","authors":"Sungwoo Goo,Jina Kim,Soyoung Lee,Sangkeun Jung,Jung-Woo Chae,Jae-Mun Choi,Hwi-Yeol Yun","doi":"10.1186/s13321-026-01152-2","DOIUrl":"https://doi.org/10.1186/s13321-026-01152-2","url":null,"abstract":"This study presents an integrated computational modeling framework combining deep learning and Quantitative Systems Pharmacology (QSP) to predict the efficacy of PROTAC (PROteolysis Targeting Chimera) molecules. PROTACs have emerged as promising therapeutics for targeted protein degradation (TPD), offering significant advantages in addressing proteins that traditional small-molecule inhibitors cannot target. However, experimental evaluation of PROTAC efficacy is hindered by extensive variability in molecular configurations, necessitating efficient computational prediction methods. The proposed model integrates binding affinity predictions from DeepCalici, a convolutional neural network-based deep learning model, with a mechanistic QSP Hook model to estimate key pharmacodynamic parameters, notably half-maximal degradation concentration(DC50) and maximal degradation(Dmax). This study utilized curated experimental data from PROTAC-DB, including experimentally validated DC50 and Dmax values. The dissociation constants (Kd) between PROTAC molecules and their protein targets (POI) or E3 ligases were predicted using DeepCalici and, then incorporated into the Hook model. To enhance the prediction accuracy, a supplementary deep neural network adjusted the hook model parameters based on chemical and biochemical features. The integrated modeling approach achieved a strong predictive performance for DC50, demonstrating its practical value in prioritizing effective PROTAC candidates. However, the predictions for Dmax were less accurate, likely reflecting the variability in the experimental conditions not captured in the current dataset. This study highlights the critical importance of comprehensive structural data for accurate modeling of PROTAC efficacy and suggests future improvements using standardized experimental data. Such integrative modeling approaches promise to accelerate the discovery and optimization of PROTAC therapeutics.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"27 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145949681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal graph fusion with statistically guided parsimonious descriptor selection for molecular property prediction. 基于统计引导的简化描述子选择的多模态图融合分子性质预测。

IF 8.6 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2026-01-11 DOI: 10.1186/s13321-025-01140-y

Yoonsuk Jang,Juyeon Lee,Keunhong Jeong,Jaeoh Kim

Graph convolutional networks (GCN) are effective for learning molecular representations, but their reliance on local message passing and simple feature concatenation limits their ability to capture global physicochemical properties. We present KROnecker-product based multimodal fusion with Variable sElection for eXpressive molecular representation learning (KROVEX), a method that integrates graph embeddings with molecular descriptors through a Kronecker-product to explicitly model second-order interactions. Informative descriptors are identified using a two-stage procedure that combines iterative sure independence screening with Elastic Net regularization. The proposed approach was evaluated on two benchmark datasets (FreeSolv and ESOL) as well as two self-curated datasets with vapor pressure and aqueous solubility as the target property. Overall, our method outperformed not only GCN but also fusion-based baselines such as EGCN, D-MPNN, and BAN under both the random and scaffold split. More importantly, the fusion operates at the final embedding level, enabling consistent performance across different GNN backbones (e.g., GAT and GIN). KROVEX achieves state-of-the-art performance on vapor pressure prediction, establishing a new benchmark for this safety-critical property essential for environmental monitoring and industrial process design. Ablation studies further demonstrated that (1) statistically guided descriptor selection yields more informative features than predefined descriptors, and (2) Kronecker-product fusion provides greater improvements than simple concatenation as the number of descriptors increases. These results demonstrate that parsimonious descriptor selection combined with multimodal graph fusion enhances predictive performance and interpretability, providing a generalizable framework for molecular property prediction.

图卷积网络（GCN）对于学习分子表示是有效的，但是它们对局部消息传递和简单的特征连接的依赖限制了它们捕捉全局物理化学性质的能力。我们提出了基于kronecker积的多模态融合和用于表达分子表示学习的变量选择（KROVEX），这是一种通过kronecker积将图嵌入与分子描述符集成以显式建模二阶相互作用的方法。信息描述符的识别使用两个阶段的过程，结合迭代确定独立性筛选和弹性网正则化。在两个基准数据集（FreeSolv和ESOL）以及两个以蒸汽压和水溶解度为目标属性的自整理数据集上对所提出的方法进行了评估。总体而言，我们的方法在随机分割和支架分割下不仅优于GCN，而且优于EGCN， D-MPNN和BAN等基于融合的基线。更重要的是，融合在最终嵌入级别运行，从而实现跨不同GNN骨干网（例如GAT和GIN）的一致性能。KROVEX在蒸汽压力预测方面实现了最先进的性能，为环境监测和工业过程设计中必不可少的安全关键性能建立了新的基准。消融研究进一步表明：(1)统计引导的描述符选择比预定义的描述符产生更多的信息特征；(2)随着描述符数量的增加，Kronecker-product融合比简单的连接提供了更大的改进。这些结果表明，简化描述符选择与多模态图融合相结合提高了预测性能和可解释性，为分子性质预测提供了一个可推广的框架。

{"title":"Multimodal graph fusion with statistically guided parsimonious descriptor selection for molecular property prediction.","authors":"Yoonsuk Jang,Juyeon Lee,Keunhong Jeong,Jaeoh Kim","doi":"10.1186/s13321-025-01140-y","DOIUrl":"https://doi.org/10.1186/s13321-025-01140-y","url":null,"abstract":"Graph convolutional networks (GCN) are effective for learning molecular representations, but their reliance on local message passing and simple feature concatenation limits their ability to capture global physicochemical properties. We present KROnecker-product based multimodal fusion with Variable sElection for eXpressive molecular representation learning (KROVEX), a method that integrates graph embeddings with molecular descriptors through a Kronecker-product to explicitly model second-order interactions. Informative descriptors are identified using a two-stage procedure that combines iterative sure independence screening with Elastic Net regularization. The proposed approach was evaluated on two benchmark datasets (FreeSolv and ESOL) as well as two self-curated datasets with vapor pressure and aqueous solubility as the target property. Overall, our method outperformed not only GCN but also fusion-based baselines such as EGCN, D-MPNN, and BAN under both the random and scaffold split. More importantly, the fusion operates at the final embedding level, enabling consistent performance across different GNN backbones (e.g., GAT and GIN). KROVEX achieves state-of-the-art performance on vapor pressure prediction, establishing a new benchmark for this safety-critical property essential for environmental monitoring and industrial process design. Ablation studies further demonstrated that (1) statistically guided descriptor selection yields more informative features than predefined descriptors, and (2) Kronecker-product fusion provides greater improvements than simple concatenation as the number of descriptors increases. These results demonstrate that parsimonious descriptor selection combined with multimodal graph fusion enhances predictive performance and interpretability, providing a generalizable framework for molecular property prediction.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"27 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145949678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Empowering federated learning for robust compound-protein interaction prediction across heterogeneous cross-pharma domains. 为跨异质跨制药领域的稳健化合物-蛋白质相互作用预测授权联邦学习。

IF 8.6 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2026-01-10 DOI: 10.1186/s13321-025-01147-5

Takuto Koyama,Hiroaki Iwata,Ryosuke Kojima,Takao Otsuka,Aki Hasegawa,Seungeon Lee,Hiroshi Ueda,Toshiharu Morimoto,Ryoko Sasaki,Nao Torimoto,Sei Murakami,Manabu Tojo,Teruki Honma,Shigeyuki Matsumoto,Yasushi Okuno

Accurately predicting novel compound-protein interactions (CPIs) is essential for accelerating drug discovery. The generalizability of machine learning-based CPI prediction models relies significantly on the availability and diversity of CPI datasets. To maximize data utility, particularly for highly confidential datasets maintained by industries, federated learning (FL)-which integrates multi-site data while preserving privacy-has emerged as a promising approach. Nonetheless, its effectiveness when applied to heterogeneous data from diverse molecular domains, a common real-world scenario, remains unclear, thereby limiting its broader adoption. This study evaluates FL for CPI prediction using datasets spanning multiple chemical and protein domains, providing practical guidance for optimizing the FL approach. Results indicate that the FL model enhanced out-of-domain prediction performance but was surpassed by local models for in-domain data under data heterogeneity. Drawing on these findings, a new strategy was developed to achieve robust performance for in- and out-of-domain tasks: a similarity-guided ensemble (SGE) that combines the global FL model with fine-tuned models based on each client's local data. This method demonstrated effectiveness with real-world industry data, including samples from the public database and 13 pharmaceutical companies. Cumulatively, these findings offer practical guidance for implementing FL in contemporary drug discovery processes. SCIENTIFIC CONTRIBUTION: This study identifies the performance trade-offs caused by heterogeneous data distributions in FL for CPI prediction. To overcome these challenges, we developed a workflow integrating local fine-tuning and a SGE, ensuring robust accuracy for both in-domain and out-of-domain predictions. The effectiveness of this approach was validated using both public datasets and real-world in-house datasets from 13 pharmaceutical companies.

准确预测新的化合物-蛋白质相互作用（cpi）对加速药物发现至关重要。基于机器学习的CPI预测模型的通用性很大程度上依赖于CPI数据集的可用性和多样性。为了最大限度地提高数据效用，特别是对于由行业维护的高度机密的数据集，联邦学习（FL）——在保持隐私的同时集成多站点数据——已经成为一种很有前途的方法。尽管如此，当它应用于来自不同分子域的异构数据（一个常见的现实场景）时，其有效性仍然不清楚，从而限制了它的广泛采用。本研究使用跨越多个化学和蛋白质结构域的数据集评估FL对CPI的预测，为优化FL方法提供实用指导。结果表明，在数据异构的情况下，FL模型增强了域外数据的预测性能，但在域内数据的预测上被局部模型所超越。根据这些发现，研究人员开发了一种新的策略来实现域内和域外任务的稳健性能：将全局FL模型与基于每个客户本地数据的微调模型相结合的相似性引导集成（SGE）。该方法在实际行业数据中证明了有效性，包括来自公共数据库和13家制药公司的样本。总的来说，这些发现为在当代药物发现过程中实施FL提供了实际指导。科学贡献：本研究确定了在CPI预测中由异构数据分布引起的性能权衡。为了克服这些挑战，我们开发了一个集成本地微调和SGE的工作流程，确保域内和域外预测的强大准确性。使用来自13家制药公司的公共数据集和实际内部数据集验证了该方法的有效性。

{"title":"Empowering federated learning for robust compound-protein interaction prediction across heterogeneous cross-pharma domains.","authors":"Takuto Koyama,Hiroaki Iwata,Ryosuke Kojima,Takao Otsuka,Aki Hasegawa,Seungeon Lee,Hiroshi Ueda,Toshiharu Morimoto,Ryoko Sasaki,Nao Torimoto,Sei Murakami,Manabu Tojo,Teruki Honma,Shigeyuki Matsumoto,Yasushi Okuno","doi":"10.1186/s13321-025-01147-5","DOIUrl":"https://doi.org/10.1186/s13321-025-01147-5","url":null,"abstract":"Accurately predicting novel compound-protein interactions (CPIs) is essential for accelerating drug discovery. The generalizability of machine learning-based CPI prediction models relies significantly on the availability and diversity of CPI datasets. To maximize data utility, particularly for highly confidential datasets maintained by industries, federated learning (FL)-which integrates multi-site data while preserving privacy-has emerged as a promising approach. Nonetheless, its effectiveness when applied to heterogeneous data from diverse molecular domains, a common real-world scenario, remains unclear, thereby limiting its broader adoption. This study evaluates FL for CPI prediction using datasets spanning multiple chemical and protein domains, providing practical guidance for optimizing the FL approach. Results indicate that the FL model enhanced out-of-domain prediction performance but was surpassed by local models for in-domain data under data heterogeneity. Drawing on these findings, a new strategy was developed to achieve robust performance for in- and out-of-domain tasks: a similarity-guided ensemble (SGE) that combines the global FL model with fine-tuned models based on each client's local data. This method demonstrated effectiveness with real-world industry data, including samples from the public database and 13 pharmaceutical companies. Cumulatively, these findings offer practical guidance for implementing FL in contemporary drug discovery processes. SCIENTIFIC CONTRIBUTION: This study identifies the performance trade-offs caused by heterogeneous data distributions in FL for CPI prediction. To overcome these challenges, we developed a workflow integrating local fine-tuning and a SGE, ensuring robust accuracy for both in-domain and out-of-domain predictions. The effectiveness of this approach was validated using both public datasets and real-world in-house datasets from 13 pharmaceutical companies.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"40 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analysis of cyclohexane, cyclopentane, and benzene conformations in ligands for PDB X-ray structures using the Hill-Reilly approach. 用Hill-Reilly方法分析PDB x射线结构配体中的环己烷、环戊烷和苯构象。

IF 8.6 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2026-01-10 DOI: 10.1186/s13321-026-01154-0

Gabriela Bučeková,Viktoriia Doshchenko,Tomáš Svoboda,Jana Porubská,Aliaksei Chareshneu,Tomáš Raček,Vladimír Horský,Radka Svobodová,Ondřej Schindler

Protein structural data are highly valuable for research, and many significant results have been published on their basis. An important part of protein structures is their ligands. The conformation of rings in ligands is crucial for the ligands' scaffold and shape and, therefore, for interactions with their surroundings and the subsequent biological function. For this reason, we developed a workflow to detect conformations of cyclohexane, cyclopentane, and benzene rings. The workflow can process rings originating from ligands, which are parts of experimental protein structures deposited in the Protein Data Bank and determined by X-ray crystallography. This fully automatic workflow utilises the Hill-Reilly approach to calculate puckering angles that quantitatively describe ring conformation. The reproducibility of the workflow is guaranteed by storing datasets within Onedata, which enables automatic dataset retrieval and the workflow execution. We analysed 128 012 ring structures originating from 25 479 different ligands. We found that cyclohexane ring structures include more than 22 % of unfavourable conformations, cyclopentane ring structures about 5 % and benzene ring structures only 0.01 %. We discovered that energetically unfavourable ring structures can occur in cyclohexane and cyclopentane ligands for proper chemical reasons. Their examination can help us to understand the binding of these ligands, which can be helpful for pharmacology, chemoinformatics, etc. On the other hand, energetically unfavourable ring conformations are often caused by model quality issues. Therefore, their occurrence should motivate researchers to inspect the quality of the protein model and also the ring's fit into experimental data.Scientific Contribution: Our analysis uncovers a conformational behaviour of cyclohexane, cyclopentane, and benzene rings, occurring in ligands, which are parts of experimental protein structures deposited in the PDB. This paper's other substantial contribution is presenting the first successful application of the Hill-Reilly approach for cyclohexane, cyclopentane, and benzene rings. Moreover, we provide Hill-Reilly parameters for these ring conformations, which can be used in other analyses.

蛋白质结构数据具有很高的研究价值，并在此基础上发表了许多重要的结果。配体是蛋白质结构的重要组成部分。配体中环的构象对配体的支架和形状至关重要，因此对其与周围环境的相互作用以及随后的生物功能至关重要。出于这个原因，我们开发了一个工作流程来检测环己烷、环戊烷和苯环的构象。该工作流程可以处理源自配体的环，配体是沉积在蛋白质数据库中的实验蛋白质结构的一部分，并通过x射线晶体学确定。这种全自动工作流程利用Hill-Reilly方法来计算定量描述环构象的皱折角度。通过在Onedata中存储数据集来保证工作流的再现性，这支持自动数据集检索和工作流执行。我们分析了源自25479种不同配体的128012个环结构。我们发现环己烷环结构中不良构象占22%以上，环戊烷环结构中不良构象占5%，苯环结构中不良构象仅占0.01%。我们发现，在环己烷和环戊烷配体中，由于适当的化学原因，会出现能量不利的环结构。它们的研究可以帮助我们了解这些配体的结合，从而有助于药理学、化学信息学等方面的研究。另一方面，能量上不利的环构象通常是由模型质量问题引起的。因此，它们的出现应该激励研究人员检查蛋白质模型的质量以及环与实验数据的契合度。科学贡献：我们的分析揭示了环己烷，环戊烷和苯环的构象行为，发生在配体中，这些配体是沉积在PDB中的实验蛋白质结构的一部分。本文的另一个重要贡献是首次成功地应用了Hill-Reilly方法制备环己烷、环戊烷和苯环。此外，我们还提供了这些环构象的Hill-Reilly参数，这些参数可用于其他分析。

{"title":"Analysis of cyclohexane, cyclopentane, and benzene conformations in ligands for PDB X-ray structures using the Hill-Reilly approach.","authors":"Gabriela Bučeková,Viktoriia Doshchenko,Tomáš Svoboda,Jana Porubská,Aliaksei Chareshneu,Tomáš Raček,Vladimír Horský,Radka Svobodová,Ondřej Schindler","doi":"10.1186/s13321-026-01154-0","DOIUrl":"https://doi.org/10.1186/s13321-026-01154-0","url":null,"abstract":"Protein structural data are highly valuable for research, and many significant results have been published on their basis. An important part of protein structures is their ligands. The conformation of rings in ligands is crucial for the ligands' scaffold and shape and, therefore, for interactions with their surroundings and the subsequent biological function. For this reason, we developed a workflow to detect conformations of cyclohexane, cyclopentane, and benzene rings. The workflow can process rings originating from ligands, which are parts of experimental protein structures deposited in the Protein Data Bank and determined by X-ray crystallography. This fully automatic workflow utilises the Hill-Reilly approach to calculate puckering angles that quantitatively describe ring conformation. The reproducibility of the workflow is guaranteed by storing datasets within Onedata, which enables automatic dataset retrieval and the workflow execution. We analysed 128 012 ring structures originating from 25 479 different ligands. We found that cyclohexane ring structures include more than 22 % of unfavourable conformations, cyclopentane ring structures about 5 % and benzene ring structures only 0.01 %. We discovered that energetically unfavourable ring structures can occur in cyclohexane and cyclopentane ligands for proper chemical reasons. Their examination can help us to understand the binding of these ligands, which can be helpful for pharmacology, chemoinformatics, etc. On the other hand, energetically unfavourable ring conformations are often caused by model quality issues. Therefore, their occurrence should motivate researchers to inspect the quality of the protein model and also the ring's fit into experimental data.Scientific Contribution: Our analysis uncovers a conformational behaviour of cyclohexane, cyclopentane, and benzene rings, occurring in ligands, which are parts of experimental protein structures deposited in the PDB. This paper's other substantial contribution is presenting the first successful application of the Hill-Reilly approach for cyclohexane, cyclopentane, and benzene rings. Moreover, we provide Hill-Reilly parameters for these ring conformations, which can be used in other analyses.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"45 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145947305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learnable protein representations in computational biology for predicting drug-target affinity. 计算生物学中预测药物靶标亲和力的可学习蛋白质表征。

IF 8.6 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2026-01-09 DOI: 10.1186/s13321-025-01145-7

Rachit Kumar,Joseph Romano,Marylyn Ritchie

In this review, we discuss the various different types of learnable protein representations that have been used in computational biology, with a particular focus on representations that have been used in the paradigm of predicting drug-target affinity. We explore this from multiple perspectives: the source of protein information used, the training paradigms used in generating and applying such representations, and the types of (deep-learning-based) encoding or embedding methods that have been used to generate and operate on such representations. We focus on drug-target affinity due to its particular relevance and utility in the field of drug development and assessment, and we make suggestions for how drug-target affinity prediction methods development can be further improved by examining the current literature from the aforementioned perspectives. This survey thus serves as a valuable resource for researchers seeking to develop methods for predicting drug-target affinity by exploring how protein information has been used and could be used in effective ways to improve such predictions.

在这篇综述中，我们讨论了在计算生物学中使用的各种不同类型的可学习蛋白质表征，特别关注在预测药物靶标亲和力范例中使用的表征。我们从多个角度探讨了这一点：所使用的蛋白质信息的来源，用于生成和应用此类表示的训练范例，以及用于生成和操作此类表示的（基于深度学习的）编码或嵌入方法的类型。由于药物靶标亲和力在药物开发和评估领域的特殊相关性和实用性，我们将重点关注药物靶标亲和力，并从上述角度对现有文献进行研究，提出如何进一步改进药物靶标亲和力预测方法的建议。因此，这项调查为研究人员寻求通过探索如何使用蛋白质信息来预测药物靶标亲和力的方法提供了宝贵的资源，并且可以有效地用于改进这种预测。

引用次数: 0

PepGraphormer: an ESM-GAT hybrid deep learning framework for antimicrobial peptide prediction. PepGraphormer：用于抗菌肽预测的ESM-GAT混合深度学习框架。

IF 8.6 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2026-01-05 DOI: 10.1186/s13321-025-01144-8

Changhang Lin,Shuwen Xiong,Jinjin Li,Feifei Cui,Zilong Zhang,Hua Shi,Leyi Wei

The prediction of Antimicrobial Peptides (AMPs) is a critical research area in drug discovery. Traditional methods, which rely on sequence alignment or handcrafted features, often fail to capture complex sequence-function relationships. Recently, Large Language Models (LLMs) like ESM2 have demonstrated remarkable success in extracting deep semantic features from protein sequences. Meanwhile, Graph Neural Networks (GNNs), particularly Graph Attention Networks (GATs), can effectively learn inter-node relationships, specifically capturing peptide-residue compositional links and inter-residue co-occurrence patterns, to aggregate neighborhood information. In this work, we propose PepGraphormer, a novel fusion model that combines the powers of large-scale pretraining from ESM2 and the structural learning advantages from GATs for AMP prediction. We first construct a heterogeneous graph with peptide sequences and amino acids as nodes, where ESM2 is leveraged to generate high-quality initial embeddings for the peptide sequence nodes. Then the classification fuses the direct predictions from ESM2 and the graph-based predictions from the GAT. The training jointly trains the ESM2 and GAT modules and learns the embeddings for nodes in the graph. Comparisons with current state-of-the-art models on multiple datasets demonstrate that PepGraphormer achieves excellent accuracy and stability in the AMP prediction task. Further ablation and generalization experiments confirm the effectiveness and robustness of this fusion framework, presenting a new avenue for computationally-driven therapeutic peptide discovery.Scientific contribution This work proposes a novel framework PepGraphormer that combines the powers of transformer-based large language model (ESM2) and graph attention network for antimicrobial peptide prediction, without requiring the 3D protein structural information used in previous studies. The model significantly outperforms state-of-the-art methods and various deep learning baselines on multiple AMP benchmark datasets.

抗菌肽预测是药物开发中的一个重要研究领域。传统的方法依赖于序列比对或手工制作的特征，往往无法捕获复杂的序列-函数关系。近年来，像ESM2这样的大型语言模型（llm）在从蛋白质序列中提取深层语义特征方面取得了显著的成功。同时，图神经网络（gnn），特别是图注意网络（GATs）可以有效地学习节点间关系，特别是捕获肽-残基组成链接和残基间共现模式，以聚合邻域信息。在这项工作中，我们提出了一种新的融合模型PepGraphormer，它结合了ESM2的大规模预训练能力和GATs的结构学习优势，用于AMP预测。我们首先构建了一个以肽序列和氨基酸为节点的异构图，利用ESM2为肽序列节点生成高质量的初始嵌入。然后，分类融合了ESM2的直接预测和GAT的基于图的预测。该训练联合训练ESM2和GAT模块，并学习图中节点的嵌入。与当前最先进的模型在多个数据集上的比较表明，PepGraphormer在AMP预测任务中实现了出色的准确性和稳定性。进一步的消融和推广实验证实了该融合框架的有效性和稳健性，为计算驱动的治疗肽发现提供了新的途径。这项工作提出了一种新的框架PepGraphormer，它结合了基于变压器的大语言模型（ESM2）和用于抗菌肽预测的图形注意网络的功能，而不需要先前研究中使用的3D蛋白质结构信息。该模型在多个AMP基准数据集上显著优于最先进的方法和各种深度学习基线。

{"title":"PepGraphormer: an ESM-GAT hybrid deep learning framework for antimicrobial peptide prediction.","authors":"Changhang Lin,Shuwen Xiong,Jinjin Li,Feifei Cui,Zilong Zhang,Hua Shi,Leyi Wei","doi":"10.1186/s13321-025-01144-8","DOIUrl":"https://doi.org/10.1186/s13321-025-01144-8","url":null,"abstract":"The prediction of Antimicrobial Peptides (AMPs) is a critical research area in drug discovery. Traditional methods, which rely on sequence alignment or handcrafted features, often fail to capture complex sequence-function relationships. Recently, Large Language Models (LLMs) like ESM2 have demonstrated remarkable success in extracting deep semantic features from protein sequences. Meanwhile, Graph Neural Networks (GNNs), particularly Graph Attention Networks (GATs), can effectively learn inter-node relationships, specifically capturing peptide-residue compositional links and inter-residue co-occurrence patterns, to aggregate neighborhood information. In this work, we propose PepGraphormer, a novel fusion model that combines the powers of large-scale pretraining from ESM2 and the structural learning advantages from GATs for AMP prediction. We first construct a heterogeneous graph with peptide sequences and amino acids as nodes, where ESM2 is leveraged to generate high-quality initial embeddings for the peptide sequence nodes. Then the classification fuses the direct predictions from ESM2 and the graph-based predictions from the GAT. The training jointly trains the ESM2 and GAT modules and learns the embeddings for nodes in the graph. Comparisons with current state-of-the-art models on multiple datasets demonstrate that PepGraphormer achieves excellent accuracy and stability in the AMP prediction task. Further ablation and generalization experiments confirm the effectiveness and robustness of this fusion framework, presenting a new avenue for computationally-driven therapeutic peptide discovery.Scientific contribution This work proposes a novel framework PepGraphormer that combines the powers of transformer-based large language model (ESM2) and graph attention network for antimicrobial peptide prediction, without requiring the 3D protein structural information used in previous studies. The model significantly outperforms state-of-the-art methods and various deep learning baselines on multiple AMP benchmark datasets.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"83 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145903591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Structure-free drug-target affinity prediction using protein and molecule language models. 基于蛋白质和分子语言模型的无结构药物靶标亲和力预测。

IF 8.6 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2026-01-03 DOI: 10.1186/s13321-025-01146-6

Amir Hallaji Bidgoli,Morteza Mahdavi,Hamed Malek

Accurate prediction of drug-target affinity (DTA) is crucial for advancing drug discovery and optimizing experimental processes. Traditional DTA models often rely on handcrafted features or structural data, which can limit their generalizability and scalability. In this study, we propose a novel, sequence-centric approach for DTA prediction that leverages pretrained large language models (LLMs), namely ChemBERTa and ESM2, to encode protein and molecule sequences. These models produce semantically rich embeddings without the need for structural data. We introduce a customized Residual Inception architecture that efficiently integrates these sequence embeddings through multi-scale convolutions and residual connections, significantly improving prediction accuracy. Our method is evaluated on benchmark datasets Davis, KIBA, and BindingDB, achieving state-of-the-art performance with MSE = 0.182 and CI = 0.915 on Davis, MSE = 0.135 and CI = 0.902 on KIBA, and MSE = 0.467 and CI = 0.888 on BindingDB. These results highlight the potential of sequence-based approaches to provide scalable, accurate, and robust solutions for DTA prediction, offering valuable insights into drug-target interactions even in data-sparse settings. SCIENTIFIC CONTRIBUTION: The combination of pretrained language models and a lightweight neural architecture paves the way for more effective and adaptable DTA frameworks in real-world drug discovery applications.

准确预测药物-靶标亲和力（DTA）对于推进药物发现和优化实验过程至关重要。传统的DTA模型通常依赖于手工制作的特征或结构数据，这限制了它们的泛化性和可扩展性。在这项研究中，我们提出了一种新的、以序列为中心的DTA预测方法，该方法利用预训练的大语言模型（LLMs），即ChemBERTa和ESM2，对蛋白质和分子序列进行编码。这些模型产生语义丰富的嵌入，而不需要结构数据。我们引入了一个定制的残差初始架构，通过多尺度卷积和残差连接有效地集成了这些序列嵌入，显著提高了预测精度。我们的方法在基准数据集Davis、KIBA和BindingDB上进行了评估，在Davis上的MSE = 0.182, CI = 0.915，在KIBA上的MSE = 0.135, CI = 0.902，在BindingDB上的MSE = 0.467, CI = 0.888，达到了最先进的性能。这些结果突出了基于序列的方法为DTA预测提供可扩展、准确和健壮的解决方案的潜力，即使在数据稀疏的情况下，也为药物-靶标相互作用提供了有价值的见解。科学贡献：预训练语言模型和轻量级神经体系结构的结合为现实世界药物发现应用中更有效和适应性更强的DTA框架铺平了道路。

{"title":"Structure-free drug-target affinity prediction using protein and molecule language models.","authors":"Amir Hallaji Bidgoli,Morteza Mahdavi,Hamed Malek","doi":"10.1186/s13321-025-01146-6","DOIUrl":"https://doi.org/10.1186/s13321-025-01146-6","url":null,"abstract":"Accurate prediction of drug-target affinity (DTA) is crucial for advancing drug discovery and optimizing experimental processes. Traditional DTA models often rely on handcrafted features or structural data, which can limit their generalizability and scalability. In this study, we propose a novel, sequence-centric approach for DTA prediction that leverages pretrained large language models (LLMs), namely ChemBERTa and ESM2, to encode protein and molecule sequences. These models produce semantically rich embeddings without the need for structural data. We introduce a customized Residual Inception architecture that efficiently integrates these sequence embeddings through multi-scale convolutions and residual connections, significantly improving prediction accuracy. Our method is evaluated on benchmark datasets Davis, KIBA, and BindingDB, achieving state-of-the-art performance with MSE = 0.182 and CI = 0.915 on Davis, MSE = 0.135 and CI = 0.902 on KIBA, and MSE = 0.467 and CI = 0.888 on BindingDB. These results highlight the potential of sequence-based approaches to provide scalable, accurate, and robust solutions for DTA prediction, offering valuable insights into drug-target interactions even in data-sparse settings. SCIENTIFIC CONTRIBUTION: The combination of pretrained language models and a lightweight neural architecture paves the way for more effective and adaptable DTA frameworks in real-world drug discovery applications.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"11 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Molecular graph-based invariant representation learning with environmental inference and subgraph generation for out-of-distribution generalization. 基于分子图的环境推理和分布外泛化子图生成的不变表示学习。

IF 8.6 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics

Pub Date : 2026-01-02 DOI: 10.1186/s13321-025-01142-w

Hang Zhu,Sisi Yuan,Mingjing Tang,Guifei Zhou,Zhanxuan Hu,Zhaoyang Liu,Jin Li,Jianmin Wang,Chunyan Li

Molecular representation learning (MRL) is a crucial link between machine learning and chemistry. It plays a vital role in predicting molecular properties and complex tasks such as drug discovery by encoding molecules as numerical vectors. While existing methods perform excellently when handling training and testing data from the same distribution, their generalization ability is often insufficient when faced with distribution shifts. Enhancing model generalization capability for out-of-distribution (OOD) data remains a significant challenge, as real-world molecular environments are often dynamic and uncertain. To effectively address this issue, we propose an innovative framework called EISG (Integrating Environmental Inference and Subgraph Generation) for molecular representation learning aimed at improving the performance of the model on OOD data by capturing the invariance of molecular graphs in different environments. Specifically, we introduce an unsupervised environmental classification model to identify latent variables generated by different distributions and designed a subgraph extractor based on information bottleneck theory to extracts invariant representations from molecular graphs that are closely related to the prediction labels. By combining new learning objectives, the environmental classifier and the subgraph extractor work in tandem to help the model identify invariant graph representations in different environments, leading to more robust OOD generalization. Experimental results demonstrate that our model exhibits strong generalization capabilities across various OOD settings. Code is available on GitHub.

分子表征学习（MRL）是机器学习与化学之间的重要纽带。它在预测分子性质和复杂任务中发挥着至关重要的作用，例如通过编码分子作为数字载体来发现药物。虽然现有方法在处理来自同一分布的训练和测试数据时表现出色，但在面对分布变化时，其泛化能力往往不足。由于现实世界的分子环境通常是动态的和不确定的，因此提高模型泛化能力仍然是一个重大挑战。为了有效地解决这个问题，我们提出了一个名为EISG （integrated Environmental Inference and Subgraph Generation）的创新框架，用于分子表示学习，旨在通过捕获不同环境中分子图的不变性来提高模型在OOD数据上的性能。具体来说，我们引入了一种无监督环境分类模型来识别由不同分布产生的潜在变量，并设计了一种基于信息瓶颈理论的子图提取器，从分子图中提取与预测标签密切相关的不变表示。通过结合新的学习目标，环境分类器和子图提取器协同工作，帮助模型识别不同环境中的不变图表示，从而实现更稳健的OOD泛化。实验结果表明，我们的模型在各种OOD设置中表现出强大的泛化能力。代码可在GitHub上获得。

{"title":"Molecular graph-based invariant representation learning with environmental inference and subgraph generation for out-of-distribution generalization.","authors":"Hang Zhu,Sisi Yuan,Mingjing Tang,Guifei Zhou,Zhanxuan Hu,Zhaoyang Liu,Jin Li,Jianmin Wang,Chunyan Li","doi":"10.1186/s13321-025-01142-w","DOIUrl":"https://doi.org/10.1186/s13321-025-01142-w","url":null,"abstract":"Molecular representation learning (MRL) is a crucial link between machine learning and chemistry. It plays a vital role in predicting molecular properties and complex tasks such as drug discovery by encoding molecules as numerical vectors. While existing methods perform excellently when handling training and testing data from the same distribution, their generalization ability is often insufficient when faced with distribution shifts. Enhancing model generalization capability for out-of-distribution (OOD) data remains a significant challenge, as real-world molecular environments are often dynamic and uncertain. To effectively address this issue, we propose an innovative framework called EISG (Integrating Environmental Inference and Subgraph Generation) for molecular representation learning aimed at improving the performance of the model on OOD data by capturing the invariance of molecular graphs in different environments. Specifically, we introduce an unsupervised environmental classification model to identify latent variables generated by different distributions and designed a subgraph extractor based on information bottleneck theory to extracts invariant representations from molecular graphs that are closely related to the prediction labels. By combining new learning objectives, the environmental classifier and the subgraph extractor work in tandem to help the model identify invariant graph representations in different environments, leading to more robust OOD generalization. Experimental results demonstrate that our model exhibits strong generalization capabilities across various OOD settings. Code is available on GitHub.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2026-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145893794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0