首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
All-atom protein sequence design using discrete diffusion models. 用离散扩散模型设计全原子蛋白质序列。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-01 DOI: 10.1186/s13321-025-01121-1
Amelia Villegas-Morcillo, Gijs J Admiraal, Marcel J T Reinders, Jana M Weber

Advancing protein design is crucial for breakthroughs in medicine and biotechnology. Traditional approaches for protein sequence representation often rely solely on the 20 canonical amino acids, limiting the representation of non-canonical amino acids and residues that undergo post-translational modifications. This work explores discrete diffusion models for generating novel protein sequences using the all-atom chemical representation SELFIES. By encoding the atomic composition of each amino acid in the protein, this approach expands the design possibilities beyond standard sequence representations. Using a modified ByteNet architecture within the discrete diffusion D3PM framework, we evaluate the impact of this all-atom representation on protein quality, diversity, and novelty, compared to conventional amino acid-based models. To this end, we develop a comprehensive assessment pipeline to determine whether generated SELFIES sequences translate into valid proteins containing both canonical and non-canonical amino acids. Additionally, we examine the influence of two noise schedules within the diffusion process-uniform (random replacement of tokens) and absorbing (progressive masking)-on generation performance. While models trained on the all-atom representation struggle to consistently generate fully valid proteins, the successfully generated proteins show improved novelty and diversity compared to their amino acid-based model counterparts. Furthermore, the all-atom representation achieves structural foldability results comparable to those of amino acid-based models. Lastly, our results highlight the absorbing noise schedule as the most effective for both representations. Data and code are available at https://github.com/Intelligent-molecular-systems/All-Atom-Protein-Sequence-Generation.

推进蛋白质设计对于医学和生物技术的突破至关重要。传统的蛋白质序列表征方法通常仅依赖于20个典型氨基酸,限制了非典型氨基酸和经过翻译后修饰的残基的表征。这项工作探索了使用全原子化学表示自拍生成新蛋白质序列的离散扩散模型。通过编码蛋白质中每个氨基酸的原子组成,这种方法扩展了超出标准序列表示的设计可能性。在离散扩散D3PM框架中使用改进的ByteNet架构,与传统的基于氨基酸的模型相比,我们评估了这种全原子表示对蛋白质质量、多样性和新颖性的影响。为此,我们开发了一个全面的评估管道,以确定生成的自序列是否转化为含有规范和非规范氨基酸的有效蛋白质。此外,我们还研究了扩散过程中两种噪声时间表-均匀(随机替换标记)和吸收(渐进掩蔽)对生成性能的影响。虽然在全原子表示上训练的模型难以一致地生成完全有效的蛋白质,但与基于氨基酸的模型相比,成功生成的蛋白质显示出更高的新颖性和多样性。此外,全原子表示实现了与基于氨基酸的模型相当的结构可折叠性结果。最后,我们的结果强调了吸收噪声的时间表是最有效的两种表示。数据和代码可在https://github.com/Intelligent-molecular-systems/All-Atom-Protein-Sequence-Generation上获得。
{"title":"All-atom protein sequence design using discrete diffusion models.","authors":"Amelia Villegas-Morcillo, Gijs J Admiraal, Marcel J T Reinders, Jana M Weber","doi":"10.1186/s13321-025-01121-1","DOIUrl":"https://doi.org/10.1186/s13321-025-01121-1","url":null,"abstract":"<p><p>Advancing protein design is crucial for breakthroughs in medicine and biotechnology. Traditional approaches for protein sequence representation often rely solely on the 20 canonical amino acids, limiting the representation of non-canonical amino acids and residues that undergo post-translational modifications. This work explores discrete diffusion models for generating novel protein sequences using the all-atom chemical representation SELFIES. By encoding the atomic composition of each amino acid in the protein, this approach expands the design possibilities beyond standard sequence representations. Using a modified ByteNet architecture within the discrete diffusion D3PM framework, we evaluate the impact of this all-atom representation on protein quality, diversity, and novelty, compared to conventional amino acid-based models. To this end, we develop a comprehensive assessment pipeline to determine whether generated SELFIES sequences translate into valid proteins containing both canonical and non-canonical amino acids. Additionally, we examine the influence of two noise schedules within the diffusion process-uniform (random replacement of tokens) and absorbing (progressive masking)-on generation performance. While models trained on the all-atom representation struggle to consistently generate fully valid proteins, the successfully generated proteins show improved novelty and diversity compared to their amino acid-based model counterparts. Furthermore, the all-atom representation achieves structural foldability results comparable to those of amino acid-based models. Lastly, our results highlight the absorbing noise schedule as the most effective for both representations. Data and code are available at https://github.com/Intelligent-molecular-systems/All-Atom-Protein-Sequence-Generation.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145652885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Novel molecule design with POWGAN, a policy-optimized Wasserstein generative adversarial network. 基于策略优化的Wasserstein生成对抗网络POWGAN的新型分子设计。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-12-01 DOI: 10.1186/s13321-025-01114-0
Bruno Macedo, Inês Ribeiro Vaz, Tiago Taveira Gomes
<p><p>Generative artificial intelligence has the potential to open new vast chemical search spaces, yet existing reinforcement-guided generative adversarial networks (GANs) struggle to produce non-fragmented and property-oriented molecules at scale without compromising other properties. To overcome these limitations, we present Policy-Optimised Wasserstein GAN (POWGAN), a graph-based generator that incorporates a dynamically scaled reward into adversarial training. The scaling factor increases when progress stalls, keeping gradients informative while steadily steering the generator towards user-defined objectives. When POWGAN replaces the loss function in a previous MedGAN architecture, using graph connectivity (non-fragmentation) as the target property, attains 1.00 fully connected quinoline-like molecules, compared to previous 0.62, while maintaining novelty (0.93) and uniqueness (0.95). The resulting model R-MedGAN produces > 12,000 novel quinoline-like, a significant increase over its predecessor under identical experimental conditions. Chemical space visualizations demonstrate that these molecules populate regions not present in the training dataset or MedGAN, confirming genuine scaffold innovation. By achieving a new architecture capable of orienting generative process towards a reward, our study also showed this strategy is capable of progressing towards druglikeness properties. Synthetic Accessibility Scores (SAS) measured by Erlth algorithm between 1 and 6, and lipophilicity measured as LogP between 1.35 and 1.80, both increased the proportion from 8 to 65% and 17% to 45%, respectively, compared to baseline. Our study shows R-MedGAN architecture, incorporating POWGAN loss, is also generalizable for models trained with different molecular scaffolds other than quinoline originally tested in MedGAN (R-MedGAN-QNL). For indole (R-MedGAN-IND) and imidazole (R-MedGAN-IMZ) datasets, connectivity increased from 0.38 and 0.50 up to 1.00 during training. This study provides evidence that an adaptive reward-scaling policy in a Wasserstein GAN can simultaneously guide the generative training towards a reward by enhancing molecular connectivity, expand generative throughput, preserve diversity, and improve drug-likeness properties. By eliminating the limitation trade-off between property optimisation and sample diversity, POWGAN and its R-MedGAN implementation advance the state of the art in molecule-generating GANs and deploys a robust, scalable platform for high-throughput, goal-directed chemical exploration in early-stage drug discovery. These findings underscore the effectiveness of adaptive reinforcement-driven strategies in generative adversarial networks oriented by rewards for molecular discovery. SCIENTIFIC CONTRIBUTION: In this work we introduce POWGAN, a policy-optimized Wasserstein GAN that uses adaptive reward scaling to improve goal-directed molecule generation. Integrated into MedGAN (R-MedGAN), it increases the number of valid, connect
生成式人工智能有潜力开辟新的广阔的化学搜索空间,但现有的强化引导生成式对抗网络(gan)难以在不影响其他特性的情况下大规模生产非碎片化和属性导向的分子。为了克服这些限制,我们提出了策略优化的沃瑟斯坦GAN (POWGAN),这是一种基于图的生成器,它将动态缩放的奖励整合到对抗训练中。当进度停止时,比例因子增加,保持梯度信息,同时稳定地将生成器转向用户定义的目标。当POWGAN取代先前MedGAN架构中的损失函数时,使用图连通性(非碎片化)作为目标属性,获得1.00个完全连接的喹啉类分子,而之前的为0.62个,同时保持新颖性(0.93)和唯一性(0.95)。由此产生的R-MedGAN模型在相同的实验条件下产生了1,000,000个新型喹啉样物质,比其前身显着增加。化学空间可视化表明,这些分子填充了训练数据集或MedGAN中不存在的区域,证实了真正的支架创新。通过实现一种能够将生成过程导向奖励的新架构,我们的研究还表明,这种策略能够朝着类似药物的特性发展。与基线相比,Erlth算法测量的综合可达性评分(SAS)在1 ~ 6之间,亲脂性LogP在1.35 ~ 1.80之间,两者的比例分别从8增加到65%和17%增加到45%。我们的研究表明,包含POWGAN损失的R-MedGAN结构也可用于除最初在MedGAN中测试的喹啉以外的不同分子支架训练的模型(R-MedGAN- qnl)。对于吲哚(R-MedGAN-IND)和咪唑(R-MedGAN-IMZ)数据集,在训练期间连通性从0.38和0.50增加到1.00。本研究提供了证据,表明Wasserstein GAN中的自适应奖励尺度策略可以通过增强分子连通性、扩大生成吞吐量、保持多样性和改善药物相似性来同时引导生成训练向奖励方向发展。通过消除性质优化和样品多样性之间的限制权衡,POWGAN及其R-MedGAN实现推进了分子生成gan的最新技术,并为早期药物发现的高通量、目标导向的化学探索部署了一个强大的、可扩展的平台。这些发现强调了自适应强化驱动策略在以分子发现奖励为导向的生成对抗网络中的有效性。科学贡献:在这项工作中,我们介绍了POWGAN,一种策略优化的Wasserstein GAN,它使用自适应奖励缩放来改进目标导向的分子生成。整合到MedGAN (R-MedGAN)中,它在保持多样性和药物相似性的同时,增加了相同设置下有效、连接和新分子的数量。这表明自适应奖励策略可以在规模上共同增强分子拓扑和性质优化。
{"title":"Novel molecule design with POWGAN, a policy-optimized Wasserstein generative adversarial network.","authors":"Bruno Macedo, Inês Ribeiro Vaz, Tiago Taveira Gomes","doi":"10.1186/s13321-025-01114-0","DOIUrl":"https://doi.org/10.1186/s13321-025-01114-0","url":null,"abstract":"&lt;p&gt;&lt;p&gt;Generative artificial intelligence has the potential to open new vast chemical search spaces, yet existing reinforcement-guided generative adversarial networks (GANs) struggle to produce non-fragmented and property-oriented molecules at scale without compromising other properties. To overcome these limitations, we present Policy-Optimised Wasserstein GAN (POWGAN), a graph-based generator that incorporates a dynamically scaled reward into adversarial training. The scaling factor increases when progress stalls, keeping gradients informative while steadily steering the generator towards user-defined objectives. When POWGAN replaces the loss function in a previous MedGAN architecture, using graph connectivity (non-fragmentation) as the target property, attains 1.00 fully connected quinoline-like molecules, compared to previous 0.62, while maintaining novelty (0.93) and uniqueness (0.95). The resulting model R-MedGAN produces &gt; 12,000 novel quinoline-like, a significant increase over its predecessor under identical experimental conditions. Chemical space visualizations demonstrate that these molecules populate regions not present in the training dataset or MedGAN, confirming genuine scaffold innovation. By achieving a new architecture capable of orienting generative process towards a reward, our study also showed this strategy is capable of progressing towards druglikeness properties. Synthetic Accessibility Scores (SAS) measured by Erlth algorithm between 1 and 6, and lipophilicity measured as LogP between 1.35 and 1.80, both increased the proportion from 8 to 65% and 17% to 45%, respectively, compared to baseline. Our study shows R-MedGAN architecture, incorporating POWGAN loss, is also generalizable for models trained with different molecular scaffolds other than quinoline originally tested in MedGAN (R-MedGAN-QNL). For indole (R-MedGAN-IND) and imidazole (R-MedGAN-IMZ) datasets, connectivity increased from 0.38 and 0.50 up to 1.00 during training. This study provides evidence that an adaptive reward-scaling policy in a Wasserstein GAN can simultaneously guide the generative training towards a reward by enhancing molecular connectivity, expand generative throughput, preserve diversity, and improve drug-likeness properties. By eliminating the limitation trade-off between property optimisation and sample diversity, POWGAN and its R-MedGAN implementation advance the state of the art in molecule-generating GANs and deploys a robust, scalable platform for high-throughput, goal-directed chemical exploration in early-stage drug discovery. These findings underscore the effectiveness of adaptive reinforcement-driven strategies in generative adversarial networks oriented by rewards for molecular discovery. SCIENTIFIC CONTRIBUTION: In this work we introduce POWGAN, a policy-optimized Wasserstein GAN that uses adaptive reward scaling to improve goal-directed molecule generation. Integrated into MedGAN (R-MedGAN), it increases the number of valid, connect","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145652913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How to build machine learning models able to extrapolate from standard to modified peptides. 如何建立能够从标准肽到修饰肽进行外推的机器学习模型。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-27 DOI: 10.1186/s13321-025-01115-z
Raúl Fernández-Díaz, Rodrigo Ochoa, Thanh Lam Hoang, Vanessa Lopez, Denis C Shields

Bioactive peptides are an important class of natural products with great functional versatility. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for standard peptides (composed of the 20 canonical amino acids) is more abundant than for modified ones. Thus, we set out to identify whether predictive models fitted to standard data are reliable when applied to modified peptides. To do this, we first considered two critical aspects of the modeling problem, namely, choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model.

生物活性肽是一类重要的天然产物,具有多种功能。化学修饰可以改善它们的药理学,但它们的结构多样性对计算建模提出了独特的挑战。此外,标准肽(由20个典型氨基酸组成)的数据比修饰肽更丰富。因此,我们着手确定适用于标准数据的预测模型在应用于修饰肽时是否可靠。为此,我们首先考虑了建模问题的两个关键方面,即选择用于指导数据集划分的相似函数和选择分子表示。基于相似性的数据集划分是一种评估技术,它将数据集划分为训练子集和测试子集,使测试集中的分子与用于拟合模型的分子不同。
{"title":"How to build machine learning models able to extrapolate from standard to modified peptides.","authors":"Raúl Fernández-Díaz, Rodrigo Ochoa, Thanh Lam Hoang, Vanessa Lopez, Denis C Shields","doi":"10.1186/s13321-025-01115-z","DOIUrl":"https://doi.org/10.1186/s13321-025-01115-z","url":null,"abstract":"<p><p>Bioactive peptides are an important class of natural products with great functional versatility. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for standard peptides (composed of the 20 canonical amino acids) is more abundant than for modified ones. Thus, we set out to identify whether predictive models fitted to standard data are reliable when applied to modified peptides. To do this, we first considered two critical aspects of the modeling problem, namely, choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":" ","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145626959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nipah Virus Inhibitor Knowledgebase (NVIK): a combined evidence approach to prioritise small molecule inhibitors 尼帕病毒抑制剂知识库(NVIK):优先考虑小分子抑制剂的综合证据方法
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-24 DOI: 10.1186/s13321-025-01049-6
Bhupender Singh, Nishi Kumari, Ayush Upadhyay, Bhavini Pahuja, Eugenia Covernton, Kishan Kalia, Kanika Tuteja, Priyanka Rani Paul, Rakesh Kumar, Mayur Sudhakar Zarkar, Anshu Bhardwaj

Nipah Virus (NiV) came into limelight due to an outbreak in Kerala, India. NiV infection can cause severe respiratory and neurological problems with fatality rate of 40–70%. It is a public health concern and has the potential to become a global pandemic. Lack of treatment has forced the containment methods to be restricted to isolation and surveillance. WHO’s ‘R&D Blueprint list of priority diseases’ (2018) indicates that there is an urgent need for accelerated research & development for addressing NiV. In the quest for druglike NiV inhibitors (NVIs) a thorough literature search followed by systematic data curation was conducted. Rigorous data analysis was done with curated NVIs for prioritising curated compounds. Our efforts led to the creation of Nipah Virus Inhibitor Knowledgebase (NVIK), a well-curated structured knowledgebase of 220 NVIs with 142 unique small molecule inhibitors. The reported IC50/EC50 values for some of these inhibitors are in the nanomolar range—as low as 0.47 nM. Of 142 unique small-molecule inhibitors, 124 (87.32%) compounds cleared the PAINS filter. The clustering analysis identified more than 90% of the NVIs as singletons signifying their diverse structural features. This diverse chemical space can be utilized in numerous ways to develop druglike anti-nipah molecules. Further, we prioritised top 10 NVIs, based on robustness of assays, physicochemical properties and their toxicity profiles. All the NVIs related information including their structures, physicochemical properties, similarity analysis with FDA approved drugs and other chemical libraries along with predicted ADMET profiles are freely accessible at https://datascience.imtech.res.in/anshu/nipah/. The NVIK has the provision to submit new inhibitors as and when reported by the community for further enhancement of the NVIs landscape.

Scientific contribution

The NVIK is a dedicated resource for NiV drug discovery containing manually curated NVIs. The NVIs are structurally mapped with known chemical space to identify their structural diversity and recommend strategies for chemical library expansion. Also, in NVIK a combined evidence-based strategy is used to prioritise these inhibitors.

Graphical Abstract

尼帕病毒(NiV)因印度喀拉拉邦的疫情而引起关注。NiV感染可引起严重的呼吸系统和神经系统问题,死亡率为40-70%。这是一个公共卫生问题,并有可能成为全球流行病。由于缺乏治疗,控制方法只能局限于隔离和监测。世卫组织《研发蓝图重点疾病清单》(2018年)表明,迫切需要加快研发以应对NiV。在寻找类药物NiV抑制剂(NVIs)的过程中,进行了全面的文献检索,然后进行了系统的数据整理。严格的数据分析与精心策划的NVIs完成,以优先考虑精心策划的化合物。我们的努力促成了尼帕病毒抑制剂知识库(NVIK)的创建,这是一个精心策划的结构化知识库,包含220种NVIs和142种独特的小分子抑制剂。据报道,其中一些抑制剂的IC50/EC50值在纳摩尔范围内,低至0.47 nM。在142个独特的小分子抑制剂中,124个(87.32%)化合物通过了PAINS过滤器。聚类分析发现,超过90%的NVIs为单例,这表明它们的结构特征多样。这种多样的化学空间可以以多种方式用于开发类似药物的抗尼帕分子。此外,我们根据检测的稳健性、物理化学性质及其毒性特征对前10名NVIs进行了优先排序。所有与NVIs相关的信息,包括它们的结构、理化性质、与FDA批准的药物的相似性分析和其他化学文库,以及预测的ADMET谱,都可以在https://datascience.imtech.res.in/anshu/nipah/上免费获取。NVIK规定,在社区报告时提交新的抑制剂,以进一步改善NVIs景观。NVIK是一个专门用于NiV药物发现的资源,其中包含手动策划的NVIs。将NVIs与已知的化学空间进行结构映射,以确定其结构多样性并推荐化学库扩展策略。此外,在NVIK中,综合循证策略用于优先考虑这些抑制剂。
{"title":"Nipah Virus Inhibitor Knowledgebase (NVIK): a combined evidence approach to prioritise small molecule inhibitors","authors":"Bhupender Singh,&nbsp;Nishi Kumari,&nbsp;Ayush Upadhyay,&nbsp;Bhavini Pahuja,&nbsp;Eugenia Covernton,&nbsp;Kishan Kalia,&nbsp;Kanika Tuteja,&nbsp;Priyanka Rani Paul,&nbsp;Rakesh Kumar,&nbsp;Mayur Sudhakar Zarkar,&nbsp;Anshu Bhardwaj","doi":"10.1186/s13321-025-01049-6","DOIUrl":"10.1186/s13321-025-01049-6","url":null,"abstract":"<div><p>Nipah Virus (NiV) came into limelight due to an outbreak in Kerala, India. NiV infection can cause severe respiratory and neurological problems with fatality rate of 40–70%. It is a public health concern and has the potential to become a global pandemic. Lack of treatment has forced the containment methods to be restricted to isolation and surveillance. WHO’s ‘R&amp;D Blueprint list of priority diseases’ (2018) indicates that there is an urgent need for accelerated research &amp; development for addressing NiV. In the quest for druglike NiV inhibitors (NVIs) a thorough literature search followed by systematic data curation was conducted. Rigorous data analysis was done with curated NVIs for prioritising curated compounds. Our efforts led to the creation of Nipah Virus Inhibitor Knowledgebase (NVIK), a well-curated structured knowledgebase of 220 NVIs with 142 unique small molecule inhibitors. The reported IC50/EC50 values for some of these inhibitors are in the nanomolar range—as low as 0.47 nM. Of 142 unique small-molecule inhibitors, 124 (87.32%) compounds cleared the PAINS filter. The clustering analysis identified more than 90% of the NVIs as singletons signifying their diverse structural features. This diverse chemical space can be utilized in numerous ways to develop druglike anti-nipah molecules. Further, we prioritised top 10 NVIs, based on robustness of assays, physicochemical properties and their toxicity profiles. All the NVIs related information including their structures, physicochemical properties, similarity analysis with FDA approved drugs and other chemical libraries along with predicted ADMET profiles are freely accessible at https://datascience.imtech.res.in/anshu/nipah/. The NVIK has the provision to submit new inhibitors as and when reported by the community for further enhancement of the NVIs landscape.</p><p>Scientific contribution</p><p>The NVIK is a dedicated resource for NiV drug discovery containing manually curated NVIs. The NVIs are structurally mapped with known chemical space to identify their structural diversity and recommend strategies for chemical library expansion. Also, in NVIK a combined evidence-based strategy is used to prioritise these inhibitors.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01049-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145583550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond performance: how design choices shape chemical language models 超越性能:设计选择如何塑造化学语言模型
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-18 DOI: 10.1186/s13321-025-01099-w
Inken Fender, Jannik Adrian Gut, Thomas Lemmin

Chemical language models (CLMs) have shown strong performance in molecular property prediction and generation tasks. However, the impact of design choices, such as molecular representation format, tokenization strategy, and model architecture, on both performance and chemical interpretability remains underexplored. In this study, we systematically evaluate how these factors influence CLM performance and chemical understanding. We evaluated models through fine-tuning on downstream tasks and probing the structure of their latent spaces using probing predictors, vector operations, and dimensionality reduction techniques. Although downstream task performance was similar across model configurations, substantial differences were observed in the structure and interpretability of internal representations, highlighting that design choices meaningfully shape how chemical information is encoded. In practice, atomwise tokenization generally improved interpretability, and a RoBERTa-based model with SMILES input remains a reliable starting point for standard prediction tasks, as no alternative consistently outperformed it. These results provide guidance for the development of more chemically grounded and interpretable CLMs.

Graphical Abstract

化学语言模型(Chemical language models, CLMs)在分子性质预测和生成任务中显示出强大的性能。然而,设计选择(如分子表示格式、标记化策略和模型架构)对性能和化学可解释性的影响仍未得到充分探讨。在本研究中,我们系统地评估了这些因素如何影响CLM性能和化学理解。我们通过对下游任务进行微调来评估模型,并使用探测预测器、向量操作和降维技术探测其潜在空间的结构。尽管不同模型配置的下游任务表现相似,但在内部表示的结构和可解释性方面观察到实质性差异,突出表明设计选择有意义地塑造了化学信息的编码方式。在实践中,原子标记化通常提高了可解释性,并且带有SMILES输入的基于roberta的模型仍然是标准预测任务的可靠起点,因为没有替代方案始终优于它。这些结果为开发更具化学基础和可解释的clm提供了指导。本研究对核心设计选择如何塑造化学语言模型提供了系统的评估。尽管不同的配置通常会产生相似的下游性能,但它们在内部表示的结构和可解释性方面产生了实质性的差异。对于标准预测任务,带有原子标记化SMILES输入的基于roberta的模型为标准预测任务提供了实用且可靠的设置。通过阐明分子表征和标记化策略的影响,我们的研究结果为开发更具可解释性和化学信息的clm提供了可操作的指导。
{"title":"Beyond performance: how design choices shape chemical language models","authors":"Inken Fender,&nbsp;Jannik Adrian Gut,&nbsp;Thomas Lemmin","doi":"10.1186/s13321-025-01099-w","DOIUrl":"10.1186/s13321-025-01099-w","url":null,"abstract":"<div><p>Chemical language models (CLMs) have shown strong performance in molecular property prediction and generation tasks. However, the impact of design choices, such as molecular representation format, tokenization strategy, and model architecture, on both performance and chemical interpretability remains underexplored. In this study, we systematically evaluate how these factors influence CLM performance and chemical understanding. We evaluated models through fine-tuning on downstream tasks and probing the structure of their latent spaces using probing predictors, vector operations, and dimensionality reduction techniques. Although downstream task performance was similar across model configurations, substantial differences were observed in the structure and interpretability of internal representations, highlighting that design choices meaningfully shape how chemical information is encoded. In practice, atomwise tokenization generally improved interpretability, and a RoBERTa-based model with SMILES input remains a reliable starting point for standard prediction tasks, as no alternative consistently outperformed it. These results provide guidance for the development of more chemically grounded and interpretable CLMs.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01099-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145535347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
NPBS Atlas: a comprehensive data resource for exploring the biological sources of natural products NPBS图集:探索天然产物生物来源的综合数据资源
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-18 DOI: 10.1186/s13321-025-01116-y
Tingjun Xu, Jinfang Dai, Yingyong Li, Junhong Zhou, Yingli Zhao, Weiming Chen, Xiao-Song Xue

Natural products continue to play a pioneering role in drug discovery due to their extraordinary chemical and biological diversity. However, their full therapeutic potential remains largely underutilized, hindered by the fragmented documentation of biological origins in existing data resources. Here, we present natural product and biological source atlas (NPBS Atlas), a data resource covers over 218,000 natural products fully annotated with comprehensive biological sources, bioactivities, and references. The database established through systematic text mining and expert manual curation, places special emphasis on curating source organism data through the information of scientific nomenclature, taxonomic classification, source parts, and the source of Traditional Chinese Medicines. NPBS Atlas represents significant advancement in natural product data resources through its unique content, specialized annotations, and featured data, thereby enabling unprecedented exploration of nature-derived chemical diversity through biological context. The web interface of NPBS Atlas is freely available at https://biochemai.cstspace.cn/npbs/.

Graphical Abstract

天然产物由于其非凡的化学和生物多样性,在药物发现中继续发挥先锋作用。然而,由于现有数据资源中对生物起源的零散记录,它们的全部治疗潜力仍未得到充分利用。在这里,我们展示了天然产物和生物来源图谱(NPBS atlas),该数据库涵盖了超过218,000种天然产物,并对其进行了全面的生物来源、生物活性和参考文献的注释。通过系统的文本挖掘和专家手册整理建立的数据库,特别注重通过科学命名、分类、来源部分和中药来源等信息来整理来源生物数据。NPBS Atlas通过其独特的内容、专门的注释和特色数据,代表了天然产物数据资源的重大进步,从而使通过生物背景对自然衍生的化学多样性进行前所未有的探索。NPBS Atlas的web界面可在https://biochemai.cstspace.cn/npbs/免费获得。记录天然产物的所有主要生物来源,包括植物、动物、真菌和细菌,并对海洋生物进行专门注释。对来源生物的部分(如根、茎、叶、果等)和中药应用进行独特的记录,这是同类资源中缺乏的特征数据。超过14000种具有生物活性(如细胞毒性、抗氧化性、抗菌性、抗肿瘤活性等)的天然产物,在可比资源库中无法获得。
{"title":"NPBS Atlas: a comprehensive data resource for exploring the biological sources of natural products","authors":"Tingjun Xu,&nbsp;Jinfang Dai,&nbsp;Yingyong Li,&nbsp;Junhong Zhou,&nbsp;Yingli Zhao,&nbsp;Weiming Chen,&nbsp;Xiao-Song Xue","doi":"10.1186/s13321-025-01116-y","DOIUrl":"10.1186/s13321-025-01116-y","url":null,"abstract":"<div><p>Natural products continue to play a pioneering role in drug discovery due to their extraordinary chemical and biological diversity. However, their full therapeutic potential remains largely underutilized, hindered by the fragmented documentation of biological origins in existing data resources. Here, we present natural product and biological source atlas (NPBS Atlas), a data resource covers over 218,000 natural products fully annotated with comprehensive biological sources, bioactivities, and references. The database established through systematic text mining and expert manual curation, places special emphasis on curating source organism data through the information of scientific nomenclature, taxonomic classification, source parts, and the source of Traditional Chinese Medicines. NPBS Atlas represents significant advancement in natural product data resources through its unique content, specialized annotations, and featured data, thereby enabling unprecedented exploration of nature-derived chemical diversity through biological context. The web interface of NPBS Atlas is freely available at https://biochemai.cstspace.cn/npbs/.</p><h3>Graphical Abstract</h3><div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01116-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145535661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting the critical micelle concentration of binary surfactant mixtures using machine learning 用机器学习预测二元表面活性剂混合物的临界胶束浓度
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-12 DOI: 10.1186/s13321-025-01112-2
Aditya Choudhary, Saaketh Desai, Methun Kamruzzaman, Alexander Landera, Koushik Ghosh, Kunal Poorey

Surfactant mixtures play a critical role in industries such as drug delivery, cosmetics, firefighting foams, and lubrication, serving as foundational components of the global economy. Their performance hinges on micelle formation, a self-assembly process governed by the critical micelle concentration (CMC), which enables key functions like solubilization, emulsification, and targeted molecular delivery. However, rapidly and accurately predicting the CMC of mixtures remains a significant challenge due to the chemical diversity and nonlinear interactions between surfactants. Here, we introduce an artificial neural network (ANN)-based machine learning framework to predict the CMC of binary surfactant mixtures. Our workflow leverages cheminformatics-derived molecular descriptors for each surfactant component, which are then aggregated using strategies such as concatenation, arithmetic mean, and harmonic mean. We find that pairing the arithmetic mean strategy with ANN yields the best performance, effectively capturing complex molecular interactions and enabling dual predictive capabilities: (1) precise interpolation of CMC values at untested mole fractions within known mixtures, and (2) accurate prediction of complete CMC–composition profiles for entirely novel surfactant combinations. SHAP-based interpretability analysis highlights that features such as hydrophobic surface area, electronic topological descriptors, and headgroup basicity drive model predictions, aligning with core principles of surfactant chemistry and reinforcing the mechanistic validity of our model. Overall, this framework accelerates data-driven surfactant design by reducing experimental burden and enabling rapid, rational optimization of formulations across pharmaceuticals, personal care, environmental remediation, and enhanced oil recovery.

Scientific contribution

This study presents a novel machine learning framework that, for the first time, predicts full critical micelle concentration (CMC)–composition profiles for binary surfactant mixtures, including untrained systems. By strategically combining the features of individual components of mixtures using arithmetic mean, our artificial neural network model deciphers nonlinear interactions between chemically distinct surfactants, enabling accurate and generalizable CMC predictions. Beyond performance gains, this framework facilitates rapid and systematic exploration of formulation space via inverse design and high-throughput screening, establishing a powerful foundation for the rational development of next-generation surfactants with applications in energy, environmental remediation, pharmaceuticals, and biomedical science.

Graphical Abstract

表面活性剂混合物在药物输送、化妆品、消防泡沫和润滑等行业中发挥着关键作用,是全球经济的基础组成部分。它们的性能取决于胶束的形成,这是一个由临界胶束浓度(CMC)控制的自组装过程,它实现了增溶、乳化和靶向分子递送等关键功能。然而,由于化学多样性和表面活性剂之间的非线性相互作用,快速准确地预测混合物的CMC仍然是一个重大挑战。本文介绍了一种基于人工神经网络(ANN)的机器学习框架来预测二元表面活性剂混合物的CMC。我们的工作流程利用化学信息学衍生的分子描述符来描述每种表面活性剂成分,然后使用串联、算术平均和谐波平均等策略进行聚合。我们发现,将算术平均策略与人工神经网络相结合可以获得最佳性能,有效地捕获复杂的分子相互作用,并实现双重预测能力:(1)在已知混合物中未经测试的摩尔分数中精确插补CMC值,(2)准确预测全新表面活性剂组合的完整CMC组成曲线。基于shap的可解释性分析强调,疏水表面积、电子拓扑描述符和头基碱度等特征驱动模型预测,与表面活性剂化学的核心原理一致,并加强了我们模型的机制有效性。总的来说,该框架通过减少实验负担、快速、合理地优化药物、个人护理、环境修复和提高石油采收率等领域的配方,加速了数据驱动的表面活性剂设计。本研究提出了一种新的机器学习框架,该框架首次预测了二元表面活性剂混合物(包括未经训练的系统)的完整临界胶束浓度(CMC)组成曲线。我们的人工神经网络模型通过算术平均值有策略地结合混合物中各个成分的特征,破译化学上不同表面活性剂之间的非线性相互作用,从而实现准确和可推广的CMC预测。除了性能提升之外,该框架还通过逆向设计和高通量筛选促进了配方空间的快速和系统探索,为下一代表面活性剂在能源、环境修复、制药和生物医学科学中的应用的合理开发奠定了强大的基础。
{"title":"Predicting the critical micelle concentration of binary surfactant mixtures using machine learning","authors":"Aditya Choudhary,&nbsp;Saaketh Desai,&nbsp;Methun Kamruzzaman,&nbsp;Alexander Landera,&nbsp;Koushik Ghosh,&nbsp;Kunal Poorey","doi":"10.1186/s13321-025-01112-2","DOIUrl":"10.1186/s13321-025-01112-2","url":null,"abstract":"<div><p>Surfactant mixtures play a critical role in industries such as drug delivery, cosmetics, firefighting foams, and lubrication, serving as foundational components of the global economy. Their performance hinges on micelle formation, a self-assembly process governed by the critical micelle concentration (CMC), which enables key functions like solubilization, emulsification, and targeted molecular delivery. However, rapidly and accurately predicting the CMC of mixtures remains a significant challenge due to the chemical diversity and nonlinear interactions between surfactants. Here, we introduce an artificial neural network (ANN)-based machine learning framework to predict the CMC of binary surfactant mixtures. Our workflow leverages cheminformatics-derived molecular descriptors for each surfactant component, which are then aggregated using strategies such as concatenation, arithmetic mean, and harmonic mean. We find that pairing the arithmetic mean strategy with ANN yields the best performance, effectively capturing complex molecular interactions and enabling dual predictive capabilities: (1) precise interpolation of CMC values at untested mole fractions within known mixtures, and (2) accurate prediction of complete CMC–composition profiles for entirely novel surfactant combinations. SHAP-based interpretability analysis highlights that features such as hydrophobic surface area, electronic topological descriptors, and headgroup basicity drive model predictions, aligning with core principles of surfactant chemistry and reinforcing the mechanistic validity of our model. Overall, this framework accelerates data-driven surfactant design by reducing experimental burden and enabling rapid, rational optimization of formulations across pharmaceuticals, personal care, environmental remediation, and enhanced oil recovery.</p><p><b>Scientific contribution</b></p><p>This study presents a novel machine learning framework that, for the first time, predicts full critical micelle concentration (CMC)–composition profiles for binary surfactant mixtures, including untrained systems. By strategically combining the features of individual components of mixtures using arithmetic mean, our artificial neural network model deciphers nonlinear interactions between chemically distinct surfactants, enabling accurate and generalizable CMC predictions. Beyond performance gains, this framework facilitates rapid and systematic exploration of formulation space via inverse design and high-throughput screening, establishing a powerful foundation for the rational development of next-generation surfactants with applications in energy, environmental remediation, pharmaceuticals, and biomedical science.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01112-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145492529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
How evaluation choices distort the outcome of generative drug discovery 评价选择如何扭曲生成药物发现的结果
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-12 DOI: 10.1186/s13321-025-01108-y
Rıza Özçelik, Francesca Grisoni

“How to evaluate the de novo designs proposed by a generative model?” Despite the transformative potential of generative deep learning in drug discovery, this seemingly simple question has no clear answer. The absence of standardized guidelines challenges both the benchmarking of generative approaches and the selection of molecules for prospective studies. In this work, we take a fresh – critical and constructive – perspective on de novo design evaluation. By training chemical language models, we analyze approximately 1 billion molecule designs and discover principles consistent across different neural networks and datasets. We uncover a key confounder: the size of the generated molecular library significantly impacts evaluation outcomes, often leading to misleading model comparisons. We find increasing the number of designs as a remedy and propose new and compute-efficient metrics to compute at large-scale. We also identify critical pitfalls in commonly used metrics — such as uniqueness and distributional similarity — that can distort assessments of generative performance. To address these issues, we propose new and refined strategies for reliable model comparison and design evaluation. Furthermore, when examining molecule selection and sampling strategies, our findings reveal the constraints to diversify the generated libraries and draw new parallels and distinctions between deep learning and drug discovery. We anticipate our findings to help reshape evaluation pipelines in generative drug discovery, paving the way for more reliable and reproducible generative modeling approaches.

Our work takes a step toward enhancing the robustness and reliability of evaluation practices in generative drug discovery. We systematically analyze current evaluation practices using approximately one billion designs from deep learning models. We find that the number of designs, often an overlooked parameter, can distort scientific outcomes related to distributional similarity and diversity. Moreover, we show that using larger design libraries than are typically adopted helps to avoid this pitfall, and we develop efficient algorithms to enable large-scale studies. We also propose guidelines for prospective molecule selection and uncover inherent constraints in diversifying molecular designs.

“如何评估生成模型提出的从头设计?”尽管生成式深度学习在药物发现方面具有变革潜力,但这个看似简单的问题并没有明确的答案。缺乏标准化的指导方针对生成方法的基准测试和前瞻性研究的分子选择都提出了挑战。在这项工作中,我们采取了新的批判性和建设性的观点来评估新设计。通过训练化学语言模型,我们分析了大约10亿个分子设计,并发现了不同神经网络和数据集之间的一致原理。我们发现了一个关键的混杂因素:生成的分子库的大小显著影响评估结果,经常导致误导性的模型比较。我们发现增加设计的数量是一种补救措施,并提出了大规模计算的新的计算效率指标。我们还确定了常用度量的关键缺陷,例如唯一性和分布相似性,这些缺陷可能会扭曲对生成性能的评估。为了解决这些问题,我们提出了新的和改进的策略来进行可靠的模型比较和设计评估。此外,在检查分子选择和采样策略时,我们的发现揭示了使生成的文库多样化的限制,并在深度学习和药物发现之间建立了新的相似之处和区别。我们预计我们的发现将有助于重塑生成药物发现的评估管道,为更可靠和可重复的生成建模方法铺平道路。我们的工作朝着增强生成药物发现评估实践的稳健性和可靠性迈出了一步。我们使用来自深度学习模型的大约10亿个设计系统地分析了当前的评估实践。我们发现,设计的数量往往是一个被忽视的参数,可以扭曲与分布相似性和多样性相关的科学结果。此外,我们表明使用比通常采用的更大的设计库有助于避免这个陷阱,并且我们开发了有效的算法来进行大规模研究。我们还提出了前瞻性分子选择的指导方针,并揭示了多样化分子设计的内在限制。
{"title":"How evaluation choices distort the outcome of generative drug discovery","authors":"Rıza Özçelik,&nbsp;Francesca Grisoni","doi":"10.1186/s13321-025-01108-y","DOIUrl":"10.1186/s13321-025-01108-y","url":null,"abstract":"<p>“How to evaluate the de novo designs proposed by a generative model?” Despite the transformative potential of generative deep learning in drug discovery, this seemingly simple question has no clear answer. The absence of standardized guidelines challenges both the benchmarking of generative approaches and the selection of molecules for prospective studies. In this work, we take a fresh – <i>critical</i> and <i>constructive </i>– perspective on de novo design evaluation. By training chemical language models, we analyze approximately 1 billion molecule designs and discover principles consistent across different neural networks and datasets. We uncover a key confounder: the size of the generated molecular library significantly impacts evaluation outcomes, often leading to misleading model comparisons. We find increasing the number of designs as a remedy and propose new and compute-efficient metrics to compute at large-scale. We also identify critical pitfalls in commonly used metrics — such as uniqueness and distributional similarity — that can distort assessments of generative performance. To address these issues, we propose new and refined strategies for reliable model comparison and design evaluation. Furthermore, when examining molecule selection and sampling strategies, our findings reveal the constraints to diversify the generated libraries and draw new parallels and distinctions between deep learning and drug discovery. We anticipate our findings to help reshape evaluation pipelines in generative drug discovery, paving the way for more reliable and reproducible generative modeling approaches.</p><p> Our work takes a step toward enhancing the robustness and reliability of evaluation practices in generative drug discovery. We systematically analyze current evaluation practices using approximately one billion designs from deep learning models. We find that the number of designs, often an overlooked parameter, can distort scientific outcomes related to distributional similarity and diversity. Moreover, we show that using larger design libraries than are typically adopted helps to avoid this pitfall, and we develop efficient algorithms to enable large-scale studies. We also propose guidelines for prospective molecule selection and uncover inherent constraints in diversifying molecular designs.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01108-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145492530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing multi-task in vivo toxicity prediction via integrated knowledge transfer of chemical knowledge and in vitro toxicity information 通过化学知识和体外毒性信息的集成知识转移,增强多任务体内毒性预测
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-12 DOI: 10.1186/s13321-025-01110-4
Minsu Park, Yewon Shin, Hyunho Kim, Hojung Nam

The evaluation of potential drug toxicity is a crucial step in early drug development. in vivo toxicity assessment represents a key challenge that must be addressed before advancing to clinical trials. However, traditional in vivo experiments primarily rely on animal models, raising concerns regarding cost, time efficiency, and ethical considerations. To address these challenges, various computational approaches have been developed to support in vivo toxicity evaluations, though these methods often demonstrate limited generalizability due to data scarcity. In this study, we propose MT-Tox, a knowledge transfer-based multi-task learning model specifically designed for in vivo toxicity prediction that overcomes data scarcity. Our model implements a sequential knowledge transfer strategy across three stages: general chemical knowledge pretraining, in vitro toxicological auxiliary training, and in vivo toxicity fine-tuning. This hierarchical approach significantly improves model performance by systematically leveraging information from both chemical structure and toxicity data sources. MT-Tox outperforms baseline models across three in vivo toxicity endpoints: carcinogenicity, drug-induced liver injury (DILI), and genotoxicity. Through ablation studies and attention analyses, we demonstrate that each knowledge transfer technique makes meaningful contributions to the prediction process. Finally, we demonstrate the real-world application of our model as a prediction tool for early-stage drug discovery through comprehensive DrugBank database screening.

Scientific contribution: We propose a knowledge transfer framework that integrates chemical and in vitro toxicological information to enhance in vivo toxicity prediction in low-data regimes. Our model provides dual-level interpretability across chemical and biological domains through attention mechanism. Moreover, we demonstrate our model’s applicability by screening the DrugBank database, simulating practical toxicity screening scenarios in drug development.

药物潜在毒性的评价是药物早期开发的关键步骤。在进行临床试验之前,体内毒性评估是必须解决的关键挑战。然而,传统的体内实验主要依赖于动物模型,这引起了对成本、时间效率和伦理考虑的担忧。为了应对这些挑战,已经开发了各种计算方法来支持体内毒性评估,尽管由于数据稀缺,这些方法通常表现出有限的通用性。在这项研究中,我们提出了MT-Tox,这是一个基于知识转移的多任务学习模型,专为克服数据稀缺的体内毒性预测而设计。我们的模型在三个阶段实现了顺序知识转移策略:一般化学知识预训练,体外毒理学辅助训练和体内毒性微调。这种分层方法通过系统地利用来自化学结构和毒性数据源的信息,显著提高了模型的性能。MT-Tox在三个体内毒性终点上优于基线模型:致癌性、药物性肝损伤(DILI)和遗传毒性。通过消融研究和注意力分析,我们证明了每种知识转移技术对预测过程都有有意义的贡献。最后,我们通过全面的DrugBank数据库筛选,展示了我们的模型作为早期药物发现预测工具的实际应用。科学贡献:我们提出了一个整合化学和体外毒理学信息的知识转移框架,以增强在低数据制度下的体内毒性预测。我们的模型通过注意机制提供化学和生物领域的双重可解释性。此外,我们通过筛选DrugBank数据库,模拟药物开发中的实际毒性筛选场景,证明了我们的模型的适用性。
{"title":"Enhancing multi-task in vivo toxicity prediction via integrated knowledge transfer of chemical knowledge and in vitro toxicity information","authors":"Minsu Park,&nbsp;Yewon Shin,&nbsp;Hyunho Kim,&nbsp;Hojung Nam","doi":"10.1186/s13321-025-01110-4","DOIUrl":"10.1186/s13321-025-01110-4","url":null,"abstract":"<div><p>The evaluation of potential drug toxicity is a crucial step in early drug development. in vivo toxicity assessment represents a key challenge that must be addressed before advancing to clinical trials. However, traditional in vivo experiments primarily rely on animal models, raising concerns regarding cost, time efficiency, and ethical considerations. To address these challenges, various computational approaches have been developed to support in vivo toxicity evaluations, though these methods often demonstrate limited generalizability due to data scarcity. In this study, we propose MT-Tox, a knowledge transfer-based multi-task learning model specifically designed for in vivo toxicity prediction that overcomes data scarcity. Our model implements a sequential knowledge transfer strategy across three stages: general chemical knowledge pretraining, in vitro toxicological auxiliary training, and in vivo toxicity fine-tuning. This hierarchical approach significantly improves model performance by systematically leveraging information from both chemical structure and toxicity data sources. MT-Tox outperforms baseline models across three in vivo toxicity endpoints: carcinogenicity, drug-induced liver injury (DILI), and genotoxicity. Through ablation studies and attention analyses, we demonstrate that each knowledge transfer technique makes meaningful contributions to the prediction process. Finally, we demonstrate the real-world application of our model as a prediction tool for early-stage drug discovery through comprehensive DrugBank database screening.</p><p><b>Scientific contribution:</b> We propose a knowledge transfer framework that integrates chemical and in vitro toxicological information to enhance in vivo toxicity prediction in low-data regimes. Our model provides dual-level interpretability across chemical and biological domains through attention mechanism. Moreover, we demonstrate our model’s applicability by screening the DrugBank database, simulating practical toxicity screening scenarios in drug development.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01110-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145492504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
C2PO: an ML-powered optimizer of the membrane permeability of cyclic peptides through chemical modification C2PO:通过化学修饰环肽膜通透性的ml优化器。
IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2025-11-11 DOI: 10.1186/s13321-025-01109-x
Roy Aerts, Joris Tavernier, Alan Kerstjens, Mazen Ahmad, Jose Carlos Gómez-Tamayo, Gary Tresadern, Hans De Winter

Peptide drug development is currently receiving due attention as a modality between small and large molecules. Therapeutic peptides represent an opportunity to achieve high potency, selectivity, and reach intracellular targets. A new era in the development of therapeutic peptides emerged with the arrival of cyclic peptides which avoid the limitations of parenteral administration via achieving sufficient oral bioavailability. However, improving the membrane permeability of cyclic peptides remains one of the principal bottlenecks. Here, we introduce a deep learning regression model of cyclic peptide membrane permeability based on publicly available data. The model starts with a chemical structure and goes beyond the limited vocabulary language models to generalize to monomers beyond the ones in the training dataset. Moreover, we introduce an efficient estimator2generative wrapper to enable using the model in direct molecular optimization of membrane permeability via chemical modification. We name our application C2PO (Cyclic Peptide Permeability Optimizer). Lastly, we demonstrate how a molecule correction tool can be used to limit the presence of unfamiliar chemistry in the generated molecules.

Scientific contribution: We provide an ML-driven optimizer application, named C2PO, that returns structurally modified cyclic peptides with an improved membrane permeability, one of the pivotal tasks in drug discovery and development. C2PO is a first-in-class application for cyclic peptide permeability amelioration, in that it converts a ML model into a generative optimizer of chemical structures. Additionally, through demonstration we incentivize the usage of an automated post-correction tool with a chemistry reference library to correct strange chemistry outputs from C2PO, a known issue for ML-generated chemical structures.

肽药物作为一种介于小分子和大分子之间的模式,目前正受到人们的重视。治疗性多肽代表了实现高效、选择性和达到细胞内目标的机会。随着环状肽的出现,治疗性肽的发展进入了一个新时代,环状肽通过实现充分的口服生物利用度,避免了肠外给药的限制。然而,提高环肽的膜通透性仍然是主要的瓶颈之一。在这里,我们引入了一个基于公开数据的环肽膜通透性的深度学习回归模型。该模型从化学结构开始,超越了有限的词汇语言模型,将其推广到训练数据集中以外的单体。此外,我们还引入了一个有效的估计器生成包装器,使该模型能够通过化学修饰直接进行膜透性的分子优化。我们将我们的应用程序命名为C2PO(循环肽通透性优化器)。最后,我们演示了如何使用分子校正工具来限制生成的分子中不熟悉的化学物质的存在。科学贡献:我们提供了一个ml驱动的优化器应用程序,名为C2PO,它返回结构修饰的环状肽,具有改善的膜通透性,这是药物发现和开发的关键任务之一。C2PO是一种一流的环肽渗透性改善应用,它将ML模型转换为化学结构的生成优化器。此外,通过演示,我们鼓励使用带有化学参考库的自动后校正工具来纠正C2PO的奇怪化学输出,这是ml生成的化学结构的已知问题。
{"title":"C2PO: an ML-powered optimizer of the membrane permeability of cyclic peptides through chemical modification","authors":"Roy Aerts,&nbsp;Joris Tavernier,&nbsp;Alan Kerstjens,&nbsp;Mazen Ahmad,&nbsp;Jose Carlos Gómez-Tamayo,&nbsp;Gary Tresadern,&nbsp;Hans De Winter","doi":"10.1186/s13321-025-01109-x","DOIUrl":"10.1186/s13321-025-01109-x","url":null,"abstract":"<div><p>Peptide drug development is currently receiving due attention as a modality between small and large molecules. Therapeutic peptides represent an opportunity to achieve high potency, selectivity, and reach intracellular targets. A new era in the development of therapeutic peptides emerged with the arrival of cyclic peptides which avoid the limitations of parenteral administration via achieving sufficient oral bioavailability. However, improving the membrane permeability of cyclic peptides remains one of the principal bottlenecks. Here, we introduce a deep learning regression model of cyclic peptide membrane permeability based on publicly available data. The model starts with a chemical structure and goes beyond the limited vocabulary language models to generalize to monomers beyond the ones in the training dataset. Moreover, we introduce an efficient <i>estimator2generative</i> wrapper to enable using the model in direct molecular optimization of membrane permeability via chemical modification. We name our application <i>C2PO</i> (Cyclic Peptide Permeability Optimizer). Lastly, we demonstrate how a molecule correction tool can be used to limit the presence of unfamiliar chemistry in the generated molecules.</p><p><b>Scientific contribution</b>: We provide an ML-driven optimizer application, named C2PO, that returns structurally modified cyclic peptides with an improved membrane permeability, one of the pivotal tasks in drug discovery and development. C2PO is a first-in-class application for cyclic peptide permeability amelioration, in that it converts a ML model into a generative optimizer of chemical structures. Additionally, through demonstration we incentivize the usage of an automated post-correction tool with a chemistry reference library to correct strange chemistry outputs from C2PO, a known issue for ML-generated chemical structures.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01109-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145491700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1