首页 > 最新文献

Journal of Cheminformatics最新文献

英文 中文
Enhancing molecular property prediction with auxiliary learning and task-specific adaptation 利用辅助学习和特定任务适应性加强分子特性预测
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-24 DOI: 10.1186/s13321-024-00880-7
Vishal Dey, Xia Ning

Pretrained Graph Neural Networks have been widely adopted for various molecular property prediction tasks. Despite their ability to encode structural and relational features of molecules, traditional fine-tuning of such pretrained GNNs on the target task can lead to poor generalization. To address this, we explore the adaptation of pretrained GNNs to the target task by jointly training them with multiple auxiliary tasks. This could enable the GNNs to learn both general and task-specific features, which may benefit the target task. However, a major challenge is to determine the relatedness of auxiliary tasks with the target task. To address this, we investigate multiple strategies to measure the relevance of auxiliary tasks and integrate such tasks by adaptively combining task gradients or by learning task weights via bi-level optimization. Additionally, we propose a novel gradient surgery-based approach, Rotation of Conflicting Gradients ((mathop {texttt{RCGrad}}limits)), that learns to align conflicting auxiliary task gradients through rotation. Our experiments with state-of-the-art pretrained GNNs demonstrate the efficacy of our proposed methods, with improvements of up to 7.7% over fine-tuning. This suggests that incorporating auxiliary tasks along with target task fine-tuning can be an effective way to improve the generalizability of pretrained GNNs for molecular property prediction.

Scientific contribution

We introduce a novel framework for adapting pretrained GNNs to molecular tasks using auxiliary learning to address the critical issue of negative transfer. Leveraging novel gradient surgery techniques such as (mathop {texttt{RCGrad}}limits), the proposed adaptation framework represents a significant departure from the dominant pretraining fine-tuning approach for molecular GNNs. Our contributions are significant for drug discovery research, especially for tasks with limited data, filling a notable gap in the efficient adaptation of pretrained models for molecular GNNs.

预训练的图神经网络已被广泛用于各种分子特性预测任务。尽管预训练图神经网络能够编码分子的结构和关系特征,但根据目标任务对其进行传统的微调可能会导致泛化效果不佳。为了解决这个问题,我们探索了通过多个辅助任务联合训练预训练 GNN 来使其适应目标任务。这可以使 GNN 同时学习通用特征和特定任务特征,从而有利于目标任务。然而,如何确定辅助任务与目标任务的相关性是一大挑战。为了解决这个问题,我们研究了多种策略来衡量辅助任务的相关性,并通过自适应结合任务梯度或通过双层优化学习任务权重来整合这些任务。此外,我们还提出了一种基于梯度手术的新方法--"冲突梯度旋转"(Rotation of Conflicting Gradients),该方法可通过旋转来调整相互冲突的辅助任务梯度。我们用最先进的预训练 GNN 进行的实验证明了我们提出的方法的有效性,与微调相比,改进幅度高达 7.7%。这表明,将辅助任务与目标任务微调结合起来,可以有效提高预训练 GNN 在分子性质预测方面的通用性。科学贡献 我们引入了一个新框架,利用辅助学习使预训练的 GNN 适应分子任务,以解决负迁移的关键问题。利用$$mathop {texttt{RCGrad}}limits$$ 等新颖的梯度手术技术,所提出的适应框架与分子 GNNs 的主流预训练微调方法大相径庭。我们的贡献对于药物发现研究意义重大,尤其是对于数据有限的任务,填补了分子 GNN 预训练模型高效适配方面的显著空白。
{"title":"Enhancing molecular property prediction with auxiliary learning and task-specific adaptation","authors":"Vishal Dey,&nbsp;Xia Ning","doi":"10.1186/s13321-024-00880-7","DOIUrl":"10.1186/s13321-024-00880-7","url":null,"abstract":"<div><p>Pretrained Graph Neural Networks have been widely adopted for various molecular property prediction tasks. Despite their ability to encode structural and relational features of molecules, traditional fine-tuning of such pretrained GNNs on the target task can lead to poor generalization. To address this, we explore the adaptation of pretrained GNNs to the target task by jointly training them with multiple auxiliary tasks. This could enable the GNNs to learn both general and task-specific features, which may benefit the target task. However, a major challenge is to determine the relatedness of auxiliary tasks with the target task. To address this, we investigate multiple strategies to measure the relevance of auxiliary tasks and integrate such tasks by adaptively combining task gradients or by learning task weights via bi-level optimization. Additionally, we propose a novel gradient surgery-based approach, Rotation of Conflicting Gradients (<span>(mathop {texttt{RCGrad}}limits)</span>), that learns to align conflicting auxiliary task gradients through rotation. Our experiments with state-of-the-art pretrained GNNs demonstrate the efficacy of our proposed methods, with improvements of up to 7.7% over fine-tuning. This suggests that incorporating auxiliary tasks along with target task fine-tuning can be an effective way to improve the generalizability of pretrained GNNs for molecular property prediction.</p><p><b>Scientific contribution</b></p><p>We introduce a novel framework for adapting pretrained GNNs to molecular tasks using auxiliary learning to address the critical issue of negative transfer. Leveraging novel gradient surgery techniques such as <span>(mathop {texttt{RCGrad}}limits)</span>, the proposed adaptation framework represents a significant departure from the dominant pretraining fine-tuning approach for molecular GNNs. Our contributions are significant for drug discovery research, especially for tasks with limited data, filling a notable gap in the efficient adaptation of pretrained models for molecular GNNs.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00880-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141755347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating the synthetic accessibility of molecules with building block and reaction-aware SAScore 利用构件和反应感知 SAScore 估算分子的合成可达性。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-23 DOI: 10.1186/s13321-024-00879-0
Shuan Chen, Yousung Jung

Synthetic accessibility prediction is a task to estimate how easily a given molecule might be synthesizable in the laboratory, playing a crucial role in computer-aided molecular design. Although synthesis planning programs can determine synthesis routes, their slow processing times make them impractical for large-scale molecule screening. On the other hand, existing rapid synthesis accessibility estimation methods offer speed but typically lack integration with actual synthesis routes and building block information. In this work, we introduce BR-SAScore, an enhanced version of SAScore that integrates the available building block information (B) and reaction knowledge (R) from synthesis planning programs into the scoring process. In particular, we differentiate fragments inherent in building blocks and fragments to be derived from synthesis (reactions) when scoring synthetic accessibility. Compared to existing methods, our experimental findings demonstrate that BR-SAScore offers more accurate and precise identification of a molecule's synthetic accessibility by the synthesis planning program with a fast calculation time. Moreover, we illustrate how BR-SAScore provides chemically interpretable results, aligning with the capability of the synthesis planning program embedded with the same reaction knowledge and available building blocks.

Scientific contribution

We introduce BR-SAScore, an extension of SAScore, to estimate the synthetic accessibility of molecules by leveraging known building-block and reactivity information. In our experiments, BR-SAScore shows superior prediction performance on predicting molecule synthetic accessibility compared to previous methods, including SAScore and deep-learning models, while requiring significantly less computation time. In addition, we show that BR-SAScore is able to precisely identify the chemical fragment contributing to the synthetic infeasibility, holding great potential for future molecule synthesizability optimization.

合成可及性预测是一项估算特定分子在实验室中合成难易程度的任务,在计算机辅助分子设计中起着至关重要的作用。虽然合成规划程序可以确定合成路线,但其处理时间较慢,不适合大规模分子筛选。另一方面,现有的快速合成可及性估算方法虽然速度快,但通常缺乏与实际合成路线和构件信息的整合。在这项工作中,我们引入了 BR-SAScore,它是 SAScore 的增强版,将合成规划程序中可用的构件信息(B)和反应知识(R)整合到评分过程中。特别是,在对合成可及性进行评分时,我们区分了构件中固有的片段和合成中衍生的片段(反应)。与现有方法相比,我们的实验结果表明,BR-SAScore 能通过合成规划程序更准确、更精确地识别分子的合成可及性,而且计算时间短。此外,我们还说明了 BR-SAScore 如何提供化学上可解释的结果,与嵌入了相同反应知识和可用构件的合成规划程序的能力相一致。在我们的实验中,BR-SAScore在预测分子合成可及性方面的预测性能优于之前的方法,包括SAScore和深度学习模型,同时所需的计算时间也大大减少。此外,我们还表明,BR-SAScore 能够精确识别导致合成不可行的化学片段,为未来的分子可合成性优化提供了巨大潜力。
{"title":"Estimating the synthetic accessibility of molecules with building block and reaction-aware SAScore","authors":"Shuan Chen,&nbsp;Yousung Jung","doi":"10.1186/s13321-024-00879-0","DOIUrl":"10.1186/s13321-024-00879-0","url":null,"abstract":"<div><p>Synthetic accessibility prediction is a task to estimate how easily a given molecule might be synthesizable in the laboratory, playing a crucial role in computer-aided molecular design. Although synthesis planning programs can determine synthesis routes, their slow processing times make them impractical for large-scale molecule screening. On the other hand, existing rapid synthesis accessibility estimation methods offer speed but typically lack integration with actual synthesis routes and building block information. In this work, we introduce BR-SAScore, an enhanced version of SAScore that integrates the available building block information (B) and reaction knowledge (R) from synthesis planning programs into the scoring process. In particular, we differentiate fragments inherent in building blocks and fragments to be derived from synthesis (reactions) when scoring synthetic accessibility. Compared to existing methods, our experimental findings demonstrate that BR-SAScore offers more accurate and precise identification of a molecule's synthetic accessibility by the synthesis planning program with a fast calculation time. Moreover, we illustrate how BR-SAScore provides chemically interpretable results, aligning with the capability of the synthesis planning program embedded with the same reaction knowledge and available building blocks.</p><p><b>Scientific contribution</b></p><p>We introduce BR-SAScore, an extension of SAScore, to estimate the synthetic accessibility of molecules by leveraging known building-block and reactivity information. In our experiments, BR-SAScore shows superior prediction performance on predicting molecule synthetic accessibility compared to previous methods, including SAScore and deep-learning models, while requiring significantly less computation time. In addition, we show that BR-SAScore is able to precisely identify the chemical fragment contributing to the synthetic infeasibility, holding great potential for future molecule synthesizability optimization.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11267797/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141750803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
piscesCSM: prediction of anticancer synergistic drug combinations piscesCSM:抗癌协同药物组合预测。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-19 DOI: 10.1186/s13321-024-00859-4
Raghad AlJarf, Carlos H. M. Rodrigues, Yoochan Myung, Douglas E. V. Pires, David B. Ascher

While drug combination therapies are of great importance, particularly in cancer treatment, identifying novel synergistic drug combinations has been a challenging venture. Computational methods have emerged in this context as a promising tool for prioritizing drug combinations for further evaluation, though they have presented limited performance, utility, and interpretability. Here, we propose a novel predictive tool, piscesCSM, that leverages graph-based representations to model small molecule chemical structures to accurately predict drug combinations with favourable anticancer synergistic effects against one or multiple cancer cell lines. Leveraging these insights, we developed a general supervised machine learning model to guide the prediction of anticancer synergistic drug combinations in over 30 cell lines. It achieved an area under the receiver operating characteristic curve (AUROC) of up to 0.89 on independent non-redundant blind tests, outperforming state-of-the-art approaches on both large-scale oncology screening data and an independent test set generated by AstraZeneca (with more than a 16% improvement in predictive accuracy). Moreover, by exploring the interpretability of our approach, we found that simple physicochemical properties and graph-based signatures are predictive of chemotherapy synergism. To provide a simple and integrated platform to rapidly screen potential candidate pairs with favourable synergistic anticancer effects, we made piscesCSM freely available online at https://biosig.lab.uq.edu.au/piscescsm/ as a web server and API. We believe that our predictive tool will provide a valuable resource for optimizing and augmenting combinatorial screening libraries to identify effective and safe synergistic anticancer drug combinations.

虽然药物组合疗法非常重要,尤其是在癌症治疗中,但识别新型协同药物组合一直是一项具有挑战性的工作。在这种情况下,计算方法作为一种有前途的工具应运而生,可用于对药物组合进行优先排序以作进一步评估,但这些方法的性能、实用性和可解释性都很有限。在这里,我们提出了一种新的预测工具 piscesCSM,它利用基于图的表示法来模拟小分子化学结构,从而准确预测对一种或多种癌细胞株具有良好抗癌协同效应的药物组合。利用这些洞察力,我们开发了一种通用的监督机器学习模型,用于指导预测 30 多种细胞系的抗癌协同药物组合。在独立的非冗余盲测中,该模型的接收者操作特征曲线下面积(AUROC)高达 0.89,在大规模肿瘤筛选数据和阿斯利康公司(AstraZeneca)生成的独立测试集上均优于最先进的方法(预测准确率提高了 16% 以上)。此外,通过探索我们方法的可解释性,我们发现简单的物理化学特性和基于图谱的特征可以预测化疗的协同作用。为了提供一个简单的集成平台来快速筛选具有良好协同抗癌效果的潜在候选配对,我们将 piscesCSM 作为网络服务器和应用程序接口免费提供给 https://biosig.lab.uq.edu.au/piscescsm/。我们相信,我们的预测工具将为优化和扩充组合筛选库提供宝贵的资源,以确定有效和安全的协同抗癌药物组合。科学贡献:本研究提出的 piscesCSM 是一种基于机器学习的框架,它依赖于成熟的小分子图谱表示法来识别协同基因药物组合并提供更好的预测准确性。我们的模型 piscesCSM 表明,在分类预测任务中,将理化特性与基于图的特征相结合的效果优于目前的架构。此外,将我们的工具作为网络服务器来实施,为研究人员筛选对一种或多种癌细胞株具有良好抗癌效果的潜在协同药物组合提供了一个用户友好型平台。
{"title":"piscesCSM: prediction of anticancer synergistic drug combinations","authors":"Raghad AlJarf,&nbsp;Carlos H. M. Rodrigues,&nbsp;Yoochan Myung,&nbsp;Douglas E. V. Pires,&nbsp;David B. Ascher","doi":"10.1186/s13321-024-00859-4","DOIUrl":"10.1186/s13321-024-00859-4","url":null,"abstract":"<p>While drug combination therapies are of great importance, particularly in cancer treatment, identifying novel synergistic drug combinations has been a challenging venture. Computational methods have emerged in this context as a promising tool for prioritizing drug combinations for further evaluation, though they have presented limited performance, utility, and interpretability. Here, we propose a novel predictive tool, piscesCSM, that leverages graph-based representations to model small molecule chemical structures to accurately predict drug combinations with favourable anticancer synergistic effects against one or multiple cancer cell lines. Leveraging these insights, we developed a general supervised machine learning model to guide the prediction of anticancer synergistic drug combinations in over 30 cell lines. It achieved an area under the receiver operating characteristic curve (AUROC) of up to 0.89 on independent non-redundant blind tests, outperforming state-of-the-art approaches on both large-scale oncology screening data and an independent test set generated by AstraZeneca (with more than a 16% improvement in predictive accuracy). Moreover, by exploring the interpretability of our approach, we found that simple physicochemical properties and graph-based signatures are predictive of chemotherapy synergism. To provide a simple and integrated platform to rapidly screen potential candidate pairs with favourable synergistic anticancer effects, we made piscesCSM freely available online at https://biosig.lab.uq.edu.au/piscescsm/ as a web server and API. We believe that our predictive tool will provide a valuable resource for optimizing and augmenting combinatorial screening libraries to identify effective and safe synergistic anticancer drug combinations.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00859-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141726656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reaction rebalancing: a novel approach to curating reaction databases 反应再平衡:整理反应数据库的新方法。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-19 DOI: 10.1186/s13321-024-00875-4
Tieu-Long Phan, Klaus Weinbauer, Thomas Gärtner, Daniel Merkle, Jakob L. Andersen, Rolf Fagerberg, Peter F. Stadler

Purpose

Reaction databases are a key resource for a wide variety of applications in computational chemistry and biochemistry, including Computer-aided Synthesis Planning (CASP) and the large-scale analysis of metabolic networks. The full potential of these resources can only be realized if datasets are accurate and complete. Missing co-reactants and co-products, i.e., unbalanced reactions, however, are the rule rather than the exception. The curation and correction of such incomplete entries is thus an urgent need.

Methods

The SynRBL framework addresses this issue with a dual-strategy: a rule-based method for non-carbon compounds, using atomic symbols and counts for prediction, alongside a Maximum Common Subgraph (MCS)-based technique for carbon compounds, aimed at aligning reactants and products to infer missing entities.

Results

The rule-based method exceeded 99% accuracy, while MCS-based accuracy varied from 81.19 to 99.33%, depending on reaction properties. Furthermore, an applicability domain and a machine learning scoring function were devised to quantify prediction confidence. The overall efficacy of this framework was delineated through its success rate and accuracy metrics, which spanned from 89.83 to 99.75% and 90.85 to 99.05%, respectively.

Conclusion

The SynRBL framework offers a novel solution for recalibrating chemical reactions, significantly enhancing reaction completeness. With rigorous validation, it achieved groundbreaking accuracy in reaction rebalancing. This sets the stage for future improvement in particular of atom-atom mapping techniques as well as of downstream tasks such as automated synthesis planning.

Scientific Contribution

SynRBL features a novel computational approach to correcting unbalanced entries in chemical reaction databases. By combining heuristic rules for inferring non-carbon compounds and common subgraph searches to address carbon unbalance, SynRBL successfully addresses most instances of this problem, which affects the majority of data in most large-scale resources. Compared to alternative solutions, SynRBL achieves a dramatic increase in both success rate and accurary, and provides the first freely available open source solution for this problem.

目的反应数据库是计算化学和生物化学领域各种应用的关键资源,包括计算机辅助合成规划(CASP)和代谢网络的大规模分析。只有数据集准确完整,才能充分发挥这些资源的潜力。然而,缺失的共反应物和共生成物,即不平衡的反应,是常规而非例外。因此,急需对这些不完整的条目进行整理和更正:SynRBL 框架采用双重策略解决这一问题:对非碳化合物采用基于规则的方法,使用原子符号和计数进行预测;对碳化合物采用基于最大公共子图(MCS)的技术,旨在对反应物和生成物进行排列,以推断出缺失的实体:结果:基于规则的方法准确率超过 99%,而基于 MCS 的准确率从 81.19% 到 99.33% 不等,具体取决于反应特性。此外,还设计了一个适用域和一个机器学习评分函数来量化预测置信度。该框架的成功率和准确率分别从 89.83% 到 99.75% 和 90.85% 到 99.05% 不等,由此可见其整体功效:SynRBL 框架为重新校准化学反应提供了一种新的解决方案,大大提高了反应的完整性。经过严格验证,它在反应再平衡方面取得了突破性的准确性。这为今后改进原子原子映射技术以及自动化合成规划等下游任务奠定了基础:SynRBL 采用了一种新颖的计算方法来纠正化学反应数据库中的不平衡条目。通过将推断非碳化合物的启发式规则与解决碳不平衡问题的普通子图搜索相结合,SynRBL 成功地解决了这一问题的大多数情况,而这一问题影响了大多数大型资源中的大部分数据。与其他解决方案相比,SynRBL 在成功率和准确率方面都有显著提高,并为这一问题提供了首个免费开源解决方案。
{"title":"Reaction rebalancing: a novel approach to curating reaction databases","authors":"Tieu-Long Phan,&nbsp;Klaus Weinbauer,&nbsp;Thomas Gärtner,&nbsp;Daniel Merkle,&nbsp;Jakob L. Andersen,&nbsp;Rolf Fagerberg,&nbsp;Peter F. Stadler","doi":"10.1186/s13321-024-00875-4","DOIUrl":"10.1186/s13321-024-00875-4","url":null,"abstract":"<div><h3>Purpose</h3><p>Reaction databases are a key resource for a wide variety of applications in computational chemistry and biochemistry, including Computer-aided Synthesis Planning (CASP) and the large-scale analysis of metabolic networks. The full potential of these resources can only be realized if datasets are accurate and complete. Missing co-reactants and co-products, i.e., unbalanced reactions, however, are the rule rather than the exception. The curation and correction of such incomplete entries is thus an urgent need.</p><h3>Methods</h3><p>The <span>SynRBL</span> framework addresses this issue with a dual-strategy: a rule-based method for non-carbon compounds, using atomic symbols and counts for prediction, alongside a Maximum Common Subgraph (MCS)-based technique for carbon compounds, aimed at aligning reactants and products to infer missing entities.</p><h3>Results</h3><p>The rule-based method exceeded 99% accuracy, while MCS-based accuracy varied from 81.19 to 99.33%, depending on reaction properties. Furthermore, an applicability domain and a machine learning scoring function were devised to quantify prediction confidence. The overall efficacy of this framework was delineated through its success rate and accuracy metrics, which spanned from 89.83 to 99.75% and 90.85 to 99.05%, respectively.</p><h3>Conclusion</h3><p>The <span>SynRBL</span> framework offers a novel solution for recalibrating chemical reactions, significantly enhancing reaction completeness. With rigorous validation, it achieved groundbreaking accuracy in reaction rebalancing. This sets the stage for future improvement in particular of atom-atom mapping techniques as well as of downstream tasks such as automated synthesis planning.</p><h3>Scientific Contribution</h3><p><span>SynRBL</span> features a novel computational approach to correcting unbalanced entries in chemical reaction databases. By combining heuristic rules for inferring non-carbon compounds and common subgraph searches to address carbon unbalance, <span>SynRBL</span> successfully addresses most instances of this problem, which affects the majority of data in most large-scale resources. Compared to alternative solutions, <span>SynRBL</span> achieves a dramatic increase in both success rate and accurary, and provides the first freely available open source solution for this problem.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00875-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141726657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ualign: pushing the limit of template-free retrosynthesis prediction with unsupervised SMILES alignment Ualign:利用无监督 SMILES 对齐技术突破无模板逆合成预测的极限
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-15 DOI: 10.1186/s13321-024-00877-2
Kaipeng Zeng, Bo Yang, Xin Zhao, Yu Zhang, Fan Nie, Xiaokang Yang, Yaohui Jin, Yanyan Xu

Motivation

Retrosynthesis planning poses a formidable challenge in the organic chemical industry, particularly in pharmaceuticals. Single-step retrosynthesis prediction, a crucial step in the planning process, has witnessed a surge in interest in recent years due to advancements in AI for science. Various deep learning-based methods have been proposed for this task in recent years, incorporating diverse levels of additional chemical knowledge dependency.

Results

This paper introduces UAlign, a template-free graph-to-sequence pipeline for retrosynthesis prediction. By combining graph neural networks and Transformers, our method can more effectively leverage the inherent graph structure of molecules. Based on the fact that the majority of molecule structures remain unchanged during a chemical reaction, we propose a simple yet effective SMILES alignment technique to facilitate the reuse of unchanged structures for reactant generation. Extensive experiments show that our method substantially outperforms state-of-the-art template-free and semi-template-based approaches. Importantly, our template-free method achieves effectiveness comparable to, or even surpasses, established powerful template-based methods.

Scientific contribution

We present a novel graph-to-sequence template-free retrosynthesis prediction pipeline that overcomes the limitations of Transformer-based methods in molecular representation learning and insufficient utilization of chemical information. We propose an unsupervised learning mechanism for establishing product-atom correspondence with reactant SMILES tokens, achieving even better results than supervised SMILES alignment methods. Extensive experiments demonstrate that UAlign significantly outperforms state-of-the-art template-free methods and rivals or surpasses template-based approaches, with up to 5% (top-5) and 5.4% (top-10) increased accuracy over the strongest baseline.

逆合成规划是有机化学行业,尤其是制药行业面临的一项艰巨挑战。单步逆合成预测是规划过程中的一个关键步骤,近年来,随着人工智能在科学领域的发展,人们对它的兴趣急剧增加。近年来,针对这一任务提出了各种基于深度学习的方法,其中包含不同程度的额外化学知识依赖。本文介绍了用于逆合成预测的无模板图到序列管道 UAlign。通过结合图神经网络和 Transformers,我们的方法可以更有效地利用分子固有的图结构。基于大多数分子结构在化学反应过程中保持不变这一事实,我们提出了一种简单而有效的 SMILES 对齐技术,以促进在生成反应物时重复使用不变的结构。大量实验表明,我们的方法大大优于最先进的无模板和半模板方法。重要的是,我们的无模板方法所取得的效果可以媲美甚至超越已建立的强大的基于模板的方法。我们提出了一种新颖的图到序列无模板逆合成预测管道,它克服了基于 Transformer 的方法在分子表征学习和化学信息利用不足方面的局限性。我们提出了一种无监督学习机制,用于建立产物原子与反应物 SMILES 标记的对应关系,取得了比监督 SMILES 配对方法更好的结果。大量实验证明,UAlign 显著优于最先进的无模板方法,并可与基于模板的方法媲美或超越,与最强基线相比,准确率分别提高了 5%(前 5 名)和 5.4%(前 10 名)。
{"title":"Ualign: pushing the limit of template-free retrosynthesis prediction with unsupervised SMILES alignment","authors":"Kaipeng Zeng,&nbsp;Bo Yang,&nbsp;Xin Zhao,&nbsp;Yu Zhang,&nbsp;Fan Nie,&nbsp;Xiaokang Yang,&nbsp;Yaohui Jin,&nbsp;Yanyan Xu","doi":"10.1186/s13321-024-00877-2","DOIUrl":"10.1186/s13321-024-00877-2","url":null,"abstract":"<div><h3>Motivation</h3><p>Retrosynthesis planning poses a formidable challenge in the organic chemical industry, particularly in pharmaceuticals. Single-step retrosynthesis prediction, a crucial step in the planning process, has witnessed a surge in interest in recent years due to advancements in AI for science. Various deep learning-based methods have been proposed for this task in recent years, incorporating diverse levels of additional chemical knowledge dependency.</p><h3>Results</h3><p>This paper introduces UAlign, a template-free graph-to-sequence pipeline for retrosynthesis prediction. By combining graph neural networks and Transformers, our method can more effectively leverage the inherent graph structure of molecules. Based on the fact that the majority of molecule structures remain unchanged during a chemical reaction, we propose a simple yet effective SMILES alignment technique to facilitate the reuse of unchanged structures for reactant generation. Extensive experiments show that our method substantially outperforms state-of-the-art template-free and semi-template-based approaches. Importantly, our template-free method achieves effectiveness comparable to, or even surpasses, established powerful template-based methods.</p><h3>Scientific contribution</h3><p>We present a novel graph-to-sequence template-free retrosynthesis prediction pipeline that overcomes the limitations of Transformer-based methods in molecular representation learning and insufficient utilization of chemical information. We propose an unsupervised learning mechanism for establishing product-atom correspondence with reactant SMILES tokens, achieving even better results than supervised SMILES alignment methods. Extensive experiments demonstrate that UAlign significantly outperforms state-of-the-art template-free methods and rivals or surpasses template-based approaches, with up to 5% (top-5) and 5.4% (top-10) increased accuracy over the strongest baseline.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00877-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141618323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LVPocket: integrated 3D global-local information to protein binding pockets prediction with transfer learning of protein structure classification LVPocket:通过蛋白质结构分类的迁移学习,综合三维全局-局部信息预测蛋白质结合口袋。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-07 DOI: 10.1186/s13321-024-00871-8
Ruifeng Zhou, Jing Fan, Sishu Li, Wenjie Zeng, Yilun Chen, Xiaoshan Zheng, Hongyang Chen, Jun Liao

Background

Previous deep learning methods for predicting protein binding pockets mainly employed 3D convolution, yet an abundance of convolution operations may lead the model to excessively prioritize local information, thus overlooking global information. Moreover, it is essential for us to account for the influence of diverse protein folding structural classes. Because proteins classified differently structurally exhibit varying biological functions, whereas those within the same structural class share similar functional attributes.

Results

We proposed LVPocket, a novel method that synergistically captures both local and global information of protein structure through the integration of Transformer encoders, which help the model achieve better performance in binding pockets prediction. And then we tailored prediction models for data of four distinct structural classes of proteins using the transfer learning. The four fine-tuned models were trained on the baseline LVPocket model which was trained on the sc-PDB dataset. LVPocket exhibits superior performance on three independent datasets compared to current state-of-the-art methods. Additionally, the fine-tuned model outperforms the baseline model in terms of performance.

Scientific contribution

We present a novel model structure for predicting protein binding pockets that provides a solution for relying on extensive convolutional computation while neglecting global information about protein structures. Furthermore, we tackle the impact of different protein folding structures on binding pocket prediction tasks through the application of transfer learning methods.

Graphical Abstract

背景:以往预测蛋白质结合口袋的深度学习方法主要采用三维卷积,然而大量的卷积操作可能会导致模型过度优先考虑局部信息,从而忽略全局信息。此外,我们还必须考虑到不同蛋白质折叠结构类别的影响。因为不同结构分类的蛋白质具有不同的生物功能,而同一结构分类的蛋白质则具有相似的功能属性:我们提出了 LVPocket 这种新方法,它通过整合 Transformer 编码器协同捕捉蛋白质结构的局部和全局信息,从而帮助模型在结合口袋预测中取得更好的性能。然后,我们利用迁移学习为四种不同结构类别的蛋白质数据定制了预测模型。这四个微调模型是在基线 LVPocket 模型的基础上训练的,而基线 LVPocket 模型是在 sc-PDB 数据集上训练的。与目前最先进的方法相比,LVPocket 在三个独立的数据集上表现出更优越的性能。此外,经过微调的模型在性能上也优于基线模型:我们提出了一种用于预测蛋白质结合口袋的新型模型结构,它为依赖大量卷积计算而忽略蛋白质结构的全局信息提供了一种解决方案。此外,我们还通过应用迁移学习方法解决了不同蛋白质折叠结构对结合口袋预测任务的影响。
{"title":"LVPocket: integrated 3D global-local information to protein binding pockets prediction with transfer learning of protein structure classification","authors":"Ruifeng Zhou,&nbsp;Jing Fan,&nbsp;Sishu Li,&nbsp;Wenjie Zeng,&nbsp;Yilun Chen,&nbsp;Xiaoshan Zheng,&nbsp;Hongyang Chen,&nbsp;Jun Liao","doi":"10.1186/s13321-024-00871-8","DOIUrl":"10.1186/s13321-024-00871-8","url":null,"abstract":"<div><h3>Background</h3><p>Previous deep learning methods for predicting protein binding pockets mainly employed 3D convolution, yet an abundance of convolution operations may lead the model to excessively prioritize local information, thus overlooking global information. Moreover, it is essential for us to account for the influence of diverse protein folding structural classes. Because proteins classified differently structurally exhibit varying biological functions, whereas those within the same structural class share similar functional attributes.</p><h3>Results</h3><p>We proposed LVPocket, a novel method that synergistically captures both local and global information of protein structure through the integration of Transformer encoders, which help the model achieve better performance in binding pockets prediction. And then we tailored prediction models for data of four distinct structural classes of proteins using the transfer learning. The four fine-tuned models were trained on the baseline LVPocket model which was trained on the sc-PDB dataset. LVPocket exhibits superior performance on three independent datasets compared to current state-of-the-art methods. Additionally, the fine-tuned model outperforms the baseline model in terms of performance.</p><h3>Scientific contribution</h3><p>We present a novel model structure for predicting protein binding pockets that provides a solution for relying on extensive convolutional computation while neglecting global information about protein structures. Furthermore, we tackle the impact of different protein folding structures on binding pocket prediction tasks through the application of transfer learning methods.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00871-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141553971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture 通过增强型 DECIMER 架构推进手绘化学结构识别。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-05 DOI: 10.1186/s13321-024-00872-7
Kohulan Rajan, Henning Otto Brinkhaus, Achim Zielesny, Christoph Steinbeck

Accurate recognition of hand-drawn chemical structures is crucial for digitising hand-written chemical information in traditional laboratory notebooks or facilitating stylus-based structure entry on tablets or smartphones. However, the inherent variability in hand-drawn structures poses challenges for existing Optical Chemical Structure Recognition (OCSR) software. To address this, we present an enhanced Deep lEarning for Chemical ImagE Recognition (DECIMER) architecture that leverages a combination of Convolutional Neural Networks (CNNs) and Transformers to improve the recognition of hand-drawn chemical structures. The model incorporates an EfficientNetV2 CNN encoder that extracts features from hand-drawn images, followed by a Transformer decoder that converts the extracted features into Simplified Molecular Input Line Entry System (SMILES) strings. Our models were trained using synthetic hand-drawn images generated by RanDepict, a tool for depicting chemical structures with different style elements. A benchmark was performed using a real-world dataset of hand-drawn chemical structures to evaluate the model's performance. The results indicate that our improved DECIMER architecture exhibits a significantly enhanced recognition accuracy compared to other approaches.

准确识别手绘化学结构对于将传统实验室笔记本中的手写化学信息数字化或在平板电脑或智能手机上使用手写笔进行结构输入至关重要。然而,手绘结构固有的可变性给现有的光学化学结构识别(OCSR)软件带来了挑战。为解决这一问题,我们提出了增强型化学图像识别深度学习(DECIMER)架构,该架构利用卷积神经网络(CNN)和变换器的组合来提高手绘化学结构的识别率。该模型包含一个 EfficientNetV2 CNN 编码器,用于从手绘图像中提取特征,然后是一个变换器解码器,用于将提取的特征转换为简化分子输入行输入系统(SMILES)字符串。我们使用 RanDepict 生成的合成手绘图像对模型进行了训练,RanDepict 是一款使用不同风格元素描绘化学结构的工具。为了评估模型的性能,我们使用真实世界的手绘化学结构数据集进行了基准测试。结果表明,与其他方法相比,我们改进的 DECIMER 架构的识别准确率显著提高。科学贡献:本文介绍的新 DECIMER 模型完善了我们之前的研究工作,是目前唯一专门为识别手绘化学结构而定制的开源模型。增强后的模型在处理手写风格、线条粗细和背景噪声的变化方面表现更佳,使其适合实际应用。DECIMER 手绘结构识别模型及其源代码已根据许可协议作为开源软件包提供。
{"title":"Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture","authors":"Kohulan Rajan,&nbsp;Henning Otto Brinkhaus,&nbsp;Achim Zielesny,&nbsp;Christoph Steinbeck","doi":"10.1186/s13321-024-00872-7","DOIUrl":"10.1186/s13321-024-00872-7","url":null,"abstract":"<p>Accurate recognition of hand-drawn chemical structures is crucial for digitising hand-written chemical information in traditional laboratory notebooks or facilitating stylus-based structure entry on tablets or smartphones. However, the inherent variability in hand-drawn structures poses challenges for existing Optical Chemical Structure Recognition (OCSR) software. To address this, we present an enhanced Deep lEarning for Chemical ImagE Recognition (DECIMER) architecture that leverages a combination of Convolutional Neural Networks (CNNs) and Transformers to improve the recognition of hand-drawn chemical structures. The model incorporates an EfficientNetV2 CNN encoder that extracts features from hand-drawn images, followed by a Transformer decoder that converts the extracted features into Simplified Molecular Input Line Entry System (SMILES) strings. Our models were trained using synthetic hand-drawn images generated by RanDepict, a tool for depicting chemical structures with different style elements. A benchmark was performed using a real-world dataset of hand-drawn chemical structures to evaluate the model's performance. The results indicate that our improved DECIMER architecture exhibits a significantly enhanced recognition accuracy compared to other approaches.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00872-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141537384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PromptSMILES: prompting for scaffold decoration and fragment linking in chemical language models PromptSMILES:提示化学语言模型中的支架装饰和片段连接
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-04 DOI: 10.1186/s13321-024-00866-5
Morgan Thomas, Mazen Ahmad, Gary Tresadern, Gianni de Fabritiis

SMILES-based generative models are amongst the most robust and successful recent methods used to augment drug design. They are typically used for complete de novo generation, however, scaffold decoration and fragment linking applications are sometimes desirable which requires a different grammar, architecture, training dataset and therefore, re-training of a new model. In this work, we describe a simple procedure to conduct constrained molecule generation with a SMILES-based generative model to extend applicability to scaffold decoration and fragment linking by providing SMILES prompts, without the need for re-training. In combination with reinforcement learning, we show that pre-trained, decoder-only models adapt to these applications quickly and can further optimize molecule generation towards a specified objective. We compare the performance of this approach to a variety of orthogonal approaches and show that performance is comparable or better. For convenience, we provide an easy-to-use python package to facilitate model sampling which can be found on GitHub and the Python Package Index.

Scientific contribution

This novel method extends an autoregressive chemical language model to scaffold decoration and fragment linking scenarios. This doesn’t require re-training, the use of a bespoke grammar, or curation of a custom dataset, as commonly required by other approaches.

基于 SMILES 的生成模型是最近用于增强药物设计的最稳健、最成功的方法之一。它们通常用于完全从头生成,但有时也需要支架装饰和片段连接应用,这需要不同的语法、架构和训练数据集,因此需要重新训练一个新模型。在这项工作中,我们介绍了一种利用基于SMILES的生成模型进行受限分子生成的简单程序,通过提供SMILES提示,将适用性扩展到脚手架装饰和片段链接,而无需重新训练。通过与强化学习相结合,我们发现预训练的纯解码器模型能快速适应这些应用,并能进一步优化分子生成,以实现指定目标。我们将这种方法的性能与各种正交方法进行了比较,结果表明两者性能相当或更好。为方便起见,我们提供了一个易于使用的 Python 软件包,以方便模型采样,该软件包可在 GitHub 和 Python 软件包索引中找到。科学贡献 这种新方法将自回归化学语言模型扩展到了支架装饰和片段连接场景。这不需要像其他方法通常需要的那样重新训练、使用定制语法或策划定制数据集。
{"title":"PromptSMILES: prompting for scaffold decoration and fragment linking in chemical language models","authors":"Morgan Thomas,&nbsp;Mazen Ahmad,&nbsp;Gary Tresadern,&nbsp;Gianni de Fabritiis","doi":"10.1186/s13321-024-00866-5","DOIUrl":"10.1186/s13321-024-00866-5","url":null,"abstract":"<div><p>SMILES-based generative models are amongst the most robust and successful recent methods used to augment drug design. They are typically used for complete de novo generation, however, scaffold decoration and fragment linking applications are sometimes desirable which requires a different grammar, architecture, training dataset and therefore, re-training of a new model. In this work, we describe a simple procedure to conduct constrained molecule generation with a SMILES-based generative model to extend applicability to scaffold decoration and fragment linking by providing SMILES prompts, without the need for re-training. In combination with reinforcement learning, we show that pre-trained, decoder-only models adapt to these applications quickly and can further optimize molecule generation towards a specified objective. We compare the performance of this approach to a variety of orthogonal approaches and show that performance is comparable or better. For convenience, we provide an easy-to-use python package to facilitate model sampling which can be found on GitHub and the Python Package Index.</p><p><b>Scientific contribution</b></p><p>This novel method extends an autoregressive chemical language model to scaffold decoration and fragment linking scenarios. This doesn’t require re-training, the use of a bespoke grammar, or curation of a custom dataset, as commonly required by other approaches.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00866-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141521629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Application of machine reading comprehension techniques for named entity recognition in materials science 将机器阅读理解技术应用于材料科学中的命名实体识别。
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-07-02 DOI: 10.1186/s13321-024-00874-5
Zihui Huang, Liqiang He, Yuhang Yang, Andi Li, Zhiwen Zhang, Siwei Wu, Yang Wang, Yan He, Xujie Liu

Materials science is an interdisciplinary field that studies the properties, structures, and behaviors of different materials. A large amount of scientific literature contains rich knowledge in the field of materials science, but manually analyzing these papers to find material-related data is a daunting task. In information processing, named entity recognition (NER) plays a crucial role as it can automatically extract entities in the field of materials science, which have significant value in tasks such as building knowledge graphs. The typically used sequence labeling methods for traditional named entity recognition in material science (MatNER) tasks often fail to fully utilize the semantic information in the dataset and cannot effectively extract nested entities. Herein, we proposed to convert the sequence labeling task into a machine reading comprehension (MRC) task. MRC method effectively can solve the challenge of extracting multiple overlapping entities by transforming it into the form of answering multiple independent questions. Moreover, the MRC framework allows for a more comprehensive understanding of the contextual information and semantic relationships within materials science literature, by integrating prior knowledge from queries. State-of-the-art (SOTA) performance was achieved on the Matscholar, BC4CHEMD, NLMChem, SOFC, and SOFC-Slot datasets, with F1-scores of 89.64%, 94.30%, 85.89%, 85.95%, and 71.73%, respectively in MRC approach. By effectively utilizing semantic information and extracting nested entities, this approach holds great significance for knowledge extraction and data analysis in the field of materials science, and thus accelerating the development of material science.

Scientific contribution

We have developed an innovative NER method that enhances the efficiency and accuracy of automatic entity extraction in the field of materials science by transforming the sequence labeling task into a MRC task, this approach provides robust support for constructing knowledge graphs and other data analysis tasks.

材料科学是一门研究不同材料的特性、结构和行为的交叉学科。大量科学文献蕴含着丰富的材料科学领域知识,但手动分析这些论文以查找与材料相关的数据是一项艰巨的任务。在信息处理中,命名实体识别(NER)起着至关重要的作用,因为它可以自动提取材料科学领域的实体,而这些实体在构建知识图谱等任务中具有重要价值。在传统的材料科学命名实体识别(MatNER)任务中,通常使用的序列标注方法往往不能充分利用数据集中的语义信息,也不能有效地提取嵌套实体。在此,我们提出将序列标注任务转换为机器阅读理解(MRC)任务。MRC 方法通过将其转换为回答多个独立问题的形式,有效地解决了提取多个重叠实体的难题。此外,MRC 框架通过整合查询中的先验知识,可以更全面地理解材料科学文献中的上下文信息和语义关系。MRC 方法在 Matscholar、BC4CHEMD、NLMChem、SOFC 和 SOFC-Slot 数据集上取得了最先进(SOTA)的性能,F1 分数分别为 89.64%、94.30%、85.89%、85.95% 和 71.73%。通过有效利用语义信息和提取嵌套实体,该方法对材料科学领域的知识提取和数据分析具有重要意义,从而加速了材料科学的发展。 科学贡献我们开发了一种创新的 NER 方法,通过将序列标注任务转化为 MRC 任务,提高了材料科学领域实体自动提取的效率和准确性,该方法为构建知识图谱和其他数据分析任务提供了强大的支持。
{"title":"Application of machine reading comprehension techniques for named entity recognition in materials science","authors":"Zihui Huang,&nbsp;Liqiang He,&nbsp;Yuhang Yang,&nbsp;Andi Li,&nbsp;Zhiwen Zhang,&nbsp;Siwei Wu,&nbsp;Yang Wang,&nbsp;Yan He,&nbsp;Xujie Liu","doi":"10.1186/s13321-024-00874-5","DOIUrl":"10.1186/s13321-024-00874-5","url":null,"abstract":"<div><p>Materials science is an interdisciplinary field that studies the properties, structures, and behaviors of different materials. A large amount of scientific literature contains rich knowledge in the field of materials science, but manually analyzing these papers to find material-related data is a daunting task. In information processing, named entity recognition (NER) plays a crucial role as it can automatically extract entities in the field of materials science, which have significant value in tasks such as building knowledge graphs. The typically used sequence labeling methods for traditional named entity recognition in material science (MatNER) tasks often fail to fully utilize the semantic information in the dataset and cannot effectively extract nested entities. Herein, we proposed to convert the sequence labeling task into a machine reading comprehension (MRC) task. MRC method effectively can solve the challenge of extracting multiple overlapping entities by transforming it into the form of answering multiple independent questions. Moreover, the MRC framework allows for a more comprehensive understanding of the contextual information and semantic relationships within materials science literature, by integrating prior knowledge from queries. State-of-the-art (SOTA) performance was achieved on the Matscholar, BC4CHEMD, NLMChem, SOFC, and SOFC-Slot datasets, with F1-scores of 89.64%, 94.30%, 85.89%, 85.95%, and 71.73%, respectively in MRC approach. By effectively utilizing semantic information and extracting nested entities, this approach holds great significance for knowledge extraction and data analysis in the field of materials science, and thus accelerating the development of material science.</p><p><b>Scientific contribution</b></p><p>We have developed an innovative NER method that enhances the efficiency and accuracy of automatic entity extraction in the field of materials science by transforming the sequence labeling task into a MRC task, this approach provides robust support for constructing knowledge graphs and other data analysis tasks.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00874-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141490382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CPSign: conformal prediction for cheminformatics modeling CPSign:用于化学信息学建模的保形预测
IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Pub Date : 2024-06-28 DOI: 10.1186/s13321-024-00870-9
Staffan Arvidsson McShane, Ulf Norinder, Jonathan Alvarsson, Ernst Ahlberg, Lars Carlsson, Ola Spjuth

Conformal prediction has seen many applications in pharmaceutical science, being able to calibrate outputs of machine learning models and producing valid prediction intervals. We here present the open source software CPSign that is a complete implementation of conformal prediction for cheminformatics modeling. CPSign implements inductive and transductive conformal prediction for classification and regression, and probabilistic prediction with the Venn-ABERS methodology. The main chemical representation is signatures but other types of descriptors are also supported. The main modeling methodology is support vector machines (SVMs), but additional modeling methods are supported via an extension mechanism, e.g. DeepLearning4J models. We also describe features for visualizing results from conformal models including calibration and efficiency plots, as well as features to publish predictive models as REST services. We compare CPSign against other common cheminformatics modeling approaches including random forest, and a directed message-passing neural network. The results show that CPSign produces robust predictive performance with comparative predictive efficiency, with superior runtime and lower hardware requirements compared to neural network based models. CPSign has been used in several studies and is in production-use in multiple organizations. The ability to work directly with chemical input files, perform descriptor calculation and modeling with SVM in the conformal prediction framework, with a single software package having a low footprint and fast execution time makes CPSign a convenient and yet flexible package for training, deploying, and predicting on chemical data. CPSign can be downloaded from GitHub at https://github.com/arosbio/cpsign.

Scientific contribution

CPSign provides a single software that allows users to perform data preprocessing, modeling and make predictions directly on chemical structures, using conformal and probabilistic prediction. Building and evaluating new models can be achieved at a high abstraction level, without sacrificing flexibility and predictive performance—showcased with a method evaluation against contemporary modeling approaches, where CPSign performs on par with a state-of-the-art deep learning based model.

共形预测在制药科学中有许多应用,它能够校准机器学习模型的输出结果,并产生有效的预测区间。我们在此介绍开源软件 CPSign,它是保形预测在化学信息学建模中的完整实现。CPSign 实现了用于分类和回归的归纳和转导保形预测,以及采用 Venn-ABERS 方法的概率预测。主要的化学表征是签名,但也支持其他类型的描述符。主要的建模方法是支持向量机(SVM),但也通过扩展机制支持其他建模方法,例如 DeepLearning4J 模型。我们还介绍了可视化保形模型结果的功能,包括校准和效率图,以及将预测模型发布为 REST 服务的功能。我们将 CPSign 与其他常见的化学信息学建模方法(包括随机森林和有向消息传递神经网络)进行了比较。结果表明,与基于神经网络的模型相比,CPSign 可产生稳健的预测性能,预测效率更高,运行时间更长,硬件要求更低。CPSign 已被用于多项研究,并在多个机构投入生产使用。CPSign 可以直接处理化学输入文件,在保形预测框架内使用 SVM 进行描述符计算和建模,而且只需一个软件包,占用空间小,执行速度快,因此是一个方便灵活的软件包,可用于化学数据的训练、部署和预测。CPSign 可从 GitHub 下载:https://github.com/arosbio/cpsign 。科学贡献 CPSign 提供了一个单一的软件,允许用户使用保形预测和概率预测,直接对化学结构进行数据预处理、建模和预测。在不牺牲灵活性和预测性能的情况下,可以在高抽象级别上构建和评估新模型--通过与当代建模方法的方法评估可以看出,CPSign 的性能与基于深度学习的最先进模型相当。
{"title":"CPSign: conformal prediction for cheminformatics modeling","authors":"Staffan Arvidsson McShane,&nbsp;Ulf Norinder,&nbsp;Jonathan Alvarsson,&nbsp;Ernst Ahlberg,&nbsp;Lars Carlsson,&nbsp;Ola Spjuth","doi":"10.1186/s13321-024-00870-9","DOIUrl":"10.1186/s13321-024-00870-9","url":null,"abstract":"<div><p>Conformal prediction has seen many applications in pharmaceutical science, being able to calibrate outputs of machine learning models and producing valid prediction intervals. We here present the open source software CPSign that is a complete implementation of conformal prediction for cheminformatics modeling. CPSign implements inductive and transductive conformal prediction for classification and regression, and probabilistic prediction with the Venn-ABERS methodology. The main chemical representation is signatures but other types of descriptors are also supported. The main modeling methodology is support vector machines (SVMs), but additional modeling methods are supported via an extension mechanism, e.g. DeepLearning4J models. We also describe features for visualizing results from conformal models including calibration and efficiency plots, as well as features to publish predictive models as REST services. We compare CPSign against other common cheminformatics modeling approaches including random forest, and a directed message-passing neural network. The results show that CPSign produces robust predictive performance with comparative predictive efficiency, with superior runtime and lower hardware requirements compared to neural network based models. CPSign has been used in several studies and is in production-use in multiple organizations. The ability to work directly with chemical input files, perform descriptor calculation and modeling with SVM in the conformal prediction framework, with a single software package having a low footprint and fast execution time makes CPSign a convenient and yet flexible package for training, deploying, and predicting on chemical data. CPSign can be downloaded from GitHub at https://github.com/arosbio/cpsign.</p><p><b>Scientific contribution</b></p><p> CPSign provides a single software that allows users to perform data preprocessing, modeling and make predictions directly on chemical structures, using conformal and probabilistic prediction. Building and evaluating new models can be achieved at a high abstraction level, without sacrificing flexibility and predictive performance—showcased with a method evaluation against contemporary modeling approaches, where CPSign performs on par with a state-of-the-art deep learning based model.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00870-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141462387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Cheminformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1