Reaction rebalancing: a novel approach to curating reaction databases

IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY Journal of Cheminformatics Pub Date : 2024-07-19 DOI:10.1186/s13321-024-00875-4
Tieu-Long Phan, Klaus Weinbauer, Thomas Gärtner, Daniel Merkle, Jakob L. Andersen, Rolf Fagerberg, Peter F. Stadler
{"title":"Reaction rebalancing: a novel approach to curating reaction databases","authors":"Tieu-Long Phan,&nbsp;Klaus Weinbauer,&nbsp;Thomas Gärtner,&nbsp;Daniel Merkle,&nbsp;Jakob L. Andersen,&nbsp;Rolf Fagerberg,&nbsp;Peter F. Stadler","doi":"10.1186/s13321-024-00875-4","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>Reaction databases are a key resource for a wide variety of applications in computational chemistry and biochemistry, including Computer-aided Synthesis Planning (CASP) and the large-scale analysis of metabolic networks. The full potential of these resources can only be realized if datasets are accurate and complete. Missing co-reactants and co-products, i.e., unbalanced reactions, however, are the rule rather than the exception. The curation and correction of such incomplete entries is thus an urgent need.</p><h3>Methods</h3><p>The <span>SynRBL</span> framework addresses this issue with a dual-strategy: a rule-based method for non-carbon compounds, using atomic symbols and counts for prediction, alongside a Maximum Common Subgraph (MCS)-based technique for carbon compounds, aimed at aligning reactants and products to infer missing entities.</p><h3>Results</h3><p>The rule-based method exceeded 99% accuracy, while MCS-based accuracy varied from 81.19 to 99.33%, depending on reaction properties. Furthermore, an applicability domain and a machine learning scoring function were devised to quantify prediction confidence. The overall efficacy of this framework was delineated through its success rate and accuracy metrics, which spanned from 89.83 to 99.75% and 90.85 to 99.05%, respectively.</p><h3>Conclusion</h3><p>The <span>SynRBL</span> framework offers a novel solution for recalibrating chemical reactions, significantly enhancing reaction completeness. With rigorous validation, it achieved groundbreaking accuracy in reaction rebalancing. This sets the stage for future improvement in particular of atom-atom mapping techniques as well as of downstream tasks such as automated synthesis planning.</p><h3>Scientific Contribution</h3><p><span>SynRBL</span> features a novel computational approach to correcting unbalanced entries in chemical reaction databases. By combining heuristic rules for inferring non-carbon compounds and common subgraph searches to address carbon unbalance, <span>SynRBL</span> successfully addresses most instances of this problem, which affects the majority of data in most large-scale resources. Compared to alternative solutions, <span>SynRBL</span> achieves a dramatic increase in both success rate and accurary, and provides the first freely available open source solution for this problem.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00875-4","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-024-00875-4","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose

Reaction databases are a key resource for a wide variety of applications in computational chemistry and biochemistry, including Computer-aided Synthesis Planning (CASP) and the large-scale analysis of metabolic networks. The full potential of these resources can only be realized if datasets are accurate and complete. Missing co-reactants and co-products, i.e., unbalanced reactions, however, are the rule rather than the exception. The curation and correction of such incomplete entries is thus an urgent need.

Methods

The SynRBL framework addresses this issue with a dual-strategy: a rule-based method for non-carbon compounds, using atomic symbols and counts for prediction, alongside a Maximum Common Subgraph (MCS)-based technique for carbon compounds, aimed at aligning reactants and products to infer missing entities.

Results

The rule-based method exceeded 99% accuracy, while MCS-based accuracy varied from 81.19 to 99.33%, depending on reaction properties. Furthermore, an applicability domain and a machine learning scoring function were devised to quantify prediction confidence. The overall efficacy of this framework was delineated through its success rate and accuracy metrics, which spanned from 89.83 to 99.75% and 90.85 to 99.05%, respectively.

Conclusion

The SynRBL framework offers a novel solution for recalibrating chemical reactions, significantly enhancing reaction completeness. With rigorous validation, it achieved groundbreaking accuracy in reaction rebalancing. This sets the stage for future improvement in particular of atom-atom mapping techniques as well as of downstream tasks such as automated synthesis planning.

Scientific Contribution

SynRBL features a novel computational approach to correcting unbalanced entries in chemical reaction databases. By combining heuristic rules for inferring non-carbon compounds and common subgraph searches to address carbon unbalance, SynRBL successfully addresses most instances of this problem, which affects the majority of data in most large-scale resources. Compared to alternative solutions, SynRBL achieves a dramatic increase in both success rate and accurary, and provides the first freely available open source solution for this problem.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
反应再平衡:整理反应数据库的新方法。
目的反应数据库是计算化学和生物化学领域各种应用的关键资源,包括计算机辅助合成规划(CASP)和代谢网络的大规模分析。只有数据集准确完整,才能充分发挥这些资源的潜力。然而,缺失的共反应物和共生成物,即不平衡的反应,是常规而非例外。因此,急需对这些不完整的条目进行整理和更正:SynRBL 框架采用双重策略解决这一问题:对非碳化合物采用基于规则的方法,使用原子符号和计数进行预测;对碳化合物采用基于最大公共子图(MCS)的技术,旨在对反应物和生成物进行排列,以推断出缺失的实体:结果:基于规则的方法准确率超过 99%,而基于 MCS 的准确率从 81.19% 到 99.33% 不等,具体取决于反应特性。此外,还设计了一个适用域和一个机器学习评分函数来量化预测置信度。该框架的成功率和准确率分别从 89.83% 到 99.75% 和 90.85% 到 99.05% 不等,由此可见其整体功效:SynRBL 框架为重新校准化学反应提供了一种新的解决方案,大大提高了反应的完整性。经过严格验证,它在反应再平衡方面取得了突破性的准确性。这为今后改进原子原子映射技术以及自动化合成规划等下游任务奠定了基础:SynRBL 采用了一种新颖的计算方法来纠正化学反应数据库中的不平衡条目。通过将推断非碳化合物的启发式规则与解决碳不平衡问题的普通子图搜索相结合,SynRBL 成功地解决了这一问题的大多数情况,而这一问题影响了大多数大型资源中的大部分数据。与其他解决方案相比,SynRBL 在成功率和准确率方面都有显著提高,并为这一问题提供了首个免费开源解决方案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Cheminformatics
Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
14.10
自引率
7.00%
发文量
82
审稿时长
3 months
期刊介绍: Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.
期刊最新文献
cidalsDB: an AI-empowered platform for anti-pathogen therapeutics research Group graph: a molecular graph representation with enhanced performance, efficiency and interpretability GT-NMR: a novel graph transformer-based approach for accurate prediction of NMR chemical shifts Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature Molecular identification via molecular fingerprint extraction from atomic force microscopy images
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1