{"title":"CLAIRE: a contrastive learning-based predictor for EC number of chemical reactions","authors":"Zishuo Zeng, Jin Guo, Jiao Jin, Xiaozhou Luo","doi":"10.1186/s13321-024-00944-8","DOIUrl":null,"url":null,"abstract":"<div><p>Predicting EC numbers for chemical reactions enables efficient enzymatic annotations for computer-aided synthesis planning. However, conventional machine learning approaches encounter challenges due to data scarcity and class imbalance. Here, we introduce CLAIRE (<u>C</u>ontrastive <u>L</u>earning-based <u>A</u>nnotat<u>I</u>on for <u>R</u>eaction’s <u>E</u>C), a novel framework leveraging contrastive learning, pre-trained language model-based reaction embeddings, and data augmentation to address these limitations. CLAIRE achieved notable performance improvements, demonstrating weighted average F1 scores of 0.861 and 0.911 on the testing set (n = 18,816) and an independent dataset (n = 1040) derived from yeast’s metabolic model, respectively. Remarkably, CLAIRE significantly outperformed the state-of-the-art model by 3.65 folds and 1.18 folds, respectively. Its high accuracy positions CLAIRE as a promising tool for retrosynthesis planning, drug fate prediction, and synthetic biology applications. CLAIRE is freely available on GitHub (https://github.com/zishuozeng/CLAIRE).</p><p><b>Scientific contribution</b></p><p>This work employed contrastive learning for predicting enzymatic reaction’s EC numbers, overcoming the challenges in data scarcity and imbalance. The new model achieves the state-of-the-art performance and may facilitate the computer-aided synthesis planning.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00944-8","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-024-00944-8","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Predicting EC numbers for chemical reactions enables efficient enzymatic annotations for computer-aided synthesis planning. However, conventional machine learning approaches encounter challenges due to data scarcity and class imbalance. Here, we introduce CLAIRE (Contrastive Learning-based AnnotatIon for Reaction’s EC), a novel framework leveraging contrastive learning, pre-trained language model-based reaction embeddings, and data augmentation to address these limitations. CLAIRE achieved notable performance improvements, demonstrating weighted average F1 scores of 0.861 and 0.911 on the testing set (n = 18,816) and an independent dataset (n = 1040) derived from yeast’s metabolic model, respectively. Remarkably, CLAIRE significantly outperformed the state-of-the-art model by 3.65 folds and 1.18 folds, respectively. Its high accuracy positions CLAIRE as a promising tool for retrosynthesis planning, drug fate prediction, and synthetic biology applications. CLAIRE is freely available on GitHub (https://github.com/zishuozeng/CLAIRE).
Scientific contribution
This work employed contrastive learning for predicting enzymatic reaction’s EC numbers, overcoming the challenges in data scarcity and imbalance. The new model achieves the state-of-the-art performance and may facilitate the computer-aided synthesis planning.
期刊介绍:
Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling.
Coverage includes, but is not limited to:
chemical information systems, software and databases, and molecular modelling,
chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases,
computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.