Evaluating Predictive Accuracy in Asymmetric Catalysis: A Machine Learning Perspective on Local Reaction Space

IF 13.1 1区化学 Q1 CHEMISTRY, PHYSICAL ACS Catalysis Pub Date : 2025-03-31 DOI:10.1021/acscatal.5c01051

Isaiah O. Betinol, Aleksandra Demchenko, Jolene P. Reid

{"title":"Evaluating Predictive Accuracy in Asymmetric Catalysis: A Machine Learning Perspective on Local Reaction Space","authors":"Isaiah O. Betinol, Aleksandra Demchenko, Jolene P. Reid","doi":"10.1021/acscatal.5c01051","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) models are increasingly being employed in asymmetric catalysis to predict reaction outcomes and optimize enantioselective processes. Despite the trend of expanding data set sizes to improve model performance, asymmetric catalysis presents unique challenges, including the difficulty of acquiring high-quality experimental data and the often-limited availability of structurally diverse examples. Consequently, rational data set design requires the practitioner to choose whether to collect data that maximizes diversity in the training set or data that maximizes representation around a target prediction. A key challenge in these studies is understanding the role of local reaction space─specifically, how much predictive accuracy is driven by nearest neighbors (structurally and electronically similar data points) and the next-nearest neighbors? This study investigates the predictive power of ML models trained with varying levels of local representation in the reaction space. We provide a framework, a radius-based random forest (RaRF) algorithm, to systematically probe the effects of including diverse reactions dissimilar to a target prediction. We show that when the training set is representative of the target reaction, the gains from increasing data set diversity are modest─typically less than 0.1 kcal/mol in predictive error─and increasing to only 0.5 kcal/mol for extrapolative tests, highlighting the need for targeted data set design. Furthermore, these findings hold even for complex architectures and features. Finally, we demonstrate that a targeted, neighborhood-oriented strategy greatly accelerates the identification of predictive models compared to diversity-driven approaches.","PeriodicalId":9,"journal":{"name":"ACS Catalysis ","volume":"16 1","pages":""},"PeriodicalIF":13.1000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Catalysis ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acscatal.5c01051","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning (ML) models are increasingly being employed in asymmetric catalysis to predict reaction outcomes and optimize enantioselective processes. Despite the trend of expanding data set sizes to improve model performance, asymmetric catalysis presents unique challenges, including the difficulty of acquiring high-quality experimental data and the often-limited availability of structurally diverse examples. Consequently, rational data set design requires the practitioner to choose whether to collect data that maximizes diversity in the training set or data that maximizes representation around a target prediction. A key challenge in these studies is understanding the role of local reaction space─specifically, how much predictive accuracy is driven by nearest neighbors (structurally and electronically similar data points) and the next-nearest neighbors? This study investigates the predictive power of ML models trained with varying levels of local representation in the reaction space. We provide a framework, a radius-based random forest (RaRF) algorithm, to systematically probe the effects of including diverse reactions dissimilar to a target prediction. We show that when the training set is representative of the target reaction, the gains from increasing data set diversity are modest─typically less than 0.1 kcal/mol in predictive error─and increasing to only 0.5 kcal/mol for extrapolative tests, highlighting the need for targeted data set design. Furthermore, these findings hold even for complex architectures and features. Finally, we demonstrate that a targeted, neighborhood-oriented strategy greatly accelerates the identification of predictive models compared to diversity-driven approaches.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估不对称催化的预测准确性：局部反应空间的机器学习视角

机器学习（ML）模型越来越多地应用于不对称催化，以预测反应结果和优化对映选择性过程。尽管有扩大数据集规模以提高模型性能的趋势，但不对称催化提出了独特的挑战，包括获取高质量实验数据的困难以及结构多样化示例的可用性通常有限。因此，合理的数据集设计要求实践者选择是收集训练集中最大限度多样性的数据，还是收集围绕目标预测最大限度表示的数据。这些研究中的一个关键挑战是理解局部反应空间的作用──具体来说，是由最近的邻居（结构和电子相似的数据点）和次近邻驱动的预测准确性有多大？本研究探讨了在反应空间中使用不同水平的局部表示训练的ML模型的预测能力。我们提供了一个框架，一个基于半径的随机森林（RaRF）算法，系统地探索包括不同于目标预测的不同反应的影响。我们表明，当训练集代表目标反应时，增加数据集多样性的收益是适度的──预测误差通常小于0.1 kcal/mol──外推测试的收益仅增加到0.5 kcal/mol，突出了目标数据集设计的必要性。此外，这些发现甚至适用于复杂的体系结构和特性。最后，我们证明了与多样性驱动的方法相比，有针对性的、面向社区的策略大大加快了预测模型的识别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACS Catalysis CHEMISTRY, PHYSICAL-

CiteScore

20.80

自引率

6.20%

发文量

1253

审稿时长

1.5 months

期刊介绍： ACS Catalysis is an esteemed journal that publishes original research in the fields of heterogeneous catalysis, molecular catalysis, and biocatalysis. It offers broad coverage across diverse areas such as life sciences, organometallics and synthesis, photochemistry and electrochemistry, drug discovery and synthesis, materials science, environmental protection, polymer discovery and synthesis, and energy and fuels. The scope of the journal is to showcase innovative work in various aspects of catalysis. This includes new reactions and novel synthetic approaches utilizing known catalysts, the discovery or modification of new catalysts, elucidation of catalytic mechanisms through cutting-edge investigations, practical enhancements of existing processes, as well as conceptual advances in the field. Contributions to ACS Catalysis can encompass both experimental and theoretical research focused on catalytic molecules, macromolecules, and materials that exhibit catalytic turnover.