Jules Schleinitz, Alba Carretero-Cerdán, Anjali Gurajapu, Yonatan Harnik, Gina Lee, Amitesh Pandey, Anat Milo, Sarah E. Reisman
{"title":"为复杂底物的区域选择性预测设计目标特异性数据集","authors":"Jules Schleinitz, Alba Carretero-Cerdán, Anjali Gurajapu, Yonatan Harnik, Gina Lee, Amitesh Pandey, Anat Milo, Sarah E. Reisman","doi":"10.1021/jacs.4c15902","DOIUrl":null,"url":null,"abstract":"The development of machine learning models to predict the regioselectivity of C(sp<sup>3</sup>)–H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity of C–H oxidation. To assess whether smaller, intentionally designed data sets could provide accuracy on complex targets, a series of acquisition functions were developed to select the most informative molecules for the specific target. Active learning-based acquisition functions that leverage predicted reactivity and model uncertainty were found to outperform those based on molecular and site similarity alone. The use of acquisition functions for data set elaboration significantly reduced the number of data points needed to perform accurate prediction, and it was found that smaller, machine-designed data sets can give accurate predictions when larger, randomly selected data sets fail. Finally, the workflow was experimentally validated on five complex substrates and shown to be applicable to predicting the regioselectivity of arene C–H radical borylation. These studies provide a quantitative alternative to the intuitive extrapolation from “model substrates” that is frequently used to estimate reactivity on complex molecules.","PeriodicalId":49,"journal":{"name":"Journal of the American Chemical Society","volume":"14 1","pages":""},"PeriodicalIF":15.6000,"publicationDate":"2025-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates\",\"authors\":\"Jules Schleinitz, Alba Carretero-Cerdán, Anjali Gurajapu, Yonatan Harnik, Gina Lee, Amitesh Pandey, Anat Milo, Sarah E. Reisman\",\"doi\":\"10.1021/jacs.4c15902\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The development of machine learning models to predict the regioselectivity of C(sp<sup>3</sup>)–H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity of C–H oxidation. To assess whether smaller, intentionally designed data sets could provide accuracy on complex targets, a series of acquisition functions were developed to select the most informative molecules for the specific target. Active learning-based acquisition functions that leverage predicted reactivity and model uncertainty were found to outperform those based on molecular and site similarity alone. The use of acquisition functions for data set elaboration significantly reduced the number of data points needed to perform accurate prediction, and it was found that smaller, machine-designed data sets can give accurate predictions when larger, randomly selected data sets fail. Finally, the workflow was experimentally validated on five complex substrates and shown to be applicable to predicting the regioselectivity of arene C–H radical borylation. These studies provide a quantitative alternative to the intuitive extrapolation from “model substrates” that is frequently used to estimate reactivity on complex molecules.\",\"PeriodicalId\":49,\"journal\":{\"name\":\"Journal of the American Chemical Society\",\"volume\":\"14 1\",\"pages\":\"\"},\"PeriodicalIF\":15.6000,\"publicationDate\":\"2025-02-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Chemical Society\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1021/jacs.4c15902\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Chemical Society","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/jacs.4c15902","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates
The development of machine learning models to predict the regioselectivity of C(sp3)–H functionalization reactions is reported. A data set for dioxirane oxidations was curated from the literature and used to generate a model to predict the regioselectivity of C–H oxidation. To assess whether smaller, intentionally designed data sets could provide accuracy on complex targets, a series of acquisition functions were developed to select the most informative molecules for the specific target. Active learning-based acquisition functions that leverage predicted reactivity and model uncertainty were found to outperform those based on molecular and site similarity alone. The use of acquisition functions for data set elaboration significantly reduced the number of data points needed to perform accurate prediction, and it was found that smaller, machine-designed data sets can give accurate predictions when larger, randomly selected data sets fail. Finally, the workflow was experimentally validated on five complex substrates and shown to be applicable to predicting the regioselectivity of arene C–H radical borylation. These studies provide a quantitative alternative to the intuitive extrapolation from “model substrates” that is frequently used to estimate reactivity on complex molecules.
期刊介绍:
The flagship journal of the American Chemical Society, known as the Journal of the American Chemical Society (JACS), has been a prestigious publication since its establishment in 1879. It holds a preeminent position in the field of chemistry and related interdisciplinary sciences. JACS is committed to disseminating cutting-edge research papers, covering a wide range of topics, and encompasses approximately 19,000 pages of Articles, Communications, and Perspectives annually. With a weekly publication frequency, JACS plays a vital role in advancing the field of chemistry by providing essential research.