Tianjun Xie, Gerhard R Wittreich, Matthew T Curnan, Geun Ho Gu, Kayla N Seals, Justin S Tolbert
{"title":"机器学习热化学估算器。","authors":"Tianjun Xie, Gerhard R Wittreich, Matthew T Curnan, Geun Ho Gu, Kayla N Seals, Justin S Tolbert","doi":"10.1021/acs.jcim.4c00989","DOIUrl":null,"url":null,"abstract":"<p><p>Modeling adsorbates on single-crystal metals is critical in rational catalyst design and other research that requires detailed thermochemistry. First-principles simulations via density functional theory (DFT) are among the prevalent tools to acquire such information about surface species. While they are highly dependable, DFT calculations often require intensive computational resources and runtime. These limiting factors become particularly pronounced when investigating large sets of complex molecules on heavy noble metals. Consequently, our ability to explore these species and their corresponding energetics is limited. In this work, we establish a novel framework that utilizes techniques including molecular encoding, descriptor synthesis, and machine learning to overcome the limitation of costly DFT simulations. Simultaneously, we estimate thermochemical information efficiently at the DFT accuracy level. More specifically, we translated our training molecules into text-based identifiers through a simplified molecular-input line-entry system. Following that, we parametrize our training matrices with sets of short-range descriptors based on group methods, applying first the nearest neighbors to account for linear contributions. This is coupled with the long-range descriptors characterizing second nearest neighbors to account for nonlinear corrections. Finally, we use linear regression and machine learning techniques, such as Gaussian process regressions to regress over the linear and nonlinear matrix systems, respectively. This is the first work to our knowledge that encompasses both the first and second nearest neighbors based on the group theory throughout the featurization, training, and deployment stages. We trained and validated our models with 459 surface species on Pt(111), Ru(0001), and Ir(111) surfaces. Results exhibit robust performance to reproduce the energetics of interest, such as enthalpies, entropies, and heat capacities, at various temperatures. Notably, the mean absolute errors can be reduced by 48% during training and 19% during prediction at a minimum, when compared to the classical group method. Leveraging the novel framework, our machine-learning-enabled thermochemistry estimator significantly empowers us to research the thermochemistry of complex species on metal catalysts.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":" ","pages":""},"PeriodicalIF":5.6000,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Machine-Learning-Enabled Thermochemistry Estimator.\",\"authors\":\"Tianjun Xie, Gerhard R Wittreich, Matthew T Curnan, Geun Ho Gu, Kayla N Seals, Justin S Tolbert\",\"doi\":\"10.1021/acs.jcim.4c00989\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Modeling adsorbates on single-crystal metals is critical in rational catalyst design and other research that requires detailed thermochemistry. First-principles simulations via density functional theory (DFT) are among the prevalent tools to acquire such information about surface species. While they are highly dependable, DFT calculations often require intensive computational resources and runtime. These limiting factors become particularly pronounced when investigating large sets of complex molecules on heavy noble metals. Consequently, our ability to explore these species and their corresponding energetics is limited. In this work, we establish a novel framework that utilizes techniques including molecular encoding, descriptor synthesis, and machine learning to overcome the limitation of costly DFT simulations. Simultaneously, we estimate thermochemical information efficiently at the DFT accuracy level. More specifically, we translated our training molecules into text-based identifiers through a simplified molecular-input line-entry system. Following that, we parametrize our training matrices with sets of short-range descriptors based on group methods, applying first the nearest neighbors to account for linear contributions. This is coupled with the long-range descriptors characterizing second nearest neighbors to account for nonlinear corrections. Finally, we use linear regression and machine learning techniques, such as Gaussian process regressions to regress over the linear and nonlinear matrix systems, respectively. This is the first work to our knowledge that encompasses both the first and second nearest neighbors based on the group theory throughout the featurization, training, and deployment stages. We trained and validated our models with 459 surface species on Pt(111), Ru(0001), and Ir(111) surfaces. Results exhibit robust performance to reproduce the energetics of interest, such as enthalpies, entropies, and heat capacities, at various temperatures. Notably, the mean absolute errors can be reduced by 48% during training and 19% during prediction at a minimum, when compared to the classical group method. Leveraging the novel framework, our machine-learning-enabled thermochemistry estimator significantly empowers us to research the thermochemistry of complex species on metal catalysts.</p>\",\"PeriodicalId\":44,\"journal\":{\"name\":\"Journal of Chemical Information and Modeling \",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":5.6000,\"publicationDate\":\"2024-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Information and Modeling \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1021/acs.jcim.4c00989\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MEDICINAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.4c00989","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
Modeling adsorbates on single-crystal metals is critical in rational catalyst design and other research that requires detailed thermochemistry. First-principles simulations via density functional theory (DFT) are among the prevalent tools to acquire such information about surface species. While they are highly dependable, DFT calculations often require intensive computational resources and runtime. These limiting factors become particularly pronounced when investigating large sets of complex molecules on heavy noble metals. Consequently, our ability to explore these species and their corresponding energetics is limited. In this work, we establish a novel framework that utilizes techniques including molecular encoding, descriptor synthesis, and machine learning to overcome the limitation of costly DFT simulations. Simultaneously, we estimate thermochemical information efficiently at the DFT accuracy level. More specifically, we translated our training molecules into text-based identifiers through a simplified molecular-input line-entry system. Following that, we parametrize our training matrices with sets of short-range descriptors based on group methods, applying first the nearest neighbors to account for linear contributions. This is coupled with the long-range descriptors characterizing second nearest neighbors to account for nonlinear corrections. Finally, we use linear regression and machine learning techniques, such as Gaussian process regressions to regress over the linear and nonlinear matrix systems, respectively. This is the first work to our knowledge that encompasses both the first and second nearest neighbors based on the group theory throughout the featurization, training, and deployment stages. We trained and validated our models with 459 surface species on Pt(111), Ru(0001), and Ir(111) surfaces. Results exhibit robust performance to reproduce the energetics of interest, such as enthalpies, entropies, and heat capacities, at various temperatures. Notably, the mean absolute errors can be reduced by 48% during training and 19% during prediction at a minimum, when compared to the classical group method. Leveraging the novel framework, our machine-learning-enabled thermochemistry estimator significantly empowers us to research the thermochemistry of complex species on metal catalysts.
期刊介绍:
The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery.
Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field.
As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.