Abdulelah S. Alshehri, Michael T. Bergman, Fengqi You and Carol K. Hall
Plastic pollution, particularly microplastics (MPs), poses a significant global threat to ecosystems and human health, necessitating innovative remediation strategies. Biocompatible and biodegradable plastic-binding peptides (PBPs) offer a potential solution through targeted adsorption and subsequent MP detection or removal from the environment. A challenge in discovering plastic-binding peptides is the vast combinatorial space of possible peptides (i.e., over 1015 for 12-mer peptides), which far exceeds the sample sizes typically reachable by experiments or biophysics-based computational methods. One step towards addressing this issue is to train deep learning models on experimental or biophysical datasets, permitting faster and cheaper evaluations of peptides. However, deep learning predictions are not always accurate, which could waste time and money due to synthesizing and evaluating false positives. Here, we resolve this issue by combining biophysical modeling data from Peptide Binder Design (PepBD) algorithm, the predictive power and uncertainty quantification of evidential deep learning, and metaheuristic search methods to identify high-affinity PBPs for several common plastics. Molecular dynamics simulations show that the discovered PBPs have greater median adsorption free energies for polyethylene (5%), polypropylene (18%), and polystyrene (34%) relative to PBPs previously designed by PepBD. The impact of including uncertainty quantification in peptide design is demonstrated by the increasing improvement in the median adsorption free energy with decreasing uncertainty. This robust framework accelerates peptide discovery, paving the way for effective, bio-inspired solutions to MP remediation.
{"title":"Biophysics-guided uncertainty-aware deep learning uncovers high-affinity plastic-binding peptides","authors":"Abdulelah S. Alshehri, Michael T. Bergman, Fengqi You and Carol K. Hall","doi":"10.1039/D4DD00219A","DOIUrl":"10.1039/D4DD00219A","url":null,"abstract":"<p >Plastic pollution, particularly microplastics (MPs), poses a significant global threat to ecosystems and human health, necessitating innovative remediation strategies. Biocompatible and biodegradable plastic-binding peptides (PBPs) offer a potential solution through targeted adsorption and subsequent MP detection or removal from the environment. A challenge in discovering plastic-binding peptides is the vast combinatorial space of possible peptides (<em>i.e.</em>, over 10<small><sup>15</sup></small> for 12-mer peptides), which far exceeds the sample sizes typically reachable by experiments or biophysics-based computational methods. One step towards addressing this issue is to train deep learning models on experimental or biophysical datasets, permitting faster and cheaper evaluations of peptides. However, deep learning predictions are not always accurate, which could waste time and money due to synthesizing and evaluating false positives. Here, we resolve this issue by combining biophysical modeling data from Peptide Binder Design (PepBD) algorithm, the predictive power and uncertainty quantification of evidential deep learning, and metaheuristic search methods to identify high-affinity PBPs for several common plastics. Molecular dynamics simulations show that the discovered PBPs have greater median adsorption free energies for polyethylene (5%), polypropylene (18%), and polystyrene (34%) relative to PBPs previously designed by PepBD. The impact of including uncertainty quantification in peptide design is demonstrated by the increasing improvement in the median adsorption free energy with decreasing uncertainty. This robust framework accelerates peptide discovery, paving the way for effective, bio-inspired solutions to MP remediation.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 561-571"},"PeriodicalIF":6.2,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11771220/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143070057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rocco Cancelliere, Mario Molinara, Antonio Licheri, Antonio Maffucci and Laura Micheli
This research utilises Artificial Intelligence (AI) to enhance electrochemical peak resolution and lower detection limits in voltammetric analysis, focusing on complex, multiplex real matrices analyses. The study investigated the quinone family, hydroquinone, benzoquinone, and catechol analysed individually and in mixtures using cyclic and square wave voltammetry. The ferrocyanide/ferricyanide redox couple was included as a standard redox probe to provide a reference for method validation.
{"title":"Artificial intelligence-assisted electrochemical sensors for qualitative and semi-quantitative multiplexed analyses†","authors":"Rocco Cancelliere, Mario Molinara, Antonio Licheri, Antonio Maffucci and Laura Micheli","doi":"10.1039/D4DD00318G","DOIUrl":"https://doi.org/10.1039/D4DD00318G","url":null,"abstract":"<p >This research utilises Artificial Intelligence (AI) to enhance electrochemical peak resolution and lower detection limits in voltammetric analysis, focusing on complex, multiplex real matrices analyses. The study investigated the quinone family, hydroquinone, benzoquinone, and catechol analysed individually and in mixtures using cyclic and square wave voltammetry. The ferrocyanide/ferricyanide redox couple was included as a standard redox probe to provide a reference for method validation.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 338-342"},"PeriodicalIF":6.2,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00318g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alan Aspuru-Guzik, Jason E. Hein and Joshua Schrier
A graphical abstract is available for this content
{"title":"Commit: Mini article for dynamic reporting of incremental improvements to previous scholarly work","authors":"Alan Aspuru-Guzik, Jason E. Hein and Joshua Schrier","doi":"10.1039/D4DD90053G","DOIUrl":"https://doi.org/10.1039/D4DD90053G","url":null,"abstract":"<p >A graphical abstract is available for this content</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 301-302"},"PeriodicalIF":6.2,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd90053g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The integration of artificial intelligence into various domains is rapidly increasing, with Large Language Models (LLMs) becoming more prevalent in numerous applications. This work is included in an overall project which aims to train an LLM specifically in the field of materials science. To assess the impact of this specialized training, it is essential to establish the baseline performance of existing LLMs in materials science. In this study, we evaluated 15 different LLMs using the MaScQA question answering (Q&A) benchmark. This benchmark comprises questions from the Graduate Aptitude Test in Engineering (GATE), tailored to test models' capabilities in answering questions related to materials science and metallurgical engineering. Our results indicate that closed-source LLMs, such as Claude-3.5-Sonnet and GPT-4o, perform the best with an overall accuracy of ∼84%, while open-source models, such as Llama3-70b and Phi3-14b, top at ∼56% and ∼43%, respectively. These findings provide a baseline for the raw capabilities of LLMs on Q&A tasks applied to materials science, and emphasise the substantial improvement that could be brought to open-source models via prompt engineering and fine-tuning strategies. We anticipate that this work could push the adoption of LLMs as valuable assistants in materials science, demonstrating their utilities in this specialised domain and related sub-domains.
{"title":"Exploring the expertise of large language models in materials science and metallurgical engineering†","authors":"Christophe Bajan and Guillaume Lambard","doi":"10.1039/D4DD00319E","DOIUrl":"https://doi.org/10.1039/D4DD00319E","url":null,"abstract":"<p >The integration of artificial intelligence into various domains is rapidly increasing, with Large Language Models (LLMs) becoming more prevalent in numerous applications. This work is included in an overall project which aims to train an LLM specifically in the field of materials science. To assess the impact of this specialized training, it is essential to establish the baseline performance of existing LLMs in materials science. In this study, we evaluated 15 different LLMs using the MaScQA question answering (Q&A) benchmark. This benchmark comprises questions from the Graduate Aptitude Test in Engineering (GATE), tailored to test models' capabilities in answering questions related to materials science and metallurgical engineering. Our results indicate that closed-source LLMs, such as Claude-3.5-Sonnet and GPT-4o, perform the best with an overall accuracy of ∼84%, while open-source models, such as Llama3-70b and Phi3-14b, top at ∼56% and ∼43%, respectively. These findings provide a baseline for the raw capabilities of LLMs on Q&A tasks applied to materials science, and emphasise the substantial improvement that could be brought to open-source models <em>via</em> prompt engineering and fine-tuning strategies. We anticipate that this work could push the adoption of LLMs as valuable assistants in materials science, demonstrating their utilities in this specialised domain and related sub-domains.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 500-512"},"PeriodicalIF":6.2,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00319e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emil I. Jaffal, Sangjoon Lee, Danila Shiryaev, Alex Vtorov, Nikhil Kumar Barua, Holger Kleinke and Anton O. Oliynyk
Traditional and non-classical machine learning models for solid-state structure prediction have predominantly relied on compositional features (derived from properties of constituent elements) to predict the existence of a structure and its properties. However, the lack of structural information can be a source of suboptimal property mapping and increased predictive uncertainty. To address this challenge, we have introduced a strategy that generates and combines both compositional and structural features with minimal programming expertise required. Our approach utilizes open-source, interactive Python programs named Composition Analyzer Featurizer (CAF) and Structure Analyzer Featurizer (SAF). CAF generates numerical compositional features from a list of formulae provided in an Excel file, while SAF extracts numerical structural features from a .cif file by generating a supercell. 133 features from CAF and 94 features from SAF are used either individually or in combination to cluster nine structure types in equiatomic AB intermetallics. The performance is comparable to those with features from JARVIS, MAGPIE, mat2vec, and OLED datasets in PLS-DA, SVM, and XGBoost models. Our SAF + CAF features provide a cost-efficient and reliable solution, even with the PLS-DA method, where a significant fraction of the most contributing features is the same as those identified in the more computationally intensive XGBoost models.
{"title":"Composition and structure analyzer/featurizer for explainable machine-learning models to predict solid state structures†","authors":"Emil I. Jaffal, Sangjoon Lee, Danila Shiryaev, Alex Vtorov, Nikhil Kumar Barua, Holger Kleinke and Anton O. Oliynyk","doi":"10.1039/D4DD00332B","DOIUrl":"https://doi.org/10.1039/D4DD00332B","url":null,"abstract":"<p >Traditional and non-classical machine learning models for solid-state structure prediction have predominantly relied on compositional features (derived from properties of constituent elements) to predict the existence of a structure and its properties. However, the lack of structural information can be a source of suboptimal property mapping and increased predictive uncertainty. To address this challenge, we have introduced a strategy that generates and combines both compositional and structural features with minimal programming expertise required. Our approach utilizes open-source, interactive Python programs named Composition Analyzer Featurizer (CAF) and Structure Analyzer Featurizer (SAF). CAF generates numerical compositional features from a list of formulae provided in an Excel file, while SAF extracts numerical structural features from a .cif file by generating a supercell. 133 features from CAF and 94 features from SAF are used either individually or in combination to cluster nine structure types in equiatomic AB intermetallics. The performance is comparable to those with features from JARVIS, MAGPIE, mat2vec, and OLED datasets in PLS-DA, SVM, and XGBoost models. Our SAF + CAF features provide a cost-efficient and reliable solution, even with the PLS-DA method, where a significant fraction of the most contributing features is the same as those identified in the more computationally intensive XGBoost models.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 548-560"},"PeriodicalIF":6.2,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00332b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Naruki Yoshikawa, Gun Deniz Akkoc, Sergio Pablo-García, Yang Cao, Han Hao and Alán Aspuru-Guzik
Automation of electrochemical measurements can accelerate the discovery of new electroactive materials. One of the hurdles to automated electrochemical measurement is the pretreatment of electrodes because mechanical polishing is usually conducted manually. Here we investigate the automation of electrochemical measurements using a robotic arm. We demonstrate automated mechanical polishing using a station with a moving polishing pad and evaluate the effect of different polishing patterns. Our automatic method improved the corroded electrodes, and we found the effect of pattern was not significant, which diverges from the current common belief amongst practitioners that a figure eight pattern is best for pretreatment. This research is a step toward automating electrochemistry experiments without human intervention.
{"title":"Does one need to polish electrodes in an eight pattern? Automation provides the answer†","authors":"Naruki Yoshikawa, Gun Deniz Akkoc, Sergio Pablo-García, Yang Cao, Han Hao and Alán Aspuru-Guzik","doi":"10.1039/D4DD00323C","DOIUrl":"https://doi.org/10.1039/D4DD00323C","url":null,"abstract":"<p >Automation of electrochemical measurements can accelerate the discovery of new electroactive materials. One of the hurdles to automated electrochemical measurement is the pretreatment of electrodes because mechanical polishing is usually conducted manually. Here we investigate the automation of electrochemical measurements using a robotic arm. We demonstrate automated mechanical polishing using a station with a moving polishing pad and evaluate the effect of different polishing patterns. Our automatic method improved the corroded electrodes, and we found the effect of pattern was not significant, which diverges from the current common belief amongst practitioners that a figure eight pattern is best for pretreatment. This research is a step toward automating electrochemistry experiments without human intervention.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 326-330"},"PeriodicalIF":6.2,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00323c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ifeanyi J. Onuorah, Miki Bonacci, Muhammad M. Isah, Marcello Mazzani, Roberto De Renzi, Giovanni Pizzi and Pietro Bonfà
Positive muon spin rotation and relaxation spectroscopy is a well established experimental technique for studying materials. It provides a local probe that generally complements scattering techniques in the study of magnetic systems and represents a valuable alternative for materials that display strong incoherent scattering or neutron absorption. Computational methods can effectively quantify the microscopic interactions underlying the experimentally observed signal, thus substantially boosting the predictive power of this technique. Here, we present an efficient set of algorithms and workflows devoted to the automation of this task. In particular, we adopt the so-called DFT+μ procedure, where the system is characterized in the density functional theory (DFT) framework with the muon modeled as a hydrogen impurity. We devise an automated strategy to obtain candidate muon stopping sites, their dipolar interaction with the nuclei, and hyperfine interactions with the electronic ground state. We validate the implementation on well-studied compounds, showing the effectiveness of our protocol in terms of accuracy and simplicity of use.
{"title":"Automated computational workflows for muon spin spectroscopy","authors":"Ifeanyi J. Onuorah, Miki Bonacci, Muhammad M. Isah, Marcello Mazzani, Roberto De Renzi, Giovanni Pizzi and Pietro Bonfà","doi":"10.1039/D4DD00314D","DOIUrl":"https://doi.org/10.1039/D4DD00314D","url":null,"abstract":"<p >Positive muon spin rotation and relaxation spectroscopy is a well established experimental technique for studying materials. It provides a local probe that generally complements scattering techniques in the study of magnetic systems and represents a valuable alternative for materials that display strong incoherent scattering or neutron absorption. Computational methods can effectively quantify the microscopic interactions underlying the experimentally observed signal, thus substantially boosting the predictive power of this technique. Here, we present an efficient set of algorithms and workflows devoted to the automation of this task. In particular, we adopt the so-called DFT+μ procedure, where the system is characterized in the density functional theory (DFT) framework with the muon modeled as a hydrogen impurity. We devise an automated strategy to obtain candidate muon stopping sites, their dipolar interaction with the nuclei, and hyperfine interactions with the electronic ground state. We validate the implementation on well-studied compounds, showing the effectiveness of our protocol in terms of accuracy and simplicity of use.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 523-538"},"PeriodicalIF":6.2,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00314d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erwin Lam, Tanguy Maury, Sebastian Preiss, Yuhui Hou, Hannes Frey, Caterina Barillari and Paco Laveille
Data management and processing are crucial steps to implement streamlined and standardized data workflows for automated and high-throughput laboratories. Electronic laboratory notebooks (ELNs) have proven to be effective to manage data in combination with a laboratory information management system (LIMS) to connect data and inventory. However, streamlined data processing does still pose a challenge on an ELN especially with large data. Herein we present a Python library that allows streamlining and automating data management of tabular data generated within a data-driven, automated high-throughput laboratory with a focus on heterogeneous catalysis R&D. This approach speeds up data processing and avoids errors introduced by manual data processing. Through the Python library, raw data from individual instruments related to a project are downloaded from an ELN, merged in a relational database fashion, processed and re-uploaded back to the ELN. Straightforward data merging is especially important, since information stemming from multiple devices needs to be processed together. By providing a configuration file that contains all the data management information, data merging and processing of individual data sources is executed. Having established streamlined data management workflows allows standardization of data handling and contributes to the implementation and use of open research data following Findable, Accessible, Interoperable and Reusable (FAIR) principles in the field of heterogeneous catalysis.
{"title":"General data management workflow to process tabular data in automated and high-throughput heterogeneous catalysis research†‡","authors":"Erwin Lam, Tanguy Maury, Sebastian Preiss, Yuhui Hou, Hannes Frey, Caterina Barillari and Paco Laveille","doi":"10.1039/D4DD00350K","DOIUrl":"https://doi.org/10.1039/D4DD00350K","url":null,"abstract":"<p >Data management and processing are crucial steps to implement streamlined and standardized data workflows for automated and high-throughput laboratories. Electronic laboratory notebooks (ELNs) have proven to be effective to manage data in combination with a laboratory information management system (LIMS) to connect data and inventory. However, streamlined data processing does still pose a challenge on an ELN especially with large data. Herein we present a Python library that allows streamlining and automating data management of tabular data generated within a data-driven, automated high-throughput laboratory with a focus on heterogeneous catalysis R&D. This approach speeds up data processing and avoids errors introduced by manual data processing. Through the Python library, raw data from individual instruments related to a project are downloaded from an ELN, merged in a relational database fashion, processed and re-uploaded back to the ELN. Straightforward data merging is especially important, since information stemming from multiple devices needs to be processed together. By providing a configuration file that contains all the data management information, data merging and processing of individual data sources is executed. Having established streamlined data management workflows allows standardization of data handling and contributes to the implementation and use of open research data following Findable, Accessible, Interoperable and Reusable (FAIR) principles in the field of heterogeneous catalysis.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 539-547"},"PeriodicalIF":6.2,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00350k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Evgeni Ulanov, Ghulam A. Qadir, Kai Riedmiller, Pascal Friederich and Frauke Gräter
Predicting reaction barriers for arbitrary configurations based on only a limited set of density functional theory (DFT) calculations would render the design of catalysts or the simulation of reactions within complex materials highly efficient. We here propose Gaussian process regression (GPR) as a method of choice if DFT calculations are limited to hundreds or thousands of barrier calculations. For the case of hydrogen atom transfer in proteins, an important reaction in chemistry and biology, we obtain a mean absolute error of 3.23 kcal mol−1 for the range of barriers in the data set using SOAP descriptors and similar values using the marginalized graph kernel. Thus, the two GPR models can robustly estimate reaction barriers within the large chemical and conformational space of proteins. Their predictive power is comparable to a graph neural network-based model, and GPR even outcompetes the latter in the low data regime. We propose GPR as a valuable tool for an approximate but data-efficient model of chemical reactivity in a complex and highly variable environment.
{"title":"Predicting hydrogen atom transfer energy barriers using Gaussian process regression†","authors":"Evgeni Ulanov, Ghulam A. Qadir, Kai Riedmiller, Pascal Friederich and Frauke Gräter","doi":"10.1039/D4DD00174E","DOIUrl":"10.1039/D4DD00174E","url":null,"abstract":"<p >Predicting reaction barriers for arbitrary configurations based on only a limited set of density functional theory (DFT) calculations would render the design of catalysts or the simulation of reactions within complex materials highly efficient. We here propose Gaussian process regression (GPR) as a method of choice if DFT calculations are limited to hundreds or thousands of barrier calculations. For the case of hydrogen atom transfer in proteins, an important reaction in chemistry and biology, we obtain a mean absolute error of 3.23 kcal mol<small><sup>−1</sup></small> for the range of barriers in the data set using SOAP descriptors and similar values using the marginalized graph kernel. Thus, the two GPR models can robustly estimate reaction barriers within the large chemical and conformational space of proteins. Their predictive power is comparable to a graph neural network-based model, and GPR even outcompetes the latter in the low data regime. We propose GPR as a valuable tool for an approximate but data-efficient model of chemical reactivity in a complex and highly variable environment.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 513-522"},"PeriodicalIF":6.2,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11747964/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143030366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work utilizes collected and organized instructional data from the field of chemical science to fine-tune mainstream open-source large language models. To objectively evaluate the performance of the fine-tuned models, we have developed an automated scoring system specifically for the chemistry domain, ensuring the accuracy and reliability of the evaluation results. Building on this foundation, we have designed an innovative chemical intelligent assistant system. This system employs the fine-tuned Mistral NeMo model as one of its primary models and features a mechanism for flexibly invoking various advanced models. This design fully considers the rapid iteration characteristics of large language models, ensuring that the system can continuously leverage the latest and most powerful AI capabilities. A major highlight of this system is its deep integration of professional knowledge and requirements from the chemistry field. By incorporating specialized functions such as molecular visualization, SMILES string processing, and chemical literature retrieval, the system significantly enhances its practical value in chemical research and applications. More notably, through carefully designed mechanisms for knowledge accumulation, skill acquisition, performance evaluation, and group collaboration, the system can optimize its professional abilities and interaction quality to a certain extent.
{"title":"AI agents in chemical research: GVIM – an intelligent research assistant system†","authors":"Kangyong Ma","doi":"10.1039/D4DD00398E","DOIUrl":"https://doi.org/10.1039/D4DD00398E","url":null,"abstract":"<p >This work utilizes collected and organized instructional data from the field of chemical science to fine-tune mainstream open-source large language models. To objectively evaluate the performance of the fine-tuned models, we have developed an automated scoring system specifically for the chemistry domain, ensuring the accuracy and reliability of the evaluation results. Building on this foundation, we have designed an innovative chemical intelligent assistant system. This system employs the fine-tuned Mistral NeMo model as one of its primary models and features a mechanism for flexibly invoking various advanced models. This design fully considers the rapid iteration characteristics of large language models, ensuring that the system can continuously leverage the latest and most powerful AI capabilities. A major highlight of this system is its deep integration of professional knowledge and requirements from the chemistry field. By incorporating specialized functions such as molecular visualization, SMILES string processing, and chemical literature retrieval, the system significantly enhances its practical value in chemical research and applications. More notably, through carefully designed mechanisms for knowledge accumulation, skill acquisition, performance evaluation, and group collaboration, the system can optimize its professional abilities and interaction quality to a certain extent.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 2","pages":" 355-375"},"PeriodicalIF":6.2,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00398e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}