Designing de novo proteins beyond those found in nature holds significant promise for advancements in both scientific and engineering applications. Current methodologies for protein design often rely on AI-based models, such as surrogate models that address end-to-end problems by linking protein structure to material properties or vice versa. However, these models frequently focus on specific material objectives or structural properties, limiting their flexibility when incorporating out-of-domain knowledge into the design process or comprehensive data analysis is required. In this study, we introduce ProtAgents, a platform for de novo protein design based on Large Language Models (LLMs), where multiple AI agents with distinct capabilities collaboratively address complex tasks within a dynamic environment. The versatility in agent development allows for expertise in diverse domains, including knowledge retrieval, protein structure analysis, physics-based simulations, and results analysis. The dynamic collaboration between agents, empowered by LLMs, provides a versatile approach to tackling protein design and analysis problems, as demonstrated through diverse examples in this study. The problems of interest encompass designing new proteins, analyzing protein structures and obtaining new first-principles data – natural vibrational frequencies – via physics simulations. The concerted effort of the system allows for powerful automated and synergistic design of de novo proteins with targeted mechanical properties. The flexibility in designing the agents, on one hand, and their capacity in autonomous collaboration through the dynamic LLM-based multi-agent environment on the other hand, unleashes great potentials of LLMs in addressing multi-objective materials problems and opens up new avenues for autonomous materials discovery and design.
{"title":"ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning†","authors":"Alireza Ghafarollahi and Markus J. Buehler","doi":"10.1039/D4DD00013G","DOIUrl":"10.1039/D4DD00013G","url":null,"abstract":"<p >Designing <em>de novo</em> proteins beyond those found in nature holds significant promise for advancements in both scientific and engineering applications. Current methodologies for protein design often rely on AI-based models, such as surrogate models that address end-to-end problems by linking protein structure to material properties or <em>vice versa</em>. However, these models frequently focus on specific material objectives or structural properties, limiting their flexibility when incorporating out-of-domain knowledge into the design process or comprehensive data analysis is required. In this study, we introduce ProtAgents, a platform for <em>de novo</em> protein design based on Large Language Models (LLMs), where multiple AI agents with distinct capabilities collaboratively address complex tasks within a dynamic environment. The versatility in agent development allows for expertise in diverse domains, including knowledge retrieval, protein structure analysis, physics-based simulations, and results analysis. The dynamic collaboration between agents, empowered by LLMs, provides a versatile approach to tackling protein design and analysis problems, as demonstrated through diverse examples in this study. The problems of interest encompass designing new proteins, analyzing protein structures and obtaining new first-principles data – natural vibrational frequencies – <em>via</em> physics simulations. The concerted effort of the system allows for powerful automated and synergistic design of <em>de novo</em> proteins with targeted mechanical properties. The flexibility in designing the agents, on one hand, and their capacity in autonomous collaboration through the dynamic LLM-based multi-agent environment on the other hand, unleashes great potentials of LLMs in addressing multi-objective materials problems and opens up new avenues for autonomous materials discovery and design.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00013g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141059333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clayton W. Kosonocky, Claus O. Wilke, Edward M. Marcotte and Andrew D. Ellington
The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule–function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.
{"title":"Mining patents with large language models elucidates the chemical function landscape†","authors":"Clayton W. Kosonocky, Claus O. Wilke, Edward M. Marcotte and Andrew D. Ellington","doi":"10.1039/D4DD00011K","DOIUrl":"10.1039/D4DD00011K","url":null,"abstract":"<p >The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule–function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00011k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140881812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kenneth López-Pérez, Taewon D. Kim and Ramón Alain Miranda-Quintana
The quantification of molecular similarity has been present since the beginning of cheminformatics. Although several similarity indices and molecular representations have been reported, all of them ultimately reduce to the calculation of molecular similarities of only two objects at a time. Hence, to obtain the average similarity of a set of molecules, all the pairwise comparisons need to be computed, which demands a quadratic scaling in the number of computational resources. Here we propose an exact alternative to this problem: iSIM (instant similarity). iSIM performs comparisons of multiple molecules at the same time and yields the same value as the average pairwise comparisons of molecules represented by binary fingerprints and real-value descriptors. In this work, we introduce the mathematical framework and several applications of iSIM in chemical sampling, visualization, diversity selection, and clustering.
自化学信息学诞生之初,分子相似性的量化问题就一直存在。尽管已经报道了多种相似性指数和分子表示方法,但所有这些方法最终都只能一次计算两个对象的分子相似性。因此,要得到一组分子的平均相似性,就需要计算所有成对比较,这就要求计算资源的数量按二次方缩放。iSIM 可同时对多个分子进行比较,并得出与用二进制指纹和实值描述符表示的分子成对比较平均值相同的值。在这项工作中,我们将介绍 iSIM 的数学框架以及在化学取样、可视化、多样性选择和聚类方面的若干应用。
{"title":"iSIM: instant similarity†","authors":"Kenneth López-Pérez, Taewon D. Kim and Ramón Alain Miranda-Quintana","doi":"10.1039/D4DD00041B","DOIUrl":"10.1039/D4DD00041B","url":null,"abstract":"<p >The quantification of molecular similarity has been present since the beginning of cheminformatics. Although several similarity indices and molecular representations have been reported, all of them ultimately reduce to the calculation of molecular similarities of only two objects at a time. Hence, to obtain the average similarity of a set of molecules, all the pairwise comparisons need to be computed, which demands a quadratic scaling in the number of computational resources. Here we propose an exact alternative to this problem: iSIM (instant similarity). iSIM performs comparisons of multiple molecules at the same time and yields the same value as the average pairwise comparisons of molecules represented by binary fingerprints and real-value descriptors. In this work, we introduce the mathematical framework and several applications of iSIM in chemical sampling, visualization, diversity selection, and clustering.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00041b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140881815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Correction for ‘Predicting small molecules solubility on endpoint devices using deep ensemble neural networks’ by Mayk Caldas Ramos and Andrew D. White, Digital Discovery, 2024, 3, 786–795, https://doi.org/10.1039/D3DD00217A.
对 Mayk Caldas Ramos 和 Andrew D. White 的 "使用深度集合神经网络预测小分子在终端设备上的溶解度 "的更正,《数字发现》,2024 年 3 期,786-795,https://doi.org/10.1039/D3DD00217A。
{"title":"Correction: Predicting small molecules solubility on endpoint devices using deep ensemble neural networks","authors":"Mayk Caldas Ramos and Andrew D. White","doi":"10.1039/D4DD90020K","DOIUrl":"10.1039/D4DD90020K","url":null,"abstract":"<p >Correction for ‘Predicting small molecules solubility on endpoint devices using deep ensemble neural networks’ by Mayk Caldas Ramos and Andrew D. White, <em>Digital Discovery</em>, 2024, <strong>3</strong>, 786–795, https://doi.org/10.1039/D3DD00217A.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd90020k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140838554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kangming Li, Brian DeCost, Kamal Choudhary and Jason Hattrick-Simpers
Use of machine learning has been increasingly popular in materials science as data-driven materials discovery is becoming the new paradigm. Reproducibility of findings is paramount for promoting transparency and accountability in research and building trust in the scientific community. Here we conduct a reproducibility analysis of the work by K. Choudhary and B. Brian [npj Comput. Mater., 7, 2021, 185], in which a new graph neural network architecture was developed with improved performance on multiple atomistic prediction tasks. We examine the reproducibility for the model performance on 29 regression tasks and for an ablation analysis of the graph neural network layers. We find that the reproduced results generally exhibit a good quantitative agreement with the initial study, despite minor disparities in model performance and training efficiency that may be resulting from factors such as hardware difference and stochasticity involved in model training and data splits. The ease of conducting these reproducibility experiments confirms the great benefits of open data and code practices to which the initial work adhered. We also discuss some further enhancements in reproducible practices such as code and data archiving and providing data identifiers used in dataset splits.
随着数据驱动的材料发现正在成为新的范式,机器学习的使用在材料科学领域日益流行。研究结果的可重复性对于促进研究的透明度和问责制以及建立科学界的信任至关重要。在此,我们对 K. Choudhary 和 B. Brian [npj Comput. Mater., 7, 2021, 185]的研究成果进行了可重复性分析,该研究开发了一种新的图神经网络架构,提高了多种原子预测任务的性能。我们研究了 29 项回归任务中模型性能的再现性,以及图神经网络层的消融分析。我们发现,尽管在模型性能和训练效率方面可能会因硬件差异、模型训练中的随机性以及数据分割等因素而存在细微差别,但重现的结果总体上与最初的研究在数量上表现出良好的一致性。这些可重复性实验的轻松进行证实了最初工作所坚持的开放数据和代码实践的巨大好处。我们还讨论了可重复性实践中的一些进一步改进,如代码和数据归档以及提供数据集拆分中使用的数据标识符。
{"title":"A reproducibility study of atomistic line graph neural networks for materials property prediction†","authors":"Kangming Li, Brian DeCost, Kamal Choudhary and Jason Hattrick-Simpers","doi":"10.1039/D4DD00064A","DOIUrl":"10.1039/D4DD00064A","url":null,"abstract":"<p >Use of machine learning has been increasingly popular in materials science as data-driven materials discovery is becoming the new paradigm. Reproducibility of findings is paramount for promoting transparency and accountability in research and building trust in the scientific community. Here we conduct a reproducibility analysis of the work by K. Choudhary and B. Brian [<em>npj Comput. Mater.</em>, <strong>7</strong>, 2021, 185], in which a new graph neural network architecture was developed with improved performance on multiple atomistic prediction tasks. We examine the reproducibility for the model performance on 29 regression tasks and for an ablation analysis of the graph neural network layers. We find that the reproduced results generally exhibit a good quantitative agreement with the initial study, despite minor disparities in model performance and training efficiency that may be resulting from factors such as hardware difference and stochasticity involved in model training and data splits. The ease of conducting these reproducibility experiments confirms the great benefits of open data and code practices to which the initial work adhered. We also discuss some further enhancements in reproducible practices such as code and data archiving and providing data identifiers used in dataset splits.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00064a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140838553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryan J. R. Jones, Yungchieh Lai, Dan Guevarra, Kevin Kan, Joel A. Haber and John M. Gregoire
The electrochemical conversion of carbon dioxide to chemicals and fuels is expected to be a key sustainability technology. Electrochemical carbon dioxide reduction technologies are challenged by several factors, including the limited solubility of carbon dioxide in aqueous electrolyte as well as the difficulty in utilizing polymer electrolytes. These considerations have driven system designs to incorporate gas diffusion electrodes (GDEs) to bring the electrocatalyst in contact with both a gaseous reactant/product stream as well as a liquid electrolyte. GDE optimization typically results from manual tuning by select experts. Automated preparation and operation of GDE cells could be a watershed for the systematic study of, and ultimately the development of a materials acceleration platform (MAP) for, catalyst discovery and system optimization. Toward this end, we present the automated GDE (AutoGDE) testing system. Given a catalyst-coated GDE, AutoGDE automates the insertion of the GDE into an electrochemical cell, the liquid and gas handling, the quantification of gaseous reaction products via online mass spectroscopy, and the archiving of the liquid electrolyte for subsequent analysis.
电化学将二氧化碳转化为化学品和燃料有望成为一项关键的可持续发展技术。电化学二氧化碳还原技术面临着多种因素的挑战,包括二氧化碳在水性电解质中的溶解度有限以及难以使用聚合物电解质。这些因素促使系统设计采用气体扩散电极 (GDE),使电催化剂同时与气态反应物/产物流和液态电解质接触。气体扩散电极的优化通常是由选定的专家进行手动调整。GDE 单元的自动制备和操作可以成为系统研究的分水岭,并最终开发出用于催化剂发现和系统优化的材料加速平台 (MAP)。为此,我们推出了自动 GDE(AutoGDE)测试系统。给定一个催化剂涂层 GDE,AutoGDE 可自动将 GDE 插入电化学电池、处理液体和气体、通过在线质谱对气态反应产物进行定量,以及将液体电解质存档以备后续分析。
{"title":"Accelerated screening of gas diffusion electrodes for carbon dioxide reduction†","authors":"Ryan J. R. Jones, Yungchieh Lai, Dan Guevarra, Kevin Kan, Joel A. Haber and John M. Gregoire","doi":"10.1039/D4DD00061G","DOIUrl":"10.1039/D4DD00061G","url":null,"abstract":"<p >The electrochemical conversion of carbon dioxide to chemicals and fuels is expected to be a key sustainability technology. Electrochemical carbon dioxide reduction technologies are challenged by several factors, including the limited solubility of carbon dioxide in aqueous electrolyte as well as the difficulty in utilizing polymer electrolytes. These considerations have driven system designs to incorporate gas diffusion electrodes (GDEs) to bring the electrocatalyst in contact with both a gaseous reactant/product stream as well as a liquid electrolyte. GDE optimization typically results from manual tuning by select experts. Automated preparation and operation of GDE cells could be a watershed for the systematic study of, and ultimately the development of a materials acceleration platform (MAP) for, catalyst discovery and system optimization. Toward this end, we present the automated GDE (AutoGDE) testing system. Given a catalyst-coated GDE, AutoGDE automates the insertion of the GDE into an electrochemical cell, the liquid and gas handling, the quantification of gaseous reaction products <em>via</em> online mass spectroscopy, and the archiving of the liquid electrolyte for subsequent analysis.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00061g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140838305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandra Volokhova, Michał Koziarski, Alex Hernández-García, Cheng-Hao Liu, Santiago Miret, Pablo Lemos, Luca Thiede, Zichao Yan, Alán Aspuru-Guzik and Yoshua Bengio
Sampling diverse, thermodynamically feasible molecular conformations plays a crucial role in predicting properties of a molecule. In this paper we propose to use GFlowNets for sampling conformations of small molecules from the Boltzmann distribution, as determined by the molecule's energy. The proposed approach can be used in combination with energy estimation methods of different fidelity and discovers a diverse set of low-energy conformations for drug-like molecules. We demonstrate that GFlowNets can reproduce molecular potential energy surfaces by sampling proportionally to the Boltzmann distribution.
{"title":"Towards equilibrium molecular conformation generation with GFlowNets†","authors":"Alexandra Volokhova, Michał Koziarski, Alex Hernández-García, Cheng-Hao Liu, Santiago Miret, Pablo Lemos, Luca Thiede, Zichao Yan, Alán Aspuru-Guzik and Yoshua Bengio","doi":"10.1039/D4DD00023D","DOIUrl":"10.1039/D4DD00023D","url":null,"abstract":"<p >Sampling diverse, thermodynamically feasible molecular conformations plays a crucial role in predicting properties of a molecule. In this paper we propose to use GFlowNets for sampling conformations of small molecules from the Boltzmann distribution, as determined by the molecule's energy. The proposed approach can be used in combination with energy estimation methods of different fidelity and discovers a diverse set of low-energy conformations for drug-like molecules. We demonstrate that GFlowNets can reproduce molecular potential energy surfaces by sampling proportionally to the Boltzmann distribution.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00023d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140810659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maxime van der Heijden, Gabor Szendrei, Victor de Haas and Antoni Forner-Cuenca
Porous electrodes are performance-defining components in electrochemical devices, such as redox flow batteries, as they govern the electrochemical performance and pumping demands of the reactor. Yet, conventional porous electrodes used in redox flow batteries are not tailored to sustain convection-enhanced electrochemical reactions. Thus, there is a need for electrode optimization to enhance the system performance. In this work, we present an optimization framework to carry out the bottom-up design of porous electrodes by coupling a genetic algorithm with a pore network modeling framework. We introduce geometrical versatility by adding a pore merging and splitting function, study the impact of various optimization parameters, geometrical definitions, and objective functions, and incorporate conventional electrode and flow field designs. Moreover, we show the need for optimizing geometries for specific reactor architectures and operating conditions to design next-generation electrodes, by analyzing the genetic algorithm optimization for initial starting geometries with diverse morphologies (cubic and a tomography-extracted commercial electrode), flow field designs (flow-through and interdigitated), and redox chemistries (VO2+/VO2+ and TEMPO/TEMPO+). We found that for kinetically sluggish electrolytes with high ionic conductivity, electrodes with numerous small pores and high internal surface area provide enhanced performance, whereas for kinetically facile electrolytes with low ionic conductivity, low through-plane tortuosity and high hydraulic conductance are desired. The computational tool developed in this work can further expanded to the design of high-performance electrode materials for a broad range of operating conditions, electrolyte chemistries, reactor designs, and electrochemical technologies.
{"title":"A versatile optimization framework for porous electrode design†","authors":"Maxime van der Heijden, Gabor Szendrei, Victor de Haas and Antoni Forner-Cuenca","doi":"10.1039/D3DD00247K","DOIUrl":"10.1039/D3DD00247K","url":null,"abstract":"<p >Porous electrodes are performance-defining components in electrochemical devices, such as redox flow batteries, as they govern the electrochemical performance and pumping demands of the reactor. Yet, conventional porous electrodes used in redox flow batteries are not tailored to sustain convection-enhanced electrochemical reactions. Thus, there is a need for electrode optimization to enhance the system performance. In this work, we present an optimization framework to carry out the bottom-up design of porous electrodes by coupling a genetic algorithm with a pore network modeling framework. We introduce geometrical versatility by adding a pore merging and splitting function, study the impact of various optimization parameters, geometrical definitions, and objective functions, and incorporate conventional electrode and flow field designs. Moreover, we show the need for optimizing geometries for specific reactor architectures and operating conditions to design next-generation electrodes, by analyzing the genetic algorithm optimization for initial starting geometries with diverse morphologies (cubic and a tomography-extracted commercial electrode), flow field designs (flow-through and interdigitated), and redox chemistries (VO<small><sup>2+</sup></small>/VO<small><sub>2</sub></small><small><sup>+</sup></small> and TEMPO/TEMPO<small><sup>+</sup></small>). We found that for kinetically sluggish electrolytes with high ionic conductivity, electrodes with numerous small pores and high internal surface area provide enhanced performance, whereas for kinetically facile electrolytes with low ionic conductivity, low through-plane tortuosity and high hydraulic conductance are desired. The computational tool developed in this work can further expanded to the design of high-performance electrode materials for a broad range of operating conditions, electrolyte chemistries, reactor designs, and electrochemical technologies.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d3dd00247k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140803034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Anselmi, Greg Slabaugh, Rachel Crespo-Otero and Devis Di Tommaso
Graph Neural Networks (GNNs) have revolutionized material property prediction by learning directly from the structural information of molecules and materials. However, conventional GNN models rely solely on local atomic interactions, such as bond lengths and angles, neglecting crucial long-range electrostatic forces that affect certain properties. To address this, we introduce the Molecular Graph Transformer (MGT), a novel GNN architecture that combines local attention mechanisms with message passing on both bond graphs and their line graphs, explicitly capturing long-range interactions. Benchmarking on MatBench and Quantum MOF (QMOF) datasets demonstrates that MGT's improved understanding of electrostatic interactions significantly enhances the prediction accuracy of properties like exfoliation energy and refractive index, while maintaining state-of-the-art performance on all other properties. This breakthrough paves the way for the development of highly accurate and efficient materials design tools across diverse applications.
{"title":"Molecular graph transformer: stepping beyond ALIGNN into long-range interactions†","authors":"Marco Anselmi, Greg Slabaugh, Rachel Crespo-Otero and Devis Di Tommaso","doi":"10.1039/D4DD00014E","DOIUrl":"10.1039/D4DD00014E","url":null,"abstract":"<p >Graph Neural Networks (GNNs) have revolutionized material property prediction by learning directly from the structural information of molecules and materials. However, conventional GNN models rely solely on local atomic interactions, such as bond lengths and angles, neglecting crucial long-range electrostatic forces that affect certain properties. To address this, we introduce the Molecular Graph Transformer (MGT), a novel GNN architecture that combines local attention mechanisms with message passing on both bond graphs and their line graphs, explicitly capturing long-range interactions. Benchmarking on MatBench and Quantum MOF (QMOF) datasets demonstrates that MGT's improved understanding of electrostatic interactions significantly enhances the prediction accuracy of properties like exfoliation energy and refractive index, while maintaining state-of-the-art performance on all other properties. This breakthrough paves the way for the development of highly accurate and efficient materials design tools across diverse applications.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00014e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140803032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties via positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive–negative (PN) classification approach, where there is access to both positive and negative examples.
深度学习可以利用现有的大规模实验数据创建精确的预测模型,并指导分子设计。然而,在经典的监督学习框架中,一个主要障碍是需要正反两方面的实例。值得注意的是,大多数肽数据库都存在信息缺失的问题,而且负面示例的观测数据较少,因为使用高通量筛选方法很难获得这类序列。为了应对这一挑战,我们在半监督设置中仅利用有限的已知正向示例,通过正向无标记学习(PU)发现可能映射到某些抗菌特性的肽序列。特别是,我们使用适应基础分类器和可靠的负识别这两种学习策略来建立深度学习模型,以便根据肽的序列推断其溶解度、溶血、与SHP-2的结合力和无污活性。我们对我们的 PU 学习方法的预测性能进行了评估,结果表明,与经典的正负(PN)分类方法相比,我们的 PU 学习方法仅使用正向数据,就能获得具有竞争力的性能,因为在正向和负向实例中都能获得正向数据。
{"title":"Learning peptide properties with positive examples only","authors":"Mehrad Ansari and Andrew D. White","doi":"10.1039/D3DD00218G","DOIUrl":"10.1039/D3DD00218G","url":null,"abstract":"<p >Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties <em>via</em> positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive–negative (PN) classification approach, where there is access to both positive and negative examples.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d3dd00218g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140629401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}