Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become 'chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.
{"title":"Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction","authors":"Jiayun Pang, Ivan Vulić","doi":"10.1039/d4fd00104d","DOIUrl":"https://doi.org/10.1039/d4fd00104d","url":null,"abstract":"Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become 'chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.","PeriodicalId":76,"journal":{"name":"Faraday Discussions","volume":"40 1","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142208840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Grayson Huldin, Junming Huang, Julius Reitemeier, Kaiyu Fu
The transition to a personalized point-of-care model in medicine will fundamentally change the way medicine is practiced, leading to better patient care. Electrochemical biosensors based on structure-switching aptamers can contribute to this medical revolution due to the feasibility and convenience of selecting aptamers for specific targets. Recent studies have reported that nanostructured electrodes can enhance the signals of aptamer-based biosensors. However, miniaturized systems and body fluid environments pose challenges such as signal-to-noise ratio reduction and biofouling. To address these issues, researchers have proposed various electrode coating materials, including zwitterionic materials, biocompatible polymers, and hybrid membranes. Nafion, a commonly used ion exchange membrane, is known for its excellent permselectivity and anti-biofouling properties, making it a suitable choice for biosensor systems. However, the performance and mechanism of Nafion-coated aptamer-based biosensor systems have not been thoroughly studied. In this work, we present a Nafion-coated gold nanoporous electrode, which excludes Nafion from the nanoporous structures and allows the aptamers immobilized inside the nanopores to freely detect chosen targets. The nanopore electrode is formed by a sputtering and dealloying process, resulting in a pore size in tens of nanometers. The biosensor is optimized by adjusting the electrochemical measurement parameters, aptamer density, Nafion thickness, and nanopore size. Furthermore, we propose an explanation for the unusual signaling behavior of the aptamers confined within the nanoporous structures. This work provides a generalizable platform to investigate membrane-coated aptamer-based biosensors.
{"title":"Nafion Coated Nanopore Electrode for Improving Electrochemical Aptamer-Based Biosensing","authors":"Grayson Huldin, Junming Huang, Julius Reitemeier, Kaiyu Fu","doi":"10.1039/d4fd00144c","DOIUrl":"https://doi.org/10.1039/d4fd00144c","url":null,"abstract":"The transition to a personalized point-of-care model in medicine will fundamentally change the way medicine is practiced, leading to better patient care. Electrochemical biosensors based on structure-switching aptamers can contribute to this medical revolution due to the feasibility and convenience of selecting aptamers for specific targets. Recent studies have reported that nanostructured electrodes can enhance the signals of aptamer-based biosensors. However, miniaturized systems and body fluid environments pose challenges such as signal-to-noise ratio reduction and biofouling. To address these issues, researchers have proposed various electrode coating materials, including zwitterionic materials, biocompatible polymers, and hybrid membranes. Nafion, a commonly used ion exchange membrane, is known for its excellent permselectivity and anti-biofouling properties, making it a suitable choice for biosensor systems. However, the performance and mechanism of Nafion-coated aptamer-based biosensor systems have not been thoroughly studied. In this work, we present a Nafion-coated gold nanoporous electrode, which excludes Nafion from the nanoporous structures and allows the aptamers immobilized inside the nanopores to freely detect chosen targets. The nanopore electrode is formed by a sputtering and dealloying process, resulting in a pore size in tens of nanometers. The biosensor is optimized by adjusting the electrochemical measurement parameters, aptamer density, Nafion thickness, and nanopore size. Furthermore, we propose an explanation for the unusual signaling behavior of the aptamers confined within the nanoporous structures. This work provides a generalizable platform to investigate membrane-coated aptamer-based biosensors.","PeriodicalId":76,"journal":{"name":"Faraday Discussions","volume":"1 1","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Synthesis of predicted materials is the key and final step needed to realize a vision of computationally-accelerated materials discovery. Because so many materials have been previously synthesized, one would anticipate that text-mining synthesis recipes from the literature would yield a valuable dataset to train machine learning models that can predict synthesis recipes to new materials. Between 2016 and 2019, the corresponding author (Wenhao Sun) participated in efforts to text-mine 31,782 solid-state synthesis recipes and 35,675 solution-based synthesis recipes from the literature. Here, we characterize these datasets and show that they do not satisfy the “4 Vs” of data-science—that is: volume, veracity, variety, and velocity. For this reason, we believe that machine-learned regression or classification models built from these datasets will have limited utility in guiding the predictive synthesis of novel materials. On the other hand, these large datasets provided an opportunity to identify anomalous synthesis recipes—which in fact did inspire new hypotheses on how materials form, that we later validated by experiment. Our case study here urges a re-evaluation on how to extract the most value from large historical materials science datasets.
{"title":"A critical reflection on attempts to machine-learn materials synthesis insights from text-mined literature recipes","authors":"Wenhao Sun, Nicholas David","doi":"10.1039/d4fd00112e","DOIUrl":"https://doi.org/10.1039/d4fd00112e","url":null,"abstract":"Synthesis of predicted materials is the key and final step needed to realize a vision of computationally-accelerated materials discovery. Because so many materials have been previously synthesized, one would anticipate that text-mining synthesis recipes from the literature would yield a valuable dataset to train machine learning models that can predict synthesis recipes to new materials. Between 2016 and 2019, the corresponding author (Wenhao Sun) participated in efforts to text-mine 31,782 solid-state synthesis recipes and 35,675 solution-based synthesis recipes from the literature. Here, we characterize these datasets and show that they do not satisfy the “4 Vs” of data-science—that is: volume, veracity, variety, and velocity. For this reason, we believe that machine-learned regression or classification models built from these datasets will have limited utility in guiding the predictive synthesis of novel materials. On the other hand, these large datasets provided an opportunity to identify anomalous synthesis recipes—which in fact did inspire new hypotheses on how materials form, that we later validated by experiment. Our case study here urges a re-evaluation on how to extract the most value from large historical materials science datasets.","PeriodicalId":76,"journal":{"name":"Faraday Discussions","volume":"23 1","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142208843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Harveen Kaur, Flaviano Della Pia, Ilyes Batatia, Xavier R. Advincula, Benjamin X. Shi, Jinggang Lan, Gábor Csányi, Angelos Michaelides, Venkat Kapil
Calculating sublimation enthalpies of molecular crystal polymorphs is relevant to a wide range of technological applications. However, predicting these quantities at first-principles accuracy – even with the aid of machine learning potentials – is a challenge that requires sub-kJ/mol accuracy in the potential energy surface and finite-temperature sampling. We present an accurate and data- efficient protocol for training machine learning interatomic potentials by fine-tuning the foundational MACE-MP-0 model and showcase its capabilities on sublimation enthalpies and physical properties of ice polymorphs. Our approach requires only a few tens of training structures to achieve sub-kJ/mol accuracy in the sublimation enthalpies and sub-1 % error in densities at finite temperature and pressure. Exploiting this data efficiency, we perform preliminary N P T simulations of hexagonal ice at the random phase approximation level and demonstrate a good agreement with experiments. Our results shows promise for finite-temperature modelling of molecular crystals with the accuracy of correlated electronic structure theory methods.
计算分子晶体多晶体的升华焓与广泛的技术应用息息相关。然而,在第一原理精度下预测这些量--即使借助机器学习势能--是一项挑战,需要势能面和限温采样达到亚千焦/摩尔精度。我们通过微调基础 MACE-MP-0 模型,提出了一种精确且数据高效的机器学习原子间势能训练协议,并展示了其在冰多晶体的升华焓和物理性质方面的能力。我们的方法只需要几十个训练结构,就能在有限温度和压力下实现亚 kJ/mol 的升华焓精度和亚 1 % 的密度误差。利用这种数据效率,我们在随机相近似水平上对六角冰进行了初步的 N P T 模拟,并证明与实验结果吻合。我们的研究结果表明,分子晶体的有限温度建模有望达到相关电子结构理论方法的精度。
{"title":"Data-efficient fine-tuning of foundational models for first-principles quality sublimation enthalpies","authors":"Harveen Kaur, Flaviano Della Pia, Ilyes Batatia, Xavier R. Advincula, Benjamin X. Shi, Jinggang Lan, Gábor Csányi, Angelos Michaelides, Venkat Kapil","doi":"10.1039/d4fd00107a","DOIUrl":"https://doi.org/10.1039/d4fd00107a","url":null,"abstract":"Calculating sublimation enthalpies of molecular crystal polymorphs is relevant to a wide range of technological applications. However, predicting these quantities at first-principles accuracy – even with the aid of machine learning potentials – is a challenge that requires sub-kJ/mol accuracy in the potential energy surface and finite-temperature sampling. We present an accurate and data- efficient protocol for training machine learning interatomic potentials by fine-tuning the foundational MACE-MP-0 model and showcase its capabilities on sublimation enthalpies and physical properties of ice polymorphs. Our approach requires only a few tens of training structures to achieve sub-kJ/mol accuracy in the sublimation enthalpies and sub-1 % error in densities at finite temperature and pressure. Exploiting this data efficiency, we perform preliminary N P T simulations of hexagonal ice at the random phase approximation level and demonstrate a good agreement with experiments. Our results shows promise for finite-temperature modelling of molecular crystals with the accuracy of correlated electronic structure theory methods.","PeriodicalId":76,"journal":{"name":"Faraday Discussions","volume":"93 1","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DaVante Cain, Ethan Cao, Ivan Vlassiouk, Tilman E Schäffer, Zuzanna Siwy
There has been a great amount of interest in nanopores as the basis for sensors and templates for preparation of biomimetic channels as well as model systems to understand transport properties at the nanoscale. The presence of surface charges on the pore walls has been shown to induce ion selectivity as well as enhance ionic conductance compared to uncharged pores. Here, using three-dimensional continuum modeling, we examine the role of length of charged nanopores as well as applied voltage for controlling ion selectivity and ionic conductance of single nanopores and small nanopore arrays. First, we present conditions where the ion current and ion selectivity of nanopores with homogeneous surface charges remain unchanged even if the pore length decreases by a factor of 6. This length-independent conductance is explained through the effect of ion concentration polarization (ICP) that modifies local ionic concentrations not only at the pore entrances but also in the pore in a voltage-dependent manner. We describe how voltage controls ion selectivity of nanopores with different lengths and present conditions when charged nanopores conduct less current than uncharged pores of the same geometrical characteristics. The manuscript provides different measures of the extent of the depletion zone induced by ICP in single pores and nanopore arrays including systems with ionic diodes. The modeling shown here will help design selective nanopores for a variety of applications where single nanopores and nanopore arrays are used.
{"title":"Ion Concentration Polarization Causes a Nearly Pore-Length-Independent Conductance of Nanopores","authors":"DaVante Cain, Ethan Cao, Ivan Vlassiouk, Tilman E Schäffer, Zuzanna Siwy","doi":"10.1039/d4fd00148f","DOIUrl":"https://doi.org/10.1039/d4fd00148f","url":null,"abstract":"There has been a great amount of interest in nanopores as the basis for sensors and templates for preparation of biomimetic channels as well as model systems to understand transport properties at the nanoscale. The presence of surface charges on the pore walls has been shown to induce ion selectivity as well as enhance ionic conductance compared to uncharged pores. Here, using three-dimensional continuum modeling, we examine the role of length of charged nanopores as well as applied voltage for controlling ion selectivity and ionic conductance of single nanopores and small nanopore arrays. First, we present conditions where the ion current and ion selectivity of nanopores with homogeneous surface charges remain unchanged even if the pore length decreases by a factor of 6. This length-independent conductance is explained through the effect of ion concentration polarization (ICP) that modifies local ionic concentrations not only at the pore entrances but also in the pore in a voltage-dependent manner. We describe how voltage controls ion selectivity of nanopores with different lengths and present conditions when charged nanopores conduct less current than uncharged pores of the same geometrical characteristics. The manuscript provides different measures of the extent of the depletion zone induced by ICP in single pores and nanopore arrays including systems with ionic diodes. The modeling shown here will help design selective nanopores for a variety of applications where single nanopores and nanopore arrays are used.","PeriodicalId":76,"journal":{"name":"Faraday Discussions","volume":"8 1","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data driven methods have transformed the prospects of the computational chemical sciences, with machine learned interatomic potentials (MLIPs) speeding up calculations by several orders of magnitude. I reflect on theory driven, as opposed to data driven, discovery based on ab initio random structure searching (AIRSS), and then introduce two new methods which exploit machine learning acceleration. I show how long high throughput anneals, between direct structural relaxation, enabled by ephemeral data derived potentials (EDDPs), can be incorporated into AIRSS to bias the sampling of challenging systems towards low energy configurations. Hot AIRSS (hot-AIRSS) preserves the parallel advantage of random search, while allowing much more complex systems to be tackled. This is demonstrated through searches for complex boron structures in large unit cells. I then show how low energy carbon structures can be directly generated from a single, experimentally determined, diamond structure. An extension to the generation of random sensible structures, candidates are stochastically generated and then optimised to minimise the difference between the EDDP environment vector and that of the reference diamond structure. The distance-based cost function is captured in an actively learned EDDP. Graphite, small nanotubes and caged, fullerene- like, structures emerge from searches using this potential, along with a rich variety of tetrahedral framework structures. Using the same approach, the pyrope, Mg3Al2(SiO4)3, garnet structure is recovered from a low energy AIRSS structure generated in a smaller unit cell with a different chemical composition. The relationship of this approach to modern diffusion model based generative methods is discussed.
{"title":"Beyond theory driven discovery: introducing hot random search and datum derived structures","authors":"Chris J. Pickard","doi":"10.1039/d4fd00134f","DOIUrl":"https://doi.org/10.1039/d4fd00134f","url":null,"abstract":"Data driven methods have transformed the prospects of the computational chemical sciences, with machine learned interatomic potentials (MLIPs) speeding up calculations by several orders of magnitude. I reflect on theory driven, as opposed to data driven, discovery based on ab initio random structure searching (AIRSS), and then introduce two new methods which exploit machine learning acceleration. I show how long high throughput anneals, between direct structural relaxation, enabled by ephemeral data derived potentials (EDDPs), can be incorporated into AIRSS to bias the sampling of challenging systems towards low energy configurations. Hot AIRSS (hot-AIRSS) preserves the parallel advantage of random search, while allowing much more complex systems to be tackled. This is demonstrated through searches for complex boron structures in large unit cells. I then show how low energy carbon structures can be directly generated from a single, experimentally determined, diamond structure. An extension to the generation of random sensible structures, candidates are stochastically generated and then optimised to minimise the difference between the EDDP environment vector and that of the reference diamond structure. The distance-based cost function is captured in an actively learned EDDP. Graphite, small nanotubes and caged, fullerene- like, structures emerge from searches using this potential, along with a rich variety of tetrahedral framework structures. Using the same approach, the pyrope, Mg3Al2(SiO4)3, garnet structure is recovered from a low energy AIRSS structure generated in a smaller unit cell with a different chemical composition. The relationship of this approach to modern diffusion model based generative methods is discussed.","PeriodicalId":76,"journal":{"name":"Faraday Discussions","volume":"2012 1","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Krzysztof Maziarz, Austin Tripp, Guoqing Liu, Megan Stanley, Shufang Xie, Piotr Gainski, Philipp Seidl, Marwin Segler
Automated Synthesis Planning has recently re-emerged as a research area at the intersection of chemistry and machine learning. Despite the appearance of steady progress, we argue that imperfect benchmarks and inconsistent comparisons mask systematic shortcomings of existing techniques, and unnecessarily hamper progress. To remedy this, we present a synthesis planning library with an extensive benchmarking framework, called Syntheseus, which promotes best practice by default, enabling consistent meaningful evaluation of single step and multi-step synthesis planning algorithms. We demonstrate the capabilities of syntheseus by re-evaluating several previous retrosynthesis algorithms, and find that the ranking of state-of-the-art models changes in controlled evaluation experiments. We end with guidance for future works in this area, and call the community to engage in the discussion on how to improve benchmarks for synthesis planning.
{"title":"Re-evaluating Retrosynthesis Algorithms with Syntheseus","authors":"Krzysztof Maziarz, Austin Tripp, Guoqing Liu, Megan Stanley, Shufang Xie, Piotr Gainski, Philipp Seidl, Marwin Segler","doi":"10.1039/d4fd00093e","DOIUrl":"https://doi.org/10.1039/d4fd00093e","url":null,"abstract":"Automated Synthesis Planning has recently re-emerged as a research area at the intersection of chemistry and machine learning. Despite the appearance of steady progress, we argue that imperfect benchmarks and inconsistent comparisons mask systematic shortcomings of existing techniques, and unnecessarily hamper progress. To remedy this, we present a synthesis planning library with an extensive benchmarking framework, called Syntheseus, which promotes best practice by default, enabling consistent meaningful evaluation of single step and multi-step synthesis planning algorithms. We demonstrate the capabilities of syntheseus by re-evaluating several previous retrosynthesis algorithms, and find that the ranking of state-of-the-art models changes in controlled evaluation experiments. We end with guidance for future works in this area, and call the community to engage in the discussion on how to improve benchmarks for synthesis planning.","PeriodicalId":76,"journal":{"name":"Faraday Discussions","volume":"25 1","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Veronika Jurásková, Gers Tusha, Hanwen Zhang, Lars V Schäfer, Fernanda Duarte
Metal ions are irreplaceable in many areas of chemistry, including (bio)catalysis, self-assembly and charge transfer processes. Yet, modelling their structural and dynamic properties in diverse chemical environments remains challenging for both force fields and ab initio methods. Here, we introduce a strategy to train machine learning potentials (MLPs) using MACE, an equivariant message-passing neural network, for metal-ligand complexes in explicit solvents. We explore the structure and ligand exchange dynamics of Mg2+ in water and Pd2+ in acetonitrile as two illustrative model systems. The trained potentials accurately reproduce equilibrium structures of the complexes in solution, including different coordination numbers and geometries. Furthermore, the MLPs can model structural changes between metal ions and ligands in the first coordination shell, and reproduce the free energy barriers for the corresponding ligand exchange. The strategy presented here provides a computationally efficient approach to model metal ions in solution, paving the way for modelling larger and more diverse metal complexes relevant to biomolecules and supramolecular assemblies.
{"title":"Modelling ligand exchange in metal complexes with machine learning potentials","authors":"Veronika Jurásková, Gers Tusha, Hanwen Zhang, Lars V Schäfer, Fernanda Duarte","doi":"10.1039/d4fd00140k","DOIUrl":"https://doi.org/10.1039/d4fd00140k","url":null,"abstract":"Metal ions are irreplaceable in many areas of chemistry, including (bio)catalysis, self-assembly and charge transfer processes. Yet, modelling their structural and dynamic properties in diverse chemical environments remains challenging for both force fields and ab initio methods. Here, we introduce a strategy to train machine learning potentials (MLPs) using MACE, an equivariant message-passing neural network, for metal-ligand complexes in explicit solvents. We explore the structure and ligand exchange dynamics of Mg<small><sup>2+</sup></small> in water and Pd<small><sup>2+</sup></small> in acetonitrile as two illustrative model systems. The trained potentials accurately reproduce equilibrium structures of the complexes in solution, including different coordination numbers and geometries. Furthermore, the MLPs can model structural changes between metal ions and ligands in the first coordination shell, and reproduce the free energy barriers for the corresponding ligand exchange. The strategy presented here provides a computationally efficient approach to model metal ions in solution, paving the way for modelling larger and more diverse metal complexes relevant to biomolecules and supramolecular assemblies.","PeriodicalId":76,"journal":{"name":"Faraday Discussions","volume":"16 1","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141887128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arya Changiarath Sivadasan, Aayush Arya, Vasileios A. Xenidis, Jan Padeken, Lukas S. Stelzl
Elucidating how protein sequence determines the properties of disordered proteins and their phase-separated condensates is a great challenge in computational chemistry, biology, and biophysics. Quantitative molecular dynamics simulations and derived free energy values can in principle capture how a sequence encodes the chemical and biological properties of a protein. These calculations are, however, computationally demanding, even after reducing the representation by coarse-graining; exploring the large spaces of potentially relevant sequences remains a formidable task. We employ an "active learning" scheme introduced by Yang et al.(bioRxiv 2022.08.05.502972) to reduce the number of labelled examples needed from simulations, where a neural network-based model suggests the most useful examples for the next training cycle. Applying this Bayesian Optimisation framework, we determine properties of protein sequences with coarse-grained molecular dynamics, which enables the network to establish sequence-property relationships for disordered proteins and their self-interactions and their interactions in phase-separated condensates. We show how iterative training with second virial coefficients derived from the simulations of disordered protein sequences leads to a rapid improvement in predicting peptide self-interactions. We employ this Bayesian approach to efficiently search for new sequences that bind to condensates of disordered C-terminal domain (CTD) of RNA Polymerase II, by simulating molecular recognition of peptides to phase-separated condensates in coarse-grained molecular dynamics. By searching for protein sequences which prefer to self-interact rather than interact with another protein sequence we are able to shape the morphology of protein condensates and design multiphasic protein condensates.
阐明蛋白质序列如何决定无序蛋白质及其相分离凝聚物的特性,是计算化学、生物学和生物物理学的一大挑战。定量分子动力学模拟和推导出的自由能值原则上可以捕捉序列如何编码蛋白质的化学和生物特性。然而,这些计算对计算要求很高,即使在通过粗粒化减少表征之后也是如此;探索潜在相关序列的巨大空间仍然是一项艰巨的任务。我们采用了杨等人提出的 "主动学习 "方案(bioRxiv 2022.08.05.502972)来减少模拟所需的标记示例数量,其中基于神经网络的模型为下一个训练周期提出了最有用的示例。通过应用这种贝叶斯优化框架,我们用粗粒度分子动力学确定了蛋白质序列的属性,从而使网络能够建立无序蛋白质的序列属性关系及其在相分离凝聚体中的自我相互作用和相互作用。我们展示了如何利用从无序蛋白质序列模拟中得出的第二病毒系数进行迭代训练,从而快速提高肽自相互作用的预测能力。我们采用这种贝叶斯方法,通过在粗粒度分子动力学中模拟分子识别肽与相分离凝聚物的过程,有效地搜索与 RNA 聚合酶 II 的无序 C 端结构域 (CTD) 凝聚物结合的新序列。通过寻找更倾向于自我相互作用而不是与另一个蛋白质序列相互作用的蛋白质序列,我们能够塑造蛋白质凝聚物的形态并设计多相蛋白质凝聚物。
{"title":"Sequence determinants of protein phase separation and recognition by protein phase-separated condensates through molecular dynamics and active learning","authors":"Arya Changiarath Sivadasan, Aayush Arya, Vasileios A. Xenidis, Jan Padeken, Lukas S. Stelzl","doi":"10.1039/d4fd00099d","DOIUrl":"https://doi.org/10.1039/d4fd00099d","url":null,"abstract":"Elucidating how protein sequence determines the properties of disordered proteins and their phase-separated condensates is a great challenge in computational chemistry, biology, and biophysics. Quantitative molecular dynamics simulations and derived free energy values can in principle capture how a sequence encodes the chemical and biological properties of a protein. These calculations are, however, computationally demanding, even after reducing the representation by coarse-graining; exploring the large spaces of potentially relevant sequences remains a formidable task. We employ an \"active learning\" scheme introduced by Yang et al.(bioRxiv 2022.08.05.502972) to reduce the number of labelled examples needed from simulations, where a neural network-based model suggests the most useful examples for the next training cycle. Applying this Bayesian Optimisation framework, we determine properties of protein sequences with coarse-grained molecular dynamics, which enables the network to establish sequence-property relationships for disordered proteins and their self-interactions and their interactions in phase-separated condensates. We show how iterative training with second virial coefficients derived from the simulations of disordered protein sequences leads to a rapid improvement in predicting peptide self-interactions. We employ this Bayesian approach to efficiently search for new sequences that bind to condensates of disordered C-terminal domain (CTD) of RNA Polymerase II, by simulating molecular recognition of peptides to phase-separated condensates in coarse-grained molecular dynamics. By searching for protein sequences which prefer to self-interact rather than interact with another protein sequence we are able to shape the morphology of protein condensates and design multiphasic protein condensates.","PeriodicalId":76,"journal":{"name":"Faraday Discussions","volume":"41 1","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141884834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Despite significant advances in nanopore nucleic acids sequencing and sensing, proteins detection remains challenging due to the complexity of inherent protein molecular properties (i.e., net charges, polarity, molecular conformation & dimension) and sophisticated environmental parameters (i.e., biofluids), resulting in unsatisfied electrical signal resolution for proteins detection such as poor accessibility, selectivity and sensitivity. The selection of an appropriate electroanalytical approach is strongly desired which should be capable of offering easily detectable and readable signals regarding proteins particularly depending on the practical application. Herein, a molecular sandwich-based DNAzyme catalytic reaction cooperated nanopore detecting approach was designed. Especially, this approach is given the easy use of Mg2+ catalyzed DNAzyme (10-23) toward nucleic acids digestion for efficient antigen protein examination. Its applicability within the proposed strategy operates by initial formation of a molecular sandwich containing capture antibody-antigen-detection antibody for efficiently entrapment of target proteins (herein taking HIV p24 antigen for example) and immobilized on magnetic beads surface. After that, the DNAzyme was linked to the detection antibody via biotin−streptavidin interaction. In the presence of Mg2+, DNAzyme catalytic reaction was triggered to digest nucleic acids substrates and release unique cleavage fragments as reporters capable of transducing easier detectable nucleic acids as substitute of complicated and difficulty-yielded protein signals, in a nanopore. Notably, experimental validation confirms the detecting stability and sensitivity for target antigen referenced with other antigen proteins, meanwhile demonstrates the detection efficacy in human serum environment at very low concentration (LoD ~1.24 pM). This DNAzyme cooperated nanopore electroanalytical approach denotes an advancement in protein examination, may benefit in vitro test of proteinic biomarkers for disease diagnosis and prognosis assessment.
尽管在纳米孔核酸测序和传感方面取得了重大进展,但由于蛋白质固有的分子特性(如净电荷、极性、分子构象和ampamp;尺寸)和复杂的环境参数(如生物流体)的复杂性,蛋白质检测仍面临挑战,导致蛋白质检测的电信号分辨率不理想,如可及性、选择性和灵敏度差。因此,选择一种适当的电分析方法是非常必要的,这种方法应能提供易于检测和读取的蛋白质信号,特别是在实际应用中。在此,我们设计了一种基于 DNA 酶催化反应的分子三明治式纳米孔检测方法。特别是,这种方法易于使用 Mg2+ 催化的 DNA 酶(10-23)对核酸进行消化,从而实现高效的抗原蛋白检测。它在拟议策略中的适用性是,首先形成一个分子夹心层,其中包含捕获抗体-抗原-检测抗体,以有效捕获目标蛋白(此处以 HIV p24 抗原为例),并固定在磁珠表面。然后,DNA 酶通过生物素-链霉亲和素相互作用与检测抗体相连。在 Mg2+ 的存在下,DNA 酶的催化反应被触发,消化核酸底物,释放出独特的裂解片段作为报告物,能够在纳米孔中转导更容易检测的核酸,以替代复杂和难以产生的蛋白质信号。值得注意的是,实验验证证实了目标抗原与其他抗原蛋白的检测稳定性和灵敏度,同时证明了在人体血清环境中极低浓度(LoD ~1.24 pM)的检测功效。这种 DNA 酶协同纳米孔电分析方法标志着蛋白质检测技术的进步,可能有利于体外检测蛋白质生物标志物,以进行疾病诊断和预后评估。
{"title":"Molecular sandwich-based DNAzyme catalytic reaction towards transducing efficient nanopore electrical detection for antigen proteins","authors":"Lebing Wang, Shou Zhou, Yunjiao Wang, Yan Wang, Jing Li, Xiaohan Chen, Daming Zhou, Liyuan Liang, Bohua Yin, Youwen Zhang, Liang Wang","doi":"10.1039/d4fd00146j","DOIUrl":"https://doi.org/10.1039/d4fd00146j","url":null,"abstract":"Despite significant advances in nanopore nucleic acids sequencing and sensing, proteins detection remains challenging due to the complexity of inherent protein molecular properties (i.e., net charges, polarity, molecular conformation & dimension) and sophisticated environmental parameters (i.e., biofluids), resulting in unsatisfied electrical signal resolution for proteins detection such as poor accessibility, selectivity and sensitivity. The selection of an appropriate electroanalytical approach is strongly desired which should be capable of offering easily detectable and readable signals regarding proteins particularly depending on the practical application. Herein, a molecular sandwich-based DNAzyme catalytic reaction cooperated nanopore detecting approach was designed. Especially, this approach is given the easy use of Mg2+ catalyzed DNAzyme (10-23) toward nucleic acids digestion for efficient antigen protein examination. Its applicability within the proposed strategy operates by initial formation of a molecular sandwich containing capture antibody-antigen-detection antibody for efficiently entrapment of target proteins (herein taking HIV p24 antigen for example) and immobilized on magnetic beads surface. After that, the DNAzyme was linked to the detection antibody via biotin−streptavidin interaction. In the presence of Mg2+, DNAzyme catalytic reaction was triggered to digest nucleic acids substrates and release unique cleavage fragments as reporters capable of transducing easier detectable nucleic acids as substitute of complicated and difficulty-yielded protein signals, in a nanopore. Notably, experimental validation confirms the detecting stability and sensitivity for target antigen referenced with other antigen proteins, meanwhile demonstrates the detection efficacy in human serum environment at very low concentration (LoD ~1.24 pM). This DNAzyme cooperated nanopore electroanalytical approach denotes an advancement in protein examination, may benefit in vitro test of proteinic biomarkers for disease diagnosis and prognosis assessment.","PeriodicalId":76,"journal":{"name":"Faraday Discussions","volume":"95 1","pages":""},"PeriodicalIF":3.4,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141884833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}