机器学习交互潜力之间数据集的可转移性

arXiv - PHYS - Chemical Physics Pub Date : 2024-09-09 DOI:arxiv-2409.05590

Samuel P. Niblett, Panagiotis Kourtis, Ioan-Bogdan Magdău, Clare P. Grey, Gábor Csányi

{"title":"机器学习交互潜力之间数据集的可转移性","authors":"Samuel P. Niblett, Panagiotis Kourtis, Ioan-Bogdan Magdău, Clare P. Grey, Gábor Csányi","doi":"arxiv-2409.05590","DOIUrl":null,"url":null,"abstract":"With the emergence of Foundational Machine Learning Interatomic Potential\n(FMLIP) models trained on extensive datasets, transferring data between\ndifferent ML architectures has become increasingly important. In this work, we\nexamine the extent to which training data optimised for one machine-learning\nforcefield algorithm may be re-used to train different models, aiming to\naccelerate FMLIP fine-tuning and to reduce the need for costly iterative\ntraining. As a test case, we train models of an organic liquid mixture that is\ncommonly used as a solvent in rechargeable battery electrolytes, making it an\nimportant target for reactive MLIP development. We assess model performance by\nanalysing the properties of molecular dynamics trajectories, showing that this\nis a more stringent test than comparing prediction errors for fixed datasets.\nWe consider several types of training data, and several popular MLIPs - notably\nthe recent MACE architecture, a message-passing neural network designed for\nhigh efficiency and smoothness. We demonstrate that simple training sets\nconstructed without any ab initio dynamics are sufficient to produce stable\nmodels of molecular liquids. For simple neural-network architectures, further\niterative training is required to capture thermodynamic and kinetic properties\ncorrectly, but MACE performs well with extremely limited datsets. We find that\nconfigurations designed by human intuition to correct systematic model\ndeficiencies transfer effectively between algorithms, but active-learned data\nthat are generated by one MLIP do not typically benefit a different algorithm.\nFinally, we show that any training data which improve model performance also\nimprove its ability to generalise to similar unseen molecules. This suggests\nthat trajectory failure modes are connected with chemical structure rather than\nbeing entirely system-specific.","PeriodicalId":501304,"journal":{"name":"arXiv - PHYS - Chemical Physics","volume":"24 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transferability of datasets between Machine-Learning Interaction Potentials\",\"authors\":\"Samuel P. Niblett, Panagiotis Kourtis, Ioan-Bogdan Magdău, Clare P. Grey, Gábor Csányi\",\"doi\":\"arxiv-2409.05590\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the emergence of Foundational Machine Learning Interatomic Potential\\n(FMLIP) models trained on extensive datasets, transferring data between\\ndifferent ML architectures has become increasingly important. In this work, we\\nexamine the extent to which training data optimised for one machine-learning\\nforcefield algorithm may be re-used to train different models, aiming to\\naccelerate FMLIP fine-tuning and to reduce the need for costly iterative\\ntraining. As a test case, we train models of an organic liquid mixture that is\\ncommonly used as a solvent in rechargeable battery electrolytes, making it an\\nimportant target for reactive MLIP development. We assess model performance by\\nanalysing the properties of molecular dynamics trajectories, showing that this\\nis a more stringent test than comparing prediction errors for fixed datasets.\\nWe consider several types of training data, and several popular MLIPs - notably\\nthe recent MACE architecture, a message-passing neural network designed for\\nhigh efficiency and smoothness. We demonstrate that simple training sets\\nconstructed without any ab initio dynamics are sufficient to produce stable\\nmodels of molecular liquids. For simple neural-network architectures, further\\niterative training is required to capture thermodynamic and kinetic properties\\ncorrectly, but MACE performs well with extremely limited datsets. We find that\\nconfigurations designed by human intuition to correct systematic model\\ndeficiencies transfer effectively between algorithms, but active-learned data\\nthat are generated by one MLIP do not typically benefit a different algorithm.\\nFinally, we show that any training data which improve model performance also\\nimprove its ability to generalise to similar unseen molecules. This suggests\\nthat trajectory failure modes are connected with chemical structure rather than\\nbeing entirely system-specific.\",\"PeriodicalId\":501304,\"journal\":{\"name\":\"arXiv - PHYS - Chemical Physics\",\"volume\":\"24 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - PHYS - Chemical Physics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.05590\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Chemical Physics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05590","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着在大量数据集上训练的基础机器学习原子间势能（FMLIP）模型的出现，在不同机器学习架构之间传输数据变得越来越重要。在这项工作中，我们研究了针对一种机器学习力场算法进行优化的训练数据在多大程度上可以重新用于训练不同的模型，旨在加快 FMLIP 的微调速度，并减少对昂贵的迭代训练的需求。作为一个测试案例，我们训练了一种有机液体混合物的模型，这种混合物通常用作充电电池电解液的溶剂，因此成为反应式 MLIP 开发的重要目标。我们通过分析分子动力学轨迹的特性来评估模型性能，结果表明这是一项比比较固定数据集的预测误差更为严格的测试。我们考虑了几种类型的训练数据和几种流行的 MLIP，特别是最近的 MACE 架构，这是一种为实现高效率和平滑性而设计的消息传递神经网络。我们证明，在没有任何原子动力学基础上构建的简单训练集足以生成稳定的分子液体模型。对于简单的神经网络架构，需要进一步的迭代训练才能正确捕捉热力学和动力学性质，但 MACE 在极其有限的训练集上也表现出色。我们发现，根据人类直觉设计的用于纠正系统模型缺陷的配置可以在不同算法之间有效转换，但由一种 MLIP 生成的主动学习数据通常不会使不同算法受益。这表明轨迹失效模式与化学结构有关，而不完全是系统特有的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Transferability of datasets between Machine-Learning Interaction Potentials

With the emergence of Foundational Machine Learning Interatomic Potential (FMLIP) models trained on extensive datasets, transferring data between different ML architectures has become increasingly important. In this work, we examine the extent to which training data optimised for one machine-learning forcefield algorithm may be re-used to train different models, aiming to accelerate FMLIP fine-tuning and to reduce the need for costly iterative training. As a test case, we train models of an organic liquid mixture that is commonly used as a solvent in rechargeable battery electrolytes, making it an important target for reactive MLIP development. We assess model performance by analysing the properties of molecular dynamics trajectories, showing that this is a more stringent test than comparing prediction errors for fixed datasets. We consider several types of training data, and several popular MLIPs - notably the recent MACE architecture, a message-passing neural network designed for high efficiency and smoothness. We demonstrate that simple training sets constructed without any ab initio dynamics are sufficient to produce stable models of molecular liquids. For simple neural-network architectures, further iterative training is required to capture thermodynamic and kinetic properties correctly, but MACE performs well with extremely limited datsets. We find that configurations designed by human intuition to correct systematic model deficiencies transfer effectively between algorithms, but active-learned data that are generated by one MLIP do not typically benefit a different algorithm. Finally, we show that any training data which improve model performance also improve its ability to generalise to similar unseen molecules. This suggests that trajectory failure modes are connected with chemical structure rather than being entirely system-specific.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - PHYS - Chemical Physics

自引率

0.00%

发文量