机器学习交互潜力之间数据集的可转移性

Samuel P. Niblett, Panagiotis Kourtis, Ioan-Bogdan Magdău, Clare P. Grey, Gábor Csányi
{"title":"机器学习交互潜力之间数据集的可转移性","authors":"Samuel P. Niblett, Panagiotis Kourtis, Ioan-Bogdan Magdău, Clare P. Grey, Gábor Csányi","doi":"arxiv-2409.05590","DOIUrl":null,"url":null,"abstract":"With the emergence of Foundational Machine Learning Interatomic Potential\n(FMLIP) models trained on extensive datasets, transferring data between\ndifferent ML architectures has become increasingly important. In this work, we\nexamine the extent to which training data optimised for one machine-learning\nforcefield algorithm may be re-used to train different models, aiming to\naccelerate FMLIP fine-tuning and to reduce the need for costly iterative\ntraining. As a test case, we train models of an organic liquid mixture that is\ncommonly used as a solvent in rechargeable battery electrolytes, making it an\nimportant target for reactive MLIP development. We assess model performance by\nanalysing the properties of molecular dynamics trajectories, showing that this\nis a more stringent test than comparing prediction errors for fixed datasets.\nWe consider several types of training data, and several popular MLIPs - notably\nthe recent MACE architecture, a message-passing neural network designed for\nhigh efficiency and smoothness. We demonstrate that simple training sets\nconstructed without any ab initio dynamics are sufficient to produce stable\nmodels of molecular liquids. For simple neural-network architectures, further\niterative training is required to capture thermodynamic and kinetic properties\ncorrectly, but MACE performs well with extremely limited datsets. We find that\nconfigurations designed by human intuition to correct systematic model\ndeficiencies transfer effectively between algorithms, but active-learned data\nthat are generated by one MLIP do not typically benefit a different algorithm.\nFinally, we show that any training data which improve model performance also\nimprove its ability to generalise to similar unseen molecules. This suggests\nthat trajectory failure modes are connected with chemical structure rather than\nbeing entirely system-specific.","PeriodicalId":501304,"journal":{"name":"arXiv - PHYS - Chemical Physics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transferability of datasets between Machine-Learning Interaction Potentials\",\"authors\":\"Samuel P. Niblett, Panagiotis Kourtis, Ioan-Bogdan Magdău, Clare P. Grey, Gábor Csányi\",\"doi\":\"arxiv-2409.05590\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the emergence of Foundational Machine Learning Interatomic Potential\\n(FMLIP) models trained on extensive datasets, transferring data between\\ndifferent ML architectures has become increasingly important. In this work, we\\nexamine the extent to which training data optimised for one machine-learning\\nforcefield algorithm may be re-used to train different models, aiming to\\naccelerate FMLIP fine-tuning and to reduce the need for costly iterative\\ntraining. As a test case, we train models of an organic liquid mixture that is\\ncommonly used as a solvent in rechargeable battery electrolytes, making it an\\nimportant target for reactive MLIP development. We assess model performance by\\nanalysing the properties of molecular dynamics trajectories, showing that this\\nis a more stringent test than comparing prediction errors for fixed datasets.\\nWe consider several types of training data, and several popular MLIPs - notably\\nthe recent MACE architecture, a message-passing neural network designed for\\nhigh efficiency and smoothness. We demonstrate that simple training sets\\nconstructed without any ab initio dynamics are sufficient to produce stable\\nmodels of molecular liquids. For simple neural-network architectures, further\\niterative training is required to capture thermodynamic and kinetic properties\\ncorrectly, but MACE performs well with extremely limited datsets. We find that\\nconfigurations designed by human intuition to correct systematic model\\ndeficiencies transfer effectively between algorithms, but active-learned data\\nthat are generated by one MLIP do not typically benefit a different algorithm.\\nFinally, we show that any training data which improve model performance also\\nimprove its ability to generalise to similar unseen molecules. This suggests\\nthat trajectory failure modes are connected with chemical structure rather than\\nbeing entirely system-specific.\",\"PeriodicalId\":501304,\"journal\":{\"name\":\"arXiv - PHYS - Chemical Physics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - PHYS - Chemical Physics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.05590\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Chemical Physics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05590","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

随着在大量数据集上训练的基础机器学习原子间势能(FMLIP)模型的出现,在不同机器学习架构之间传输数据变得越来越重要。在这项工作中,我们研究了针对一种机器学习力场算法进行优化的训练数据在多大程度上可以重新用于训练不同的模型,旨在加快 FMLIP 的微调速度,并减少对昂贵的迭代训练的需求。作为一个测试案例,我们训练了一种有机液体混合物的模型,这种混合物通常用作充电电池电解液的溶剂,因此成为反应式 MLIP 开发的重要目标。我们通过分析分子动力学轨迹的特性来评估模型性能,结果表明这是一项比比较固定数据集的预测误差更为严格的测试。我们考虑了几种类型的训练数据和几种流行的 MLIP,特别是最近的 MACE 架构,这是一种为实现高效率和平滑性而设计的消息传递神经网络。我们证明,在没有任何原子动力学基础上构建的简单训练集足以生成稳定的分子液体模型。对于简单的神经网络架构,需要进一步的迭代训练才能正确捕捉热力学和动力学性质,但 MACE 在极其有限的训练集上也表现出色。我们发现,根据人类直觉设计的用于纠正系统模型缺陷的配置可以在不同算法之间有效转换,但由一种 MLIP 生成的主动学习数据通常不会使不同算法受益。这表明轨迹失效模式与化学结构有关,而不完全是系统特有的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Transferability of datasets between Machine-Learning Interaction Potentials
With the emergence of Foundational Machine Learning Interatomic Potential (FMLIP) models trained on extensive datasets, transferring data between different ML architectures has become increasingly important. In this work, we examine the extent to which training data optimised for one machine-learning forcefield algorithm may be re-used to train different models, aiming to accelerate FMLIP fine-tuning and to reduce the need for costly iterative training. As a test case, we train models of an organic liquid mixture that is commonly used as a solvent in rechargeable battery electrolytes, making it an important target for reactive MLIP development. We assess model performance by analysing the properties of molecular dynamics trajectories, showing that this is a more stringent test than comparing prediction errors for fixed datasets. We consider several types of training data, and several popular MLIPs - notably the recent MACE architecture, a message-passing neural network designed for high efficiency and smoothness. We demonstrate that simple training sets constructed without any ab initio dynamics are sufficient to produce stable models of molecular liquids. For simple neural-network architectures, further iterative training is required to capture thermodynamic and kinetic properties correctly, but MACE performs well with extremely limited datsets. We find that configurations designed by human intuition to correct systematic model deficiencies transfer effectively between algorithms, but active-learned data that are generated by one MLIP do not typically benefit a different algorithm. Finally, we show that any training data which improve model performance also improve its ability to generalise to similar unseen molecules. This suggests that trajectory failure modes are connected with chemical structure rather than being entirely system-specific.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Phase-cycling and double-quantum two-dimensional electronic spectroscopy using a common-path birefringent interferometer Developing Orbital-Dependent Corrections for the Non-Additive Kinetic Energy in Subsystem Density Functional Theory Thermodynamics of mixtures with strongly negative deviations from Raoult's law. XV. Permittivities and refractive indices for 1-alkanol + n-hexylamine systems at (293.15-303.15) K. Application of the Kirkwood-Fröhlich model Mutual neutralization of C$_{60}^+$ and C$_{60}^-$ ions: Excitation energies and state-selective rate coefficients All-in-one foundational models learning across quantum chemical levels
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1