通过大型数据集改进材料科学中的机器学习模型

IF 10 2区 材料科学 Q1 MATERIALS SCIENCE, MULTIDISCIPLINARY Materials Today Physics Pub Date : 2024-09-25 DOI:10.1016/j.mtphys.2024.101560
Jonathan Schmidt , Tiago F.T. Cerqueira , Aldo H. Romero , Antoine Loew , Fabian Jäger , Hai-Chen Wang , Silvana Botti , Miguel A.L. Marques
{"title":"通过大型数据集改进材料科学中的机器学习模型","authors":"Jonathan Schmidt ,&nbsp;Tiago F.T. Cerqueira ,&nbsp;Aldo H. Romero ,&nbsp;Antoine Loew ,&nbsp;Fabian Jäger ,&nbsp;Hai-Chen Wang ,&nbsp;Silvana Botti ,&nbsp;Miguel A.L. Marques","doi":"10.1016/j.mtphys.2024.101560","DOIUrl":null,"url":null,"abstract":"<div><div>The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present <span>alexandria</span>, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.</div></div>","PeriodicalId":18253,"journal":{"name":"Materials Today Physics","volume":"48 ","pages":"Article 101560"},"PeriodicalIF":10.0000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving machine-learning models in materials science through large datasets\",\"authors\":\"Jonathan Schmidt ,&nbsp;Tiago F.T. Cerqueira ,&nbsp;Aldo H. Romero ,&nbsp;Antoine Loew ,&nbsp;Fabian Jäger ,&nbsp;Hai-Chen Wang ,&nbsp;Silvana Botti ,&nbsp;Miguel A.L. Marques\",\"doi\":\"10.1016/j.mtphys.2024.101560\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present <span>alexandria</span>, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.</div></div>\",\"PeriodicalId\":18253,\"journal\":{\"name\":\"Materials Today Physics\",\"volume\":\"48 \",\"pages\":\"Article 101560\"},\"PeriodicalIF\":10.0000,\"publicationDate\":\"2024-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Materials Today Physics\",\"FirstCategoryId\":\"88\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2542529324002360\",\"RegionNum\":2,\"RegionCategory\":\"材料科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATERIALS SCIENCE, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Materials Today Physics","FirstCategoryId":"88","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2542529324002360","RegionNum":2,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATERIALS SCIENCE, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

机器学习模型的准确性受限于可用于训练和验证的数据的质量和数量。这个问题在材料科学领域尤其具有挑战性,因为材料科学领域缺乏大规模、高质量和一致性的数据集。在这里,我们介绍亚历山大(alexandria),这是一个开放式数据库,包含 500 多万个周期性三维、二维和一维化合物的密度泛函理论计算结果。我们利用这些数据训练机器学习模型,使用基于成分的模型和晶体图神经网络重现七种不同的性质。在大多数情况下,模型的误差会随着训练数据的增加而单调减少,但有些图网络在训练集规模较大时似乎会达到饱和。训练中的差异可能与不同属性的统计分布有关。我们还观察到,与简单的基于组成的方法相比,能够获取详细几何信息的图网络一般能生成更精确的模型。最后,我们评估了几种通用的机器学习原子间势。使用这些力场优化的晶体几何图形质量非常高,但遗憾的是能量的准确性仍然不足。此外,我们还观察到这些模型的训练集中取样不足的化学空间区域存在一些不稳定性。这项研究凸显了大规模、高质量数据集在改进材料科学领域机器学习模型方面的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Improving machine-learning models in materials science through large datasets
The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present alexandria, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Materials Today Physics
Materials Today Physics Materials Science-General Materials Science
CiteScore
14.00
自引率
7.80%
发文量
284
审稿时长
15 days
期刊介绍: Materials Today Physics is a multi-disciplinary journal focused on the physics of materials, encompassing both the physical properties and materials synthesis. Operating at the interface of physics and materials science, this journal covers one of the largest and most dynamic fields within physical science. The forefront research in materials physics is driving advancements in new materials, uncovering new physics, and fostering novel applications at an unprecedented pace.
期刊最新文献
Mist CVD Technology for Gallium Oxide Deposition: A Review Atomic Imprint Crystallization: Externally-Templated Crystallization of Amorphous Silicon Achieving ultra-high resistivity and outstanding piezoelectric properties by co-substitution in CaBi2Nb2O9 ceramics Data-driven design of thermal-mechanical multifunctional metamaterials Construction of bifunctional MOF-based composite electrocatalysts promoting oxygen evolution reaction and glucose oxidation reaction and its kinetic deciphering
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1