Improving machine-learning models in materials science through large datasets

IF 10 2区材料科学 Q1 MATERIALS SCIENCE, MULTIDISCIPLINARY Materials Today Physics Pub Date : 2024-09-25 DOI:10.1016/j.mtphys.2024.101560

Jonathan Schmidt , Tiago F.T. Cerqueira , Aldo H. Romero , Antoine Loew , Fabian Jäger , Hai-Chen Wang , Silvana Botti , Miguel A.L. Marques

{"title":"Improving machine-learning models in materials science through large datasets","authors":"Jonathan Schmidt , Tiago F.T. Cerqueira , Aldo H. Romero , Antoine Loew , Fabian Jäger , Hai-Chen Wang , Silvana Botti , Miguel A.L. Marques","doi":"10.1016/j.mtphys.2024.101560","DOIUrl":null,"url":null,"abstract":"<div><div>The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present <span>alexandria</span>, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.</div></div>","PeriodicalId":18253,"journal":{"name":"Materials Today Physics","volume":"48 ","pages":"Article 101560"},"PeriodicalIF":10.0000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Materials Today Physics","FirstCategoryId":"88","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2542529324002360","RegionNum":2,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATERIALS SCIENCE, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present alexandria, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过大型数据集改进材料科学中的机器学习模型

机器学习模型的准确性受限于可用于训练和验证的数据的质量和数量。这个问题在材料科学领域尤其具有挑战性，因为材料科学领域缺乏大规模、高质量和一致性的数据集。在这里，我们介绍亚历山大（alexandria），这是一个开放式数据库，包含 500 多万个周期性三维、二维和一维化合物的密度泛函理论计算结果。我们利用这些数据训练机器学习模型，使用基于成分的模型和晶体图神经网络重现七种不同的性质。在大多数情况下，模型的误差会随着训练数据的增加而单调减少，但有些图网络在训练集规模较大时似乎会达到饱和。训练中的差异可能与不同属性的统计分布有关。我们还观察到，与简单的基于组成的方法相比，能够获取详细几何信息的图网络一般能生成更精确的模型。最后，我们评估了几种通用的机器学习原子间势。使用这些力场优化的晶体几何图形质量非常高，但遗憾的是能量的准确性仍然不足。此外，我们还观察到这些模型的训练集中取样不足的化学空间区域存在一些不稳定性。这项研究凸显了大规模、高质量数据集在改进材料科学领域机器学习模型方面的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Materials Today Physics Materials Science-General Materials Science

CiteScore

14.00

自引率

7.80%

发文量

284

审稿时长

15 days

期刊介绍： Materials Today Physics is a multi-disciplinary journal focused on the physics of materials, encompassing both the physical properties and materials synthesis. Operating at the interface of physics and materials science, this journal covers one of the largest and most dynamic fields within physical science. The forefront research in materials physics is driving advancements in new materials, uncovering new physics, and fostering novel applications at an unprecedented pace.