Jonathan Schmidt , Tiago F.T. Cerqueira , Aldo H. Romero , Antoine Loew , Fabian Jäger , Hai-Chen Wang , Silvana Botti , Miguel A.L. Marques
{"title":"Improving machine-learning models in materials science through large datasets","authors":"Jonathan Schmidt , Tiago F.T. Cerqueira , Aldo H. Romero , Antoine Loew , Fabian Jäger , Hai-Chen Wang , Silvana Botti , Miguel A.L. Marques","doi":"10.1016/j.mtphys.2024.101560","DOIUrl":null,"url":null,"abstract":"<div><div>The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present <span>alexandria</span>, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.</div></div>","PeriodicalId":18253,"journal":{"name":"Materials Today Physics","volume":null,"pages":null},"PeriodicalIF":10.0000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Materials Today Physics","FirstCategoryId":"88","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2542529324002360","RegionNum":2,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATERIALS SCIENCE, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
The accuracy of a machine learning model is limited by the quality and quantity of the data available for its training and validation. This problem is particularly challenging in materials science, where large, high-quality, and consistent datasets are scarce. Here we present alexandria, an open database of more than 5 million density-functional theory calculations for periodic three-, two-, and one-dimensional compounds. We use this data to train machine learning models to reproduce seven different properties using both composition-based models and crystal-graph neural networks. In the majority of cases, the error of the models decreases monotonically with the training data, although some graph networks seem to saturate for large training set sizes. Differences in the training can be correlated with the statistical distribution of the different properties. We also observe that graph-networks, that have access to detailed geometrical information, yield in general more accurate models than simple composition-based methods. Finally, we assess several universal machine learning interatomic potentials. Crystal geometries optimised with these force fields are very high quality, but unfortunately the accuracy of the energies is still lacking. Furthermore, we observe some instabilities for regions of chemical space that are undersampled in the training sets used for these models. This study highlights the potential of large-scale, high-quality datasets to improve machine learning models in materials science.
期刊介绍:
Materials Today Physics is a multi-disciplinary journal focused on the physics of materials, encompassing both the physical properties and materials synthesis. Operating at the interface of physics and materials science, this journal covers one of the largest and most dynamic fields within physical science. The forefront research in materials physics is driving advancements in new materials, uncovering new physics, and fostering novel applications at an unprecedented pace.