Regressive Machine learning (ML) has emerged as an alternative method for theoretically evaluating materials properties. Most of ML-based study of materials properties including dataset handling, feature spaces, transformers and estimators have been reported without questioning the ML foundation of these components and their interactions with the outcomes because of the availability of homogeneous datasets, standardized training conditions and technical implementation challenges. In this paper, a database of defect impurities in 2D materials was used to assess the impact of ML workflow’s components on the training-inference process. By investigating descriptor engineering (vectorized matrix properties) and model algorithms (statistical tree-based and artificial neural network - ANN models) on two sub-datasets derived from the database (interstitial-int and adsorbate-ads impurities), we report a comprehensive study of the ML-based prediction of the energy formation of the impurities in 2D materials. Quantitatively, for the statistical models, the training errors lower than 1.4 eV and 1.1 eV were found thanks to the descriptor engineering for the int and the ads datasets, respectively. Regarding the ANN models, these values are 2.1 eV and 1.3 eV. The prediction errors on the unseen data (test sets) were found lower than the ones obtained without descriptor engineering for all models. However, the overfitting effect remains visible but less pronounced for the ads dataset than for the int dataset. This finding reveals the impact of the dataset characteristics on the performance of the ML ecosystem involving data engineering and model algorithms. Beyond the search of best performances in regressive ML prediction of 2D materials properties, our work demonstrates a full-scale study of the ML process starting from the data engineering to model evaluation and selection, allowing to benchmark the criteria for further ML assessment in terms of training, models and prediction. Our results could be reference for further works in ML-led prediction physics of materials science.
扫码关注我们
求助内容:
应助结果提醒方式:
