MatFold: systematic insights into materials discovery models' performance through standardized cross-validation protocols†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY Digital discovery Pub Date : 2024-12-09 DOI:10.1039/D4DD00250D

Matthew D. Witman and Peter Schindler

{"title":"MatFold: systematic insights into materials discovery models' performance through standardized cross-validation protocols†","authors":"Matthew D. Witman and Peter Schindler","doi":"10.1039/D4DD00250D","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) models in the materials sciences that are validated by overly simplistic cross-validation (CV) protocols can yield biased performance estimates for downstream modeling or materials screening tasks. This can be particularly counterproductive for applications where the time and cost of failed validation efforts (experimental synthesis, characterization, and testing) are consequential. We propose a set of standardized and increasingly difficult splitting protocols for chemically and structurally motivated CV that can be followed to validate any ML model for materials discovery. Among several benefits, this enables systematic insights into model generalizability, improvability, and uncertainty, provides benchmarks for fair comparison between competing models with access to differing quantities of data, and systematically reduces possible data leakage through increasingly strict splitting protocols. Performing thorough CV investigations across increasingly strict chemical/structural splitting criteria, local vs. global property prediction tasks, small vs. large datasets, and structure vs. compositional model architectures, some common threads are observed; however, several marked differences exist across these exemplars, indicating the need for comprehensive analysis to fully understand each model's generalization accuracy and potential for materials discovery. For this we provide a general-purpose, featurization-agnostic toolkit, MatFold, to automate reproducible construction of these CV splits and encourage further community use in model benchmarking.","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 625-635"},"PeriodicalIF":6.2000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00250d?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d4dd00250d","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning (ML) models in the materials sciences that are validated by overly simplistic cross-validation (CV) protocols can yield biased performance estimates for downstream modeling or materials screening tasks. This can be particularly counterproductive for applications where the time and cost of failed validation efforts (experimental synthesis, characterization, and testing) are consequential. We propose a set of standardized and increasingly difficult splitting protocols for chemically and structurally motivated CV that can be followed to validate any ML model for materials discovery. Among several benefits, this enables systematic insights into model generalizability, improvability, and uncertainty, provides benchmarks for fair comparison between competing models with access to differing quantities of data, and systematically reduces possible data leakage through increasingly strict splitting protocols. Performing thorough CV investigations across increasingly strict chemical/structural splitting criteria, local vs. global property prediction tasks, small vs. large datasets, and structure vs. compositional model architectures, some common threads are observed; however, several marked differences exist across these exemplars, indicating the need for comprehensive analysis to fully understand each model's generalization accuracy and potential for materials discovery. For this we provide a general-purpose, featurization-agnostic toolkit, MatFold, to automate reproducible construction of these CV splits and encourage further community use in model benchmarking.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MatFold：通过标准化交叉验证协议系统地洞察材料发现模型的性能

通过过于简单的交叉验证（CV）协议验证的材料科学中的机器学习（ML）模型可能会对下游建模或材料筛选任务产生有偏差的性能估计。对于那些失败的验证工作（实验合成、表征和测试）的时间和成本是重要的应用程序来说，这可能会适得其反。我们为化学和结构驱动的CV提出了一套标准化且日益困难的分裂协议，可以遵循该协议来验证任何用于材料发现的ML模型。这样做的好处之一是，可以系统地了解模型的泛化性、可改进性和不确定性，为访问不同数量数据的竞争模型之间的公平比较提供基准，并通过日益严格的分割协议系统地减少可能的数据泄漏。通过对越来越严格的化学/结构分裂标准、局部与全局属性预测任务、小型与大型数据集、结构与成分模型架构进行彻底的CV调查，可以观察到一些共同的线索；然而，这些样本之间存在一些明显的差异，这表明需要进行综合分析，以充分了解每个模型的泛化精度和材料发现潜力。为此，我们提供了一个通用的、与特性无关的工具包MatFold，用于自动化这些CV拆分的可重复构建，并鼓励社区在模型基准测试中进一步使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Digital discovery

CiteScore

2.80

自引率

0.00%

发文量