A study of the role of data and model uncertainty in active learning

IF 3.3 3区材料科学 Q2 MATERIALS SCIENCE, MULTIDISCIPLINARY Computational Materials Science Pub Date : 2025-01-31 Epub Date: 2024-11-07 DOI:10.1016/j.commatsci.2024.113512

Yahao Li , Errui Jiang , Ziqi Ni , Wudi Li , Ming Huang , Fengyuan Zhao , Fengqi Liu , Yicong Ye , Shuxin Bai

{"title":"A study of the role of data and model uncertainty in active learning","authors":"Yahao Li , Errui Jiang , Ziqi Ni , Wudi Li , Ming Huang , Fengyuan Zhao , Fengqi Liu , Yicong Ye , Shuxin Bai","doi":"10.1016/j.commatsci.2024.113512","DOIUrl":null,"url":null,"abstract":"<div><div>Uncertainty-based active learning strategies have demonstrated significant superiority in small data research of materials domain. This study explores the effects of model uncertainty and data uncertainty separately on the performance of active learning strategies, specifically focusing on the number of iterations required to identify the optimal samples. For model uncertainty, three kinds of acquisition functions are compared, including predicted value strategy (PV), ranking of predicted value strategy (PR) and expected improvement strategy (EI). Among these, the active learning model utilizing PR requires the fewest average iterations (1.75). For data uncertainty, we evaluate the iterations of active learning by Gaussian process models that incorporate the uncertainty of the observations and noise samples that takes account into the uncertainty of the input features respectively. The results indicate that the active learning iterations of the three strategies converge to similar at the optimal weighting when the uncertainty of the observations is considered in the model (EI for 1.75, PV for 1.21 and PR for 1.18). In contrast, incorporating noise samples into the augmented dataset after the original samples would severely deteriorate the efficiency of active learning recommendations. Our findings aim to offer guidance for exploring more favorable acquisition functions and methods for active learning strategies.</div></div>","PeriodicalId":10650,"journal":{"name":"Computational Materials Science","volume":"247 ","pages":"Article 113512"},"PeriodicalIF":3.3000,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Materials Science","FirstCategoryId":"88","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S092702562400733X","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/11/7 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"MATERIALS SCIENCE, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Uncertainty-based active learning strategies have demonstrated significant superiority in small data research of materials domain. This study explores the effects of model uncertainty and data uncertainty separately on the performance of active learning strategies, specifically focusing on the number of iterations required to identify the optimal samples. For model uncertainty, three kinds of acquisition functions are compared, including predicted value strategy (PV), ranking of predicted value strategy (PR) and expected improvement strategy (EI). Among these, the active learning model utilizing PR requires the fewest average iterations (1.75). For data uncertainty, we evaluate the iterations of active learning by Gaussian process models that incorporate the uncertainty of the observations and noise samples that takes account into the uncertainty of the input features respectively. The results indicate that the active learning iterations of the three strategies converge to similar at the optimal weighting when the uncertainty of the observations is considered in the model (EI for 1.75, PV for 1.21 and PR for 1.18). In contrast, incorporating noise samples into the augmented dataset after the original samples would severely deteriorate the efficiency of active learning recommendations. Our findings aim to offer guidance for exploring more favorable acquisition functions and methods for active learning strategies.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

数据和模型不确定性在主动学习中的作用研究

在材料领域的小数据研究中，基于不确定性的主动学习策略已显示出明显的优越性。本研究分别探讨了模型不确定性和数据不确定性对主动学习策略性能的影响，特别关注了识别最优样本所需的迭代次数。针对模型的不确定性，比较了三种获取函数，包括预测值策略（PV）、预测值排序策略（PR）和预期改进策略（EI）。其中，利用 PR 的主动学习模型所需的平均迭代次数最少（1.75 次）。对于数据的不确定性，我们通过高斯过程模型对主动学习的迭代进行了评估，该模型包含了观测数据的不确定性和噪声样本，分别考虑了输入特征的不确定性。结果表明，当模型中考虑到观测数据的不确定性时，三种策略的主动学习迭代收敛到相似的最优权重（EI 为 1.75，PV 为 1.21，PR 为 1.18）。相反，在原始样本之后将噪声样本纳入增强数据集会严重降低主动学习建议的效率。我们的研究结果旨在为探索更有利的主动学习策略获取函数和方法提供指导。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computational Materials Science 工程技术-材料科学：综合

CiteScore

6.50

自引率

6.10%

发文量

665

审稿时长

26 days

期刊介绍： The goal of Computational Materials Science is to report on results that provide new or unique insights into, or significantly expand our understanding of, the properties of materials or phenomena associated with their design, synthesis, processing, characterization, and utilization. To be relevant to the journal, the results should be applied or applicable to specific material systems that are discussed within the submission.