Active learning with biased non-response to label requests

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Data Mining and Knowledge Discovery Pub Date : 2024-05-25 DOI:10.1007/s10618-024-01026-x

Thomas S. Robinson, Niek Tax, Richard Mudd, Ido Guy

{"title":"Active learning with biased non-response to label requests","authors":"Thomas S. Robinson, Niek Tax, Richard Mudd, Ido Guy","doi":"10.1007/s10618-024-01026-x","DOIUrl":null,"url":null,"abstract":"<p>Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning’s effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that biased non-response is likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy–the <i>Upper Confidence Bound of the Expected Utility (UCB-EU)</i>–that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for specific sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from an e-commerce platform. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy-to-implement correction that mitigates model degradation.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"36 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01026-x","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Active learning can improve the efficiency of training prediction models by identifying the most informative new labels to acquire. However, non-response to label requests can impact active learning’s effectiveness in real-world contexts. We conceptualise this degradation by considering the type of non-response present in the data, demonstrating that biased non-response is particularly detrimental to model performance. We argue that biased non-response is likely in contexts where the labelling process, by nature, relies on user interactions. To mitigate the impact of biased non-response, we propose a cost-based correction to the sampling strategy–the Upper Confidence Bound of the Expected Utility (UCB-EU)–that can, plausibly, be applied to any active learning algorithm. Through experiments, we demonstrate that our method successfully reduces the harm from labelling non-response in many settings. However, we also characterise settings where the non-response bias in the annotations remains detrimental under UCB-EU for specific sampling methods and data generating processes. Finally, we evaluate our method on a real-world dataset from an e-commerce platform. We show that UCB-EU yields substantial performance improvements to conversion models that are trained on clicked impressions. Most generally, this research serves to both better conceptualise the interplay between types of non-response and model improvements via active learning, and to provide a practical, easy-to-implement correction that mitigates model degradation.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

有偏差地不回应标签请求的主动学习

主动学习可以通过识别信息量最大的新标签来提高预测模型的训练效率。然而，在现实世界中，不响应标签请求会影响主动学习的有效性。我们通过考虑数据中存在的非响应类型，将这种退化概念化，并证明有偏见的非响应对模型性能尤其不利。我们认为，在贴标签过程本质上依赖于用户交互的情况下，有偏差的非响应是很有可能发生的。为了减轻偏差非响应的影响，我们对采样策略提出了一种基于成本的修正方法--期望效用置信度上限（UCB-EU），这种方法可以合理地应用于任何主动学习算法。通过实验，我们证明了我们的方法在很多情况下都能成功地减少标签不响应带来的危害。不过，我们也描述了在 UCB-EU 下，对于特定的采样方法和数据生成过程，注释中的非响应偏差仍然有害的情况。最后，我们在一个电子商务平台的真实数据集上对我们的方法进行了评估。结果表明，UCB-EU 能大幅提高基于点击印象训练的转换模型的性能。总体而言，这项研究有助于更好地概念化非响应类型与通过主动学习改进模型之间的相互作用，并提供一种实用、易于实施的校正方法来减轻模型退化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Data Mining and Knowledge Discovery 工程技术-计算机：人工智能

CiteScore

10.40

自引率

4.20%

发文量

审稿时长

10 months

期刊介绍： Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.