Binary quantification and dataset shift: an experimental investigation

IF 2.8 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Data Mining and Knowledge Discovery Pub Date : 2024-03-18 DOI:10.1007/s10618-024-01014-1

Pablo González, Alejandro Moreo, Fabrizio Sebastiani

{"title":"Binary quantification and dataset shift: an experimental investigation","authors":"Pablo González, Alejandro Moreo, Fabrizio Sebastiani","doi":"10.1007/s10618-024-01014-1","DOIUrl":null,"url":null,"abstract":"Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., prior probability shift; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https://github.com/pglez82/quant_datasetshift.","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"159 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01014-1","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Quantification is the supervised learning task that consists of training predictors of the class prevalence values of sets of unlabelled data, and is of special interest when the labelled data on which the predictor has been trained and the unlabelled data are not IID, i.e., suffer from dataset shift. To date, quantification methods have mostly been tested only on a special case of dataset shift, i.e., prior probability shift; the relationship between quantification and other types of dataset shift remains, by and large, unexplored. In this work we carry out an experimental analysis of how current quantification algorithms behave under different types of dataset shift, in order to identify limitations of current approaches and hopefully pave the way for the development of more broadly applicable methods. We do this by proposing a fine-grained taxonomy of types of dataset shift, by establishing protocols for the generation of datasets affected by these types of shift, and by testing existing quantification methods on the datasets thus generated. One finding that results from this investigation is that many existing quantification methods that had been found robust to prior probability shift are not necessarily robust to other types of dataset shift. A second finding is that no existing quantification method seems to be robust enough to dealing with all the types of dataset shift we simulate in our experiments. The code needed to reproduce all our experiments is publicly available at https://github.com/pglez82/quant_datasetshift.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

二元量化与数据集转移：一项实验研究

量化是一种监督学习任务，包括训练未标记数据集的类别流行值预测器，当预测器所训练的标记数据和未标记数据不是 IID 时，即出现数据集偏移时，量化就会引起特别的兴趣。迄今为止，量化方法大多只在数据集偏移的一种特殊情况下（即先验概率偏移）进行过测试；量化与其他类型的数据集偏移之间的关系基本上仍未得到探讨。在这项工作中，我们对当前的量化算法在不同类型的数据集偏移下的表现进行了实验分析，以找出当前方法的局限性，并希望为开发更广泛适用的方法铺平道路。为此，我们提出了数据集偏移类型的精细分类法，建立了生成受这些类型偏移影响的数据集的协议，并在由此生成的数据集上测试了现有的量化方法。这项调查得出的一个发现是，许多现有的量化方法对先验概率偏移具有鲁棒性，但对其他类型的数据集偏移并不一定具有鲁棒性。第二个发现是，现有的量化方法似乎都不足以应对我们在实验中模拟的所有类型的数据集偏移。重现我们所有实验所需的代码可在 https://github.com/pglez82/quant_datasetshift 公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Data Mining and Knowledge Discovery 工程技术-计算机：人工智能

CiteScore

10.40

自引率

4.20%

发文量

审稿时长

10 months

期刊介绍： Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.

期刊最新文献

Missing value replacement in strings and applications. FRUITS: feature extraction using iterated sums for time series classification Bounding the family-wise error rate in local causal discovery using Rademacher averages Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack Efficient learning with projected histograms