{"title":"Fighting selection bias in statistical learning: application to visual recognition from biased image databases","authors":"Stephan Clémençon, Pierre Laforgue, Robin Vogel","doi":"10.1080/10485252.2023.2259011","DOIUrl":null,"url":null,"abstract":"AbstractIn practice, and especially when training deep neural networks, visual recognition rules are often learned based on various sources of information. On the other hand, the recent deployment of facial recognition systems with uneven performances on different population segments has highlighted the representativeness issues induced by a naive aggregation of the datasets. In this paper, we show how biasing models can remedy these problems. Based on the (approximate) knowledge of the biasing mechanisms at work, our approach consists in reweighting the observations, so as to form a nearly debiased estimator of the target distribution. One key condition is that the supports of the biased distributions must partly overlap, and cover the support of the target distribution. In order to meet this requirement in practice, we propose to use a low dimensional image representation, shared across the image databases. Finally, we provide numerical experiments highlighting the relevance of our approach.Keywords: Sampling biasselection effectvisual recognitionreliable statistical learning Disclosure statementNo potential conflict of interest was reported by the author(s).Additional informationFundingThis work was partially supported by the research chair ‘Good In Tech : Rethinking innovation and technology as drivers of a better world for and by humans’, under the auspices of the ‘Fondation du Risque’ and in partnership with the Institut Mines-Télécom, Sciences Po, Afnor, Ag2r La Mondiale, CGI France, Danone and Sycomore.","PeriodicalId":50112,"journal":{"name":"Journal of Nonparametric Statistics","volume":"181 1","pages":"0"},"PeriodicalIF":0.8000,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Nonparametric Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/10485252.2023.2259011","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
Abstract
AbstractIn practice, and especially when training deep neural networks, visual recognition rules are often learned based on various sources of information. On the other hand, the recent deployment of facial recognition systems with uneven performances on different population segments has highlighted the representativeness issues induced by a naive aggregation of the datasets. In this paper, we show how biasing models can remedy these problems. Based on the (approximate) knowledge of the biasing mechanisms at work, our approach consists in reweighting the observations, so as to form a nearly debiased estimator of the target distribution. One key condition is that the supports of the biased distributions must partly overlap, and cover the support of the target distribution. In order to meet this requirement in practice, we propose to use a low dimensional image representation, shared across the image databases. Finally, we provide numerical experiments highlighting the relevance of our approach.Keywords: Sampling biasselection effectvisual recognitionreliable statistical learning Disclosure statementNo potential conflict of interest was reported by the author(s).Additional informationFundingThis work was partially supported by the research chair ‘Good In Tech : Rethinking innovation and technology as drivers of a better world for and by humans’, under the auspices of the ‘Fondation du Risque’ and in partnership with the Institut Mines-Télécom, Sciences Po, Afnor, Ag2r La Mondiale, CGI France, Danone and Sycomore.
在实践中,特别是在训练深度神经网络时,视觉识别规则通常是基于各种信息来源学习的。另一方面,最近部署的面部识别系统在不同人群中表现不均匀,突出了数据集幼稚聚合引起的代表性问题。在本文中,我们展示了偏置模型如何解决这些问题。基于对工作中的偏倚机制的(近似)了解,我们的方法包括重新加权观测值,从而形成目标分布的近去偏估计量。一个关键条件是有偏分布的支持必须部分重叠,并覆盖目标分布的支持。为了在实践中满足这一要求,我们建议使用低维图像表示,在图像数据库中共享。最后,我们提供了数值实验,突出了我们方法的相关性。关键词:抽样偏倚选择效应视觉识别可靠统计学习披露声明作者未报告潜在的利益冲突。这项工作得到了“科技的好处:重新思考创新和技术作为人类和人类更美好世界的驱动力”研究主席的部分支持,该研究主席由“Risque基金会”主持,并与矿业研究所、巴黎政治学院、Afnor、Ag2r La Mondiale、CGI法国、达能和Sycomore合作。
期刊介绍:
Journal of Nonparametric Statistics provides a medium for the publication of research and survey work in nonparametric statistics and related areas. The scope includes, but is not limited to the following topics:
Nonparametric modeling,
Nonparametric function estimation,
Rank and other robust and distribution-free procedures,
Resampling methods,
Lack-of-fit testing,
Multivariate analysis,
Inference with high-dimensional data,
Dimension reduction and variable selection,
Methods for errors in variables, missing, censored, and other incomplete data structures,
Inference of stochastic processes,
Sample surveys,
Time series analysis,
Longitudinal and functional data analysis,
Nonparametric Bayes methods and decision procedures,
Semiparametric models and procedures,
Statistical methods for imaging and tomography,
Statistical inverse problems,
Financial statistics and econometrics,
Bioinformatics and comparative genomics,
Statistical algorithms and machine learning.
Both the theory and applications of nonparametric statistics are covered in the journal. Research applying nonparametric methods to medicine, engineering, technology, science and humanities is welcomed, provided the novelty and quality level are of the highest order.
Authors are encouraged to submit supplementary technical arguments, computer code, data analysed in the paper or any additional information for online publication along with the published paper.