Double Trouble? Impact and Detection of Duplicates in Face Image Datasets

Torsten Schlett, C. Rathgeb, Juan E. Tapia, Christoph Busch
{"title":"Double Trouble? Impact and Detection of Duplicates in Face Image Datasets","authors":"Torsten Schlett, C. Rathgeb, Juan E. Tapia, Christoph Busch","doi":"10.5220/0012422500003654","DOIUrl":null,"url":null,"abstract":"Various face image datasets intended for facial biometrics research were created via web-scraping, i.e. the collection of images publicly available on the internet. This work presents an approach to detect both exactly and nearly identical face image duplicates, using file and image hashes. The approach is extended through the use of face image preprocessing. Additional steps based on face recognition and face image quality assessment models reduce false positives, and facilitate the deduplication of the face images both for intra- and inter-subject duplicate sets. The presented approach is applied to five datasets, namely LFW, TinyFace, Adience, CASIA-WebFace, and C-MS-Celeb (a cleaned MS-Celeb-1M variant). Duplicates are detected within every dataset, with hundreds to hundreds of thousands of duplicates for all except LFW. Face recognition and quality assessment experiments indicate a minor impact on the results through the duplicate removal. The final deduplication data is publicly available.","PeriodicalId":410036,"journal":{"name":"International Conference on Pattern Recognition Applications and Methods","volume":"68 12","pages":"801-808"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Pattern Recognition Applications and Methods","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0012422500003654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Various face image datasets intended for facial biometrics research were created via web-scraping, i.e. the collection of images publicly available on the internet. This work presents an approach to detect both exactly and nearly identical face image duplicates, using file and image hashes. The approach is extended through the use of face image preprocessing. Additional steps based on face recognition and face image quality assessment models reduce false positives, and facilitate the deduplication of the face images both for intra- and inter-subject duplicate sets. The presented approach is applied to five datasets, namely LFW, TinyFace, Adience, CASIA-WebFace, and C-MS-Celeb (a cleaned MS-Celeb-1M variant). Duplicates are detected within every dataset, with hundreds to hundreds of thousands of duplicates for all except LFW. Face recognition and quality assessment experiments indicate a minor impact on the results through the duplicate removal. The final deduplication data is publicly available.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
双重麻烦?人脸图像数据集中重复图像的影响与检测
用于人脸生物识别研究的各种人脸图像数据集是通过网络抓取(即收集互联网上公开的图像)创建的。这项研究提出了一种利用文件和图像哈希值检测完全相同和几乎完全相同的重复人脸图像的方法。该方法通过使用人脸图像预处理进行扩展。基于人脸识别和人脸图像质量评估模型的附加步骤可减少误报,并有助于重复删除主体内和主体间重复集的人脸图像。所介绍的方法适用于五个数据集,即 LFW、TinyFace、Adience、CASIA-WebFace 和 C-MS-Celeb(经过清理的 MS-Celeb-1M 变体)。每个数据集中都能检测到重复数据,除 LFW 外,其他数据集中都有数百至数十万个重复数据。人脸识别和质量评估实验表明,重复删除对结果的影响很小。重复数据删除的最终结果已公开。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
PatchSVD: A Non-Uniform SVD-Based Image Compression Algorithm On Spectrogram Analysis in a Multiple Classifier Fusion Framework for Power Grid Classification Using Electric Network Frequency Semantic Properties of cosine based bias scores for word embeddings Double Trouble? Impact and Detection of Duplicates in Face Image Datasets Detecting Brain Tumors through Multimodal Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1