Hong Guan, Yancheng Wang, Lulu Xie, Soham Nag, Rajeev Goel, Niranjan Erappa Narayana Swamy, Yingzhen Yang, Chaowei Xiao, Jonathan Prisby, Ross Maciejewski, Jia Zou
{"title":"IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection","authors":"Hong Guan, Yancheng Wang, Lulu Xie, Soham Nag, Rajeev Goel, Niranjan Erappa Narayana Swamy, Yingzhen Yang, Chaowei Xiao, Jonathan Prisby, Ross Maciejewski, Jia Zou","doi":"arxiv-2408.01690","DOIUrl":null,"url":null,"abstract":"Effective fraud detection and analysis of government-issued identity\ndocuments, such as passports, driver's licenses, and identity cards, are\nessential in thwarting identity theft and bolstering security on online\nplatforms. The training of accurate fraud detection and analysis tools depends\non the availability of extensive identity document datasets. However, current\npublicly available benchmark datasets for identity document analysis, including\nMIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a\nlimited number of samples, cover insufficient varieties of fraud patterns, and\nseldom include alterations in critical personal identifying fields like\nportrait images, limiting their utility in training models capable of detecting\nrealistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark\ndataset, IDNet, designed to advance privacy-preserving fraud detection efforts.\nThe IDNet dataset comprises 837,060 images of synthetically generated identity\ndocuments, totaling approximately 490 gigabytes, categorized into 20 types from\n$10$ U.S. states and 10 European countries. We evaluate the utility and present\nuse cases of the dataset, illustrating how it can aid in training\nprivacy-preserving fraud detection methods, facilitating the generation of\ncamera and video capturing of identity documents, and testing schema\nunification and other identity document management functionalities.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.01690","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Effective fraud detection and analysis of government-issued identity
documents, such as passports, driver's licenses, and identity cards, are
essential in thwarting identity theft and bolstering security on online
platforms. The training of accurate fraud detection and analysis tools depends
on the availability of extensive identity document datasets. However, current
publicly available benchmark datasets for identity document analysis, including
MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a
limited number of samples, cover insufficient varieties of fraud patterns, and
seldom include alterations in critical personal identifying fields like
portrait images, limiting their utility in training models capable of detecting
realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark
dataset, IDNet, designed to advance privacy-preserving fraud detection efforts.
The IDNet dataset comprises 837,060 images of synthetically generated identity
documents, totaling approximately 490 gigabytes, categorized into 20 types from
$10$ U.S. states and 10 European countries. We evaluate the utility and present
use cases of the dataset, illustrating how it can aid in training
privacy-preserving fraud detection methods, facilitating the generation of
camera and video capturing of identity documents, and testing schema
unification and other identity document management functionalities.