{"title":"The detection of duplicates in document image databases","authors":"D. Doermann, Huiping Li, O. Kia","doi":"10.1109/ICDAR.1997.619863","DOIUrl":null,"url":null,"abstract":"We propose and implement a method for detecting duplicate documents in very large image databases. The method is based on a robust \"signature\" extracted from each document image which is used to index into a table of previously processed documents. The approach has a number of advantages over OCR or other recognition based methods, including speed and robustness to imaging distortions. To justify the approach and test the scalability, we have developed a simulator which allows us to change parameters of the system and examine performance for millions of document signatures. A complete system is implemented and tested on a test collection of technical articles and memos.","PeriodicalId":435320,"journal":{"name":"Proceedings of the Fourth International Conference on Document Analysis and Recognition","volume":"97 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"67","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fourth International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.1997.619863","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 67
Abstract
We propose and implement a method for detecting duplicate documents in very large image databases. The method is based on a robust "signature" extracted from each document image which is used to index into a table of previously processed documents. The approach has a number of advantages over OCR or other recognition based methods, including speed and robustness to imaging distortions. To justify the approach and test the scalability, we have developed a simulator which allows us to change parameters of the system and examine performance for millions of document signatures. A complete system is implemented and tested on a test collection of technical articles and memos.