文档图像数据库中重复项的检测

Proceedings of the Fourth International Conference on Document Analysis and Recognition Pub Date : 1997-08-18 DOI:10.1109/ICDAR.1997.619863

D. Doermann, Huiping Li, O. Kia

{"title":"文档图像数据库中重复项的检测","authors":"D. Doermann, Huiping Li, O. Kia","doi":"10.1109/ICDAR.1997.619863","DOIUrl":null,"url":null,"abstract":"We propose and implement a method for detecting duplicate documents in very large image databases. The method is based on a robust \"signature\" extracted from each document image which is used to index into a table of previously processed documents. The approach has a number of advantages over OCR or other recognition based methods, including speed and robustness to imaging distortions. To justify the approach and test the scalability, we have developed a simulator which allows us to change parameters of the system and examine performance for millions of document signatures. A complete system is implemented and tested on a test collection of technical articles and memos.","PeriodicalId":435320,"journal":{"name":"Proceedings of the Fourth International Conference on Document Analysis and Recognition","volume":"97 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"67","resultStr":"{\"title\":\"The detection of duplicates in document image databases\",\"authors\":\"D. Doermann, Huiping Li, O. Kia\",\"doi\":\"10.1109/ICDAR.1997.619863\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose and implement a method for detecting duplicate documents in very large image databases. The method is based on a robust \\\"signature\\\" extracted from each document image which is used to index into a table of previously processed documents. The approach has a number of advantages over OCR or other recognition based methods, including speed and robustness to imaging distortions. To justify the approach and test the scalability, we have developed a simulator which allows us to change parameters of the system and examine performance for millions of document signatures. A complete system is implemented and tested on a test collection of technical articles and memos.\",\"PeriodicalId\":435320,\"journal\":{\"name\":\"Proceedings of the Fourth International Conference on Document Analysis and Recognition\",\"volume\":\"97 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1997-08-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"67\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the Fourth International Conference on Document Analysis and Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDAR.1997.619863\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Fourth International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.1997.619863","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 67

摘要

我们提出并实现了一种在非常大的图像数据库中检测重复文档的方法。该方法基于从每个文档图像中提取的鲁棒“签名”，该签名用于索引到先前处理过的文档表。与OCR或其他基于识别的方法相比，该方法具有许多优点，包括速度和对成像畸变的鲁棒性。为了验证该方法并测试可扩展性，我们开发了一个模拟器，允许我们更改系统参数并检查数百万个文档签名的性能。一个完整的系统在技术文章和备忘录的测试集合上实现和测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

The detection of duplicates in document image databases

We propose and implement a method for detecting duplicate documents in very large image databases. The method is based on a robust "signature" extracted from each document image which is used to index into a table of previously processed documents. The approach has a number of advantages over OCR or other recognition based methods, including speed and robustness to imaging distortions. To justify the approach and test the scalability, we have developed a simulator which allows us to change parameters of the system and examine performance for millions of document signatures. A complete system is implemented and tested on a test collection of technical articles and memos.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the Fourth International Conference on Document Analysis and Recognition

自引率

0.00%

发文量

期刊最新文献

Document layout analysis based on emergent computation Offline handwritten Chinese character recognition via radical extraction and recognition Boundary normalization for recognition of non-touching non-degraded characters Words recognition using associative memory Image and text coupling for creating electronic books from manuscripts