Building a test collection for complex document information processing

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval Pub Date : 2006-08-06 DOI:10.1145/1148170.1148307

D. Lewis, G. Agam, S. Argamon, O. Frieder, D. Grossman, J. Heard

引用次数: 230

Abstract

Research and development of information access technology for scanned paper documents has been hampered by the lack of public test collections of realistic scope and complexity. As part of a project to create a prototype system for search and mining of masses of document images, we are assembling a 1.5 terabyte dataset to support evaluation of both end-to-end complex document information processing (CDIP) tasks (e.g., text retrieval and data mining) as well as component technologies such as optical character recognition (OCR), document structure analysis, signature matching, and authorship attribution.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

构建用于复杂文档信息处理的测试集合

由于缺乏具有现实范围和复杂性的公共测试集，纸质扫描文档信息访问技术的研究和发展一直受到阻碍。作为创建用于搜索和挖掘大量文档图像的原型系统项目的一部分，我们正在组装一个1.5 tb的数据集，以支持端到端的复杂文档信息处理(CDIP)任务(例如，文本检索和数据挖掘)以及光学字符识别(OCR)、文档结构分析、签名匹配和作者归属等组件技术的评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

自引率

0.00%

发文量

期刊最新文献

Strict and vague interpretation of XML-retrieval queries AggregateRank: bringing order to web sites Text clustering with extended user feedback Improving personalized web search using result diversification High accuracy retrieval with multiple nested ranker