Abstract: Digitization and Search: A Non-Traditional Use of HPC

2012 SC Companion: High Performance Computing, Networking Storage and Analysis Pub Date : 2012-12-01 DOI:10.1109/SC.Companion.2012.259

Liana Diesendruck, Luigi Marini, R. Kooper, M. Kejriwal, Kenton McHenry

引用次数: 1

Abstract

We describe our efforts to provide a form of automated search of handwritten content for digitized document archives. To carry out the search we use a computer vision technique called word spotting. A form of content based image retrieval, it avoids the still difficult task of directly recognizing text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on an archive three computationally expensive pre-processing steps are required. We augment this automated portion of the process with a passive crowd sourcing element that mines queries from the systems users in order to then improve the results of future queries. We benchmark the proposed framework on 1930s Census data, a collection of roughly 3.6 million forms and 7 billion individual units of information.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

摘要:数字化与搜索:高性能计算的一种非传统应用

我们描述了我们为数字化文档档案提供手写内容自动搜索形式的努力。为了进行搜索，我们使用了一种叫做单词定位的计算机视觉技术。它是一种基于内容的图像检索形式，它允许用户使用包含手写文本的查询图像进行搜索，并根据包含更相似内容的图像对图像数据库进行排序，从而避免了直接识别文本的困难任务。为了在存档中提供这种搜索功能，需要执行三个计算代价高昂的预处理步骤。我们用一个被动的众包元素来增强这个过程的自动化部分，这个元素挖掘来自系统用户的查询，以便改进未来查询的结果。我们以20世纪30年代的人口普查数据为基准，收集了大约360万份表格和70亿个单独的信息单位。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

自引率

0.00%

发文量