A framework to access handwritten information within large digitized paper collections

2012 IEEE 8th International Conference on E-Science Pub Date : 2012-10-08 DOI:10.1109/eScience.2012.6404434

Liana Diesendruck, Luigi Marini, R. Kooper, M. Kejriwal, Kenton McHenry

{"title":"A framework to access handwritten information within large digitized paper collections","authors":"Liana Diesendruck, Luigi Marini, R. Kooper, M. Kejriwal, Kenton McHenry","doi":"10.1109/eScience.2012.6404434","DOIUrl":null,"url":null,"abstract":"We describe our efforts with the National Archives and Records Administration (NARA) to provide a form of automated search of handwritten content within large digitized document archives. With a growing push towards the digitization of paper archives there is an imminent need to develop tools capable of searching the resulting unstructured image data as data from such collections offer valuable historical records that can be mined for information pertinent to a number of fields from the geosciences to the humanities. To carry out the search, we use a Computer Vision technique called Word Spotting. A form of content based image retrieval, it avoids the still difficult task of directly recognizing the text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on an archive, three computationally expensive pre-processing steps are required. We describe these steps, the open source framework we have developed, and how it can be used not only on the recently released 1940 Census data containing nearly 4 million high resolution scanned forms, but also on other collections of forms. With a growing demand to digitize our wealth of paper archives we see this type of automated search as a low cost scalable alternative to the costly manual transcription that would otherwise be required.","PeriodicalId":6364,"journal":{"name":"2012 IEEE 8th International Conference on E-Science","volume":"20 1","pages":"1-10"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 8th International Conference on E-Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/eScience.2012.6404434","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

We describe our efforts with the National Archives and Records Administration (NARA) to provide a form of automated search of handwritten content within large digitized document archives. With a growing push towards the digitization of paper archives there is an imminent need to develop tools capable of searching the resulting unstructured image data as data from such collections offer valuable historical records that can be mined for information pertinent to a number of fields from the geosciences to the humanities. To carry out the search, we use a Computer Vision technique called Word Spotting. A form of content based image retrieval, it avoids the still difficult task of directly recognizing the text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on an archive, three computationally expensive pre-processing steps are required. We describe these steps, the open source framework we have developed, and how it can be used not only on the recently released 1940 Census data containing nearly 4 million high resolution scanned forms, but also on other collections of forms. With a growing demand to digitize our wealth of paper archives we see this type of automated search as a low cost scalable alternative to the costly manual transcription that would otherwise be required.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一个访问大型数字化纸质馆藏中手写信息的框架

我们描述了我们与国家档案和记录管理局(NARA)合作的努力，以提供一种在大型数字化文档档案中自动搜索手写内容的形式。随着纸质档案数字化的不断推进，迫切需要开发能够搜索由此产生的非结构化图像数据的工具，因为这些收集的数据提供了有价值的历史记录，可以从地球科学到人文科学等多个领域挖掘相关信息。为了进行搜索，我们使用了一种叫做单词识别的计算机视觉技术。它是一种基于内容的图像检索形式，它允许用户使用包含手写文本的查询图像进行搜索，并根据包含更相似内容的图像对图像数据库进行排序，从而避免了直接识别文本的困难任务。为了在存档中提供这种搜索功能，需要执行三个计算代价高昂的预处理步骤。我们描述了这些步骤，我们开发的开源框架，以及它如何不仅用于最近发布的包含近400万高分辨率扫描表格的1940年人口普查数据，而且还用于其他表格集合。随着对纸质档案数字化需求的不断增长，我们将这种类型的自动搜索视为一种低成本可扩展的替代方案，否则将需要昂贵的人工转录。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 IEEE 8th International Conference on E-Science

自引率

0.00%

发文量

期刊最新文献

Scientific Workflow Interchanging through Patterns: Reversals and Lessons Learned Shape Analysis Using the Spectral Graph Wavelet Transform Provenance analysis: Towards quality provenance Fast confidential search for bio-medical data using Bloom filters and Homomorphic Cryptography Calibration of watershed models using cloud computing