Digitizing and parsing semi-structured historical administrative documents from the G.I. Bill mortgage guarantee program

IF 2 3区管理学 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE Journal of Documentation Pub Date : 2023-07-31 DOI:10.1108/jd-03-2023-0055

Sara Lafia, David A. Bleckley, J. T. Alexander

{"title":"Digitizing and parsing semi-structured historical administrative documents from the G.I. Bill mortgage guarantee program","authors":"Sara Lafia, David A. Bleckley, J. T. Alexander","doi":"10.1108/jd-03-2023-0055","DOIUrl":null,"url":null,"abstract":"PurposeMany libraries and archives maintain collections of research documents, such as administrative records, with paper-based formats that limit the documents' access to in-person use. Digitization transforms paper-based collections into more accessible and analyzable formats. As collections are digitized, there is an opportunity to incorporate deep learning techniques, such as Document Image Analysis (DIA), into workflows to increase the usability of information extracted from archival documents. This paper describes the authors' approach using digital scanning, optical character recognition (OCR) and deep learning to create a digital archive of administrative records related to the mortgage guarantee program of the Servicemen's Readjustment Act of 1944, also known as the G.I. Bill.Design/methodology/approachThe authors used a collection of 25,744 semi-structured paper-based records from the administration of G.I. Bill Mortgages from 1946 to 1954 to develop a digitization and processing workflow. These records include the name and city of the mortgagor, the amount of the mortgage, the location of the Reconstruction Finance Corporation agent, one or more identification numbers and the name and location of the bank handling the loan. The authors extracted structured information from these scanned historical records in order to create a tabular data file and link them to other authoritative individual-level data sources.FindingsThe authors compared the flexible character accuracy of five OCR methods. The authors then compared the character error rate (CER) of three text extraction approaches (regular expressions, DIA and named entity recognition (NER)). The authors were able to obtain the highest quality structured text output using DIA with the Layout Parser toolkit by post-processing with regular expressions. Through this project, the authors demonstrate how DIA can improve the digitization of administrative records to automatically produce a structured data resource for researchers and the public.Originality/valueThe authors' workflow is readily transferable to other archival digitization projects. Through the use of digital scanning, OCR and DIA processes, the authors created the first digital microdata file of administrative records related to the G.I. Bill mortgage guarantee program available to researchers and the general public. These records offer research insights into the lives of veterans who benefited from loans, the impacts on the communities built by the loans and the institutions that implemented them.","PeriodicalId":47969,"journal":{"name":"Journal of Documentation","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2023-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Documentation","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1108/jd-03-2023-0055","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

PurposeMany libraries and archives maintain collections of research documents, such as administrative records, with paper-based formats that limit the documents' access to in-person use. Digitization transforms paper-based collections into more accessible and analyzable formats. As collections are digitized, there is an opportunity to incorporate deep learning techniques, such as Document Image Analysis (DIA), into workflows to increase the usability of information extracted from archival documents. This paper describes the authors' approach using digital scanning, optical character recognition (OCR) and deep learning to create a digital archive of administrative records related to the mortgage guarantee program of the Servicemen's Readjustment Act of 1944, also known as the G.I. Bill.Design/methodology/approachThe authors used a collection of 25,744 semi-structured paper-based records from the administration of G.I. Bill Mortgages from 1946 to 1954 to develop a digitization and processing workflow. These records include the name and city of the mortgagor, the amount of the mortgage, the location of the Reconstruction Finance Corporation agent, one or more identification numbers and the name and location of the bank handling the loan. The authors extracted structured information from these scanned historical records in order to create a tabular data file and link them to other authoritative individual-level data sources.FindingsThe authors compared the flexible character accuracy of five OCR methods. The authors then compared the character error rate (CER) of three text extraction approaches (regular expressions, DIA and named entity recognition (NER)). The authors were able to obtain the highest quality structured text output using DIA with the Layout Parser toolkit by post-processing with regular expressions. Through this project, the authors demonstrate how DIA can improve the digitization of administrative records to automatically produce a structured data resource for researchers and the public.Originality/valueThe authors' workflow is readily transferable to other archival digitization projects. Through the use of digital scanning, OCR and DIA processes, the authors created the first digital microdata file of administrative records related to the G.I. Bill mortgage guarantee program available to researchers and the general public. These records offer research insights into the lives of veterans who benefited from loans, the impacts on the communities built by the loans and the institutions that implemented them.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

G.I.Bill抵押担保计划中半结构化历史管理文件的数字化和解析

目的许多图书馆和档案馆以纸质格式保存研究文件，如行政记录，限制了文件的当面使用。数字化将基于纸张的收藏转变为更易于访问和分析的格式。随着藏品的数字化，有机会将深度学习技术（如文档图像分析（DIA））纳入工作流程，以提高从档案文档中提取的信息的可用性。本文描述了作者使用数字扫描、光学字符识别（OCR）和深度学习创建与1944年《军人重新调整法案》抵押担保计划相关的行政记录数字档案的方法，也被称为G.I.账单。设计/方法/方法作者使用了1946年至1954年G.I.账单抵押管理局收集的25744份半结构化纸质记录，以开发数字化和处理工作流程。这些记录包括抵押人的姓名和城市、抵押金额、重建金融公司代理人的地点、一个或多个身份号码以及处理贷款的银行的名称和地点。作者从这些扫描的历史记录中提取了结构化信息，以便创建一个表格数据文件，并将其链接到其他权威的个人级数据源。结果比较了五种OCR方法的灵活字符精度。然后，作者比较了三种文本提取方法（正则表达式、DIA和命名实体识别（NER））的字符错误率（CER）。作者能够通过使用正则表达式进行后处理，使用布局分析器工具包的DIA获得最高质量的结构化文本输出。通过这个项目，作者展示了DIA如何改进行政记录的数字化，为研究人员和公众自动生成结构化的数据资源。原创性/价值作者的工作流程很容易转移到其他档案数字化项目中。通过使用数字扫描、OCR和DIA流程，作者创建了第一个与G.I.法案抵押担保计划相关的行政记录数字微数据文件，可供研究人员和公众使用。这些记录为受益于贷款的退伍军人的生活、贷款对社区的影响以及实施贷款的机构提供了研究见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Documentation INFORMATION SCIENCE & LIBRARY SCIENCE-

CiteScore

4.20

自引率

14.30%

发文量

期刊介绍： The scope of the Journal of Documentation is broadly information sciences, encompassing all of the academic and professional disciplines which deal with recorded information. These include, but are certainly not limited to: ■Information science, librarianship and related disciplines ■Information and knowledge management ■Information and knowledge organisation ■Information seeking and retrieval, and human information behaviour ■Information and digital literacies