Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

IF 2.1 3区 计算机科学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS ACM Journal on Computing and Cultural Heritage Pub Date : 2023-06-30 DOI:10.1145/3606705
Mariana Dias, C. Lopes
{"title":"Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents","authors":"Mariana Dias, C. Lopes","doi":"10.1145/3606705","DOIUrl":null,"url":null,"abstract":"Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods’ parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays’ covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.","PeriodicalId":54310,"journal":{"name":"ACM Journal on Computing and Cultural Heritage","volume":"40 1","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2023-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal on Computing and Cultural Heritage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3606705","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods’ parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays’ covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
文化打字文献字符识别图像处理算法优化
关联数据作为一种构建和连接数据的新方法被广泛应用于各个领域。文化遗产机构一直在使用关联数据来改进档案描述,促进信息的发现。大多数档案记录都有非机器可读的扫描图像形式的物理工件的数字表示。光学字符识别(OCR)识别图像中的文本并将其转换为机器编码的文本。本文评估了OCR中图像处理方法和参数调整对打印文化遗产文献的影响。该方法使用多目标问题公式最小化Levenshtein编辑距离并最大化正确识别的单词数量,并使用非主导排序遗传算法(NSGA-II)来调整方法的参数。评价结果表明,数字表示类型的参数化有利于OCR图像预处理算法的性能。此外,我们的研究结果表明,在OCR中使用图像预处理算法可能更适合没有预处理的文本识别任务不能产生良好结果的类型学。特别是,自适应阈值分割、双边滤波和开放分别是戏剧剧本的封面、字母和整个数据集表现最好的算法,应该在OCR之前应用以提高其性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Journal on Computing and Cultural Heritage
ACM Journal on Computing and Cultural Heritage Arts and Humanities-Conservation
CiteScore
4.60
自引率
8.30%
发文量
90
期刊介绍: ACM Journal on Computing and Cultural Heritage (JOCCH) publishes papers of significant and lasting value in all areas relating to the use of information and communication technologies (ICT) in support of Cultural Heritage. The journal encourages the submission of manuscripts that demonstrate innovative use of technology for the discovery, analysis, interpretation and presentation of cultural material, as well as manuscripts that illustrate applications in the Cultural Heritage sector that challenge the computational technologies and suggest new research opportunities in computer science.
期刊最新文献
Heritage Iconographic Content Structuring: from Automatic Linking to Visual Validation Digitising the Deep Past: Machine Learning for Rock Art Motif Classification in an Educational Citizen Science Application Interpretable Clusters for Representing Citizens’ Sense of Belonging through Interaction with Cultural Heritage Classification of Impressionist and Pointillist paintings based on their brushstrokes characteristics ZoAM GameBot: a Journey to the Lost Computational World in the Amazonia
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1