OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

IF 2 Q2 SOCIAL SCIENCES, MATHEMATICAL METHODS Journal of Computational Social Science Pub Date : 2021-06-24 DOI:10.31235/osf.io/6zfvs
Thomas Hegghammer
{"title":"OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment","authors":"Thomas Hegghammer","doi":"10.31235/osf.io/6zfvs","DOIUrl":null,"url":null,"abstract":"Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans ( n  = 322) and Arabic-language article scans ( n  = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.","PeriodicalId":29946,"journal":{"name":"Journal of Computational Social Science","volume":"57 1","pages":"861-882"},"PeriodicalIF":2.0000,"publicationDate":"2021-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computational Social Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31235/osf.io/6zfvs","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}
引用次数: 26

Abstract

Optical Character Recognition (OCR) can open up understudied historical documents to computational analysis, but the accuracy of OCR software varies. This article reports a benchmarking experiment comparing the performance of Tesseract, Amazon Textract, and Google Document AI on images of English and Arabic text. English-language book scans ( n  = 322) and Arabic-language article scans ( n  = 100) were replicated 43 times with different types of artificial noise for a corpus of 18,568 documents, generating 51,304 process requests. Document AI delivered the best results, and the server-based processors (Textract and Document AI) performed substantially better than Tesseract, especially on noisy documents. Accuracy for English was considerably higher than for Arabic. Specifying the relative performance of three leading OCR products and the differential effects of commonly found noise types can help scholars identify better OCR solutions for their research needs. The test materials have been preserved in the openly available “Noisy OCR Dataset” (NOD) for reuse in future benchmarking studies.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
OCR与Tesseract、Amazon text和Google Document AI:一个基准实验
光学字符识别(OCR)可以打开未被充分研究的历史文献进行计算分析,但OCR软件的准确性存在差异。本文报告了一个基准测试实验,比较了Tesseract、Amazon text和Google Document AI在英语和阿拉伯语文本图像上的性能。英语图书扫描(n = 322)和阿拉伯语文章扫描(n = 100)使用不同类型的人工噪声对18,568个文档的语料库进行了43次复制,产生了51,304个处理请求。文档人工智能提供了最好的结果,基于服务器的处理器(文本和文档人工智能)比Tesseract表现得更好,特别是在嘈杂的文档上。英语的准确率要比阿拉伯语高得多。具体说明三种主要OCR产品的相对性能和常见噪声类型的差异影响,可以帮助学者确定更好的OCR解决方案,以满足他们的研究需求。测试材料已保存在公开可用的“噪声OCR数据集”(NOD)中,以便在未来的基准研究中重用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of Computational Social Science
Journal of Computational Social Science SOCIAL SCIENCES, MATHEMATICAL METHODS-
CiteScore
6.20
自引率
6.20%
发文量
30
期刊最新文献
Identifying the factors influencing the development of bilateral investment treaties with health safeguards: a Machine Learning-based link prediction approach. Open-source LLMs for text annotation: a practical guide for model setting and fine-tuning. Capitalizing on a crisis: a computational analysis of all five million British firms during the Covid-19 pandemic. Telegram channels covering Russia’s invasion of Ukraine: a comparative analysis of large multilingual corpora Fast meta-analytic approximations for relational event models: applications to data streams and multilevel data.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1