Automated compilation of Urdu poetry handwritten image datasets for optical character recognition

IF 1.6 Q2 MULTIDISCIPLINARY SCIENCES MethodsX Pub Date : 2024-12-21 DOI:10.1016/j.mex.2024.103130
Irtaza Ijaz , Abdallah Namoun , Nasser Aljohani , Meshari Huwaytim Alanazi , Mohammad N. Alanazi , Junaid Shuja , Mohammad Ali Humayun
{"title":"Automated compilation of Urdu poetry handwritten image datasets for optical character recognition","authors":"Irtaza Ijaz ,&nbsp;Abdallah Namoun ,&nbsp;Nasser Aljohani ,&nbsp;Meshari Huwaytim Alanazi ,&nbsp;Mohammad N. Alanazi ,&nbsp;Junaid Shuja ,&nbsp;Mohammad Ali Humayun","doi":"10.1016/j.mex.2024.103130","DOIUrl":null,"url":null,"abstract":"<div><div>Optical character recognition (OCR) is vital in digitizing printed data into a digital format, which can be conveniently used for various purposes. A significant amount of work has been done in OCR for well-resourced languages like English. However, languages like Urdu, spoken by a large community, face limitations in OCR due to a lack of resources and the complexity and diversity of handwritten scripts. One of the major hindrances in the development of OCR for low-resource languages like Urdu is the lack of extensive datasets. However, such datasets can be obtained from old handwritten books with reference text available online. This study presents a method to leverage this resource and automatically process Urdu handwritten poetry books with corresponding scripts available online. The images are segmented at the sentence level using automated neighborhood-connected component analysis, followed by manual adjustment. Corresponding Unicode text for each image are obtained by web scraping followed by text similarity analysis. A sample dataset collected comprises purely handwritten Urdu text images for Urdu poetry by Mirza Ghalib and Allama Iqbal, arguably the two most influential poets in Urdu. The dataset comprises 2888 images with Unicode transcriptions from poetry by Mirza Ghalib and Allama Iqbal.<ul><li><span>•</span><span><div>The method automates OCR dataset creation by segmenting handwritten text images and scraping corresponding text from the web for alignment.</div></span></li><li><span>•</span><span><div>Handwritten images are segmented into sentences using a resource-efficient Neighborhood Component Analysis approach.</div></span></li><li><span>•</span><span><div>Possible text samples are scraped from the web, and the corresponding labels are aligned with images based on the minimum edit distance between the scraped text and the predictions by an OCR engine.</div></span></li></ul></div></div>","PeriodicalId":18446,"journal":{"name":"MethodsX","volume":"14 ","pages":"Article 103130"},"PeriodicalIF":1.6000,"publicationDate":"2024-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11743332/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MethodsX","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2215016124005818","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Optical character recognition (OCR) is vital in digitizing printed data into a digital format, which can be conveniently used for various purposes. A significant amount of work has been done in OCR for well-resourced languages like English. However, languages like Urdu, spoken by a large community, face limitations in OCR due to a lack of resources and the complexity and diversity of handwritten scripts. One of the major hindrances in the development of OCR for low-resource languages like Urdu is the lack of extensive datasets. However, such datasets can be obtained from old handwritten books with reference text available online. This study presents a method to leverage this resource and automatically process Urdu handwritten poetry books with corresponding scripts available online. The images are segmented at the sentence level using automated neighborhood-connected component analysis, followed by manual adjustment. Corresponding Unicode text for each image are obtained by web scraping followed by text similarity analysis. A sample dataset collected comprises purely handwritten Urdu text images for Urdu poetry by Mirza Ghalib and Allama Iqbal, arguably the two most influential poets in Urdu. The dataset comprises 2888 images with Unicode transcriptions from poetry by Mirza Ghalib and Allama Iqbal.
  • The method automates OCR dataset creation by segmenting handwritten text images and scraping corresponding text from the web for alignment.
  • Handwritten images are segmented into sentences using a resource-efficient Neighborhood Component Analysis approach.
  • Possible text samples are scraped from the web, and the corresponding labels are aligned with images based on the minimum edit distance between the scraped text and the predictions by an OCR engine.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
用于光学字符识别的乌尔都语诗歌手写图像数据集的自动编译。
光学字符识别(OCR)是将印刷数据数字化为数字格式的关键,它可以方便地用于各种目的。对于像英语这样资源丰富的语言,在OCR方面已经做了大量的工作。然而,像乌尔都语这样由大型社区使用的语言,由于缺乏资源和手写脚本的复杂性和多样性,在OCR方面面临限制。在像乌尔都语这样的低资源语言中开发OCR的主要障碍之一是缺乏广泛的数据集。然而,这些数据集可以从带有在线参考文本的旧手写书籍中获得。本研究提出一种利用此资源,并自动处理网上可取得的乌尔都语手写诗集的方法。使用自动邻域连接成分分析在句子级别对图像进行分割,然后进行手动调整。通过网页抓取和文本相似度分析获得每个图像对应的Unicode文本。收集的样本数据集包括米尔扎·加里布和阿拉玛·伊克巴尔的乌尔都语诗歌的纯手写乌尔都语文本图像,他们可以说是乌尔都语中最有影响力的两位诗人。该数据集由2888张图片组成,其中包含Mirza Ghalib和Allama Iqbal的Unicode诗歌转录。•该方法通过分割手写文本图像并从web上抓取相应文本进行对齐来自动创建OCR数据集。•使用资源高效的邻域成分分析方法将手写图像分割成句子。•从网络上抓取可能的文本样本,并根据抓取文本与OCR引擎预测之间的最小编辑距离将相应的标签与图像对齐。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
MethodsX
MethodsX Health Professions-Medical Laboratory Technology
CiteScore
3.60
自引率
5.30%
发文量
314
审稿时长
7 weeks
期刊介绍:
期刊最新文献
Determination of the method of induction of mutations by gamma radiation in soybeans (Glycine max L. Merrill) for tolerance to carbonic rot produced by the fungus Macrophomina phaseolina (Tassi Goid.) Simple DNA extraction for museum beetle specimens to unlock genetic data from historical collections A new method and information system based on artificial intelligence for black flight identification Multi-criteria evaluation and multi-method analysis for appropriately selecting renewable energy sources in Colombia EMI-LTI: An enhanced integrated model for lung tumor identification using Gabor filter and ROI
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1