Lines segmentation and word extraction of Arabic handwritten text

Asmae Lamsaf, M. A. Kerroum, S. Boulaknadel, Y. Fakhri
{"title":"Lines segmentation and word extraction of Arabic handwritten text","authors":"Asmae Lamsaf, M. A. Kerroum, S. Boulaknadel, Y. Fakhri","doi":"10.1145/3286606.3286831","DOIUrl":null,"url":null,"abstract":"Words are often a succession of sub-words (characters, connected components) separated by spaces, in Arabic handwritten its spaces are divided into two types: the first type represents the spaces that separate two connected components of the same word (within-word). the second type are spaces that separate two connected components from two different words(between-words). in our work we designate by the second type. Spaces in Arabic handwriting do not respect any rule because each person has his own style of writing, which increases the difficulty of segmentation between words. The extraction of words based on the classification of spaces detected and extracts between-words spaces to segment the text into words. In this paper, we present a method that aims to compute the threshold for each line, the threshold is not fixed in the document, each line is associated its classification threshold spaces. Before segmenting the text image into words, it is necessary to segment it into lines in order to apply our method to each line of text. To extract the lines, the preprocessing is applied to the text images in order to apply the proposed method for the line segmentation step. Our system is applied on the benchmarking datasets of the Arabic handwriting database for text recognition (AHDB) and the experimental results are very promising as we achieved a success word extraction rate of 87.9%.","PeriodicalId":416459,"journal":{"name":"Proceedings of the 3rd International Conference on Smart City Applications","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Conference on Smart City Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3286606.3286831","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Words are often a succession of sub-words (characters, connected components) separated by spaces, in Arabic handwritten its spaces are divided into two types: the first type represents the spaces that separate two connected components of the same word (within-word). the second type are spaces that separate two connected components from two different words(between-words). in our work we designate by the second type. Spaces in Arabic handwriting do not respect any rule because each person has his own style of writing, which increases the difficulty of segmentation between words. The extraction of words based on the classification of spaces detected and extracts between-words spaces to segment the text into words. In this paper, we present a method that aims to compute the threshold for each line, the threshold is not fixed in the document, each line is associated its classification threshold spaces. Before segmenting the text image into words, it is necessary to segment it into lines in order to apply our method to each line of text. To extract the lines, the preprocessing is applied to the text images in order to apply the proposed method for the line segmentation step. Our system is applied on the benchmarking datasets of the Arabic handwriting database for text recognition (AHDB) and the experimental results are very promising as we achieved a success word extraction rate of 87.9%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
阿拉伯语手写文本的行分割和词提取
单词通常是由空格分隔的一系列子词(字符,连接成分),在阿拉伯语手写中,其空格分为两种类型:第一种类型表示将同一单词的两个连接成分分隔开的空间(词内)。第二类是将两个相连的组件与两个不同的单词(词间)分开的空格。在我们的工作中,我们称之为第二种类型。阿拉伯笔迹的空格不遵守任何规则,因为每个人都有自己的书写风格,这增加了单词之间分割的难度。基于检测到的空间分类提取词,提取词间空间,将文本分割成词。在本文中,我们提出了一种旨在计算每行阈值的方法,该阈值在文档中不是固定的,每行都关联其分类阈值空间。在将文本图像分割成单词之前,有必要将其分割成行,以便将我们的方法应用于每一行文本。为了提取线条,对文本图像进行预处理,以便将提出的方法应用于线条分割步骤。我们的系统在阿拉伯语手写体文本识别(AHDB)的基准数据集上进行了应用,实验结果非常理想,成功提取了87.9%的单词。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
5G Assessing the Critical Sources of Wastes in the Moroccan Construction Industry: An Empirical Study Marketing and smart city: a new model of urban development for cities in Morocco Usage of watermarking techniques in medical imaging Effectiveness of quantum algorithms on classical computing complexities
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1