ResneSt-Transformer:用于端到端手写段落识别的无联合注意分割模型

IF 2.3 Q2 COMPUTER SCIENCE, THEORY & METHODS Array Pub Date : 2023-09-01 DOI:10.1016/j.array.2023.100300
Mohammed Hamdan, Mohamed Cheriet
{"title":"ResneSt-Transformer:用于端到端手写段落识别的无联合注意分割模型","authors":"Mohammed Hamdan,&nbsp;Mohamed Cheriet","doi":"10.1016/j.array.2023.100300","DOIUrl":null,"url":null,"abstract":"<div><p>Offline handwritten text recognition (HTR) typically relies on segmented text-line images for training and transcription. However, acquiring line-level position and transcript information can be challenging and time-consuming, while automatic line segmentation algorithms are prone to errors that impede the recognition phase. To address these issues, we introduce a state-of-the-art solution that integrates vision and language models using efficient split and multi-head attention neural networks, referred to as joint attention (ResneSt-Transformer), for end-to-end recognition of handwritten paragraphs. Our proposed novel one-stage, segmentation-free pipeline employs joint attention mechanisms to process paragraph images in an end-to-end trainable manner. This pipeline comprises three modules, with the output of one serving as the input for the next. Initially, a feature extraction module employing a CNN with a split attention mechanism (ResneSt50) is utilized. Subsequently, we develop an encoder module containing four transformer layers to generate robust representations of the entire paragraph image. Lastly, we designed a decoder module with six transformer layers to construct weighted masks. The encoder and decoder modules incorporate a multi-head self-attention mechanism and positional encoding, enabling the model to concentrate on specific feature maps at the current time step. By leveraging joint attention and a segmentation-free approach, our neural network calculates split attention weights on the visual representation, facilitating implicit line segmentation. This strategy signifies a substantial advancement toward achieving end-to-end transcription of entire paragraphs. Experiments conducted on paragraph-level benchmark datasets, including RIMES, IAM, and READ 2016 test datasets, demonstrate competitive results compared to recent paragraph-level models while maintaining reduced complexity. The code and pre-trained models are available on our GitHub repository here: <span>HTTPS link</span><svg><path></path></svg>.</p></div>","PeriodicalId":8417,"journal":{"name":"Array","volume":null,"pages":null},"PeriodicalIF":2.3000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ResneSt-Transformer: Joint attention segmentation-free for end-to-end handwriting paragraph recognition model\",\"authors\":\"Mohammed Hamdan,&nbsp;Mohamed Cheriet\",\"doi\":\"10.1016/j.array.2023.100300\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Offline handwritten text recognition (HTR) typically relies on segmented text-line images for training and transcription. However, acquiring line-level position and transcript information can be challenging and time-consuming, while automatic line segmentation algorithms are prone to errors that impede the recognition phase. To address these issues, we introduce a state-of-the-art solution that integrates vision and language models using efficient split and multi-head attention neural networks, referred to as joint attention (ResneSt-Transformer), for end-to-end recognition of handwritten paragraphs. Our proposed novel one-stage, segmentation-free pipeline employs joint attention mechanisms to process paragraph images in an end-to-end trainable manner. This pipeline comprises three modules, with the output of one serving as the input for the next. Initially, a feature extraction module employing a CNN with a split attention mechanism (ResneSt50) is utilized. Subsequently, we develop an encoder module containing four transformer layers to generate robust representations of the entire paragraph image. Lastly, we designed a decoder module with six transformer layers to construct weighted masks. The encoder and decoder modules incorporate a multi-head self-attention mechanism and positional encoding, enabling the model to concentrate on specific feature maps at the current time step. By leveraging joint attention and a segmentation-free approach, our neural network calculates split attention weights on the visual representation, facilitating implicit line segmentation. This strategy signifies a substantial advancement toward achieving end-to-end transcription of entire paragraphs. Experiments conducted on paragraph-level benchmark datasets, including RIMES, IAM, and READ 2016 test datasets, demonstrate competitive results compared to recent paragraph-level models while maintaining reduced complexity. The code and pre-trained models are available on our GitHub repository here: <span>HTTPS link</span><svg><path></path></svg>.</p></div>\",\"PeriodicalId\":8417,\"journal\":{\"name\":\"Array\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2023-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Array\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2590005623000255\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Array","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590005623000255","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

摘要

离线手写文本识别(HTR)通常依赖于分割的文本行图像进行训练和转录。然而,获取线水平位置和转录信息可能是具有挑战性和耗时的,而自动线段算法容易出现阻碍识别阶段的错误。为了解决这些问题,我们引入了一种最先进的解决方案,该解决方案集成了视觉和语言模型,使用高效的分裂和多头注意力神经网络,称为联合注意力(ResneSt-Transformer),用于手写段落的端到端识别。我们提出的新型单阶段无分割管道采用联合注意机制以端到端可训练的方式处理段落图像。该管道由三个模块组成,其中一个模块的输出作为下一个模块的输入。最初,我们使用了一个带有分裂注意机制的CNN特征提取模块(ResneSt50)。随后,我们开发了一个包含四个变压器层的编码器模块,以生成整个段落图像的鲁棒表示。最后,我们设计了一个具有六层变压器的解码器模块来构建加权掩模。编码器和解码器模块结合了多头自注意机制和位置编码,使模型能够专注于当前时间步长的特定特征映射。通过利用联合注意和无分割方法,我们的神经网络计算视觉表示上的分裂注意权重,促进隐式线分割。这一策略标志着实现整个段落的端到端转录的实质性进步。在段落级基准数据集(包括RIMES、IAM和READ 2016测试数据集)上进行的实验显示,与最近的段落级模型相比,实验结果具有竞争力,同时保持了较低的复杂性。代码和预训练模型可以在我们的GitHub存储库中找到:HTTPS链接。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
ResneSt-Transformer: Joint attention segmentation-free for end-to-end handwriting paragraph recognition model

Offline handwritten text recognition (HTR) typically relies on segmented text-line images for training and transcription. However, acquiring line-level position and transcript information can be challenging and time-consuming, while automatic line segmentation algorithms are prone to errors that impede the recognition phase. To address these issues, we introduce a state-of-the-art solution that integrates vision and language models using efficient split and multi-head attention neural networks, referred to as joint attention (ResneSt-Transformer), for end-to-end recognition of handwritten paragraphs. Our proposed novel one-stage, segmentation-free pipeline employs joint attention mechanisms to process paragraph images in an end-to-end trainable manner. This pipeline comprises three modules, with the output of one serving as the input for the next. Initially, a feature extraction module employing a CNN with a split attention mechanism (ResneSt50) is utilized. Subsequently, we develop an encoder module containing four transformer layers to generate robust representations of the entire paragraph image. Lastly, we designed a decoder module with six transformer layers to construct weighted masks. The encoder and decoder modules incorporate a multi-head self-attention mechanism and positional encoding, enabling the model to concentrate on specific feature maps at the current time step. By leveraging joint attention and a segmentation-free approach, our neural network calculates split attention weights on the visual representation, facilitating implicit line segmentation. This strategy signifies a substantial advancement toward achieving end-to-end transcription of entire paragraphs. Experiments conducted on paragraph-level benchmark datasets, including RIMES, IAM, and READ 2016 test datasets, demonstrate competitive results compared to recent paragraph-level models while maintaining reduced complexity. The code and pre-trained models are available on our GitHub repository here: HTTPS link.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Array
Array Computer Science-General Computer Science
CiteScore
4.40
自引率
0.00%
发文量
93
审稿时长
45 days
期刊最新文献
Combining computational linguistics with sentence embedding to create a zero-shot NLIDB Development of automatic CNC machine with versatile applications in art, design, and engineering Dual-model approach for one-shot lithium-ion battery state of health sequence prediction Maximizing influence via link prediction in evolving networks Assessing generalizability of Deep Reinforcement Learning algorithms for Automated Vulnerability Assessment and Penetration Testing
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1