首页 > 最新文献

2019 International Conference on Document Analysis and Recognition (ICDAR)最新文献

英文 中文
A Synthetic Recipe for OCR OCR的合成配方
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00143
David Etter, Stephen Rawls, Cameron Carpenter, Gregory Sell
Synthetic data generation for optical character recognition (OCR) promises unlimited training data at zero annotation cost. With enough fonts and seed text, we should be able to generate data to train a model that approaches or exceeds the performance with real annotated data. Unfortunately, this is not always the reality. Unconstrained image settings, such as internet memes, scanned web pages, or newspapers, present diverse scripts, fonts, layouts, and complex backgrounds, which cause models trained with synthetic data to break down. In this work, we investigate the synthetic image generation problem on a large multilingual set of unconstrained document images. Our work presents a comprehensive evaluation of the impact of synthetic data attributes on model performance. The results provide a recipe for synthetic data generation that will help guide future research.
光学字符识别(OCR)的合成数据生成承诺以零标注成本获得无限的训练数据。有了足够的字体和种子文本,我们应该能够生成数据来训练一个接近或超过真实带注释数据性能的模型。不幸的是,事实并非总是如此。不受约束的图像设置,如网络表情包、扫描的网页或报纸,呈现出不同的脚本、字体、布局和复杂的背景,这导致用合成数据训练的模型崩溃。在这项工作中,我们研究了一个大型多语言无约束文档图像集上的合成图像生成问题。我们的工作提出了综合数据属性对模型性能影响的综合评估。这些结果为合成数据的生成提供了一个方法,将有助于指导未来的研究。
{"title":"A Synthetic Recipe for OCR","authors":"David Etter, Stephen Rawls, Cameron Carpenter, Gregory Sell","doi":"10.1109/ICDAR.2019.00143","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00143","url":null,"abstract":"Synthetic data generation for optical character recognition (OCR) promises unlimited training data at zero annotation cost. With enough fonts and seed text, we should be able to generate data to train a model that approaches or exceeds the performance with real annotated data. Unfortunately, this is not always the reality. Unconstrained image settings, such as internet memes, scanned web pages, or newspapers, present diverse scripts, fonts, layouts, and complex backgrounds, which cause models trained with synthetic data to break down. In this work, we investigate the synthetic image generation problem on a large multilingual set of unconstrained document images. Our work presents a comprehensive evaluation of the impact of synthetic data attributes on model performance. The results provide a recipe for synthetic data generation that will help guide future research.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127728273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT ICDAR2019基于任意形状文本的鲁棒阅读挑战- RRC-ArT
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00252
Chee-Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, Chuanming Fang, Shuaitao Zhang, Junyu Han, Errui Ding, Jingtuo Liu, Dimosthenis Karatzas, Chee Seng Chan, Lianwen Jin
This paper reports the ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT that consists of three major challenges: i) scene text detection, ii) scene text recognition, and iii) scene text spotting. A total of 78 submissions from 46 unique teams/individuals were received for this competition. The top performing score of each challenge is as follows: i) T1 - 82.65%, ii) T2.1 - 74.3%, iii) T2.2 - 85.32%, iv) T3.1 - 53.86%, and v) T3.2 - 54.91%. Apart from the results, this paper also details the ArT dataset, tasks description, evaluation metrics and participants' methods. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.
本文报道了ICDAR2019关于任意形状文本的鲁棒阅读挑战- RRC-ArT,该挑战包括三个主要挑战:i)场景文本检测,ii)场景文本识别和iii)场景文本识别。本次比赛共收到来自46个独特团队/个人的78份参赛作品。各挑战的最高表现分数为:i) T1 - 82.65%, ii) T2.1 - 74.3%, iii) T2.2 - 85.32%, iv) T3.1 - 53.86%, v) T3.2 - 54.91%。除结果外,本文还详细介绍了ArT数据集、任务描述、评估指标和参与者方法。数据集、评估工具包以及结果都可以在挑战网站上公开获取。
{"title":"ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT","authors":"Chee-Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, Chuanming Fang, Shuaitao Zhang, Junyu Han, Errui Ding, Jingtuo Liu, Dimosthenis Karatzas, Chee Seng Chan, Lianwen Jin","doi":"10.1109/ICDAR.2019.00252","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00252","url":null,"abstract":"This paper reports the ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT that consists of three major challenges: i) scene text detection, ii) scene text recognition, and iii) scene text spotting. A total of 78 submissions from 46 unique teams/individuals were received for this competition. The top performing score of each challenge is as follows: i) T1 - 82.65%, ii) T2.1 - 74.3%, iii) T2.2 - 85.32%, iv) T3.1 - 53.86%, and v) T3.2 - 54.91%. Apart from the results, this paper also details the ArT dataset, tasks description, evaluation metrics and participants' methods. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133243687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 124
A Comparative Study of Attention-Based Encoder-Decoder Approaches to Natural Scene Text Recognition 基于注意的自然场景文本识别编码器-解码器方法的比较研究
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00151
Fu'ze Cong, Wenping Hu, Qiang Huo, Li Guo
Attention-based encoder-decoder approaches have shown promising results in scene text recognition. In the literature, models with different encoders, decoders and attention mechanisms have been proposed and compared on isolated word recognition tasks, where the models are trained on either synthetic word images or a small set of real-world images. In this paper, we investigate different components of the attention based framework and compare its performance with a CNN-DBLSTM-CTC based approach on large-scale real-world scene text sentence recognition tasks. We train character models by using more than 1.6M real-world text lines and compare their performance on test sets collected from a variety of real-world scenarios. Our results show that (1) attention on a two-dimensional feature map can yield better performance than one-dimensional one and an RNN based decoder performs better than CNN based one; (2) attention-based approaches can achieve higher recognition accuracy than CNN-DBLSTM-CTC based approaches on isolated word recognition tasks, but perform worse on sentence recognition tasks; (3) it is more effective and efficient for CNN-DBLSTM-CTC based approaches to leverage an explicit language model to boost recognition accuracy.
基于注意力的编码器-解码器方法在场景文本识别中显示出良好的效果。在文献中,已经提出了具有不同编码器、解码器和注意机制的模型,并对孤立的单词识别任务进行了比较,其中模型在合成单词图像或一小部分真实图像上进行了训练。在本文中,我们研究了基于注意力的框架的不同组成部分,并将其与基于CNN-DBLSTM-CTC的方法在大规模真实场景文本句子识别任务中的性能进行了比较。我们通过使用超过160万真实世界的文本行来训练角色模型,并比较它们在从各种真实世界场景收集的测试集上的表现。研究结果表明:(1)对二维特征映射的关注比一维特征映射的关注效果更好,基于RNN的解码器比基于CNN的解码器效果更好;(2)在孤立词识别任务上,基于注意力的方法比基于CNN-DBLSTM-CTC的方法具有更高的识别准确率,但在句子识别任务上表现较差;(3)基于CNN-DBLSTM-CTC的方法利用显式语言模型提高识别精度更为有效和高效。
{"title":"A Comparative Study of Attention-Based Encoder-Decoder Approaches to Natural Scene Text Recognition","authors":"Fu'ze Cong, Wenping Hu, Qiang Huo, Li Guo","doi":"10.1109/ICDAR.2019.00151","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00151","url":null,"abstract":"Attention-based encoder-decoder approaches have shown promising results in scene text recognition. In the literature, models with different encoders, decoders and attention mechanisms have been proposed and compared on isolated word recognition tasks, where the models are trained on either synthetic word images or a small set of real-world images. In this paper, we investigate different components of the attention based framework and compare its performance with a CNN-DBLSTM-CTC based approach on large-scale real-world scene text sentence recognition tasks. We train character models by using more than 1.6M real-world text lines and compare their performance on test sets collected from a variety of real-world scenarios. Our results show that (1) attention on a two-dimensional feature map can yield better performance than one-dimensional one and an RNN based decoder performs better than CNN based one; (2) attention-based approaches can achieve higher recognition accuracy than CNN-DBLSTM-CTC based approaches on isolated word recognition tasks, but perform worse on sentence recognition tasks; (3) it is more effective and efficient for CNN-DBLSTM-CTC based approaches to leverage an explicit language model to boost recognition accuracy.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133041894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Deep Splitting and Merging for Table Structure Decomposition 表结构分解的深度拆分和合并
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00027
Chris Tensmeyer, Vlad I. Morariu, Brian L. Price, Scott D. Cohen, Tony R. Martinez
Given the large variety and complexity of tables, table structure extraction is a challenging task in automated document analysis systems. We present a pair of novel deep learning models (Split and Merge models) that given an input image, 1) predicts the basic table grid pattern and 2) predicts which grid elements should be merged to recover cells that span multiple rows or columns. We propose projection pooling as a novel component of the Split model and grid pooling as a novel part of the Merge model. While most Fully Convolutional Networks rely on local evidence, these unique pooling regions allow our models to take advantage of the global table structure. We achieve state-of-the-art performance on the public ICDAR 2013 Table Competition dataset of PDF documents. On a much larger private dataset which we used to train the models, we significantly outperform both a state-ofthe-art deep model and a major commercial software system.
由于表的多样性和复杂性,表结构提取在自动化文档分析系统中是一项具有挑战性的任务。我们提出了一对新的深度学习模型(拆分和合并模型),给定输入图像,1)预测基本表网格模式,2)预测应该合并哪些网格元素以恢复跨多行或多列的单元格。我们提出投影池作为Split模型的新组件,网格池作为Merge模型的新组件。虽然大多数全卷积网络依赖于局部证据,但这些独特的池化区域允许我们的模型利用全局表结构。我们在PDF文档的公共ICDAR 2013表竞争数据集上实现了最先进的性能。在我们用来训练模型的更大的私有数据集上,我们的表现明显优于最先进的深度模型和主要的商业软件系统。
{"title":"Deep Splitting and Merging for Table Structure Decomposition","authors":"Chris Tensmeyer, Vlad I. Morariu, Brian L. Price, Scott D. Cohen, Tony R. Martinez","doi":"10.1109/ICDAR.2019.00027","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00027","url":null,"abstract":"Given the large variety and complexity of tables, table structure extraction is a challenging task in automated document analysis systems. We present a pair of novel deep learning models (Split and Merge models) that given an input image, 1) predicts the basic table grid pattern and 2) predicts which grid elements should be merged to recover cells that span multiple rows or columns. We propose projection pooling as a novel component of the Split model and grid pooling as a novel part of the Merge model. While most Fully Convolutional Networks rely on local evidence, these unique pooling regions allow our models to take advantage of the global table structure. We achieve state-of-the-art performance on the public ICDAR 2013 Table Competition dataset of PDF documents. On a much larger private dataset which we used to train the models, we significantly outperform both a state-ofthe-art deep model and a major commercial software system.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133812336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Improving Text Recognition using Optical and Language Model Writer Adaptation 利用光学和语言模型书写器自适应改进文本识别
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00190
Yann Soullard, Wassim Swaileh, Pierrick Tranouez, T. Paquet, Clément Chatelain
State-of-the-art methods for handwriting text recognition are based on deep learning approaches and language modeling that require large data sets during training. In practice, there are some applications where the system processes mono-writer documents, and would thus benefit from being trained on examples from that writer. However, this is not common to have numerous examples coming from just one writer. In this paper, we propose an approach to adapt both the optical model and the language model to a particular writer, from a generic system trained on large data sets with a variety of examples. We show the benefits of the optical and language model writer adaptation. Our approach reaches competitive results on the READ 2018 data set, which is dedicated to model adaptation to particular writers.
最先进的手写文本识别方法基于深度学习方法和语言建模,这些方法在训练期间需要大量数据集。在实践中,有一些应用程序,其中系统处理单作者文档,因此将受益于来自该作者的示例的训练。然而,这是不常见的,有大量的例子来自一个作家。在本文中,我们提出了一种方法,使光学模型和语言模型适应于特定的作者,从一个通用系统训练的大型数据集与各种各样的例子。我们展示了视觉和语言模式作家适应的好处。我们的方法在READ 2018数据集上取得了有竞争力的结果,该数据集致力于对特定作家的模型适应。
{"title":"Improving Text Recognition using Optical and Language Model Writer Adaptation","authors":"Yann Soullard, Wassim Swaileh, Pierrick Tranouez, T. Paquet, Clément Chatelain","doi":"10.1109/ICDAR.2019.00190","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00190","url":null,"abstract":"State-of-the-art methods for handwriting text recognition are based on deep learning approaches and language modeling that require large data sets during training. In practice, there are some applications where the system processes mono-writer documents, and would thus benefit from being trained on examples from that writer. However, this is not common to have numerous examples coming from just one writer. In this paper, we propose an approach to adapt both the optical model and the language model to a particular writer, from a generic system trained on large data sets with a variety of examples. We show the benefits of the optical and language model writer adaptation. Our approach reaches competitive results on the READ 2018 data set, which is dedicated to model adaptation to particular writers.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130365346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images tableet:端到端表检测和扫描文档图像表数据提取的深度学习模型
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00029
Shubham Paliwal, D. Vishwanath, R. Rahul, Monika Sharma, L. Vig
With the widespread use of mobile phones and scanners to photograph and upload documents, the need for extracting the information trapped in unstructured document images such as retail receipts, insurance claim forms and financial invoices is becoming more acute. A major hurdle to this objective is that these images often contain information in the form of tables and extracting data from tabular sub-images presents a unique set of challenges. This includes accurate detection of the tabular region within an image, and subsequently detecting and extracting information from the rows and columns of the detected table. While some progress has been made in table detection, extracting the table contents is still a challenge since this involves more fine grained table structure(rows & columns) recognition. Prior approaches have attempted to solve the table detection and structure recognition problems independently using two separate models. In this paper, we propose TableNet: a novel end-to-end deep learning model for both table detection and structure recognition. The model exploits the interdependence between the twin tasks of table detection and table structure recognition to segment out the table and column regions. This is followed by semantic rule-based row extraction from the identified tabular sub-regions. The proposed model and extraction approach was evaluated on the publicly available ICDAR 2013 and Marmot Table datasets obtaining state of the art results. Additionally, we demonstrate that feeding additional semantic features further improves model performance and that the model exhibits transfer learning across datasets. Another contribution of this paper is to provide additional table structure annotations for the Marmot data, which currently only has annotations for table detection.
随着移动电话和扫描仪被广泛用于拍摄和上传文件,从零售收据、保险索赔表格和财务发票等非结构化文件图像中提取信息的需求变得越来越迫切。实现这一目标的一个主要障碍是,这些图像通常包含表格形式的信息,从表格子图像中提取数据面临一系列独特的挑战。这包括对图像中的表格区域进行准确检测,然后从检测到的表格的行和列中检测和提取信息。虽然在表检测方面取得了一些进展,但提取表内容仍然是一个挑战,因为这涉及到更细粒度的表结构(行和列)识别。先前的方法试图使用两个独立的模型分别解决表检测和结构识别问题。在本文中,我们提出了TableNet:一种新颖的端到端深度学习模型,用于表检测和结构识别。该模型利用表检测和表结构识别这两个任务之间的相互依赖关系,分割出表和列区域。然后从已识别的表格子区域中进行基于语义规则的行提取。在公开可用的ICDAR 2013和Marmot Table数据集上对所提出的模型和提取方法进行了评估,获得了最先进的结果。此外,我们证明了提供额外的语义特征进一步提高了模型的性能,并且模型显示了跨数据集的迁移学习。本文的另一个贡献是为Marmot数据提供了额外的表结构注释,目前Marmot数据只有表检测注释。
{"title":"TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images","authors":"Shubham Paliwal, D. Vishwanath, R. Rahul, Monika Sharma, L. Vig","doi":"10.1109/ICDAR.2019.00029","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00029","url":null,"abstract":"With the widespread use of mobile phones and scanners to photograph and upload documents, the need for extracting the information trapped in unstructured document images such as retail receipts, insurance claim forms and financial invoices is becoming more acute. A major hurdle to this objective is that these images often contain information in the form of tables and extracting data from tabular sub-images presents a unique set of challenges. This includes accurate detection of the tabular region within an image, and subsequently detecting and extracting information from the rows and columns of the detected table. While some progress has been made in table detection, extracting the table contents is still a challenge since this involves more fine grained table structure(rows & columns) recognition. Prior approaches have attempted to solve the table detection and structure recognition problems independently using two separate models. In this paper, we propose TableNet: a novel end-to-end deep learning model for both table detection and structure recognition. The model exploits the interdependence between the twin tasks of table detection and table structure recognition to segment out the table and column regions. This is followed by semantic rule-based row extraction from the identified tabular sub-regions. The proposed model and extraction approach was evaluated on the publicly available ICDAR 2013 and Marmot Table datasets obtaining state of the art results. Additionally, we demonstrate that feeding additional semantic features further improves model performance and that the model exhibits transfer learning across datasets. Another contribution of this paper is to provide additional table structure annotations for the Marmot data, which currently only has annotations for table detection.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114382703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 107
An Attention-Based End-to-End Model for Multiple Text Lines Recognition in Japanese Historical Documents 基于注意的日语历史文献多文本行识别的端到端模型
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00106
N. Ly, C. Nguyen, M. Nakagawa
This paper presents an attention-based convolutional sequence to sequence (ACseq2seq) model for recognizing an input image of multiple text lines from Japanese historical documents without explicit segmentation of lines. The recognition system has three main parts: a feature extractor using Convolutional Neural Network (CNN) to extract a feature sequence from an input image; an encoder employing bidirectional Long Short-Term Memory (BLSTM) to encode the feature sequence; and a decoder using a unidirectional LSTM with the attention mechanism to generate the final target text based on the attended pertinent features. We also introduce a residual LSTM network between the attention vector and softmax layer in the decoder. The system can be trained end-to-end by a standard cross-entropy loss function. In the experiment, we evaluate the performance of the ACseq2seq model on the anomalously deformed Kana datasets in the PRMU contest. The results of the experiments show that our proposed model achieves higher recognition accuracy than the state-of-the-art recognition methods on the anomalously deformed Kana datasets.
本文提出了一种基于注意力的卷积序列到序列(ACseq2seq)模型,用于识别日本历史文献中多文本行输入图像,而不需要明确的行分割。该识别系统有三个主要部分:使用卷积神经网络(CNN)从输入图像中提取特征序列的特征提取器;采用双向长短期记忆(BLSTM)对特征序列进行编码的编码器;以及使用具有注意机制的单向LSTM的解码器,以基于所关注的相关特征生成最终目标文本。我们还在解码器的注意向量和softmax层之间引入了残差LSTM网络。系统可以通过标准的交叉熵损失函数进行端到端训练。在实验中,我们评估了ACseq2seq模型在PRMU竞赛中异常变形的假名数据集上的性能。实验结果表明,该模型在异常变形假名数据集上取得了比现有识别方法更高的识别精度。
{"title":"An Attention-Based End-to-End Model for Multiple Text Lines Recognition in Japanese Historical Documents","authors":"N. Ly, C. Nguyen, M. Nakagawa","doi":"10.1109/ICDAR.2019.00106","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00106","url":null,"abstract":"This paper presents an attention-based convolutional sequence to sequence (ACseq2seq) model for recognizing an input image of multiple text lines from Japanese historical documents without explicit segmentation of lines. The recognition system has three main parts: a feature extractor using Convolutional Neural Network (CNN) to extract a feature sequence from an input image; an encoder employing bidirectional Long Short-Term Memory (BLSTM) to encode the feature sequence; and a decoder using a unidirectional LSTM with the attention mechanism to generate the final target text based on the attended pertinent features. We also introduce a residual LSTM network between the attention vector and softmax layer in the decoder. The system can be trained end-to-end by a standard cross-entropy loss function. In the experiment, we evaluate the performance of the ACseq2seq model on the anomalously deformed Kana datasets in the PRMU contest. The results of the experiments show that our proposed model achieves higher recognition accuracy than the state-of-the-art recognition methods on the anomalously deformed Kana datasets.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"210 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125312841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Training Full-Page Handwritten Text Recognition Models without Annotated Line Breaks 训练全页手写文本识别模型没有注释的换行符
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00011
Chris Tensmeyer, Curtis Wigington
Training Handwritten Text Recognition (HTR) models typically requires large amounts of labeled data which often are line or page images with corresponding line-level ground truth (GT) transcriptions. Many digital collections have page-level transcriptions for each image, but the transcription is unformatted, i.e., line breaks are not annotated. Can we train lined-based HTR models using such data? In this work, we present a novel alignment technique for segmenting page-level GT text into text lines during HTR model training. This text segmentation problem is formulated as an optimization problem to minimize the cost of aligning predicted lines with the GT text. Using both simulated and HTR model predictions, we show that the alignment method identifies line breaks accurately, even when the predicted lines have high character error rates (CER). We removed the GT line breaks from the ICDAR-2017 READ dataset and trained a HTR model using the proposed alignment method to predict line breaks on-the-fly. This model achieves comparable CER w.r.t. to the same model trained with the GT line breaks. Additionally, we downloaded an online digital collection of 50K English journal pages (not curated for HTR research) whose transcriptions do not contain line breaks, and achieve 11% CER.
训练手写文本识别(HTR)模型通常需要大量标记数据,这些数据通常是具有相应的行级地面真值(GT)转录的行或页图像。许多数字集合对每个图像都有页面级别的转录,但转录是未格式化的,即没有注释换行符。我们可以使用这些数据来训练基于线的HTR模型吗?在这项工作中,我们提出了一种新的对齐技术,用于在HTR模型训练期间将页面级GT文本分割为文本行。这个文本分割问题被制定为一个优化问题,以最小化对齐预测线与GT文本的成本。使用模拟和HTR模型预测,我们表明对齐方法可以准确地识别断行,即使预测的行具有高字符错误率(CER)。我们从ICDAR-2017 READ数据集中删除了GT换行,并使用提出的对齐方法训练了一个HTR模型来实时预测换行。该模型与使用GT换行符训练的相同模型实现了可比的CER w.r.t.。此外,我们下载了50K英文期刊页面的在线数字集合(不是为HTR研究设计的),其转录不包含换行符,并达到11%的CER。
{"title":"Training Full-Page Handwritten Text Recognition Models without Annotated Line Breaks","authors":"Chris Tensmeyer, Curtis Wigington","doi":"10.1109/ICDAR.2019.00011","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00011","url":null,"abstract":"Training Handwritten Text Recognition (HTR) models typically requires large amounts of labeled data which often are line or page images with corresponding line-level ground truth (GT) transcriptions. Many digital collections have page-level transcriptions for each image, but the transcription is unformatted, i.e., line breaks are not annotated. Can we train lined-based HTR models using such data? In this work, we present a novel alignment technique for segmenting page-level GT text into text lines during HTR model training. This text segmentation problem is formulated as an optimization problem to minimize the cost of aligning predicted lines with the GT text. Using both simulated and HTR model predictions, we show that the alignment method identifies line breaks accurately, even when the predicted lines have high character error rates (CER). We removed the GT line breaks from the ICDAR-2017 READ dataset and trained a HTR model using the proposed alignment method to predict line breaks on-the-fly. This model achieves comparable CER w.r.t. to the same model trained with the GT line breaks. Additionally, we downloaded an online digital collection of 50K English journal pages (not curated for HTR research) whose transcriptions do not contain line breaks, and achieve 11% CER.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131560494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Instance Aware Document Image Segmentation using Label Pyramid Networks and Deep Watershed Transformation 基于标签金字塔网络和深度分水岭变换的实例感知文档图像分割
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00088
Xiaohui Li, Fei Yin, Tao Xue, Long Liu, J. Ogier, Cheng-Lin Liu
Segmentation of complex document images remains a challenge due to the large variability of layout and image degradation. In this paper, we propose a method to segment complex document images based on Label Pyramid Network (LPN) and Deep Watershed Transform (DWT). The method can segment document images into instance aware regions including text lines, text regions, figures, tables, etc. The backbone of LPN can be any type of Fully Convolutional Networks (FCN), and in training, label map pyramids on training images are provided to exploit the hierarchical boundary information of regions efficiently through multi-task learning. The label map pyramid is transformed from region class label map by distance transformation and multi-level thresholding. In segmentation, the outputs of multiple tasks of LPN are summed into one single probability map, on which watershed transformation is carried out to segment the document image into instance aware regions. In experiments on four public databases, our method is demonstrated effective and superior, yielding state of the art performance for text line segmentation, baseline detection and region segmentation.
复杂文档图像的分割仍然是一个挑战,由于大变异性的布局和图像退化。提出了一种基于标签金字塔网络(LPN)和深度分水岭变换(DWT)的复杂文档图像分割方法。该方法可以将文档图像分割为实例感知区域,包括文本行、文本区域、图形、表格等。LPN的主干可以是任意类型的全卷积网络(FCN),在训练中,在训练图像上提供标签映射金字塔,通过多任务学习有效地利用区域的分层边界信息。通过距离变换和多级阈值分割,将区域类标签映射转化为标签映射金字塔。在分割中,将LPN的多个任务的输出求和成一个概率图,在此概率图上进行分水岭变换,将文档图像分割成实例感知区域。在四个公共数据库的实验中,我们的方法被证明是有效和优越的,在文本行分割、基线检测和区域分割方面产生了最先进的性能。
{"title":"Instance Aware Document Image Segmentation using Label Pyramid Networks and Deep Watershed Transformation","authors":"Xiaohui Li, Fei Yin, Tao Xue, Long Liu, J. Ogier, Cheng-Lin Liu","doi":"10.1109/ICDAR.2019.00088","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00088","url":null,"abstract":"Segmentation of complex document images remains a challenge due to the large variability of layout and image degradation. In this paper, we propose a method to segment complex document images based on Label Pyramid Network (LPN) and Deep Watershed Transform (DWT). The method can segment document images into instance aware regions including text lines, text regions, figures, tables, etc. The backbone of LPN can be any type of Fully Convolutional Networks (FCN), and in training, label map pyramids on training images are provided to exploit the hierarchical boundary information of regions efficiently through multi-task learning. The label map pyramid is transformed from region class label map by distance transformation and multi-level thresholding. In segmentation, the outputs of multiple tasks of LPN are summed into one single probability map, on which watershed transformation is carried out to segment the document image into instance aware regions. In experiments on four public databases, our method is demonstrated effective and superior, yielding state of the art performance for text line segmentation, baseline detection and region segmentation.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134315883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Residual BiRNN Based Seq2Seq Model with Transition Probability Matrix for Online Handwritten Mathematical Expression Recognition 基于转移概率矩阵残差BiRNN的在线手写数学表达式识别Seq2Seq模型
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00107
Zelin Hong, Ning You, J. Tan, Ning Bi
In this paper, we present a Seq2Seq model for online handwritten mathematical expression recognition (OHMER), which consists of two major parts: a residual bidirectional RNN (BiRNN) based encoder that takes handwritten traces as the input and a transition probability matrix introduced decoder that generates LaTeX notations. We employ residual connection in the BiRNN layers to improve feature extraction. Markovian transition probability matrix is introduced in decoder and long-term information can be used in each decoding step through joint probability. Furthermore, we analyze the impact of the novel encoder and transition probability matrix through several specific instances. Experimental results on the CROHME 2014 and CROHME 2016 competition tasks show that our model outperforms the previous state-of-the-art single model by only using the official training dataset.
在本文中,我们提出了一个用于在线手写数学表达式识别(OHMER)的Seq2Seq模型,该模型由两个主要部分组成:一个基于残差双向RNN (BiRNN)的编码器,该编码器以手写轨迹作为输入;一个引入转移概率矩阵的解码器,该解码器生成LaTeX符号。我们在BiRNN层中使用残差连接来改进特征提取。在解码器中引入马尔可夫转移概率矩阵,通过联合概率在每一步解码中获取长期信息。此外,我们还通过几个具体实例分析了新编码器和转移概率矩阵的影响。在CROHME 2014和CROHME 2016比赛任务上的实验结果表明,我们的模型在只使用官方训练数据集的情况下优于之前最先进的单一模型。
{"title":"Residual BiRNN Based Seq2Seq Model with Transition Probability Matrix for Online Handwritten Mathematical Expression Recognition","authors":"Zelin Hong, Ning You, J. Tan, Ning Bi","doi":"10.1109/ICDAR.2019.00107","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00107","url":null,"abstract":"In this paper, we present a Seq2Seq model for online handwritten mathematical expression recognition (OHMER), which consists of two major parts: a residual bidirectional RNN (BiRNN) based encoder that takes handwritten traces as the input and a transition probability matrix introduced decoder that generates LaTeX notations. We employ residual connection in the BiRNN layers to improve feature extraction. Markovian transition probability matrix is introduced in decoder and long-term information can be used in each decoding step through joint probability. Furthermore, we analyze the impact of the novel encoder and transition probability matrix through several specific instances. Experimental results on the CROHME 2014 and CROHME 2016 competition tasks show that our model outperforms the previous state-of-the-art single model by only using the official training dataset.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133622445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
2019 International Conference on Document Analysis and Recognition (ICDAR)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1