Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00143
David Etter, Stephen Rawls, Cameron Carpenter, Gregory Sell
Synthetic data generation for optical character recognition (OCR) promises unlimited training data at zero annotation cost. With enough fonts and seed text, we should be able to generate data to train a model that approaches or exceeds the performance with real annotated data. Unfortunately, this is not always the reality. Unconstrained image settings, such as internet memes, scanned web pages, or newspapers, present diverse scripts, fonts, layouts, and complex backgrounds, which cause models trained with synthetic data to break down. In this work, we investigate the synthetic image generation problem on a large multilingual set of unconstrained document images. Our work presents a comprehensive evaluation of the impact of synthetic data attributes on model performance. The results provide a recipe for synthetic data generation that will help guide future research.
{"title":"A Synthetic Recipe for OCR","authors":"David Etter, Stephen Rawls, Cameron Carpenter, Gregory Sell","doi":"10.1109/ICDAR.2019.00143","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00143","url":null,"abstract":"Synthetic data generation for optical character recognition (OCR) promises unlimited training data at zero annotation cost. With enough fonts and seed text, we should be able to generate data to train a model that approaches or exceeds the performance with real annotated data. Unfortunately, this is not always the reality. Unconstrained image settings, such as internet memes, scanned web pages, or newspapers, present diverse scripts, fonts, layouts, and complex backgrounds, which cause models trained with synthetic data to break down. In this work, we investigate the synthetic image generation problem on a large multilingual set of unconstrained document images. Our work presents a comprehensive evaluation of the impact of synthetic data attributes on model performance. The results provide a recipe for synthetic data generation that will help guide future research.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127728273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00252
Chee-Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, Chuanming Fang, Shuaitao Zhang, Junyu Han, Errui Ding, Jingtuo Liu, Dimosthenis Karatzas, Chee Seng Chan, Lianwen Jin
This paper reports the ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT that consists of three major challenges: i) scene text detection, ii) scene text recognition, and iii) scene text spotting. A total of 78 submissions from 46 unique teams/individuals were received for this competition. The top performing score of each challenge is as follows: i) T1 - 82.65%, ii) T2.1 - 74.3%, iii) T2.2 - 85.32%, iv) T3.1 - 53.86%, and v) T3.2 - 54.91%. Apart from the results, this paper also details the ArT dataset, tasks description, evaluation metrics and participants' methods. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.
{"title":"ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT","authors":"Chee-Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, Chuanming Fang, Shuaitao Zhang, Junyu Han, Errui Ding, Jingtuo Liu, Dimosthenis Karatzas, Chee Seng Chan, Lianwen Jin","doi":"10.1109/ICDAR.2019.00252","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00252","url":null,"abstract":"This paper reports the ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT that consists of three major challenges: i) scene text detection, ii) scene text recognition, and iii) scene text spotting. A total of 78 submissions from 46 unique teams/individuals were received for this competition. The top performing score of each challenge is as follows: i) T1 - 82.65%, ii) T2.1 - 74.3%, iii) T2.2 - 85.32%, iv) T3.1 - 53.86%, and v) T3.2 - 54.91%. Apart from the results, this paper also details the ArT dataset, tasks description, evaluation metrics and participants' methods. The dataset, the evaluation kit as well as the results are publicly available at the challenge website.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133243687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00151
Fu'ze Cong, Wenping Hu, Qiang Huo, Li Guo
Attention-based encoder-decoder approaches have shown promising results in scene text recognition. In the literature, models with different encoders, decoders and attention mechanisms have been proposed and compared on isolated word recognition tasks, where the models are trained on either synthetic word images or a small set of real-world images. In this paper, we investigate different components of the attention based framework and compare its performance with a CNN-DBLSTM-CTC based approach on large-scale real-world scene text sentence recognition tasks. We train character models by using more than 1.6M real-world text lines and compare their performance on test sets collected from a variety of real-world scenarios. Our results show that (1) attention on a two-dimensional feature map can yield better performance than one-dimensional one and an RNN based decoder performs better than CNN based one; (2) attention-based approaches can achieve higher recognition accuracy than CNN-DBLSTM-CTC based approaches on isolated word recognition tasks, but perform worse on sentence recognition tasks; (3) it is more effective and efficient for CNN-DBLSTM-CTC based approaches to leverage an explicit language model to boost recognition accuracy.
{"title":"A Comparative Study of Attention-Based Encoder-Decoder Approaches to Natural Scene Text Recognition","authors":"Fu'ze Cong, Wenping Hu, Qiang Huo, Li Guo","doi":"10.1109/ICDAR.2019.00151","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00151","url":null,"abstract":"Attention-based encoder-decoder approaches have shown promising results in scene text recognition. In the literature, models with different encoders, decoders and attention mechanisms have been proposed and compared on isolated word recognition tasks, where the models are trained on either synthetic word images or a small set of real-world images. In this paper, we investigate different components of the attention based framework and compare its performance with a CNN-DBLSTM-CTC based approach on large-scale real-world scene text sentence recognition tasks. We train character models by using more than 1.6M real-world text lines and compare their performance on test sets collected from a variety of real-world scenarios. Our results show that (1) attention on a two-dimensional feature map can yield better performance than one-dimensional one and an RNN based decoder performs better than CNN based one; (2) attention-based approaches can achieve higher recognition accuracy than CNN-DBLSTM-CTC based approaches on isolated word recognition tasks, but perform worse on sentence recognition tasks; (3) it is more effective and efficient for CNN-DBLSTM-CTC based approaches to leverage an explicit language model to boost recognition accuracy.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133041894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00027
Chris Tensmeyer, Vlad I. Morariu, Brian L. Price, Scott D. Cohen, Tony R. Martinez
Given the large variety and complexity of tables, table structure extraction is a challenging task in automated document analysis systems. We present a pair of novel deep learning models (Split and Merge models) that given an input image, 1) predicts the basic table grid pattern and 2) predicts which grid elements should be merged to recover cells that span multiple rows or columns. We propose projection pooling as a novel component of the Split model and grid pooling as a novel part of the Merge model. While most Fully Convolutional Networks rely on local evidence, these unique pooling regions allow our models to take advantage of the global table structure. We achieve state-of-the-art performance on the public ICDAR 2013 Table Competition dataset of PDF documents. On a much larger private dataset which we used to train the models, we significantly outperform both a state-ofthe-art deep model and a major commercial software system.
{"title":"Deep Splitting and Merging for Table Structure Decomposition","authors":"Chris Tensmeyer, Vlad I. Morariu, Brian L. Price, Scott D. Cohen, Tony R. Martinez","doi":"10.1109/ICDAR.2019.00027","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00027","url":null,"abstract":"Given the large variety and complexity of tables, table structure extraction is a challenging task in automated document analysis systems. We present a pair of novel deep learning models (Split and Merge models) that given an input image, 1) predicts the basic table grid pattern and 2) predicts which grid elements should be merged to recover cells that span multiple rows or columns. We propose projection pooling as a novel component of the Split model and grid pooling as a novel part of the Merge model. While most Fully Convolutional Networks rely on local evidence, these unique pooling regions allow our models to take advantage of the global table structure. We achieve state-of-the-art performance on the public ICDAR 2013 Table Competition dataset of PDF documents. On a much larger private dataset which we used to train the models, we significantly outperform both a state-ofthe-art deep model and a major commercial software system.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133812336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00190
Yann Soullard, Wassim Swaileh, Pierrick Tranouez, T. Paquet, Clément Chatelain
State-of-the-art methods for handwriting text recognition are based on deep learning approaches and language modeling that require large data sets during training. In practice, there are some applications where the system processes mono-writer documents, and would thus benefit from being trained on examples from that writer. However, this is not common to have numerous examples coming from just one writer. In this paper, we propose an approach to adapt both the optical model and the language model to a particular writer, from a generic system trained on large data sets with a variety of examples. We show the benefits of the optical and language model writer adaptation. Our approach reaches competitive results on the READ 2018 data set, which is dedicated to model adaptation to particular writers.
{"title":"Improving Text Recognition using Optical and Language Model Writer Adaptation","authors":"Yann Soullard, Wassim Swaileh, Pierrick Tranouez, T. Paquet, Clément Chatelain","doi":"10.1109/ICDAR.2019.00190","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00190","url":null,"abstract":"State-of-the-art methods for handwriting text recognition are based on deep learning approaches and language modeling that require large data sets during training. In practice, there are some applications where the system processes mono-writer documents, and would thus benefit from being trained on examples from that writer. However, this is not common to have numerous examples coming from just one writer. In this paper, we propose an approach to adapt both the optical model and the language model to a particular writer, from a generic system trained on large data sets with a variety of examples. We show the benefits of the optical and language model writer adaptation. Our approach reaches competitive results on the READ 2018 data set, which is dedicated to model adaptation to particular writers.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130365346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00029
Shubham Paliwal, D. Vishwanath, R. Rahul, Monika Sharma, L. Vig
With the widespread use of mobile phones and scanners to photograph and upload documents, the need for extracting the information trapped in unstructured document images such as retail receipts, insurance claim forms and financial invoices is becoming more acute. A major hurdle to this objective is that these images often contain information in the form of tables and extracting data from tabular sub-images presents a unique set of challenges. This includes accurate detection of the tabular region within an image, and subsequently detecting and extracting information from the rows and columns of the detected table. While some progress has been made in table detection, extracting the table contents is still a challenge since this involves more fine grained table structure(rows & columns) recognition. Prior approaches have attempted to solve the table detection and structure recognition problems independently using two separate models. In this paper, we propose TableNet: a novel end-to-end deep learning model for both table detection and structure recognition. The model exploits the interdependence between the twin tasks of table detection and table structure recognition to segment out the table and column regions. This is followed by semantic rule-based row extraction from the identified tabular sub-regions. The proposed model and extraction approach was evaluated on the publicly available ICDAR 2013 and Marmot Table datasets obtaining state of the art results. Additionally, we demonstrate that feeding additional semantic features further improves model performance and that the model exhibits transfer learning across datasets. Another contribution of this paper is to provide additional table structure annotations for the Marmot data, which currently only has annotations for table detection.
{"title":"TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images","authors":"Shubham Paliwal, D. Vishwanath, R. Rahul, Monika Sharma, L. Vig","doi":"10.1109/ICDAR.2019.00029","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00029","url":null,"abstract":"With the widespread use of mobile phones and scanners to photograph and upload documents, the need for extracting the information trapped in unstructured document images such as retail receipts, insurance claim forms and financial invoices is becoming more acute. A major hurdle to this objective is that these images often contain information in the form of tables and extracting data from tabular sub-images presents a unique set of challenges. This includes accurate detection of the tabular region within an image, and subsequently detecting and extracting information from the rows and columns of the detected table. While some progress has been made in table detection, extracting the table contents is still a challenge since this involves more fine grained table structure(rows & columns) recognition. Prior approaches have attempted to solve the table detection and structure recognition problems independently using two separate models. In this paper, we propose TableNet: a novel end-to-end deep learning model for both table detection and structure recognition. The model exploits the interdependence between the twin tasks of table detection and table structure recognition to segment out the table and column regions. This is followed by semantic rule-based row extraction from the identified tabular sub-regions. The proposed model and extraction approach was evaluated on the publicly available ICDAR 2013 and Marmot Table datasets obtaining state of the art results. Additionally, we demonstrate that feeding additional semantic features further improves model performance and that the model exhibits transfer learning across datasets. Another contribution of this paper is to provide additional table structure annotations for the Marmot data, which currently only has annotations for table detection.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114382703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00106
N. Ly, C. Nguyen, M. Nakagawa
This paper presents an attention-based convolutional sequence to sequence (ACseq2seq) model for recognizing an input image of multiple text lines from Japanese historical documents without explicit segmentation of lines. The recognition system has three main parts: a feature extractor using Convolutional Neural Network (CNN) to extract a feature sequence from an input image; an encoder employing bidirectional Long Short-Term Memory (BLSTM) to encode the feature sequence; and a decoder using a unidirectional LSTM with the attention mechanism to generate the final target text based on the attended pertinent features. We also introduce a residual LSTM network between the attention vector and softmax layer in the decoder. The system can be trained end-to-end by a standard cross-entropy loss function. In the experiment, we evaluate the performance of the ACseq2seq model on the anomalously deformed Kana datasets in the PRMU contest. The results of the experiments show that our proposed model achieves higher recognition accuracy than the state-of-the-art recognition methods on the anomalously deformed Kana datasets.
{"title":"An Attention-Based End-to-End Model for Multiple Text Lines Recognition in Japanese Historical Documents","authors":"N. Ly, C. Nguyen, M. Nakagawa","doi":"10.1109/ICDAR.2019.00106","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00106","url":null,"abstract":"This paper presents an attention-based convolutional sequence to sequence (ACseq2seq) model for recognizing an input image of multiple text lines from Japanese historical documents without explicit segmentation of lines. The recognition system has three main parts: a feature extractor using Convolutional Neural Network (CNN) to extract a feature sequence from an input image; an encoder employing bidirectional Long Short-Term Memory (BLSTM) to encode the feature sequence; and a decoder using a unidirectional LSTM with the attention mechanism to generate the final target text based on the attended pertinent features. We also introduce a residual LSTM network between the attention vector and softmax layer in the decoder. The system can be trained end-to-end by a standard cross-entropy loss function. In the experiment, we evaluate the performance of the ACseq2seq model on the anomalously deformed Kana datasets in the PRMU contest. The results of the experiments show that our proposed model achieves higher recognition accuracy than the state-of-the-art recognition methods on the anomalously deformed Kana datasets.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"210 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125312841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00011
Chris Tensmeyer, Curtis Wigington
Training Handwritten Text Recognition (HTR) models typically requires large amounts of labeled data which often are line or page images with corresponding line-level ground truth (GT) transcriptions. Many digital collections have page-level transcriptions for each image, but the transcription is unformatted, i.e., line breaks are not annotated. Can we train lined-based HTR models using such data? In this work, we present a novel alignment technique for segmenting page-level GT text into text lines during HTR model training. This text segmentation problem is formulated as an optimization problem to minimize the cost of aligning predicted lines with the GT text. Using both simulated and HTR model predictions, we show that the alignment method identifies line breaks accurately, even when the predicted lines have high character error rates (CER). We removed the GT line breaks from the ICDAR-2017 READ dataset and trained a HTR model using the proposed alignment method to predict line breaks on-the-fly. This model achieves comparable CER w.r.t. to the same model trained with the GT line breaks. Additionally, we downloaded an online digital collection of 50K English journal pages (not curated for HTR research) whose transcriptions do not contain line breaks, and achieve 11% CER.
{"title":"Training Full-Page Handwritten Text Recognition Models without Annotated Line Breaks","authors":"Chris Tensmeyer, Curtis Wigington","doi":"10.1109/ICDAR.2019.00011","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00011","url":null,"abstract":"Training Handwritten Text Recognition (HTR) models typically requires large amounts of labeled data which often are line or page images with corresponding line-level ground truth (GT) transcriptions. Many digital collections have page-level transcriptions for each image, but the transcription is unformatted, i.e., line breaks are not annotated. Can we train lined-based HTR models using such data? In this work, we present a novel alignment technique for segmenting page-level GT text into text lines during HTR model training. This text segmentation problem is formulated as an optimization problem to minimize the cost of aligning predicted lines with the GT text. Using both simulated and HTR model predictions, we show that the alignment method identifies line breaks accurately, even when the predicted lines have high character error rates (CER). We removed the GT line breaks from the ICDAR-2017 READ dataset and trained a HTR model using the proposed alignment method to predict line breaks on-the-fly. This model achieves comparable CER w.r.t. to the same model trained with the GT line breaks. Additionally, we downloaded an online digital collection of 50K English journal pages (not curated for HTR research) whose transcriptions do not contain line breaks, and achieve 11% CER.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131560494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00088
Xiaohui Li, Fei Yin, Tao Xue, Long Liu, J. Ogier, Cheng-Lin Liu
Segmentation of complex document images remains a challenge due to the large variability of layout and image degradation. In this paper, we propose a method to segment complex document images based on Label Pyramid Network (LPN) and Deep Watershed Transform (DWT). The method can segment document images into instance aware regions including text lines, text regions, figures, tables, etc. The backbone of LPN can be any type of Fully Convolutional Networks (FCN), and in training, label map pyramids on training images are provided to exploit the hierarchical boundary information of regions efficiently through multi-task learning. The label map pyramid is transformed from region class label map by distance transformation and multi-level thresholding. In segmentation, the outputs of multiple tasks of LPN are summed into one single probability map, on which watershed transformation is carried out to segment the document image into instance aware regions. In experiments on four public databases, our method is demonstrated effective and superior, yielding state of the art performance for text line segmentation, baseline detection and region segmentation.
{"title":"Instance Aware Document Image Segmentation using Label Pyramid Networks and Deep Watershed Transformation","authors":"Xiaohui Li, Fei Yin, Tao Xue, Long Liu, J. Ogier, Cheng-Lin Liu","doi":"10.1109/ICDAR.2019.00088","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00088","url":null,"abstract":"Segmentation of complex document images remains a challenge due to the large variability of layout and image degradation. In this paper, we propose a method to segment complex document images based on Label Pyramid Network (LPN) and Deep Watershed Transform (DWT). The method can segment document images into instance aware regions including text lines, text regions, figures, tables, etc. The backbone of LPN can be any type of Fully Convolutional Networks (FCN), and in training, label map pyramids on training images are provided to exploit the hierarchical boundary information of regions efficiently through multi-task learning. The label map pyramid is transformed from region class label map by distance transformation and multi-level thresholding. In segmentation, the outputs of multiple tasks of LPN are summed into one single probability map, on which watershed transformation is carried out to segment the document image into instance aware regions. In experiments on four public databases, our method is demonstrated effective and superior, yielding state of the art performance for text line segmentation, baseline detection and region segmentation.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134315883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00107
Zelin Hong, Ning You, J. Tan, Ning Bi
In this paper, we present a Seq2Seq model for online handwritten mathematical expression recognition (OHMER), which consists of two major parts: a residual bidirectional RNN (BiRNN) based encoder that takes handwritten traces as the input and a transition probability matrix introduced decoder that generates LaTeX notations. We employ residual connection in the BiRNN layers to improve feature extraction. Markovian transition probability matrix is introduced in decoder and long-term information can be used in each decoding step through joint probability. Furthermore, we analyze the impact of the novel encoder and transition probability matrix through several specific instances. Experimental results on the CROHME 2014 and CROHME 2016 competition tasks show that our model outperforms the previous state-of-the-art single model by only using the official training dataset.
{"title":"Residual BiRNN Based Seq2Seq Model with Transition Probability Matrix for Online Handwritten Mathematical Expression Recognition","authors":"Zelin Hong, Ning You, J. Tan, Ning Bi","doi":"10.1109/ICDAR.2019.00107","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00107","url":null,"abstract":"In this paper, we present a Seq2Seq model for online handwritten mathematical expression recognition (OHMER), which consists of two major parts: a residual bidirectional RNN (BiRNN) based encoder that takes handwritten traces as the input and a transition probability matrix introduced decoder that generates LaTeX notations. We employ residual connection in the BiRNN layers to improve feature extraction. Markovian transition probability matrix is introduced in decoder and long-term information can be used in each decoding step through joint probability. Furthermore, we analyze the impact of the novel encoder and transition probability matrix through several specific instances. Experimental results on the CROHME 2014 and CROHME 2016 competition tasks show that our model outperforms the previous state-of-the-art single model by only using the official training dataset.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133622445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}