首页 > 最新文献

2019 International Conference on Document Analysis and Recognition (ICDAR)最新文献

英文 中文
ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction ICDAR2019扫描收据OCR和信息提取竞赛
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00244
Zheng Huang, Kai Chen, Jianhua He, X. Bai, Dimosthenis Karatzas, Shijian Lu, C. V. Jawahar
The ICDAR 2019 Challenge on "Scanned receipts OCR and key information extraction" (SROIE) covers important aspects related to the automated analysis of scanned receipts. The SROIE tasks play a key role in many document analysis systems and hold significant commercial potential. Although a lot of work has been published over the years on administrative document analysis, the community has advanced relatively slowly, as most datasets have been kept private. One of the key contributions of SROIE to the document analysis community is to offer a first, standardized dataset of 1000 whole scanned receipt images and annotations, as well as an evaluation procedure for such tasks. The Challenge is structured around three tasks, namely Scanned Receipt Text Localization (Task 1), Scanned Receipt OCR (Task 2) and Key Information Extraction from Scanned Receipts (Task 3). The competition opened on 10th February, 2019 and closed on 5th May, 2019. We received 29, 24 and 18 valid submissions received for the three competition tasks, respectively. This report presents the competition datasets, define the tasks and the evaluation protocols, offer detailed submission statistics, as well as an analysis of the submitted performance. While the tasks of text localization and recognition seem to be relatively easy to tackle, it is interesting to observe the variety of ideas and approaches proposed for the information extraction task. According to the submissions' performance we believe there is still margin for improving information extraction performance, although the current dataset would have to grow substantially in following editions. Given the success of the SROIE competition evidenced by the wide interest generated and the healthy number of submissions from academic, research institutes and industry over different countries, we consider that the SROIE competition can evolve into a useful resource for the community, drawing further attention and promoting research and development efforts in this field.
2019年ICDAR关于“扫描收据OCR和关键信息提取”(SROIE)的挑战涵盖了与扫描收据自动分析相关的重要方面。SROIE任务在许多文档分析系统中发挥关键作用,并具有重要的商业潜力。尽管多年来已经发表了大量关于行政文件分析的工作,但由于大多数数据集都是保密的,因此社区进展相对缓慢。SROIE对文档分析社区的主要贡献之一是提供了第一个包含1000个完整扫描收据图像和注释的标准化数据集,以及此类任务的评估程序。挑战赛围绕三个任务展开,即扫描收据文本本地化(任务1)、扫描收据OCR(任务2)和扫描收据关键信息提取(任务3)。竞赛于2019年2月10日开始,2019年5月5日结束。我们分别收到了29份、24份和18份有效的参赛作品。该报告展示了竞赛数据集,定义了任务和评估协议,提供了详细的提交统计数据,以及对提交的表现的分析。虽然文本定位和识别任务似乎相对容易解决,但观察为信息提取任务提出的各种想法和方法是有趣的。根据提交的表现,我们认为仍有改进信息提取性能的余地,尽管当前的数据集将在接下来的版本中大幅增长。鉴于该比赛的成功,来自不同国家的学术、研究机构和业界都对比赛产生了广泛的兴趣,并提交了大量的参赛作品,我们认为该比赛可以成为社会的有用资源,吸引更多的关注,并促进该领域的研究和发展工作。
{"title":"ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction","authors":"Zheng Huang, Kai Chen, Jianhua He, X. Bai, Dimosthenis Karatzas, Shijian Lu, C. V. Jawahar","doi":"10.1109/ICDAR.2019.00244","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00244","url":null,"abstract":"The ICDAR 2019 Challenge on \"Scanned receipts OCR and key information extraction\" (SROIE) covers important aspects related to the automated analysis of scanned receipts. The SROIE tasks play a key role in many document analysis systems and hold significant commercial potential. Although a lot of work has been published over the years on administrative document analysis, the community has advanced relatively slowly, as most datasets have been kept private. One of the key contributions of SROIE to the document analysis community is to offer a first, standardized dataset of 1000 whole scanned receipt images and annotations, as well as an evaluation procedure for such tasks. The Challenge is structured around three tasks, namely Scanned Receipt Text Localization (Task 1), Scanned Receipt OCR (Task 2) and Key Information Extraction from Scanned Receipts (Task 3). The competition opened on 10th February, 2019 and closed on 5th May, 2019. We received 29, 24 and 18 valid submissions received for the three competition tasks, respectively. This report presents the competition datasets, define the tasks and the evaluation protocols, offer detailed submission statistics, as well as an analysis of the submitted performance. While the tasks of text localization and recognition seem to be relatively easy to tackle, it is interesting to observe the variety of ideas and approaches proposed for the information extraction task. According to the submissions' performance we believe there is still margin for improving information extraction performance, although the current dataset would have to grow substantially in following editions. Given the success of the SROIE competition evidenced by the wide interest generated and the healthy number of submissions from academic, research institutes and industry over different countries, we consider that the SROIE competition can evolve into a useful resource for the community, drawing further attention and promoting research and development efforts in this field.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117073203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 152
Fast Text/non-Text Image Classification with Knowledge Distillation 基于知识蒸馏的快速文本/非文本图像分类
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00234
Miao Zhao, Rui-Qi Wang, Fei Yin, Xu-Yao Zhang, Lin-Lin Huang, J. Ogier
How to efficiently judge whether a natural image contains texts or not is an important problem. Since text detection and recognition algorithms are usually time-consuming, and it is unnecessary to run them on images that do not contain any texts. In this paper, we investigate this problem from two perspectives: the speed and the accuracy. First, to achieve high speed for efficient filtering large number of images especially on CPU, we propose using small and shallow convolutional neural network, where the features from different layers are adaptively pooled into certain sizes to overcome difficulties caused by multiple scales and various locations. Although this can achieve high speed but its accuracy is not satisfactory due to limited capacity of small network. Therefore, our second contribution is using the knowledge distillation to improve the accuracy of the small network, by constructing a larger and deeper neural network as teacher network to instruct the learning process of the small network. With the above two strategies, we can achieve both high speed and high accuracy for filtering scene text images. Experimental results on a benchmark dataset have shown the effectiveness of our method: the teacher network yields state-of-the-art performance, and the distilled small network achieves high performance while maintaining high speed which is 176 times faster on CPU and 3.8 times faster on GPU than a compared benchmark method.
如何有效地判断一幅自然图像是否含有文本是一个重要的问题。由于文本检测和识别算法通常是耗时的,并且没有必要在不包含任何文本的图像上运行它们。本文从速度和准确性两个方面对这一问题进行了研究。首先,为了实现对大量图像的高速高效过滤,特别是在CPU上,我们提出使用小而浅的卷积神经网络,将来自不同层的特征自适应地汇集成一定的大小,以克服多尺度和不同位置带来的困难。虽然可以达到较高的速度,但由于小型网络容量的限制,其精度不能令人满意。因此,我们的第二个贡献是利用知识蒸馏来提高小网络的准确性,通过构建一个更大更深的神经网络作为教师网络来指导小网络的学习过程。通过以上两种策略,我们可以实现对场景文本图像的高速和高精度过滤。在基准数据集上的实验结果表明了我们方法的有效性:教师网络产生了最先进的性能,而蒸馏的小网络在保持高速的同时实现了高性能,在CPU上比基准方法快176倍,在GPU上比基准方法快3.8倍。
{"title":"Fast Text/non-Text Image Classification with Knowledge Distillation","authors":"Miao Zhao, Rui-Qi Wang, Fei Yin, Xu-Yao Zhang, Lin-Lin Huang, J. Ogier","doi":"10.1109/ICDAR.2019.00234","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00234","url":null,"abstract":"How to efficiently judge whether a natural image contains texts or not is an important problem. Since text detection and recognition algorithms are usually time-consuming, and it is unnecessary to run them on images that do not contain any texts. In this paper, we investigate this problem from two perspectives: the speed and the accuracy. First, to achieve high speed for efficient filtering large number of images especially on CPU, we propose using small and shallow convolutional neural network, where the features from different layers are adaptively pooled into certain sizes to overcome difficulties caused by multiple scales and various locations. Although this can achieve high speed but its accuracy is not satisfactory due to limited capacity of small network. Therefore, our second contribution is using the knowledge distillation to improve the accuracy of the small network, by constructing a larger and deeper neural network as teacher network to instruct the learning process of the small network. With the above two strategies, we can achieve both high speed and high accuracy for filtering scene text images. Experimental results on a benchmark dataset have shown the effectiveness of our method: the teacher network yields state-of-the-art performance, and the distilled small network achieves high performance while maintaining high speed which is 176 times faster on CPU and 3.8 times faster on GPU than a compared benchmark method.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123514711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
An Attention-Based End-to-End Model for Multiple Text Lines Recognition in Japanese Historical Documents 基于注意的日语历史文献多文本行识别的端到端模型
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00106
N. Ly, C. Nguyen, M. Nakagawa
This paper presents an attention-based convolutional sequence to sequence (ACseq2seq) model for recognizing an input image of multiple text lines from Japanese historical documents without explicit segmentation of lines. The recognition system has three main parts: a feature extractor using Convolutional Neural Network (CNN) to extract a feature sequence from an input image; an encoder employing bidirectional Long Short-Term Memory (BLSTM) to encode the feature sequence; and a decoder using a unidirectional LSTM with the attention mechanism to generate the final target text based on the attended pertinent features. We also introduce a residual LSTM network between the attention vector and softmax layer in the decoder. The system can be trained end-to-end by a standard cross-entropy loss function. In the experiment, we evaluate the performance of the ACseq2seq model on the anomalously deformed Kana datasets in the PRMU contest. The results of the experiments show that our proposed model achieves higher recognition accuracy than the state-of-the-art recognition methods on the anomalously deformed Kana datasets.
本文提出了一种基于注意力的卷积序列到序列(ACseq2seq)模型,用于识别日本历史文献中多文本行输入图像,而不需要明确的行分割。该识别系统有三个主要部分:使用卷积神经网络(CNN)从输入图像中提取特征序列的特征提取器;采用双向长短期记忆(BLSTM)对特征序列进行编码的编码器;以及使用具有注意机制的单向LSTM的解码器,以基于所关注的相关特征生成最终目标文本。我们还在解码器的注意向量和softmax层之间引入了残差LSTM网络。系统可以通过标准的交叉熵损失函数进行端到端训练。在实验中,我们评估了ACseq2seq模型在PRMU竞赛中异常变形的假名数据集上的性能。实验结果表明,该模型在异常变形假名数据集上取得了比现有识别方法更高的识别精度。
{"title":"An Attention-Based End-to-End Model for Multiple Text Lines Recognition in Japanese Historical Documents","authors":"N. Ly, C. Nguyen, M. Nakagawa","doi":"10.1109/ICDAR.2019.00106","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00106","url":null,"abstract":"This paper presents an attention-based convolutional sequence to sequence (ACseq2seq) model for recognizing an input image of multiple text lines from Japanese historical documents without explicit segmentation of lines. The recognition system has three main parts: a feature extractor using Convolutional Neural Network (CNN) to extract a feature sequence from an input image; an encoder employing bidirectional Long Short-Term Memory (BLSTM) to encode the feature sequence; and a decoder using a unidirectional LSTM with the attention mechanism to generate the final target text based on the attended pertinent features. We also introduce a residual LSTM network between the attention vector and softmax layer in the decoder. The system can be trained end-to-end by a standard cross-entropy loss function. In the experiment, we evaluate the performance of the ACseq2seq model on the anomalously deformed Kana datasets in the PRMU contest. The results of the experiments show that our proposed model achieves higher recognition accuracy than the state-of-the-art recognition methods on the anomalously deformed Kana datasets.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"210 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125312841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Sub-Word Embeddings for OCR Corrections in Highly Fusional Indic Languages 基于子词嵌入的高融合印度语OCR校正
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00034
Rohit Saluja, Mayur Punjabi, Mark J. Carman, Ganesh Ramakrishnan, P. Chaudhuri
Texts in Indic Languages contain a large proportion of out-of-vocabulary (OOV) words due to frequent fusion using conjoining rules (of which there are around 4000 in Sanskrit). OCR errors further accentuate this complexity for the error correction systems. Variations of sub-word units such as n-grams, possibly encapsulating the context, can be extracted from the OCR text as well as the language text individually. Some of the sub-word units that are derived from the texts in such languages highly correlate to the word conjoining rules. Signals such as frequency values (on a corpus) associated with such sub-word units have been used previously with log-linear classifiers for detecting errors in Indic OCR texts. We explore two different encodings to capture such signals and augment the input to Long Short Term Memory (LSTM) based OCR correction models, that have proven useful in the past for jointly learning the language as well as OCR-specific confusions. The first type of encoding makes direct use of sub-word unit frequency values, derived from the training data. The formulation results in faster convergence and better accuracy values of the error correction model on four different languages with varying complexities. The second type of encoding makes use of trainable sub-word embeddings. We introduce a new procedure for training fastText embeddings on the sub-word units and further observe a large gain in F-Scores, as well as word-level accuracy values.
由于频繁使用连接规则进行融合,印度语的文本中包含了很大比例的词汇外(OOV)单词(梵语中大约有4000个)。OCR误差进一步加剧了纠错系统的复杂性。子词单位的变化,如n-gram,可能封装上下文,可以分别从OCR文本和语言文本中提取。从这些语言的文本中衍生出来的一些子词单位与单词连接规则高度相关。与这些子词单位相关联的频率值等信号(在语料库上)以前已与对数线性分类器一起用于检测印度OCR文本中的错误。我们探索了两种不同的编码来捕获这些信号,并将输入增强到基于长短期记忆(LSTM)的OCR校正模型,这些模型在过去被证明对共同学习语言以及OCR特异性混淆很有用。第一种编码直接使用从训练数据中导出的子词单位频率值。在不同复杂程度的四种语言下,该模型的收敛速度更快,精度值更高。第二种类型的编码使用可训练的子词嵌入。我们引入了一种新的过程,在子词单元上训练快速文本嵌入,并进一步观察到F-Scores和词级精度值的大幅提高。
{"title":"Sub-Word Embeddings for OCR Corrections in Highly Fusional Indic Languages","authors":"Rohit Saluja, Mayur Punjabi, Mark J. Carman, Ganesh Ramakrishnan, P. Chaudhuri","doi":"10.1109/ICDAR.2019.00034","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00034","url":null,"abstract":"Texts in Indic Languages contain a large proportion of out-of-vocabulary (OOV) words due to frequent fusion using conjoining rules (of which there are around 4000 in Sanskrit). OCR errors further accentuate this complexity for the error correction systems. Variations of sub-word units such as n-grams, possibly encapsulating the context, can be extracted from the OCR text as well as the language text individually. Some of the sub-word units that are derived from the texts in such languages highly correlate to the word conjoining rules. Signals such as frequency values (on a corpus) associated with such sub-word units have been used previously with log-linear classifiers for detecting errors in Indic OCR texts. We explore two different encodings to capture such signals and augment the input to Long Short Term Memory (LSTM) based OCR correction models, that have proven useful in the past for jointly learning the language as well as OCR-specific confusions. The first type of encoding makes direct use of sub-word unit frequency values, derived from the training data. The formulation results in faster convergence and better accuracy values of the error correction model on four different languages with varying complexities. The second type of encoding makes use of trainable sub-word embeddings. We introduce a new procedure for training fastText embeddings on the sub-word units and further observe a large gain in F-Scores, as well as word-level accuracy values.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126290142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Selective Super-Resolution for Scene Text Images 选择超分辨率的场景文本图像
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00071
Ryo Nakao, Brian Kenji Iwana, S. Uchida
In this paper, we realize the enhancement of super-resolution using images with scene text. Specifically, this paper proposes the use of Super-Resolution Convolutional Neural Networks (SRCNN) which are constructed to tackle issues associated with characters and text. We demonstrate that standard SRCNNs trained for general object super-resolution is not sufficient and that the proposed method is a viable method in creating a robust model for text. To do so, we analyze the characteristics of SRCNNs through quantitative and qualitative evaluations with scene text data. In addition, analysis using the correlation between layers by Singular Vector Canonical Correlation Analysis (SVCCA) and comparison of filters of each SRCNN using t-SNE is performed. Furthermore, in order to create a unified super-resolution model specialized for both text and objects, a model using SRCNNs trained with the different data types and Content-wise Network Fusion (CNF) is used. We integrate the SRCNN trained for character images and then SRCNN trained for general object images, and verify the accuracy improvement of scene images which include text. We also examine how each SRCNN affects super-resolution images after fusion.
本文利用带有场景文本的图像实现了超分辨率的增强。具体来说,本文提出使用超分辨率卷积神经网络(SRCNN)来解决与字符和文本相关的问题。我们证明了针对一般目标超分辨率训练的标准srcnn是不够的,并且所提出的方法是创建文本鲁棒模型的可行方法。为此,我们通过场景文本数据的定量和定性评估来分析srcnn的特征。此外,利用奇异向量典型相关分析(SVCCA)对层间的相关性进行分析,并利用t-SNE对每个SRCNN的滤波器进行比较。此外,为了创建统一的文本和对象的超分辨率模型,使用了使用不同数据类型和内容智能网络融合(CNF)训练的srcnn模型。我们将训练好的针对字符图像的SRCNN与训练好的针对一般目标图像的SRCNN进行整合,验证了包含文本的场景图像的准确率提升。我们还研究了每个SRCNN如何影响融合后的超分辨率图像。
{"title":"Selective Super-Resolution for Scene Text Images","authors":"Ryo Nakao, Brian Kenji Iwana, S. Uchida","doi":"10.1109/ICDAR.2019.00071","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00071","url":null,"abstract":"In this paper, we realize the enhancement of super-resolution using images with scene text. Specifically, this paper proposes the use of Super-Resolution Convolutional Neural Networks (SRCNN) which are constructed to tackle issues associated with characters and text. We demonstrate that standard SRCNNs trained for general object super-resolution is not sufficient and that the proposed method is a viable method in creating a robust model for text. To do so, we analyze the characteristics of SRCNNs through quantitative and qualitative evaluations with scene text data. In addition, analysis using the correlation between layers by Singular Vector Canonical Correlation Analysis (SVCCA) and comparison of filters of each SRCNN using t-SNE is performed. Furthermore, in order to create a unified super-resolution model specialized for both text and objects, a model using SRCNNs trained with the different data types and Content-wise Network Fusion (CNF) is used. We integrate the SRCNN trained for character images and then SRCNN trained for general object images, and verify the accuracy improvement of scene images which include text. We also examine how each SRCNN affects super-resolution images after fusion.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126436414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Field Typing for Improved Recognition on Heterogeneous Handwritten Forms 改进异构手写表单识别的字段输入
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00084
Ciprian Tomoiaga, Paul Feng, M. Salzmann, PA Jayet
Offline handwriting recognition has undergone continuous progress over the past decades. However, existing methods are typically benchmarked on free-form text datasets that are biased towards good-quality images and handwriting styles, and homogeneous content. In this paper, we show that state-of-the-art algorithms, employing long short-term memory (LSTM) layers, do not readily generalize to real-world structured documents, such as forms, due to their highly heterogeneous and out-of-vocabulary content, and to the inherent ambiguities of this content. To address this, we propose to leverage the content type within an LSTM-based architecture. Furthermore, we introduce a procedure to generate synthetic data to train this architecture without requiring expensive manual annotations. We demonstrate the effectiveness of our approach at transcribing text on a challenging, real-world dataset of European Accident Statements.
离线手写识别在过去的几十年里经历了不断的进步。然而,现有的方法通常是在自由格式的文本数据集上进行基准测试的,这些数据集倾向于高质量的图像和手写样式,以及同质的内容。在本文中,我们表明,采用长短期记忆(LSTM)层的最先进算法,由于其高度异构和词汇外的内容,以及这些内容固有的模糊性,不容易推广到现实世界的结构化文档,如表单。为了解决这个问题,我们建议在基于lstm的体系结构中利用内容类型。此外,我们引入了一个过程来生成合成数据来训练这个体系结构,而不需要昂贵的手动注释。我们展示了我们的方法在一个具有挑战性的、真实世界的欧洲事故声明数据集上转录文本的有效性。
{"title":"Field Typing for Improved Recognition on Heterogeneous Handwritten Forms","authors":"Ciprian Tomoiaga, Paul Feng, M. Salzmann, PA Jayet","doi":"10.1109/ICDAR.2019.00084","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00084","url":null,"abstract":"Offline handwriting recognition has undergone continuous progress over the past decades. However, existing methods are typically benchmarked on free-form text datasets that are biased towards good-quality images and handwriting styles, and homogeneous content. In this paper, we show that state-of-the-art algorithms, employing long short-term memory (LSTM) layers, do not readily generalize to real-world structured documents, such as forms, due to their highly heterogeneous and out-of-vocabulary content, and to the inherent ambiguities of this content. To address this, we propose to leverage the content type within an LSTM-based architecture. Furthermore, we introduce a procedure to generate synthetic data to train this architecture without requiring expensive manual annotations. We demonstrate the effectiveness of our approach at transcribing text on a challenging, real-world dataset of European Accident Statements.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125815823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Handwritten Chinese Text Recognizer Applying Multi-level Multimodal Fusion Network 基于多层次多模态融合网络的手写体中文文本识别
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00235
Yuhuan Xiu, Qingqing Wang, Hongjian Zhan, Man Lan, Yue Lu
Handwritten Chinese text recognition (HCTR) has received extensive attention from the community of pattern recognition in the past decades. Most existing deep learning methods consist of two stages, i.e., training a text recognition network on the base of visual information, followed by incorporating language constrains with various language models. Therefore, the inherent linguistic semantic information is often neglected when designing the recognition network. To tackle this problem, in this work, we propose a novel multi-level multimodal fusion network and properly embed it into an attention-based LSTM so that both the visual information and the linguistic semantic information can be fully leveraged when predicting sequential outputs from the feature vectors. Experimental results on the ICDAR-2013 competition dataset demonstrate a comparable result with the state-of-the-art approaches.
近几十年来,手写体中文文本识别(HCTR)受到模式识别界的广泛关注。大多数现有的深度学习方法包括两个阶段,即基于视觉信息训练文本识别网络,然后将语言约束与各种语言模型结合起来。因此,在设计识别网络时,往往忽略了固有的语言语义信息。为了解决这一问题,我们提出了一种新的多层次多模态融合网络,并将其适当嵌入到基于注意力的LSTM中,以便在预测特征向量的顺序输出时充分利用视觉信息和语言语义信息。在ICDAR-2013竞赛数据集上的实验结果表明,与最先进的方法具有可比性。
{"title":"A Handwritten Chinese Text Recognizer Applying Multi-level Multimodal Fusion Network","authors":"Yuhuan Xiu, Qingqing Wang, Hongjian Zhan, Man Lan, Yue Lu","doi":"10.1109/ICDAR.2019.00235","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00235","url":null,"abstract":"Handwritten Chinese text recognition (HCTR) has received extensive attention from the community of pattern recognition in the past decades. Most existing deep learning methods consist of two stages, i.e., training a text recognition network on the base of visual information, followed by incorporating language constrains with various language models. Therefore, the inherent linguistic semantic information is often neglected when designing the recognition network. To tackle this problem, in this work, we propose a novel multi-level multimodal fusion network and properly embed it into an attention-based LSTM so that both the visual information and the linguistic semantic information can be fully leveraged when predicting sequential outputs from the feature vectors. Experimental results on the ICDAR-2013 competition dataset demonstrate a comparable result with the state-of-the-art approaches.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129747217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
LPGA: Line-of-Sight Parsing with Graph-Based Attention for Math Formula Recognition LPGA:用于数学公式识别的基于图的视线解析
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00109
Mahshad Mahdavi, M. Condon, Kenny Davila
We present a model for recognizing typeset math formula images from connected components or symbols. In our approach, connected components are used to construct a line-of-sight (LOS) graph. The graph is used both to reduce the search space for formula structure interpretations, and to guide a classification attention model using separate channels for inputs and their local visual context. For classification, we used visual densities with Random Forests for initial development, and then converted this to a Convolutional Neural Network (CNN) with a second branch to capture context for each input image. Formula structure is extracted as a directed spanning tree from a weighted LOS graph using Edmonds' algorithm. We obtain strong results for formulas without grids or matrices in the InftyCDB-2 dataset (90.89% from components, 93.5% from symbols). Using tools from the CROHME handwritten formula recognition competitions, we were able to compile all symbol and structure recognition errors for analysis. Our data and source code are publicly available.
我们提出了一个从连接组件或符号中识别排版数学公式图像的模型。在我们的方法中,连接的组件用于构建视线(LOS)图。该图既用于减少公式结构解释的搜索空间,又用于指导使用输入及其局部视觉上下文的单独通道的分类注意模型。对于分类,我们使用随机森林的视觉密度进行初始开发,然后将其转换为具有第二个分支的卷积神经网络(CNN),以捕获每个输入图像的上下文。利用Edmonds算法从加权LOS图中提取公式结构为有向生成树。对于InftyCDB-2数据集中没有网格或矩阵的公式,我们获得了强有力的结果(90.89%来自组件,93.5%来自符号)。使用来自CROHME手写公式识别比赛的工具,我们能够编译所有符号和结构识别错误进行分析。我们的数据和源代码是公开的。
{"title":"LPGA: Line-of-Sight Parsing with Graph-Based Attention for Math Formula Recognition","authors":"Mahshad Mahdavi, M. Condon, Kenny Davila","doi":"10.1109/ICDAR.2019.00109","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00109","url":null,"abstract":"We present a model for recognizing typeset math formula images from connected components or symbols. In our approach, connected components are used to construct a line-of-sight (LOS) graph. The graph is used both to reduce the search space for formula structure interpretations, and to guide a classification attention model using separate channels for inputs and their local visual context. For classification, we used visual densities with Random Forests for initial development, and then converted this to a Convolutional Neural Network (CNN) with a second branch to capture context for each input image. Formula structure is extracted as a directed spanning tree from a weighted LOS graph using Edmonds' algorithm. We obtain strong results for formulas without grids or matrices in the InftyCDB-2 dataset (90.89% from components, 93.5% from symbols). Using tools from the CROHME handwritten formula recognition competitions, we were able to compile all symbol and structure recognition errors for analysis. Our data and source code are publicly available.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130149470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
GARN: A Novel Generative Adversarial Recognition Network for End-to-End Scene Character Recognition 一种新的端到端场景字符识别生成对抗识别网络
Pub Date : 2019-09-01 DOI: 10.1109/ICDAR.2019.00115
Hao Kong, Dongqi Tang, Xi Meng, Tong Lu
Deep neural networks have shown their powerful ability in scene character recognition tasks; however, in real life applications, it is often hard to find a large amount of high-quality scene character images for training these networks. In this paper, we proposed a novel end-to-end network named Generative Adversarial Recognition Networks (GARN) for accurate natural scene character recognition in an end-to-end way. The proposed GARN consists of a generation part and a classification part. For the generation part, the purpose is to produce diverse realistic samples to help the classifier overcome the overfitting problem. While in the classification part, a multinomial classifier is trained along with the generator in the form of a game to achieve better character recognition performance. That is, the proposed GARN has the ability to augment scene character data by its generation part and recognize scene characters by its classification part. It is trained in an adversarial way to improve recognition performance. The experimental results on benchmark datasets and the comparisons with the state-of-the-art methods show the effectiveness of the proposed GARN in scene character recognition.
深度神经网络在场景字符识别任务中显示出强大的能力;然而,在现实应用中,通常很难找到大量高质量的场景人物图像来训练这些网络。在本文中,我们提出了一种新的端到端网络,称为生成对抗识别网络(GARN),用于端到端精确的自然场景字符识别。本文提出的GARN由生成部分和分类部分组成。对于生成部分,目的是产生多样化的真实样本,以帮助分类器克服过拟合问题。而在分类部分,以游戏的形式与生成器一起训练多项式分类器,以获得更好的字符识别性能。即,本文提出的GARN具有通过生成部分增强场景字符数据和通过分类部分识别场景字符的能力。它以对抗的方式进行训练,以提高识别性能。在基准数据集上的实验结果以及与现有方法的比较表明了所提出的GARN在场景字符识别中的有效性。
{"title":"GARN: A Novel Generative Adversarial Recognition Network for End-to-End Scene Character Recognition","authors":"Hao Kong, Dongqi Tang, Xi Meng, Tong Lu","doi":"10.1109/ICDAR.2019.00115","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00115","url":null,"abstract":"Deep neural networks have shown their powerful ability in scene character recognition tasks; however, in real life applications, it is often hard to find a large amount of high-quality scene character images for training these networks. In this paper, we proposed a novel end-to-end network named Generative Adversarial Recognition Networks (GARN) for accurate natural scene character recognition in an end-to-end way. The proposed GARN consists of a generation part and a classification part. For the generation part, the purpose is to produce diverse realistic samples to help the classifier overcome the overfitting problem. While in the classification part, a multinomial classifier is trained along with the generator in the form of a game to achieve better character recognition performance. That is, the proposed GARN has the ability to augment scene character data by its generation part and recognize scene characters by its classification part. It is trained in an adversarial way to improve recognition performance. The experimental results on benchmark datasets and the comparisons with the state-of-the-art methods show the effectiveness of the proposed GARN in scene character recognition.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116430745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Deep Visual Template-Free Form Parsing 深度可视化无模板表单解析
Pub Date : 2019-09-01 DOI: 10.1109/icdar.2019.00030
Brian L. Davis, B. Morse, Scott D. Cohen, Brian L. Price, Chris Tensmeyer
Automatic, template-free extraction of information from form images is challenging due to the variety of form layouts. This is even more challenging for historical forms due to noise and degradation. A crucial part of the extraction process is associating input text with pre-printed labels. We present a learned, template-free solution to detecting pre-printed text and input text/handwriting and predicting pair-wise relationships between them. While previous approaches to this problem have been focused on clean images and clear layouts, we show our approach is effective in the domain of noisy, degraded, and varied form images. We introduce a new dataset of historical form images (late 1800s, early 1900s) for training and validating our approach. Our method uses a convolutional network to detect pre-printed text and input text lines. We pool features from the detection network to classify possible relationships in a language-agnostic way. We show that our proposed pairing method outperforms heuristic rules and that visual features are critical to obtaining high accuracy.
由于表单布局的多样性,从表单图像中自动、无模板地提取信息是具有挑战性的。由于噪音和退化,这对历史形式来说更具挑战性。提取过程的关键部分是将输入文本与预打印的标签关联起来。我们提出了一个学习的、无模板的解决方案来检测预打印文本和输入文本/手写,并预测它们之间的成对关系。虽然以前解决这个问题的方法主要集中在干净的图像和清晰的布局上,但我们的方法在噪声、退化和各种形式的图像领域是有效的。我们引入了一个新的历史形式图像数据集(19世纪末,20世纪初),用于训练和验证我们的方法。我们的方法使用卷积网络来检测预打印文本和输入文本行。我们从检测网络中汇集特征,以语言不可知的方式对可能的关系进行分类。我们证明了我们提出的配对方法优于启发式规则,并且视觉特征是获得高精度的关键。
{"title":"Deep Visual Template-Free Form Parsing","authors":"Brian L. Davis, B. Morse, Scott D. Cohen, Brian L. Price, Chris Tensmeyer","doi":"10.1109/icdar.2019.00030","DOIUrl":"https://doi.org/10.1109/icdar.2019.00030","url":null,"abstract":"Automatic, template-free extraction of information from form images is challenging due to the variety of form layouts. This is even more challenging for historical forms due to noise and degradation. A crucial part of the extraction process is associating input text with pre-printed labels. We present a learned, template-free solution to detecting pre-printed text and input text/handwriting and predicting pair-wise relationships between them. While previous approaches to this problem have been focused on clean images and clear layouts, we show our approach is effective in the domain of noisy, degraded, and varied form images. We introduce a new dataset of historical form images (late 1800s, early 1900s) for training and validating our approach. Our method uses a convolutional network to detect pre-printed text and input text lines. We pool features from the detection network to classify possible relationships in a language-agnostic way. We show that our proposed pairing method outperforms heuristic rules and that visual features are critical to obtaining high accuracy.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"18 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127824650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
期刊
2019 International Conference on Document Analysis and Recognition (ICDAR)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1