Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00071
Ryo Nakao, Brian Kenji Iwana, S. Uchida
In this paper, we realize the enhancement of super-resolution using images with scene text. Specifically, this paper proposes the use of Super-Resolution Convolutional Neural Networks (SRCNN) which are constructed to tackle issues associated with characters and text. We demonstrate that standard SRCNNs trained for general object super-resolution is not sufficient and that the proposed method is a viable method in creating a robust model for text. To do so, we analyze the characteristics of SRCNNs through quantitative and qualitative evaluations with scene text data. In addition, analysis using the correlation between layers by Singular Vector Canonical Correlation Analysis (SVCCA) and comparison of filters of each SRCNN using t-SNE is performed. Furthermore, in order to create a unified super-resolution model specialized for both text and objects, a model using SRCNNs trained with the different data types and Content-wise Network Fusion (CNF) is used. We integrate the SRCNN trained for character images and then SRCNN trained for general object images, and verify the accuracy improvement of scene images which include text. We also examine how each SRCNN affects super-resolution images after fusion.
{"title":"Selective Super-Resolution for Scene Text Images","authors":"Ryo Nakao, Brian Kenji Iwana, S. Uchida","doi":"10.1109/ICDAR.2019.00071","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00071","url":null,"abstract":"In this paper, we realize the enhancement of super-resolution using images with scene text. Specifically, this paper proposes the use of Super-Resolution Convolutional Neural Networks (SRCNN) which are constructed to tackle issues associated with characters and text. We demonstrate that standard SRCNNs trained for general object super-resolution is not sufficient and that the proposed method is a viable method in creating a robust model for text. To do so, we analyze the characteristics of SRCNNs through quantitative and qualitative evaluations with scene text data. In addition, analysis using the correlation between layers by Singular Vector Canonical Correlation Analysis (SVCCA) and comparison of filters of each SRCNN using t-SNE is performed. Furthermore, in order to create a unified super-resolution model specialized for both text and objects, a model using SRCNNs trained with the different data types and Content-wise Network Fusion (CNF) is used. We integrate the SRCNN trained for character images and then SRCNN trained for general object images, and verify the accuracy improvement of scene images which include text. We also examine how each SRCNN affects super-resolution images after fusion.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126436414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00084
Ciprian Tomoiaga, Paul Feng, M. Salzmann, PA Jayet
Offline handwriting recognition has undergone continuous progress over the past decades. However, existing methods are typically benchmarked on free-form text datasets that are biased towards good-quality images and handwriting styles, and homogeneous content. In this paper, we show that state-of-the-art algorithms, employing long short-term memory (LSTM) layers, do not readily generalize to real-world structured documents, such as forms, due to their highly heterogeneous and out-of-vocabulary content, and to the inherent ambiguities of this content. To address this, we propose to leverage the content type within an LSTM-based architecture. Furthermore, we introduce a procedure to generate synthetic data to train this architecture without requiring expensive manual annotations. We demonstrate the effectiveness of our approach at transcribing text on a challenging, real-world dataset of European Accident Statements.
{"title":"Field Typing for Improved Recognition on Heterogeneous Handwritten Forms","authors":"Ciprian Tomoiaga, Paul Feng, M. Salzmann, PA Jayet","doi":"10.1109/ICDAR.2019.00084","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00084","url":null,"abstract":"Offline handwriting recognition has undergone continuous progress over the past decades. However, existing methods are typically benchmarked on free-form text datasets that are biased towards good-quality images and handwriting styles, and homogeneous content. In this paper, we show that state-of-the-art algorithms, employing long short-term memory (LSTM) layers, do not readily generalize to real-world structured documents, such as forms, due to their highly heterogeneous and out-of-vocabulary content, and to the inherent ambiguities of this content. To address this, we propose to leverage the content type within an LSTM-based architecture. Furthermore, we introduce a procedure to generate synthetic data to train this architecture without requiring expensive manual annotations. We demonstrate the effectiveness of our approach at transcribing text on a challenging, real-world dataset of European Accident Statements.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125815823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00034
Rohit Saluja, Mayur Punjabi, Mark J. Carman, Ganesh Ramakrishnan, P. Chaudhuri
Texts in Indic Languages contain a large proportion of out-of-vocabulary (OOV) words due to frequent fusion using conjoining rules (of which there are around 4000 in Sanskrit). OCR errors further accentuate this complexity for the error correction systems. Variations of sub-word units such as n-grams, possibly encapsulating the context, can be extracted from the OCR text as well as the language text individually. Some of the sub-word units that are derived from the texts in such languages highly correlate to the word conjoining rules. Signals such as frequency values (on a corpus) associated with such sub-word units have been used previously with log-linear classifiers for detecting errors in Indic OCR texts. We explore two different encodings to capture such signals and augment the input to Long Short Term Memory (LSTM) based OCR correction models, that have proven useful in the past for jointly learning the language as well as OCR-specific confusions. The first type of encoding makes direct use of sub-word unit frequency values, derived from the training data. The formulation results in faster convergence and better accuracy values of the error correction model on four different languages with varying complexities. The second type of encoding makes use of trainable sub-word embeddings. We introduce a new procedure for training fastText embeddings on the sub-word units and further observe a large gain in F-Scores, as well as word-level accuracy values.
{"title":"Sub-Word Embeddings for OCR Corrections in Highly Fusional Indic Languages","authors":"Rohit Saluja, Mayur Punjabi, Mark J. Carman, Ganesh Ramakrishnan, P. Chaudhuri","doi":"10.1109/ICDAR.2019.00034","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00034","url":null,"abstract":"Texts in Indic Languages contain a large proportion of out-of-vocabulary (OOV) words due to frequent fusion using conjoining rules (of which there are around 4000 in Sanskrit). OCR errors further accentuate this complexity for the error correction systems. Variations of sub-word units such as n-grams, possibly encapsulating the context, can be extracted from the OCR text as well as the language text individually. Some of the sub-word units that are derived from the texts in such languages highly correlate to the word conjoining rules. Signals such as frequency values (on a corpus) associated with such sub-word units have been used previously with log-linear classifiers for detecting errors in Indic OCR texts. We explore two different encodings to capture such signals and augment the input to Long Short Term Memory (LSTM) based OCR correction models, that have proven useful in the past for jointly learning the language as well as OCR-specific confusions. The first type of encoding makes direct use of sub-word unit frequency values, derived from the training data. The formulation results in faster convergence and better accuracy values of the error correction model on four different languages with varying complexities. The second type of encoding makes use of trainable sub-word embeddings. We introduce a new procedure for training fastText embeddings on the sub-word units and further observe a large gain in F-Scores, as well as word-level accuracy values.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126290142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00237
Abderrahmane Rahiche, A. Bakhta, M. Cheriet
In this paper, we propose a novel Blind Source Separation (BSS) based framework for multispectral (MS) document images binarization. This framework takes advantage of the multidimensional data representation of MS images and makes use of the Graph regularized Non-negative Matrix Factorization (GNMF) to decompose MS document images into their different constituting components, i.e., foreground (text, ink), background (paper, parchment), degradation information, etc. The proposed framework is validated on two different real-world data sets of manuscript images showing a high capability of dealing with: variable numbers of bands regardless of the acquisition protocol, different types of degradations, and illumination non-uniformity while outperforming the results reported in the state-of-the-art. Although the focus was put on the binary separation (i.e., foreground/background), the proposed framework is also used for the decomposition of document images into different components, i.e., background, text, and degradation, which allows full sources separation, whereby further analysis and characterization of each component can be possible. A comparative study is performed using Independent Component Analysis (ICA) and Principal Component Analysis (PCA) methods. Our framework is also validated on another third dataset of MS images of natural objects to demonstrate its generalizability beyond document samples.
{"title":"Blind Source Separation Based Framework for Multispectral Document Images Binarization","authors":"Abderrahmane Rahiche, A. Bakhta, M. Cheriet","doi":"10.1109/ICDAR.2019.00237","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00237","url":null,"abstract":"In this paper, we propose a novel Blind Source Separation (BSS) based framework for multispectral (MS) document images binarization. This framework takes advantage of the multidimensional data representation of MS images and makes use of the Graph regularized Non-negative Matrix Factorization (GNMF) to decompose MS document images into their different constituting components, i.e., foreground (text, ink), background (paper, parchment), degradation information, etc. The proposed framework is validated on two different real-world data sets of manuscript images showing a high capability of dealing with: variable numbers of bands regardless of the acquisition protocol, different types of degradations, and illumination non-uniformity while outperforming the results reported in the state-of-the-art. Although the focus was put on the binary separation (i.e., foreground/background), the proposed framework is also used for the decomposition of document images into different components, i.e., background, text, and degradation, which allows full sources separation, whereby further analysis and characterization of each component can be possible. A comparative study is performed using Independent Component Analysis (ICA) and Principal Component Analysis (PCA) methods. Our framework is also validated on another third dataset of MS images of natural objects to demonstrate its generalizability beyond document samples.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114442377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/icdar.2019.00030
Brian L. Davis, B. Morse, Scott D. Cohen, Brian L. Price, Chris Tensmeyer
Automatic, template-free extraction of information from form images is challenging due to the variety of form layouts. This is even more challenging for historical forms due to noise and degradation. A crucial part of the extraction process is associating input text with pre-printed labels. We present a learned, template-free solution to detecting pre-printed text and input text/handwriting and predicting pair-wise relationships between them. While previous approaches to this problem have been focused on clean images and clear layouts, we show our approach is effective in the domain of noisy, degraded, and varied form images. We introduce a new dataset of historical form images (late 1800s, early 1900s) for training and validating our approach. Our method uses a convolutional network to detect pre-printed text and input text lines. We pool features from the detection network to classify possible relationships in a language-agnostic way. We show that our proposed pairing method outperforms heuristic rules and that visual features are critical to obtaining high accuracy.
{"title":"Deep Visual Template-Free Form Parsing","authors":"Brian L. Davis, B. Morse, Scott D. Cohen, Brian L. Price, Chris Tensmeyer","doi":"10.1109/icdar.2019.00030","DOIUrl":"https://doi.org/10.1109/icdar.2019.00030","url":null,"abstract":"Automatic, template-free extraction of information from form images is challenging due to the variety of form layouts. This is even more challenging for historical forms due to noise and degradation. A crucial part of the extraction process is associating input text with pre-printed labels. We present a learned, template-free solution to detecting pre-printed text and input text/handwriting and predicting pair-wise relationships between them. While previous approaches to this problem have been focused on clean images and clear layouts, we show our approach is effective in the domain of noisy, degraded, and varied form images. We introduce a new dataset of historical form images (late 1800s, early 1900s) for training and validating our approach. Our method uses a convolutional network to detect pre-printed text and input text lines. We pool features from the detection network to classify possible relationships in a language-agnostic way. We show that our proposed pairing method outperforms heuristic rules and that visual features are critical to obtaining high accuracy.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"18 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127824650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00109
Mahshad Mahdavi, M. Condon, Kenny Davila
We present a model for recognizing typeset math formula images from connected components or symbols. In our approach, connected components are used to construct a line-of-sight (LOS) graph. The graph is used both to reduce the search space for formula structure interpretations, and to guide a classification attention model using separate channels for inputs and their local visual context. For classification, we used visual densities with Random Forests for initial development, and then converted this to a Convolutional Neural Network (CNN) with a second branch to capture context for each input image. Formula structure is extracted as a directed spanning tree from a weighted LOS graph using Edmonds' algorithm. We obtain strong results for formulas without grids or matrices in the InftyCDB-2 dataset (90.89% from components, 93.5% from symbols). Using tools from the CROHME handwritten formula recognition competitions, we were able to compile all symbol and structure recognition errors for analysis. Our data and source code are publicly available.
{"title":"LPGA: Line-of-Sight Parsing with Graph-Based Attention for Math Formula Recognition","authors":"Mahshad Mahdavi, M. Condon, Kenny Davila","doi":"10.1109/ICDAR.2019.00109","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00109","url":null,"abstract":"We present a model for recognizing typeset math formula images from connected components or symbols. In our approach, connected components are used to construct a line-of-sight (LOS) graph. The graph is used both to reduce the search space for formula structure interpretations, and to guide a classification attention model using separate channels for inputs and their local visual context. For classification, we used visual densities with Random Forests for initial development, and then converted this to a Convolutional Neural Network (CNN) with a second branch to capture context for each input image. Formula structure is extracted as a directed spanning tree from a weighted LOS graph using Edmonds' algorithm. We obtain strong results for formulas without grids or matrices in the InftyCDB-2 dataset (90.89% from components, 93.5% from symbols). Using tools from the CROHME handwritten formula recognition competitions, we were able to compile all symbol and structure recognition errors for analysis. Our data and source code are publicly available.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130149470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00185
Rubén Tolosana, R. Vera-Rodríguez, Julian Fierrez, A. Morales, J. Ortega-Garcia
Data have become one of the most valuable things in this new era where deep learning technology seems to overcome traditional approaches. However, in some tasks, such as the verification of handwritten signatures, the amount of publicly available data is scarce, what makes difficult to test the real limits of deep learning. In addition to the lack of public data, it is not easy to evaluate the improvements of novel approaches compared with the state of the art as different experimental protocols and conditions are usually considered for different signature databases. To tackle all these mentioned problems, the main contribution of this study is twofold: i) we present and describe the new DeepSignDB on-line handwritten signature biometric public database, and ii) we propose a standard experimental protocol and benchmark to be used for the research community in order to perform a fair comparison of novel approaches with the state of the art. The DeepSignDB database is obtained through the combination of some of the most popular on-line signature databases, and a novel dataset not presented yet. It comprises more than 70K signatures acquired using both stylus and finger inputs from a total of 1526 users. Two acquisition scenarios are considered, office and mobile, with a total of 8 different devices. Additionally, different types of impostors and number of acquisition sessions are considered along the database. The DeepSignDB and benchmark results are available in GitHub.
{"title":"Do You Need More Data? The DeepSignDB On-Line Handwritten Signature Biometric Database","authors":"Rubén Tolosana, R. Vera-Rodríguez, Julian Fierrez, A. Morales, J. Ortega-Garcia","doi":"10.1109/ICDAR.2019.00185","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00185","url":null,"abstract":"Data have become one of the most valuable things in this new era where deep learning technology seems to overcome traditional approaches. However, in some tasks, such as the verification of handwritten signatures, the amount of publicly available data is scarce, what makes difficult to test the real limits of deep learning. In addition to the lack of public data, it is not easy to evaluate the improvements of novel approaches compared with the state of the art as different experimental protocols and conditions are usually considered for different signature databases. To tackle all these mentioned problems, the main contribution of this study is twofold: i) we present and describe the new DeepSignDB on-line handwritten signature biometric public database, and ii) we propose a standard experimental protocol and benchmark to be used for the research community in order to perform a fair comparison of novel approaches with the state of the art. The DeepSignDB database is obtained through the combination of some of the most popular on-line signature databases, and a novel dataset not presented yet. It comprises more than 70K signatures acquired using both stylus and finger inputs from a total of 1526 users. Two acquisition scenarios are considered, office and mobile, with a total of 8 different devices. Additionally, different types of impostors and number of acquisition sessions are considered along the database. The DeepSignDB and benchmark results are available in GitHub.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123879340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00225
Shoaib Ahmed Siddiqui, Pervaiz Iqbal Khan, A. Dengel, Sheraz Ahmed
Based on the recent advancements in the domain of semantic segmentation, Fully-Convolutional Networks (FCN) have been successfully applied for the task of table structure recognition in the past. We analyze the efficacy of semantic segmentation networks for this purpose and simplify the problem by proposing prediction tiling based on the consistency assumption which holds for tabular structures. For an image of dimensions H × W, we predict a single column for the rows (ŷ_row ∊ H) and a predict a single row for the columns (ŷ_row ∊ W). We use a dual-headed architecture where initial feature maps (from the encoder-decoder model) are shared while the last two layers generate class specific (row/column) predictions. This allows us to generate predictions using a single model for both rows and columns simultaneously, where previous methods relied on two separate models for inference. With the proposed method, we were able to achieve state-of-the-art results on ICDAR-13 image-based table structure recognition dataset with an average F-Measure of 92.39% (91.90% and 92.88% F-Measure for rows and columns respectively). With the proposed method, we were able to achieve state-of-the-art results on ICDAR-13. The obtained results advocate that constraining the problem space in the case of FCN by imposing valid constraints can lead to significant performance gains.
{"title":"Rethinking Semantic Segmentation for Table Structure Recognition in Documents","authors":"Shoaib Ahmed Siddiqui, Pervaiz Iqbal Khan, A. Dengel, Sheraz Ahmed","doi":"10.1109/ICDAR.2019.00225","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00225","url":null,"abstract":"Based on the recent advancements in the domain of semantic segmentation, Fully-Convolutional Networks (FCN) have been successfully applied for the task of table structure recognition in the past. We analyze the efficacy of semantic segmentation networks for this purpose and simplify the problem by proposing prediction tiling based on the consistency assumption which holds for tabular structures. For an image of dimensions H × W, we predict a single column for the rows (ŷ_row ∊ H) and a predict a single row for the columns (ŷ_row ∊ W). We use a dual-headed architecture where initial feature maps (from the encoder-decoder model) are shared while the last two layers generate class specific (row/column) predictions. This allows us to generate predictions using a single model for both rows and columns simultaneously, where previous methods relied on two separate models for inference. With the proposed method, we were able to achieve state-of-the-art results on ICDAR-13 image-based table structure recognition dataset with an average F-Measure of 92.39% (91.90% and 92.88% F-Measure for rows and columns respectively). With the proposed method, we were able to achieve state-of-the-art results on ICDAR-13. The obtained results advocate that constraining the problem space in the case of FCN by imposing valid constraints can lead to significant performance gains.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124011225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00155
Momina Moetesum, I. Siddiqi, N. Vincent
Sketches and drawings are popularly employed in clinical psychology to assess the visual-motor and perceptual development in children and adolescents. Drawn responses by subjects are mostly characterized by high degree of deformations that indicates presence of various visual, perceptual and motor disorders. Classification of deformations is a challenging task due to complex and extensive rule representation. In this study, we propose a novel technique to model clinical manifestations using Deep Convolutional Neural Networks (DCNNs). Drawn responses of nine templates used for assessment of perceptual orientation of individuals are employed as training samples. A number of defined deviations scored in each template are then modeled by applying fine tuning on a pre-trained DCNN architecture. Performance of the proposed technique is evaluated on samples of 106 children. Results of experiments show that pre-trained DCNNs can model and classify a number of deformations across multiple shapes with considerable success. Nevertheless some deformations are represented more reliably than the others. Overall promising classification results are observed that substantiate the effectiveness of our proposed technique.
{"title":"Deformation Classification of Drawings for Assessment of Visual-Motor Perceptual Maturity","authors":"Momina Moetesum, I. Siddiqi, N. Vincent","doi":"10.1109/ICDAR.2019.00155","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00155","url":null,"abstract":"Sketches and drawings are popularly employed in clinical psychology to assess the visual-motor and perceptual development in children and adolescents. Drawn responses by subjects are mostly characterized by high degree of deformations that indicates presence of various visual, perceptual and motor disorders. Classification of deformations is a challenging task due to complex and extensive rule representation. In this study, we propose a novel technique to model clinical manifestations using Deep Convolutional Neural Networks (DCNNs). Drawn responses of nine templates used for assessment of perceptual orientation of individuals are employed as training samples. A number of defined deviations scored in each template are then modeled by applying fine tuning on a pre-trained DCNN architecture. Performance of the proposed technique is evaluated on samples of 106 children. Results of experiments show that pre-trained DCNNs can model and classify a number of deformations across multiple shapes with considerable success. Nevertheless some deformations are represented more reliably than the others. Overall promising classification results are observed that substantiate the effectiveness of our proposed technique.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"2005 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128824004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.1109/ICDAR.2019.00051
Romain Karpinski, A. Belaïd
This paper proposes a fully automatic new method for generating semi-synthetic images of historical documents to increase the number of training samples in small datasets. This method extracts and mixes background only images (BOI) with text only images (TOI) issued from two different sources to create semi-synthetic images. The TOIs are extracted with the help of a binary mask obtained by binarizing the image. The BOIs are reconstructed from the original image by replacing TOI pixels using an inpainting method. Finally, a TOI can be efficiently integrated in a BOI using the gradient domain, thus creating a new semi-synthetic image. The idea behind this technique is to automatically obtain documents close to real ones with different backgrounds to highlight the content. Experiments are conducted on the public HisDB dataset which contains few labeled images. We show that the proposed method improves the performance results of a semantic segmentation and baseline extraction task.
{"title":"Semi-Synthetic Data Augmentation of Scanned Historical Documents","authors":"Romain Karpinski, A. Belaïd","doi":"10.1109/ICDAR.2019.00051","DOIUrl":"https://doi.org/10.1109/ICDAR.2019.00051","url":null,"abstract":"This paper proposes a fully automatic new method for generating semi-synthetic images of historical documents to increase the number of training samples in small datasets. This method extracts and mixes background only images (BOI) with text only images (TOI) issued from two different sources to create semi-synthetic images. The TOIs are extracted with the help of a binary mask obtained by binarizing the image. The BOIs are reconstructed from the original image by replacing TOI pixels using an inpainting method. Finally, a TOI can be efficiently integrated in a BOI using the gradient domain, thus creating a new semi-synthetic image. The idea behind this technique is to automatically obtain documents close to real ones with different backgrounds to highlight the content. Experiments are conducted on the public HisDB dataset which contains few labeled images. We show that the proposed method improves the performance results of a semantic segmentation and baseline extraction task.","PeriodicalId":325437,"journal":{"name":"2019 International Conference on Document Analysis and Recognition (ICDAR)","volume":"21 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116374623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}