Théodore Bluche, D. Stutzmann, Christopher Kermorvant
Written texts are both physical (signs, shapes and graphical systems) and abstract objects (ideas), whose meanings and social connotations evolve through time. To study this dual nature of texts, palaeographers need to analyse large scale corpora at the finest granularity, such as character shape. This goal can only be reached through an automatic segmentation process. In this paper, we present a method, based on Handwritten Text Recognition, to automatically align images of digitized manuscripts with texts from scholarly editions, at the levels of page, column, line, word, and character. It has been successfully applied to two datasets of medieval manuscripts, which are now almost fully segmented at character level. The quality of the word and character segmentations are evaluated and further palaeographical analysis are presented.
{"title":"Automatic Handwritten Character Segmentation for Paleographical Character Shape Analysis","authors":"Théodore Bluche, D. Stutzmann, Christopher Kermorvant","doi":"10.1109/DAS.2016.74","DOIUrl":"https://doi.org/10.1109/DAS.2016.74","url":null,"abstract":"Written texts are both physical (signs, shapes and graphical systems) and abstract objects (ideas), whose meanings and social connotations evolve through time. To study this dual nature of texts, palaeographers need to analyse large scale corpora at the finest granularity, such as character shape. This goal can only be reached through an automatic segmentation process. In this paper, we present a method, based on Handwritten Text Recognition, to automatically align images of digitized manuscripts with texts from scholarly editions, at the levels of page, column, line, word, and character. It has been successfully applied to two datasets of medieval manuscripts, which are now almost fully segmented at character level. The quality of the word and character segmentations are evaluated and further palaeographical analysis are presented.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125870543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper explores the use of a learned classifier for post-OCR text correction. Experiments with the Arabic language show that this approach, which integrates a weighted confusion matrix and a shallow language model, improves the vast majority of segmentation and recognition errors, the most frequent types of error on our dataset.
{"title":"OCR Error Correction Using Character Correction and Feature-Based Word Classification","authors":"Ido Kissos, N. Dershowitz","doi":"10.1109/DAS.2016.44","DOIUrl":"https://doi.org/10.1109/DAS.2016.44","url":null,"abstract":"This paper explores the use of a learned classifier for post-OCR text correction. Experiments with the Arabic language show that this approach, which integrates a weighted confusion matrix and a shallow language model, improves the vast majority of segmentation and recognition errors, the most frequent types of error on our dataset.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131166329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christophe Rigaud, Thanh Nam Le, J. Burie, J. Ogier, Shoya Ishimaru, M. Iwata, K. Kise
The popularity of storing, distributing and reading comic books electronically has made the task of comics analysis an interesting research problem. Different work have been carried out aiming at understanding their layout structure and the graphic content. However the results are still far from universally applicable, largely due to the huge variety in expression styles and page arrangement, especially in manga (Japanese comics). In this paper, we propose a comic image analysis approach using eye-tracking data recorded during manga reading sessions. As humans are extremely capable of interpreting the structured drawing content, and show different reading behaviors based on the nature of the content, their eye movements follow distinguishable patterns over text or graphic regions. Therefore, eye gaze data can add rich information to the understanding of the manga content. Experimental results show that the fixations and saccades indeed form consistent patterns among readers, and can be used for manga textual and graphical analysis.
{"title":"Semi-automatic Text and Graphics Extraction of Manga Using Eye Tracking Information","authors":"Christophe Rigaud, Thanh Nam Le, J. Burie, J. Ogier, Shoya Ishimaru, M. Iwata, K. Kise","doi":"10.1109/DAS.2016.72","DOIUrl":"https://doi.org/10.1109/DAS.2016.72","url":null,"abstract":"The popularity of storing, distributing and reading comic books electronically has made the task of comics analysis an interesting research problem. Different work have been carried out aiming at understanding their layout structure and the graphic content. However the results are still far from universally applicable, largely due to the huge variety in expression styles and page arrangement, especially in manga (Japanese comics). In this paper, we propose a comic image analysis approach using eye-tracking data recorded during manga reading sessions. As humans are extremely capable of interpreting the structured drawing content, and show different reading behaviors based on the nature of the content, their eye movements follow distinguishable patterns over text or graphic regions. Therefore, eye gaze data can add rich information to the understanding of the manga content. Experimental results show that the fixations and saccades indeed form consistent patterns among readers, and can be used for manga textual and graphical analysis.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130239364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pen-based systems are becoming more and more important due to the growing availability of touch sensitive devices in various forms and sizes. Their interfaces offer the possibility to directly interact with a system by natural handwriting. In contrast to other input modalities it is not required to switch to special modes, like software-keyboards. In this paper we propose a new method for querying digital archives of historical documents. Word images are retrieved with respect to search terms that users write on a pen-based system by hand. The captured trajectory is used as a query which we call query-by-online-trajectory word spotting. By using attribute embeddings for both online-trajectory and visual features, word images are retrieved based on their distance to the query in a common subspace. The system is therefore robust, as no explicit transcription for queries or word images is required. We evaluate our approach for writer-dependent as well as writer-independent scenarios, where we present highly accurate retrieval results in the former and compelling retrieval results in the latter case. Our performance is very competitive in comparison to related methods from the literature.
{"title":"Word Spotting in Historical Document Collections with Online-Handwritten Queries","authors":"Christian Wieprecht, Leonard Rothacker, G. Fink","doi":"10.1109/DAS.2016.41","DOIUrl":"https://doi.org/10.1109/DAS.2016.41","url":null,"abstract":"Pen-based systems are becoming more and more important due to the growing availability of touch sensitive devices in various forms and sizes. Their interfaces offer the possibility to directly interact with a system by natural handwriting. In contrast to other input modalities it is not required to switch to special modes, like software-keyboards. In this paper we propose a new method for querying digital archives of historical documents. Word images are retrieved with respect to search terms that users write on a pen-based system by hand. The captured trajectory is used as a query which we call query-by-online-trajectory word spotting. By using attribute embeddings for both online-trajectory and visual features, word images are retrieved based on their distance to the query in a common subspace. The system is therefore robust, as no explicit transcription for queries or word images is required. We evaluate our approach for writer-dependent as well as writer-independent scenarios, where we present highly accurate retrieval results in the former and compelling retrieval results in the latter case. Our performance is very competitive in comparison to related methods from the literature.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132820100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The goal of document image quality assessment (DIQA) is to build a computational model which can predict the degree of degradation for document images. Based on the estimated quality scores, the immediate feedback can be provided by document processing and analysis systems, which helps to maintain, organize, recognize and retrieve the information from document images. Recently, the bag-of-visual-words (BoV) based approaches have gained increasing attention from researchers to fulfill the task of quality assessment, but how to use BoV to represent images more accurately is still a challenging problem. In this paper, we propose to utilize a sparse representation based method to estimate document image's quality with respect to the OCR capability. Unlike the conventional sparse representation approaches, we introduce the target quality scores into the training phase of sparse representation. The proposed method improves the discriminability of the system and ensures the obtained codebook is more suitable for our assessment task. The experimental results on a public dataset show that the proposed method outperforms other hand-crafted and BoV based DIQA approaches.
{"title":"Document Image Quality Assessment Using Discriminative Sparse Representation","authors":"Xujun Peng, Huaigu Cao, P. Natarajan","doi":"10.1109/DAS.2016.24","DOIUrl":"https://doi.org/10.1109/DAS.2016.24","url":null,"abstract":"The goal of document image quality assessment (DIQA) is to build a computational model which can predict the degree of degradation for document images. Based on the estimated quality scores, the immediate feedback can be provided by document processing and analysis systems, which helps to maintain, organize, recognize and retrieve the information from document images. Recently, the bag-of-visual-words (BoV) based approaches have gained increasing attention from researchers to fulfill the task of quality assessment, but how to use BoV to represent images more accurately is still a challenging problem. In this paper, we propose to utilize a sparse representation based method to estimate document image's quality with respect to the OCR capability. Unlike the conventional sparse representation approaches, we introduce the target quality scores into the training phase of sparse representation. The proposed method improves the discriminability of the system and ensures the obtained codebook is more suitable for our assessment task. The experimental results on a public dataset show that the proposed method outperforms other hand-crafted and BoV based DIQA approaches.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130758532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The feasibility of large-scale OCR projects can so far only be assessed by running pilot studies on subsets of the target document collections and measuring the success of different workflows based on precise ground truth, which can be very costly to produce in the required volume. The premise of this paper is that, as an alternative, quality prediction may be used to approximate the success of a given OCR workflow. A new system is thus presented where a classifier is trained using metadata, image and layout features in combination with measured success rates (based on minimal ground truth). Subsequently, only document images are required as input for the numeric prediction of the quality score (no ground truth required). This way, the system can be applied to any number of similar (unseen) documents in order to assess their suitability for being processed using the particular workflow. The usefulness of the system has been validated using a realistic dataset of historical newspaper pages.
{"title":"Quality Prediction System for Large-Scale Digitisation Workflows","authors":"C. Clausner, S. Pletschacher, A. Antonacopoulos","doi":"10.1109/DAS.2016.82","DOIUrl":"https://doi.org/10.1109/DAS.2016.82","url":null,"abstract":"The feasibility of large-scale OCR projects can so far only be assessed by running pilot studies on subsets of the target document collections and measuring the success of different workflows based on precise ground truth, which can be very costly to produce in the required volume. The premise of this paper is that, as an alternative, quality prediction may be used to approximate the success of a given OCR workflow. A new system is thus presented where a classifier is trained using metadata, image and layout features in combination with measured success rates (based on minimal ground truth). Subsequently, only document images are required as input for the numeric prediction of the quality score (no ground truth required). This way, the system can be applied to any number of similar (unseen) documents in order to assess their suitability for being processed using the particular workflow. The usefulness of the system has been validated using a realistic dataset of historical newspaper pages.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126241189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
During past years, text extraction in document images has been widely studied in the general context of Document Image Analysis (DIA) and especially in the framework of layout analysis. Many existing techniques rely on complex processes based on preprocessing, image transforms or component/edges extraction and their analysis. At the same time, text extraction inside videos has received an increased interest and the use of corner or key points has been proven to be very effective. Because it is noteworthy to notice that very few studies were performed on the use of corner points for text extraction in document images, we propose in this paper to evaluate the possibilities associated with this kind of approach for DIA. To do that, we designed a very simple technique based on FAST key points. A first stage divide the image into blocks and the density of points inside each one is computed. The more dense ones are kept as text blocks. Then, connectivity of blocks is checked to group them and to obtain complete text blocks. This technique has been evaluated on different kind of images: different languages (Telugu, Arabic, French), handwritten as well as typewritten, skewed documents, images at different resolution and with different kind and amount of noises (deformations, ink dot, bleed through, acquisition (blur, resolution)), etc. Even with fixed parameters for all such kind of documents images, the precision and recall are close or higher to 90% which makes this basic method already effective. Consequently, even if the proposed approach does not propose a breakthrough from theoretical aspects, it highlights that accurate text extraction could be achieved without complex approach. Moreover, this approach could also be easily improved to be more precise, robust and useful for more complex layout analysis.
{"title":"Text Extraction in Document Images: Highlight on Using Corner Points","authors":"Vikas Yadav, N. Ragot","doi":"10.1109/DAS.2016.67","DOIUrl":"https://doi.org/10.1109/DAS.2016.67","url":null,"abstract":"During past years, text extraction in document images has been widely studied in the general context of Document Image Analysis (DIA) and especially in the framework of layout analysis. Many existing techniques rely on complex processes based on preprocessing, image transforms or component/edges extraction and their analysis. At the same time, text extraction inside videos has received an increased interest and the use of corner or key points has been proven to be very effective. Because it is noteworthy to notice that very few studies were performed on the use of corner points for text extraction in document images, we propose in this paper to evaluate the possibilities associated with this kind of approach for DIA. To do that, we designed a very simple technique based on FAST key points. A first stage divide the image into blocks and the density of points inside each one is computed. The more dense ones are kept as text blocks. Then, connectivity of blocks is checked to group them and to obtain complete text blocks. This technique has been evaluated on different kind of images: different languages (Telugu, Arabic, French), handwritten as well as typewritten, skewed documents, images at different resolution and with different kind and amount of noises (deformations, ink dot, bleed through, acquisition (blur, resolution)), etc. Even with fixed parameters for all such kind of documents images, the precision and recall are close or higher to 90% which makes this basic method already effective. Consequently, even if the proposed approach does not propose a breakthrough from theoretical aspects, it highlights that accurate text extraction could be achieved without complex approach. Moreover, this approach could also be easily improved to be more precise, robust and useful for more complex layout analysis.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123341162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. I. Toledo, A. Fornés, Jordi Cucurull-Juan, J. Lladós
In paper based elections, manual tallies at polling station level produce myriads of documents. These documents share a common form-like structure and a reduced vocabulary worldwide. On the other hand, each tally sheet is filled by a different writer and on different countries, different scripts are used. We present a complete document analysis system for electoral tally sheet processing combining state of the art techniques with a new handwriting recognition subprocess based on unsupervised feature discovery with Variational Autoencoders and sequence classification with BLSTM neural networks. The whole system is designed to be script independent and allows a fast and reliable results consolidation process with reduced operational cost.
{"title":"Election Tally Sheets Processing System","authors":"J. I. Toledo, A. Fornés, Jordi Cucurull-Juan, J. Lladós","doi":"10.1109/DAS.2016.37","DOIUrl":"https://doi.org/10.1109/DAS.2016.37","url":null,"abstract":"In paper based elections, manual tallies at polling station level produce myriads of documents. These documents share a common form-like structure and a reduced vocabulary worldwide. On the other hand, each tally sheet is filled by a different writer and on different countries, different scripts are used. We present a complete document analysis system for electoral tally sheet processing combining state of the art techniques with a new handwriting recognition subprocess based on unsupervised feature discovery with Variational Autoencoders and sequence classification with BLSTM neural networks. The whole system is designed to be script independent and allows a fast and reliable results consolidation process with reduced operational cost.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114899520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
George Retsinas, G. Louloudis, N. Stamatopoulos, B. Gatos
In this paper, we present a novel approach for segmentation-based handwritten keyword spotting. The proposed approach relies upon the extraction of a simple yet efficient descriptor which is based on projections of oriented gradients. To this end, a global and a local word image descriptors, together with their combination, are proposed. Retrieval is performed using to the euclidean distance between the descriptors of a query image and the segmented word images. The proposed methods have been evaluated on the dataset of the ICFHR 2014 Competition on handwritten keyword spotting. Experimental results prove the efficiency of the proposed methods compared to several state-of-the-art techniques.
{"title":"Keyword Spotting in Handwritten Documents Using Projections of Oriented Gradients","authors":"George Retsinas, G. Louloudis, N. Stamatopoulos, B. Gatos","doi":"10.1109/DAS.2016.61","DOIUrl":"https://doi.org/10.1109/DAS.2016.61","url":null,"abstract":"In this paper, we present a novel approach for segmentation-based handwritten keyword spotting. The proposed approach relies upon the extraction of a simple yet efficient descriptor which is based on projections of oriented gradients. To this end, a global and a local word image descriptors, together with their combination, are proposed. Retrieval is performed using to the euclidean distance between the descriptors of a query image and the segmented word images. The proposed methods have been evaluated on the dataset of the ICFHR 2014 Competition on handwritten keyword spotting. Experimental results prove the efficiency of the proposed methods compared to several state-of-the-art techniques.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124024583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper provides a rare glimpse into the overall approach for the refinement, i.e. the enrichment of scanned historical newspapers with text and layout recognition, in the Europeana Newspapers project. Within three years, the project processed more than 10 million pages of historical newspapers from 12 national and major libraries to produce the largest open access and fully searchable text collection of digital historical newspapers in Europe. In this, a wide variety of legal, logistical, technical and other challenges were encountered. After introducing the background issues in newspaper digitization in Europe, the paper discusses the technical aspects of refinement in greater detail. It explains what decisions were taken in the design of the large-scale processing workflow to address these challenges, what were the results produced and what were identified as best practices.
{"title":"Making Europe's Historical Newspapers Searchable","authors":"Clemens Neudecker, A. Antonacopoulos","doi":"10.1109/DAS.2016.83","DOIUrl":"https://doi.org/10.1109/DAS.2016.83","url":null,"abstract":"This paper provides a rare glimpse into the overall approach for the refinement, i.e. the enrichment of scanned historical newspapers with text and layout recognition, in the Europeana Newspapers project. Within three years, the project processed more than 10 million pages of historical newspapers from 12 national and major libraries to produce the largest open access and fully searchable text collection of digital historical newspapers in Europe. In this, a wide variety of legal, logistical, technical and other challenges were encountered. After introducing the background issues in newspaper digitization in Europe, the paper discusses the technical aspects of refinement in greater detail. It explains what decisions were taken in the design of the large-scale processing workflow to address these challenges, what were the results produced and what were identified as best practices.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121259410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}