2016 12th IAPR Workshop on Document Analysis Systems (DAS)最新文献

英文中文

Handwritten and Machine-Printed Text Discrimination Using a Template Matching Approach 基于模板匹配方法的手写体和机器打印文本识别

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

Pub Date : 2016-06-13 DOI: 10.1109/DAS.2016.22

Mehryar Emambakhsh, Yulan He, I. Nabney

We propose a novel template matching approach for the discrimination of handwritten and machine-printed text. We first pre-process the scanned document images by performing denoising, circles/lines exclusion and word-block level segmentation. We then align and match characters in a flexible sized gallery with the segmented regions, using parallelised normalised cross-correlation. The experimental results over the Pattern Recognition & Image Analysis Research Lab-Natural History Museum (PRImA-NHM) dataset show remarkably high robustness of the algorithm in classifying cluttered, occluded and noisy samples, in addition to those with significant high missing data. The algorithm, which gives 84.0% classification rate with false positive rate 0.16 over the dataset, does not require training samples and generates compelling results as opposed to the training-based approaches, which have used the same benchmark.

我们提出了一种新的模板匹配方法来区分手写体和机器打印文本。我们首先对扫描的文档图像进行预处理，进行去噪、圈/线排除和词块级分割。然后，我们在一个灵活大小的画廊中与分割的区域对齐和匹配字符，使用并行规范化的相互关联。在模式识别与图像分析研究实验室-自然历史博物馆(PRImA-NHM)数据集上的实验结果表明，除了数据缺失率较高的样本外，该算法在分类混乱、遮挡和噪声样本方面具有非常高的鲁棒性。该算法在数据集上给出了84.0%的分类率和0.16的误报率，不需要训练样本，与使用相同基准的基于训练的方法相反，它产生了令人信服的结果。

引用次数: 5

General Pattern Run-Length Transform for Writer Identification 用于写器识别的通用模式运行长度变换

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

Pub Date : 2016-06-10 DOI: 10.1109/DAS.2016.42

Sheng He, Lambert Schomaker

In this paper we present a novel textural-based feature for writer identification: the General Pattern Run-Length Transform (GPRLT), which is the histogram of the run-length of any complex patterns. The GPRLT can be computed on the binary images (GPRLT bin) or on the gray scale images (GPRLT gray) without using any binarization or segmentation methods. Experimental results show that the GPRLT gray achieves even higher performance than the GPRLT bin for writer identification. The writer identification performance on the challenging CERUG-EN data set demonstrates that the proposed methods outperform state-of-the-art algorithms. Our source code and data set are available on www.ai.rug.nl/~sheng/dflib.

在本文中，我们提出了一种新的基于纹理的作家识别特征:通用模式游程变换(GPRLT)，它是任何复杂模式的游程直方图。GPRLT既可以在二值图像(GPRLT bin)上计算，也可以在灰度图像(GPRLT gray)上计算，无需使用任何二值化或分割方法。实验结果表明，GPRLT灰度比GPRLT bin在作者识别方面取得了更高的性能。在具有挑战性的CERUG-EN数据集上的作者识别性能表明，所提出的方法优于最先进的算法。我们的源代码和数据集可在www.ai.rug.nl/~sheng/dflib上获得。

引用次数: 15

Automatic Synthesis of Historical Arabic Text for Word-Spotting 自动合成历史阿拉伯语文本的单词定位

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

Pub Date : 2016-04-11 DOI: 10.1109/DAS.2016.9

M. Kassis, Jihad El-Sana

We present a novel framework for automatic and efficient synthesis of historical handwritten Arabic text. The main purpose of this framework is to assist word spotting and keyword searching in handwritten historical documents. The proposed framework consists of two main procedures: building a letter connectivity map and synthesizing words. A letter connectivity map includes multiple instances of the various shape of each letter, since a letter in Arabic usually has multiple shapes depends in its position in the word. Each map represents one writer and encodes the specific handwriting style. The letter connectivity map is used to guide the synthesis of any Arabic continuous subword, word, or sentence. The proposed framework automatically generates the letter connectivity map annotation from a several pages historical pages previously annotated. Once the letter connectivity map is available our framework can synthesis the pictorial representation of any Arabic word or sentence from their text representation. The writing style of the synthesized text resembles the writing style of the input pages. The synthesized words can be used in word-spotting and many other historical document processing applications. The proposed approach provides an intuitive and easy-to-use framework to search for a keyword in the rest of the manuscript. Our experimental study shows that our approach enables accurate results in word spotting algorithms.

我们提出了一个新的框架自动和有效的合成历史手写体阿拉伯语文本。该框架的主要目的是帮助手写体历史文献中的单词识别和关键字搜索。该框架包括两个主要步骤:建立字母连接图和合成单词。字母连接图包括每个字母的各种形状的多个实例，因为阿拉伯语中的字母通常根据其在单词中的位置有多种形状。每个地图代表一个写作者，并编码特定的书写风格。字母连接图用于指导任何阿拉伯语连续子词、单词或句子的合成。提出的框架自动从先前注释的几个历史页面生成字母连通性映射注释。一旦字母连接图可用，我们的框架就可以从文本表示中合成任何阿拉伯单词或句子的图形表示。合成文本的写作风格类似于输入页面的写作风格。合成词可用于单词识别和许多其他历史文档处理应用程序。提出的方法提供了一个直观和易于使用的框架来搜索手稿的其余部分的关键字。我们的实验研究表明，我们的方法可以在单词识别算法中获得准确的结果。

{"title":"Automatic Synthesis of Historical Arabic Text for Word-Spotting","authors":"M. Kassis, Jihad El-Sana","doi":"10.1109/DAS.2016.9","DOIUrl":"https://doi.org/10.1109/DAS.2016.9","url":null,"abstract":"We present a novel framework for automatic and efficient synthesis of historical handwritten Arabic text. The main purpose of this framework is to assist word spotting and keyword searching in handwritten historical documents. The proposed framework consists of two main procedures: building a letter connectivity map and synthesizing words. A letter connectivity map includes multiple instances of the various shape of each letter, since a letter in Arabic usually has multiple shapes depends in its position in the word. Each map represents one writer and encodes the specific handwriting style. The letter connectivity map is used to guide the synthesis of any Arabic continuous subword, word, or sentence. The proposed framework automatically generates the letter connectivity map annotation from a several pages historical pages previously annotated. Once the letter connectivity map is available our framework can synthesis the pictorial representation of any Arabic word or sentence from their text representation. The writing style of the synthesized text resembles the writing style of the input pages. The synthesized words can be used in word-spotting and many other historical document processing applications. The proposed approach provides an intuitive and easy-to-use framework to search for a keyword in the rest of the manuscript. Our experimental study shows that our approach enables accurate results in word spotting algorithms.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"38 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120981193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Fuzzy Integral for Combining SVM-Based Handwritten Soft-Biometrics Prediction 模糊积分结合svm手写软生物特征预测

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

Pub Date : 2016-04-11 DOI: 10.1109/DAS.2016.27

Nesrine Bouadjenek, H. Nemmour, Y. Chibani

This work addresses soft-biometrics prediction from handwriting analysis, which aims to predict the writer's gender, age range and handedness. Three SVM predictors associated each to a specific data feature are developed and subsequently combined to aggregate a robust prediction. For the combination step, Sugeno's Fuzzy Integral is proposed. Experiments are conducted on public Arabic and English handwriting datasets. The performance assessment is carried out comparatively to individual systems as well as to max and average rules, using independent and blended corpuses. The results obtained demon-strated the usefulness of the Fuzzy Integral, which provides a gain of more than 4% over individual systems as well as other combination rules. Moreover, with respect to the state of the art methods, the proposed approach seems to be much more relevant.

这项工作涉及笔迹分析的软生物识别预测，旨在预测作者的性别、年龄范围和惯用手。开发了三个与特定数据特征相关的支持向量机预测器，并随后将其组合在一起以聚合一个鲁棒预测。对于组合步骤，提出了Sugeno模糊积分法。在公开的阿拉伯语和英语手写数据集上进行了实验。采用独立语料库和混合语料库对单个系统进行性能评估，对最大规则和平均规则进行性能评估。得到的结果证明了模糊积分的有用性，它比单个系统以及其他组合规则提供了4%以上的增益。此外，就目前最先进的方法而言，拟议的方法似乎更为相关。

引用次数: 8

Combination of Structural and Factual Descriptors for Document Stream Segmentation 结构化描述符和事实描述符的组合用于文档流分割

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

Pub Date : 2016-04-11 DOI: 10.1109/DAS.2016.21

Romain Karpinski, A. Belaïd

This paper extends a previous work being done by [4]. Having no information about the document separation in the flow, the system operates progressively by examining successive pairs of pages looking for continuity or rupture descriptors. Four document levels have been introduced to better extract those descriptors and reduce the ambiguity in their extraction: records, technical documents, fundamental documents and cases. At each level, structural and factual descriptors are first extracted and then compared between pairs of pages or documents. To reinforce the descriptor interest and focus the system on equivalent descriptors in the pairs, the descriptors are accompanied by their context. The extraction of the context is facilitated by the determination of the physical and logical structure in the pages. Contextual rules based on these descriptors are employed for the determination of either a continuity, a rupture or an uncertainty between the pairs. To overcome the problem of information emptiness in the current page, a logbook is used to gather the descriptors in all the previous pages of the record and a buffer allows to delay the comparison. These latter points were added to the previous work that widely reinforce the current system increasing its precision of more than 6%.

本文扩展了[4]之前所做的工作。在流程中没有关于文档分离的信息，系统通过检查连续的页面对寻找连续性或断裂描述符来逐步操作。为了更好地提取这些描述符并减少其提取中的歧义，采用了四个文件级别:记录、技术文件、基本文件和案例。在每个级别上，首先提取结构描述符和事实描述符，然后在页面或文档对之间进行比较。为了加强对描述符的兴趣并将系统集中在对中的等效描述符上，描述符伴随着它们的上下文。通过确定页面中的物理和逻辑结构，可以方便地提取上下文。基于这些描述符的上下文规则用于确定对之间的连续性、断裂或不确定性。为了克服当前页中信息空白的问题，使用日志簿来收集记录的所有先前页面中的描述符，并使用缓冲区来延迟比较。后两点被添加到先前的工作中，广泛加强了当前系统，使其精度提高了6%以上。

{"title":"Combination of Structural and Factual Descriptors for Document Stream Segmentation","authors":"Romain Karpinski, A. Belaïd","doi":"10.1109/DAS.2016.21","DOIUrl":"https://doi.org/10.1109/DAS.2016.21","url":null,"abstract":"This paper extends a previous work being done by [4]. Having no information about the document separation in the flow, the system operates progressively by examining successive pairs of pages looking for continuity or rupture descriptors. Four document levels have been introduced to better extract those descriptors and reduce the ambiguity in their extraction: records, technical documents, fundamental documents and cases. At each level, structural and factual descriptors are first extracted and then compared between pairs of pages or documents. To reinforce the descriptor interest and focus the system on equivalent descriptors in the pairs, the descriptors are accompanied by their context. The extraction of the context is facilitated by the determination of the physical and logical structure in the pages. Contextual rules based on these descriptors are employed for the determination of either a continuity, a rupture or an uncertainty between the pairs. To overcome the problem of information emptiness in the current page, a logbook is used to gather the descriptors in all the previous pages of the record and a buffer allows to delay the comparison. These latter points were added to the previous work that widely reinforce the current system increasing its precision of more than 6%.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"208 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121633129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Recognition-Based Approach of Numeral Extraction in Handwritten Chemistry Documents Using Contextual Knowledge 基于识别的基于上下文知识的手写化学文献数字提取方法

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

Pub Date : 2016-04-11 DOI: 10.1109/DAS.2016.54

N. Ghanmi, A. Belaïd

This paper presents a complete procedure that uses contextual and syntactic information to identify and recognize amount fields in the table regions of chemistry documents. The proposed method is composed of two main modules. Firstly, a structural analysis based on connected component (CC) dimensions and positions identifies some special symbols and clusters other CCs into three groups: fragment of characters, isolated characters or connected characters. Then, a specific processing is performed on each group of CCs. The fragment of characters are merged with the nearest character or string using geometric relationship based rules. The characters are sent to a recognition module to identify the numeral components. For the connected characters, the final decision on the string nature (numeric or non-numeric) is made based on a global score computed on the full string using the height regularity property and the recognition probabilities of its segmented fragments. Finally, a simple syntactic verification at table row level is conducted in order to correct eventual errors. The experimental tests are carried out on real-world chemistry documents provided by our industrial partner eNovalys. The obtained results show the effectiveness of the proposed system in extracting amount fields.

本文提出了一种利用上下文信息和句法信息识别化学文献表区域中数量字段的完整方法。该方法主要由两个模块组成。首先，根据连通成分的尺寸和位置进行结构分析，识别出一些特殊的符号，并将其他连通成分分为三组:字符片段、孤立字符和连通字符。然后，对每组cc执行特定的处理。使用基于几何关系的规则将字符片段与最近的字符或字符串合并。这些字符被送到一个识别模块来识别数字成分。对于连接字符，字符串性质(数字或非数字)的最终决定是基于使用高度正则性及其分割片段的识别概率在完整字符串上计算的全局分数。最后，在表行级别进行简单的语法验证，以纠正最终的错误。实验测试是在我们的工业合作伙伴eNovalys提供的真实化学文件上进行的。实验结果表明了该系统在萃取量领域的有效性。

{"title":"Recognition-Based Approach of Numeral Extraction in Handwritten Chemistry Documents Using Contextual Knowledge","authors":"N. Ghanmi, A. Belaïd","doi":"10.1109/DAS.2016.54","DOIUrl":"https://doi.org/10.1109/DAS.2016.54","url":null,"abstract":"This paper presents a complete procedure that uses contextual and syntactic information to identify and recognize amount fields in the table regions of chemistry documents. The proposed method is composed of two main modules. Firstly, a structural analysis based on connected component (CC) dimensions and positions identifies some special symbols and clusters other CCs into three groups: fragment of characters, isolated characters or connected characters. Then, a specific processing is performed on each group of CCs. The fragment of characters are merged with the nearest character or string using geometric relationship based rules. The characters are sent to a recognition module to identify the numeral components. For the connected characters, the final decision on the string nature (numeric or non-numeric) is made based on a global score computed on the full string using the height regularity property and the recognition probabilities of its segmented fragments. Finally, a simple syntactic verification at table row level is conducted in order to correct eventual errors. The experimental tests are carried out on real-world chemistry documents provided by our industrial partner eNovalys. The obtained results show the effectiveness of the proposed system in extracting amount fields.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131323771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters ocoract:一个基于孤立字符训练的序列学习OCR系统

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

Pub Date : 2016-04-11 DOI: 10.1109/DAS.2016.51

A. Ul-Hasan, S. S. Bukhari, A. Dengel

Digitizing historical documents is crucial in preserving the literary heritage. With the availability of low cost capturing devices, libraries and institutes all over the world have old literature preserved in the form of scanned documents. However, searching through these scanned images is still a tedious job as one is unable to search through them. Contemporary machine learning approaches have been applied successfully to recognize text in both printed and handwriting form, however, these approaches require a lot of transcribed training data in order to obtain satisfactory performance. Transcribing the documents manually is a laborious and costly task, requiring many man-hours and language-specific expertise. This paper presents a generic iterative training framework to address this issue. The proposed framework is not only applicable to historical documents, but for present-day documents as well, where manually transcribed training data is unavailable. Starting with the minimal information available, the proposed approach iteratively corrects the training and generalization errors. Specifically, we have used a segmentation-based OCR method to train on individual symbols and then use the semi-corrected recognized text lines as the ground-truth data for segmentation-free sequence learning, which learns to correct the errors in the ground-truth by incorporating context-aware processing. The proposed approach is applied to a collection of 15th century Latin documents. The iterative procedure using segmentation-free OCR was able to reduce the initial character error of about 23% (obtained from segmentation-based OCR) to less than 7% in few iterations.

数字化历史文献对保护文学遗产至关重要。随着低成本捕获设备的可用性，世界各地的图书馆和研究所都以扫描文件的形式保存了旧文献。然而，搜索这些扫描图像仍然是一项繁琐的工作，因为人们无法搜索它们。现代机器学习方法已经成功地应用于识别印刷和手写形式的文本，然而，这些方法需要大量的转录训练数据才能获得令人满意的性能。手动抄写文件是一项费力而昂贵的任务，需要大量的工时和特定语言的专业知识。本文提出了一个通用的迭代训练框架来解决这个问题。所建议的框架不仅适用于历史文档，也适用于无法获得人工转录的训练数据的当前文档。该方法从最小可用信息开始，迭代地修正训练和泛化误差。具体来说，我们使用基于分割的OCR方法对单个符号进行训练，然后使用半校正的识别文本行作为无分割序列学习的基础真值数据，该序列学习通过结合上下文感知处理来纠正基础真值中的错误。所提出的方法适用于15世纪拉丁文文献的集合。使用无分割OCR的迭代过程能够在几次迭代中将大约23%的初始字符误差(来自基于分割的OCR)降低到7%以下。

{"title":"OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters","authors":"A. Ul-Hasan, S. S. Bukhari, A. Dengel","doi":"10.1109/DAS.2016.51","DOIUrl":"https://doi.org/10.1109/DAS.2016.51","url":null,"abstract":"Digitizing historical documents is crucial in preserving the literary heritage. With the availability of low cost capturing devices, libraries and institutes all over the world have old literature preserved in the form of scanned documents. However, searching through these scanned images is still a tedious job as one is unable to search through them. Contemporary machine learning approaches have been applied successfully to recognize text in both printed and handwriting form, however, these approaches require a lot of transcribed training data in order to obtain satisfactory performance. Transcribing the documents manually is a laborious and costly task, requiring many man-hours and language-specific expertise. This paper presents a generic iterative training framework to address this issue. The proposed framework is not only applicable to historical documents, but for present-day documents as well, where manually transcribed training data is unavailable. Starting with the minimal information available, the proposed approach iteratively corrects the training and generalization errors. Specifically, we have used a segmentation-based OCR method to train on individual symbols and then use the semi-corrected recognized text lines as the ground-truth data for segmentation-free sequence learning, which learns to correct the errors in the ground-truth by incorporating context-aware processing. The proposed approach is applied to a collection of 15th century Latin documents. The iterative procedure using segmentation-free OCR was able to reduce the initial character error of about 23% (obtained from segmentation-based OCR) to less than 7% in few iterations.","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115239773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

QATIP -- An Optical Character Recognition System for Arabic Heritage Collections in Libraries 图书馆阿拉伯文化遗产光学字符识别系统QATIP

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

Pub Date : 2016-04-11 DOI: 10.1109/DAS.2016.81

Felix Stahlberg, S. Vogel

Nowadays, commercial optical character recognition (OCR) software achieves very high accuracy on high-quality scans of modern Arabic documents. However, a large fraction of Arabic heritage collections in libraries is usually more challenging - e.g. consisting of typewritten documents, early prints, and historical manuscripts. In this paper, we present our end-user oriented QATIP system for OCR in such documents. The recognition is based on the Kaldi toolkit and sophisticated text image normalization. This paper contains two main contributions: First, we describe the QATIP interface for libraries which consists of both a graphical user interface for adding and monitoring jobs and a web API for automated access. Second, we suggest novel approaches for language modelling and ligature modelling for continuous Arabic OCR. We test our QATIP system on an early print and a historical manuscript and report substantial improvements - e.g. 12.6% character error rate with QATIP compared to 51.8% with the best OCR product in our experimental setup (Tesseract).

如今，商用光学字符识别(OCR)软件在现代阿拉伯语文档的高质量扫描上实现了非常高的准确性。然而，图书馆中的大部分阿拉伯文化遗产收藏通常更具挑战性-例如，由打字文件，早期印刷品和历史手稿组成。在本文中，我们提出了一个面向终端用户的QATIP系统，用于这些文档的OCR。该识别基于Kaldi工具包和复杂的文本图像规范化。本文包含两个主要贡献:首先，我们描述了库的QATIP接口，该接口由用于添加和监视作业的图形用户界面和用于自动访问的web API组成。其次，我们提出了针对连续阿拉伯语OCR的语言建模和结扎建模的新方法。我们在早期印刷品和历史手稿上测试了我们的QATIP系统，并报告了实质性的改进-例如，QATIP的字符错误率为12.6%，而我们实验设置(Tesseract)中最好的OCR产品的错误率为51.8%。

引用次数: 10

Complete System for Text Line Extraction Using Convolutional Neural Networks and Watershed Transform 基于卷积神经网络和分水岭变换的文本行提取系统

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

Pub Date : 2016-04-11 DOI: 10.1109/DAS.2016.58

Joan Pastor-Pellicer, Muhammad Zeshan Afzal, M. Liwicki, María José Castro Bleda

We present a novel Convolutional Neural Network based method for the extraction of text lines, which consists of an initial Layout Analysis followed by the estimation of the Main Body Area (i.e., the text area between the baseline and the corpus line) for each text line. Finally, a region-based method using watershed transform is performed on the map of the Main Body Area for extracting the resulting lines. We have evaluated the new system on the IAM-HisDB, a publicly available dataset containing historical documents, outperforming existing learning-based text line extraction methods, which consider the problem as pixel labelling problem into text and non-text regions.

我们提出了一种新颖的基于卷积神经网络的文本行提取方法，该方法包括初始布局分析，然后估计每个文本行的主体区域(即基线和语料库行之间的文本区域)。最后，对主体区域的地图进行基于区域的分水岭变换提取结果线。我们在IAM-HisDB(一个包含历史文档的公开可用数据集)上评估了新系统，优于现有的基于学习的文本行提取方法，这些方法将问题视为文本和非文本区域的像素标记问题。

引用次数: 34

Interactive Definition and Tuning of One-Class Classifiers for Document Image Classification 用于文档图像分类的单类分类器的交互定义和调优

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

Pub Date : 2016-04-11 DOI: 10.1109/DAS.2016.46

Nathalie Girard, Roger Trullo, Sabine Barrat, N. Ragot, Jean-Yves Ramel

With mass of data, document image classification systems have to face new trends like being able to process heterogeneous data streams efficiently. Generally, when processing data streams, few knowledge is available about the content of the possible streams. Furthermore, as getting labelled data is costly, the classification model has to be learned from few available labelled examples. To handle such specific context, we think that combining one-class classifiers could be a very interesting alternative to quickly define and tune classification systems dedicated to different document streams. The main interest of one-class classifiers is that no interdependence occurs between each classifier model allowing easy removal, addition or modification of classes of documents. Such reconfiguration will not have any impact on the other classifiers. It is also noticeable that each classifier can use a different set of features compared to the other to handle the same class or even different classes. In return, as only one class is well-specified during the learning step, one-class classifiers have to be defined carefully to obtain good performances. It is more difficult to select the representative training examples and the discriminative features with only positive examples. To overcome these difficulties, we have defined a complete framework offering different methods that can help a system designer to define and tune one-class classifier models. The aims are to make easier the selection of good training examples and of suitable features depending on the class to recognize into the document stream. For that purpose, the proposed methods compute different measures to evaluate the relevance of the available features and training examples. Moreover, a visualization of the decision space according to selected examples and features is proposed to help such a choice and, an automatic tuning is proposed for the parameters of the models according to the class to recognize when a validation stream is available. The pertinence of the proposed framework is illustrated on two different use cases (a real data stream and a public data set).

随着海量数据的出现，文档图像分类系统必须面对高效处理异构数据流的新趋势。通常，在处理数据流时，很少有关于可能流内容的知识可用。此外，由于获得标记数据的成本很高，分类模型必须从很少可用的标记示例中学习。为了处理这种特定的上下文，我们认为组合单类分类器可能是一种非常有趣的替代方法，可以快速定义和调优专用于不同文档流的分类系统。单类分类器的主要优点是每个分类器模型之间没有相互依赖关系，从而可以轻松地删除、添加或修改文档的类。这样的重新配置不会对其他分类器产生任何影响。同样值得注意的是，与其他分类器相比，每个分类器可以使用一组不同的特征来处理相同甚至不同的类。反过来，由于在学习步骤中只有一个类被很好地指定，因此必须仔细定义一个类分类器以获得良好的性能。在只有正例的情况下，选择具有代表性的训练样例和判别特征更加困难。为了克服这些困难，我们定义了一个完整的框架，提供了不同的方法，可以帮助系统设计人员定义和调优单类分类器模型。目的是为了更容易地选择好的训练示例和合适的特征，这取决于要识别到文档流中的类。为此，提出的方法计算不同的度量来评估可用特征和训练示例的相关性。此外，提出了一种根据所选示例和特征对决策空间进行可视化的方法来帮助进行选择，并提出了一种根据类别对模型参数进行自动调整的方法，以识别何时有验证流可用。提出的框架的相关性通过两个不同的用例(真实数据流和公共数据集)来说明。

{"title":"Interactive Definition and Tuning of One-Class Classifiers for Document Image Classification","authors":"Nathalie Girard, Roger Trullo, Sabine Barrat, N. Ragot, Jean-Yves Ramel","doi":"10.1109/DAS.2016.46","DOIUrl":"https://doi.org/10.1109/DAS.2016.46","url":null,"abstract":"With mass of data, document image classification systems have to face new trends like being able to process heterogeneous data streams efficiently. Generally, when processing data streams, few knowledge is available about the content of the possible streams. Furthermore, as getting labelled data is costly, the classification model has to be learned from few available labelled examples. To handle such specific context, we think that combining one-class classifiers could be a very interesting alternative to quickly define and tune classification systems dedicated to different document streams. The main interest of one-class classifiers is that no interdependence occurs between each classifier model allowing easy removal, addition or modification of classes of documents. Such reconfiguration will not have any impact on the other classifiers. It is also noticeable that each classifier can use a different set of features compared to the other to handle the same class or even different classes. In return, as only one class is well-specified during the learning step, one-class classifiers have to be defined carefully to obtain good performances. It is more difficult to select the representative training examples and the discriminative features with only positive examples. To overcome these difficulties, we have defined a complete framework offering different methods that can help a system designer to define and tune one-class classifier models. The aims are to make easier the selection of good training examples and of suitable features depending on the class to recognize into the document stream. For that purpose, the proposed methods compute different measures to evaluate the relevance of the available features and training examples. Moreover, a visualization of the decision space according to selected examples and features is proposed to help such a choice and, an automatic tuning is proposed for the parameters of the models according to the class to recognize when a validation stream is available. The pertinence of the proposed framework is illustrated on two different use cases (a real data stream and a public data set).","PeriodicalId":197359,"journal":{"name":"2016 12th IAPR Workshop on Document Analysis Systems (DAS)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127101200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2016 12th IAPR Workshop on Document Analysis Systems (DAS)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀