International Journal on Document Analysis and Recognition最新文献

A survey on artificial intelligence-based approaches for personality analysis from handwritten documents 基于人工智能的手写文件个性分析方法调查

IF 2.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition

Pub Date : 2024-08-27 DOI: 10.1007/s10032-024-00496-5

Suparna Saha Biswas, Himadri Mukherjee, Ankita Dhar, Obaidullah Sk Md, Kaushik Roy

Human personality is a blend of different traits and virtues. It’s modeling is challenging due to its inherent complexity. There are multitudinous cues to predict personality and handwriting is one of them. This is because it is distinctive to a large extent and varies at the individual level. The allied field of science which deals with the analysis of handwriting for understanding personality is known as Graphology. Researchers have discovered disparate features of handwriting that can reveal the personality traits of an individual. Several attempts have been made to model personality from handwriting in different languages but significant advancement is required for commercialization. In this paper, we present the reported aspects of handwriting, techniques for processing handwritten documents and evaluation measures for personality identification to draw a horizon and aid in further advancement of research in this field.

人的个性是不同特征和美德的融合。由于其固有的复杂性，对其进行建模具有挑战性。有许多线索可以预测人格，笔迹就是其中之一。这是因为笔迹在很大程度上是独特的，而且在个体水平上存在差异。为了解个性而对笔迹进行分析的相关科学领域被称为笔迹学。研究人员发现，笔迹的不同特征可以揭示一个人的个性特征。人们已多次尝试从不同语言的笔迹中建立人格模型，但要实现商业化，还需要取得重大进展。在本文中，我们介绍了所报告的手写内容、处理手写文件的技术以及人格识别的评估方法，以期为这一领域的研究开辟出一条新的道路，并进一步推动其发展。

引用次数: 0

In-domain versus out-of-domain transfer learning for document layout analysis 文件布局分析中的域内与域外迁移学习

IF 2.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition

Pub Date : 2024-08-19 DOI: 10.1007/s10032-024-00497-4

Axel De Nardin, Silvia Zottin, Claudio Piciarelli, Gian Luca Foresti, Emanuela Colombi

Data availability is a big concern in the field of document analysis, especially when working on tasks that require a high degree of precision when it comes to the definition of the ground truths on which to train deep learning models. A notable example is represented by the task of document layout analysis in handwritten documents, which requires pixel-precise segmentation maps to highlight the different layout components of each document page. These segmentation maps are typically very time-consuming and require a high degree of domain knowledge to be defined, as they are intrinsically characterized by the content of the text. For this reason in the present work, we explore the effects of different initialization strategies for deep learning models employed for this type of task by relying on both in-domain and cross-domain datasets for their pre-training. To test the employed models we use two publicly available datasets with heterogeneous characteristics both regarding their structure as well as the languages of the contained documents. We show how a combination of cross-domain and in-domain transfer learning approaches leads to the best overall performance of the models, as well as speeding up their convergence process.

在文档分析领域，数据可用性是一个大问题，尤其是在执行对训练深度学习模型的基本事实的定义精度要求很高的任务时。手写文档中的文档布局分析任务就是一个显著的例子，它需要像素级精度的分割图来突出显示每个文档页面的不同布局组件。这些分割图通常非常耗时，而且需要高度的领域知识才能定义，因为它们的内在特征是文本内容。因此，在本研究中，我们通过使用域内和跨域数据集对深度学习模型进行预训练，探索不同初始化策略对此类任务的影响。为了测试所使用的模型，我们使用了两个公开可用的数据集，这两个数据集在结构和所含文档的语言方面都具有不同的特点。我们展示了跨域和域内迁移学习方法的结合如何使模型的整体性能达到最佳，以及如何加快其收敛过程。

{"title":"In-domain versus out-of-domain transfer learning for document layout analysis","authors":"Axel De Nardin, Silvia Zottin, Claudio Piciarelli, Gian Luca Foresti, Emanuela Colombi","doi":"10.1007/s10032-024-00497-4","DOIUrl":"https://doi.org/10.1007/s10032-024-00497-4","url":null,"abstract":"Data availability is a big concern in the field of document analysis, especially when working on tasks that require a high degree of precision when it comes to the definition of the ground truths on which to train deep learning models. A notable example is represented by the task of document layout analysis in handwritten documents, which requires pixel-precise segmentation maps to highlight the different layout components of each document page. These segmentation maps are typically very time-consuming and require a high degree of domain knowledge to be defined, as they are intrinsically characterized by the content of the text. For this reason in the present work, we explore the effects of different initialization strategies for deep learning models employed for this type of task by relying on both in-domain and cross-domain datasets for their pre-training. To test the employed models we use two publicly available datasets with heterogeneous characteristics both regarding their structure as well as the languages of the contained documents. We show how a combination of cross-domain and in-domain transfer learning approaches leads to the best overall performance of the models, as well as speeding up their convergence process.\u0000","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"64 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep learning-based modified-EAST scene text detector: insights from a novel multiscript dataset 基于深度学习的修改后 EAST 场景文本检测器：从新型多脚本数据集中获得的启示

IF 2.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition

Pub Date : 2024-07-31 DOI: 10.1007/s10032-024-00491-w

Shilpa Mahajan, Rajneesh Rani, Aman Kamboj

The field of computer vision has seen significant transformation with the emergence and advancement of deep learning models. Deep learning waves have a significant impact on scene text detection, a vital and active area in computer vision. Numerous scientific, industrial, and academic procedures make use of text analysis. Natural scene text detection is more difficult than document image text detection owing to variations in font, size, style, brightness, etc. The National Institute of Technology Jalandhar-Text Detection dataset (NITJ-TD) is a new dataset that we have put forward in this study for various text analysis tasks including text detection, text segmentation, script identification, text recognition, etc. a deep learning model that seeks to identify the text’s location within the image,which are gathered in an unrestricted setting. The system consists of an NMS to choose the best match and prevent repeated predictions, and a modified EAST to pinpoint the exact ROI in the image. To improve the model’s performance, an enhancement module is added to the fundamental Efficient and Accurate Scene Text detector (EAST). The suggested approach is contrasted in terms of text word detection in the image. Several pre-trained models are used to assign the text word to various intersections over Union (IoU) values. We made use of our NITJ-TD dataset, which is made up of 1500 photos that were gathered from various North Indian sites. Punjabi, English, and Hindi scripts can be seen on the images. We also examined the outcomes of the ICDAR-2013 benchmark dataset. On both the suggested dataset and the benchmarked dataset, our approach performed better.

随着深度学习模型的出现和发展，计算机视觉领域发生了重大变革。深度学习浪潮对场景文本检测产生了重大影响，而场景文本检测是计算机视觉中一个重要而活跃的领域。许多科学、工业和学术程序都会用到文本分析。由于字体、大小、风格、亮度等的变化，自然场景文本检测比文档图像文本检测更加困难。国立贾朗达尔理工学院-文本检测数据集（NITJ-TD）是一个新的数据集，我们在本研究中将其用于各种文本分析任务，包括文本检测、文本分割、脚本识别、文本识别等。该系统由一个 NMS 和一个修改后的 EAST 组成，前者用于选择最佳匹配并防止重复预测，后者用于精确定位图像中的 ROI。为了提高模型的性能，在基本的高效精确场景文本检测器（EAST）中添加了一个增强模块。建议的方法在图像中的文本字词检测方面进行了对比。我们使用了几个预先训练好的模型，将文本词分配到不同的交叉联合（IoU）值上。我们使用了 NITJ-TD 数据集，该数据集由从北印度多个网站收集的 1500 张照片组成。图片上可以看到旁遮普语、英语和印地语脚本。我们还检查了 ICDAR-2013 基准数据集的结果。在建议数据集和基准数据集上，我们的方法都表现得更好。

{"title":"Deep learning-based modified-EAST scene text detector: insights from a novel multiscript dataset","authors":"Shilpa Mahajan, Rajneesh Rani, Aman Kamboj","doi":"10.1007/s10032-024-00491-w","DOIUrl":"https://doi.org/10.1007/s10032-024-00491-w","url":null,"abstract":"The field of computer vision has seen significant transformation with the emergence and advancement of deep learning models. Deep learning waves have a significant impact on scene text detection, a vital and active area in computer vision. Numerous scientific, industrial, and academic procedures make use of text analysis. Natural scene text detection is more difficult than document image text detection owing to variations in font, size, style, brightness, etc. The National Institute of Technology Jalandhar-Text Detection dataset (NITJ-TD) is a new dataset that we have put forward in this study for various text analysis tasks including text detection, text segmentation, script identification, text recognition, etc. a deep learning model that seeks to identify the text’s location within the image,which are gathered in an unrestricted setting. The system consists of an NMS to choose the best match and prevent repeated predictions, and a modified EAST to pinpoint the exact ROI in the image. To improve the model’s performance, an enhancement module is added to the fundamental Efficient and Accurate Scene Text detector (EAST). The suggested approach is contrasted in terms of text word detection in the image. Several pre-trained models are used to assign the text word to various intersections over Union (IoU) values. We made use of our NITJ-TD dataset, which is made up of 1500 photos that were gathered from various North Indian sites. Punjabi, English, and Hindi scripts can be seen on the images. We also examined the outcomes of the ICDAR-2013 benchmark dataset. On both the suggested dataset and the benchmarked dataset, our approach performed better.\u0000","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"50 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards fully automated processing and analysis of construction diagrams: AI-powered symbol detection 实现施工图的全自动处理和分析：人工智能驱动的符号检测

IF 2.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition

Pub Date : 2024-07-25 DOI: 10.1007/s10032-024-00492-9

Laura Jamieson, Carlos Francisco Moreno-Garcia, Eyad Elyan

Construction drawings are frequently stored in undigitised formats and consequently, their analysis requires substantial manual effort. This is true for many crucial tasks, including material takeoff where the purpose is to obtain a list of the equipment and respective amounts required for a project. Engineering drawing digitisation has recently attracted increased attention, however construction drawings have received considerably less interest compared to other types. To address these issues, this paper presents a novel framework for the automatic processing of construction drawings. Extensive experiments were performed using two state-of-the-art deep learning models for object detection in challenging high-resolution drawings sourced from industry. The results show a significant reduction in the time required for drawing analysis. Promising performance was achieved for symbol detection across various classes, with a mean average precision of 79% for the YOLO-based method and 83% for the Faster R-CNN-based method. This framework enables the digital transformation of construction drawings, improving tasks such as material takeoff and many others.

施工图纸通常以非数字化格式存储，因此，对其进行分析需要大量的人工工作。许多关键任务都是如此，包括材料估算，其目的是获得项目所需设备和相应数量的清单。工程图纸数字化最近引起了越来越多的关注，但与其他类型的图纸相比，建筑图纸受到的关注要少得多。为了解决这些问题，本文提出了一种自动处理建筑图纸的新框架。我们使用两种最先进的深度学习模型进行了广泛的实验，以检测具有挑战性的工业高分辨率图纸中的对象。结果表明，图纸分析所需的时间大大缩短。基于 YOLO 的方法的平均精确度为 79%，基于 Faster R-CNN 的方法的平均精确度为 83%。该框架实现了施工图纸的数字化转换，改进了材料估算等任务。

{"title":"Towards fully automated processing and analysis of construction diagrams: AI-powered symbol detection","authors":"Laura Jamieson, Carlos Francisco Moreno-Garcia, Eyad Elyan","doi":"10.1007/s10032-024-00492-9","DOIUrl":"https://doi.org/10.1007/s10032-024-00492-9","url":null,"abstract":"Construction drawings are frequently stored in undigitised formats and consequently, their analysis requires substantial manual effort. This is true for many crucial tasks, including material takeoff where the purpose is to obtain a list of the equipment and respective amounts required for a project. Engineering drawing digitisation has recently attracted increased attention, however construction drawings have received considerably less interest compared to other types. To address these issues, this paper presents a novel framework for the automatic processing of construction drawings. Extensive experiments were performed using two state-of-the-art deep learning models for object detection in challenging high-resolution drawings sourced from industry. The results show a significant reduction in the time required for drawing analysis. Promising performance was achieved for symbol detection across various classes, with a mean average precision of 79% for the YOLO-based method and 83% for the Faster R-CNN-based method. This framework enables the digital transformation of construction drawings, improving tasks such as material takeoff and many others.\u0000","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"8 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GAN-based text line segmentation method for challenging handwritten documents 基于 GAN 的文本行分割方法，适用于具有挑战性的手写文档

IF 2.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition

Pub Date : 2024-07-21 DOI: 10.1007/s10032-024-00488-5

İbrahim Özşeker, Ali Alper Demir, Ufuk Özkaya

Text line segmentation (TLS) is an essential step of the end-to-end document analysis systems. The main purpose of this step is to extract the individual text lines of any handwritten documents with high accuracy. Handwritten and historical documents mostly contain touching and overlapping characters, heavy diacritics, footnotes and side notes added over the years. In this work, we present a new TLS method based on generative adversarial networks (GAN). TLS problem is tackled as an image-to-image translation problem and the GAN model was trained to learn the spatial information between the individual text lines and their corresponding masks including the text lines. To evaluate the segmentation performance of the proposed GAN model, two challenging datasets, VML-AHTE and VML-MOC, were used. According to the qualitative and quantitative results, the proposed GAN model achieved the best segmentation accuracy on the VML-MOC dataset and showed competitive performance on the VML-AHTE dataset.

文本行分割（TLS）是端到端文档分析系统的一个基本步骤。这一步骤的主要目的是高精度地提取任何手写文档中的单个文本行。手写文档和历史文献大多包含触摸和重叠字符、大量的变音符号、脚注和多年来添加的旁注。在这项工作中，我们提出了一种基于生成式对抗网络（GAN）的新 TLS 方法。TLS 问题是作为图像到图像的翻译问题来处理的，GAN 模型经过训练，可以学习单个文本行和包括文本行在内的相应掩码之间的空间信息。为了评估所提出的 GAN 模型的分割性能，我们使用了两个具有挑战性的数据集：VML-AHTE 和 VML-MOC。根据定性和定量结果，所提出的 GAN 模型在 VML-MOC 数据集上达到了最佳分割精度，在 VML-AHTE 数据集上表现出了竞争力。

引用次数: 0

Image quality determination of palm leaf heritage documents using integrated discrete cosine transform features with vision transformer 利用视觉变换器综合离散余弦变换特征确定棕榈叶遗产文件的图像质量

IF 2.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition

Pub Date : 2024-07-17 DOI: 10.1007/s10032-024-00490-x

Remya Sivan, Peeta Basa Pati, Made Windu Antara Kesiman

Classification of Palm leaf images into various quality categories is an important step towards the digitization of these heritage documents. Manual inspection and categorization is not only laborious, time-consuming and costly but also subject to inspector’s biases and errors. This study aims to automate the classification of palm leaf document images into three different visual quality categories. A comparative analysis between various structural and statistical features and classifiers against deep neural networks is performed. VGG16, VGG19 and ResNet152v2 architectures along with a custom CNN model are used, while Discrete Cosine Transform (DCT), Grey Level Co-occurrence Matrix (GLCM), Tamura, and Histogram of Gradient (HOG) are chosen from the traditional methods. Based on these extracted features, various classifiers, namely, k-Nearest Neighbors (k-NN), multi-layer perceptron (MLP), Support Vector Machines (SVM), Decision Tree (DT) and Logistic Regression (LR) are trained and evaluated. Accuracy, precision, recall, and F1 scores are used as performance metrics for the evaluation of various algorithms. Results demonstrate that CNN embeddings and DCT features have emerged as superior features. Based on these findings, we integrated DCT with a Vision Transformer (ViT) for the document classification task. The result illustrates that this incorporation of DCT with ViT outperforms all other methods with 96% train F1 score and a test F1 score of 90%.

将棕榈叶图像分为不同的质量类别是实现这些遗产文件数字化的重要一步。人工检查和分类不仅费力、费时、费钱，而且会受到检查人员偏见和错误的影响。本研究旨在将棕榈叶文献图像自动分类为三种不同的视觉质量类别。本研究对各种结构和统计特征以及分类器与深度神经网络进行了比较分析。使用了 VGG16、VGG19 和 ResNet152v2 架构以及自定义 CNN 模型，并从传统方法中选择了离散余弦变换 (DCT)、灰度共现矩阵 (GLCM)、Tamura 和梯度直方图 (HOG)。根据这些提取的特征，对各种分类器，即 k-Nearest Neighbors (k-NN)、multi-layer perceptron (MLP)、Support Vector Machines (SVM)、Decision Tree (DT) 和 Logistic Regression (LR) 进行了训练和评估。准确率、精确度、召回率和 F1 分数被用作评估各种算法的性能指标。结果表明，CNN 嵌入和 DCT 特征是最优秀的特征。基于这些发现，我们将 DCT 与视觉变换器 (ViT) 集成到文档分类任务中。结果表明，DCT 与 ViT 的结合优于所有其他方法，训练 F1 得分为 96%，测试 F1 得分为 90%。

{"title":"Image quality determination of palm leaf heritage documents using integrated discrete cosine transform features with vision transformer","authors":"Remya Sivan, Peeta Basa Pati, Made Windu Antara Kesiman","doi":"10.1007/s10032-024-00490-x","DOIUrl":"https://doi.org/10.1007/s10032-024-00490-x","url":null,"abstract":"Classification of Palm leaf images into various quality categories is an important step towards the digitization of these heritage documents. Manual inspection and categorization is not only laborious, time-consuming and costly but also subject to inspector’s biases and errors. This study aims to automate the classification of palm leaf document images into three different visual quality categories. A comparative analysis between various structural and statistical features and classifiers against deep neural networks is performed. VGG16, VGG19 and ResNet152v2 architectures along with a custom CNN model are used, while Discrete Cosine Transform (DCT), Grey Level Co-occurrence Matrix (GLCM), Tamura, and Histogram of Gradient (HOG) are chosen from the traditional methods. Based on these extracted features, various classifiers, namely, k-Nearest Neighbors (k-NN), multi-layer perceptron (MLP), Support Vector Machines (SVM), Decision Tree (DT) and Logistic Regression (LR) are trained and evaluated. Accuracy, precision, recall, and F1 scores are used as performance metrics for the evaluation of various algorithms. Results demonstrate that CNN embeddings and DCT features have emerged as superior features. Based on these findings, we integrated DCT with a Vision Transformer (ViT) for the document classification task. The result illustrates that this incorporation of DCT with ViT outperforms all other methods with 96% train F1 score and a test F1 score of 90%.","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"49 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

End-to-end semi-supervised approach with modulated object queries for table detection in documents 利用调制对象查询的端到端半监督方法检测文档中的表格

IF 2.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition

Pub Date : 2024-07-10 DOI: 10.1007/s10032-024-00471-0

Iqraa Ehsan, Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal

Table detection, a pivotal task in document analysis, aims to precisely recognize and locate tables within document images. Although deep learning has shown remarkable progress in this realm, it typically requires an extensive dataset of labeled data for proficient training. Current CNN-based semi-supervised table detection approaches use the anchor generation process and non-maximum suppression in their detection process, limiting training efficiency. Meanwhile, transformer-based semi-supervised techniques adopted a one-to-one match strategy that provides noisy pseudo-labels, limiting overall efficiency. This study presents an innovative transformer-based semi-supervised table detector. It improves the quality of pseudo-labels through a novel matching strategy combining one-to-one and one-to-many assignment techniques. This approach significantly enhances training efficiency during the early stages, ensuring superior pseudo-labels for further training. Our semi-supervised approach is comprehensively evaluated on benchmark datasets, including PubLayNet, ICADR-19, and TableBank. It achieves new state-of-the-art results, with a mAP of 95.7% and 97.9% on TableBank (word) and PubLaynet with 30% label data, marking a 7.4 and 7.6 point improvement over previous semi-supervised table detection approach, respectively. The results clearly show the superiority of our semi-supervised approach, surpassing all existing state-of-the-art methods by substantial margins. This research represents a significant advancement in semi-supervised table detection methods, offering a more efficient and accurate solution for practical document analysis tasks.

表格检测是文档分析中的一项重要任务，旨在精确识别和定位文档图像中的表格。虽然深度学习在这一领域取得了显著进展，但通常需要大量标注数据集才能进行熟练训练。目前基于 CNN 的半监督表格检测方法在检测过程中使用锚生成过程和非最大抑制，限制了训练效率。同时，基于变换器的半监督技术采用一对一匹配策略，提供了噪声伪标签，限制了整体效率。本研究提出了一种创新的基于变压器的半监督表检测器。它通过结合一对一和一对多分配技术的新型匹配策略，提高了伪标签的质量。这种方法大大提高了早期阶段的训练效率，确保为进一步训练提供优质的伪标签。我们的半监督方法在基准数据集上进行了全面评估，包括 PubLayNet、ICADR-19 和 TableBank。与之前的半监督表格检测方法相比，该方法分别提高了 7.4 和 7.6 个百分点，在含有 30% 标签数据的 TableBank（单词）和 PubLaynet 上的 mAP 分别为 95.7% 和 97.9%，达到了最新水平。这些结果清楚地表明了我们的半监督方法的优越性，大大超过了所有现有的先进方法。这项研究代表了半监督表格检测方法的重大进步，为实际文档分析任务提供了更高效、更准确的解决方案。

{"title":"End-to-end semi-supervised approach with modulated object queries for table detection in documents","authors":"Iqraa Ehsan, Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal","doi":"10.1007/s10032-024-00471-0","DOIUrl":"https://doi.org/10.1007/s10032-024-00471-0","url":null,"abstract":"Table detection, a pivotal task in document analysis, aims to precisely recognize and locate tables within document images. Although deep learning has shown remarkable progress in this realm, it typically requires an extensive dataset of labeled data for proficient training. Current CNN-based semi-supervised table detection approaches use the anchor generation process and non-maximum suppression in their detection process, limiting training efficiency. Meanwhile, transformer-based semi-supervised techniques adopted a one-to-one match strategy that provides noisy pseudo-labels, limiting overall efficiency. This study presents an innovative transformer-based semi-supervised table detector. It improves the quality of pseudo-labels through a novel matching strategy combining one-to-one and one-to-many assignment techniques. This approach significantly enhances training efficiency during the early stages, ensuring superior pseudo-labels for further training. Our semi-supervised approach is comprehensively evaluated on benchmark datasets, including PubLayNet, ICADR-19, and TableBank. It achieves new state-of-the-art results, with a mAP of 95.7% and 97.9% on TableBank (word) and PubLaynet with 30% label data, marking a 7.4 and 7.6 point improvement over previous semi-supervised table detection approach, respectively. The results clearly show the superiority of our semi-supervised approach, surpassing all existing state-of-the-art methods by substantial margins. This research represents a significant advancement in semi-supervised table detection methods, offering a more efficient and accurate solution for practical document analysis tasks.","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"25 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ChemScraper: leveraging PDF graphics instructions for molecular diagram parsing ChemScraper：利用 PDF 图形指令进行分子图解析

IF 2.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition

Pub Date : 2024-07-05 DOI: 10.1007/s10032-024-00486-7

Ayush Kumar Shah, Bryan Amador, Abhisek Dey, Ming Creekmore, Blake Ocampo, Scott Denmark, Richard Zanibbi

Most molecular diagram parsers recover chemical structure from raster images (e.g., PNGs). However, many PDFs include commands giving explicit locations and shapes for characters, lines, and polygons. We present a new parser that uses these born-digital PDF primitives as input. The parsing model is fast and accurate, and does not require GPUs, Optical Character Recognition (OCR), or vectorization. We use the parser to annotate raster images and then train a new multi-task neural network for recognizing molecules in raster images. We evaluate our parsers using SMILES and standard benchmarks, along with a novel evaluation protocol comparing molecular graphs directly that supports automatic error compilation and reveals errors missed by SMILES-based evaluation. On the synthetic USPTO benchmark, our born-digital parser obtains a recognition rate of 98.4% (1% higher than previous models) and our relatively simple neural parser for raster images obtains a rate of 85% using less training data than existing neural approaches (thousands vs. millions of molecules).

大多数分子图解析器都能从光栅图像（如 PNG）中恢复化学结构。然而，许多 PDF 文件都包含一些命令，明确给出了字符、线条和多边形的位置和形状。我们提出了一种新的解析器，使用这些天生的数字 PDF 基元作为输入。该解析模型快速准确，无需 GPU、光学字符识别 (OCR) 或矢量化。我们使用解析器对光栅图像进行注释，然后训练一个新的多任务神经网络来识别光栅图像中的分子。我们使用 SMILES 和标准基准对我们的解析器进行了评估，同时还采用了直接比较分子图的新型评估协议，该协议支持自动错误编译，并能揭示基于 SMILES 的评估所遗漏的错误。在合成的美国专利商标局基准上，我们的天生数字解析器获得了 98.4% 的识别率（比以前的模型高 1%），而我们相对简单的光栅图像神经解析器获得了 85% 的识别率，使用的训练数据比现有的神经方法要少（数千个分子比数百万个分子）。

{"title":"ChemScraper: leveraging PDF graphics instructions for molecular diagram parsing","authors":"Ayush Kumar Shah, Bryan Amador, Abhisek Dey, Ming Creekmore, Blake Ocampo, Scott Denmark, Richard Zanibbi","doi":"10.1007/s10032-024-00486-7","DOIUrl":"https://doi.org/10.1007/s10032-024-00486-7","url":null,"abstract":"Most molecular diagram parsers recover chemical structure from raster images (e.g., PNGs). However, many PDFs include commands giving explicit locations and shapes for characters, lines, and polygons. We present a new parser that uses these born-digital PDF primitives as input. The parsing model is fast and accurate, and does not require GPUs, Optical Character Recognition (OCR), or vectorization. We use the parser to annotate raster images and then train a new multi-task neural network for recognizing molecules in raster images. We evaluate our parsers using SMILES and standard benchmarks, along with a novel evaluation protocol comparing molecular graphs directly that supports automatic error compilation and reveals errors missed by SMILES-based evaluation. On the synthetic USPTO benchmark, our born-digital parser obtains a recognition rate of 98.4% (1% higher than previous models) and our relatively simple neural parser for raster images obtains a rate of 85% using less training data than existing neural approaches (thousands vs. millions of molecules).","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"5 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141569989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring recursive neural networks for compact handwritten text recognition models 探索用于紧凑型手写文本识别模型的递归神经网络

IF 2.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition

Pub Date : 2024-06-27 DOI: 10.1007/s10032-024-00481-y

Enrique Mas-Candela, Jorge Calvo-Zaragoza

This paper addresses the challenge of deploying recognition models in specific scenarios in which memory size is relevant, such as in low-cost devices or browser-based applications. We specifically focus on developing memory-efficient approaches for Handwritten Text Recognition (HTR) by leveraging recursive networks. These networks reuse learned weights across successive layers, thus enabling the maintenance of depth, a critical factor associated with model accuracy, without an increase in memory footprint. We apply neural recursion techniques to models typically used in HTR that contain convolutional and recurrent layers. We additionally study the impact of kernel scaling, which allows the activations of these recursive layers to be modified for greater expressiveness with little cost to memory. Our experiments on various HTR benchmarks demonstrate that recursive networks are, indeed, a good alternative. It is noteworthy that these recursive networks not only preserve but in some instances also enhance accuracy, making them a promising solution for memory-efficient HTR applications. This research establishes the utility of recursive networks in addressing memory constraints in HTR models. Their ability to sustain or improve accuracy while being memory-efficient positions them as a promising solution for practical deployment, especially in contexts where memory size is a critical consideration, such as low-cost devices and browser-based applications.

本文探讨了在内存大小相关的特定场景（如低成本设备或基于浏览器的应用）中部署识别模型所面临的挑战。我们特别关注通过利用递归网络为手写文字识别（HTR）开发内存效率高的方法。这些网络可以在连续的层中重复使用学习到的权重，从而在不增加内存占用的情况下保持深度，而深度是与模型准确性相关的关键因素。我们将神经递归技术应用于 HTR 中通常使用的包含卷积层和递归层的模型。此外，我们还研究了内核缩放的影响，它允许修改这些递归层的激活，以提高表达能力，而对内存的影响很小。我们在各种 HTR 基准上进行的实验表明，递归网络确实是一种不错的选择。值得注意的是，这些递归网络不仅保持了准确性，而且在某些情况下还提高了准确性，这使它们成为高效内存 HTR 应用的理想解决方案。这项研究证实了递归网络在解决 HTR 模型内存限制方面的实用性。递归网络既能保持或提高准确性，又能节省内存，因此在实际应用中是一种很有前途的解决方案，尤其是在对内存大小有严格要求的情况下，如低成本设备和基于浏览器的应用。

{"title":"Exploring recursive neural networks for compact handwritten text recognition models","authors":"Enrique Mas-Candela, Jorge Calvo-Zaragoza","doi":"10.1007/s10032-024-00481-y","DOIUrl":"https://doi.org/10.1007/s10032-024-00481-y","url":null,"abstract":"This paper addresses the challenge of deploying recognition models in specific scenarios in which memory size is relevant, such as in low-cost devices or browser-based applications. We specifically focus on developing memory-efficient approaches for Handwritten Text Recognition (HTR) by leveraging recursive networks. These networks reuse learned weights across successive layers, thus enabling the maintenance of depth, a critical factor associated with model accuracy, without an increase in memory footprint. We apply neural recursion techniques to models typically used in HTR that contain convolutional and recurrent layers. We additionally study the impact of kernel scaling, which allows the activations of these recursive layers to be modified for greater expressiveness with little cost to memory. Our experiments on various HTR benchmarks demonstrate that recursive networks are, indeed, a good alternative. It is noteworthy that these recursive networks not only preserve but in some instances also enhance accuracy, making them a promising solution for memory-efficient HTR applications. This research establishes the utility of recursive networks in addressing memory constraints in HTR models. Their ability to sustain or improve accuracy while being memory-efficient positions them as a promising solution for practical deployment, especially in contexts where memory size is a critical consideration, such as low-cost devices and browser-based applications.\u0000","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"48 14 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141502856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DocXclassifier: towards a robust and interpretable deep neural network for document image classification DocXclassifier：为文档图像分类开发鲁棒且可解释的深度神经网络

IF 2.3 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal on Document Analysis and Recognition

Pub Date : 2024-06-25 DOI: 10.1007/s10032-024-00483-w

Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed

Model interpretability and robustness are becoming increasingly critical today for the safe and practical deployment of deep learning (DL) models in industrial settings. As DL-backed automated document processing systems become increasingly common in business workflows, there is a pressing need today to enhance interpretability and robustness for the task of document image classification, an integral component of such systems. Surprisingly, while much research has been devoted to improving the performance of deep models for this task, little attention has been given to their interpretability and robustness. In this paper, we aim to improve upon both aspects and introduce two inherently interpretable deep document classifiers, DocXClassifier and DocXClassifierFPN, both of which not only achieve significant performance improvements over existing approaches but also hold the capability to simultaneously generate feature importance maps while making their predictions. Our approach involves integrating a convolutional neural network (ConvNet) backbone with an attention mechanism to perform weighted aggregation of features based on their importance to the class, enabling the generation of interpretable importance maps. Additionally, we propose integrating Feature Pyramid Networks with the attention mechanism to significantly enhance the resolution of the interpretability maps, especially for pyramidal ConvNet architectures. Our approach attains state-of-the-art performance in image-based classification on two popular document datasets, RVL-CDIP and Tobacco3482, with top-1 classification accuracies of 94.19% and 95.71%, respectively. Additionally, it sets a new record for the highest image-based classification accuracy on Tobacco3482 without transfer learning from RVL-CDIP, at 90.29%. In addition, our proposed training strategy demonstrates superior robustness compared to existing approaches, significantly outperforming them on 19 out of 21 different types of novel data distortions, while achieving comparable results on the remaining two. By combining robustness with interpretability, DocXClassifier presents a promising step toward the practical deployment of DL models for document classification tasks.

如今，模型的可解释性和鲁棒性对于在工业环境中安全、实用地部署深度学习（DL）模型越来越重要。随着由深度学习支持的自动文档处理系统在业务工作流程中变得越来越常见，如今迫切需要提高文档图像分类任务的可解释性和鲁棒性，这是此类系统不可或缺的组成部分。令人惊讶的是，虽然很多研究都致力于提高深度模型在这项任务中的性能，但却很少关注它们的可解释性和鲁棒性。在本文中，我们旨在改进这两个方面，并引入了两个本质上可解释的深度文档分类器--DocXClassifier 和 DocXClassifierFPN，这两个分类器不仅在性能上比现有方法有了显著提高，而且还能在进行预测的同时生成特征重要性图。我们的方法是将卷积神经网络（ConvNet）骨干网与注意力机制相结合，根据特征对类别的重要性对其进行加权聚合，从而生成可解释的重要性图。此外，我们还建议将特征金字塔网络与注意力机制相结合，以显著提高可解释性地图的分辨率，尤其是对于金字塔型 ConvNet 架构而言。在 RVL-CDIP 和 Tobacco3482 这两个流行的文档数据集上，我们的方法在基于图像的分类方面取得了最先进的性能，最高分类准确率分别为 94.19% 和 95.71%。此外，在没有从 RVL-CDIP 转移学习的情况下，它在 Tobacco3482 上的图像分类准确率达到了 90.29%，创造了新的最高记录。此外，与现有方法相比，我们提出的训练策略表现出了卓越的鲁棒性，在 21 种不同类型的新数据失真中，有 19 种明显优于现有方法，而在其余两种失真中也取得了相当的结果。通过将鲁棒性与可解释性相结合，DocXClassifier 向实际部署用于文档分类任务的 DL 模型迈出了充满希望的一步。

{"title":"DocXclassifier: towards a robust and interpretable deep neural network for document image classification","authors":"Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed","doi":"10.1007/s10032-024-00483-w","DOIUrl":"https://doi.org/10.1007/s10032-024-00483-w","url":null,"abstract":"Model interpretability and robustness are becoming increasingly critical today for the safe and practical deployment of deep learning (DL) models in industrial settings. As DL-backed automated document processing systems become increasingly common in business workflows, there is a pressing need today to enhance interpretability and robustness for the task of document image classification, an integral component of such systems. Surprisingly, while much research has been devoted to improving the performance of deep models for this task, little attention has been given to their interpretability and robustness. In this paper, we aim to improve upon both aspects and introduce two inherently interpretable deep document classifiers, DocXClassifier and DocXClassifierFPN, both of which not only achieve significant performance improvements over existing approaches but also hold the capability to simultaneously generate feature importance maps while making their predictions. Our approach involves integrating a convolutional neural network (ConvNet) backbone with an attention mechanism to perform weighted aggregation of features based on their importance to the class, enabling the generation of interpretable importance maps. Additionally, we propose integrating Feature Pyramid Networks with the attention mechanism to significantly enhance the resolution of the interpretability maps, especially for pyramidal ConvNet architectures. Our approach attains state-of-the-art performance in image-based classification on two popular document datasets, RVL-CDIP and Tobacco3482, with top-1 classification accuracies of 94.19% and 95.71%, respectively. Additionally, it sets a new record for the highest image-based classification accuracy on Tobacco3482 without transfer learning from RVL-CDIP, at 90.29%. In addition, our proposed training strategy demonstrates superior robustness compared to existing approaches, significantly outperforming them on 19 out of 21 different types of novel data distortions, while achieving comparable results on the remaining two. By combining robustness with interpretability, DocXClassifier presents a promising step toward the practical deployment of DL models for document classification tasks.","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"140 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141502859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0