Pub Date : 2024-08-27DOI: 10.1007/s10032-024-00496-5
Suparna Saha Biswas, Himadri Mukherjee, Ankita Dhar, Obaidullah Sk Md, Kaushik Roy
Human personality is a blend of different traits and virtues. It’s modeling is challenging due to its inherent complexity. There are multitudinous cues to predict personality and handwriting is one of them. This is because it is distinctive to a large extent and varies at the individual level. The allied field of science which deals with the analysis of handwriting for understanding personality is known as Graphology. Researchers have discovered disparate features of handwriting that can reveal the personality traits of an individual. Several attempts have been made to model personality from handwriting in different languages but significant advancement is required for commercialization. In this paper, we present the reported aspects of handwriting, techniques for processing handwritten documents and evaluation measures for personality identification to draw a horizon and aid in further advancement of research in this field.
{"title":"A survey on artificial intelligence-based approaches for personality analysis from handwritten documents","authors":"Suparna Saha Biswas, Himadri Mukherjee, Ankita Dhar, Obaidullah Sk Md, Kaushik Roy","doi":"10.1007/s10032-024-00496-5","DOIUrl":"https://doi.org/10.1007/s10032-024-00496-5","url":null,"abstract":"<p>Human personality is a blend of different traits and virtues. It’s modeling is challenging due to its inherent complexity. There are multitudinous cues to predict personality and handwriting is one of them. This is because it is distinctive to a large extent and varies at the individual level. The allied field of science which deals with the analysis of handwriting for understanding personality is known as Graphology. Researchers have discovered disparate features of handwriting that can reveal the personality traits of an individual. Several attempts have been made to model personality from handwriting in different languages but significant advancement is required for commercialization. In this paper, we present the reported aspects of handwriting, techniques for processing handwritten documents and evaluation measures for personality identification to draw a horizon and aid in further advancement of research in this field.\u0000</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"6 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-19DOI: 10.1007/s10032-024-00497-4
Axel De Nardin, Silvia Zottin, Claudio Piciarelli, Gian Luca Foresti, Emanuela Colombi
Data availability is a big concern in the field of document analysis, especially when working on tasks that require a high degree of precision when it comes to the definition of the ground truths on which to train deep learning models. A notable example is represented by the task of document layout analysis in handwritten documents, which requires pixel-precise segmentation maps to highlight the different layout components of each document page. These segmentation maps are typically very time-consuming and require a high degree of domain knowledge to be defined, as they are intrinsically characterized by the content of the text. For this reason in the present work, we explore the effects of different initialization strategies for deep learning models employed for this type of task by relying on both in-domain and cross-domain datasets for their pre-training. To test the employed models we use two publicly available datasets with heterogeneous characteristics both regarding their structure as well as the languages of the contained documents. We show how a combination of cross-domain and in-domain transfer learning approaches leads to the best overall performance of the models, as well as speeding up their convergence process.
{"title":"In-domain versus out-of-domain transfer learning for document layout analysis","authors":"Axel De Nardin, Silvia Zottin, Claudio Piciarelli, Gian Luca Foresti, Emanuela Colombi","doi":"10.1007/s10032-024-00497-4","DOIUrl":"https://doi.org/10.1007/s10032-024-00497-4","url":null,"abstract":"<p>Data availability is a big concern in the field of document analysis, especially when working on tasks that require a high degree of precision when it comes to the definition of the ground truths on which to train deep learning models. A notable example is represented by the task of document layout analysis in handwritten documents, which requires pixel-precise segmentation maps to highlight the different layout components of each document page. These segmentation maps are typically very time-consuming and require a high degree of domain knowledge to be defined, as they are intrinsically characterized by the content of the text. For this reason in the present work, we explore the effects of different initialization strategies for deep learning models employed for this type of task by relying on both in-domain and cross-domain datasets for their pre-training. To test the employed models we use two publicly available datasets with heterogeneous characteristics both regarding their structure as well as the languages of the contained documents. We show how a combination of cross-domain and in-domain transfer learning approaches leads to the best overall performance of the models, as well as speeding up their convergence process.\u0000</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"64 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-31DOI: 10.1007/s10032-024-00491-w
Shilpa Mahajan, Rajneesh Rani, Aman Kamboj
The field of computer vision has seen significant transformation with the emergence and advancement of deep learning models. Deep learning waves have a significant impact on scene text detection, a vital and active area in computer vision. Numerous scientific, industrial, and academic procedures make use of text analysis. Natural scene text detection is more difficult than document image text detection owing to variations in font, size, style, brightness, etc. The National Institute of Technology Jalandhar-Text Detection dataset (NITJ-TD) is a new dataset that we have put forward in this study for various text analysis tasks including text detection, text segmentation, script identification, text recognition, etc. a deep learning model that seeks to identify the text’s location within the image,which are gathered in an unrestricted setting. The system consists of an NMS to choose the best match and prevent repeated predictions, and a modified EAST to pinpoint the exact ROI in the image. To improve the model’s performance, an enhancement module is added to the fundamental Efficient and Accurate Scene Text detector (EAST). The suggested approach is contrasted in terms of text word detection in the image. Several pre-trained models are used to assign the text word to various intersections over Union (IoU) values. We made use of our NITJ-TD dataset, which is made up of 1500 photos that were gathered from various North Indian sites. Punjabi, English, and Hindi scripts can be seen on the images. We also examined the outcomes of the ICDAR-2013 benchmark dataset. On both the suggested dataset and the benchmarked dataset, our approach performed better.
{"title":"Deep learning-based modified-EAST scene text detector: insights from a novel multiscript dataset","authors":"Shilpa Mahajan, Rajneesh Rani, Aman Kamboj","doi":"10.1007/s10032-024-00491-w","DOIUrl":"https://doi.org/10.1007/s10032-024-00491-w","url":null,"abstract":"<p>The field of computer vision has seen significant transformation with the emergence and advancement of deep learning models. Deep learning waves have a significant impact on scene text detection, a vital and active area in computer vision. Numerous scientific, industrial, and academic procedures make use of text analysis. Natural scene text detection is more difficult than document image text detection owing to variations in font, size, style, brightness, etc. The National Institute of Technology Jalandhar-Text Detection dataset (NITJ-TD) is a new dataset that we have put forward in this study for various text analysis tasks including text detection, text segmentation, script identification, text recognition, etc. a deep learning model that seeks to identify the text’s location within the image,which are gathered in an unrestricted setting. The system consists of an NMS to choose the best match and prevent repeated predictions, and a modified EAST to pinpoint the exact ROI in the image. To improve the model’s performance, an enhancement module is added to the fundamental Efficient and Accurate Scene Text detector (EAST). The suggested approach is contrasted in terms of text word detection in the image. Several pre-trained models are used to assign the text word to various intersections over Union (IoU) values. We made use of our NITJ-TD dataset, which is made up of 1500 photos that were gathered from various North Indian sites. Punjabi, English, and Hindi scripts can be seen on the images. We also examined the outcomes of the ICDAR-2013 benchmark dataset. On both the suggested dataset and the benchmarked dataset, our approach performed better.\u0000</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"50 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-25DOI: 10.1007/s10032-024-00492-9
Laura Jamieson, Carlos Francisco Moreno-Garcia, Eyad Elyan
Construction drawings are frequently stored in undigitised formats and consequently, their analysis requires substantial manual effort. This is true for many crucial tasks, including material takeoff where the purpose is to obtain a list of the equipment and respective amounts required for a project. Engineering drawing digitisation has recently attracted increased attention, however construction drawings have received considerably less interest compared to other types. To address these issues, this paper presents a novel framework for the automatic processing of construction drawings. Extensive experiments were performed using two state-of-the-art deep learning models for object detection in challenging high-resolution drawings sourced from industry. The results show a significant reduction in the time required for drawing analysis. Promising performance was achieved for symbol detection across various classes, with a mean average precision of 79% for the YOLO-based method and 83% for the Faster R-CNN-based method. This framework enables the digital transformation of construction drawings, improving tasks such as material takeoff and many others.
{"title":"Towards fully automated processing and analysis of construction diagrams: AI-powered symbol detection","authors":"Laura Jamieson, Carlos Francisco Moreno-Garcia, Eyad Elyan","doi":"10.1007/s10032-024-00492-9","DOIUrl":"https://doi.org/10.1007/s10032-024-00492-9","url":null,"abstract":"<p>Construction drawings are frequently stored in undigitised formats and consequently, their analysis requires substantial manual effort. This is true for many crucial tasks, including material takeoff where the purpose is to obtain a list of the equipment and respective amounts required for a project. Engineering drawing digitisation has recently attracted increased attention, however construction drawings have received considerably less interest compared to other types. To address these issues, this paper presents a novel framework for the automatic processing of construction drawings. Extensive experiments were performed using two state-of-the-art deep learning models for object detection in challenging high-resolution drawings sourced from industry. The results show a significant reduction in the time required for drawing analysis. Promising performance was achieved for symbol detection across various classes, with a mean average precision of 79% for the YOLO-based method and 83% for the Faster R-CNN-based method. This framework enables the digital transformation of construction drawings, improving tasks such as material takeoff and many others.\u0000</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"8 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-21DOI: 10.1007/s10032-024-00488-5
İbrahim Özşeker, Ali Alper Demir, Ufuk Özkaya
Text line segmentation (TLS) is an essential step of the end-to-end document analysis systems. The main purpose of this step is to extract the individual text lines of any handwritten documents with high accuracy. Handwritten and historical documents mostly contain touching and overlapping characters, heavy diacritics, footnotes and side notes added over the years. In this work, we present a new TLS method based on generative adversarial networks (GAN). TLS problem is tackled as an image-to-image translation problem and the GAN model was trained to learn the spatial information between the individual text lines and their corresponding masks including the text lines. To evaluate the segmentation performance of the proposed GAN model, two challenging datasets, VML-AHTE and VML-MOC, were used. According to the qualitative and quantitative results, the proposed GAN model achieved the best segmentation accuracy on the VML-MOC dataset and showed competitive performance on the VML-AHTE dataset.
文本行分割(TLS)是端到端文档分析系统的一个基本步骤。这一步骤的主要目的是高精度地提取任何手写文档中的单个文本行。手写文档和历史文献大多包含触摸和重叠字符、大量的变音符号、脚注和多年来添加的旁注。在这项工作中,我们提出了一种基于生成式对抗网络(GAN)的新 TLS 方法。TLS 问题是作为图像到图像的翻译问题来处理的,GAN 模型经过训练,可以学习单个文本行和包括文本行在内的相应掩码之间的空间信息。为了评估所提出的 GAN 模型的分割性能,我们使用了两个具有挑战性的数据集:VML-AHTE 和 VML-MOC。根据定性和定量结果,所提出的 GAN 模型在 VML-MOC 数据集上达到了最佳分割精度,在 VML-AHTE 数据集上表现出了竞争力。
{"title":"GAN-based text line segmentation method for challenging handwritten documents","authors":"İbrahim Özşeker, Ali Alper Demir, Ufuk Özkaya","doi":"10.1007/s10032-024-00488-5","DOIUrl":"https://doi.org/10.1007/s10032-024-00488-5","url":null,"abstract":"<p>Text line segmentation (TLS) is an essential step of the end-to-end document analysis systems. The main purpose of this step is to extract the individual text lines of any handwritten documents with high accuracy. Handwritten and historical documents mostly contain touching and overlapping characters, heavy diacritics, footnotes and side notes added over the years. In this work, we present a new TLS method based on generative adversarial networks (GAN). TLS problem is tackled as an image-to-image translation problem and the GAN model was trained to learn the spatial information between the individual text lines and their corresponding masks including the text lines. To evaluate the segmentation performance of the proposed GAN model, two challenging datasets, VML-AHTE and VML-MOC, were used. According to the qualitative and quantitative results, the proposed GAN model achieved the best segmentation accuracy on the VML-MOC dataset and showed competitive performance on the VML-AHTE dataset.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"40 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-17DOI: 10.1007/s10032-024-00490-x
Remya Sivan, Peeta Basa Pati, Made Windu Antara Kesiman
Classification of Palm leaf images into various quality categories is an important step towards the digitization of these heritage documents. Manual inspection and categorization is not only laborious, time-consuming and costly but also subject to inspector’s biases and errors. This study aims to automate the classification of palm leaf document images into three different visual quality categories. A comparative analysis between various structural and statistical features and classifiers against deep neural networks is performed. VGG16, VGG19 and ResNet152v2 architectures along with a custom CNN model are used, while Discrete Cosine Transform (DCT), Grey Level Co-occurrence Matrix (GLCM), Tamura, and Histogram of Gradient (HOG) are chosen from the traditional methods. Based on these extracted features, various classifiers, namely, k-Nearest Neighbors (k-NN), multi-layer perceptron (MLP), Support Vector Machines (SVM), Decision Tree (DT) and Logistic Regression (LR) are trained and evaluated. Accuracy, precision, recall, and F1 scores are used as performance metrics for the evaluation of various algorithms. Results demonstrate that CNN embeddings and DCT features have emerged as superior features. Based on these findings, we integrated DCT with a Vision Transformer (ViT) for the document classification task. The result illustrates that this incorporation of DCT with ViT outperforms all other methods with 96% train F1 score and a test F1 score of 90%.
将棕榈叶图像分为不同的质量类别是实现这些遗产文件数字化的重要一步。人工检查和分类不仅费力、费时、费钱,而且会受到检查人员偏见和错误的影响。本研究旨在将棕榈叶文献图像自动分类为三种不同的视觉质量类别。本研究对各种结构和统计特征以及分类器与深度神经网络进行了比较分析。使用了 VGG16、VGG19 和 ResNet152v2 架构以及自定义 CNN 模型,并从传统方法中选择了离散余弦变换 (DCT)、灰度共现矩阵 (GLCM)、Tamura 和梯度直方图 (HOG)。根据这些提取的特征,对各种分类器,即 k-Nearest Neighbors (k-NN)、multi-layer perceptron (MLP)、Support Vector Machines (SVM)、Decision Tree (DT) 和 Logistic Regression (LR) 进行了训练和评估。准确率、精确度、召回率和 F1 分数被用作评估各种算法的性能指标。结果表明,CNN 嵌入和 DCT 特征是最优秀的特征。基于这些发现,我们将 DCT 与视觉变换器 (ViT) 集成到文档分类任务中。结果表明,DCT 与 ViT 的结合优于所有其他方法,训练 F1 得分为 96%,测试 F1 得分为 90%。
{"title":"Image quality determination of palm leaf heritage documents using integrated discrete cosine transform features with vision transformer","authors":"Remya Sivan, Peeta Basa Pati, Made Windu Antara Kesiman","doi":"10.1007/s10032-024-00490-x","DOIUrl":"https://doi.org/10.1007/s10032-024-00490-x","url":null,"abstract":"<p>Classification of Palm leaf images into various quality categories is an important step towards the digitization of these heritage documents. Manual inspection and categorization is not only laborious, time-consuming and costly but also subject to inspector’s biases and errors. This study aims to automate the classification of palm leaf document images into three different visual quality categories. A comparative analysis between various structural and statistical features and classifiers against deep neural networks is performed. VGG16, VGG19 and ResNet152v2 architectures along with a custom CNN model are used, while Discrete Cosine Transform (DCT), Grey Level Co-occurrence Matrix (GLCM), Tamura, and Histogram of Gradient (HOG) are chosen from the traditional methods. Based on these extracted features, various classifiers, namely, k-Nearest Neighbors (k-NN), multi-layer perceptron (MLP), Support Vector Machines (SVM), Decision Tree (DT) and Logistic Regression (LR) are trained and evaluated. Accuracy, precision, recall, and F1 scores are used as performance metrics for the evaluation of various algorithms. Results demonstrate that CNN embeddings and DCT features have emerged as superior features. Based on these findings, we integrated DCT with a Vision Transformer (ViT) for the document classification task. The result illustrates that this incorporation of DCT with ViT outperforms all other methods with 96% train F1 score and a test F1 score of 90%.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"49 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-10DOI: 10.1007/s10032-024-00471-0
Iqraa Ehsan, Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal
Table detection, a pivotal task in document analysis, aims to precisely recognize and locate tables within document images. Although deep learning has shown remarkable progress in this realm, it typically requires an extensive dataset of labeled data for proficient training. Current CNN-based semi-supervised table detection approaches use the anchor generation process and non-maximum suppression in their detection process, limiting training efficiency. Meanwhile, transformer-based semi-supervised techniques adopted a one-to-one match strategy that provides noisy pseudo-labels, limiting overall efficiency. This study presents an innovative transformer-based semi-supervised table detector. It improves the quality of pseudo-labels through a novel matching strategy combining one-to-one and one-to-many assignment techniques. This approach significantly enhances training efficiency during the early stages, ensuring superior pseudo-labels for further training. Our semi-supervised approach is comprehensively evaluated on benchmark datasets, including PubLayNet, ICADR-19, and TableBank. It achieves new state-of-the-art results, with a mAP of 95.7% and 97.9% on TableBank (word) and PubLaynet with 30% label data, marking a 7.4 and 7.6 point improvement over previous semi-supervised table detection approach, respectively. The results clearly show the superiority of our semi-supervised approach, surpassing all existing state-of-the-art methods by substantial margins. This research represents a significant advancement in semi-supervised table detection methods, offering a more efficient and accurate solution for practical document analysis tasks.
{"title":"End-to-end semi-supervised approach with modulated object queries for table detection in documents","authors":"Iqraa Ehsan, Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal","doi":"10.1007/s10032-024-00471-0","DOIUrl":"https://doi.org/10.1007/s10032-024-00471-0","url":null,"abstract":"<p>Table detection, a pivotal task in document analysis, aims to precisely recognize and locate tables within document images. Although deep learning has shown remarkable progress in this realm, it typically requires an extensive dataset of labeled data for proficient training. Current CNN-based semi-supervised table detection approaches use the anchor generation process and non-maximum suppression in their detection process, limiting training efficiency. Meanwhile, transformer-based semi-supervised techniques adopted a one-to-one match strategy that provides noisy pseudo-labels, limiting overall efficiency. This study presents an innovative transformer-based semi-supervised table detector. It improves the quality of pseudo-labels through a novel matching strategy combining one-to-one and one-to-many assignment techniques. This approach significantly enhances training efficiency during the early stages, ensuring superior pseudo-labels for further training. Our semi-supervised approach is comprehensively evaluated on benchmark datasets, including PubLayNet, ICADR-19, and TableBank. It achieves new state-of-the-art results, with a mAP of 95.7% and 97.9% on TableBank (word) and PubLaynet with 30% label data, marking a 7.4 and 7.6 point improvement over previous semi-supervised table detection approach, respectively. The results clearly show the superiority of our semi-supervised approach, surpassing all existing state-of-the-art methods by substantial margins. This research represents a significant advancement in semi-supervised table detection methods, offering a more efficient and accurate solution for practical document analysis tasks.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"25 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-05DOI: 10.1007/s10032-024-00486-7
Ayush Kumar Shah, Bryan Amador, Abhisek Dey, Ming Creekmore, Blake Ocampo, Scott Denmark, Richard Zanibbi
Most molecular diagram parsers recover chemical structure from raster images (e.g., PNGs). However, many PDFs include commands giving explicit locations and shapes for characters, lines, and polygons. We present a new parser that uses these born-digital PDF primitives as input. The parsing model is fast and accurate, and does not require GPUs, Optical Character Recognition (OCR), or vectorization. We use the parser to annotate raster images and then train a new multi-task neural network for recognizing molecules in raster images. We evaluate our parsers using SMILES and standard benchmarks, along with a novel evaluation protocol comparing molecular graphs directly that supports automatic error compilation and reveals errors missed by SMILES-based evaluation. On the synthetic USPTO benchmark, our born-digital parser obtains a recognition rate of 98.4% (1% higher than previous models) and our relatively simple neural parser for raster images obtains a rate of 85% using less training data than existing neural approaches (thousands vs. millions of molecules).
大多数分子图解析器都能从光栅图像(如 PNG)中恢复化学结构。然而,许多 PDF 文件都包含一些命令,明确给出了字符、线条和多边形的位置和形状。我们提出了一种新的解析器,使用这些天生的数字 PDF 基元作为输入。该解析模型快速准确,无需 GPU、光学字符识别 (OCR) 或矢量化。我们使用解析器对光栅图像进行注释,然后训练一个新的多任务神经网络来识别光栅图像中的分子。我们使用 SMILES 和标准基准对我们的解析器进行了评估,同时还采用了直接比较分子图的新型评估协议,该协议支持自动错误编译,并能揭示基于 SMILES 的评估所遗漏的错误。在合成的美国专利商标局基准上,我们的天生数字解析器获得了 98.4% 的识别率(比以前的模型高 1%),而我们相对简单的光栅图像神经解析器获得了 85% 的识别率,使用的训练数据比现有的神经方法要少(数千个分子比数百万个分子)。
{"title":"ChemScraper: leveraging PDF graphics instructions for molecular diagram parsing","authors":"Ayush Kumar Shah, Bryan Amador, Abhisek Dey, Ming Creekmore, Blake Ocampo, Scott Denmark, Richard Zanibbi","doi":"10.1007/s10032-024-00486-7","DOIUrl":"https://doi.org/10.1007/s10032-024-00486-7","url":null,"abstract":"<p>Most molecular diagram parsers recover chemical structure from raster images (e.g., PNGs). However, many PDFs include commands giving explicit locations and shapes for characters, lines, and polygons. We present a new parser that uses these born-digital PDF primitives as input. The parsing model is fast and accurate, and does not require GPUs, Optical Character Recognition (OCR), or vectorization. We use the parser to annotate raster images and then train a new multi-task neural network for recognizing molecules in raster images. We evaluate our parsers using SMILES and standard benchmarks, along with a novel evaluation protocol comparing molecular graphs directly that supports automatic error compilation and reveals errors missed by SMILES-based evaluation. On the synthetic USPTO benchmark, our born-digital parser obtains a recognition rate of 98.4% (1% higher than previous models) and our relatively simple neural parser for raster images obtains a rate of 85% using less training data than existing neural approaches (thousands vs. millions of molecules).</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"5 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141569989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-27DOI: 10.1007/s10032-024-00481-y
Enrique Mas-Candela, Jorge Calvo-Zaragoza
This paper addresses the challenge of deploying recognition models in specific scenarios in which memory size is relevant, such as in low-cost devices or browser-based applications. We specifically focus on developing memory-efficient approaches for Handwritten Text Recognition (HTR) by leveraging recursive networks. These networks reuse learned weights across successive layers, thus enabling the maintenance of depth, a critical factor associated with model accuracy, without an increase in memory footprint. We apply neural recursion techniques to models typically used in HTR that contain convolutional and recurrent layers. We additionally study the impact of kernel scaling, which allows the activations of these recursive layers to be modified for greater expressiveness with little cost to memory. Our experiments on various HTR benchmarks demonstrate that recursive networks are, indeed, a good alternative. It is noteworthy that these recursive networks not only preserve but in some instances also enhance accuracy, making them a promising solution for memory-efficient HTR applications. This research establishes the utility of recursive networks in addressing memory constraints in HTR models. Their ability to sustain or improve accuracy while being memory-efficient positions them as a promising solution for practical deployment, especially in contexts where memory size is a critical consideration, such as low-cost devices and browser-based applications.
{"title":"Exploring recursive neural networks for compact handwritten text recognition models","authors":"Enrique Mas-Candela, Jorge Calvo-Zaragoza","doi":"10.1007/s10032-024-00481-y","DOIUrl":"https://doi.org/10.1007/s10032-024-00481-y","url":null,"abstract":"<p>This paper addresses the challenge of deploying recognition models in specific scenarios in which memory size is relevant, such as in low-cost devices or browser-based applications. We specifically focus on developing memory-efficient approaches for Handwritten Text Recognition (HTR) by leveraging recursive networks. These networks reuse learned weights across successive layers, thus enabling the maintenance of depth, a critical factor associated with model accuracy, without an increase in memory footprint. We apply neural recursion techniques to models typically used in HTR that contain convolutional and recurrent layers. We additionally study the impact of kernel scaling, which allows the activations of these recursive layers to be modified for greater expressiveness with little cost to memory. Our experiments on various HTR benchmarks demonstrate that recursive networks are, indeed, a good alternative. It is noteworthy that these recursive networks not only preserve but in some instances also enhance accuracy, making them a promising solution for memory-efficient HTR applications. This research establishes the utility of recursive networks in addressing memory constraints in HTR models. Their ability to sustain or improve accuracy while being memory-efficient positions them as a promising solution for practical deployment, especially in contexts where memory size is a critical consideration, such as low-cost devices and browser-based applications.\u0000</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"48 14 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141502856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-25DOI: 10.1007/s10032-024-00483-w
Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed
Model interpretability and robustness are becoming increasingly critical today for the safe and practical deployment of deep learning (DL) models in industrial settings. As DL-backed automated document processing systems become increasingly common in business workflows, there is a pressing need today to enhance interpretability and robustness for the task of document image classification, an integral component of such systems. Surprisingly, while much research has been devoted to improving the performance of deep models for this task, little attention has been given to their interpretability and robustness. In this paper, we aim to improve upon both aspects and introduce two inherently interpretable deep document classifiers, DocXClassifier and DocXClassifierFPN, both of which not only achieve significant performance improvements over existing approaches but also hold the capability to simultaneously generate feature importance maps while making their predictions. Our approach involves integrating a convolutional neural network (ConvNet) backbone with an attention mechanism to perform weighted aggregation of features based on their importance to the class, enabling the generation of interpretable importance maps. Additionally, we propose integrating Feature Pyramid Networks with the attention mechanism to significantly enhance the resolution of the interpretability maps, especially for pyramidal ConvNet architectures. Our approach attains state-of-the-art performance in image-based classification on two popular document datasets, RVL-CDIP and Tobacco3482, with top-1 classification accuracies of 94.19% and 95.71%, respectively. Additionally, it sets a new record for the highest image-based classification accuracy on Tobacco3482 without transfer learning from RVL-CDIP, at 90.29%. In addition, our proposed training strategy demonstrates superior robustness compared to existing approaches, significantly outperforming them on 19 out of 21 different types of novel data distortions, while achieving comparable results on the remaining two. By combining robustness with interpretability, DocXClassifier presents a promising step toward the practical deployment of DL models for document classification tasks.
{"title":"DocXclassifier: towards a robust and interpretable deep neural network for document image classification","authors":"Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed","doi":"10.1007/s10032-024-00483-w","DOIUrl":"https://doi.org/10.1007/s10032-024-00483-w","url":null,"abstract":"<p>Model interpretability and robustness are becoming increasingly critical today for the safe and practical deployment of deep learning (DL) models in industrial settings. As DL-backed automated document processing systems become increasingly common in business workflows, there is a pressing need today to enhance interpretability and robustness for the task of document image classification, an integral component of such systems. Surprisingly, while much research has been devoted to improving the performance of deep models for this task, little attention has been given to their interpretability and robustness. In this paper, we aim to improve upon both aspects and introduce two inherently interpretable deep document classifiers, DocXClassifier and DocXClassifierFPN, both of which not only achieve significant performance improvements over existing approaches but also hold the capability to simultaneously generate feature importance maps while making their predictions. Our approach involves integrating a convolutional neural network (ConvNet) backbone with an attention mechanism to perform weighted aggregation of features based on their importance to the class, enabling the generation of interpretable importance maps. Additionally, we propose integrating Feature Pyramid Networks with the attention mechanism to significantly enhance the resolution of the interpretability maps, especially for pyramidal ConvNet architectures. Our approach attains state-of-the-art performance in image-based classification on two popular document datasets, RVL-CDIP and Tobacco3482, with top-1 classification accuracies of 94.19% and 95.71%, respectively. Additionally, it sets a new record for the highest image-based classification accuracy on Tobacco3482 without transfer learning from RVL-CDIP, at 90.29%. In addition, our proposed training strategy demonstrates superior robustness compared to existing approaches, significantly outperforming them on 19 out of 21 different types of novel data distortions, while achieving comparable results on the remaining two. By combining robustness with interpretability, DocXClassifier presents a promising step toward the practical deployment of DL models for document classification tasks.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"140 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141502859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}