首页 > 最新文献

International Journal on Document Analysis and Recognition最新文献

英文 中文
Tabular context-aware optical character recognition and tabular data reconstruction for historical records. 表格上下文感知光学字符识别和历史记录的表格数据重建。
IF 2.5 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-01 Epub Date: 2025-07-01 DOI: 10.1007/s10032-025-00543-9
Loitongbam Gyanendro Singh, Stuart E Middleton

Digitizing historical tabular records is essential for preserving and analyzing valuable data across various fields, but it presents challenges due to complex layouts, mixed text types, and degraded document quality. This paper introduces a comprehensive framework to address these issues through three key contributions. First, it presents UoS_Data_Rescue, a novel dataset of 1,113 historical logbooks with over 594,000 annotated text cells, designed to handle the complexities of handwritten entries, aging artifacts, and intricate layouts. Second, it proposes a novel context-aware text extraction approach (TrOCR-ctx) to reduce cascading errors during table digitization. Third, it proposes an enhanced end-to-end OCR pipeline that integrates TrOCR-ctx with ByT5, combining OCR and post-OCR correction in a unified training framework. This framework enables the system to produce both the raw OCR output and a corrected version in a single pass, improving recognition accuracy, particularly for multilingual and degraded text, within complex table digitization tasks. The model achieves superior performance with a 0.049 word error rate and a 0.035 character error rate, outperforming existing methods by up to 41% in OCR tasks and 10.74% in table reconstruction tasks. This framework offers a robust solution for large-scale digitization of tabular documents, extending its applications beyond climate records to other domains requiring structured document preservation. The dataset and implementation are available as open-source resources.

数字化历史表格记录对于保存和分析各个领域的有价值数据至关重要,但由于复杂的布局、混合的文本类型和文档质量下降,它带来了挑战。本文介绍了一个全面的框架,通过三个关键贡献来解决这些问题。首先,它展示了UoS_Data_Rescue,这是一个新颖的数据集,包含1113个历史日志,超过594,000个带注释的文本单元格,旨在处理手写条目的复杂性、老化的工件和复杂的布局。其次,提出了一种新的上下文感知文本提取方法(TrOCR-ctx),以减少表数字化过程中的级联错误。第三,提出了一种集成了TrOCR-ctx和ByT5的增强端到端OCR管道,将OCR和OCR后校正结合在一个统一的训练框架中。该框架使系统能够在一次传递中生成原始OCR输出和修正版本,从而提高识别精度,特别是对于复杂表格数字化任务中的多语言和降级文本。该模型的错误率为0.049,字符错误率为0.035,在OCR任务和表重建任务中分别比现有方法高出41%和10.74%。该框架为表格文档的大规模数字化提供了一个强大的解决方案,将其应用范围从气候记录扩展到需要结构化文档保存的其他领域。数据集和实现可以作为开源资源获得。
{"title":"Tabular context-aware optical character recognition and tabular data reconstruction for historical records.","authors":"Loitongbam Gyanendro Singh, Stuart E Middleton","doi":"10.1007/s10032-025-00543-9","DOIUrl":"10.1007/s10032-025-00543-9","url":null,"abstract":"<p><p>Digitizing historical tabular records is essential for preserving and analyzing valuable data across various fields, but it presents challenges due to complex layouts, mixed text types, and degraded document quality. This paper introduces a comprehensive framework to address these issues through three key contributions. First, it presents UoS_Data_Rescue, a novel dataset of 1,113 historical logbooks with over 594,000 annotated text cells, designed to handle the complexities of handwritten entries, aging artifacts, and intricate layouts. Second, it proposes a novel context-aware text extraction approach (TrOCR-ctx) to reduce cascading errors during table digitization. Third, it proposes an enhanced end-to-end OCR pipeline that integrates TrOCR-ctx with ByT5, combining OCR and post-OCR correction in a unified training framework. This framework enables the system to produce both the raw OCR output and a corrected version in a single pass, improving recognition accuracy, particularly for multilingual and degraded text, within complex table digitization tasks. The model achieves superior performance with a 0.049 word error rate and a 0.035 character error rate, outperforming existing methods by up to 41% in OCR tasks and 10.74% in table reconstruction tasks. This framework offers a robust solution for large-scale digitization of tabular documents, extending its applications beyond climate records to other domains requiring structured document preservation. The dataset and implementation are available as open-source resources.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"28 3","pages":"357-376"},"PeriodicalIF":2.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12450121/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145126180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Redacted text detection using neural image segmentation methods. 编辑文本检测使用神经图像分割方法。
IF 2.5 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-01 Epub Date: 2025-01-30 DOI: 10.1007/s10032-025-00513-1
Ruben van Heusden, Kaj Meijer, Maarten Marx

The redaction of sensitive information in documents is common practice in specific types of organizations. This happens for example in court proceedings or in documents released under the Freedom of Information Act (FOIA). The ability to automatically detect when information has been redacted has several practical applications, such as the gathering of statistics on the amount of redaction present in documents, enabling a critical view on redaction practices. It can also be used to further investigate redactions, and whether or not the used techniques provide sufficient anonymization. The task is particularly challenging because of the large variety of redaction methods and techniques, from software for automatic redaction to manual redactions by pen. Any detection system must be robust to a large variety of inputs, as it will be run on many documents that might not even contain redactions. In this study, we evaluate two neural methods for the task, namely a Mask R-CNN model and a Mask2Former model, and compare them to a rule-based model based on optical character recognition and morphological operations. The best performing, the Mask R-CNN model, has a recall of .94 with a precision of .96 over a challenging data set containing several redaction types. Adding many pages without redaction barely lowers this score (precision drops to .90, recall drops to .92). The Mask2Former model is most robust to inputs without redactions, producing the least false positives of all models.

在特定类型的组织中,对文档中的敏感信息进行编校是常见的做法。例如,在法庭诉讼或根据《信息自由法》(FOIA)公布的文件中就会发生这种情况。自动检测信息何时被编校的能力有几个实际应用,例如收集文档中出现的编校数量的统计数据,从而支持对编校实践的批判性视图。它还可以用于进一步调查编校,以及所使用的技术是否提供了足够的匿名化。由于各种各样的编校方法和技术,从自动编校的软件到手写编校,这项任务尤其具有挑战性。任何检测系统都必须对各种各样的输入具有鲁棒性,因为它将在许多甚至可能不包含编校的文档上运行。在这项研究中,我们评估了两种神经方法,即Mask R-CNN模型和Mask2Former模型,并将它们与基于光学字符识别和形态学操作的基于规则的模型进行了比较。表现最好的是Mask R-CNN型号,召回率为。精度为。96对包含多种编校类型的具有挑战性的数据集。添加许多页而不进行编校几乎不会降低分数(精度下降到。90,回忆下降到0.92)。Mask2Former模型对于没有编校的输入是最健壮的,在所有模型中产生的误报最少。
{"title":"Redacted text detection using neural image segmentation methods.","authors":"Ruben van Heusden, Kaj Meijer, Maarten Marx","doi":"10.1007/s10032-025-00513-1","DOIUrl":"https://doi.org/10.1007/s10032-025-00513-1","url":null,"abstract":"<p><p>The redaction of sensitive information in documents is common practice in specific types of organizations. This happens for example in court proceedings or in documents released under the Freedom of Information Act (FOIA). The ability to automatically detect when information has been redacted has several practical applications, such as the gathering of statistics on the amount of redaction present in documents, enabling a critical view on redaction practices. It can also be used to further investigate redactions, and whether or not the used techniques provide sufficient anonymization. The task is particularly challenging because of the large variety of redaction methods and techniques, from software for automatic redaction to manual redactions by pen. Any detection system must be robust to a large variety of inputs, as it will be run on many documents that might not even contain redactions. In this study, we evaluate two neural methods for the task, namely a Mask R-CNN model and a Mask2Former model, and compare them to a rule-based model based on optical character recognition and morphological operations. The best performing, the Mask R-CNN model, has a recall of .94 with a precision of .96 over a challenging data set containing several redaction types. Adding many pages without redaction barely lowers this score (precision drops to .90, recall drops to .92). The Mask2Former model is most robust to inputs without redactions, producing the least false positives of all models.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"28 4","pages":"597-607"},"PeriodicalIF":2.5,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12662892/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145649770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A survey on artificial intelligence-based approaches for personality analysis from handwritten documents 基于人工智能的手写文件个性分析方法调查
IF 2.3 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-27 DOI: 10.1007/s10032-024-00496-5
Suparna Saha Biswas, Himadri Mukherjee, Ankita Dhar, Obaidullah Sk Md, Kaushik Roy

Human personality is a blend of different traits and virtues. It’s modeling is challenging due to its inherent complexity. There are multitudinous cues to predict personality and handwriting is one of them. This is because it is distinctive to a large extent and varies at the individual level. The allied field of science which deals with the analysis of handwriting for understanding personality is known as Graphology. Researchers have discovered disparate features of handwriting that can reveal the personality traits of an individual. Several attempts have been made to model personality from handwriting in different languages but significant advancement is required for commercialization. In this paper, we present the reported aspects of handwriting, techniques for processing handwritten documents and evaluation measures for personality identification to draw a horizon and aid in further advancement of research in this field.

人的个性是不同特征和美德的融合。由于其固有的复杂性,对其进行建模具有挑战性。有许多线索可以预测人格,笔迹就是其中之一。这是因为笔迹在很大程度上是独特的,而且在个体水平上存在差异。为了解个性而对笔迹进行分析的相关科学领域被称为笔迹学。研究人员发现,笔迹的不同特征可以揭示一个人的个性特征。人们已多次尝试从不同语言的笔迹中建立人格模型,但要实现商业化,还需要取得重大进展。在本文中,我们介绍了所报告的手写内容、处理手写文件的技术以及人格识别的评估方法,以期为这一领域的研究开辟出一条新的道路,并进一步推动其发展。
{"title":"A survey on artificial intelligence-based approaches for personality analysis from handwritten documents","authors":"Suparna Saha Biswas, Himadri Mukherjee, Ankita Dhar, Obaidullah Sk Md, Kaushik Roy","doi":"10.1007/s10032-024-00496-5","DOIUrl":"https://doi.org/10.1007/s10032-024-00496-5","url":null,"abstract":"<p>Human personality is a blend of different traits and virtues. It’s modeling is challenging due to its inherent complexity. There are multitudinous cues to predict personality and handwriting is one of them. This is because it is distinctive to a large extent and varies at the individual level. The allied field of science which deals with the analysis of handwriting for understanding personality is known as Graphology. Researchers have discovered disparate features of handwriting that can reveal the personality traits of an individual. Several attempts have been made to model personality from handwriting in different languages but significant advancement is required for commercialization. In this paper, we present the reported aspects of handwriting, techniques for processing handwritten documents and evaluation measures for personality identification to draw a horizon and aid in further advancement of research in this field.\u0000</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"6 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In-domain versus out-of-domain transfer learning for document layout analysis 文件布局分析中的域内与域外迁移学习
IF 2.3 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-08-19 DOI: 10.1007/s10032-024-00497-4
Axel De Nardin, Silvia Zottin, Claudio Piciarelli, Gian Luca Foresti, Emanuela Colombi

Data availability is a big concern in the field of document analysis, especially when working on tasks that require a high degree of precision when it comes to the definition of the ground truths on which to train deep learning models. A notable example is represented by the task of document layout analysis in handwritten documents, which requires pixel-precise segmentation maps to highlight the different layout components of each document page. These segmentation maps are typically very time-consuming and require a high degree of domain knowledge to be defined, as they are intrinsically characterized by the content of the text. For this reason in the present work, we explore the effects of different initialization strategies for deep learning models employed for this type of task by relying on both in-domain and cross-domain datasets for their pre-training. To test the employed models we use two publicly available datasets with heterogeneous characteristics both regarding their structure as well as the languages of the contained documents. We show how a combination of cross-domain and in-domain transfer learning approaches leads to the best overall performance of the models, as well as speeding up their convergence process.

在文档分析领域,数据可用性是一个大问题,尤其是在执行对训练深度学习模型的基本事实的定义精度要求很高的任务时。手写文档中的文档布局分析任务就是一个显著的例子,它需要像素级精度的分割图来突出显示每个文档页面的不同布局组件。这些分割图通常非常耗时,而且需要高度的领域知识才能定义,因为它们的内在特征是文本内容。因此,在本研究中,我们通过使用域内和跨域数据集对深度学习模型进行预训练,探索不同初始化策略对此类任务的影响。为了测试所使用的模型,我们使用了两个公开可用的数据集,这两个数据集在结构和所含文档的语言方面都具有不同的特点。我们展示了跨域和域内迁移学习方法的结合如何使模型的整体性能达到最佳,以及如何加快其收敛过程。
{"title":"In-domain versus out-of-domain transfer learning for document layout analysis","authors":"Axel De Nardin, Silvia Zottin, Claudio Piciarelli, Gian Luca Foresti, Emanuela Colombi","doi":"10.1007/s10032-024-00497-4","DOIUrl":"https://doi.org/10.1007/s10032-024-00497-4","url":null,"abstract":"<p>Data availability is a big concern in the field of document analysis, especially when working on tasks that require a high degree of precision when it comes to the definition of the ground truths on which to train deep learning models. A notable example is represented by the task of document layout analysis in handwritten documents, which requires pixel-precise segmentation maps to highlight the different layout components of each document page. These segmentation maps are typically very time-consuming and require a high degree of domain knowledge to be defined, as they are intrinsically characterized by the content of the text. For this reason in the present work, we explore the effects of different initialization strategies for deep learning models employed for this type of task by relying on both in-domain and cross-domain datasets for their pre-training. To test the employed models we use two publicly available datasets with heterogeneous characteristics both regarding their structure as well as the languages of the contained documents. We show how a combination of cross-domain and in-domain transfer learning approaches leads to the best overall performance of the models, as well as speeding up their convergence process.\u0000</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"64 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142213548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep learning-based modified-EAST scene text detector: insights from a novel multiscript dataset 基于深度学习的修改后 EAST 场景文本检测器:从新型多脚本数据集中获得的启示
IF 2.3 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-31 DOI: 10.1007/s10032-024-00491-w
Shilpa Mahajan, Rajneesh Rani, Aman Kamboj

The field of computer vision has seen significant transformation with the emergence and advancement of deep learning models. Deep learning waves have a significant impact on scene text detection, a vital and active area in computer vision. Numerous scientific, industrial, and academic procedures make use of text analysis. Natural scene text detection is more difficult than document image text detection owing to variations in font, size, style, brightness, etc. The National Institute of Technology Jalandhar-Text Detection dataset (NITJ-TD) is a new dataset that we have put forward in this study for various text analysis tasks including text detection, text segmentation, script identification, text recognition, etc. a deep learning model that seeks to identify the text’s location within the image,which are gathered in an unrestricted setting. The system consists of an NMS to choose the best match and prevent repeated predictions, and a modified EAST to pinpoint the exact ROI in the image. To improve the model’s performance, an enhancement module is added to the fundamental Efficient and Accurate Scene Text detector (EAST). The suggested approach is contrasted in terms of text word detection in the image. Several pre-trained models are used to assign the text word to various intersections over Union (IoU) values. We made use of our NITJ-TD dataset, which is made up of 1500 photos that were gathered from various North Indian sites. Punjabi, English, and Hindi scripts can be seen on the images. We also examined the outcomes of the ICDAR-2013 benchmark dataset. On both the suggested dataset and the benchmarked dataset, our approach performed better.

随着深度学习模型的出现和发展,计算机视觉领域发生了重大变革。深度学习浪潮对场景文本检测产生了重大影响,而场景文本检测是计算机视觉中一个重要而活跃的领域。许多科学、工业和学术程序都会用到文本分析。由于字体、大小、风格、亮度等的变化,自然场景文本检测比文档图像文本检测更加困难。国立贾朗达尔理工学院-文本检测数据集(NITJ-TD)是一个新的数据集,我们在本研究中将其用于各种文本分析任务,包括文本检测、文本分割、脚本识别、文本识别等。该系统由一个 NMS 和一个修改后的 EAST 组成,前者用于选择最佳匹配并防止重复预测,后者用于精确定位图像中的 ROI。为了提高模型的性能,在基本的高效精确场景文本检测器(EAST)中添加了一个增强模块。建议的方法在图像中的文本字词检测方面进行了对比。我们使用了几个预先训练好的模型,将文本词分配到不同的交叉联合(IoU)值上。我们使用了 NITJ-TD 数据集,该数据集由从北印度多个网站收集的 1500 张照片组成。图片上可以看到旁遮普语、英语和印地语脚本。我们还检查了 ICDAR-2013 基准数据集的结果。在建议数据集和基准数据集上,我们的方法都表现得更好。
{"title":"Deep learning-based modified-EAST scene text detector: insights from a novel multiscript dataset","authors":"Shilpa Mahajan, Rajneesh Rani, Aman Kamboj","doi":"10.1007/s10032-024-00491-w","DOIUrl":"https://doi.org/10.1007/s10032-024-00491-w","url":null,"abstract":"<p>The field of computer vision has seen significant transformation with the emergence and advancement of deep learning models. Deep learning waves have a significant impact on scene text detection, a vital and active area in computer vision. Numerous scientific, industrial, and academic procedures make use of text analysis. Natural scene text detection is more difficult than document image text detection owing to variations in font, size, style, brightness, etc. The National Institute of Technology Jalandhar-Text Detection dataset (NITJ-TD) is a new dataset that we have put forward in this study for various text analysis tasks including text detection, text segmentation, script identification, text recognition, etc. a deep learning model that seeks to identify the text’s location within the image,which are gathered in an unrestricted setting. The system consists of an NMS to choose the best match and prevent repeated predictions, and a modified EAST to pinpoint the exact ROI in the image. To improve the model’s performance, an enhancement module is added to the fundamental Efficient and Accurate Scene Text detector (EAST). The suggested approach is contrasted in terms of text word detection in the image. Several pre-trained models are used to assign the text word to various intersections over Union (IoU) values. We made use of our NITJ-TD dataset, which is made up of 1500 photos that were gathered from various North Indian sites. Punjabi, English, and Hindi scripts can be seen on the images. We also examined the outcomes of the ICDAR-2013 benchmark dataset. On both the suggested dataset and the benchmarked dataset, our approach performed better.\u0000</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"50 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards fully automated processing and analysis of construction diagrams: AI-powered symbol detection 实现施工图的全自动处理和分析:人工智能驱动的符号检测
IF 2.3 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-25 DOI: 10.1007/s10032-024-00492-9
Laura Jamieson, Carlos Francisco Moreno-Garcia, Eyad Elyan

Construction drawings are frequently stored in undigitised formats and consequently, their analysis requires substantial manual effort. This is true for many crucial tasks, including material takeoff where the purpose is to obtain a list of the equipment and respective amounts required for a project. Engineering drawing digitisation has recently attracted increased attention, however construction drawings have received considerably less interest compared to other types. To address these issues, this paper presents a novel framework for the automatic processing of construction drawings. Extensive experiments were performed using two state-of-the-art deep learning models for object detection in challenging high-resolution drawings sourced from industry. The results show a significant reduction in the time required for drawing analysis. Promising performance was achieved for symbol detection across various classes, with a mean average precision of 79% for the YOLO-based method and 83% for the Faster R-CNN-based method. This framework enables the digital transformation of construction drawings, improving tasks such as material takeoff and many others.

施工图纸通常以非数字化格式存储,因此,对其进行分析需要大量的人工工作。许多关键任务都是如此,包括材料估算,其目的是获得项目所需设备和相应数量的清单。工程图纸数字化最近引起了越来越多的关注,但与其他类型的图纸相比,建筑图纸受到的关注要少得多。为了解决这些问题,本文提出了一种自动处理建筑图纸的新框架。我们使用两种最先进的深度学习模型进行了广泛的实验,以检测具有挑战性的工业高分辨率图纸中的对象。结果表明,图纸分析所需的时间大大缩短。基于 YOLO 的方法的平均精确度为 79%,基于 Faster R-CNN 的方法的平均精确度为 83%。该框架实现了施工图纸的数字化转换,改进了材料估算等任务。
{"title":"Towards fully automated processing and analysis of construction diagrams: AI-powered symbol detection","authors":"Laura Jamieson, Carlos Francisco Moreno-Garcia, Eyad Elyan","doi":"10.1007/s10032-024-00492-9","DOIUrl":"https://doi.org/10.1007/s10032-024-00492-9","url":null,"abstract":"<p>Construction drawings are frequently stored in undigitised formats and consequently, their analysis requires substantial manual effort. This is true for many crucial tasks, including material takeoff where the purpose is to obtain a list of the equipment and respective amounts required for a project. Engineering drawing digitisation has recently attracted increased attention, however construction drawings have received considerably less interest compared to other types. To address these issues, this paper presents a novel framework for the automatic processing of construction drawings. Extensive experiments were performed using two state-of-the-art deep learning models for object detection in challenging high-resolution drawings sourced from industry. The results show a significant reduction in the time required for drawing analysis. Promising performance was achieved for symbol detection across various classes, with a mean average precision of 79% for the YOLO-based method and 83% for the Faster R-CNN-based method. This framework enables the digital transformation of construction drawings, improving tasks such as material takeoff and many others.\u0000</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"8 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GAN-based text line segmentation method for challenging handwritten documents 基于 GAN 的文本行分割方法,适用于具有挑战性的手写文档
IF 2.3 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-21 DOI: 10.1007/s10032-024-00488-5
İbrahim Özşeker, Ali Alper Demir, Ufuk Özkaya

Text line segmentation (TLS) is an essential step of the end-to-end document analysis systems. The main purpose of this step is to extract the individual text lines of any handwritten documents with high accuracy. Handwritten and historical documents mostly contain touching and overlapping characters, heavy diacritics, footnotes and side notes added over the years. In this work, we present a new TLS method based on generative adversarial networks (GAN). TLS problem is tackled as an image-to-image translation problem and the GAN model was trained to learn the spatial information between the individual text lines and their corresponding masks including the text lines. To evaluate the segmentation performance of the proposed GAN model, two challenging datasets, VML-AHTE and VML-MOC, were used. According to the qualitative and quantitative results, the proposed GAN model achieved the best segmentation accuracy on the VML-MOC dataset and showed competitive performance on the VML-AHTE dataset.

文本行分割(TLS)是端到端文档分析系统的一个基本步骤。这一步骤的主要目的是高精度地提取任何手写文档中的单个文本行。手写文档和历史文献大多包含触摸和重叠字符、大量的变音符号、脚注和多年来添加的旁注。在这项工作中,我们提出了一种基于生成式对抗网络(GAN)的新 TLS 方法。TLS 问题是作为图像到图像的翻译问题来处理的,GAN 模型经过训练,可以学习单个文本行和包括文本行在内的相应掩码之间的空间信息。为了评估所提出的 GAN 模型的分割性能,我们使用了两个具有挑战性的数据集:VML-AHTE 和 VML-MOC。根据定性和定量结果,所提出的 GAN 模型在 VML-MOC 数据集上达到了最佳分割精度,在 VML-AHTE 数据集上表现出了竞争力。
{"title":"GAN-based text line segmentation method for challenging handwritten documents","authors":"İbrahim Özşeker, Ali Alper Demir, Ufuk Özkaya","doi":"10.1007/s10032-024-00488-5","DOIUrl":"https://doi.org/10.1007/s10032-024-00488-5","url":null,"abstract":"<p>Text line segmentation (TLS) is an essential step of the end-to-end document analysis systems. The main purpose of this step is to extract the individual text lines of any handwritten documents with high accuracy. Handwritten and historical documents mostly contain touching and overlapping characters, heavy diacritics, footnotes and side notes added over the years. In this work, we present a new TLS method based on generative adversarial networks (GAN). TLS problem is tackled as an image-to-image translation problem and the GAN model was trained to learn the spatial information between the individual text lines and their corresponding masks including the text lines. To evaluate the segmentation performance of the proposed GAN model, two challenging datasets, VML-AHTE and VML-MOC, were used. According to the qualitative and quantitative results, the proposed GAN model achieved the best segmentation accuracy on the VML-MOC dataset and showed competitive performance on the VML-AHTE dataset.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"40 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image quality determination of palm leaf heritage documents using integrated discrete cosine transform features with vision transformer 利用视觉变换器综合离散余弦变换特征确定棕榈叶遗产文件的图像质量
IF 2.3 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-17 DOI: 10.1007/s10032-024-00490-x
Remya Sivan, Peeta Basa Pati, Made Windu Antara Kesiman

Classification of Palm leaf images into various quality categories is an important step towards the digitization of these heritage documents. Manual inspection and categorization is not only laborious, time-consuming and costly but also subject to inspector’s biases and errors. This study aims to automate the classification of palm leaf document images into three different visual quality categories. A comparative analysis between various structural and statistical features and classifiers against deep neural networks is performed. VGG16, VGG19 and ResNet152v2 architectures along with a custom CNN model are used, while Discrete Cosine Transform (DCT), Grey Level Co-occurrence Matrix (GLCM), Tamura, and Histogram of Gradient (HOG) are chosen from the traditional methods. Based on these extracted features, various classifiers, namely, k-Nearest Neighbors (k-NN), multi-layer perceptron (MLP), Support Vector Machines (SVM), Decision Tree (DT) and Logistic Regression (LR) are trained and evaluated. Accuracy, precision, recall, and F1 scores are used as performance metrics for the evaluation of various algorithms. Results demonstrate that CNN embeddings and DCT features have emerged as superior features. Based on these findings, we integrated DCT with a Vision Transformer (ViT) for the document classification task. The result illustrates that this incorporation of DCT with ViT outperforms all other methods with 96% train F1 score and a test F1 score of 90%.

将棕榈叶图像分为不同的质量类别是实现这些遗产文件数字化的重要一步。人工检查和分类不仅费力、费时、费钱,而且会受到检查人员偏见和错误的影响。本研究旨在将棕榈叶文献图像自动分类为三种不同的视觉质量类别。本研究对各种结构和统计特征以及分类器与深度神经网络进行了比较分析。使用了 VGG16、VGG19 和 ResNet152v2 架构以及自定义 CNN 模型,并从传统方法中选择了离散余弦变换 (DCT)、灰度共现矩阵 (GLCM)、Tamura 和梯度直方图 (HOG)。根据这些提取的特征,对各种分类器,即 k-Nearest Neighbors (k-NN)、multi-layer perceptron (MLP)、Support Vector Machines (SVM)、Decision Tree (DT) 和 Logistic Regression (LR) 进行了训练和评估。准确率、精确度、召回率和 F1 分数被用作评估各种算法的性能指标。结果表明,CNN 嵌入和 DCT 特征是最优秀的特征。基于这些发现,我们将 DCT 与视觉变换器 (ViT) 集成到文档分类任务中。结果表明,DCT 与 ViT 的结合优于所有其他方法,训练 F1 得分为 96%,测试 F1 得分为 90%。
{"title":"Image quality determination of palm leaf heritage documents using integrated discrete cosine transform features with vision transformer","authors":"Remya Sivan, Peeta Basa Pati, Made Windu Antara Kesiman","doi":"10.1007/s10032-024-00490-x","DOIUrl":"https://doi.org/10.1007/s10032-024-00490-x","url":null,"abstract":"<p>Classification of Palm leaf images into various quality categories is an important step towards the digitization of these heritage documents. Manual inspection and categorization is not only laborious, time-consuming and costly but also subject to inspector’s biases and errors. This study aims to automate the classification of palm leaf document images into three different visual quality categories. A comparative analysis between various structural and statistical features and classifiers against deep neural networks is performed. VGG16, VGG19 and ResNet152v2 architectures along with a custom CNN model are used, while Discrete Cosine Transform (DCT), Grey Level Co-occurrence Matrix (GLCM), Tamura, and Histogram of Gradient (HOG) are chosen from the traditional methods. Based on these extracted features, various classifiers, namely, k-Nearest Neighbors (k-NN), multi-layer perceptron (MLP), Support Vector Machines (SVM), Decision Tree (DT) and Logistic Regression (LR) are trained and evaluated. Accuracy, precision, recall, and F1 scores are used as performance metrics for the evaluation of various algorithms. Results demonstrate that CNN embeddings and DCT features have emerged as superior features. Based on these findings, we integrated DCT with a Vision Transformer (ViT) for the document classification task. The result illustrates that this incorporation of DCT with ViT outperforms all other methods with 96% train F1 score and a test F1 score of 90%.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"49 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141739164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
End-to-end semi-supervised approach with modulated object queries for table detection in documents 利用调制对象查询的端到端半监督方法检测文档中的表格
IF 2.3 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-10 DOI: 10.1007/s10032-024-00471-0
Iqraa Ehsan, Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal

Table detection, a pivotal task in document analysis, aims to precisely recognize and locate tables within document images. Although deep learning has shown remarkable progress in this realm, it typically requires an extensive dataset of labeled data for proficient training. Current CNN-based semi-supervised table detection approaches use the anchor generation process and non-maximum suppression in their detection process, limiting training efficiency. Meanwhile, transformer-based semi-supervised techniques adopted a one-to-one match strategy that provides noisy pseudo-labels, limiting overall efficiency. This study presents an innovative transformer-based semi-supervised table detector. It improves the quality of pseudo-labels through a novel matching strategy combining one-to-one and one-to-many assignment techniques. This approach significantly enhances training efficiency during the early stages, ensuring superior pseudo-labels for further training. Our semi-supervised approach is comprehensively evaluated on benchmark datasets, including PubLayNet, ICADR-19, and TableBank. It achieves new state-of-the-art results, with a mAP of 95.7% and 97.9% on TableBank (word) and PubLaynet with 30% label data, marking a 7.4 and 7.6 point improvement over previous semi-supervised table detection approach, respectively. The results clearly show the superiority of our semi-supervised approach, surpassing all existing state-of-the-art methods by substantial margins. This research represents a significant advancement in semi-supervised table detection methods, offering a more efficient and accurate solution for practical document analysis tasks.

表格检测是文档分析中的一项重要任务,旨在精确识别和定位文档图像中的表格。虽然深度学习在这一领域取得了显著进展,但通常需要大量标注数据集才能进行熟练训练。目前基于 CNN 的半监督表格检测方法在检测过程中使用锚生成过程和非最大抑制,限制了训练效率。同时,基于变换器的半监督技术采用一对一匹配策略,提供了噪声伪标签,限制了整体效率。本研究提出了一种创新的基于变压器的半监督表检测器。它通过结合一对一和一对多分配技术的新型匹配策略,提高了伪标签的质量。这种方法大大提高了早期阶段的训练效率,确保为进一步训练提供优质的伪标签。我们的半监督方法在基准数据集上进行了全面评估,包括 PubLayNet、ICADR-19 和 TableBank。与之前的半监督表格检测方法相比,该方法分别提高了 7.4 和 7.6 个百分点,在含有 30% 标签数据的 TableBank(单词)和 PubLaynet 上的 mAP 分别为 95.7% 和 97.9%,达到了最新水平。这些结果清楚地表明了我们的半监督方法的优越性,大大超过了所有现有的先进方法。这项研究代表了半监督表格检测方法的重大进步,为实际文档分析任务提供了更高效、更准确的解决方案。
{"title":"End-to-end semi-supervised approach with modulated object queries for table detection in documents","authors":"Iqraa Ehsan, Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal","doi":"10.1007/s10032-024-00471-0","DOIUrl":"https://doi.org/10.1007/s10032-024-00471-0","url":null,"abstract":"<p>Table detection, a pivotal task in document analysis, aims to precisely recognize and locate tables within document images. Although deep learning has shown remarkable progress in this realm, it typically requires an extensive dataset of labeled data for proficient training. Current CNN-based semi-supervised table detection approaches use the anchor generation process and non-maximum suppression in their detection process, limiting training efficiency. Meanwhile, transformer-based semi-supervised techniques adopted a one-to-one match strategy that provides noisy pseudo-labels, limiting overall efficiency. This study presents an innovative transformer-based semi-supervised table detector. It improves the quality of pseudo-labels through a novel matching strategy combining one-to-one and one-to-many assignment techniques. This approach significantly enhances training efficiency during the early stages, ensuring superior pseudo-labels for further training. Our semi-supervised approach is comprehensively evaluated on benchmark datasets, including PubLayNet, ICADR-19, and TableBank. It achieves new state-of-the-art results, with a mAP of 95.7% and 97.9% on TableBank (word) and PubLaynet with 30% label data, marking a 7.4 and 7.6 point improvement over previous semi-supervised table detection approach, respectively. The results clearly show the superiority of our semi-supervised approach, surpassing all existing state-of-the-art methods by substantial margins. This research represents a significant advancement in semi-supervised table detection methods, offering a more efficient and accurate solution for practical document analysis tasks.</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"25 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141587124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ChemScraper: leveraging PDF graphics instructions for molecular diagram parsing ChemScraper:利用 PDF 图形指令进行分子图解析
IF 2.3 4区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-07-05 DOI: 10.1007/s10032-024-00486-7
Ayush Kumar Shah, Bryan Amador, Abhisek Dey, Ming Creekmore, Blake Ocampo, Scott Denmark, Richard Zanibbi

Most molecular diagram parsers recover chemical structure from raster images (e.g., PNGs). However, many PDFs include commands giving explicit locations and shapes for characters, lines, and polygons. We present a new parser that uses these born-digital PDF primitives as input. The parsing model is fast and accurate, and does not require GPUs, Optical Character Recognition (OCR), or vectorization. We use the parser to annotate raster images and then train a new multi-task neural network for recognizing molecules in raster images. We evaluate our parsers using SMILES and standard benchmarks, along with a novel evaluation protocol comparing molecular graphs directly that supports automatic error compilation and reveals errors missed by SMILES-based evaluation. On the synthetic USPTO benchmark, our born-digital parser obtains a recognition rate of 98.4% (1% higher than previous models) and our relatively simple neural parser for raster images obtains a rate of 85% using less training data than existing neural approaches (thousands vs. millions of molecules).

大多数分子图解析器都能从光栅图像(如 PNG)中恢复化学结构。然而,许多 PDF 文件都包含一些命令,明确给出了字符、线条和多边形的位置和形状。我们提出了一种新的解析器,使用这些天生的数字 PDF 基元作为输入。该解析模型快速准确,无需 GPU、光学字符识别 (OCR) 或矢量化。我们使用解析器对光栅图像进行注释,然后训练一个新的多任务神经网络来识别光栅图像中的分子。我们使用 SMILES 和标准基准对我们的解析器进行了评估,同时还采用了直接比较分子图的新型评估协议,该协议支持自动错误编译,并能揭示基于 SMILES 的评估所遗漏的错误。在合成的美国专利商标局基准上,我们的天生数字解析器获得了 98.4% 的识别率(比以前的模型高 1%),而我们相对简单的光栅图像神经解析器获得了 85% 的识别率,使用的训练数据比现有的神经方法要少(数千个分子比数百万个分子)。
{"title":"ChemScraper: leveraging PDF graphics instructions for molecular diagram parsing","authors":"Ayush Kumar Shah, Bryan Amador, Abhisek Dey, Ming Creekmore, Blake Ocampo, Scott Denmark, Richard Zanibbi","doi":"10.1007/s10032-024-00486-7","DOIUrl":"https://doi.org/10.1007/s10032-024-00486-7","url":null,"abstract":"<p>Most molecular diagram parsers recover chemical structure from raster images (e.g., PNGs). However, many PDFs include commands giving explicit locations and shapes for characters, lines, and polygons. We present a new parser that uses these born-digital PDF primitives as input. The parsing model is fast and accurate, and does not require GPUs, Optical Character Recognition (OCR), or vectorization. We use the parser to annotate raster images and then train a new multi-task neural network for recognizing molecules in raster images. We evaluate our parsers using SMILES and standard benchmarks, along with a novel evaluation protocol comparing molecular graphs directly that supports automatic error compilation and reveals errors missed by SMILES-based evaluation. On the synthetic USPTO benchmark, our born-digital parser obtains a recognition rate of 98.4% (1% higher than previous models) and our relatively simple neural parser for raster images obtains a rate of 85% using less training data than existing neural approaches (thousands vs. millions of molecules).</p>","PeriodicalId":50277,"journal":{"name":"International Journal on Document Analysis and Recognition","volume":"5 1","pages":""},"PeriodicalIF":2.3,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141569989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
International Journal on Document Analysis and Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1