{"title":"Tab this folder of documents: page stream segmentation of business documents","authors":"Thisanaporn Mungmeeprued, Yuxin Ma, Nisarg Mehta, Aldo Lipani","doi":"10.1145/3558100.3563852","DOIUrl":null,"url":null,"abstract":"In the midst of digital transformation, automatically understanding the structure and composition of scanned documents is important in order to allow correct indexing, archiving, and processing. In many organizations, different types of documents are usually scanned together in folders, so it is essential to automate the task of segmenting the folders into documents which then proceed to further analysis tailored to specific document types. This task is known as Page Stream Segmentation (PSS). In this paper, we propose a deep learning solution to solve the task of determining whether or not a page is a breaking-point given a sequence of scanned pages (a folder) as input. We also provide a dataset called TABME (TAB this folder of docuMEnts) generated specifically for this task. Our proposed architecture combines LayoutLM and ResNet to exploit both textual and visual features of the document pages and achieves an F1 score of 0.953. The dataset and code used to run the experiments in this paper are available at the following web link: https://github.com/aldolipani/TABME.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM Symposium on Document Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3558100.3563852","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
In the midst of digital transformation, automatically understanding the structure and composition of scanned documents is important in order to allow correct indexing, archiving, and processing. In many organizations, different types of documents are usually scanned together in folders, so it is essential to automate the task of segmenting the folders into documents which then proceed to further analysis tailored to specific document types. This task is known as Page Stream Segmentation (PSS). In this paper, we propose a deep learning solution to solve the task of determining whether or not a page is a breaking-point given a sequence of scanned pages (a folder) as input. We also provide a dataset called TABME (TAB this folder of docuMEnts) generated specifically for this task. Our proposed architecture combines LayoutLM and ResNet to exploit both textual and visual features of the document pages and achieves an F1 score of 0.953. The dataset and code used to run the experiments in this paper are available at the following web link: https://github.com/aldolipani/TABME.
在数字化转换过程中,为了实现正确的索引、归档和处理,自动理解扫描文档的结构和组成非常重要。在许多组织中,不同类型的文档通常在文件夹中一起扫描,因此自动化将文件夹分割成文档的任务非常重要,然后根据特定的文档类型进行进一步分析。这个任务被称为页面流分割(PSS)。在本文中,我们提出了一个深度学习解决方案来解决给定扫描页面序列(文件夹)作为输入来确定页面是否为断点的任务。我们还提供了一个专门为此任务生成的名为TABME (TAB this folder of docuMEnts)的数据集。我们提出的架构结合了LayoutLM和ResNet来利用文档页面的文本和视觉特征,并获得了0.953的F1分数。用于运行本文中实验的数据集和代码可在以下web链接中获得:https://github.com/aldolipani/TABME。