首页 > 最新文献

Proceedings of the 22nd ACM Symposium on Document Engineering最新文献

英文 中文
From print to online newspapers on small displays: a layout generation approach aimed at preserving entry points 从印刷到在线小屏幕报纸:一种旨在保留入口点的版面生成方法
Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563847
Sebastián Gallardo Díaz, Dorian Mazauric, Pierre Kornprobst
Simply transposing the print newspapers into digital media can not be satisfactory because they were not designed for small displays. One key feature lost is the notion of entry points that are essential for navigation. By focusing on headlines as entry points, we show how to produce alternative layouts for small displays that preserve entry points quality (readability and usability) while optimizing aesthetics and style. Our approach consists in a relayouting approach implemented via a genetic-inspired approach. We tested it on realistic newspaper pages. For the case discussed here, we obtained more than 2000 different layouts where the font was increased by a factor of two. We show that the quality of headlines is globally much better with the new layouts than with the original layout. Future work will tend to generalize this promising approach, accounting for the complexity of real newspapers, with user experience quality as the primary goal.
简单地把印刷报纸转换成数字媒体是不能令人满意的,因为它们不是为小屏幕设计的。丢失的一个关键特性是导航所必需的入口点的概念。通过关注标题作为入口点,我们展示了如何为小型显示器制作替代布局,以保持入口点的质量(可读性和可用性),同时优化美学和风格。我们的方法包括通过遗传启发方法实现的中继方法。我们在真实的报纸页面上进行了测试。对于这里讨论的案例,我们获得了超过2000种不同的布局,其中字体增加了两倍。我们发现,在全球范围内,使用新布局的标题质量比使用原始布局的标题质量要好得多。未来的工作将倾向于推广这种有前途的方法,考虑到真实报纸的复杂性,以用户体验质量为主要目标。
{"title":"From print to online newspapers on small displays: a layout generation approach aimed at preserving entry points","authors":"Sebastián Gallardo Díaz, Dorian Mazauric, Pierre Kornprobst","doi":"10.1145/3558100.3563847","DOIUrl":"https://doi.org/10.1145/3558100.3563847","url":null,"abstract":"Simply transposing the print newspapers into digital media can not be satisfactory because they were not designed for small displays. One key feature lost is the notion of entry points that are essential for navigation. By focusing on headlines as entry points, we show how to produce alternative layouts for small displays that preserve entry points quality (readability and usability) while optimizing aesthetics and style. Our approach consists in a relayouting approach implemented via a genetic-inspired approach. We tested it on realistic newspaper pages. For the case discussed here, we obtained more than 2000 different layouts where the font was increased by a factor of two. We show that the quality of headlines is globally much better with the new layouts than with the original layout. Future work will tend to generalize this promising approach, accounting for the complexity of real newspapers, with user experience quality as the primary goal.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115659127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A cascaded approach for page-object detection in scientific papers 科学论文中页面对象检测的级联方法
Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563851
Erika Spiteri Bailey, Alexandra Bonnici, Stefania Cristina
In recent years, Page Object Detection (POD) has become a popular document understanding task, proving to be a non-trivial task given the potential complexity of documents. The rise of neural networks facilitated a more general learning approach to this task. However, in the literature, the different objects such as formulae, or figures among others, are generally considered individually. In this paper, we describe the joint localisation of six object classes relevant to scientific papers, namely isolated formulae, embedded formulae, figures, tables, variables and references. Through a qualitative analysis of these object classes, we note a hierarchy among the classes and propose a new localisation approach, using two, cascaded You Only Look Once (YOLO) networks. We also present a new data set consisting of labelled bounding boxes for all six object classes. This data set combines two commonly used data sets in the literature for formulae localisation, adding to the document images in these data sets the labels for figures, tables, variables and references. Using this data set, we achieve an average F1-score of 0.755 across all classes, which is comparable to the state-of-the-art for the object classes when considered individually for localisation.
近年来,页面对象检测(POD)已成为一种流行的文档理解任务,鉴于文档的潜在复杂性,它被证明是一项重要的任务。神经网络的兴起促进了一种更通用的学习方法来完成这项任务。然而,在文献中,不同的对象,如公式或数字等,通常是单独考虑的。在本文中,我们描述了与科学论文相关的六类对象的联合定位,即孤立公式、嵌入公式、图形、表格、变量和参考文献。通过对这些对象类的定性分析,我们注意到类之间的层次结构,并提出了一种新的定位方法,使用两个级联的You Only Look Once (YOLO)网络。我们还提出了一个由所有六个对象类的标记边界框组成的新数据集。该数据集结合了文献中常用的两个数据集进行公式定位,在这些数据集中的文档图像上添加了图形、表格、变量和参考文献的标签。使用这个数据集,我们在所有类中获得了0.755的平均f1分数,当单独考虑对象类进行本地化时,这与对象类的最新水平相当。
{"title":"A cascaded approach for page-object detection in scientific papers","authors":"Erika Spiteri Bailey, Alexandra Bonnici, Stefania Cristina","doi":"10.1145/3558100.3563851","DOIUrl":"https://doi.org/10.1145/3558100.3563851","url":null,"abstract":"In recent years, Page Object Detection (POD) has become a popular document understanding task, proving to be a non-trivial task given the potential complexity of documents. The rise of neural networks facilitated a more general learning approach to this task. However, in the literature, the different objects such as formulae, or figures among others, are generally considered individually. In this paper, we describe the joint localisation of six object classes relevant to scientific papers, namely isolated formulae, embedded formulae, figures, tables, variables and references. Through a qualitative analysis of these object classes, we note a hierarchy among the classes and propose a new localisation approach, using two, cascaded You Only Look Once (YOLO) networks. We also present a new data set consisting of labelled bounding boxes for all six object classes. This data set combines two commonly used data sets in the literature for formulae localisation, adding to the document images in these data sets the labels for figures, tables, variables and references. Using this data set, we achieve an average F1-score of 0.755 across all classes, which is comparable to the state-of-the-art for the object classes when considered individually for localisation.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122391431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Academic writing and publishing beyond documents 学术写作和出版以外的文件
Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563840
C. Mahlow, M. Piotrowski
Research on writing tools stopped in the late 1980s when Microsoft Word had achieved monopoly status. However, the development of the Web and the advent of mobile devices are increasingly rendering static print-like documents obsolete. In this vision paper we reflect on the impact of this development on scholarly writing and publishing. Academic publications increasingly include dynamic elements, e.g., code, data plots, and other visualizations, which clearly requires other tools for document production than traditional word processors. When the printed page no longer is the desired final product, content and form can be addressed explicitly and separately, thus emphasizing the structure of texts rather than the structure of documents. The resulting challenges have not yet been fully addressed by document engineering.
对书写工具的研究在20世纪80年代末停止,当时微软的Word已经取得了垄断地位。然而,随着Web的发展和移动设备的出现,静态的类似打印的文档越来越过时。在这篇愿景论文中,我们反思了这一发展对学术写作和出版的影响。学术出版物越来越多地包含动态元素,例如代码、数据图和其他可视化,这显然需要传统文字处理器之外的其他工具来进行文档制作。当打印页面不再是期望的最终产品时,可以明确地分别处理内容和形式,从而强调文本的结构而不是文档的结构。文档工程还没有完全解决由此产生的挑战。
{"title":"Academic writing and publishing beyond documents","authors":"C. Mahlow, M. Piotrowski","doi":"10.1145/3558100.3563840","DOIUrl":"https://doi.org/10.1145/3558100.3563840","url":null,"abstract":"Research on writing tools stopped in the late 1980s when Microsoft Word had achieved monopoly status. However, the development of the Web and the advent of mobile devices are increasingly rendering static print-like documents obsolete. In this vision paper we reflect on the impact of this development on scholarly writing and publishing. Academic publications increasingly include dynamic elements, e.g., code, data plots, and other visualizations, which clearly requires other tools for document production than traditional word processors. When the printed page no longer is the desired final product, content and form can be addressed explicitly and separately, thus emphasizing the structure of texts rather than the structure of documents. The resulting challenges have not yet been fully addressed by document engineering.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117283329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Binarization of photographed documents image quality, processing time and size assessment 二值化对拍摄的文档图像质量、处理时间和尺寸进行评估
Pub Date : 2022-09-20 DOI: 10.1145/3558100.3564159
R. Lins, R. Bernardino, Ricardo da Silva Barboza, S. Simske
Today, over eighty percent of the world's population owns a smart-phone with an in-built camera, and they are very often used to photograph documents. Document binarization is a key process in many document processing platforms. This competition on binarizing photographed documents assessed the quality, time, space, and performance of five new algorithms and sixty-four "classical" and alternative algorithms. The evaluation dataset is composed of offset, laser, and deskjet printed documents, photographed using six widely-used mobile devices with the strobe flash on and off, under two different angles and places of capture.
今天,世界上超过80%的人口拥有内置摄像头的智能手机,它们经常被用来拍摄文件。文档二值化是许多文档处理平台中的一个关键过程。这次关于二值化照片文件的竞赛评估了五种新算法和64种“经典”算法和替代算法的质量、时间、空间和性能。评估数据集由胶印、激光和桌面喷墨打印的文件组成,使用六种广泛使用的移动设备在两个不同的角度和位置下拍摄,并打开和关闭闪光灯。
{"title":"Binarization of photographed documents image quality, processing time and size assessment","authors":"R. Lins, R. Bernardino, Ricardo da Silva Barboza, S. Simske","doi":"10.1145/3558100.3564159","DOIUrl":"https://doi.org/10.1145/3558100.3564159","url":null,"abstract":"Today, over eighty percent of the world's population owns a smart-phone with an in-built camera, and they are very often used to photograph documents. Document binarization is a key process in many document processing platforms. This competition on binarizing photographed documents assessed the quality, time, space, and performance of five new algorithms and sixty-four \"classical\" and alternative algorithms. The evaluation dataset is composed of offset, laser, and deskjet printed documents, photographed using six widely-used mobile devices with the strobe flash on and off, under two different angles and places of capture.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"767 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116137288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Detecting malware using text documents extracted from spam email through machine learning 通过机器学习从垃圾邮件中提取文本文档来检测恶意软件
Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563854
Luis Ángel Redondo-Gutierrez, Francisco Jáñez-Martino, Eduardo FIDALGO, Enrique Alegre, V. González-Castro, R. Alaíz-Rodríguez
Spam has become an effective way for cybercriminals to spread malware. Although cybersecurity agencies and companies develop products and organise courses for people to detect malicious spam email patterns, spam attacks are not totally avoided yet. In this work, we present and make publicly available "Spam Email Malware Detection - 600" (SEMD-600), a new dataset, based on Bruce Guenter's, for malware detection in spam using only the text of the email. We also introduce a pipeline for malware detection based on traditional Natural Language Processing (NLP) techniques. Using SEMD-600, we compare the text representation techniques Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF), in combination with three different supervised classifiers: Support Vector Machine, Naive Bayes and Logistic Regression, to detect malware in plain text documents. We found that combining TF-IDF with Logistic Regression achieved the best performance, with a macro F1 score of 0.763.
垃圾邮件已成为网络犯罪分子传播恶意软件的有效途径。虽然网络安全机构和公司开发产品和组织课程,让人们检测恶意垃圾邮件模式,但垃圾邮件攻击还没有完全避免。在这项工作中,我们提出并公开了“垃圾邮件恶意软件检测-600”(SEMD-600),这是一个基于Bruce Guenter的新数据集,用于仅使用电子邮件文本检测垃圾邮件中的恶意软件。我们还介绍了一种基于传统自然语言处理(NLP)技术的恶意软件检测管道。使用SEMD-600,我们比较了文本表示技术词袋和词频逆文档频率(TF-IDF),结合三种不同的监督分类器:支持向量机,朴素贝叶斯和逻辑回归,以检测纯文本文档中的恶意软件。我们发现TF-IDF与Logistic回归相结合的效果最好,宏观F1得分为0.763。
{"title":"Detecting malware using text documents extracted from spam email through machine learning","authors":"Luis Ángel Redondo-Gutierrez, Francisco Jáñez-Martino, Eduardo FIDALGO, Enrique Alegre, V. González-Castro, R. Alaíz-Rodríguez","doi":"10.1145/3558100.3563854","DOIUrl":"https://doi.org/10.1145/3558100.3563854","url":null,"abstract":"Spam has become an effective way for cybercriminals to spread malware. Although cybersecurity agencies and companies develop products and organise courses for people to detect malicious spam email patterns, spam attacks are not totally avoided yet. In this work, we present and make publicly available \"Spam Email Malware Detection - 600\" (SEMD-600), a new dataset, based on Bruce Guenter's, for malware detection in spam using only the text of the email. We also introduce a pipeline for malware detection based on traditional Natural Language Processing (NLP) techniques. Using SEMD-600, we compare the text representation techniques Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF), in combination with three different supervised classifiers: Support Vector Machine, Naive Bayes and Logistic Regression, to detect malware in plain text documents. We found that combining TF-IDF with Logistic Regression achieved the best performance, with a macro F1 score of 0.763.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124827044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Anonymizing and obfuscating PDF content while preserving document structure 匿名化和模糊化PDF内容,同时保留文档结构
Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563849
Charlotte Curtis
The portable document format (PDF) is both versatile and complex, with a specification exceeding well over a thousand pages. For independent developers writing software that reads, displays, or transforms PDFs, it is difficult to comprehensively account for all of the potential variations that might exist in the wild. Compounding this problem are the usage agreements that often accompany purchased and proprietary PDFs, preventing end users from uploading a troublesome document as part of a bug report and limiting the set of test cases that can be made public for open source development. In this paper, pdf-mangler is presented as a solution to this problem. The goal of pdf-mangler is to remove information in the form of text, images, and vector graphics while retaining as much of the document structure and general visual appearance as possible. The intention is for pdf-mangler to be deployed as part of an automated bug reporting tool for PDF software.
可移植文档格式(PDF)既通用又复杂,其规范远远超过一千页。对于编写读取、显示或转换pdf的软件的独立开发人员来说,很难全面考虑可能存在的所有潜在变化。使这个问题更加复杂的是,通常伴随着购买和专有pdf的使用协议,阻止最终用户上传麻烦的文档作为bug报告的一部分,并限制了可以公开用于开源开发的测试用例集。本文提出了pdf-mangler来解决这个问题。pdf-mangler的目标是删除文本、图像和矢量图形形式的信息,同时尽可能多地保留文档结构和一般视觉外观。其目的是将PDF -mangler部署为PDF软件的自动错误报告工具的一部分。
{"title":"Anonymizing and obfuscating PDF content while preserving document structure","authors":"Charlotte Curtis","doi":"10.1145/3558100.3563849","DOIUrl":"https://doi.org/10.1145/3558100.3563849","url":null,"abstract":"The portable document format (PDF) is both versatile and complex, with a specification exceeding well over a thousand pages. For independent developers writing software that reads, displays, or transforms PDFs, it is difficult to comprehensively account for all of the potential variations that might exist in the wild. Compounding this problem are the usage agreements that often accompany purchased and proprietary PDFs, preventing end users from uploading a troublesome document as part of a bug report and limiting the set of test cases that can be made public for open source development. In this paper, pdf-mangler is presented as a solution to this problem. The goal of pdf-mangler is to remove information in the form of text, images, and vector graphics while retaining as much of the document structure and general visual appearance as possible. The intention is for pdf-mangler to be deployed as part of an automated bug reporting tool for PDF software.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130184409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Triplet transformer network for multi-label document classification 三联体变压器网络多标签文档分类
Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563843
J. Melsbach, Sven Stahlmann, Stefan Hirschmeier, D. Schoder
Multi-label document classification is the task of assigning one or more labels to a document, and has become a common task in various businesses. Typically, current state-of-the-art models based on pretrained language models tackle this task without taking the textual information of label names into account, therefore omitting possibly valuable information. We present an approach that leverages this information stored in label names by reformulating the problem of multi label classification into a document similarity problem. To achieve this, we use a triplet transformer network that learns to embed labels and documents into a joint vector space. Our approach is fast at inference, classifying documents by determining the closest and therefore most similar labels. We evaluate our approach on a challenging real-world dataset of a German radio-broadcaster and find that our model provides competitive results compared to other established approaches.
多标签文档分类是为一个文档分配一个或多个标签的任务,已成为各种业务中常见的任务。通常,当前基于预训练语言模型的最先进的模型在处理此任务时没有考虑标签名称的文本信息,因此遗漏了可能有价值的信息。我们提出了一种方法,通过将多标签分类问题重新表述为文档相似度问题,利用存储在标签名称中的信息。为了实现这一点,我们使用一个三元变换网络来学习将标签和文档嵌入到一个联合向量空间中。我们的方法在推理方面非常快,通过确定最接近和最相似的标签对文档进行分类。我们在德国广播公司具有挑战性的真实数据集上评估了我们的方法,并发现与其他既定方法相比,我们的模型提供了具有竞争力的结果。
{"title":"Triplet transformer network for multi-label document classification","authors":"J. Melsbach, Sven Stahlmann, Stefan Hirschmeier, D. Schoder","doi":"10.1145/3558100.3563843","DOIUrl":"https://doi.org/10.1145/3558100.3563843","url":null,"abstract":"Multi-label document classification is the task of assigning one or more labels to a document, and has become a common task in various businesses. Typically, current state-of-the-art models based on pretrained language models tackle this task without taking the textual information of label names into account, therefore omitting possibly valuable information. We present an approach that leverages this information stored in label names by reformulating the problem of multi label classification into a document similarity problem. To achieve this, we use a triplet transformer network that learns to embed labels and documents into a joint vector space. Our approach is fast at inference, classifying documents by determining the closest and therefore most similar labels. We evaluate our approach on a challenging real-world dataset of a German radio-broadcaster and find that our model provides competitive results compared to other established approaches.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126248655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Tab this folder of documents: page stream segmentation of business documents 标签此文件夹的文件:页流分割的业务文件
Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563852
Thisanaporn Mungmeeprued, Yuxin Ma, Nisarg Mehta, Aldo Lipani
In the midst of digital transformation, automatically understanding the structure and composition of scanned documents is important in order to allow correct indexing, archiving, and processing. In many organizations, different types of documents are usually scanned together in folders, so it is essential to automate the task of segmenting the folders into documents which then proceed to further analysis tailored to specific document types. This task is known as Page Stream Segmentation (PSS). In this paper, we propose a deep learning solution to solve the task of determining whether or not a page is a breaking-point given a sequence of scanned pages (a folder) as input. We also provide a dataset called TABME (TAB this folder of docuMEnts) generated specifically for this task. Our proposed architecture combines LayoutLM and ResNet to exploit both textual and visual features of the document pages and achieves an F1 score of 0.953. The dataset and code used to run the experiments in this paper are available at the following web link: https://github.com/aldolipani/TABME.
在数字化转换过程中,为了实现正确的索引、归档和处理,自动理解扫描文档的结构和组成非常重要。在许多组织中,不同类型的文档通常在文件夹中一起扫描,因此自动化将文件夹分割成文档的任务非常重要,然后根据特定的文档类型进行进一步分析。这个任务被称为页面流分割(PSS)。在本文中,我们提出了一个深度学习解决方案来解决给定扫描页面序列(文件夹)作为输入来确定页面是否为断点的任务。我们还提供了一个专门为此任务生成的名为TABME (TAB this folder of docuMEnts)的数据集。我们提出的架构结合了LayoutLM和ResNet来利用文档页面的文本和视觉特征,并获得了0.953的F1分数。用于运行本文中实验的数据集和代码可在以下web链接中获得:https://github.com/aldolipani/TABME。
{"title":"Tab this folder of documents: page stream segmentation of business documents","authors":"Thisanaporn Mungmeeprued, Yuxin Ma, Nisarg Mehta, Aldo Lipani","doi":"10.1145/3558100.3563852","DOIUrl":"https://doi.org/10.1145/3558100.3563852","url":null,"abstract":"In the midst of digital transformation, automatically understanding the structure and composition of scanned documents is important in order to allow correct indexing, archiving, and processing. In many organizations, different types of documents are usually scanned together in folders, so it is essential to automate the task of segmenting the folders into documents which then proceed to further analysis tailored to specific document types. This task is known as Page Stream Segmentation (PSS). In this paper, we propose a deep learning solution to solve the task of determining whether or not a page is a breaking-point given a sequence of scanned pages (a folder) as input. We also provide a dataset called TABME (TAB this folder of docuMEnts) generated specifically for this task. Our proposed architecture combines LayoutLM and ResNet to exploit both textual and visual features of the document pages and achieves an F1 score of 0.953. The dataset and code used to run the experiments in this paper are available at the following web link: https://github.com/aldolipani/TABME.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116950370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Optical character recognition with transformers and CTC 光学字符识别与变压器和CTC
Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563845
Israel Campiotti, R. Lotufo
Text recognition tasks are commonly solved by using a deep learning pipeline called CRNN. The classical CRNN is a sequence of a convolutional network, followed by a bidirectional LSTM and a CTC layer. In this paper, we perform an extensive analysis of the components of a CRNN to find what is crucial to the entire pipeline and what characteristics can be exchanged for a more effective choice. Given the results of our experiments, we propose two different architectures for the task of text recognition. The first model, CNN + CTC, is a combination of a convolutional model followed by a CTC layer. The second model, CNN + Tr + CTC, adds an encoder-only Transformers between the convolutional network and the CTC layer. To the best of our knowledge, this is the first time that a Transformers have been successfully trained using just CTC loss. To assess the capabilities of our proposed architectures, we train and evaluate them on the SROIE 2019 data set. Our CNN + CTC achieves an F1 score of 89.66% possessing only 4.7 million parameters. CNN + Tr + CTC attained an F1 score of 93.76% with 11 million parameters, which is almost 97% of the performance achieved by the TrOCR using 334 million parameters and more than 600 million synthetic images for pretraining.
文本识别任务通常通过使用称为CRNN的深度学习管道来解决。经典的CRNN是一个卷积网络序列,然后是一个双向LSTM和一个CTC层。在本文中,我们对CRNN的组成部分进行了广泛的分析,以找出对整个管道至关重要的内容,以及可以交换哪些特征以获得更有效的选择。根据我们的实验结果,我们提出了两种不同的文本识别架构。第一个模型是CNN + CTC,它是卷积模型和CTC层的组合。第二个模型,CNN + Tr + CTC,在卷积网络和CTC层之间增加了一个仅编码的变压器。据我们所知,这是第一次变形金刚成功地使用CTC损失进行训练。为了评估我们提出的架构的能力,我们在SROIE 2019数据集上对它们进行了训练和评估。我们的CNN + CTC仅使用470万个参数,F1得分达到89.66%。CNN + Tr + CTC使用1100万个参数获得了93.76%的F1分数,几乎是使用3.34亿个参数和6亿多张合成图像进行预训练的TrOCR的97%。
{"title":"Optical character recognition with transformers and CTC","authors":"Israel Campiotti, R. Lotufo","doi":"10.1145/3558100.3563845","DOIUrl":"https://doi.org/10.1145/3558100.3563845","url":null,"abstract":"Text recognition tasks are commonly solved by using a deep learning pipeline called CRNN. The classical CRNN is a sequence of a convolutional network, followed by a bidirectional LSTM and a CTC layer. In this paper, we perform an extensive analysis of the components of a CRNN to find what is crucial to the entire pipeline and what characteristics can be exchanged for a more effective choice. Given the results of our experiments, we propose two different architectures for the task of text recognition. The first model, CNN + CTC, is a combination of a convolutional model followed by a CTC layer. The second model, CNN + Tr + CTC, adds an encoder-only Transformers between the convolutional network and the CTC layer. To the best of our knowledge, this is the first time that a Transformers have been successfully trained using just CTC loss. To assess the capabilities of our proposed architectures, we train and evaluate them on the SROIE 2019 data set. Our CNN + CTC achieves an F1 score of 89.66% possessing only 4.7 million parameters. CNN + Tr + CTC attained an F1 score of 93.76% with 11 million parameters, which is almost 97% of the performance achieved by the TrOCR using 334 million parameters and more than 600 million synthetic images for pretraining.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127538270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Chinese public procurement document harvesting pipeline 中国公共采购文件收集管道
Pub Date : 2022-09-20 DOI: 10.1145/3558100.3563848
Danrun Cao, Oussama Ahmia, Nicolas Béchet, P. Marteau
We present a processing pipeline for Chinese public procurement document harvesting, with the aim of producing strategic data with greater added value. It consists of three micro-modules: data collection, information extraction, database indexing. The information extraction part is implemented through a hybrid system which combines rule-based and machine learning approaches. Rule-based method is used for extracting information with presenting recurring morphological features, such as dates, amounts and contract awardee information. Machine learning method is used for trade detection in the title of procurement documents.
我们为中国公共采购文件收集提供了一个处理管道,旨在产生具有更大附加值的战略数据。它由数据采集、信息提取、数据库索引三个微模块组成。信息提取部分通过结合基于规则和机器学习方法的混合系统实现。使用基于规则的方法提取具有重复出现的形态特征的信息,如日期、金额和合同受奖人信息。采用机器学习方法对采购文件标题进行贸易检测。
{"title":"Chinese public procurement document harvesting pipeline","authors":"Danrun Cao, Oussama Ahmia, Nicolas Béchet, P. Marteau","doi":"10.1145/3558100.3563848","DOIUrl":"https://doi.org/10.1145/3558100.3563848","url":null,"abstract":"We present a processing pipeline for Chinese public procurement document harvesting, with the aim of producing strategic data with greater added value. It consists of three micro-modules: data collection, information extraction, database indexing. The information extraction part is implemented through a hybrid system which combines rule-based and machine learning approaches. Rule-based method is used for extracting information with presenting recurring morphological features, such as dates, amounts and contract awardee information. Machine learning method is used for trade detection in the title of procurement documents.","PeriodicalId":146244,"journal":{"name":"Proceedings of the 22nd ACM Symposium on Document Engineering","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125112305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the 22nd ACM Symposium on Document Engineering
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1