μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context

arXiv - CS - Digital Libraries Pub Date : 2024-08-28 DOI:arxiv-2408.15646

Fabio Quattrini, Carmine Zaccagnino, Silvia Cascianelli, Laura Righi, Rita Cucchiara

{"title":"μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context","authors":"Fabio Quattrini, Carmine Zaccagnino, Silvia Cascianelli, Laura Righi, Rita Cucchiara","doi":"arxiv-2408.15646","DOIUrl":null,"url":null,"abstract":"Regesta are catalogs of summaries of other documents and, in some cases, are\nthe only source of information about the content of such full-length documents.\nFor this reason, they are of great interest to scholars in many social and\nhumanities fields. In this work, we focus on Regesta Pontificum Romanum, a\nlarge collection of papal registers. Regesta are visually rich documents, where\nthe layout is as important as the text content to convey the contained\ninformation through the structure, and are inherently multi-page documents.\nAmong Digital Humanities techniques that can help scholars efficiently exploit\nregesta and other documental sources in the form of scanned documents, Document\nParsing has emerged as a task to process document images and convert them into\nmachine-readable structured representations, usually markup language. However,\ncurrent models focus on scientific and business documents, and most of them\nconsider only single-paged documents. To overcome this limitation, in this\nwork, we propose {\\mu}gat, an extension of the recently proposed Document\nparsing Nougat architecture, which can handle elements spanning over the single\npage limits. Specifically, we adapt Nougat to process a larger, multi-page\ncontext, consisting of the previous and the following page, while parsing the\ncurrent page. Experimental results, both qualitative and quantitative,\ndemonstrate the effectiveness of our proposed approach also in the case of the\nchallenging Regesta Pontificum Romanorum.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"22 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.15646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Regesta are catalogs of summaries of other documents and, in some cases, are the only source of information about the content of such full-length documents. For this reason, they are of great interest to scholars in many social and humanities fields. In this work, we focus on Regesta Pontificum Romanum, a large collection of papal registers. Regesta are visually rich documents, where the layout is as important as the text content to convey the contained information through the structure, and are inherently multi-page documents. Among Digital Humanities techniques that can help scholars efficiently exploit regesta and other documental sources in the form of scanned documents, Document Parsing has emerged as a task to process document images and convert them into machine-readable structured representations, usually markup language. However, current models focus on scientific and business documents, and most of them consider only single-paged documents. To overcome this limitation, in this work, we propose {\mu}gat, an extension of the recently proposed Document parsing Nougat architecture, which can handle elements spanning over the single page limits. Specifically, we adapt Nougat to process a larger, multi-page context, consisting of the previous and the following page, while parsing the current page. Experimental results, both qualitative and quantitative, demonstrate the effectiveness of our proposed approach also in the case of the challenging Regesta Pontificum Romanorum.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

μgat：通过提供多页面上下文改进单页面文档解析

Regesta 是其他文件摘要的目录，在某些情况下，是有关此类完整文件内容的唯一信息来源。在这项工作中，我们的重点是 Regesta Pontificum Romanum，它是教皇登记簿的大型合集。Regesta 是视觉丰富的文档，其布局与文本内容同等重要，可通过结构传达所包含的信息，而且本身就是多页文档。在可帮助学者有效利用 Regesta 和其他扫描文档形式的文档源的数字人文技术中，文档解析（DocumentParsing）已成为一项处理文档图像并将其转换为机器可读的结构化表示（通常是标记语言）的任务。然而，目前的模型主要集中在科学和商业文档上，而且大多数模型只考虑单页文档。为了克服这一限制，我们在本研究中提出了{\mu}gat，它是最近提出的文档解析 Nougat 架构的扩展，可以处理跨越单页限制的元素。具体来说，我们对Nougat进行了调整，使其能够在解析当前页面的同时，处理由上一页和下一页组成的更大的多页面上下文。实验结果从定性和定量两个方面证明了我们提出的方法在处理具有挑战性的《罗马教规》（Regesta Pontificum Romanorum）时的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Digital Libraries

自引率

0.00%

发文量