μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context

Fabio Quattrini, Carmine Zaccagnino, Silvia Cascianelli, Laura Righi, Rita Cucchiara
{"title":"μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context","authors":"Fabio Quattrini, Carmine Zaccagnino, Silvia Cascianelli, Laura Righi, Rita Cucchiara","doi":"arxiv-2408.15646","DOIUrl":null,"url":null,"abstract":"Regesta are catalogs of summaries of other documents and, in some cases, are\nthe only source of information about the content of such full-length documents.\nFor this reason, they are of great interest to scholars in many social and\nhumanities fields. In this work, we focus on Regesta Pontificum Romanum, a\nlarge collection of papal registers. Regesta are visually rich documents, where\nthe layout is as important as the text content to convey the contained\ninformation through the structure, and are inherently multi-page documents.\nAmong Digital Humanities techniques that can help scholars efficiently exploit\nregesta and other documental sources in the form of scanned documents, Document\nParsing has emerged as a task to process document images and convert them into\nmachine-readable structured representations, usually markup language. However,\ncurrent models focus on scientific and business documents, and most of them\nconsider only single-paged documents. To overcome this limitation, in this\nwork, we propose {\\mu}gat, an extension of the recently proposed Document\nparsing Nougat architecture, which can handle elements spanning over the single\npage limits. Specifically, we adapt Nougat to process a larger, multi-page\ncontext, consisting of the previous and the following page, while parsing the\ncurrent page. Experimental results, both qualitative and quantitative,\ndemonstrate the effectiveness of our proposed approach also in the case of the\nchallenging Regesta Pontificum Romanorum.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"22 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.15646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Regesta are catalogs of summaries of other documents and, in some cases, are the only source of information about the content of such full-length documents. For this reason, they are of great interest to scholars in many social and humanities fields. In this work, we focus on Regesta Pontificum Romanum, a large collection of papal registers. Regesta are visually rich documents, where the layout is as important as the text content to convey the contained information through the structure, and are inherently multi-page documents. Among Digital Humanities techniques that can help scholars efficiently exploit regesta and other documental sources in the form of scanned documents, Document Parsing has emerged as a task to process document images and convert them into machine-readable structured representations, usually markup language. However, current models focus on scientific and business documents, and most of them consider only single-paged documents. To overcome this limitation, in this work, we propose {\mu}gat, an extension of the recently proposed Document parsing Nougat architecture, which can handle elements spanning over the single page limits. Specifically, we adapt Nougat to process a larger, multi-page context, consisting of the previous and the following page, while parsing the current page. Experimental results, both qualitative and quantitative, demonstrate the effectiveness of our proposed approach also in the case of the challenging Regesta Pontificum Romanorum.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
μgat:通过提供多页面上下文改进单页面文档解析
Regesta 是其他文件摘要的目录,在某些情况下,是有关此类完整文件内容的唯一信息来源。在这项工作中,我们的重点是 Regesta Pontificum Romanum,它是教皇登记簿的大型合集。Regesta 是视觉丰富的文档,其布局与文本内容同等重要,可通过结构传达所包含的信息,而且本身就是多页文档。在可帮助学者有效利用 Regesta 和其他扫描文档形式的文档源的数字人文技术中,文档解析(DocumentParsing)已成为一项处理文档图像并将其转换为机器可读的结构化表示(通常是标记语言)的任务。然而,目前的模型主要集中在科学和商业文档上,而且大多数模型只考虑单页文档。为了克服这一限制,我们在本研究中提出了{\mu}gat,它是最近提出的文档解析 Nougat 架构的扩展,可以处理跨越单页限制的元素。具体来说,我们对Nougat进行了调整,使其能够在解析当前页面的同时,处理由上一页和下一页组成的更大的多页面上下文。实验结果从定性和定量两个方面证明了我们提出的方法在处理具有挑战性的《罗马教规》(Regesta Pontificum Romanorum)时的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Publishing Instincts: An Exploration-Exploitation Framework for Studying Academic Publishing Behavior and "Home Venues" Research Citations Building Trust in Wikipedia Evaluating the Linguistic Coverage of OpenAlex: An Assessment of Metadata Accuracy and Completeness Towards understanding evolution of science through language model series Ensuring Adherence to Standards in Experiment-Related Metadata Entered Via Spreadsheets
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1