Understand Layout and Translate Text: Unified Feature-Conductive End-to-End Document Image Translation

Zhiyang Zhang;Yaping Zhang;Yupu Liang;Cong Ma;Lu Xiang;Yang Zhao;Yu Zhou;Chengqing Zong
{"title":"Understand Layout and Translate Text: Unified Feature-Conductive End-to-End Document Image Translation","authors":"Zhiyang Zhang;Yaping Zhang;Yupu Liang;Cong Ma;Lu Xiang;Yang Zhao;Yu Zhou;Chengqing Zong","doi":"10.1109/TPAMI.2025.3530998","DOIUrl":null,"url":null,"abstract":"Document Image Translation (DIT) aims to translate texts on document images from one language to another. It is a multi-modal task involving cooperation of text and layout. Current approaches either handle layout and translation as separate processes, risking accumulative errors, or use vanilla end-to-end encoder-decoder models to capture layout implicitly, often suffering inadequate layout incorporation. We argue that a favorable framework should explicitly engage layout-specific modules and properly organize them toward translation. For this, we first revisit two key layouts: the geometric layout reflecting word’s spatial positions, and the logical layout depicting word’s logical order. Then, a novel pipeline (understand layout <inline-formula><tex-math>$\\rightarrow$</tex-math></inline-formula> translate text) is determined to prioritize layouts such that preceding layouts contribute to translation. Following this pipeline, we introduce Unified Document Image Translation (UniDIT), a comprehensive framework that unifies layout with translation in one network. It is devised to leverage each module’s advantage, and provide an elaborate feature-conductive flow for module communication globally. A novel bridging mechanism is also introduced to adapt layout features conducive to translation. We further contribute DITransv2, a large-scale fine-grained benchmark that includes heterogeneous and complex document layouts. Extensive experiments on DITransv2 and additional established benchmarks demonstrate UniDIT outperforms previous state-of-the-arts in all aspects.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3358-3376"},"PeriodicalIF":18.6000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10844563/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Document Image Translation (DIT) aims to translate texts on document images from one language to another. It is a multi-modal task involving cooperation of text and layout. Current approaches either handle layout and translation as separate processes, risking accumulative errors, or use vanilla end-to-end encoder-decoder models to capture layout implicitly, often suffering inadequate layout incorporation. We argue that a favorable framework should explicitly engage layout-specific modules and properly organize them toward translation. For this, we first revisit two key layouts: the geometric layout reflecting word’s spatial positions, and the logical layout depicting word’s logical order. Then, a novel pipeline (understand layout $\rightarrow$ translate text) is determined to prioritize layouts such that preceding layouts contribute to translation. Following this pipeline, we introduce Unified Document Image Translation (UniDIT), a comprehensive framework that unifies layout with translation in one network. It is devised to leverage each module’s advantage, and provide an elaborate feature-conductive flow for module communication globally. A novel bridging mechanism is also introduced to adapt layout features conducive to translation. We further contribute DITransv2, a large-scale fine-grained benchmark that includes heterogeneous and complex document layouts. Extensive experiments on DITransv2 and additional established benchmarks demonstrate UniDIT outperforms previous state-of-the-arts in all aspects.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
理解布局和翻译文本:统一的特征-引导端到端文档图像翻译
文档图像翻译(DIT)旨在将文档图像上的文本从一种语言翻译成另一种语言。这是一个涉及文本和布局协同的多模态任务。当前的方法要么将布局和翻译作为单独的过程处理,冒着累积错误的风险,要么使用普通的端到端编码器-解码器模型来隐式地捕获布局,这通常会导致布局合并不足。我们认为,一个有利的框架应该明确地参与特定布局模块,并适当地组织它们进行翻译。为此,我们首先回顾两个关键布局:反映单词空间位置的几何布局和描述单词逻辑顺序的逻辑布局。然后,确定了一个新的管道(理解layout $\rightarrow$ translate text)来确定布局的优先级,以便之前的布局有助于翻译。在此基础上,我们介绍了统一文档图像翻译(uniit),这是一个将布局和翻译统一在一个网络中的综合框架。它旨在利用每个模块的优势,并为模块的全局通信提供一个精心设计的特性传导流。引入了一种新的桥接机制来适应有利于翻译的布局特征。我们进一步贡献了DITransv2,这是一个大规模的细粒度基准测试,包括异构和复杂的文档布局。对DITransv2的广泛实验和其他已建立的基准测试表明,unitit在所有方面都优于以前的最先进技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Spike Camera Optical Flow Estimation Based on Continuous Spike Streams. Bi-C2R: Bidirectional Continual Compatible Representation for Re-Indexing Free Lifelong Person Re-Identification. Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network. A Survey on Interpretability in Visual Recognition. Mitigating Negative Transfer via Reducing Environmental Disagreement.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1