{"title":"Understand Layout and Translate Text: Unified Feature-Conductive End-to-End Document Image Translation","authors":"Zhiyang Zhang;Yaping Zhang;Yupu Liang;Cong Ma;Lu Xiang;Yang Zhao;Yu Zhou;Chengqing Zong","doi":"10.1109/TPAMI.2025.3530998","DOIUrl":null,"url":null,"abstract":"Document Image Translation (DIT) aims to translate texts on document images from one language to another. It is a multi-modal task involving cooperation of text and layout. Current approaches either handle layout and translation as separate processes, risking accumulative errors, or use vanilla end-to-end encoder-decoder models to capture layout implicitly, often suffering inadequate layout incorporation. We argue that a favorable framework should explicitly engage layout-specific modules and properly organize them toward translation. For this, we first revisit two key layouts: the geometric layout reflecting word’s spatial positions, and the logical layout depicting word’s logical order. Then, a novel pipeline (understand layout <inline-formula><tex-math>$\\rightarrow$</tex-math></inline-formula> translate text) is determined to prioritize layouts such that preceding layouts contribute to translation. Following this pipeline, we introduce Unified Document Image Translation (UniDIT), a comprehensive framework that unifies layout with translation in one network. It is devised to leverage each module’s advantage, and provide an elaborate feature-conductive flow for module communication globally. A novel bridging mechanism is also introduced to adapt layout features conducive to translation. We further contribute DITransv2, a large-scale fine-grained benchmark that includes heterogeneous and complex document layouts. Extensive experiments on DITransv2 and additional established benchmarks demonstrate UniDIT outperforms previous state-of-the-arts in all aspects.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3358-3376"},"PeriodicalIF":18.6000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10844563/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Document Image Translation (DIT) aims to translate texts on document images from one language to another. It is a multi-modal task involving cooperation of text and layout. Current approaches either handle layout and translation as separate processes, risking accumulative errors, or use vanilla end-to-end encoder-decoder models to capture layout implicitly, often suffering inadequate layout incorporation. We argue that a favorable framework should explicitly engage layout-specific modules and properly organize them toward translation. For this, we first revisit two key layouts: the geometric layout reflecting word’s spatial positions, and the logical layout depicting word’s logical order. Then, a novel pipeline (understand layout $\rightarrow$ translate text) is determined to prioritize layouts such that preceding layouts contribute to translation. Following this pipeline, we introduce Unified Document Image Translation (UniDIT), a comprehensive framework that unifies layout with translation in one network. It is devised to leverage each module’s advantage, and provide an elaborate feature-conductive flow for module communication globally. A novel bridging mechanism is also introduced to adapt layout features conducive to translation. We further contribute DITransv2, a large-scale fine-grained benchmark that includes heterogeneous and complex document layouts. Extensive experiments on DITransv2 and additional established benchmarks demonstrate UniDIT outperforms previous state-of-the-arts in all aspects.