Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Shuhang Liu, Jun Du, Jianshu Zhang
{"title":"DocMamba: Efficient Document Pre-training with State Space Model","authors":"Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Shuhang Liu, Jun Du, Jianshu Zhang","doi":"arxiv-2409.11887","DOIUrl":null,"url":null,"abstract":"In recent years, visually-rich document understanding has attracted\nincreasing attention. Transformer-based pre-trained models have become the\nmainstream approach, yielding significant performance gains in this field.\nHowever, the self-attention mechanism's quadratic computational complexity\nhinders their efficiency and ability to process long documents. In this paper,\nwe present DocMamba, a novel framework based on the state space model. It is\ndesigned to reduce computational complexity to linear while preserving global\nmodeling capabilities. To further enhance its effectiveness in document\nprocessing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture\ncontiguous semantic information. Experimental results demonstrate that DocMamba\nachieves new state-of-the-art results on downstream datasets such as FUNSD,\nCORD, and SORIE, while significantly improving speed and reducing memory usage.\nNotably, experiments on the HRDoc confirm DocMamba's potential for length\nextrapolation. The code will be available online.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11887","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In recent years, visually-rich document understanding has attracted
increasing attention. Transformer-based pre-trained models have become the
mainstream approach, yielding significant performance gains in this field.
However, the self-attention mechanism's quadratic computational complexity
hinders their efficiency and ability to process long documents. In this paper,
we present DocMamba, a novel framework based on the state space model. It is
designed to reduce computational complexity to linear while preserving global
modeling capabilities. To further enhance its effectiveness in document
processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture
contiguous semantic information. Experimental results demonstrate that DocMamba
achieves new state-of-the-art results on downstream datasets such as FUNSD,
CORD, and SORIE, while significantly improving speed and reducing memory usage.
Notably, experiments on the HRDoc confirm DocMamba's potential for length
extrapolation. The code will be available online.