Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Maoyuan Ye;Jing Zhang;Juhua Liu;Chenyu Liu;Baocai Yin;Cong Liu;Bo Du;Dacheng Tao
{"title":"Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation","authors":"Maoyuan Ye;Jing Zhang;Juhua Liu;Chenyu Liu;Baocai Yin;Cong Liu;Bo Du;Dacheng Tao","doi":"10.1109/TPAMI.2024.3495831","DOIUrl":null,"url":null,"abstract":"The Segment Anything Model (SAM), a profound vision foundation model pretrained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces <i>Hi-SAM</i>, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including <i>pixel-level text</i>, <i>word</i>, <i>text-line</i>, and <i>paragraph</i>, while realizing <i>layout analysis</i> as well. Specifically, we first turn SAM into a high-quality pixel-level text segmentation (TS) model through a parameter-efficient fine-tuning approach. We use this TS model to iteratively generate the pixel-level text labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation (PS) mode. In the AMG mode, Hi-SAM segments pixel-level text foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the PS mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for pixel-level text segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring <inline-formula><tex-math>$20\\times$</tex-math></inline-formula> fewer training epochs.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1431-1447"},"PeriodicalIF":18.6000,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10750316/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The Segment Anything Model (SAM), a profound vision foundation model pretrained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality pixel-level text segmentation (TS) model through a parameter-efficient fine-tuning approach. We use this TS model to iteratively generate the pixel-level text labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation (PS) mode. In the AMG mode, Hi-SAM segments pixel-level text foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the PS mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for pixel-level text segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring $20\times$ fewer training epochs.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Hi-SAM:为分层文本分段建立分段模型
SAM (Segment Anything Model)是在大规模数据集上进行预训练的深度视觉基础模型,它打破了常规分割的界限,并激发了各种下游应用。本文介绍了一种利用SAM进行分层文本分割的统一模型Hi-SAM。Hi-SAM在像素级文本、字、文本行和段落四个层次的分割方面表现出色,同时也实现了布局分析。具体来说,我们首先通过参数高效的微调方法将SAM转化为高质量的像素级文本分割(TS)模型。我们使用该TS模型以半自动的方式迭代生成像素级文本标签,统一了HierText数据集中四个文本层次的标签。随后,有了这些完整的标签,我们推出了基于TS架构的端到端可训练的Hi-SAM,并带有定制的分层掩码解码器。在推理过程中,Hi-SAM提供自动掩码生成(AMG)模式和提示分割(PS)模式。在AMG模式下,Hi-SAM首先分割像素级文本前景蒙版,然后采样前景点进行分层文本蒙版生成,并顺带实现布局分析。至于PS模式,Hi-SAM提供单词、文本行和段落蒙版,只需单击一下。实验结果表明,我们的TS模型具有最先进的性能:在Total-Text上的fgIOU为84.86%,在TextSeg上的fgIOU为88.96%。此外,与之前的专家在HierText上联合分层检测和布局分析相比,Hi-SAM取得了显著的进步:在文本行级别上实现了4.73%的PQ和5.39%的F1,在段落级别上实现了5.49%的PQ和7.39%的F1,所需的训练次数减少了20倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Spike Camera Optical Flow Estimation Based on Continuous Spike Streams. Bi-C2R: Bidirectional Continual Compatible Representation for Re-Indexing Free Lifelong Person Re-Identification. Modality Equilibrium Matters: Minor-Modality-Aware Adaptive Alternating for Cross-Modal Memory Enhancement. Principled Multimodal Representation Learning. Class-Distribution-Aware Pseudo-Labeling for Semi-Supervised Multi-Label Learning.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1