反思自我监督语义分割:实现端到端分割。

Yue Liu, Jun Zeng, Xingzhen Tao, Gang Fang
{"title":"反思自我监督语义分割:实现端到端分割。","authors":"Yue Liu, Jun Zeng, Xingzhen Tao, Gang Fang","doi":"10.1109/TPAMI.2024.3432326","DOIUrl":null,"url":null,"abstract":"<p><p>The challenge of semantic segmentation with scarce pixel-level annotations has induced many self-supervised works, however most of which essentially train an image encoder or a segmentation head that produces finer dense representations, and when performing segmentation inference they need to resort to supervised linear classifiers or traditional clustering. Segmentation by dataset-level clustering not only deviates the real-time and end-to-end inference practice, but also escalates the problem from segmenting per image to clustering all pixels at once, which results in downgraded performance. To remedy this issue, we propose a novel self-supervised semantic segmentation training and inferring paradigm where inferring is performed in an end-to-end manner. Specifically, based on our observations in probing dense representation by image-level self-supervised ViT, i.e. semantic inconsistency between patches and poor semantic quality in non-salient regions, we propose prototype-image alignment and global-local alignment with attention map constraint to train a tailored Transformer Decoder with learnable prototypes and utilize adaptive prototypes for segmentation inference per image. Extensive experiments under fully unsupervised semantic segmentation settings demonstrate the superior performance and the generalizability of our proposed method. The code is available at: https://github.com/yliu1229/AlignSeg.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Rethinking Self-Supervised Semantic Segmentation: Achieving End-to-End Segmentation.\",\"authors\":\"Yue Liu, Jun Zeng, Xingzhen Tao, Gang Fang\",\"doi\":\"10.1109/TPAMI.2024.3432326\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>The challenge of semantic segmentation with scarce pixel-level annotations has induced many self-supervised works, however most of which essentially train an image encoder or a segmentation head that produces finer dense representations, and when performing segmentation inference they need to resort to supervised linear classifiers or traditional clustering. Segmentation by dataset-level clustering not only deviates the real-time and end-to-end inference practice, but also escalates the problem from segmenting per image to clustering all pixels at once, which results in downgraded performance. To remedy this issue, we propose a novel self-supervised semantic segmentation training and inferring paradigm where inferring is performed in an end-to-end manner. Specifically, based on our observations in probing dense representation by image-level self-supervised ViT, i.e. semantic inconsistency between patches and poor semantic quality in non-salient regions, we propose prototype-image alignment and global-local alignment with attention map constraint to train a tailored Transformer Decoder with learnable prototypes and utilize adaptive prototypes for segmentation inference per image. Extensive experiments under fully unsupervised semantic segmentation settings demonstrate the superior performance and the generalizability of our proposed method. The code is available at: https://github.com/yliu1229/AlignSeg.</p>\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TPAMI.2024.3432326\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3432326","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

利用稀缺的像素级注释进行语义分割所面临的挑战引发了许多自监督工作,但其中大多数工作基本上都是训练图像编码器或分割头,以生成更精细的密集表示,而在执行分割推理时,它们需要借助监督线性分类器或传统聚类。通过数据集级聚类进行分割不仅偏离了实时和端到端的推理实践,而且将问题从每幅图像的分割升级为一次性对所有像素进行聚类,从而导致性能下降。为了解决这个问题,我们提出了一种新颖的自监督语义分割训练和推理范例,在这种范例中,推理是以端到端的方式进行的。具体来说,根据我们在图像级自监督 ViT 的密集表征探测中观察到的问题,即斑块之间的语义不一致和非倾斜区域的语义质量差,我们提出了原型-图像对齐和全局-局部对齐的注意图约束,用可学习的原型来训练定制的变换器解码器,并利用自适应原型进行每幅图像的分割推理。在完全无监督的语义分割设置下进行的大量实验证明了我们提出的方法具有卓越的性能和通用性。代码见:https://github.com/yliu1229/AlignSeg。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Rethinking Self-Supervised Semantic Segmentation: Achieving End-to-End Segmentation.

The challenge of semantic segmentation with scarce pixel-level annotations has induced many self-supervised works, however most of which essentially train an image encoder or a segmentation head that produces finer dense representations, and when performing segmentation inference they need to resort to supervised linear classifiers or traditional clustering. Segmentation by dataset-level clustering not only deviates the real-time and end-to-end inference practice, but also escalates the problem from segmenting per image to clustering all pixels at once, which results in downgraded performance. To remedy this issue, we propose a novel self-supervised semantic segmentation training and inferring paradigm where inferring is performed in an end-to-end manner. Specifically, based on our observations in probing dense representation by image-level self-supervised ViT, i.e. semantic inconsistency between patches and poor semantic quality in non-salient regions, we propose prototype-image alignment and global-local alignment with attention map constraint to train a tailored Transformer Decoder with learnable prototypes and utilize adaptive prototypes for segmentation inference per image. Extensive experiments under fully unsupervised semantic segmentation settings demonstrate the superior performance and the generalizability of our proposed method. The code is available at: https://github.com/yliu1229/AlignSeg.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Diversifying Policies with Non-Markov Dispersion to Expand the Solution Space. Integrating Neural Radiance Fields End-to-End for Cognitive Visuomotor Navigation. Variational Label Enhancement for Instance-Dependent Partial Label Learning. TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation. Efficient Neural Collaborative Search for Pickup and Delivery Problems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1