PartSeg: Few-shot part segmentation via part-aware prompt learning

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pattern Recognition Pub Date : 2025-01-02 DOI:10.1016/j.patcog.2024.111326

Mengya Han , Heliang Zheng , Chaoyue Wang , Yong Luo , Han Hu , Jing Zhang , Bo Du

{"title":"PartSeg: Few-shot part segmentation via part-aware prompt learning","authors":"Mengya Han , Heliang Zheng , Chaoyue Wang , Yong Luo , Han Hu , Jing Zhang , Bo Du","doi":"10.1016/j.patcog.2024.111326","DOIUrl":null,"url":null,"abstract":"<div><div>In this work, we address the task of few-shot part segmentation, which aims to segment the different parts of an unseen object using very few labeled examples. It has been found that leveraging the textual space of a powerful pre-trained image-language model, such as CLIP, can substantially enhance the learning of visual features in few-shot tasks. However, CLIP-based methods primarily focus on high-level visual features that are fully aligned with textual features representing the “summary” of the image, which often struggle to understand the concept of object parts through textual descriptions. To address this, we propose PartSeg, a novel method that learns part-aware prompts to grasp the concept of “part” and better utilize the textual space of CLIP to enhance few-shot part segmentation. Specifically, we design a part-aware prompt learning module that generates part-aware prompts, enabling the CLIP model to better understand the concept of “part” and effectively utilize its textual space. The part-aware prompt learning module includes a part-specific prompt generator that produces part-specific tokens for each part class. Furthermore, since the concept of the same part across different object categories is general, we establish relationships between these parts to estimate part-shared tokens during the prompt learning process. Finally, the part-specific and part-shared tokens, along with the textual tokens encoded from textual descriptions of parts (i.e., part labels), are combined to form the part-aware prompt used to generate textual prototypes for segmentation. We conduct extensive experiments on the PartImageNet and Pascal_Part datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111326"},"PeriodicalIF":7.6000,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S003132032401077X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In this work, we address the task of few-shot part segmentation, which aims to segment the different parts of an unseen object using very few labeled examples. It has been found that leveraging the textual space of a powerful pre-trained image-language model, such as CLIP, can substantially enhance the learning of visual features in few-shot tasks. However, CLIP-based methods primarily focus on high-level visual features that are fully aligned with textual features representing the “summary” of the image, which often struggle to understand the concept of object parts through textual descriptions. To address this, we propose PartSeg, a novel method that learns part-aware prompts to grasp the concept of “part” and better utilize the textual space of CLIP to enhance few-shot part segmentation. Specifically, we design a part-aware prompt learning module that generates part-aware prompts, enabling the CLIP model to better understand the concept of “part” and effectively utilize its textual space. The part-aware prompt learning module includes a part-specific prompt generator that produces part-specific tokens for each part class. Furthermore, since the concept of the same part across different object categories is general, we establish relationships between these parts to estimate part-shared tokens during the prompt learning process. Finally, the part-specific and part-shared tokens, along with the textual tokens encoded from textual descriptions of parts (i.e., part labels), are combined to form the part-aware prompt used to generate textual prototypes for segmentation. We conduct extensive experiments on the PartImageNet and Pascal_Part datasets, and the results demonstrate that our proposed method achieves state-of-the-art performance.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

PartSeg：通过部分感知提示学习实现的少镜头部分分割

在这项工作中，我们解决了少镜头部分分割的任务，其目的是使用很少的标记示例来分割未见物体的不同部分。研究发现，利用强大的预训练图像语言模型（如CLIP）的文本空间，可以在少量拍摄任务中大大增强视觉特征的学习。然而，基于clip的方法主要关注与代表图像“摘要”的文本特征完全一致的高级视觉特征，这些文本特征通常难以通过文本描述理解对象部分的概念。为了解决这个问题，我们提出了一种新的方法PartSeg，该方法通过学习零件感知提示来掌握“零件”的概念，并更好地利用CLIP的文本空间来增强小镜头零件分割。具体来说，我们设计了一个部件感知提示学习模块，生成部件感知提示，使CLIP模型能够更好地理解“部件”的概念，并有效地利用其文本空间。部件感知提示学习模块包括一个特定于部件的提示生成器，它为每个部件类生成特定于部件的令牌。此外，由于跨不同对象类别的相同部分的概念是通用的，我们在提示学习过程中建立了这些部分之间的关系来估计部分共享标记。最后，部件特定的和部件共享的标记，以及从部件的文本描述（即，部件标签）编码的文本标记，被组合成用于生成分割文本原型的部件感知提示。我们在PartImageNet和Pascal_Part数据集上进行了大量的实验，结果表明我们提出的方法达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.

期刊最新文献

Editorial Board Contrastive calibration on consensus and complementary multi-view representations Adversarial supervised contrastive feature learning for cross-modal retrieval A visual-textual mutual guidance fusion network for remote sensing visual question answering Generalizable face forgery detection via mining single-step reconstruction difference