LPN: Language-Guided Prototypical Network for Few-Shot Classification

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-09-09 DOI:10.1109/TCSVT.2024.3456127

Kaihui Cheng;Chule Yang;Xiao Liu;Naiyang Guan;Zhiyuan Wang

{"title":"LPN: Language-Guided Prototypical Network for Few-Shot Classification","authors":"Kaihui Cheng;Chule Yang;Xiao Liu;Naiyang Guan;Zhiyuan Wang","doi":"10.1109/TCSVT.2024.3456127","DOIUrl":null,"url":null,"abstract":"Few-shot classification aims to adapt to new tasks with limited labeled examples. Recent methods have explored various techniques for measuring the similarity between query and support images, along with meta-training and pre-training strategies, to leverage visual features more effectively. However, the potential of multi-modality information remains unexplored, presenting a promising avenue for improvement in few-shot classification. In this paper, we propose a Language-guided Prototypical Network (LPN) for few-shot classification without image-level captions. LPN leverages the complementarity of vision and language modalities through two parallel branches with pre-fusion and post-fusion. Firstly, we introduce the language modality by utilizing a pre-trained text encoder to extract class-level text features from class names. In the visual branch, we process images using a conventional image encoder and leverage the class-level features to align the visual features, effectively capturing more class-relevant visual information. In the text branch, we combine the class-level text features with the visual features using a language-guided decoder. This decoder generates image-specific text features for the pre-fusion step. Additionally, we utilize these class-level text features to refine the prototypical head, creating robust prototypes for subsequent measurements. Finally, to enhance overall performance, we aggregate the visual and text logits, adjusting for discrepancies between the modalities during the post-fusion process. Extensive experiments demonstrate that LPN outperforms several state-of-the-art methods on benchmark datasets, showcasing its effectiveness and robustness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"632-642"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10669382/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Few-shot classification aims to adapt to new tasks with limited labeled examples. Recent methods have explored various techniques for measuring the similarity between query and support images, along with meta-training and pre-training strategies, to leverage visual features more effectively. However, the potential of multi-modality information remains unexplored, presenting a promising avenue for improvement in few-shot classification. In this paper, we propose a Language-guided Prototypical Network (LPN) for few-shot classification without image-level captions. LPN leverages the complementarity of vision and language modalities through two parallel branches with pre-fusion and post-fusion. Firstly, we introduce the language modality by utilizing a pre-trained text encoder to extract class-level text features from class names. In the visual branch, we process images using a conventional image encoder and leverage the class-level features to align the visual features, effectively capturing more class-relevant visual information. In the text branch, we combine the class-level text features with the visual features using a language-guided decoder. This decoder generates image-specific text features for the pre-fusion step. Additionally, we utilize these class-level text features to refine the prototypical head, creating robust prototypes for subsequent measurements. Finally, to enhance overall performance, we aggregate the visual and text logits, adjusting for discrepancies between the modalities during the post-fusion process. Extensive experiments demonstrate that LPN outperforms several state-of-the-art methods on benchmark datasets, showcasing its effectiveness and robustness.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LPN：语言引导的原型网络，用于少数镜头分类

Few-shot分类旨在适应具有有限标记示例的新任务。最近的方法已经探索了各种测量查询和支持图像之间相似性的技术，以及元训练和预训练策略，以更有效地利用视觉特征。然而，多模态信息的潜力仍未得到开发，这为改进少射分类提供了一条有希望的途径。在本文中，我们提出了一种语言引导的原型网络（LPN），用于无图像级标题的少镜头分类。LPN通过融合前和融合后两个平行分支利用视觉和语言模式的互补性。首先，我们通过使用预训练的文本编码器从类名中提取类级文本特征来引入语言模态。在视觉分支中，我们使用传统的图像编码器处理图像，并利用类级特征来对齐视觉特征，有效地捕获更多与类相关的视觉信息。在文本分支中，我们使用语言引导解码器将类级文本特征与视觉特征结合起来。该解码器为预融合步骤生成特定于图像的文本特征。此外，我们利用这些类级文本特征来完善原型头部，为后续测量创建健壮的原型。最后，为了提高整体性能，我们汇总了视觉和文本逻辑，并在融合后过程中调整了模式之间的差异。大量实验表明，LPN在基准数据集上优于几种最先进的方法，展示了其有效性和鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.

期刊最新文献

IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information 2025 Index IEEE Transactions on Circuits and Systems for Video Technology IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information