LPN: Language-Guided Prototypical Network for Few-Shot Classification

IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-09-09 DOI:10.1109/TCSVT.2024.3456127
Kaihui Cheng;Chule Yang;Xiao Liu;Naiyang Guan;Zhiyuan Wang
{"title":"LPN: Language-Guided Prototypical Network for Few-Shot Classification","authors":"Kaihui Cheng;Chule Yang;Xiao Liu;Naiyang Guan;Zhiyuan Wang","doi":"10.1109/TCSVT.2024.3456127","DOIUrl":null,"url":null,"abstract":"Few-shot classification aims to adapt to new tasks with limited labeled examples. Recent methods have explored various techniques for measuring the similarity between query and support images, along with meta-training and pre-training strategies, to leverage visual features more effectively. However, the potential of multi-modality information remains unexplored, presenting a promising avenue for improvement in few-shot classification. In this paper, we propose a Language-guided Prototypical Network (LPN) for few-shot classification without image-level captions. LPN leverages the complementarity of vision and language modalities through two parallel branches with pre-fusion and post-fusion. Firstly, we introduce the language modality by utilizing a pre-trained text encoder to extract class-level text features from class names. In the visual branch, we process images using a conventional image encoder and leverage the class-level features to align the visual features, effectively capturing more class-relevant visual information. In the text branch, we combine the class-level text features with the visual features using a language-guided decoder. This decoder generates image-specific text features for the pre-fusion step. Additionally, we utilize these class-level text features to refine the prototypical head, creating robust prototypes for subsequent measurements. Finally, to enhance overall performance, we aggregate the visual and text logits, adjusting for discrepancies between the modalities during the post-fusion process. Extensive experiments demonstrate that LPN outperforms several state-of-the-art methods on benchmark datasets, showcasing its effectiveness and robustness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"632-642"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10669382/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Few-shot classification aims to adapt to new tasks with limited labeled examples. Recent methods have explored various techniques for measuring the similarity between query and support images, along with meta-training and pre-training strategies, to leverage visual features more effectively. However, the potential of multi-modality information remains unexplored, presenting a promising avenue for improvement in few-shot classification. In this paper, we propose a Language-guided Prototypical Network (LPN) for few-shot classification without image-level captions. LPN leverages the complementarity of vision and language modalities through two parallel branches with pre-fusion and post-fusion. Firstly, we introduce the language modality by utilizing a pre-trained text encoder to extract class-level text features from class names. In the visual branch, we process images using a conventional image encoder and leverage the class-level features to align the visual features, effectively capturing more class-relevant visual information. In the text branch, we combine the class-level text features with the visual features using a language-guided decoder. This decoder generates image-specific text features for the pre-fusion step. Additionally, we utilize these class-level text features to refine the prototypical head, creating robust prototypes for subsequent measurements. Finally, to enhance overall performance, we aggregate the visual and text logits, adjusting for discrepancies between the modalities during the post-fusion process. Extensive experiments demonstrate that LPN outperforms several state-of-the-art methods on benchmark datasets, showcasing its effectiveness and robustness.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
LPN:语言引导的原型网络,用于少数镜头分类
Few-shot分类旨在适应具有有限标记示例的新任务。最近的方法已经探索了各种测量查询和支持图像之间相似性的技术,以及元训练和预训练策略,以更有效地利用视觉特征。然而,多模态信息的潜力仍未得到开发,这为改进少射分类提供了一条有希望的途径。在本文中,我们提出了一种语言引导的原型网络(LPN),用于无图像级标题的少镜头分类。LPN通过融合前和融合后两个平行分支利用视觉和语言模式的互补性。首先,我们通过使用预训练的文本编码器从类名中提取类级文本特征来引入语言模态。在视觉分支中,我们使用传统的图像编码器处理图像,并利用类级特征来对齐视觉特征,有效地捕获更多与类相关的视觉信息。在文本分支中,我们使用语言引导解码器将类级文本特征与视觉特征结合起来。该解码器为预融合步骤生成特定于图像的文本特征。此外,我们利用这些类级文本特征来完善原型头部,为后续测量创建健壮的原型。最后,为了提高整体性能,我们汇总了视觉和文本逻辑,并在融合后过程中调整了模式之间的差异。大量实验表明,LPN在基准数据集上优于几种最先进的方法,展示了其有效性和鲁棒性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
13.80
自引率
27.40%
发文量
660
审稿时长
5 months
期刊介绍: The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.
期刊最新文献
IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information 2025 Index IEEE Transactions on Circuits and Systems for Video Technology IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1