Kaihui Cheng;Chule Yang;Xiao Liu;Naiyang Guan;Zhiyuan Wang
{"title":"LPN: Language-Guided Prototypical Network for Few-Shot Classification","authors":"Kaihui Cheng;Chule Yang;Xiao Liu;Naiyang Guan;Zhiyuan Wang","doi":"10.1109/TCSVT.2024.3456127","DOIUrl":null,"url":null,"abstract":"Few-shot classification aims to adapt to new tasks with limited labeled examples. Recent methods have explored various techniques for measuring the similarity between query and support images, along with meta-training and pre-training strategies, to leverage visual features more effectively. However, the potential of multi-modality information remains unexplored, presenting a promising avenue for improvement in few-shot classification. In this paper, we propose a Language-guided Prototypical Network (LPN) for few-shot classification without image-level captions. LPN leverages the complementarity of vision and language modalities through two parallel branches with pre-fusion and post-fusion. Firstly, we introduce the language modality by utilizing a pre-trained text encoder to extract class-level text features from class names. In the visual branch, we process images using a conventional image encoder and leverage the class-level features to align the visual features, effectively capturing more class-relevant visual information. In the text branch, we combine the class-level text features with the visual features using a language-guided decoder. This decoder generates image-specific text features for the pre-fusion step. Additionally, we utilize these class-level text features to refine the prototypical head, creating robust prototypes for subsequent measurements. Finally, to enhance overall performance, we aggregate the visual and text logits, adjusting for discrepancies between the modalities during the post-fusion process. Extensive experiments demonstrate that LPN outperforms several state-of-the-art methods on benchmark datasets, showcasing its effectiveness and robustness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"632-642"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10669382/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Few-shot classification aims to adapt to new tasks with limited labeled examples. Recent methods have explored various techniques for measuring the similarity between query and support images, along with meta-training and pre-training strategies, to leverage visual features more effectively. However, the potential of multi-modality information remains unexplored, presenting a promising avenue for improvement in few-shot classification. In this paper, we propose a Language-guided Prototypical Network (LPN) for few-shot classification without image-level captions. LPN leverages the complementarity of vision and language modalities through two parallel branches with pre-fusion and post-fusion. Firstly, we introduce the language modality by utilizing a pre-trained text encoder to extract class-level text features from class names. In the visual branch, we process images using a conventional image encoder and leverage the class-level features to align the visual features, effectively capturing more class-relevant visual information. In the text branch, we combine the class-level text features with the visual features using a language-guided decoder. This decoder generates image-specific text features for the pre-fusion step. Additionally, we utilize these class-level text features to refine the prototypical head, creating robust prototypes for subsequent measurements. Finally, to enhance overall performance, we aggregate the visual and text logits, adjusting for discrepancies between the modalities during the post-fusion process. Extensive experiments demonstrate that LPN outperforms several state-of-the-art methods on benchmark datasets, showcasing its effectiveness and robustness.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.