Pedestrian Attribute Recognition via CLIP-Based Prompt Vision-Language Fusion

IF 11.1 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-09-04 DOI:10.1109/TCSVT.2024.3454366
Xiao Wang;Jiandong Jin;Chenglong Li;Jin Tang;Cheng Zhang;Wei Wang
{"title":"Pedestrian Attribute Recognition via CLIP-Based Prompt Vision-Language Fusion","authors":"Xiao Wang;Jiandong Jin;Chenglong Li;Jin Tang;Cheng Zhang;Wei Wang","doi":"10.1109/TCSVT.2024.3454366","DOIUrl":null,"url":null,"abstract":"Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the relations between pedestrian images and attribute labels. In this paper, we formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Specifically, the attribute phrases are first expanded into sentences, and then the pre-trained vision-language model CLIP is adopted as our backbone for feature embedding of visual images and attribute descriptions. The contrastive learning objective connects the vision and language modalities well in the CLIP-based feature space, and the Transformer layers used in CLIP can capture the long-range relations between pixels. Then, a multi-modal Transformer is adopted to fuse the dual features effectively and feed-forward network is used to predict attributes. To optimize our network efficiently, we propose the region-aware prompt tuning technique to adjust very few parameters (i.e., only the prompt vectors and classification heads) and fix both the pre-trained VL model and multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy. It also achieves new state-of-the-art performance on both standard and zero-shot settings for PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The source code and pre-trained models will be released on <uri>https://github.com/Event-AHU/OpenPAR</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"148-161"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10664464/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Existing pedestrian attribute recognition (PAR) algorithms adopt pre-trained CNN (e.g., ResNet) as their backbone network for visual feature learning, which might obtain sub-optimal results due to the insufficient employment of the relations between pedestrian images and attribute labels. In this paper, we formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels. Specifically, the attribute phrases are first expanded into sentences, and then the pre-trained vision-language model CLIP is adopted as our backbone for feature embedding of visual images and attribute descriptions. The contrastive learning objective connects the vision and language modalities well in the CLIP-based feature space, and the Transformer layers used in CLIP can capture the long-range relations between pixels. Then, a multi-modal Transformer is adopted to fuse the dual features effectively and feed-forward network is used to predict attributes. To optimize our network efficiently, we propose the region-aware prompt tuning technique to adjust very few parameters (i.e., only the prompt vectors and classification heads) and fix both the pre-trained VL model and multi-modal Transformer. Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy. It also achieves new state-of-the-art performance on both standard and zero-shot settings for PAR, including RAPv1, RAPv2, WIDER, PA100K, and PETA-ZS, RAP-ZS datasets. The source code and pre-trained models will be released on https://github.com/Event-AHU/OpenPAR.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过基于 CLIP 的视觉语言融合提示进行行人属性识别
现有的行人属性识别(PAR)算法采用预训练的CNN(如ResNet)作为其视觉特征学习的骨干网络,由于没有充分利用行人图像与属性标签之间的关系,可能会获得次优结果。在本文中,我们将PAR描述为一个视觉语言融合问题,充分利用行人图像与属性标签之间的关系。具体而言,首先将属性短语扩展成句子,然后采用预训练好的视觉语言模型CLIP作为我们的主干,对视觉图像进行特征嵌入和属性描述。在基于CLIP的特征空间中,对比学习目标很好地连接了视觉和语言模态,并且在CLIP中使用的Transformer层可以捕获像素之间的远程关系。然后,采用多模态变压器有效融合双特征,并采用前馈网络进行属性预测。为了有效地优化我们的网络,我们提出了区域感知提示调谐技术来调整很少的参数(即只有提示向量和分类头),并固定预训练的VL模型和多模态变压器。与微调策略相比,我们提出的PAR算法只调整0.75%的可学习参数。它还在PAR的标准和零射击设置上实现了新的最先进的性能,包括RAPv1, RAPv2, WIDER, PA100K和PETA-ZS, RAP-ZS数据集。源代码和预训练模型将在https://github.com/Event-AHU/OpenPAR上发布。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
13.80
自引率
27.40%
发文量
660
审稿时长
5 months
期刊介绍: The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.
期刊最新文献
IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information 2025 Index IEEE Transactions on Circuits and Systems for Video Technology IEEE Circuits and Systems Society Information IEEE Circuits and Systems Society Information
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1