{"title":"S2CA: Shared Concept Prototypes and Concept-level Alignment for text–video retrieval","authors":"Yuxiao Li, Yu Xin, Jiangbo Qian, Yihong Dong","doi":"10.1016/j.neucom.2024.128851","DOIUrl":null,"url":null,"abstract":"<div><div>Text–video retrieval, as a fundamental task of cross-modal learning, relies on effectively establishing the semantic association between text and video. At present, mainstream semantic alignment methods for text–video adopt instance-level alignment strategies, ignoring the fine-grained concept association and the “concept-level alignment” characteristics of text–video. In this regard, we propose <strong>S</strong>hared <strong>C</strong>oncept Prototypes and <strong>C</strong>oncept-level <strong>A</strong>lignment (<strong>S2CA</strong>) to achieve concept-level alignment. Specifically, we utilize the text–video <strong>Shared Concept Prototypes</strong> mechanism to bridge the correspondence between text and video. On this basis, we use cross-attention and Gumbel-softmax to obtain <strong>Discrete Concept Allocation Matrices</strong> and then assign text and video tokens to corresponding concept prototypes. In this way, texts and videos are decoupled into multiple <strong>Conceptual Aggregated Features</strong>, thereby achieving <strong>Concept-level Alignment</strong>. In addition, we use CLIP as the teacher model and adopt the Align-Transform-Reconstruct distillation framework to strengthen the multimodal semantic learning ability. The extensive experiments on MSR-VTT, DiDeMo, ActivityNet and MSVD prove the effectiveness of our method.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128851"},"PeriodicalIF":5.5000,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224016229","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Text–video retrieval, as a fundamental task of cross-modal learning, relies on effectively establishing the semantic association between text and video. At present, mainstream semantic alignment methods for text–video adopt instance-level alignment strategies, ignoring the fine-grained concept association and the “concept-level alignment” characteristics of text–video. In this regard, we propose Shared Concept Prototypes and Concept-level Alignment (S2CA) to achieve concept-level alignment. Specifically, we utilize the text–video Shared Concept Prototypes mechanism to bridge the correspondence between text and video. On this basis, we use cross-attention and Gumbel-softmax to obtain Discrete Concept Allocation Matrices and then assign text and video tokens to corresponding concept prototypes. In this way, texts and videos are decoupled into multiple Conceptual Aggregated Features, thereby achieving Concept-level Alignment. In addition, we use CLIP as the teacher model and adopt the Align-Transform-Reconstruct distillation framework to strengthen the multimodal semantic learning ability. The extensive experiments on MSR-VTT, DiDeMo, ActivityNet and MSVD prove the effectiveness of our method.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.