Zehua Hao, Fang Liu, Licheng Jiao, Yaoyang Du, Shuo Li, Hao Wang, Pengfang Li, Xu Liu, Puhua Chen
{"title":"Preserving text space integrity for robust compositional zero-shot learning via mixture of pretrained experts","authors":"Zehua Hao, Fang Liu, Licheng Jiao, Yaoyang Du, Shuo Li, Hao Wang, Pengfang Li, Xu Liu, Puhua Chen","doi":"10.1016/j.neucom.2024.128773","DOIUrl":null,"url":null,"abstract":"<div><div>In the current landscape of Compositional Zero-Shot Learning (CZSL) methods that leverage CLIP, the predominant approach is based on prompt learning paradigms. These methods encounter significant computational complexity when dealing with a large number of categories. Additionally, when confronted with new classification tasks, there is a necessity to learn the prompts again, which can be both time-consuming and resource-intensive. To address these challenges, We present a new methodology, named the <strong>M</strong>ixture of <strong>P</strong>retrained <strong>E</strong>xpert (MoPE), for enhancing Compositional Zero-shot Learning through Logit-Level Fusion with Multi Expert Fusion Module. The MoPE skillfully blends the benefits of extensive pre-trained models like CLIP, Bert, GPT-3 and Word2Vec for effectively tackling Compositional Zero-shot Learning. Firstly, we extract the text label space for each language model individually, then map the visual feature vectors to their respective text spaces. This maintains the integrity and structure of the original text space. During this process, the pre-trained expert parameters are kept frozen. The mapping of visual features to the corresponding text spaces is subject to learning and could be considered as multiple learnable visual experts. In the model fusion phase, we propose a new fusion strategy that features a gating mechanism that adjusts the contributions of various models dynamically. This enables our approach to adapt more effectively to a range of tasks and data sets. The method’s robustness is demonstrated by the fact that the language model is not tailored to specific downstream task datasets or losses. This preserves the larger model’s topology and expands the potential for application. Preliminary experiments conducted on the UT-Zappos, AO-Clever, and C-GQA datasets indicate that MoPE performs competitively when compared to existing techniques.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128773"},"PeriodicalIF":5.5000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015443","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
In the current landscape of Compositional Zero-Shot Learning (CZSL) methods that leverage CLIP, the predominant approach is based on prompt learning paradigms. These methods encounter significant computational complexity when dealing with a large number of categories. Additionally, when confronted with new classification tasks, there is a necessity to learn the prompts again, which can be both time-consuming and resource-intensive. To address these challenges, We present a new methodology, named the Mixture of Pretrained Expert (MoPE), for enhancing Compositional Zero-shot Learning through Logit-Level Fusion with Multi Expert Fusion Module. The MoPE skillfully blends the benefits of extensive pre-trained models like CLIP, Bert, GPT-3 and Word2Vec for effectively tackling Compositional Zero-shot Learning. Firstly, we extract the text label space for each language model individually, then map the visual feature vectors to their respective text spaces. This maintains the integrity and structure of the original text space. During this process, the pre-trained expert parameters are kept frozen. The mapping of visual features to the corresponding text spaces is subject to learning and could be considered as multiple learnable visual experts. In the model fusion phase, we propose a new fusion strategy that features a gating mechanism that adjusts the contributions of various models dynamically. This enables our approach to adapt more effectively to a range of tasks and data sets. The method’s robustness is demonstrated by the fact that the language model is not tailored to specific downstream task datasets or losses. This preserves the larger model’s topology and expands the potential for application. Preliminary experiments conducted on the UT-Zappos, AO-Clever, and C-GQA datasets indicate that MoPE performs competitively when compared to existing techniques.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.