通过混合预训练专家,保持文本空间的完整性,实现稳健的合成零点学习

IF 5.5 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Neurocomputing Pub Date : 2024-10-28 DOI:10.1016/j.neucom.2024.128773
Zehua Hao, Fang Liu, Licheng Jiao, Yaoyang Du, Shuo Li, Hao Wang, Pengfang Li, Xu Liu, Puhua Chen
{"title":"通过混合预训练专家,保持文本空间的完整性,实现稳健的合成零点学习","authors":"Zehua Hao,&nbsp;Fang Liu,&nbsp;Licheng Jiao,&nbsp;Yaoyang Du,&nbsp;Shuo Li,&nbsp;Hao Wang,&nbsp;Pengfang Li,&nbsp;Xu Liu,&nbsp;Puhua Chen","doi":"10.1016/j.neucom.2024.128773","DOIUrl":null,"url":null,"abstract":"<div><div>In the current landscape of Compositional Zero-Shot Learning (CZSL) methods that leverage CLIP, the predominant approach is based on prompt learning paradigms. These methods encounter significant computational complexity when dealing with a large number of categories. Additionally, when confronted with new classification tasks, there is a necessity to learn the prompts again, which can be both time-consuming and resource-intensive. To address these challenges, We present a new methodology, named the <strong>M</strong>ixture of <strong>P</strong>retrained <strong>E</strong>xpert (MoPE), for enhancing Compositional Zero-shot Learning through Logit-Level Fusion with Multi Expert Fusion Module. The MoPE skillfully blends the benefits of extensive pre-trained models like CLIP, Bert, GPT-3 and Word2Vec for effectively tackling Compositional Zero-shot Learning. Firstly, we extract the text label space for each language model individually, then map the visual feature vectors to their respective text spaces. This maintains the integrity and structure of the original text space. During this process, the pre-trained expert parameters are kept frozen. The mapping of visual features to the corresponding text spaces is subject to learning and could be considered as multiple learnable visual experts. In the model fusion phase, we propose a new fusion strategy that features a gating mechanism that adjusts the contributions of various models dynamically. This enables our approach to adapt more effectively to a range of tasks and data sets. The method’s robustness is demonstrated by the fact that the language model is not tailored to specific downstream task datasets or losses. This preserves the larger model’s topology and expands the potential for application. Preliminary experiments conducted on the UT-Zappos, AO-Clever, and C-GQA datasets indicate that MoPE performs competitively when compared to existing techniques.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128773"},"PeriodicalIF":5.5000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Preserving text space integrity for robust compositional zero-shot learning via mixture of pretrained experts\",\"authors\":\"Zehua Hao,&nbsp;Fang Liu,&nbsp;Licheng Jiao,&nbsp;Yaoyang Du,&nbsp;Shuo Li,&nbsp;Hao Wang,&nbsp;Pengfang Li,&nbsp;Xu Liu,&nbsp;Puhua Chen\",\"doi\":\"10.1016/j.neucom.2024.128773\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In the current landscape of Compositional Zero-Shot Learning (CZSL) methods that leverage CLIP, the predominant approach is based on prompt learning paradigms. These methods encounter significant computational complexity when dealing with a large number of categories. Additionally, when confronted with new classification tasks, there is a necessity to learn the prompts again, which can be both time-consuming and resource-intensive. To address these challenges, We present a new methodology, named the <strong>M</strong>ixture of <strong>P</strong>retrained <strong>E</strong>xpert (MoPE), for enhancing Compositional Zero-shot Learning through Logit-Level Fusion with Multi Expert Fusion Module. The MoPE skillfully blends the benefits of extensive pre-trained models like CLIP, Bert, GPT-3 and Word2Vec for effectively tackling Compositional Zero-shot Learning. Firstly, we extract the text label space for each language model individually, then map the visual feature vectors to their respective text spaces. This maintains the integrity and structure of the original text space. During this process, the pre-trained expert parameters are kept frozen. The mapping of visual features to the corresponding text spaces is subject to learning and could be considered as multiple learnable visual experts. In the model fusion phase, we propose a new fusion strategy that features a gating mechanism that adjusts the contributions of various models dynamically. This enables our approach to adapt more effectively to a range of tasks and data sets. The method’s robustness is demonstrated by the fact that the language model is not tailored to specific downstream task datasets or losses. This preserves the larger model’s topology and expands the potential for application. Preliminary experiments conducted on the UT-Zappos, AO-Clever, and C-GQA datasets indicate that MoPE performs competitively when compared to existing techniques.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"614 \",\"pages\":\"Article 128773\"},\"PeriodicalIF\":5.5000,\"publicationDate\":\"2024-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231224015443\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015443","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

在当前利用 CLIP 的合成零点学习(CZSL)方法中,最主要的方法是基于提示学习范式。这些方法在处理大量类别时会遇到很大的计算复杂性。此外,在面对新的分类任务时,有必要再次学习提示,这可能既耗时又耗资源。为了应对这些挑战,我们提出了一种名为 "预训练专家混合物"(MoPE)的新方法,通过与多专家融合模块的 Logit 级融合来增强合成零点学习。MoPE 巧妙地融合了大量预训练模型(如 CLIP、Bert、GPT-3 和 Word2Vec)的优点,从而有效地解决了合成零镜头学习问题。首先,我们为每个语言模型单独提取文本标签空间,然后将视觉特征向量映射到各自的文本空间。这就保持了原始文本空间的完整性和结构性。在此过程中,预先训练好的专家参数将被冻结。视觉特征到相应文本空间的映射是可以学习的,可以视为多个可学习的视觉专家。在模型融合阶段,我们提出了一种新的融合策略,其特点是采用门控机制,动态调整各种模型的贡献。这使我们的方法能够更有效地适应一系列任务和数据集。该方法的稳健性体现在语言模型并不针对特定的下游任务数据集或损失而量身定制。这就保留了更大模型的拓扑结构,并扩大了应用潜力。在UT-Zappos、AO-Clever 和 C-GQA 数据集上进行的初步实验表明,与现有技术相比,MoPE 的性能极具竞争力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Preserving text space integrity for robust compositional zero-shot learning via mixture of pretrained experts
In the current landscape of Compositional Zero-Shot Learning (CZSL) methods that leverage CLIP, the predominant approach is based on prompt learning paradigms. These methods encounter significant computational complexity when dealing with a large number of categories. Additionally, when confronted with new classification tasks, there is a necessity to learn the prompts again, which can be both time-consuming and resource-intensive. To address these challenges, We present a new methodology, named the Mixture of Pretrained Expert (MoPE), for enhancing Compositional Zero-shot Learning through Logit-Level Fusion with Multi Expert Fusion Module. The MoPE skillfully blends the benefits of extensive pre-trained models like CLIP, Bert, GPT-3 and Word2Vec for effectively tackling Compositional Zero-shot Learning. Firstly, we extract the text label space for each language model individually, then map the visual feature vectors to their respective text spaces. This maintains the integrity and structure of the original text space. During this process, the pre-trained expert parameters are kept frozen. The mapping of visual features to the corresponding text spaces is subject to learning and could be considered as multiple learnable visual experts. In the model fusion phase, we propose a new fusion strategy that features a gating mechanism that adjusts the contributions of various models dynamically. This enables our approach to adapt more effectively to a range of tasks and data sets. The method’s robustness is demonstrated by the fact that the language model is not tailored to specific downstream task datasets or losses. This preserves the larger model’s topology and expands the potential for application. Preliminary experiments conducted on the UT-Zappos, AO-Clever, and C-GQA datasets indicate that MoPE performs competitively when compared to existing techniques.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Neurocomputing
Neurocomputing 工程技术-计算机:人工智能
CiteScore
13.10
自引率
10.00%
发文量
1382
审稿时长
70 days
期刊介绍: Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.
期刊最新文献
Editorial Board Virtual sample generation for small sample learning: A survey, recent developments and future prospects Adaptive selection of spectral–spatial features for hyperspectral image classification using a modified-CBAM-based network FPGA-based component-wise LSTM training accelerator for neural granger causality analysis Multi-sensor information fusion in Internet of Vehicles based on deep learning: A review
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1