Peng Wang , Dagang Li , Xuesi Hu , Yongmei Wang , Youhua Zhang
{"title":"CLIPMulti:探索用于零镜头文本分类的多模态增强型 CLIP 的性能","authors":"Peng Wang , Dagang Li , Xuesi Hu , Yongmei Wang , Youhua Zhang","doi":"10.1016/j.csl.2024.101748","DOIUrl":null,"url":null,"abstract":"<div><div>Zero-shot text classification does not require large amounts of labeled data and is designed to handle text classification tasks that lack annotated training data. Existing zero-shot text classification uses either a text–text matching paradigm or a text–image matching paradigm, which shows good performance on different benchmark datasets. However, the existing classification paradigms only consider a single modality for text matching, and little attention is paid to the help of multimodality for text classification. In order to incorporate multimodality into zero-shot text classification, we propose a multimodal enhanced CLIP framework (CLIPMulti), which employs a text–image&text matching paradigm to enhance the effectiveness of zero-shot text classification. Three different image and text combinations are tested for their effects on zero-shot text classification, and a matching method (Match-CLIPMulti) is further proposed to find the corresponding text based on the classified images automatically. We conducted experiments on seven publicly available zero-shot text classification datasets and achieved competitive performance. In addition, we analyzed the effect of different parameters on the Match-CLIPMulti experiments. We hope this work will bring more thoughts and explorations on multimodal fusion in language tasks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101748"},"PeriodicalIF":3.1000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CLIPMulti: Explore the performance of multimodal enhanced CLIP for zero-shot text classification\",\"authors\":\"Peng Wang , Dagang Li , Xuesi Hu , Yongmei Wang , Youhua Zhang\",\"doi\":\"10.1016/j.csl.2024.101748\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Zero-shot text classification does not require large amounts of labeled data and is designed to handle text classification tasks that lack annotated training data. Existing zero-shot text classification uses either a text–text matching paradigm or a text–image matching paradigm, which shows good performance on different benchmark datasets. However, the existing classification paradigms only consider a single modality for text matching, and little attention is paid to the help of multimodality for text classification. In order to incorporate multimodality into zero-shot text classification, we propose a multimodal enhanced CLIP framework (CLIPMulti), which employs a text–image&text matching paradigm to enhance the effectiveness of zero-shot text classification. Three different image and text combinations are tested for their effects on zero-shot text classification, and a matching method (Match-CLIPMulti) is further proposed to find the corresponding text based on the classified images automatically. We conducted experiments on seven publicly available zero-shot text classification datasets and achieved competitive performance. In addition, we analyzed the effect of different parameters on the Match-CLIPMulti experiments. We hope this work will bring more thoughts and explorations on multimodal fusion in language tasks.</div></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"90 \",\"pages\":\"Article 101748\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230824001311\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824001311","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
CLIPMulti: Explore the performance of multimodal enhanced CLIP for zero-shot text classification
Zero-shot text classification does not require large amounts of labeled data and is designed to handle text classification tasks that lack annotated training data. Existing zero-shot text classification uses either a text–text matching paradigm or a text–image matching paradigm, which shows good performance on different benchmark datasets. However, the existing classification paradigms only consider a single modality for text matching, and little attention is paid to the help of multimodality for text classification. In order to incorporate multimodality into zero-shot text classification, we propose a multimodal enhanced CLIP framework (CLIPMulti), which employs a text–image&text matching paradigm to enhance the effectiveness of zero-shot text classification. Three different image and text combinations are tested for their effects on zero-shot text classification, and a matching method (Match-CLIPMulti) is further proposed to find the corresponding text based on the classified images automatically. We conducted experiments on seven publicly available zero-shot text classification datasets and achieved competitive performance. In addition, we analyzed the effect of different parameters on the Match-CLIPMulti experiments. We hope this work will bring more thoughts and explorations on multimodal fusion in language tasks.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.