{"title":"Multi-modal Few-shot Image Recognition with enhanced semantic and visual integration","authors":"Chunru Dong, Lizhen Wang, Feng Zhang, Qiang Hua","doi":"10.1016/j.imavis.2025.105490","DOIUrl":null,"url":null,"abstract":"<div><div>Few-Shot Learning (FSL) enables models to recognize new classes with only a few examples by leveraging knowledge from known classes. Although some methods incorporate class names as prior knowledge, effectively integrating visual and semantic information remains challenging. Additionally, conventional similarity measurement techniques often result in information loss, obscure distinctions between samples, and fail to capture intra-sample diversity. To address these issues, this paper presents a Multi-modal Few-shot Image Recognition (MFSIR) approach. We first introduce the Multi-Scale Interaction Module (MSIM), which facilitates multi-scale interactions between semantic and visual features, significantly enhancing the representation of visual features. We also propose the Hybrid Similarity Measurement Module (HSMM), which integrates information from multiple dimensions to evaluate the similarity between samples by dynamically adjusting the weights of various similarity measurement methods, thereby improving the accuracy and robustness of similarity assessments. Experimental results demonstrate that our approach significantly outperforms existing methods on four FSL benchmarks, with marked improvements in FSL accuracy under 1-shot and 5-shot scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105490"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000782","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Few-Shot Learning (FSL) enables models to recognize new classes with only a few examples by leveraging knowledge from known classes. Although some methods incorporate class names as prior knowledge, effectively integrating visual and semantic information remains challenging. Additionally, conventional similarity measurement techniques often result in information loss, obscure distinctions between samples, and fail to capture intra-sample diversity. To address these issues, this paper presents a Multi-modal Few-shot Image Recognition (MFSIR) approach. We first introduce the Multi-Scale Interaction Module (MSIM), which facilitates multi-scale interactions between semantic and visual features, significantly enhancing the representation of visual features. We also propose the Hybrid Similarity Measurement Module (HSMM), which integrates information from multiple dimensions to evaluate the similarity between samples by dynamically adjusting the weights of various similarity measurement methods, thereby improving the accuracy and robustness of similarity assessments. Experimental results demonstrate that our approach significantly outperforms existing methods on four FSL benchmarks, with marked improvements in FSL accuracy under 1-shot and 5-shot scenarios.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.