Zero-Shot Detection, the ability to detect novel objects without training samples, exhibits immense potential in an ever-changing world, particularly in scenarios requiring the identification of emerging categories. However, effectively applying ZSD to fine-grained domains, characterized by high inter-class similarity and notable intra-class diversity, remains a significant challenge. This is particularly pronounced in the food domain, where the intricate nature of food attributes—notably the pervasive visual ambiguity among related culinary categories and the extensive spectrum of appearances within each food category—severely constrains the performance of existing methods. To address these specific challenges in the food domain, we introduce Zero-Shot Food Detection with Semantic Space and Feature Fusion (ZeSF), a novel framework tailored for Zero-Shot Food Detection. ZeSF integrates two key modules: (1) Multi-Scale Context Integration Module (MSCIM) that employs dilated convolutions for hierarchical feature extraction and adaptive multi-scale fusion to capture subtle, fine-grained visual distinctions; and (2) Contextual Text Feature Enhancement Module (CTFEM) that leverages Large Language Models to generate semantically rich textual embeddings, encompassing both global attributes and discriminative local descriptors. Critically, a cross-modal alignment further harmonizes visual and textual features. Comprehensive evaluations on the UEC FOOD 256 and Food Objects With Attributes (FOWA) datasets affirm ZeSF’s superiority, achieving significant improvements in the Harmonic Mean for the Generalized ZSD setting. Crucially, we further validate the framework’s generalization capability on the MS COCO and PASCAL VOC benchmarks, where it again outperforms strong baselines. The source code will be publicly available upon publication.
扫码关注我们
求助内容:
应助结果提醒方式:
