Image captioning (IC) is a pivotal cross-modal task that generates coherent textual descriptions for visual inputs, bridging vision and language domains. Attention-based methods have significantly advanced the field of image captioning. However, empirical observations indicate that attention mechanisms often allocate focus uniformly across the full spectrum of feature sequences, which inadvertently diminishes emphasis on long-range dependencies. Such remote elements, nevertheless, play a critical role in yielding captions of superior quality. Therefore, we pursued strategies that harmonize comprehensive feature representation with targeted prioritization of key signals, ultimately proposed the Dynamic Hybrid Network (DH-Net) to enhance caption quality. Specifically, following the encoder–decoder architecture, we propose a hybrid encoder (HE) to integrate the attention mechanisms with the mamba blocks. which further complements the attention by leveraging mamba’s superior long-sequence modeling capabilities, and enables a synergistic combination of local feature extraction and global context modeling. Additionally, we introduce a Feature Aggregation Module (FAM) into the decoder, which dynamically adapts multi-modal feature fusion to evolving decoding contexts, ensuring context-sensitive integration of heterogeneous features. Extensive evaluations on the MSCOCO and Flickr30k dataset demonstrate that DH-Net achieves state-of-the-art performance, significantly outperforming existing approaches in generating accurate and semantically rich captions. The implementation code is accessible via https://github.com/simple-boy/DH-Net.
扫码关注我们
求助内容:
应助结果提醒方式:
