Cross-modal retrieval aims to establish semantic associations between heterogeneous modalities, among which image-text retrieval is a key application scenario that seeks to achieve efficient semantic alignment between images and texts. Existing approaches often rely on fixed patch selection strategies for fine-grained alignment. However, such static strategies struggle to adapt to complex scene variations. Moreover, fine-grained alignment methods tend to fall into local optima by overemphasizing local feature details while neglecting global semantic context. Such limitations significantly hinder both retrieval accuracy and generalization performance. To address these challenges, we propose a Dynamic Patch Selection and Dual-Granularity Alignment (DPSDGA) framework that jointly enhances global semantic consistency and local feature interactions for robust cross-modal alignment. Specifically, we introduce a dynamic sparse module that adaptively adjusts the number of retained visual patches based on scene complexity, effectively filtering redundant information while preserving critical semantic features. Furthermore, we design a dual-granularity alignment mechanism, which combines global contrastive learning with local fine-grained alignment to enhance semantic consistency across modalities. Extensive experiments on two benchmark datasets, Flickr30k and MS-COCO, demonstrate that our method significantly outperforms existing approaches in image-text retrieval.
扫码关注我们
求助内容:
应助结果提醒方式:
