Weakly-supervised action localization (WSAL) aims to identify and temporally localize action instances in untrimmed videos using only video-level labels. Existing methods (e.g., multi-instance learning, pseudo-label learning) primarily focus on leveraging coarse-grained category labels to infer fine-grained action intervals. With the rise of vision-language models, integrating language modalities has become a promising direction for WSAL, focusing on leveraging class-based supervision to enhance the localization of fine-grained action instances. However, current methods largely concentrate on mapping class labels to visual representations, rather than capturing the semantic relationships within action sequences themselves. To address these issues, we propose a novel framework that aligns vision-semantic and category-semantic knowledge for fine-grained action localization. Specifically, we introduce a residual hierarchical spatial-temporal predictive model to learn the temporal variations of feature semantics, which, when combined with a class-based supervision module, enables more accurate action localization. Extensive experiments and ablation studies demonstrate that our method consistently surpasses all previous state-of-the-art approaches.
扫码关注我们
求助内容:
应助结果提醒方式:
