Zero-shot group activity recognition (ZS-GAR) aims to identify activities unseen during training. However, conventional methods deploy models with parameters frozen at test time. This static nature prevents the model from adapting to the inherent distributional shift of unseen classes, severely impairing its generalization capability. To address this problem, we propose a test-time adaptation (TTA) framework that dynamically adapts the model during inference by employing two synergistic self-supervised mechanisms. First, an Actor-Drop Feature Augmentation strategy leverages group relational structure as a potent self-supervised signal by enforcing predictive consistency on samples where individuals are randomly masked. Second, our Label-Semantic Contrastive Learning mechanism generates pseudo-labels from high-confidence predictions and uses a dynamic memory bank, aligning features with their inferred semantic prototypes. This process not only enhances vision-language alignment for unseen classes but also demonstrates robustness against data corruptions, as validated on two new benchmarks, VD-C and CAD-C, featuring various corruption types. Extensive experiments on standard ZS-GAR benchmarks show our method significantly outperforms existing techniques, validating TTA’s effectiveness for this task.
扫码关注我们
求助内容:
应助结果提醒方式:
