Rotation-invariant frameworks are crucial in many computer vision tasks, such as human action recognition (HAR), especially when applied in real-world scenarios. Since most datasets, including those on fall detection, have been generated in controlled environments with fixed camera angles, heights, and movements, approaches developed to address such tasks tend to fail when individual appearance variations occur. To address this challenge, our study proposes the use of the EVA-02-Ti lightweight vision transformer for processing people’s polar mappings and handling the task of fall detection. In particular, we strive to leverage the transformation’s rotation-invariant characteristic and correctly classify the rotated images. Towards this goal, a polar-based rotary position embedding (P-RoPE), which generates relative positions among polar patches according to r and θ axes instead of the Cartesian x and y axes, is presented. Replacing the original RoPE, we achieve an enhancement of ViT’s performance, as demonstrated in our experimental protocol, while it also outperforms a state-of-the-art approach. An evaluation was conducted on E-FPDS and VFP290k, where training was performed on initial images and testing was performed on the rotated ones. Finally, when assessed on Fashion-MNIST-rot-12k, a standard dataset for rotation-invariant scenarios, P-RoPE again surpasses both the baseline version and another benchmark method.
扫码关注我们
求助内容:
应助结果提醒方式:
