Human activity recognition (HAR) using wearable sensors has advanced rapidly, improving the precision of complex movement identification. However, existing methods rely on single-modal time-series features, limiting spatiotemporal representation and global dependency capture. This hinders dynamic characterization and reduces recognition accuracy. To overcome these limitations, this paper proposes a novel multimodal fusion deep learning approach to enhance complex action recognition. First, we adopt a multimodal input strategy that integrates time-series data and Gramian Angular Difference Field (GADF) images to comprehensively capture the spatiotemporal characteristics of motion data. Second, we design a dual-stream feature fusion network, where Bidirectional Gated Recurrent Unit (BiGRU) combined with a multi-head self-attention mechanism (MSA) is employed to extract time-series features, while Efficient Channel Attention (ECA) and residual block are utilized to enhance image feature representation, effectively leveraging the complementary information across modalities. Finally, we introduce an interpretable analysis method based on submodule optimization, enabling cross-modal attribution analysis to identify key regions contributing to model decisions for both time-series and image features. Experimental results demonstrate that the proposed method achieves an accuracy of 96.88% in a 16-class sports activity recognition task, significantly outperforming traditional machine learning methods and existing deep learning models. This study provides an effective solution for complex action recognition and lays a technological foundation for real-time motion monitoring in wearable smart devices and broader HAR applications.
扫码关注我们
求助内容:
应助结果提醒方式:
