With the increasing demand for flexibility and adaptability in modern manufacturing systems, intelligent perception and recognition of human actions in human-robot collaborative assembly (HRCA) tasks have garnered significant attention. However, accurate action recognition in complex and dynamic environments remains challenging due to challenges in multimodal fusion and semantic understanding. To address these challenges, a semantically-contrastive action recognition network (SCAR) is proposed, which enhances fine-grained modeling and discrimination of assembly actions. SCAR integrates structural motion information from skeleton sequences with semantic and contextual features extracted from RGB images, thereby improving comprehensive scene perception. Furthermore, task-relevant textual descriptions are introduced as semantic priors to guide cross-modal feature learning. A contrastive learning strategy is employed to reinforce semantic alignment and discriminability across modalities, facilitating the learning of task-aware representations. Evaluations on the benchmark action dataset NTU RGB+D and practical HRCA tasks demonstrate that SCAR significantly outperforms mainstream methods in recognition accuracy. The advantage is particularly evident in scenarios involving ambiguous operations and semantically similar assembly tasks. Ablation studies further validate the efficacy of the semantic guidance mechanism and contrastive learning strategy in enhancing modality complementarity and system robustness.
扫码关注我们
求助内容:
应助结果提醒方式:
