SigFormer: Sparse Signal-Guided Transformer for Multi-Modal Action Segmentation

IF 5.2 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-04-10 DOI:10.1145/3657296

Qi Liu, Xinchen Liu, Kun Liu, Xiaoyan Gu, Wu Liu

{"title":"SigFormer: Sparse Signal-Guided Transformer for Multi-Modal Action Segmentation","authors":"Qi Liu, Xinchen Liu, Kun Liu, Xiaoyan Gu, Wu Liu","doi":"10.1145/3657296","DOIUrl":null,"url":null,"abstract":"Multi-modal human action segmentation is a critical and challenging task with a wide range of applications. Nowadays, the majority of approaches concentrate on the fusion of dense signals (i.e., RGB, optical flow, and depth maps). However, the potential contributions of sparse IoT sensor signals, which can be crucial for achieving accurate recognition, have not been fully explored. To make up for this, we introduce a Sparse signal-guided Transformer (SigFormer) to combine both dense and sparse signals. We employ mask attention to fuse localized features by constraining cross-attention within the regions where sparse signals are valid. However, since sparse signals are discrete, they lack sufficient information about the temporal action boundaries. Therefore, in SigFormer, we propose to emphasize the boundary information at two stages to alleviate this problem. In the first feature extraction stage, we introduce an intermediate bottleneck module to jointly learn both category and boundary features of each dense modality through the inner loss functions. After the fusion of dense modalities and sparse signals, we then devise a two-branch architecture that explicitly models the interrelationship between action category and temporal boundary. Experimental results demonstrate that SigFormer outperforms the state-of-the-art approaches on a multi-modal action segmentation dataset from real industrial environments, reaching an outstanding F1 score of 0.958. The codes and pre-trained models have been available at https://github.com/LIUQI-creat/SigFormer.","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"66 1","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3657296","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Multi-modal human action segmentation is a critical and challenging task with a wide range of applications. Nowadays, the majority of approaches concentrate on the fusion of dense signals (i.e., RGB, optical flow, and depth maps). However, the potential contributions of sparse IoT sensor signals, which can be crucial for achieving accurate recognition, have not been fully explored. To make up for this, we introduce a Sparse signal-guided Transformer (SigFormer) to combine both dense and sparse signals. We employ mask attention to fuse localized features by constraining cross-attention within the regions where sparse signals are valid. However, since sparse signals are discrete, they lack sufficient information about the temporal action boundaries. Therefore, in SigFormer, we propose to emphasize the boundary information at two stages to alleviate this problem. In the first feature extraction stage, we introduce an intermediate bottleneck module to jointly learn both category and boundary features of each dense modality through the inner loss functions. After the fusion of dense modalities and sparse signals, we then devise a two-branch architecture that explicitly models the interrelationship between action category and temporal boundary. Experimental results demonstrate that SigFormer outperforms the state-of-the-art approaches on a multi-modal action segmentation dataset from real industrial environments, reaching an outstanding F1 score of 0.958. The codes and pre-trained models have been available at https://github.com/LIUQI-creat/SigFormer.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

SigFormer：用于多模态动作分割的稀疏信号引导变换器

多模态人体动作分割是一项关键而具有挑战性的任务，其应用范围十分广泛。目前，大多数方法都集中在密集信号（即 RGB、光流和深度图）的融合上。然而，稀疏物联网传感器信号的潜在贡献尚未得到充分挖掘，而这些信号对于实现准确识别至关重要。为了弥补这一不足，我们引入了稀疏信号引导变换器（SigFormer）来结合密集信号和稀疏信号。我们通过限制稀疏信号有效区域内的交叉注意，利用掩码注意来融合局部特征。然而，由于稀疏信号是离散的，它们缺乏有关时间动作边界的足够信息。因此，在 SigFormer 中，我们建议在两个阶段强调边界信息，以缓解这一问题。在第一个特征提取阶段，我们引入了一个中间瓶颈模块，通过内部损失函数共同学习每个密集模态的类别和边界特征。在融合了密集模态和稀疏信号之后，我们设计了一种双分支架构，明确地模拟了动作类别和时间边界之间的相互关系。实验结果表明，在来自真实工业环境的多模态动作分割数据集上，SigFormer 的表现优于最先进的方法，F1 分数高达 0.958。代码和预训练模型可在 https://github.com/LIUQI-creat/SigFormer 网站上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Multimedia Computing Communications and Applications 工程技术-计算机：理论方法

CiteScore

8.50

自引率

5.90%

发文量

285

审稿时长

7.5 months

期刊介绍： The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome. TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.