Weakly-supervised temporal action localization using multi-branch attention weighting

IF 4.3 3区 材料科学 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC ACS Applied Electronic Materials Pub Date : 2024-08-30 DOI:10.1007/s00530-024-01445-2
Mengxue Liu, Wenjing Li, Fangzhen Ge, Xiangjun Gao
{"title":"Weakly-supervised temporal action localization using multi-branch attention weighting","authors":"Mengxue Liu, Wenjing Li, Fangzhen Ge, Xiangjun Gao","doi":"10.1007/s00530-024-01445-2","DOIUrl":null,"url":null,"abstract":"<p>Weakly-supervised temporal action localization aims to train an accurate and robust localization model using only video-level labels. Due to the lack of frame-level temporal annotations, existing weakly-supervised temporal action localization methods typically rely on multiple instance learning mechanisms to localize and classify all action instances in an untrimmed video. However, these methods focus only on the most discriminative regions that contribute to the classification task, neglecting a large number of ambiguous background and context snippets in the video. We believe that these controversial snippets have a significant impact on the localization results. To mitigate this issue, we propose a multi-branch attention weighting network (MAW-Net), which introduces an additional non-action class and integrates a multi-branch attention module to generate action and background attention, respectively. In addition, considering the correlation among context, action, and background, we use the difference of action and background attention to construct context attention. Finally, based on these three types of attention values, we obtain three new class activation sequences that distinguish action, background, and context. This enables our model to effectively remove background and context snippets in the localization results. Extensive experiments were performed on the THUMOS-14 and Activitynet1.3 datasets. The experimental results show that our method is superior to other state-of-the-art methods, and its performance is comparable to those of fully-supervised approaches.</p>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00530-024-01445-2","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Weakly-supervised temporal action localization aims to train an accurate and robust localization model using only video-level labels. Due to the lack of frame-level temporal annotations, existing weakly-supervised temporal action localization methods typically rely on multiple instance learning mechanisms to localize and classify all action instances in an untrimmed video. However, these methods focus only on the most discriminative regions that contribute to the classification task, neglecting a large number of ambiguous background and context snippets in the video. We believe that these controversial snippets have a significant impact on the localization results. To mitigate this issue, we propose a multi-branch attention weighting network (MAW-Net), which introduces an additional non-action class and integrates a multi-branch attention module to generate action and background attention, respectively. In addition, considering the correlation among context, action, and background, we use the difference of action and background attention to construct context attention. Finally, based on these three types of attention values, we obtain three new class activation sequences that distinguish action, background, and context. This enables our model to effectively remove background and context snippets in the localization results. Extensive experiments were performed on the THUMOS-14 and Activitynet1.3 datasets. The experimental results show that our method is superior to other state-of-the-art methods, and its performance is comparable to those of fully-supervised approaches.

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用多分支注意力加权进行弱监督时间动作定位
弱监督时态动作定位旨在仅使用视频级标签来训练一个准确而稳健的定位模型。由于缺乏帧级时空注释,现有的弱监督时空动作定位方法通常依赖多实例学习机制来定位和分类未剪辑视频中的所有动作实例。然而,这些方法只关注对分类任务最有帮助的区域,而忽略了视频中大量模糊的背景和上下文片段。我们认为,这些有争议的片段会对定位结果产生重大影响。为了缓解这一问题,我们提出了一种多分支注意力加权网络(MAW-Net),它引入了一个额外的非动作类,并集成了一个多分支注意力模块,以分别产生动作和背景注意力。此外,考虑到上下文、动作和背景之间的相关性,我们利用动作注意和背景注意的差异来构建上下文注意。最后,基于这三种注意力值,我们得到了三种新的类别激活序列,它们可以区分动作、背景和上下文。这样,我们的模型就能有效去除定位结果中的背景和上下文片段。我们在 THUMOS-14 和 Activitynet1.3 数据集上进行了广泛的实验。实验结果表明,我们的方法优于其他最先进的方法,其性能可与完全监督方法相媲美。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
7.20
自引率
4.30%
发文量
567
期刊最新文献
Vitamin B12: prevention of human beings from lethal diseases and its food application. Current status and obstacles of narrowing yield gaps of four major crops. Cold shock treatment alleviates pitting in sweet cherry fruit by enhancing antioxidant enzymes activity and regulating membrane lipid metabolism. Removal of proteins and lipids affects structure, in vitro digestion and physicochemical properties of rice flour modified by heat-moisture treatment. Investigating the impact of climate variables on the organic honey yield in Turkey using XGBoost machine learning.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1