基于跟踪和一致性损失驱动的弱监督时空注意网络

IF 2.4 4区计算机科学 Eurasip Journal on Image and Video Processing Pub Date : 2022-07-18 DOI:10.1186/s13640-022-00588-4

Jinlei Zhu, Houjin Chen, Pan Pan, Jia Sun

{"title":"基于跟踪和一致性损失驱动的弱监督时空注意网络","authors":"Jinlei Zhu, Houjin Chen, Pan Pan, Jia Sun","doi":"10.1186/s13640-022-00588-4","DOIUrl":null,"url":null,"abstract":"<p>This study proposes a novel network model for video action tube detection. This model is based on a location-interactive weakly supervised spatial–temporal attention mechanism driven by multiple loss functions. It is especially costly and time consuming to annotate every target location in video frames. Thus, we first propose a cross-domain weakly supervised learning method with a spatial–temporal attention mechanism for action tube detection. In source domain, we trained a newly designed multi-loss spatial–temporal attention–convolution network on the source data set, which has both object location and classification annotations. In target domain, we introduced internal tracking loss and neighbor-consistency loss; we trained the network with the pre-trained model on the target data set, which only has inaccurate action temporal positions. Although this is a location-unsupervised method, its performance outperforms typical weakly supervised methods, and even shows comparable results with some recent fully supervised methods. We also visualize the activation maps, which reveal the intrinsic reason behind the higher performance of the proposed method.</p>","PeriodicalId":49322,"journal":{"name":"Eurasip Journal on Image and Video Processing","volume":"14 2","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Weakly supervised spatial–temporal attention network driven by tracking and consistency loss for action detection\",\"authors\":\"Jinlei Zhu, Houjin Chen, Pan Pan, Jia Sun\",\"doi\":\"10.1186/s13640-022-00588-4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>This study proposes a novel network model for video action tube detection. This model is based on a location-interactive weakly supervised spatial–temporal attention mechanism driven by multiple loss functions. It is especially costly and time consuming to annotate every target location in video frames. Thus, we first propose a cross-domain weakly supervised learning method with a spatial–temporal attention mechanism for action tube detection. In source domain, we trained a newly designed multi-loss spatial–temporal attention–convolution network on the source data set, which has both object location and classification annotations. In target domain, we introduced internal tracking loss and neighbor-consistency loss; we trained the network with the pre-trained model on the target data set, which only has inaccurate action temporal positions. Although this is a location-unsupervised method, its performance outperforms typical weakly supervised methods, and even shows comparable results with some recent fully supervised methods. We also visualize the activation maps, which reveal the intrinsic reason behind the higher performance of the proposed method.</p>\",\"PeriodicalId\":49322,\"journal\":{\"name\":\"Eurasip Journal on Image and Video Processing\",\"volume\":\"14 2\",\"pages\":\"\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2022-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Eurasip Journal on Image and Video Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1186/s13640-022-00588-4\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Image and Video Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13640-022-00588-4","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文提出了一种新的视频动作管检测网络模型。该模型基于多损失函数驱动的位置交互弱监督时空注意机制。在视频帧中对每个目标位置进行标注是非常昂贵和耗时的。因此，我们首先提出了一种具有时空注意机制的跨域弱监督学习方法用于动作管检测。在源域，我们在源数据集上训练了一个新设计的多损失时空注意卷积网络，该网络同时具有目标定位和分类注释。在目标域引入了内部跟踪损失和邻居一致性损失;我们使用预先训练好的模型在目标数据集上训练网络，目标数据集只有不准确的动作时间位置。虽然这是一种位置无监督方法，但其性能优于典型的弱监督方法，甚至可以与最近的一些完全监督方法相媲美。我们还可视化了激活图，这揭示了所提方法更高性能背后的内在原因。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Weakly supervised spatial–temporal attention network driven by tracking and consistency loss for action detection

This study proposes a novel network model for video action tube detection. This model is based on a location-interactive weakly supervised spatial–temporal attention mechanism driven by multiple loss functions. It is especially costly and time consuming to annotate every target location in video frames. Thus, we first propose a cross-domain weakly supervised learning method with a spatial–temporal attention mechanism for action tube detection. In source domain, we trained a newly designed multi-loss spatial–temporal attention–convolution network on the source data set, which has both object location and classification annotations. In target domain, we introduced internal tracking loss and neighbor-consistency loss; we trained the network with the pre-trained model on the target data set, which only has inaccurate action temporal positions. Although this is a location-unsupervised method, its performance outperforms typical weakly supervised methods, and even shows comparable results with some recent fully supervised methods. We also visualize the activation maps, which reveal the intrinsic reason behind the higher performance of the proposed method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Eurasip Journal on Image and Video Processing Engineering-Electrical and Electronic Engineering

CiteScore

7.10

自引率

0.00%

发文量

审稿时长

6.8 months

期刊介绍： EURASIP Journal on Image and Video Processing is intended for researchers from both academia and industry, who are active in the multidisciplinary field of image and video processing. The scope of the journal covers all theoretical and practical aspects of the domain, from basic research to development of application.