COWO: towards real-time spatiotemporal action localization in videos

Robotic Intelligence and Automation Pub Date : 2022-01-18 DOI:10.1108/aa-07-2021-0098

Yang Yi, Yang Sun, Saimei Yuan, Yiji Zhu, Mengyi Zhang, Wenjun Zhu

{"title":"COWO: towards real-time spatiotemporal action localization in videos","authors":"Yang Yi, Yang Sun, Saimei Yuan, Yiji Zhu, Mengyi Zhang, Wenjun Zhu","doi":"10.1108/aa-07-2021-0098","DOIUrl":null,"url":null,"abstract":"<h3>Purpose</h3>\n<p>The purpose of this paper is to provide a fast and accurate network for spatiotemporal action localization in videos. It detects human actions both in time and space simultaneously in real-time, which is applicable in real-world scenarios such as safety monitoring and collaborative assembly.</p>\n<h3>Design/methodology/approach</h3>\n<p>This paper design an end-to-end deep learning network called collaborator only watch once (COWO). COWO recognizes the ongoing human activities in real-time with enhanced accuracy. COWO inherits from the architecture of you only watch once (YOWO), known to be the best performing network for online action localization to date, but with three major structural modifications: COWO enhances the intraclass compactness and enlarges the interclass separability in the feature level. A new correlation channel fusion and attention mechanism are designed based on the Pearson correlation coefficient. Accordingly, a correction loss function is designed. This function minimizes the same class distance and enhances the intraclass compactness. Use a probabilistic K-means clustering technique for selecting the initial seed points. The idea behind this is that the initial distance between cluster centers should be as considerable as possible. CIOU regression loss function is applied instead of the Smooth L1 loss function to help the model converge stably.</p>\n<h3>Findings</h3>\n<p>COWO outperforms the original YOWO with improvements of frame mAP 3% and 2.1% at a speed of 35.12 fps. Compared with the two-stream, T-CNN, C3D, the improvement is about 5% and 14.5% when applied to J-HMDB-21, UCF101-24 and AGOT data sets.</p>\n<h3>Originality/value</h3>\n<p>COWO extends more flexibility for assembly scenarios as it perceives spatiotemporal human actions in real-time. It contributes to many real-world scenarios such as safety monitoring and collaborative assembly.</p>","PeriodicalId":501194,"journal":{"name":"Robotic Intelligence and Automation","volume":"262 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotic Intelligence and Automation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/aa-07-2021-0098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

The purpose of this paper is to provide a fast and accurate network for spatiotemporal action localization in videos. It detects human actions both in time and space simultaneously in real-time, which is applicable in real-world scenarios such as safety monitoring and collaborative assembly.

Design/methodology/approach

This paper design an end-to-end deep learning network called collaborator only watch once (COWO). COWO recognizes the ongoing human activities in real-time with enhanced accuracy. COWO inherits from the architecture of you only watch once (YOWO), known to be the best performing network for online action localization to date, but with three major structural modifications: COWO enhances the intraclass compactness and enlarges the interclass separability in the feature level. A new correlation channel fusion and attention mechanism are designed based on the Pearson correlation coefficient. Accordingly, a correction loss function is designed. This function minimizes the same class distance and enhances the intraclass compactness. Use a probabilistic K-means clustering technique for selecting the initial seed points. The idea behind this is that the initial distance between cluster centers should be as considerable as possible. CIOU regression loss function is applied instead of the Smooth L1 loss function to help the model converge stably.

Findings

COWO outperforms the original YOWO with improvements of frame mAP 3% and 2.1% at a speed of 35.12 fps. Compared with the two-stream, T-CNN, C3D, the improvement is about 5% and 14.5% when applied to J-HMDB-21, UCF101-24 and AGOT data sets.

Originality/value

COWO extends more flexibility for assembly scenarios as it perceives spatiotemporal human actions in real-time. It contributes to many real-world scenarios such as safety monitoring and collaborative assembly.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

面向视频实时时空动作定位

目的为视频中动作的时空定位提供一个快速准确的网络。它可以实时检测人类在时间和空间上的行为，适用于安全监控和协同组装等现实场景。设计/方法/方法本文设计了一个端到端深度学习网络，称为协作者只看一次(coco)。coo实时识别正在进行的人类活动，并提高了准确性。COWO继承了you only watch one (YOWO)的架构，YOWO被认为是迄今为止性能最好的在线动作定位网络，但在结构上进行了三个主要的修改:COWO增强了类内紧凑性，并在特征级别上扩大了类间可分离性。基于Pearson相关系数，设计了一种新的相关通道融合和注意机制。据此，设计了修正损失函数。这个函数最小化了相同的类距离，增强了类内的紧凑性。使用概率k均值聚类技术来选择初始种子点。这背后的想法是，星团中心之间的初始距离应该尽可能大。采用CIOU回归损失函数代替光滑L1损失函数，使模型稳定收敛。在35.12 fps的速度下，scowo比原来的YOWO帧mAP分别提高了3%和2.1%。在J-HMDB-21、UCF101-24和AGOT数据集上，与双流、T-CNN、C3D相比，分别提高了约5%和14.5%。独创性/valueCOWO为装配场景扩展了更多的灵活性，因为它可以实时感知时空的人类行为。它有助于许多现实世界的场景，如安全监控和协作组装。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Robotic Intelligence and Automation

自引率

0.00%

发文量