Yang Yi, Yang Sun, Saimei Yuan, Yiji Zhu, Mengyi Zhang, Wenjun Zhu
{"title":"COWO: towards real-time spatiotemporal action localization in videos","authors":"Yang Yi, Yang Sun, Saimei Yuan, Yiji Zhu, Mengyi Zhang, Wenjun Zhu","doi":"10.1108/aa-07-2021-0098","DOIUrl":null,"url":null,"abstract":"<h3>Purpose</h3>\n<p>The purpose of this paper is to provide a fast and accurate network for spatiotemporal action localization in videos. It detects human actions both in time and space simultaneously in real-time, which is applicable in real-world scenarios such as safety monitoring and collaborative assembly.</p><!--/ Abstract__block -->\n<h3>Design/methodology/approach</h3>\n<p>This paper design an end-to-end deep learning network called collaborator only watch once (COWO). COWO recognizes the ongoing human activities in real-time with enhanced accuracy. COWO inherits from the architecture of you only watch once (YOWO), known to be the best performing network for online action localization to date, but with three major structural modifications: COWO enhances the intraclass compactness and enlarges the interclass separability in the feature level. A new correlation channel fusion and attention mechanism are designed based on the Pearson correlation coefficient. Accordingly, a correction loss function is designed. This function minimizes the same class distance and enhances the intraclass compactness. Use a probabilistic K-means clustering technique for selecting the initial seed points. The idea behind this is that the initial distance between cluster centers should be as considerable as possible. CIOU regression loss function is applied instead of the Smooth L1 loss function to help the model converge stably.</p><!--/ Abstract__block -->\n<h3>Findings</h3>\n<p>COWO outperforms the original YOWO with improvements of frame mAP 3% and 2.1% at a speed of 35.12 fps. Compared with the two-stream, T-CNN, C3D, the improvement is about 5% and 14.5% when applied to J-HMDB-21, UCF101-24 and AGOT data sets.</p><!--/ Abstract__block -->\n<h3>Originality/value</h3>\n<p>COWO extends more flexibility for assembly scenarios as it perceives spatiotemporal human actions in real-time. It contributes to many real-world scenarios such as safety monitoring and collaborative assembly.</p><!--/ Abstract__block -->","PeriodicalId":501194,"journal":{"name":"Robotic Intelligence and Automation","volume":"262 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Robotic Intelligence and Automation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/aa-07-2021-0098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose
The purpose of this paper is to provide a fast and accurate network for spatiotemporal action localization in videos. It detects human actions both in time and space simultaneously in real-time, which is applicable in real-world scenarios such as safety monitoring and collaborative assembly.
Design/methodology/approach
This paper design an end-to-end deep learning network called collaborator only watch once (COWO). COWO recognizes the ongoing human activities in real-time with enhanced accuracy. COWO inherits from the architecture of you only watch once (YOWO), known to be the best performing network for online action localization to date, but with three major structural modifications: COWO enhances the intraclass compactness and enlarges the interclass separability in the feature level. A new correlation channel fusion and attention mechanism are designed based on the Pearson correlation coefficient. Accordingly, a correction loss function is designed. This function minimizes the same class distance and enhances the intraclass compactness. Use a probabilistic K-means clustering technique for selecting the initial seed points. The idea behind this is that the initial distance between cluster centers should be as considerable as possible. CIOU regression loss function is applied instead of the Smooth L1 loss function to help the model converge stably.
Findings
COWO outperforms the original YOWO with improvements of frame mAP 3% and 2.1% at a speed of 35.12 fps. Compared with the two-stream, T-CNN, C3D, the improvement is about 5% and 14.5% when applied to J-HMDB-21, UCF101-24 and AGOT data sets.
Originality/value
COWO extends more flexibility for assembly scenarios as it perceives spatiotemporal human actions in real-time. It contributes to many real-world scenarios such as safety monitoring and collaborative assembly.
目的为视频中动作的时空定位提供一个快速准确的网络。它可以实时检测人类在时间和空间上的行为,适用于安全监控和协同组装等现实场景。设计/方法/方法本文设计了一个端到端深度学习网络,称为协作者只看一次(coco)。coo实时识别正在进行的人类活动,并提高了准确性。COWO继承了you only watch one (YOWO)的架构,YOWO被认为是迄今为止性能最好的在线动作定位网络,但在结构上进行了三个主要的修改:COWO增强了类内紧凑性,并在特征级别上扩大了类间可分离性。基于Pearson相关系数,设计了一种新的相关通道融合和注意机制。据此,设计了修正损失函数。这个函数最小化了相同的类距离,增强了类内的紧凑性。使用概率k均值聚类技术来选择初始种子点。这背后的想法是,星团中心之间的初始距离应该尽可能大。采用CIOU回归损失函数代替光滑L1损失函数,使模型稳定收敛。在35.12 fps的速度下,scowo比原来的YOWO帧mAP分别提高了3%和2.1%。在J-HMDB-21、UCF101-24和AGOT数据集上,与双流、T-CNN、C3D相比,分别提高了约5%和14.5%。独创性/valueCOWO为装配场景扩展了更多的灵活性,因为它可以实时感知时空的人类行为。它有助于许多现实世界的场景,如安全监控和协作组装。