Position-aware Location Regression Network for Temporal Video Grounding

2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS) Pub Date : 2021-11-16 DOI:10.1109/AVSS52988.2021.9663815

Sunoh Kim, Kimin Yun, J. Choi

引用次数: 1

Abstract

The key to successful grounding for video surveillance is to understand a semantic phrase corresponding to important actors and objects. Conventional methods ignore comprehensive contexts for the phrase or require heavy computation for multiple phrases. To understand comprehensive contexts with only one semantic phrase, we propose Position-aware Location Regression Network (PLRN) which exploits position-aware features of a query and a video. Specifically, PLRN first encodes both the video and query using positional information of words and video segments. Then, a semantic phrase feature is extracted from an encoded query with attention. The semantic phrase feature and encoded video are merged and made into a context-aware feature by reflecting local and global contexts. Finally, PLRN predicts start, end, center, and width values of a grounding boundary. Our experiments show that PLRN achieves competitive performance over existing methods with less computation time and memory.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于时间视频接地的位置感知定位回归网络

视频监控成功接地的关键是理解与重要行为者和对象相对应的语义短语。传统方法忽略了短语的综合上下文，或者需要对多个短语进行大量计算。为了仅用一个语义短语理解全面的上下文，我们提出了位置感知位置回归网络(PLRN)，该网络利用查询和视频的位置感知特征。具体来说，PLRN首先使用单词和视频片段的位置信息对视频和查询进行编码。然后，从带有注意的编码查询中提取语义短语特征。将语义短语特征与编码视频相结合，通过反映局部和全局上下文，形成上下文感知特征。最后，PLRN预测接地边界的起始、结束、中心和宽度值。我们的实验表明，与现有方法相比，PLRN在计算时间和内存方面具有竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)

自引率

0.00%

发文量