Proceedings of the 4th on Person in Context Workshop最新文献

英文中文

Cascaded Decoding and Multi-Stage Inference for Spatio-Temporal Video Grounding 时空视频接地的级联解码和多阶段推理

Proceedings of the 4th on Person in Context Workshop

Pub Date : 2022-10-14 DOI: 10.1145/3552455.3555814

Li Yang, Peixuan Wu, Chunfen Yuan, Bing Li, Weiming Hu

Human-centric spatio-temporal video grounding (HC-STVG) is a challenging task that aims to localize the spatio-temporal tube of the target person in a video based on a natural language description. In this report, we present our approach for this challenging HC-STVG task. Specifically, based on the TubeDETR framework, we propose two cascaded decoders to decouple spatial and temporal grounding, which allows the model to capture respective favorable features for these two grounding subtasks. We also devise a multi-stage inference strategy to reason about the target in a coarse-to-fine manner and thereby produce more precise grounding results for the target. To further improve accuracy, we propose a model ensemble strategy that incorporates the results of models with better performance in spatial or temporal grounding. We validated the effectiveness of our proposed method on the HC-STVG 2.0 dataset and won second place in the HC-STVG track of the 4th Person in Context (PIC) workshop at ACM MM 2022.

以人为中心的时空视频接地(HC-STVG)是一项具有挑战性的任务，其目的是基于自然语言描述来定位视频中目标人物的时空管。在本报告中，我们提出了解决这一具有挑战性的HC-STVG任务的方法。具体而言，基于TubeDETR框架，我们提出了两个级联解码器来解耦空间和时间接地，这使得模型能够捕获这两个接地子任务各自的有利特征。我们还设计了一种多阶段推理策略，以从粗到精的方式对目标进行推理，从而为目标产生更精确的接地结果。为了进一步提高精度，我们提出了一种模型集成策略，该策略结合了在空间或时间接地方面性能较好的模型的结果。我们在HC-STVG 2.0数据集上验证了我们提出的方法的有效性，并在ACM MM 2022第四届语境中人(PIC)研讨会的HC-STVG赛道上获得了第二名。

引用次数: 0

Fine-grained Video Captioning via Precise Key Point Positioning 通过精确的关键点定位的细粒度视频字幕

Proceedings of the 4th on Person in Context Workshop

Pub Date : 2022-10-14 DOI: 10.1145/3552455.3555817

Yunjie Zhang, Tiangyang Xu, Xiaoning Song, Zhenghua Feng, Xiaojun Wu

In recent years, a variety of excellent dense video caption models have emerged. However, most of these models focus on global features and salient events in the video. For the makeup data set used in this competition, the video content is very similar with only slight variations. Because the model lacks the ability to focus on fine-grained features, it does not generate captions very well. Based on this, this paper proposes a key point detection algorithm for the human face and human hand to synchronize and coordinate the detection of video frame extraction, and encapsulate the detected auxiliary features into the existing features, so that the existing video subtitle system can focus on fine-grained features. In order to improve the effect of generating subtitles, we further use the TSP model to extract more efficient video features. Our model has better performance than the baseline.

近年来，出现了各种优秀的密集视频字幕模型。然而，这些模型大多侧重于全局特征和视频中的突出事件。本次比赛使用的化妆数据集，视频内容非常相似，只有细微的变化。因为模型缺乏关注细粒度特征的能力，所以它不能很好地生成标题。基于此，本文提出了一种人脸和人手关键点检测算法，对视频帧提取进行同步协调检测，并将检测到的辅助特征封装到现有特征中，使现有视频字幕系统能够专注于细粒度特征。为了提高生成字幕的效果，我们进一步使用TSP模型提取更有效的视频特征。我们的模型比基线有更好的性能。

引用次数: 0

STVGFormer

Proceedings of the 4th on Person in Context Workshop

Pub Date : 2022-10-14 DOI: 10.1145/3552455.3555813

Zihang Lin, Chaolei Tan, Jianfang Hu, Zhi Jin, Tiancai Ye, Weishi Zheng

In this technical report, we introduce our solution to human-centric spatio-temporal video grounding task. We propose a concise and effective framework named STVGFormer, which models spatio-temporal visual-linguistic dependencies with a static branch and a dynamic branch. The static branch performs cross-modal understanding in a single frame and learns to localize the target object spatially according to intra-frame visual cues like object appearances. The dynamic branch performs cross-modal understanding across multiple frames. It learns to predict the starting and ending time of the target moment according to dynamic visual cues like motions. Both the static and dynamic branches are designed as cross-modal transformers. We further design a novel static-dynamic interaction block to enable the static and dynamic branches to transfer useful and complementary information from each other, which is shown to be effective to improve the prediction on hard cases. Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG track of the 4th Person in Context Challenge.

引用次数: 3

Exploiting Feature Diversity for Make-up Temporal Video Grounding 利用特征多样性进行时序视频合成

Proceedings of the 4th on Person in Context Workshop

Pub Date : 2022-08-12 DOI: 10.1145/3552455.3555818

Xiujun Shu, Wei Wen, Taian Guo, Su He, Chen Wu, Ruizhi Qiao

This technical report presents the 3rd winning solution for MTVG, a new task introduced in the 4-th Person in Context (PIC) Challenge at ACM MM 2022. MTVG aims at localizing the temporal boundary of the step in an untrimmed video based on a textual description. The biggest challenge of this task is the fine-grained video-text semantics of make-up steps. However, current methods mainly extract video features using action-based pre-trained models. As actions are more coarse-grained than make-up steps, action-based features are not suffi cient to provide fi ne-grained cues. To address this issue,we propose to achieve fi ne-grained representation via exploiting feature diversities. Specifi cally, we proposed a series of methods from feature extraction, network optimization, to model ensemble. As a result, we achieved 3rd place in the MTVG competition.

本技术报告介绍了MTVG的第三个获奖解决方案，这是ACM MM 2022第四届情境中人(PIC)挑战赛中引入的新任务。MTVG的目的是在基于文本描述的未修剪视频中定位步骤的时间边界。这项任务的最大挑战是化妆步骤的细粒度视频文本语义。然而，目前的方法主要是使用基于动作的预训练模型提取视频特征。由于操作比补充步骤更粗粒度，基于操作的功能不足以提供细粒度的提示。为了解决这个问题，我们建议通过利用特征多样性来实现细粒度表示。具体来说，我们提出了从特征提取、网络优化到模型集成的一系列方法。结果，我们在MTVG比赛中获得了第三名。

引用次数: 1

Human-centric Spatio-Temporal Video Grounding via the Combination of Mutual Matching Network and TubeDETR 基于互匹配网络和TubeDETR的以人为中心的时空视频接地

Proceedings of the 4th on Person in Context Workshop

Pub Date : 2022-07-09 DOI: 10.1145/3552455.3555815

Fan Yu, Zhixiang Zhao, Yuchen Wang, Yi Xu, Tongwei Ren, Gangshan Wu

In this technical report, we represent our solution for the Human-centric Spatio-Temporal Video Grounding (HC-STVG) track of the 4th Person in Context (PIC) workshop and challenge. Our solution is built on the basis of TubeDETR and Mutual Matching Network (MMN). Specifically, TubeDETR exploits a video-text encoder and a space-time decoder to predict the starting time, the ending time and the tube of the target person. MMN detects persons in images, links them as tubes, extracts features of person tubes and the text description, and predicts the similarities between them to choose the most likely person tube as the grounding result. Our solution finally finetunes the results by combining the spatio localization of MMN and the temporal localization of TubeDETR. In the HC-STVG track of the 4th PIC challenge, our solution achieves the third place.

在本技术报告中，我们为第四届情境中人(PIC)研讨会和挑战的以人为中心的时空视频接地(HC-STVG)轨道提供了我们的解决方案。我们的解决方案建立在TubeDETR和互匹配网络(MMN)的基础上。具体来说，TubeDETR利用视频文本编码器和时空解码器来预测目标人的开始时间、结束时间和管道。MMN检测图像中的人，将其链接为管，提取人管和文字描述的特征，并预测两者之间的相似度，选择最可能的人管作为接地结果。我们的解决方案结合MMN的空间定位和TubeDETR的时间定位对结果进行微调。在第四届PIC挑战赛的HC-STVG赛道中，我们的方案获得了第三名。

引用次数: 0

PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and Multi-Head Decoding for Dense Video Captioning PIC第四个挑战:语义辅助的多特征编码和多头解码用于密集视频字幕

Proceedings of the 4th on Person in Context Workshop

Pub Date : 2022-07-06 DOI: 10.1145/3552455.3555816

Yifan Lu, Ziqi Zhang, Yuxin Chen, Chunfen Yuan, Bing Li, Weiming Hu

The task of Dense Video Captioning (DVC) aims to generate captions with timestamps for multiple events in one video. Semantic information plays an important role for both localization and description of DVC. We present a semantic-assisted dense video captioning model based on the encoding-decoding framework. In the encoding stage, we design a concept detector to extract semantic information, which is then fused with multi-modal visual features to sufficiently represent the input video. In the decoding stage, we design a classification head, paralleled with the localization and captioning heads, to provide semantic supervision. Our method achieves significant improvements on the YouMakeup dataset citewang2019youmakeup under DVC evaluation metrics and achieves high performance in the Makeup Dense Video Captioning (MDVC) task of hrefhttp://picdataset.com/challenge/task/mdvc/ PIC 4th Challenge.

密集视频字幕(DVC)的任务是为一个视频中的多个事件生成带有时间戳的字幕。语义信息对于DVC的定位和描述都起着重要的作用。提出了一种基于编解码框架的语义辅助密集视频字幕模型。在编码阶段，我们设计了一个概念检测器来提取语义信息，然后将其与多模态视觉特征融合以充分表征输入视频。在解码阶段，我们设计了一个分类头，与定位头和字幕头并行，以提供语义监督。我们的方法在DVC评价指标下对you化妆数据集 citewang2019you化妆进行了显著改进，并在hrefhttp://picdataset.com/challenge/task/mdvc/ PIC第四次挑战赛的化妆密集视频字幕(MDVC)任务中取得了高性能。

引用次数: 1

Proceedings of the 4th on Person in Context Workshop 第四届“情境中人”研讨会论文集

Proceedings of the 4th on Person in Context Workshop

Pub Date : 1900-01-01 DOI: 10.1145/3552455

引用次数: 0

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 4th on Person in Context Workshop

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀