学习全局-局部视频-文本表示

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2020-06-01 DOI:10.1109/cvpr42600.2020.00877

Linchao Zhu, Yi Yang

{"title":"学习全局-局部视频-文本表示","authors":"Linchao Zhu, Yi Yang","doi":"10.1109/cvpr42600.2020.00877","DOIUrl":null,"url":null,"abstract":"In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperform the state-of-the-arts, demonstrating its superiority in video-text representation learning.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"27 1","pages":"8743-8752"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"337","resultStr":"{\"title\":\"ActBERT: Learning Global-Local Video-Text Representations\",\"authors\":\"Linchao Zhu, Yi Yang\",\"doi\":\"10.1109/cvpr42600.2020.00877\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperform the state-of-the-arts, demonstrating its superiority in video-text representation learning.\",\"PeriodicalId\":6715,\"journal\":{\"name\":\"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"volume\":\"27 1\",\"pages\":\"8743-8752\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"337\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/cvpr42600.2020.00877\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/cvpr42600.2020.00877","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 337

摘要

在本文中，我们引入了ActBERT，用于从未标记数据中联合视频文本表示的自监督学习。首先，我们利用全局行为信息来催化语言文本和局部区域对象之间的相互作用。它从成对的视频序列和文本描述中揭示全局和局部视觉线索，用于详细的视觉和文本关系建模。其次，我们引入了一个纠缠变压器块(ENT)来编码三个信息源，即全局动作、局部区域对象和语言描述。通过从上下文信息中明智地提取线索来发现全局-局部对应关系。它强制联合视频文本表示意识到细粒度对象以及全局的人类意图。我们验证了ActBERT在下游视频和语言任务上的泛化能力，即文本视频剪辑检索、视频字幕、视频问答、动作分割和动作步骤定位。ActBERT的表现明显优于最先进的技术，证明了其在视频文本表示学习方面的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ActBERT: Learning Global-Local Video-Text Representations

In this paper, we introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data. First, we leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects. It uncovers global and local visual clues from paired video sequences and text descriptions for detailed visual and text relation modeling. Second, we introduce an ENtangled Transformer block (ENT) to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions. Global-local correspondences are discovered via judicious clues extraction from contextual information. It enforces the joint videotext representation to be aware of fine-grained objects as well as global human intention. We validate the generalization capability of ActBERT on downstream video-and language tasks, i.e., text-video clip retrieval, video captioning, video question answering, action segmentation, and action step localization. ActBERT significantly outperform the state-of-the-arts, demonstrating its superiority in video-text representation learning.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量

期刊最新文献

Geometric Structure Based and Regularized Depth Estimation From 360 Indoor Imagery 3D Part Guided Image Editing for Fine-Grained Object Understanding SDC-Depth: Semantic Divide-and-Conquer Network for Monocular Depth Estimation Approximating shapes in images with low-complexity polygons PFRL: Pose-Free Reinforcement Learning for 6D Pose Estimation