Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning

2020 25th International Conference on Pattern Recognition (ICPR) Pub Date : 2021-01-10 DOI:10.1109/ICPR48806.2021.9413295

M. Tellamekala, M. Valstar, Michael P. Pound, T. Giesbrecht

{"title":"Audio-Visual Predictive Coding for Self-Supervised Visual Representation Learning","authors":"M. Tellamekala, M. Valstar, Michael P. Pound, T. Giesbrecht","doi":"10.1109/ICPR48806.2021.9413295","DOIUrl":null,"url":null,"abstract":"Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce ‘Audio-Visual Permutative Predictive Coding’ (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 80.30% spoken word classification rate on the LRW dataset, performing on par with directly supervised visual encoders that are learned from large amounts of labeled data.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"89 1","pages":"9912-9919"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 25th International Conference on Pattern Recognition (ICPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPR48806.2021.9413295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Self-supervised learning has emerged as a candidate approach to learn semantic visual features from unlabeled video data. In self-supervised learning, intrinsic correspondences between data points are used to define a proxy task that forces the model to learn semantic representations. Most existing proxy tasks applied to video data exploit only either intra-modal (e.g. temporal) or cross-modal (e.g. audio-visual) correspondences separately. In theory, jointly learning both these correspondences may result in richer visual features; but, as we show in this work, doing so is non-trivial in practice. To address this problem, we introduce ‘Audio-Visual Permutative Predictive Coding’ (AV-PPC), a multi-task learning framework designed to fully leverage the temporal and cross-modal correspondences as natural supervision signals. In AV-PPC, the model is trained to simultaneously learn multiple intra- and cross-modal predictive coding sub-tasks. By using visual speech recognition (lip-reading) as the downstream evaluation task, we show that our proposed proxy task can learn higher quality visual features than existing proxy tasks. We also show that AV-PPC visual features are highly data-efficient. Without further finetuning, AV-PPC visual encoder achieves 80.30% spoken word classification rate on the LRW dataset, performing on par with directly supervised visual encoders that are learned from large amounts of labeled data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

自监督视觉表征学习的视听预测编码

自监督学习已经成为一种从未标记视频数据中学习语义视觉特征的候选方法。在自监督学习中，数据点之间的内在对应关系用于定义代理任务，该任务强制模型学习语义表示。应用于视频数据的大多数现有代理任务仅单独利用模态内(例如时间)或跨模态(例如视听)对应。从理论上讲，共同学习这两种对应关系可能会产生更丰富的视觉特征;但是，正如我们在这项工作中所展示的，这样做在实践中不是微不足道的。为了解决这个问题,我们引入“视听Permutative预测编码”(AV-PPC),一个多任务学习框架,旨在充分利用时间和跨通道通讯监督信号一样自然。在AV-PPC中，训练模型同时学习多个模态内和跨模态的预测编码子任务。通过使用视觉语音识别(唇读)作为下游评估任务，我们表明我们提出的代理任务比现有的代理任务可以学习到更高质量的视觉特征。我们还表明，AV-PPC视觉特征具有很高的数据效率。在没有进一步调整的情况下，AV-PPC视觉编码器在LRW数据集上实现了80.30%的口语单词分类率，与从大量标记数据中学习的直接监督视觉编码器的表现相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 25th International Conference on Pattern Recognition (ICPR)

自引率

0.00%

发文量

期刊最新文献

Trajectory representation learning for Multi-Task NMRDP planning Semantic Segmentation Refinement Using Entropy and Boundary-guided Monte Carlo Sampling and Directed Regional Search A Randomized Algorithm for Sparse Recovery An Empirical Bayes Approach to Topic Modeling To Honor our Heroes: Analysis of the Obituaries of Australians Killed in Action in WWI and WWII