Non-Local Neural Networks With Grouped Bilinear Attentional Transforms

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2020-06-01 DOI:10.1109/cvpr42600.2020.01182

Lu Chi, Zehuan Yuan, Yadong Mu, Changhu Wang

{"title":"Non-Local Neural Networks With Grouped Bilinear Attentional Transforms","authors":"Lu Chi, Zehuan Yuan, Yadong Mu, Changhu Wang","doi":"10.1109/cvpr42600.2020.01182","DOIUrl":null,"url":null,"abstract":"Modeling spatial or temporal long-range dependency plays a key role in deep neural networks. Conventional dominant solutions include recurrent operations on sequential data or deeply stacking convolutional layers with small kernel size. Recently, a number of non-local operators (such as self-attention based) have been devised. They are typically generic and can be plugged into many existing network pipelines for globally computing among any two neurons in a feature map. This work proposes a novel non-local operator. It is inspired by the attention mechanism of human visual system, which can quickly attend to important local parts in sight and suppress other less-relevant information. The core of our method is learnable and data-adaptive bilinear attentional transform (BA-Transform), whose merits are three-folds: first, BA-Transform is versatile to model a wide spectrum of local or global attentional operations, such as emphasizing specific local regions. Each BA-Transform is learned in a data-adaptive way; Secondly, to address the discrepancy among features, we further design grouped BA-Transforms, which essentially apply different attentional operations to different groups of feature channels; Thirdly, many existing non-local operators are computation-intensive. The proposed BA-Transform is implemented by simple matrix multiplication and admits better efficacy. For empirical evaluation, we perform comprehensive experiments on two large-scale benchmarks, ImageNet and Kinetics, for image / video classification respectively. The achieved accuracies and various ablation experiments consistently demonstrate significant improvement by large margins.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"48 1","pages":"11801-11810"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/cvpr42600.2020.01182","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

Modeling spatial or temporal long-range dependency plays a key role in deep neural networks. Conventional dominant solutions include recurrent operations on sequential data or deeply stacking convolutional layers with small kernel size. Recently, a number of non-local operators (such as self-attention based) have been devised. They are typically generic and can be plugged into many existing network pipelines for globally computing among any two neurons in a feature map. This work proposes a novel non-local operator. It is inspired by the attention mechanism of human visual system, which can quickly attend to important local parts in sight and suppress other less-relevant information. The core of our method is learnable and data-adaptive bilinear attentional transform (BA-Transform), whose merits are three-folds: first, BA-Transform is versatile to model a wide spectrum of local or global attentional operations, such as emphasizing specific local regions. Each BA-Transform is learned in a data-adaptive way; Secondly, to address the discrepancy among features, we further design grouped BA-Transforms, which essentially apply different attentional operations to different groups of feature channels; Thirdly, many existing non-local operators are computation-intensive. The proposed BA-Transform is implemented by simple matrix multiplication and admits better efficacy. For empirical evaluation, we perform comprehensive experiments on two large-scale benchmarks, ImageNet and Kinetics, for image / video classification respectively. The achieved accuracies and various ablation experiments consistently demonstrate significant improvement by large margins.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

具有分组双线性注意变换的非局部神经网络

在深度神经网络中，空间或时间远程依赖关系建模起着关键作用。传统的主流解决方案包括对顺序数据的循环操作或小核尺寸的深度堆叠卷积层。最近，一些非本地运营商(如基于自关注)被设计出来。它们通常是通用的，可以插入到许多现有的网络管道中，在特征映射中的任意两个神经元之间进行全局计算。本文提出了一种新的非局部算子。它的灵感来自于人类视觉系统的注意机制，可以快速注意到视觉中重要的局部部分，并抑制其他不相关的信息。该方法的核心是可学习和数据自适应的双线性注意变换(BA-Transform)，其优点有三个方面:首先，BA-Transform是通用的，可以模拟广泛的局部或全局注意操作，例如强调特定的局部区域。每个BA-Transform都以数据自适应的方式学习;其次，为了解决特征之间的差异，我们进一步设计了分组ba变换，实质上是对不同组的特征通道应用不同的注意操作;第三，许多现有的非局部运算符是计算密集型的。本文提出的ba变换采用简单的矩阵乘法实现，具有较好的效果。为了进行实证评估，我们分别在ImageNet和Kinetics两个大规模基准上进行了图像/视频分类的综合实验。所获得的精度和各种烧蚀实验一致显示出大幅度的显著提高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量

期刊最新文献

Geometric Structure Based and Regularized Depth Estimation From 360 Indoor Imagery 3D Part Guided Image Editing for Fine-Grained Object Understanding SDC-Depth: Semantic Divide-and-Conquer Network for Monocular Depth Estimation Approximating shapes in images with low-complexity polygons PFRL: Pose-Free Reinforcement Learning for 6D Pose Estimation