Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision最新文献_第9页

DevNet: Self-supervised Monocular Depth Learning via Density Volume Construction DevNet:通过密度体积构建的自监督单目深度学习

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-14 DOI: 10.48550/arXiv.2209.06351

Kaichen Zhou, Lanqing Hong, Changhao Chen, Hang Xu, Chao Ye, Qingyong Hu, Zhenguo Li

Self-supervised depth learning from monocular images normally relies on the 2D pixel-wise photometric relation between temporally adjacent image frames. However, they neither fully exploit the 3D point-wise geometric correspondences, nor effectively tackle the ambiguities in the photometric warping caused by occlusions or illumination inconsistency. To address these problems, this work proposes Density Volume Construction Network (DevNet), a novel self-supervised monocular depth learning framework, that can consider 3D spatial information, and exploit stronger geometric constraints among adjacent camera frustums. Instead of directly regressing the pixel value from a single image, our DevNet divides the camera frustum into multiple parallel planes and predicts the pointwise occlusion probability density on each plane. The final depth map is generated by integrating the density along corresponding rays. During the training process, novel regularization strategies and loss functions are introduced to mitigate photometric ambiguities and overfitting. Without obviously enlarging model parameters size or running time, DevNet outperforms several representative baselines on both the KITTI-2015 outdoor dataset and NYU-V2 indoor dataset. In particular, the root-mean-square-deviation is reduced by around 4% with DevNet on both KITTI-2015 and NYU-V2 in the task of depth estimation. Code is available at https://github.com/gitkaichenzhou/DevNet.

单眼图像的自监督深度学习通常依赖于时间相邻图像帧之间的二维逐像素光度关系。然而，它们既不能充分利用三维逐点几何对应关系，也不能有效地解决由遮挡或光照不一致引起的光度扭曲中的模糊性。为了解决这些问题，本研究提出了密度体积构建网络(DevNet)，这是一种新颖的自监督单目深度学习框架，可以考虑3D空间信息，并利用相邻相机平台之间更强的几何约束。我们的DevNet不是直接从单个图像中回归像素值，而是将相机截锥体划分为多个平行平面，并预测每个平面上的逐点遮挡概率密度。最终的深度图是通过对相应光线的密度积分生成的。在训练过程中，引入了新的正则化策略和损失函数来减轻光度模糊和过拟合。在没有明显扩大模型参数大小或运行时间的情况下，DevNet在KITTI-2015室外数据集和NYU-V2室内数据集上的表现优于几个代表性基线。特别是，在KITTI-2015和NYU-V2的深度估计任务中，使用DevNet将均方根偏差降低了约4%。代码可从https://github.com/gitkaichenzhou/DevNet获得。

{"title":"DevNet: Self-supervised Monocular Depth Learning via Density Volume Construction","authors":"Kaichen Zhou, Lanqing Hong, Changhao Chen, Hang Xu, Chao Ye, Qingyong Hu, Zhenguo Li","doi":"10.48550/arXiv.2209.06351","DOIUrl":"https://doi.org/10.48550/arXiv.2209.06351","url":null,"abstract":"Self-supervised depth learning from monocular images normally relies on the 2D pixel-wise photometric relation between temporally adjacent image frames. However, they neither fully exploit the 3D point-wise geometric correspondences, nor effectively tackle the ambiguities in the photometric warping caused by occlusions or illumination inconsistency. To address these problems, this work proposes Density Volume Construction Network (DevNet), a novel self-supervised monocular depth learning framework, that can consider 3D spatial information, and exploit stronger geometric constraints among adjacent camera frustums. Instead of directly regressing the pixel value from a single image, our DevNet divides the camera frustum into multiple parallel planes and predicts the pointwise occlusion probability density on each plane. The final depth map is generated by integrating the density along corresponding rays. During the training process, novel regularization strategies and loss functions are introduced to mitigate photometric ambiguities and overfitting. Without obviously enlarging model parameters size or running time, DevNet outperforms several representative baselines on both the KITTI-2015 outdoor dataset and NYU-V2 indoor dataset. In particular, the root-mean-square-deviation is reduced by around 4% with DevNet on both KITTI-2015 and NYU-V2 in the task of depth estimation. Code is available at https://github.com/gitkaichenzhou/DevNet.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"29 1","pages":"125-142"},"PeriodicalIF":0.0,"publicationDate":"2022-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85845112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Robust Category-Level 6D Pose Estimation with Coarse-to-Fine Rendering of Neural Features 基于神经特征粗到精绘制的鲁棒类别级6D姿态估计

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-12 DOI: 10.48550/arXiv.2209.05624

Wufei Ma, Angtian Wang, A. Yuille, Adam Kortylewski

We consider the problem of category-level 6D pose estimation from a single RGB image. Our approach represents an object category as a cuboid mesh and learns a generative model of the neural feature activations at each mesh vertex to perform pose estimation through differentiable rendering. A common problem of rendering-based approaches is that they rely on bounding box proposals, which do not convey information about the 3D rotation of the object and are not reliable when objects are partially occluded. Instead, we introduce a coarse-to-fine optimization strategy that utilizes the rendering process to estimate a sparse set of 6D object proposals, which are subsequently refined with gradient-based optimization. The key to enabling the convergence of our approach is a neural feature representation that is trained to be scale- and rotation-invariant using contrastive learning. Our experiments demonstrate an enhanced category-level 6D pose estimation performance compared to prior work, particularly under strong partial occlusion.

我们考虑了从单个RGB图像中估计类别级6D姿态的问题。我们的方法将一个对象类别表示为一个长方体网格，并学习每个网格顶点的神经特征激活的生成模型，通过可微渲染来执行姿态估计。基于渲染的方法的一个常见问题是，它们依赖于边界框建议，这些建议不能传达物体的3D旋转信息，并且在物体部分遮挡时不可靠。相反，我们引入了一种从粗到精的优化策略，该策略利用渲染过程来估计6D对象建议的稀疏集，随后使用基于梯度的优化对其进行细化。使我们的方法收敛的关键是使用对比学习训练成尺度和旋转不变的神经特征表示。与之前的工作相比，我们的实验证明了增强的类别级6D姿态估计性能，特别是在强部分遮挡下。

{"title":"Robust Category-Level 6D Pose Estimation with Coarse-to-Fine Rendering of Neural Features","authors":"Wufei Ma, Angtian Wang, A. Yuille, Adam Kortylewski","doi":"10.48550/arXiv.2209.05624","DOIUrl":"https://doi.org/10.48550/arXiv.2209.05624","url":null,"abstract":"We consider the problem of category-level 6D pose estimation from a single RGB image. Our approach represents an object category as a cuboid mesh and learns a generative model of the neural feature activations at each mesh vertex to perform pose estimation through differentiable rendering. A common problem of rendering-based approaches is that they rely on bounding box proposals, which do not convey information about the 3D rotation of the object and are not reliable when objects are partially occluded. Instead, we introduce a coarse-to-fine optimization strategy that utilizes the rendering process to estimate a sparse set of 6D object proposals, which are subsequently refined with gradient-based optimization. The key to enabling the convergence of our approach is a neural feature representation that is trained to be scale- and rotation-invariant using contrastive learning. Our experiments demonstrate an enhanced category-level 6D pose estimation performance compared to prior work, particularly under strong partial occlusion.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"17 1","pages":"492-508"},"PeriodicalIF":0.0,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77724814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

CenterFormer: Center-based Transformer for 3D Object Detection CenterFormer:用于3D对象检测的基于中心的变压器

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-12 DOI: 10.48550/arXiv.2209.05588

Zixiang Zhou, Xian Zhao, Yu Wang, Panqu Wang, H. Foroosh

Query-based transformer has shown great potential in constructing long-range attention in many image-domain tasks, but has rarely been considered in LiDAR-based 3D object detection due to the overwhelming size of the point cloud data. In this paper, we propose CenterFormer, a center-based transformer network for 3D object detection. CenterFormer first uses a center heatmap to select center candidates on top of a standard voxel-based point cloud encoder. It then uses the feature of the center candidate as the query embedding in the transformer. To further aggregate features from multiple frames, we design an approach to fuse features through cross-attention. Lastly, regression heads are added to predict the bounding box on the output center feature representation. Our design reduces the convergence difficulty and computational complexity of the transformer structure. The results show significant improvements over the strong baseline of anchor-free object detection networks. CenterFormer achieves state-of-the-art performance for a single model on the Waymo Open Dataset, with 73.7% mAPH on the validation set and 75.6% mAPH on the test set, significantly outperforming all previously published CNN and transformer-based methods. Our code is publicly available at https://github.com/TuSimple/centerformer

在许多图像域任务中，基于查询的变换在构建远程注意力方面显示出巨大的潜力，但由于点云数据的压倒性规模，在基于lidar的三维目标检测中很少被考虑。在本文中，我们提出了CenterFormer，一个基于中心的变压器网络，用于三维目标检测。CenterFormer首先使用中心热图在标准的基于体素的点云编码器上选择中心候选者。然后使用中心候选的特征作为查询嵌入到转换器中。为了进一步聚合多帧的特征，我们设计了一种通过交叉关注来融合特征的方法。最后，加入回归头来预测输出中心特征表示上的边界框。我们的设计降低了变压器结构的收敛难度和计算复杂度。结果表明，与无锚点目标检测网络的强基线相比，该方法有显著的改进。CenterFormer在Waymo开放数据集上对单个模型实现了最先进的性能，在验证集上的mAPH为73.7%，在测试集上的mAPH为75.6%，显著优于之前发布的所有基于CNN和变压器的方法。我们的代码可以在https://github.com/TuSimple/centerformer上公开获得

{"title":"CenterFormer: Center-based Transformer for 3D Object Detection","authors":"Zixiang Zhou, Xian Zhao, Yu Wang, Panqu Wang, H. Foroosh","doi":"10.48550/arXiv.2209.05588","DOIUrl":"https://doi.org/10.48550/arXiv.2209.05588","url":null,"abstract":"Query-based transformer has shown great potential in constructing long-range attention in many image-domain tasks, but has rarely been considered in LiDAR-based 3D object detection due to the overwhelming size of the point cloud data. In this paper, we propose CenterFormer, a center-based transformer network for 3D object detection. CenterFormer first uses a center heatmap to select center candidates on top of a standard voxel-based point cloud encoder. It then uses the feature of the center candidate as the query embedding in the transformer. To further aggregate features from multiple frames, we design an approach to fuse features through cross-attention. Lastly, regression heads are added to predict the bounding box on the output center feature representation. Our design reduces the convergence difficulty and computational complexity of the transformer structure. The results show significant improvements over the strong baseline of anchor-free object detection networks. CenterFormer achieves state-of-the-art performance for a single model on the Waymo Open Dataset, with 73.7% mAPH on the validation set and 75.6% mAPH on the test set, significantly outperforming all previously published CNN and transformer-based methods. Our code is publicly available at https://github.com/TuSimple/centerformer","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"139 1","pages":"496-513"},"PeriodicalIF":0.0,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79872637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Cross-Modal Knowledge Transfer Without Task-Relevant Source Data 无任务相关源数据的跨模态知识转移

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-08 DOI: 10.48550/arXiv.2209.04027

Sk. Miraj Ahmed, Suhas Lohit, Kuan-Chuan Peng, Michael Jones, A. Roy-Chowdhury

Cost-effective depth and infrared sensors as alternatives to usual RGB sensors are now a reality, and have some advantages over RGB in domains like autonomous navigation and remote sensing. As such, building computer vision and deep learning systems for depth and infrared data are crucial. However, large labeled datasets for these modalities are still lacking. In such cases, transferring knowledge from a neural network trained on a well-labeled large dataset in the source modality (RGB) to a neural network that works on a target modality (depth, infrared, etc.) is of great value. For reasons like memory and privacy, it may not be possible to access the source data, and knowledge transfer needs to work with only the source models. We describe an effective solution, SOCKET: SOurce-free Cross-modal KnowledgE Transfer for this challenging task of transferring knowledge from one source modality to a different target modality without access to task-relevant source data. The framework reduces the modality gap using paired task-irrelevant data, as well as by matching the mean and variance of the target features with the batch-norm statistics that are present in the source models. We show through extensive experiments that our method significantly outperforms existing source-free methods for classification tasks which do not account for the modality gap.

具有成本效益的深度和红外传感器作为常规RGB传感器的替代品现在已经成为现实，并且在自主导航和遥感等领域比RGB具有一些优势。因此，为深度和红外数据构建计算机视觉和深度学习系统至关重要。然而，这些模式的大型标记数据集仍然缺乏。在这种情况下，将在源模态(RGB)中标记良好的大型数据集上训练的神经网络的知识转移到在目标模态(深度，红外等)上工作的神经网络是很有价值的。由于内存和隐私等原因，可能无法访问源数据，并且知识转移只需要使用源模型。我们描述了一个有效的解决方案，SOCKET:无源跨模态知识转移，用于在不访问任务相关源数据的情况下将知识从一个源模态转移到另一个目标模态的具有挑战性的任务。该框架使用与任务无关的成对数据，以及通过将目标特征的均值和方差与源模型中存在的批规范统计数据相匹配，减少了模态差距。我们通过大量的实验表明，我们的方法在分类任务中显著优于现有的无源方法，这些方法不考虑模态差距。

{"title":"Cross-Modal Knowledge Transfer Without Task-Relevant Source Data","authors":"Sk. Miraj Ahmed, Suhas Lohit, Kuan-Chuan Peng, Michael Jones, A. Roy-Chowdhury","doi":"10.48550/arXiv.2209.04027","DOIUrl":"https://doi.org/10.48550/arXiv.2209.04027","url":null,"abstract":"Cost-effective depth and infrared sensors as alternatives to usual RGB sensors are now a reality, and have some advantages over RGB in domains like autonomous navigation and remote sensing. As such, building computer vision and deep learning systems for depth and infrared data are crucial. However, large labeled datasets for these modalities are still lacking. In such cases, transferring knowledge from a neural network trained on a well-labeled large dataset in the source modality (RGB) to a neural network that works on a target modality (depth, infrared, etc.) is of great value. For reasons like memory and privacy, it may not be possible to access the source data, and knowledge transfer needs to work with only the source models. We describe an effective solution, SOCKET: SOurce-free Cross-modal KnowledgE Transfer for this challenging task of transferring knowledge from one source modality to a different target modality without access to task-relevant source data. The framework reduces the modality gap using paired task-irrelevant data, as well as by matching the mean and variance of the target features with the batch-norm statistics that are present in the source models. We show through extensive experiments that our method significantly outperforms existing source-free methods for classification tasks which do not account for the modality gap.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"7 1","pages":"111-127"},"PeriodicalIF":0.0,"publicationDate":"2022-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88568569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Multi-Granularity Prediction for Scene Text Recognition 场景文本识别的多粒度预测

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-08 DOI: 10.48550/arXiv.2209.03592

P. Wang, Cheng Da, C. Yao

. Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93 . 35% on standard benchmarks. Code will be released soon.

．场景文本识别(STR)是计算机视觉领域一个活跃的研究课题。为了解决这一具有挑战性的问题，相继提出了许多创新的方法，将语言知识纳入STR模型最近成为一个突出的趋势。在这项工作中，我们首先从视觉转换(ViT)的最新进展中汲取灵感，构建一个概念简单但功能强大的视觉STR模型，该模型建立在视觉转换(ViT)的基础上，优于以前最先进的场景文本识别模型，包括纯视觉模型和语言增强方法。为了整合语言知识，我们进一步提出了一种多粒度预测策略，以隐式的方式将语言形态的信息注入到模型中，即在传统的字符级表示之外，在输出空间中引入NLP中广泛使用的子词表示(BPE和WordPiece)，而不采用独立的语言模型(LM)。由此产生的算法(称为MGP-STR)能够将STR的性能提升到更高的水平。具体来说，它的平均识别准确率达到了93。35%的标准基准。代码将很快发布。

{"title":"Multi-Granularity Prediction for Scene Text Recognition","authors":"P. Wang, Cheng Da, C. Yao","doi":"10.48550/arXiv.2209.03592","DOIUrl":"https://doi.org/10.48550/arXiv.2209.03592","url":null,"abstract":". Scene text recognition (STR) has been an active research topic in computer vision for years. To tackle this challenging problem, numerous innovative methods have been successively proposed and incorporating linguistic knowledge into STR models has recently become a prominent trend. In this work, we first draw inspiration from the recent progress in Vision Transformer (ViT) to construct a conceptually simple yet powerful vision STR model, which is built upon ViT and outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods. To integrate linguistic knowledge, we further propose a Multi-Granularity Prediction strategy to inject information from the language modality into the model in an implicit way, i.e. , subword representations (BPE and WordPiece) widely-used in NLP are introduced into the output space, in addition to the conventional character level representation, while no independent language model (LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the performance envelop of STR to an even higher level. Specifically, it achieves an average recognition accuracy of 93 . 35% on standard benchmarks. Code will be released soon.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"112 1","pages":"339-355"},"PeriodicalIF":0.0,"publicationDate":"2022-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80658762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Unpaired Image Translation via Vector Symbolic Architectures 基于矢量符号结构的非配对图像翻译

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-06 DOI: 10.48550/arXiv.2209.02686

Justin D. Theiss, Jay Leverett, Daeil Kim, Aayush Prakash

Image-to-image translation has played an important role in enabling synthetic data for computer vision. However, if the source and target domains have a large semantic mismatch, existing techniques often suffer from source content corruption aka semantic flipping. To address this problem, we propose a new paradigm for image-to-image translation using Vector Symbolic Architectures (VSA), a theoretical framework which defines algebraic operations in a high-dimensional vector (hypervector) space. We introduce VSA-based constraints on adversarial learning for source-to-target translations by learning a hypervector mapping that inverts the translation to ensure consistency with source content. We show both qualitatively and quantitatively that our method improves over other state-of-the-art techniques.

图像到图像的转换在为计算机视觉提供合成数据方面发挥了重要作用。然而，如果源域和目标域有很大的语义不匹配，现有的技术往往会遭受源内容损坏，即语义翻转。为了解决这个问题，我们提出了一个使用向量符号体系结构(VSA)的图像到图像转换的新范式，VSA是一个理论框架，它定义了高维向量(超向量)空间中的代数运算。我们引入了基于vsa的对抗性学习约束，通过学习一个反向翻译的超向量映射来确保与源内容的一致性。我们在定性和定量上都表明，我们的方法优于其他最先进的技术。

引用次数: 14

Towards Accurate Binary Neural Networks via Modeling Contextual Dependencies 通过上下文依赖关系建模实现精确的二元神经网络

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-03 DOI: 10.48550/arXiv.2209.01404

Xingrun Xing, Yangguang Li, Wei Li, Wenrui Ding, Yalong Jiang, Yufeng Wang, Jinghua Shao, Chunlei Liu, Xianglong Liu

, Abstract. Existing Binary Neural Networks (BNNs) mainly operate on local convolutions with binarization function. However, such simple bit operations lack the ability of modeling contextual dependencies, which is critical for learning discriminative deep representations in vision models. In this work, we tackle this issue by presenting new designs of binary neural modules, which enables BNNs to learn effective contextual dependencies. First, we propose a binary multi-layer perceptron (MLP) block as an alternative to binary convolution blocks to directly model contextual dependencies. Both short-range and long-range feature dependencies are modeled by binary MLPs, where the former provides local inductive bias and the latter breaks limited receptive field in binary convolutions. Second, to improve the robustness of binary models with contextual dependencies, we compute the contextual dynamic embeddings to determine the binarization thresholds in general binary convolutional blocks. Armed with our binary MLP blocks and improved binary convolution, we build the BNNs with explicit Contextual De-pendency modeling, termed as BCDNet. On the standard ImageNet-1K classification benchmark, the BCDNet achieves 72.3% Top-1 accuracy and outperforms leading binary methods by a large margin. In particu-lar, the proposed BCDNet exceeds the state-of-the-art ReActNet-A by 2.9% Top-1 accuracy with similar operations. Our code is available at https://github.com/Sense-GVT/BCDNet .

、抽象。现有的二值神经网络主要是在局部卷积上进行二值化运算。然而，这种简单的位操作缺乏对上下文依赖关系建模的能力，而上下文依赖关系对于学习视觉模型中的判别深度表示至关重要。在这项工作中，我们通过提出新的二元神经模块设计来解决这个问题，这使得bnn能够学习有效的上下文依赖关系。首先，我们提出了一个二元多层感知器(MLP)块作为二元卷积块的替代方案，直接对上下文依赖关系进行建模。在二元mlp模型中，前者提供了局部归纳偏置，后者打破了二元卷积中的有限接受域。其次，为了提高具有上下文相关性的二元模型的鲁棒性，我们计算了上下文动态嵌入来确定一般二进制卷积块的二值化阈值。利用我们的二进制MLP块和改进的二进制卷积，我们使用显式上下文依赖建模(称为BCDNet)构建了bnn。在标准的ImageNet-1K分类基准上，BCDNet达到了72.3%的Top-1准确率，并且大大优于领先的二值化方法。特别是，拟议的BCDNet在类似操作下比最先进的ReActNet-A精度高出2.9%。我们的代码可在https://github.com/Sense-GVT/BCDNet上获得。

{"title":"Towards Accurate Binary Neural Networks via Modeling Contextual Dependencies","authors":"Xingrun Xing, Yangguang Li, Wei Li, Wenrui Ding, Yalong Jiang, Yufeng Wang, Jinghua Shao, Chunlei Liu, Xianglong Liu","doi":"10.48550/arXiv.2209.01404","DOIUrl":"https://doi.org/10.48550/arXiv.2209.01404","url":null,"abstract":", Abstract. Existing Binary Neural Networks (BNNs) mainly operate on local convolutions with binarization function. However, such simple bit operations lack the ability of modeling contextual dependencies, which is critical for learning discriminative deep representations in vision models. In this work, we tackle this issue by presenting new designs of binary neural modules, which enables BNNs to learn effective contextual dependencies. First, we propose a binary multi-layer perceptron (MLP) block as an alternative to binary convolution blocks to directly model contextual dependencies. Both short-range and long-range feature dependencies are modeled by binary MLPs, where the former provides local inductive bias and the latter breaks limited receptive field in binary convolutions. Second, to improve the robustness of binary models with contextual dependencies, we compute the contextual dynamic embeddings to determine the binarization thresholds in general binary convolutional blocks. Armed with our binary MLP blocks and improved binary convolution, we build the BNNs with explicit Contextual De-pendency modeling, termed as BCDNet. On the standard ImageNet-1K classification benchmark, the BCDNet achieves 72.3% Top-1 accuracy and outperforms leading binary methods by a large margin. In particu-lar, the proposed BCDNet exceeds the state-of-the-art ReActNet-A by 2.9% Top-1 accuracy with similar operations. Our code is available at https://github.com/Sense-GVT/BCDNet .","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"8 1","pages":"536-552"},"PeriodicalIF":0.0,"publicationDate":"2022-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86602070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Meta-Learning with Less Forgetting on Large-Scale Non-Stationary Task Distributions 大规模非平稳任务分布下较少遗忘的元学习

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-03 DOI: 10.48550/arXiv.2209.01501

Zhenyi Wang, Li Shen, Le Fang, Qiuling Suo, Dongling Zhan, Tiehang Duan, Mingchen Gao

. The paradigm of machine intelligence moves from purely supervised learning to a more practical scenario when many loosely related unlabeled data are available and labeled data is scarce. Most existing algo-rithms assume that the underlying task distribution is stationary. Here we consider a more realistic and challenging setting in that task distributions evolve over time. We name this problem as S emi-supervised meta-learning with E volving T ask di S tributions, abbreviated as SETS . Two key challenges arise in this more realistic setting: (i) how to use unlabeled data in the presence of a large amount of unlabeled out-of-distribution (OOD) data; and (ii) how to prevent catastrophic forgetting on previously learned task distributions due to the task distribution shift. We propose an O OD R obust and knowle D ge pres E rved semi-supe R vised meta-learning approach ( ORDER ) ‡ , to tackle these two major challenges. Specifically, our ORDER introduces a novel mutual information regularization to robustify the model with unlabeled OOD data and adopts an optimal transport regularization to remember previously learned knowledge in feature space. In addition, we test our method on a very challenging dataset: SETS on large-scale non-stationary semi-supervised task distributions consisting of (at least) 72K tasks. With extensive experiments, we demonstrate the proposed ORDER alleviates forgetting on evolving task distributions and is more robust to OOD data than related strong baselines.

．机器智能的范式从纯粹的监督学习转向更实际的场景，当许多松散相关的未标记数据可用而标记数据稀缺时。大多数现有算法都假定底层任务分布是平稳的。在这里，我们考虑一个更现实和更具挑战性的设置，即任务分布随着时间的推移而变化。我们将这个问题命名为S -半监督元学习，其中E包含S个子集，缩写为set。在这种更现实的环境中出现了两个关键挑战:(i)如何在存在大量未标记的分布外(OOD)数据的情况下使用未标记数据;(ii)如何防止由于任务分布的转移而导致的对先前学习的任务分布的灾难性遗忘。为了解决这两个主要的挑战，我们提出了一种基于知识的半超学习型元学习方法(ORDER)。具体来说，我们的ORDER引入了一种新的互信息正则化来对未标记的OOD数据模型进行鲁棒化，并采用最优传输正则化来记住特征空间中先前学习的知识。此外，我们在一个非常具有挑战性的数据集上测试了我们的方法:set在由(至少)72K个任务组成的大规模非平稳半监督任务分布上。通过大量的实验，我们证明了所提出的ORDER减轻了对不断变化的任务分布的遗忘，并且比相关的强基线对OOD数据更具鲁棒性。

{"title":"Meta-Learning with Less Forgetting on Large-Scale Non-Stationary Task Distributions","authors":"Zhenyi Wang, Li Shen, Le Fang, Qiuling Suo, Dongling Zhan, Tiehang Duan, Mingchen Gao","doi":"10.48550/arXiv.2209.01501","DOIUrl":"https://doi.org/10.48550/arXiv.2209.01501","url":null,"abstract":". The paradigm of machine intelligence moves from purely supervised learning to a more practical scenario when many loosely related unlabeled data are available and labeled data is scarce. Most existing algo-rithms assume that the underlying task distribution is stationary. Here we consider a more realistic and challenging setting in that task distributions evolve over time. We name this problem as S emi-supervised meta-learning with E volving T ask di S tributions, abbreviated as SETS . Two key challenges arise in this more realistic setting: (i) how to use unlabeled data in the presence of a large amount of unlabeled out-of-distribution (OOD) data; and (ii) how to prevent catastrophic forgetting on previously learned task distributions due to the task distribution shift. We propose an O OD R obust and knowle D ge pres E rved semi-supe R vised meta-learning approach ( ORDER ) ‡ , to tackle these two major challenges. Specifically, our ORDER introduces a novel mutual information regularization to robustify the model with unlabeled OOD data and adopts an optimal transport regularization to remember previously learned knowledge in feature space. In addition, we test our method on a very challenging dataset: SETS on large-scale non-stationary semi-supervised task distributions consisting of (at least) 72K tasks. With extensive experiments, we demonstrate the proposed ORDER alleviates forgetting on evolving task distributions and is more robust to OOD data than related strong baselines.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"38 1","pages":"221-238"},"PeriodicalIF":0.0,"publicationDate":"2022-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86359811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Revisiting Outer Optimization in Adversarial Training 对抗性训练中的外部优化问题重述

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-02 DOI: 10.48550/arXiv.2209.01199

Ali Dabouei, Fariborz Taherkhani, Sobhan Soleymani, N. Nasrabadi

. Despite the fundamental distinction between adversarial and natural training (AT and NT), AT methods generally adopt momentum SGD (MSGD) for the outer optimization. This paper aims to analyze this choice by investigating the overlooked role of outer optimization in AT. Our exploratory evaluations reveal that AT induces higher gradient norm and variance compared to NT. This phenomenon hinders the outer optimization in AT since the convergence rate of MSGD is highly dependent on the variance of the gradients. To this end, we propose an optimization method called ENGM which regularizes the contribution of each input example to the average mini-batch gradients. We prove that the convergence rate of ENGM is independent of the variance of the gradients, and thus, it is suitable for AT. We introduce a trick to reduce the computational cost of ENGM using empirical observations on the correlation between the norm of gradients w.r.t. the network parameters and input examples. Our extensive evaluations and ablation studies on CIFAR-10, CIFAR-100, and TinyImageNet demonstrate that ENGM and its variants consistently improve the performance of a wide range of AT methods. Furthermore, ENGM alleviates major shortcomings of AT including robust overfitting and high sensitivity to hyperparameter settings.

。尽管对抗性训练和自然训练(AT和NT)之间存在根本性的区别，但AT方法通常采用动量SGD (MSGD)进行外部优化。本文旨在通过研究外部优化在自动化生产中被忽视的作用来分析这种选择。我们的探索性评估表明，与NT相比，AT诱导了更高的梯度范数和方差，这一现象阻碍了AT的外部优化，因为MSGD的收敛速度高度依赖于梯度的方差。为此，我们提出了一种称为ENGM的优化方法，该方法对每个输入样本对平均小批梯度的贡献进行正则化。我们证明了ENGM的收敛速度与梯度的方差无关，因此它适用于AT。我们引入了一种技巧来减少ENGM的计算成本，利用经验观察梯度范数与网络参数和输入示例之间的相关性。我们对CIFAR-10、CIFAR-100和TinyImageNet进行了广泛的评估和消蚀研究，结果表明，ENGM及其变体持续提高了各种AT方法的性能。此外，ENGM缓解了AT的主要缺点，包括鲁棒过拟合和对超参数设置的高灵敏度。

{"title":"Revisiting Outer Optimization in Adversarial Training","authors":"Ali Dabouei, Fariborz Taherkhani, Sobhan Soleymani, N. Nasrabadi","doi":"10.48550/arXiv.2209.01199","DOIUrl":"https://doi.org/10.48550/arXiv.2209.01199","url":null,"abstract":". Despite the fundamental distinction between adversarial and natural training (AT and NT), AT methods generally adopt momentum SGD (MSGD) for the outer optimization. This paper aims to analyze this choice by investigating the overlooked role of outer optimization in AT. Our exploratory evaluations reveal that AT induces higher gradient norm and variance compared to NT. This phenomenon hinders the outer optimization in AT since the convergence rate of MSGD is highly dependent on the variance of the gradients. To this end, we propose an optimization method called ENGM which regularizes the contribution of each input example to the average mini-batch gradients. We prove that the convergence rate of ENGM is independent of the variance of the gradients, and thus, it is suitable for AT. We introduce a trick to reduce the computational cost of ENGM using empirical observations on the correlation between the norm of gradients w.r.t. the network parameters and input examples. Our extensive evaluations and ablation studies on CIFAR-10, CIFAR-100, and TinyImageNet demonstrate that ENGM and its variants consistently improve the performance of a wide range of AT methods. Furthermore, ENGM alleviates major shortcomings of AT including robust overfitting and high sensitivity to hyperparameter settings.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"1 1","pages":"244-261"},"PeriodicalIF":0.0,"publicationDate":"2022-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88710069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation 基于序列到序列转换的统一的完全和时间戳监督的时间动作分割

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

Pub Date : 2022-09-01 DOI: 10.48550/arXiv.2209.00638

Nadine Behrmann, S. Golestaneh, Zico Kolter, Juergen Gall, M. Noroozi

This paper introduces a unified framework for video action segmentation via sequence to sequence (seq2seq) translation in a fully and timestamp supervised setup. In contrast to current state-of-the-art frame-level prediction methods, we view action segmentation as a seq2seq translation task, i.e., mapping a sequence of video frames to a sequence of action segments. Our proposed method involves a series of modifications and auxiliary loss functions on the standard Transformer seq2seq translation model to cope with long input sequences opposed to short output sequences and relatively few videos. We incorporate an auxiliary supervision signal for the encoder via a frame-wise loss and propose a separate alignment decoder for an implicit duration prediction. Finally, we extend our framework to the timestamp supervised setting via our proposed constrained k-medoids algorithm to generate pseudo-segmentations. Our proposed framework performs consistently on both fully and timestamp supervised settings, outperforming or competing state-of-the-art on several datasets. Our code is publicly available at https://github.com/boschresearch/UVAST.

本文介绍了一个统一的视频动作分割框架，该框架是在完全时间戳监督下通过序列到序列(seq2seq)转换实现的。与当前最先进的帧级预测方法相比，我们将动作分割视为一个seq2seq转换任务，即将一系列视频帧映射到一系列动作片段。我们提出的方法包括对标准Transformer seq2seq翻译模型进行一系列修改和辅助损失函数，以应对长输入序列而不是短输出序列和相对较少的视频。我们通过逐帧损失为编码器合并了辅助监督信号，并提出了一个单独的对齐解码器，用于隐式持续时间预测。最后，我们通过我们提出的约束k-medoids算法将我们的框架扩展到时间戳监督设置来生成伪分割。我们提出的框架在完全和时间戳监督设置上表现一致，在几个数据集上优于或竞争最先进的技术。我们的代码可以在https://github.com/boschresearch/UVAST上公开获得。

{"title":"Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation","authors":"Nadine Behrmann, S. Golestaneh, Zico Kolter, Juergen Gall, M. Noroozi","doi":"10.48550/arXiv.2209.00638","DOIUrl":"https://doi.org/10.48550/arXiv.2209.00638","url":null,"abstract":"This paper introduces a unified framework for video action segmentation via sequence to sequence (seq2seq) translation in a fully and timestamp supervised setup. In contrast to current state-of-the-art frame-level prediction methods, we view action segmentation as a seq2seq translation task, i.e., mapping a sequence of video frames to a sequence of action segments. Our proposed method involves a series of modifications and auxiliary loss functions on the standard Transformer seq2seq translation model to cope with long input sequences opposed to short output sequences and relatively few videos. We incorporate an auxiliary supervision signal for the encoder via a frame-wise loss and propose a separate alignment decoder for an implicit duration prediction. Finally, we extend our framework to the timestamp supervised setting via our proposed constrained k-medoids algorithm to generate pseudo-segmentations. Our proposed framework performs consistently on both fully and timestamp supervised settings, outperforming or competing state-of-the-art on several datasets. Our code is publicly available at https://github.com/boschresearch/UVAST.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"5 1","pages":"52-68"},"PeriodicalIF":0.0,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83706870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27