2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献_第6页

Dynamic Kernel Selection for Improved Generalization and Memory Efficiency in Meta-learning 动态核选择提高元学习泛化和记忆效率

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00962

Arnav Chavan, Rishabh Tiwari, Udbhav Bamba, D. Gupta

Gradient based meta-learning methods are prone to overfit on the meta-training set, and this behaviour is more prominent with large and complex networks. Moreover, large networks restrict the application of meta-learning models on low-power edge devices. While choosing smaller networks avoid these issues to a certain extent, it affects the overall generalization leading to reduced performance. Clearly, there is an approximately optimal choice of network architecture that is best suited for every meta-learning problem, however, identifying it beforehand is not straight-forward. In this paper, we present Metadock, a task-specific dynamic kernel selection strategy for designing compressed CNN models that generalize well on unseen tasks in meta-learning. Our method is based on the hypothesis that for a given set of similar tasks, not all kernels of the network are needed by each individual task. Rather, each task uses only a fraction of the kernels, and the selection of the kernels per task can be learnt dynamically as a part of the inner update steps. Metadockcompresses the meta-model as well as the task-specific inner models, thus providing significant reduction in model size for each task, and through constraining the number of active kernels for every task, it implicitly mitigates the issue of meta-overfitting. We show that for the same inference budget, pruned versions of large CNN models obtained using our approach consistently outperform the conventional choices of CNN models. Metadock couples well with popular meta-learning approaches such as iMAML [22]. The efficacy of our method is validated on CIFAR-fs [1] and mini-ImageNet [28] datasets, and we have observed that our approach can provide improvements in model accuracy of up to 2% on standard meta-learning benchmark, while reducing the model size by more than 75%. Our code is available at https://github.com/transmuteAI/MetaDOCK.

基于梯度的元学习方法在元训练集上容易出现过拟合，这种行为在大型复杂网络中更为突出。此外，大型网络限制了元学习模型在低功耗边缘设备上的应用。虽然选择较小的网络在一定程度上避免了这些问题，但它会影响整体泛化，导致性能下降。显然，存在一个最适合每个元学习问题的网络架构的近似最佳选择，然而，事先确定它并不是直截了当地的。在本文中，我们提出了Metadock，一种特定于任务的动态核选择策略，用于设计压缩CNN模型，该模型可以很好地泛化元学习中不可见的任务。我们的方法是基于这样一个假设:对于一组给定的类似任务，并不是每个单独的任务都需要网络的所有核。相反，每个任务只使用一小部分内核，并且每个任务的内核选择可以作为内部更新步骤的一部分动态学习。metadockk压缩了元模型以及特定于任务的内部模型，从而大大减少了每个任务的模型大小，并且通过限制每个任务的活动内核数量，它隐含地减轻了元过拟合的问题。我们表明，对于相同的推理预算，使用我们的方法获得的大型CNN模型的修剪版本始终优于CNN模型的传统选择。Metadock与流行的元学习方法(如iMAML)很好地结合在一起[22]。我们的方法的有效性在CIFAR-fs[1]和mini-ImageNet[28]数据集上得到了验证，我们已经观察到，我们的方法可以在标准元学习基准上将模型精度提高2%，同时将模型大小减少75%以上。我们的代码可在https://github.com/transmuteAI/MetaDOCK上获得。

{"title":"Dynamic Kernel Selection for Improved Generalization and Memory Efficiency in Meta-learning","authors":"Arnav Chavan, Rishabh Tiwari, Udbhav Bamba, D. Gupta","doi":"10.1109/CVPR52688.2022.00962","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00962","url":null,"abstract":"Gradient based meta-learning methods are prone to overfit on the meta-training set, and this behaviour is more prominent with large and complex networks. Moreover, large networks restrict the application of meta-learning models on low-power edge devices. While choosing smaller networks avoid these issues to a certain extent, it affects the overall generalization leading to reduced performance. Clearly, there is an approximately optimal choice of network architecture that is best suited for every meta-learning problem, however, identifying it beforehand is not straight-forward. In this paper, we present Metadock, a task-specific dynamic kernel selection strategy for designing compressed CNN models that generalize well on unseen tasks in meta-learning. Our method is based on the hypothesis that for a given set of similar tasks, not all kernels of the network are needed by each individual task. Rather, each task uses only a fraction of the kernels, and the selection of the kernels per task can be learnt dynamically as a part of the inner update steps. Metadockcompresses the meta-model as well as the task-specific inner models, thus providing significant reduction in model size for each task, and through constraining the number of active kernels for every task, it implicitly mitigates the issue of meta-overfitting. We show that for the same inference budget, pruned versions of large CNN models obtained using our approach consistently outperform the conventional choices of CNN models. Metadock couples well with popular meta-learning approaches such as iMAML [22]. The efficacy of our method is validated on CIFAR-fs [1] and mini-ImageNet [28] datasets, and we have observed that our approach can provide improvements in model accuracy of up to 2% on standard meta-learning benchmark, while reducing the model size by more than 75%. Our code is available at https://github.com/transmuteAI/MetaDOCK.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134050051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

PCL: Proxy-based Contrastive Learning for Domain Generalization 面向领域泛化的基于代理的对比学习

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00696

Xu Yao, Yang Bai, Xinyun Zhang, Yuechen Zhang, Qi Sun, Ran Chen, Ruiyu Li, Bei Yu

Domain generalization refers to the problem of training a model from a collection of different source domains that can directly generalize to the unseen target domains. A promising solution is contrastive learning, which attempts to learn domain-invariant representations by exploiting rich semantic relations among sample-to-sample pairs from different domains. A simple approach is to pull positive sample pairs from different domains closer while pushing other negative pairs further apart. In this paper, we find that directly applying contrastive-based methods (e.g., supervised contrastive learning) are not effective in domain generalization. We argue that aligning positive sample-to-sample pairs tends to hinder the model generalization due to the significant distribution gaps between different domains. To address this issue, we propose a novel proxy-based contrastive learning method, which replaces the original sample-to-sample relations with proxy-to-sample relations, significantly alleviating the positive alignment issue. Experiments on the four standard benchmarks demonstrate the effectiveness of the proposed method. Furthermore, we also consider a more complex scenario where no ImageNet pre-trained models are provided. Our method consistently shows better performance.

领域泛化是指从不同的源领域的集合中训练一个模型的问题，该模型可以直接泛化到未知的目标领域。对比学习是一种很有前途的解决方案，它试图通过利用来自不同领域的样本对之间丰富的语义关系来学习领域不变表示。一种简单的方法是将来自不同区域的阳性样本对拉得更近，同时将其他阴性样本对推得更远。在本文中，我们发现直接应用基于对比的方法(如监督对比学习)在领域泛化中是无效的。我们认为，由于不同域之间存在显著的分布差距，对齐正样本对往往会阻碍模型的泛化。为了解决这个问题，我们提出了一种新的基于代理的对比学习方法，该方法用代理与样本之间的关系取代了原始的样本与样本之间的关系，显著缓解了正对齐问题。在四个标准基准上的实验证明了该方法的有效性。此外，我们还考虑了一个更复杂的场景，其中没有提供ImageNet预训练模型。我们的方法始终表现出更好的性能。

{"title":"PCL: Proxy-based Contrastive Learning for Domain Generalization","authors":"Xu Yao, Yang Bai, Xinyun Zhang, Yuechen Zhang, Qi Sun, Ran Chen, Ruiyu Li, Bei Yu","doi":"10.1109/CVPR52688.2022.00696","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00696","url":null,"abstract":"Domain generalization refers to the problem of training a model from a collection of different source domains that can directly generalize to the unseen target domains. A promising solution is contrastive learning, which attempts to learn domain-invariant representations by exploiting rich semantic relations among sample-to-sample pairs from different domains. A simple approach is to pull positive sample pairs from different domains closer while pushing other negative pairs further apart. In this paper, we find that directly applying contrastive-based methods (e.g., supervised contrastive learning) are not effective in domain generalization. We argue that aligning positive sample-to-sample pairs tends to hinder the model generalization due to the significant distribution gaps between different domains. To address this issue, we propose a novel proxy-based contrastive learning method, which replaces the original sample-to-sample relations with proxy-to-sample relations, significantly alleviating the positive alignment issue. Experiments on the four standard benchmarks demonstrate the effectiveness of the proposed method. Furthermore, we also consider a more complex scenario where no ImageNet pre-trained models are provided. Our method consistently shows better performance.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131889196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Bounded Adversarial Attack on Deep Content Features 深度内容特征的有限对抗性攻击

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01477

Qiuling Xu, Guanhong Tao, Xiangyu Zhang

We propose a novel adversarial attack targeting content features in some deep layer, that is, individual neurons in the layer. A naive method that enforces a fixed value/percentage bound for neuron activation values can hardly work and generates very noisy samples. The reason is that the level of perceptual variation entailed by a fixed value bound is non-uniform across neurons and even for the same neuron. We hence propose a novel distribution quantile bound for activation values and a polynomial barrier loss function. Given a benign input, a fixed quantile bound is translated to many value bounds, one for each neuron, based on the distributions of the neuron's activations and the current activation value on the given input. These individualized bounds enable fine-grained regulation, allowing content feature mutations with bounded perceptional variations. Our evaluation on ImageNet and five different model architectures demonstrates that our attack is effective. Compared to seven other latest adversarial attacks in both the pixel space and the feature space, our attack can achieve the state-of-the-art trade-off between attack success rate and imperceptibility. 11Code and Samples are available on Github [37].

我们提出了一种新的针对深层内容特征的对抗性攻击，即针对层中的单个神经元。一种对神经元激活值施加固定值/百分比界限的朴素方法很难奏效，并且会产生非常嘈杂的样本。其原因是，一个固定的值边界所带来的感知变化水平在神经元之间是不均匀的，甚至对于同一个神经元也是如此。因此，我们提出了一个新的激活值分布分位数界和一个多项式势垒损失函数。给定良性输入，根据神经元的激活分布和给定输入上的当前激活值，将固定的分位数界限转换为多个值界限，每个神经元一个值界限。这些个性化的边界实现了细粒度的调节，允许具有有限感知变化的内容特征突变。我们对ImageNet和五种不同模型架构的评估表明，我们的攻击是有效的。与其他七种最新的像素空间和特征空间的对抗性攻击相比，我们的攻击可以在攻击成功率和不可感知性之间实现最先进的权衡。代码和示例可在Github上获得[37]。

{"title":"Bounded Adversarial Attack on Deep Content Features","authors":"Qiuling Xu, Guanhong Tao, Xiangyu Zhang","doi":"10.1109/CVPR52688.2022.01477","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01477","url":null,"abstract":"We propose a novel adversarial attack targeting content features in some deep layer, that is, individual neurons in the layer. A naive method that enforces a fixed value/percentage bound for neuron activation values can hardly work and generates very noisy samples. The reason is that the level of perceptual variation entailed by a fixed value bound is non-uniform across neurons and even for the same neuron. We hence propose a novel distribution quantile bound for activation values and a polynomial barrier loss function. Given a benign input, a fixed quantile bound is translated to many value bounds, one for each neuron, based on the distributions of the neuron's activations and the current activation value on the given input. These individualized bounds enable fine-grained regulation, allowing content feature mutations with bounded perceptional variations. Our evaluation on ImageNet and five different model architectures demonstrates that our attack is effective. Compared to seven other latest adversarial attacks in both the pixel space and the feature space, our attack can achieve the state-of-the-art trade-off between attack success rate and imperceptibility. 11Code and Samples are available on Github [37].","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127595030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Segment, Magnify and Reiterate: Detecting Camouflaged Objects the Hard Way 分割，放大和重申:检测伪装对象的艰难方式

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00467

Qi Jia, Shuilian Yao, Yu Liu, Xin Fan, Risheng Liu, Zhongxuan Luo

It is challenging to accurately detect camouflaged objects from their highly similar surroundings. Existing methods mainly leverage a single-stage detection fashion, while neglecting small objects with low-resolution fine edges requires more operations than the larger ones. To tackle camouflaged object detection (COD), we are inspired by humans attention coupled with the coarse-to-fine detection strategy, and thereby propose an iterative refinement framework, coined SegMaR, which integrates Segment, Magnify and Reiterate in a multi-stage detection fashion. Specifically, we design a new discriminative mask which makes the model attend on the fixation and edge regions. In addition, we leverage an attention-based sampler to magnify the object region progressively with no need of enlarging the image size. Extensive experiments show our SegMaR achieves remarkable and consistent improvements over other state-of-the-art methods. Especially, we surpass two competitive methods 7.4% and 20.0% respectively in average over standard evaluation metrics on small camouflaged objects. Additional studies provide more promising insights into Seg-MaR, including its effectiveness on the discriminative mask and its generalization to other network architectures. Code is available at https://github.com/dlut-dimt/SegMaR.

从高度相似的环境中准确地发现伪装的物体是一项挑战。现有方法主要利用单阶段检测方式，忽略具有低分辨率精细边缘的小物体比大物体需要更多的操作。为了解决伪装对象检测(COD)问题，我们受到人类注意力和粗到精的检测策略的启发，提出了一个迭代的改进框架，称为分段，放大和重申，它以多阶段检测的方式集成了分段，放大和重申。具体来说，我们设计了一种新的判别掩模，使模型同时关注固定区域和边缘区域。此外，我们利用基于注意力的采样器在不需要放大图像尺寸的情况下逐步放大目标区域。大量的实验表明，我们的分段法比其他最先进的方法取得了显著和持续的改进。特别是在小型伪装物体的平均超标准评价指标上，我们比两种竞争方法分别高出7.4%和20.0%。其他的研究为Seg-MaR提供了更多有希望的见解，包括它在判别掩码上的有效性以及它在其他网络架构上的推广。代码可从https://github.com/dlut-dimt/SegMaR获得。

{"title":"Segment, Magnify and Reiterate: Detecting Camouflaged Objects the Hard Way","authors":"Qi Jia, Shuilian Yao, Yu Liu, Xin Fan, Risheng Liu, Zhongxuan Luo","doi":"10.1109/CVPR52688.2022.00467","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00467","url":null,"abstract":"It is challenging to accurately detect camouflaged objects from their highly similar surroundings. Existing methods mainly leverage a single-stage detection fashion, while neglecting small objects with low-resolution fine edges requires more operations than the larger ones. To tackle camouflaged object detection (COD), we are inspired by humans attention coupled with the coarse-to-fine detection strategy, and thereby propose an iterative refinement framework, coined SegMaR, which integrates Segment, Magnify and Reiterate in a multi-stage detection fashion. Specifically, we design a new discriminative mask which makes the model attend on the fixation and edge regions. In addition, we leverage an attention-based sampler to magnify the object region progressively with no need of enlarging the image size. Extensive experiments show our SegMaR achieves remarkable and consistent improvements over other state-of-the-art methods. Especially, we surpass two competitive methods 7.4% and 20.0% respectively in average over standard evaluation metrics on small camouflaged objects. Additional studies provide more promising insights into Seg-MaR, including its effectiveness on the discriminative mask and its generalization to other network architectures. Code is available at https://github.com/dlut-dimt/SegMaR.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126572903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

LD-ConGR: A Large RGB-D Video Dataset for Long-Distance Continuous Gesture Recognition 用于远距离连续手势识别的大型RGB-D视频数据集

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00330

Dan Liu, Libo Zhang, Yanjun Wu

Gesture recognition plays an important role in natural human-computer interaction and sign language recognition. Existing research on gesture recognition is limited to close-range interaction such as vehicle gesture control and face-to-face communication. To apply gesture recognition to long-distance interactive scenes such as meetings and smart homes, a large RGB-D video dataset LD-ConGR is established in this paper. LD-ConGR is distinguished from existing gesture datasets by its long-distance gesture collection, fine-grained annotations, and high video qual-ity. Specifically, 1) the farthest gesture provided by the LD-ConGR is captured 4m away from the camera while existing gesture datasets collect gestures within 1m from the camera; 2) besides the gesture category, the temporal segmentation of gestures and hand location are also anno-tated in LD-ConGR; 3) videos are captured at high reso-lution (1280 x 720 for color streams and 640 x 576 for depth streams) and high frame rate (30 fps). On top of the LD-ConGR, a series of experimental and studies are conducted, and the proposed gesture region estimation and key frame sampling strategies are demonstrated to be effective in dealing with long-distance gesture recognition and the uncertainty of gesture duration. The dataset and experimen-tal results presented in this paper are expected to boost the research of long-distance gesture recognition. The dataset is available at https://github.com/Diananini/LD-ConGR-CVPR2022.

手势识别在自然人机交互和手语识别中占有重要地位。现有的手势识别研究仅限于近距离互动，如车辆手势控制和面对面交流。为了将手势识别应用于会议、智能家居等远距离交互场景，本文建立了大型RGB-D视频数据集ld - conr。ld - conr以其远距离的手势采集、细粒度的注释和高视频质量区别于现有的手势数据集。具体来说，1)ld - conr提供的最远的手势是在距离相机4m的地方捕获的，而现有的手势数据集收集的是距离相机1m以内的手势;2) ld - conr除对手势类别进行标注外，还对手势的时间分割和手部位置进行标注;3)视频以高分辨率(1280 x 720彩色流和640 x 576深度流)和高帧率(30 fps)捕获。在LD-ConGR的基础上，进行了一系列的实验和研究，证明了所提出的手势区域估计和关键帧采样策略在处理远距离手势识别和手势持续时间的不确定性方面是有效的。本文的数据集和实验结果有望推动远距离手势识别的研究。该数据集可在https://github.com/Diananini/LD-ConGR-CVPR2022上获得。

{"title":"LD-ConGR: A Large RGB-D Video Dataset for Long-Distance Continuous Gesture Recognition","authors":"Dan Liu, Libo Zhang, Yanjun Wu","doi":"10.1109/CVPR52688.2022.00330","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00330","url":null,"abstract":"Gesture recognition plays an important role in natural human-computer interaction and sign language recognition. Existing research on gesture recognition is limited to close-range interaction such as vehicle gesture control and face-to-face communication. To apply gesture recognition to long-distance interactive scenes such as meetings and smart homes, a large RGB-D video dataset LD-ConGR is established in this paper. LD-ConGR is distinguished from existing gesture datasets by its long-distance gesture collection, fine-grained annotations, and high video qual-ity. Specifically, 1) the farthest gesture provided by the LD-ConGR is captured 4m away from the camera while existing gesture datasets collect gestures within 1m from the camera; 2) besides the gesture category, the temporal segmentation of gestures and hand location are also anno-tated in LD-ConGR; 3) videos are captured at high reso-lution (1280 x 720 for color streams and 640 x 576 for depth streams) and high frame rate (30 fps). On top of the LD-ConGR, a series of experimental and studies are conducted, and the proposed gesture region estimation and key frame sampling strategies are demonstrated to be effective in dealing with long-distance gesture recognition and the uncertainty of gesture duration. The dataset and experimen-tal results presented in this paper are expected to boost the research of long-distance gesture recognition. The dataset is available at https://github.com/Diananini/LD-ConGR-CVPR2022.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131249940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

ISDNet: Integrating Shallow and Deep Networks for Efficient Ultra-high Resolution Segmentation ISDNet:集成浅层和深层网络，实现高效超高分辨率分割

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00432

Shaohua Guo, Liang Liu, Zhenye Gan, Yabiao Wang, Wuhao Zhang, Chengjie Wang, Guannan Jiang, Wei Zhang, Ran Yi, Lizhuang Ma, Ke Xu

The huge burden of computation and memory are two obstacles in ultra-high resolution image segmentation. To tackle these issues, most of the previous works follow the global-local refinement pipeline, which pays more attention to the memory consumption but neglects the inference speed. In comparison to the pipeline that partitions the large image into small local regions, we focus on inferring the whole image directly. In this paper, we propose ISDNet, a novel ultra-high resolution segmentation framework that integrates the shallow and deep networks in a new manner, which significantly accelerates the inference speed while achieving accurate segmentation. To further exploit the relationship between the shallow and deep features, we propose a novel Relational-Aware feature Fusion module, which ensures high performance and robustness of our framework. Extensive experiments on Deepglobe, Inria Aerial, and Cityscapes datasets demonstrate our performance is consistently superior to state-of-the-arts. Specifically, it achieves 73.30 mIoU with a speed of 27.70 FPS on Deepglobe, which is more accurate and 172 × faster than the recent competitor. Code available at https://github.com/cedricgsh/ISDNet.

巨大的计算负担和内存负担是超高分辨率图像分割的两大障碍。为了解决这些问题，以前的工作大多采用全局-局部优化管道，该管道更关注内存消耗而忽略了推理速度。与将大图像分割成小的局部区域的流水线方法相比，我们更侧重于直接推断整个图像。在本文中，我们提出了一种新的超高分辨率分割框架ISDNet，它以一种新的方式集成了浅网和深网，在实现准确分割的同时显著加快了推理速度。为了进一步挖掘浅特征和深特征之间的关系，我们提出了一种新的关系感知特征融合模块，该模块确保了框架的高性能和鲁棒性。在Deepglobe, Inria Aerial和cityscape数据集上进行的大量实验表明，我们的性能始终优于最先进的性能。具体来说，它在Deepglobe上达到了73.30 mIoU，速度为27.70 FPS，比最近的竞争对手更准确，速度快172倍。代码可从https://github.com/cedricgsh/ISDNet获得。

{"title":"ISDNet: Integrating Shallow and Deep Networks for Efficient Ultra-high Resolution Segmentation","authors":"Shaohua Guo, Liang Liu, Zhenye Gan, Yabiao Wang, Wuhao Zhang, Chengjie Wang, Guannan Jiang, Wei Zhang, Ran Yi, Lizhuang Ma, Ke Xu","doi":"10.1109/CVPR52688.2022.00432","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00432","url":null,"abstract":"The huge burden of computation and memory are two obstacles in ultra-high resolution image segmentation. To tackle these issues, most of the previous works follow the global-local refinement pipeline, which pays more attention to the memory consumption but neglects the inference speed. In comparison to the pipeline that partitions the large image into small local regions, we focus on inferring the whole image directly. In this paper, we propose ISDNet, a novel ultra-high resolution segmentation framework that integrates the shallow and deep networks in a new manner, which significantly accelerates the inference speed while achieving accurate segmentation. To further exploit the relationship between the shallow and deep features, we propose a novel Relational-Aware feature Fusion module, which ensures high performance and robustness of our framework. Extensive experiments on Deepglobe, Inria Aerial, and Cityscapes datasets demonstrate our performance is consistently superior to state-of-the-arts. Specifically, it achieves 73.30 mIoU with a speed of 27.70 FPS on Deepglobe, which is more accurate and 172 × faster than the recent competitor. Code available at https://github.com/cedricgsh/ISDNet.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130839521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

SHIFT: A Synthetic Driving Dataset for Continuous Multi-Task Domain Adaptation SHIFT:用于连续多任务领域自适应的合成驱动数据集

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.02068

Tao Sun, Mattia Segu, Janis Postels, Yuxuan Wang, L. Gool, B. Schiele, F. Tombari, F. Yu

Adapting to a continuously evolving environment is a safety-critical challenge inevitably faced by all autonomous-driving systems. Existing image- and video-based driving datasets, however, fall short of capturing the mutable nature of the real world. In this paper, we introduce the largest multi-task synthetic dataset for autonomous driving, SHIFT. It presents discrete and continuous shifts in cloudiness, rain and fog intensity, time of day, and vehicle and pedestrian density. Featuring a comprehensive sensor suite and annotations for several mainstream perception tasks, SHIFT allows to investigate how a perception systems' performance degrades at increasing levels of domain shift, fostering the development of continuous adaptation strategies to mitigate this problem and assessing the robustness and generality of a model. Our dataset and benchmark toolkit are publicly available at www.vis.xyz/shift.

适应不断变化的环境是所有自动驾驶系统不可避免地面临的安全关键挑战。然而，现有的基于图像和视频的驾驶数据集无法捕捉到现实世界的可变特性。在本文中，我们引入了最大的自动驾驶多任务合成数据集SHIFT。它在云量、雨雾强度、时间、车辆和行人密度等方面呈现离散和连续的变化。SHIFT为几个主流感知任务提供了全面的传感器套件和注释，可以研究感知系统的性能如何随着域移位水平的增加而下降，促进持续适应策略的发展，以缓解这一问题，并评估模型的鲁棒性和通用性。我们的数据集和基准工具包可在www.vis.xyz/shift上公开获取。

引用次数: 54

Meta Agent Teaming Active Learning for Pose Estimation 元代理团队主动学习姿态估计

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01080

Jia Gong, Zhipeng Fan, Qiuhong Ke, Hossein Rahmani, J. Liu

The existing pose estimation approaches often require a large number of annotated images to attain good estimation performance, which are laborious to acquire. To reduce the human efforts on pose annotations, we propose a novel Meta Agent Teaming Active Learning (MATAL) framework to actively select and label informative images for effective learning. Our MATAL formulates the image selection procedure as a Markov Decision Process and learns an optimal sampling policy that directly maximizes the performance of the pose estimator based on the reward. Our framework consists of a novel state-action representation as well as a multi-agent team to enable batch sampling in the active learning procedure. The framework could be effectively optimized via Meta-Optimization to accelerate the adaptation to the gradually expanded labeled data during deployment. Finally, we show experimental results on both human hand and body pose estimation benchmark datasets and demonstrate that our method significantly outperforms all baselines continuously under the same amount of annotation budget. Moreover, to obtain similar pose estimation accuracy, our MATAL framework can save around 40% labeling efforts on average compared to state-of-the-art active learning frameworks.

现有的姿态估计方法通常需要大量的注释图像才能获得良好的估计性能，而这些图像的获取非常费力。为了减少人类在姿态标注上的工作量，我们提出了一种新的Meta Agent团队主动学习(MATAL)框架来主动选择和标记信息丰富的图像以进行有效的学习。我们的MATAL将图像选择过程制定为马尔可夫决策过程，并学习最优采样策略，直接最大化基于奖励的姿态估计器的性能。我们的框架由一种新的状态-行为表示和一个多智能体团队组成，以实现主动学习过程中的批量采样。通过Meta-Optimization对框架进行有效优化，加快对部署过程中逐渐扩展的标签数据的适应。最后，我们展示了在人手和身体姿态估计基准数据集上的实验结果，并证明我们的方法在相同标注预算下持续显著优于所有基线。此外，为了获得相似的姿态估计精度，与最先进的主动学习框架相比，我们的MATAL框架平均可以节省约40%的标记工作。

{"title":"Meta Agent Teaming Active Learning for Pose Estimation","authors":"Jia Gong, Zhipeng Fan, Qiuhong Ke, Hossein Rahmani, J. Liu","doi":"10.1109/CVPR52688.2022.01080","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01080","url":null,"abstract":"The existing pose estimation approaches often require a large number of annotated images to attain good estimation performance, which are laborious to acquire. To reduce the human efforts on pose annotations, we propose a novel Meta Agent Teaming Active Learning (MATAL) framework to actively select and label informative images for effective learning. Our MATAL formulates the image selection procedure as a Markov Decision Process and learns an optimal sampling policy that directly maximizes the performance of the pose estimator based on the reward. Our framework consists of a novel state-action representation as well as a multi-agent team to enable batch sampling in the active learning procedure. The framework could be effectively optimized via Meta-Optimization to accelerate the adaptation to the gradually expanded labeled data during deployment. Finally, we show experimental results on both human hand and body pose estimation benchmark datasets and demonstrate that our method significantly outperforms all baselines continuously under the same amount of annotation budget. Moreover, to obtain similar pose estimation accuracy, our MATAL framework can save around 40% labeling efforts on average compared to state-of-the-art active learning frameworks.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132751443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Learning Modal-Invariant and Temporal-Memory for Video-based Visible-Infrared Person Re-Identification 基于视频的可见红外人物再识别的模态不变与时间记忆学习

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.02030

Xinyu Lin, Jinxing Li, Zeyu Ma, Huafeng Li, Shuang Li, Kaixiong Xu, Guangming Lu, Dafan Zhang

Thanks for the cross-modal retrieval techniques, visible-infrared (RGB-IR) person re-identification (Re-ID) is achieved by projecting them into a common space, allowing person Re-ID in 24-hour surveillance systems. However, with respect to the probe-to- gallery, almost all existing RGB-IR based cross-modal person Re-ID methods focus on image-to-image matching, while the video-to-video matching which contains much richer spatial- and temporal-information remains under-explored. In this paper, we primarily study the video-based cross-modal per-son Re-ID method. To achieve this task, a video-based RGB-IR dataset is constructed, in which 927 valid identities with 463,259 frames and 21,863 tracklets captured by 12 RGB/IR cameras are collected. Based on our constructed dataset, we prove that with the increase of frames in a tracklet, the performance does meet more enhancement, demonstrating the significance of video-to-video matching in RGB-IR person Re-ID. Additionally, a novel method is further proposed, which not only projects two modalities to a modal-invariant subspace, but also extracts the temporal-memory for motion-invariant. Thanks to these two strategies, much better results are achieved on our video-based cross-modal person Re-ID. The code and dataset are released at: https://github.com/VCM-project233/MITML.

由于采用了跨模态检索技术，可见红外(RGB-IR)人员再识别(Re-ID)是通过将它们投射到一个公共空间来实现的，允许在24小时监视系统中进行人员再识别。然而，对于探针到画廊，现有的基于RGB-IR的跨模态人重新识别方法几乎都集中在图像到图像的匹配上，而包含更丰富的空间和时间信息的视频到视频的匹配还没有得到充分的探索。本文主要研究了基于视频的跨模态个人Re-ID方法。为此，构建了基于视频的RGB-IR数据集，其中收集了12台RGB/IR相机拍摄的927个有效身份、463,259帧和21,863条轨迹。基于我们构建的数据集，我们证明了随着tracklet帧数的增加，性能确实得到了更多的增强，从而证明了视频-视频匹配在RGB-IR人物Re-ID中的重要性。在此基础上，提出了一种新的方法，将两个模态投影到模态不变子空间中，同时提取运动不变子空间的时间记忆。由于这两种策略，我们基于视频的跨模式人员Re-ID取得了更好的结果。代码和数据集发布在:https://github.com/VCM-project233/MITML。

{"title":"Learning Modal-Invariant and Temporal-Memory for Video-based Visible-Infrared Person Re-Identification","authors":"Xinyu Lin, Jinxing Li, Zeyu Ma, Huafeng Li, Shuang Li, Kaixiong Xu, Guangming Lu, Dafan Zhang","doi":"10.1109/CVPR52688.2022.02030","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.02030","url":null,"abstract":"Thanks for the cross-modal retrieval techniques, visible-infrared (RGB-IR) person re-identification (Re-ID) is achieved by projecting them into a common space, allowing person Re-ID in 24-hour surveillance systems. However, with respect to the probe-to- gallery, almost all existing RGB-IR based cross-modal person Re-ID methods focus on image-to-image matching, while the video-to-video matching which contains much richer spatial- and temporal-information remains under-explored. In this paper, we primarily study the video-based cross-modal per-son Re-ID method. To achieve this task, a video-based RGB-IR dataset is constructed, in which 927 valid identities with 463,259 frames and 21,863 tracklets captured by 12 RGB/IR cameras are collected. Based on our constructed dataset, we prove that with the increase of frames in a tracklet, the performance does meet more enhancement, demonstrating the significance of video-to-video matching in RGB-IR person Re-ID. Additionally, a novel method is further proposed, which not only projects two modalities to a modal-invariant subspace, but also extracts the temporal-memory for motion-invariant. Thanks to these two strategies, much better results are achieved on our video-based cross-modal person Re-ID. The code and dataset are released at: https://github.com/VCM-project233/MITML.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133131540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

AIM: an Auto-Augmenter for Images and Meshes 目标:图像和网格的自动增强器

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00080

Vinit Veerendraveer Singh, C. Kambhamettu

Data augmentations are commonly used to increase the robustness of deep neural networks. In most contemporary research, the networks do not decide the augmentations; they are task-agnostic, and grid search determines their magnitudes. Furthermore, augmentations applicable to lower-dimensional data do not easily extend to higher-dimensional data and vice versa. This paper presents an auto-augmenter for images and meshes (AIM) that easily incorporates into neural networks at training and inference times. It Jointly optimizes with the network to produce constrained, non-rigid deformations in the data. AIM predicts sample-aware deformations suited for a task, and our experiments confirm its effectiveness with various networks.

数据增强通常用于增强深度神经网络的鲁棒性。在大多数当代研究中，网络并不决定增强;它们与任务无关，网格搜索决定了它们的大小。此外，适用于低维数据的增强不容易扩展到高维数据，反之亦然。本文提出了一种图像和网格的自动增强器(AIM)，它可以在训练和推理时轻松地融入神经网络。它与网络共同优化，在数据中产生约束的非刚性变形。AIM预测适合任务的样本感知变形，我们的实验证实了它在各种网络中的有效性。

引用次数: 0