2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献_第7页

Online Convolutional Reparameterization 在线卷积重参数化

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00065

Mu Hu, Junyi Feng, Jiashen Hua, Baisheng Lai, Jianqiang Huang, Xiaojin Gong, Xiansheng Hua

Structural re-parameterization has drawn increasing attention in various computer vision tasks. It aims at improving the performance of deep models without introducing any inference-time cost. Though efficient during inference, such models rely heavily on the complicated training-time blocks to achieve high accuracy, leading to large extra training cost. In this paper, we present online convolutional re-parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. To achieve this goal, we introduce a linear scaling layer for better optimizing the online blocks. Assisted with the reduced training cost, we also explore some more effective re-param components. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2×. Meanwhile, equipped with OREPA, the models out-perform previous methods on ImageNet by up to +0.6%. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks. Codes are available at https://github.com/JUGGHM/OREPA_CVPR2022.

结构重参数化在各种计算机视觉任务中受到越来越多的关注。它旨在在不引入任何推理时间成本的情况下提高深度模型的性能。虽然这种模型在推理过程中效率很高，但为了达到较高的准确率，这种模型严重依赖于复杂的训练时间块，导致了大量的额外训练成本。在本文中，我们提出了在线卷积重新参数化(OREPA)，一种两阶段管道，旨在通过将复杂的训练时间块压缩到单个卷积中来减少巨大的训练开销。为了实现这一目标，我们引入了一个线性缩放层来更好地优化在线块。在降低培训成本的帮助下，我们还探索了一些更有效的重新参数化组件。与目前最先进的重参数模型相比，OREPA能够节省约70%的训练时间内存成本，并将训练速度提高约2倍。同时，配备OREPA后，模型在ImageNet上的表现比以前的方法高出0.6%。我们还对目标检测和语义分割进行了实验，并在下游任务上显示出一致的改进。代码可在https://github.com/JUGGHM/OREPA_CVPR2022上获得。

{"title":"Online Convolutional Reparameterization","authors":"Mu Hu, Junyi Feng, Jiashen Hua, Baisheng Lai, Jianqiang Huang, Xiaojin Gong, Xiansheng Hua","doi":"10.1109/CVPR52688.2022.00065","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00065","url":null,"abstract":"Structural re-parameterization has drawn increasing attention in various computer vision tasks. It aims at improving the performance of deep models without introducing any inference-time cost. Though efficient during inference, such models rely heavily on the complicated training-time blocks to achieve high accuracy, leading to large extra training cost. In this paper, we present online convolutional re-parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. To achieve this goal, we introduce a linear scaling layer for better optimizing the online blocks. Assisted with the reduced training cost, we also explore some more effective re-param components. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2×. Meanwhile, equipped with OREPA, the models out-perform previous methods on ImageNet by up to +0.6%. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks. Codes are available at https://github.com/JUGGHM/OREPA_CVPR2022.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131281869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Divide and Conquer: Compositional Experts for Generalized Novel Class Discovery 分而治之:通用新类发现的作文专家

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01387

Muli Yang, Yuehua Zhu, Jiaping Yu, Aming Wu, Cheng Deng

In response to the explosively-increasing requirement of annotated data, Novel Class Discovery (NCD) has emerged as a promising alternative to automatically recognize unknown classes without any annotation. To this end, a model makes use of a base set to learn basic semantic discriminability that can be transferred to recognize novel classes. Most existing works handle the base and novel sets using separate objectives within a two-stage training paradigm. Despite showing competitive performance on novel classes, they fail to generalize to recognizing samples from both base and novel sets. In this paper, we focus on this generalized setting of NCD (GNCD), and propose to divide and conquer it with two groups of Compositional Experts (ComEx). Each group of experts is designed to characterize the whole dataset in a comprehensive yet complementary fashion. With their union, we can solve GNCD in an efficient end-to-end manner. We further look into the draw-back in current NCD methods, and propose to strengthen ComEx with global-to-local and local-to-local regularization. ComEx11Code: https://github.com/muliyangm/ComEx. is evaluated on four popular benchmarks, showing clear superiority towards the goal of GNCD.

为了满足对标注数据爆炸式增长的需求，新颖类发现(NCD)作为一种很有前途的替代方法出现了，它可以在没有任何标注的情况下自动识别未知类。为此，模型利用一个基集来学习基本的语义可辨别性，这些可辨别性可以转移到识别新的类。大多数现有的作品处理的基础和新设置使用单独的目标在一个两阶段的训练范式。尽管在新类别上表现出了竞争力，但它们无法推广到从基本集和新集识别样本。在本文中，我们将重点放在NCD (GNCD)的广义设置上，并提出用两组composition Experts (ComEx)来划分和征服它。每组专家都被设计成以一种全面而又互补的方式来描述整个数据集。有了它们的结合，我们可以以高效的端到端方式解决GNCD问题。我们进一步研究了当前非传染性疾病方法的不足之处，并建议通过全球到地方和地方到地方的规范化来加强ComEx。ComEx11Code: https://github.com/muliyangm/ComEx。在四个常用基准上进行评估，显示出对GNCD目标的明显优势。

{"title":"Divide and Conquer: Compositional Experts for Generalized Novel Class Discovery","authors":"Muli Yang, Yuehua Zhu, Jiaping Yu, Aming Wu, Cheng Deng","doi":"10.1109/CVPR52688.2022.01387","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01387","url":null,"abstract":"In response to the explosively-increasing requirement of annotated data, Novel Class Discovery (NCD) has emerged as a promising alternative to automatically recognize unknown classes without any annotation. To this end, a model makes use of a base set to learn basic semantic discriminability that can be transferred to recognize novel classes. Most existing works handle the base and novel sets using separate objectives within a two-stage training paradigm. Despite showing competitive performance on novel classes, they fail to generalize to recognizing samples from both base and novel sets. In this paper, we focus on this generalized setting of NCD (GNCD), and propose to divide and conquer it with two groups of Compositional Experts (ComEx). Each group of experts is designed to characterize the whole dataset in a comprehensive yet complementary fashion. With their union, we can solve GNCD in an efficient end-to-end manner. We further look into the draw-back in current NCD methods, and propose to strengthen ComEx with global-to-local and local-to-local regularization. ComEx11Code: https://github.com/muliyangm/ComEx. is evaluated on four popular benchmarks, showing clear superiority towards the goal of GNCD.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131405290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Stereo Depth from Events Cameras: Concentrate and Focus on the Future 事件相机的立体深度:集中精力，聚焦未来

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00602

Yeongwoo Nam, Sayed Mohammad Mostafavi Isfahani, Kuk-Jin Yoon, Jonghyun Choi

Neuromorphic cameras or event cameras mimic human vision by reporting changes in the intensity in a scene, instead of reporting the whole scene at once in a form of an image frame as performed by conventional cameras. Events are streamed data that are often dense when either the scene changes or the camera moves rapidly. The rapid movement causes the events to be overridden or missed when creating a tensor for the machine to learn on. To alleviate the event missing or overriding issue, we propose to learn to concentrate on the dense events to produce a compact event representation with high details for depth estimation. Specifically, we learn a model with events from both past and future but infer only with past data with the predicted future. We initially estimate depth in an event-only setting but also propose to further incorporate images and events by a hier-archical event and intensity combination network for better depth estimation. By experiments in challenging real-world scenarios, we validate that our method outperforms prior arts even with low computational cost. Code is available at: https://github.com/yonseivnl/se-cff.

神经形态相机或事件相机通过报告场景强度的变化来模仿人类视觉，而不是像传统相机那样以图像帧的形式立即报告整个场景。事件是流数据，当场景变化或摄像机快速移动时，这些数据通常很密集。在为机器创建一个学习张量时，快速移动导致事件被覆盖或错过。为了减轻事件丢失或覆盖问题，我们建议学习集中在密集事件上，以产生具有高细节的紧凑事件表示，用于深度估计。具体来说，我们从过去和未来的事件中学习一个模型，但只根据过去的数据和预测的未来来推断。我们最初在仅事件设置中估计深度，但也建议通过分层事件和强度组合网络进一步合并图像和事件，以获得更好的深度估计。通过在具有挑战性的现实世界场景中的实验，我们验证了我们的方法即使在较低的计算成本下也优于现有技术。代码可从https://github.com/yonseivnl/se-cff获得。

{"title":"Stereo Depth from Events Cameras: Concentrate and Focus on the Future","authors":"Yeongwoo Nam, Sayed Mohammad Mostafavi Isfahani, Kuk-Jin Yoon, Jonghyun Choi","doi":"10.1109/CVPR52688.2022.00602","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00602","url":null,"abstract":"Neuromorphic cameras or event cameras mimic human vision by reporting changes in the intensity in a scene, instead of reporting the whole scene at once in a form of an image frame as performed by conventional cameras. Events are streamed data that are often dense when either the scene changes or the camera moves rapidly. The rapid movement causes the events to be overridden or missed when creating a tensor for the machine to learn on. To alleviate the event missing or overriding issue, we propose to learn to concentrate on the dense events to produce a compact event representation with high details for depth estimation. Specifically, we learn a model with events from both past and future but infer only with past data with the predicted future. We initially estimate depth in an event-only setting but also propose to further incorporate images and events by a hier-archical event and intensity combination network for better depth estimation. By experiments in challenging real-world scenarios, we validate that our method outperforms prior arts even with low computational cost. Code is available at: https://github.com/yonseivnl/se-cff.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115529375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Generalizing Gaze Estimation with Rotation Consistency 基于旋转一致性的广义凝视估计

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00417

Yiwei Bao, Yunfei Liu, Haofei Wang, Feng Lu

Recent advances of deep learning-based approaches have achieved remarkable performance on appearance-based gaze estimation. However, due to the shortage of target domain data and absence of target labels, generalizing gaze estimation algorithm to unseen environments is still challenging. In this paper, we discover the rotation-consistency property in gaze estimation and introduce the ‘sub-label’ for unsupervised domain adaptation. Consequently, we propose the Rotation-enhanced Unsupervised Domain Adaptation (RUDA) for gaze estimation. First, we rotate the original images with different angles for training. Then we conduct domain adaptation under the constraint of rotation consistency. The target domain images are assigned with sub-labels, derived from relative rotation angles rather than untouchable real labels. With such sub-labels, we propose a novel distribution loss that facilitates the domain adaptation. We evaluate the RUDA framework on four cross-domain gaze estimation tasks. Experimental results demonstrate that it improves the performance over the baselines with gains ranging from 12.2% to 30.5%. Our framework has the potential to be used in other computer vision tasks with physical constraints.

近年来，基于深度学习的方法在基于外观的注视估计方面取得了显著的进展。然而，由于缺乏目标域数据和缺乏目标标签，将注视估计算法推广到不可见环境仍然是一个挑战。本文发现了注视估计中的旋转一致性，并引入了用于无监督域自适应的“子标签”。因此，我们提出了旋转增强的无监督域自适应(RUDA)用于凝视估计。首先，对原始图像进行不同角度的旋转训练。然后在旋转一致性约束下进行域自适应。目标域图像被分配了子标签，子标签来源于相对旋转角度，而不是不可触摸的真实标签。利用这些子标签，我们提出了一种新的分布损失，便于领域自适应。我们在四个跨域凝视估计任务上评估了RUDA框架。实验结果表明，与基线相比，该算法的性能提高了12.2% ~ 30.5%。我们的框架有潜力用于其他具有物理限制的计算机视觉任务。

{"title":"Generalizing Gaze Estimation with Rotation Consistency","authors":"Yiwei Bao, Yunfei Liu, Haofei Wang, Feng Lu","doi":"10.1109/CVPR52688.2022.00417","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00417","url":null,"abstract":"Recent advances of deep learning-based approaches have achieved remarkable performance on appearance-based gaze estimation. However, due to the shortage of target domain data and absence of target labels, generalizing gaze estimation algorithm to unseen environments is still challenging. In this paper, we discover the rotation-consistency property in gaze estimation and introduce the ‘sub-label’ for unsupervised domain adaptation. Consequently, we propose the Rotation-enhanced Unsupervised Domain Adaptation (RUDA) for gaze estimation. First, we rotate the original images with different angles for training. Then we conduct domain adaptation under the constraint of rotation consistency. The target domain images are assigned with sub-labels, derived from relative rotation angles rather than untouchable real labels. With such sub-labels, we propose a novel distribution loss that facilitates the domain adaptation. We evaluate the RUDA framework on four cross-domain gaze estimation tasks. Experimental results demonstrate that it improves the performance over the baselines with gains ranging from 12.2% to 30.5%. Our framework has the potential to be used in other computer vision tasks with physical constraints.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115591742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Patch-level Representation Learning for Self-supervised Vision Transformers 自监督视觉变压器的补丁级表示学习

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00817

Sukmin Yun, Hankook Lee, Jaehyung Kim, Jinwoo Shin

Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advan-tages of the underlying neural network, as the current state-of-the-art visual pretext tasks for SSL do not enjoy the ben-efit, i.e., they are architecture-agnostic. In particular, we fo-cus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often out-performing convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined Self Patch, for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neigh-bors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with Self-Patch learns more semantically meaningful relations among patches (without using human-annotated labels), which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite its simplicity, we demonstrate that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation. Specifically, Self Patch signif-icantly improves the recent self-supervised ViT, DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance segmentation, and +2.9 mIoU on ADE20K semantic segmentation.

最近的自监督学习(SSL)方法在从未标记的图像中学习视觉表示方面显示出令人印象深刻的结果。本文旨在通过利用底层神经网络的架构优势来进一步提高它们的性能，因为当前最先进的SSL视觉借口任务没有享受到这种优势，即它们与架构无关。我们特别关注视觉变形(ViTs)，它最近作为一种更好的架构选择而受到广泛关注，在各种视觉任务中通常表现优于卷积网络。ViT的独特之处在于它从图像中获取一系列不相交的补丁，并在内部处理补丁级表示。受此启发，我们设计了一个简单而有效的视觉借口任务，称为自我补丁，用于学习更好的补丁级表示。具体地说，我们对每个补丁及其邻居强制不变性，即每个补丁将相似的相邻补丁视为正样本。因此，使用Self-Patch训练vit可以在patch之间学习更多语义上有意义的关系(不使用人工注释的标签)，这对于密集预测类型的下游任务尤其有益。尽管它很简单，但我们证明它可以显着提高现有SSL方法在各种视觉任务中的性能，包括对象检测和语义分割。具体来说，Self Patch显著改进了最近的自监督ViT DINO，在COCO对象检测上实现了+1.3 AP，在COCO实例分割上实现了+1.2 AP，在ADE20K语义分割上实现了+2.9 mIoU。

{"title":"Patch-level Representation Learning for Self-supervised Vision Transformers","authors":"Sukmin Yun, Hankook Lee, Jaehyung Kim, Jinwoo Shin","doi":"10.1109/CVPR52688.2022.00817","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00817","url":null,"abstract":"Recent self-supervised learning (SSL) methods have shown impressive results in learning visual representations from unlabeled images. This paper aims to improve their performance further by utilizing the architectural advan-tages of the underlying neural network, as the current state-of-the-art visual pretext tasks for SSL do not enjoy the ben-efit, i.e., they are architecture-agnostic. In particular, we fo-cus on Vision Transformers (ViTs), which have gained much attention recently as a better architectural choice, often out-performing convolutional networks for various visual tasks. The unique characteristic of ViT is that it takes a sequence of disjoint patches from an image and processes patch-level representations internally. Inspired by this, we design a simple yet effective visual pretext task, coined Self Patch, for learning better patch-level representations. To be specific, we enforce invariance against each patch and its neigh-bors, i.e., each patch treats similar neighboring patches as positive samples. Consequently, training ViTs with Self-Patch learns more semantically meaningful relations among patches (without using human-annotated labels), which can be beneficial, in particular, to downstream tasks of a dense prediction type. Despite its simplicity, we demonstrate that it can significantly improve the performance of existing SSL methods for various visual tasks, including object detection and semantic segmentation. Specifically, Self Patch signif-icantly improves the recent self-supervised ViT, DINO, by achieving +1.3 AP on COCO object detection, +1.2 AP on COCO instance segmentation, and +2.9 mIoU on ADE20K semantic segmentation.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"290 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124173957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Super-Fibonacci Spirals: Fast, Low-Discrepancy Sampling of SO(3) 超斐波那契螺旋:SO(3)的快速、低差异采样

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00811

M. Alexa

Super-Fibonacci spirals are an extension of Fibonacci spirals, enabling fast generation of an arbitrary but fixed number of 3D orientations. The algorithm is simple and fast. A comprehensive evaluation comparing to other meth-ods shows that the generated sets of orientations have low discrepancy, minimal spurious components in the power spectrum, and almost identical Voronoi volumes. This makes them useful for a variety of applications, in partic-ular Monte Carlo sampling.

超级斐波那契螺旋是斐波那契螺旋的扩展，可以快速生成任意但固定数量的三维方向。该算法简单、快速。与其他方法相比，综合评价表明，生成的定向集具有低差异，功率谱中的杂散分量最小，Voronoi体积几乎相同。这使得它们在各种应用中都很有用，特别是蒙特卡罗采样。

引用次数: 3

Stand-Alone Inter-Frame Attention in Video Models 视频模型中的独立帧间注意

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00319

Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Jiebo Luo, Tao Mei

Motion, as the uniqueness of a video, has been critical to the development of video understanding models. Modern deep learning models leverage motion by either executing spatio-temporal 3D convolutions, factorizing 3D convolutions into spatial and temporal convolutions separately, or computing self-attention along temporal dimension. The implicit assumption behind such successes is that the feature maps across consecutive frames can be nicely aggregated. Nevertheless, the assumption may not always hold especially for the regions with large deformation. In this paper, we present a new recipe of inter-frame attention block, namely Stand-alone Inter-Frame Attention (SIFA), that novelly delves into the deformation across frames to estimate local self-attention on each spatial location. Technically, SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames. Taking each spatial location in the current frame as the query, the locally deformable neighbors in the next frame are regarded as the keys/values. Then, SIFA measures the similarity between query and keys as stand-alone attention to weighted average the values for temporal aggregation. We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer. Extensive experiments conducted on four video datasets demonstrate the superiority of SIFA-Net and SIFA-Transformer as stronger backbones. More remarkably, SIFA-Transformer achieves an accuracy of 83.1% on Kinetics-400 dataset. Source code is available at https://github.com/FuchenUSTC/SIFA.

运动作为视频的独特性，对视频理解模型的发展至关重要。现代深度学习模型通过执行时空3D卷积，将3D卷积分别分解为空间和时间卷积，或沿时间维度计算自注意力来利用运动。这种成功背后隐含的假设是，跨连续帧的特征映射可以很好地聚合。然而，这个假设可能并不总是成立，特别是对于大变形的区域。本文提出了一种新的帧间注意块方法，即独立帧间注意(SIFA)，该方法新颖地研究了帧间的变形，以估计每个空间位置上的局部自注意。从技术上讲，SIFA通过通过两帧之间的差异重新缩放偏移预测来重塑可变形设计。将当前帧中的每个空间位置作为查询，将下一帧中局部可变形的邻居作为键/值。然后，SIFA度量查询和键之间的相似度，将其作为独立的关注点，对用于时间聚合的值进行加权平均。我们进一步将SIFA模块分别插入ConvNets和Vision Transformer中，以设计SIFA- net和SIFA-Transformer。在四个视频数据集上进行的大量实验证明了SIFA-Net和SIFA-Transformer作为更强主干网的优势。更值得注意的是，SIFA-Transformer在Kinetics-400数据集上达到了83.1%的精度。源代码可从https://github.com/FuchenUSTC/SIFA获得。

{"title":"Stand-Alone Inter-Frame Attention in Video Models","authors":"Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Jiebo Luo, Tao Mei","doi":"10.1109/CVPR52688.2022.00319","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00319","url":null,"abstract":"Motion, as the uniqueness of a video, has been critical to the development of video understanding models. Modern deep learning models leverage motion by either executing spatio-temporal 3D convolutions, factorizing 3D convolutions into spatial and temporal convolutions separately, or computing self-attention along temporal dimension. The implicit assumption behind such successes is that the feature maps across consecutive frames can be nicely aggregated. Nevertheless, the assumption may not always hold especially for the regions with large deformation. In this paper, we present a new recipe of inter-frame attention block, namely Stand-alone Inter-Frame Attention (SIFA), that novelly delves into the deformation across frames to estimate local self-attention on each spatial location. Technically, SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames. Taking each spatial location in the current frame as the query, the locally deformable neighbors in the next frame are regarded as the keys/values. Then, SIFA measures the similarity between query and keys as stand-alone attention to weighted average the values for temporal aggregation. We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer. Extensive experiments conducted on four video datasets demonstrate the superiority of SIFA-Net and SIFA-Transformer as stronger backbones. More remarkably, SIFA-Transformer achieves an accuracy of 83.1% on Kinetics-400 dataset. Source code is available at https://github.com/FuchenUSTC/SIFA.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115105622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Bilateral Video Magnification Filter 双侧视频放大滤波器

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01685

Shoichiro Takeda, K. Niwa, Mariko Isogawa, S. Shimizu, Kazuki Okami, Y. Aono

Eulerian video magnification (EVM) has progressed to magnify subtle motions with a target frequency even under the presence of large motions of objects. However, existing EVM methods often fail to produce desirable results in real videos due to (1) misextracting subtle motions with a non-target frequency and (2) collapsing results when large de/acceleration motions occur (e.g., objects suddenly start, stop, or change direction). To enhance EVM performance on real videos, this paper proposes a bilateral video magnification filter (BVMF) that offers simple yet robust temporal filtering. BVMF has two kernels; (I) one kernel performs temporal bandpass filtering via a Laplacian of Gaussian whose passband peaks at the target frequency with unity gain and (II) the other kernel excludes large motions outside the magnitude of interest by Gaussian filtering on the intensity of the input signal via the Fourier shift theorem. Thus, BVMF extracts only subtle motions with the target frequency while excluding large motions outside the magnitude of interest, regardless of motion dynamics. In addition, BVMF runs the two kernels in the temporal and intensity domains simultaneously like the bilateral filter does in the spatial and intensity domains. This simplifies implementation and, as a secondary effect, keeps the memory usage low. Experiments conducted on synthetic and real videos show that BVMF outperforms state-of-the-art methods.

欧拉视频放大技术(EVM)已经发展到可以放大目标频率下的细微运动，即使是在物体剧烈运动的情况下。然而，现有的EVM方法往往不能在真实视频中产生理想的结果，因为(1)错误地提取非目标频率的细微运动，(2)当发生大的de/加速度运动时(例如，物体突然启动、停止或改变方向)，结果会崩溃。为了提高EVM在真实视频上的性能，本文提出了一种双边视频放大滤波器(BVMF)，该滤波器提供了简单而鲁棒的时间滤波。BVMF有两个内核;(I)一个核通过一个高斯拉普拉斯函数进行时间带通滤波，该函数的通频带以单位增益在目标频率处达到峰值;(II)另一个核通过傅里叶移位定理对输入信号的强度进行高斯滤波，排除感兴趣幅度以外的大运动。因此，无论运动动力学如何，BVMF只提取具有目标频率的细微运动，而排除感兴趣幅度以外的大运动。此外，BVMF在时间域和强度域同时运行两个核，就像双边滤波器在空间域和强度域一样。这简化了实现，并且作为次要效果，保持了较低的内存使用量。在合成视频和真实视频上进行的实验表明，BVMF优于最先进的方法。

{"title":"Bilateral Video Magnification Filter","authors":"Shoichiro Takeda, K. Niwa, Mariko Isogawa, S. Shimizu, Kazuki Okami, Y. Aono","doi":"10.1109/CVPR52688.2022.01685","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01685","url":null,"abstract":"Eulerian video magnification (EVM) has progressed to magnify subtle motions with a target frequency even under the presence of large motions of objects. However, existing EVM methods often fail to produce desirable results in real videos due to (1) misextracting subtle motions with a non-target frequency and (2) collapsing results when large de/acceleration motions occur (e.g., objects suddenly start, stop, or change direction). To enhance EVM performance on real videos, this paper proposes a bilateral video magnification filter (BVMF) that offers simple yet robust temporal filtering. BVMF has two kernels; (I) one kernel performs temporal bandpass filtering via a Laplacian of Gaussian whose passband peaks at the target frequency with unity gain and (II) the other kernel excludes large motions outside the magnitude of interest by Gaussian filtering on the intensity of the input signal via the Fourier shift theorem. Thus, BVMF extracts only subtle motions with the target frequency while excluding large motions outside the magnitude of interest, regardless of motion dynamics. In addition, BVMF runs the two kernels in the temporal and intensity domains simultaneously like the bilateral filter does in the spatial and intensity domains. This simplifies implementation and, as a secondary effect, keeps the memory usage low. Experiments conducted on synthetic and real videos show that BVMF outperforms state-of-the-art methods.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116024009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Using 3D Topological Connectivity for Ghost Particle Reduction in Flow Reconstruction 利用三维拓扑连通性减少流重构中的鬼粒子

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00188

C. Tsalicoglou, T. Rösgen

Volumetric flow velocimetry for experimental fluid dynamics relies primarily on the 3D reconstruction of point objects, which are the detected positions of tracer particles identified in images obtained by a multi-camera setup. By assuming that the particles accurately follow the observed flow, their displacement over a known time interval is a measure of the local flow velocity. The number of particles imaged in a 1 Megapixel image is typically in the order of 103-1 04, resulting in a large number of consistent but in-correct reconstructions (no real particle in 3D), that must be eliminated through tracking or intensity constraints. In an alternative method, 3D Particle Streak Velocimetry (3D-PSV), the exposure time is increased, and the particles' pathlines are imaged as “streaks”. We treat these streaks (a) as connected endpoints and (b) as conic section segments and develop a theoretical model that describes the mechanisms of 3D ambiguity generation and shows that streaks can drastically reduce reconstruction ambiguities. Moreover, we propose a method for simultaneously estimating these short, low-curvature conic section segments and their 3D position from multiple camera views. Our results validate the theory, and the streak and conic section reconstruction method produces far fewer ambiguities than simple particle reconstruction, outperforming current state-of-the-art particle tracking software on the evaluated cases.

用于实验流体动力学的体积流速法主要依赖于点对象的三维重建，点对象是由多摄像机设置获得的图像中识别的示踪粒子的检测位置。通过假设粒子精确地跟随观察到的流动，它们在已知时间间隔内的位移是局部流动速度的度量。在100万像素的图像中成像的粒子数量通常在103-1 - 04的数量级，导致大量一致但不正确的重建(3D中没有真正的粒子)，必须通过跟踪或强度限制来消除。在另一种替代方法中，3D粒子条纹测速(3D- psv)，曝光时间增加，粒子的路径被成像为“条纹”。我们将这些条纹(a)视为连接端点，(b)视为圆锥截面段，并开发了一个理论模型，该模型描述了3D模糊产生的机制，并表明条纹可以大大减少重建模糊。此外，我们提出了一种从多个摄像机视图中同时估计这些短的、低曲率的圆锥截面段及其三维位置的方法。我们的结果验证了该理论，并且条纹和圆锥截面重建方法比简单的粒子重建方法产生的歧义要少得多，在评估的情况下优于当前最先进的粒子跟踪软件。

{"title":"Using 3D Topological Connectivity for Ghost Particle Reduction in Flow Reconstruction","authors":"C. Tsalicoglou, T. Rösgen","doi":"10.1109/CVPR52688.2022.00188","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00188","url":null,"abstract":"Volumetric flow velocimetry for experimental fluid dynamics relies primarily on the 3D reconstruction of point objects, which are the detected positions of tracer particles identified in images obtained by a multi-camera setup. By assuming that the particles accurately follow the observed flow, their displacement over a known time interval is a measure of the local flow velocity. The number of particles imaged in a 1 Megapixel image is typically in the order of 103-1 04, resulting in a large number of consistent but in-correct reconstructions (no real particle in 3D), that must be eliminated through tracking or intensity constraints. In an alternative method, 3D Particle Streak Velocimetry (3D-PSV), the exposure time is increased, and the particles' pathlines are imaged as “streaks”. We treat these streaks (a) as connected endpoints and (b) as conic section segments and develop a theoretical model that describes the mechanisms of 3D ambiguity generation and shows that streaks can drastically reduce reconstruction ambiguities. Moreover, we propose a method for simultaneously estimating these short, low-curvature conic section segments and their 3D position from multiple camera views. Our results validate the theory, and the streak and conic section reconstruction method produces far fewer ambiguities than simple particle reconstruction, outperforming current state-of-the-art particle tracking software on the evaluated cases.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116025611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Lagrange Motion Analysis and View Embeddings for Improved Gait Recognition 改进步态识别的拉格朗日运动分析和视图嵌入

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01961

Tianrui Chai, Annan Li, Shaoxiong Zhang, Zilong Li, Yunhong Wang

Gait is considered the walking pattern of human body, which includes both shape and motion cues. However, the main-stream appearance-based methods for gait recognition rely on the shape of silhouette. It is unclear whether motion can be explicitly represented in the gait sequence modeling. In this paper, we analyzed human walking using the Lagrange's equation and come to the conclusion that second-order information in the temporal dimension is necessary for identification. We designed a second-order motion extraction module based on the conclusions drawn. Also, a light weight view-embedding module is designed by analyzing the problem that current methods to cross-view task do not take view itself into consideration explicitly. Experiments on CASIA-B and OU-MVLP datasets show the effectiveness of our method and some visualization for extracted motion are done to show the interpretability of our motion extraction module.

步态被认为是人体的行走方式，它包括形状和运动线索。然而，主流的基于外观的步态识别方法依赖于轮廓的形状。目前尚不清楚运动是否可以在步态序列建模中明确表示。本文利用拉格朗日方程对人的行走进行了分析，得出了在时间维度上的二阶信息是识别的必要条件。在此基础上设计了二阶运动提取模块。同时，通过分析当前跨视图任务处理方法没有明确考虑视图本身的问题，设计了轻量级视图嵌入模块。在CASIA-B和OU-MVLP数据集上的实验表明了该方法的有效性，并对提取的运动进行了可视化处理，证明了运动提取模块的可解释性。

引用次数: 24