2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献_第6页

Joint Deep Model-based MR Image and Coil Sensitivity Reconstruction Network (Joint-ICNet) for Fast MRI 基于深度模型的联合磁共振图像和线圈灵敏度重建网络(Joint- icnet)

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00523

Yohan Jun, Hyungseob Shin, Taejoon Eo, D. Hwang

Magnetic resonance imaging (MRI) can provide diagnostic information with high-resolution and high-contrast images. However, MRI requires a relatively long scan time compared to other medical imaging techniques, where long scan time might occur patient’s discomfort and limit the increase in resolution of magnetic resonance (MR) image. In this study, we propose a Joint Deep Model-based MR Image and Coil Sensitivity Reconstruction Network, called Joint-ICNet, which jointly reconstructs an MR image and coil sensitivity maps from undersampled multi-coil k-space data using deep learning networks combined with MR physical models. Joint-ICNet has two main blocks, where one is an MR image reconstruction block that reconstructs an MR image from undersampled multi-coil k-space data and the other is a coil sensitivity maps reconstruction block that estimates coil sensitivity maps from undersampled multi-coil k-space data. The desired MR image and coil sensitivity maps can be obtained by sequentially estimating them with two blocks based on the unrolled network architecture. To demonstrate the performance of Joint-ICNet, we performed experiments with a fastMRI brain dataset for two reduction factors (R = 4 and 8). With qualitative and quantitative results, we demonstrate that our proposed Joint-ICNet outperforms conventional parallel imaging and deep-learning-based methods in reconstructing MR images from undersampled multi-coil k-space data.

磁共振成像(MRI)可以提供高分辨率和高对比度图像的诊断信息。然而，与其他医学成像技术相比，MRI需要较长的扫描时间，长时间扫描可能会引起患者的不适，限制了磁共振(MR)图像分辨率的提高。在这项研究中，我们提出了一个基于深度模型的MR图像和线圈灵敏度联合重建网络，称为Joint- icnet，它使用深度学习网络结合MR物理模型，从欠采样的多线圈k空间数据中共同重建MR图像和线圈灵敏度图。Joint-ICNet有两个主要块，其中一个是磁共振图像重建块，从欠采样的多线圈k空间数据中重建磁共振图像，另一个是线圈灵敏度图重建块，从欠采样的多线圈k空间数据中估计线圈灵敏度图。基于展开的网络结构，用两个分块依次估计得到期望的磁共振图像和线圈灵敏度图。为了证明Joint-ICNet的性能，我们使用fastMRI脑数据集进行了两个还原因子(R = 4和8)的实验。通过定性和定量结果，我们证明了我们提出的Joint-ICNet在从欠采样多线圈k空间数据重建MR图像方面优于传统的并行成像和基于深度学习的方法。

{"title":"Joint Deep Model-based MR Image and Coil Sensitivity Reconstruction Network (Joint-ICNet) for Fast MRI","authors":"Yohan Jun, Hyungseob Shin, Taejoon Eo, D. Hwang","doi":"10.1109/CVPR46437.2021.00523","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00523","url":null,"abstract":"Magnetic resonance imaging (MRI) can provide diagnostic information with high-resolution and high-contrast images. However, MRI requires a relatively long scan time compared to other medical imaging techniques, where long scan time might occur patient’s discomfort and limit the increase in resolution of magnetic resonance (MR) image. In this study, we propose a Joint Deep Model-based MR Image and Coil Sensitivity Reconstruction Network, called Joint-ICNet, which jointly reconstructs an MR image and coil sensitivity maps from undersampled multi-coil k-space data using deep learning networks combined with MR physical models. Joint-ICNet has two main blocks, where one is an MR image reconstruction block that reconstructs an MR image from undersampled multi-coil k-space data and the other is a coil sensitivity maps reconstruction block that estimates coil sensitivity maps from undersampled multi-coil k-space data. The desired MR image and coil sensitivity maps can be obtained by sequentially estimating them with two blocks based on the unrolled network architecture. To demonstrate the performance of Joint-ICNet, we performed experiments with a fastMRI brain dataset for two reduction factors (R = 4 and 8). With qualitative and quantitative results, we demonstrate that our proposed Joint-ICNet outperforms conventional parallel imaging and deep-learning-based methods in reconstructing MR images from undersampled multi-coil k-space data.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125962653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

Verifiability and Predictability: Interpreting Utilities of Network Architectures for Point Cloud Processing 可验证性和可预测性:解释点云处理网络架构的实用程序

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01056

Wen Shen, Zhihua Wei, Shikun Huang, Binbin Zhang, Panyue Chen, Ping Zhao, Quanshi Zhang

In this paper, we diagnose deep neural networks for 3D point cloud processing to explore utilities of different intermediate-layer network architectures. We propose a number of hypotheses on the effects of specific intermediate-layer network architectures on the representation capacity of DNNs. In order to prove the hypotheses, we design five metrics to diagnose various types of DNNs from the following perspectives, information discarding, information concentration, rotation robustness, adversarial robustness, and neighborhood inconsistency. We conduct comparative studies based on such metrics to verify the hypotheses. We further use the verified hypotheses to revise intermediate-layer architectures of existing DNNs and improve their utilities. Experiments demonstrate the effectiveness of our method. The code will be released when this paper is accepted.

在本文中，我们诊断了用于三维点云处理的深度神经网络，以探索不同中间层网络架构的效用。我们提出了一些关于特定中间层网络架构对深度神经网络表示能力的影响的假设。为了证明这些假设，我们设计了五个指标，从信息丢弃、信息集中、旋转鲁棒性、对抗鲁棒性和邻域不一致性等角度来诊断不同类型的深度神经网络。我们根据这些指标进行比较研究，以验证假设。我们进一步使用验证的假设来修改现有深度神经网络的中间层架构并提高其效用。实验证明了该方法的有效性。本文被录用后将发布代码。

引用次数: 2

Camera Pose Matters: Improving Depth Prediction by Mitigating Pose Distribution Bias 相机姿势问题:通过减轻姿势分布偏差来改善深度预测

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01550

Yunhan Zhao, Shu Kong, Charless C. Fowlkes

Monocular depth predictors are typically trained on large-scale training sets which are naturally biased w.r.t the distribution of camera poses. As a result, trained predictors fail to make reliable depth predictions for testing examples captured under uncommon camera poses. To address this issue, we propose two novel techniques that exploit the camera pose during training and prediction. First, we introduce a simple perspective-aware data augmentation that synthesizes new training examples with more diverse views by perturbing the existing ones in a geometrically consistent manner. Second, we propose a conditional model that exploits the per-image camera pose as prior knowledge by encoding it as a part of the input. We show that jointly applying the two methods improves depth prediction on images captured under uncommon and even never-before-seen camera poses. We show that our methods improve performance when applied to a range of different predictor architectures. Lastly, we show that explicitly encoding the camera pose distribution improves the generalization performance of a synthetically trained depth predictor when evaluated on real images.

单目深度预测器通常是在大规模的训练集上训练的，这些训练集自然地偏向于相机姿势的分布。因此，训练有素的预测器无法对在不常见相机姿势下捕获的测试示例做出可靠的深度预测。为了解决这个问题，我们提出了两种在训练和预测过程中利用相机姿势的新技术。首先，我们引入了一个简单的视角感知数据增强，通过以几何一致的方式干扰现有的训练样例，合成具有更多样化视图的新训练样例。其次，我们提出了一个条件模型，该模型通过将每个图像的相机姿势编码为输入的一部分来利用先验知识。我们表明，联合应用这两种方法可以提高在不常见甚至从未见过的相机姿势下捕获的图像的深度预测。我们表明，当应用于一系列不同的预测器架构时，我们的方法提高了性能。最后，我们证明了当在真实图像上评估时，显式编码相机姿态分布可以提高综合训练深度预测器的泛化性能。

{"title":"Camera Pose Matters: Improving Depth Prediction by Mitigating Pose Distribution Bias","authors":"Yunhan Zhao, Shu Kong, Charless C. Fowlkes","doi":"10.1109/CVPR46437.2021.01550","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01550","url":null,"abstract":"Monocular depth predictors are typically trained on large-scale training sets which are naturally biased w.r.t the distribution of camera poses. As a result, trained predictors fail to make reliable depth predictions for testing examples captured under uncommon camera poses. To address this issue, we propose two novel techniques that exploit the camera pose during training and prediction. First, we introduce a simple perspective-aware data augmentation that synthesizes new training examples with more diverse views by perturbing the existing ones in a geometrically consistent manner. Second, we propose a conditional model that exploits the per-image camera pose as prior knowledge by encoding it as a part of the input. We show that jointly applying the two methods improves depth prediction on images captured under uncommon and even never-before-seen camera poses. We show that our methods improve performance when applied to a range of different predictor architectures. Lastly, we show that explicitly encoding the camera pose distribution improves the generalization performance of a synthetically trained depth predictor when evaluated on real images.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"13 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128567942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

L2M-GAN: Learning to Manipulate Latent Space Semantics for Facial Attribute Editing L2M-GAN:用于人脸属性编辑的潜在空间语义操纵学习

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00297

Guoxing Yang, Nanyi Fei, Mingyu Ding, Guangzhen Liu, Zhiwu Lu, T. Xiang

A deep facial attribute editing model strives to meet two requirements: (1) attribute correctness – the target attribute should correctly appear on the edited face image; (2) irrelevance preservation – any irrelevant information (e.g., identity) should not be changed after editing. Meeting both requirements challenges the state-of-the-art works which resort to either spatial attention or latent space factorization. Specifically, the former assume that each attribute has well-defined local support regions; they are often more effective for editing a local attribute than a global one. The latter factorize the latent space of a fixed pretrained GAN into different attribute-relevant parts, but they cannot be trained end-to-end with the GAN, leading to sub-optimal solutions. To overcome these limitations, we propose a novel latent space factorization model, called L2M-GAN, which is learned end-to-end and effective for editing both local and global attributes. The key novel components are: (1) A latent space vector of the GAN is factorized into an attribute-relevant and irrelevant codes with an orthogonality constraint imposed to ensure disentanglement. (2) An attribute-relevant code transformer is learned to manipulate the attribute value; crucially, the transformed code are subject to the same orthogonality constraint. By forcing both the original attribute-relevant latent code and the edited code to be disentangled from any attribute-irrelevant code, our model strikes the perfect balance between attribute correctness and irrelevance preservation. Extensive experiments on CelebA-HQ show that our L2M-GAN achieves significant improvements over the state-of-the-arts.

深度人脸属性编辑模型力求满足两个要求:(1)属性正确性——目标属性应正确出现在编辑后的人脸图像上;(2)不相关保存——任何不相关的信息(如身份)在编辑后不应被更改。满足这两种要求对最先进的作品提出了挑战，这些作品要么诉诸空间注意力，要么诉诸潜在的空间分解。具体来说，前者假设每个属性都有定义良好的局部支持区域;对于编辑局部属性，它们通常比编辑全局属性更有效。后者将固定的预训练GAN的潜在空间分解为不同的属性相关部分，但它们不能端到端与GAN一起训练，导致次优解。为了克服这些限制，我们提出了一种新的潜在空间分解模型，称为L2M-GAN，它是端到端学习的，可以有效地编辑局部和全局属性。关键的新组件是:(1)将GAN的潜在空间向量分解为属性相关和不相关的代码，并施加正交性约束以确保解纠缠。(2)学习与属性相关的代码转换器来操作属性值;至关重要的是，转换后的代码受到相同的正交性约束。通过强制将原始属性相关的潜在代码和编辑后的代码从任何属性无关的代码中分离出来，我们的模型在属性正确性和不相关性保存之间取得了完美的平衡。在CelebA-HQ上的大量实验表明，我们的L2M-GAN比最先进的技术有了显著的改进。

{"title":"L2M-GAN: Learning to Manipulate Latent Space Semantics for Facial Attribute Editing","authors":"Guoxing Yang, Nanyi Fei, Mingyu Ding, Guangzhen Liu, Zhiwu Lu, T. Xiang","doi":"10.1109/CVPR46437.2021.00297","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00297","url":null,"abstract":"A deep facial attribute editing model strives to meet two requirements: (1) attribute correctness – the target attribute should correctly appear on the edited face image; (2) irrelevance preservation – any irrelevant information (e.g., identity) should not be changed after editing. Meeting both requirements challenges the state-of-the-art works which resort to either spatial attention or latent space factorization. Specifically, the former assume that each attribute has well-defined local support regions; they are often more effective for editing a local attribute than a global one. The latter factorize the latent space of a fixed pretrained GAN into different attribute-relevant parts, but they cannot be trained end-to-end with the GAN, leading to sub-optimal solutions. To overcome these limitations, we propose a novel latent space factorization model, called L2M-GAN, which is learned end-to-end and effective for editing both local and global attributes. The key novel components are: (1) A latent space vector of the GAN is factorized into an attribute-relevant and irrelevant codes with an orthogonality constraint imposed to ensure disentanglement. (2) An attribute-relevant code transformer is learned to manipulate the attribute value; crucially, the transformed code are subject to the same orthogonality constraint. By forcing both the original attribute-relevant latent code and the edited code to be disentangled from any attribute-irrelevant code, our model strikes the perfect balance between attribute correctness and irrelevance preservation. Extensive experiments on CelebA-HQ show that our L2M-GAN achieves significant improvements over the state-of-the-arts.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128271246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

What’s in the Image? Explorable Decoding of Compressed Images 图像里有什么?压缩图像的可探索解码

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00293

Yuval Bahat, T. Michaeli

The ever-growing amounts of visual contents captured on a daily basis necessitate the use of lossy compression methods in order to save storage space and transmission bandwidth. While extensive research efforts are devoted to improving compression techniques, every method inevitably discards information. Especially at low bit rates, this information often corresponds to semantically meaningful visual cues, so that decompression involves significant ambiguity. In spite of this fact, existing decompression algorithms typically produce only a single output, and do not allow the viewer to explore the set of images that map to the given compressed code. In this work we propose the first image decompression method to facilitate user-exploration of the diverse set of natural images that could have given rise to the compressed input code, thus granting users the ability to determine what could and what could not have been there in the original scene. Specifically, we develop a novel deep-network based decoder architecture for the ubiquitous JPEG standard, which allows traversing the set of decompressed images that are consistent with the compressed JPEG file. To allow for simple user interaction, we develop a graphical user interface comprising several intuitive exploration tools, including an automatic tool for examining specific solutions of interest. We exemplify our framework on graphical, medical and forensic use cases, demonstrating its wide range of potential applications.

为了节省存储空间和传输带宽，每天捕获的不断增长的视觉内容需要使用有损压缩方法。虽然大量的研究努力致力于改进压缩技术，但每种方法都不可避免地会丢弃信息。特别是在低比特率下，这些信息通常对应于语义上有意义的视觉线索，因此解压缩涉及明显的模糊性。尽管如此，现有的解压缩算法通常只产生一个输出，并且不允许查看者探索映射到给定压缩代码的图像集。在这项工作中，我们提出了第一种图像解压缩方法，以方便用户探索可能产生压缩输入代码的各种自然图像集，从而使用户能够确定原始场景中可能存在什么，不可能存在什么。具体来说，我们为普遍存在的JPEG标准开发了一种新颖的基于深度网络的解码器架构，它允许遍历与压缩JPEG文件一致的解压缩图像集。为了允许简单的用户交互，我们开发了一个图形用户界面，其中包含几个直观的探索工具，包括一个用于检查感兴趣的特定解决方案的自动工具。我们以图形、医疗和法医用例举例说明了我们的框架，展示了其广泛的潜在应用。

{"title":"What’s in the Image? Explorable Decoding of Compressed Images","authors":"Yuval Bahat, T. Michaeli","doi":"10.1109/CVPR46437.2021.00293","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00293","url":null,"abstract":"The ever-growing amounts of visual contents captured on a daily basis necessitate the use of lossy compression methods in order to save storage space and transmission bandwidth. While extensive research efforts are devoted to improving compression techniques, every method inevitably discards information. Especially at low bit rates, this information often corresponds to semantically meaningful visual cues, so that decompression involves significant ambiguity. In spite of this fact, existing decompression algorithms typically produce only a single output, and do not allow the viewer to explore the set of images that map to the given compressed code. In this work we propose the first image decompression method to facilitate user-exploration of the diverse set of natural images that could have given rise to the compressed input code, thus granting users the ability to determine what could and what could not have been there in the original scene. Specifically, we develop a novel deep-network based decoder architecture for the ubiquitous JPEG standard, which allows traversing the set of decompressed images that are consistent with the compressed JPEG file. To allow for simple user interaction, we develop a graphical user interface comprising several intuitive exploration tools, including an automatic tool for examining specific solutions of interest. We exemplify our framework on graphical, medical and forensic use cases, demonstrating its wide range of potential applications.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128604088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

EvDistill: Asynchronous Events to End-task Learning via Bidirectional Reconstruction-guided Cross-modal Knowledge Distillation ev蒸馏:通过双向重构引导的跨模态知识蒸馏的异步事件到任务结束学习

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00067

Lin Wang, Yujeong Chae, Sung-Hoon Yoon, Tae-Kyun Kim, Kuk-Jin Yoon

Event cameras sense per-pixel intensity changes and produce asynchronous event streams with high dynamic range and less motion blur, showing advantages over the conventional cameras. A hurdle of training event-based models is the lack of large qualitative labeled data. Prior works learning end-tasks mostly rely on labeled or pseudo-labeled datasets obtained from the active pixel sensor (APS) frames; however, such datasets’ quality is far from rivaling those based on the canonical images. In this paper, we propose a novel approach, called EvDistill, to learn a student network on the unlabeled and unpaired event data (target modality) via knowledge distillation (KD) from a teacher network trained with large-scale, labeled image data (source modality). To enable KD across the unpaired modalities, we first propose a bidirectional modality reconstruction (BMR) module to bridge both modalities and simultaneously exploit them to distill knowledge via the crafted pairs, causing no extra computation in the inference. The BMR is improved by the end-tasks and KD losses in an end-to-end manner. Second, we leverage the structural similarities of both modalities and adapt the knowledge by matching their distributions. Moreover, as most prior feature KD methods are uni-modality and less applicable to our problem, we propose an affinity graph KD loss to boost the distillation. Our extensive experiments on semantic segmentation and object recognition demonstrate that EvDistill achieves significantly better results than the prior works and KD with only events and APS frames.

事件摄像机感知每像素的强度变化，并产生具有高动态范围和较少运动模糊的异步事件流，显示出优于传统摄像机的优势。训练基于事件的模型的一个障碍是缺乏大量定性标记数据。先前的工作学习结束任务主要依赖于从主动像素传感器(APS)帧中获得的标记或伪标记数据集;然而，这些数据集的质量远远不能与基于规范图像的数据集相媲美。在本文中，我们提出了一种称为EvDistill的新方法，通过知识蒸馏(KD)从使用大规模标记图像数据(源模态)训练的教师网络中学习未标记和未配对的事件数据(目标模态)的学生网络。为了实现跨未配对模态的KD，我们首先提出了一个双向模态重构(BMR)模块来连接两个模态，并同时利用它们通过精心制作的对提取知识，在推理中不需要额外的计算。末端任务和KD损失以端到端方式改善了BMR。其次，我们利用两种模式的结构相似性，并通过匹配它们的分布来适应知识。此外，由于大多数先前的特征KD方法是单模态的，不太适用于我们的问题，我们提出了亲和图KD损失来提高蒸馏。我们在语义分割和目标识别方面的大量实验表明，EvDistill取得的结果明显优于仅使用事件和APS帧的KD。

{"title":"EvDistill: Asynchronous Events to End-task Learning via Bidirectional Reconstruction-guided Cross-modal Knowledge Distillation","authors":"Lin Wang, Yujeong Chae, Sung-Hoon Yoon, Tae-Kyun Kim, Kuk-Jin Yoon","doi":"10.1109/CVPR46437.2021.00067","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00067","url":null,"abstract":"Event cameras sense per-pixel intensity changes and produce asynchronous event streams with high dynamic range and less motion blur, showing advantages over the conventional cameras. A hurdle of training event-based models is the lack of large qualitative labeled data. Prior works learning end-tasks mostly rely on labeled or pseudo-labeled datasets obtained from the active pixel sensor (APS) frames; however, such datasets’ quality is far from rivaling those based on the canonical images. In this paper, we propose a novel approach, called EvDistill, to learn a student network on the unlabeled and unpaired event data (target modality) via knowledge distillation (KD) from a teacher network trained with large-scale, labeled image data (source modality). To enable KD across the unpaired modalities, we first propose a bidirectional modality reconstruction (BMR) module to bridge both modalities and simultaneously exploit them to distill knowledge via the crafted pairs, causing no extra computation in the inference. The BMR is improved by the end-tasks and KD losses in an end-to-end manner. Second, we leverage the structural similarities of both modalities and adapt the knowledge by matching their distributions. Moreover, as most prior feature KD methods are uni-modality and less applicable to our problem, we propose an affinity graph KD loss to boost the distillation. Our extensive experiments on semantic segmentation and object recognition demonstrate that EvDistill achieves significantly better results than the prior works and KD with only events and APS frames.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129372184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40

Hierarchical Video Prediction using Relational Layouts for Human-Object Interactions 使用关系布局进行人-物交互的分层视频预测

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01197

Navaneeth Bodla, G. Shrivastava, R. Chellappa, Abhinav Shrivastava

Learning to model and predict how humans interact with objects while performing an action is challenging, and most of the existing video prediction models are ineffective in modeling complicated human-object interactions. Our work builds on hierarchical video prediction models, which disentangle the video generation process into two stages: predicting a high-level representation, such as pose sequence, and then learning a pose-to-pixels translation model for pixel generation. An action sequence for a human-object interaction task is typically very complicated, involving the evolution of pose, person’s appearance, object locations, and object appearances over time. To this end, we propose a Hierarchical Video Prediction model using Relational Layouts. In the first stage, we learn to predict a sequence of layouts. A layout is a high-level representation of the video containing both pose and objects’ information for every frame. The layout sequence is learned by modeling the relationships between the pose and objects using relational reasoning and recurrent neural networks. The layout sequence acts as a strong structure prior to the second stage that learns to map the layouts into pixel space. Experimental evaluation of our method on two datasets, UMD-HOI and Bimanual, shows significant improvements in standard video evaluation metrics such as LPIPS, PSNR, and SSIM. We also perform a detailed qualitative analysis of our model to demonstrate various generalizations.

学习建模和预测人类在执行动作时如何与物体交互是具有挑战性的，并且大多数现有的视频预测模型在建模复杂的人与物体交互方面是无效的。我们的工作建立在分层视频预测模型的基础上，该模型将视频生成过程分解为两个阶段:预测高级表示，例如姿势序列，然后学习用于像素生成的姿势到像素的转换模型。人-对象交互任务的动作序列通常非常复杂，涉及姿势、人的外观、对象位置和对象外观随时间的演变。为此，我们提出了一种基于关系布局的分层视频预测模型。在第一阶段，我们学习预测一系列布局。布局是视频的高级表示，包含每一帧的姿态和对象信息。布局序列是通过使用关系推理和递归神经网络建模姿态和对象之间的关系来学习的。在学习将布局映射到像素空间的第二阶段之前，布局序列充当了一个强大的结构。我们的方法在两个数据集(UMD-HOI和bimmanual)上的实验评估显示，在标准视频评估指标(如LPIPS、PSNR和SSIM)上有显著改进。我们还对我们的模型进行了详细的定性分析，以证明各种概括。

{"title":"Hierarchical Video Prediction using Relational Layouts for Human-Object Interactions","authors":"Navaneeth Bodla, G. Shrivastava, R. Chellappa, Abhinav Shrivastava","doi":"10.1109/CVPR46437.2021.01197","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01197","url":null,"abstract":"Learning to model and predict how humans interact with objects while performing an action is challenging, and most of the existing video prediction models are ineffective in modeling complicated human-object interactions. Our work builds on hierarchical video prediction models, which disentangle the video generation process into two stages: predicting a high-level representation, such as pose sequence, and then learning a pose-to-pixels translation model for pixel generation. An action sequence for a human-object interaction task is typically very complicated, involving the evolution of pose, person’s appearance, object locations, and object appearances over time. To this end, we propose a Hierarchical Video Prediction model using Relational Layouts. In the first stage, we learn to predict a sequence of layouts. A layout is a high-level representation of the video containing both pose and objects’ information for every frame. The layout sequence is learned by modeling the relationships between the pose and objects using relational reasoning and recurrent neural networks. The layout sequence acts as a strong structure prior to the second stage that learns to map the layouts into pixel space. Experimental evaluation of our method on two datasets, UMD-HOI and Bimanual, shows significant improvements in standard video evaluation metrics such as LPIPS, PSNR, and SSIM. We also perform a detailed qualitative analysis of our model to demonstrate various generalizations.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127144469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Cross-View Gait Recognition with Deep Universal Linear Embeddings 基于深度通用线性嵌入的横视步态识别

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00898

Shaoxiong Zhang, Yunhong Wang, Annan Li

Gait is considered an attractive biometric identifier for its non-invasive and non-cooperative features compared with other biometric identifiers such as fingerprint and iris. At present, cross-view gait recognition methods always establish representations from various deep convolutional networks for recognition and ignore the potential dynamical information of the gait sequences. If assuming that pedestrians have different walking patterns, gait recognition can be performed by calculating their dynamical features from each view. This paper introduces the Koopman operator theory to gait recognition, which can find an embedding space for a global linear approximation of a nonlinear dynamical system. Furthermore, a novel framework based on convolutional variational autoencoder and deep Koopman embedding is proposed to approximate the Koopman operators, which is used as dynamical features from the linearized embedding space for cross-view gait recognition. It gives solid physical interpretability for a gait recognition system. Experiments on a large public dataset, OU-MVLP, prove the effectiveness of the proposed method.

与指纹、虹膜等生物识别技术相比，步态以其非侵入性和非合作性的特点被认为是一种有吸引力的生物识别技术。目前，横视步态识别方法总是建立各种深度卷积网络的表示来进行识别，而忽略了步态序列潜在的动态信息。假设行人有不同的行走模式，步态识别可以通过从每个视图计算他们的动态特征来实现。将库普曼算子理论引入到步态识别中，可以为非线性动力系统的全局线性逼近找到嵌入空间。在此基础上，提出了一种基于卷积变分自编码器和深度库普曼嵌入的框架来逼近库普曼算子，并将库普曼算子作为线性化嵌入空间的动态特征用于横视步态识别。它为步态识别系统提供了坚实的物理可解释性。在大型公共数据集OU-MVLP上的实验证明了该方法的有效性。

{"title":"Cross-View Gait Recognition with Deep Universal Linear Embeddings","authors":"Shaoxiong Zhang, Yunhong Wang, Annan Li","doi":"10.1109/CVPR46437.2021.00898","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00898","url":null,"abstract":"Gait is considered an attractive biometric identifier for its non-invasive and non-cooperative features compared with other biometric identifiers such as fingerprint and iris. At present, cross-view gait recognition methods always establish representations from various deep convolutional networks for recognition and ignore the potential dynamical information of the gait sequences. If assuming that pedestrians have different walking patterns, gait recognition can be performed by calculating their dynamical features from each view. This paper introduces the Koopman operator theory to gait recognition, which can find an embedding space for a global linear approximation of a nonlinear dynamical system. Furthermore, a novel framework based on convolutional variational autoencoder and deep Koopman embedding is proposed to approximate the Koopman operators, which is used as dynamical features from the linearized embedding space for cross-view gait recognition. It gives solid physical interpretability for a gait recognition system. Experiments on a large public dataset, OU-MVLP, prove the effectiveness of the proposed method.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127276973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Interventional Video Grounding with Dual Contrastive Learning 介入录像基础与双重对比学习

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00279

Guoshun Nan, Rui Qiao, Yao Xiao, Jun Liu, Sicong Leng, H. Zhang, Wei Lu

Video grounding aims to localize a moment from an untrimmed video for a given textual query. Existing approaches focus more on the alignment of visual and language stimuli with various likelihood-based matching or regression strategies, i.e., P(Y |X). Consequently, these models may suffer from spurious correlations between the language and video features due to the selection bias of the dataset. 1) To uncover the causality behind the model and data, we first propose a novel paradigm from the perspective of the causal inference, i.e., interventional video grounding (IVG) that leverages backdoor adjustment to deconfound the selection bias based on structured causal model (SCM) and do-calculus P(Y |do(X)). Then, we present a simple yet effective method to approximate the unobserved confounder as it cannot be directly sampled from the dataset. 2) Meanwhile, we introduce a dual contrastive learning approach (DCL) to better align the text and video by maximizing the mutual information (MI) between query and video clips, and the MI between start/end frames of a target moment and the others within a video to learn more informative visual representations. Experiments on three standard benchmarks show the effectiveness of our approaches.

视频接地旨在为给定的文本查询从未修剪的视频中定位一个时刻。现有的方法更多地关注视觉和语言刺激与各种基于似然的匹配或回归策略的对齐，即P(Y |X)。因此，由于数据集的选择偏差，这些模型可能会受到语言和视频特征之间虚假相关性的影响。1)为了揭示模型和数据背后的因果关系，我们首先从因果推理的角度提出了一种新的范式，即基于结构化因果模型(SCM)和do-calculus P(Y |do(X))利用后门调整去发现选择偏差的介入性视频接地(IVG)。然后，我们提出了一种简单而有效的方法来近似未观察到的混杂因素，因为它不能直接从数据集中采样。2)同时，我们引入了一种双对比学习方法(DCL)，通过最大化查询和视频片段之间的互信息(MI)，以及目标时刻的开始/结束帧与视频中其他帧之间的互信息(MI)来更好地对齐文本和视频，以学习更多信息丰富的视觉表示。在三个标准基准上的实验表明了我们的方法的有效性。

{"title":"Interventional Video Grounding with Dual Contrastive Learning","authors":"Guoshun Nan, Rui Qiao, Yao Xiao, Jun Liu, Sicong Leng, H. Zhang, Wei Lu","doi":"10.1109/CVPR46437.2021.00279","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00279","url":null,"abstract":"Video grounding aims to localize a moment from an untrimmed video for a given textual query. Existing approaches focus more on the alignment of visual and language stimuli with various likelihood-based matching or regression strategies, i.e., P(Y |X). Consequently, these models may suffer from spurious correlations between the language and video features due to the selection bias of the dataset. 1) To uncover the causality behind the model and data, we first propose a novel paradigm from the perspective of the causal inference, i.e., interventional video grounding (IVG) that leverages backdoor adjustment to deconfound the selection bias based on structured causal model (SCM) and do-calculus P(Y |do(X)). Then, we present a simple yet effective method to approximate the unobserved confounder as it cannot be directly sampled from the dataset. 2) Meanwhile, we introduce a dual contrastive learning approach (DCL) to better align the text and video by maximizing the mutual information (MI) between query and video clips, and the MI between start/end frames of a target moment and the others within a video to learn more informative visual representations. Experiments on three standard benchmarks show the effectiveness of our approaches.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127548886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 93

Practical Single-Image Super-Resolution Using Look-Up Table 实用的单图像超分辨率使用查找表

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00075

Younghyun Jo, Seon Joo Kim

A number of super-resolution (SR) algorithms from interpolation to deep neural networks (DNN) have emerged to restore or create missing details of the input low-resolution image. As mobile devices and display hardware develops, the demand for practical SR technology has increased. Current state-of-the-art SR methods are based on DNNs for better quality. However, they are feasible when executed by using a parallel computing module (e.g. GPUs), and have been difficult to apply to general uses such as end-user software, smartphones, and televisions. To this end, we propose an efficient and practical approach for the SR by adopting look-up table (LUT). We train a deep SR network with a small receptive field and transfer the output values of the learned deep model to the LUT. At test time, we retrieve the precomputed HR output values from the LUT for query LR input pixels. The proposed method can be performed very quickly because it does not require a large number of floating point operations. Experimental results show the efficiency and the effectiveness of our method. Especially, our method runs faster while showing better quality compared to bicubic interpolation.

从插值到深度神经网络(DNN)，已经出现了许多超分辨率(SR)算法来恢复或创建输入低分辨率图像的缺失细节。随着移动设备和显示硬件的发展，对实用SR技术的需求也在增加。目前最先进的SR方法是基于dnn的，以获得更好的质量。然而，当使用并行计算模块(例如gpu)执行时，它们是可行的，并且很难应用于最终用户软件，智能手机和电视等一般用途。为此，我们提出了一种高效实用的SR方法，即采用查找表(LUT)。我们训练了一个具有小接受场的深度SR网络，并将学习到的深度模型的输出值转移到LUT。在测试时，我们从查询LR输入像素的LUT中检索预先计算的HR输出值。由于不需要大量的浮点运算，所提出的方法可以非常快速地执行。实验结果表明了该方法的有效性和有效性。特别是，与双三次插值相比，我们的方法运行速度更快，同时显示出更好的质量。

{"title":"Practical Single-Image Super-Resolution Using Look-Up Table","authors":"Younghyun Jo, Seon Joo Kim","doi":"10.1109/CVPR46437.2021.00075","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00075","url":null,"abstract":"A number of super-resolution (SR) algorithms from interpolation to deep neural networks (DNN) have emerged to restore or create missing details of the input low-resolution image. As mobile devices and display hardware develops, the demand for practical SR technology has increased. Current state-of-the-art SR methods are based on DNNs for better quality. However, they are feasible when executed by using a parallel computing module (e.g. GPUs), and have been difficult to apply to general uses such as end-user software, smartphones, and televisions. To this end, we propose an efficient and practical approach for the SR by adopting look-up table (LUT). We train a deep SR network with a small receptive field and transfer the output values of the learned deep model to the LUT. At test time, we retrieve the precomputed HR output values from the LUT for query LR input pixels. The proposed method can be performed very quickly because it does not require a large number of floating point operations. Experimental results show the efficiency and the effectiveness of our method. Especially, our method runs faster while showing better quality compared to bicubic interpolation.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127470942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38