首页 > 最新文献

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献

英文 中文
LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization LipSync3D:使用姿势和照明标准化的视频个性化3D说话面孔的数据高效学习
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00278
A. Lahiri, Vivek Kwatra, C. Frueh, John Lewis, C. Bregler
In this paper, we present a video-based learning framework for animating personalized 3D talking faces from audio. We introduce two training-time data normalizations that significantly improve data sample efficiency. First, we isolate and represent faces in a normalized space that decouples 3D geometry, head pose, and texture. This decomposes the prediction problem into regressions over the 3D face shape and the corresponding 2D texture atlas. Second, we leverage facial symmetry and approximate albedo constancy of skin to isolate and remove spatio-temporal lighting variations. Together, these normalizations allow simple networks to generate high fidelity lip-sync videos under novel ambient illumination while training with just a single speaker-specific video. Further, to stabilize temporal dynamics, we introduce an auto-regressive approach that conditions the model on its previous visual state. Human ratings and objective metrics demonstrate that our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores. We illustrate several applications enabled by our framework.
在本文中,我们提出了一个基于视频的学习框架,用于从音频中动画个性化3D说话面孔。我们引入了两种训练时间数据归一化,显著提高了数据样本效率。首先,我们在一个标准化空间中分离和表示人脸,该空间解耦了3D几何形状、头部姿势和纹理。这将预测问题分解为3D面部形状和相应的2D纹理图谱的回归。其次,我们利用面部对称性和皮肤的近似反照率常数来隔离和去除时空光照变化。总之,这些归一化允许简单的网络在新的环境照明下生成高保真的口型同步视频,同时只使用单个特定讲话者的视频进行训练。此外,为了稳定时间动态,我们引入了一种自回归方法,该方法将模型置于其先前的视觉状态上。人类评分和客观指标表明,我们的方法在现实主义、对口型和视觉质量得分方面优于当代最先进的音频驱动视频再现基准。我们将演示由我们的框架支持的几个应用程序。
{"title":"LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization","authors":"A. Lahiri, Vivek Kwatra, C. Frueh, John Lewis, C. Bregler","doi":"10.1109/CVPR46437.2021.00278","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00278","url":null,"abstract":"In this paper, we present a video-based learning framework for animating personalized 3D talking faces from audio. We introduce two training-time data normalizations that significantly improve data sample efficiency. First, we isolate and represent faces in a normalized space that decouples 3D geometry, head pose, and texture. This decomposes the prediction problem into regressions over the 3D face shape and the corresponding 2D texture atlas. Second, we leverage facial symmetry and approximate albedo constancy of skin to isolate and remove spatio-temporal lighting variations. Together, these normalizations allow simple networks to generate high fidelity lip-sync videos under novel ambient illumination while training with just a single speaker-specific video. Further, to stabilize temporal dynamics, we introduce an auto-regressive approach that conditions the model on its previous visual state. Human ratings and objective metrics demonstrate that our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores. We illustrate several applications enabled by our framework.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116889475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
Seeing in Extra Darkness Using a Deep-Red Flash 在额外的黑暗中使用深红色闪光
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00987
J. Xiong, Jian Wang, W. Heidrich, S. Nayar
We propose a new flash technique for low-light imaging, using deep-red light as an illuminating source. Our main observation is that in a dim environment, the human eye mainly uses rods for the perception of light, which are not sensitive to wavelengths longer than 620nm, yet the camera sensor still has a spectral response. We propose a novel modulation strategy when training a modern CNN model for guided image filtering, fusing a noisy RGB frame and a flash frame. This fusion network is further extended for video reconstruction. We have built a prototype with minor hardware adjustments and tested the new flash technique on a variety of static and dynamic scenes. The experimental results demonstrate that our method produces compelling reconstructions, even in extra dim conditions.
我们提出了一种新的微光成像闪光技术,使用深红光作为光源。我们的主要观察是,在昏暗的环境下,人眼主要使用杆状体来感知光,杆状体对超过620nm的波长不敏感,而相机传感器仍然有光谱响应。我们在训练用于引导图像滤波的现代CNN模型时提出了一种新的调制策略,融合了噪声RGB帧和闪光帧。将该融合网络进一步扩展到视频重构。我们已经建立了一个原型,对硬件进行了轻微的调整,并在各种静态和动态场景中测试了新的闪光技术。实验结果表明,即使在特别昏暗的条件下,我们的方法也能产生令人信服的重建。
{"title":"Seeing in Extra Darkness Using a Deep-Red Flash","authors":"J. Xiong, Jian Wang, W. Heidrich, S. Nayar","doi":"10.1109/CVPR46437.2021.00987","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00987","url":null,"abstract":"We propose a new flash technique for low-light imaging, using deep-red light as an illuminating source. Our main observation is that in a dim environment, the human eye mainly uses rods for the perception of light, which are not sensitive to wavelengths longer than 620nm, yet the camera sensor still has a spectral response. We propose a novel modulation strategy when training a modern CNN model for guided image filtering, fusing a noisy RGB frame and a flash frame. This fusion network is further extended for video reconstruction. We have built a prototype with minor hardware adjustments and tested the new flash technique on a variety of static and dynamic scenes. The experimental results demonstrate that our method produces compelling reconstructions, even in extra dim conditions.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"265 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121116988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Double low-rank representation with projection distance penalty for clustering 双低秩表示与投影距离惩罚聚类
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00528
Zhiqiang Fu, Yao Zhao, Dongxia Chang, Xingxing Zhang, Yiming Wang
This paper presents a novel, simple yet robust self-representation method, i.e., Double Low-Rank Representation with Projection Distance penalty (DLRRPD) for clustering. With the learned optimal projected representations, DLRRPD is capable of obtaining an effective similarity graph to capture the multi-subspace structure. Besides the global low-rank constraint, the local geometrical structure is additionally exploited via a projection distance penalty in our DLRRPD, thus facilitating a more favorable graph. Moreover, to improve the robustness of DLRRPD to noises, we introduce a Laplacian rank constraint, which can further encourage the learned graph to be more discriminative for clustering tasks. Meanwhile, Frobenius norm (instead of the popularly used nuclear norm) is employed to enforce the graph to be more block-diagonal with lower complexity. Extensive experiments have been conducted on synthetic, real, and noisy data to show that the proposed method outperforms currently available alternatives by a margin of 1.0%~10.1%.
本文提出了一种新颖、简单、鲁棒的聚类自表示方法,即带有投影距离惩罚的双低秩表示(DLRRPD)。利用学习到的最优投影表示,DLRRPD能够获得有效的相似图来捕获多子空间结构。除了全局低秩约束外,我们的DLRRPD还通过投影距离惩罚来利用局部几何结构,从而促进了更有利的图。此外,为了提高DLRRPD对噪声的鲁棒性,我们引入了拉普拉斯秩约束,进一步提高了学习图对聚类任务的判别能力。同时,采用Frobenius范数(而不是常用的核范数)来强制图具有更低复杂度的块对角线性。在合成的、真实的和有噪声的数据上进行了大量的实验,结果表明,所提出的方法比目前可用的替代方法高出1.0%~10.1%。
{"title":"Double low-rank representation with projection distance penalty for clustering","authors":"Zhiqiang Fu, Yao Zhao, Dongxia Chang, Xingxing Zhang, Yiming Wang","doi":"10.1109/CVPR46437.2021.00528","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00528","url":null,"abstract":"This paper presents a novel, simple yet robust self-representation method, i.e., Double Low-Rank Representation with Projection Distance penalty (DLRRPD) for clustering. With the learned optimal projected representations, DLRRPD is capable of obtaining an effective similarity graph to capture the multi-subspace structure. Besides the global low-rank constraint, the local geometrical structure is additionally exploited via a projection distance penalty in our DLRRPD, thus facilitating a more favorable graph. Moreover, to improve the robustness of DLRRPD to noises, we introduce a Laplacian rank constraint, which can further encourage the learned graph to be more discriminative for clustering tasks. Meanwhile, Frobenius norm (instead of the popularly used nuclear norm) is employed to enforce the graph to be more block-diagonal with lower complexity. Extensive experiments have been conducted on synthetic, real, and noisy data to show that the proposed method outperforms currently available alternatives by a margin of 1.0%~10.1%.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124814976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Single-View 3D Object Reconstruction from Shape Priors in Memory 记忆中形状先验的单视图3D对象重建
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00317
Shuo Yang, Min Xu, Haozhe Xie, Stuart W. Perry, Jiahao Xia
Existing methods for single-view 3D object reconstruction directly learn to transform image features into 3D representations. However, these methods are vulnerable to images containing noisy backgrounds and heavy occlusions because the extracted image features do not contain enough information to reconstruct high-quality 3D shapes. Humans routinely use incomplete or noisy visual cues from an image to retrieve similar 3D shapes from their memory and reconstruct the 3D shape of an object. Inspired by this, we propose a novel method, named Mem3D, that explicitly constructs shape priors to supplement the missing information in the image. Specifically, the shape priors are in the forms of "image-voxel" pairs in the memory network, which is stored by a well-designed writing strategy during training. We also propose a voxel triplet loss function that helps to retrieve the precise 3D shapes that are highly related to the input image from shape priors. The LSTM-based shape encoder is introduced to extract information from the retrieved 3D shapes, which are useful in recovering the 3D shape of an object that is heavily occluded or in complex environments. Experimental results demonstrate that Mem3D significantly improves reconstruction quality and performs favorably against state-of-the-art methods on the ShapeNet and Pix3D datasets.
现有的单视图三维物体重建方法直接学习将图像特征转换为三维表示。然而,这些方法容易受到含有噪声背景和严重遮挡的图像的影响,因为提取的图像特征不包含足够的信息来重建高质量的3D形状。人类通常使用图像中不完整或嘈杂的视觉线索从记忆中检索相似的3D形状,并重建物体的3D形状。受此启发,我们提出了一种名为Mem3D的新方法,该方法明确地构建形状先验来补充图像中缺失的信息。具体而言,形状先验在记忆网络中以“图像-体素”对的形式存在,并在训练过程中通过精心设计的书写策略进行存储。我们还提出了一个体素三重损失函数,该函数有助于从形状先验中检索与输入图像高度相关的精确3D形状。引入基于lstm的形状编码器,从检索到的三维形状中提取信息,可用于在严重遮挡或复杂环境中恢复物体的三维形状。实验结果表明,Mem3D显著提高了重建质量,在ShapeNet和Pix3D数据集上表现优于最先进的方法。
{"title":"Single-View 3D Object Reconstruction from Shape Priors in Memory","authors":"Shuo Yang, Min Xu, Haozhe Xie, Stuart W. Perry, Jiahao Xia","doi":"10.1109/CVPR46437.2021.00317","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00317","url":null,"abstract":"Existing methods for single-view 3D object reconstruction directly learn to transform image features into 3D representations. However, these methods are vulnerable to images containing noisy backgrounds and heavy occlusions because the extracted image features do not contain enough information to reconstruct high-quality 3D shapes. Humans routinely use incomplete or noisy visual cues from an image to retrieve similar 3D shapes from their memory and reconstruct the 3D shape of an object. Inspired by this, we propose a novel method, named Mem3D, that explicitly constructs shape priors to supplement the missing information in the image. Specifically, the shape priors are in the forms of \"image-voxel\" pairs in the memory network, which is stored by a well-designed writing strategy during training. We also propose a voxel triplet loss function that helps to retrieve the precise 3D shapes that are highly related to the input image from shape priors. The LSTM-based shape encoder is introduced to extract information from the retrieved 3D shapes, which are useful in recovering the 3D shape of an object that is heavily occluded or in complex environments. Experimental results demonstrate that Mem3D significantly improves reconstruction quality and performs favorably against state-of-the-art methods on the ShapeNet and Pix3D datasets.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124827789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Refining Pseudo Labels with Clustering Consensus over Generations for Unsupervised Object Re-identification 基于多代聚类共识的伪标签改进无监督对象再识别
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00344
Xiao Zhang, Yixiao Ge, Y. Qiao, Hongsheng Li
Unsupervised object re-identification targets at learning discriminative representations for object retrieval without any annotations. Clustering-based methods [27], [46], [10] conduct training with the generated pseudo labels and currently dominate this research direction. However, they still suffer from the issue of pseudo label noise. To tackle the challenge, we propose to properly estimate pseudo label similarities between consecutive training generations with clustering consensus and refine pseudo labels with temporally propagated and ensembled pseudo labels. To the best of our knowledge, this is the first attempt to leverage the spirit of temporal ensembling [25] to improve classification with dynamically changing classes over generations. The proposed pseudo label refinery strategy is simple yet effective and can be seamlessly integrated into existing clustering-based unsupervised re-identification methods. With our proposed approach, state-of-the-art method [10] can be further boosted with up to 8.8% mAP improvements on the challenging MSMT17 [39] dataset.
无监督对象再识别的目标是在没有任何注释的情况下学习对象检索的判别表示。基于聚类的方法[27]、[46]、[10]对生成的伪标签进行训练,目前是该研究的主导方向。然而,它们仍然受到伪标签噪声的困扰。为了解决这一挑战,我们提出利用聚类共识适当地估计连续训练代之间的伪标签相似性,并使用临时传播和集成的伪标签来改进伪标签。据我们所知,这是第一次尝试利用时间集成的精神[25]来改进几代人之间动态变化的分类。提出的伪标签精炼策略简单有效,可以无缝集成到现有的基于聚类的无监督再识别方法中。使用我们提出的方法,最先进的方法[10]可以在具有挑战性的MSMT17[39]数据集上进一步提高高达8.8%的mAP。
{"title":"Refining Pseudo Labels with Clustering Consensus over Generations for Unsupervised Object Re-identification","authors":"Xiao Zhang, Yixiao Ge, Y. Qiao, Hongsheng Li","doi":"10.1109/CVPR46437.2021.00344","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00344","url":null,"abstract":"Unsupervised object re-identification targets at learning discriminative representations for object retrieval without any annotations. Clustering-based methods [27], [46], [10] conduct training with the generated pseudo labels and currently dominate this research direction. However, they still suffer from the issue of pseudo label noise. To tackle the challenge, we propose to properly estimate pseudo label similarities between consecutive training generations with clustering consensus and refine pseudo labels with temporally propagated and ensembled pseudo labels. To the best of our knowledge, this is the first attempt to leverage the spirit of temporal ensembling [25] to improve classification with dynamically changing classes over generations. The proposed pseudo label refinery strategy is simple yet effective and can be seamlessly integrated into existing clustering-based unsupervised re-identification methods. With our proposed approach, state-of-the-art method [10] can be further boosted with up to 8.8% mAP improvements on the challenging MSMT17 [39] dataset.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123689815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
Monocular 3D Object Detection: An Extrinsic Parameter Free Approach 单眼三维目标检测:一种无外部参数的方法
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00747
Yunsong Zhou, Yuan He, Hongzi Zhu, Cheng Wang, Hongyang Li, Qinhong Jiang
Monocular 3D object detection is an important task in autonomous driving. It can be easily intractable where there exists ego-car pose change w.r.t. ground plane. This is common due to the slight fluctuation of road smoothness and slope. Due to the lack of insight in industrial application, existing methods on open datasets neglect the cam-era pose information, which inevitably results in the detector being susceptible to camera extrinsic parameters. The perturbation of objects is very popular in most autonomous driving cases for industrial products. To this end, we propose a novel method to capture camera pose to formulate the detector free from extrinsic perturbation. Specifically, the proposed framework predicts camera extrinsic parameters by detecting vanishing point and horizon change. A converter is designed to rectify perturbative features in the latent space. By doing so, our 3D detector works independent of the extrinsic parameter variations and produces accurate results in realistic cases, e.g., potholed and uneven roads, where almost all existing monocular detectors fail to handle. Experiments demonstrate our method yields the best performance compared with the other state-of-the-arts by a large margin on both KITTI 3D and nuScenes datasets.
单目三维目标检测是自动驾驶中的一项重要任务。当在地平面上存在自我-汽车姿态变化时,这个问题很容易解决。由于路面平整度和坡度的轻微波动,这是很常见的。由于缺乏对工业应用的洞察力,现有的基于开放数据集的方法忽略了相机时代的位姿信息,这不可避免地导致检测器容易受到相机外部参数的影响。在大多数工业产品的自动驾驶案例中,物体的摄动是非常普遍的。为此,我们提出了一种捕捉相机姿态的新方法,以制定不受外部扰动的探测器。具体来说,该框架通过检测消失点和视界变化来预测摄像机的外部参数。设计了一种变换器来校正潜在空间中的微扰特征。通过这样做,我们的3D探测器独立于外部参数变化而工作,并在现实情况下产生准确的结果,例如,坑坑洼洼和不平的道路,几乎所有现有的单目探测器都无法处理。实验表明,在KITTI 3D和nuScenes数据集上,我们的方法与其他最先进的方法相比,产生了最好的性能。
{"title":"Monocular 3D Object Detection: An Extrinsic Parameter Free Approach","authors":"Yunsong Zhou, Yuan He, Hongzi Zhu, Cheng Wang, Hongyang Li, Qinhong Jiang","doi":"10.1109/CVPR46437.2021.00747","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00747","url":null,"abstract":"Monocular 3D object detection is an important task in autonomous driving. It can be easily intractable where there exists ego-car pose change w.r.t. ground plane. This is common due to the slight fluctuation of road smoothness and slope. Due to the lack of insight in industrial application, existing methods on open datasets neglect the cam-era pose information, which inevitably results in the detector being susceptible to camera extrinsic parameters. The perturbation of objects is very popular in most autonomous driving cases for industrial products. To this end, we propose a novel method to capture camera pose to formulate the detector free from extrinsic perturbation. Specifically, the proposed framework predicts camera extrinsic parameters by detecting vanishing point and horizon change. A converter is designed to rectify perturbative features in the latent space. By doing so, our 3D detector works independent of the extrinsic parameter variations and produces accurate results in realistic cases, e.g., potholed and uneven roads, where almost all existing monocular detectors fail to handle. Experiments demonstrate our method yields the best performance compared with the other state-of-the-arts by a large margin on both KITTI 3D and nuScenes datasets.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126852387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
Defending Multimodal Fusion Models against Single-Source Adversaries 防御单源对手的多模态融合模型
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00335
Karren D. Yang, Wan-Yi Lin, M. Barman, Filipe Condessa, Zico Kolter
Beyond achieving high performance across many vision tasks, multimodal models are expected to be robust to single-source faults due to the availability of redundant information between modalities. In this paper, we investigate the robustness of multimodal neural networks against worst-case (i.e., adversarial) perturbations on a single modality. We first show that standard multimodal fusion models are vulnerable to single-source adversaries: an attack on any single modality can overcome the correct information from multiple unperturbed modalities and cause the model to fail. This surprising vulnerability holds across diverse multimodal tasks and necessitates a solution. Motivated by this finding, we propose an adversarially robust fusion strategy that trains the model to compare information coming from all the input sources, detect inconsistencies in the perturbed modality compared to the other modalities, and only allow information from the unperturbed modalities to pass through. Our approach significantly improves on state-of-the-art methods in single-source robustness, achieving gains of 7.8-25.2% on action recognition, 19.7-48.2% on object detection, and 1.6-6.7% on sentiment analysis, without degrading performance on unperturbed (i.e., clean) data.
除了在许多视觉任务中实现高性能之外,由于模态之间冗余信息的可用性,多模态模型有望对单源故障具有鲁棒性。在本文中,我们研究了多模态神经网络对单模态上最坏情况(即对抗性)扰动的鲁棒性。我们首先表明,标准的多模态融合模型容易受到单源对手的攻击:对任何单一模态的攻击都可以克服来自多个未受干扰模态的正确信息,并导致模型失败。这个令人惊讶的漏洞存在于各种多模式任务中,需要一个解决方案。基于这一发现,我们提出了一种对抗鲁棒的融合策略,该策略训练模型来比较来自所有输入源的信息,检测受干扰模态与其他模态的不一致性,并且只允许来自未受干扰模态的信息通过。我们的方法在单源鲁棒性方面显著改进了最先进的方法,在动作识别上实现了7.8-25.2%的增益,在目标检测上实现了19.7-48.2%的增益,在情感分析上实现了1.6-6.7%的增益,而在未受干扰(即干净)数据上的性能没有下降。
{"title":"Defending Multimodal Fusion Models against Single-Source Adversaries","authors":"Karren D. Yang, Wan-Yi Lin, M. Barman, Filipe Condessa, Zico Kolter","doi":"10.1109/CVPR46437.2021.00335","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00335","url":null,"abstract":"Beyond achieving high performance across many vision tasks, multimodal models are expected to be robust to single-source faults due to the availability of redundant information between modalities. In this paper, we investigate the robustness of multimodal neural networks against worst-case (i.e., adversarial) perturbations on a single modality. We first show that standard multimodal fusion models are vulnerable to single-source adversaries: an attack on any single modality can overcome the correct information from multiple unperturbed modalities and cause the model to fail. This surprising vulnerability holds across diverse multimodal tasks and necessitates a solution. Motivated by this finding, we propose an adversarially robust fusion strategy that trains the model to compare information coming from all the input sources, detect inconsistencies in the perturbed modality compared to the other modalities, and only allow information from the unperturbed modalities to pass through. Our approach significantly improves on state-of-the-art methods in single-source robustness, achieving gains of 7.8-25.2% on action recognition, 19.7-48.2% on object detection, and 1.6-6.7% on sentiment analysis, without degrading performance on unperturbed (i.e., clean) data.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114939049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Co-Attention for Conditioned Image Matching 条件图像匹配的共同注意
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01566
Olivia Wiles, Sébastien Ehrhardt, Andrew Zisserman
We propose a new approach to determine correspondences between image pairs in the wild under large changes in illumination, viewpoint, context, and material. While other approaches find correspondences between pairs of images by treating the images independently, we instead condition on both images to implicitly take account of the differences between them. To achieve this, we introduce (i) a spatial attention mechanism (a co-attention module, CoAM) for conditioning the learned features on both images, and (ii) a distinctiveness score used to choose the best matches at test time. CoAM can be added to standard architectures and trained using self-supervision or supervised data, and achieves a significant performance improvement under hard conditions, e.g. large viewpoint changes. We demonstrate that models using CoAM achieve state of the art or competitive results on a wide range of tasks: local matching, camera localization, 3D reconstruction, and image stylization.
我们提出了一种新的方法来确定野外在光照、视点、环境和材料的大变化下图像对之间的对应关系。虽然其他方法通过独立处理图像来找到图像对之间的对应关系,但我们将两个图像作为条件来隐式地考虑它们之间的差异。为了实现这一目标,我们引入了(i)空间注意机制(共同注意模块,CoAM)来调节两个图像上的学习特征,以及(ii)用于在测试时选择最佳匹配的显著性分数。CoAM可以添加到标准架构中,并使用自我监督或监督数据进行训练,并在困难条件下实现显着的性能改进,例如大视点变化。我们证明,使用CoAM的模型在广泛的任务上实现了最先进的或具有竞争力的结果:局部匹配、相机定位、3D重建和图像风格化。
{"title":"Co-Attention for Conditioned Image Matching","authors":"Olivia Wiles, Sébastien Ehrhardt, Andrew Zisserman","doi":"10.1109/CVPR46437.2021.01566","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01566","url":null,"abstract":"We propose a new approach to determine correspondences between image pairs in the wild under large changes in illumination, viewpoint, context, and material. While other approaches find correspondences between pairs of images by treating the images independently, we instead condition on both images to implicitly take account of the differences between them. To achieve this, we introduce (i) a spatial attention mechanism (a co-attention module, CoAM) for conditioning the learned features on both images, and (ii) a distinctiveness score used to choose the best matches at test time. CoAM can be added to standard architectures and trained using self-supervision or supervised data, and achieves a significant performance improvement under hard conditions, e.g. large viewpoint changes. We demonstrate that models using CoAM achieve state of the art or competitive results on a wide range of tasks: local matching, camera localization, 3D reconstruction, and image stylization.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"261 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116032798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Adversarial Invariant Learning 对抗性不变学习
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01226
Nanyang Ye, Jingxuan Tang, Huayu Deng, Xiao-Yun Zhou, Qianxiao Li, Zhenguo Li, Guang-Zhong Yang, Zhanxing Zhu
Though machine learning algorithms are able to achieve pattern recognition from the correlation between data and labels, the presence of spurious features in the data decreases the robustness of these learned relationships with respect to varied testing environments. This is known as out-of-distribution (OoD) generalization problem. Recently, invariant risk minimization (IRM) attempts to tackle this issue by penalizing predictions based on the unstable spurious features in the data collected from different environments. However, similar to domain adaptation or domain generalization, a prevalent non-trivial limitation in these works is that the environment information is assigned by human specialists, i.e. a priori, or determined heuristically. However, an inappropriate group partitioning can dramatically deteriorate the OoD generalization and this process is expensive and time-consuming. To deal with this issue, we propose a novel theoretically principled min-max framework to iteratively construct a worst-case splitting, i.e. creating the most challenging environment splittings for the backbone learning paradigm (e.g. IRM) to learn the robust feature representation. We also design a differentiable training strategy to facilitate the feasible gradient- based computation. Numerical experiments show that our algorithmic framework has achieved superior and stable performance in various datasets, such as Colored MNIST and Punctuated Stanford sentiment treebank (SST). Furthermore, we also find our algorithm to be robust even to a strong data poisoning attack. To the best of our knowledge, this is one of the first to adopt differentiable environment splitting method to enable stable predictions across environments without environment index information, which achieves the state-of-the-art performance on datasets with strong spurious correlation, such as Colored MNIST.
虽然机器学习算法能够从数据和标签之间的相关性中实现模式识别,但数据中存在的虚假特征降低了这些学习关系相对于各种测试环境的鲁棒性。这被称为分布外(OoD)泛化问题。最近,不变风险最小化(IRM)试图通过惩罚基于从不同环境收集的数据中不稳定的虚假特征的预测来解决这个问题。然而,与领域适应或领域泛化类似,这些工作中普遍存在的一个重要限制是,环境信息是由人类专家分配的,即先验的,或启发式的确定。然而,不适当的分组划分会极大地破坏OoD泛化,而且这个过程既昂贵又耗时。为了解决这个问题,我们提出了一个新的理论原则的最小-最大框架来迭代构建最坏情况分裂,即为骨干学习范式(例如IRM)创建最具挑战性的环境分裂来学习鲁棒特征表示。我们还设计了一个可微的训练策略,以方便可行的基于梯度的计算。数值实验表明,我们的算法框架在有色MNIST和标点斯坦福情感树库(SST)等多种数据集上都取得了优异而稳定的性能。此外,我们还发现我们的算法即使对强数据中毒攻击也具有鲁棒性。据我们所知,这是第一个采用可微环境分裂方法,在没有环境索引信息的情况下实现跨环境稳定预测的方法之一,在具有强虚假相关性的数据集(如Colored MNIST)上实现了最先进的性能。
{"title":"Adversarial Invariant Learning","authors":"Nanyang Ye, Jingxuan Tang, Huayu Deng, Xiao-Yun Zhou, Qianxiao Li, Zhenguo Li, Guang-Zhong Yang, Zhanxing Zhu","doi":"10.1109/CVPR46437.2021.01226","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01226","url":null,"abstract":"Though machine learning algorithms are able to achieve pattern recognition from the correlation between data and labels, the presence of spurious features in the data decreases the robustness of these learned relationships with respect to varied testing environments. This is known as out-of-distribution (OoD) generalization problem. Recently, invariant risk minimization (IRM) attempts to tackle this issue by penalizing predictions based on the unstable spurious features in the data collected from different environments. However, similar to domain adaptation or domain generalization, a prevalent non-trivial limitation in these works is that the environment information is assigned by human specialists, i.e. a priori, or determined heuristically. However, an inappropriate group partitioning can dramatically deteriorate the OoD generalization and this process is expensive and time-consuming. To deal with this issue, we propose a novel theoretically principled min-max framework to iteratively construct a worst-case splitting, i.e. creating the most challenging environment splittings for the backbone learning paradigm (e.g. IRM) to learn the robust feature representation. We also design a differentiable training strategy to facilitate the feasible gradient- based computation. Numerical experiments show that our algorithmic framework has achieved superior and stable performance in various datasets, such as Colored MNIST and Punctuated Stanford sentiment treebank (SST). Furthermore, we also find our algorithm to be robust even to a strong data poisoning attack. To the best of our knowledge, this is one of the first to adopt differentiable environment splitting method to enable stable predictions across environments without environment index information, which achieves the state-of-the-art performance on datasets with strong spurious correlation, such as Colored MNIST.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"162 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116043458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Self-boosting Framework for Automated Radiographic Report Generation 自动生成放射报告的自我提升框架
Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00246
Zhanyu Wang, Luping Zhou, Lei Wang, Xiu Li
Automated radiographic report generation is a challenging task since it requires to generate paragraphs describing fine-grained visual differences of cases, especially for those between the diseased and the healthy. Existing image captioning methods commonly target at generic images, and lack mechanism to meet this requirement. To bridge this gap, in this paper, we propose a self-boosting framework that improves radiographic report generation based on the cooperation of the main task of report generation and an auxiliary task of image-text matching. The two tasks are built as the two branches of a network model and influence each other in a cooperative way. On one hand, the image-text matching branch helps to learn highly text-correlated visual features for the report generation branch to output high quality reports. On the other hand, the improved reports produced by the report generation branch provide additional harder samples for the image-text matching branch and enforce the latter to improve itself by learning better visual and text feature representations. This, in turn, helps improve the report generation branch again. These two branches are jointly trained to help improve each other iteratively and progressively, so that the whole model is self-boosted without requiring external resources. Experimental results demonstrate the effectiveness of our method on two public datasets, showing its superior performance over multiple state-of-the-art image captioning and medical report generation methods.
自动生成放射报告是一项具有挑战性的任务,因为它需要生成描述病例的细粒度视觉差异的段落,特别是对于患病和健康之间的病例。现有的图像标注方法通常针对的是通用图像,缺乏满足这一要求的机制。为了弥补这一差距,本文提出了一种基于报告生成主任务和图像-文本匹配辅助任务合作的自促进框架,以改进射线成像报告生成。这两个任务被构建为网络模型的两个分支,并以协作的方式相互影响。一方面,图像-文本匹配分支有助于学习高度文本相关的视觉特征,供报表生成分支输出高质量的报表。另一方面,报告生成分支生成的改进报告为图像-文本匹配分支提供了额外的更难的样本,并强制后者通过学习更好的视觉和文本特征表示来改进自己。这反过来又有助于再次改进报告生成分支。这两个分支是联合训练的,互相迭代、递进地改进,使整个模型在不需要外部资源的情况下自我提升。实验结果证明了我们的方法在两个公共数据集上的有效性,显示了其优于多种最先进的图像字幕和医疗报告生成方法的性能。
{"title":"A Self-boosting Framework for Automated Radiographic Report Generation","authors":"Zhanyu Wang, Luping Zhou, Lei Wang, Xiu Li","doi":"10.1109/CVPR46437.2021.00246","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00246","url":null,"abstract":"Automated radiographic report generation is a challenging task since it requires to generate paragraphs describing fine-grained visual differences of cases, especially for those between the diseased and the healthy. Existing image captioning methods commonly target at generic images, and lack mechanism to meet this requirement. To bridge this gap, in this paper, we propose a self-boosting framework that improves radiographic report generation based on the cooperation of the main task of report generation and an auxiliary task of image-text matching. The two tasks are built as the two branches of a network model and influence each other in a cooperative way. On one hand, the image-text matching branch helps to learn highly text-correlated visual features for the report generation branch to output high quality reports. On the other hand, the improved reports produced by the report generation branch provide additional harder samples for the image-text matching branch and enforce the latter to improve itself by learning better visual and text feature representations. This, in turn, helps improve the report generation branch again. These two branches are jointly trained to help improve each other iteratively and progressively, so that the whole model is self-boosted without requiring external resources. Experimental results demonstrate the effectiveness of our method on two public datasets, showing its superior performance over multiple state-of-the-art image captioning and medical report generation methods.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122402282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
期刊
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1