首页 > 最新文献

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献

英文 中文
GateHUB: Gated History Unit with Background Suppression for Online Action Detection GateHUB:用于在线动作检测的带有背景抑制的门控历史单元
Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01930
Junwen Chen, Gaurav Mittal, Ye Yu, Yu Kong, Mei Chen
Online action detection is the task of predicting the action as soon as it happens in a streaming video. A major challenge is that the model does not have access to the future and has to solely rely on the history, i.e., the frames observed so far, to make predictions. It is therefore important to accentuate parts of the history that are more informative to the prediction of the current frame. We present GateHUB, Gated History Unit with Background Suppression, that comprises a novel position-guided gated cross attention mechanism to enhance or suppress parts of the history as per how informative they are for current frame prediction. GateHUB further proposes Future-augmented History (FaH) to make history features more informative by using subsequently observed frames when available. In a single unified framework, GateHUB integrates the transformer's ability of long-range temporal modeling and the recurrent model's capacity to selectively encode relevant information. GateHUB also introduces a background suppression objective to further mitigate false positive background frames that closely resemble the action frames. Extensive validation on three benchmark datasets, THUMOS, TVSeries, and HDD, demonstrates that GateHUB significantly outperforms all existing methods and is also more efficient than the existing best work. Furthermore, a flow free version of GateHUB is able to achieve higher or close accuracy at 2.8× higher frame rate compared to all existing methods that require both RGB and optical flow information for prediction.
在线动作检测是在流媒体视频中发生动作时立即预测动作的任务。一个主要的挑战是,该模型无法访问未来,必须完全依赖历史,即迄今为止观察到的框架,以做出预测。因此,重要的是要强调历史上对当前框架的预测提供更多信息的部分。我们提出了GateHUB,带背景抑制的门控历史单元,它包括一种新的位置引导门控交叉注意机制,根据它们对当前帧预测的信息量来增强或抑制部分历史。GateHUB进一步提出了未来增强历史(FaH),通过在可用时使用随后观察到的帧,使历史特征更具信息性。在一个单一的统一框架中,GateHUB集成了变压器的远程时间建模能力和循环模型的选择性编码相关信息的能力。GateHUB还引入了一个背景抑制目标,以进一步减少与动作帧非常相似的假阳性背景帧。在三个基准数据集(THUMOS、TVSeries和HDD)上进行的广泛验证表明,GateHUB显著优于所有现有方法,也比现有的最佳方法更有效。此外,与所有需要RGB和光流信息进行预测的现有方法相比,GateHUB的无流版本能够以2.8倍高的帧速率实现更高或接近的精度。
{"title":"GateHUB: Gated History Unit with Background Suppression for Online Action Detection","authors":"Junwen Chen, Gaurav Mittal, Ye Yu, Yu Kong, Mei Chen","doi":"10.1109/CVPR52688.2022.01930","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01930","url":null,"abstract":"Online action detection is the task of predicting the action as soon as it happens in a streaming video. A major challenge is that the model does not have access to the future and has to solely rely on the history, i.e., the frames observed so far, to make predictions. It is therefore important to accentuate parts of the history that are more informative to the prediction of the current frame. We present GateHUB, Gated History Unit with Background Suppression, that comprises a novel position-guided gated cross attention mechanism to enhance or suppress parts of the history as per how informative they are for current frame prediction. GateHUB further proposes Future-augmented History (FaH) to make history features more informative by using subsequently observed frames when available. In a single unified framework, GateHUB integrates the transformer's ability of long-range temporal modeling and the recurrent model's capacity to selectively encode relevant information. GateHUB also introduces a background suppression objective to further mitigate false positive background frames that closely resemble the action frames. Extensive validation on three benchmark datasets, THUMOS, TVSeries, and HDD, demonstrates that GateHUB significantly outperforms all existing methods and is also more efficient than the existing best work. Furthermore, a flow free version of GateHUB is able to achieve higher or close accuracy at 2.8× higher frame rate compared to all existing methods that require both RGB and optical flow information for prediction.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124376580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Recurring the Transformer for Video Action Recognition 用于视频动作识别的循环变压器
Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01367
Jie Yang, Xingbo Dong, Liujun Liu, Chaofu Zhang, Jiajun Shen, Dahai Yu
Existing video understanding approaches, such as 3D convolutional neural networks and Transformer-Based methods, usually process the videos in a clip-wise manner; hence huge GPU memory is needed and fixed-length video clips are usually required. To alleviate those issues, we introduce a novel Recurrent Vision Transformer (RViT) framework based on spatial-temporal representation learning to achieve the video action recognition task. Specifically, the proposed RViT is equipped with an attention gate to build interaction between current frame input and previous hidden state, thus aggregating the global level interframe features through the hidden state temporally. RViT is executed recurrently to process a video by giving the current frame and previous hidden state. The RViT can capture both spatial and temporal features because of the attention gate and recurrent execution. Besides, the proposed RViT can work on variant-length video clips properly without requiring large GPU memory thanks to the frame by frame processing flow. Our experiment results demonstrate that RViT can achieve state-of-the-art performance on various datasets for the video recognition task. Specifically, RViT can achieve a top-1 accuracy of 81.5% on Kinetics-400, 92.31% on Jester, 67.9% on Something-Something-V2, and an mAP accuracy of 66.1% on Charades.
现有的视频理解方法,如3D卷积神经网络和基于变压器的方法,通常以剪辑方式处理视频;因此需要巨大的GPU内存,通常需要固定长度的视频剪辑。为了解决这些问题,我们引入了一种基于时空表示学习的循环视觉变换(RViT)框架来实现视频动作识别任务。具体而言,该RViT通过注意门来构建当前帧输入与前一个隐藏状态之间的交互,从而通过隐藏状态暂时聚合全局级帧间特征。RViT通过给出当前帧和之前的隐藏状态来循环执行以处理视频。由于注意门和重复执行,RViT可以同时捕捉空间和时间特征。此外,由于采用逐帧处理流程,所提出的RViT可以在不需要大量GPU内存的情况下正确处理变长视频剪辑。我们的实验结果表明,RViT可以在各种数据集上实现最先进的视频识别性能。具体来说,RViT在Kinetics-400上的准确率为81.5%,在Jester上的准确率为92.31%,在Something-Something-V2上的准确率为67.9%,在Charades上的mAP准确率为66.1%。
{"title":"Recurring the Transformer for Video Action Recognition","authors":"Jie Yang, Xingbo Dong, Liujun Liu, Chaofu Zhang, Jiajun Shen, Dahai Yu","doi":"10.1109/CVPR52688.2022.01367","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01367","url":null,"abstract":"Existing video understanding approaches, such as 3D convolutional neural networks and Transformer-Based methods, usually process the videos in a clip-wise manner; hence huge GPU memory is needed and fixed-length video clips are usually required. To alleviate those issues, we introduce a novel Recurrent Vision Transformer (RViT) framework based on spatial-temporal representation learning to achieve the video action recognition task. Specifically, the proposed RViT is equipped with an attention gate to build interaction between current frame input and previous hidden state, thus aggregating the global level interframe features through the hidden state temporally. RViT is executed recurrently to process a video by giving the current frame and previous hidden state. The RViT can capture both spatial and temporal features because of the attention gate and recurrent execution. Besides, the proposed RViT can work on variant-length video clips properly without requiring large GPU memory thanks to the frame by frame processing flow. Our experiment results demonstrate that RViT can achieve state-of-the-art performance on various datasets for the video recognition task. Specifically, RViT can achieve a top-1 accuracy of 81.5% on Kinetics-400, 92.31% on Jester, 67.9% on Something-Something-V2, and an mAP accuracy of 66.1% on Charades.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121806554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
FAM: Visual Explanations for the Feature Representations from Deep Convolutional Networks FAM:深度卷积网络特征表示的可视化解释
Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01006
Yu-Xi Wu, Changhuai Chen, Jun Che, Shi Pu
In recent years, increasing attention has been drawn to the internal mechanisms of representation models. Traditional methods are inapplicable to fully explain the feature representations, especially if the images do not fit into any category. In this case, employing an existing class or the similarity with other image is unable to provide a complete and reliable visual explanation. To handle this task, we propose a novel visual explanation paradigm called Fea-ture Activation Mapping (FAM) in this paper. Under this paradigm, Grad-FAM and Score-FAM are designed for vi-sualizing feature representations. Unlike the previous approaches, FAM locates the regions of images that contribute most to the feature vector itself. Extensive experiments and evaluations, both subjective and objective, showed that Score-FAM provided most promising interpretable vi-sual explanations for feature representations in Person Re-Identification. Furthermore, FAM also can be employed to analyze other vision tasks, such as self-supervised represen-tation learning.
近年来,表征模型的内部机制受到越来越多的关注。传统的方法不适用于充分解释特征表示,特别是当图像不属于任何类别时。在这种情况下,使用现有的类或与其他图像的相似性无法提供完整可靠的视觉解释。为了解决这个问题,我们提出了一种新的视觉解释范式,称为特征激活映射(FAM)。在这种范式下,Grad-FAM和Score-FAM被设计用于可视化特征表示。与之前的方法不同,FAM定位对特征向量本身贡献最大的图像区域。大量的主观和客观实验和评估表明,Score-FAM为人物再识别中的特征表征提供了最有希望的可解释的视觉解释。此外,FAM也可以用于分析其他视觉任务,如自监督表征学习。
{"title":"FAM: Visual Explanations for the Feature Representations from Deep Convolutional Networks","authors":"Yu-Xi Wu, Changhuai Chen, Jun Che, Shi Pu","doi":"10.1109/CVPR52688.2022.01006","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01006","url":null,"abstract":"In recent years, increasing attention has been drawn to the internal mechanisms of representation models. Traditional methods are inapplicable to fully explain the feature representations, especially if the images do not fit into any category. In this case, employing an existing class or the similarity with other image is unable to provide a complete and reliable visual explanation. To handle this task, we propose a novel visual explanation paradigm called Fea-ture Activation Mapping (FAM) in this paper. Under this paradigm, Grad-FAM and Score-FAM are designed for vi-sualizing feature representations. Unlike the previous approaches, FAM locates the regions of images that contribute most to the feature vector itself. Extensive experiments and evaluations, both subjective and objective, showed that Score-FAM provided most promising interpretable vi-sual explanations for feature representations in Person Re-Identification. Furthermore, FAM also can be employed to analyze other vision tasks, such as self-supervised represen-tation learning.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"194 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124285149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-scale Video Panoptic Segmentation in the Wild: A Benchmark 野外大规模视频全光学分割:一个基准
Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.02036
Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang
In this paper, we present a new large-scale dataset for the video panoptic segmentation task, which aims to assign semantic classes and track identities to all pixels in a video. As the ground truth for this task is difficult to annotate, previous datasets for video panoptic segmentation are limited by either small scales or the number of scenes. In contrast, our large-scale VIdeo Panoptic Segmentation in the Wild (VIPSeg) dataset provides 3,536 videos and 84,750 frames with pixel-level panoptic annotations, covering a wide range of real-world scenarios and categories. To the best of our knowledge, our VIPSeg is the first attempt to tackle the challenging video panoptic segmentation task in the wild by considering diverse scenarios. Based on VIPSeg, we evaluate existing video panoptic segmentation approaches and propose an efficient and effective clip-based baseline method to analyze our VIPSeg dataset. Our dataset is available at https://github.com/VIPSeg-Dataset/VIPSeg-Dataset/.
在本文中,我们提出了一个新的用于视频全光分割任务的大规模数据集,旨在为视频中的所有像素分配语义类和跟踪身份。由于该任务的基础事实难以注释,以前用于视频全景分割的数据集受到小尺度或场景数量的限制。相比之下,我们的大规模视频全景分割(VIPSeg)数据集提供了3,536个视频和84,750帧的像素级全景注释,涵盖了广泛的现实世界场景和类别。据我们所知,我们的VIPSeg是第一次尝试通过考虑不同的场景来解决具有挑战性的视频全光分割任务。基于VIPSeg,我们评估了现有的视频全光学分割方法,并提出了一种高效的基于片段的基线方法来分析我们的VIPSeg数据集。我们的数据集可以在https://github.com/VIPSeg-Dataset/VIPSeg-Dataset/上找到。
{"title":"Large-scale Video Panoptic Segmentation in the Wild: A Benchmark","authors":"Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, Yi Yang","doi":"10.1109/CVPR52688.2022.02036","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.02036","url":null,"abstract":"In this paper, we present a new large-scale dataset for the video panoptic segmentation task, which aims to assign semantic classes and track identities to all pixels in a video. As the ground truth for this task is difficult to annotate, previous datasets for video panoptic segmentation are limited by either small scales or the number of scenes. In contrast, our large-scale VIdeo Panoptic Segmentation in the Wild (VIPSeg) dataset provides 3,536 videos and 84,750 frames with pixel-level panoptic annotations, covering a wide range of real-world scenarios and categories. To the best of our knowledge, our VIPSeg is the first attempt to tackle the challenging video panoptic segmentation task in the wild by considering diverse scenarios. Based on VIPSeg, we evaluate existing video panoptic segmentation approaches and propose an efficient and effective clip-based baseline method to analyze our VIPSeg dataset. Our dataset is available at https://github.com/VIPSeg-Dataset/VIPSeg-Dataset/.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116682025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
End-to-End Reconstruction-Classification Learning for Face Forgery Detection 面向人脸伪造检测的端到端重构分类学习
Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00408
Junyi Cao, Chao Ma, Taiping Yao, Shen Chen, Shouhong Ding, Xiaokang Yang
Existing face forgery detectors mainly focus on specific forgery patterns like noise characteristics, local textures, or frequency statistics for forgery detection. This causes specialization of learned representations to known forgery patterns presented in the training set, and makes it difficult to detect forgeries with unknown patterns. In this paper, from a new perspective, we propose a forgery detection frame-work emphasizing the common compact representations of genuine faces based on reconstruction-classification learning. Reconstruction learning over real images enhances the learned representations to be aware of forgery patterns that are even unknown, while classification learning takes the charge of mining the essential discrepancy between real and fake images, facilitating the understanding of forgeries. To achieve better representations, instead of only using the encoder in reconstruction learning, we build bipartite graphs over the encoder and decoder features in a multi-scale fashion. We further exploit the reconstruction difference as guidance of forgery traces on the graph output as the final representation, which is fed into the classifier for forgery detection. The reconstruction and classification learning is optimized end-to-end. Extensive experiments on large-scale benchmark datasets demonstrate the superiority of the proposed method over state of the arts.
现有的人脸伪造检测器主要关注特定的伪造模式,如噪声特征、局部纹理或频率统计来进行伪造检测。这导致学习到的表示专门化到训练集中呈现的已知伪造模式,并且使得检测具有未知模式的伪造变得困难。在本文中,我们从一个新的角度提出了一个基于重建-分类学习的伪造检测框架,该框架强调真实人脸的共同紧凑表示。对真实图像的重建学习增强了学习表征,使其能够意识到甚至未知的伪造模式,而分类学习则负责挖掘真实图像与伪造图像之间的本质差异,促进对伪造图像的理解。为了获得更好的表示,我们不是只在重建学习中使用编码器,而是以多尺度的方式在编码器和解码器特征上构建二部图。我们进一步利用重建差作为图输出上伪造痕迹的指导作为最终表示,将其输入到分类器中进行伪造检测。重构和分类学习是端到端优化的。在大规模基准数据集上进行的大量实验表明,所提出的方法优于现有的方法。
{"title":"End-to-End Reconstruction-Classification Learning for Face Forgery Detection","authors":"Junyi Cao, Chao Ma, Taiping Yao, Shen Chen, Shouhong Ding, Xiaokang Yang","doi":"10.1109/CVPR52688.2022.00408","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00408","url":null,"abstract":"Existing face forgery detectors mainly focus on specific forgery patterns like noise characteristics, local textures, or frequency statistics for forgery detection. This causes specialization of learned representations to known forgery patterns presented in the training set, and makes it difficult to detect forgeries with unknown patterns. In this paper, from a new perspective, we propose a forgery detection frame-work emphasizing the common compact representations of genuine faces based on reconstruction-classification learning. Reconstruction learning over real images enhances the learned representations to be aware of forgery patterns that are even unknown, while classification learning takes the charge of mining the essential discrepancy between real and fake images, facilitating the understanding of forgeries. To achieve better representations, instead of only using the encoder in reconstruction learning, we build bipartite graphs over the encoder and decoder features in a multi-scale fashion. We further exploit the reconstruction difference as guidance of forgery traces on the graph output as the final representation, which is fed into the classifier for forgery detection. The reconstruction and classification learning is optimized end-to-end. Extensive experiments on large-scale benchmark datasets demonstrate the superiority of the proposed method over state of the arts.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117204076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
EI-CLIP: Entity-aware Interventional Contrastive Learning for E-commerce Cross-modal Retrieval 电子商务跨模式检索的实体感知介入对比学习
Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01752
Haoyu Ma, Handong Zhao, Zhe Lin, Ajinkya Kale, Zhangyang Wang, Tong Yu, Jiuxiang Gu, Sunav Choudhary, Xiaohui Xie
Cross language-image modality retrieval in E-commerce is a fundamental problem for product search, recommendation, and marketing services. Extensive efforts have been made to conquer the cross-modal retrieval problem in the general domain. When it comes to E-commerce, a com-mon practice is to adopt the pretrained model and finetune on E-commerce data. Despite its simplicity, the performance is sub-optimal due to overlooking the uniqueness of E-commerce multimodal data. A few recent efforts [10], [72] have shown significant improvements over generic methods with customized designs for handling product images. Unfortunately, to the best of our knowledge, no existing method has addressed the unique challenges in the e-commerce language. This work studies the outstanding one, where it has a large collection of special meaning entities, e.g., “Di s s e l (brand)”, “Top (category)”, “relaxed (fit)” in the fashion clothing business. By formulating such out-of-distribution finetuning process in the Causal Inference paradigm, we view the erroneous semantics of these special entities as confounders to cause the retrieval failure. To rectify these semantics for aligning with e-commerce do-main knowledge, we propose an intervention-based entity-aware contrastive learning framework with two modules, i.e., the Confounding Entity Selection Module and Entity-Aware Learning Module. Our method achieves competitive performance on the E-commerce benchmark Fashion-Gen. Particularly, in top-1 accuracy (R@l), we observe 10.3% and 10.5% relative improvements over the closest baseline in image-to-text and text-to-image retrievals, respectively.
电子商务中的跨语言图像模态检索是产品搜索、推荐和营销服务的基本问题。为了解决通用领域的跨模态检索问题,人们做了大量的工作。对于电子商务,通常的做法是采用预训练模型并对电子商务数据进行微调。尽管它很简单,但由于忽略了电子商务多模式数据的唯一性,性能不是最优的。最近的一些努力[10],[72]已经显示出对处理产品图像的定制设计的通用方法的显着改进。不幸的是,据我们所知,没有任何现有的方法能够解决电子商务语言中的独特挑战。本文研究的是比较突出的一个,其中有大量的特殊含义实体,如时尚服装行业中的“Di ss el(品牌)”、“Top(类别)”、“relax(合身)”等。通过在因果推理范式中制定这种分布外微调过程,我们将这些特殊实体的错误语义视为导致检索失败的混杂因素。为了纠正这些语义,使其与电子商务的主要知识保持一致,我们提出了一个基于干预的实体感知对比学习框架,该框架包含两个模块,即混淆实体选择模块和实体感知学习模块。我们的方法在电子商务基准Fashion-Gen上取得了具有竞争力的表现。特别是,在前1名的准确性(R@l)中,我们观察到在图像到文本和文本到图像检索中,相对于最接近的基线分别提高了10.3%和10.5%。
{"title":"EI-CLIP: Entity-aware Interventional Contrastive Learning for E-commerce Cross-modal Retrieval","authors":"Haoyu Ma, Handong Zhao, Zhe Lin, Ajinkya Kale, Zhangyang Wang, Tong Yu, Jiuxiang Gu, Sunav Choudhary, Xiaohui Xie","doi":"10.1109/CVPR52688.2022.01752","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01752","url":null,"abstract":"Cross language-image modality retrieval in E-commerce is a fundamental problem for product search, recommendation, and marketing services. Extensive efforts have been made to conquer the cross-modal retrieval problem in the general domain. When it comes to E-commerce, a com-mon practice is to adopt the pretrained model and finetune on E-commerce data. Despite its simplicity, the performance is sub-optimal due to overlooking the uniqueness of E-commerce multimodal data. A few recent efforts [10], [72] have shown significant improvements over generic methods with customized designs for handling product images. Unfortunately, to the best of our knowledge, no existing method has addressed the unique challenges in the e-commerce language. This work studies the outstanding one, where it has a large collection of special meaning entities, e.g., “Di s s e l (brand)”, “Top (category)”, “relaxed (fit)” in the fashion clothing business. By formulating such out-of-distribution finetuning process in the Causal Inference paradigm, we view the erroneous semantics of these special entities as confounders to cause the retrieval failure. To rectify these semantics for aligning with e-commerce do-main knowledge, we propose an intervention-based entity-aware contrastive learning framework with two modules, i.e., the Confounding Entity Selection Module and Entity-Aware Learning Module. Our method achieves competitive performance on the E-commerce benchmark Fashion-Gen. Particularly, in top-1 accuracy (R@l), we observe 10.3% and 10.5% relative improvements over the closest baseline in image-to-text and text-to-image retrievals, respectively.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"50 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121016265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Improving Segmentation of the Inferior Alveolar Nerve through Deep Label Propagation 通过深标签传播改善下牙槽神经的分割
Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.02046
Marco Cipriano, Stefano Allegretti, Federico Bolelli, F. Pollastri, C. Grana
Many recent works in dentistry and maxillofacial imagery focused on the Inferior Alveolar Nerve (IAN) canal detection. Unfortunately, the small extent of available 3D maxillofacial datasets has strongly limited the performance of deep learning-based techniques. On the other hand, a huge amount of sparsely annotated data is produced every day from the regular procedures in the maxillofacial practice. Despite the amount of sparsely labeled images being significant, the adoption of those data still raises an open problem. Indeed, the deep learning approach frames the presence of dense annotations as a crucial factor. Recent efforts in literature have hence focused on developing label propagation techniques to expand sparse annotations into dense labels. However, the proposed methods proved only marginally effective for the purpose of segmenting the alveolar nerve in CBCT scans. This paper exploits and publicly releases a new 3D densely annotated dataset, through which we are able to train a deep label propagation model which obtains better results than those available in literature. By combining a segmentation model trained on the 3D annotated data and label propagation, we significantly improve the state of the art in the Inferior Alveolar Nerve segmentation.
最近在牙科和颌面影像学方面的许多工作都集中在下牙槽神经(IAN)管的检测上。不幸的是,可用的3D颌面数据集的范围很小,严重限制了基于深度学习的技术的性能。另一方面,颌面部日常诊疗过程中每天都会产生大量稀疏标注的数据。尽管稀疏标记图像的数量很重要,但这些数据的采用仍然提出了一个悬而未决的问题。事实上,深度学习方法将密集注释的存在作为一个关键因素。因此,最近的文献研究主要集中在开发标签传播技术,将稀疏注释扩展为密集标签。然而,所提出的方法被证明在CBCT扫描中对肺泡神经的分割只有微弱的效果。本文利用并公开发布了一个新的三维密集标注数据集,通过该数据集我们可以训练一个深度标签传播模型,该模型获得了比现有文献更好的结果。通过将三维标注数据训练的分割模型与标签传播相结合,我们显著提高了下牙槽神经的分割水平。
{"title":"Improving Segmentation of the Inferior Alveolar Nerve through Deep Label Propagation","authors":"Marco Cipriano, Stefano Allegretti, Federico Bolelli, F. Pollastri, C. Grana","doi":"10.1109/CVPR52688.2022.02046","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.02046","url":null,"abstract":"Many recent works in dentistry and maxillofacial imagery focused on the Inferior Alveolar Nerve (IAN) canal detection. Unfortunately, the small extent of available 3D maxillofacial datasets has strongly limited the performance of deep learning-based techniques. On the other hand, a huge amount of sparsely annotated data is produced every day from the regular procedures in the maxillofacial practice. Despite the amount of sparsely labeled images being significant, the adoption of those data still raises an open problem. Indeed, the deep learning approach frames the presence of dense annotations as a crucial factor. Recent efforts in literature have hence focused on developing label propagation techniques to expand sparse annotations into dense labels. However, the proposed methods proved only marginally effective for the purpose of segmenting the alveolar nerve in CBCT scans. This paper exploits and publicly releases a new 3D densely annotated dataset, through which we are able to train a deep label propagation model which obtains better results than those available in literature. By combining a segmentation model trained on the 3D annotated data and label propagation, we significantly improve the state of the art in the Inferior Alveolar Nerve segmentation.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127354024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Weakly Supervised Segmentation on Outdoor 4D point clouds with Temporal Matching and Spatial Graph Propagation 基于时间匹配和空间图传播的室外4D点云弱监督分割
Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01154
Hanyu Shi, Jiacheng Wei, Ruibo Li, Fayao Liu, Guosheng Lin
Existing point cloud segmentation methods require a large amount of annotated data, especially for the outdoor point cloud scene. Due to the complexity of the outdoor 3D scenes, manual annotations on the outdoor point cloud scene are time-consuming and expensive. In this paper, we study how to achieve scene understanding with limited annotated data. Treating 100 consecutive frames as a sequence, we divide the whole dataset into a series of sequences and annotate only 0.1% points in the first frame of each sequence to reduce the annotation requirements. This leads to a total annotation budget of 0.001%. We propose a novel temporal-spatial framework for effective weakly supervised learning to generate high-quality pseudo labels from these limited annotated data. Specifically, the frame-work contains two modules: an matching module in temporal dimension to propagate pseudo labels across different frames, and a graph propagation module in spatial dimension to propagate the information of pseudo labels to the entire point clouds in each frame. With only 0.001% annotations for training, experimental results on both SemanticKITTI and SemanticPOSS shows our weakly supervised two-stage framework is comparable to some existing fully supervised methods. We also evaluate our framework with 0.005% initial annotations on SemanticKITTI, and achieve a result close to fully supervised backbone model.
现有的点云分割方法需要大量的标注数据,特别是室外点云场景。由于室外3D场景的复杂性,对室外点云场景进行手工标注既耗时又昂贵。本文主要研究如何在有限的标注数据下实现场景理解。我们将100个连续帧作为一个序列,将整个数据集划分为一系列序列,并在每个序列的第一帧中只标注0.1%的点,以减少标注需求。这导致总注释预算为0.001%。我们提出了一种新的时间-空间框架,用于有效的弱监督学习,从这些有限的注释数据中生成高质量的伪标签。具体来说,该框架包含两个模块:一个是时间维度的匹配模块,用于跨帧传播伪标签;另一个是空间维度的图传播模块,用于将伪标签信息传播到每帧的整个点云。在SemanticKITTI和SemanticPOSS上,仅使用0.001%的注释进行训练,实验结果表明我们的弱监督两阶段框架与现有的一些完全监督方法相当。我们还在SemanticKITTI上用0.005%的初始注释评估了我们的框架,并获得了接近完全监督骨架模型的结果。
{"title":"Weakly Supervised Segmentation on Outdoor 4D point clouds with Temporal Matching and Spatial Graph Propagation","authors":"Hanyu Shi, Jiacheng Wei, Ruibo Li, Fayao Liu, Guosheng Lin","doi":"10.1109/CVPR52688.2022.01154","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01154","url":null,"abstract":"Existing point cloud segmentation methods require a large amount of annotated data, especially for the outdoor point cloud scene. Due to the complexity of the outdoor 3D scenes, manual annotations on the outdoor point cloud scene are time-consuming and expensive. In this paper, we study how to achieve scene understanding with limited annotated data. Treating 100 consecutive frames as a sequence, we divide the whole dataset into a series of sequences and annotate only 0.1% points in the first frame of each sequence to reduce the annotation requirements. This leads to a total annotation budget of 0.001%. We propose a novel temporal-spatial framework for effective weakly supervised learning to generate high-quality pseudo labels from these limited annotated data. Specifically, the frame-work contains two modules: an matching module in temporal dimension to propagate pseudo labels across different frames, and a graph propagation module in spatial dimension to propagate the information of pseudo labels to the entire point clouds in each frame. With only 0.001% annotations for training, experimental results on both SemanticKITTI and SemanticPOSS shows our weakly supervised two-stage framework is comparable to some existing fully supervised methods. We also evaluate our framework with 0.005% initial annotations on SemanticKITTI, and achieve a result close to fully supervised backbone model.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"382 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127485201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Data-Free Network Compression via Parametric Non-uniform Mixed Precision Quantization 基于参数非均匀混合精度量化的无数据网络压缩
Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.00054
V. Chikin, Mikhail Antiukh
Deep Neural Networks (DNNs) usually have a large number of parameters and consume a huge volume of storage space, which limits the application of DNNs on memory-constrained devices. Network quantization is an appealing way to compress DNNs. However, most of existing quantization methods require the training dataset and a fine-tuning procedure to preserve the quality of a full-precision model. These are unavailable for the confidential scenarios due to personal privacy and security problems. Focusing on this issue, we propose a novel data-free method for network compression called PNMQ, which employs the Parametric Non-uniform Mixed precision Quantization to generate a quantized network. During the compression stage, the optimal parametric non-uniform quantization grid is calculated directly for each layer to minimize the quantization error. User can directly specify the required compression ratio of a network, which is used by the PNMQ algorithm to select bitwidths of layers. This method does not require any model retraining or expensive calculations, which allows efficient implementations for network compression on edge devices. Extensive experiments have been conducted on various computer vision tasks and the results demonstrate that PNMQ achieves better performance than other state-of-the-art methods of network compression.
深度神经网络(Deep Neural Networks, dnn)通常具有大量的参数,并且需要消耗大量的存储空间,这限制了dnn在内存受限设备上的应用。网络量化是压缩dnn的一种很有吸引力的方法。然而,大多数现有的量化方法需要训练数据集和微调过程来保持全精度模型的质量。由于个人隐私和安全问题,在保密场景下无法使用。针对这一问题,我们提出了一种新的无数据网络压缩方法PNMQ,该方法采用参数非均匀混合精度量化来生成量化网络。在压缩阶段,直接对每一层计算最优参数非均匀量化网格,使量化误差最小化。用户可以直接指定所需的网络压缩比,PNMQ算法使用该压缩比选择层的位宽。这种方法不需要任何模型再训练或昂贵的计算,这允许在边缘设备上有效地实现网络压缩。在各种计算机视觉任务上进行了大量的实验,结果表明PNMQ比其他最先进的网络压缩方法取得了更好的性能。
{"title":"Data-Free Network Compression via Parametric Non-uniform Mixed Precision Quantization","authors":"V. Chikin, Mikhail Antiukh","doi":"10.1109/CVPR52688.2022.00054","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.00054","url":null,"abstract":"Deep Neural Networks (DNNs) usually have a large number of parameters and consume a huge volume of storage space, which limits the application of DNNs on memory-constrained devices. Network quantization is an appealing way to compress DNNs. However, most of existing quantization methods require the training dataset and a fine-tuning procedure to preserve the quality of a full-precision model. These are unavailable for the confidential scenarios due to personal privacy and security problems. Focusing on this issue, we propose a novel data-free method for network compression called PNMQ, which employs the Parametric Non-uniform Mixed precision Quantization to generate a quantized network. During the compression stage, the optimal parametric non-uniform quantization grid is calculated directly for each layer to minimize the quantization error. User can directly specify the required compression ratio of a network, which is used by the PNMQ algorithm to select bitwidths of layers. This method does not require any model retraining or expensive calculations, which allows efficient implementations for network compression on edge devices. Extensive experiments have been conducted on various computer vision tasks and the results demonstrate that PNMQ achieves better performance than other state-of-the-art methods of network compression.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124783014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Aug-NeRF: Training Stronger Neural Radiance Fields with Triple-Level Physically-Grounded Augmentations Aug-NeRF:训练更强的神经辐射场与三级物理接地增强
Pub Date : 2022-06-01 DOI: 10.1109/CVPR52688.2022.01476
Tianlong Chen, Peihao Wang, Zhiwen Fan, Zhangyang Wang
Neural Radiance Field (NeRF) regresses a neural param-eterized scene by differentially rendering multi-view images with ground-truth supervision. However, when interpolating novel views, NeRF often yields inconsistent and visually non-smooth geometric results, which we consider as a generalization gap between seen and unseen views. Recent advances in convolutional neural networks have demonstrated the promise of advanced robust data augmentations, either random or learned, in enhancing both in-distribution and out-of-distribution generalization. Inspired by that, we propose Augmented NeRF (Aug-NeRF), which for the first time brings the power of robust data augmentations into regular-izing the NeRF training. Particularly, our proposal learns to seamlessly blend worst-case perturbations into three distinct levels of the NeRF pipeline with physical grounds, including (1) the input coordinates, to simulate imprecise camera parameters at image capture; (2) intermediate features, to smoothen the intrinsic feature manifold; and (3) pre-rendering output, to account for the potential degra-dation factors in the multi-view image supervision. Extensive results demonstrate that Aug-NeRF effectively boosts NeRF performance in both novel view synthesis (up to 1.5dB PSNR gain) and underlying geometry reconstruction. Fur-thermore, thanks to the implicit smooth prior injected by the triple-level augmentations, Aug-NeRF can even recover scenes from heavily corrupted images, a highly challenging setting untackled before. Our codes are available in https://github.com/VITA-Group/Aug-NeRF.
神经辐射场(Neural Radiance Field, NeRF)通过对多视图图像进行差分渲染,并结合地真监督,对神经参数化场景进行回归。然而,当插值新视图时,NeRF通常会产生不一致和视觉上不光滑的几何结果,我们认为这是可见视图和未见视图之间的泛化差距。卷积神经网络的最新进展已经证明了先进的鲁棒数据增强的前景,无论是随机的还是学习的,都可以增强分布内和分布外的泛化。受此启发,我们提出了增强NeRF (augg -NeRF),它首次将鲁棒数据增强的力量引入NeRF训练的正则化。特别是,我们的建议学习将最坏情况的扰动无缝地融合到三个不同层次的NeRF管道中,包括物理基础,包括(1)输入坐标,以模拟图像捕获时的不精确相机参数;(2)中间特征,平滑内在特征流形;(3)预渲染输出,以考虑多视图图像监督中潜在的退化因素。广泛的结果表明,Aug-NeRF有效地提高了NeRF在新视图合成(高达1.5dB PSNR增益)和底层几何重建方面的性能。此外,由于隐含平滑先验注入的三级增强,Aug-NeRF甚至可以从严重损坏的图像中恢复场景,这是一个非常具有挑战性的设置。我们的代码可以在https://github.com/VITA-Group/Aug-NeRF上找到。
{"title":"Aug-NeRF: Training Stronger Neural Radiance Fields with Triple-Level Physically-Grounded Augmentations","authors":"Tianlong Chen, Peihao Wang, Zhiwen Fan, Zhangyang Wang","doi":"10.1109/CVPR52688.2022.01476","DOIUrl":"https://doi.org/10.1109/CVPR52688.2022.01476","url":null,"abstract":"Neural Radiance Field (NeRF) regresses a neural param-eterized scene by differentially rendering multi-view images with ground-truth supervision. However, when interpolating novel views, NeRF often yields inconsistent and visually non-smooth geometric results, which we consider as a generalization gap between seen and unseen views. Recent advances in convolutional neural networks have demonstrated the promise of advanced robust data augmentations, either random or learned, in enhancing both in-distribution and out-of-distribution generalization. Inspired by that, we propose Augmented NeRF (Aug-NeRF), which for the first time brings the power of robust data augmentations into regular-izing the NeRF training. Particularly, our proposal learns to seamlessly blend worst-case perturbations into three distinct levels of the NeRF pipeline with physical grounds, including (1) the input coordinates, to simulate imprecise camera parameters at image capture; (2) intermediate features, to smoothen the intrinsic feature manifold; and (3) pre-rendering output, to account for the potential degra-dation factors in the multi-view image supervision. Extensive results demonstrate that Aug-NeRF effectively boosts NeRF performance in both novel view synthesis (up to 1.5dB PSNR gain) and underlying geometry reconstruction. Fur-thermore, thanks to the implicit smooth prior injected by the triple-level augmentations, Aug-NeRF can even recover scenes from heavily corrupted images, a highly challenging setting untackled before. Our codes are available in https://github.com/VITA-Group/Aug-NeRF.","PeriodicalId":355552,"journal":{"name":"2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124786105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
期刊
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1