2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)最新文献

英文中文

Self-Supervised 2D/3D Registration for X-Ray to CT Image Fusion x射线与CT图像融合的自监督2D/3D配准

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/wacv56688.2023.00281

Srikrishna Jaganathan, Maximilian Kukla, Jian Wang, Karthik Shetty, Andreas Maier

Deep Learning-based 2D/3D registration enables fast, robust, and accurate X-ray to CT image fusion when large annotated paired datasets are available for training. However, the need for paired CT volume and X-ray images with ground truth registration limits the applicability in interventional scenarios. An alternative is to use simulated X-ray projections from CT volumes, thus removing the need for paired annotated datasets. Deep Neural Networks trained exclusively on simulated X-ray projections can perform significantly worse on real X-ray images due to the domain gap. We propose a self-supervised 2D/3D registration framework combining simulated training with unsupervised feature and pixel space domain adaptation to overcome the domain gap and eliminate the need for paired annotated datasets. Our framework achieves a registration accuracy of 1.83 ± 1.16 mm with a high success ratio of 90.1% on real X-ray images showing a 23.9% increase in success ratio compared to reference annotation-free algorithms.

基于深度学习的2D/3D配准可以实现快速、鲁棒和准确的x射线到CT图像融合，当有大量带注释的配对数据集可供训练时。然而，需要配对CT体积和x射线图像与地面真值配准限制了其在介入场景中的适用性。另一种选择是使用来自CT体的模拟x射线投影，从而消除了对成对注释数据集的需求。由于域间隙的存在，仅在模拟x射线投影上训练的深度神经网络在真实x射线图像上的表现明显较差。我们提出了一种将模拟训练与无监督特征和像素空间域自适应相结合的自监督2D/3D配准框架，以克服域间隙并消除对成对注释数据集的需求。我们的框架在真实x射线图像上的配准精度为1.83±1.16 mm，成功率为90.1%，与参考的无注释算法相比，成功率提高了23.9%。

引用次数: 1

Ego-Vehicle Action Recognition based on Semi-Supervised Contrastive Learning 基于半监督对比学习的自我-车辆动作识别

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/wacv56688.2023.00593

Chihiro Noguchi, Toshihiro Tanizawa

In recent years, many automobiles have been equipped with cameras, which have accumulated an enormous amount of video footage of driving scenes. Autonomous driving demands the highest level of safety, for which even unimaginably rare driving scenes have to be collected in training data to improve the recognition accuracy for specific scenes. However, it is prohibitively costly to find very few specific scenes from an enormous amount of videos. In this article, we show that proper video-to-video distances can be defined by focusing on ego-vehicle actions. It is well known that existing methods based on supervised learning cannot handle videos that do not fall into predefined classes, though they work well in defining video-to-video distances in the embedding space between labeled videos. To tackle this problem, we propose a method based on semi-supervised contrastive learning. We consider two related but distinct contrastive learning: standard graph contrastive learning and our proposed SOIA-based contrastive learning. We observe that the latter approach can provide more sensible video-to-video distances between unlabeled videos. Next, the effectiveness of our method is quantified by evaluating the classification performance of the ego-vehicle action recognition using HDD dataset, which shows that our method including unlabeled data in training significantly outperforms the existing methods using only labeled data in training.

近年来，许多汽车都配备了摄像头，这些摄像头积累了大量的驾驶场景录像。自动驾驶对安全性的要求是最高的，即使是极其罕见的驾驶场景，也需要在训练数据中进行采集，以提高对特定场景的识别精度。然而，从大量的视频中找到很少的特定场景是非常昂贵的。在本文中，我们展示了适当的视频到视频的距离可以通过关注自我车辆的动作来定义。众所周知，现有的基于监督学习的方法不能处理不属于预定义类的视频，尽管它们在标记视频之间的嵌入空间中定义视频到视频的距离方面效果很好。为了解决这个问题，我们提出了一种基于半监督对比学习的方法。我们考虑了两种相关但不同的对比学习:标准图对比学习和我们提出的基于soa的对比学习。我们观察到后一种方法可以在未标记的视频之间提供更合理的视频到视频距离。接下来，通过评估HDD数据集的自我-车辆动作识别分类性能来量化我们方法的有效性，结果表明，我们的方法在训练中包含未标记的数据，显著优于现有的仅使用标记数据的方法。

{"title":"Ego-Vehicle Action Recognition based on Semi-Supervised Contrastive Learning","authors":"Chihiro Noguchi, Toshihiro Tanizawa","doi":"10.1109/wacv56688.2023.00593","DOIUrl":"https://doi.org/10.1109/wacv56688.2023.00593","url":null,"abstract":"In recent years, many automobiles have been equipped with cameras, which have accumulated an enormous amount of video footage of driving scenes. Autonomous driving demands the highest level of safety, for which even unimaginably rare driving scenes have to be collected in training data to improve the recognition accuracy for specific scenes. However, it is prohibitively costly to find very few specific scenes from an enormous amount of videos. In this article, we show that proper video-to-video distances can be defined by focusing on ego-vehicle actions. It is well known that existing methods based on supervised learning cannot handle videos that do not fall into predefined classes, though they work well in defining video-to-video distances in the embedding space between labeled videos. To tackle this problem, we propose a method based on semi-supervised contrastive learning. We consider two related but distinct contrastive learning: standard graph contrastive learning and our proposed SOIA-based contrastive learning. We observe that the latter approach can provide more sensible video-to-video distances between unlabeled videos. Next, the effectiveness of our method is quantified by evaluating the classification performance of the ego-vehicle action recognition using HDD dataset, which shows that our method including unlabeled data in training significantly outperforms the existing methods using only labeled data in training.","PeriodicalId":497882,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135470414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

TTTFlow: Unsupervised Test-Time Training with Normalizing Flow TTTFlow:使用规范化流进行无监督测试时间训练

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/wacv56688.2023.00216

David Osowiechi, Gustavo A. Vargas Hakim, Mehrdad Noori, Milad Cheraghalikhani, Ismail Ben Ayed, Christian Desrosiers

A major problem of deep neural networks for image classification is their vulnerability to domain changes at test-time. Recent methods have proposed to address this problem with test-time training (TTT), where a two-branch model is trained to learn a main classification task and also a self-supervised task used to perform test-time adaptation. However, these techniques require defining a proxy task specific to the target application. To tackle this limitation, we propose TTTFlow: a Y-shaped architecture using an unsupervised head based on Normalizing Flows to learn the nor-mal distribution of latent features and detect domain shifts in test examples. At inference, keeping the unsupervised head fixed, we adapt the model to domain-shifted examples by maximizing the log likelihood of the Normalizing Flow. Our results show that our method can significantly improve the accuracy with respect to previous works.

深度神经网络用于图像分类的一个主要问题是在测试时容易受到域变化的影响。最近提出的方法是通过测试时间训练(TTT)来解决这个问题，其中训练一个双分支模型来学习一个主要的分类任务和一个用于执行测试时间适应的自监督任务。但是，这些技术需要定义特定于目标应用程序的代理任务。为了解决这一限制，我们提出了TTTFlow:一个使用基于Normalizing Flows的无监督头部的y形架构，以学习潜在特征的正态分布并检测测试示例中的域移位。在推理中，保持无监督头部固定，我们通过最大化归一化流的对数似然来使模型适应域移位的例子。实验结果表明，该方法相对于以往的工作，可以显著提高准确率。

引用次数: 2

Visually explaining 3D-CNN predictions for video classification with an adaptive occlusion sensitivity analysis 可视化地解释3D-CNN预测视频分类与自适应遮挡敏感性分析

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/wacv56688.2023.00156

Tomoki Uchiyama, Naoya Sogi, Koichiro Niinuma, Kazuhiro Fukui

This paper proposes a method for visually explaining the decision-making process of 3D convolutional neural networks (CNN) with a temporal extension of occlusion sensitivity analysis. The key idea here is to occlude a specific volume of data by a 3D mask in an input 3D temporalspatial data space and then measure the change degree in the output score. The occluded volume data that produces a larger change degree is regarded as a more critical element for classification. However, while the occlusion sensitivity analysis is commonly used to analyze single image classification, it is not so straightforward to apply this idea to video classification as a simple fixed cuboid cannot deal with the motions. To this end, we adapt the shape of a 3D occlusion mask to complicated motions of target objects. Our flexible mask adaptation is performed by considering the temporal continuity and spatial co-occurrence of the optical flows extracted from the input video data. We further propose to approximate our method by using the first-order partial derivative of the score with respect to an input image to reduce its computational cost. We demonstrate the effectiveness of our method through various and extensive comparisons with the conventional methods in terms of the deletion/insertion metric and the pointing metric on the UCF101. The code is available at: https://github.com/uchiyama33/AOSA.

本文提出了一种基于遮挡敏感性分析的时间扩展的三维卷积神经网络(CNN)决策过程可视化解释方法。这里的关键思想是在输入的3D时空数据空间中用3D掩码遮挡特定的数据量，然后测量输出分数的变化程度。产生较大变化程度的被遮挡体数据被视为更关键的分类要素。然而，虽然遮挡敏感性分析通常用于分析单幅图像分类，但将该思想应用于视频分类并不是那么简单，因为简单的固定长方体无法处理运动。为此，我们调整了三维遮挡遮罩的形状，以适应目标物体的复杂运动。我们通过考虑从输入视频数据中提取的光流的时间连续性和空间共现性来实现灵活的掩膜自适应。我们进一步建议通过使用分数相对于输入图像的一阶偏导数来近似我们的方法，以减少其计算成本。我们通过在UCF101上的删除/插入度量和指向度量方面与传统方法进行各种广泛的比较，证明了我们的方法的有效性。代码可从https://github.com/uchiyama33/AOSA获得。

{"title":"Visually explaining 3D-CNN predictions for video classification with an adaptive occlusion sensitivity analysis","authors":"Tomoki Uchiyama, Naoya Sogi, Koichiro Niinuma, Kazuhiro Fukui","doi":"10.1109/wacv56688.2023.00156","DOIUrl":"https://doi.org/10.1109/wacv56688.2023.00156","url":null,"abstract":"This paper proposes a method for visually explaining the decision-making process of 3D convolutional neural networks (CNN) with a temporal extension of occlusion sensitivity analysis. The key idea here is to occlude a specific volume of data by a 3D mask in an input 3D temporalspatial data space and then measure the change degree in the output score. The occluded volume data that produces a larger change degree is regarded as a more critical element for classification. However, while the occlusion sensitivity analysis is commonly used to analyze single image classification, it is not so straightforward to apply this idea to video classification as a simple fixed cuboid cannot deal with the motions. To this end, we adapt the shape of a 3D occlusion mask to complicated motions of target objects. Our flexible mask adaptation is performed by considering the temporal continuity and spatial co-occurrence of the optical flows extracted from the input video data. We further propose to approximate our method by using the first-order partial derivative of the score with respect to an input image to reduce its computational cost. We demonstrate the effectiveness of our method through various and extensive comparisons with the conventional methods in terms of the deletion/insertion metric and the pointing metric on the UCF101. The code is available at: https://github.com/uchiyama33/AOSA.","PeriodicalId":497882,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135127083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

X-Align: Cross-Modal Cross-View Alignment for Bird’s-Eye-View Segmentation X-Align:用于鸟瞰分割的跨模态跨视图对齐

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Pub Date : 2023-01-01 DOI: 10.1109/wacv56688.2023.00330

Shubhankar Borse, Marvin Klingner, Varun Ravi Kumar, Hong Cai, Abdulaziz Almuzairee, Senthil Yogamani, Fatih Porikli

Bird’s-eye-view (BEV) grid is a typical representation of the perception of road components, e.g., drivable area, in autonomous driving. Most existing approaches rely on cameras only to perform segmentation in BEV space, which is fundamentally constrained by the absence of reliable depth information. The latest works leverage both camera and LiDAR modalities but suboptimally fuse their features using simple, concatenation-based mechanisms.In this paper, we address these problems by enhancing the alignment of the unimodal features in order to aid feature fusion, as well as enhancing the alignment between the cameras’ perspective view (PV) and BEV representations. We propose X-Align, a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation consisting of the following components: (i) a novel CrossModal Feature Alignment (X-FA) loss, (ii) an attentionbased Cross-Modal Feature Fusion (X-FF) module to align multi-modal BEV features implicitly, and (iii) an auxiliary PV segmentation branch with Cross-View Segmentation Alignment (X-SA) losses to improve the PV-to-BEV transformation. We evaluate our proposed method across two commonly used benchmark datasets, i.e., nuScenes and KITTI-360. Notably, X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes. We also provide extensive ablation studies to demonstrate the effectiveness of the individual components.

在自动驾驶中，鸟瞰(BEV)网格是对道路组件(如可驾驶区域)感知的典型表示。大多数现有方法仅依靠相机在BEV空间中进行分割，这从根本上受到缺乏可靠深度信息的限制。最新的工作利用了摄像头和激光雷达模式，但使用简单的、基于连接的机制将它们的功能融合在一起。在本文中，我们通过增强单峰特征的对齐来帮助特征融合，以及增强相机视角视图(PV)和BEV表示之间的对齐来解决这些问题。我们提出了一种新的端到端跨模态和跨视图学习框架X-Align，用于BEV分割，包括以下组件:(i)新的跨模态特征对齐(X-FA)损失，(ii)基于注意力的跨模态特征融合(X-FF)模块，用于隐式对齐多模态BEV特征，以及(iii)具有跨视图分割对齐(X-SA)损失的辅助PV分割分支，以改善PV到BEV的转换。我们在两个常用的基准数据集(即nuScenes和KITTI-360)上评估了我们提出的方法。值得注意的是，X-Align在nuScenes上的表现比最先进的技术高出3个绝对mIoU点。我们还提供广泛的消融研究，以证明单个组件的有效性。

{"title":"X-Align: Cross-Modal Cross-View Alignment for Bird’s-Eye-View Segmentation","authors":"Shubhankar Borse, Marvin Klingner, Varun Ravi Kumar, Hong Cai, Abdulaziz Almuzairee, Senthil Yogamani, Fatih Porikli","doi":"10.1109/wacv56688.2023.00330","DOIUrl":"https://doi.org/10.1109/wacv56688.2023.00330","url":null,"abstract":"Bird’s-eye-view (BEV) grid is a typical representation of the perception of road components, e.g., drivable area, in autonomous driving. Most existing approaches rely on cameras only to perform segmentation in BEV space, which is fundamentally constrained by the absence of reliable depth information. The latest works leverage both camera and LiDAR modalities but suboptimally fuse their features using simple, concatenation-based mechanisms.In this paper, we address these problems by enhancing the alignment of the unimodal features in order to aid feature fusion, as well as enhancing the alignment between the cameras’ perspective view (PV) and BEV representations. We propose X-Align, a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation consisting of the following components: (i) a novel CrossModal Feature Alignment (X-FA) loss, (ii) an attentionbased Cross-Modal Feature Fusion (X-FF) module to align multi-modal BEV features implicitly, and (iii) an auxiliary PV segmentation branch with Cross-View Segmentation Alignment (X-SA) losses to improve the PV-to-BEV transformation. We evaluate our proposed method across two commonly used benchmark datasets, i.e., nuScenes and KITTI-360. Notably, X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes. We also provide extensive ablation studies to demonstrate the effectiveness of the individual components.","PeriodicalId":497882,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136298132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀