Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00161
Brian Hu, Bhavan Kumar Vasu, Anthony J. Hoogs
Despite significant progress in the past few years, machine learning systems are still often viewed as "black boxes," which lack the ability to explain their output decisions. In high-stakes situations such as healthcare, there is a need for explainable AI (XAI) tools that can help open up this black box. In contrast to approaches which largely tackle classification problems in the medical imaging domain, we address the less-studied problem of explainable image retrieval. We test our approach on a COVID-19 chest X-ray dataset and the ISIC 2017 skin lesion dataset, showing that saliency maps help reveal the image features used by models to determine image similarity. We evaluated three different saliency algorithms, which were either occlusion-based, attention-based, or relied on a form of activation mapping. We also develop quantitative evaluation metrics that allow us to go beyond simple qualitative comparisons of the different saliency algorithms. Our results have the potential to aid clinicians when viewing medical images and addresses an urgent need for interventional tools in response to COVID-19. The source code is publicly available at: https://gitlab.kitware.com/brianhhu/x-mir.
{"title":"X-MIR: EXplainable Medical Image Retrieval","authors":"Brian Hu, Bhavan Kumar Vasu, Anthony J. Hoogs","doi":"10.1109/WACV51458.2022.00161","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00161","url":null,"abstract":"Despite significant progress in the past few years, machine learning systems are still often viewed as \"black boxes,\" which lack the ability to explain their output decisions. In high-stakes situations such as healthcare, there is a need for explainable AI (XAI) tools that can help open up this black box. In contrast to approaches which largely tackle classification problems in the medical imaging domain, we address the less-studied problem of explainable image retrieval. We test our approach on a COVID-19 chest X-ray dataset and the ISIC 2017 skin lesion dataset, showing that saliency maps help reveal the image features used by models to determine image similarity. We evaluated three different saliency algorithms, which were either occlusion-based, attention-based, or relied on a form of activation mapping. We also develop quantitative evaluation metrics that allow us to go beyond simple qualitative comparisons of the different saliency algorithms. Our results have the potential to aid clinicians when viewing medical images and addresses an urgent need for interventional tools in response to COVID-19. The source code is publicly available at: https://gitlab.kitware.com/brianhhu/x-mir.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133179525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00266
Fatemeh Azimi, Sebastián M. Palacio, Federico Raue, Jörn Hees, Luca Bertinetto, A. Dengel
In typical computer vision problems revolving around video data, pre-trained models are simply evaluated at test time, without adaptation. This general approach clearly cannot capture the shifts that will likely arise between the distributions from which training and test data have been sampled. Adapting a pre-trained model to a new video en-countered at test time could be essential to avoid the potentially catastrophic effects of such shifts. However, given the inherent impossibility of labeling data only available at test-time, traditional "fine-tuning" techniques cannot be lever-aged in this highly practical scenario. This paper explores whether the recent progress in test-time adaptation in the image domain and self-supervised learning can be lever-aged to adapt a model to previously unseen and unlabelled videos presenting both mild (but arbitrary) and severe covariate shifts. In our experiments, we show that test-time adaptation approaches applied to self-supervised methods are always beneficial, but also that the extent of their effectiveness largely depends on the specific combination of the algorithms used for adaptation and self-supervision, and also on the type of covariate shift taking place.
{"title":"Self-supervised Test-time Adaptation on Video Data","authors":"Fatemeh Azimi, Sebastián M. Palacio, Federico Raue, Jörn Hees, Luca Bertinetto, A. Dengel","doi":"10.1109/WACV51458.2022.00266","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00266","url":null,"abstract":"In typical computer vision problems revolving around video data, pre-trained models are simply evaluated at test time, without adaptation. This general approach clearly cannot capture the shifts that will likely arise between the distributions from which training and test data have been sampled. Adapting a pre-trained model to a new video en-countered at test time could be essential to avoid the potentially catastrophic effects of such shifts. However, given the inherent impossibility of labeling data only available at test-time, traditional \"fine-tuning\" techniques cannot be lever-aged in this highly practical scenario. This paper explores whether the recent progress in test-time adaptation in the image domain and self-supervised learning can be lever-aged to adapt a model to previously unseen and unlabelled videos presenting both mild (but arbitrary) and severe covariate shifts. In our experiments, we show that test-time adaptation approaches applied to self-supervised methods are always beneficial, but also that the extent of their effectiveness largely depends on the specific combination of the algorithms used for adaptation and self-supervision, and also on the type of covariate shift taking place.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132446911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00328
Arvind Vepa, Andy Choi, Noor Nakhaei, Wonjun Lee, Noah Stier, Andrew Vu, Greyson Jenkins, Xiaoyan Yang, Manjot Shergill, Moira Desphy, K. Delao, M. Levy, Cristopher Garduno, Lacy Nelson, Wan-Ching Liu, Fan Hung, F. Scalzo
Automated vessel segmentation in cerebral digital subtraction angiography (DSA) has significant clinical utility in the management of cerebrovascular diseases. Although deep learning has become the foundation for state-of-the-art image segmentation, a significant amount of labeled data is needed for training. Furthermore, due to domain differences, pre-trained networks cannot be applied to DSA data out-of-the-box. To address this, we propose a novel learning framework, which utilizes an active contour model for weak supervision and low-cost human-in-the-loop strategies to improve weak label quality. Our study produces several significant results, including state-of-the-art results for cerebral DSA vessel segmentation, which exceed human annotator quality, and an analysis of annotation cost and model performance trade-offs when utilizing weak supervision strategies. For comparison purposes, we also demonstrate our approach on the Digital Retinal Images for Vessel Extraction (DRIVE) dataset. Additionally, we will be publicly releasing code to reproduce our methodology and our dataset, the largest known high-quality annotated cerebral DSA vessel segmentation dataset.
{"title":"Weakly-Supervised Convolutional Neural Networks for Vessel Segmentation in Cerebral Angiography","authors":"Arvind Vepa, Andy Choi, Noor Nakhaei, Wonjun Lee, Noah Stier, Andrew Vu, Greyson Jenkins, Xiaoyan Yang, Manjot Shergill, Moira Desphy, K. Delao, M. Levy, Cristopher Garduno, Lacy Nelson, Wan-Ching Liu, Fan Hung, F. Scalzo","doi":"10.1109/WACV51458.2022.00328","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00328","url":null,"abstract":"Automated vessel segmentation in cerebral digital subtraction angiography (DSA) has significant clinical utility in the management of cerebrovascular diseases. Although deep learning has become the foundation for state-of-the-art image segmentation, a significant amount of labeled data is needed for training. Furthermore, due to domain differences, pre-trained networks cannot be applied to DSA data out-of-the-box. To address this, we propose a novel learning framework, which utilizes an active contour model for weak supervision and low-cost human-in-the-loop strategies to improve weak label quality. Our study produces several significant results, including state-of-the-art results for cerebral DSA vessel segmentation, which exceed human annotator quality, and an analysis of annotation cost and model performance trade-offs when utilizing weak supervision strategies. For comparison purposes, we also demonstrate our approach on the Digital Retinal Images for Vessel Extraction (DRIVE) dataset. Additionally, we will be publicly releasing code to reproduce our methodology and our dataset, the largest known high-quality annotated cerebral DSA vessel segmentation dataset.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132428115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00164
Thomas Hartley, K. Sidorov, Christopher Willis, A. D. Marshall
CNN architectures that take videos as an input are often overlooked when it comes to the development of explanation techniques. This is despite their use in often critical domains such as surveillance and healthcare. Explanation techniques developed for these networks must take into account the additional temporal domain if they are to be successful. In this paper we introduce SWAG-V, an extension of SWAG for use with networks that take video as an input. In addition we show how these explanations can be created in such a way that they are balanced between fine and coarse explanations. By creating superpixels that incorporate the frames of the input video we are able to create explanations that better locate regions of the input that are important to the networks prediction. We compare SWAG-V against a number of similar techniques using metrics such as insertion and deletion, and weak localisation. We compute these using Kinetics-400 with both the C3D and R(2+1)D network architectures and find that SWAG-V is able to outperform multiple techniques.
{"title":"SWAG-V: Explanations for Video using Superpixels Weighted by Average Gradients","authors":"Thomas Hartley, K. Sidorov, Christopher Willis, A. D. Marshall","doi":"10.1109/WACV51458.2022.00164","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00164","url":null,"abstract":"CNN architectures that take videos as an input are often overlooked when it comes to the development of explanation techniques. This is despite their use in often critical domains such as surveillance and healthcare. Explanation techniques developed for these networks must take into account the additional temporal domain if they are to be successful. In this paper we introduce SWAG-V, an extension of SWAG for use with networks that take video as an input. In addition we show how these explanations can be created in such a way that they are balanced between fine and coarse explanations. By creating superpixels that incorporate the frames of the input video we are able to create explanations that better locate regions of the input that are important to the networks prediction. We compare SWAG-V against a number of similar techniques using metrics such as insertion and deletion, and weak localisation. We compute these using Kinetics-400 with both the C3D and R(2+1)D network architectures and find that SWAG-V is able to outperform multiple techniques.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115084659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00321
Mohit Lamba, K. Mitra
The ability of Light Field (LF) cameras to capture the 3D geometry of a scene in a single photographic exposure has become central to several applications ranging from passive depth estimation to post-capture refocusing and view synthesis. But these LF applications break down in extreme low-light conditions due to excessive noise and poor image photometry. Existing low-light restoration techniques are inappropriate because they either do not leverage LF’s multi-view perspective or have enormous time and memory complexity. We propose a three-stage network that is simultaneously fast and accurate for real world applications. Our accuracy comes from the fact that our three stage architecture utilizes global, local and view-specific information present in low-light LFs and fuse them using an RNN inspired feedforward network. We are fast because we restore multiple views simultaneously and so require less number of forward passes. Besides these advantages, our network is flexible enough to restore a m × m LF during inference even if trained for a smaller n × n (n < m) LF without any finetuning. Extensive experiments on real low-light LF demonstrate that compared to the current state-of-the-art, our model can achieve up to 1 dB higher restoration PSNR, with 9× speedup, 23% smaller model size and about 5× lower floating-point operations.
光场(LF)相机在单次曝光中捕捉场景的3D几何形状的能力已经成为从被动深度估计到捕获后重新聚焦和视图合成等几个应用的核心。但是,由于噪声过大和图像测光性能差,这些LF应用在极端低光条件下会崩溃。现有的低光恢复技术是不合适的,因为它们要么没有利用LF的多视图视角,要么有巨大的时间和内存复杂性。我们提出了一个三级网络,同时快速和准确的现实世界的应用。我们的准确性来自于这样一个事实,即我们的三阶段架构利用了低光照LFs中存在的全局、局部和特定视图信息,并使用RNN启发的前馈网络将它们融合在一起。我们的速度很快,因为我们可以同时恢复多个视图,因此需要更少的前向传递。除了这些优点之外,我们的网络足够灵活,即使在没有任何微调的情况下训练较小的n × n (n < m) LF,也可以在推理期间恢复m × m的LF。在实际低光照下进行的大量实验表明,与目前最先进的模型相比,我们的模型可以实现高达1 dB的恢复PSNR,加速提高9倍,模型尺寸缩小23%,浮点运算减少约5倍。
{"title":"Fast and Efficient Restoration of Extremely Dark Light Fields","authors":"Mohit Lamba, K. Mitra","doi":"10.1109/WACV51458.2022.00321","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00321","url":null,"abstract":"The ability of Light Field (LF) cameras to capture the 3D geometry of a scene in a single photographic exposure has become central to several applications ranging from passive depth estimation to post-capture refocusing and view synthesis. But these LF applications break down in extreme low-light conditions due to excessive noise and poor image photometry. Existing low-light restoration techniques are inappropriate because they either do not leverage LF’s multi-view perspective or have enormous time and memory complexity. We propose a three-stage network that is simultaneously fast and accurate for real world applications. Our accuracy comes from the fact that our three stage architecture utilizes global, local and view-specific information present in low-light LFs and fuse them using an RNN inspired feedforward network. We are fast because we restore multiple views simultaneously and so require less number of forward passes. Besides these advantages, our network is flexible enough to restore a m × m LF during inference even if trained for a smaller n × n (n < m) LF without any finetuning. Extensive experiments on real low-light LF demonstrate that compared to the current state-of-the-art, our model can achieve up to 1 dB higher restoration PSNR, with 9× speedup, 23% smaller model size and about 5× lower floating-point operations.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116346339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current state-of-the-art object detectors can have significant performance drop when deployed in the wild due to domain gaps with training data. Unsupervised Domain Adaptation (UDA) is a promising approach to adapt detectors for new domains/environments without any expensive label cost. Previous mainstream UDA works for object detection usually focused on image-level and/or feature-level adaptation by using adversarial learning methods. In this work, we show that such adversarial-based methods can only reduce domain style gap, but cannot address the domain content gap that is also important for object detectors. To overcome this limitation, we propose the SC-UDA framework to concurrently reduce both gaps: We propose fine-grained domain style transfer to reduce the style gaps with finer image details preserved for detecting small objects; Then we leverage the pseudo label-based self-training to reduce content gaps; To address pseudo label error accumulation during self-training, novel optimizations are proposed, including uncertainty-based pseudo labeling and imbalanced mini-batch sampling strategy. Experiment results show that our approach consistently outperforms prior state-of-the-art methods (up to 8.6%, 2.7% and 2.5% mAP on three UDA benchmarks).
{"title":"SC-UDA: Style and Content Gaps aware Unsupervised Domain Adaptation for Object Detection","authors":"Fuxun Yu, Di Wang, Yinpeng Chen, Nikolaos Karianakis, Tong Shen, Pei Yu, Dimitrios Lymberopoulos, Sidi Lu, Weisong Shi, Xiang Chen","doi":"10.1109/WACV51458.2022.00113","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00113","url":null,"abstract":"Current state-of-the-art object detectors can have significant performance drop when deployed in the wild due to domain gaps with training data. Unsupervised Domain Adaptation (UDA) is a promising approach to adapt detectors for new domains/environments without any expensive label cost. Previous mainstream UDA works for object detection usually focused on image-level and/or feature-level adaptation by using adversarial learning methods. In this work, we show that such adversarial-based methods can only reduce domain style gap, but cannot address the domain content gap that is also important for object detectors. To overcome this limitation, we propose the SC-UDA framework to concurrently reduce both gaps: We propose fine-grained domain style transfer to reduce the style gaps with finer image details preserved for detecting small objects; Then we leverage the pseudo label-based self-training to reduce content gaps; To address pseudo label error accumulation during self-training, novel optimizations are proposed, including uncertainty-based pseudo labeling and imbalanced mini-batch sampling strategy. Experiment results show that our approach consistently outperforms prior state-of-the-art methods (up to 8.6%, 2.7% and 2.5% mAP on three UDA benchmarks).","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122992057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00096
S. Jung, Tae Bok Lee, Y. S. Heo
Most recent face deblurring methods have focused on utilizing facial shape priors such as face landmarks and parsing maps. While these priors can provide facial geometric cues effectively, they are insufficient to contain local texture details that act as important clues to solve face deblurring problem. To deal with this, we focus on estimating the deep features of pre-trained face recognition networks (e.g., VGGFace network) that include rich information about sharp faces as a prior, and adopt a generative adversarial network (GAN) to learn it. To this end, we propose a deep feature prior guided network (DFPGnet) that restores facial details using the estimated the deep feature prior from a blurred image. In our DFPGnet, the generator is divided into two streams including prior estimation and deblurring streams. Since the estimated deep features of the prior estimation stream are learned from the VGGFace network which is trained for face recognition not for deblurring, we need to alleviate the discrepancy of feature distributions between the two streams. Therefore, we present feature transform modules at the connecting points of the two streams. In addition, we propose a channel-attention feature discriminator and prior loss, which encourages the generator to focus on more important channels for deblurring among the deep feature prior during training. Experimental results show that our method achieves state-of-the-art performance both qualitatively and quantitatively.
{"title":"Deep Feature Prior Guided Face Deblurring","authors":"S. Jung, Tae Bok Lee, Y. S. Heo","doi":"10.1109/WACV51458.2022.00096","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00096","url":null,"abstract":"Most recent face deblurring methods have focused on utilizing facial shape priors such as face landmarks and parsing maps. While these priors can provide facial geometric cues effectively, they are insufficient to contain local texture details that act as important clues to solve face deblurring problem. To deal with this, we focus on estimating the deep features of pre-trained face recognition networks (e.g., VGGFace network) that include rich information about sharp faces as a prior, and adopt a generative adversarial network (GAN) to learn it. To this end, we propose a deep feature prior guided network (DFPGnet) that restores facial details using the estimated the deep feature prior from a blurred image. In our DFPGnet, the generator is divided into two streams including prior estimation and deblurring streams. Since the estimated deep features of the prior estimation stream are learned from the VGGFace network which is trained for face recognition not for deblurring, we need to alleviate the discrepancy of feature distributions between the two streams. Therefore, we present feature transform modules at the connecting points of the two streams. In addition, we propose a channel-attention feature discriminator and prior loss, which encourages the generator to focus on more important channels for deblurring among the deep feature prior during training. Experimental results show that our method achieves state-of-the-art performance both qualitatively and quantitatively.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124951899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00156
Songtao He, Harinarayanan Balakrishnan
Digital maps with lane-level details are the foundation of many applications. However, creating and maintaining digital maps especially maps with lane-level details, are labor-intensive and expensive. In this work, we propose a mapping pipeline to extract lane-level street maps from aerial imagery automatically. Our mapping pipeline first extracts lanes at non-intersection areas, then it enumerates all the possible turning lanes at intersections, validates the connectivity of them, and extracts the valid turning lanes to complete the map. We evaluate the accuracy of our mapping pipeline on a dataset consisting of four U.S. cities, demonstrating the effectiveness of our proposed mapping pipeline and the potential of scalable mapping solutions based on aerial imagery.
{"title":"Lane-Level Street Map Extraction from Aerial Imagery","authors":"Songtao He, Harinarayanan Balakrishnan","doi":"10.1109/WACV51458.2022.00156","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00156","url":null,"abstract":"Digital maps with lane-level details are the foundation of many applications. However, creating and maintaining digital maps especially maps with lane-level details, are labor-intensive and expensive. In this work, we propose a mapping pipeline to extract lane-level street maps from aerial imagery automatically. Our mapping pipeline first extracts lanes at non-intersection areas, then it enumerates all the possible turning lanes at intersections, validates the connectivity of them, and extracts the valid turning lanes to complete the map. We evaluate the accuracy of our mapping pipeline on a dataset consisting of four U.S. cities, demonstrating the effectiveness of our proposed mapping pipeline and the potential of scalable mapping solutions based on aerial imagery.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128692221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00279
Lei Ji, Chenfei Wu, Daisy Zhou, Kun Yan, Edward Cui, Xilin Chen, Nan Duan
Temporal Video Segmentation (TVS) is a fundamental video understanding task and has been widely researched in recent years. There are two subtasks of TVS: Video Action Segmentation (VAS) and Video Procedure Segmentation (VPS): VAS aims to recognize what actions happen in-side the video while VPS aims to segment the video into a sequence of video clips as a procedure. The VAS task inevitably relies on pre-defined action labels and is thus hard to scale to various open-domain videos. To overcome this limitation, the VPS task tries to divide a video into several category-independent procedure segments. However, the existing dataset for the VPS task is small (2k videos) and lacks diversity (only cooking domain). To tackle these problems, we collect a large and diverse dataset called TIPS, specifically for the VPS task. TIPS contains 63k videos including more than 300k procedure segments from instructional videos on YouTube, which covers plenty of how-to areas such as cooking, health, beauty, parenting, gardening, etc. We then propose a multi-modal Transformer with Gaussian Boundary Detection (MT-GBD) model for VPS, with the backbone of the Transformer and Convolution. Furthermore, we propose a new EIOU metric for the VPS task, which helps better evaluate VPS quality in a more comprehensive way. Experimental results show the effectiveness of our proposed model and metric.
时间视频分割(Temporal Video Segmentation, TVS)是一项基本的视频理解任务,近年来得到了广泛的研究。TVS有两个子任务:视频动作分割(VAS)和视频过程分割(VPS): VAS的目的是识别视频内部发生的动作,而VPS的目的是将视频分割成一系列视频剪辑作为一个过程。VAS任务不可避免地依赖于预定义的动作标签,因此难以扩展到各种开放域视频。为了克服这一限制,VPS任务尝试将视频划分为几个类别无关的过程段。然而,VPS任务的现有数据集很小(2k个视频)并且缺乏多样性(只有烹饪领域)。为了解决这些问题,我们专门为VPS任务收集了一个名为TIPS的大型多样化数据集。TIPS包含63,000个视频,其中包括来自YouTube上教学视频的30多万个过程片段,涵盖了烹饪,健康,美容,育儿,园艺等大量操作领域。然后,我们提出了一个具有高斯边界检测(MT-GBD)模型的多模态变压器用于VPS,其中变压器的主干是卷积。此外,我们为VPS任务提出了一个新的EIOU度量,该度量有助于更全面地更好地评估VPS质量。实验结果表明了模型和度量的有效性。
{"title":"Learning Temporal Video Procedure Segmentation from an Automatically Collected Large Dataset","authors":"Lei Ji, Chenfei Wu, Daisy Zhou, Kun Yan, Edward Cui, Xilin Chen, Nan Duan","doi":"10.1109/WACV51458.2022.00279","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00279","url":null,"abstract":"Temporal Video Segmentation (TVS) is a fundamental video understanding task and has been widely researched in recent years. There are two subtasks of TVS: Video Action Segmentation (VAS) and Video Procedure Segmentation (VPS): VAS aims to recognize what actions happen in-side the video while VPS aims to segment the video into a sequence of video clips as a procedure. The VAS task inevitably relies on pre-defined action labels and is thus hard to scale to various open-domain videos. To overcome this limitation, the VPS task tries to divide a video into several category-independent procedure segments. However, the existing dataset for the VPS task is small (2k videos) and lacks diversity (only cooking domain). To tackle these problems, we collect a large and diverse dataset called TIPS, specifically for the VPS task. TIPS contains 63k videos including more than 300k procedure segments from instructional videos on YouTube, which covers plenty of how-to areas such as cooking, health, beauty, parenting, gardening, etc. We then propose a multi-modal Transformer with Gaussian Boundary Detection (MT-GBD) model for VPS, with the backbone of the Transformer and Convolution. Furthermore, we propose a new EIOU metric for the VPS task, which helps better evaluate VPS quality in a more comprehensive way. Experimental results show the effectiveness of our proposed model and metric.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125158199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00027
Dominik Bauer, T. Patten, M. Vincze
Observational noise, inaccurate segmentation and ambiguity due to symmetry and occlusion lead to inaccurate object pose estimates. While depth- and RGB-based pose refinement approaches increase the accuracy of the resulting pose estimates, they are susceptible to ambiguity in the observation as they consider visual alignment. We propose to leverage the fact that we often observe static, rigid scenes. Thus, the objects therein need to be under physically plausible poses. We show that considering plausibility reduces ambiguity and, in consequence, allows poses to be more accurately predicted in cluttered environments. To this end, we extend a recent RL-based registration approach towards iterative refinement of object poses. Experiments on the LINEMOD and YCB-VIDEO datasets demonstrate the state-of-the-art performance of our depth-based refinement approach. Code is available at github.com/dornik/sporeagent.
{"title":"SporeAgent: Reinforced Scene-level Plausibility for Object Pose Refinement","authors":"Dominik Bauer, T. Patten, M. Vincze","doi":"10.1109/WACV51458.2022.00027","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00027","url":null,"abstract":"Observational noise, inaccurate segmentation and ambiguity due to symmetry and occlusion lead to inaccurate object pose estimates. While depth- and RGB-based pose refinement approaches increase the accuracy of the resulting pose estimates, they are susceptible to ambiguity in the observation as they consider visual alignment. We propose to leverage the fact that we often observe static, rigid scenes. Thus, the objects therein need to be under physically plausible poses. We show that considering plausibility reduces ambiguity and, in consequence, allows poses to be more accurately predicted in cluttered environments. To this end, we extend a recent RL-based registration approach towards iterative refinement of object poses. Experiments on the LINEMOD and YCB-VIDEO datasets demonstrate the state-of-the-art performance of our depth-based refinement approach. Code is available at github.com/dornik/sporeagent.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127230370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}