首页 > 最新文献

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)最新文献

英文 中文
FalCon: Fine-grained Feature Map Sparsity Computing with Decomposed Convolutions for Inference Optimization 基于分解卷积的细粒度特征映射稀疏计算用于推理优化
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00369
Zirui Xu, Fuxun Yu, Chenxi Liu, Zhe Wu, Hongcheng Wang, Xiang Chen
Many works focus on the model’s static parameter optimization (e.g., filters and weights) for CNN inference acceleration. Compared to parameter sparsity, feature map sparsity is per-input related which has better adaptability. The practical sparsity patterns are non-structural and randomly located on feature maps with non-identical shapes. However, the existing feature map sparsity works take computing efficiency as the primary goal, thereby they can only remove structural sparsity and fail to match the above characteristics. In this paper, we develop a novel sparsity computing scheme called FalCon, which can well adapt to the practical sparsity patterns while still maintaining efficient computing. Specifically, we first propose a decomposed convolution design that enables a fine-grained computing unit for sparsity. Additionally, a decomposed convolution computing optimization paradigm is proposed to convert the sparse computing units to practical acceleration. Extensive experiments show that FalCon achieves at most 67.30% theoretical computation reduction with a neglected accuracy drop while accelerating CNN inference by 37%.
许多工作都集中在CNN推理加速模型的静态参数优化(如滤波器和权重)上。与参数稀疏性相比,特征映射稀疏性是与每个输入相关的,具有更好的适应性。实用的稀疏模式是非结构性的,随机分布在具有不同形状的特征映射上。然而,现有的特征映射稀疏性工作以计算效率为主要目标,只能去除结构稀疏性,无法匹配上述特征。在本文中,我们开发了一种新的稀疏计算方案FalCon,它可以很好地适应实际的稀疏模式,同时保持高效的计算。具体来说,我们首先提出了一种分解卷积设计,它可以实现细粒度计算单元的稀疏性。此外,提出了一种分解卷积计算优化范式,将稀疏计算单元转化为实际加速。大量实验表明,FalCon在忽略准确率下降的情况下,最多可实现67.30%的理论计算减少,同时将CNN推理加速37%。
{"title":"FalCon: Fine-grained Feature Map Sparsity Computing with Decomposed Convolutions for Inference Optimization","authors":"Zirui Xu, Fuxun Yu, Chenxi Liu, Zhe Wu, Hongcheng Wang, Xiang Chen","doi":"10.1109/WACV51458.2022.00369","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00369","url":null,"abstract":"Many works focus on the model’s static parameter optimization (e.g., filters and weights) for CNN inference acceleration. Compared to parameter sparsity, feature map sparsity is per-input related which has better adaptability. The practical sparsity patterns are non-structural and randomly located on feature maps with non-identical shapes. However, the existing feature map sparsity works take computing efficiency as the primary goal, thereby they can only remove structural sparsity and fail to match the above characteristics. In this paper, we develop a novel sparsity computing scheme called FalCon, which can well adapt to the practical sparsity patterns while still maintaining efficient computing. Specifically, we first propose a decomposed convolution design that enables a fine-grained computing unit for sparsity. Additionally, a decomposed convolution computing optimization paradigm is proposed to convert the sparse computing units to practical acceleration. Extensive experiments show that FalCon achieves at most 67.30% theoretical computation reduction with a neglected accuracy drop while accelerating CNN inference by 37%.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"354 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115928335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Non-local Attention Improves Description Generation for Retinal Images 非局部注意改进视网膜图像描述生成
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00331
Jia-Hong Huang, Ting-Wei Wu, C. Yang, Zenglin Shi, I-Hung Lin, J. Tegnér, M. Worring
Automatically generating medical reports from retinal images is a difficult task in which an algorithm must generate semantically coherent descriptions for a given retinal image. Existing methods mainly rely on the input image to generate descriptions. However, many abstract medical concepts or descriptions cannot be generated based on image information only. In this work, we integrate additional information to help solve this task; we observe that early in the diagnosis process, ophthalmologists have usually written down a small set of keywords denoting important information. These keywords are then subsequently used to aid the later creation of medical reports for a patient. Since these keywords commonly exist and are useful for generating medical reports, we incorporate them into automatic report generation. Since we have two types of inputs expert-defined unordered keywords and images - effectively fusing features from these different modalities is challenging. To that end, we propose a new keyword-driven medical report generation method based on a non-local attention-based multi-modal feature fusion approach, TransFuser, which is capable of fusing features from different types of inputs based on such attention. Our experiments show the proposed method successfully captures the mutual information of keywords and image content. We further show our proposed keyword-driven generation model reinforced by the TransFuser is superior to baselines under the popular text evaluation metrics BLEU, CIDEr, and ROUGE. Trans-Fuser Github: https://github.com/Jhhuangkay/Non-local-Attention-ImprovesDescription-Generation-for-Retinal-Images.
从视网膜图像中自动生成医学报告是一项困难的任务,其中算法必须为给定的视网膜图像生成语义一致的描述。现有的方法主要依靠输入图像来生成描述。然而,许多抽象的医学概念或描述不能仅仅基于图像信息来生成。在这项工作中,我们集成了额外的信息来帮助解决这个任务;我们观察到,在诊断过程的早期,眼科医生通常会写下一小组表示重要信息的关键词。然后使用这些关键字来帮助稍后为患者创建医疗报告。由于这些关键字通常存在并且对生成医疗报告很有用,因此我们将它们合并到自动报告生成中。由于我们有两种类型的输入——专家定义的无序关键词和图像——有效地融合这些不同模式的特征是具有挑战性的。为此,我们提出了一种新的基于非局部关注的多模态特征融合方法的关键字驱动医疗报告生成方法——输血,该方法能够基于这种关注融合来自不同类型输入的特征。实验表明,该方法成功地捕获了关键词和图像内容之间的互信息。我们进一步表明,在流行的文本评估指标BLEU、CIDEr和ROUGE下,我们提出的关键词驱动生成模型在输血用户的强化下优于基线。Trans-Fuser Github: https://github.com/Jhhuangkay/Non-local-Attention-ImprovesDescription-Generation-for-Retinal-Images。
{"title":"Non-local Attention Improves Description Generation for Retinal Images","authors":"Jia-Hong Huang, Ting-Wei Wu, C. Yang, Zenglin Shi, I-Hung Lin, J. Tegnér, M. Worring","doi":"10.1109/WACV51458.2022.00331","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00331","url":null,"abstract":"Automatically generating medical reports from retinal images is a difficult task in which an algorithm must generate semantically coherent descriptions for a given retinal image. Existing methods mainly rely on the input image to generate descriptions. However, many abstract medical concepts or descriptions cannot be generated based on image information only. In this work, we integrate additional information to help solve this task; we observe that early in the diagnosis process, ophthalmologists have usually written down a small set of keywords denoting important information. These keywords are then subsequently used to aid the later creation of medical reports for a patient. Since these keywords commonly exist and are useful for generating medical reports, we incorporate them into automatic report generation. Since we have two types of inputs expert-defined unordered keywords and images - effectively fusing features from these different modalities is challenging. To that end, we propose a new keyword-driven medical report generation method based on a non-local attention-based multi-modal feature fusion approach, TransFuser, which is capable of fusing features from different types of inputs based on such attention. Our experiments show the proposed method successfully captures the mutual information of keywords and image content. We further show our proposed keyword-driven generation model reinforced by the TransFuser is superior to baselines under the popular text evaluation metrics BLEU, CIDEr, and ROUGE. Trans-Fuser Github: https://github.com/Jhhuangkay/Non-local-Attention-ImprovesDescription-Generation-for-Retinal-Images.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125114678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Leveraging Test-Time Consensus Prediction for Robustness against Unseen Noise 利用测试时间一致性预测对不可见噪声的鲁棒性
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00362
Anindya Sarkar, Anirban Sarkar, V. Balasubramanian
We propose a method to improve DNN robustness against unseen noisy corruptions, such as Gaussian noise, Shot Noise, Impulse Noise, Speckle noise with different levels of severity by leveraging ensemble technique through a consensus based prediction method using self-supervised learning at inference time. We also propose to enhance the model training by considering other aspects of the issue i.e. noise in data and better representation learning which shows even better generalization performance with the consensus based prediction strategy. We report results of each noisy corruption on the standard CIFAR10-C and ImageNet-C benchmark which shows significant boost in performance over previous methods. We also introduce results for MNIST-C and TinyImagenet-C to show usefulness of our method across datasets of different complexities to provide robustness against unseen noise. We show results with different architectures to validate our method against other baseline methods, and also conduct experiments to show the usefulness of each part of our method.
我们提出了一种方法,通过在推理时使用自监督学习的基于共识的预测方法,利用集成技术提高DNN对看不见的噪声破坏的鲁棒性,如高斯噪声、散点噪声、脉冲噪声和不同严重程度的斑点噪声。我们还建议通过考虑问题的其他方面来增强模型训练,例如数据中的噪声和更好的表示学习,通过基于共识的预测策略显示出更好的泛化性能。我们报告了标准CIFAR10-C和ImageNet-C基准测试中每种噪声损坏的结果,这些结果显示比以前的方法性能有显着提高。我们还介绍了MNIST-C和TinyImagenet-C的结果,以显示我们的方法在不同复杂性的数据集上的实用性,以提供对看不见的噪声的鲁棒性。我们展示了不同架构的结果,以对照其他基准方法验证我们的方法,并且还进行了实验,以显示我们方法的每个部分的有用性。
{"title":"Leveraging Test-Time Consensus Prediction for Robustness against Unseen Noise","authors":"Anindya Sarkar, Anirban Sarkar, V. Balasubramanian","doi":"10.1109/WACV51458.2022.00362","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00362","url":null,"abstract":"We propose a method to improve DNN robustness against unseen noisy corruptions, such as Gaussian noise, Shot Noise, Impulse Noise, Speckle noise with different levels of severity by leveraging ensemble technique through a consensus based prediction method using self-supervised learning at inference time. We also propose to enhance the model training by considering other aspects of the issue i.e. noise in data and better representation learning which shows even better generalization performance with the consensus based prediction strategy. We report results of each noisy corruption on the standard CIFAR10-C and ImageNet-C benchmark which shows significant boost in performance over previous methods. We also introduce results for MNIST-C and TinyImagenet-C to show usefulness of our method across datasets of different complexities to provide robustness against unseen noise. We show results with different architectures to validate our method against other baseline methods, and also conduct experiments to show the usefulness of each part of our method.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128229449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Inpaint2Learn: A Self-Supervised Framework for Affordance Learning Inpaint2Learn:一个功能学习的自我监督框架
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00383
Lingzhi Zhang, Weiyu Du, Shenghao Zhou, Jiancong Wang, Jianbo Shi
Perceiving affordances –the opportunities of interaction in a scene, is a fundamental ability of humans. It is an equally important skill for AI agents and robots to better understand and interact with the world. However, labeling affordances in the environment is not a trivial task. To address this issue, we propose a task-agnostic framework, named Inpaint2Learn, that generates affordance labels in a fully automatic manner and opens the door for affordance learning in the wild. To demonstrate its effectiveness, we apply it to three different tasks: human affordance prediction, Location2Object and 6D object pose hallucination. Our experiments and user studies show that our models, trained with the Inpaint2Learn scaffold, are able to generate diverse and visually plausible results in all three scenarios.
感知能力——场景中交互的机会,是人类的一项基本能力。对于人工智能代理和机器人来说,更好地理解世界并与之互动是一项同样重要的技能。然而,在环境中标记可得性并不是一项简单的任务。为了解决这个问题,我们提出了一个名为Inpaint2Learn的任务不可知框架,它以全自动的方式生成可视性标签,为在自然环境中进行可视性学习打开了大门。为了证明其有效性,我们将其应用于三个不同的任务:人类功能预测、Location2Object和6D物体姿势幻觉。我们的实验和用户研究表明,使用Inpaint2Learn支架训练的模型能够在所有三种场景中生成多样化且视觉上可信的结果。
{"title":"Inpaint2Learn: A Self-Supervised Framework for Affordance Learning","authors":"Lingzhi Zhang, Weiyu Du, Shenghao Zhou, Jiancong Wang, Jianbo Shi","doi":"10.1109/WACV51458.2022.00383","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00383","url":null,"abstract":"Perceiving affordances –the opportunities of interaction in a scene, is a fundamental ability of humans. It is an equally important skill for AI agents and robots to better understand and interact with the world. However, labeling affordances in the environment is not a trivial task. To address this issue, we propose a task-agnostic framework, named Inpaint2Learn, that generates affordance labels in a fully automatic manner and opens the door for affordance learning in the wild. To demonstrate its effectiveness, we apply it to three different tasks: human affordance prediction, Location2Object and 6D object pose hallucination. Our experiments and user studies show that our models, trained with the Inpaint2Learn scaffold, are able to generate diverse and visually plausible results in all three scenarios.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114305989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
C-VTON: Context-Driven Image-Based Virtual Try-On Network C-VTON:上下文驱动的基于图像的虚拟试戴网络
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00226
Benjamin Fele, Ajda Lampe, P. Peer, Vitomir Štruc
Image-based virtual try-on techniques have shown great promise for enhancing the user-experience and improving customer satisfaction on fashion-oriented e-commerce platforms. However, existing techniques are currently still limited in the quality of the try-on results they are able to produce from input images of diverse characteristics. In this work, we propose a Context-Driven Virtual Try-On Network (C-VTON) that addresses these limitations and convincingly transfers selected clothing items to the target subjects even under challenging pose configurations and in the presence of self-occlusions. At the core of the C-VTON pipeline are: (i) a geometric matching procedure that efficiently aligns the target clothing with the pose of the person in the input images, and (ii) a powerful image generator that utilizes various types of contextual information when synthesizing the final try-on result. C-VTON is evaluated in rigorous experiments on the VITON and MPV datasets and in comparison to state-of-the-art techniques from the literature. Experimental results show that the proposed approach is able to produce photo-realistic and visually convincing results and significantly improves on the existing state-of-the-art.
在面向时尚的电子商务平台上,基于图像的虚拟试戴技术在增强用户体验和提高客户满意度方面表现出了巨大的希望。然而,现有的技术目前仍然局限于从不同特征的输入图像中产生的试戴结果的质量。在这项工作中,我们提出了一个情境驱动的虚拟试穿网络(C-VTON),它解决了这些限制,即使在具有挑战性的姿势配置和存在自我遮挡的情况下,也能令人信服地将选定的服装转移到目标受试者身上。C-VTON管道的核心是:(i)几何匹配程序,有效地将目标服装与输入图像中的人的姿势对齐,以及(ii)强大的图像生成器,在合成最终试穿结果时利用各种类型的上下文信息。C-VTON在VITON和MPV数据集的严格实验中进行评估,并与文献中最先进的技术进行比较。实验结果表明,该方法能够产生逼真的视觉效果,在现有技术的基础上有了显著的改进。
{"title":"C-VTON: Context-Driven Image-Based Virtual Try-On Network","authors":"Benjamin Fele, Ajda Lampe, P. Peer, Vitomir Štruc","doi":"10.1109/WACV51458.2022.00226","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00226","url":null,"abstract":"Image-based virtual try-on techniques have shown great promise for enhancing the user-experience and improving customer satisfaction on fashion-oriented e-commerce platforms. However, existing techniques are currently still limited in the quality of the try-on results they are able to produce from input images of diverse characteristics. In this work, we propose a Context-Driven Virtual Try-On Network (C-VTON) that addresses these limitations and convincingly transfers selected clothing items to the target subjects even under challenging pose configurations and in the presence of self-occlusions. At the core of the C-VTON pipeline are: (i) a geometric matching procedure that efficiently aligns the target clothing with the pose of the person in the input images, and (ii) a powerful image generator that utilizes various types of contextual information when synthesizing the final try-on result. C-VTON is evaluated in rigorous experiments on the VITON and MPV datasets and in comparison to state-of-the-art techniques from the literature. Experimental results show that the proposed approach is able to produce photo-realistic and visually convincing results and significantly improves on the existing state-of-the-art.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122555908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
FLUID: Few-Shot Self-Supervised Image Deraining 流体:少镜头自监督图像脱轨
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00049
Shyam Nandan Rai, Rohit Saluja, Chetan Arora, V. Balasubramanian, A. Subramanian, C.V. Jawahar
Self-supervised methods have shown promising results in denoising and dehazing tasks, where the collection of the paired dataset is challenging and expensive. However, we find that these methods fail to remove the rain streaks when applied for image deraining tasks. The method’s poor performance is due to the explicit assumptions: (i) the distribution of noise or haze is uniform and (ii) the value of a noisy or hazy pixel is independent of its neighbors. The rainy pixels are non-uniformly distributed, and it is not necessarily dependant on its neighboring pixels. Hence, we conclude that the self-supervised method needs to have some prior knowledge about rain distribution to perform the deraining task. To provide this knowledge, we hypothesize a network trained with minimal supervision to estimate the likelihood of rainy pixels. This leads us to our proposed method called FLUID: Few Shot Sel f-Supervised Image Deraining.We perform extensive experiments and comparisons with existing image deraining and few-shot image-to-image translation methods on Rain 100L and DDN-SIRR datasets containing real and synthetic rainy images. In addition, we use the Rainy Cityscapes dataset to show that our method trained in a few-shot setting can improve semantic segmentation and object detection in rainy conditions. Our approach obtains a mIoU gain of 51.20 over the current best-performing deraining method. [Project Page]
自监督方法在去噪和去雾任务中显示出有希望的结果,其中成对数据集的收集具有挑战性且昂贵。然而,我们发现这些方法在应用于图像脱轨任务时不能去除雨纹。该方法的性能差是由于明确的假设:(i)噪声或雾霾的分布是均匀的,(ii)噪声或雾霾像素的值与其邻居无关。雨像元的分布不均匀,并不一定依赖于邻近像元。因此,我们得出结论,自监督方法需要有一些关于降雨分布的先验知识来执行训练任务。为了提供这些知识,我们假设了一个在最小监督下训练的网络来估计下雨像素的可能性。这导致我们提出的方法称为流体:少数镜头自监督图像脱轨。我们在Rain 100L和DDN-SIRR数据集上进行了大量的实验,并与现有的图像脱除和少量图像到图像的转换方法进行了比较,这些数据集包含真实和合成的降雨图像。此外,我们使用雨天城市景观数据集来证明我们的方法在少数镜头设置下训练可以改善雨天条件下的语义分割和目标检测。我们的方法比目前表现最好的训练方法获得了51.20的mIoU增益。(项目页面)
{"title":"FLUID: Few-Shot Self-Supervised Image Deraining","authors":"Shyam Nandan Rai, Rohit Saluja, Chetan Arora, V. Balasubramanian, A. Subramanian, C.V. Jawahar","doi":"10.1109/WACV51458.2022.00049","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00049","url":null,"abstract":"Self-supervised methods have shown promising results in denoising and dehazing tasks, where the collection of the paired dataset is challenging and expensive. However, we find that these methods fail to remove the rain streaks when applied for image deraining tasks. The method’s poor performance is due to the explicit assumptions: (i) the distribution of noise or haze is uniform and (ii) the value of a noisy or hazy pixel is independent of its neighbors. The rainy pixels are non-uniformly distributed, and it is not necessarily dependant on its neighboring pixels. Hence, we conclude that the self-supervised method needs to have some prior knowledge about rain distribution to perform the deraining task. To provide this knowledge, we hypothesize a network trained with minimal supervision to estimate the likelihood of rainy pixels. This leads us to our proposed method called FLUID: Few Shot Sel f-Supervised Image Deraining.We perform extensive experiments and comparisons with existing image deraining and few-shot image-to-image translation methods on Rain 100L and DDN-SIRR datasets containing real and synthetic rainy images. In addition, we use the Rainy Cityscapes dataset to show that our method trained in a few-shot setting can improve semantic segmentation and object detection in rainy conditions. Our approach obtains a mIoU gain of 51.20 over the current best-performing deraining method. [Project Page]","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128852726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Less Can Be More: Sound Source Localization With a Classification Model 少即是多:声源定位与分类模型
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00065
Arda Senocak, H. Ryu, Junsik Kim, In-So Kweon
In this paper, we tackle sound localization as a natural outcome of the audio-visual video classification problem. Differently from the existing sound localization approaches, we do not use any explicit sub-modules or training mechanisms but use simple cross-modal attention on top of the representations learned by a classification loss. Our key contribution is to show that a simple audio-visual classification model has the ability to localize sound sources accurately and to give on par performance with state-of-the-art methods by proving that indeed "less is more". Furthermore, we propose potential applications that can be built based on our model. First, we introduce informative moment selection to enhance the localization task learning in the existing approaches compare to mid-frame usage. Then, we introduce a pseudo bounding box generation procedure that can significantly boost the performance of the existing methods in semi-supervised settings or be used for large-scale automatic annotation with minimal effort from any video dataset.
在本文中,我们将声音定位作为视听视频分类问题的自然结果来解决。与现有的声音定位方法不同,我们不使用任何显式的子模块或训练机制,而是在通过分类损失学习到的表征之上使用简单的跨模态注意。我们的主要贡献是表明一个简单的视听分类模型能够准确地定位声源,并通过证明确实“少即是多”来提供与最先进的方法相当的性能。此外,我们提出了可以基于我们的模型构建的潜在应用程序。首先,与中帧方法相比,我们引入了信息矩选择来增强现有方法中的定位任务学习。然后,我们引入了一个伪边界框生成过程,该过程可以显着提高现有方法在半监督设置下的性能,或者用于对任何视频数据集进行大规模自动标注。
{"title":"Less Can Be More: Sound Source Localization With a Classification Model","authors":"Arda Senocak, H. Ryu, Junsik Kim, In-So Kweon","doi":"10.1109/WACV51458.2022.00065","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00065","url":null,"abstract":"In this paper, we tackle sound localization as a natural outcome of the audio-visual video classification problem. Differently from the existing sound localization approaches, we do not use any explicit sub-modules or training mechanisms but use simple cross-modal attention on top of the representations learned by a classification loss. Our key contribution is to show that a simple audio-visual classification model has the ability to localize sound sources accurately and to give on par performance with state-of-the-art methods by proving that indeed \"less is more\". Furthermore, we propose potential applications that can be built based on our model. First, we introduce informative moment selection to enhance the localization task learning in the existing approaches compare to mid-frame usage. Then, we introduce a pseudo bounding box generation procedure that can significantly boost the performance of the existing methods in semi-supervised settings or be used for large-scale automatic annotation with minimal effort from any video dataset.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117136306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Few-Shot Open-Set Recognition of Hyperspectral Images with Outlier Calibration Network 基于离群点标定网络的高光谱图像少镜头开集识别
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00215
Debabrata Pal, Valay Bundele, Renuka Sharma, Biplab Banerjee, Y. Jeppu
We tackle the few-shot open-set recognition (FSOSR) problem in the context of remote sensing hyperspectral image (HSI) classification. Prior research on OSR mainly considers an empirical threshold on the class prediction scores to reject the outlier samples. Further, recent endeavors in few-shot HSI classification fail to recognize outliers due to the ‘closed-set’ nature of the problem and the fact that the entire class distributions are unknown during training. To this end, we propose to optimize a novel outlier calibration network (OCN) together with a feature extraction module during the meta-training phase. The feature extractor is equipped with a novel residual 3D convolutional block attention network (R3CBAM) for enhanced spectral-spatial feature learning from HSI. Our method rejects the outliers based on OCN prediction scores barring the need for manual thresholding. Finally, we propose to augment the query set with synthesized support set features during the similarity learning stage in order to combat the data scarcity issue of few-shot learning. The superiority of the proposed model is showcased on four benchmark HSI datasets. 1
研究了基于遥感高光谱图像分类的少镜头开集识别(FSOSR)问题。以往对OSR的研究主要考虑在类预测分数上设置一个经验阈值来拒绝异常样本。此外,由于问题的“闭集”性质以及整个类分布在训练期间未知的事实,最近在少数次HSI分类中的努力未能识别异常值。为此,我们提出在元训练阶段优化一种新的离群点校准网络(OCN)和特征提取模块。特征提取器配备了一种新的残差三维卷积块注意网络(R3CBAM),用于增强从HSI中学习频谱空间特征。我们的方法拒绝基于OCN预测分数的异常值,而不需要手动阈值。最后,我们提出在相似学习阶段用综合支持集特征来增强查询集,以解决少镜头学习的数据稀缺性问题。在四个基准HSI数据集上展示了该模型的优越性。1
{"title":"Few-Shot Open-Set Recognition of Hyperspectral Images with Outlier Calibration Network","authors":"Debabrata Pal, Valay Bundele, Renuka Sharma, Biplab Banerjee, Y. Jeppu","doi":"10.1109/WACV51458.2022.00215","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00215","url":null,"abstract":"We tackle the few-shot open-set recognition (FSOSR) problem in the context of remote sensing hyperspectral image (HSI) classification. Prior research on OSR mainly considers an empirical threshold on the class prediction scores to reject the outlier samples. Further, recent endeavors in few-shot HSI classification fail to recognize outliers due to the ‘closed-set’ nature of the problem and the fact that the entire class distributions are unknown during training. To this end, we propose to optimize a novel outlier calibration network (OCN) together with a feature extraction module during the meta-training phase. The feature extractor is equipped with a novel residual 3D convolutional block attention network (R3CBAM) for enhanced spectral-spatial feature learning from HSI. Our method rejects the outliers based on OCN prediction scores barring the need for manual thresholding. Finally, we propose to augment the query set with synthesized support set features during the similarity learning stage in order to combat the data scarcity issue of few-shot learning. The superiority of the proposed model is showcased on four benchmark HSI datasets. 1","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"559 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114316510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Unsupervised Sounding Object Localization with Bottom-Up and Top-Down Attention 基于自底向上和自顶向下注意的无监督探测目标定位
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00222
Jiaying Shi, Chao Ma
Learning to localize sounding objects in visual scenes without manual annotations has drawn increasing attention recently. In this paper, we propose an unsupervised sounding object localization algorithm by using bottom-up and top-down attention in visual scenes. The bottom-up attention module generates an objectness confidence map, while the top-down attention draws the similarity between sound and visual regions. Moreover, we propose a bottom-up attention loss function, which models the correlation relationship between bottom-up and top-down attention. Extensive experimental results demonstrate that our proposed unsupervised method significantly advances the state-of-the-art unsupervised methods. The source code is available at https://github.com/VISION-SJTU/USOL.
如何在不需要人工标注的情况下对视觉场景中的发声物体进行定位,近年来受到越来越多的关注。本文提出了一种基于自底向上和自顶向下的视觉场景无监督探测目标定位算法。自下而上的注意模块生成对象置信度图,而自上而下的注意模块绘制声音和视觉区域之间的相似性。此外,我们提出了一个自下而上的注意损失函数,该函数模拟了自下而上和自上而下的注意之间的相关关系。大量的实验结果表明,我们提出的无监督方法显着提高了最先进的无监督方法。源代码可从https://github.com/VISION-SJTU/USOL获得。
{"title":"Unsupervised Sounding Object Localization with Bottom-Up and Top-Down Attention","authors":"Jiaying Shi, Chao Ma","doi":"10.1109/WACV51458.2022.00222","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00222","url":null,"abstract":"Learning to localize sounding objects in visual scenes without manual annotations has drawn increasing attention recently. In this paper, we propose an unsupervised sounding object localization algorithm by using bottom-up and top-down attention in visual scenes. The bottom-up attention module generates an objectness confidence map, while the top-down attention draws the similarity between sound and visual regions. Moreover, we propose a bottom-up attention loss function, which models the correlation relationship between bottom-up and top-down attention. Extensive experimental results demonstrate that our proposed unsupervised method significantly advances the state-of-the-art unsupervised methods. The source code is available at https://github.com/VISION-SJTU/USOL.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127056318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Learning to Generate the Unknowns as a Remedy to the Open-Set Domain Shift 学习生成未知数作为开集域移位的补救措施
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00379
Mahsa Baktash, Tianle Chen, M. Salzmann
In many situations, the data one has access to at test time follows a different distribution from the training data. Over the years, this problem has been tackled by closed-set domain adaptation techniques. Recently, open-set domain adaptation has emerged to address the more realistic scenario where additional unknown classes are present in the target data. In this setting, existing techniques focus on the challenging task of isolating the unknown target samples, so as to avoid the negative transfer resulting from aligning the source feature distributions with the broader target one that encompasses the additional unknown classes. Here, we propose a simpler and more effective solution consisting of complementing the source data distribution and making it comparable to the target one by enabling the model to generate source samples corresponding to the unknown target classes. We formulate this as a general module that can be incorporated into any existing closed-set approach and show that this strategy allows us to outperform the state of the art on open-set domain adaptation benchmark datasets.
在许多情况下,测试时访问的数据遵循与训练数据不同的分布。多年来,这一问题一直被闭集域自适应技术所解决。最近,开放集域自适应已经出现,以解决目标数据中存在额外未知类的更现实的场景。在这种情况下,现有的技术专注于隔离未知目标样本的挑战性任务,以避免将源特征分布与包含额外未知类的更广泛的目标特征分布对齐所导致的负迁移。在这里,我们提出了一种更简单有效的解决方案,通过使模型能够生成与未知目标类对应的源样本,来补充源数据分布并使其与目标数据分布相当。我们将其表述为一个通用模块,可以合并到任何现有的闭集方法中,并表明该策略使我们能够在开集域自适应基准数据集上超越目前的技术水平。
{"title":"Learning to Generate the Unknowns as a Remedy to the Open-Set Domain Shift","authors":"Mahsa Baktash, Tianle Chen, M. Salzmann","doi":"10.1109/WACV51458.2022.00379","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00379","url":null,"abstract":"In many situations, the data one has access to at test time follows a different distribution from the training data. Over the years, this problem has been tackled by closed-set domain adaptation techniques. Recently, open-set domain adaptation has emerged to address the more realistic scenario where additional unknown classes are present in the target data. In this setting, existing techniques focus on the challenging task of isolating the unknown target samples, so as to avoid the negative transfer resulting from aligning the source feature distributions with the broader target one that encompasses the additional unknown classes. Here, we propose a simpler and more effective solution consisting of complementing the source data distribution and making it comparable to the target one by enabling the model to generate source samples corresponding to the unknown target classes. We formulate this as a general module that can be incorporated into any existing closed-set approach and show that this strategy allows us to outperform the state of the art on open-set domain adaptation benchmark datasets.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126338346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1