首页 > 最新文献

2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)最新文献

英文 中文
Exploiting Visual Context Semantics for Sound Source Localization 利用视觉上下文语义进行声源定位
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00517
Xinchi Zhou, Dongzhan Zhou, Di Hu, Hang Zhou, Wanli Ouyang
Self-supervised sound source localization in unconstrained visual scenes is an important task of audio-visual learning. In this paper, we propose a visual reasoning module to explicitly exploit the rich visual context semantics, which alleviates the issue of insufficient utilization of visual information in previous works. The learning objectives are carefully designed to provide stronger supervision signals for the extracted visual semantics while enhancing the audio-visual interactions, which lead to more robust feature representations. Extensive experimental results demonstrate that our approach significantly boosts the localization performances on various datasets, even without initializations pretrained on ImageNet. Moreover, with the visual context exploitation, our framework can accomplish both the audio-visual and purely visual inference, which expands the application scope of the sound source localization task and further raises the competitiveness of our approach.
无约束视觉场景下的自监督声源定位是视听学习的重要课题。在本文中,我们提出了一个视觉推理模块,明确地利用了丰富的视觉上下文语义,缓解了以往工作中对视觉信息利用不足的问题。学习目标经过精心设计,为提取的视觉语义提供更强的监督信号,同时增强视听交互,从而获得更鲁棒的特征表示。大量的实验结果表明,即使没有在ImageNet上进行预训练的初始化,我们的方法也能显著提高在各种数据集上的定位性能。此外,通过对视觉上下文的开发,我们的框架既可以完成视听推理,也可以完成纯视觉推理,这扩大了声源定位任务的应用范围,进一步提高了我们的方法的竞争力。
{"title":"Exploiting Visual Context Semantics for Sound Source Localization","authors":"Xinchi Zhou, Dongzhan Zhou, Di Hu, Hang Zhou, Wanli Ouyang","doi":"10.1109/WACV56688.2023.00517","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00517","url":null,"abstract":"Self-supervised sound source localization in unconstrained visual scenes is an important task of audio-visual learning. In this paper, we propose a visual reasoning module to explicitly exploit the rich visual context semantics, which alleviates the issue of insufficient utilization of visual information in previous works. The learning objectives are carefully designed to provide stronger supervision signals for the extracted visual semantics while enhancing the audio-visual interactions, which lead to more robust feature representations. Extensive experimental results demonstrate that our approach significantly boosts the localization performances on various datasets, even without initializations pretrained on ImageNet. Moreover, with the visual context exploitation, our framework can accomplish both the audio-visual and purely visual inference, which expands the application scope of the sound source localization task and further raises the competitiveness of our approach.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124442236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding 事件特定的视听融合层:视频理解的一个简单的新视角
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00227
Arda Senocak, Junsik Kim, Tae-Hyun Oh, Dingzeyu Li, I. Kweon
To understand our surrounding world, our brain is continuously inundated with multisensory information and their complex interactions coming from the outside world at any given moment. While processing this information might seem effortless for human brains, it is challenging to build a machine that can perform similar tasks since complex interactions cannot be dealt with a single type of integration but require more sophisticated approaches. In this paper, we propose a new simple method to address the multisensory integration in video understanding. Unlike previous works where a single fusion type is used, we design a multi-head model with individual event-specific layers to deal with different audio-visual relationships, enabling different ways of audio-visual fusion. Experimental results show that our event-specific layers can discover unique properties of the audio-visual relationships in the videos, e.g., semantically matched moments, and rhythmic events. Moreover, although our network is trained with single labels, our multi-head design can inherently output additional semantically meaningful multi-labels for a video. As an application, we demonstrate that our proposed method can expose the extent of event-characteristics of popular benchmark datasets.
为了了解我们周围的世界,我们的大脑在任何特定时刻都不断被来自外部世界的多感官信息和它们之间复杂的相互作用所淹没。虽然人脑处理这些信息似乎毫不费力,但制造一台可以执行类似任务的机器是具有挑战性的,因为复杂的交互不能处理单一类型的集成,而是需要更复杂的方法。在本文中,我们提出了一种新的简单方法来解决视频理解中的多感官整合问题。与以往使用单一融合类型的工作不同,我们设计了一个带有单个事件特定层的多头模型来处理不同的视听关系,从而实现不同的视听融合方式。实验结果表明,我们的事件特定层可以发现视频中视听关系的独特属性,例如语义匹配时刻和节奏事件。此外,尽管我们的网络是用单个标签训练的,但我们的多头设计本质上可以为视频输出额外的语义上有意义的多标签。作为一个应用,我们证明了我们提出的方法可以暴露流行基准数据集的事件特征的程度。
{"title":"Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding","authors":"Arda Senocak, Junsik Kim, Tae-Hyun Oh, Dingzeyu Li, I. Kweon","doi":"10.1109/WACV56688.2023.00227","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00227","url":null,"abstract":"To understand our surrounding world, our brain is continuously inundated with multisensory information and their complex interactions coming from the outside world at any given moment. While processing this information might seem effortless for human brains, it is challenging to build a machine that can perform similar tasks since complex interactions cannot be dealt with a single type of integration but require more sophisticated approaches. In this paper, we propose a new simple method to address the multisensory integration in video understanding. Unlike previous works where a single fusion type is used, we design a multi-head model with individual event-specific layers to deal with different audio-visual relationships, enabling different ways of audio-visual fusion. Experimental results show that our event-specific layers can discover unique properties of the audio-visual relationships in the videos, e.g., semantically matched moments, and rhythmic events. Moreover, although our network is trained with single labels, our multi-head design can inherently output additional semantically meaningful multi-labels for a video. As an application, we demonstrate that our proposed method can expose the extent of event-characteristics of popular benchmark datasets.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116648914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Learning Few-shot Segmentation from Bounding Box Annotations 从边界框注释中学习少镜头分割
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00374
Byeolyi Han, Tae-Hyun Oh
We present a new weakly-supervised few-shot semantic segmentation setting and a meta-learning method for tackling the new challenge. Different from existing settings, we leverage bounding box annotations as weak supervision signals during the meta-training phase, i.e., more label-efficient. Bounding box provides a cheaper label representation than segmentation mask but contains both an object of interest and a disturbing background. We first show that meta-training with bounding boxes degrades recent few-shot semantic segmentation methods, which are typically meta-trained with full semantic segmentation supervisions. We postulate that this challenge is originated from the impure information of bounding box representation. We propose a pseudo trimap estimator and trimap-attention based prototype learning to extract clearer supervision signals from bounding boxes. These developments robustify and generalize our method well to noisy support masks at test time. We empirically show that our method consistently improves performance. Our method gains 1.4% and 3.6% mean-IoU over the competing one in full and weak test supervision cases, respectively, in the 1-way 5-shot setting on Pascal-5i.
我们提出了一种新的弱监督少镜头语义分割设置和一种元学习方法来解决新的挑战。与现有的设置不同,我们在元训练阶段利用边界框注释作为弱监督信号,即更高效的标签。边界框提供了比分割掩码更便宜的标签表示,但同时包含感兴趣的对象和令人不安的背景。我们首先表明,使用边界框的元训练降低了最近的少量语义分割方法,这些方法通常是使用完整的语义分割监督进行元训练的。我们假设这一挑战源于边界框表示的不纯信息。我们提出了一个伪三映射估计器和基于三映射注意的原型学习来从边界框中提取更清晰的监督信号。这些发展使我们的方法在测试时可以很好地鲁棒化和推广到噪声支持掩模。我们的经验表明,我们的方法一贯提高性能。在Pascal-5i的1-way 5-shot设置中,我们的方法在完全和弱测试监管情况下分别比竞争对手获得1.4%和3.6%的平均iou。
{"title":"Learning Few-shot Segmentation from Bounding Box Annotations","authors":"Byeolyi Han, Tae-Hyun Oh","doi":"10.1109/WACV56688.2023.00374","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00374","url":null,"abstract":"We present a new weakly-supervised few-shot semantic segmentation setting and a meta-learning method for tackling the new challenge. Different from existing settings, we leverage bounding box annotations as weak supervision signals during the meta-training phase, i.e., more label-efficient. Bounding box provides a cheaper label representation than segmentation mask but contains both an object of interest and a disturbing background. We first show that meta-training with bounding boxes degrades recent few-shot semantic segmentation methods, which are typically meta-trained with full semantic segmentation supervisions. We postulate that this challenge is originated from the impure information of bounding box representation. We propose a pseudo trimap estimator and trimap-attention based prototype learning to extract clearer supervision signals from bounding boxes. These developments robustify and generalize our method well to noisy support masks at test time. We empirically show that our method consistently improves performance. Our method gains 1.4% and 3.6% mean-IoU over the competing one in full and weak test supervision cases, respectively, in the 1-way 5-shot setting on Pascal-5i.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116922841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Semantic Segmentation of Degraded Images Using Layer-Wise Feature Adjustor 基于分层特征调节器的退化图像语义分割
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00322
Kazuki Endo, Masayuki Tanaka, M. Okutomi
Semantic segmentation of degraded images is important for practical applications such as autonomous driving and surveillance systems. The degradation level, which represents the strength of degradation, is usually unknown in practice. Therefore, the semantic segmentation algorithm needs to take account of various levels of degradation. In this paper, we propose a convolutional neural network of semantic segmentation which can cope with various levels of degradation. The proposed network is based on the knowledge distillation from a source network trained with only clean images. More concretely, the proposed network is trained to acquire multi-layer features keeping consistency with the source network, while adjusting for various levels of degradation. The effectiveness of the proposed method is confirmed for different types of degradations: JPEG distortion, Gaussian blur and salt&pepper noise. The experimental comparisons validate that the proposed network outperforms existing networks for semantic segmentation of degraded images with various degradation levels.
退化图像的语义分割对于自动驾驶和监控系统等实际应用具有重要意义。表征降解强度的降解水平在实践中通常是未知的。因此,语义分割算法需要考虑到不同程度的退化。在本文中,我们提出了一个卷积神经网络的语义分割,可以应付不同程度的退化。该网络基于仅用干净图像训练的源网络的知识蒸馏。更具体地说,所提出的网络被训练以获取与源网络保持一致的多层特征,同时根据不同程度的退化进行调整。在JPEG失真、高斯模糊和椒盐噪声等不同类型的图像退化情况下,验证了该方法的有效性。实验结果表明,本文提出的网络在对不同退化程度的退化图像进行语义分割方面优于现有的网络。
{"title":"Semantic Segmentation of Degraded Images Using Layer-Wise Feature Adjustor","authors":"Kazuki Endo, Masayuki Tanaka, M. Okutomi","doi":"10.1109/WACV56688.2023.00322","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00322","url":null,"abstract":"Semantic segmentation of degraded images is important for practical applications such as autonomous driving and surveillance systems. The degradation level, which represents the strength of degradation, is usually unknown in practice. Therefore, the semantic segmentation algorithm needs to take account of various levels of degradation. In this paper, we propose a convolutional neural network of semantic segmentation which can cope with various levels of degradation. The proposed network is based on the knowledge distillation from a source network trained with only clean images. More concretely, the proposed network is trained to acquire multi-layer features keeping consistency with the source network, while adjusting for various levels of degradation. The effectiveness of the proposed method is confirmed for different types of degradations: JPEG distortion, Gaussian blur and salt&pepper noise. The experimental comparisons validate that the proposed network outperforms existing networks for semantic segmentation of degraded images with various degradation levels.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117211662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bent & Broken Bicycles: Leveraging synthetic data for damaged object re-identification 弯曲和损坏的自行车:利用合成数据重新识别损坏的物体
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00486
Luca Piano, F. G. Pratticò, Alessandro Sebastian Russo, Lorenzo Lanari, L. Morra, F. Lamberti
Instance-level object re-identification is a fundamental computer vision task, with applications from image retrieval to intelligent monitoring and fraud detection. In this work, we propose the novel task of damaged object re-identification, which aims at distinguishing changes in visual appearance due to deformations or missing parts from subtle intra-class variations. To explore this task, we leverage the power of computer-generated imagery to create, in a semi-automatic fashion, high-quality synthetic images of the same bike before and after a damage occurs. The resulting dataset, Bent & Broken Bicycles (BB-Bicycles), contains 39,200 images and 2,800 unique bike instances spanning 20 different bike models. As a baseline for this task, we propose TransReI3D, a multi-task, transformer-based deep network unifying damage detection (framed as a multi-label classification task) with object re-identification. The BBBicycles dataset is available at https://tinyurl.com/37tepf7m
实例级对象重新识别是一项基本的计算机视觉任务,应用范围从图像检索到智能监控和欺诈检测。在这项工作中,我们提出了受损物体重新识别的新任务,旨在从微妙的类内变化中区分由于变形或缺失部分而导致的视觉外观变化。为了探索这项任务,我们利用计算机生成图像的力量,以半自动的方式创建同一辆自行车损坏前后的高质量合成图像。由此产生的数据集,弯曲和损坏的自行车(bb - bikes),包含39,200张图像和2,800个独特的自行车实例,跨越20种不同的自行车模型。作为该任务的基线,我们提出了TransReI3D,这是一个多任务,基于变压器的深度网络,将损伤检测(框架为多标签分类任务)与物体重新识别统一起来。BBBicycles数据集可在https://tinyurl.com/37tepf7m上获得
{"title":"Bent & Broken Bicycles: Leveraging synthetic data for damaged object re-identification","authors":"Luca Piano, F. G. Pratticò, Alessandro Sebastian Russo, Lorenzo Lanari, L. Morra, F. Lamberti","doi":"10.1109/WACV56688.2023.00486","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00486","url":null,"abstract":"Instance-level object re-identification is a fundamental computer vision task, with applications from image retrieval to intelligent monitoring and fraud detection. In this work, we propose the novel task of damaged object re-identification, which aims at distinguishing changes in visual appearance due to deformations or missing parts from subtle intra-class variations. To explore this task, we leverage the power of computer-generated imagery to create, in a semi-automatic fashion, high-quality synthetic images of the same bike before and after a damage occurs. The resulting dataset, Bent & Broken Bicycles (BB-Bicycles), contains 39,200 images and 2,800 unique bike instances spanning 20 different bike models. As a baseline for this task, we propose TransReI3D, a multi-task, transformer-based deep network unifying damage detection (framed as a multi-label classification task) with object re-identification. The BBBicycles dataset is available at https://tinyurl.com/37tepf7m","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"2009 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127334374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Learning Methodology for Early Detection and Outbreak Prediction of Invasive Species Growth 入侵物种生长早期检测和爆发预测的深度学习方法
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00627
Nathan Elias
Invasive species (IS) cause major environmental damages, costing approximately $1.4 Trillion globally. Early detection and rapid response (EDRR) is key to mitigating IS growth, but current EDRR methods are highly inadequate at addressing IS growth. In this paper, a machine-learning-based approach to combat IS spread is proposed, in which identification, detection, and prediction of IS growth are automated in a novel mobile application and scalable models. This paper details the techniques used for the novel development of deep, multi-dimensional Convolutional Neural Networks (CNNs) to detect the presence of IS in both 2D and 3D spaces, as well as the creation of geospatial Long Short-Term Memory (LSTMs) models to then accurately quantify, simulate, and project invasive species’ future environmental spread. Results from conducting training and in-field validation studies show that this new methodology significantly improves current EDRR methods, by drastically decreasing the intensity of manual field labor while providing a toolkit that increases the efficiency and efficacy of ongoing efforts to combat IS. Furthermore, this research presents scalable expansion into dynamic LIDAR and aerial detection of IS growth, with the proposed toolkit already being deployed by state parks and national environmental/wildlife services.
入侵物种(IS)造成重大环境破坏,全球损失约1.4万亿美元。早期发现和快速反应(EDRR)是减缓is增长的关键,但目前的EDRR方法在应对is增长方面非常不足。在本文中,提出了一种基于机器学习的方法来对抗IS的传播,其中IS的识别、检测和预测在一种新的移动应用程序和可扩展模型中自动化。本文详细介绍了用于深度、多维卷积神经网络(cnn)新发展的技术,该技术用于检测2D和3D空间中IS的存在,以及创建地理空间长短期记忆(LSTMs)模型,从而准确量化、模拟和预测入侵物种未来的环境传播。培训和现场验证研究的结果表明,这种新方法显著改进了当前的EDRR方法,大大降低了现场手工劳动的强度,同时提供了一个工具包,提高了正在进行的打击IS的效率和效力。此外,这项研究提出了可扩展的动态激光雷达和IS增长的空中探测,所提出的工具包已经被州立公园和国家环境/野生动物服务部门部署。
{"title":"Deep Learning Methodology for Early Detection and Outbreak Prediction of Invasive Species Growth","authors":"Nathan Elias","doi":"10.1109/WACV56688.2023.00627","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00627","url":null,"abstract":"Invasive species (IS) cause major environmental damages, costing approximately $1.4 Trillion globally. Early detection and rapid response (EDRR) is key to mitigating IS growth, but current EDRR methods are highly inadequate at addressing IS growth. In this paper, a machine-learning-based approach to combat IS spread is proposed, in which identification, detection, and prediction of IS growth are automated in a novel mobile application and scalable models. This paper details the techniques used for the novel development of deep, multi-dimensional Convolutional Neural Networks (CNNs) to detect the presence of IS in both 2D and 3D spaces, as well as the creation of geospatial Long Short-Term Memory (LSTMs) models to then accurately quantify, simulate, and project invasive species’ future environmental spread. Results from conducting training and in-field validation studies show that this new methodology significantly improves current EDRR methods, by drastically decreasing the intensity of manual field labor while providing a toolkit that increases the efficiency and efficacy of ongoing efforts to combat IS. Furthermore, this research presents scalable expansion into dynamic LIDAR and aerial detection of IS growth, with the proposed toolkit already being deployed by state parks and national environmental/wildlife services.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124818441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Improving Predicate Representation in Scene Graph Generation by Self-Supervised Learning 用自监督学习改进场景图生成中的谓词表示
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00276
So Hasegawa, Masayuki Hiromoto, Akira Nakagawa, Y. Umeda
Scene graph generation (SGG) aims to understand sophisticated visual information by detecting triplets of subject, object, and their relationship (predicate). Since the predicate labels are heavily imbalanced, existing supervised methods struggle to improve accuracy for the rare predicates due to insufficient labeled data. In this paper, we propose SePiR, a novel self-supervised learning method for SGG to improve the representation of rare predicates. We first train a relational encoder by contrastive learning without using predicate labels, and then fine-tune a predicate classifier with labeled data. To apply contrastive learning to SGG, we newly propose data augmentation in which subject-object pairs are augmented by replacing their visual features with those from other images having the same object labels. By such augmentation, we can increase the variation of the visual features while keeping the relationship between the objects. Comprehensive experimental results on the Visual Genome dataset show that the SGG performance of SePiR is comparable to the state-of-theart, and especially with the limited labeled dataset, our method significantly outperforms the existing supervised methods. Moreover, SePiR’s improved representation enables the model architecture simpler, resulting in 3.6x and 6.3x reduction of the parameters and inference time from the existing method, independently.
场景图生成(SGG)旨在通过检测主体、客体及其关系(谓词)的三元组来理解复杂的视觉信息。由于谓词标签严重不平衡,由于标记数据不足,现有的监督方法难以提高罕见谓词的准确性。在本文中,我们提出了一种新颖的自监督学习方法SePiR,用于改进稀有谓词的表示。我们首先在不使用谓词标签的情况下通过对比学习训练关系编码器,然后使用标记数据微调谓词分类器。为了将对比学习应用到SGG中,我们新提出了数据增强,其中通过使用具有相同对象标签的其他图像的视觉特征替换主题-对象对来增强主题-对象对。通过这种增强,我们可以在保持物体之间关系的同时增加视觉特征的变化。在Visual Genome数据集上的综合实验结果表明,SePiR的SGG性能与目前的水平相当,特别是在有限标记数据集上,我们的方法明显优于现有的监督方法。此外,SePiR的改进表示使模型架构更简单,从而使参数和推理时间分别比现有方法减少3.6倍和6.3倍。
{"title":"Improving Predicate Representation in Scene Graph Generation by Self-Supervised Learning","authors":"So Hasegawa, Masayuki Hiromoto, Akira Nakagawa, Y. Umeda","doi":"10.1109/WACV56688.2023.00276","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00276","url":null,"abstract":"Scene graph generation (SGG) aims to understand sophisticated visual information by detecting triplets of subject, object, and their relationship (predicate). Since the predicate labels are heavily imbalanced, existing supervised methods struggle to improve accuracy for the rare predicates due to insufficient labeled data. In this paper, we propose SePiR, a novel self-supervised learning method for SGG to improve the representation of rare predicates. We first train a relational encoder by contrastive learning without using predicate labels, and then fine-tune a predicate classifier with labeled data. To apply contrastive learning to SGG, we newly propose data augmentation in which subject-object pairs are augmented by replacing their visual features with those from other images having the same object labels. By such augmentation, we can increase the variation of the visual features while keeping the relationship between the objects. Comprehensive experimental results on the Visual Genome dataset show that the SGG performance of SePiR is comparable to the state-of-theart, and especially with the limited labeled dataset, our method significantly outperforms the existing supervised methods. Moreover, SePiR’s improved representation enables the model architecture simpler, resulting in 3.6x and 6.3x reduction of the parameters and inference time from the existing method, independently.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123575582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Reference-based Video Super-Resolution (ERVSR): Single Reference Image Is All You Need 高效的基于参考的视频超分辨率(ERVSR):单个参考图像就是您所需要的
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00187
Youngrae Kim, Jinsu Lim, Hoonhee Cho, Minji Lee, Dongman Lee, Kuk-Jin Yoon, Ho-Jin Choi
Reference-based video super-resolution (RefVSR) is a promising domain of super-resolution that recovers high-frequency textures of a video using reference video. The multiple cameras with different focal lengths in mobile devices aid recent works in RefVSR, which aim to super-resolve a low-resolution ultra-wide video by utilizing wide-angle videos. Previous works in RefVSR used all reference frames of a Ref video at each time step for the super-resolution of low-resolution videos. However, computation on higher-resolution images increases the runtime and memory consumption, hence hinders the practical application of RefVSR. To solve this problem, we propose an Efficient Reference-based Video Super-Resolution (ERVSR) that exploits a single reference frame to super-resolve whole low-resolution video frames. We introduce an attention-based feature align module and an aggregation upsampling module that attends LR features using the correlation between the reference and LR frames. The proposed ERVSR achieves 12× faster speed, 1/4 memory consumption than previous state-of-the-art RefVSR networks, and competitive performance on the RealMCVSR dataset while using a single reference image.
基于参考的视频超分辨率(RefVSR)是一个很有前途的超分辨率领域,它利用参考视频来恢复视频的高频纹理。移动设备上不同焦距的多个摄像头有助于RefVSR最近的工作,该工作旨在通过利用广角视频来超分辨率低分辨率超宽视频。以前在RefVSR的工作中,对于低分辨率视频的超分辨率,在每个时间步使用Ref视频的所有参考帧。然而,在高分辨率图像上的计算增加了运行时和内存消耗,从而阻碍了RefVSR的实际应用。为了解决这个问题,我们提出了一种高效的基于参考的视频超分辨率(ERVSR),它利用单个参考帧来超分辨率整个低分辨率视频帧。我们引入了一个基于注意力的特征对齐模块和一个聚合上采样模块,该模块使用参考帧和LR帧之间的相关性来关注LR特征。提出的ERVSR实现了比以前最先进的RefVSR网络快12倍的速度,1/4的内存消耗,并且在使用单个参考图像时在RealMCVSR数据集上具有竞争力的性能。
{"title":"Efficient Reference-based Video Super-Resolution (ERVSR): Single Reference Image Is All You Need","authors":"Youngrae Kim, Jinsu Lim, Hoonhee Cho, Minji Lee, Dongman Lee, Kuk-Jin Yoon, Ho-Jin Choi","doi":"10.1109/WACV56688.2023.00187","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00187","url":null,"abstract":"Reference-based video super-resolution (RefVSR) is a promising domain of super-resolution that recovers high-frequency textures of a video using reference video. The multiple cameras with different focal lengths in mobile devices aid recent works in RefVSR, which aim to super-resolve a low-resolution ultra-wide video by utilizing wide-angle videos. Previous works in RefVSR used all reference frames of a Ref video at each time step for the super-resolution of low-resolution videos. However, computation on higher-resolution images increases the runtime and memory consumption, hence hinders the practical application of RefVSR. To solve this problem, we propose an Efficient Reference-based Video Super-Resolution (ERVSR) that exploits a single reference frame to super-resolve whole low-resolution video frames. We introduce an attention-based feature align module and an aggregation upsampling module that attends LR features using the correlation between the reference and LR frames. The proposed ERVSR achieves 12× faster speed, 1/4 memory consumption than previous state-of-the-art RefVSR networks, and competitive performance on the RealMCVSR dataset while using a single reference image.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115986510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Searching Efficient Neural Architecture with Multi-resolution Fusion Transformer for Appearance-based Gaze Estimation 基于外观注视估计的多分辨率融合变压器高效神经结构搜索
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00095
Vikrant Nagpure, K. Okuma
For aiming at a more accurate appearance-based gaze estimation, a series of recent works propose to use transformers or high-resolution networks in several ways which achieve state-of-the-art, but such works lack efficiency for real-time applications on edge computing devices. In this paper, we propose a compact model to precisely and efficiently solve gaze estimation. The proposed model includes 1) a Neural Architecture Search(NAS)-based multi-resolution feature extractor for extracting feature maps with global and local information which are essential for this task and 2) a novel multi-resolution fusion transformer as the gaze estimation head for efficiently estimating gaze values by fusing the extracted feature maps. We search our proposed model, called GazeNAS-ETH, on the ETH-XGaze dataset. We confirmed through experiments that GazeNAS-ETH achieved state-of-the-art on Gaze360, MPIIFaceGaze, RTGENE, and EYEDIAP datasets, while having only about 1M parameters and using only 0.28 GFLOPs, which is significantly less compared to previous state-of-the-art models, making it easier to deploy for real-time applications.
为了实现更精确的基于外观的凝视估计,最近的一系列工作提出了使用变压器或高分辨率网络的几种方法,这些方法达到了最先进的水平,但这些工作在边缘计算设备上的实时应用缺乏效率。在本文中,我们提出了一个紧凑的模型来精确有效地解决注视估计问题。该模型包括:1)基于神经结构搜索(NAS)的多分辨率特征提取器,用于提取具有全局和局部信息的特征图;2)一种新的多分辨率融合变压器作为凝视估计头,通过融合提取的特征图有效地估计凝视值。我们在ETH-XGaze数据集上搜索我们提出的模型GazeNAS-ETH。我们通过实验证实,GazeNAS-ETH在Gaze360、MPIIFaceGaze、RTGENE和EYEDIAP数据集上达到了最先进的水平,同时只有大约1M个参数,仅使用0.28 GFLOPs,与以前最先进的模型相比,这大大减少了,使其更容易部署到实时应用程序中。
{"title":"Searching Efficient Neural Architecture with Multi-resolution Fusion Transformer for Appearance-based Gaze Estimation","authors":"Vikrant Nagpure, K. Okuma","doi":"10.1109/WACV56688.2023.00095","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00095","url":null,"abstract":"For aiming at a more accurate appearance-based gaze estimation, a series of recent works propose to use transformers or high-resolution networks in several ways which achieve state-of-the-art, but such works lack efficiency for real-time applications on edge computing devices. In this paper, we propose a compact model to precisely and efficiently solve gaze estimation. The proposed model includes 1) a Neural Architecture Search(NAS)-based multi-resolution feature extractor for extracting feature maps with global and local information which are essential for this task and 2) a novel multi-resolution fusion transformer as the gaze estimation head for efficiently estimating gaze values by fusing the extracted feature maps. We search our proposed model, called GazeNAS-ETH, on the ETH-XGaze dataset. We confirmed through experiments that GazeNAS-ETH achieved state-of-the-art on Gaze360, MPIIFaceGaze, RTGENE, and EYEDIAP datasets, while having only about 1M parameters and using only 0.28 GFLOPs, which is significantly less compared to previous state-of-the-art models, making it easier to deploy for real-time applications.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122747026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Intention-Conditioned Long-Term Human Egocentric Action Anticipation 意向条件下的长期人类自我中心行动预期
Pub Date : 2023-01-01 DOI: 10.1109/WACV56688.2023.00599
Esteve Valls Mascaro, Hyemin Ahn, Dongheui Lee
To anticipate how a person would act in the future, it is essential to understand the human intention since it guides the subject towards a certain action. In this paper, we propose a hierarchical architecture which assumes a sequence of human action (low-level) can be driven from the human intention (high-level). Based on this, we deal with long-term action anticipation task in egocentric videos. Our framework first extracts this low- and high-level human information over the observed human actions in a video through a Hierarchical Multi-task Multi-Layer Perceptrons Mixer (H3M). Then, we constrain the uncertainty of the future through an Intention-Conditioned Variational Auto-Encoder (I-CVAE) that generates multiple stable predictions of the next actions that the observed human might perform. By leveraging human intention as high-level information, we claim that our model is able to anticipate more time-consistent actions in the long-term, thus improving the results over the baseline in Ego4D dataset. This work results in the state-of-the-art for Long-Term Anticipation (LTA) task in Ego4D by providing more plausible anticipated sequences, improving the anticipation scores of nouns and actions. Our work ranked first in both CVPR@2022 and ECCV@2022 Ego4D LTA Challenge.
为了预测一个人将来会如何行动,理解人的意图是至关重要的,因为它引导主体走向某种行动。在本文中,我们提出了一个层次结构,它假设人类行为的序列(低级)可以从人类意图(高级)驱动。在此基础上,我们研究了自我中心视频中的长期动作预期任务。我们的框架首先通过分层多任务多层感知器混频器(H3M)从视频中观察到的人类行为中提取低级和高级人类信息。然后,我们通过意图条件变分自编码器(I-CVAE)约束未来的不确定性,该编码器生成对观察到的人类可能执行的下一个动作的多个稳定预测。通过利用人类意图作为高级信息,我们声称我们的模型能够在长期内预测更多的时间一致的行为,从而改善了Ego4D数据集中基线的结果。这项工作通过提供更合理的预期序列,提高名词和动作的预期分数,为Ego4D中的长期预期(LTA)任务提供了最先进的技术。我们的作品在CVPR@2022和ECCV@2022 Ego4D LTA挑战赛中都获得了第一名。
{"title":"Intention-Conditioned Long-Term Human Egocentric Action Anticipation","authors":"Esteve Valls Mascaro, Hyemin Ahn, Dongheui Lee","doi":"10.1109/WACV56688.2023.00599","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00599","url":null,"abstract":"To anticipate how a person would act in the future, it is essential to understand the human intention since it guides the subject towards a certain action. In this paper, we propose a hierarchical architecture which assumes a sequence of human action (low-level) can be driven from the human intention (high-level). Based on this, we deal with long-term action anticipation task in egocentric videos. Our framework first extracts this low- and high-level human information over the observed human actions in a video through a Hierarchical Multi-task Multi-Layer Perceptrons Mixer (H3M). Then, we constrain the uncertainty of the future through an Intention-Conditioned Variational Auto-Encoder (I-CVAE) that generates multiple stable predictions of the next actions that the observed human might perform. By leveraging human intention as high-level information, we claim that our model is able to anticipate more time-consistent actions in the long-term, thus improving the results over the baseline in Ego4D dataset. This work results in the state-of-the-art for Long-Term Anticipation (LTA) task in Ego4D by providing more plausible anticipated sequences, improving the anticipation scores of nouns and actions. Our work ranked first in both CVPR@2022 and ECCV@2022 Ego4D LTA Challenge.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122896300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1