首页 > 最新文献

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)最新文献

英文 中文
Contextual Proposal Network for Action Localization 行动本地化的上下文建议网络
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00084
He-Yen Hsieh, Ding-Jie Chen, Tyng-Luh Liu
This paper investigates the problem of Temporal Action Proposal (TAP) generation, which aims to provide a set of high-quality video segments that potentially contain actions events locating in long untrimmed videos. Based on the goal to distill available contextual information, we introduce a Contextual Proposal Network (CPN) composing of two context-aware mechanisms. The first mechanism, i.e., feature enhancing, integrates the inception-like module with long-range attention to capture the multi-scale temporal contexts for yielding a robust video segment representation. The second mechanism, i.e., boundary scoring, employs the bi-directional recurrent neural networks (RNN) to capture bi-directional temporal contexts that explicitly model actionness, background, and confidence of proposals. While generating and scoring proposals, such bi-directional temporal contexts are helpful to retrieve high-quality proposals of low false positives for covering the video action instances. We conduct experiments on two challenging datasets of ActivityNet-1.3 and THUMOS-14 to demonstrate the effectiveness of the proposed Contextual Proposal Network (CPN). In particular, our method respectively surpasses state-of-the-art TAP methods by 1.54% AUC on ActivityNet-1.3 test split and by 0.61% AR@200 on THUMOS-14 dataset.
本文研究了时间动作建议(TAP)生成问题,该问题旨在提供一组高质量的视频片段,这些视频片段可能包含位于长未修剪视频中的动作事件。基于提取可用上下文信息的目标,我们引入了一个由两种上下文感知机制组成的上下文提议网络(CPN)。第一种机制,即特征增强,将类似开始的模块与远程关注集成在一起,以捕获多尺度时间上下文,从而产生鲁棒的视频片段表示。第二种机制,即边界评分,采用双向循环神经网络(RNN)来捕获双向时间上下文,这些上下文明确地模拟提案的行动性、背景和信心。在生成和评分建议的同时,这种双向时间上下文有助于检索覆盖视频动作实例的低误报的高质量建议。我们在ActivityNet-1.3和THUMOS-14两个具有挑战性的数据集上进行了实验,以证明所提出的上下文提案网络(CPN)的有效性。特别是,我们的方法在ActivityNet-1.3测试分割上分别超过最先进的TAP方法1.54%的AUC,在THUMOS-14数据集上分别超过0.61% AR@200。
{"title":"Contextual Proposal Network for Action Localization","authors":"He-Yen Hsieh, Ding-Jie Chen, Tyng-Luh Liu","doi":"10.1109/WACV51458.2022.00084","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00084","url":null,"abstract":"This paper investigates the problem of Temporal Action Proposal (TAP) generation, which aims to provide a set of high-quality video segments that potentially contain actions events locating in long untrimmed videos. Based on the goal to distill available contextual information, we introduce a Contextual Proposal Network (CPN) composing of two context-aware mechanisms. The first mechanism, i.e., feature enhancing, integrates the inception-like module with long-range attention to capture the multi-scale temporal contexts for yielding a robust video segment representation. The second mechanism, i.e., boundary scoring, employs the bi-directional recurrent neural networks (RNN) to capture bi-directional temporal contexts that explicitly model actionness, background, and confidence of proposals. While generating and scoring proposals, such bi-directional temporal contexts are helpful to retrieve high-quality proposals of low false positives for covering the video action instances. We conduct experiments on two challenging datasets of ActivityNet-1.3 and THUMOS-14 to demonstrate the effectiveness of the proposed Contextual Proposal Network (CPN). In particular, our method respectively surpasses state-of-the-art TAP methods by 1.54% AUC on ActivityNet-1.3 test split and by 0.61% AR@200 on THUMOS-14 dataset.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125764352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Low-cost Multispectral Scene Analysis with Modality Distillation 基于模态蒸馏的低成本多光谱场景分析
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00339
Heng Zhang, É. Fromont, S. Lefèvre, Bruno Avignon
Despite its robust performance under various illumination conditions, multispectral scene analysis has not been widely deployed due to two strong practical limitations: 1) thermal cameras, especially high-resolution ones are much more expensive than conventional visible cameras; 2) the most commonly adopted multispectral architectures, two-stream neural networks, nearly double the inference time of a regular mono-spectral model which makes them impractical in embedded environments. In this work, we aim to tackle these two limitations by proposing a novel knowledge distillation framework named Modality Distillation (MD). The proposed framework distils the knowledge from a high thermal resolution two-stream network with feature-level fusion to a low thermal resolution one-stream network with image-level fusion. We show on different multispectral scene analysis benchmarks that our method can effectively allow the use of low-resolution thermal sensors with more compact one-stream networks.
尽管多光谱场景分析在各种光照条件下都具有良好的性能,但由于两个强烈的实际限制,它没有得到广泛应用:1)热像仪,特别是高分辨率热像仪比传统的可见光相机昂贵得多;2)最常用的多光谱结构——双流神经网络,其推理时间几乎是常规单光谱模型的两倍,这使得它们在嵌入式环境中不实用。在这项工作中,我们的目标是通过提出一种名为情态蒸馏(MD)的新型知识蒸馏框架来解决这两个限制。该框架将知识从具有特征级融合的高热分辨率双流网络提取到具有图像级融合的低热分辨率单流网络。我们在不同的多光谱场景分析基准测试中表明,我们的方法可以有效地允许使用具有更紧凑的单流网络的低分辨率热传感器。
{"title":"Low-cost Multispectral Scene Analysis with Modality Distillation","authors":"Heng Zhang, É. Fromont, S. Lefèvre, Bruno Avignon","doi":"10.1109/WACV51458.2022.00339","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00339","url":null,"abstract":"Despite its robust performance under various illumination conditions, multispectral scene analysis has not been widely deployed due to two strong practical limitations: 1) thermal cameras, especially high-resolution ones are much more expensive than conventional visible cameras; 2) the most commonly adopted multispectral architectures, two-stream neural networks, nearly double the inference time of a regular mono-spectral model which makes them impractical in embedded environments. In this work, we aim to tackle these two limitations by proposing a novel knowledge distillation framework named Modality Distillation (MD). The proposed framework distils the knowledge from a high thermal resolution two-stream network with feature-level fusion to a low thermal resolution one-stream network with image-level fusion. We show on different multispectral scene analysis benchmarks that our method can effectively allow the use of low-resolution thermal sensors with more compact one-stream networks.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131589547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Nonnegative Low-Rank Tensor Completion via Dual Formulation with Applications to Image and Video Completion 基于对偶公式的非负低秩张量补全及其在图像和视频补全中的应用
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00412
T. Sinha, Jayadev Naram, Pawan Kumar
Recent approaches to the tensor completion problem have often overlooked the nonnegative structure of the data. We consider the problem of learning a nonnegative low-rank tensor, and using duality theory, we propose a novel factorization of such tensors. The factorization decouples the nonnegative constraints from the low-rank constraints. The resulting problem is an optimization problem on manifolds, and we propose a variant of Riemannian conjugate gradients to solve it. We test the proposed algorithm across various tasks such as colour image inpainting, video completion, and hyperspectral image completion. Experimental results show that the proposed method outperforms many state-of-the-art tensor completion algorithms.
最近研究张量补全问题的方法往往忽略了数据的非负结构。考虑了一个非负低秩张量的学习问题,利用对偶理论,提出了一种新的非负低秩张量的分解方法。因式分解将非负约束与低秩约束解耦。所得到的问题是流形上的一个优化问题,我们提出了黎曼共轭梯度的一个变体来解决它。我们在各种任务中测试了所提出的算法,如彩色图像绘制,视频补全和高光谱图像补全。实验结果表明,该方法优于许多最先进的张量补全算法。
{"title":"Nonnegative Low-Rank Tensor Completion via Dual Formulation with Applications to Image and Video Completion","authors":"T. Sinha, Jayadev Naram, Pawan Kumar","doi":"10.1109/WACV51458.2022.00412","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00412","url":null,"abstract":"Recent approaches to the tensor completion problem have often overlooked the nonnegative structure of the data. We consider the problem of learning a nonnegative low-rank tensor, and using duality theory, we propose a novel factorization of such tensors. The factorization decouples the nonnegative constraints from the low-rank constraints. The resulting problem is an optimization problem on manifolds, and we propose a variant of Riemannian conjugate gradients to solve it. We test the proposed algorithm across various tasks such as colour image inpainting, video completion, and hyperspectral image completion. Experimental results show that the proposed method outperforms many state-of-the-art tensor completion algorithms.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128810161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Fusion Point Pruning for Optimized 2D Object Detection with Radar-Camera Fusion 基于雷达-相机融合优化二维目标检测的融合点剪枝
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00134
Lukas Stäcker, Philipp Heidenreich, J. Rambach, D. Stricker
Object detection is one of the most important perception tasks for advanced driver assistant systems and autonomous driving. Due to its complementary features and moderate cost, radar-camera fusion is of particular interest in the automotive industry but comes with the challenge of how to optimally fuse the heterogeneous data sources. To solve this for 2D object detection, we propose two new techniques to project the radar detections onto the image plane, exploiting additional uncertainty information. We also introduce a new technique called fusion point pruning, which automatically finds the best fusion points of radar and image features in the neural network architecture. These new approaches combined surpass the state of the art in 2D object detection performance for radar-camera fusion models, evaluated with the nuScenes dataset. We further find that the utilization of radar-camera fusion is especially beneficial for night scenes.
目标检测是高级驾驶辅助系统和自动驾驶中最重要的感知任务之一。由于其互补性和适中的成本,雷达-相机融合技术在汽车工业中受到特别关注,但如何最佳地融合异构数据源是一个挑战。为了解决二维目标检测的这个问题,我们提出了两种新技术,利用额外的不确定性信息,将雷达检测投影到图像平面上。我们还介绍了一种新的融合点修剪技术,该技术在神经网络架构中自动找到雷达和图像特征的最佳融合点。这些新方法结合起来,在雷达-相机融合模型的2D目标检测性能方面超越了最先进的水平,并使用nuScenes数据集进行了评估。我们进一步发现,利用雷达-相机融合对夜景尤其有益。
{"title":"Fusion Point Pruning for Optimized 2D Object Detection with Radar-Camera Fusion","authors":"Lukas Stäcker, Philipp Heidenreich, J. Rambach, D. Stricker","doi":"10.1109/WACV51458.2022.00134","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00134","url":null,"abstract":"Object detection is one of the most important perception tasks for advanced driver assistant systems and autonomous driving. Due to its complementary features and moderate cost, radar-camera fusion is of particular interest in the automotive industry but comes with the challenge of how to optimally fuse the heterogeneous data sources. To solve this for 2D object detection, we propose two new techniques to project the radar detections onto the image plane, exploiting additional uncertainty information. We also introduce a new technique called fusion point pruning, which automatically finds the best fusion points of radar and image features in the neural network architecture. These new approaches combined surpass the state of the art in 2D object detection performance for radar-camera fusion models, evaluated with the nuScenes dataset. We further find that the utilization of radar-camera fusion is especially beneficial for night scenes.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"214 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124204335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Densely-packed Object Detection via Hard Negative-Aware Anchor Attention 基于硬负意识锚点注意的密集物体检测
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00147
Sungmin Cho, Jinwook Paeng, Junseok Kwon
In this paper, we propose a novel densely-packed object detection method based on advanced weighted Hausdorff distance (AWHD) and hard negative-aware anchor (HNAA) attention. Densely-packed object detection is more challenging than conventional object detection due to the high object density and small-size objects. To overcome these challenges, the proposed AWHD improves the conventional weighted Hausdorff distance and obtains an accurate center area map. Using the precise center area map, the proposed HNAA attention determines the relative importance of each anchor and imposes a penalty on hard negative anchors. Experimental results demonstrate that our proposed method based on the AWHD and HNAA attention produces accurate densely-packed object detection results and comparably outperforms other state-of-the-art detection methods. The code is available at ${color{Blue} text{here}}$.
本文提出了一种基于高级加权豪斯多夫距离(AWHD)和硬负感知锚点(HNAA)注意力的密集填充目标检测方法。高密度目标检测由于目标密度高、目标尺寸小,比传统的目标检测更具挑战性。为了克服这些挑战,提出的AWHD改进了传统的加权Hausdorff距离,并获得了精确的中心区域图。利用精确的中心区图,建议的HNAA关注确定每个锚点的相对重要性,并对硬负锚点施加惩罚。实验结果表明,我们提出的基于AWHD和HNAA注意的方法可以产生准确的密集填充目标检测结果,并且相对于其他先进的检测方法。代码可在${color{Blue} text{here}}$获得。
{"title":"Densely-packed Object Detection via Hard Negative-Aware Anchor Attention","authors":"Sungmin Cho, Jinwook Paeng, Junseok Kwon","doi":"10.1109/WACV51458.2022.00147","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00147","url":null,"abstract":"In this paper, we propose a novel densely-packed object detection method based on advanced weighted Hausdorff distance (AWHD) and hard negative-aware anchor (HNAA) attention. Densely-packed object detection is more challenging than conventional object detection due to the high object density and small-size objects. To overcome these challenges, the proposed AWHD improves the conventional weighted Hausdorff distance and obtains an accurate center area map. Using the precise center area map, the proposed HNAA attention determines the relative importance of each anchor and imposes a penalty on hard negative anchors. Experimental results demonstrate that our proposed method based on the AWHD and HNAA attention produces accurate densely-packed object detection results and comparably outperforms other state-of-the-art detection methods. The code is available at ${color{Blue} text{here}}$.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116899209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MAPS: Multimodal Attention for Product Similarity MAPS:产品相似度的多模式关注
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00304
Nilotpal Das, Aniket Joshi, Promod Yenigalla, Gourav Agrwal
Learning to identify similar products in the e-commerce domain has widespread applications such as ensuring consistent grouping of the products in the catalog, avoiding duplicates in the search results, etc. Here, we address the problem of learning product similarity for highly challenging real-world data from the Amazon catalog. We define it as a metric learning problem, where similar products are projected close to each other and dissimilar ones are projected further apart. To this end, we propose a scalable end-to-end multimodal framework for product representation learning in a weakly supervised setting using raw data from the catalog. This includes product images as well as textual attributes like product title and category information. The model uses the image as the primary source of information, while the title helps the model focus on relevant regions in the image by ignoring the background clutter. To validate our approach, we created multimodal datasets covering three broad product categories, where we achieve up to 10% improvement in precision compared to state-of-the-art multimodal benchmark. Along with this, we also incorporate several effective heuristics for training data generation, which further complements the overall training. Additionally, we demonstrate that incorporating the product title makes the model scale effectively across multiple product categories.
学习识别电子商务领域中的类似产品具有广泛的应用,例如确保目录中产品的一致分组,避免搜索结果中的重复等。在这里,我们将针对来自Amazon目录的极具挑战性的真实数据来解决学习产品相似性的问题。我们将其定义为度量学习问题,其中相似的产品被投影到彼此附近,而不相似的产品被投影到更远的地方。为此,我们提出了一个可扩展的端到端多模态框架,用于使用目录中的原始数据在弱监督设置中进行产品表示学习。这包括产品图像以及文本属性,如产品标题和类别信息。该模型使用图像作为主要的信息来源,而标题通过忽略背景杂波帮助模型关注图像中的相关区域。为了验证我们的方法,我们创建了涵盖三大类产品的多模态数据集,与最先进的多模态基准相比,我们的精度提高了10%。与此同时,我们还结合了几个有效的启发式方法来训练数据生成,这进一步补充了整体训练。此外,我们证明了纳入产品名称可以使模型有效地跨多个产品类别进行扩展。
{"title":"MAPS: Multimodal Attention for Product Similarity","authors":"Nilotpal Das, Aniket Joshi, Promod Yenigalla, Gourav Agrwal","doi":"10.1109/WACV51458.2022.00304","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00304","url":null,"abstract":"Learning to identify similar products in the e-commerce domain has widespread applications such as ensuring consistent grouping of the products in the catalog, avoiding duplicates in the search results, etc. Here, we address the problem of learning product similarity for highly challenging real-world data from the Amazon catalog. We define it as a metric learning problem, where similar products are projected close to each other and dissimilar ones are projected further apart. To this end, we propose a scalable end-to-end multimodal framework for product representation learning in a weakly supervised setting using raw data from the catalog. This includes product images as well as textual attributes like product title and category information. The model uses the image as the primary source of information, while the title helps the model focus on relevant regions in the image by ignoring the background clutter. To validate our approach, we created multimodal datasets covering three broad product categories, where we achieve up to 10% improvement in precision compared to state-of-the-art multimodal benchmark. Along with this, we also incorporate several effective heuristics for training data generation, which further complements the overall training. Additionally, we demonstrate that incorporating the product title makes the model scale effectively across multiple product categories.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123207702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
CrossLocate: Cross-modal Large-scale Visual Geo-Localization in Natural Environments using Rendered Modalities CrossLocate:使用渲染模态在自然环境中进行跨模态大规模视觉地理定位
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00225
Jan Tomešek, Martin Čadík, J. Brejcha
We propose a novel approach to visual geo-localization in natural environments. This is a challenging problem due to vast localization areas, the variable appearance of outdoor environments and the scarcity of available data. In order to make the research of new approaches possible, we first create two databases containing "synthetic" images of various modalities. These image modalities are rendered from a 3D terrain model and include semantic segmentations, silhouette maps and depth maps. By combining the rendered database views with existing datasets of photographs (used as "‘queries" to be localized), we create a unique benchmark for visual geo-localization in natural environments, which contains correspondences between query photographs and rendered database imagery. The distinct ability to match photographs to synthetically rendered databases defines our task as "cross-modal". On top of this benchmark, we provide thorough ablation studies analysing the localization potential of the database image modalities. We reveal the depth information as the best choice for outdoor localization. Finally, based on our observations, we carefully develop a fully-automatic method for large-scale cross-modal localization using image retrieval. We demonstrate its localization performance outdoors in the entire state of Switzerland. Our method reveals a large gap between operating within a single image domain (e.g. photographs) and working across domains (e.g. photographs matched to rendered images), as gained knowledge is not transferable between the two. Moreover, we show that modern localization methods fail when applied to such a cross- modal task and that our method achieves significantly better results than state-of-the-art approaches. The datasets, code and trained models are available on the project website: http://cphoto.fit.vutbr.cz/crosslocate/.
我们提出了一种新的自然环境下的视觉地理定位方法。这是一个具有挑战性的问题,因为巨大的定位区域,室外环境的变化和可用数据的稀缺。为了使新方法的研究成为可能,我们首先创建了两个包含各种模式的“合成”图像的数据库。这些图像模态由3D地形模型渲染,包括语义分割、轮廓图和深度图。通过将呈现的数据库视图与现有的照片数据集(用作本地化的“查询”)相结合,我们为自然环境中的视觉地理定位创建了一个独特的基准,其中包含查询照片和呈现的数据库图像之间的对应关系。将照片与合成渲染数据库相匹配的独特能力将我们的任务定义为“跨模式”。在此基准之上,我们提供了彻底的消融研究,分析了数据库图像模式的定位潜力。我们将深度信息作为户外定位的最佳选择。最后,基于我们的观察,我们精心开发了一种使用图像检索进行大规模跨模态定位的全自动方法。我们在整个瑞士州的户外展示了它的本地化性能。我们的方法揭示了在单个图像域内操作(例如照片)和跨域工作(例如与渲染图像匹配的照片)之间的巨大差距,因为获得的知识不能在两者之间转移。此外,我们表明现代定位方法在应用于这种跨模态任务时失败,并且我们的方法比最先进的方法取得了显着更好的结果。数据集、代码和训练模型可在项目网站上获得:http://cphoto.fit.vutbr.cz/crosslocate/。
{"title":"CrossLocate: Cross-modal Large-scale Visual Geo-Localization in Natural Environments using Rendered Modalities","authors":"Jan Tomešek, Martin Čadík, J. Brejcha","doi":"10.1109/WACV51458.2022.00225","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00225","url":null,"abstract":"We propose a novel approach to visual geo-localization in natural environments. This is a challenging problem due to vast localization areas, the variable appearance of outdoor environments and the scarcity of available data. In order to make the research of new approaches possible, we first create two databases containing \"synthetic\" images of various modalities. These image modalities are rendered from a 3D terrain model and include semantic segmentations, silhouette maps and depth maps. By combining the rendered database views with existing datasets of photographs (used as \"‘queries\" to be localized), we create a unique benchmark for visual geo-localization in natural environments, which contains correspondences between query photographs and rendered database imagery. The distinct ability to match photographs to synthetically rendered databases defines our task as \"cross-modal\". On top of this benchmark, we provide thorough ablation studies analysing the localization potential of the database image modalities. We reveal the depth information as the best choice for outdoor localization. Finally, based on our observations, we carefully develop a fully-automatic method for large-scale cross-modal localization using image retrieval. We demonstrate its localization performance outdoors in the entire state of Switzerland. Our method reveals a large gap between operating within a single image domain (e.g. photographs) and working across domains (e.g. photographs matched to rendered images), as gained knowledge is not transferable between the two. Moreover, we show that modern localization methods fail when applied to such a cross- modal task and that our method achieves significantly better results than state-of-the-art approaches. The datasets, code and trained models are available on the project website: http://cphoto.fit.vutbr.cz/crosslocate/.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124870012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Action anticipation using latent goal learning 使用潜在目标学习的行动预期
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00088
Debaditya Roy, Basura Fernando
To get something done, humans perform a sequence of actions dictated by a goal. So, predicting the next action in the sequence becomes easier once we know the goal that guides the entire activity. We present an action anticipation model that uses goal information in an effective manner. Specifically, we use a latent goal representation as a proxy for the "real goal" of the sequence and use this goal information when predicting the next action. We design a model to compute the latent goal representation from the observed video and use it to predict the next action. We also exploit two properties of goals to propose new losses for training the model. First, the effect of the next action should be closer to the latent goal than the observed action, termed as "goal closeness". Second, the latent goal should remain consistent before and after the execution of the next action which we coined as "goal consistency". Using this technique, we obtain state-of-the-art action anticipation performance on scripted datasets 50Salads and Breakfast that have predefined goals in all their videos. We also evaluate the latent goal-based model on EPIC-KITCHENS55 which is an unscripted dataset with multiple goals being pursued simultaneously. Even though this is not an ideal setup for using latent goals, our model is able to predict the next noun better than existing approaches on both seen and unseen kitchens in the test set.1
为了完成某件事,人们会根据一个目标执行一系列的行动。因此,一旦我们知道指导整个活动的目标,预测序列中的下一个动作就变得容易了。我们提出了一个有效利用目标信息的动作预期模型。具体来说,我们使用潜在目标表示作为序列的“真实目标”的代理,并在预测下一个动作时使用该目标信息。我们设计了一个模型,从观察到的视频中计算潜在目标的表示,并用它来预测下一步的动作。我们还利用目标的两个性质来提出新的损失来训练模型。首先,下一个行动的效果应该比观察到的行动更接近潜在目标,称为“目标接近”。其次,潜在的目标应该在执行下一个行动之前和之后保持一致,我们称之为“目标一致性”。使用这种技术,我们在脚本数据集50沙拉和早餐上获得了最先进的动作预期性能,这些数据集在所有视频中都有预定义的目标。我们还在EPIC-KITCHENS55上评估了基于潜在目标的模型,EPIC-KITCHENS55是一个同时追求多个目标的无脚本数据集。尽管这不是使用潜在目标的理想设置,但我们的模型能够比现有的方法更好地预测下一个名词,无论是在测试集中看到的还是看不见的厨房
{"title":"Action anticipation using latent goal learning","authors":"Debaditya Roy, Basura Fernando","doi":"10.1109/WACV51458.2022.00088","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00088","url":null,"abstract":"To get something done, humans perform a sequence of actions dictated by a goal. So, predicting the next action in the sequence becomes easier once we know the goal that guides the entire activity. We present an action anticipation model that uses goal information in an effective manner. Specifically, we use a latent goal representation as a proxy for the \"real goal\" of the sequence and use this goal information when predicting the next action. We design a model to compute the latent goal representation from the observed video and use it to predict the next action. We also exploit two properties of goals to propose new losses for training the model. First, the effect of the next action should be closer to the latent goal than the observed action, termed as \"goal closeness\". Second, the latent goal should remain consistent before and after the execution of the next action which we coined as \"goal consistency\". Using this technique, we obtain state-of-the-art action anticipation performance on scripted datasets 50Salads and Breakfast that have predefined goals in all their videos. We also evaluate the latent goal-based model on EPIC-KITCHENS55 which is an unscripted dataset with multiple goals being pursued simultaneously. Even though this is not an ideal setup for using latent goals, our model is able to predict the next noun better than existing approaches on both seen and unseen kitchens in the test set.1","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128609266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Mutual Learning of Joint and Separate Domain Alignments for Multi-Source Domain Adaptation 面向多源域自适应的联合与分离域对齐互学习
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00172
Yuanyuan Xu, Meina Kan, S. Shan, Xilin Chen
Multi-Source Domain Adaptation (MSDA) aims at transferring knowledge from multiple labeled source domains to benefit the task in an unlabeled target domain. The challenges of MSDA lie in mitigating domain gaps and combining information from diverse source domains. In most existing methods, the multiple source domains can be jointly or separately aligned to the target domain. In this work, we consider that these two types of methods, i.e. joint and separate domain alignments, are complementary and propose a mutual learning based alignment network (MLAN) to combine their advantages. Specifically, our proposed method is composed of three components, i.e. a joint alignment branch, a separate alignment branch, and a mutual learning objective between them. In the joint alignment branch, the samples from all source domains and the target domain are aligned together, with a single domain alignment goal, while in the separate alignment branch, each source domain is individually aligned to the target domain. Finally, by taking advantage of the complementarity of joint and separate domain alignment mechanisms, mutual learning is used to make the two branches learn collaboratively. Compared with other existing methods, our proposed MLAN integrates information of different domain alignment mechanisms and thus can mine rich knowledge from multiple domains for better performance. The experiments on Domain-Net, Office-31, and Digits-five datasets demonstrate the effectiveness of our method.
多源域自适应(Multi-Source Domain Adaptation, MSDA)的目的是将多个有标记的源领域的知识转移到一个未标记的目标领域。MSDA的挑战在于减少领域差距和组合来自不同源领域的信息。在大多数现有方法中,多个源域可以联合或单独对齐到目标域。在这项工作中,我们认为这两种类型的方法,即联合域对齐和分离域对齐,是互补的,并提出了一个基于相互学习的对齐网络(MLAN)来结合它们的优势。具体来说,我们提出的方法由三个组成部分组成,即联合对齐分支、独立对齐分支和它们之间的相互学习目标。在联合对齐分支中,来自所有源域和目标域的样本对齐在一起,具有单个域对齐目标,而在单独对齐分支中,每个源域分别与目标域对齐。最后,利用联合域和分离域对齐机制的互补性,利用互学习实现两个分支的协同学习。与现有方法相比,本文提出的多域网络集成了不同领域对齐机制的信息,可以从多个领域中挖掘丰富的知识,从而获得更好的性能。在Domain-Net、Office-31和digits - 5数据集上的实验验证了该方法的有效性。
{"title":"Mutual Learning of Joint and Separate Domain Alignments for Multi-Source Domain Adaptation","authors":"Yuanyuan Xu, Meina Kan, S. Shan, Xilin Chen","doi":"10.1109/WACV51458.2022.00172","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00172","url":null,"abstract":"Multi-Source Domain Adaptation (MSDA) aims at transferring knowledge from multiple labeled source domains to benefit the task in an unlabeled target domain. The challenges of MSDA lie in mitigating domain gaps and combining information from diverse source domains. In most existing methods, the multiple source domains can be jointly or separately aligned to the target domain. In this work, we consider that these two types of methods, i.e. joint and separate domain alignments, are complementary and propose a mutual learning based alignment network (MLAN) to combine their advantages. Specifically, our proposed method is composed of three components, i.e. a joint alignment branch, a separate alignment branch, and a mutual learning objective between them. In the joint alignment branch, the samples from all source domains and the target domain are aligned together, with a single domain alignment goal, while in the separate alignment branch, each source domain is individually aligned to the target domain. Finally, by taking advantage of the complementarity of joint and separate domain alignment mechanisms, mutual learning is used to make the two branches learn collaboratively. Compared with other existing methods, our proposed MLAN integrates information of different domain alignment mechanisms and thus can mine rich knowledge from multiple domains for better performance. The experiments on Domain-Net, Office-31, and Digits-five datasets demonstrate the effectiveness of our method.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127529034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Multi-Dimensional Dynamic Model Compression for Efficient Image Super-Resolution 高效图像超分辨率的多维动态模型压缩
Pub Date : 2022-01-01 DOI: 10.1109/WACV51458.2022.00355
Zejiang Hou, S. Kung
Modern single image super-resolution (SR) system based on convolutional neural networks achieves substantial progress. However, most SR deep networks are computationally expensive and require excessively large activation memory footprints, impeding their effective deployment to resource-limited devices. Based on the observation that the activation patterns in SR networks exhibit high input-dependency, we propose Multi-Dimensional Dynamic Model Compression method that can reduce both spatial and channel wise redundancy in an SR deep network for different input images. To reduce the spatial-wise redundancy, we propose to perform convolution on scaled-down feature-maps where the down-scaling factor is made adaptive to different input images. To reduce the channel-wise redundancy, we introduce a low-cost channel saliency predictor for each convolution to dynamically skip the computation of unimportant channels based on the Gumbel-Softmax. To better capture the feature-maps information and facilitate input-adaptive decision, we employ classic image processing metrics, e.g., Spatial Information, to guide the saliency predictors. The proposed method can be readily applied to a variety of SR deep networks and trained end-to-end with standard super-resolution loss, in combination with a sparsity criterion. Experiments on several benchmarks demonstrate that our method can effectively reduce the FLOPs of both lightweight and non-compact SR models with negligible PSNR loss. Moreover, our compressed models achieve competitive PSNR-FLOPs Pareto frontier compared with SOTA NAS-based SR methods.
现代基于卷积神经网络的单图像超分辨率(SR)系统取得了长足的进步。然而,大多数SR深度网络在计算上是昂贵的,并且需要过大的激活内存占用,阻碍了它们在资源有限的设备上的有效部署。基于观察到SR网络中的激活模式表现出高度的输入依赖性,我们提出了多维动态模型压缩方法,该方法可以减少SR深度网络中不同输入图像的空间和信道冗余。为了减少空间冗余,我们建议对按比例缩小的特征图进行卷积,其中按比例缩小的因子自适应不同的输入图像。为了减少信道冗余,我们为每个卷积引入一个低成本的信道显著性预测器,以动态跳过基于Gumbel-Softmax的不重要信道的计算。为了更好地捕获特征图信息,促进输入自适应决策,我们采用经典的图像处理指标,如空间信息,来指导显著性预测器。结合稀疏度准则,该方法可以很容易地应用于各种SR深度网络,并具有标准超分辨率损失的端到端训练。在多个基准测试上的实验表明,我们的方法可以有效地降低轻量级和非紧凑型SR模型的FLOPs,而PSNR损失可以忽略不计。此外,与基于SOTA nas的SR方法相比,我们的压缩模型实现了具有竞争力的PSNR-FLOPs Pareto边界。
{"title":"Multi-Dimensional Dynamic Model Compression for Efficient Image Super-Resolution","authors":"Zejiang Hou, S. Kung","doi":"10.1109/WACV51458.2022.00355","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00355","url":null,"abstract":"Modern single image super-resolution (SR) system based on convolutional neural networks achieves substantial progress. However, most SR deep networks are computationally expensive and require excessively large activation memory footprints, impeding their effective deployment to resource-limited devices. Based on the observation that the activation patterns in SR networks exhibit high input-dependency, we propose Multi-Dimensional Dynamic Model Compression method that can reduce both spatial and channel wise redundancy in an SR deep network for different input images. To reduce the spatial-wise redundancy, we propose to perform convolution on scaled-down feature-maps where the down-scaling factor is made adaptive to different input images. To reduce the channel-wise redundancy, we introduce a low-cost channel saliency predictor for each convolution to dynamically skip the computation of unimportant channels based on the Gumbel-Softmax. To better capture the feature-maps information and facilitate input-adaptive decision, we employ classic image processing metrics, e.g., Spatial Information, to guide the saliency predictors. The proposed method can be readily applied to a variety of SR deep networks and trained end-to-end with standard super-resolution loss, in combination with a sparsity criterion. Experiments on several benchmarks demonstrate that our method can effectively reduce the FLOPs of both lightweight and non-compact SR models with negligible PSNR loss. Moreover, our compressed models achieve competitive PSNR-FLOPs Pareto frontier compared with SOTA NAS-based SR methods.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129220680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1