2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献_第5页

Attention Scaling for Crowd Counting 人群计数的注意力缩放

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00476

Xiaoheng Jiang, Li Zhang, Mingliang Xu, Tianzhu Zhang, Pei Lv, Bing Zhou, Xin Yang, Yanwei Pang

Convolutional Neural Network (CNN) based methods generally take crowd counting as a regression task by outputting crowd densities. They learn the mapping between image contents and crowd density distributions. Though having achieved promising results, these data-driven counting networks are prone to overestimate or underestimate people counts of regions with different density patterns, which degrades the whole count accuracy. To overcome this problem, we propose an approach to alleviate the counting performance differences in different regions. Specifically, our approach consists of two networks named Density Attention Network (DANet) and Attention Scaling Network (ASNet). DANet provides ASNet with attention masks related to regions of different density levels. ASNet first generates density maps and scaling factors and then multiplies them by attention masks to output separate attention-based density maps. These density maps are summed to give the final density map. The attention scaling factors help attenuate the estimation errors in different regions. Furthermore, we present a novel Adaptive Pyramid Loss (APLoss) to hierarchically calculate the estimation losses of sub-regions, which alleviates the training bias. Extensive experiments on four challenging datasets (ShanghaiTech Part A, UCF_CC_50, UCF-QNRF, and WorldExpo'10) demonstrate the superiority of the proposed approach.

基于卷积神经网络(CNN)的方法一般将人群计数作为一个回归任务，输出人群密度。他们学习图像内容和人群密度分布之间的映射。这些数据驱动的计数网络虽然取得了令人鼓舞的结果，但容易高估或低估具有不同密度模式的区域的人口计数，从而降低了整个计数的准确性。为了克服这一问题，我们提出了一种缓解不同区域计数性能差异的方法。具体来说，我们的方法由两个网络组成，分别是密度注意力网络(DANet)和注意力缩放网络(ASNet)。DANet为ASNet提供了与不同密度水平的区域相关的注意掩模。ASNet首先生成密度图和缩放因子，然后将它们乘以注意掩模以输出单独的基于注意的密度图。将这些密度图相加得到最终的密度图。注意尺度因子有助于减小不同区域的估计误差。此外，我们提出了一种新的自适应金字塔损失(APLoss)来分层计算子区域的估计损失，从而减轻了训练偏差。在四个具有挑战性的数据集(上海科技A部、UCF_CC_50、UCF-QNRF和WorldExpo’10)上进行的大量实验证明了该方法的优越性。

{"title":"Attention Scaling for Crowd Counting","authors":"Xiaoheng Jiang, Li Zhang, Mingliang Xu, Tianzhu Zhang, Pei Lv, Bing Zhou, Xin Yang, Yanwei Pang","doi":"10.1109/cvpr42600.2020.00476","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00476","url":null,"abstract":"Convolutional Neural Network (CNN) based methods generally take crowd counting as a regression task by outputting crowd densities. They learn the mapping between image contents and crowd density distributions. Though having achieved promising results, these data-driven counting networks are prone to overestimate or underestimate people counts of regions with different density patterns, which degrades the whole count accuracy. To overcome this problem, we propose an approach to alleviate the counting performance differences in different regions. Specifically, our approach consists of two networks named Density Attention Network (DANet) and Attention Scaling Network (ASNet). DANet provides ASNet with attention masks related to regions of different density levels. ASNet first generates density maps and scaling factors and then multiplies them by attention masks to output separate attention-based density maps. These density maps are summed to give the final density map. The attention scaling factors help attenuate the estimation errors in different regions. Furthermore, we present a novel Adaptive Pyramid Loss (APLoss) to hierarchically calculate the estimation losses of sub-regions, which alleviates the training bias. Extensive experiments on four challenging datasets (ShanghaiTech Part A, UCF_CC_50, UCF-QNRF, and WorldExpo'10) demonstrate the superiority of the proposed approach.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"165 1","pages":"4705-4714"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76752207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 183

Improving Action Segmentation via Graph-Based Temporal Reasoning 基于图的时间推理改进动作分割

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.01404

Yifei Huang, Yusuke Sugano, Yoichi Sato

Temporal relations among multiple action segments play an important role in action segmentation especially when observations are limited (e.g., actions are occluded by other objects or happen outside a field of view). In this paper, we propose a network module called Graph-based Temporal Reasoning Module (GTRM) that can be built on top of existing action segmentation models to learn the relation of multiple action segments in various time spans. We model the relations by using two Graph Convolution Networks (GCNs) where each node represents an action segment. The two graphs have different edge properties to account for boundary regression and classification tasks, respectively. By applying graph convolution, we can update each node's representation based on its relation with neighboring nodes. The updated representation is then used for improved action segmentation. We evaluate our model on the challenging egocentric datasets namely EGTEA and EPIC-Kitchens, where actions may be partially observed due to the viewpoint restriction. The results show that our proposed GTRM outperforms state-of-the-art action segmentation models by a large margin. We also demonstrate the effectiveness of our model on two third-person video datasets, the 50Salads dataset and the Breakfast dataset.

多个动作片段之间的时间关系在动作分割中起着重要的作用，特别是当观察有限时(例如，动作被其他物体遮挡或发生在视场之外)。在本文中，我们提出了一个基于图的时态推理模块(GTRM)的网络模块，该模块可以建立在现有的动作分割模型之上，以学习多个动作片段在不同时间跨度内的关系。我们通过使用两个图卷积网络(GCNs)对关系进行建模，其中每个节点表示一个动作段。这两个图具有不同的边缘属性，分别用于边界回归和分类任务。通过图卷积，我们可以根据每个节点与相邻节点的关系来更新每个节点的表示。然后将更新后的表示用于改进的动作分割。我们在具有挑战性的以自我为中心的数据集上评估我们的模型，即EGTEA和EPIC-Kitchens，其中由于视点限制，可能会部分观察到动作。结果表明，我们提出的GTRM在很大程度上优于最先进的动作分割模型。我们还在两个第三人称视频数据集(50salad数据集和Breakfast数据集)上展示了我们的模型的有效性。

{"title":"Improving Action Segmentation via Graph-Based Temporal Reasoning","authors":"Yifei Huang, Yusuke Sugano, Yoichi Sato","doi":"10.1109/cvpr42600.2020.01404","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01404","url":null,"abstract":"Temporal relations among multiple action segments play an important role in action segmentation especially when observations are limited (e.g., actions are occluded by other objects or happen outside a field of view). In this paper, we propose a network module called Graph-based Temporal Reasoning Module (GTRM) that can be built on top of existing action segmentation models to learn the relation of multiple action segments in various time spans. We model the relations by using two Graph Convolution Networks (GCNs) where each node represents an action segment. The two graphs have different edge properties to account for boundary regression and classification tasks, respectively. By applying graph convolution, we can update each node's representation based on its relation with neighboring nodes. The updated representation is then used for improved action segmentation. We evaluate our model on the challenging egocentric datasets namely EGTEA and EPIC-Kitchens, where actions may be partially observed due to the viewpoint restriction. The results show that our proposed GTRM outperforms state-of-the-art action segmentation models by a large margin. We also demonstrate the effectiveness of our model on two third-person video datasets, the 50Salads dataset and the Breakfast dataset.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"61 1","pages":"14021-14031"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78121801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 92

Multi-Modality Cross Attention Network for Image and Sentence Matching 图像与句子匹配的多模态交叉注意网络

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/CVPR42600.2020.01095

Xiaoyan Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, Feng Wu

The key of image and sentence matching is to accurately measure the visual-semantic similarity between an image and a sentence. However, most existing methods make use of only the intra-modality relationship within each modality or the inter-modality relationship between image regions and sentence words for the cross-modal matching task. Different from them, in this work, we propose a novel MultiModality Cross Attention (MMCA) Network for image and sentence matching by jointly modeling the intra-modality and inter-modality relationships of image regions and sentence words in a unified deep model. In the proposed MMCA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.

图像与句子匹配的关键是准确测量图像与句子之间的视觉语义相似度。然而，现有的大多数方法仅利用每个模态内的模态关系或图像区域与句子单词之间的模态间关系来完成跨模态匹配任务。与之不同的是，本文提出了一种新的多模态交叉注意(multimodal Cross Attention, MMCA)网络，通过在统一的深度模型中对图像区域和句子单词的模态内和模态间关系进行联合建模，实现图像和句子的匹配。在MMCA模型中，我们设计了一种新的交叉注意机制，该机制不仅能够利用每个情态内的情态关系，还能够利用图像区域和句子单词之间的情态间关系，相互补充和增强，实现图像和句子的匹配。在两个标准基准(包括Flickr30K和MS-COCO)上的大量实验结果表明，该模型比最先进的图像和句子匹配方法表现良好。

{"title":"Multi-Modality Cross Attention Network for Image and Sentence Matching","authors":"Xiaoyan Wei, Tianzhu Zhang, Yan Li, Yongdong Zhang, Feng Wu","doi":"10.1109/CVPR42600.2020.01095","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.01095","url":null,"abstract":"The key of image and sentence matching is to accurately measure the visual-semantic similarity between an image and a sentence. However, most existing methods make use of only the intra-modality relationship within each modality or the inter-modality relationship between image regions and sentence words for the cross-modal matching task. Different from them, in this work, we propose a novel MultiModality Cross Attention (MMCA) Network for image and sentence matching by jointly modeling the intra-modality and inter-modality relationships of image regions and sentence words in a unified deep model. In the proposed MMCA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"33 1","pages":"10938-10947"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78215716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 173

Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction 注意机制利用时间背景:实时三维人体姿态重建

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00511

Ruixu Liu, Ju Shen, He Wang, Chen Chen, S. Cheung, V. Asari

We propose a novel attention-based framework for 3D human pose estimation from a monocular video. Despite the general success of end-to-end deep learning paradigms, our approach is based on two key observations: (1) temporal incoherence and jitter are often yielded from a single frame prediction; (2) error rate can be remarkably reduced by increasing the receptive field in a video. Therefore, we design an attentional mechanism to adaptively identify significant frames and tensor outputs from each deep neural net layer, leading to a more optimal estimation. To achieve large temporal receptive fields, multi-scale dilated convolutions are employed to model long-range dependencies among frames. The architecture is straightforward to implement and can be flexibly adopted for real-time applications. Any off-the-shelf 2D pose estimation system, e.g. Mocap libraries, can be easily integrated in an ad-hoc fashion. We both quantitatively and qualitatively evaluate our method on various standard benchmark datasets (e.g. Human3.6M, HumanEva). Our method considerably outperforms all the state-of-the-art algorithms up to 8% error reduction (average mean per joint position error: 34.7) as compared to the best-reported results. Code is available at: (https://github.com/lrxjason/Attention3DHumanPose)

我们提出了一种新的基于注意力的单目视频三维人体姿态估计框架。尽管端到端深度学习范式总体上取得了成功，但我们的方法基于两个关键观察:(1)单帧预测通常会产生时间不相干和抖动;(2)增加视频的接受野可以显著降低错误率。因此，我们设计了一种注意力机制来自适应地识别每个深度神经网络层的重要帧和张量输出，从而实现更优化的估计。为了获得大的时间接受域，采用多尺度扩展卷积来模拟帧间的远程依赖关系。该体系结构易于实现，可以灵活地用于实时应用。任何现成的2D姿态估计系统，例如动作捕捉库，都可以轻松地以特别的方式集成。我们在各种标准基准数据集(例如Human3.6M, HumanEva)上对我们的方法进行了定量和定性评估。与最好的报告结果相比，我们的方法大大优于所有最先进的算法，误差减少了8%(平均每个关节位置误差:34.7)。代码可从以下网址获得:(https://github.com/lrxjason/Attention3DHumanPose)

{"title":"Attention Mechanism Exploits Temporal Contexts: Real-Time 3D Human Pose Reconstruction","authors":"Ruixu Liu, Ju Shen, He Wang, Chen Chen, S. Cheung, V. Asari","doi":"10.1109/cvpr42600.2020.00511","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00511","url":null,"abstract":"We propose a novel attention-based framework for 3D human pose estimation from a monocular video. Despite the general success of end-to-end deep learning paradigms, our approach is based on two key observations: (1) temporal incoherence and jitter are often yielded from a single frame prediction; (2) error rate can be remarkably reduced by increasing the receptive field in a video. Therefore, we design an attentional mechanism to adaptively identify significant frames and tensor outputs from each deep neural net layer, leading to a more optimal estimation. To achieve large temporal receptive fields, multi-scale dilated convolutions are employed to model long-range dependencies among frames. The architecture is straightforward to implement and can be flexibly adopted for real-time applications. Any off-the-shelf 2D pose estimation system, e.g. Mocap libraries, can be easily integrated in an ad-hoc fashion. We both quantitatively and qualitatively evaluate our method on various standard benchmark datasets (e.g. Human3.6M, HumanEva). Our method considerably outperforms all the state-of-the-art algorithms up to 8% error reduction (average mean per joint position error: 34.7) as compared to the best-reported results. Code is available at: (https://github.com/lrxjason/Attention3DHumanPose)","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"23 1","pages":"5063-5072"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77955388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 112

Noise-Aware Fully Webly Supervised Object Detection 噪声感知的全网络监督对象检测

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.01134

Yunhang Shen, Rongrong Ji, Zhiwei Chen, Xiaopeng Hong, Feng Zheng, Jianzhuang Liu, Mingliang Xu, Q. Tian

We investigate the emerging task of learning object detectors with sole image-level labels on the web without requiring any other supervision like precise annotations or additional images from well-annotated benchmark datasets. Such a task, termed as fully webly supervised object detection, is extremely challenging, since image-level labels on the web are always noisy, leading to poor performance of the learned detectors. In this work, we propose an end-to-end framework to jointly learn webly supervised detectors and reduce the negative impact of noisy labels. Such noise is heterogeneous, which is further categorized into two types, namely background noise and foreground noise. Regarding the background noise, we propose a residual learning structure incorporated with weakly supervised detection, which decomposes background noise and models clean data. To explicitly learn the residual feature between clean data and noisy labels, we further propose a spatially-sensitive entropy criterion, which exploits the conditional distribution of detection results to estimate the confidence of background categories being noise. Regarding the foreground noise, a bagging-mixup learning is introduced, which suppresses foreground noisy signals from incorrectly labelled images, whilst maintaining the diversity of training data. We evaluate the proposed approach on popular benchmark datasets by training detectors on web images, which are retrieved by the corresponding category tags from photo-sharing sites. Extensive experiments show that our method achieves significant improvements over the state-of-the-art methods.

我们研究了在网络上使用单一图像级标签学习对象检测器的新兴任务，而不需要任何其他监督，如精确的注释或来自良好注释的基准数据集的额外图像。这种被称为完全网络监督对象检测的任务是极具挑战性的，因为网络上的图像级标签总是有噪声的，导致学习检测器的性能很差。在这项工作中，我们提出了一个端到端框架来共同学习网络监督检测器并减少噪声标签的负面影响。这种噪声是异质的，又可以分为背景噪声和前景噪声两类。对于背景噪声，我们提出了一种结合弱监督检测的残差学习结构，对背景噪声进行分解并对干净的数据进行建模。为了明确学习干净数据和噪声标签之间的残差特征，我们进一步提出了一个空间敏感的熵准则，该准则利用检测结果的条件分布来估计背景类别是噪声的置信度。在前景噪声方面，引入了bagging-mixup学习，在保持训练数据多样性的同时，抑制了来自错误标记图像的前景噪声信号。我们在流行的基准数据集上通过训练网络图像的检测器来评估所提出的方法，这些图像是由照片共享网站的相应类别标签检索的。大量的实验表明，我们的方法比最先进的方法有了显著的改进。

{"title":"Noise-Aware Fully Webly Supervised Object Detection","authors":"Yunhang Shen, Rongrong Ji, Zhiwei Chen, Xiaopeng Hong, Feng Zheng, Jianzhuang Liu, Mingliang Xu, Q. Tian","doi":"10.1109/cvpr42600.2020.01134","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01134","url":null,"abstract":"We investigate the emerging task of learning object detectors with sole image-level labels on the web without requiring any other supervision like precise annotations or additional images from well-annotated benchmark datasets. Such a task, termed as fully webly supervised object detection, is extremely challenging, since image-level labels on the web are always noisy, leading to poor performance of the learned detectors. In this work, we propose an end-to-end framework to jointly learn webly supervised detectors and reduce the negative impact of noisy labels. Such noise is heterogeneous, which is further categorized into two types, namely background noise and foreground noise. Regarding the background noise, we propose a residual learning structure incorporated with weakly supervised detection, which decomposes background noise and models clean data. To explicitly learn the residual feature between clean data and noisy labels, we further propose a spatially-sensitive entropy criterion, which exploits the conditional distribution of detection results to estimate the confidence of background categories being noise. Regarding the foreground noise, a bagging-mixup learning is introduced, which suppresses foreground noisy signals from incorrectly labelled images, whilst maintaining the diversity of training data. We evaluate the proposed approach on popular benchmark datasets by training detectors on web images, which are retrieved by the corresponding category tags from photo-sharing sites. Extensive experiments show that our method achieves significant improvements over the state-of-the-art methods.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"15 1","pages":"11323-11332"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76256887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Don’t Hit Me! Glass Detection in Real-World Scenes 别打我!真实世界场景中的玻璃检测

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00374

Haiyang Mei, Xin Yang, Yang Wang, Yu-An Liu, Shengfeng He, Qiang Zhang, Xiaopeng Wei, Rynson W. H. Lau

Glass is very common in our daily life. Existing computer vision systems neglect it and thus may have severe consequences, e.g., a robot may crash into a glass wall. However, sensing the presence of glass is not straightforward. The key challenge is that arbitrary objects/scenes can appear behind the glass, and the content within the glass region is typically similar to those behind it. In this paper, we propose an important problem of detecting glass from a single RGB image. To address this problem, we construct a large-scale glass detection dataset (GDD) and design a glass detection network, called GDNet, which explores abundant contextual cues for robust glass detection with a novel large-field contextual feature integration (LCFI) module. Extensive experiments demonstrate that the proposed method achieves more superior glass detection results on our GDD test set than state-of-the-art methods fine-tuned for glass detection.

玻璃在我们的日常生活中很常见。现有的计算机视觉系统忽视了这一点，因此可能会产生严重的后果，例如，机器人可能会撞到玻璃墙上。然而，感知玻璃的存在并不简单。关键的挑战在于，任意物体/场景都可能出现在玻璃后面，而玻璃区域内的内容通常与玻璃后面的内容相似。在本文中，我们提出了一个重要的问题，即从单个RGB图像中检测玻璃。为了解决这个问题，我们构建了一个大规模的玻璃检测数据集(GDD)，并设计了一个名为GDNet的玻璃检测网络，该网络通过一种新颖的大视场上下文特征集成(LCFI)模块探索了丰富的上下文线索，用于稳健的玻璃检测。大量的实验表明，所提出的方法在我们的GDD测试集上实现了比最先进的玻璃检测方法更优越的玻璃检测结果。

引用次数: 61

Taking a Deeper Look at Co-Salient Object Detection 更深入地了解共显著目标检测

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.00299

Deng-Ping Fan, Zheng Lin, Ge-Peng Ji, Dingwen Zhang, H. Fu, Ming-Ming Cheng

Co-salient object detection (CoSOD) is a newly emerging and rapidly growing branch of salient object detection (SOD), which aims to detect the co-occurring salient objects in multiple images. However, existing CoSOD datasets often have a serious data bias, which assumes that each group of images contains salient objects of similar visual appearances. This bias results in the ideal settings and the effectiveness of the models, trained on existing datasets, may be impaired in real-life situations, where the similarity is usually semantic or conceptual. To tackle this issue, we first collect a new high-quality dataset, named CoSOD3k, which contains 3,316 images divided into 160 groups with multiple level annotations, i.e., category, bounding box, object, and instance levels. CoSOD3k makes a significant leap in terms of diversity, difficulty and scalability, benefiting related vision tasks. Besides, we comprehensively summarize 34 cutting-edge algorithms, benchmarking 19 of them over four existing CoSOD datasets (MSRC, iCoSeg, Image Pair and CoSal2015) and our CoSOD3k with a total of ∼61K images (largest scale), and reporting group-level performance analysis. Finally, we discuss the challenge and future work of CoSOD. Our study would give a strong boost to growth in the CoSOD community. Benchmark toolbox and results are available on our project page.

共同显著目标检测(CoSOD)是显著目标检测(SOD)中一个新兴且发展迅速的分支，其目的是检测多幅图像中共同出现的显著目标。然而，现有的CoSOD数据集通常存在严重的数据偏差，它假设每组图像都包含具有相似视觉外观的显著对象。这种偏差会导致理想设置和在现有数据集上训练的模型的有效性在现实生活中可能会受到损害，在现实生活中，相似性通常是语义或概念上的。为了解决这个问题，我们首先收集了一个新的高质量数据集，名为CoSOD3k，其中包含3316张图像，分为160组，具有多级注释，即类别，边界框，对象和实例级别。CoSOD3k在多样性、难度和可扩展性方面实现了重大飞跃，有利于相关的视觉任务。此外，我们全面总结了34种前沿算法，在四个现有的CoSOD数据集(MSRC, iCoSeg, Image Pair和CoSal2015)和CoSOD3k上对其中19种算法进行了基准测试，总共有61K图像(最大规模)，并报告了组级性能分析。最后，我们讨论了CoSOD面临的挑战和未来的工作。我们的研究将有力地促进CoSOD社区的发展。基准工具箱和结果可在我们的项目页面上获得。

{"title":"Taking a Deeper Look at Co-Salient Object Detection","authors":"Deng-Ping Fan, Zheng Lin, Ge-Peng Ji, Dingwen Zhang, H. Fu, Ming-Ming Cheng","doi":"10.1109/cvpr42600.2020.00299","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00299","url":null,"abstract":"Co-salient object detection (CoSOD) is a newly emerging and rapidly growing branch of salient object detection (SOD), which aims to detect the co-occurring salient objects in multiple images. However, existing CoSOD datasets often have a serious data bias, which assumes that each group of images contains salient objects of similar visual appearances. This bias results in the ideal settings and the effectiveness of the models, trained on existing datasets, may be impaired in real-life situations, where the similarity is usually semantic or conceptual. To tackle this issue, we first collect a new high-quality dataset, named CoSOD3k, which contains 3,316 images divided into 160 groups with multiple level annotations, i.e., category, bounding box, object, and instance levels. CoSOD3k makes a significant leap in terms of diversity, difficulty and scalability, benefiting related vision tasks. Besides, we comprehensively summarize 34 cutting-edge algorithms, benchmarking 19 of them over four existing CoSOD datasets (MSRC, iCoSeg, Image Pair and CoSal2015) and our CoSOD3k with a total of ∼61K images (largest scale), and reporting group-level performance analysis. Finally, we discuss the challenge and future work of CoSOD. Our study would give a strong boost to growth in the CoSOD community. Benchmark toolbox and results are available on our project page.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"12 1","pages":"2916-2926"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73963587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

Image Search With Text Feedback by Visiolinguistic Attention Learning 基于视觉语言注意学习的文本反馈图像搜索

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/CVPR42600.2020.00307

Yanbei Chen, S. Gong, Loris Bazzani

Image search with text feedback has promising impacts in various real-world applications, such as e-commerce and internet search. Given a reference image and text feedback from user, the goal is to retrieve images that not only resemble the input image, but also change certain aspects in accordance with the given text. This is a challenging task as it requires the synergistic understanding of both image and text. In this work, we tackle this task by a novel Visiolinguistic Attention Learning (VAL) framework. Specifically, we propose a composite transformer that can be seamlessly plugged in a CNN to selectively preserve and transform the visual features conditioned on language semantics. By inserting multiple composite transformers at varying depths, VAL is incentive to encapsulate the multi-granular visiolinguistic information, thus yielding an expressive representation for effective image search. We conduct comprehensive evaluation on three datasets: Fashion200k, Shoes and FashionIQ. Extensive experiments show our model exceeds existing approaches on all datasets, demonstrating consistent superiority in coping with various text feedbacks, including attribute-like and natural language descriptions.

具有文本反馈的图像搜索在电子商务和互联网搜索等各种现实应用中具有广阔的应用前景。给定用户的参考图像和文本反馈，目标是检索图像不仅与输入图像相似，而且根据给定文本改变某些方面。这是一项具有挑战性的任务，因为它需要对图像和文本的协同理解。在这项工作中，我们通过一个新的视觉语言注意学习(VAL)框架来解决这个问题。具体来说，我们提出了一种可以无缝插入CNN的复合转换器，以选择性地保留和转换语言语义条件下的视觉特征。通过在不同深度插入多个复合变形器，VAL被激励封装多颗粒视觉语言信息，从而为有效的图像搜索提供富有表现力的表示。我们对三个数据集:Fashion200k、Shoes和FashionIQ进行了综合评估。大量的实验表明，我们的模型在所有数据集上都超过了现有的方法，在处理各种文本反馈(包括类属性和自然语言描述)方面表现出一致的优势。

{"title":"Image Search With Text Feedback by Visiolinguistic Attention Learning","authors":"Yanbei Chen, S. Gong, Loris Bazzani","doi":"10.1109/CVPR42600.2020.00307","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.00307","url":null,"abstract":"Image search with text feedback has promising impacts in various real-world applications, such as e-commerce and internet search. Given a reference image and text feedback from user, the goal is to retrieve images that not only resemble the input image, but also change certain aspects in accordance with the given text. This is a challenging task as it requires the synergistic understanding of both image and text. In this work, we tackle this task by a novel Visiolinguistic Attention Learning (VAL) framework. Specifically, we propose a composite transformer that can be seamlessly plugged in a CNN to selectively preserve and transform the visual features conditioned on language semantics. By inserting multiple composite transformers at varying depths, VAL is incentive to encapsulate the multi-granular visiolinguistic information, thus yielding an expressive representation for effective image search. We conduct comprehensive evaluation on three datasets: Fashion200k, Shoes and FashionIQ. Extensive experiments show our model exceeds existing approaches on all datasets, demonstrating consistent superiority in coping with various text feedbacks, including attribute-like and natural language descriptions.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"42 1","pages":"2998-3008"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81617299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 109

Estimating Low-Rank Region Likelihood Maps 估计低秩区域似然图

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.01379

G. Csurka, Z. Kato, Andor Juhasz, M. Humenberger

Low-rank regions capture geometrically meaningful structures in an image which encompass typical local features such as edges, corners and all kinds of regular, symmetric, often repetitive patterns, that are commonly found in man-made environment. While such patterns are challenging current state-of-the-art feature correspondence methods, the recovered homography of a low-rank texture readily provides 3D structure with respect to a 3D plane, without any prior knowledge of the visual information on that plane. However, the automatic and efficient detection of the broad class of low-rank regions is unsolved. Herein, we propose a novel self-supervised low-rank region detection deep network that predicts a low-rank likelihood map from an image. The evaluation of our method on real-world datasets shows not only that it reliably predicts low-rank regions in the image similarly to our baseline method, but thanks to the data augmentations used in the training phase it generalizes well to difficult cases (e.g. day/night lighting, low contrast, underexposure) where the baseline prediction fails.

低秩区域捕获图像中具有几何意义的结构，这些结构包含典型的局部特征，如边缘、角落和各种规则、对称、经常重复的图案，这些特征通常在人造环境中发现。虽然这种模式对当前最先进的特征对应方法提出了挑战，但低秩纹理的恢复单应性很容易提供关于3D平面的3D结构，而无需事先了解该平面上的视觉信息。然而，对大量低秩区域的自动、高效检测一直是一个未解决的问题。在此，我们提出了一种新的自监督低秩区域检测深度网络，从图像中预测低秩似然图。我们的方法在真实世界数据集上的评估表明，它不仅可以可靠地预测图像中的低秩区域，类似于我们的基线方法，而且由于在训练阶段使用的数据增强，它可以很好地推广到基线预测失败的困难情况(例如白天/夜间照明，低对比度，曝光不足)。

引用次数: 1

Non-Local Neural Networks With Grouped Bilinear Attentional Transforms 具有分组双线性注意变换的非局部神经网络

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2020-06-01 DOI: 10.1109/cvpr42600.2020.01182

Lu Chi, Zehuan Yuan, Yadong Mu, Changhu Wang

Modeling spatial or temporal long-range dependency plays a key role in deep neural networks. Conventional dominant solutions include recurrent operations on sequential data or deeply stacking convolutional layers with small kernel size. Recently, a number of non-local operators (such as self-attention based) have been devised. They are typically generic and can be plugged into many existing network pipelines for globally computing among any two neurons in a feature map. This work proposes a novel non-local operator. It is inspired by the attention mechanism of human visual system, which can quickly attend to important local parts in sight and suppress other less-relevant information. The core of our method is learnable and data-adaptive bilinear attentional transform (BA-Transform), whose merits are three-folds: first, BA-Transform is versatile to model a wide spectrum of local or global attentional operations, such as emphasizing specific local regions. Each BA-Transform is learned in a data-adaptive way; Secondly, to address the discrepancy among features, we further design grouped BA-Transforms, which essentially apply different attentional operations to different groups of feature channels; Thirdly, many existing non-local operators are computation-intensive. The proposed BA-Transform is implemented by simple matrix multiplication and admits better efficacy. For empirical evaluation, we perform comprehensive experiments on two large-scale benchmarks, ImageNet and Kinetics, for image / video classification respectively. The achieved accuracies and various ablation experiments consistently demonstrate significant improvement by large margins.

在深度神经网络中，空间或时间远程依赖关系建模起着关键作用。传统的主流解决方案包括对顺序数据的循环操作或小核尺寸的深度堆叠卷积层。最近，一些非本地运营商(如基于自关注)被设计出来。它们通常是通用的，可以插入到许多现有的网络管道中，在特征映射中的任意两个神经元之间进行全局计算。本文提出了一种新的非局部算子。它的灵感来自于人类视觉系统的注意机制，可以快速注意到视觉中重要的局部部分，并抑制其他不相关的信息。该方法的核心是可学习和数据自适应的双线性注意变换(BA-Transform)，其优点有三个方面:首先，BA-Transform是通用的，可以模拟广泛的局部或全局注意操作，例如强调特定的局部区域。每个BA-Transform都以数据自适应的方式学习;其次，为了解决特征之间的差异，我们进一步设计了分组ba变换，实质上是对不同组的特征通道应用不同的注意操作;第三，许多现有的非局部运算符是计算密集型的。本文提出的ba变换采用简单的矩阵乘法实现，具有较好的效果。为了进行实证评估，我们分别在ImageNet和Kinetics两个大规模基准上进行了图像/视频分类的综合实验。所获得的精度和各种烧蚀实验一致显示出大幅度的显著提高。

{"title":"Non-Local Neural Networks With Grouped Bilinear Attentional Transforms","authors":"Lu Chi, Zehuan Yuan, Yadong Mu, Changhu Wang","doi":"10.1109/cvpr42600.2020.01182","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01182","url":null,"abstract":"Modeling spatial or temporal long-range dependency plays a key role in deep neural networks. Conventional dominant solutions include recurrent operations on sequential data or deeply stacking convolutional layers with small kernel size. Recently, a number of non-local operators (such as self-attention based) have been devised. They are typically generic and can be plugged into many existing network pipelines for globally computing among any two neurons in a feature map. This work proposes a novel non-local operator. It is inspired by the attention mechanism of human visual system, which can quickly attend to important local parts in sight and suppress other less-relevant information. The core of our method is learnable and data-adaptive bilinear attentional transform (BA-Transform), whose merits are three-folds: first, BA-Transform is versatile to model a wide spectrum of local or global attentional operations, such as emphasizing specific local regions. Each BA-Transform is learned in a data-adaptive way; Secondly, to address the discrepancy among features, we further design grouped BA-Transforms, which essentially apply different attentional operations to different groups of feature channels; Thirdly, many existing non-local operators are computation-intensive. The proposed BA-Transform is implemented by simple matrix multiplication and admits better efficacy. For empirical evaluation, we perform comprehensive experiments on two large-scale benchmarks, ImageNet and Kinetics, for image / video classification respectively. The achieved accuracies and various ablation experiments consistently demonstrate significant improvement by large margins.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"48 1","pages":"11801-11810"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83969905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16