2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)最新文献_第5页

RaScaNet: Learning Tiny Models by Raster-Scanning Images RaScaNet:通过光栅扫描图像学习微小模型

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01346

Jaehyoung Yoo, Dongwook Lee, Changyong Son, S. Jung, ByungIn Yoo, Changkyu Choi, Jae-Joon Han, Bohyung Han

Deploying deep convolutional neural networks on ultra-low power systems is challenging due to the extremely limited resources. Especially, the memory becomes a bottleneck as the systems put a hard limit on the size of on-chip memory. Because peak memory explosion in the lower layers is critical even in tiny models, the size of an input image should be reduced with sacrifice in accuracy. To overcome this drawback, we propose a novel Raster-Scanning Network, named RaScaNet, inspired by raster-scanning in image sensors. RaScaNet reads only a few rows of pixels at a time using a convolutional neural network and then sequentially learns the representation of the whole image using a recurrent neural network. The proposed method operates on an ultra-low power system without input size reduction; it requires 15.9–24.3× smaller peak memory and 5.3–12.9× smaller weight memory than the state-of-the-art tiny models. Moreover, RaScaNet fully exploits on-chip SRAM and cache memory of the system as the sum of the peak memory and the weight memory does not exceed 60 KB, improving the power efficiency of the system. In our experiments, we demonstrate the binary classification performance of RaScaNet on Visual Wake Words and Pascal VOC datasets.

由于资源极其有限，在超低功耗系统上部署深度卷积神经网络具有挑战性。特别是，当系统对片上存储器的大小施加硬限制时，存储器成为瓶颈。因为即使在很小的模型中，底层的峰值内存爆炸也是至关重要的，因此应该以牺牲精度为代价减小输入图像的大小。为了克服这个缺点，我们提出了一种新的栅格扫描网络，命名为RaScaNet，灵感来自图像传感器中的栅格扫描。RaScaNet使用卷积神经网络一次只读取几行像素，然后使用循环神经网络依次学习整个图像的表示。该方法在不减小输入尺寸的超低功率系统上运行;它需要比最先进的微型模型小15.9 - 24.3倍的峰值内存和5.3 - 12.9倍的重量内存。此外，RaScaNet充分利用了系统的片上SRAM和缓存，峰值内存和权重内存的总和不超过60kb，提高了系统的功耗效率。在实验中，我们展示了RaScaNet在Visual Wake Words和Pascal VOC数据集上的二分类性能。

{"title":"RaScaNet: Learning Tiny Models by Raster-Scanning Images","authors":"Jaehyoung Yoo, Dongwook Lee, Changyong Son, S. Jung, ByungIn Yoo, Changkyu Choi, Jae-Joon Han, Bohyung Han","doi":"10.1109/CVPR46437.2021.01346","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01346","url":null,"abstract":"Deploying deep convolutional neural networks on ultra-low power systems is challenging due to the extremely limited resources. Especially, the memory becomes a bottleneck as the systems put a hard limit on the size of on-chip memory. Because peak memory explosion in the lower layers is critical even in tiny models, the size of an input image should be reduced with sacrifice in accuracy. To overcome this drawback, we propose a novel Raster-Scanning Network, named RaScaNet, inspired by raster-scanning in image sensors. RaScaNet reads only a few rows of pixels at a time using a convolutional neural network and then sequentially learns the representation of the whole image using a recurrent neural network. The proposed method operates on an ultra-low power system without input size reduction; it requires 15.9–24.3× smaller peak memory and 5.3–12.9× smaller weight memory than the state-of-the-art tiny models. Moreover, RaScaNet fully exploits on-chip SRAM and cache memory of the system as the sum of the peak memory and the weight memory does not exceed 60 KB, improving the power efficiency of the system. In our experiments, we demonstrate the binary classification performance of RaScaNet on Visual Wake Words and Pascal VOC datasets.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133555721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Learning Spatial-Semantic Relationship for Facial Attribute Recognition with Limited Labeled Data 有限标记数据下人脸属性识别的空间语义关系学习

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01174

Y. Shu, Yan Yan, Si Chen, Jing-Hao Xue, Chunhua Shen, Hanzi Wang

Recent advances in deep learning have demonstrated excellent results for Facial Attribute Recognition (FAR), typically trained with large-scale labeled data. However, in many real-world FAR applications, only limited labeled data are available, leading to remarkable deterioration in performance for most existing deep learning-based FAR methods. To address this problem, here we propose a method termed Spatial-Semantic Patch Learning (SSPL). The training of SSPL involves two stages. First, three auxiliary tasks, consisting of a Patch Rotation Task (PRT), a Patch Segmentation Task (PST), and a Patch Classification Task (PCT), are jointly developed to learn the spatial-semantic relationship from large-scale unlabeled facial data. We thus obtain a powerful pre-trained model. In particular, PRT exploits the spatial information of facial images in a self-supervised learning manner. PST and PCT respectively capture the pixel-level and image-level semantic information of facial images based on a facial parsing model. Second, the spatial-semantic knowledge learned from auxiliary tasks is transferred to the FAR task. By doing so, it enables that only a limited number of labeled data are required to fine-tune the pre-trained model. We achieve superior performance compared with state-of-the-art methods, as substantiated by extensive experiments and studies.

深度学习的最新进展已经证明了面部属性识别(FAR)的出色效果，通常使用大规模标记数据进行训练。然而，在许多真实的FAR应用中，只有有限的标记数据可用，导致大多数现有的基于深度学习的FAR方法的性能显著下降。为了解决这个问题，我们提出了一种称为空间语义补丁学习(SSPL)的方法。SSPL的训练包括两个阶段。首先，联合开发斑块旋转任务(PRT)、斑块分割任务(PST)和斑块分类任务(PCT)三个辅助任务，从大规模未标记的面部数据中学习空间语义关系;因此，我们得到了一个强大的预训练模型。特别是，PRT以一种自监督学习的方式利用了面部图像的空间信息。基于人脸解析模型，PST和PCT分别捕获人脸图像的像素级和图像级语义信息。其次，将从辅助任务中学到的空间语义知识转移到FAR任务中。通过这样做，它可以只需要有限数量的标记数据来微调预训练的模型。与最先进的方法相比，我们实现了卓越的性能，这一点得到了广泛的实验和研究的证实。

{"title":"Learning Spatial-Semantic Relationship for Facial Attribute Recognition with Limited Labeled Data","authors":"Y. Shu, Yan Yan, Si Chen, Jing-Hao Xue, Chunhua Shen, Hanzi Wang","doi":"10.1109/CVPR46437.2021.01174","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01174","url":null,"abstract":"Recent advances in deep learning have demonstrated excellent results for Facial Attribute Recognition (FAR), typically trained with large-scale labeled data. However, in many real-world FAR applications, only limited labeled data are available, leading to remarkable deterioration in performance for most existing deep learning-based FAR methods. To address this problem, here we propose a method termed Spatial-Semantic Patch Learning (SSPL). The training of SSPL involves two stages. First, three auxiliary tasks, consisting of a Patch Rotation Task (PRT), a Patch Segmentation Task (PST), and a Patch Classification Task (PCT), are jointly developed to learn the spatial-semantic relationship from large-scale unlabeled facial data. We thus obtain a powerful pre-trained model. In particular, PRT exploits the spatial information of facial images in a self-supervised learning manner. PST and PCT respectively capture the pixel-level and image-level semantic information of facial images based on a facial parsing model. Second, the spatial-semantic knowledge learned from auxiliary tasks is transferred to the FAR task. By doing so, it enables that only a limited number of labeled data are required to fine-tune the pre-trained model. We achieve superior performance compared with state-of-the-art methods, as substantiated by extensive experiments and studies.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133568619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

4D Hyperspectral Photoacoustic Data Restoration with Reliability Analysis 四维高光谱光声数据恢复与可靠性分析

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00457

Weihang Liao, Art Subpa-Asa, Yinqiang Zheng, Imari Sato

Hyperspectral photoacoustic (HSPA) spectroscopy is an emerging bi-modal imaging technology that is able to show the wavelength-dependent absorption distribution of the interior of a 3D volume. However, HSPA devices have to scan an object exhaustively in the spatial and spectral domains; and the acquired data tend to suffer from complex noise. This time-consuming scanning process and noise severely affects the usability of HSPA. It is therefore critical to examine the feasibility of 4D HSPA data restoration from an in-complete and noisy observation. In this work, we present a data reliability analysis for the depth and spectral domain. On the basis of this analysis, we explore the inherent data correlations and develop a restoration algorithm to recover 4D HSPA cubes. Experiments on real data verify that the proposed method achieves satisfactory restoration results.

高光谱光声(HSPA)光谱学是一种新兴的双峰成像技术，能够显示三维体内部波长相关的吸收分布。然而，HSPA设备必须在空间和光谱域彻底扫描对象;而且采集到的数据容易受到复杂噪声的影响。这种费时的扫描过程和噪声严重影响了HSPA的可用性。因此，从不完整和有噪声的观测中检验4D HSPA数据恢复的可行性是至关重要的。在这项工作中，我们提出了深度和光谱域的数据可靠性分析。在此基础上，我们探索了数据的内在相关性，并开发了一种恢复四维HSPA立方体的算法。对实际数据的实验验证了该方法的恢复效果。

引用次数: 1

RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection 有效、准确的激光雷达三维目标检测的距离稀疏网络

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00567

Pei Sun, Weiyue Wang, Yuning Chai, Gamaleldin F. Elsayed, A. Bewley, Xiao Zhang, C. Sminchisescu, Drago Anguelov

The detection of 3D objects from LiDAR data is a critical component in most autonomous driving systems. Safe, high speed driving needs larger detection ranges, which are enabled by new LiDARs. These larger detection ranges require more efficient and accurate detection models. Towards this goal, we propose Range Sparse Net (RSN) – a simple, efficient, and accurate 3D object detector – in order to tackle real time 3D object detection in this extended detection regime. RSN predicts foreground points from range images and applies sparse convolutions on the selected foreground points to detect objects. The lightweight 2D convolutions on dense range images results in significantly fewer selected foreground points, thus enabling the later sparse convolutions in RSN to efficiently operate. Combining features from the range image further enhance detection accuracy. RSN runs at more than 60 frames per second on a 150m × 150m detection region on Waymo Open Dataset (WOD) while being more accurate than previously published detectors. As of 11/2020, RSN is ranked first in the WOD leaderboard based on the APH/LEVEL_1 metrics for LiDAR-based pedestrian and vehicle detection, while being several times faster than alternatives.

从激光雷达数据中检测3D物体是大多数自动驾驶系统的关键组成部分。安全、高速驾驶需要更大的探测范围，而新型激光雷达可以实现这一点。这些更大的检测范围需要更高效和准确的检测模型。为了实现这一目标，我们提出了一种简单、高效、准确的3D目标检测器——距离稀疏网(RSN)，以便在这种扩展的检测体系中解决实时3D目标检测问题。RSN从距离图像中预测前景点，并对选中的前景点应用稀疏卷积来检测目标。在密集距离图像上进行轻量级的二维卷积，所选择的前景点明显减少，从而使RSN中后续的稀疏卷积能够高效地运行。结合距离图像的特征，进一步提高了检测精度。RSN在Waymo开放数据集(WOD)上150m × 150m的检测区域上以每秒60帧以上的速度运行，同时比以前发布的检测器更准确。截至2020年11月，基于基于激光雷达的行人和车辆检测的APH/LEVEL_1指标，RSN在世界领先排行榜上排名第一，同时速度比替代方案快几倍。

{"title":"RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection","authors":"Pei Sun, Weiyue Wang, Yuning Chai, Gamaleldin F. Elsayed, A. Bewley, Xiao Zhang, C. Sminchisescu, Drago Anguelov","doi":"10.1109/CVPR46437.2021.00567","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00567","url":null,"abstract":"The detection of 3D objects from LiDAR data is a critical component in most autonomous driving systems. Safe, high speed driving needs larger detection ranges, which are enabled by new LiDARs. These larger detection ranges require more efficient and accurate detection models. Towards this goal, we propose Range Sparse Net (RSN) – a simple, efficient, and accurate 3D object detector – in order to tackle real time 3D object detection in this extended detection regime. RSN predicts foreground points from range images and applies sparse convolutions on the selected foreground points to detect objects. The lightweight 2D convolutions on dense range images results in significantly fewer selected foreground points, thus enabling the later sparse convolutions in RSN to efficiently operate. Combining features from the range image further enhance detection accuracy. RSN runs at more than 60 frames per second on a 150m × 150m detection region on Waymo Open Dataset (WOD) while being more accurate than previously published detectors. As of 11/2020, RSN is ranked first in the WOD leaderboard based on the APH/LEVEL_1 metrics for LiDAR-based pedestrian and vehicle detection, while being several times faster than alternatives.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115404549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 106

Progressive Unsupervised Learning for Visual Object Tracking 视觉对象跟踪的渐进式无监督学习

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00301

Wu, Jia Wan, Antoni B. Chan

In this paper, we propose a progressive unsupervised learning (PUL) framework, which entirely removes the need for annotated training videos in visual tracking. Specifically, we first learn a background discrimination (BD) model that effectively distinguishes an object from back-ground in a contrastive learning way. We then employ the BD model to progressively mine temporal corresponding patches (i.e., patches connected by a track) in sequential frames. As the BD model is imperfect and thus the mined patch pairs are noisy, we propose a noise-robust loss function to more effectively learn temporal correspondences from this noisy data. We use the proposed noise robust loss to train backbone networks of Siamese trackers. Without online fine-tuning or adaptation, our unsupervised real-time Siamese trackers can outperform state-of-the-art unsupervised deep trackers and achieve competitive results to the supervised baselines.

在本文中，我们提出了一种渐进式无监督学习(PUL)框架，该框架完全消除了视觉跟踪中对标注训练视频的需求。具体来说，我们首先学习了一个背景辨别(BD)模型，该模型通过对比学习的方式有效地将物体与背景区分开来。然后，我们使用BD模型在顺序帧中逐步挖掘时间对应的补丁(即，由轨道连接的补丁)。由于BD模型是不完善的，因此挖掘的补丁对是有噪声的，我们提出了一个噪声鲁棒损失函数来更有效地从这些噪声数据中学习时间对应。我们使用所提出的噪声鲁棒损失来训练连体跟踪器的骨干网络。无需在线微调或自适应，我们的无监督实时暹罗跟踪器可以优于最先进的无监督深度跟踪器，并获得与监督基线相竞争的结果。

引用次数: 23

Gradient Forward-Propagation for Large-Scale Temporal Video Modelling 基于梯度前向传播的大尺度时域视频建模

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00913

Mateusz Malinowski, Dimitrios Vytiniotis, G. Swirszcz, Viorica Patraucean, J. Carreira

How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. For temporal signals, this introduces high latency and hinders real-time learning. It also creates a coupling between consecutive layers, which limits model parallelism and increases memory consumption. In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time, and we propose mechanisms for temporal integration of information based on different variants of skip connections. We also show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training. The proposed Skip-Sideways achieves low latency training, model parallelism, and, importantly, is capable of extracting temporal features, leading to more stable training and improved performance on real-world action recognition video datasets such as HMDB51, UCF101, and the large-scale Kinetics-600. Finally, we also show that models trained with Skip-Sideways generate better future frames than Sideways models, and hence they can better utilize motion cues.

如何在大容量的时间数据上有效地训练神经网络?为了计算更新参数所需的梯度，反向传播会阻塞计算，直到向前和向后传递完成。对于时间信号，这引入了高延迟并阻碍了实时学习。它还创建了连续层之间的耦合，这限制了模型的并行性并增加了内存消耗。在本文中，我们在Sideways的基础上，通过在时间上向前传播近似梯度来避免阻塞，并且我们提出了基于跳跃连接的不同变体的信息时间整合机制。我们还展示了如何解耦计算并将单个神经模块委托给不同的设备，从而允许分布式和并行训练。提出的Skip-Sideways实现了低延迟训练，模型并行性，并且重要的是，能够提取时间特征，从而在现实世界的动作识别视频数据集(如HMDB51, UCF101和大规模的kinetics600)上实现更稳定的训练和更高的性能。最后，我们还表明，使用Skip-Sideways模型训练的模型比Sideways模型生成更好的未来帧，因此它们可以更好地利用运动线索。

{"title":"Gradient Forward-Propagation for Large-Scale Temporal Video Modelling","authors":"Mateusz Malinowski, Dimitrios Vytiniotis, G. Swirszcz, Viorica Patraucean, J. Carreira","doi":"10.1109/CVPR46437.2021.00913","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00913","url":null,"abstract":"How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. For temporal signals, this introduces high latency and hinders real-time learning. It also creates a coupling between consecutive layers, which limits model parallelism and increases memory consumption. In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time, and we propose mechanisms for temporal integration of information based on different variants of skip connections. We also show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training. The proposed Skip-Sideways achieves low latency training, model parallelism, and, importantly, is capable of extracting temporal features, leading to more stable training and improved performance on real-world action recognition video datasets such as HMDB51, UCF101, and the large-scale Kinetics-600. Finally, we also show that models trained with Skip-Sideways generate better future frames than Sideways models, and hence they can better utilize motion cues.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"105 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115761445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

DyGLIP: A Dynamic Graph Model with Link Prediction for Accurate Multi-Camera Multiple Object Tracking DyGLIP:一种基于链路预测的多相机多目标精确跟踪动态图模型

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01357

Kha Gia Quach, Pha Nguyen, Huu Le, Thanh-Dat Truong, C. Duong, M. Tran, Khoa Luu

Multi-Camera Multiple Object Tracking (MC-MOT) is a significant computer vision problem due to its emerging applicability in several real-world applications. Despite a large number of existing works, solving the data association problem in any MC-MOT pipeline is arguably one of the most challenging tasks. Developing a robust MC-MOT system, however, is still highly challenging due to many practical issues such as inconsistent lighting conditions, varying object movement patterns, or the trajectory occlusions of the objects between the cameras. To address these problems, this work, therefore, proposes a new Dynamic Graph Model with Link Prediction (DyGLIP) approach 1 to solve the data association task. Compared to existing methods, our new model offers several advantages, including better feature representations and the ability to recover from lost tracks during camera transitions. Moreover, our model works gracefully regardless of the overlapping ratios between the cameras. Experimental results show that we out-perform existing MC-MOT algorithms by a large margin on several practical datasets. Notably, our model works favor-ably on online settings but can be extended to an incremental approach for large-scale datasets.

多相机多目标跟踪(MC-MOT)是一个重要的计算机视觉问题，因为它在一些现实世界的应用中出现了新的适用性。尽管已有大量的工作，但解决任何MC-MOT管道中的数据关联问题可以说是最具挑战性的任务之一。然而，由于许多实际问题，如不一致的照明条件、不同的物体运动模式或相机之间物体的轨迹遮挡，开发一个强大的MC-MOT系统仍然具有很高的挑战性。为了解决这些问题，本工作提出了一种新的带有链接预测的动态图模型(DyGLIP)方法1来解决数据关联任务。与现有方法相比，我们的新模型提供了几个优势，包括更好的特征表示和在相机转换期间从丢失的轨迹中恢复的能力。此外，无论相机之间的重叠比例如何，我们的模型都能优雅地工作。实验结果表明，在几个实际数据集上，我们的算法大大优于现有的MC-MOT算法。值得注意的是，我们的模型在在线设置上工作良好，但可以扩展到大规模数据集的增量方法。

{"title":"DyGLIP: A Dynamic Graph Model with Link Prediction for Accurate Multi-Camera Multiple Object Tracking","authors":"Kha Gia Quach, Pha Nguyen, Huu Le, Thanh-Dat Truong, C. Duong, M. Tran, Khoa Luu","doi":"10.1109/CVPR46437.2021.01357","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01357","url":null,"abstract":"Multi-Camera Multiple Object Tracking (MC-MOT) is a significant computer vision problem due to its emerging applicability in several real-world applications. Despite a large number of existing works, solving the data association problem in any MC-MOT pipeline is arguably one of the most challenging tasks. Developing a robust MC-MOT system, however, is still highly challenging due to many practical issues such as inconsistent lighting conditions, varying object movement patterns, or the trajectory occlusions of the objects between the cameras. To address these problems, this work, therefore, proposes a new Dynamic Graph Model with Link Prediction (DyGLIP) approach 1 to solve the data association task. Compared to existing methods, our new model offers several advantages, including better feature representations and the ability to recover from lost tracks during camera transitions. Moreover, our model works gracefully regardless of the overlapping ratios between the cameras. Experimental results show that we out-perform existing MC-MOT algorithms by a large margin on several practical datasets. Notably, our model works favor-ably on online settings but can be extended to an incremental approach for large-scale datasets.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116354525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Graph-based High-order Relation Modeling for Long-term Action Recognition 基于图的长期动作识别高阶关系建模

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.00887

Jiaming Zhou, Kun-Yu Lin, Haoxin Li, Weishi Zheng

Long-term actions involve many important visual concepts, e.g., objects, motions, and sub-actions, and there are various relations among these concepts, which we call basic relations. These basic relations will jointly affect each other during the temporal evolution of long-term actions, which forms the high-order relations that are essential for long-term action recognition. In this paper, we propose a Graph-based High-order Relation Modeling (GHRM) module to exploit the high-order relations in the long-term actions for long-term action recognition. In GHRM, each basic relation in the long-term actions will be modeled by a graph, where each node represents a segment in a long video. Moreover, when modeling each basic relation, the information from all the other basic relations will be incorporated by GHRM, and thus the high-order relations in the long-term actions can be well exploited. To better exploit the high-order relations along the time dimension, we design a GHRM-layer consisting of a Temporal-GHRM branch and a Semantic-GHRM branch, which aims to model the local temporal high-order relations and global semantic high-order relations. The experimental results on three long-term action recognition datasets, namely, Breakfast, Charades, and MultiThumos, demonstrate the effectiveness of our model.

长期动作涉及许多重要的视觉概念，如物体、运动、子动作等，这些概念之间存在着各种各样的关系，我们称之为基本关系。这些基本关系在长期行为的时间演化过程中相互影响，形成了长期行为认知所必需的高阶关系。本文提出了一种基于图的高阶关系建模(GHRM)模块，利用长期动作中的高阶关系进行长期动作识别。在GHRM中，长期动作中的每个基本关系将通过一个图来建模，其中每个节点表示长视频中的一个片段。此外，在对每个基本关系建模时，GHRM将所有其他基本关系的信息纳入其中，从而可以很好地利用长期行动中的高阶关系。为了更好地利用时间维度上的高阶关系，我们设计了一个由时间- ghrm分支和语义- ghrm分支组成的ghrm层，旨在对局部时间高阶关系和全局语义高阶关系进行建模。在Breakfast、Charades和MultiThumos三个长期动作识别数据集上的实验结果验证了该模型的有效性。

{"title":"Graph-based High-order Relation Modeling for Long-term Action Recognition","authors":"Jiaming Zhou, Kun-Yu Lin, Haoxin Li, Weishi Zheng","doi":"10.1109/CVPR46437.2021.00887","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00887","url":null,"abstract":"Long-term actions involve many important visual concepts, e.g., objects, motions, and sub-actions, and there are various relations among these concepts, which we call basic relations. These basic relations will jointly affect each other during the temporal evolution of long-term actions, which forms the high-order relations that are essential for long-term action recognition. In this paper, we propose a Graph-based High-order Relation Modeling (GHRM) module to exploit the high-order relations in the long-term actions for long-term action recognition. In GHRM, each basic relation in the long-term actions will be modeled by a graph, where each node represents a segment in a long video. Moreover, when modeling each basic relation, the information from all the other basic relations will be incorporated by GHRM, and thus the high-order relations in the long-term actions can be well exploited. To better exploit the high-order relations along the time dimension, we design a GHRM-layer consisting of a Temporal-GHRM branch and a Semantic-GHRM branch, which aims to model the local temporal high-order relations and global semantic high-order relations. The experimental results on three long-term action recognition datasets, namely, Breakfast, Charades, and MultiThumos, demonstrate the effectiveness of our model.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117190786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Relevance-CAM: Your Model Already Knows Where to Look 相关性凸轮:你的模型已经知道去哪里看

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01470

J. Lee, Sewon Kim, I. Park, Taejoon Eo, D. Hwang

With increasing fields of application for neural networks and the development of neural networks, the ability to explain deep learning models is also becoming increasingly important. Especially, prior to practical applications, it is crucial to analyze a model’s inference and the process of generating the results. A common explanation method is Class Activation Mapping(CAM) based method where it is often used to understand the last layer of the convolutional neural networks popular in the field of Computer Vision. In this paper, we propose a novel CAM method named Relevance-weighted Class Activation Mapping(Relevance-CAM) that utilizes Layer-wise Relevance Propagation to obtain the weighting components. This allows the explanation map to be faithful and robust to the shattered gradient problem, a shared problem of the gradient based CAM methods that causes noisy saliency maps for intermediate layers. Therefore, our proposed method can better explain a model by correctly analyzing the intermediate layers as well as the last convolutional layer. In this paper, we visualize how each layer of the popular image processing models extracts class specific features using Relevance-CAM, evaluate the localization ability, and show why the gradient based CAM cannot be used to explain the intermediate layers, proven by experimenting the weighting component. Relevance-CAM outperforms other CAM-based methods in recognition and localization evaluation in layers of any depth. The source code is available at: https://github.com/mongeoroo/Relevance-CAM

随着神经网络应用领域的增加和神经网络的发展，解释深度学习模型的能力也变得越来越重要。特别是在实际应用之前，分析模型的推理和产生结果的过程是至关重要的。一种常见的解释方法是基于类激活映射(Class Activation Mapping, CAM)的方法，它通常用于理解计算机视觉领域流行的卷积神经网络的最后一层。在本文中，我们提出了一种新的CAM方法，即关联加权类激活映射(Relevance-CAM)，该方法利用分层关联传播来获得加权分量。这使得解释图对破碎梯度问题具有忠实性和鲁棒性，这是基于梯度的CAM方法的一个共同问题，它会导致中间层的显着性图产生噪声。因此，我们提出的方法可以通过正确分析中间层和最后一层卷积层来更好地解释模型。在本文中，我们可视化了流行的图像处理模型的每一层是如何使用Relevance-CAM提取类特定特征的，评估了定位能力，并通过实验加权分量证明了为什么基于梯度的CAM不能用于解释中间层。在任意深度层的识别和定位评价方面，相关性- cam优于其他基于cam的方法。源代码可从https://github.com/mongeoroo/Relevance-CAM获得

{"title":"Relevance-CAM: Your Model Already Knows Where to Look","authors":"J. Lee, Sewon Kim, I. Park, Taejoon Eo, D. Hwang","doi":"10.1109/CVPR46437.2021.01470","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01470","url":null,"abstract":"With increasing fields of application for neural networks and the development of neural networks, the ability to explain deep learning models is also becoming increasingly important. Especially, prior to practical applications, it is crucial to analyze a model’s inference and the process of generating the results. A common explanation method is Class Activation Mapping(CAM) based method where it is often used to understand the last layer of the convolutional neural networks popular in the field of Computer Vision. In this paper, we propose a novel CAM method named Relevance-weighted Class Activation Mapping(Relevance-CAM) that utilizes Layer-wise Relevance Propagation to obtain the weighting components. This allows the explanation map to be faithful and robust to the shattered gradient problem, a shared problem of the gradient based CAM methods that causes noisy saliency maps for intermediate layers. Therefore, our proposed method can better explain a model by correctly analyzing the intermediate layers as well as the last convolutional layer. In this paper, we visualize how each layer of the popular image processing models extracts class specific features using Relevance-CAM, evaluate the localization ability, and show why the gradient based CAM cannot be used to explain the intermediate layers, proven by experimenting the weighting component. Relevance-CAM outperforms other CAM-based methods in recognition and localization evaluation in layers of any depth. The source code is available at: https://github.com/mongeoroo/Relevance-CAM","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124420105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Graph-based High-Order Relation Discovery for Fine-grained Recognition 基于图的细粒度识别高阶关系发现

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Pub Date : 2021-06-01 DOI: 10.1109/CVPR46437.2021.01483

Yifan Zhao, Ke Yan, Feiyue Huang, Jia Li

Fine-grained object recognition aims to learn effective features that can identify the subtle differences between visually similar objects. Most of the existing works tend to amplify discriminative part regions with attention mechanisms. Besides its unstable performance under complex backgrounds, the intrinsic interrelationship between different semantic features is less explored. Toward this end, we propose an effective graph-based relation discovery approach to build a contextual understanding of high-order relationships. In our approach, a high-dimensional feature bank is first formed and jointly regularized with semantic- and positional-aware high-order constraints, endowing rich attributes to feature representations. Second, to overcome the high-dimension curse, we propose a graph-based semantic grouping strategy to embed this high-order tensor bank into a low-dimensional space. Meanwhile, a group-wise learning strategy is proposed to regularize the features focusing on the cluster embedding center. With the collaborative learning of three modules, our module is able to grasp the stronger contextual details of fine-grained objects. Experimental evidence demonstrates our approach achieves new state-of-the-art on 4 widely-used fine-grained object recognition benchmarks.

细粒度物体识别旨在学习有效的特征，可以识别视觉上相似的物体之间的细微差异。现有的研究大多倾向于通过注意机制放大辨别部分区域。除了它在复杂背景下的性能不稳定外，不同语义特征之间的内在相互关系也很少被探索。为此，我们提出了一种有效的基于图的关系发现方法，以建立对高阶关系的上下文理解。在我们的方法中，首先形成一个高维特征库，并与语义感知和位置感知的高阶约束联合正则化，赋予特征表示丰富的属性。其次，为了克服高维诅咒，我们提出了一种基于图的语义分组策略，将高阶张量库嵌入到低维空间中。同时，提出了一种以聚类嵌入中心为中心的群智能学习策略，对特征进行正则化。通过三个模块的协同学习，我们的模块能够更强地掌握细粒度对象的上下文细节。实验证据表明，我们的方法在4个广泛使用的细粒度对象识别基准上达到了最新水平。

{"title":"Graph-based High-Order Relation Discovery for Fine-grained Recognition","authors":"Yifan Zhao, Ke Yan, Feiyue Huang, Jia Li","doi":"10.1109/CVPR46437.2021.01483","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01483","url":null,"abstract":"Fine-grained object recognition aims to learn effective features that can identify the subtle differences between visually similar objects. Most of the existing works tend to amplify discriminative part regions with attention mechanisms. Besides its unstable performance under complex backgrounds, the intrinsic interrelationship between different semantic features is less explored. Toward this end, we propose an effective graph-based relation discovery approach to build a contextual understanding of high-order relationships. In our approach, a high-dimensional feature bank is first formed and jointly regularized with semantic- and positional-aware high-order constraints, endowing rich attributes to feature representations. Second, to overcome the high-dimension curse, we propose a graph-based semantic grouping strategy to embed this high-order tensor bank into a low-dimensional space. Meanwhile, a group-wise learning strategy is proposed to regularize the features focusing on the cluster embedding center. With the collaborative learning of three modules, our module is able to grasp the stronger contextual details of fine-grained objects. Experimental evidence demonstrates our approach achieves new state-of-the-art on 4 widely-used fine-grained object recognition benchmarks.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125795889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43