Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.01346
Jaehyoung Yoo, Dongwook Lee, Changyong Son, S. Jung, ByungIn Yoo, Changkyu Choi, Jae-Joon Han, Bohyung Han
Deploying deep convolutional neural networks on ultra-low power systems is challenging due to the extremely limited resources. Especially, the memory becomes a bottleneck as the systems put a hard limit on the size of on-chip memory. Because peak memory explosion in the lower layers is critical even in tiny models, the size of an input image should be reduced with sacrifice in accuracy. To overcome this drawback, we propose a novel Raster-Scanning Network, named RaScaNet, inspired by raster-scanning in image sensors. RaScaNet reads only a few rows of pixels at a time using a convolutional neural network and then sequentially learns the representation of the whole image using a recurrent neural network. The proposed method operates on an ultra-low power system without input size reduction; it requires 15.9–24.3× smaller peak memory and 5.3–12.9× smaller weight memory than the state-of-the-art tiny models. Moreover, RaScaNet fully exploits on-chip SRAM and cache memory of the system as the sum of the peak memory and the weight memory does not exceed 60 KB, improving the power efficiency of the system. In our experiments, we demonstrate the binary classification performance of RaScaNet on Visual Wake Words and Pascal VOC datasets.
由于资源极其有限,在超低功耗系统上部署深度卷积神经网络具有挑战性。特别是,当系统对片上存储器的大小施加硬限制时,存储器成为瓶颈。因为即使在很小的模型中,底层的峰值内存爆炸也是至关重要的,因此应该以牺牲精度为代价减小输入图像的大小。为了克服这个缺点,我们提出了一种新的栅格扫描网络,命名为RaScaNet,灵感来自图像传感器中的栅格扫描。RaScaNet使用卷积神经网络一次只读取几行像素,然后使用循环神经网络依次学习整个图像的表示。该方法在不减小输入尺寸的超低功率系统上运行;它需要比最先进的微型模型小15.9 - 24.3倍的峰值内存和5.3 - 12.9倍的重量内存。此外,RaScaNet充分利用了系统的片上SRAM和缓存,峰值内存和权重内存的总和不超过60kb,提高了系统的功耗效率。在实验中,我们展示了RaScaNet在Visual Wake Words和Pascal VOC数据集上的二分类性能。
{"title":"RaScaNet: Learning Tiny Models by Raster-Scanning Images","authors":"Jaehyoung Yoo, Dongwook Lee, Changyong Son, S. Jung, ByungIn Yoo, Changkyu Choi, Jae-Joon Han, Bohyung Han","doi":"10.1109/CVPR46437.2021.01346","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01346","url":null,"abstract":"Deploying deep convolutional neural networks on ultra-low power systems is challenging due to the extremely limited resources. Especially, the memory becomes a bottleneck as the systems put a hard limit on the size of on-chip memory. Because peak memory explosion in the lower layers is critical even in tiny models, the size of an input image should be reduced with sacrifice in accuracy. To overcome this drawback, we propose a novel Raster-Scanning Network, named RaScaNet, inspired by raster-scanning in image sensors. RaScaNet reads only a few rows of pixels at a time using a convolutional neural network and then sequentially learns the representation of the whole image using a recurrent neural network. The proposed method operates on an ultra-low power system without input size reduction; it requires 15.9–24.3× smaller peak memory and 5.3–12.9× smaller weight memory than the state-of-the-art tiny models. Moreover, RaScaNet fully exploits on-chip SRAM and cache memory of the system as the sum of the peak memory and the weight memory does not exceed 60 KB, improving the power efficiency of the system. In our experiments, we demonstrate the binary classification performance of RaScaNet on Visual Wake Words and Pascal VOC datasets.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133555721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.01174
Y. Shu, Yan Yan, Si Chen, Jing-Hao Xue, Chunhua Shen, Hanzi Wang
Recent advances in deep learning have demonstrated excellent results for Facial Attribute Recognition (FAR), typically trained with large-scale labeled data. However, in many real-world FAR applications, only limited labeled data are available, leading to remarkable deterioration in performance for most existing deep learning-based FAR methods. To address this problem, here we propose a method termed Spatial-Semantic Patch Learning (SSPL). The training of SSPL involves two stages. First, three auxiliary tasks, consisting of a Patch Rotation Task (PRT), a Patch Segmentation Task (PST), and a Patch Classification Task (PCT), are jointly developed to learn the spatial-semantic relationship from large-scale unlabeled facial data. We thus obtain a powerful pre-trained model. In particular, PRT exploits the spatial information of facial images in a self-supervised learning manner. PST and PCT respectively capture the pixel-level and image-level semantic information of facial images based on a facial parsing model. Second, the spatial-semantic knowledge learned from auxiliary tasks is transferred to the FAR task. By doing so, it enables that only a limited number of labeled data are required to fine-tune the pre-trained model. We achieve superior performance compared with state-of-the-art methods, as substantiated by extensive experiments and studies.
{"title":"Learning Spatial-Semantic Relationship for Facial Attribute Recognition with Limited Labeled Data","authors":"Y. Shu, Yan Yan, Si Chen, Jing-Hao Xue, Chunhua Shen, Hanzi Wang","doi":"10.1109/CVPR46437.2021.01174","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01174","url":null,"abstract":"Recent advances in deep learning have demonstrated excellent results for Facial Attribute Recognition (FAR), typically trained with large-scale labeled data. However, in many real-world FAR applications, only limited labeled data are available, leading to remarkable deterioration in performance for most existing deep learning-based FAR methods. To address this problem, here we propose a method termed Spatial-Semantic Patch Learning (SSPL). The training of SSPL involves two stages. First, three auxiliary tasks, consisting of a Patch Rotation Task (PRT), a Patch Segmentation Task (PST), and a Patch Classification Task (PCT), are jointly developed to learn the spatial-semantic relationship from large-scale unlabeled facial data. We thus obtain a powerful pre-trained model. In particular, PRT exploits the spatial information of facial images in a self-supervised learning manner. PST and PCT respectively capture the pixel-level and image-level semantic information of facial images based on a facial parsing model. Second, the spatial-semantic knowledge learned from auxiliary tasks is transferred to the FAR task. By doing so, it enables that only a limited number of labeled data are required to fine-tune the pre-trained model. We achieve superior performance compared with state-of-the-art methods, as substantiated by extensive experiments and studies.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133568619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.00457
Weihang Liao, Art Subpa-Asa, Yinqiang Zheng, Imari Sato
Hyperspectral photoacoustic (HSPA) spectroscopy is an emerging bi-modal imaging technology that is able to show the wavelength-dependent absorption distribution of the interior of a 3D volume. However, HSPA devices have to scan an object exhaustively in the spatial and spectral domains; and the acquired data tend to suffer from complex noise. This time-consuming scanning process and noise severely affects the usability of HSPA. It is therefore critical to examine the feasibility of 4D HSPA data restoration from an in-complete and noisy observation. In this work, we present a data reliability analysis for the depth and spectral domain. On the basis of this analysis, we explore the inherent data correlations and develop a restoration algorithm to recover 4D HSPA cubes. Experiments on real data verify that the proposed method achieves satisfactory restoration results.
{"title":"4D Hyperspectral Photoacoustic Data Restoration with Reliability Analysis","authors":"Weihang Liao, Art Subpa-Asa, Yinqiang Zheng, Imari Sato","doi":"10.1109/CVPR46437.2021.00457","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00457","url":null,"abstract":"Hyperspectral photoacoustic (HSPA) spectroscopy is an emerging bi-modal imaging technology that is able to show the wavelength-dependent absorption distribution of the interior of a 3D volume. However, HSPA devices have to scan an object exhaustively in the spatial and spectral domains; and the acquired data tend to suffer from complex noise. This time-consuming scanning process and noise severely affects the usability of HSPA. It is therefore critical to examine the feasibility of 4D HSPA data restoration from an in-complete and noisy observation. In this work, we present a data reliability analysis for the depth and spectral domain. On the basis of this analysis, we explore the inherent data correlations and develop a restoration algorithm to recover 4D HSPA cubes. Experiments on real data verify that the proposed method achieves satisfactory restoration results.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133169355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.00567
Pei Sun, Weiyue Wang, Yuning Chai, Gamaleldin F. Elsayed, A. Bewley, Xiao Zhang, C. Sminchisescu, Drago Anguelov
The detection of 3D objects from LiDAR data is a critical component in most autonomous driving systems. Safe, high speed driving needs larger detection ranges, which are enabled by new LiDARs. These larger detection ranges require more efficient and accurate detection models. Towards this goal, we propose Range Sparse Net (RSN) – a simple, efficient, and accurate 3D object detector – in order to tackle real time 3D object detection in this extended detection regime. RSN predicts foreground points from range images and applies sparse convolutions on the selected foreground points to detect objects. The lightweight 2D convolutions on dense range images results in significantly fewer selected foreground points, thus enabling the later sparse convolutions in RSN to efficiently operate. Combining features from the range image further enhance detection accuracy. RSN runs at more than 60 frames per second on a 150m × 150m detection region on Waymo Open Dataset (WOD) while being more accurate than previously published detectors. As of 11/2020, RSN is ranked first in the WOD leaderboard based on the APH/LEVEL_1 metrics for LiDAR-based pedestrian and vehicle detection, while being several times faster than alternatives.
{"title":"RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection","authors":"Pei Sun, Weiyue Wang, Yuning Chai, Gamaleldin F. Elsayed, A. Bewley, Xiao Zhang, C. Sminchisescu, Drago Anguelov","doi":"10.1109/CVPR46437.2021.00567","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00567","url":null,"abstract":"The detection of 3D objects from LiDAR data is a critical component in most autonomous driving systems. Safe, high speed driving needs larger detection ranges, which are enabled by new LiDARs. These larger detection ranges require more efficient and accurate detection models. Towards this goal, we propose Range Sparse Net (RSN) – a simple, efficient, and accurate 3D object detector – in order to tackle real time 3D object detection in this extended detection regime. RSN predicts foreground points from range images and applies sparse convolutions on the selected foreground points to detect objects. The lightweight 2D convolutions on dense range images results in significantly fewer selected foreground points, thus enabling the later sparse convolutions in RSN to efficiently operate. Combining features from the range image further enhance detection accuracy. RSN runs at more than 60 frames per second on a 150m × 150m detection region on Waymo Open Dataset (WOD) while being more accurate than previously published detectors. As of 11/2020, RSN is ranked first in the WOD leaderboard based on the APH/LEVEL_1 metrics for LiDAR-based pedestrian and vehicle detection, while being several times faster than alternatives.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115404549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.00301
Wu, Jia Wan, Antoni B. Chan
In this paper, we propose a progressive unsupervised learning (PUL) framework, which entirely removes the need for annotated training videos in visual tracking. Specifically, we first learn a background discrimination (BD) model that effectively distinguishes an object from back-ground in a contrastive learning way. We then employ the BD model to progressively mine temporal corresponding patches (i.e., patches connected by a track) in sequential frames. As the BD model is imperfect and thus the mined patch pairs are noisy, we propose a noise-robust loss function to more effectively learn temporal correspondences from this noisy data. We use the proposed noise robust loss to train backbone networks of Siamese trackers. Without online fine-tuning or adaptation, our unsupervised real-time Siamese trackers can outperform state-of-the-art unsupervised deep trackers and achieve competitive results to the supervised baselines.
{"title":"Progressive Unsupervised Learning for Visual Object Tracking","authors":"Wu, Jia Wan, Antoni B. Chan","doi":"10.1109/CVPR46437.2021.00301","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00301","url":null,"abstract":"In this paper, we propose a progressive unsupervised learning (PUL) framework, which entirely removes the need for annotated training videos in visual tracking. Specifically, we first learn a background discrimination (BD) model that effectively distinguishes an object from back-ground in a contrastive learning way. We then employ the BD model to progressively mine temporal corresponding patches (i.e., patches connected by a track) in sequential frames. As the BD model is imperfect and thus the mined patch pairs are noisy, we propose a noise-robust loss function to more effectively learn temporal correspondences from this noisy data. We use the proposed noise robust loss to train backbone networks of Siamese trackers. Without online fine-tuning or adaptation, our unsupervised real-time Siamese trackers can outperform state-of-the-art unsupervised deep trackers and achieve competitive results to the supervised baselines.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115641263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.00913
Mateusz Malinowski, Dimitrios Vytiniotis, G. Swirszcz, Viorica Patraucean, J. Carreira
How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. For temporal signals, this introduces high latency and hinders real-time learning. It also creates a coupling between consecutive layers, which limits model parallelism and increases memory consumption. In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time, and we propose mechanisms for temporal integration of information based on different variants of skip connections. We also show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training. The proposed Skip-Sideways achieves low latency training, model parallelism, and, importantly, is capable of extracting temporal features, leading to more stable training and improved performance on real-world action recognition video datasets such as HMDB51, UCF101, and the large-scale Kinetics-600. Finally, we also show that models trained with Skip-Sideways generate better future frames than Sideways models, and hence they can better utilize motion cues.
{"title":"Gradient Forward-Propagation for Large-Scale Temporal Video Modelling","authors":"Mateusz Malinowski, Dimitrios Vytiniotis, G. Swirszcz, Viorica Patraucean, J. Carreira","doi":"10.1109/CVPR46437.2021.00913","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00913","url":null,"abstract":"How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. For temporal signals, this introduces high latency and hinders real-time learning. It also creates a coupling between consecutive layers, which limits model parallelism and increases memory consumption. In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time, and we propose mechanisms for temporal integration of information based on different variants of skip connections. We also show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training. The proposed Skip-Sideways achieves low latency training, model parallelism, and, importantly, is capable of extracting temporal features, leading to more stable training and improved performance on real-world action recognition video datasets such as HMDB51, UCF101, and the large-scale Kinetics-600. Finally, we also show that models trained with Skip-Sideways generate better future frames than Sideways models, and hence they can better utilize motion cues.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"105 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115761445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.01357
Kha Gia Quach, Pha Nguyen, Huu Le, Thanh-Dat Truong, C. Duong, M. Tran, Khoa Luu
Multi-Camera Multiple Object Tracking (MC-MOT) is a significant computer vision problem due to its emerging applicability in several real-world applications. Despite a large number of existing works, solving the data association problem in any MC-MOT pipeline is arguably one of the most challenging tasks. Developing a robust MC-MOT system, however, is still highly challenging due to many practical issues such as inconsistent lighting conditions, varying object movement patterns, or the trajectory occlusions of the objects between the cameras. To address these problems, this work, therefore, proposes a new Dynamic Graph Model with Link Prediction (DyGLIP) approach 1 to solve the data association task. Compared to existing methods, our new model offers several advantages, including better feature representations and the ability to recover from lost tracks during camera transitions. Moreover, our model works gracefully regardless of the overlapping ratios between the cameras. Experimental results show that we out-perform existing MC-MOT algorithms by a large margin on several practical datasets. Notably, our model works favor-ably on online settings but can be extended to an incremental approach for large-scale datasets.
{"title":"DyGLIP: A Dynamic Graph Model with Link Prediction for Accurate Multi-Camera Multiple Object Tracking","authors":"Kha Gia Quach, Pha Nguyen, Huu Le, Thanh-Dat Truong, C. Duong, M. Tran, Khoa Luu","doi":"10.1109/CVPR46437.2021.01357","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01357","url":null,"abstract":"Multi-Camera Multiple Object Tracking (MC-MOT) is a significant computer vision problem due to its emerging applicability in several real-world applications. Despite a large number of existing works, solving the data association problem in any MC-MOT pipeline is arguably one of the most challenging tasks. Developing a robust MC-MOT system, however, is still highly challenging due to many practical issues such as inconsistent lighting conditions, varying object movement patterns, or the trajectory occlusions of the objects between the cameras. To address these problems, this work, therefore, proposes a new Dynamic Graph Model with Link Prediction (DyGLIP) approach 1 to solve the data association task. Compared to existing methods, our new model offers several advantages, including better feature representations and the ability to recover from lost tracks during camera transitions. Moreover, our model works gracefully regardless of the overlapping ratios between the cameras. Experimental results show that we out-perform existing MC-MOT algorithms by a large margin on several practical datasets. Notably, our model works favor-ably on online settings but can be extended to an incremental approach for large-scale datasets.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116354525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.00887
Jiaming Zhou, Kun-Yu Lin, Haoxin Li, Weishi Zheng
Long-term actions involve many important visual concepts, e.g., objects, motions, and sub-actions, and there are various relations among these concepts, which we call basic relations. These basic relations will jointly affect each other during the temporal evolution of long-term actions, which forms the high-order relations that are essential for long-term action recognition. In this paper, we propose a Graph-based High-order Relation Modeling (GHRM) module to exploit the high-order relations in the long-term actions for long-term action recognition. In GHRM, each basic relation in the long-term actions will be modeled by a graph, where each node represents a segment in a long video. Moreover, when modeling each basic relation, the information from all the other basic relations will be incorporated by GHRM, and thus the high-order relations in the long-term actions can be well exploited. To better exploit the high-order relations along the time dimension, we design a GHRM-layer consisting of a Temporal-GHRM branch and a Semantic-GHRM branch, which aims to model the local temporal high-order relations and global semantic high-order relations. The experimental results on three long-term action recognition datasets, namely, Breakfast, Charades, and MultiThumos, demonstrate the effectiveness of our model.
{"title":"Graph-based High-order Relation Modeling for Long-term Action Recognition","authors":"Jiaming Zhou, Kun-Yu Lin, Haoxin Li, Weishi Zheng","doi":"10.1109/CVPR46437.2021.00887","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.00887","url":null,"abstract":"Long-term actions involve many important visual concepts, e.g., objects, motions, and sub-actions, and there are various relations among these concepts, which we call basic relations. These basic relations will jointly affect each other during the temporal evolution of long-term actions, which forms the high-order relations that are essential for long-term action recognition. In this paper, we propose a Graph-based High-order Relation Modeling (GHRM) module to exploit the high-order relations in the long-term actions for long-term action recognition. In GHRM, each basic relation in the long-term actions will be modeled by a graph, where each node represents a segment in a long video. Moreover, when modeling each basic relation, the information from all the other basic relations will be incorporated by GHRM, and thus the high-order relations in the long-term actions can be well exploited. To better exploit the high-order relations along the time dimension, we design a GHRM-layer consisting of a Temporal-GHRM branch and a Semantic-GHRM branch, which aims to model the local temporal high-order relations and global semantic high-order relations. The experimental results on three long-term action recognition datasets, namely, Breakfast, Charades, and MultiThumos, demonstrate the effectiveness of our model.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117190786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.01470
J. Lee, Sewon Kim, I. Park, Taejoon Eo, D. Hwang
With increasing fields of application for neural networks and the development of neural networks, the ability to explain deep learning models is also becoming increasingly important. Especially, prior to practical applications, it is crucial to analyze a model’s inference and the process of generating the results. A common explanation method is Class Activation Mapping(CAM) based method where it is often used to understand the last layer of the convolutional neural networks popular in the field of Computer Vision. In this paper, we propose a novel CAM method named Relevance-weighted Class Activation Mapping(Relevance-CAM) that utilizes Layer-wise Relevance Propagation to obtain the weighting components. This allows the explanation map to be faithful and robust to the shattered gradient problem, a shared problem of the gradient based CAM methods that causes noisy saliency maps for intermediate layers. Therefore, our proposed method can better explain a model by correctly analyzing the intermediate layers as well as the last convolutional layer. In this paper, we visualize how each layer of the popular image processing models extracts class specific features using Relevance-CAM, evaluate the localization ability, and show why the gradient based CAM cannot be used to explain the intermediate layers, proven by experimenting the weighting component. Relevance-CAM outperforms other CAM-based methods in recognition and localization evaluation in layers of any depth. The source code is available at: https://github.com/mongeoroo/Relevance-CAM
{"title":"Relevance-CAM: Your Model Already Knows Where to Look","authors":"J. Lee, Sewon Kim, I. Park, Taejoon Eo, D. Hwang","doi":"10.1109/CVPR46437.2021.01470","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01470","url":null,"abstract":"With increasing fields of application for neural networks and the development of neural networks, the ability to explain deep learning models is also becoming increasingly important. Especially, prior to practical applications, it is crucial to analyze a model’s inference and the process of generating the results. A common explanation method is Class Activation Mapping(CAM) based method where it is often used to understand the last layer of the convolutional neural networks popular in the field of Computer Vision. In this paper, we propose a novel CAM method named Relevance-weighted Class Activation Mapping(Relevance-CAM) that utilizes Layer-wise Relevance Propagation to obtain the weighting components. This allows the explanation map to be faithful and robust to the shattered gradient problem, a shared problem of the gradient based CAM methods that causes noisy saliency maps for intermediate layers. Therefore, our proposed method can better explain a model by correctly analyzing the intermediate layers as well as the last convolutional layer. In this paper, we visualize how each layer of the popular image processing models extracts class specific features using Relevance-CAM, evaluate the localization ability, and show why the gradient based CAM cannot be used to explain the intermediate layers, proven by experimenting the weighting component. Relevance-CAM outperforms other CAM-based methods in recognition and localization evaluation in layers of any depth. The source code is available at: https://github.com/mongeoroo/Relevance-CAM","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124420105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-06-01DOI: 10.1109/CVPR46437.2021.01483
Yifan Zhao, Ke Yan, Feiyue Huang, Jia Li
Fine-grained object recognition aims to learn effective features that can identify the subtle differences between visually similar objects. Most of the existing works tend to amplify discriminative part regions with attention mechanisms. Besides its unstable performance under complex backgrounds, the intrinsic interrelationship between different semantic features is less explored. Toward this end, we propose an effective graph-based relation discovery approach to build a contextual understanding of high-order relationships. In our approach, a high-dimensional feature bank is first formed and jointly regularized with semantic- and positional-aware high-order constraints, endowing rich attributes to feature representations. Second, to overcome the high-dimension curse, we propose a graph-based semantic grouping strategy to embed this high-order tensor bank into a low-dimensional space. Meanwhile, a group-wise learning strategy is proposed to regularize the features focusing on the cluster embedding center. With the collaborative learning of three modules, our module is able to grasp the stronger contextual details of fine-grained objects. Experimental evidence demonstrates our approach achieves new state-of-the-art on 4 widely-used fine-grained object recognition benchmarks.
{"title":"Graph-based High-Order Relation Discovery for Fine-grained Recognition","authors":"Yifan Zhao, Ke Yan, Feiyue Huang, Jia Li","doi":"10.1109/CVPR46437.2021.01483","DOIUrl":"https://doi.org/10.1109/CVPR46437.2021.01483","url":null,"abstract":"Fine-grained object recognition aims to learn effective features that can identify the subtle differences between visually similar objects. Most of the existing works tend to amplify discriminative part regions with attention mechanisms. Besides its unstable performance under complex backgrounds, the intrinsic interrelationship between different semantic features is less explored. Toward this end, we propose an effective graph-based relation discovery approach to build a contextual understanding of high-order relationships. In our approach, a high-dimensional feature bank is first formed and jointly regularized with semantic- and positional-aware high-order constraints, endowing rich attributes to feature representations. Second, to overcome the high-dimension curse, we propose a graph-based semantic grouping strategy to embed this high-order tensor bank into a low-dimensional space. Meanwhile, a group-wise learning strategy is proposed to regularize the features focusing on the cluster embedding center. With the collaborative learning of three modules, our module is able to grasp the stronger contextual details of fine-grained objects. Experimental evidence demonstrates our approach achieves new state-of-the-art on 4 widely-used fine-grained object recognition benchmarks.","PeriodicalId":339646,"journal":{"name":"2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125795889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}