Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00069
Taher Naderi, Amir Sadovnik, J. Hayward, Hairong Qi
ingle image depth estimation is an ill-posed problem. That is, it is not mathematically possible to uniquely estimate the 3rd dimension (or depth) from a single 2D image. Hence, additional constraints need to be incorporated in order to regulate the solution space. In this paper, we explore the idea of constraining the model by taking advantage of the similarity between the RGB image and the corresponding depth map at the geometric edges of the 3D scene for more accurate depth estimation. We propose a general light-weight adaptive geometric attention module that uses the cross-correlation between the encoder and the decoder as a measure of this similarity. More precisely, we use the cosine similarity between the local embedded features in the encoder and the decoder at each spatial point. The proposed module along with the encoder-decoder network is trained in an end-to-end fashion and achieves superior and competitive performance in comparison with other state-of-the-art methods. In addition, adding our module to the base encoder-decoder model adds only an additional 0.03% (or 0.0003) parameters. Therefore, this module can be added to any base encoder-decoder network without changing its structure to address any task at hand.
{"title":"Monocular Depth Estimation with Adaptive Geometric Attention","authors":"Taher Naderi, Amir Sadovnik, J. Hayward, Hairong Qi","doi":"10.1109/WACV51458.2022.00069","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00069","url":null,"abstract":"ingle image depth estimation is an ill-posed problem. That is, it is not mathematically possible to uniquely estimate the 3rd dimension (or depth) from a single 2D image. Hence, additional constraints need to be incorporated in order to regulate the solution space. In this paper, we explore the idea of constraining the model by taking advantage of the similarity between the RGB image and the corresponding depth map at the geometric edges of the 3D scene for more accurate depth estimation. We propose a general light-weight adaptive geometric attention module that uses the cross-correlation between the encoder and the decoder as a measure of this similarity. More precisely, we use the cosine similarity between the local embedded features in the encoder and the decoder at each spatial point. The proposed module along with the encoder-decoder network is trained in an end-to-end fashion and achieves superior and competitive performance in comparison with other state-of-the-art methods. In addition, adding our module to the base encoder-decoder model adds only an additional 0.03% (or 0.0003) parameters. Therefore, this module can be added to any base encoder-decoder network without changing its structure to address any task at hand.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131051022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00302
J. Weyler, Federico Magistri, Peter Seitz, J. Behley, C. Stachniss
A detailed analysis of a plant’s phenotype in real field conditions is critical for plant scientists and breeders to understand plant function. In contrast to traditional phenotyping performed manually, vision-based systems have the potential for an objective and automated assessment with high spatial and temporal resolution. One of such systems’ objectives is to detect and segment individual leaves of each plant since this information correlates to the growth stage and provides phenotypic traits, such as leaf count, cover-age, and size. In this paper, we propose a vision-based approach that performs instance segmentation of individual crop leaves and associates each with its corresponding crop plant in real fields. This enables us to compute relevant basic phenotypic traits on a per-plant level. We employ a convolutional neural network and operate directly on drone imagery. The network generates two different representations of the input image that we utilize to cluster individual crop leaf and plant instances. We propose a novel method to compute clustering regions based on our network’s predictions that achieves high accuracy. Furthermore, we com-pare to other state-of-the-art approaches and show that our system achieves superior performance. The source code of our approach is available 1.
{"title":"In-Field Phenotyping Based on Crop Leaf and Plant Instance Segmentation","authors":"J. Weyler, Federico Magistri, Peter Seitz, J. Behley, C. Stachniss","doi":"10.1109/WACV51458.2022.00302","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00302","url":null,"abstract":"A detailed analysis of a plant’s phenotype in real field conditions is critical for plant scientists and breeders to understand plant function. In contrast to traditional phenotyping performed manually, vision-based systems have the potential for an objective and automated assessment with high spatial and temporal resolution. One of such systems’ objectives is to detect and segment individual leaves of each plant since this information correlates to the growth stage and provides phenotypic traits, such as leaf count, cover-age, and size. In this paper, we propose a vision-based approach that performs instance segmentation of individual crop leaves and associates each with its corresponding crop plant in real fields. This enables us to compute relevant basic phenotypic traits on a per-plant level. We employ a convolutional neural network and operate directly on drone imagery. The network generates two different representations of the input image that we utilize to cluster individual crop leaf and plant instances. We propose a novel method to compute clustering regions based on our network’s predictions that achieves high accuracy. Furthermore, we com-pare to other state-of-the-art approaches and show that our system achieves superior performance. The source code of our approach is available 1.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126777598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00258
Lingyu Zhang, R. Radke
The goal of natural language video moment localization is to locate a short segment of a long, untrimmed video that corresponds to a description presented as natural text. The description may contain several pieces of key information, including subjects/objects, sequential actions, and locations. Here, we propose a novel video moment localization framework based on the convolutional response between multimodal signals, i.e., the video sequence, the text query, and subtitles for the video if they are available. We emphasize the effect of the language sequence as a query about the video content, by converting the query sentence into a boundary detector with a filter kernel size and stride. We convolve the video sequence with the query detector to locate the start and end boundaries of the target video segment. When subtitles are available, we blend the boundary heatmaps from the visual and subtitle branches together using an LSTM to capture asynchronous dependencies across two modalities in the video. We perform extensive experiments on the TVR, Charades-STA, and TACoS benchmark datasets, demonstrating that our model achieves state-of-the-art results on all three.
{"title":"Natural Language Video Moment Localization Through Query-Controlled Temporal Convolution","authors":"Lingyu Zhang, R. Radke","doi":"10.1109/WACV51458.2022.00258","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00258","url":null,"abstract":"The goal of natural language video moment localization is to locate a short segment of a long, untrimmed video that corresponds to a description presented as natural text. The description may contain several pieces of key information, including subjects/objects, sequential actions, and locations. Here, we propose a novel video moment localization framework based on the convolutional response between multimodal signals, i.e., the video sequence, the text query, and subtitles for the video if they are available. We emphasize the effect of the language sequence as a query about the video content, by converting the query sentence into a boundary detector with a filter kernel size and stride. We convolve the video sequence with the query detector to locate the start and end boundaries of the target video segment. When subtitles are available, we blend the boundary heatmaps from the visual and subtitle branches together using an LSTM to capture asynchronous dependencies across two modalities in the video. We perform extensive experiments on the TVR, Charades-STA, and TACoS benchmark datasets, demonstrating that our model achieves state-of-the-art results on all three.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"65 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123187798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00162
Xuan Gong, Luckyson Khaidem, Wentao Zhu, Baochang Zhang, D. Doermann
Uncertainty estimation in medical image registration enables surgeons to evaluate the operative risk based on the trustworthiness of the registered image data thus of paramount importance for practical clinical applications. Despite the recent promising results obtained with deep unsupervised learning-based registration methods, reasoning about uncertainty of unsupervised registration models remains largely unexplored. In this work, we propose a predictive module to learn the registration and uncertainty in correspondence simultaneously. Our framework introduces empirical randomness and registration error based uncertainty prediction. We systematically assess the performances on two MRI datasets with different ensemble paradigms. Experimental results highlight that our proposed framework significantly improves the registration accuracy and uncertainty compared with the baseline.
{"title":"Uncertainty Learning towards Unsupervised Deformable Medical Image Registration","authors":"Xuan Gong, Luckyson Khaidem, Wentao Zhu, Baochang Zhang, D. Doermann","doi":"10.1109/WACV51458.2022.00162","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00162","url":null,"abstract":"Uncertainty estimation in medical image registration enables surgeons to evaluate the operative risk based on the trustworthiness of the registered image data thus of paramount importance for practical clinical applications. Despite the recent promising results obtained with deep unsupervised learning-based registration methods, reasoning about uncertainty of unsupervised registration models remains largely unexplored. In this work, we propose a predictive module to learn the registration and uncertainty in correspondence simultaneously. Our framework introduces empirical randomness and registration error based uncertainty prediction. We systematically assess the performances on two MRI datasets with different ensemble paradigms. Experimental results highlight that our proposed framework significantly improves the registration accuracy and uncertainty compared with the baseline.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116939386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00275
Royston Rodrigues, Masahiro Tani
When we humans recognize places from images, we not only infer about the objects that are available but even think about landmarks that might be surrounding it. Current place recognition approaches lack the ability to go beyond objects that are available in the image and hence miss out on understanding the scene completely. In this paper, we take a step towards holistic scene understanding. We address the problem of image geo-localization by retrieving corresponding aerial views from a large database of geotagged aerial imagery. One of the main challenges in tackling this problem is the limited Field of View (FoV) nature of query images which needs to be matched to aerial views which contain 360°FoV details. State-of-the-art method DSM-Net [19] tackles this challenge by matching aerial images locally within fixed FoV sectors. We show that local matching limits complete scene understanding and is inadequate when partial buildings are visible in query images or when local sectors of aerial images are covered by dense trees. Our approach considers both local and global properties of aerial images and hence is robust to such conditions. Experiments on standard benchmarks demonstrates that the proposed approach improves top-1% image recall rate on the CVACT [9] data-set from 57.08% to 77.19% and from 61.20% to 75.21% on the CVUSA [28] data-set for 70°FoV. We also achieve state-of-the art results for 90°FoV on both CVACT [9] and CVUSA [28] data-sets demonstrating the effectiveness of our proposed method.
{"title":"Global Assists Local: Effective Aerial Representations for Field of View Constrained Image Geo-Localization","authors":"Royston Rodrigues, Masahiro Tani","doi":"10.1109/WACV51458.2022.00275","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00275","url":null,"abstract":"When we humans recognize places from images, we not only infer about the objects that are available but even think about landmarks that might be surrounding it. Current place recognition approaches lack the ability to go beyond objects that are available in the image and hence miss out on understanding the scene completely. In this paper, we take a step towards holistic scene understanding. We address the problem of image geo-localization by retrieving corresponding aerial views from a large database of geotagged aerial imagery. One of the main challenges in tackling this problem is the limited Field of View (FoV) nature of query images which needs to be matched to aerial views which contain 360°FoV details. State-of-the-art method DSM-Net [19] tackles this challenge by matching aerial images locally within fixed FoV sectors. We show that local matching limits complete scene understanding and is inadequate when partial buildings are visible in query images or when local sectors of aerial images are covered by dense trees. Our approach considers both local and global properties of aerial images and hence is robust to such conditions. Experiments on standard benchmarks demonstrates that the proposed approach improves top-1% image recall rate on the CVACT [9] data-set from 57.08% to 77.19% and from 61.20% to 75.21% on the CVUSA [28] data-set for 70°FoV. We also achieve state-of-the art results for 90°FoV on both CVACT [9] and CVUSA [28] data-sets demonstrating the effectiveness of our proposed method.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114900254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00265
Chenshu Chen, Tangyou Liu, Wenming Tan, Shiliang Pu
As it is costly to densely annotate large scale datasets for supervised semantic segmentation, extensive semi-supervised methods have been proposed. However, the accuracy, stability and flexibility of existing methods are still far from satisfactory. In this paper, we propose an effective and flexible framework for semi-supervised semantic segmentation using a small set of fully labeled images and a set of weakly labeled images with bounding box labels. In our framework, position and class priors are designed to guide the annotation network to predict accurate pseudo masks for weakly labeled images, which are used to train the segmentation network. We also propose a mixed-dual-head training method to reduce the interference of label noise while enabling the training process more stable. Experiments on PASCAL VOC 2012 show that our method achieves state-of-the-art performance and can achieve competitive results even with very few fully labeled images. Furthermore, the performance can be further boosted with extra weakly labeled images from COCO dataset.
{"title":"Mixed-dual-head Meets Box Priors: A Robust Framework for Semi-supervised Segmentation","authors":"Chenshu Chen, Tangyou Liu, Wenming Tan, Shiliang Pu","doi":"10.1109/WACV51458.2022.00265","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00265","url":null,"abstract":"As it is costly to densely annotate large scale datasets for supervised semantic segmentation, extensive semi-supervised methods have been proposed. However, the accuracy, stability and flexibility of existing methods are still far from satisfactory. In this paper, we propose an effective and flexible framework for semi-supervised semantic segmentation using a small set of fully labeled images and a set of weakly labeled images with bounding box labels. In our framework, position and class priors are designed to guide the annotation network to predict accurate pseudo masks for weakly labeled images, which are used to train the segmentation network. We also propose a mixed-dual-head training method to reduce the interference of label noise while enabling the training process more stable. Experiments on PASCAL VOC 2012 show that our method achieves state-of-the-art performance and can achieve competitive results even with very few fully labeled images. Furthermore, the performance can be further boosted with extra weakly labeled images from COCO dataset.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121848480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00237
Lilika Makabe, Hiroaki Santo, Fumio Okura, Y. Matsushita
We introduce a fiducial marker for the registration of two-dimensional (2D) images and untextured three-dimensional (3D) shapes that are recorded by commodity laser scanners. Specifically, we design a 3D-version of the ArUco marker that retains exactly the same appearance as its 2D counterpart from any viewpoint above the marker but contains shape information. The shape-coded ArUco can naturally work with off-the-shelf ArUco marker detectors in the 2D image domain. For the 3D domain, we develop a method for detecting the marker in an untextured 3D point cloud. Experiments demonstrate accurate 2D-3D registration using our shape-coded ArUco markers in comparison to baseline methods.
{"title":"Shape-coded ArUco: Fiducial Marker for Bridging 2D and 3D Modalities","authors":"Lilika Makabe, Hiroaki Santo, Fumio Okura, Y. Matsushita","doi":"10.1109/WACV51458.2022.00237","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00237","url":null,"abstract":"We introduce a fiducial marker for the registration of two-dimensional (2D) images and untextured three-dimensional (3D) shapes that are recorded by commodity laser scanners. Specifically, we design a 3D-version of the ArUco marker that retains exactly the same appearance as its 2D counterpart from any viewpoint above the marker but contains shape information. The shape-coded ArUco can naturally work with off-the-shelf ArUco marker detectors in the 2D image domain. For the 3D domain, we develop a method for detecting the marker in an untextured 3D point cloud. Experiments demonstrate accurate 2D-3D registration using our shape-coded ArUco markers in comparison to baseline methods.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116605814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00203
Gong Cheng, J. Elder
Domain shift limits generalization in many problem domains. For road segmentation, one of the principal causes of domain shift is variation in the geometric camera parameters, which results in misregistration of scene structure between images. To address this issue, we decompose the shift into two components: Between-camera shift and within-camera shift. To handle between-camera shift, we assume that average camera parameters are known or can be estimated and use this knowledge to rectify both source and target domain images to a standard virtual camera model. To handle within-camera shift, we use estimates of road vanishing points to correct for shifts in camera pan and tilt. While this approach improves alignment, it produces gaps in the virtual image that complicates network training. To solve this problem, we introduce a novel projective image completion method that fills these gaps in a plausible way. Using five diverse and challenging road segmentation datasets, we demonstrate that our virtual camera method dramatically improves road segmentation performance when generalizing across cameras, and propose that this be integrated as a standard component of road segmentation systems to improve generalization.
{"title":"VCSeg: Virtual Camera Adaptation for Road Segmentation","authors":"Gong Cheng, J. Elder","doi":"10.1109/WACV51458.2022.00203","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00203","url":null,"abstract":"Domain shift limits generalization in many problem domains. For road segmentation, one of the principal causes of domain shift is variation in the geometric camera parameters, which results in misregistration of scene structure between images. To address this issue, we decompose the shift into two components: Between-camera shift and within-camera shift. To handle between-camera shift, we assume that average camera parameters are known or can be estimated and use this knowledge to rectify both source and target domain images to a standard virtual camera model. To handle within-camera shift, we use estimates of road vanishing points to correct for shifts in camera pan and tilt. While this approach improves alignment, it produces gaps in the virtual image that complicates network training. To solve this problem, we introduce a novel projective image completion method that fills these gaps in a plausible way. Using five diverse and challenging road segmentation datasets, we demonstrate that our virtual camera method dramatically improves road segmentation performance when generalizing across cameras, and propose that this be integrated as a standard component of road segmentation systems to improve generalization.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115069646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00308
Sam Leroux, Bo Li, P. Simoens
Automated anomaly detection in surveillance videos has attracted much interest as it provides a scalable alternative to manual monitoring. Most existing approaches achieve good performance on clean benchmark datasets recorded in well-controlled environments. However, detecting anomalies is much more challenging in the real world. Adverse weather conditions like rain or changing brightness levels cause a significant shift in the input data distribution, which in turn can lead to the detector model incorrectly reporting high anomaly scores. Additionally, surveillance cameras are usually deployed in evolving environments such as a city street of which the appearance changes over time because of seasonal changes or roadworks. The anomaly detection model will need to be updated periodically to deal with these issues. In this paper, we introduce a multi-branch model that is equipped with a trainable preprocessing step and multiple identical branches for detecting anomalies during day and night as well as in sunny and rainy conditions. We experimentally validate our approach on a distorted version of the Avenue dataset and provide qualitative results on real-world surveillance camera data. Experimental results show that our method outperforms the existing methods in terms of detection accuracy while being faster and more robust on scenes with varying visibility.
{"title":"Multi-branch Neural Networks for Video Anomaly Detection in Adverse Lighting and Weather Conditions","authors":"Sam Leroux, Bo Li, P. Simoens","doi":"10.1109/WACV51458.2022.00308","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00308","url":null,"abstract":"Automated anomaly detection in surveillance videos has attracted much interest as it provides a scalable alternative to manual monitoring. Most existing approaches achieve good performance on clean benchmark datasets recorded in well-controlled environments. However, detecting anomalies is much more challenging in the real world. Adverse weather conditions like rain or changing brightness levels cause a significant shift in the input data distribution, which in turn can lead to the detector model incorrectly reporting high anomaly scores. Additionally, surveillance cameras are usually deployed in evolving environments such as a city street of which the appearance changes over time because of seasonal changes or roadworks. The anomaly detection model will need to be updated periodically to deal with these issues. In this paper, we introduce a multi-branch model that is equipped with a trainable preprocessing step and multiple identical branches for detecting anomalies during day and night as well as in sunny and rainy conditions. We experimentally validate our approach on a distorted version of the Avenue dataset and provide qualitative results on real-world surveillance camera data. Experimental results show that our method outperforms the existing methods in terms of detection accuracy while being faster and more robust on scenes with varying visibility.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123512657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-01DOI: 10.1109/WACV51458.2022.00405
Zhenzhen Wang, Minghai Qin, Yen-Kuang Chen
Images are transmitted or stored in their compressed form and most of the AI tasks are performed from the re-constructed domain. Convolutional neural network (CNN)-based image compression and reconstruction is growing rapidly and it achieves or surpasses the state-of-the-art heuristic image compression methods, such as JPEG or BPG. A major limitation of the application of the CNN-based image compression is on the computation complexity during compression and reconstruction. Therefore, learning from the compressed domain is desirable to avoid the computation and latency caused by reconstruction. In this paper, we show that learning from the compressed domain can achieve comparative or even better accuracy than from the reconstructed domain. At a high compression rate of 0.098 bpp, for example, the proposed compression-learning system has over 3% absolute accuracy boost over the traditional compression-reconstruction-learning flow. The improvement is achieved by optimizing the compression-learning system targeting original-sized instead of standardized (e.g., 224x224) images, which is crucial in practice since real-world images into the system have different sizes. We also propose an efficient model-free entropy estimation method and a criterion to learn from a selected subset of features in the compressed domain to further re-duce the transmission and computation cost without accuracy degradation.
{"title":"Learning from the CNN-based Compressed Domain","authors":"Zhenzhen Wang, Minghai Qin, Yen-Kuang Chen","doi":"10.1109/WACV51458.2022.00405","DOIUrl":"https://doi.org/10.1109/WACV51458.2022.00405","url":null,"abstract":"Images are transmitted or stored in their compressed form and most of the AI tasks are performed from the re-constructed domain. Convolutional neural network (CNN)-based image compression and reconstruction is growing rapidly and it achieves or surpasses the state-of-the-art heuristic image compression methods, such as JPEG or BPG. A major limitation of the application of the CNN-based image compression is on the computation complexity during compression and reconstruction. Therefore, learning from the compressed domain is desirable to avoid the computation and latency caused by reconstruction. In this paper, we show that learning from the compressed domain can achieve comparative or even better accuracy than from the reconstructed domain. At a high compression rate of 0.098 bpp, for example, the proposed compression-learning system has over 3% absolute accuracy boost over the traditional compression-reconstruction-learning flow. The improvement is achieved by optimizing the compression-learning system targeting original-sized instead of standardized (e.g., 224x224) images, which is crucial in practice since real-world images into the system have different sizes. We also propose an efficient model-free entropy estimation method and a criterion to learn from a selected subset of features in the compressed domain to further re-duce the transmission and computation cost without accuracy degradation.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122868933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}