Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663832
A. Munir, N. Martinel, C. Micheloni
Vehicle re-identification (re-id) is a challenging task due to the presence of high intra-class and low inter-class variations in the visual data acquired from monitoring camera networks. Unique and discriminative feature representations are needed to overcome the existence of several variations including color, illumination, orientation, background and occlusion. The orientations of the vehicles in the images make the learned models unable to learn multiple parts of the vehicle and relationship between them. The combination of global and partial features is one of the solutions to improve the discriminative learning of deep learning models. Leveraging on such solutions, we propose an Oriented Splits Network (OSN) for an end to end learning of multiple features along with global features to form a strong descriptor for vehicle re-identification. To capture the orientation variability of the vehicles, the proposed network introduces a partition of the images into several oriented stripes to obtain local descriptors for each part/region. Such a scheme is therefore exploited by a camera based feature distillation (CBD) training strategy to remove the background features. These are filtered out from oriented vehicles representations which yield to a much stronger unique representation of the vehicles. We perform experiments on two benchmark vehicle re-id datasets to verify the performance of the proposed approach which show that the proposed solution achieves better result with respect to the state of the art with margin.
{"title":"Oriented Splits Network to Distill Background for Vehicle Re-Identification","authors":"A. Munir, N. Martinel, C. Micheloni","doi":"10.1109/AVSS52988.2021.9663832","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663832","url":null,"abstract":"Vehicle re-identification (re-id) is a challenging task due to the presence of high intra-class and low inter-class variations in the visual data acquired from monitoring camera networks. Unique and discriminative feature representations are needed to overcome the existence of several variations including color, illumination, orientation, background and occlusion. The orientations of the vehicles in the images make the learned models unable to learn multiple parts of the vehicle and relationship between them. The combination of global and partial features is one of the solutions to improve the discriminative learning of deep learning models. Leveraging on such solutions, we propose an Oriented Splits Network (OSN) for an end to end learning of multiple features along with global features to form a strong descriptor for vehicle re-identification. To capture the orientation variability of the vehicles, the proposed network introduces a partition of the images into several oriented stripes to obtain local descriptors for each part/region. Such a scheme is therefore exploited by a camera based feature distillation (CBD) training strategy to remove the background features. These are filtered out from oriented vehicles representations which yield to a much stronger unique representation of the vehicles. We perform experiments on two benchmark vehicle re-id datasets to verify the performance of the proposed approach which show that the proposed solution achieves better result with respect to the state of the art with margin.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125907407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-15DOI: 10.1109/AVSS52988.2021.9663793
Dhruv Agarwal, Tanay Agrawal, Laura M. Ferrari, Franccois Bremond
Multimodal Deep Learning has garnered much interest, and transformers have triggered novel approaches, thanks to the cross-attention mechanism. Here we propose an approach to deal with two key existing challenges: the high computational resource demanded and the issue of missing modalities. We introduce for the first time the concept of knowledge distillation in transformers to use only one modality at inference time. We report a full study analyzing multiple student-teacher configurations, levels at which distillation is applied, and different methodologies. With the best configuration, we improved the state-of-the-art accuracy by 3%, we reduced the number of parameters by 2.5 times and the inference time by 22%. Such performance-computation tradeoff can be exploited in many applications and we aim at opening a new research area where the deployment of complex models with limited resources is demanded
{"title":"From Multimodal to Unimodal Attention in Transformers using Knowledge Distillation","authors":"Dhruv Agarwal, Tanay Agrawal, Laura M. Ferrari, Franccois Bremond","doi":"10.1109/AVSS52988.2021.9663793","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663793","url":null,"abstract":"Multimodal Deep Learning has garnered much interest, and transformers have triggered novel approaches, thanks to the cross-attention mechanism. Here we propose an approach to deal with two key existing challenges: the high computational resource demanded and the issue of missing modalities. We introduce for the first time the concept of knowledge distillation in transformers to use only one modality at inference time. We report a full study analyzing multiple student-teacher configurations, levels at which distillation is applied, and different methodologies. With the best configuration, we improved the state-of-the-art accuracy by 3%, we reduced the number of parameters by 2.5 times and the inference time by 22%. Such performance-computation tradeoff can be exploited in many applications and we aim at opening a new research area where the deployment of complex models with limited resources is demanded","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"28 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131860250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-14DOI: 10.1109/AVSS52988.2021.9663755
Arij Bouazizi, U. Kressel, Vasileios Belagiannis
We present a simple, yet effective, approach for self-supervised 3D human pose estimation. Unlike the prior work, we explore the temporal information next to the multi-view self-supervision. During training, we rely on triangulating 2D body pose estimates of a multiple-view camera system. A temporal convolutional neural network is trained with the generated 3D ground-truth and the geometric multi-view consistency loss, imposing geometrical constraints on the predicted 3D body skeleton. During inference, our model receives a sequence of 2D body pose estimates from a single-view to predict the 3D body pose for each of them. An extensive evaluation shows that our method achieves state-of-the-art performance in the Human3.6M and MPI-INF-3DHP benchmarks. Our code and models are publicly available at https://github.com/vru2020/TM_HPE/.
{"title":"Learning Temporal 3D Human Pose Estimation with Pseudo-Labels","authors":"Arij Bouazizi, U. Kressel, Vasileios Belagiannis","doi":"10.1109/AVSS52988.2021.9663755","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663755","url":null,"abstract":"We present a simple, yet effective, approach for self-supervised 3D human pose estimation. Unlike the prior work, we explore the temporal information next to the multi-view self-supervision. During training, we rely on triangulating 2D body pose estimates of a multiple-view camera system. A temporal convolutional neural network is trained with the generated 3D ground-truth and the geometric multi-view consistency loss, imposing geometrical constraints on the predicted 3D body skeleton. During inference, our model receives a sequence of 2D body pose estimates from a single-view to predict the 3D body pose for each of them. An extensive evaluation shows that our method achieves state-of-the-art performance in the Human3.6M and MPI-INF-3DHP benchmarks. Our code and models are publicly available at https://github.com/vru2020/TM_HPE/.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122130933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-10DOI: 10.1109/AVSS52988.2021.9663816
Neelabh Sinha, Michal Balazia, F. Brémond
3D gaze estimation is about predicting the line of sight of a person in 3D space. Person-independent models for the same lack precision due to anatomical differences of subjects, whereas person-specific calibrated techniques add strict constraints on scalability. To overcome these issues, we propose a novel technique, Facial Landmark Heatmap Activated Multimodal Gaze Estimation (FLAME), as a way of combining eye anatomical information using eye land-mark heatmaps to obtain precise gaze estimation without any person-specific calibration. Our evaluation demonstrates a competitive performance of about 10% improvement on benchmark datasets ColumbiaGaze and EYEDIAP. We also conduct an ablation study to validate our method.
{"title":"FLAME: Facial Landmark Heatmap Activated Multimodal Gaze Estimation","authors":"Neelabh Sinha, Michal Balazia, F. Brémond","doi":"10.1109/AVSS52988.2021.9663816","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663816","url":null,"abstract":"3D gaze estimation is about predicting the line of sight of a person in 3D space. Person-independent models for the same lack precision due to anatomical differences of subjects, whereas person-specific calibrated techniques add strict constraints on scalability. To overcome these issues, we propose a novel technique, Facial Landmark Heatmap Activated Multimodal Gaze Estimation (FLAME), as a way of combining eye anatomical information using eye land-mark heatmaps to obtain precise gaze estimation without any person-specific calibration. Our evaluation demonstrates a competitive performance of about 10% improvement on benchmark datasets ColumbiaGaze and EYEDIAP. We also conduct an ablation study to validate our method.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116319537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-10-09DOI: 10.1109/AVSS52988.2021.9663762
Cristiano Patr'icio, J. Neves
The recognition of unseen objects from a semantic representation or textual description, usually denoted as zero-shot learning, is more prone to be used in real-world scenarios when compared to traditional object recognition. Nevertheless, no work has evaluated the feasibility of deploying zero-shot learning approaches in these scenarios, particularly when using low-power devices. In this paper, we provide the first benchmark on the inference time of zero-shot learning, comprising an evaluation of state-of-the-art approaches regarding their speed/accuracy trade-off. An analysis to the processing time of the different phases of the ZSL inference stage reveals that visual feature extraction is the major bottleneck in this paradigm, but, we show that lightweight networks can dramatically reduce the overall inference time without reducing the accuracy obtained by the de facto ResNet101 architecture. Also, this benchmark evaluates how different ZSL approaches perform in low-power devices, and how the visual feature extraction phase could be optimized in this hardware. To foster the research and deployment of ZSL systems capable of operating in real-world scenarios, we release the evaluation framework used in this benchmark(https://github.com/CristianoPatricio/zsl-methods).
{"title":"ZSpeedL - Evaluating the Performance of Zero-Shot Learning Methods using Low-Power Devices","authors":"Cristiano Patr'icio, J. Neves","doi":"10.1109/AVSS52988.2021.9663762","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663762","url":null,"abstract":"The recognition of unseen objects from a semantic representation or textual description, usually denoted as zero-shot learning, is more prone to be used in real-world scenarios when compared to traditional object recognition. Nevertheless, no work has evaluated the feasibility of deploying zero-shot learning approaches in these scenarios, particularly when using low-power devices. In this paper, we provide the first benchmark on the inference time of zero-shot learning, comprising an evaluation of state-of-the-art approaches regarding their speed/accuracy trade-off. An analysis to the processing time of the different phases of the ZSL inference stage reveals that visual feature extraction is the major bottleneck in this paradigm, but, we show that lightweight networks can dramatically reduce the overall inference time without reducing the accuracy obtained by the de facto ResNet101 architecture. Also, this benchmark evaluates how different ZSL approaches perform in low-power devices, and how the visual feature extraction phase could be optimized in this hardware. To foster the research and deployment of ZSL systems capable of operating in real-world scenarios, we release the evaluation framework used in this benchmark(https://github.com/CristianoPatricio/zsl-methods).","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117263157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-08-10DOI: 10.1109/AVSS52988.2021.9663798
Youngsaeng Jin, Jonghwan Hong, D. Han, Hanseok Ko
Anomaly detection in video streams is a challenging problem because of the scarcity of abnormal events and the difficulty of accurately annotating them. To alleviate these issues, unsupervised learning-based prediction methods have been previously applied. These approaches train the model with only normal events and predict a future frame from a sequence of preceding frames by use of encoder-decoder architectures so that they result in small prediction errors on normal events but large errors on abnormal events. The architecture, however, comes with the computational burden as some anomaly detection tasks require low computational cost without sacrificing performance. In this paper, Cross-Parallel Network (CPNet) for efficient anomaly detection is proposed here to minimize computations without performance drops. It consists of N smaller parallel U-Net, each of which is designed to handle a single input frame, to make the calculations significantly more efficient. Additionally, an inter-network shift module is incorporated to capture temporal relationships among sequential frames to enable more accurate future predictions. The quantitative results show that our model requires less computational cost than the baseline U-Net while delivering equivalent performance in anomaly detection.
{"title":"CPNet: Cross-Parallel Network for Efficient Anomaly Detection","authors":"Youngsaeng Jin, Jonghwan Hong, D. Han, Hanseok Ko","doi":"10.1109/AVSS52988.2021.9663798","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663798","url":null,"abstract":"Anomaly detection in video streams is a challenging problem because of the scarcity of abnormal events and the difficulty of accurately annotating them. To alleviate these issues, unsupervised learning-based prediction methods have been previously applied. These approaches train the model with only normal events and predict a future frame from a sequence of preceding frames by use of encoder-decoder architectures so that they result in small prediction errors on normal events but large errors on abnormal events. The architecture, however, comes with the computational burden as some anomaly detection tasks require low computational cost without sacrificing performance. In this paper, Cross-Parallel Network (CPNet) for efficient anomaly detection is proposed here to minimize computations without performance drops. It consists of N smaller parallel U-Net, each of which is designed to handle a single input frame, to make the calculations significantly more efficient. Additionally, an inter-network shift module is incorporated to capture temporal relationships among sequential frames to enable more accurate future predictions. The quantitative results show that our model requires less computational cost than the baseline U-Net while delivering equivalent performance in anomaly detection.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130991693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-04-20DOI: 10.1109/AVSS52988.2021.9663783
Loic Jezequel, Ngoc-Son Vu, Jean Beaudet, A. Histace
Detecting anomalies using deep learning has become a major challenge over the last years, and is becoming increasingly promising in several fields. The introduction of self-supervised learning has greatly helped many methods including anomaly detection where simple geometric transformation recognition tasks are used. However these methods do not perform well on fine-grained problems since they lack finer features. By combining both high-scale shape features and low-scale fine features in a multi-task framework, our method greatly improves fine-grained anomaly detection. It outperforms state-of-the-art with up to 31% relative error reduction measured with AUROC on various anomaly detection problems including one-vs-all, out-of-distribution detection and face presentation attack detection.
{"title":"Fine-grained anomaly detection via multi-task self-supervision","authors":"Loic Jezequel, Ngoc-Son Vu, Jean Beaudet, A. Histace","doi":"10.1109/AVSS52988.2021.9663783","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663783","url":null,"abstract":"Detecting anomalies using deep learning has become a major challenge over the last years, and is becoming increasingly promising in several fields. The introduction of self-supervised learning has greatly helped many methods including anomaly detection where simple geometric transformation recognition tasks are used. However these methods do not perform well on fine-grained problems since they lack finer features. By combining both high-scale shape features and low-scale fine features in a multi-task framework, our method greatly improves fine-grained anomaly detection. It outperforms state-of-the-art with up to 31% relative error reduction measured with AUROC on various anomaly detection problems including one-vs-all, out-of-distribution detection and face presentation attack detection.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121821770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-09-17DOI: 10.1109/AVSS52988.2021.9663769
Shao-Yuan Lo, Vishal M. Patel
The majority of adversarial machine learning research focuses on additive attacks, which add adversarial perturbation to input data. On the other hand, unlike image recognition problems, only a handful of attack approaches have been explored in the video domain. In this paper, we propose a novel attack method against video recognition models, Multiplicative Adversarial Videos (MultAV), which imposes perturbation on video data by multiplication. MultAV has different noise distributions to the additive counterparts and thus challenges the defense methods tailored to resisting additive adversarial attacks. Moreover, it can be generalized to not only $ell_{p}$-norm attacks with a new adversary constraint called ratio bound, but also different types of physically realizable attacks. Experimental results show that the model adversarially trained against additive attack is less robust to MultAV.
{"title":"MultAV: Multiplicative Adversarial Videos","authors":"Shao-Yuan Lo, Vishal M. Patel","doi":"10.1109/AVSS52988.2021.9663769","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663769","url":null,"abstract":"The majority of adversarial machine learning research focuses on additive attacks, which add adversarial perturbation to input data. On the other hand, unlike image recognition problems, only a handful of attack approaches have been explored in the video domain. In this paper, we propose a novel attack method against video recognition models, Multiplicative Adversarial Videos (MultAV), which imposes perturbation on video data by multiplication. MultAV has different noise distributions to the additive counterparts and thus challenges the defense methods tailored to resisting additive adversarial attacks. Moreover, it can be generalized to not only $ell_{p}$-norm attacks with a new adversary constraint called ratio bound, but also different types of physically realizable attacks. Experimental results show that the model adversarially trained against additive attack is less robust to MultAV.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129785804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}