Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663841
Fardad Dadboud, Vaibhav Patel, Varun Mehta, M. Bolic, I. Mantegh
In Drone-vs-Bird Detection Challenge in conjunction with the 4th International Workshop on Small-Drone Surveillance, Detection and Counteraction Techniques at IEEE AVSS 2021, we proposed a YOLOV5-based object detection model for small UAV detection and classification. YOLOV5 leverages PANet neck and mosaic augmentation which help in improving detection of small objects. We have combined the challenge dataset with one of the publicly available UAV air to air dataset having complex background and lighting conditions for training the model. The proposed approach achieved 0.96 Recall, $0.98 mAP_{0.5}$, and $0.71 mAP_{0.5:0.95}$ on the 10% randomly sampled dataset from the whole dataset.
{"title":"Single-Stage UAV Detection and Classification with YOLOV5: Mosaic Data Augmentation and PANet","authors":"Fardad Dadboud, Vaibhav Patel, Varun Mehta, M. Bolic, I. Mantegh","doi":"10.1109/AVSS52988.2021.9663841","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663841","url":null,"abstract":"In Drone-vs-Bird Detection Challenge in conjunction with the 4th International Workshop on Small-Drone Surveillance, Detection and Counteraction Techniques at IEEE AVSS 2021, we proposed a YOLOV5-based object detection model for small UAV detection and classification. YOLOV5 leverages PANet neck and mosaic augmentation which help in improving detection of small objects. We have combined the challenge dataset with one of the publicly available UAV air to air dataset having complex background and lighting conditions for training the model. The proposed approach achieved 0.96 Recall, $0.98 mAP_{0.5}$, and $0.71 mAP_{0.5:0.95}$ on the 10% randomly sampled dataset from the whole dataset.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"267 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122553409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663791
R. Pflugfelder, Jonas Auer
Occlusion is a fundamental challenge in object recognition. Fragmented occlusion is much more challenging than ordinary partial occlusion and occurs in natural environments such as forests. Less is known in computer vision about fragmented occlusion and object recognition. Interestingly, human vision has far more explored this problem as the human visual system evolved to fragmented occlusion at the times when mankind occupied rainforest. A motivating example of fragmented occlusion is object detection through foliage which is an essential requirement in green border surveillance. Instead of detection, this paper studies the simpler problem of localisation with persons. A neural network based method shows a precision larger than 90% on new image sequences capturing the problem. This is possible by two observations: (i) fragmented occlusion is unsolvable in single images without temporal information, and (ii) colour quantisation and colour swapping is essential to force the training of the network to learn from the available temporal information in the spatiotemporal data.
{"title":"Person Localisation under Fragmented Occlusion*","authors":"R. Pflugfelder, Jonas Auer","doi":"10.1109/AVSS52988.2021.9663791","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663791","url":null,"abstract":"Occlusion is a fundamental challenge in object recognition. Fragmented occlusion is much more challenging than ordinary partial occlusion and occurs in natural environments such as forests. Less is known in computer vision about fragmented occlusion and object recognition. Interestingly, human vision has far more explored this problem as the human visual system evolved to fragmented occlusion at the times when mankind occupied rainforest. A motivating example of fragmented occlusion is object detection through foliage which is an essential requirement in green border surveillance. Instead of detection, this paper studies the simpler problem of localisation with persons. A neural network based method shows a precision larger than 90% on new image sequences capturing the problem. This is possible by two observations: (i) fragmented occlusion is unsolvable in single images without temporal information, and (ii) colour quantisation and colour swapping is essential to force the training of the network to learn from the available temporal information in the spatiotemporal data.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129485455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663775
Hiren Galiyawala, M. Raval, Dhyey Savaliya
Physical characteristics or soft biometrics are visually perceptible aspects of a human body. Noticeable attributes like build, height, complexion, clothes help with the development of a human surveillance system. The paper proposes Discrete Soft biometric Attribute-based Person Retrieval (DSA-PR) from a video using height, gender, torso (clothes) color-1, torso color-2, and torso (clothes) type given in a textual query. The DSA-PR uses Mask R-CNN for semantic segmentation and ResNet-50 for attribute classification. Height is estimated using the Tsai camera calibration method. DSA-PR weighs attributes and fuses their probability to generate a final score for each detected person. The proposed approach achieves an average Intersection-over-Union (IoU) of 0.602 and retrieval with IoU $ge$ 0.4 is 0.808 over the AVSS challenge II dataset which works out to 5.8% and 2.02% above the state-of-the-art techniques respectively.
{"title":"DSA-PR: Discrete Soft Biometric Attribute-Based Person Retrieval in Surveillance Videos","authors":"Hiren Galiyawala, M. Raval, Dhyey Savaliya","doi":"10.1109/AVSS52988.2021.9663775","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663775","url":null,"abstract":"Physical characteristics or soft biometrics are visually perceptible aspects of a human body. Noticeable attributes like build, height, complexion, clothes help with the development of a human surveillance system. The paper proposes Discrete Soft biometric Attribute-based Person Retrieval (DSA-PR) from a video using height, gender, torso (clothes) color-1, torso color-2, and torso (clothes) type given in a textual query. The DSA-PR uses Mask R-CNN for semantic segmentation and ResNet-50 for attribute classification. Height is estimated using the Tsai camera calibration method. DSA-PR weighs attributes and fuses their probability to generate a final score for each detected person. The proposed approach achieves an average Intersection-over-Union (IoU) of 0.602 and retrieval with IoU $ge$ 0.4 is 0.808 over the AVSS challenge II dataset which works out to 5.8% and 2.02% above the state-of-the-art techniques respectively.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127037759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663759
F. C. Akyon, Ogulcan Eryuksel, Kamil Anil Ozfuttu, S. Altinuc
As the usage of drones increases with lowered costs and improved drone technology, drone detection emerges as a vital object detection task. However, detecting distant drones under unfavorable conditions, namely weak contrast, long-range, low visibility, requires effective algorithms. Our method approaches the drone detection problem by fine-tuning a YOLOv5 model with real and synthetically generated data using a Kalman-based object tracker to boost detection confidence. Our results indicate that augmenting the real data with an optimal subset of synthetic data can increase the performance. Moreover, temporal information gathered by object tracking methods can increase performance further.
{"title":"Track Boosting and Synthetic Data Aided Drone Detection","authors":"F. C. Akyon, Ogulcan Eryuksel, Kamil Anil Ozfuttu, S. Altinuc","doi":"10.1109/AVSS52988.2021.9663759","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663759","url":null,"abstract":"As the usage of drones increases with lowered costs and improved drone technology, drone detection emerges as a vital object detection task. However, detecting distant drones under unfavorable conditions, namely weak contrast, long-range, low visibility, requires effective algorithms. Our method approaches the drone detection problem by fine-tuning a YOLOv5 model with real and synthetically generated data using a Kalman-based object tracker to boost detection confidence. Our results indicate that augmenting the real data with an optimal subset of synthetic data can increase the performance. Moreover, temporal information gathered by object tracking methods can increase performance further.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131636142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663830
K. Lee, Nishant Sankaran, D. Mohan, Kenny Davila, Dennis Fedorishin, S. Setlur, V. Govindaraju
Long-term surveillance applications often involve having to re-identify individuals over several days. The task is made even more challenging due to changes in appearance features such as clothing over a longitudinal time-span of days or longer. In this paper, we propose a novel approach called Bayesian Personalized-Wardrobe Model (BPWM) for long-term person re-identification (re-ID) by employing a Bayesian Personalized Ranking (BPR) for clothing features extracted from video sequences. In contrast to previous long-term person re-ID works, we exploit the fact that people typically choose their attire based on their personal preferences and that knowing a person’s chosen wardrobe can be used as a soft-biometric to distinguish identities in the long-term. We evaluate the performance of our proposed BP-WM on the extended Indoor Long-term Re-identification Wardrobe (ILRW) dataset. Experimental results show that our method achieves state-of-the-art performance and that BP-WM can be used as a reliable soft-biometric for person re-identification.
{"title":"Bayesian Personalized-Wardrobe Model (BP-WM) for Long-Term Person Re-Identification","authors":"K. Lee, Nishant Sankaran, D. Mohan, Kenny Davila, Dennis Fedorishin, S. Setlur, V. Govindaraju","doi":"10.1109/AVSS52988.2021.9663830","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663830","url":null,"abstract":"Long-term surveillance applications often involve having to re-identify individuals over several days. The task is made even more challenging due to changes in appearance features such as clothing over a longitudinal time-span of days or longer. In this paper, we propose a novel approach called Bayesian Personalized-Wardrobe Model (BPWM) for long-term person re-identification (re-ID) by employing a Bayesian Personalized Ranking (BPR) for clothing features extracted from video sequences. In contrast to previous long-term person re-ID works, we exploit the fact that people typically choose their attire based on their personal preferences and that knowing a person’s chosen wardrobe can be used as a soft-biometric to distinguish identities in the long-term. We evaluate the performance of our proposed BP-WM on the extended Indoor Long-term Re-identification Wardrobe (ILRW) dataset. Experimental results show that our method achieves state-of-the-art performance and that BP-WM can be used as a reliable soft-biometric for person re-identification.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121310809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663738
Te-Wei Chen, Yen-Ting Huang, W. Liao
Real-time semantic segmentation is one of the most investigated areas in the field of computer vision. In this paper, we focus on improving the performance of BiSeNet V2 by modifying its architecture. BiSeNet V2 is a two-branch segmentation model designed to extract semantic information from high-level feature maps and detailed information from low-level feature maps. The proposed enhancement remains lightweight and real-time with two main modifications: enlarging the contextual information and breaking the constraint caused by the fixed size of convolutional kernels. Specifically, additional modules known as dilated strip pooling (DSP) and dilated mixed pooling (DMP) are appended to the original BiSeNet V2 model to form the far-sighted BiSeNet V2. The proposed dilated strip pooling block and dilated mixed pooling module are adapted from modules proposed in SPNet, with extra branches composed of dilated convolutions to provide larger receptive fields. The proposed far-sighted BiSeNet V2 improves the accuracy to 76.0% from 73.4% with an FPS of 94 on Nvidia 1080Ti. Moreover, the proposed dilated mixed pooling block achieves the same performance as that of the model with two mixed pooling modules using only 2/3 of the number of parameters.
{"title":"Far-Sighted BiSeNet V2 for Real-time Semantic Segmentation","authors":"Te-Wei Chen, Yen-Ting Huang, W. Liao","doi":"10.1109/AVSS52988.2021.9663738","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663738","url":null,"abstract":"Real-time semantic segmentation is one of the most investigated areas in the field of computer vision. In this paper, we focus on improving the performance of BiSeNet V2 by modifying its architecture. BiSeNet V2 is a two-branch segmentation model designed to extract semantic information from high-level feature maps and detailed information from low-level feature maps. The proposed enhancement remains lightweight and real-time with two main modifications: enlarging the contextual information and breaking the constraint caused by the fixed size of convolutional kernels. Specifically, additional modules known as dilated strip pooling (DSP) and dilated mixed pooling (DMP) are appended to the original BiSeNet V2 model to form the far-sighted BiSeNet V2. The proposed dilated strip pooling block and dilated mixed pooling module are adapted from modules proposed in SPNet, with extra branches composed of dilated convolutions to provide larger receptive fields. The proposed far-sighted BiSeNet V2 improves the accuracy to 76.0% from 73.4% with an FPS of 94 on Nvidia 1080Ti. Moreover, the proposed dilated mixed pooling block achieves the same performance as that of the model with two mixed pooling modules using only 2/3 of the number of parameters.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122207330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a simple and effective flag detection approach for multi-nation flag instance segmentation in-the-wild based on data augmentation and Mask-RCNN PointRend. To the best of our knowledge, this is the first multi-nation flag detection work incorporating recent deep object detection with code and dataset that will be released for public use. Flag images with binary segmentation are collected from public domain including the Open Image V6 and annotated for up to 225 countries. Additional flag images are generated from template flag images with cropping, warping, masking, and color adaption to hallucinate realistic-looking flag images for training and testing. Data augmentation is performed by fusing and transforming the segmented flags on top of natural image backgrounds to synthesize new images. To cope with the large variability of flags with the lack of authentic annotated flags, we combine the trained binary Mask-RCNN segmentation weights with the new multi-nation classifier for fine-tuning. For evaluation, the proposed model is compared with other popular detectors and instance segmentation methods including YOLACT++. Results show the efficacy of the proposed approach.
{"title":"FlagDetSeg: Multi-Nation Flag Detection and Segmentation in the Wild","authors":"Shou-Fang Wu, Ming-Ching Chang, Siwei Lyu, Cheng-Shih Wong, Ashok Pandey, Po-Chi Su","doi":"10.1109/AVSS52988.2021.9663833","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663833","url":null,"abstract":"We present a simple and effective flag detection approach for multi-nation flag instance segmentation in-the-wild based on data augmentation and Mask-RCNN PointRend. To the best of our knowledge, this is the first multi-nation flag detection work incorporating recent deep object detection with code and dataset that will be released for public use. Flag images with binary segmentation are collected from public domain including the Open Image V6 and annotated for up to 225 countries. Additional flag images are generated from template flag images with cropping, warping, masking, and color adaption to hallucinate realistic-looking flag images for training and testing. Data augmentation is performed by fusing and transforming the segmented flags on top of natural image backgrounds to synthesize new images. To cope with the large variability of flags with the lack of authentic annotated flags, we combine the trained binary Mask-RCNN segmentation weights with the new multi-nation classifier for fine-tuning. For evaluation, the proposed model is compared with other popular detectors and instance segmentation methods including YOLACT++. Results show the efficacy of the proposed approach.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123810821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663739
J. L. Patino, Tom Cane, J. Ferryman
This paper describes a new multimodal maritime dataset recorded using a multispectral suite of sensors, including AIS, GPS, radar, and visible and thermal cameras. The visible and thermal cameras are mounted on the vessel itself and surveillance is performed around the vessel in order to protect it from piracy at sea. The dataset corresponds to a series of acted scenarios which simulate attacks to the vessel by small, fast-moving boats (‘skiffs’). The scenarios are inspired by real piracy incidents at sea and present a range of technical challenges to the different stages in an automated surveillance system: object detection, object tracking, and event recognition (in this case, threats towards the vessel). The dataset can thus be employed for training and testing at several stages of a threat detection and classification system. We also present in this paper baseline results that can be used for benchmarking algorithms performing such tasks. This new dataset fills a lack of publicly available datasets for the development and testing of maritime surveillance applications.
{"title":"A comprehensive maritime benchmark dataset for detection, tracking and threat recognition","authors":"J. L. Patino, Tom Cane, J. Ferryman","doi":"10.1109/AVSS52988.2021.9663739","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663739","url":null,"abstract":"This paper describes a new multimodal maritime dataset recorded using a multispectral suite of sensors, including AIS, GPS, radar, and visible and thermal cameras. The visible and thermal cameras are mounted on the vessel itself and surveillance is performed around the vessel in order to protect it from piracy at sea. The dataset corresponds to a series of acted scenarios which simulate attacks to the vessel by small, fast-moving boats (‘skiffs’). The scenarios are inspired by real piracy incidents at sea and present a range of technical challenges to the different stages in an automated surveillance system: object detection, object tracking, and event recognition (in this case, threats towards the vessel). The dataset can thus be employed for training and testing at several stages of a threat detection and classification system. We also present in this paper baseline results that can be used for benchmarking algorithms performing such tasks. This new dataset fills a lack of publicly available datasets for the development and testing of maritime surveillance applications.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124948673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663809
Itziar Sagastiberri, Noud van de Gevel, Jorge García, O. Otaegui
Recent online multi-object tracking approaches combine single object trackers and affinity networks with the aim of capturing object motions and associating objects by using their appearance, respectively. Those affinity networks often build on complex feature representations (re-ID embeddings) or sophisticated scoring functions, whose objective is to match current detections with previous tracklets, known as short-term appearance information. However, drastic appearance changes during the object trajectory acquired by omnidirectional cameras causes a degradation of the performance since affinity networks ignore the variation of the long-term appearance information. In this paper, we deal with the appearance changes in a coherent way by proposing a novel affinity model which is able to predict the new visual appearance of an object by considering the long-term appearance information. Our affinity model includes a convolutional LSTM encoder-decoder architecture to learn the space-time appearance transformation metric between consecutive re-ID feature representations along the object trajectory. Experimental results show that it achieves promising performance on several multi-object tracking datasets containing omnidirectional cameras.
{"title":"Learning Sequential Visual Appearance Transformation for Online Multi-Object Tracking","authors":"Itziar Sagastiberri, Noud van de Gevel, Jorge García, O. Otaegui","doi":"10.1109/AVSS52988.2021.9663809","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663809","url":null,"abstract":"Recent online multi-object tracking approaches combine single object trackers and affinity networks with the aim of capturing object motions and associating objects by using their appearance, respectively. Those affinity networks often build on complex feature representations (re-ID embeddings) or sophisticated scoring functions, whose objective is to match current detections with previous tracklets, known as short-term appearance information. However, drastic appearance changes during the object trajectory acquired by omnidirectional cameras causes a degradation of the performance since affinity networks ignore the variation of the long-term appearance information. In this paper, we deal with the appearance changes in a coherent way by proposing a novel affinity model which is able to predict the new visual appearance of an object by considering the long-term appearance information. Our affinity model includes a convolutional LSTM encoder-decoder architecture to learn the space-time appearance transformation metric between consecutive re-ID feature representations along the object trajectory. Experimental results show that it achieves promising performance on several multi-object tracking datasets containing omnidirectional cameras.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121785657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663742
Chun-Lung Yang, Tsung-Hsuan Wu, S. Lai
Video anomaly detection plays a crucial role in automatically detecting abnormal actions or events from surveillance video, which can help to protect public safety. Deep learning techniques have been extensively employed and achieved excellent anomaly detection results recently. However, previous image-reconstruction-based models did not fully exploit foreground object regions for the video anomaly detection. Some recent works applied pre-trained object detectors to provide local context in the video surveillance scenario for anomaly detection. Nevertheless, these methods require prior knowledge of object types for the anomaly which is somewhat contradictory to the problem setting of unsupervised anomaly detection. In this paper, we propose a novel framework based on learning the moving-object feature prediction based on a convolutional autoencoder architecture. We train our anomaly detector to be aware of moving-object regions in a scene without using an object detector or requiring prior knowledge of specific object classes for the anomaly. The appearance and motion features in moving objects regions provide comprehensive information of moving foreground objects for unsupervised learning of video anomaly detector. Besides, the proposed latent representation learning scheme encourages the convolutional autoencoder model to learn a more convergent latent representation for normal training data, while anomalous data exhibits quite different representations. We also propose a novel anomaly scoring method based on the feature prediction errors of moving foreground object regions and the latent representation regularity. Our experimental results demonstrate that the proposed approach achieves competitive results compared with SOTA methods on three public datasets for video anomaly detection.
{"title":"Moving-Object-Aware Anomaly Detection in Surveillance Videos","authors":"Chun-Lung Yang, Tsung-Hsuan Wu, S. Lai","doi":"10.1109/AVSS52988.2021.9663742","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663742","url":null,"abstract":"Video anomaly detection plays a crucial role in automatically detecting abnormal actions or events from surveillance video, which can help to protect public safety. Deep learning techniques have been extensively employed and achieved excellent anomaly detection results recently. However, previous image-reconstruction-based models did not fully exploit foreground object regions for the video anomaly detection. Some recent works applied pre-trained object detectors to provide local context in the video surveillance scenario for anomaly detection. Nevertheless, these methods require prior knowledge of object types for the anomaly which is somewhat contradictory to the problem setting of unsupervised anomaly detection. In this paper, we propose a novel framework based on learning the moving-object feature prediction based on a convolutional autoencoder architecture. We train our anomaly detector to be aware of moving-object regions in a scene without using an object detector or requiring prior knowledge of specific object classes for the anomaly. The appearance and motion features in moving objects regions provide comprehensive information of moving foreground objects for unsupervised learning of video anomaly detector. Besides, the proposed latent representation learning scheme encourages the convolutional autoencoder model to learn a more convergent latent representation for normal training data, while anomalous data exhibits quite different representations. We also propose a novel anomaly scoring method based on the feature prediction errors of moving foreground object regions and the latent representation regularity. Our experimental results demonstrate that the proposed approach achieves competitive results compared with SOTA methods on three public datasets for video anomaly detection.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122776406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}