Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663817
S. Behera, T. K. Vijay, H. M. Kausik, D. P. Dogra
Human visual perception regarding crowd gatherings can provide valuable information about behavioral movements. Empirical analysis on visual perception about orderly moving crowds has revealed that such movements are often structured in nature with relatively higher order parameter and lower entropy as compared to unstructured crowd, and vice-versa. This paper proposes a Physics-Induced Deep Learning Network (PIDLNet), a deep learning framework trained on conventional 3D convolutional features combined with physics-based features. We have computed frame-level entropy and order parameter from the motion flows extracted from the crowd videos. These features are then integrated with the 3D convolutional features at a later stage in the feature extraction pipeline to aid in the crowd characterization process. Experiments reveal that the proposed network can characterize video segments depicting crowd movements with accuracy as high as 91.63%. We have obtained overall AUC of 0.9913 on highly challenging publicly available video dataset. The method outperforms existing deep-learning frameworks and conventional crowd characterization frameworks by a notable margin.
{"title":"PIDLNet: A Physics-Induced Deep Learning Network for Characterization of Crowd Videos","authors":"S. Behera, T. K. Vijay, H. M. Kausik, D. P. Dogra","doi":"10.1109/AVSS52988.2021.9663817","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663817","url":null,"abstract":"Human visual perception regarding crowd gatherings can provide valuable information about behavioral movements. Empirical analysis on visual perception about orderly moving crowds has revealed that such movements are often structured in nature with relatively higher order parameter and lower entropy as compared to unstructured crowd, and vice-versa. This paper proposes a Physics-Induced Deep Learning Network (PIDLNet), a deep learning framework trained on conventional 3D convolutional features combined with physics-based features. We have computed frame-level entropy and order parameter from the motion flows extracted from the crowd videos. These features are then integrated with the 3D convolutional features at a later stage in the feature extraction pipeline to aid in the crowd characterization process. Experiments reveal that the proposed network can characterize video segments depicting crowd movements with accuracy as high as 91.63%. We have obtained overall AUC of 0.9913 on highly challenging publicly available video dataset. The method outperforms existing deep-learning frameworks and conventional crowd characterization frameworks by a notable margin.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"308 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124388685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663748
N. Ramanathan, Allison Beach, R. Hastings, Weihong Yin, Sima Taheri, P. Brewer, Dana Eubanks, Kyoung-Jin Park, Hongli Deng, Zhong Zhang, Donald Madden, Gang Qian, Amit Mistry, Huiping Li
Automated access control entails automatically detecting incoming vehicles in real-time and allowing access only to authorized vehicles. Access control systems typically adopt one or more sensors such as inductive loops, light array sensors, wireless magnetometers in detecting vehicles at access points. This paper 1 provides a detailed account on a real-time video analytics system named the “ Virtual Inductive Loop ” (VIL), that we developed as an alternative cost-efficient solution for access control. The VIL system poses precision and recall rates over 98%, performs on par with current systems in latency towards detecting event onset and further adds a suite of additional capabilities to access control systems such as vehicle classification, tailgate detection and unusual event detection. The system was tested in live conditions in different site at a Naval Facility in the United States over a two year period. The project was funded by the Office of Naval Research (#N000l4-l7-C-7030).
自动访问控制需要实时自动检测进入的车辆,并只允许授权的车辆进入。访问控制系统通常采用一个或多个传感器,如感应回路、光阵列传感器、无线磁力计等,在接入点检测车辆。本文1详细介绍了一种名为“虚拟电感回路”(VIL)的实时视频分析系统,我们开发了它作为访问控制的另一种经济高效的解决方案。VIL系统的准确率和召回率超过98%,在检测事件发生的延迟方面与当前系统相当,并进一步为访问控制系统增加了一套额外的功能,如车辆分类、尾门检测和异常事件检测。该系统在美国海军设施的不同地点进行了为期两年的现场测试。该项目由海军研究办公室资助(# n00014 - 17 - c -7030)。
{"title":"Virtual Inductive Loop: Real time video analytics for vehicular access control","authors":"N. Ramanathan, Allison Beach, R. Hastings, Weihong Yin, Sima Taheri, P. Brewer, Dana Eubanks, Kyoung-Jin Park, Hongli Deng, Zhong Zhang, Donald Madden, Gang Qian, Amit Mistry, Huiping Li","doi":"10.1109/AVSS52988.2021.9663748","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663748","url":null,"abstract":"Automated access control entails automatically detecting incoming vehicles in real-time and allowing access only to authorized vehicles. Access control systems typically adopt one or more sensors such as inductive loops, light array sensors, wireless magnetometers in detecting vehicles at access points. This paper 1 provides a detailed account on a real-time video analytics system named the “ Virtual Inductive Loop ” (VIL), that we developed as an alternative cost-efficient solution for access control. The VIL system poses precision and recall rates over 98%, performs on par with current systems in latency towards detecting event onset and further adds a suite of additional capabilities to access control systems such as vehicle classification, tailgate detection and unusual event detection. The system was tested in live conditions in different site at a Naval Facility in the United States over a two year period. The project was funded by the Office of Naval Research (#N000l4-l7-C-7030).","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125531081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663848
Jangwon Lee, Gang Qian, Allison Beach
In this paper, we propose a simple and effective method to properly assign weights to the query samples and compute aggregated matching scores using these weights in multi-query object matching. Multi-query object matching commonly exists in many real-life problems such as finding suspicious objects in surveillance videos. In this problem, a query object is represented by multiple samples and the matching candidates in a database are ranked according to their similarities to these query samples. In this context, query samples are not equally effective to find the target object in the database, thus one of the key challenges is how to measure the effectiveness of each query to find the correct matching object. So far, however, very little attention has been paid to address this issue. Therefore, we propose a simple but effective way, Inverse Model Frequency (IMF), to measure of matching effectiveness of query samples. Furthermore, we introduce a new score aggregation method to boost the object matching performance given multiple queries. We tested the proposed method for vehicle re-identification and image retrieval tasks. Our proposed approach achieves state-of-the-art matching accuracy on two vehicle re-identification datasets (VehicleID/VeRi-776) and two image retrieval datasets (the original & revisited Oxford/Paris). The proposed approach can seamlessly plug into many existing multi-query object matching approaches to further boost their performance with minimal effort.
{"title":"A Sample Weighting and Score Aggregation Method for Multi-query Object Matching","authors":"Jangwon Lee, Gang Qian, Allison Beach","doi":"10.1109/AVSS52988.2021.9663848","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663848","url":null,"abstract":"In this paper, we propose a simple and effective method to properly assign weights to the query samples and compute aggregated matching scores using these weights in multi-query object matching. Multi-query object matching commonly exists in many real-life problems such as finding suspicious objects in surveillance videos. In this problem, a query object is represented by multiple samples and the matching candidates in a database are ranked according to their similarities to these query samples. In this context, query samples are not equally effective to find the target object in the database, thus one of the key challenges is how to measure the effectiveness of each query to find the correct matching object. So far, however, very little attention has been paid to address this issue. Therefore, we propose a simple but effective way, Inverse Model Frequency (IMF), to measure of matching effectiveness of query samples. Furthermore, we introduce a new score aggregation method to boost the object matching performance given multiple queries. We tested the proposed method for vehicle re-identification and image retrieval tasks. Our proposed approach achieves state-of-the-art matching accuracy on two vehicle re-identification datasets (VehicleID/VeRi-776) and two image retrieval datasets (the original & revisited Oxford/Paris). The proposed approach can seamlessly plug into many existing multi-query object matching approaches to further boost their performance with minimal effort.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127183895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663863
Olivier Laurendin, S. Ambellouis, A. Fleury, Ankur Mahtani, Sanaa Chafik, Clément Strauss
In the field of train transportation, personal injuries due to train automatic doors are still a common occurrence. This paper aims at implementing a computer vision solution as part of a safety detection system to identify automatic doors-related hazardous events to reduce their occurrence and their severity. Deep anomaly detection algorithms are often applied on CCTV video feeds to identify such hazardous events. However, the anomalous events identified by those algorithms are often simpler than most common occurrences in transport environments, hindering their widespread usage. Since such events are of quite a diverse nature and no dataset featuring them exist, we create a specilically-tailored dataset composed of real-case scenarios of hazardous events near train doors. We then study an anomaly detection algorithm from the literature on this dataset and propose a set of modifications to better adapt it to our railway context and to subsequently ease its application to a wider range of use-cases.
{"title":"Hazardous Events Detection in Automatic Train Doors Vicinity Using Deep Neural Networks","authors":"Olivier Laurendin, S. Ambellouis, A. Fleury, Ankur Mahtani, Sanaa Chafik, Clément Strauss","doi":"10.1109/AVSS52988.2021.9663863","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663863","url":null,"abstract":"In the field of train transportation, personal injuries due to train automatic doors are still a common occurrence. This paper aims at implementing a computer vision solution as part of a safety detection system to identify automatic doors-related hazardous events to reduce their occurrence and their severity. Deep anomaly detection algorithms are often applied on CCTV video feeds to identify such hazardous events. However, the anomalous events identified by those algorithms are often simpler than most common occurrences in transport environments, hindering their widespread usage. Since such events are of quite a diverse nature and no dataset featuring them exist, we create a specilically-tailored dataset composed of real-case scenarios of hazardous events near train doors. We then study an anomaly detection algorithm from the literature on this dataset and propose a set of modifications to better adapt it to our railway context and to subsequently ease its application to a wider range of use-cases.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128147144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663737
Lulu Chen, Jonathan N. Boyle, A. Danelakis, J. Ferryman, Simone Ferstl, Damjan Gicic, A. Grudzien, André Howe, M. Kowalski, Krzysztof Mierzejewski, T. Theoharis
This work presents a novel multimodal biometric dataset with emerging biometric traits including 3D face, thermal face, iris on-the-move, iris mobile, somatotype and smartphone sensors. This dataset was created to resemble on-the-move characteristics in applications such as border control. The five types of biometric traits were selected as they can be captured while on-the-move, are contactless, and show potential for use in a multimodal fusion verification system in a border control scenario. Innovative sensor hardware was used in the data capture. The data featuring these biometric traits will be a valuable contribution to advancing biometric fusion research in general. Baseline evaluation was performed on each unimodal dataset. Multimodal fusion was evaluated based on various scenarios for comparison. Real-time performance is presented based on an Automated Border Control (ABC) scenario.
{"title":"D4FLY Multimodal Biometric Database: multimodal fusion evaluation envisaging on-the-move biometric-based border control","authors":"Lulu Chen, Jonathan N. Boyle, A. Danelakis, J. Ferryman, Simone Ferstl, Damjan Gicic, A. Grudzien, André Howe, M. Kowalski, Krzysztof Mierzejewski, T. Theoharis","doi":"10.1109/AVSS52988.2021.9663737","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663737","url":null,"abstract":"This work presents a novel multimodal biometric dataset with emerging biometric traits including 3D face, thermal face, iris on-the-move, iris mobile, somatotype and smartphone sensors. This dataset was created to resemble on-the-move characteristics in applications such as border control. The five types of biometric traits were selected as they can be captured while on-the-move, are contactless, and show potential for use in a multimodal fusion verification system in a border control scenario. Innovative sensor hardware was used in the data capture. The data featuring these biometric traits will be a valuable contribution to advancing biometric fusion research in general. Baseline evaluation was performed on each unimodal dataset. Multimodal fusion was evaluated based on various scenarios for comparison. Real-time performance is presented based on an Automated Border Control (ABC) scenario.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128270894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rise of AI deep learning, video surveillance based on deep neural networks can provide real-time detection and tracking of vehicles and pedestrians. We present a video analytic system for monitoring railway crossing and providing security protection for rail intersections. Our system can automatically determine the rail-crossing gate status via visual detection and analyze traffic by detecting and tracking passing vehicles, thus to oversee a set of rail-transportation related safety events. Assuming a fixed camera view, each gate RoI can be manually annotated once for each site during system setup, and then gate status can be automatically detected afterwards. Vehicles are detected using YOLOv4 and multi-target tracking is performed using DeepSORT. Safety-related events including trespassing are continuously monitored using rule-based triggering. Experimental evaluation is performed on a Youtube rail crossing dataset as well as a private dataset. On the private dataset of 76 total minutes from 38 videos, our system can successfully detect all 56 events out of 58 annotated events. On the public dataset of 14.21 hrs of videos, it detects 58 out of 62 events.
{"title":"A Video Analytic System for Rail Crossing Point Protection","authors":"Guangliang Zhao, Ashok Pandey, Ming-Ching Chang, Siwei Lyu","doi":"10.1109/AVSS52988.2021.9663781","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663781","url":null,"abstract":"With the rise of AI deep learning, video surveillance based on deep neural networks can provide real-time detection and tracking of vehicles and pedestrians. We present a video analytic system for monitoring railway crossing and providing security protection for rail intersections. Our system can automatically determine the rail-crossing gate status via visual detection and analyze traffic by detecting and tracking passing vehicles, thus to oversee a set of rail-transportation related safety events. Assuming a fixed camera view, each gate RoI can be manually annotated once for each site during system setup, and then gate status can be automatically detected afterwards. Vehicles are detected using YOLOv4 and multi-target tracking is performed using DeepSORT. Safety-related events including trespassing are continuously monitored using rule-based triggering. Experimental evaluation is performed on a Youtube rail crossing dataset as well as a private dataset. On the private dataset of 76 total minutes from 38 videos, our system can successfully detect all 56 events out of 58 annotated events. On the public dataset of 14.21 hrs of videos, it detects 58 out of 62 events.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130107814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663824
Han Chen, Yifan Jiang, Hanseok Ko
Due to the fast processing-speed and robustness it can achieve, skeleton-based action recognition has recently received the attention of the computer vision community. The recent Convolutional Neural Network (CNN)-based methods have shown commendable performance in learning spatio-temporal representations for skeleton sequence, which use skeleton image as input to a CNN. Since the CNN-based methods mainly encoding the temporal and skeleton joints simply as rows and columns, respectively, the latent correlation related to all joints may be lost caused by the 2D convolution. To solve this problem, we propose a novel CNN-based method with adversarial training for action recognition. We introduce a two-level domain adversarial learning to align the features of skeleton images from different view angles or subjects, respectively, thus further improve the generalization. We evaluated our proposed method on NTU RGB+D. It achieves competitive results compared with state-of-the-art methods and 2.4%, 1.9%accuracy gain than the baseline for cross-subject and cross-view.
{"title":"Action Recognition with Domain Invariant Features of Skeleton Image","authors":"Han Chen, Yifan Jiang, Hanseok Ko","doi":"10.1109/AVSS52988.2021.9663824","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663824","url":null,"abstract":"Due to the fast processing-speed and robustness it can achieve, skeleton-based action recognition has recently received the attention of the computer vision community. The recent Convolutional Neural Network (CNN)-based methods have shown commendable performance in learning spatio-temporal representations for skeleton sequence, which use skeleton image as input to a CNN. Since the CNN-based methods mainly encoding the temporal and skeleton joints simply as rows and columns, respectively, the latent correlation related to all joints may be lost caused by the 2D convolution. To solve this problem, we propose a novel CNN-based method with adversarial training for action recognition. We introduce a two-level domain adversarial learning to align the features of skeleton images from different view angles or subjects, respectively, thus further improve the generalization. We evaluated our proposed method on NTU RGB+D. It achieves competitive results compared with state-of-the-art methods and 2.4%, 1.9%accuracy gain than the baseline for cross-subject and cross-view.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130277943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663815
Sunoh Kim, Kimin Yun, J. Choi
The key to successful grounding for video surveillance is to understand a semantic phrase corresponding to important actors and objects. Conventional methods ignore comprehensive contexts for the phrase or require heavy computation for multiple phrases. To understand comprehensive contexts with only one semantic phrase, we propose Position-aware Location Regression Network (PLRN) which exploits position-aware features of a query and a video. Specifically, PLRN first encodes both the video and query using positional information of words and video segments. Then, a semantic phrase feature is extracted from an encoded query with attention. The semantic phrase feature and encoded video are merged and made into a context-aware feature by reflecting local and global contexts. Finally, PLRN predicts start, end, center, and width values of a grounding boundary. Our experiments show that PLRN achieves competitive performance over existing methods with less computation time and memory.
{"title":"Position-aware Location Regression Network for Temporal Video Grounding","authors":"Sunoh Kim, Kimin Yun, J. Choi","doi":"10.1109/AVSS52988.2021.9663815","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663815","url":null,"abstract":"The key to successful grounding for video surveillance is to understand a semantic phrase corresponding to important actors and objects. Conventional methods ignore comprehensive contexts for the phrase or require heavy computation for multiple phrases. To understand comprehensive contexts with only one semantic phrase, we propose Position-aware Location Regression Network (PLRN) which exploits position-aware features of a query and a video. Specifically, PLRN first encodes both the video and query using positional information of words and video segments. Then, a semantic phrase feature is extracted from an encoded query with attention. The semantic phrase feature and encoded video are merged and made into a context-aware feature by reflecting local and global contexts. Finally, PLRN predicts start, end, center, and width values of a grounding boundary. Our experiments show that PLRN achieves competitive performance over existing methods with less computation time and memory.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133519614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663806
Joochan Lee, Yongwoo Kim, Sungtae Moon, J. Ko
While recent advances in deep neural networks (DNNs) enabled remarkable performance on various computer vision tasks, it is challenging for edge devices to perform real-time inference of complex DNN models due to their stringent resource constraint. To enhance the inference throughput, recent studies proposed collaborative intelligence (CI) that splits DNN computation into edge and cloud platforms, mostly for simple tasks such as image classification. However, for general DNN-based object detectors with a branching architecture, CI is highly restricted because of a significant feature transmission overhead. To solve this issue, this paper proposes a splittable object detector that enables edge-cloud collaborative real-time video inference. The proposed architecture includes a feature reconstruction network that can generate multiple features required for detection using a small-sized feature from the edge-side extractor. Asymmetric scaling on the feature extractor and reconstructor further reduces the transmitted feature size and edge inference latency, while maintaining detection accuracy. The performance evaluation using Yolov5 shows that the proposed model achieves 28 fps (2.45X and 1.56X higher than edge-only and cloud-only inference, respectively), on the NVIDIA Jetson TX2 platform in WiFi environment.
{"title":"A Splittable DNN-Based Object Detector for Edge-Cloud Collaborative Real-Time Video Inference","authors":"Joochan Lee, Yongwoo Kim, Sungtae Moon, J. Ko","doi":"10.1109/AVSS52988.2021.9663806","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663806","url":null,"abstract":"While recent advances in deep neural networks (DNNs) enabled remarkable performance on various computer vision tasks, it is challenging for edge devices to perform real-time inference of complex DNN models due to their stringent resource constraint. To enhance the inference throughput, recent studies proposed collaborative intelligence (CI) that splits DNN computation into edge and cloud platforms, mostly for simple tasks such as image classification. However, for general DNN-based object detectors with a branching architecture, CI is highly restricted because of a significant feature transmission overhead. To solve this issue, this paper proposes a splittable object detector that enables edge-cloud collaborative real-time video inference. The proposed architecture includes a feature reconstruction network that can generate multiple features required for detection using a small-sized feature from the edge-side extractor. Asymmetric scaling on the feature extractor and reconstructor further reduces the transmitted feature size and edge inference latency, while maintaining detection accuracy. The performance evaluation using Yolov5 shows that the proposed model achieves 28 fps (2.45X and 1.56X higher than edge-only and cloud-only inference, respectively), on the NVIDIA Jetson TX2 platform in WiFi environment.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"258 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116454069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-16DOI: 10.1109/AVSS52988.2021.9663770
Jen-Cheng Hou, A. McGonigal, F. Bartolomei, M. Thonnat
In this work, we propose a multi-stream approach with knowledge distillation to classify epileptic seizures and psychogenic non-epileptic seizures. The proposed framework utilizes multi-stream information from keypoints and appearance from both body and face. We take the detected keypoints through time as spatio-temporal graph and train it with an adaptive graph convolutional networks to model the spatio-temporal dynamics throughout the seizure event. Besides, we regularize the keypoint features with complementary information from the appearance stream by imposing a knowledge distillation mechanism. We demonstrate the effectiveness of our approach by conducting experiments on real-world seizure videos. The experiments are conducted by both seizure-wise cross validation and leave-one-subject-out validation, and with the proposed model, the performances of the F1-scorelaccuracy are 0.89/0.87 for seizure-wise cross validation, and 0.75/0.72 for leave-one-subject-out validation.
{"title":"A Multi-Stream Approach for Seizure Classification with Knowledge Distillation","authors":"Jen-Cheng Hou, A. McGonigal, F. Bartolomei, M. Thonnat","doi":"10.1109/AVSS52988.2021.9663770","DOIUrl":"https://doi.org/10.1109/AVSS52988.2021.9663770","url":null,"abstract":"In this work, we propose a multi-stream approach with knowledge distillation to classify epileptic seizures and psychogenic non-epileptic seizures. The proposed framework utilizes multi-stream information from keypoints and appearance from both body and face. We take the detected keypoints through time as spatio-temporal graph and train it with an adaptive graph convolutional networks to model the spatio-temporal dynamics throughout the seizure event. Besides, we regularize the keypoint features with complementary information from the appearance stream by imposing a knowledge distillation mechanism. We demonstrate the effectiveness of our approach by conducting experiments on real-world seizure videos. The experiments are conducted by both seizure-wise cross validation and leave-one-subject-out validation, and with the proposed model, the performances of the F1-scorelaccuracy are 0.89/0.87 for seizure-wise cross validation, and 0.75/0.72 for leave-one-subject-out validation.","PeriodicalId":246327,"journal":{"name":"2021 17th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122869562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}