Pub Date : 2022-09-20DOI: 10.1109/MFI55806.2022.9913843
Benjamin Fraser, Brendan Copp, Gurpreet Singh, Orhan Keyvan, Tongfei Bian, Valentin Sonntag, Yang Xing, Weisi Guo, A. Tsourdos
This paper explores multi-person pose estimation for reducing the risk of airborne pathogens. The recent COVID-19 pandemic highlights these risks in a globally connected world. We developed several techniques which analyse CCTV inputs for crowd analysis. The framework utilised automated homography from pose feature positions to determine interpersonal distance. It also incorporates mask detection by using pose features for an image classification pipeline. A further model predicts the behaviour of each person by using their estimated pose features. We combine the models to assess transmission risk based on recent scientific literature. A custom dashboard displays a risk density heat-map in real time. This system could improve public space management and reduce transmission in future pandemics. This context agnostic system and has many applications for other crowd monitoring problems.
{"title":"Reducing Viral Transmission through AI-based Crowd Monitoring and Social Distancing Analysis","authors":"Benjamin Fraser, Brendan Copp, Gurpreet Singh, Orhan Keyvan, Tongfei Bian, Valentin Sonntag, Yang Xing, Weisi Guo, A. Tsourdos","doi":"10.1109/MFI55806.2022.9913843","DOIUrl":"https://doi.org/10.1109/MFI55806.2022.9913843","url":null,"abstract":"This paper explores multi-person pose estimation for reducing the risk of airborne pathogens. The recent COVID-19 pandemic highlights these risks in a globally connected world. We developed several techniques which analyse CCTV inputs for crowd analysis. The framework utilised automated homography from pose feature positions to determine interpersonal distance. It also incorporates mask detection by using pose features for an image classification pipeline. A further model predicts the behaviour of each person by using their estimated pose features. We combine the models to assess transmission risk based on recent scientific literature. A custom dashboard displays a risk density heat-map in real time. This system could improve public space management and reduce transmission in future pandemics. This context agnostic system and has many applications for other crowd monitoring problems.","PeriodicalId":344737,"journal":{"name":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115955834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-20DOI: 10.1109/MFI55806.2022.9913858
Tim Baur, J. Reuter, Antonio Zea, U. Hanebeck
With the high resolution of modern sensors such as multilayer LiDARs, estimating the 3D shape in an extended object tracking procedure is possible. In recent years, 3D shapes have been estimated in spherical coordinates using Gaussian processes, spherical double Fourier series or spherical harmonics. However, observations have shown that in many scenarios only a few measurements are obtained from top or bottom surfaces, leading to error-prone estimates in spherical coordinates. Therefore, in this paper we propose to estimate the shape in cylindrical coordinates instead, applying harmonic functions. Specifically, we derive an expansion for 3D shapes in cylindrical coordinates by solving a boundary value problem for the Laplace equation. This shape representation is then integrated in a plain greedy association model and compared to shape estimation procedures in spherical coordinates. Since the shape representation is only integrated in a basic estimator, the results are preliminary and a detailed discussion for future work is presented at the end of the paper.
{"title":"Harmonic Functions for Three-Dimensional Shape Estimation in Cylindrical Coordinates","authors":"Tim Baur, J. Reuter, Antonio Zea, U. Hanebeck","doi":"10.1109/MFI55806.2022.9913858","DOIUrl":"https://doi.org/10.1109/MFI55806.2022.9913858","url":null,"abstract":"With the high resolution of modern sensors such as multilayer LiDARs, estimating the 3D shape in an extended object tracking procedure is possible. In recent years, 3D shapes have been estimated in spherical coordinates using Gaussian processes, spherical double Fourier series or spherical harmonics. However, observations have shown that in many scenarios only a few measurements are obtained from top or bottom surfaces, leading to error-prone estimates in spherical coordinates. Therefore, in this paper we propose to estimate the shape in cylindrical coordinates instead, applying harmonic functions. Specifically, we derive an expansion for 3D shapes in cylindrical coordinates by solving a boundary value problem for the Laplace equation. This shape representation is then integrated in a plain greedy association model and compared to shape estimation procedures in spherical coordinates. Since the shape representation is only integrated in a basic estimator, the results are preliminary and a detailed discussion for future work is presented at the end of the paper.","PeriodicalId":344737,"journal":{"name":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133995905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-20DOI: 10.1109/MFI55806.2022.9913867
Florent Feriol, Yoko Watanabe, Damien Vivet
Context-adaptive navigation is currently considered as one of the potential solutions to achieve a more precise and robust positioning. The goal would be to adapt the sensor parameters and the navigation filter structure so that it takes into account the context-dependant sensor performance, notably GNSS signal degradations. For that, a reliable context detection is essential. This paper proposes a GNSS-based environmental context detector which classifies the environment surrounding a vehicle into four classes: canyon, open-sky, trees and urban. A support-vector machine classifier is trained on our database collected around Toulouse. We first show the classification results of a model based on GNSS data only, revealing its limitation to distinguish trees and urban contexts. For addressing this issue, this paper proposes the vision-enhanced model by adding satellite visibility information from sky segmentation on fisheye camera images. Compared to the GNSS-only model, the proposed vision-enhanced model significantly improved the classification performance and raised an average F1-score from 78% to 86%.
{"title":"Vision-enhanced GNSS-based environmental context detection for autonomous vehicle navigation","authors":"Florent Feriol, Yoko Watanabe, Damien Vivet","doi":"10.1109/MFI55806.2022.9913867","DOIUrl":"https://doi.org/10.1109/MFI55806.2022.9913867","url":null,"abstract":"Context-adaptive navigation is currently considered as one of the potential solutions to achieve a more precise and robust positioning. The goal would be to adapt the sensor parameters and the navigation filter structure so that it takes into account the context-dependant sensor performance, notably GNSS signal degradations. For that, a reliable context detection is essential. This paper proposes a GNSS-based environmental context detector which classifies the environment surrounding a vehicle into four classes: canyon, open-sky, trees and urban. A support-vector machine classifier is trained on our database collected around Toulouse. We first show the classification results of a model based on GNSS data only, revealing its limitation to distinguish trees and urban contexts. For addressing this issue, this paper proposes the vision-enhanced model by adding satellite visibility information from sky segmentation on fisheye camera images. Compared to the GNSS-only model, the proposed vision-enhanced model significantly improved the classification performance and raised an average F1-score from 78% to 86%.","PeriodicalId":344737,"journal":{"name":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122849670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-20DOI: 10.1109/MFI55806.2022.9913844
Semih Beycimen, Dmitry I. Ignatyev, A. Zolotas
This paper presents a study of end-to-end methods for predicting autonomous vehicle navigation parameters. Image-based and Image & Lidar points-based end-to-end models have been trained under Nvidia learning architectures as well as Densenet-169, Resnet-152 and Inception-v4. Various learning parameters for autonomous vehicle navigation, input models and pre-processing data algorithms i.e. image cropping, noise removing, semantic segmentation for image data have been investigated and tested. The best ones, from the rigorous investigation, are selected for the main framework of the study. Results reveal that the Nvidia architecture trained Image & Lidar points-based method offers the better results accuracy rate-wise for steering angle and speed.
{"title":"Predicting Autonomous Vehicle Navigation Parameters via Image and Image-and-Point Cloud Fusion-based End-to-End Methods","authors":"Semih Beycimen, Dmitry I. Ignatyev, A. Zolotas","doi":"10.1109/MFI55806.2022.9913844","DOIUrl":"https://doi.org/10.1109/MFI55806.2022.9913844","url":null,"abstract":"This paper presents a study of end-to-end methods for predicting autonomous vehicle navigation parameters. Image-based and Image & Lidar points-based end-to-end models have been trained under Nvidia learning architectures as well as Densenet-169, Resnet-152 and Inception-v4. Various learning parameters for autonomous vehicle navigation, input models and pre-processing data algorithms i.e. image cropping, noise removing, semantic segmentation for image data have been investigated and tested. The best ones, from the rigorous investigation, are selected for the main framework of the study. Results reveal that the Nvidia architecture trained Image & Lidar points-based method offers the better results accuracy rate-wise for steering angle and speed.","PeriodicalId":344737,"journal":{"name":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114716226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-20DOI: 10.1109/MFI55806.2022.9913853
John Daniel Bossér, Gustaf Hendeby, M. L. Nordenvaad, I. Skog
A theoretically sound likelihood function for passive sonar surveillance using a hydrophone array is presented. The likelihood is derived from first order principles along with the assumption that the source signal can be approximated as white Gaussian noise within the considered frequency band. The resulting likelihood is a nonlinear function of the delay-and-sum beamformer response and signal-to-noise ratio (SNR).Evaluation of the proposed likelihood function is done by using it in a Bernoulli filter based track-before-detect (TkBD) framework. As a reference, the same TkBD framework, but with another beamforming response based likelihood, is used. Results from Monte-Carlo simulations of two bearings-only tracking scenarios are presented. The results show that the TkBD framework with the proposed likelihood yields an approx. 10 seconds faster target detection for a target at an SNR of -27 dB, and a lower bearing tracking error. Compared to a classical detect-and-track target tracker, the TkBD framework with the proposed likelihood yields 4 dB to 5 dB detection gain.
{"title":"A Statistically Motivated Likelihood for Track-Before-Detect","authors":"John Daniel Bossér, Gustaf Hendeby, M. L. Nordenvaad, I. Skog","doi":"10.1109/MFI55806.2022.9913853","DOIUrl":"https://doi.org/10.1109/MFI55806.2022.9913853","url":null,"abstract":"A theoretically sound likelihood function for passive sonar surveillance using a hydrophone array is presented. The likelihood is derived from first order principles along with the assumption that the source signal can be approximated as white Gaussian noise within the considered frequency band. The resulting likelihood is a nonlinear function of the delay-and-sum beamformer response and signal-to-noise ratio (SNR).Evaluation of the proposed likelihood function is done by using it in a Bernoulli filter based track-before-detect (TkBD) framework. As a reference, the same TkBD framework, but with another beamforming response based likelihood, is used. Results from Monte-Carlo simulations of two bearings-only tracking scenarios are presented. The results show that the TkBD framework with the proposed likelihood yields an approx. 10 seconds faster target detection for a target at an SNR of -27 dB, and a lower bearing tracking error. Compared to a classical detect-and-track target tracker, the TkBD framework with the proposed likelihood yields 4 dB to 5 dB detection gain.","PeriodicalId":344737,"journal":{"name":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122586815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-20DOI: 10.1109/MFI55806.2022.9913852
Zichen Liang, Hu Cao, Chu Yang, Zikai Zhang, G. Chen
Event sequence conveys asynchronous pixel-wise visual information in a low power and high temporal resolution manner, which enables more robust perception under challenging conditions, e.g., fast motion. Two main factors limit the development of event-based object detection in traffic scenes: lack of high-quality datasets and effective event-based algorithms. To solve the first problem, we propose a simulated event-based detection dataset named EventKITTI, which incorporates the novel event modality information into a mixed two-level (i.e. object-level and video-level) detection dataset under traffic scenarios. EventKITTI possesses the high-quality event stream and the largest number of categories at microsecond temporal resolution and 1242×375 spatial resolution, exceeding existing datasets. As for the second problem, existing algorithms rely on CNN-based, spiking or graph architectures to capture local features of moving objects, leading to poor performance in objects with incomplete contours. Hence, we propose event-based object detectors named GFA-Net and CGFA-Net. To enhance the global-local learning ability in the spatial dimension, GFA-Net introduces transformer with edge-based position encoding and multi-scale feature fusion to detect objects on static frame. Furthermore, CGFA-Net optimizes edge-based position encoding with close-loop learning based on previous detected heatmap, which aggregates temporal global features across event frames. The proposed event-based object detectors achieve the best speed-accuracy trade-off on EventKITTI, approaching an 81.3% MAP at 33.0 FPS on object-level detection dataset and a 64.5% MAP at 30.3 FPS on video-level detection dataset.
{"title":"Global-local Feature Aggregation for Event-based Object Detection on EventKITTI","authors":"Zichen Liang, Hu Cao, Chu Yang, Zikai Zhang, G. Chen","doi":"10.1109/MFI55806.2022.9913852","DOIUrl":"https://doi.org/10.1109/MFI55806.2022.9913852","url":null,"abstract":"Event sequence conveys asynchronous pixel-wise visual information in a low power and high temporal resolution manner, which enables more robust perception under challenging conditions, e.g., fast motion. Two main factors limit the development of event-based object detection in traffic scenes: lack of high-quality datasets and effective event-based algorithms. To solve the first problem, we propose a simulated event-based detection dataset named EventKITTI, which incorporates the novel event modality information into a mixed two-level (i.e. object-level and video-level) detection dataset under traffic scenarios. EventKITTI possesses the high-quality event stream and the largest number of categories at microsecond temporal resolution and 1242×375 spatial resolution, exceeding existing datasets. As for the second problem, existing algorithms rely on CNN-based, spiking or graph architectures to capture local features of moving objects, leading to poor performance in objects with incomplete contours. Hence, we propose event-based object detectors named GFA-Net and CGFA-Net. To enhance the global-local learning ability in the spatial dimension, GFA-Net introduces transformer with edge-based position encoding and multi-scale feature fusion to detect objects on static frame. Furthermore, CGFA-Net optimizes edge-based position encoding with close-loop learning based on previous detected heatmap, which aggregates temporal global features across event frames. The proposed event-based object detectors achieve the best speed-accuracy trade-off on EventKITTI, approaching an 81.3% MAP at 33.0 FPS on object-level detection dataset and a 64.5% MAP at 30.3 FPS on video-level detection dataset.","PeriodicalId":344737,"journal":{"name":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114185703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-20DOI: 10.1109/MFI55806.2022.9913864
A. Jøsang, Jinny Cho, Feng Chen
The noninformative prior weight W of a Dirichlet PDF (Probability Density Function) determines the balance between the prior probability and the influence of new observations on the posterior probability distribution. In this work, we propose a method for dynamically converging the weight W in a way that satisfies two constraints. The first constraint is that the prior Dirichlet PDF (i.e. in the absence of evidence) must always be uniform, which dictates that W = k where k is the cardinality of the domain. The second constraint is that the prior weight of large domains must not be so heavy that it prevents new observation evidence from having the expected influence over the shape of the Dirichlet PDF, which dictates that W quickly converges to a low constant CW in the presence of observation evidence, where typically CW = 2. In the case of a binary domain, the noninformative prior weight is normally set to W = 2, irrespective of the amount of evidence. In the case of a multidimensional domain with arbitrarily large cardinality k, the noninformative prior weight is initially equal to the domain cardinality k, but rapidly decreases to the constant convergence factor CW as the amount of evidence increases.
{"title":"Noninformative Prior Weights for Dirichlet PDFs*","authors":"A. Jøsang, Jinny Cho, Feng Chen","doi":"10.1109/MFI55806.2022.9913864","DOIUrl":"https://doi.org/10.1109/MFI55806.2022.9913864","url":null,"abstract":"The noninformative prior weight W of a Dirichlet PDF (Probability Density Function) determines the balance between the prior probability and the influence of new observations on the posterior probability distribution. In this work, we propose a method for dynamically converging the weight W in a way that satisfies two constraints. The first constraint is that the prior Dirichlet PDF (i.e. in the absence of evidence) must always be uniform, which dictates that W = k where k is the cardinality of the domain. The second constraint is that the prior weight of large domains must not be so heavy that it prevents new observation evidence from having the expected influence over the shape of the Dirichlet PDF, which dictates that W quickly converges to a low constant CW in the presence of observation evidence, where typically CW = 2. In the case of a binary domain, the noninformative prior weight is normally set to W = 2, irrespective of the amount of evidence. In the case of a multidimensional domain with arbitrarily large cardinality k, the noninformative prior weight is initially equal to the domain cardinality k, but rapidly decreases to the constant convergence factor CW as the amount of evidence increases.","PeriodicalId":344737,"journal":{"name":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130835556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-20DOI: 10.1109/MFI55806.2022.9913849
Juan D. González, Hans-Joachim Wünsche
We propose a method to extract spatial context of unknown objects in a driving scenario by classifying the surfaces in which the traffic participants transit. In order to classify these surfaces without the need for a big amount of labeled data, we resort to an unsupervised learning method that clusters patches of terrain using features extracted from LiDAR and image data. Using an iterative method, we are able to model the characteristics of map features from a geographical information system (GIS), such as streets and sidewalks, and extend their contextual meaning to the area around our test vehicle. We evaluate our results using a partially labeled 3D scan of our campus and find that our method is able to correctly extract and extend the spatial context of the map features from the GIS to the labeled surfaces on the campus.
{"title":"Context Extraction from GIS Data Using LiDAR and Camera Features","authors":"Juan D. González, Hans-Joachim Wünsche","doi":"10.1109/MFI55806.2022.9913849","DOIUrl":"https://doi.org/10.1109/MFI55806.2022.9913849","url":null,"abstract":"We propose a method to extract spatial context of unknown objects in a driving scenario by classifying the surfaces in which the traffic participants transit. In order to classify these surfaces without the need for a big amount of labeled data, we resort to an unsupervised learning method that clusters patches of terrain using features extracted from LiDAR and image data. Using an iterative method, we are able to model the characteristics of map features from a geographical information system (GIS), such as streets and sidewalks, and extend their contextual meaning to the area around our test vehicle. We evaluate our results using a partially labeled 3D scan of our campus and find that our method is able to correctly extract and extend the spatial context of the map features from the GIS to the labeled surfaces on the campus.","PeriodicalId":344737,"journal":{"name":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128077425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-20DOI: 10.1109/MFI55806.2022.9913874
Hao Shen, Xiwen Yang, D. Lin, Jianduo Chai, Jiakai Huo, Xiaofeng Xing, Shaoming He
Vision-based multi-sensor multi-object tracking is a fundamental task in the applications of a swarm of Unmanned Aerial Vehicles (UAVs). The benchmark datasets are critical to the development of computer vision research since they can provide a fair and principled way to evaluate various approaches and promote the improvement of corresponding algorithms. In recent years, many benchmarks have been created for single-camera single-object tracking, single-camera multi-object detection, and single-camera multi-object tracking scenarios. However, up to the best of our knowledge, few benchmarks of multi-camera multi-object tracking have been provided. In this paper, we build a dataset for multi-UAV multi-object tracking tasks to fill the gap. Several cameras are placed in the VICON motion capture system to simulate the UAV team, and several toy cars are employed to represent ground targets. The first-perspective videos from the cameras, the motion states of the cameras, and the ground truth of the objects are recorded. We also propose a metric to evaluate the performance of the multi-UAV multi-object tracking task. The dataset and the code for algorithm evaluation are available at our GitHub (https://github.com/bitshenwenxiao/MUMO).
{"title":"A Benchmark for Vision-based Multi-UAV Multi-object Tracking","authors":"Hao Shen, Xiwen Yang, D. Lin, Jianduo Chai, Jiakai Huo, Xiaofeng Xing, Shaoming He","doi":"10.1109/MFI55806.2022.9913874","DOIUrl":"https://doi.org/10.1109/MFI55806.2022.9913874","url":null,"abstract":"Vision-based multi-sensor multi-object tracking is a fundamental task in the applications of a swarm of Unmanned Aerial Vehicles (UAVs). The benchmark datasets are critical to the development of computer vision research since they can provide a fair and principled way to evaluate various approaches and promote the improvement of corresponding algorithms. In recent years, many benchmarks have been created for single-camera single-object tracking, single-camera multi-object detection, and single-camera multi-object tracking scenarios. However, up to the best of our knowledge, few benchmarks of multi-camera multi-object tracking have been provided. In this paper, we build a dataset for multi-UAV multi-object tracking tasks to fill the gap. Several cameras are placed in the VICON motion capture system to simulate the UAV team, and several toy cars are employed to represent ground targets. The first-perspective videos from the cameras, the motion states of the cameras, and the ground truth of the objects are recorded. We also propose a metric to evaluate the performance of the multi-UAV multi-object tracking task. The dataset and the code for algorithm evaluation are available at our GitHub (https://github.com/bitshenwenxiao/MUMO).","PeriodicalId":344737,"journal":{"name":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131445921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-20DOI: 10.1109/MFI55806.2022.9913877
Adeyemi Osigbesan, Solene Barrat, Harkeerat Singh, Dongzi Xia, Siddharth Singh, Yang Xing, Weisi Guo, A. Tsourdos
Fall-related injuries at the workplace account for a fair percentage of the global accident at work claims according to Health and Safety Executive (HSE). With a significant percentage of these being fatal, industrial and maintenance workshops have great potential for injuries that can be associated with slips, trips, and other types of falls, owing to their characteristic fast-paced workspaces. Typically, the short turnaround time expected for aircraft undergoing maintenance increases the risk of workers falling, and thus makes a good case for the study of more contemporary methods for the detection of work-related falls in the aircraft maintenance environment. Advanced development in human pose estimation using computer vision technology has made it possible to automate real-time detection and classification of human actions by analyzing body part motion and position relative to time. This paper attempts to combine the analysis of body silhouette bounding box with body joint position estimation to detect and categorize in real-time, human motion captured in continuous video feeds into a fall or a non-fall event. We proposed a standard wide-angle camera, installed at a diagonal ceiling position in an aircraft hangar for our visual data input, and a three-dimensional convolutional neural network with Long Short-Term Memory (LSTM) layers using a technique we referred to as Region Key point (Reg-Key) repartitioning for visual pose estimation and fall detection.
{"title":"Vision-based Fall Detection in Aircraft Maintenance Environment with Pose Estimation","authors":"Adeyemi Osigbesan, Solene Barrat, Harkeerat Singh, Dongzi Xia, Siddharth Singh, Yang Xing, Weisi Guo, A. Tsourdos","doi":"10.1109/MFI55806.2022.9913877","DOIUrl":"https://doi.org/10.1109/MFI55806.2022.9913877","url":null,"abstract":"Fall-related injuries at the workplace account for a fair percentage of the global accident at work claims according to Health and Safety Executive (HSE). With a significant percentage of these being fatal, industrial and maintenance workshops have great potential for injuries that can be associated with slips, trips, and other types of falls, owing to their characteristic fast-paced workspaces. Typically, the short turnaround time expected for aircraft undergoing maintenance increases the risk of workers falling, and thus makes a good case for the study of more contemporary methods for the detection of work-related falls in the aircraft maintenance environment. Advanced development in human pose estimation using computer vision technology has made it possible to automate real-time detection and classification of human actions by analyzing body part motion and position relative to time. This paper attempts to combine the analysis of body silhouette bounding box with body joint position estimation to detect and categorize in real-time, human motion captured in continuous video feeds into a fall or a non-fall event. We proposed a standard wide-angle camera, installed at a diagonal ceiling position in an aircraft hangar for our visual data input, and a three-dimensional convolutional neural network with Long Short-Term Memory (LSTM) layers using a technique we referred to as Region Key point (Reg-Key) repartitioning for visual pose estimation and fall detection.","PeriodicalId":344737,"journal":{"name":"2022 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems (MFI)","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124423667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}