Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034621
Knowledge distillation (KD) has emerged as a popular model compression technique to transfer knowledge from a larger, more performante teacher network to a more compact student network to improve its accuracy. Depending on the type of knowledge being transferred, KD can be categorised as follows: response-based, feature-based or similarity-based distillation [5], [26]. Inspired by Bucilua et al. [3], KD was originally proposed by Hinton et al. [8] as a response-based distillation technique which transferred the so-called “dark knowledge” by “softening” the teacher's prediction vector before distilling it to the student in image classification.
知识蒸馏(Knowledge distillation, KD)已成为一种流行的模型压缩技术,用于将知识从更大、更高效的教师网络转移到更紧凑的学生网络,以提高其准确性。根据转移的知识类型,知识分配可以分类如下:基于响应的、基于特征的或基于相似性的蒸馏[5],[26]。受Bucilua et al. b[8]的启发,KD最初是由Hinton et al.[8]提出的,作为一种基于响应的蒸馏技术,在图像分类中,通过“软化”老师的预测向量,将所谓的“暗知识”转移到学生面前,然后将其蒸馏出来。
{"title":"Feature Similarity and its Correlation with Accuracy in Knowledge Distillation","authors":"","doi":"10.1109/DICTA56598.2022.10034621","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034621","url":null,"abstract":"Knowledge distillation (KD) has emerged as a popular model compression technique to transfer knowledge from a larger, more performante teacher network to a more compact student network to improve its accuracy. Depending on the type of knowledge being transferred, KD can be categorised as follows: response-based, feature-based or similarity-based distillation [5], [26]. Inspired by Bucilua et al. [3], KD was originally proposed by Hinton et al. [8] as a response-based distillation technique which transferred the so-called “dark knowledge” by “softening” the teacher's prediction vector before distilling it to the student in image classification.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131637885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034627
Computer vision techniques have been successfully applied across a large number of industries for a variety of purposes. In this work we extend the capabilities of computer vision to slipper lobster weight estimation. Our proposed method combines machine learning and traditional computer vision techniques to first detect slipper lobsters and their eyes. An algorithm to determine which eyes belong to which slipper lobster and estimate the weight from the distance between the eyes is then developed. The proposed method correctly identifies 86% of lobster eye pairs and estimates weight with a mean error of 4.78g. Our weight estimation method achieves high accuracy and has the potential to be implemented within aquaculture operations in the future.
{"title":"Machine Vision Approach for Slipper Lobster Weight Estimation","authors":"","doi":"10.1109/DICTA56598.2022.10034627","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034627","url":null,"abstract":"Computer vision techniques have been successfully applied across a large number of industries for a variety of purposes. In this work we extend the capabilities of computer vision to slipper lobster weight estimation. Our proposed method combines machine learning and traditional computer vision techniques to first detect slipper lobsters and their eyes. An algorithm to determine which eyes belong to which slipper lobster and estimate the weight from the distance between the eyes is then developed. The proposed method correctly identifies 86% of lobster eye pairs and estimates weight with a mean error of 4.78g. Our weight estimation method achieves high accuracy and has the potential to be implemented within aquaculture operations in the future.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131692978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034631
Jianfeng Weng, Kun Hu, Tingting Yao, Jingya Wang, Zhiyong Wang
Person Re-identification (ReID) has been extensively studied in recent years due to the increasing demand in public security. However, collecting and dealing with sensitive personal data raises privacy concerns. Therefore, federated learning has been explored for Person ReID, which aims to share minimal sensitive data between different parties (clients). However, existing federated learning based person ReID methods generally rely on laborious and time-consuming data annotations and it is difficult to guarantee cross-domain consistency. Thus, in this work, a federated unsupervised cluster-contrastive (FedUCC) learning method is proposed for Person ReID. FedUCC introduces a three-stage modelling strategy following a coarse-to-fine manner. In detail, generic knowledge, specialized knowledge and patch knowledge are discovered using a deep neural network. This enables the sharing of mutual knowledge among clients while retaining local domain-specific knowledge based on the kinds of network layers and their parameters. Comprehensive experiments on 8 public benchmark datasets demonstrate the state-of-the-art performance of our proposed method.
{"title":"Robust Knowledge Adaptation for Federated Unsupervised Person ReID","authors":"Jianfeng Weng, Kun Hu, Tingting Yao, Jingya Wang, Zhiyong Wang","doi":"10.1109/DICTA56598.2022.10034631","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034631","url":null,"abstract":"Person Re-identification (ReID) has been extensively studied in recent years due to the increasing demand in public security. However, collecting and dealing with sensitive personal data raises privacy concerns. Therefore, federated learning has been explored for Person ReID, which aims to share minimal sensitive data between different parties (clients). However, existing federated learning based person ReID methods generally rely on laborious and time-consuming data annotations and it is difficult to guarantee cross-domain consistency. Thus, in this work, a federated unsupervised cluster-contrastive (FedUCC) learning method is proposed for Person ReID. FedUCC introduces a three-stage modelling strategy following a coarse-to-fine manner. In detail, generic knowledge, specialized knowledge and patch knowledge are discovered using a deep neural network. This enables the sharing of mutual knowledge among clients while retaining local domain-specific knowledge based on the kinds of network layers and their parameters. Comprehensive experiments on 8 public benchmark datasets demonstrate the state-of-the-art performance of our proposed method.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"108 14","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120850709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034587
Traffic sign damage monitoring is a practical issue facing large operations all over the world. Despite the scale of traffic sign damage and its consequent impact on public safety, damage audits are performed manually. By automating components of damage assessment we can greatly improve the effectiveness and efficiency of the process and in doing so alleviate its negative impact on traffic safety. In this paper, traffic sign damage assessment is explored as a computer vision problem approached with deep learning. We specifically focus on occlusion-type damages that hinder sign legibility. This paper makes several contributions. Firstly, it provides a comprehensive survey of related work on this problem. Secondly, it provides an extension to the generation of synthetic images for such a study. Most importantly, it proposes an extension of the EfficientDet object detection framework to address the challenge. It is shown that synthetic images can be successfully used to train an object detector variant to assess the level of damage, as measured between 0.0 and 1.0, in traffic signs. The extended framework achieves a damage assessment root mean squared error (RMSE) of 0.087 on a synthetic test set while maintaining its object detection capabilities.
{"title":"End-to-End Traffic Sign Damage Assessment","authors":"","doi":"10.1109/DICTA56598.2022.10034587","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034587","url":null,"abstract":"Traffic sign damage monitoring is a practical issue facing large operations all over the world. Despite the scale of traffic sign damage and its consequent impact on public safety, damage audits are performed manually. By automating components of damage assessment we can greatly improve the effectiveness and efficiency of the process and in doing so alleviate its negative impact on traffic safety. In this paper, traffic sign damage assessment is explored as a computer vision problem approached with deep learning. We specifically focus on occlusion-type damages that hinder sign legibility. This paper makes several contributions. Firstly, it provides a comprehensive survey of related work on this problem. Secondly, it provides an extension to the generation of synthetic images for such a study. Most importantly, it proposes an extension of the EfficientDet object detection framework to address the challenge. It is shown that synthetic images can be successfully used to train an object detector variant to assess the level of damage, as measured between 0.0 and 1.0, in traffic signs. The extended framework achieves a damage assessment root mean squared error (RMSE) of 0.087 on a synthetic test set while maintaining its object detection capabilities.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116274582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034611
Inverse synthetic aperture radar (ISAR) is a common radar imaging technique used to characterise and classify non-cooperative targets. Different approaches to classification have been proposed and include the traditional approach using geometric features extracted from images of known targets and more recently, deep learning approaches that utilise transfer learning to deal with the small training datasets typically available. However, the challenge in a real-world scenario will be when no target training data is available and a different approach to classification will be required. In this work, we develop a deep neural network-based approach by utilising metadata features to enhance the performance of ISAR ship classification and provide an alternative metadata-only solution for ISAR ship classification.
{"title":"ISAR Ship Classification Using Metadata Features","authors":"","doi":"10.1109/DICTA56598.2022.10034611","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034611","url":null,"abstract":"Inverse synthetic aperture radar (ISAR) is a common radar imaging technique used to characterise and classify non-cooperative targets. Different approaches to classification have been proposed and include the traditional approach using geometric features extracted from images of known targets and more recently, deep learning approaches that utilise transfer learning to deal with the small training datasets typically available. However, the challenge in a real-world scenario will be when no target training data is available and a different approach to classification will be required. In this work, we develop a deep neural network-based approach by utilising metadata features to enhance the performance of ISAR ship classification and provide an alternative metadata-only solution for ISAR ship classification.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125750207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034629
Detecting cracks is an important in a number of civil engineering applications. Recent advances in computer vision has enabled automatic crack detection and fine-grained segmentation using deep learning. However, the models used in previous work are often large and are therefore mainly suitable for offline structure monitoring where images taken from a site are analysed later by a powerful computer. In this work, we address the segmentation problem in an online setting, which permits the use of mobile inspection devices such as drones with limited computing power to monitor structures independently in realtime. We propose SC-CrackSeg, which has a very small number of parameters and can provide very high segmentation accuracy. Our main contribution is a multi-branch information-sharing architecture that efficiently manages global perspective while maintaining the fine and high-resolution details key in crack detection. SC-CrackSegextends a previously proposed model but optimized specifically for this application: reduction to a singleinput, a more efficient context mining module, and a simpler feature fusion module. We evaluate SC-CrackSeg on large crack detection data sets and the results show that our proposed model is competitive against the existing methods.
{"title":"SC-CrackSeg: A Real-Time Shared Feature Pyramid Network for Crack Detection and Segmentation","authors":"","doi":"10.1109/DICTA56598.2022.10034629","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034629","url":null,"abstract":"Detecting cracks is an important in a number of civil engineering applications. Recent advances in computer vision has enabled automatic crack detection and fine-grained segmentation using deep learning. However, the models used in previous work are often large and are therefore mainly suitable for offline structure monitoring where images taken from a site are analysed later by a powerful computer. In this work, we address the segmentation problem in an online setting, which permits the use of mobile inspection devices such as drones with limited computing power to monitor structures independently in realtime. We propose SC-CrackSeg, which has a very small number of parameters and can provide very high segmentation accuracy. Our main contribution is a multi-branch information-sharing architecture that efficiently manages global perspective while maintaining the fine and high-resolution details key in crack detection. SC-CrackSegextends a previously proposed model but optimized specifically for this application: reduction to a singleinput, a more efficient context mining module, and a simpler feature fusion module. We evaluate SC-CrackSeg on large crack detection data sets and the results show that our proposed model is competitive against the existing methods.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"70 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130937726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034600
The atmospheric conditions like rain degrade visibility, creating problems for computer vision applications. The initial rain removal works are video-based; hence, they have temporal information, making the rain removal work much easier. In single image de-raining, the lack of temporal information creates challenges. Deep learning-based networks are recently popular for single-image rain removal. The networks may or may not use image decomposition for rain removal. This paper presents an end-to-end Wasserstein Generative Adversarial Network (WGAN) to restore a rain-free image from a rain image that does not require image decomposition. The network is trained using a combined Wasserstein, the mean absolute error or L1 loss, and VGG loss (perceptual loss) to improve the quality of generated rain-free images. Two networks, U-Net and W-Net, are trained as generators to show the network's performance. The proposed cWGAN is an end-to-end network that does not require further enhancement. An extensive test using natural and synthetic rainy images reveals that the proposed cWGAN network competes against the recent single image de-raining techniques.
{"title":"Single image rain removal using cWGAN network","authors":"","doi":"10.1109/DICTA56598.2022.10034600","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034600","url":null,"abstract":"The atmospheric conditions like rain degrade visibility, creating problems for computer vision applications. The initial rain removal works are video-based; hence, they have temporal information, making the rain removal work much easier. In single image de-raining, the lack of temporal information creates challenges. Deep learning-based networks are recently popular for single-image rain removal. The networks may or may not use image decomposition for rain removal. This paper presents an end-to-end Wasserstein Generative Adversarial Network (WGAN) to restore a rain-free image from a rain image that does not require image decomposition. The network is trained using a combined Wasserstein, the mean absolute error or L1 loss, and VGG loss (perceptual loss) to improve the quality of generated rain-free images. Two networks, U-Net and W-Net, are trained as generators to show the network's performance. The proposed cWGAN is an end-to-end network that does not require further enhancement. An extensive test using natural and synthetic rainy images reveals that the proposed cWGAN network competes against the recent single image de-raining techniques.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131218384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034636
Food intake monitoring plays an important role in personal dietary systems. Existing video based eating activity monitoring systems typically use recordings taken with an identical device in a single laboratory settings. In contrast, we explore videos recorded using smartphones for recognizing eating gestures in home environments. For this purpose, we collected 20 eating sessions from 14 participants using different smartphones. Specifically, the data is labelled into eating and no-eating classes. To recognize eating activity from video we have employed three deep learning approaches namely, 3D CNN, SlowFast network, and CNN-LSTM. Our approach has achieved the best F1-score of 0.560 with SlowFast network when evaluated using the Leave-One-Subject-Out (LOSO) scheme. Our preliminary results suggest that the video-based food intake monitoring can be used in home environments. However, our models failed to recognize the eating activity when the user tend to bend to pick food from the plate. More videos with such eating styles need to be incorporated in training data to enhance the performance.
{"title":"Eating Activity Monitoring in Home Environments Using Smartphone-Based Video Recordings","authors":"","doi":"10.1109/DICTA56598.2022.10034636","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034636","url":null,"abstract":"Food intake monitoring plays an important role in personal dietary systems. Existing video based eating activity monitoring systems typically use recordings taken with an identical device in a single laboratory settings. In contrast, we explore videos recorded using smartphones for recognizing eating gestures in home environments. For this purpose, we collected 20 eating sessions from 14 participants using different smartphones. Specifically, the data is labelled into eating and no-eating classes. To recognize eating activity from video we have employed three deep learning approaches namely, 3D CNN, SlowFast network, and CNN-LSTM. Our approach has achieved the best F1-score of 0.560 with SlowFast network when evaluated using the Leave-One-Subject-Out (LOSO) scheme. Our preliminary results suggest that the video-based food intake monitoring can be used in home environments. However, our models failed to recognize the eating activity when the user tend to bend to pick food from the plate. More videos with such eating styles need to be incorporated in training data to enhance the performance.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130837914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034625
Advances in image processing and deep learning methods have enhanced the analysis of optical coherence tomography (OCT) scans, which provide high-quality cross-sectional images of the posterior part of the eye (retina and choroid). These automatic methods support diagnosis and monitoring of ocular conditions by automatically segmenting the required tissues and quantifying the thickness of tissue layers. The performance of these automatic methods is often affected by the image quality, such as the presence of speckle noise or the variations of focus used to capture the OCT images. Changes in image quality can negatively impact the segmentation performance of the methods. In this study, OCT images of different capture modalities (i.e. various focus and denoise settings) are used to analyze how image quality factors can affect the segmentation performance of U-Net and TransU-Net methods, comparing segmentation results. To deal with the various modalities (i.e. different image quality aspects), an image-to-image translation process with CycleGAN is proposed to standardize the image quality and to facilitate the segmentation process of these methods. Results demonstrate that using this image-to-image process as a denoising technique for OCT images captured with an enhanced depth imaging focus modality gave the best performance when using the TransU-Net method, with a Dice coefficient of 0.99, improving the segmentation performance of the U-Net method. The proposed technique provides a viable alternative to OCT instrument-agnostic segmentation.
{"title":"A deep learning multi-capture segmentation modality for retinal OCT imaging","authors":"","doi":"10.1109/DICTA56598.2022.10034625","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034625","url":null,"abstract":"Advances in image processing and deep learning methods have enhanced the analysis of optical coherence tomography (OCT) scans, which provide high-quality cross-sectional images of the posterior part of the eye (retina and choroid). These automatic methods support diagnosis and monitoring of ocular conditions by automatically segmenting the required tissues and quantifying the thickness of tissue layers. The performance of these automatic methods is often affected by the image quality, such as the presence of speckle noise or the variations of focus used to capture the OCT images. Changes in image quality can negatively impact the segmentation performance of the methods. In this study, OCT images of different capture modalities (i.e. various focus and denoise settings) are used to analyze how image quality factors can affect the segmentation performance of U-Net and TransU-Net methods, comparing segmentation results. To deal with the various modalities (i.e. different image quality aspects), an image-to-image translation process with CycleGAN is proposed to standardize the image quality and to facilitate the segmentation process of these methods. Results demonstrate that using this image-to-image process as a denoising technique for OCT images captured with an enhanced depth imaging focus modality gave the best performance when using the TransU-Net method, with a Dice coefficient of 0.99, improving the segmentation performance of the U-Net method. The proposed technique provides a viable alternative to OCT instrument-agnostic segmentation.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130205866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034583
Conservation of koalas is an urgent task for Australia given their rapidly declining numbers in the wild. To better estimate koala populations and analyse koala activity, a camera network was deployed to capture video of koalas from zoos and the wild. This led to the creation of the world's first koala video tracking dataset. Based on this dataset, a two-stream convolutional neural network model was constructed to detect and track koala activity in the video. The model has two branches, one using semantic information for object detection in the original video frames, and the other using optical flow for motion information tracking. Both branches use Yolov5, which generates the positions of objects detected in colour or infrared video. Finally, the features generated by the two branches are fused to determine the final position of the koala in each frame. Experimental results show that the dual-stream network can significantly improve the tracking performance when compared with the baseline model that uses only semantic information for tracking.
{"title":"Dual-stream Convolutional Neural Networks for Koala Detection and Tracking","authors":"","doi":"10.1109/DICTA56598.2022.10034583","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034583","url":null,"abstract":"Conservation of koalas is an urgent task for Australia given their rapidly declining numbers in the wild. To better estimate koala populations and analyse koala activity, a camera network was deployed to capture video of koalas from zoos and the wild. This led to the creation of the world's first koala video tracking dataset. Based on this dataset, a two-stream convolutional neural network model was constructed to detect and track koala activity in the video. The model has two branches, one using semantic information for object detection in the original video frames, and the other using optical flow for motion information tracking. Both branches use Yolov5, which generates the positions of objects detected in colour or infrared video. Finally, the features generated by the two branches are fused to determine the final position of the koala in each frame. Experimental results show that the dual-stream network can significantly improve the tracking performance when compared with the baseline model that uses only semantic information for tracking.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124429554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}