Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034572
Coronary artery calcium scoring (CACS) is a routine procedure to assess the risk of cardiovascular disease (CVD) risk categorisation. CACS involves quantification of the calcification regions, measured using computed tomography (CT) images; non-contrast CT imaging variant, despite its low contrast, provides a distinguishable view of calcifications with shorten acquisition time compared to high-contrast CT. In non-contrast CT, a key challenge with extracting information from CACS images is the low signal to noise ratio from the images and the regions are small and thus making them difficult to differentiate from the surrounding structures. Manual annotations of calcifications require expertise, and it is expensive, error-prone and time consuming. Therefore, it is highly advantageous if unlabelled data, where there is a large quantify of, can be leveraged in the training process to minimise the need for labelled annotations. We propose a three-stage deep learning method to automatically perform calcification segmentation for CACS. Our method first employs a self-supervised representation learning (SSRL) network that is designed to extract contextual information of the cardiac structure using unlabelled contrast-enhanced coronary CT angiography (CCTA). The network is able to capture enhanced and complementary views of semantic features from a large unlabelled data. The second network applies convolutional neural network (CNN) to detect and select cardiac calcium scoring CT (CSCT) slices with the presence of calcifications, hence avoiding images slices without any calcifications. Lastly, our method employs a U-Net based customised network that identifies the calcifications among the detected slices and classifies them by their anatomical locations into one of the three coronary arteries - left anterior descending (LAD), left circumflex (LCX) or right coronary artery (RCA). Our method was trained and evaluated on the public dataset of Automatic Coronary Calcium Scoring (orCaScore) and achieved an 0.844 F1 score. Ablation experiments demonstrated the effectiveness of each single stage. The F1 score raised from 0.583 of the baseline U-Net to 0.771 of the customised U-Net, it further improved to 0.818 and 0.844 by adding the classification and the SSRL networks.
{"title":"A Three-Stage Self Supervised Deep Learning Network for Automatic Calcium Scoring of Cardiac Computed Tomography Images","authors":"","doi":"10.1109/DICTA56598.2022.10034572","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034572","url":null,"abstract":"Coronary artery calcium scoring (CACS) is a routine procedure to assess the risk of cardiovascular disease (CVD) risk categorisation. CACS involves quantification of the calcification regions, measured using computed tomography (CT) images; non-contrast CT imaging variant, despite its low contrast, provides a distinguishable view of calcifications with shorten acquisition time compared to high-contrast CT. In non-contrast CT, a key challenge with extracting information from CACS images is the low signal to noise ratio from the images and the regions are small and thus making them difficult to differentiate from the surrounding structures. Manual annotations of calcifications require expertise, and it is expensive, error-prone and time consuming. Therefore, it is highly advantageous if unlabelled data, where there is a large quantify of, can be leveraged in the training process to minimise the need for labelled annotations. We propose a three-stage deep learning method to automatically perform calcification segmentation for CACS. Our method first employs a self-supervised representation learning (SSRL) network that is designed to extract contextual information of the cardiac structure using unlabelled contrast-enhanced coronary CT angiography (CCTA). The network is able to capture enhanced and complementary views of semantic features from a large unlabelled data. The second network applies convolutional neural network (CNN) to detect and select cardiac calcium scoring CT (CSCT) slices with the presence of calcifications, hence avoiding images slices without any calcifications. Lastly, our method employs a U-Net based customised network that identifies the calcifications among the detected slices and classifies them by their anatomical locations into one of the three coronary arteries - left anterior descending (LAD), left circumflex (LCX) or right coronary artery (RCA). Our method was trained and evaluated on the public dataset of Automatic Coronary Calcium Scoring (orCaScore) and achieved an 0.844 F1 score. Ablation experiments demonstrated the effectiveness of each single stage. The F1 score raised from 0.583 of the baseline U-Net to 0.771 of the customised U-Net, it further improved to 0.818 and 0.844 by adding the classification and the SSRL networks.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"97 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128969662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034602
Pedestrian intention prediction is crucial to autonomous driving, especial in urban setting. Since current methods fail to accurately model the complex behavior of pedestrians in real time, we propose the ORPI-Net, which can fully explore and utilize pedestrian features efficiently. ORPI-Net includes the PMTC-Net for online pedestrian video feature extraction and a simple effective multi-modal fusion module. Partial Channel Motion Enhancement (PME) and 1D Temporal Group Convolution (TGC) in PMTC-Net can be easily embedded in a 2D convolutional backbone, which can capture the motion features of pedestrians and establish temporal relations. The multimodal fusion module can leverage various information sources at low computational cost, significantly improving performance. In the end, our model achieves new state-of-the-art on the PIE and JAAD datasets in high performance mode and run at a fast speed of 20+FPS in real-time mode.
{"title":"Online and Real-time Network for Video Pedestrian Intent Prediction","authors":"","doi":"10.1109/DICTA56598.2022.10034602","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034602","url":null,"abstract":"Pedestrian intention prediction is crucial to autonomous driving, especial in urban setting. Since current methods fail to accurately model the complex behavior of pedestrians in real time, we propose the ORPI-Net, which can fully explore and utilize pedestrian features efficiently. ORPI-Net includes the PMTC-Net for online pedestrian video feature extraction and a simple effective multi-modal fusion module. Partial Channel Motion Enhancement (PME) and 1D Temporal Group Convolution (TGC) in PMTC-Net can be easily embedded in a 2D convolutional backbone, which can capture the motion features of pedestrians and establish temporal relations. The multimodal fusion module can leverage various information sources at low computational cost, significantly improving performance. In the end, our model achieves new state-of-the-art on the PIE and JAAD datasets in high performance mode and run at a fast speed of 20+FPS in real-time mode.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121497187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034632
An essential aspect of artificial intelligence is how closely machines can mimic humans. One of the motivations for developing intelligent systems is human vision. While trying to recognise a class of images, it is as vital to distinguish the class of images from similar-looking objects and identify them in hidden places as it is to create bounding boxes and learn to localize the position of the object. Traditionally, deep learning models have performed exceptionally well in image classification and object detection tasks. In this work, we perform four experiments to train machines to distinguish between real faces and face-like objects and to recognise them. Nine state-of-the-art deep learning-based classifiers have been chosen to perform a comparative study on the designed experiments. Using these experiments, we establish that training models on real faces does not prepare them to identify face-like objects, and at the same time, training on face-like objects enables the models to detect face-like images even while hidden amongst other images. Despite work being done in the fields of camouflage detection and optical illusion detection, to the best of our knowledge, no work has been done in training and testing machines to distinguish between face and face-like objects with deep learning methods. This work could help researchers make better camouflage detection systems, perform context sensitive studies, understand the biases that various models possess towards certain classes of images, and have applications in real life such as military and self-driving cars.
{"title":"Hidden and Face-Like Object Detection Using Deep Learning Techniques – An Empirical Study","authors":"","doi":"10.1109/DICTA56598.2022.10034632","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034632","url":null,"abstract":"An essential aspect of artificial intelligence is how closely machines can mimic humans. One of the motivations for developing intelligent systems is human vision. While trying to recognise a class of images, it is as vital to distinguish the class of images from similar-looking objects and identify them in hidden places as it is to create bounding boxes and learn to localize the position of the object. Traditionally, deep learning models have performed exceptionally well in image classification and object detection tasks. In this work, we perform four experiments to train machines to distinguish between real faces and face-like objects and to recognise them. Nine state-of-the-art deep learning-based classifiers have been chosen to perform a comparative study on the designed experiments. Using these experiments, we establish that training models on real faces does not prepare them to identify face-like objects, and at the same time, training on face-like objects enables the models to detect face-like images even while hidden amongst other images. Despite work being done in the fields of camouflage detection and optical illusion detection, to the best of our knowledge, no work has been done in training and testing machines to distinguish between face and face-like objects with deep learning methods. This work could help researchers make better camouflage detection systems, perform context sensitive studies, understand the biases that various models possess towards certain classes of images, and have applications in real life such as military and self-driving cars.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131378956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034626
Mathematical expression recognition is one of the important processes in scientific documents analysis. Despite the importance of this task, solving mathematical expression recognition is still very challenging. One of the reasons for the difficulty of math recognition compared to normal text recognition is that math formula usually has 2-D spatial structure relationship [1] instead of 1-D ones from normal text data. The spatial structure relationship of math formula is presented by many math symbols such as superscript, subscript, fraction symbol, etc. The traditional approach usually solves this problem in two stages. First, the character segmentation stage is used to segment each character in math formula and then classify it based on the given vocabulary. Second, the structural analysis stage is used to identify the spatial relationships between all characters of the math formula.
{"title":"A Hybrid Vision Transformer Approach for Mathematical Expression Recognition","authors":"","doi":"10.1109/DICTA56598.2022.10034626","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034626","url":null,"abstract":"Mathematical expression recognition is one of the important processes in scientific documents analysis. Despite the importance of this task, solving mathematical expression recognition is still very challenging. One of the reasons for the difficulty of math recognition compared to normal text recognition is that math formula usually has 2-D spatial structure relationship [1] instead of 1-D ones from normal text data. The spatial structure relationship of math formula is presented by many math symbols such as superscript, subscript, fraction symbol, etc. The traditional approach usually solves this problem in two stages. First, the character segmentation stage is used to segment each character in math formula and then classify it based on the given vocabulary. Second, the structural analysis stage is used to identify the spatial relationships between all characters of the math formula.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130684023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034614
Extracting good discriminative features plays a significant role in the predictive accuracy of any machine learning model. Engineering good features from raw data is a non-trivial and often a time-consuming task. Models like Convolutional Neural Networks (CNNs) have been very popular in the image classification tasks. This is due to CNN's excellent predictive capabilities and ability to automatically learn good features from raw data. One inherent draw back for CNNs and other deep learning models is that they are black-box models. The predictions made by these models cannot be explained based on features learned by them. In this paper, we put forth a novel feature extractor in which the features are automatically extracted from hidden neurons and convolutional neurons. The predictions made by the model are explained using the visualizations of the activation functions of the class specific neurons in hidden layers. Thus, the model put forth in this paper has excellent predictive capabilities and the predictions can be explained based on activation functions of class specific neurons.
{"title":"Feature Extractor Based on Class Specific Hidden Neuron Activations for Image Classification","authors":"","doi":"10.1109/DICTA56598.2022.10034614","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034614","url":null,"abstract":"Extracting good discriminative features plays a significant role in the predictive accuracy of any machine learning model. Engineering good features from raw data is a non-trivial and often a time-consuming task. Models like Convolutional Neural Networks (CNNs) have been very popular in the image classification tasks. This is due to CNN's excellent predictive capabilities and ability to automatically learn good features from raw data. One inherent draw back for CNNs and other deep learning models is that they are black-box models. The predictions made by these models cannot be explained based on features learned by them. In this paper, we put forth a novel feature extractor in which the features are automatically extracted from hidden neurons and convolutional neurons. The predictions made by the model are explained using the visualizations of the activation functions of the class specific neurons in hidden layers. Thus, the model put forth in this paper has excellent predictive capabilities and the predictions can be explained based on activation functions of class specific neurons.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116960441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034609
Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We test our method on two popular large data sets, consistently outperforming the related recent methods. Moreover, our technique also provides memory and computational advantages over the competitive techniques.
{"title":"Deepfake Detection with Spatio-Temporal Consistency and Attention","authors":"","doi":"10.1109/DICTA56598.2022.10034609","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034609","url":null,"abstract":"Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We test our method on two popular large data sets, consistently outperforming the related recent methods. Moreover, our technique also provides memory and computational advantages over the competitive techniques.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124403036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034584
Brandon Birmingham, A. Muscat
This paper presents a Keyword-driven and N-gram Graph based approach for Image Captioning (KENGIC). Most current state-of-the-art image caption generators are trained end-to-end on large scale paired image-caption datasets which are very laborious and expensive to collect. Such models are limited in terms of their explainability and their applicability across different domains. To address these limitations, a simple model based on N-Gram graphs which does not require any end-to-end training on paired image captions is proposed. Starting with a set of image keywords considered as nodes, the generator is designed to form a directed graph by connecting these nodes through overlapping n-grams as found in a given text corpus. The model then infers the caption by maximising the most probable n-gram sequences from the constructed graph. To analyse the use and choice of keywords in context of this approach, this study analysed the generation of image captions based on (a) keywords extracted from gold standard captions and (b) from automatically detected keywords. Both quantitative and qualitative analyses demonstrated the effectiveness of KENGIC. The performance achieved is very close to that of current state-of-the-art image caption generators that are trained in the unpaired setting. The analysis of this approach could also shed light on the generation process behind current top performing caption generators trained in the paired setting, and in addition, provide insights on the limitations of the current most widely used evaluation metrics in automatic image captioning
{"title":"KENGIC: KEyword-driven and N-Gram Graph based Image Captioning","authors":"Brandon Birmingham, A. Muscat","doi":"10.1109/DICTA56598.2022.10034584","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034584","url":null,"abstract":"This paper presents a Keyword-driven and N-gram Graph based approach for Image Captioning (KENGIC). Most current state-of-the-art image caption generators are trained end-to-end on large scale paired image-caption datasets which are very laborious and expensive to collect. Such models are limited in terms of their explainability and their applicability across different domains. To address these limitations, a simple model based on N-Gram graphs which does not require any end-to-end training on paired image captions is proposed. Starting with a set of image keywords considered as nodes, the generator is designed to form a directed graph by connecting these nodes through overlapping n-grams as found in a given text corpus. The model then infers the caption by maximising the most probable n-gram sequences from the constructed graph. To analyse the use and choice of keywords in context of this approach, this study analysed the generation of image captions based on (a) keywords extracted from gold standard captions and (b) from automatically detected keywords. Both quantitative and qualitative analyses demonstrated the effectiveness of KENGIC. The performance achieved is very close to that of current state-of-the-art image caption generators that are trained in the unpaired setting. The analysis of this approach could also shed light on the generation process behind current top performing caption generators trained in the paired setting, and in addition, provide insights on the limitations of the current most widely used evaluation metrics in automatic image captioning","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114554919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034582
Road extraction from remote sensing images plays a crucial role in navigation, traffic management, urban construction, and other fields. With the development of deep learning in the field of computer vision, road extraction from remote sensing images using deep learning models has become a hot research topic. The convolution-based U-shaped road extraction models have some issues such as high extraction error rate and poor continuity on road topology. The Transformer-based road extraction methods also have issues such as low extraction accuracy and large GPU memory usage. In order to solve the above issues, we propose a Swin-ResUNet structure and use the new paradigm Swin Transformer to extract roads in remote sensing images. Specifically, we construct a Swin-Topology module by adding a Sobel layer based on residual connections to the Swin Transformer block. Based on the Swin-Topology module, we propose a Swin-ResUNet network structure in order to better capture the topology of roads. Experimental results show that the values of mIOU and mDC obtained on the Massachusetts dataset were 64.1% and 76.6% respectively, and the corresponding values on the DeepGlobe2018 dataset were 66.69% and 75.86% respectively. When the batch size is 8, the GPU memory usage with Swin-ResUNet is about 9 GB, which is significantly smaller than other Transformer-based networks. Compared with convolution-based U-shaped structures, the Swin-ResUNet can better capture the topology of roads in remote sensing images and improve road extraction accuracy. Compared with other Transformer-based road extraction methods, the Swin-ResUNet can improve the accuracy of road extraction and reduce GPU memory usage.
{"title":"Swin-ResUNet: A Swin-Topology Module for Road Extraction from Remote Sensing Images","authors":"","doi":"10.1109/DICTA56598.2022.10034582","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034582","url":null,"abstract":"Road extraction from remote sensing images plays a crucial role in navigation, traffic management, urban construction, and other fields. With the development of deep learning in the field of computer vision, road extraction from remote sensing images using deep learning models has become a hot research topic. The convolution-based U-shaped road extraction models have some issues such as high extraction error rate and poor continuity on road topology. The Transformer-based road extraction methods also have issues such as low extraction accuracy and large GPU memory usage. In order to solve the above issues, we propose a Swin-ResUNet structure and use the new paradigm Swin Transformer to extract roads in remote sensing images. Specifically, we construct a Swin-Topology module by adding a Sobel layer based on residual connections to the Swin Transformer block. Based on the Swin-Topology module, we propose a Swin-ResUNet network structure in order to better capture the topology of roads. Experimental results show that the values of mIOU and mDC obtained on the Massachusetts dataset were 64.1% and 76.6% respectively, and the corresponding values on the DeepGlobe2018 dataset were 66.69% and 75.86% respectively. When the batch size is 8, the GPU memory usage with Swin-ResUNet is about 9 GB, which is significantly smaller than other Transformer-based networks. Compared with convolution-based U-shaped structures, the Swin-ResUNet can better capture the topology of roads in remote sensing images and improve road extraction accuracy. Compared with other Transformer-based road extraction methods, the Swin-ResUNet can improve the accuracy of road extraction and reduce GPU memory usage.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114594483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034630
Hilmil Pradana, Minh-Son Dao, K. Zettsu
In the last decade, advanced self-driving system brings significantly improvement technology on various aspects such as efficiency, convenience, and transportation safety system to contribute the global society impacts around the world. To develop it, many researchers are focusing to alert all possible traffic risk cases from closed-circuit television (CCTV) and dashboardmounted cameras. Most of these methods focused on identifying frame-by-frame in which an anomaly is occurred, but they are unrealized, which road traffic participant can cause ego-vehicle leading into collision because of available annotation dataset only to detect anomaly on traffic video. Near-miss is one type of accident and can be defined as a narrowly avoided accident. However, there are no different between accident and near-miss on the time before accident happened, so that we re-define the definition of accident on DADA-2000 dataset together with nearmiss and also extend start and end time of accident duration to precisely cover all ego-motions during incident. Unlike previous works, proposed system is to classify all possible traffic risk accidents including near-miss to give more critical information for real-world driving assistance systems. Due to limited annotating video availability, we augment re-annotation DADA-2000 dataset using manipulating video style translation to increase number of traffic risk accident videos and to generalize performance of video classification model on different types of conditions. In evaluation, the proposed method achieved significantly improvement result by 10.25 % positive margin from baseline model for accuracy on cross validation analysis. Quantitative evaluation based on our re-annotation shows that the proposed method is valuable for computer vision community to train their models to produce better traffic risk classification.
{"title":"Augmenting Ego-Vehicle for Traffic Near-Miss and Accident Classification Dataset Using Manipulating Conditional Style Translation","authors":"Hilmil Pradana, Minh-Son Dao, K. Zettsu","doi":"10.1109/DICTA56598.2022.10034630","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034630","url":null,"abstract":"In the last decade, advanced self-driving system brings significantly improvement technology on various aspects such as efficiency, convenience, and transportation safety system to contribute the global society impacts around the world. To develop it, many researchers are focusing to alert all possible traffic risk cases from closed-circuit television (CCTV) and dashboardmounted cameras. Most of these methods focused on identifying frame-by-frame in which an anomaly is occurred, but they are unrealized, which road traffic participant can cause ego-vehicle leading into collision because of available annotation dataset only to detect anomaly on traffic video. Near-miss is one type of accident and can be defined as a narrowly avoided accident. However, there are no different between accident and near-miss on the time before accident happened, so that we re-define the definition of accident on DADA-2000 dataset together with nearmiss and also extend start and end time of accident duration to precisely cover all ego-motions during incident. Unlike previous works, proposed system is to classify all possible traffic risk accidents including near-miss to give more critical information for real-world driving assistance systems. Due to limited annotating video availability, we augment re-annotation DADA-2000 dataset using manipulating video style translation to increase number of traffic risk accident videos and to generalize performance of video classification model on different types of conditions. In evaluation, the proposed method achieved significantly improvement result by 10.25 % positive margin from baseline model for accuracy on cross validation analysis. Quantitative evaluation based on our re-annotation shows that the proposed method is valuable for computer vision community to train their models to produce better traffic risk classification.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124045748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034606
Malleefowl is listed as one of the vulnerable birds in Australia. To track the pattern of presence and abundance of Malleefowl, surveying the egg-incubator (a.k.a. nest or mound) is an extensively used technique. However, on large conservation areas, Malleefowl mound detection by manually inspection on land or from air is challenging for various environmental and technical reasons. Usually, mounds are built on the ground, and they are widely scattered over the large areas. Hence, in recent years, airborne Light Detection and Ranging (LiDAR) techniques have been used for data acquisition and analysis. However, such existing methods are still limited in terms of detection accuracy and system automation. In this paper, we propose a novel method to address these limitations. We have designed robust features which effectively represent the key visual characteristics of candidate mounds captured in LiDAR point cloud data. These features include: (1) differences of elevation between original ground points and the corresponding feet of these ground points fitted plane along with the z-axis direction, and (2) convex-hull measurement. Using these features, we then use machine learning methods, i.e., clustering to differentiate the true mounds among the candidate mounds, and bagged-tree classifier to learn a model for classifying whether a patch contains a mound or not. Our training and testing datasets contain LiDAR point cloud data captured from the Tarawi Nature Reserve, and are provided by the New South Wales Government Department of Planning and Environment of Australia. They comprise a total of 1,060 patches (each 20 m × 20 m) - half of which contain mounds, and the remaining half contain no mound. Our experimental results show that our proposed method has more than 84% accuracy in detecting patches with mounds.
{"title":"Automatic Malleefowl Mound Detection Using Robust LiDAR-based Features and Classification","authors":"","doi":"10.1109/DICTA56598.2022.10034606","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034606","url":null,"abstract":"Malleefowl is listed as one of the vulnerable birds in Australia. To track the pattern of presence and abundance of Malleefowl, surveying the egg-incubator (a.k.a. nest or mound) is an extensively used technique. However, on large conservation areas, Malleefowl mound detection by manually inspection on land or from air is challenging for various environmental and technical reasons. Usually, mounds are built on the ground, and they are widely scattered over the large areas. Hence, in recent years, airborne Light Detection and Ranging (LiDAR) techniques have been used for data acquisition and analysis. However, such existing methods are still limited in terms of detection accuracy and system automation. In this paper, we propose a novel method to address these limitations. We have designed robust features which effectively represent the key visual characteristics of candidate mounds captured in LiDAR point cloud data. These features include: (1) differences of elevation between original ground points and the corresponding feet of these ground points fitted plane along with the z-axis direction, and (2) convex-hull measurement. Using these features, we then use machine learning methods, i.e., clustering to differentiate the true mounds among the candidate mounds, and bagged-tree classifier to learn a model for classifying whether a patch contains a mound or not. Our training and testing datasets contain LiDAR point cloud data captured from the Tarawi Nature Reserve, and are provided by the New South Wales Government Department of Planning and Environment of Australia. They comprise a total of 1,060 patches (each 20 m × 20 m) - half of which contain mounds, and the remaining half contain no mound. Our experimental results show that our proposed method has more than 84% accuracy in detecting patches with mounds.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115453717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}