2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)最新文献_第4页

A Three-Stage Self Supervised Deep Learning Network for Automatic Calcium Scoring of Cardiac Computed Tomography Images 心脏计算机断层图像钙离子自动评分的三级自监督深度学习网络

2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

Pub Date : 2022-11-30 DOI: 10.1109/DICTA56598.2022.10034572

Coronary artery calcium scoring (CACS) is a routine procedure to assess the risk of cardiovascular disease (CVD) risk categorisation. CACS involves quantification of the calcification regions, measured using computed tomography (CT) images; non-contrast CT imaging variant, despite its low contrast, provides a distinguishable view of calcifications with shorten acquisition time compared to high-contrast CT. In non-contrast CT, a key challenge with extracting information from CACS images is the low signal to noise ratio from the images and the regions are small and thus making them difficult to differentiate from the surrounding structures. Manual annotations of calcifications require expertise, and it is expensive, error-prone and time consuming. Therefore, it is highly advantageous if unlabelled data, where there is a large quantify of, can be leveraged in the training process to minimise the need for labelled annotations. We propose a three-stage deep learning method to automatically perform calcification segmentation for CACS. Our method first employs a self-supervised representation learning (SSRL) network that is designed to extract contextual information of the cardiac structure using unlabelled contrast-enhanced coronary CT angiography (CCTA). The network is able to capture enhanced and complementary views of semantic features from a large unlabelled data. The second network applies convolutional neural network (CNN) to detect and select cardiac calcium scoring CT (CSCT) slices with the presence of calcifications, hence avoiding images slices without any calcifications. Lastly, our method employs a U-Net based customised network that identifies the calcifications among the detected slices and classifies them by their anatomical locations into one of the three coronary arteries - left anterior descending (LAD), left circumflex (LCX) or right coronary artery (RCA). Our method was trained and evaluated on the public dataset of Automatic Coronary Calcium Scoring (orCaScore) and achieved an 0.844 F1 score. Ablation experiments demonstrated the effectiveness of each single stage. The F1 score raised from 0.583 of the baseline U-Net to 0.771 of the customised U-Net, it further improved to 0.818 and 0.844 by adding the classification and the SSRL networks.

冠状动脉钙评分(CACS)是评估心血管疾病(CVD)风险分类的常规方法。CACS涉及钙化区域的量化，使用计算机断层扫描(CT)图像测量;非对比CT成像变体，尽管其对比度较低，但与高对比CT相比，可提供可区分的钙化视图，且采集时间较短。在非对比CT中，从CACS图像中提取信息的一个关键挑战是图像的低信噪比，并且区域很小，因此难以与周围结构区分。手工标注钙化需要专业知识，而且昂贵、容易出错且耗时。因此，如果在训练过程中可以利用大量未标记的数据，以最大限度地减少对标记注释的需求，这是非常有利的。我们提出了一种三阶段深度学习方法来自动对CACS进行钙化分割。我们的方法首先采用自监督表示学习(SSRL)网络，该网络旨在使用未标记的对比增强冠状动脉CT血管造影(CCTA)提取心脏结构的上下文信息。该网络能够从大量未标记数据中捕获增强的和互补的语义特征视图。第二种网络利用卷积神经网络(CNN)检测和选择存在钙化的心脏钙评分CT (CSCT)切片，从而避免无钙化的图像切片。最后，我们的方法采用基于U-Net的定制网络，识别检测到的切片中的钙化，并根据其解剖位置将其分类为三个冠状动脉之一-左前降支(LAD)，左旋支(LCX)或右冠状动脉(RCA)。我们的方法在自动冠状动脉钙评分(orCaScore)的公共数据集上进行了训练和评估，得分为0.844 F1。烧蚀实验证明了每个单级的有效性。F1得分从基线U-Net的0.583提高到定制U-Net的0.771，加入分类和SSRL网络后，F1得分进一步提高到0.818和0.844。

{"title":"A Three-Stage Self Supervised Deep Learning Network for Automatic Calcium Scoring of Cardiac Computed Tomography Images","authors":"","doi":"10.1109/DICTA56598.2022.10034572","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034572","url":null,"abstract":"Coronary artery calcium scoring (CACS) is a routine procedure to assess the risk of cardiovascular disease (CVD) risk categorisation. CACS involves quantification of the calcification regions, measured using computed tomography (CT) images; non-contrast CT imaging variant, despite its low contrast, provides a distinguishable view of calcifications with shorten acquisition time compared to high-contrast CT. In non-contrast CT, a key challenge with extracting information from CACS images is the low signal to noise ratio from the images and the regions are small and thus making them difficult to differentiate from the surrounding structures. Manual annotations of calcifications require expertise, and it is expensive, error-prone and time consuming. Therefore, it is highly advantageous if unlabelled data, where there is a large quantify of, can be leveraged in the training process to minimise the need for labelled annotations. We propose a three-stage deep learning method to automatically perform calcification segmentation for CACS. Our method first employs a self-supervised representation learning (SSRL) network that is designed to extract contextual information of the cardiac structure using unlabelled contrast-enhanced coronary CT angiography (CCTA). The network is able to capture enhanced and complementary views of semantic features from a large unlabelled data. The second network applies convolutional neural network (CNN) to detect and select cardiac calcium scoring CT (CSCT) slices with the presence of calcifications, hence avoiding images slices without any calcifications. Lastly, our method employs a U-Net based customised network that identifies the calcifications among the detected slices and classifies them by their anatomical locations into one of the three coronary arteries - left anterior descending (LAD), left circumflex (LCX) or right coronary artery (RCA). Our method was trained and evaluated on the public dataset of Automatic Coronary Calcium Scoring (orCaScore) and achieved an 0.844 F1 score. Ablation experiments demonstrated the effectiveness of each single stage. The F1 score raised from 0.583 of the baseline U-Net to 0.771 of the customised U-Net, it further improved to 0.818 and 0.844 by adding the classification and the SSRL networks.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"97 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128969662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online and Real-time Network for Video Pedestrian Intent Prediction 视频行人意图预测的在线实时网络

2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

Pub Date : 2022-11-30 DOI: 10.1109/DICTA56598.2022.10034602

Pedestrian intention prediction is crucial to autonomous driving, especial in urban setting. Since current methods fail to accurately model the complex behavior of pedestrians in real time, we propose the ORPI-Net, which can fully explore and utilize pedestrian features efficiently. ORPI-Net includes the PMTC-Net for online pedestrian video feature extraction and a simple effective multi-modal fusion module. Partial Channel Motion Enhancement (PME) and 1D Temporal Group Convolution (TGC) in PMTC-Net can be easily embedded in a 2D convolutional backbone, which can capture the motion features of pedestrians and establish temporal relations. The multimodal fusion module can leverage various information sources at low computational cost, significantly improving performance. In the end, our model achieves new state-of-the-art on the PIE and JAAD datasets in high performance mode and run at a fast speed of 20+FPS in real-time mode.

行人意图预测对自动驾驶至关重要，尤其是在城市环境中。针对现有方法无法实时准确地模拟行人的复杂行为，我们提出了能够充分挖掘和有效利用行人特征的ORPI-Net。ORPI-Net包括用于在线行人视频特征提取的PMTC-Net和一个简单有效的多模态融合模块。PMTC-Net中的部分通道运动增强(PME)和一维时间群卷积(TGC)可以很容易地嵌入到二维卷积主干中，可以捕捉行人的运动特征并建立时间关系。多模态融合模块能够以较低的计算成本利用各种信息源，显著提高性能。最后，我们的模型在高性能模式下实现了PIE和JAAD数据集上的最新技术，并在实时模式下以20+FPS的快速速度运行。

引用次数: 0

Hidden and Face-Like Object Detection Using Deep Learning Techniques – An Empirical Study 使用深度学习技术的隐藏和类人脸物体检测-实证研究

2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

Pub Date : 2022-11-30 DOI: 10.1109/DICTA56598.2022.10034632

An essential aspect of artificial intelligence is how closely machines can mimic humans. One of the motivations for developing intelligent systems is human vision. While trying to recognise a class of images, it is as vital to distinguish the class of images from similar-looking objects and identify them in hidden places as it is to create bounding boxes and learn to localize the position of the object. Traditionally, deep learning models have performed exceptionally well in image classification and object detection tasks. In this work, we perform four experiments to train machines to distinguish between real faces and face-like objects and to recognise them. Nine state-of-the-art deep learning-based classifiers have been chosen to perform a comparative study on the designed experiments. Using these experiments, we establish that training models on real faces does not prepare them to identify face-like objects, and at the same time, training on face-like objects enables the models to detect face-like images even while hidden amongst other images. Despite work being done in the fields of camouflage detection and optical illusion detection, to the best of our knowledge, no work has been done in training and testing machines to distinguish between face and face-like objects with deep learning methods. This work could help researchers make better camouflage detection systems, perform context sensitive studies, understand the biases that various models possess towards certain classes of images, and have applications in real life such as military and self-driving cars.

人工智能的一个重要方面是机器模仿人类的程度。开发智能系统的动机之一是人类的视觉。在尝试识别一类图像时，将这类图像与相似的物体区分开来并在隐藏的地方识别它们与创建边界框并学习定位物体的位置一样重要。传统上，深度学习模型在图像分类和目标检测任务中表现得非常好。在这项工作中，我们进行了四个实验来训练机器区分真实的人脸和类似人脸的物体并识别它们。选择了九个最先进的基于深度学习的分类器对设计的实验进行比较研究。通过这些实验，我们确定了在真实人脸上训练模型并没有准备好识别类人脸物体，同时，在类人脸物体上训练使模型能够检测到隐藏在其他图像中的类人脸图像。尽管在伪装检测和视错觉检测领域做了很多工作，但据我们所知，在训练和测试机器用深度学习方法区分人脸和类人脸物体方面还没有做过任何工作。这项工作可以帮助研究人员制作更好的伪装检测系统，进行上下文敏感研究，了解各种模型对某些类别图像的偏见，并在现实生活中应用，如军事和自动驾驶汽车。

{"title":"Hidden and Face-Like Object Detection Using Deep Learning Techniques – An Empirical Study","authors":"","doi":"10.1109/DICTA56598.2022.10034632","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034632","url":null,"abstract":"An essential aspect of artificial intelligence is how closely machines can mimic humans. One of the motivations for developing intelligent systems is human vision. While trying to recognise a class of images, it is as vital to distinguish the class of images from similar-looking objects and identify them in hidden places as it is to create bounding boxes and learn to localize the position of the object. Traditionally, deep learning models have performed exceptionally well in image classification and object detection tasks. In this work, we perform four experiments to train machines to distinguish between real faces and face-like objects and to recognise them. Nine state-of-the-art deep learning-based classifiers have been chosen to perform a comparative study on the designed experiments. Using these experiments, we establish that training models on real faces does not prepare them to identify face-like objects, and at the same time, training on face-like objects enables the models to detect face-like images even while hidden amongst other images. Despite work being done in the fields of camouflage detection and optical illusion detection, to the best of our knowledge, no work has been done in training and testing machines to distinguish between face and face-like objects with deep learning methods. This work could help researchers make better camouflage detection systems, perform context sensitive studies, understand the biases that various models possess towards certain classes of images, and have applications in real life such as military and self-driving cars.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131378956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Hybrid Vision Transformer Approach for Mathematical Expression Recognition 一种用于数学表达式识别的混合视觉变换方法

2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

Pub Date : 2022-11-30 DOI: 10.1109/DICTA56598.2022.10034626

Mathematical expression recognition is one of the important processes in scientific documents analysis. Despite the importance of this task, solving mathematical expression recognition is still very challenging. One of the reasons for the difficulty of math recognition compared to normal text recognition is that math formula usually has 2-D spatial structure relationship [1] instead of 1-D ones from normal text data. The spatial structure relationship of math formula is presented by many math symbols such as superscript, subscript, fraction symbol, etc. The traditional approach usually solves this problem in two stages. First, the character segmentation stage is used to segment each character in math formula and then classify it based on the given vocabulary. Second, the structural analysis stage is used to identify the spatial relationships between all characters of the math formula.

数学表达式识别是科学文献分析的重要环节之一。尽管这项任务很重要，但解决数学表达式识别仍然非常具有挑战性。与普通文本识别相比，数学识别困难的原因之一是数学公式通常具有二维空间结构关系[1]，而不是普通文本数据的一维空间结构关系。数学公式的空间结构关系由上标、下标、分数符号等多种数学符号来表示。传统的方法通常分两个阶段来解决这个问题。首先，字符分割阶段对数学公式中的每个字符进行分割，然后根据给定的词汇对其进行分类。其次，结构分析阶段用于识别数学公式各字符之间的空间关系。

引用次数: 1

Feature Extractor Based on Class Specific Hidden Neuron Activations for Image Classification 基于类别特定隐藏神经元激活的图像分类特征提取

2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

Pub Date : 2022-11-30 DOI: 10.1109/DICTA56598.2022.10034614

Extracting good discriminative features plays a significant role in the predictive accuracy of any machine learning model. Engineering good features from raw data is a non-trivial and often a time-consuming task. Models like Convolutional Neural Networks (CNNs) have been very popular in the image classification tasks. This is due to CNN's excellent predictive capabilities and ability to automatically learn good features from raw data. One inherent draw back for CNNs and other deep learning models is that they are black-box models. The predictions made by these models cannot be explained based on features learned by them. In this paper, we put forth a novel feature extractor in which the features are automatically extracted from hidden neurons and convolutional neurons. The predictions made by the model are explained using the visualizations of the activation functions of the class specific neurons in hidden layers. Thus, the model put forth in this paper has excellent predictive capabilities and the predictions can be explained based on activation functions of class specific neurons.

提取好的判别特征对任何机器学习模型的预测精度都起着重要的作用。从原始数据中设计出好的特性是一项非常重要的任务，而且通常是一项耗时的任务。卷积神经网络(cnn)等模型在图像分类任务中非常流行。这是由于CNN出色的预测能力和从原始数据中自动学习好的特征的能力。cnn和其他深度学习模型的一个固有缺点是它们是黑箱模型。这些模型做出的预测不能用它们学到的特征来解释。本文提出了一种新的特征提取器，可以自动提取隐藏神经元和卷积神经元的特征。利用隐藏层中特定类神经元的激活函数的可视化来解释模型所做的预测。因此，本文提出的模型具有良好的预测能力，并且可以基于特定类别神经元的激活函数来解释预测。

引用次数: 0

Deepfake Detection with Spatio-Temporal Consistency and Attention 基于时空一致性和注意力的深度伪造检测

2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

Pub Date : 2022-11-30 DOI: 10.1109/DICTA56598.2022.10034609

Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We test our method on two popular large data sets, consistently outperforming the related recent methods. Moreover, our technique also provides memory and computational advantages over the competitive techniques.

深度造假视频由于其日益增强的真实性而引起了越来越多的社区关注。当然，自动检测伪造的Deepfake视频吸引了研究人员的兴趣。目前检测伪造视频的方法主要依赖于全局帧特征，没有充分利用被篡改视频中的时空不一致性。此外，它们无法在空间和时间维度上关注操作特定的微妙和定位良好的模式变化。为了解决这些问题，我们提出了一种神经深度伪造检测器，该检测器专注于单个帧级和帧序列级伪造视频的局部操纵签名。利用ResNet主干，通过空间注意机制加强浅帧级特征学习。通过纹理增强的浅层特征与深层特征的融合，进一步增强了模型的空间流。同时，该模型使用距离注意机制处理帧序列，进一步允许将时间注意图与更深层次的学习特征融合。整个模型被训练成检测伪造内容的分类器。我们在两个流行的大型数据集上测试了我们的方法，始终优于最近的相关方法。此外，我们的技术还提供了比竞争技术更大的内存和计算优势。

{"title":"Deepfake Detection with Spatio-Temporal Consistency and Attention","authors":"","doi":"10.1109/DICTA56598.2022.10034609","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034609","url":null,"abstract":"Deepfake videos are causing growing concerns among communities due to their ever-increasing realism. Naturally, automated detection of forged Deepfake videos is attracting a proportional amount of interest of researchers. Current methods for detecting forged videos mainly rely on global frame features and under-utilize the spatio-temporal inconsistencies found in the manipulated videos. Moreover, they fail to attend to manipulation-specific subtle and well-localized pattern variations along both spatial and temporal dimensions. Addressing these gaps, we propose a neural Deepfake detector that focuses on the localized manipulative signatures of the forged videos at individual frame level as well as frame sequence level. Using a ResNet backbone, it strengthens the shallow frame-level feature learning with a spatial attention mechanism. The spatial stream of the model is further helped by fusing texture enhanced shallow features with the deeper features. Simultaneously, the model processes frame sequences with a distance attention mechanism that further allows fusion of temporal attention maps with the learned features at the deeper layers. The overall model is trained to detect forged content as a classifier. We test our method on two popular large data sets, consistently outperforming the related recent methods. Moreover, our technique also provides memory and computational advantages over the competitive techniques.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124403036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

KENGIC: KEyword-driven and N-Gram Graph based Image Captioning KENGIC:关键词驱动和基于N-Gram图的图像字幕

2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

Pub Date : 2022-11-30 DOI: 10.1109/DICTA56598.2022.10034584

Brandon Birmingham, A. Muscat

This paper presents a Keyword-driven and N-gram Graph based approach for Image Captioning (KENGIC). Most current state-of-the-art image caption generators are trained end-to-end on large scale paired image-caption datasets which are very laborious and expensive to collect. Such models are limited in terms of their explainability and their applicability across different domains. To address these limitations, a simple model based on N-Gram graphs which does not require any end-to-end training on paired image captions is proposed. Starting with a set of image keywords considered as nodes, the generator is designed to form a directed graph by connecting these nodes through overlapping n-grams as found in a given text corpus. The model then infers the caption by maximising the most probable n-gram sequences from the constructed graph. To analyse the use and choice of keywords in context of this approach, this study analysed the generation of image captions based on (a) keywords extracted from gold standard captions and (b) from automatically detected keywords. Both quantitative and qualitative analyses demonstrated the effectiveness of KENGIC. The performance achieved is very close to that of current state-of-the-art image caption generators that are trained in the unpaired setting. The analysis of this approach could also shed light on the generation process behind current top performing caption generators trained in the paired setting, and in addition, provide insights on the limitations of the current most widely used evaluation metrics in automatic image captioning

提出了一种基于关键词驱动和n图的图像字幕方法。目前大多数最先进的图像标题生成器都是在大规模成对图像标题数据集上进行端到端的训练，这些数据集收集起来非常费力且昂贵。这些模型在可解释性和跨不同领域的适用性方面是有限的。为了解决这些限制，提出了一种基于N-Gram图的简单模型，该模型不需要对配对图像标题进行任何端到端训练。从一组被认为是节点的图像关键字开始，生成器被设计成通过在给定文本语料库中发现的重叠n-gram将这些节点连接起来，形成一个有向图。然后，该模型通过从构造的图中最大化最可能的n-gram序列来推断标题。为了分析在这种方法的背景下关键词的使用和选择，本研究分析了基于(a)从金标准标题中提取的关键词和(b)从自动检测的关键词生成图像标题。定量和定性分析均证明了KENGIC的有效性。所取得的性能非常接近当前在未配对设置中训练的最先进的图像标题生成器。对这种方法的分析还可以揭示在配对设置中训练的当前表现最好的字幕生成器背后的生成过程，此外，还可以提供对当前最广泛使用的自动图像字幕评估指标的局限性的见解

{"title":"KENGIC: KEyword-driven and N-Gram Graph based Image Captioning","authors":"Brandon Birmingham, A. Muscat","doi":"10.1109/DICTA56598.2022.10034584","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034584","url":null,"abstract":"This paper presents a Keyword-driven and N-gram Graph based approach for Image Captioning (KENGIC). Most current state-of-the-art image caption generators are trained end-to-end on large scale paired image-caption datasets which are very laborious and expensive to collect. Such models are limited in terms of their explainability and their applicability across different domains. To address these limitations, a simple model based on N-Gram graphs which does not require any end-to-end training on paired image captions is proposed. Starting with a set of image keywords considered as nodes, the generator is designed to form a directed graph by connecting these nodes through overlapping n-grams as found in a given text corpus. The model then infers the caption by maximising the most probable n-gram sequences from the constructed graph. To analyse the use and choice of keywords in context of this approach, this study analysed the generation of image captions based on (a) keywords extracted from gold standard captions and (b) from automatically detected keywords. Both quantitative and qualitative analyses demonstrated the effectiveness of KENGIC. The performance achieved is very close to that of current state-of-the-art image caption generators that are trained in the unpaired setting. The analysis of this approach could also shed light on the generation process behind current top performing caption generators trained in the paired setting, and in addition, provide insights on the limitations of the current most widely used evaluation metrics in automatic image captioning","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114554919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Swin-ResUNet: A Swin-Topology Module for Road Extraction from Remote Sensing Images swwin - resunet:用于遥感图像道路提取的swwin - topology模块

2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

Pub Date : 2022-11-30 DOI: 10.1109/DICTA56598.2022.10034582

Road extraction from remote sensing images plays a crucial role in navigation, traffic management, urban construction, and other fields. With the development of deep learning in the field of computer vision, road extraction from remote sensing images using deep learning models has become a hot research topic. The convolution-based U-shaped road extraction models have some issues such as high extraction error rate and poor continuity on road topology. The Transformer-based road extraction methods also have issues such as low extraction accuracy and large GPU memory usage. In order to solve the above issues, we propose a Swin-ResUNet structure and use the new paradigm Swin Transformer to extract roads in remote sensing images. Specifically, we construct a Swin-Topology module by adding a Sobel layer based on residual connections to the Swin Transformer block. Based on the Swin-Topology module, we propose a Swin-ResUNet network structure in order to better capture the topology of roads. Experimental results show that the values of mIOU and mDC obtained on the Massachusetts dataset were 64.1% and 76.6% respectively, and the corresponding values on the DeepGlobe2018 dataset were 66.69% and 75.86% respectively. When the batch size is 8, the GPU memory usage with Swin-ResUNet is about 9 GB, which is significantly smaller than other Transformer-based networks. Compared with convolution-based U-shaped structures, the Swin-ResUNet can better capture the topology of roads in remote sensing images and improve road extraction accuracy. Compared with other Transformer-based road extraction methods, the Swin-ResUNet can improve the accuracy of road extraction and reduce GPU memory usage.

遥感影像道路提取在导航、交通管理、城市建设等领域发挥着至关重要的作用。随着深度学习在计算机视觉领域的发展，利用深度学习模型从遥感图像中提取道路已成为一个研究热点。基于卷积的u型道路提取模型存在提取错误率高、道路拓扑连续性差等问题。基于transformer的道路提取方法也存在提取精度低和GPU内存占用大等问题。为了解决上述问题，我们提出了一种Swin- resunet结构，并使用新的Swin Transformer范式来提取遥感图像中的道路。具体来说，我们通过在Swin Transformer块上添加基于剩余连接的Sobel层来构建Swin- topology模块。在swwin - topology模块的基础上，我们提出了一种swwin - resunet网络结构，以便更好地捕获道路拓扑。实验结果表明，在Massachusetts数据集上得到的mIOU和mDC值分别为64.1%和76.6%，在DeepGlobe2018数据集上得到的mIOU和mDC值分别为66.69%和75.86%。当批处理大小为8时，swing - resunet的GPU内存使用量约为9 GB，这比其他基于transformer的网络要小得多。与基于卷积的u形结构相比，swwin - resunet可以更好地捕捉遥感图像中道路的拓扑结构，提高道路提取精度。与其他基于transformer的道路提取方法相比，swwin - resunet可以提高道路提取的准确性，减少GPU内存的使用。

{"title":"Swin-ResUNet: A Swin-Topology Module for Road Extraction from Remote Sensing Images","authors":"","doi":"10.1109/DICTA56598.2022.10034582","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034582","url":null,"abstract":"Road extraction from remote sensing images plays a crucial role in navigation, traffic management, urban construction, and other fields. With the development of deep learning in the field of computer vision, road extraction from remote sensing images using deep learning models has become a hot research topic. The convolution-based U-shaped road extraction models have some issues such as high extraction error rate and poor continuity on road topology. The Transformer-based road extraction methods also have issues such as low extraction accuracy and large GPU memory usage. In order to solve the above issues, we propose a Swin-ResUNet structure and use the new paradigm Swin Transformer to extract roads in remote sensing images. Specifically, we construct a Swin-Topology module by adding a Sobel layer based on residual connections to the Swin Transformer block. Based on the Swin-Topology module, we propose a Swin-ResUNet network structure in order to better capture the topology of roads. Experimental results show that the values of mIOU and mDC obtained on the Massachusetts dataset were 64.1% and 76.6% respectively, and the corresponding values on the DeepGlobe2018 dataset were 66.69% and 75.86% respectively. When the batch size is 8, the GPU memory usage with Swin-ResUNet is about 9 GB, which is significantly smaller than other Transformer-based networks. Compared with convolution-based U-shaped structures, the Swin-ResUNet can better capture the topology of roads in remote sensing images and improve road extraction accuracy. Compared with other Transformer-based road extraction methods, the Swin-ResUNet can improve the accuracy of road extraction and reduce GPU memory usage.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114594483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Augmenting Ego-Vehicle for Traffic Near-Miss and Accident Classification Dataset Using Manipulating Conditional Style Translation 基于操纵条件风格翻译的交通险情和事故分类数据增强Ego-Vehicle

2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

Pub Date : 2022-11-30 DOI: 10.1109/DICTA56598.2022.10034630

Hilmil Pradana, Minh-Son Dao, K. Zettsu

In the last decade, advanced self-driving system brings significantly improvement technology on various aspects such as efficiency, convenience, and transportation safety system to contribute the global society impacts around the world. To develop it, many researchers are focusing to alert all possible traffic risk cases from closed-circuit television (CCTV) and dashboardmounted cameras. Most of these methods focused on identifying frame-by-frame in which an anomaly is occurred, but they are unrealized, which road traffic participant can cause ego-vehicle leading into collision because of available annotation dataset only to detect anomaly on traffic video. Near-miss is one type of accident and can be defined as a narrowly avoided accident. However, there are no different between accident and near-miss on the time before accident happened, so that we re-define the definition of accident on DADA-2000 dataset together with nearmiss and also extend start and end time of accident duration to precisely cover all ego-motions during incident. Unlike previous works, proposed system is to classify all possible traffic risk accidents including near-miss to give more critical information for real-world driving assistance systems. Due to limited annotating video availability, we augment re-annotation DADA-2000 dataset using manipulating video style translation to increase number of traffic risk accident videos and to generalize performance of video classification model on different types of conditions. In evaluation, the proposed method achieved significantly improvement result by 10.25 % positive margin from baseline model for accuracy on cross validation analysis. Quantitative evaluation based on our re-annotation shows that the proposed method is valuable for computer vision community to train their models to produce better traffic risk classification.

近十年来，先进的自动驾驶系统在效率、便利性、交通安全系统等各个方面带来了显著的技术进步，在全球范围内对全球社会产生了影响。为了开发它，许多研究人员正致力于通过闭路电视(CCTV)和安装在仪表板上的摄像头来提醒所有可能的交通风险情况。这些方法大多侧重于逐帧识别异常，但由于现有的标注数据只能检测交通视频上的异常，无法实现道路交通参与者可能导致的自车碰撞。未遂事故是一种事故，可以定义为侥幸避免的事故。然而，在事故发生前的时间点上，事故和未遂事件并没有区别，因此我们在DADA-2000数据集上重新定义了事故的定义，并将事故持续时间的开始和结束时间延长，以精确地涵盖事故过程中的所有自我运动。与以往的工作不同，本文提出的系统是对包括未遂事故在内的所有可能的交通风险事故进行分类，为现实世界的驾驶辅助系统提供更关键的信息。由于注释视频的可用性有限，我们使用操纵视频风格翻译来增强重新注释DADA-2000数据集，以增加交通风险事故视频的数量，并在不同类型的条件下推广视频分类模型的性能。在评价中，该方法在交叉验证分析的准确率上比基线模型提高了10.25%。基于我们的重新标注的定量评价表明，本文提出的方法对计算机视觉社区训练他们的模型以产生更好的交通风险分类是有价值的。

{"title":"Augmenting Ego-Vehicle for Traffic Near-Miss and Accident Classification Dataset Using Manipulating Conditional Style Translation","authors":"Hilmil Pradana, Minh-Son Dao, K. Zettsu","doi":"10.1109/DICTA56598.2022.10034630","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034630","url":null,"abstract":"In the last decade, advanced self-driving system brings significantly improvement technology on various aspects such as efficiency, convenience, and transportation safety system to contribute the global society impacts around the world. To develop it, many researchers are focusing to alert all possible traffic risk cases from closed-circuit television (CCTV) and dashboardmounted cameras. Most of these methods focused on identifying frame-by-frame in which an anomaly is occurred, but they are unrealized, which road traffic participant can cause ego-vehicle leading into collision because of available annotation dataset only to detect anomaly on traffic video. Near-miss is one type of accident and can be defined as a narrowly avoided accident. However, there are no different between accident and near-miss on the time before accident happened, so that we re-define the definition of accident on DADA-2000 dataset together with nearmiss and also extend start and end time of accident duration to precisely cover all ego-motions during incident. Unlike previous works, proposed system is to classify all possible traffic risk accidents including near-miss to give more critical information for real-world driving assistance systems. Due to limited annotating video availability, we augment re-annotation DADA-2000 dataset using manipulating video style translation to increase number of traffic risk accident videos and to generalize performance of video classification model on different types of conditions. In evaluation, the proposed method achieved significantly improvement result by 10.25 % positive margin from baseline model for accuracy on cross validation analysis. Quantitative evaluation based on our re-annotation shows that the proposed method is valuable for computer vision community to train their models to produce better traffic risk classification.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124045748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Malleefowl Mound Detection Using Robust LiDAR-based Features and Classification 基于鲁棒激光雷达特征和分类的锤头鸟丘自动检测

2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)

Pub Date : 2022-11-30 DOI: 10.1109/DICTA56598.2022.10034606

Malleefowl is listed as one of the vulnerable birds in Australia. To track the pattern of presence and abundance of Malleefowl, surveying the egg-incubator (a.k.a. nest or mound) is an extensively used technique. However, on large conservation areas, Malleefowl mound detection by manually inspection on land or from air is challenging for various environmental and technical reasons. Usually, mounds are built on the ground, and they are widely scattered over the large areas. Hence, in recent years, airborne Light Detection and Ranging (LiDAR) techniques have been used for data acquisition and analysis. However, such existing methods are still limited in terms of detection accuracy and system automation. In this paper, we propose a novel method to address these limitations. We have designed robust features which effectively represent the key visual characteristics of candidate mounds captured in LiDAR point cloud data. These features include: (1) differences of elevation between original ground points and the corresponding feet of these ground points fitted plane along with the z-axis direction, and (2) convex-hull measurement. Using these features, we then use machine learning methods, i.e., clustering to differentiate the true mounds among the candidate mounds, and bagged-tree classifier to learn a model for classifying whether a patch contains a mound or not. Our training and testing datasets contain LiDAR point cloud data captured from the Tarawi Nature Reserve, and are provided by the New South Wales Government Department of Planning and Environment of Australia. They comprise a total of 1,060 patches (each 20 m × 20 m) - half of which contain mounds, and the remaining half contain no mound. Our experimental results show that our proposed method has more than 84% accuracy in detecting patches with mounds.

malleef鸮被列为澳大利亚的易危鸟类之一。为了追踪马蹄铁的存在模式和数量，调查孵卵器(又名巢或丘)是一种广泛使用的技术。然而，在大型自然保护区，由于各种环境和技术原因，在陆地或空中进行人工检测是具有挑战性的。通常，土丘建在地面上，它们广泛地分散在大片地区。因此，近年来，机载光探测和测距(LiDAR)技术已用于数据采集和分析。然而，这些现有的方法在检测精度和系统自动化方面仍然受到限制。在本文中，我们提出了一种新的方法来解决这些限制。我们设计了强大的特征，有效地代表了激光雷达点云数据中捕获的候选土丘的关键视觉特征。这些特征包括:(1)原始地点与这些地点对应的地脚沿z轴方向拟合平面的高程差;(2)凸体测量。利用这些特征，我们使用机器学习方法，即聚类来区分候选土丘中的真实土丘，并使用袋装树分类器来学习用于分类斑块是否包含土丘的模型。我们的训练和测试数据集包含从塔拉维自然保护区捕获的激光雷达点云数据，由澳大利亚新南威尔士州政府规划和环境部提供。它们共包括1060个地块(每个20米× 20米)，其中一半包含土墩，其余一半没有土墩。实验结果表明，该方法对带有土墩的斑块的检测准确率在84%以上。

{"title":"Automatic Malleefowl Mound Detection Using Robust LiDAR-based Features and Classification","authors":"","doi":"10.1109/DICTA56598.2022.10034606","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034606","url":null,"abstract":"Malleefowl is listed as one of the vulnerable birds in Australia. To track the pattern of presence and abundance of Malleefowl, surveying the egg-incubator (a.k.a. nest or mound) is an extensively used technique. However, on large conservation areas, Malleefowl mound detection by manually inspection on land or from air is challenging for various environmental and technical reasons. Usually, mounds are built on the ground, and they are widely scattered over the large areas. Hence, in recent years, airborne Light Detection and Ranging (LiDAR) techniques have been used for data acquisition and analysis. However, such existing methods are still limited in terms of detection accuracy and system automation. In this paper, we propose a novel method to address these limitations. We have designed robust features which effectively represent the key visual characteristics of candidate mounds captured in LiDAR point cloud data. These features include: (1) differences of elevation between original ground points and the corresponding feet of these ground points fitted plane along with the z-axis direction, and (2) convex-hull measurement. Using these features, we then use machine learning methods, i.e., clustering to differentiate the true mounds among the candidate mounds, and bagged-tree classifier to learn a model for classifying whether a patch contains a mound or not. Our training and testing datasets contain LiDAR point cloud data captured from the Tarawi Nature Reserve, and are provided by the New South Wales Government Department of Planning and Environment of Australia. They comprise a total of 1,060 patches (each 20 m × 20 m) - half of which contain mounds, and the remaining half contain no mound. Our experimental results show that our proposed method has more than 84% accuracy in detecting patches with mounds.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115453717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0