Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034612
Nowadays, color-to-gray conversion known as decolorization is widely used not only in image processing but also in machine learning and deep learning-based applications, due to its lower complexity. In the conventional decolorization process, in general, fixed conversion coefficients are used for whole images. Using fixed coefficients, namely weighting parameters, for the whole images may lead to deterioration of the quality of original color images. On the other hand, using unfixed conversion coefficients, which depend on the contents of the images, demonstrates better decolorization performance with respect to the fixed coefficients. The critical points in the decolorization process are to preserve spatial information, such as contrast preserving, and to have less computational complexity. In this study, a very fast decolorization method is proposed without sacrificing contrast preservation. The proposed method exploits the gradient-based correlation similarity approach, in which global and local contrast information is considered for a total of 66 distinct weighting coefficients. Under the CADIK dataset, COLOR250 dataset and high resolution images, these weighting coefficients are reduced by using the frequency of best weighting coefficients. Experimental results show that the proposed method can be used in real-time without compromising the image decolorization performance.
{"title":"A Lightweight Image Decolorization Approach based on Contrast Preservation","authors":"","doi":"10.1109/DICTA56598.2022.10034612","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034612","url":null,"abstract":"Nowadays, color-to-gray conversion known as decolorization is widely used not only in image processing but also in machine learning and deep learning-based applications, due to its lower complexity. In the conventional decolorization process, in general, fixed conversion coefficients are used for whole images. Using fixed coefficients, namely weighting parameters, for the whole images may lead to deterioration of the quality of original color images. On the other hand, using unfixed conversion coefficients, which depend on the contents of the images, demonstrates better decolorization performance with respect to the fixed coefficients. The critical points in the decolorization process are to preserve spatial information, such as contrast preserving, and to have less computational complexity. In this study, a very fast decolorization method is proposed without sacrificing contrast preservation. The proposed method exploits the gradient-based correlation similarity approach, in which global and local contrast information is considered for a total of 66 distinct weighting coefficients. Under the CADIK dataset, COLOR250 dataset and high resolution images, these weighting coefficients are reduced by using the frequency of best weighting coefficients. Experimental results show that the proposed method can be used in real-time without compromising the image decolorization performance.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"47 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123169416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034588
Naevi are benign skin lesions that appear on the skin surface as small brown, tan, or pink spots. These are important to monitor, as a high number of naevi is the strongest known phenotypic risk factor for melanoma, and about a third of melanoma are thought to derive directly from a precursor naevus. The main aim of this research is to model the dermoscopic naevus appearance using machine learning and characterise their features by classifying them as suspicious or non-suspicious naevi. To extract the prominent appearance features of naevi, principal component analysis (PCA) along with convolutional autoencoder (CAE) were implemented for automated feature extraction. These features were then used for the classification of naevi using random forest (RF) and artificial neural network (ANN) classifiers. Using the features extracted by CAE, ANN achieved high average accuracy, specificity, sensitivity, precision, and AUC of 95.62%, 91.24%, 100%, 91.95%, 95.6% respectively. In addition, RF showed that both PCA and CAE based methods had an overall accuracy of 88.46%. Moreover, RF used to rank the features which helped in selecting the most important features useful for naevus classification. If validated clinically, machine learning (ML) approaches might be an efficient guide in early detection of melanoma by identifying the suspicious naevi clinicians need to assess carefully.
{"title":"Identification of suspicious naevi in dermoscopy images by learning their appearance","authors":"","doi":"10.1109/DICTA56598.2022.10034588","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034588","url":null,"abstract":"Naevi are benign skin lesions that appear on the skin surface as small brown, tan, or pink spots. These are important to monitor, as a high number of naevi is the strongest known phenotypic risk factor for melanoma, and about a third of melanoma are thought to derive directly from a precursor naevus. The main aim of this research is to model the dermoscopic naevus appearance using machine learning and characterise their features by classifying them as suspicious or non-suspicious naevi. To extract the prominent appearance features of naevi, principal component analysis (PCA) along with convolutional autoencoder (CAE) were implemented for automated feature extraction. These features were then used for the classification of naevi using random forest (RF) and artificial neural network (ANN) classifiers. Using the features extracted by CAE, ANN achieved high average accuracy, specificity, sensitivity, precision, and AUC of 95.62%, 91.24%, 100%, 91.95%, 95.6% respectively. In addition, RF showed that both PCA and CAE based methods had an overall accuracy of 88.46%. Moreover, RF used to rank the features which helped in selecting the most important features useful for naevus classification. If validated clinically, machine learning (ML) approaches might be an efficient guide in early detection of melanoma by identifying the suspicious naevi clinicians need to assess carefully.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123327880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034590
Knowledge distillation for object detection has seen far less activity in literature than classification tasks. This is primarily due to the complexity of object detection, which involves classification as well as localisation. In this paper, we propose a three-pronged distillation framework which includes: (a) homogenised logit-matching; (b) hint learning; and (c) soft masking. Prediction matching is a commonly used response-based distillation technique, which suffers from knowledge loss due to filtering of detected objects through non-maximal suppression. To circumvent this problem, we draw inspiration from the idea of hint learning to propose output logit-matching, where the teacher and student outputs feature maps are directly matched without filtering any of the teacher's detections as in prediction matching. We demonstrate that by transferring the raw knowledge of the high-performing teacher, we reduce the knowledge loss and thereby improve the student's performance. We then perform an ablation study to understand whether early, middle or latestage hint learning is most beneficial. Finally, we propose an alternate imitation masking technique called “soft” masking that uses a 2D Gaussian to mask regions of interest on a feature map. As opposed to vanilla “hard” imitation masking, we show that this method satisfies the philosophy of using softened labels for effective knowledge distillation.
{"title":"Unified Framework for Effective Knowledge Distillation in Single-stage Object Detectors","authors":"","doi":"10.1109/DICTA56598.2022.10034590","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034590","url":null,"abstract":"Knowledge distillation for object detection has seen far less activity in literature than classification tasks. This is primarily due to the complexity of object detection, which involves classification as well as localisation. In this paper, we propose a three-pronged distillation framework which includes: (a) homogenised logit-matching; (b) hint learning; and (c) soft masking. Prediction matching is a commonly used response-based distillation technique, which suffers from knowledge loss due to filtering of detected objects through non-maximal suppression. To circumvent this problem, we draw inspiration from the idea of hint learning to propose output logit-matching, where the teacher and student outputs feature maps are directly matched without filtering any of the teacher's detections as in prediction matching. We demonstrate that by transferring the raw knowledge of the high-performing teacher, we reduce the knowledge loss and thereby improve the student's performance. We then perform an ablation study to understand whether early, middle or latestage hint learning is most beneficial. Finally, we propose an alternate imitation masking technique called “soft” masking that uses a 2D Gaussian to mask regions of interest on a feature map. As opposed to vanilla “hard” imitation masking, we show that this method satisfies the philosophy of using softened labels for effective knowledge distillation.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"2014 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128999840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034624
Many theories have developed over the years in attempts to better explain how colours work and how best to calculate and describe colour differences. In 1913 Albert Munsell introduced the Atlas of the Munsell Color System, arranging colour into the tristimulus of hue, value, and chroma. The Munsell colour system has enabled professionals to bridge the disciplines of art and science and is the basis of many professional applications today such as food science [1], dentistry [2], printing [3], painting [4], and soil science [5].
{"title":"Understanding the Effect of Smartphone Cameras on Estimating Munsell Soil Color from Imagery","authors":"","doi":"10.1109/DICTA56598.2022.10034624","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034624","url":null,"abstract":"Many theories have developed over the years in attempts to better explain how colours work and how best to calculate and describe colour differences. In 1913 Albert Munsell introduced the Atlas of the Munsell Color System, arranging colour into the tristimulus of hue, value, and chroma. The Munsell colour system has enabled professionals to bridge the disciplines of art and science and is the basis of many professional applications today such as food science [1], dentistry [2], printing [3], painting [4], and soil science [5].","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126792124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034589
A key issue with palm vein images is that slight movements of fingers and the thumb or changes in the hand pose can stretch the skin in different areas and alter the vein patterns. This can produce palm vein images with an infinite number of variations for a given subject. This paper presents a novel filtering method for SIFT based feature matching referred to as the Median Distance (MMD) Filter, which checks the difference of keypoint coordinates and calculates the mean and the median in each direction, and uses a set of rules to determine the correct matches. Our experiments conducted with the 850nm subset of the CASIA dataset show that the MMD filter can detect and filter false positives that were not detected by other filtering methods. Comparison against existing SIFT based palm vein recognition systems demonstrates that the proposed MMD filter produces excellent performance with lower EER values.
{"title":"A Filtering Method for SIFT based Palm Vein Recognition","authors":"","doi":"10.1109/DICTA56598.2022.10034589","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034589","url":null,"abstract":"A key issue with palm vein images is that slight movements of fingers and the thumb or changes in the hand pose can stretch the skin in different areas and alter the vein patterns. This can produce palm vein images with an infinite number of variations for a given subject. This paper presents a novel filtering method for SIFT based feature matching referred to as the Median Distance (MMD) Filter, which checks the difference of keypoint coordinates and calculates the mean and the median in each direction, and uses a set of rules to determine the correct matches. Our experiments conducted with the 850nm subset of the CASIA dataset show that the MMD filter can detect and filter false positives that were not detected by other filtering methods. Comparison against existing SIFT based palm vein recognition systems demonstrates that the proposed MMD filter produces excellent performance with lower EER values.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125466702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034639
Explainability is important in the design and deployment of neural networks. It allows engineers to design better models and can give end-users an improved understanding of the outputs. However, many explainability methods are unsuited to the domain of medical imaging. Saliency mapping methods only describe what regions of an input image contributed to the output, but don't explain the important visual features within those regions. Feature visualization methods have not yet been useful in the domain of medical imaging due to the visual complexity of images generally resulting in un-interpretable features. In this work, we propose a novel explainability technique called “Class Specific Semantic Dictionaries”. This extends saliency mapping and feature visualisation methods to enable the analysis of neural network decision-making in the context of medical image diagnosis. By utilising gradient information from the fully connected layers, our approach is able to give insight into the channels deemed important by the network for the diagnosis of each particular disease. The important channels for a class are contextualised by showing the highly activating examples from the training data, providing an understanding of the learned features through example. The explainability techniques are combined into a single User Interface (UI) to streamline the evaluation of neural networks. To demonstrate how our new method overcomes the explainability challenges of medical imaging models we analyse COVID-Net, an open source convolutional neural network for diagnosing COVID-19 from chest x-rays. We present evidence that, despite achieving 96.3% accuracy on the test data, COVID-Net uses confounding variables not indicative of underlying disease to discriminate between COVID-Positive and COVID-Negative patients and may not generalise well on new data.
{"title":"Explainable Deep Learning for Medical Imaging Models Through Class Specific Semantic Dictionaries","authors":"","doi":"10.1109/DICTA56598.2022.10034639","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034639","url":null,"abstract":"Explainability is important in the design and deployment of neural networks. It allows engineers to design better models and can give end-users an improved understanding of the outputs. However, many explainability methods are unsuited to the domain of medical imaging. Saliency mapping methods only describe what regions of an input image contributed to the output, but don't explain the important visual features within those regions. Feature visualization methods have not yet been useful in the domain of medical imaging due to the visual complexity of images generally resulting in un-interpretable features. In this work, we propose a novel explainability technique called “Class Specific Semantic Dictionaries”. This extends saliency mapping and feature visualisation methods to enable the analysis of neural network decision-making in the context of medical image diagnosis. By utilising gradient information from the fully connected layers, our approach is able to give insight into the channels deemed important by the network for the diagnosis of each particular disease. The important channels for a class are contextualised by showing the highly activating examples from the training data, providing an understanding of the learned features through example. The explainability techniques are combined into a single User Interface (UI) to streamline the evaluation of neural networks. To demonstrate how our new method overcomes the explainability challenges of medical imaging models we analyse COVID-Net, an open source convolutional neural network for diagnosing COVID-19 from chest x-rays. We present evidence that, despite achieving 96.3% accuracy on the test data, COVID-Net uses confounding variables not indicative of underlying disease to discriminate between COVID-Positive and COVID-Negative patients and may not generalise well on new data.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130898078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034598
Wei Luo, Wei Luo
Despite the recent progress in the development of Vision-Language models in accurate visual question answering (VQA), the robustness of these models is still quite limited under the presence of out-of-distribution datasets that include unanswerable questions. In our work, we first implement a randomized VQA dataset with unanswerable questions to test the robustness of a state-of-the-art VQA model. The dataset combines the visual inputs with randomized questions from the VQA v2 dataset to test the sensitivity of predictions from the model. We establish that even on unanswerable questions that are not relevant to the visual clues, a state-of-the-art VQA model either fails to predict the “unknown” answer, or gives an inaccurate answer with a high softmax score. To alleviate this issue without needing to retrain the large backbone models, we propose a technique called Cross Modal Augmentation (CMA), a multi-modal semantic augmentation during test time only, which reprojects the visual and textual inputs into multiple copies, while maintaining semantic information. These multiple instances, with similar semantics, are then fed to the same model and the predictions are combined to achieve a more robust output from the model. We demonstrate that using this model-agnostic technique enables the VQA model to provide more robust answers in scenarios that may include unanswerable questions.
{"title":"Semantic multi-modal reprojection for robust visual question answering","authors":"Wei Luo, Wei Luo","doi":"10.1109/DICTA56598.2022.10034598","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034598","url":null,"abstract":"Despite the recent progress in the development of Vision-Language models in accurate visual question answering (VQA), the robustness of these models is still quite limited under the presence of out-of-distribution datasets that include unanswerable questions. In our work, we first implement a randomized VQA dataset with unanswerable questions to test the robustness of a state-of-the-art VQA model. The dataset combines the visual inputs with randomized questions from the VQA v2 dataset to test the sensitivity of predictions from the model. We establish that even on unanswerable questions that are not relevant to the visual clues, a state-of-the-art VQA model either fails to predict the “unknown” answer, or gives an inaccurate answer with a high softmax score. To alleviate this issue without needing to retrain the large backbone models, we propose a technique called Cross Modal Augmentation (CMA), a multi-modal semantic augmentation during test time only, which reprojects the visual and textual inputs into multiple copies, while maintaining semantic information. These multiple instances, with similar semantics, are then fed to the same model and the predictions are combined to achieve a more robust output from the model. We demonstrate that using this model-agnostic technique enables the VQA model to provide more robust answers in scenarios that may include unanswerable questions.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115751805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034613
Children with ASD (Autism spectrum disorder) have difficulties in expressing their emotions and social interaction with their environment. In recent studies, asistive robots are used to support children's social skills and emotion recognition is performed to improve the quality of the interaction between the robot and the child. In this study, rPPG(remote photoplethysmography) signals were extracted using face images that were captured through camera during the interaction between the robot and children with ASD. These signals were then compared with the physiological data obtained through Empatica E4 wristwatch. The results were evaluated and the reasons that might affect the results were discussed. Although the correlation between these data was low, some advantages were found in the results compared to both the E4 wristwatch and emotion recognition from face. Unlike the signals obtained from E4, rPPG signals can be found when the child moves. In addition, although no emotion can be detected from the child's face, rPPG signals can be obtained. The aim of the study is to use rPPG signals in emotion recognition as an alternative method since other emotion recognition modalities face with challenges during robot child interaction in children with ASD.
{"title":"RPPG Detection in Children with Autism Spectrum Disorder during Robot-child Interaction Studies","authors":"","doi":"10.1109/DICTA56598.2022.10034613","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034613","url":null,"abstract":"Children with ASD (Autism spectrum disorder) have difficulties in expressing their emotions and social interaction with their environment. In recent studies, asistive robots are used to support children's social skills and emotion recognition is performed to improve the quality of the interaction between the robot and the child. In this study, rPPG(remote photoplethysmography) signals were extracted using face images that were captured through camera during the interaction between the robot and children with ASD. These signals were then compared with the physiological data obtained through Empatica E4 wristwatch. The results were evaluated and the reasons that might affect the results were discussed. Although the correlation between these data was low, some advantages were found in the results compared to both the E4 wristwatch and emotion recognition from face. Unlike the signals obtained from E4, rPPG signals can be found when the child moves. In addition, although no emotion can be detected from the child's face, rPPG signals can be obtained. The aim of the study is to use rPPG signals in emotion recognition as an alternative method since other emotion recognition modalities face with challenges during robot child interaction in children with ASD.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122795644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034604
RGB-D image pair based salient object detection models aim to localize the salient objects in an RGB image with extra depth information about the scene provided to guide the detection process. The conventional practice for this task involves explicitly using depth as input to achieve multi-modal learning. In this paper, we observe two main issues within existing RGB-D saliency detection frameworks. Firstly, we claim that it is better to define depth as extra prior information instead of as a part of the input for RGB-D saliency detection, as we can directly perform saliency detection based only on the appearance information from the RGB image, while we cannot perform saliency detection given only the depth data. Secondly, there exists a huge domain gap in terms of the source of depth between different benchmark testing datasets, e.g. depth from Kinect and stereo cameras. In this paper, we focus on the variant of stereo image pair based saliency detection, where the depth is “implicitly” encoded in the stereo image pair for effective RGB-D saliency detection. Experimental results illustrate the effectiveness of our solution.
{"title":"Stereo Saliency Detection by Modeling Concatenation Cost Volume Feature","authors":"","doi":"10.1109/DICTA56598.2022.10034604","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034604","url":null,"abstract":"RGB-D image pair based salient object detection models aim to localize the salient objects in an RGB image with extra depth information about the scene provided to guide the detection process. The conventional practice for this task involves explicitly using depth as input to achieve multi-modal learning. In this paper, we observe two main issues within existing RGB-D saliency detection frameworks. Firstly, we claim that it is better to define depth as extra prior information instead of as a part of the input for RGB-D saliency detection, as we can directly perform saliency detection based only on the appearance information from the RGB image, while we cannot perform saliency detection given only the depth data. Secondly, there exists a huge domain gap in terms of the source of depth between different benchmark testing datasets, e.g. depth from Kinect and stereo cameras. In this paper, we focus on the variant of stereo image pair based saliency detection, where the depth is “implicitly” encoded in the stereo image pair for effective RGB-D saliency detection. Experimental results illustrate the effectiveness of our solution.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128185631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-11-30DOI: 10.1109/DICTA56598.2022.10034580
Due to the numerous potential applications in visual surveillance and nighttime driving, recognizing human action in low-light conditions remains a difficult problem in computer vision. Existing methods separate action recognition and dark enhancement into two distinct steps to accomplish this task. However, Isolating the recognition and enhancement impedes end-to-end learning of the space-time representation for video action classification. This paper presents a domain adaptation-based action recognition approach that uses adversarial learning in cross-domain settings to learn cross-domain action recognition. Supervised learning can train it on a large amount of labelled data from the source domain (daytime action sequences). However, it uses deep domain invariant features to perform unsupervised learning on many unlabelled data from the target domain (nighttime action sequences). The resulting augmented model, named 3D-DiNet can be trained using standard backpropagation with an additional layer. It achieves SOTA performance on InFAR and XD145 actions datasets.
{"title":"Learning Deeply Domain-invariant Features for Action Recognition Around the Clock","authors":"","doi":"10.1109/DICTA56598.2022.10034580","DOIUrl":"https://doi.org/10.1109/DICTA56598.2022.10034580","url":null,"abstract":"Due to the numerous potential applications in visual surveillance and nighttime driving, recognizing human action in low-light conditions remains a difficult problem in computer vision. Existing methods separate action recognition and dark enhancement into two distinct steps to accomplish this task. However, Isolating the recognition and enhancement impedes end-to-end learning of the space-time representation for video action classification. This paper presents a domain adaptation-based action recognition approach that uses adversarial learning in cross-domain settings to learn cross-domain action recognition. Supervised learning can train it on a large amount of labelled data from the source domain (daytime action sequences). However, it uses deep domain invariant features to perform unsupervised learning on many unlabelled data from the target domain (nighttime action sequences). The resulting augmented model, named 3D-DiNet can be trained using standard backpropagation with an additional layer. It achieves SOTA performance on InFAR and XD145 actions datasets.","PeriodicalId":159377,"journal":{"name":"2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128286434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}