Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215910
Hiroaki Masuzawa, Chuo Nakano, Jun Miura
This paper deals with deep cucumber recognition using CG (Computer Graphics)-based dataset generation. The variety and the size of the dataset are crucial in deep learning. Although there are many public datasets for common situations like traffic scenes, we need to make a dataset for a particular scene like cucumber farms. As it is costly and time-consuming to annotate much data manually, we proposed generating images by CG and converting them to realistic ones using adversarial learning approaches. We compare several image conversion methods using real cucumber plant images.
{"title":"CG-based dataset generation and adversarial image conversion for deep cucumber recognition","authors":"Hiroaki Masuzawa, Chuo Nakano, Jun Miura","doi":"10.23919/MVA57639.2023.10215910","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215910","url":null,"abstract":"This paper deals with deep cucumber recognition using CG (Computer Graphics)-based dataset generation. The variety and the size of the dataset are crucial in deep learning. Although there are many public datasets for common situations like traffic scenes, we need to make a dataset for a particular scene like cucumber farms. As it is costly and time-consuming to annotate much data manually, we proposed generating images by CG and converting them to realistic ones using adversarial learning approaches. We compare several image conversion methods using real cucumber plant images.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114181023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215925
Taiki Arakane, Chihiro Kai, T. Saitoh
We have been working on lip-reading that estimates the content of utterances using only visual information. When most people started wearing masks due to the coronavirus pandemic, several people asked us if using machine-based lip-reading even when wearing a mask was possible. Taking this as a research question, we worked on word-level lip-reading for the world’s first masked face image. The utterance scene dataset when wearing a mask is not open to the public, so we developed our dataset, face detection for masked face images, facial landmarks detection, and lip-reading based on deep learning. We collected speech scenes of 20 people for 15 Japanese words and obtained a recognition accuracy of 88.3% as a result of the recognition experiment. This paper reports that lip-reading is possible for masked face images.
{"title":"Can you read lips with a masked face?","authors":"Taiki Arakane, Chihiro Kai, T. Saitoh","doi":"10.23919/MVA57639.2023.10215925","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215925","url":null,"abstract":"We have been working on lip-reading that estimates the content of utterances using only visual information. When most people started wearing masks due to the coronavirus pandemic, several people asked us if using machine-based lip-reading even when wearing a mask was possible. Taking this as a research question, we worked on word-level lip-reading for the world’s first masked face image. The utterance scene dataset when wearing a mask is not open to the public, so we developed our dataset, face detection for masked face images, facial landmarks detection, and lip-reading based on deep learning. We collected speech scenes of 20 people for 15 Japanese words and obtained a recognition accuracy of 88.3% as a result of the recognition experiment. This paper reports that lip-reading is possible for masked face images.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126235519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Detecting small objects is often impeded by blurriness and low resolution, which poses substantial challenges for accurately detecting and localizing such objects. In addition, conventional feature extraction methods usually face difficulties in capturing effective representations for these entities, as down-sampling and convolutional operations contribute to the blurring of small object details. To tackle these challenges, this study introduces an approach for detecting tiny objects through ensemble fusion, which leverages the advantages of multiple diverse model variants and combines their predictions. Experimental results reveal that the proposed method effectively harnesses the strengths of each model via ensemble fusion, leading to enhanced accuracy and robustness in small object detection. Our model achieves the highest score of 0.776 in terms of average precision (AP) at an IoU threshold of 0.5 in the MVA Challenge on Small Object Detection for Birds.
{"title":"Ensemble Fusion for Small Object Detection","authors":"Hao-Yu Hou, Mu-Yi Shen, Chia-Chi Hsu, En-Ming Huang, Yu-Chen Huang, Yu-Cheng Xia, Chien-Yao Wang, Chun-Yi Lee","doi":"10.23919/MVA57639.2023.10215748","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215748","url":null,"abstract":"Detecting small objects is often impeded by blurriness and low resolution, which poses substantial challenges for accurately detecting and localizing such objects. In addition, conventional feature extraction methods usually face difficulties in capturing effective representations for these entities, as down-sampling and convolutional operations contribute to the blurring of small object details. To tackle these challenges, this study introduces an approach for detecting tiny objects through ensemble fusion, which leverages the advantages of multiple diverse model variants and combines their predictions. Experimental results reveal that the proposed method effectively harnesses the strengths of each model via ensemble fusion, leading to enhanced accuracy and robustness in small object detection. Our model achieves the highest score of 0.776 in terms of average precision (AP) at an IoU threshold of 0.5 in the MVA Challenge on Small Object Detection for Birds.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132454074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215818
Vijay John, Yasutomo Kawanishi
Visible and thermal camera-based sensor fusion has shown to address the limitations and enhance the robustness of visible camera-based person classification. In this paper, we propose to further enhance the classification accuracy of visible-thermal person classification using transfer learning, knowledge distillation, and the vision transformer. In our work, the visible-thermal person classifier is implemented using the vision transformer. The proposed classifier is trained using the transfer learning and knowledge distillation techniques. To train the proposed classifier, visible and thermal teacher models are implemented using the vision transformers. The multimodal classifier learns from the two teachers using a novel loss function which incorporates the knowledge distillation. The proposed method is validated on the public Speaking Faces dataset. A comparative analysis with baseline algorithms and an ablation study is performed. The results show that the proposed framework reports an enhanced classification accuracy.
{"title":"Combining Knowledge Distillation and Transfer Learning for Sensor Fusion in Visible and Thermal Camera-based Person Classification","authors":"Vijay John, Yasutomo Kawanishi","doi":"10.23919/MVA57639.2023.10215818","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215818","url":null,"abstract":"Visible and thermal camera-based sensor fusion has shown to address the limitations and enhance the robustness of visible camera-based person classification. In this paper, we propose to further enhance the classification accuracy of visible-thermal person classification using transfer learning, knowledge distillation, and the vision transformer. In our work, the visible-thermal person classifier is implemented using the vision transformer. The proposed classifier is trained using the transfer learning and knowledge distillation techniques. To train the proposed classifier, visible and thermal teacher models are implemented using the vision transformers. The multimodal classifier learns from the two teachers using a novel loss function which incorporates the knowledge distillation. The proposed method is validated on the public Speaking Faces dataset. A comparative analysis with baseline algorithms and an ablation study is performed. The results show that the proposed framework reports an enhanced classification accuracy.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129002684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215758
Yudai Hirose, Satoshi Ono
With the rapid development of deep neural networks (DNNs), eXplainable AI, which provides a basis for prediction on inputs, has become increasingly important. In addition, DNNs have a vulnerability called an Adversarial Example (AE), which can cause incorrect output by applying special perturbations to inputs. Potential vulnerabilities can also exist in image interpreters such as GradCAM, necessitating their investigation, as these vulnerabilities could potentially result in misdiagnosis within medical imaging. Therefore, this study proposes a black-box adversarial attack method that misleads the image interpreter using Sep-CMA-ES. The proposed method deceptively shifts the focus area of the image interpreter to a different location from that of the original image while maintaining the same predictive labels.
{"title":"Black-box Adversarial Attack against Visual Interpreters for Deep Neural Networks","authors":"Yudai Hirose, Satoshi Ono","doi":"10.23919/MVA57639.2023.10215758","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215758","url":null,"abstract":"With the rapid development of deep neural networks (DNNs), eXplainable AI, which provides a basis for prediction on inputs, has become increasingly important. In addition, DNNs have a vulnerability called an Adversarial Example (AE), which can cause incorrect output by applying special perturbations to inputs. Potential vulnerabilities can also exist in image interpreters such as GradCAM, necessitating their investigation, as these vulnerabilities could potentially result in misdiagnosis within medical imaging. Therefore, this study proposes a black-box adversarial attack method that misleads the image interpreter using Sep-CMA-ES. The proposed method deceptively shifts the focus area of the image interpreter to a different location from that of the original image while maintaining the same predictive labels.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133911197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215631
DJAFER YAHIA M BENCHADI, Messaoud Benchadi, Bojan Batalo, K. Fukui
This paper proposes a novel approach based on subspace representation for malware detection, an important task of distinguishing between safe and malware (malicious) file classes. Our solution is to utilize a target software’s byte-level visualization (image pattern) and represent the two classes by low-dimensional subspaces respectively, in high-dimensional vector space. We use the kernel constrained subspace method (KCSM) as a classifier, which has shown excellent results in various pattern recognition tasks. However, its computational cost may be high due to the use of kernel trick, which makes it difficult to achieve real-time detection. To address this issue, we introduce Random Fourier Features (RFF), which we can handle directly like standard vectors, bypassing the kernel trick. This approach reduces execution time by around 99%, while retaining a high recognition rate. We conduct extensive experiments on several public malware datasets, and demonstrate superior results against several baselines and previous approaches.
{"title":"Malware detection using Kernel Constrained Subspace Method","authors":"DJAFER YAHIA M BENCHADI, Messaoud Benchadi, Bojan Batalo, K. Fukui","doi":"10.23919/MVA57639.2023.10215631","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215631","url":null,"abstract":"This paper proposes a novel approach based on subspace representation for malware detection, an important task of distinguishing between safe and malware (malicious) file classes. Our solution is to utilize a target software’s byte-level visualization (image pattern) and represent the two classes by low-dimensional subspaces respectively, in high-dimensional vector space. We use the kernel constrained subspace method (KCSM) as a classifier, which has shown excellent results in various pattern recognition tasks. However, its computational cost may be high due to the use of kernel trick, which makes it difficult to achieve real-time detection. To address this issue, we introduce Random Fourier Features (RFF), which we can handle directly like standard vectors, bypassing the kernel trick. This approach reduces execution time by around 99%, while retaining a high recognition rate. We conduct extensive experiments on several public malware datasets, and demonstrate superior results against several baselines and previous approaches.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128915386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10216205
Yutaka Yoshihama, Kenichi Yadani, Shota Isobe
Designing a convolutional neural network architecture that achieves low-latency and high accuracy on edge devices with constrained computational resources is a difficult challenge. Neural architecture search (NAS) is used to optimize the architecture in a large design space, but at huge computational cost. As a countermeasure, we use here the zero-shot NAS method. A drawback to the previous method was that a discrepancy of correction occurred between the evaluation score of the neural architecture and its accuracy. To address this problem, we refined the neural architecture search space from previous zero-shot NAS. The neural architecture obtained using the proposed method achieves ImageNet top-1 accuracy of 75.3% under conditions of latency equivalent to MobileNetV2 (ImageNet top-1 accuracy is 71.8%) on the Qualcomm SA8155 platform.
{"title":"Hardware-Aware Zero-Shot Neural Architecture Search","authors":"Yutaka Yoshihama, Kenichi Yadani, Shota Isobe","doi":"10.23919/MVA57639.2023.10216205","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10216205","url":null,"abstract":"Designing a convolutional neural network architecture that achieves low-latency and high accuracy on edge devices with constrained computational resources is a difficult challenge. Neural architecture search (NAS) is used to optimize the architecture in a large design space, but at huge computational cost. As a countermeasure, we use here the zero-shot NAS method. A drawback to the previous method was that a discrepancy of correction occurred between the evaluation score of the neural architecture and its accuracy. To address this problem, we refined the neural architecture search space from previous zero-shot NAS. The neural architecture obtained using the proposed method achieves ImageNet top-1 accuracy of 75.3% under conditions of latency equivalent to MobileNetV2 (ImageNet top-1 accuracy is 71.8%) on the Qualcomm SA8155 platform.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129345815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215609
Kazuya Odagiri, K. Onoguchi
We present a low-cost method for detecting blind spots in front of the ego vehicle. In low visibility conditions, blind spot estimation is crucial to avoid the risk of pedestrians or vehicles appearing suddenly. However, most blind spot estimation methods require expensive range sensors or neural networks trained with data measured by them. Our method only uses a monocular camera throughout all phases from training to inference, since it is cheaper and more versatile. We assume that a blind spot is a depth discontinuity region. Occupancy probabilities of these regions are integrated using the occupancy grid mapping algorithm. Instead of using range sensors, we leverage the self-supervised monocular depth estimation method for the occupancy grid mapping. 2D blind spot labels are created from occupancy grids and a blind spot estimation network is trained using these labels. Our experiments show quantitative and qualitative performance and demonstrate an ability to learn with arbitrary videos.
{"title":"Monocular Blind Spot Estimation with Occupancy Grid Mapping","authors":"Kazuya Odagiri, K. Onoguchi","doi":"10.23919/MVA57639.2023.10215609","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215609","url":null,"abstract":"We present a low-cost method for detecting blind spots in front of the ego vehicle. In low visibility conditions, blind spot estimation is crucial to avoid the risk of pedestrians or vehicles appearing suddenly. However, most blind spot estimation methods require expensive range sensors or neural networks trained with data measured by them. Our method only uses a monocular camera throughout all phases from training to inference, since it is cheaper and more versatile. We assume that a blind spot is a depth discontinuity region. Occupancy probabilities of these regions are integrated using the occupancy grid mapping algorithm. Instead of using range sensors, we leverage the self-supervised monocular depth estimation method for the occupancy grid mapping. 2D blind spot labels are created from occupancy grids and a blind spot estimation network is trained using these labels. Our experiments show quantitative and qualitative performance and demonstrate an ability to learn with arbitrary videos.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115453721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10216010
Niklas Penzel, J. Denzler
In many domains, so-called foundation models were recently proposed. These models are trained on immense amounts of data resulting in impressive performances on various downstream tasks and benchmarks. Later works focus on leveraging this pre-trained knowledge by combining these models. To reduce data and compute requirements, we utilize and combine foundation models in two ways. First, we use language and vision models to extract and generate a challenging language vision task in the form of artwork interpretation pairs. Second, we combine and fine-tune CLIP as well as GPT-2 to reduce compute requirements for training interpretation models. We perform a qualitative and quantitative analysis of our data and conclude that generating artwork leads to improvements in visual-text alignment and, therefore, to more proficient interpretation models1. Our approach addresses how to leverage and combine pre-trained models to tackle tasks where existing data is scarce or difficult to obtain.
{"title":"Interpreting Art by Leveraging Pre-Trained Models","authors":"Niklas Penzel, J. Denzler","doi":"10.23919/MVA57639.2023.10216010","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10216010","url":null,"abstract":"In many domains, so-called foundation models were recently proposed. These models are trained on immense amounts of data resulting in impressive performances on various downstream tasks and benchmarks. Later works focus on leveraging this pre-trained knowledge by combining these models. To reduce data and compute requirements, we utilize and combine foundation models in two ways. First, we use language and vision models to extract and generate a challenging language vision task in the form of artwork interpretation pairs. Second, we combine and fine-tune CLIP as well as GPT-2 to reduce compute requirements for training interpretation models. We perform a qualitative and quantitative analysis of our data and conclude that generating artwork leads to improvements in visual-text alignment and, therefore, to more proficient interpretation models1. Our approach addresses how to leverage and combine pre-trained models to tackle tasks where existing data is scarce or difficult to obtain.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114555505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215720
Masato Tada, X. Han
Early diagnosis of skin diseases is an important and challenge task for proper treatment, and even the deadliest skin cancer: the malignant melanoma can be cured for increasing the survival rate with less than 5-year life expectancy. The manual diagnosis of skin lesions by specialists not only is time-consuming but also usually causes great variation of the diagnosis results. Recently, deep learning networks with the main convolution operations have been widely employed for vision recognition including medical image analysis and classification, and demonstrated the great effectiveness. However, the convolution operation extracts the feature in the limited receptive field, and cannot capture long-range dependence for modeling global contexts. Therefore, transformer as an alternative for global feature modeling with self-attention module has become the prevalent network architecture for lifting performance in various vision tasks. This study aims to construct a hybrid skin lesion recognition model by incorporating the convolution operations and self-attention structures. Specifically, we firstly employ a backbone CNN to extract the high-level feature maps, and then leverage a transformer block to capture the global correlation. Due to the diverse contexts in channel domain and the reduced information in spatial domain of the high-level features, we alternatively incorporate a self-attention to model long-range dependencies in the channel direction instead of spatial self-attention in the conventional transformer block, and then follow spatial relation modeling with the depth-wise convolution block in the feature feed-forward module. To demonstrate the effectiveness of the proposed method, we conduct experiments on the HAM10000 and ISIC2019 skin lesion datasets, and verify the superior performance over the baseline model and the state-of-the-art methods.
{"title":"Bottleneck Transformer model with Channel Self-Attention for skin lesion classification","authors":"Masato Tada, X. Han","doi":"10.23919/MVA57639.2023.10215720","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215720","url":null,"abstract":"Early diagnosis of skin diseases is an important and challenge task for proper treatment, and even the deadliest skin cancer: the malignant melanoma can be cured for increasing the survival rate with less than 5-year life expectancy. The manual diagnosis of skin lesions by specialists not only is time-consuming but also usually causes great variation of the diagnosis results. Recently, deep learning networks with the main convolution operations have been widely employed for vision recognition including medical image analysis and classification, and demonstrated the great effectiveness. However, the convolution operation extracts the feature in the limited receptive field, and cannot capture long-range dependence for modeling global contexts. Therefore, transformer as an alternative for global feature modeling with self-attention module has become the prevalent network architecture for lifting performance in various vision tasks. This study aims to construct a hybrid skin lesion recognition model by incorporating the convolution operations and self-attention structures. Specifically, we firstly employ a backbone CNN to extract the high-level feature maps, and then leverage a transformer block to capture the global correlation. Due to the diverse contexts in channel domain and the reduced information in spatial domain of the high-level features, we alternatively incorporate a self-attention to model long-range dependencies in the channel direction instead of spatial self-attention in the conventional transformer block, and then follow spatial relation modeling with the depth-wise convolution block in the feature feed-forward module. To demonstrate the effectiveness of the proposed method, we conduct experiments on the HAM10000 and ISIC2019 skin lesion datasets, and verify the superior performance over the baseline model and the state-of-the-art methods.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125364532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}