Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10216084
Yu-Hui Huang, M. Proesmans, L. Gool
Zero padding is widely used in convolutional neural networks (CNNs) to prevent the size of feature maps diminishing too fast. However, it has been claimed to disturb the statistics at the border [9]. In this work, we compare various padding methods for the scene parsing task and propose an alternative padding method (CApadding) by extending the image to alleviate the border issue. Experiments on Cityspaces [2] and Deep-Globe [3] show that models with the proposed padding method achieves higher mean Intersection-Over-Union (IoU) than the zero padding based models.
{"title":"Padding Investigations for CNNs in Scene Parsing Tasks","authors":"Yu-Hui Huang, M. Proesmans, L. Gool","doi":"10.23919/MVA57639.2023.10216084","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10216084","url":null,"abstract":"Zero padding is widely used in convolutional neural networks (CNNs) to prevent the size of feature maps diminishing too fast. However, it has been claimed to disturb the statistics at the border [9]. In this work, we compare various padding methods for the scene parsing task and propose an alternative padding method (CApadding) by extending the image to alleviate the border issue. Experiments on Cityspaces [2] and Deep-Globe [3] show that models with the proposed padding method achieves higher mean Intersection-Over-Union (IoU) than the zero padding based models.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123120411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215856
Hyeon Joon Lee, E. Simo-Serra
This study introduces a novel methodology for generating levels in the iconic video game Super Mario Bros. using a diffusion model based on a UNet architecture. The model is trained on existing levels, represented as a categorical distribution, to accurately capture the game’s fundamental mechanics and design principles. The proposed approach demonstrates notable success in producing high-quality and diverse levels, with a significant proportion being playable by an artificial agent. This research emphasizes the potential of diffusion models as an efficient tool for procedural content generation and highlights their potential impact on the development of new video games and the enhancement of existing games through generated content.
{"title":"Using Unconditional Diffusion Models in Level Generation for Super Mario Bros","authors":"Hyeon Joon Lee, E. Simo-Serra","doi":"10.23919/MVA57639.2023.10215856","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215856","url":null,"abstract":"This study introduces a novel methodology for generating levels in the iconic video game Super Mario Bros. using a diffusion model based on a UNet architecture. The model is trained on existing levels, represented as a categorical distribution, to accurately capture the game’s fundamental mechanics and design principles. The proposed approach demonstrates notable success in producing high-quality and diverse levels, with a significant proportion being playable by an artificial agent. This research emphasizes the potential of diffusion models as an efficient tool for procedural content generation and highlights their potential impact on the development of new video games and the enhancement of existing games through generated content.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129835064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215699
Katarina Tolja, M. Subašić, Z. Kalafatić, S. Lončarić
In this paper, we propose two innovative approaches to tackle the key challenges in product size classification, with a specific focus on bottles. Our research is particularly interesting as we leverage the bottle cap as a reference object, which allows bottle size classification to overcome challenges in the distance between the capturing device and the retail shelf, viewing angle, and arrangement of bottles on the shelves. We showcase the usage of the reference object in explicit and implicit novel approaches and discuss the benefits and limitations of the proposed methods.
{"title":"Enhancing Retail Product Recognition: Fine-Grained Bottle Size Classification","authors":"Katarina Tolja, M. Subašić, Z. Kalafatić, S. Lončarić","doi":"10.23919/MVA57639.2023.10215699","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215699","url":null,"abstract":"In this paper, we propose two innovative approaches to tackle the key challenges in product size classification, with a specific focus on bottles. Our research is particularly interesting as we leverage the bottle cap as a reference object, which allows bottle size classification to overcome challenges in the distance between the capturing device and the retail shelf, viewing angle, and arrangement of bottles on the shelves. We showcase the usage of the reference object in explicit and implicit novel approaches and discuss the benefits and limitations of the proposed methods.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130027986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10216055
Banri Kojima, Takahiro Komamizu, Yasutomo Kawanishi, Keisuke Doman, I. Ide
This paper proposes a method for estimating impressions from images according to the personal attributes of users so that they can find the desired images based on their tastes. Our previous work, which considered gender and age as personal attributes, showed promising results, but it also showed that users sharing these attributes do not necessarily share similar tastes. Therefore, other attributes should be considered to capture the personal tastes of each user well. However, taking more attributes into account leads to a problem in which insufficient amounts of data are served to classifiers due to the explosion of the number of combinations of attributes. To tackle this problem, we propose an aggregation-based method to condense training data for impression estimation while considering personal attribute information. For evaluation, a dataset of 4,000 carpet images annotated with 24 impression words was prepared. Experimental results showed that the use of combinations of personal attributes improved the accuracy of impression estimation, which indicates the effectiveness of the proposed approach.
{"title":"Image Impression Estimation by Clustering People with Similar Tastes","authors":"Banri Kojima, Takahiro Komamizu, Yasutomo Kawanishi, Keisuke Doman, I. Ide","doi":"10.23919/MVA57639.2023.10216055","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10216055","url":null,"abstract":"This paper proposes a method for estimating impressions from images according to the personal attributes of users so that they can find the desired images based on their tastes. Our previous work, which considered gender and age as personal attributes, showed promising results, but it also showed that users sharing these attributes do not necessarily share similar tastes. Therefore, other attributes should be considered to capture the personal tastes of each user well. However, taking more attributes into account leads to a problem in which insufficient amounts of data are served to classifiers due to the explosion of the number of combinations of attributes. To tackle this problem, we propose an aggregation-based method to condense training data for impression estimation while considering personal attribute information. For evaluation, a dataset of 4,000 carpet images annotated with 24 impression words was prepared. Experimental results showed that the use of combinations of personal attributes improved the accuracy of impression estimation, which indicates the effectiveness of the proposed approach.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122015977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215626
Takuhiro Okada, Yuantian Huang, Guoqing Hao, S. Iizuka, K. Fukui
This paper presents a deep learning-based approach for the severity classification of coffee leaf diseases. Coffee leaf diseases are one of the significant problems in the coffee industry, where estimating the health status of coffee leaves based on their appearance is crucial in the production process. However, there have been few studies on this task, and cases of misclassification have been reported due to the inability to detect slight color differences when classifying the disease severity. In this work, we propose a low-level feature aggregation technique for neural network-based classifiers to capture the discolored distribution of the entire coffee leaf, which effectively supports discrimination of the severity. This feature aggregation is achieved by incorporating attention mechanisms in the shallow layers of the network that extract low-level features such as color. The attention mechanism in the shallow layers provides the network with information on global dependencies of the color features of the leaves, allowing the network to more easily identify the disease severity. We use an efficient computational technique for the attention modules to reduce memory and computational cost, which enables us to introduce the attention mechanisms in large-sized feature maps in the shallow layers. We conduct in-depth validation experiments on the coffee leaf disease datasets and demonstrate the effectiveness of our proposed model compared to state-of-the-art image classification models in accurately classifying the severity of coffee leaf diseases.
{"title":"Low-Level Feature Aggregation Networks for Disease Severity Estimation of Coffee Leaves","authors":"Takuhiro Okada, Yuantian Huang, Guoqing Hao, S. Iizuka, K. Fukui","doi":"10.23919/MVA57639.2023.10215626","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215626","url":null,"abstract":"This paper presents a deep learning-based approach for the severity classification of coffee leaf diseases. Coffee leaf diseases are one of the significant problems in the coffee industry, where estimating the health status of coffee leaves based on their appearance is crucial in the production process. However, there have been few studies on this task, and cases of misclassification have been reported due to the inability to detect slight color differences when classifying the disease severity. In this work, we propose a low-level feature aggregation technique for neural network-based classifiers to capture the discolored distribution of the entire coffee leaf, which effectively supports discrimination of the severity. This feature aggregation is achieved by incorporating attention mechanisms in the shallow layers of the network that extract low-level features such as color. The attention mechanism in the shallow layers provides the network with information on global dependencies of the color features of the leaves, allowing the network to more easily identify the disease severity. We use an efficient computational technique for the attention modules to reduce memory and computational cost, which enables us to introduce the attention mechanisms in large-sized feature maps in the shallow layers. We conduct in-depth validation experiments on the coffee leaf disease datasets and demonstrate the effectiveness of our proposed model compared to state-of-the-art image classification models in accurately classifying the severity of coffee leaf diseases.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126971515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10216093
Da Huo, Marc A. Kastner, Tingwei Liu, Yasutomo Kawanishi, Takatsugu Hirayama, Takahiro Komamizu, I. Ide
Object detection is the task of detecting objects in an image. In this task, the detection of small objects is particularly difficult. Other than the small size, it is also accompanied by difficulties due to blur, occlusion, and so on. Current small object detection methods are tailored to small and dense situations, such as pedestrians in a crowd or far objects in remote sensing scenarios. However, when the target object is small and sparse, there is a lack of objects available for training, making it more difficult to learn effective features. In this paper, we propose a specialized method for detecting a specific category of small objects; birds. Particularly, we improve the features learned by the neck; the sub-network between the backbone and the prediction head, to learn more effective features with a hierarchical design. We employ Swin Transformer to upsample the image features. Moreover, we change the shifted window size for adapting to small objects. Experiments show that the proposed Swin Transformer-based neck combined with CenterNet can lead to good performance by changing the window sizes. We further find that smaller window sizes (default 2) benefit mAPs for small object detection.
{"title":"Small Object Detection for Birds with Swin Transformer","authors":"Da Huo, Marc A. Kastner, Tingwei Liu, Yasutomo Kawanishi, Takatsugu Hirayama, Takahiro Komamizu, I. Ide","doi":"10.23919/MVA57639.2023.10216093","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10216093","url":null,"abstract":"Object detection is the task of detecting objects in an image. In this task, the detection of small objects is particularly difficult. Other than the small size, it is also accompanied by difficulties due to blur, occlusion, and so on. Current small object detection methods are tailored to small and dense situations, such as pedestrians in a crowd or far objects in remote sensing scenarios. However, when the target object is small and sparse, there is a lack of objects available for training, making it more difficult to learn effective features. In this paper, we propose a specialized method for detecting a specific category of small objects; birds. Particularly, we improve the features learned by the neck; the sub-network between the backbone and the prediction head, to learn more effective features with a hierarchical design. We employ Swin Transformer to upsample the image features. Moreover, we change the shifted window size for adapting to small objects. Experiments show that the proposed Swin Transformer-based neck combined with CenterNet can lead to good performance by changing the window sizes. We further find that smaller window sizes (default 2) benefit mAPs for small object detection.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114070609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215797
Yuan Li, Tingting Hu, Ryuji Fuchikami, T. Ikenaga
High frame rate and ultra-low delay vision systems, which process 1000 FPS videos within 1 ms/frame delay, play an increasingly important role in fields such as robotics and factory automation. Among them, an image segmentation system is necessary as segmentation is a crucial pre-processing step for various applications. Recently many existing researches focus on superpixel segmentation, but few of them attempt to reach high processing speed. To achieve this target, this paper proposes: (A) Grid sample based temporal iteration, which leverages the high frame rate video property to distribute iterations into the temporal domain, ensuring the entire system is within one frame delay. Additionally, grid sample is proposed to add initialization information to temporal iteration for the stability of superpixels. (B) Compactness-coefficient distance is proposed to add information of the entire superpixel instead of only using the information of the center point. The evaluation results demonstrate that the proposed superpixel segmentation system achieves boundary recall and under-segmentation error comparable to the original SLIC superpixel segmentation system. For label consistency, the proposed system is more than 0.02 higher than the original system.
{"title":"Grid Sample Based Temporal Iteration and Compactness-coefficient Distance for High Frame and Ultra-low Delay SLIC Segmentation System","authors":"Yuan Li, Tingting Hu, Ryuji Fuchikami, T. Ikenaga","doi":"10.23919/MVA57639.2023.10215797","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215797","url":null,"abstract":"High frame rate and ultra-low delay vision systems, which process 1000 FPS videos within 1 ms/frame delay, play an increasingly important role in fields such as robotics and factory automation. Among them, an image segmentation system is necessary as segmentation is a crucial pre-processing step for various applications. Recently many existing researches focus on superpixel segmentation, but few of them attempt to reach high processing speed. To achieve this target, this paper proposes: (A) Grid sample based temporal iteration, which leverages the high frame rate video property to distribute iterations into the temporal domain, ensuring the entire system is within one frame delay. Additionally, grid sample is proposed to add initialization information to temporal iteration for the stability of superpixels. (B) Compactness-coefficient distance is proposed to add information of the entire superpixel instead of only using the information of the center point. The evaluation results demonstrate that the proposed superpixel segmentation system achieves boundary recall and under-segmentation error comparable to the original SLIC superpixel segmentation system. For label consistency, the proposed system is more than 0.02 higher than the original system.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124104848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215993
Takuya Nakabayashi, H. Saito
Automatic fall detection is a crucial task in healthcare as falls pose a significant risk to the health of elderly individuals. This paper presents a lightweight acceleration-based fall detection method that can be implemented on edge devices. The proposed method uses Autoencoders, a type of unsupervised learning, within the framework of anomaly detection, allowing for network training without requiring extensive labeled fall data. One of the challenges in fall detection is the difficulty in collecting fall data. However, our proposed method can overcome this limitation by training the neural network without fall data, using the anomaly detection framework of Autoencoders. Additionally, this method employs an extremely lightweight Autoencoder that can run independently on an edge device, eliminating the need to transmit data to a server and minimizing privacy concerns. We conducted experiments comparing the performance of our proposed method with that of a baseline method using a unique fall detection dataset. Our results confirm that our method outperforms the baseline method in detecting falls with higher accuracy.
{"title":"Unsupervised Fall Detection on Edge Devices","authors":"Takuya Nakabayashi, H. Saito","doi":"10.23919/MVA57639.2023.10215993","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215993","url":null,"abstract":"Automatic fall detection is a crucial task in healthcare as falls pose a significant risk to the health of elderly individuals. This paper presents a lightweight acceleration-based fall detection method that can be implemented on edge devices. The proposed method uses Autoencoders, a type of unsupervised learning, within the framework of anomaly detection, allowing for network training without requiring extensive labeled fall data. One of the challenges in fall detection is the difficulty in collecting fall data. However, our proposed method can overcome this limitation by training the neural network without fall data, using the anomaly detection framework of Autoencoders. Additionally, this method employs an extremely lightweight Autoencoder that can run independently on an edge device, eliminating the need to transmit data to a server and minimizing privacy concerns. We conducted experiments comparing the performance of our proposed method with that of a baseline method using a unique fall detection dataset. Our results confirm that our method outperforms the baseline method in detecting falls with higher accuracy.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121579551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10216160
Ching-Ching Yang, W. Chu, S. Dubey
Weakly-supervised image hashing emerges recently because web images associated with contextual text or tags are abundant. Text information weakly-related to images can be utilized to guide the learning of a deep hashing network. In this paper, we propose Weakly-supervised deep Hashing based on Cross-Modal Transformer (WHCMT). First, cross-scale attention between image patches is discovered to form more effective visual representations. A baseline transformer is also adopted to find self-attention of tags and form tag representations. Second, the cross-modal attention between images and tags is discovered by the proposed cross-modal transformer. Effective hash codes are then generated by embedding layers. WHCMT is tested on semantic image retrieval, and we show new state-of-the-art results can be obtained for the MIRFLICKR-25K dataset and NUS-WIDE dataset.
{"title":"Weakly-Supervised Deep Image Hashing based on Cross-Modal Transformer","authors":"Ching-Ching Yang, W. Chu, S. Dubey","doi":"10.23919/MVA57639.2023.10216160","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10216160","url":null,"abstract":"Weakly-supervised image hashing emerges recently because web images associated with contextual text or tags are abundant. Text information weakly-related to images can be utilized to guide the learning of a deep hashing network. In this paper, we propose Weakly-supervised deep Hashing based on Cross-Modal Transformer (WHCMT). First, cross-scale attention between image patches is discovered to form more effective visual representations. A baseline transformer is also adopted to find self-attention of tags and form tag representations. Second, the cross-modal attention between images and tags is discovered by the proposed cross-modal transformer. Effective hash codes are then generated by embedding layers. WHCMT is tested on semantic image retrieval, and we show new state-of-the-art results can be obtained for the MIRFLICKR-25K dataset and NUS-WIDE dataset.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129882408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-07-23DOI: 10.23919/MVA57639.2023.10215538
Chu-Chi Chiu, Hsuan-Kung Yang, Hao-Wei Chen, Yu-Wen Chen, Chun-Yi Lee
In this paper, we develop a Vision Transformer based visual odometry (VO), called ViTVO. ViTVO introduces an attention mechanism to perform visual odometry. Due to the nature of VO, Transformer based VO models tend to overconcentrate on few points, which may result in a degradation of accuracy. In addition, noises from dynamic objects usually cause difficulties in performing VO tasks. To overcome these issues, we propose an attention loss during training, which utilizes ground truth masks or self supervision to guide the attention maps to focus more on static regions of an image. In our experiments, we demonstrate the superior performance of ViTVO on the Sintel validation set, and validate the effectiveness of our attention supervision mechanism in performing VO tasks.
{"title":"ViTVO: Vision Transformer based Visual Odometry with Attention Supervision","authors":"Chu-Chi Chiu, Hsuan-Kung Yang, Hao-Wei Chen, Yu-Wen Chen, Chun-Yi Lee","doi":"10.23919/MVA57639.2023.10215538","DOIUrl":"https://doi.org/10.23919/MVA57639.2023.10215538","url":null,"abstract":"In this paper, we develop a Vision Transformer based visual odometry (VO), called ViTVO. ViTVO introduces an attention mechanism to perform visual odometry. Due to the nature of VO, Transformer based VO models tend to overconcentrate on few points, which may result in a degradation of accuracy. In addition, noises from dynamic objects usually cause difficulties in performing VO tasks. To overcome these issues, we propose an attention loss during training, which utilizes ground truth masks or self supervision to guide the attention maps to focus more on static regions of an image. In our experiments, we demonstrate the superior performance of ViTVO on the Sintel validation set, and validate the effectiveness of our attention supervision mechanism in performing VO tasks.","PeriodicalId":338734,"journal":{"name":"2023 18th International Conference on Machine Vision and Applications (MVA)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127923758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}