{"title":"基于视觉转换器的零射击学习","authors":"Ruisheng Ran, Qianwei Hu, Tianyu Gao, Shuhong Dong","doi":"10.1109/prmvia58252.2023.00010","DOIUrl":null,"url":null,"abstract":"Zero-Shot Learning (ZSL) simulates human’s transfer learning mechanism, which can recognize samples or categories that have not appeared during the training phase. However, the current ZSL still has a domain shift issue. To solved the domain shift issue, we propose a new ZSL method that combines Vision Transformer (ViT) and the encoder-decoder mechanism. This method refers to ViT’s Multi-Head Self-Attention (MSA) to extract more detailed visual features. The encoder-decoder mechanism can make the semantic information extracted from the image features accurately express its visual features and enhance recognition accuracy. We implemented it on three data sets of CUB, SUN and AWA2, and the experimental results proved that the method suggested in this study performs better than the current available methods. It shows that our new method is an effective ZSL method.","PeriodicalId":221346,"journal":{"name":"2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Zero-Shot Learning based on Vision Transformer\",\"authors\":\"Ruisheng Ran, Qianwei Hu, Tianyu Gao, Shuhong Dong\",\"doi\":\"10.1109/prmvia58252.2023.00010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Zero-Shot Learning (ZSL) simulates human’s transfer learning mechanism, which can recognize samples or categories that have not appeared during the training phase. However, the current ZSL still has a domain shift issue. To solved the domain shift issue, we propose a new ZSL method that combines Vision Transformer (ViT) and the encoder-decoder mechanism. This method refers to ViT’s Multi-Head Self-Attention (MSA) to extract more detailed visual features. The encoder-decoder mechanism can make the semantic information extracted from the image features accurately express its visual features and enhance recognition accuracy. We implemented it on three data sets of CUB, SUN and AWA2, and the experimental results proved that the method suggested in this study performs better than the current available methods. It shows that our new method is an effective ZSL method.\",\"PeriodicalId\":221346,\"journal\":{\"name\":\"2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/prmvia58252.2023.00010\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/prmvia58252.2023.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Zero-Shot Learning (ZSL) simulates human’s transfer learning mechanism, which can recognize samples or categories that have not appeared during the training phase. However, the current ZSL still has a domain shift issue. To solved the domain shift issue, we propose a new ZSL method that combines Vision Transformer (ViT) and the encoder-decoder mechanism. This method refers to ViT’s Multi-Head Self-Attention (MSA) to extract more detailed visual features. The encoder-decoder mechanism can make the semantic information extracted from the image features accurately express its visual features and enhance recognition accuracy. We implemented it on three data sets of CUB, SUN and AWA2, and the experimental results proved that the method suggested in this study performs better than the current available methods. It shows that our new method is an effective ZSL method.