{"title":"ViT-R50 GAN:基于视觉变压器混合模型的图像生成对抗网络","authors":"Y. Huang","doi":"10.1109/ICCECE58074.2023.10135253","DOIUrl":null,"url":null,"abstract":"In recent years, the tremendous potential of GAN in image generation has been demonstrated. Transformer derived from the NLP field is also gradually applied in computer vision, and Vision Transformer performs well in image classification problems. In this paper, we design a ViT-based GAN architecture for image generation. We found that the Transformer-based generator did not perform well due to using the same attention matrix for each channel. To overcome this problem, we increased the number of heads to generate more attention matrices. And this part is named enhanced multi-head attention, replacing multi-head attention in Transformer. Secondly, our discriminator uses a hybrid model of ResNet50 and ViT, where ResNet50 works on feature extraction making the discriminator perform better. Experiments show that our architecture performs well on image generation tasks.","PeriodicalId":120030,"journal":{"name":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","volume":"235 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"ViT-R50 GAN: Vision Transformers Hybrid Model based Generative Adversarial Networks for Image Generation\",\"authors\":\"Y. Huang\",\"doi\":\"10.1109/ICCECE58074.2023.10135253\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, the tremendous potential of GAN in image generation has been demonstrated. Transformer derived from the NLP field is also gradually applied in computer vision, and Vision Transformer performs well in image classification problems. In this paper, we design a ViT-based GAN architecture for image generation. We found that the Transformer-based generator did not perform well due to using the same attention matrix for each channel. To overcome this problem, we increased the number of heads to generate more attention matrices. And this part is named enhanced multi-head attention, replacing multi-head attention in Transformer. Secondly, our discriminator uses a hybrid model of ResNet50 and ViT, where ResNet50 works on feature extraction making the discriminator perform better. Experiments show that our architecture performs well on image generation tasks.\",\"PeriodicalId\":120030,\"journal\":{\"name\":\"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)\",\"volume\":\"235 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCECE58074.2023.10135253\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 3rd International Conference on Consumer Electronics and Computer Engineering (ICCECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCECE58074.2023.10135253","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ViT-R50 GAN: Vision Transformers Hybrid Model based Generative Adversarial Networks for Image Generation
In recent years, the tremendous potential of GAN in image generation has been demonstrated. Transformer derived from the NLP field is also gradually applied in computer vision, and Vision Transformer performs well in image classification problems. In this paper, we design a ViT-based GAN architecture for image generation. We found that the Transformer-based generator did not perform well due to using the same attention matrix for each channel. To overcome this problem, we increased the number of heads to generate more attention matrices. And this part is named enhanced multi-head attention, replacing multi-head attention in Transformer. Secondly, our discriminator uses a hybrid model of ResNet50 and ViT, where ResNet50 works on feature extraction making the discriminator perform better. Experiments show that our architecture performs well on image generation tasks.