{"title":"具有本地聚合的Swin变压器","authors":"Lu Chen, Yang Bai, Q. Cheng, Mei Wu","doi":"10.1109/ISPDS56360.2022.9874052","DOIUrl":null,"url":null,"abstract":"Despite the many advantages of Convolutional Neural Networks (CNN), their perceptual fields are usually small and not conducive to capturing global features. In contrast, Transformer is able to capture long-range dependencies and obtain global information of an image with self-attention. For combining the advantages of CNN and Transformer, we propose to integrate the Local Aggregation module to the structure of Swin Transformer. The Local Aggregation module includes lightweight Depthwise Convolution and Pointwise Convolution, and it can locally capture the information of feature map at stages of Swin Transformer. Our experiments demonstrate that accuracy can be improved with such an integrated model. On the Cifar-10 dataset, the Top-1 accuracy reaches 87.74%, which is 3.32% higher than Swin, and the Top-5 accuracy reaches 99.54%; on the Mini-ImageNet dataset, the Top-1 accuracy reaches 79.1%, which is 7.68% higher than Swin, and the Top-5 accuracy reaches 94.02%, which is 3.25% higher than Swin 3.25%.","PeriodicalId":280244,"journal":{"name":"2022 3rd International Conference on Information Science, Parallel and Distributed Systems (ISPDS)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Swin Transformer with Local Aggregation\",\"authors\":\"Lu Chen, Yang Bai, Q. Cheng, Mei Wu\",\"doi\":\"10.1109/ISPDS56360.2022.9874052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Despite the many advantages of Convolutional Neural Networks (CNN), their perceptual fields are usually small and not conducive to capturing global features. In contrast, Transformer is able to capture long-range dependencies and obtain global information of an image with self-attention. For combining the advantages of CNN and Transformer, we propose to integrate the Local Aggregation module to the structure of Swin Transformer. The Local Aggregation module includes lightweight Depthwise Convolution and Pointwise Convolution, and it can locally capture the information of feature map at stages of Swin Transformer. Our experiments demonstrate that accuracy can be improved with such an integrated model. On the Cifar-10 dataset, the Top-1 accuracy reaches 87.74%, which is 3.32% higher than Swin, and the Top-5 accuracy reaches 99.54%; on the Mini-ImageNet dataset, the Top-1 accuracy reaches 79.1%, which is 7.68% higher than Swin, and the Top-5 accuracy reaches 94.02%, which is 3.25% higher than Swin 3.25%.\",\"PeriodicalId\":280244,\"journal\":{\"name\":\"2022 3rd International Conference on Information Science, Parallel and Distributed Systems (ISPDS)\",\"volume\":\"61 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 3rd International Conference on Information Science, Parallel and Distributed Systems (ISPDS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISPDS56360.2022.9874052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 3rd International Conference on Information Science, Parallel and Distributed Systems (ISPDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPDS56360.2022.9874052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Despite the many advantages of Convolutional Neural Networks (CNN), their perceptual fields are usually small and not conducive to capturing global features. In contrast, Transformer is able to capture long-range dependencies and obtain global information of an image with self-attention. For combining the advantages of CNN and Transformer, we propose to integrate the Local Aggregation module to the structure of Swin Transformer. The Local Aggregation module includes lightweight Depthwise Convolution and Pointwise Convolution, and it can locally capture the information of feature map at stages of Swin Transformer. Our experiments demonstrate that accuracy can be improved with such an integrated model. On the Cifar-10 dataset, the Top-1 accuracy reaches 87.74%, which is 3.32% higher than Swin, and the Top-5 accuracy reaches 99.54%; on the Mini-ImageNet dataset, the Top-1 accuracy reaches 79.1%, which is 7.68% higher than Swin, and the Top-5 accuracy reaches 94.02%, which is 3.25% higher than Swin 3.25%.