Redesign Visual Transformer For Small Datasets

IF 0.9 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING Scalable Computing-Practice and Experience Pub Date : 2022-12-01 DOI:10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00077

Jingjie Wang, Xiang Wei, Siyang Lu, Mingquan Wang, Xiaoyu Liu, Wei Lu

{"title":"Redesign Visual Transformer For Small Datasets","authors":"Jingjie Wang, Xiang Wei, Siyang Lu, Mingquan Wang, Xiaoyu Liu, Wei Lu","doi":"10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00077","DOIUrl":null,"url":null,"abstract":"Nowadays, the self-attention mechanism has become a resound of visual feature extraction along with convolution. The transformer network composed of self-attention has developed rapidly and made remarkable achievements in visual tasks. The self-attention shows the potential to replace convolution as the primary method of visual feature extraction in ubiquitous intelligence. Nevertheless, the development of the Visual Transformer still suffer from the following problems: a) The self-attention mechanism has a low inductive bias, which leads to large data demand and a high training cost. b) The Transformer backbone network cannot adapt well to the low visual information density and performs unsatisfactorily under low resolution and small-scale datasets. To tackle the abovementioned two problems, this paper proposes a novel algorithm based on the mature Visual Transformer architecture, which is dedicated to exploring the performance potential of the Transformer network and its kernel self-attention mechanism on small-scale datasets. Specifically, we first propose a network architecture equipped with multi-coordination strategy to solve the self-attention degradation problem inherent in the existing Transformer architecture. Secondly, we introduce consistent regularization into the Transformer to make the self-attention mechanism acquire more reliable feature representation ability in the case of insufficient visual features. In the experiments, CSwin Transformer, the mainstream visual model, is selected to verify the effectiveness of the proposed method on the prevalent small datasets, and superior results are achieved. In particular, without pre-training, our accuracy on the CIFAR-100 dataset is improved by 1.24% compared to CSwin.","PeriodicalId":43791,"journal":{"name":"Scalable Computing-Practice and Experience","volume":"28 1","pages":"401-408"},"PeriodicalIF":0.9000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scalable Computing-Practice and Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Nowadays, the self-attention mechanism has become a resound of visual feature extraction along with convolution. The transformer network composed of self-attention has developed rapidly and made remarkable achievements in visual tasks. The self-attention shows the potential to replace convolution as the primary method of visual feature extraction in ubiquitous intelligence. Nevertheless, the development of the Visual Transformer still suffer from the following problems: a) The self-attention mechanism has a low inductive bias, which leads to large data demand and a high training cost. b) The Transformer backbone network cannot adapt well to the low visual information density and performs unsatisfactorily under low resolution and small-scale datasets. To tackle the abovementioned two problems, this paper proposes a novel algorithm based on the mature Visual Transformer architecture, which is dedicated to exploring the performance potential of the Transformer network and its kernel self-attention mechanism on small-scale datasets. Specifically, we first propose a network architecture equipped with multi-coordination strategy to solve the self-attention degradation problem inherent in the existing Transformer architecture. Secondly, we introduce consistent regularization into the Transformer to make the self-attention mechanism acquire more reliable feature representation ability in the case of insufficient visual features. In the experiments, CSwin Transformer, the mainstream visual model, is selected to verify the effectiveness of the proposed method on the prevalent small datasets, and superior results are achieved. In particular, without pre-training, our accuracy on the CIFAR-100 dataset is improved by 1.24% compared to CSwin.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

为小数据集重新设计可视化转换器

自注意机制与卷积一起成为当前视觉特征提取的一个热点。自关注组成的变压器网络发展迅速，在视觉任务方面取得了显著成就。自关注显示了取代卷积作为泛在智能中视觉特征提取的主要方法的潜力。然而，Visual Transformer的开发仍然存在以下问题:a)自注意机制的归纳偏置较低，导致数据需求量大，训练成本高。b) Transformer骨干网不能很好地适应低视觉信息密度，在低分辨率和小尺度数据集下表现不理想。为了解决上述两个问题，本文提出了一种基于成熟的Visual Transformer架构的新算法，该算法致力于探索Transformer网络在小规模数据集上的性能潜力及其内核自关注机制。具体而言，我们首先提出了一种配备多协调策略的网络体系结构，以解决现有Transformer体系结构固有的自关注退化问题。其次，在Transformer中引入一致性正则化，使自关注机制在视觉特征不足的情况下获得更可靠的特征表示能力;在实验中，选择主流视觉模型CSwin Transformer在流行的小数据集上验证了所提出方法的有效性，取得了较好的效果。特别是，在没有预训练的情况下，我们在CIFAR-100数据集上的准确率比CSwin提高了1.24%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Scalable Computing-Practice and Experience COMPUTER SCIENCE, SOFTWARE ENGINEERING-

CiteScore

2.00

自引率

0.00%

发文量

期刊介绍： The area of scalable computing has matured and reached a point where new issues and trends require a professional forum. SCPE will provide this avenue by publishing original refereed papers that address the present as well as the future of parallel and distributed computing. The journal will focus on algorithm development, implementation and execution on real-world parallel architectures, and application of parallel and distributed computing to the solution of real-life problems.