Redesign Visual Transformer For Small Datasets

Jingjie Wang, Xiang Wei, Siyang Lu, Mingquan Wang, Xiaoyu Liu, Wei Lu
{"title":"Redesign Visual Transformer For Small Datasets","authors":"Jingjie Wang, Xiang Wei, Siyang Lu, Mingquan Wang, Xiaoyu Liu, Wei Lu","doi":"10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00077","DOIUrl":null,"url":null,"abstract":"Nowadays, the self-attention mechanism has become a resound of visual feature extraction along with convolution. The transformer network composed of self-attention has developed rapidly and made remarkable achievements in visual tasks. The self-attention shows the potential to replace convolution as the primary method of visual feature extraction in ubiquitous intelligence. Nevertheless, the development of the Visual Transformer still suffer from the following problems: a) The self-attention mechanism has a low inductive bias, which leads to large data demand and a high training cost. b) The Transformer backbone network cannot adapt well to the low visual information density and performs unsatisfactorily under low resolution and small-scale datasets. To tackle the abovementioned two problems, this paper proposes a novel algorithm based on the mature Visual Transformer architecture, which is dedicated to exploring the performance potential of the Transformer network and its kernel self-attention mechanism on small-scale datasets. Specifically, we first propose a network architecture equipped with multi-coordination strategy to solve the self-attention degradation problem inherent in the existing Transformer architecture. Secondly, we introduce consistent regularization into the Transformer to make the self-attention mechanism acquire more reliable feature representation ability in the case of insufficient visual features. In the experiments, CSwin Transformer, the mainstream visual model, is selected to verify the effectiveness of the proposed method on the prevalent small datasets, and superior results are achieved. In particular, without pre-training, our accuracy on the CIFAR-100 dataset is improved by 1.24% compared to CSwin.","PeriodicalId":43791,"journal":{"name":"Scalable Computing-Practice and Experience","volume":"28 1","pages":"401-408"},"PeriodicalIF":0.9000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scalable Computing-Practice and Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SmartWorld-UIC-ATC-ScalCom-DigitalTwin-PriComp-Metaverse56740.2022.00077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

Nowadays, the self-attention mechanism has become a resound of visual feature extraction along with convolution. The transformer network composed of self-attention has developed rapidly and made remarkable achievements in visual tasks. The self-attention shows the potential to replace convolution as the primary method of visual feature extraction in ubiquitous intelligence. Nevertheless, the development of the Visual Transformer still suffer from the following problems: a) The self-attention mechanism has a low inductive bias, which leads to large data demand and a high training cost. b) The Transformer backbone network cannot adapt well to the low visual information density and performs unsatisfactorily under low resolution and small-scale datasets. To tackle the abovementioned two problems, this paper proposes a novel algorithm based on the mature Visual Transformer architecture, which is dedicated to exploring the performance potential of the Transformer network and its kernel self-attention mechanism on small-scale datasets. Specifically, we first propose a network architecture equipped with multi-coordination strategy to solve the self-attention degradation problem inherent in the existing Transformer architecture. Secondly, we introduce consistent regularization into the Transformer to make the self-attention mechanism acquire more reliable feature representation ability in the case of insufficient visual features. In the experiments, CSwin Transformer, the mainstream visual model, is selected to verify the effectiveness of the proposed method on the prevalent small datasets, and superior results are achieved. In particular, without pre-training, our accuracy on the CIFAR-100 dataset is improved by 1.24% compared to CSwin.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
为小数据集重新设计可视化转换器
自注意机制与卷积一起成为当前视觉特征提取的一个热点。自关注组成的变压器网络发展迅速,在视觉任务方面取得了显著成就。自关注显示了取代卷积作为泛在智能中视觉特征提取的主要方法的潜力。然而,Visual Transformer的开发仍然存在以下问题:a)自注意机制的归纳偏置较低,导致数据需求量大,训练成本高。b) Transformer骨干网不能很好地适应低视觉信息密度,在低分辨率和小尺度数据集下表现不理想。为了解决上述两个问题,本文提出了一种基于成熟的Visual Transformer架构的新算法,该算法致力于探索Transformer网络在小规模数据集上的性能潜力及其内核自关注机制。具体而言,我们首先提出了一种配备多协调策略的网络体系结构,以解决现有Transformer体系结构固有的自关注退化问题。其次,在Transformer中引入一致性正则化,使自关注机制在视觉特征不足的情况下获得更可靠的特征表示能力;在实验中,选择主流视觉模型CSwin Transformer在流行的小数据集上验证了所提出方法的有效性,取得了较好的效果。特别是,在没有预训练的情况下,我们在CIFAR-100数据集上的准确率比CSwin提高了1.24%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Scalable Computing-Practice and Experience
Scalable Computing-Practice and Experience COMPUTER SCIENCE, SOFTWARE ENGINEERING-
CiteScore
2.00
自引率
0.00%
发文量
10
期刊介绍: The area of scalable computing has matured and reached a point where new issues and trends require a professional forum. SCPE will provide this avenue by publishing original refereed papers that address the present as well as the future of parallel and distributed computing. The journal will focus on algorithm development, implementation and execution on real-world parallel architectures, and application of parallel and distributed computing to the solution of real-life problems.
期刊最新文献
A Deep LSTM-RNN Classification Method for Covid-19 Twitter Review Based on Sentiment Analysis Flexible English Learning Platform using Collaborative Cloud-Fog-Edge Networking Computer Malicious Code Signal Detection based on Big Data Technology Analyzing Spectator Emotions and Behaviors at Live Sporting Events using Computer Vision and Sentiment Analysis Techniques Spacecraft Test Data Integration Management Technology based on Big Data Platform
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1