S-ViT: Sparse Vision Transformer for Accurate Face Recognition

IF 0.4 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS Applied Computing Review Pub Date : 2023-03-27 DOI:10.1145/3555776.3577640
Geunsu Kim, Gyudo Park, Soohyeok Kang, Simon S. Woo
{"title":"S-ViT: Sparse Vision Transformer for Accurate Face Recognition","authors":"Geunsu Kim, Gyudo Park, Soohyeok Kang, Simon S. Woo","doi":"10.1145/3555776.3577640","DOIUrl":null,"url":null,"abstract":"Most of the existing face recognition applications using deep learning models have leveraged CNN-based architectures as the feature extractor. However, recent studies have shown that in computer vision tasks, vision transformer-based models often outperform CNN-based models. Therefore, in this work, we propose a Sparse Vision Transformer (S-ViT) based on the Vision Transformer (ViT) architecture to improve the face recognition tasks. After the model is trained, S-ViT tends to have a sparse distribution of weights compared to ViT, so we named it according to these characteristics. Unlike the conventional ViT, our proposed S-ViT adopts image Relative Positional Encoding (iRPE) method for positional encoding. Also, S-ViT has been modified so that all token embeddings, not just class token, participate in the decoding process. Through extensive experiment, we showed that S-ViT achieves better performance in closed-set than the other baseline models, and showed better performance than the baseline ViT-based models. For example, when using ArcFace as the loss function in the identification protocol, S-ViT achieved up to 3.27% higher accuracy than ResNet50. We also show that the use of ArcFace loss functions yields greater performance gains in S-ViT than in baseline models. In addition, S-ViT has an advantage in cost-performance trade-off because it tends to be more robust to the pruning technique than the underlying model, ViT. Therefore, S-ViT offers the additional advantage, which can be applied more flexibly in the target devices with limited resources.","PeriodicalId":42971,"journal":{"name":"Applied Computing Review","volume":null,"pages":null},"PeriodicalIF":0.4000,"publicationDate":"2023-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing Review","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555776.3577640","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Most of the existing face recognition applications using deep learning models have leveraged CNN-based architectures as the feature extractor. However, recent studies have shown that in computer vision tasks, vision transformer-based models often outperform CNN-based models. Therefore, in this work, we propose a Sparse Vision Transformer (S-ViT) based on the Vision Transformer (ViT) architecture to improve the face recognition tasks. After the model is trained, S-ViT tends to have a sparse distribution of weights compared to ViT, so we named it according to these characteristics. Unlike the conventional ViT, our proposed S-ViT adopts image Relative Positional Encoding (iRPE) method for positional encoding. Also, S-ViT has been modified so that all token embeddings, not just class token, participate in the decoding process. Through extensive experiment, we showed that S-ViT achieves better performance in closed-set than the other baseline models, and showed better performance than the baseline ViT-based models. For example, when using ArcFace as the loss function in the identification protocol, S-ViT achieved up to 3.27% higher accuracy than ResNet50. We also show that the use of ArcFace loss functions yields greater performance gains in S-ViT than in baseline models. In addition, S-ViT has an advantage in cost-performance trade-off because it tends to be more robust to the pruning technique than the underlying model, ViT. Therefore, S-ViT offers the additional advantage, which can be applied more flexibly in the target devices with limited resources.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
S-ViT:用于精确人脸识别的稀疏视觉变换
大多数使用深度学习模型的现有人脸识别应用都利用基于cnn的架构作为特征提取器。然而,最近的研究表明,在计算机视觉任务中,基于视觉变换的模型往往优于基于cnn的模型。因此,在这项工作中,我们提出了一种基于视觉转换器(ViT)架构的稀疏视觉转换器(S-ViT)来改进人脸识别任务。经过模型训练后,S-ViT相对于ViT的权值分布趋于稀疏,所以我们根据这些特征来命名它。与传统的ViT不同,本文提出的S-ViT采用图像相对位置编码(iRPE)方法进行位置编码。此外,S-ViT已被修改,以便所有令牌嵌入,而不仅仅是类令牌,参与解码过程。通过大量的实验,我们发现S-ViT在闭集中的性能优于其他基线模型,并且优于基于基线vit的模型。例如,在识别协议中使用ArcFace作为损失函数时,S-ViT的准确率比ResNet50高出3.27%。我们还表明,使用ArcFace损失函数在S-ViT中比在基线模型中产生更大的性能收益。此外,S-ViT在成本-性能权衡方面具有优势,因为它比底层模型ViT对剪枝技术更健壮。因此,S-ViT提供了额外的优势,可以更灵活地应用于资源有限的目标设备。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Applied Computing Review
Applied Computing Review COMPUTER SCIENCE, INFORMATION SYSTEMS-
自引率
40.00%
发文量
8
期刊最新文献
DIWS-LCR-Rot-hop++: A Domain-Independent Word Selector for Cross-Domain Aspect-Based Sentiment Classification Leveraging Semantic Technologies for Collaborative Inference of Threatening IoT Dependencies Relating Optimal Repairs in Ontology Engineering with Contraction Operations in Belief Change Block-RACS: Towards Reputation-Aware Client Selection and Monetization Mechanism for Federated Learning Elastic Data Binning: Time-Series Sketching for Time-Domain Astrophysics Analysis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1