Transient Fault Detection in Tensor Cores for Modern GPUs

IF 4.7 Q2 MATERIALS SCIENCE, BIOMATERIALS ACS Applied Bio Materials Pub Date : 2024-08-10 DOI:10.1145/3687483
M. Hafezan, E. Atoofian
{"title":"Transient Fault Detection in Tensor Cores for Modern GPUs","authors":"M. Hafezan, E. Atoofian","doi":"10.1145/3687483","DOIUrl":null,"url":null,"abstract":"Deep Neural networks (DNNs) have emerged as an effective solution for many machine learning applications. However, the great success comes with the cost of excessive computation. The Volta graphics processing unit (GPU) from NVIDIA introduced a specialized hardware unit called tensor core (TC) aiming at meeting the growing computation demand needed by DNNs. Most previous studies on TCs have focused on performance improvement through the utilization of TC's high degree of parallelism. However, as DNNs are deployed into security-sensitive applications such as autonomous driving, the reliability of TCs is as important as performance.\n In this work, we exploit the unique architectural characteristics of TCs and propose a simple and implementation-efficient hardware technique called fault detection in tensor core (FDTC) to detect transient faults in TCs. In particular, FDTC exploits the zero-valued weights that stem from network pruning as well as sparse activations arising from the common ReLU operator to verify tensor operations. High level of sparsity in tensors allows FDTC to run original and verifying products simultaneously, leading to zero performance penalty. For applications with low sparsity rate, FDTC relies on temporal redundancy to re-execute effectual products. FDTC schedules the execution of verifying products only when multipliers are idle. Our experimental results reveal that FDTC offers 100% fault coverage with no performance penalty and small energy overhead in TCs.","PeriodicalId":2,"journal":{"name":"ACS Applied Bio Materials","volume":"2 9","pages":""},"PeriodicalIF":4.7000,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Bio Materials","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3687483","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATERIALS SCIENCE, BIOMATERIALS","Score":null,"Total":0}
引用次数: 0

Abstract

Deep Neural networks (DNNs) have emerged as an effective solution for many machine learning applications. However, the great success comes with the cost of excessive computation. The Volta graphics processing unit (GPU) from NVIDIA introduced a specialized hardware unit called tensor core (TC) aiming at meeting the growing computation demand needed by DNNs. Most previous studies on TCs have focused on performance improvement through the utilization of TC's high degree of parallelism. However, as DNNs are deployed into security-sensitive applications such as autonomous driving, the reliability of TCs is as important as performance. In this work, we exploit the unique architectural characteristics of TCs and propose a simple and implementation-efficient hardware technique called fault detection in tensor core (FDTC) to detect transient faults in TCs. In particular, FDTC exploits the zero-valued weights that stem from network pruning as well as sparse activations arising from the common ReLU operator to verify tensor operations. High level of sparsity in tensors allows FDTC to run original and verifying products simultaneously, leading to zero performance penalty. For applications with low sparsity rate, FDTC relies on temporal redundancy to re-execute effectual products. FDTC schedules the execution of verifying products only when multipliers are idle. Our experimental results reveal that FDTC offers 100% fault coverage with no performance penalty and small energy overhead in TCs.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
现代 GPU 张量核中的瞬态故障检测
深度神经网络(DNN)已成为许多机器学习应用的有效解决方案。然而,巨大的成功伴随着过高的计算成本。英伟达™(NVIDIA®)公司的 Volta 图形处理器(GPU)引入了一种名为张量核(TC)的专用硬件单元,旨在满足 DNN 不断增长的计算需求。以往关于张量核的研究大多侧重于通过利用张量核的高度并行性来提高性能。然而,随着 DNN 被部署到自动驾驶等对安全敏感的应用中,TC 的可靠性与性能同样重要。在这项工作中,我们利用张量核心的独特架构特性,提出了一种名为 "张量核心故障检测(FDTC)"的简单而高效的硬件技术,用于检测张量核心中的瞬态故障。特别是,FDTC 利用网络剪枝产生的零值权重以及普通 ReLU 算子产生的稀疏激活来验证张量运算。张量的高稀疏性允许 FDTC 同时运行原始和验证产品,从而实现零性能损失。对于稀疏率较低的应用,FDTC 依靠时间冗余来重新执行有效乘积。FDTC 仅在乘法器空闲时才安排执行验证乘积。我们的实验结果表明,FDTC 在 TC 中提供了 100% 的故障覆盖率,而且没有性能损失,能量开销也很小。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACS Applied Bio Materials
ACS Applied Bio Materials Chemistry-Chemistry (all)
CiteScore
9.40
自引率
2.10%
发文量
464
期刊介绍: ACS Applied Bio Materials is an interdisciplinary journal publishing original research covering all aspects of biomaterials and biointerfaces including and beyond the traditional biosensing, biomedical and therapeutic applications. The journal is devoted to reports of new and original experimental and theoretical research of an applied nature that integrates knowledge in the areas of materials, engineering, physics, bioscience, and chemistry into important bio applications. The journal is specifically interested in work that addresses the relationship between structure and function and assesses the stability and degradation of materials under relevant environmental and biological conditions.
期刊最新文献
Tumor Microenvironment Stimuli-Responsive Polypeptide Manganese-Calcium Nanomodulator Orchestrating Chemodynamic Therapy and Alleviating Hypoxia in Tumors. 3D-Printed Bone Spacers with Dual-Phase Structure: A Comparison of Biogenic and Commercial Hydroxyapatite for Potential Treatment of Bone Defects. Dual Antibacterial and Anticancer Functionality of Self-Assembled Dipeptide-Capped Silver Nanoparticles: Molecular Insights into Protein-Nanoparticle Interactions. Simultaneous Cross-Linking and Nanoparticle Anchoring by Dialdehyde Cellulose in Injectable Composite Chitosan/Polypyrrole Hydrogels. Biocompatibility of Additively Manufactured Fe-AZ31 Biodegradable Composites for Craniofacial Implant Applications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1