Evaluating GPU's Instruction-Level Error Characteristics Under Low Supply Voltages

IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE IEEE Transactions on Computers Pub Date : 2024-11-18 DOI:10.1109/TC.2024.3500366
Jingweijia Tan;Jiashuo Wang;Kaige Yan;Xiaohui Wei;Xin Fu
{"title":"Evaluating GPU's Instruction-Level Error Characteristics Under Low Supply Voltages","authors":"Jingweijia Tan;Jiashuo Wang;Kaige Yan;Xiaohui Wei;Xin Fu","doi":"10.1109/TC.2024.3500366","DOIUrl":null,"url":null,"abstract":"Supply voltage underscaling has been an effective approach to improve the energy-efficiency of modern high-performance processors, such as GPUs. However, energy efficiency and reliability are two sides of a trade-off. Undervolting will inevitably undermine reliability, since it reduces chip manufacturers’ voltage guardbands that is designed to ensure correct operations under worst-case scenarios. To achieve optimal energy efficiency while maintaining enough reliability, it is necessary to deeply understand the error characteristics caused by undervolting. Unlike previous works which focus mostly on program level, we perform the first comprehensive instruction-level voltage margin and error characteristics evaluation for GPU architectures. We systematically measure the error probability and patterns of GPU instructions during undervolting. Then, we also analyze the impact of locations (SMs, threads, and bits) and operand data values on the error characteristics. Based on our observations, we reduce the voltage to the minimum safe limit for different instructions which achieves 18.37% energy saving, and we further propose an error detection strategy which reduces the performance and energy overhead by 14.8% with negligible 0.01% degradation for error detection rate.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"555-568"},"PeriodicalIF":3.8000,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computers","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10756742/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Supply voltage underscaling has been an effective approach to improve the energy-efficiency of modern high-performance processors, such as GPUs. However, energy efficiency and reliability are two sides of a trade-off. Undervolting will inevitably undermine reliability, since it reduces chip manufacturers’ voltage guardbands that is designed to ensure correct operations under worst-case scenarios. To achieve optimal energy efficiency while maintaining enough reliability, it is necessary to deeply understand the error characteristics caused by undervolting. Unlike previous works which focus mostly on program level, we perform the first comprehensive instruction-level voltage margin and error characteristics evaluation for GPU architectures. We systematically measure the error probability and patterns of GPU instructions during undervolting. Then, we also analyze the impact of locations (SMs, threads, and bits) and operand data values on the error characteristics. Based on our observations, we reduce the voltage to the minimum safe limit for different instructions which achieves 18.37% energy saving, and we further propose an error detection strategy which reduces the performance and energy overhead by 14.8% with negligible 0.01% degradation for error detection rate.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
低电源电压下GPU指令级错误特性的评估
电源电压过缩是提高现代高性能处理器(如gpu)能效的有效方法。然而,能源效率和可靠性是一个权衡的两面。电压过低将不可避免地破坏可靠性,因为它降低了芯片制造商为确保在最坏情况下正确运行而设计的电压保护带。为了在保持足够的可靠性的同时实现最佳的能源效率,有必要深入了解欠压引起的误差特性。与以往主要关注程序级的工作不同,我们对GPU架构进行了第一次全面的指令级电压裕度和误差特性评估。我们系统地测量了欠电压下GPU指令的错误概率和模式。然后,我们还分析了位置(SMs、线程和位)和操作数数据值对错误特征的影响。根据我们的观察,我们将不同指令的电压降低到最小安全极限,实现了18.37%的节能,并且我们进一步提出了一种错误检测策略,该策略将性能和能量开销降低了14.8%,而错误检测率下降了0.01%,可以忽略不计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
IEEE Transactions on Computers
IEEE Transactions on Computers 工程技术-工程:电子与电气
CiteScore
6.60
自引率
5.40%
发文量
199
审稿时长
6.0 months
期刊介绍: The IEEE Transactions on Computers is a monthly publication with a wide distribution to researchers, developers, technical managers, and educators in the computer field. It publishes papers on research in areas of current interest to the readers. These areas include, but are not limited to, the following: a) computer organizations and architectures; b) operating systems, software systems, and communication protocols; c) real-time systems and embedded systems; d) digital devices, computer components, and interconnection networks; e) specification, design, prototyping, and testing methods and tools; f) performance, fault tolerance, reliability, security, and testability; g) case studies and experimental and theoretical evaluations; and h) new and important applications and trends.
期刊最新文献
2025 Reviewers List Evaluation of Radiation Resilience, Performance, and Vmin of Sub-3 nm FSFET Based SRAM Arrays Dual-Pronged Deep Learning Preprocessing on Heterogeneous Platforms With CPU, Accelerator and CSD Latency Optimization in Hybrid Memory System for GNNs Fused FP8 Many-Terms Dot Product With Scaling and FP32 Accumulation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1