Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit

Marcello Barbirotta;Francesco Minervini;Carlos Rojas Morales;Adrian Cristal;Osman Unsal;Mauro Olivieri
{"title":"Enhancing Fault Tolerance in High-Performance Computing: A Real Hardware Case Study on a RISC-V Vector Processing Unit","authors":"Marcello Barbirotta;Francesco Minervini;Carlos Rojas Morales;Adrian Cristal;Osman Unsal;Mauro Olivieri","doi":"10.1109/OJCS.2024.3468895","DOIUrl":null,"url":null,"abstract":"High-Performance Computing (HPC) systems are designed for large-scale processing and complex dataset analysis leveraging scalability, efficiency, and parallelism, often integrating specialized hardware structures such as Vector Processing Units (VPUs). As these systems have grown in complexity and scale, their vulnerability to errors and failures has become an important and complex issue in the HPC world. Our research addresses this challenge by exploring and implementing advanced fault tolerance techniques inside the Vitruvius+ architecture, a partial out-of-order Vector Processing Unit. To the best of our knowledge, this is the first full RTL-level implementation of instruction replication in an HPC-class vector processor for reliability. Specifically, we investigate the integration and interaction of redundancy mechanisms inside the most sensitive architectural units, obtaining a reduction of 75% in non-silent faults causing system failure, proven by an extensive fault injection simulation campaign, with a hardware overhead of only 7.5% and a negligible variation in clock frequency.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"5 ","pages":"553-565"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10694791","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10694791/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

High-Performance Computing (HPC) systems are designed for large-scale processing and complex dataset analysis leveraging scalability, efficiency, and parallelism, often integrating specialized hardware structures such as Vector Processing Units (VPUs). As these systems have grown in complexity and scale, their vulnerability to errors and failures has become an important and complex issue in the HPC world. Our research addresses this challenge by exploring and implementing advanced fault tolerance techniques inside the Vitruvius+ architecture, a partial out-of-order Vector Processing Unit. To the best of our knowledge, this is the first full RTL-level implementation of instruction replication in an HPC-class vector processor for reliability. Specifically, we investigate the integration and interaction of redundancy mechanisms inside the most sensitive architectural units, obtaining a reduction of 75% in non-silent faults causing system failure, proven by an extensive fault injection simulation campaign, with a hardware overhead of only 7.5% and a negligible variation in clock frequency.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
增强高性能计算的容错性:RISC-V 矢量处理单元的真实硬件案例研究
高性能计算(HPC)系统旨在利用可扩展性、效率和并行性进行大规模处理和复杂数据集分析,通常集成了矢量处理单元(VPU)等专用硬件结构。随着这些系统的复杂性和规模不断扩大,它们易受错误和故障影响的问题已成为高性能计算领域一个重要而复杂的问题。我们的研究通过探索和实施 Vitruvius+ 架构(部分无序矢量处理单元)内的高级容错技术来应对这一挑战。据我们所知,这是首次在高性能计算级矢量处理器中全面实施RTL级指令复制,以提高可靠性。具体来说,我们研究了最敏感架构单元内冗余机制的集成和交互,通过大量故障注入仿真活动证明,导致系统故障的非静态故障减少了 75%,硬件开销仅为 7.5%,时钟频率变化可忽略不计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
12.60
自引率
0.00%
发文量
0
期刊最新文献
Enhancing Cross-Language Multimodal Emotion Recognition With Dual Attention Transformers Video-Based Deception Detection via Capsule Network With Channel-Wise Attention and Supervised Contrastive Learning An Auditable, Privacy-Preserving, Transparent Unspent Transaction Output Model for Blockchain-Based Central Bank Digital Currency An Innovative Dense ResU-Net Architecture With T-Max-Avg Pooling for Advanced Crack Detection in Concrete Structures Polarity Classification of Low Resource Roman Urdu and Movie Reviews Sentiments Using Machine Learning-Based Ensemble Approaches
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1