{"title":"Failure modes and failure mitigation in GPGPUs: a reference model and its application","authors":"Francesco Terrosi, A. Ceccarelli, A. Bondavalli","doi":"10.1109/COMPSAC54236.2022.00018","DOIUrl":null,"url":null,"abstract":"General Purpose GPUs (GPGPUs) are highly susceptible to both transient and permanent faults. This is a serious concern for their safe and reliable usage in many domains, from autonomous driving to High Performance Computing. The research and industrial community responded fiercely to this issue, by analyzing failures impact and devising failure mitigation strategies. This led to the definition of several failure modes and mitigation approaches. Unfortunately, these are often based on different foundations, and it is not easy to position them in a consistent view. This work elaborates a GPGPU failures model, identifying relations between the GPGPU failure modes and components, and then it analyzes mitigations proposed in the literature. By proposing a unified view on failures and mitigations, the resulting model i) positions each research on the subject, ii) easily identifies the current gaps, and iii) sets the basis for further research on GPGPU failures.","PeriodicalId":330838,"journal":{"name":"2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMPSAC54236.2022.00018","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
General Purpose GPUs (GPGPUs) are highly susceptible to both transient and permanent faults. This is a serious concern for their safe and reliable usage in many domains, from autonomous driving to High Performance Computing. The research and industrial community responded fiercely to this issue, by analyzing failures impact and devising failure mitigation strategies. This led to the definition of several failure modes and mitigation approaches. Unfortunately, these are often based on different foundations, and it is not easy to position them in a consistent view. This work elaborates a GPGPU failures model, identifying relations between the GPGPU failure modes and components, and then it analyzes mitigations proposed in the literature. By proposing a unified view on failures and mitigations, the resulting model i) positions each research on the subject, ii) easily identifies the current gaps, and iii) sets the basis for further research on GPGPU failures.