Nathan Debardeleben, S. Blanchard, D. Kaeli, P. Rech
{"title":"Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design","authors":"Nathan Debardeleben, S. Blanchard, D. Kaeli, P. Rech","doi":"10.1109/VTS.2015.7116295","DOIUrl":null,"url":null,"abstract":"Reliability is an issue for today's large scale computing systems designers, producers, and users. As we approach exascale, the resilience challenge will become critical due to increase in system-scale. It is then fundamental to understand the nature of errors, evaluate their probability of occurrence, and improve the design to reduce their impact on the overall system. In the paper we will present experimental, field, and analytical data to characterize and quantify errors on accelerators, providing a thorough understanding of errors impact on today and future large-scale systems.","PeriodicalId":187545,"journal":{"name":"2015 IEEE 33rd VLSI Test Symposium (VTS)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 33rd VLSI Test Symposium (VTS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/VTS.2015.7116295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Reliability is an issue for today's large scale computing systems designers, producers, and users. As we approach exascale, the resilience challenge will become critical due to increase in system-scale. It is then fundamental to understand the nature of errors, evaluate their probability of occurrence, and improve the design to reduce their impact on the overall system. In the paper we will present experimental, field, and analytical data to characterize and quantify errors on accelerators, providing a thorough understanding of errors impact on today and future large-scale systems.