Mark Wilkening, Fritz G. Previlon, D. Kaeli, S. Gurumurthi, Steven E. Raasch, Vilas Sridharan
{"title":"Evaluating the Resilience of Parallel Applications","authors":"Mark Wilkening, Fritz G. Previlon, D. Kaeli, S. Gurumurthi, Steven E. Raasch, Vilas Sridharan","doi":"10.1109/DFT.2018.8602987","DOIUrl":null,"url":null,"abstract":"Reliability is a significant design constraint for supercomputers and large-scale data centers. Modeling the effects of faults on applications targeted to such systems allows system architects and software designers to provision resilience features, that improve fidelity of results and reduce runtimes. In this paper, we propose mechanisms to improve existing techniques to model the effect of transient faults on realistic applications. First, we extend the existing Program Vulnerability Factor metric to model multi-threaded applications. Then we demonstrate how to measure the multi-threaded PVF of an application in simulation and introduce the ability to account for software detection of hardware faults, differentiating faults that cause detected, uncorrected errors (DUE) from faults that cause silent data corruption (SDC).","PeriodicalId":297244,"journal":{"name":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DFT.2018.8602987","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Reliability is a significant design constraint for supercomputers and large-scale data centers. Modeling the effects of faults on applications targeted to such systems allows system architects and software designers to provision resilience features, that improve fidelity of results and reduce runtimes. In this paper, we propose mechanisms to improve existing techniques to model the effect of transient faults on realistic applications. First, we extend the existing Program Vulnerability Factor metric to model multi-threaded applications. Then we demonstrate how to measure the multi-threaded PVF of an application in simulation and introduce the ability to account for software detection of hardware faults, differentiating faults that cause detected, uncorrected errors (DUE) from faults that cause silent data corruption (SDC).