G. Georgakoudis, I. Laguna, H. Vandierendonck, Dimitrios S. Nikolopoulos, M. Schulz
{"title":"SAFIRE: Scalable and Accurate Fault Injection for Parallel Multithreaded Applications","authors":"G. Georgakoudis, I. Laguna, H. Vandierendonck, Dimitrios S. Nikolopoulos, M. Schulz","doi":"10.1109/IPDPS.2019.00097","DOIUrl":null,"url":null,"abstract":"Soft errors threaten to disrupt supercomputing scaling. Fault injection is a key technique to understand the impact of faults on scientific applications. However, injecting faults in parallel applications has been prohibitively slow, inaccurate and hard to implement. In this paper, we present, the first fast and accurate fault injection framework for parallel, multi-threaded applications. uses novel compiler instrumentation and code generation techniques to achieve high accuracy and high speed. Using, we show that fault manifestations can be significantly different depending on whether they happen in the application itself or in the parallel runtime system. In our experimental evaluation on 15 HPC parallel programs, we show that is multiple factors faster and equally accurate in comparison with state-of-the-art dynamic binary instrumentation tools for fault injection.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2019.00097","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Soft errors threaten to disrupt supercomputing scaling. Fault injection is a key technique to understand the impact of faults on scientific applications. However, injecting faults in parallel applications has been prohibitively slow, inaccurate and hard to implement. In this paper, we present, the first fast and accurate fault injection framework for parallel, multi-threaded applications. uses novel compiler instrumentation and code generation techniques to achieve high accuracy and high speed. Using, we show that fault manifestations can be significantly different depending on whether they happen in the application itself or in the parallel runtime system. In our experimental evaluation on 15 HPC parallel programs, we show that is multiple factors faster and equally accurate in comparison with state-of-the-art dynamic binary instrumentation tools for fault injection.