Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645942
Evan Coleman, Erik J. Jensen, M. Sosonkina
This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. The implementations make use of hybrid parallelism where the computational work is distributed over multiple nodes using MPI and parallelized on each node using openMP. A series of experiments is conducted to measure the impact of an undetected soft fault on an asynchronous iterative method, and to compare and contrast several techniques for simulating the occurrence of a fault and then recovering from the effects of the faults. The data shows that the two numerical soft-fault models tested here more consistently than a “bit-flip” model produce bad enough behavior to test a variety of recovery strategies, such as those based on partial checkpointing.
{"title":"Impacts of Three Soft-Fault Models on Hybrid Parallel Asynchronous Iterative Methods","authors":"Evan Coleman, Erik J. Jensen, M. Sosonkina","doi":"10.1109/CAHPC.2018.8645942","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645942","url":null,"abstract":"This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. The implementations make use of hybrid parallelism where the computational work is distributed over multiple nodes using MPI and parallelized on each node using openMP. A series of experiments is conducted to measure the impact of an undetected soft fault on an asynchronous iterative method, and to compare and contrast several techniques for simulating the occurrence of a fault and then recovering from the effects of the faults. The data shows that the two numerical soft-fault models tested here more consistently than a “bit-flip” model produce bad enough behavior to test a variety of recovery strategies, such as those based on partial checkpointing.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114676753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645891
M. Radulovic, Kazi Asifuzzaman, D. Zivanovic, Nikola Rajovic, G. C. D. Verdière, D. Pleiter, M. Marazakis, Nikolaos D. Kallimanis, P. Carpenter, Petar Radojkovic, E. Ayguadé
Various servers with different characteristics and architectures are hitting the market, and their evaluation and comparison in terms of HPC features is complex and multidimensional. In this paper, we share our experience of evaluating a diverse set of HPC systems, consisting of three mainstream and five emerging architectures. We evaluate the performance and power efficiency using prominent HPC benchmarks, High-Performance Linpack (HPL) and High Performance Conjugate Gradients (HPCG), and expand our analysis using publicly available specialized kernel benchmarks, targeting specific system components. In addition to a large body of quantitative results, we emphasize six usually overlooked aspects of the HPC platforms evaluation, and share our conclusions and lessons learned. Overall, we believe that this paper will improve the evaluation and comparison of HPC platforms, making a first step towards a more reliable and uniform methodology.
{"title":"Mainstream vs. Emerging HPC: Metrics, Trade-Offs and Lessons Learned","authors":"M. Radulovic, Kazi Asifuzzaman, D. Zivanovic, Nikola Rajovic, G. C. D. Verdière, D. Pleiter, M. Marazakis, Nikolaos D. Kallimanis, P. Carpenter, Petar Radojkovic, E. Ayguadé","doi":"10.1109/CAHPC.2018.8645891","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645891","url":null,"abstract":"Various servers with different characteristics and architectures are hitting the market, and their evaluation and comparison in terms of HPC features is complex and multidimensional. In this paper, we share our experience of evaluating a diverse set of HPC systems, consisting of three mainstream and five emerging architectures. We evaluate the performance and power efficiency using prominent HPC benchmarks, High-Performance Linpack (HPL) and High Performance Conjugate Gradients (HPCG), and expand our analysis using publicly available specialized kernel benchmarks, targeting specific system components. In addition to a large body of quantitative results, we emphasize six usually overlooked aspects of the HPC platforms evaluation, and share our conclusions and lessons learned. Overall, we believe that this paper will improve the evaluation and comparison of HPC platforms, making a first step towards a more reliable and uniform methodology.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123930123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645869
J. Dokulil, S. Benkner
Task-based runtime systems are considered as one of the options for dealing with the challenges of upcoming parallel architectures. The greater flexibility of these runtime systems can also be used to dynamically adjust the resources allocated to the applications, adapting to the current load of the system and the progress of the applications. In our work, we have extended our implementation of the Open Community Runtime to support dynamic adjustment of execution threads. The runtimes communicate with an agent process, which collects performance data, computes thread allocation, and instructs the runtimes to make the required adjustments. We have tested our solution under different scenarios, focusing on producer-consumer applications, where the dynamic resource management was used to keep the applications in sync, improving the overall performance in some cases.
{"title":"Adaptive Scheduling of Collocated Applications Using a Task-Based Runtime System","authors":"J. Dokulil, S. Benkner","doi":"10.1109/CAHPC.2018.8645869","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645869","url":null,"abstract":"Task-based runtime systems are considered as one of the options for dealing with the challenges of upcoming parallel architectures. The greater flexibility of these runtime systems can also be used to dynamically adjust the resources allocated to the applications, adapting to the current load of the system and the progress of the applications. In our work, we have extended our implementation of the Open Community Runtime to support dynamic adjustment of execution threads. The runtimes communicate with an agent process, which collects performance data, computes thread allocation, and instructs the runtimes to make the required adjustments. We have tested our solution under different scenarios, focusing on producer-consumer applications, where the dynamic resource management was used to keep the applications in sync, improving the overall performance in some cases.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117003463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645951
Pedro Caldeira, J. Penha, L. Bragança, Ricardo Ferreira, J. Nacif, R. Ferreira, Fernando Magno Quintão Pereira
Recent years have seen a surge in the popularity of Field-Programmable Gate Arrays (FPGAs). Programmers can use them to develop high-performance systems that are not only efficient in time, but also in energy. Yet, programming FPGAs remains a difficult task. Even though there exist today OpenCL interfaces to synthesize such hardware, higher-level programming languages, such as Java, C# or Python remain distant from them. In this paper, we describe a compiler, and its supporting runtime environment, that reduces this distance, translating functional code written in Java to the Intel HARP platform. Thus, we bring two contributions. First, the insight that a functional-style library is a good starting point to bridge the gap between high-level programming idioms and FPGAs. Second, the implementation of this system itself, including the compiler, its intermediate representation, and all the runtime support necessary to shield developers from the task of transferring data back and forth between the host CPU and the accelerator. To demonstrate the effectiveness of our system, we have used it to implement different benchmarks, used in image processing and data-mining. For large inputs, we can observe consistent 20x speedups over the Java Virtual Machine across all our benchmarks. Depending on the target function that we compile, this speedup can achieve 280x.
{"title":"From Java to FPGA: An Experience with the Intel HARP System","authors":"Pedro Caldeira, J. Penha, L. Bragança, Ricardo Ferreira, J. Nacif, R. Ferreira, Fernando Magno Quintão Pereira","doi":"10.1109/CAHPC.2018.8645951","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645951","url":null,"abstract":"Recent years have seen a surge in the popularity of Field-Programmable Gate Arrays (FPGAs). Programmers can use them to develop high-performance systems that are not only efficient in time, but also in energy. Yet, programming FPGAs remains a difficult task. Even though there exist today OpenCL interfaces to synthesize such hardware, higher-level programming languages, such as Java, C# or Python remain distant from them. In this paper, we describe a compiler, and its supporting runtime environment, that reduces this distance, translating functional code written in Java to the Intel HARP platform. Thus, we bring two contributions. First, the insight that a functional-style library is a good starting point to bridge the gap between high-level programming idioms and FPGAs. Second, the implementation of this system itself, including the compiler, its intermediate representation, and all the runtime support necessary to shield developers from the task of transferring data back and forth between the host CPU and the accelerator. To demonstrate the effectiveness of our system, we have used it to implement different benchmarks, used in image processing and data-mining. For large inputs, we can observe consistent 20x speedups over the Java Virtual Machine across all our benchmarks. Depending on the target function that we compile, this speedup can achieve 280x.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114960257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645945
Georgios Kornaros, M. Coppola
In addition to GPUs that see increasingly widespread use for general-purpose computing, special-purpose accelerators are widely used for their high efficiency and low power consumption, attached to general-purpose CPUs, thus forming Heterogeneous System Architectures (HSAs). This paper presents a new communication model for heterogeneous computing, that utilizes a unified memory space for CPUs and accelerators and removes the requirement for virtual-to-physical address translation through an I/O Memory Management Unit (IOMMU), thus making stronger the adoption of Heterogeneous System Architectures in SoCs that do not include an IOMMU but still representing a large number in real products. By exploiting user-level queuing, workload dispatching to specialized hardware accelerators allows the removal of drawbacks present when copying objects through using the operating system calls. Additionally, dispatching is structured to enable fixed-size packet management that hardware specialized logic accelerates. To also eliminate IOMMU performance loss and IOMMU management complexity, we propose direct accelerator data placement in contiguous space in system-memory, where, the dispatcher provides trasparent access to the accelerators and at the same time offers an easy abstraction in the programming layer for the application. We demonstrate dispatching rates that exceed ten thousand jobs per second implementing architectural support on a low-cost embedded System-on-Chip, bounded only by the computing capacity of the hardware accelerators.
{"title":"Enabling Efficient Job Dispatching in Accelerator-Extended Heterogeneous Systems with Unified Address Space","authors":"Georgios Kornaros, M. Coppola","doi":"10.1109/CAHPC.2018.8645945","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645945","url":null,"abstract":"In addition to GPUs that see increasingly widespread use for general-purpose computing, special-purpose accelerators are widely used for their high efficiency and low power consumption, attached to general-purpose CPUs, thus forming Heterogeneous System Architectures (HSAs). This paper presents a new communication model for heterogeneous computing, that utilizes a unified memory space for CPUs and accelerators and removes the requirement for virtual-to-physical address translation through an I/O Memory Management Unit (IOMMU), thus making stronger the adoption of Heterogeneous System Architectures in SoCs that do not include an IOMMU but still representing a large number in real products. By exploiting user-level queuing, workload dispatching to specialized hardware accelerators allows the removal of drawbacks present when copying objects through using the operating system calls. Additionally, dispatching is structured to enable fixed-size packet management that hardware specialized logic accelerates. To also eliminate IOMMU performance loss and IOMMU management complexity, we propose direct accelerator data placement in contiguous space in system-memory, where, the dispatcher provides trasparent access to the accelerators and at the same time offers an easy abstraction in the programming layer for the application. We demonstrate dispatching rates that exceed ten thousand jobs per second implementing architectural support on a low-cost embedded System-on-Chip, bounded only by the computing capacity of the hardware accelerators.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123209374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645860
Kazumasa Sakivama, S. Kato, Y. Ishikawa, A. Hori, Abraham Monrroy
Convolutional neural networks (CNNs) have achieved outstanding accuracy among conventional machine learning algorithms. Recent works have shown that large and complicated models, which take significant cost for training are needed to get higher accuracy. To train these models efficiently in high performance computers (HPCs), many parallelization techniques for CNNs have been developed. However, most techniques are mainly targeting GPUs and parallelizations for CPUs are not fully investigated. This paper explores CNN training performance on large-scale multicore clusters by optimizing intra-node processing and applying techniques of inter-node parallelization for multiple GPUs. Detailed experiments conducted on state-of-the-art multi-core processors using the openMP API and MPI framework demonstrated that Caffe-based CNNs can be accelerated by using well-designed multithreaded programs. We achieved at most 1.64 times speedup in convolution operations with devised lowering strategy compared to conventional lowering and acquired 772 times speedup with 864 nodes compared to one node.
{"title":"Deep Learning on Large-Scale Muticore Clusters","authors":"Kazumasa Sakivama, S. Kato, Y. Ishikawa, A. Hori, Abraham Monrroy","doi":"10.1109/CAHPC.2018.8645860","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645860","url":null,"abstract":"Convolutional neural networks (CNNs) have achieved outstanding accuracy among conventional machine learning algorithms. Recent works have shown that large and complicated models, which take significant cost for training are needed to get higher accuracy. To train these models efficiently in high performance computers (HPCs), many parallelization techniques for CNNs have been developed. However, most techniques are mainly targeting GPUs and parallelizations for CPUs are not fully investigated. This paper explores CNN training performance on large-scale multicore clusters by optimizing intra-node processing and applying techniques of inter-node parallelization for multiple GPUs. Detailed experiments conducted on state-of-the-art multi-core processors using the openMP API and MPI framework demonstrated that Caffe-based CNNs can be accelerated by using well-designed multithreaded programs. We achieved at most 1.64 times speedup in convolution operations with devised lowering strategy compared to conventional lowering and acquired 772 times speedup with 864 nodes compared to one node.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121250151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645923
K. Peou, A. Kelly, J. Falcou, Cécile Germain
In this work, we study the numerical performance of various common algorithms used to calculate the average of an array of half precision (FP16) floating point values. While the current generation of CPUs does not support native FP16 arithmetic, it is a planned feature in a number of next-generation CPUs. FP16 arithmetic was emulated via the half software library. Due to the limitations of the FP16 data type, some algorithms proved insufficient for arrays as small as 100 elements. We propose an algorithm that allows numerically stable FP16 computation of the average and compare it to the naive floating point (FP32) algorithm in terms of both numerical precision and runtime performance. We find that our algorithm offers comparable robustness, numerical precision, and SIMD performance to the higher precision computation.
{"title":"A Case Study on Optimizing Accurate Half Precision Average","authors":"K. Peou, A. Kelly, J. Falcou, Cécile Germain","doi":"10.1109/CAHPC.2018.8645923","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645923","url":null,"abstract":"In this work, we study the numerical performance of various common algorithms used to calculate the average of an array of half precision (FP16) floating point values. While the current generation of CPUs does not support native FP16 arithmetic, it is a planned feature in a number of next-generation CPUs. FP16 arithmetic was emulated via the half software library. Due to the limitations of the FP16 data type, some algorithms proved insufficient for arrays as small as 100 elements. We propose an algorithm that allows numerically stable FP16 computation of the average and compare it to the naive floating point (FP32) algorithm in terms of both numerical precision and runtime performance. We find that our algorithm offers comparable robustness, numerical precision, and SIMD performance to the higher precision computation.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125745095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645857
Shirin Tavara, Alexander Schliep
Alternating Direction Method Of Multipliers (ADMM) is one of the promising frameworks for training Support Vector Machines (SVMs) on large-scale data in a distributed manner. In a consensus-based ADMM, nodes may only communicate with one-hop neighbors and this may cause slow convergence. In this paper, we investigate the impact of network topology on the convergence speed of ADMM-based SVMs using expander graphs. In particular, we investigate how much the expansion property of the network influence the convergence and which topology is preferable. Besides, we supply an implementation making these theoretical advances practically available. The results of the experiments show that graphs with large spectral gaps and higher degrees exhibit accelerated convergence.
{"title":"Effect of Network Topology on the Performance of ADMM-Based SVMs","authors":"Shirin Tavara, Alexander Schliep","doi":"10.1109/CAHPC.2018.8645857","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645857","url":null,"abstract":"Alternating Direction Method Of Multipliers (ADMM) is one of the promising frameworks for training Support Vector Machines (SVMs) on large-scale data in a distributed manner. In a consensus-based ADMM, nodes may only communicate with one-hop neighbors and this may cause slow convergence. In this paper, we investigate the impact of network topology on the convergence speed of ADMM-based SVMs using expander graphs. In particular, we investigate how much the expansion property of the network influence the convergence and which topology is preferable. Besides, we supply an implementation making these theoretical advances practically available. The results of the experiments show that graphs with large spectral gaps and higher degrees exhibit accelerated convergence.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133925203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645894
Alberto Miranda, Ramon Nou, Toni Cortes
The growth in data-intensive scientific applications poses strong demands on the HPC storage subsystem, as data needs to be copied from compute nodes to I/O nodes and vice versa for jobs to run. The emerging trend of adding denser, NVM-based burst buffers to compute nodes, however, offers the possibility of using these resources to build temporary file systems with specific I/O optimizations for a batch job. In this work, we present echofs, a temporary filesystem that coordinates with the job scheduler to preload a job's input files into node-local burst buffers. We present the results measured with NVM emulation, and different FS backends with DAX/FUSE on a local node, to show the benefits of our proposal and such coordination.
{"title":"ECHOFS: A Scheduler-Guided Temporary Filesystem to Leverage Node-Local NVMS","authors":"Alberto Miranda, Ramon Nou, Toni Cortes","doi":"10.1109/CAHPC.2018.8645894","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645894","url":null,"abstract":"The growth in data-intensive scientific applications poses strong demands on the HPC storage subsystem, as data needs to be copied from compute nodes to I/O nodes and vice versa for jobs to run. The emerging trend of adding denser, NVM-based burst buffers to compute nodes, however, offers the possibility of using these resources to build temporary file systems with specific I/O optimizations for a batch job. In this work, we present echofs, a temporary filesystem that coordinates with the job scheduler to preload a job's input files into node-local burst buffers. We present the results measured with NVM emulation, and different FS backends with DAX/FUSE on a local node, to show the benefits of our proposal and such coordination.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131176147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645856
Daniel Oliveira, Francis B. Moreira, P. Rech, P. Navaux
The error rate of current High Performance Computing (HPC) systems is already in the order of one per dozens of hours. Understanding the reliability behavior of HPC applications will be required for the next generation of supercomputers. Using the reliability behavior one can select efficient mitigation techniques for the application and fine-tune parameters such as checkpoint frequency. In this paper, we investigate the application of a machine learning model to predict the reliability behavior of HPC applications. We inject faults in more than 30 HPC applications executing in the Intel Xeon Phi Knights Landing (KNL) and use profiling information to build a predictive model with Support Vector Machines (SVM). We show that the model can predict the Program Vulnerability Factor (PVF) with an average relative error of 7 % for certain classes of algorithm, such as linear algebra and sorting. The average relative error for all algorithm classes is 22 %. Such a fast and straightforward prediction model can be effective as a filter to select the most unreliable applications to perform an in-depth analysis.
{"title":"Predicting the Reliability Behavior of HPC Applications","authors":"Daniel Oliveira, Francis B. Moreira, P. Rech, P. Navaux","doi":"10.1109/CAHPC.2018.8645856","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645856","url":null,"abstract":"The error rate of current High Performance Computing (HPC) systems is already in the order of one per dozens of hours. Understanding the reliability behavior of HPC applications will be required for the next generation of supercomputers. Using the reliability behavior one can select efficient mitigation techniques for the application and fine-tune parameters such as checkpoint frequency. In this paper, we investigate the application of a machine learning model to predict the reliability behavior of HPC applications. We inject faults in more than 30 HPC applications executing in the Intel Xeon Phi Knights Landing (KNL) and use profiling information to build a predictive model with Support Vector Machines (SVM). We show that the model can predict the Program Vulnerability Factor (PVF) with an average relative error of 7 % for certain classes of algorithm, such as linear algebra and sorting. The average relative error for all algorithm classes is 22 %. Such a fast and straightforward prediction model can be effective as a filter to select the most unreliable applications to perform an in-depth analysis.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131271067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}