Pub Date : 2020-11-01DOI: 10.1109/sc41405.2020.00002
{"title":"[Copyright notice]","authors":"","doi":"10.1109/sc41405.2020.00002","DOIUrl":"https://doi.org/10.1109/sc41405.2020.00002","url":null,"abstract":"","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129515017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00091
Süreyya Emre Kurt, Aravind Sukumaran-Rajam, F. Rastello, P. Sadayappan
Tiling is a key technique to reduce data movement in matrix computations. While tiling is well understood and widely used for dense matrix/tensor computations, effective tiling of sparse matrix computations remains a challenging problem. This paper proposes a novel method to efficiently summarize the impact of the sparsity structure of a matrix on achievable data reuse as a one-dimensional signature, which is then used to build an analytical cost model for tile size optimization for sparse matrix computations. The proposed model-driven approach to sparse tiling is evaluated on two key sparse matrix kernels: Sparse Matrix - Dense Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM). Experimental results demonstrate that model-based tiled SpMM and SDDMM achieve high performance relative to the current state-of-the-art.
{"title":"Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures","authors":"Süreyya Emre Kurt, Aravind Sukumaran-Rajam, F. Rastello, P. Sadayappan","doi":"10.1109/SC41405.2020.00091","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00091","url":null,"abstract":"Tiling is a key technique to reduce data movement in matrix computations. While tiling is well understood and widely used for dense matrix/tensor computations, effective tiling of sparse matrix computations remains a challenging problem. This paper proposes a novel method to efficiently summarize the impact of the sparsity structure of a matrix on achievable data reuse as a one-dimensional signature, which is then used to build an analytical cost model for tile size optimization for sparse matrix computations. The proposed model-driven approach to sparse tiling is evaluated on two key sparse matrix kernels: Sparse Matrix - Dense Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM). Experimental results demonstrate that model-based tiled SpMM and SDDMM achieve high performance relative to the current state-of-the-art.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129704610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00093
K. Zhou, Yueming Hao, J. Mellor-Crummey, Xiaozhu Meng, Xu Liu
GPGPUs are widely used in high-performance computing systems to accelerate scientific and machine learning workloads. Developing efficient GPU kernels is critically important to obtain “bare-metal” performance on GPU-based clusters. In this paper, we describe the design and implementation of GVPROF, the first value profiler that pinpoints value-related inefficiencies in applications running on NVIDIA GPU-based clusters. The novelty of GVPROF resides in its ability to detect temporal and spatial value redundancies, which provides useful information to guide code optimization. GVPROF can monitor production multi-node multi-GPU executions in clusters. Our experiments with well-known GPU benchmarks and HPC applications show that GVPROF incurs acceptable overhead and scales to large executions. Using GVPROF, we optimized several HPC and machine learning workloads on one NVIDIA V100 GPU. In one case study of LAMMPS, optimizations based on information from GVProf led to whole-program speedups ranging from 1.37x on a single GPU to 1.08x on 64 GPUs.
{"title":"GVPROF: A Value Profiler for GPU-Based Clusters","authors":"K. Zhou, Yueming Hao, J. Mellor-Crummey, Xiaozhu Meng, Xu Liu","doi":"10.1109/SC41405.2020.00093","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00093","url":null,"abstract":"GPGPUs are widely used in high-performance computing systems to accelerate scientific and machine learning workloads. Developing efficient GPU kernels is critically important to obtain “bare-metal” performance on GPU-based clusters. In this paper, we describe the design and implementation of GVPROF, the first value profiler that pinpoints value-related inefficiencies in applications running on NVIDIA GPU-based clusters. The novelty of GVPROF resides in its ability to detect temporal and spatial value redundancies, which provides useful information to guide code optimization. GVPROF can monitor production multi-node multi-GPU executions in clusters. Our experiments with well-known GPU benchmarks and HPC applications show that GVPROF incurs acceptable overhead and scales to large executions. Using GVPROF, we optimized several HPC and machine learning workloads on one NVIDIA V100 GPU. In one case study of LAMMPS, optimizations based on information from GVProf led to whole-program speedups ranging from 1.37x on a single GPU to 1.08x on 64 GPUs.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129736694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00056
Tianxi Li, D. Shankar, Shashank Gugnani, Xiaoyi Lu
Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote Direct Memory Access (RDMA) capable networks. However, existing studies to combine RDMA and PMEM can not deliver the desired performance due to their PMEM-oblivious communication protocols. In this paper, we propose novel PMEM-aware RDMA-based communication protocols for persistent key-value stores, referred to as Remote Direct Memory Persistence based Key-Value stores (RDMPKV). RDMP-KV employs a hybrid ‘server-reply/server-bypass’ approach to ‘durably’ store individual key-value objects on PMEM-equipped servers. RDMP-KV’s runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability through a lightweight consistency scheme. Performance evaluations show that RDMP-KV can improve the server-side performance with different persistent key-value storage architectures by up to 22x, as compared with PMEM-oblivious RDMA-‘Server-Reply’ protocols. Our evaluations also show that RDMP-KV outperforms a distributed PMEM-based filesystem by up to 65% and a recent RDMA-to-PMEM framework by up to 71%.
{"title":"RDMP-KV: Designing Remote Direct Memory Persistence based Key-Value Stores with PMEM","authors":"Tianxi Li, D. Shankar, Shashank Gugnani, Xiaoyi Lu","doi":"10.1109/SC41405.2020.00056","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00056","url":null,"abstract":"Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote Direct Memory Access (RDMA) capable networks. However, existing studies to combine RDMA and PMEM can not deliver the desired performance due to their PMEM-oblivious communication protocols. In this paper, we propose novel PMEM-aware RDMA-based communication protocols for persistent key-value stores, referred to as Remote Direct Memory Persistence based Key-Value stores (RDMPKV). RDMP-KV employs a hybrid ‘server-reply/server-bypass’ approach to ‘durably’ store individual key-value objects on PMEM-equipped servers. RDMP-KV’s runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability through a lightweight consistency scheme. Performance evaluations show that RDMP-KV can improve the server-side performance with different persistent key-value storage architectures by up to 22x, as compared with PMEM-oblivious RDMA-‘Server-Reply’ protocols. Our evaluations also show that RDMP-KV outperforms a distributed PMEM-based filesystem by up to 65% and a recent RDMA-to-PMEM framework by up to 71%.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133156757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00094
Shaoqi Wang, O. J. Gonzalez, Xiaobo Zhou, Thomas Williams, Brian D. Friedman, Martin Havemann, Thomas Y. C. Woo
Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks.We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPU allocation mechanism works in concert with the scheduler. It offers a lightweight and nonintrusive method to reallocate GPUs based on a “SideCar” process that temporarily stops and restarts the job’s DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU clusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.
{"title":"An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems","authors":"Shaoqi Wang, O. J. Gonzalez, Xiaobo Zhou, Thomas Williams, Brian D. Friedman, Martin Havemann, Thomas Y. C. Woo","doi":"10.1109/SC41405.2020.00094","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00094","url":null,"abstract":"Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks.We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPU allocation mechanism works in concert with the scheduler. It offers a lightweight and nonintrusive method to reallocate GPUs based on a “SideCar” process that temporarily stops and restarts the job’s DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU clusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116682632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00014
Qiao Kang, R. Ross, R. Latham, Sunwoo Lee, Ankit Agrawal, A. Choudhary, W. Liao
As modern parallel computers enter the exascale era, the communication cost for redistributing requests becomes a significant bottleneck in MPIIO routines. The communication kernel for request redistribution, which has an all-to-many personalized communication pattern for application programs with a large number of noncontiguous requests, plays an essential role in the overall performance. This paper explores the available communication kernels for two-phase I/O communication. We generalize the spread-out algorithm to adapt to the all-to-many communication pattern of two-phase I/O by reducing the communication straggler effect. Communication throttling methods that reduce communication contention for asynchronous MPI implementation are adopted to improve communication performance further. Experimental results are presented using different communication kernels running on Cray XC40 Cori and IBM AC922 Summit supercomputers with different I/O patterns. Our study shows that adjusting communication kernel algorithms for different I/O patterns can improve the end-to-end performance up to 10 times compared with default MPI-IO implementations.
{"title":"Improving All-to-Many Personalized Communication in Two-Phase I/O","authors":"Qiao Kang, R. Ross, R. Latham, Sunwoo Lee, Ankit Agrawal, A. Choudhary, W. Liao","doi":"10.1109/SC41405.2020.00014","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00014","url":null,"abstract":"As modern parallel computers enter the exascale era, the communication cost for redistributing requests becomes a significant bottleneck in MPIIO routines. The communication kernel for request redistribution, which has an all-to-many personalized communication pattern for application programs with a large number of noncontiguous requests, plays an essential role in the overall performance. This paper explores the available communication kernels for two-phase I/O communication. We generalize the spread-out algorithm to adapt to the all-to-many communication pattern of two-phase I/O by reducing the communication straggler effect. Communication throttling methods that reduce communication contention for asynchronous MPI implementation are adopted to improve communication performance further. Experimental results are presented using different communication kernels running on Cray XC40 Cori and IBM AC922 Summit supercomputers with different I/O patterns. Our study shows that adjusting communication kernel algorithms for different I/O patterns can improve the end-to-end performance up to 10 times compared with default MPI-IO implementations.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134436358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00088
Tirthak Patel, Zhengchun Liu, R. Kettimuthu, P. Rich, W. Allcock, Devesh Tiwari
HPC workload analysis and resource consumption characteristics are the key to driving better operation practices, system procurement decisions, and designing effective resource management techniques. Unfortunately, the HPC community does not have easy accessibility to long-term introspective work-load analysis and characterization for production-scale HPC systems. This study bridges this gap by providing detailed long-term quantification, characterization, and analysis of job characteristics on two supercomputers: Intrepid and Mira. This study is one of the largest of its kind – covering trends and characteristics for over three billion compute hours, 750 thousand jobs, and spanning a decade. We confirm several long-held conventional wisdom, and identify many previously undiscovered trends and its implications. We also introduce a learning based technique to predict the resource requirement of future jobs with high accuracy, using features available prior to the job submission and without requiring any application-specific tracing or application-intrusive instrumentation.
{"title":"Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications","authors":"Tirthak Patel, Zhengchun Liu, R. Kettimuthu, P. Rich, W. Allcock, Devesh Tiwari","doi":"10.1109/SC41405.2020.00088","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00088","url":null,"abstract":"HPC workload analysis and resource consumption characteristics are the key to driving better operation practices, system procurement decisions, and designing effective resource management techniques. Unfortunately, the HPC community does not have easy accessibility to long-term introspective work-load analysis and characterization for production-scale HPC systems. This study bridges this gap by providing detailed long-term quantification, characterization, and analysis of job characteristics on two supercomputers: Intrepid and Mira. This study is one of the largest of its kind – covering trends and characteristics for over three billion compute hours, 750 thousand jobs, and spanning a decade. We confirm several long-held conventional wisdom, and identify many previously undiscovered trends and its implications. We also introduce a learning based technique to predict the resource requirement of future jobs with high accuracy, using features available prior to the job submission and without requiring any application-specific tracing or application-intrusive instrumentation.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124906358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00031
Maciej Besta, Marcel Schneider, Marek Konieczny, Karolina Cynk, Erik Henriksson, S. D. Girolamo, Ankit Singla, T. Hoefler
We introduce FatPaths: a simple, generic, and robust routing architecture that enables state-of-the-art low-diameter topologies such as Slim Fly to achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC supercomputers as well as cloud data centers and clusters. FatPaths exposes and exploits the rich (“fat”) diversity of both minimal and non-minimal paths for high-performance multi-pathing. Moreover, FatPaths uses a redesigned “purified” transport layer that removes virtually all TCP performance issues (e.g., the slow start), and incorporates flowlet switching, a technique used to prevent packet reordering in TCP networks, to enable very simple and effective load balancing. Our design enables recent low-diameter topologies to outperform powerful Clos designs, achieving 15% higher net throughput at 2” lower latency for comparable cost. FatPaths will significantly accelerate Ethernet clusters that form more than 50% of the Top500 list and it may become a standard routing scheme for modern topologies.Extended paper version: https://arxiv.org/abs/1906.10885
{"title":"FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short","authors":"Maciej Besta, Marcel Schneider, Marek Konieczny, Karolina Cynk, Erik Henriksson, S. D. Girolamo, Ankit Singla, T. Hoefler","doi":"10.1109/SC41405.2020.00031","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00031","url":null,"abstract":"We introduce FatPaths: a simple, generic, and robust routing architecture that enables state-of-the-art low-diameter topologies such as Slim Fly to achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC supercomputers as well as cloud data centers and clusters. FatPaths exposes and exploits the rich (“fat”) diversity of both minimal and non-minimal paths for high-performance multi-pathing. Moreover, FatPaths uses a redesigned “purified” transport layer that removes virtually all TCP performance issues (e.g., the slow start), and incorporates flowlet switching, a technique used to prevent packet reordering in TCP networks, to enable very simple and effective load balancing. Our design enables recent low-diameter topologies to outperform powerful Clos designs, achieving 15% higher net throughput at 2” lower latency for comparable cost. FatPaths will significantly accelerate Ethernet clusters that form more than 50% of the Top500 list and it may become a standard routing scheme for modern topologies.Extended paper version: https://arxiv.org/abs/1906.10885","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124195517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00036
Quan Chen, Shuai Xue, Shang Zhao, Shanpei Chen, Yihao Wu, Yu Xu, Zhuo Song, Tao Ma, Yong Yang, M. Guo
The tenants of public cloud platforms share hard-ware resources on the same node, resulting in the potential for performance interference (or malicious attacks). A tenant is able to degrade the performance of its neighbors on the same node significantly through overuse of the shared memory bus, last level cache (LLC)/memory bandwidth, and power. To eliminate such unfairness we propose Alita, a runtime system consisting of an online interference identifier and adaptive interference eliminator. The interference identifier monitors hardware and system-level event statistics to identify resource polluters. The eliminator improves the performance of normal applications by throttling only the resource usage of polluters. Specifically, Alita adopts bus lock sparsification, bias LLC/bandwidth isolation, and selective power throttling to throttle the resource usage of polluters. Results for an experimental platform and in-production cloud platform with 30,000 nodes demonstrate that Alita significantly improves the performance of co-located virtual machines in the presence of resource polluters based on system-level knowledge.
{"title":"Alita: Comprehensive Performance Isolation through Bias Resource Management for Public Clouds","authors":"Quan Chen, Shuai Xue, Shang Zhao, Shanpei Chen, Yihao Wu, Yu Xu, Zhuo Song, Tao Ma, Yong Yang, M. Guo","doi":"10.1109/SC41405.2020.00036","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00036","url":null,"abstract":"The tenants of public cloud platforms share hard-ware resources on the same node, resulting in the potential for performance interference (or malicious attacks). A tenant is able to degrade the performance of its neighbors on the same node significantly through overuse of the shared memory bus, last level cache (LLC)/memory bandwidth, and power. To eliminate such unfairness we propose Alita, a runtime system consisting of an online interference identifier and adaptive interference eliminator. The interference identifier monitors hardware and system-level event statistics to identify resource polluters. The eliminator improves the performance of normal applications by throttling only the resource usage of polluters. Specifically, Alita adopts bus lock sparsification, bias LLC/bandwidth isolation, and selective power throttling to throttle the resource usage of polluters. Results for an experimental platform and in-production cloud platform with 30,000 nodes demonstrate that Alita significantly improves the performance of co-located virtual machines in the presence of resource polluters based on system-level knowledge.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132457760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00065
Isaac Boixaderas, D. Zivanovic, Sergi Moré, Javier Bartolome, David Vicente, Marc Casas, P. Carpenter, Petar Radojkovic, E. Ayguadé
This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading cause of hardware failures in large-scale HPC clusters. The method uses a random forest classifier, which was trained and evaluated using error logs from two years of production of the MareNostrum 3 supercomputer. By enabling the system to take measures to mitigate node failures, our method reduces lost compute time by up to 57%, a net saving of 21,000 node–hours per year. We release all source code as open source. We also discuss and clarify aspects of methodology that are essential for a DRAM prediction method to be useful in practice. We explain why standard evaluation metrics, such as precision and recall, are insufficient, and base the evaluation on a cost–benefit analysis. This methodology can help ensure that any DRAM error predictor is clear from training bias and has a clear cost–benefit calculation.
{"title":"Cost-Aware Prediction of Uncorrected DRAM Errors in the Field","authors":"Isaac Boixaderas, D. Zivanovic, Sergi Moré, Javier Bartolome, David Vicente, Marc Casas, P. Carpenter, Petar Radojkovic, E. Ayguadé","doi":"10.1109/SC41405.2020.00065","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00065","url":null,"abstract":"This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading cause of hardware failures in large-scale HPC clusters. The method uses a random forest classifier, which was trained and evaluated using error logs from two years of production of the MareNostrum 3 supercomputer. By enabling the system to take measures to mitigate node failures, our method reduces lost compute time by up to 57%, a net saving of 21,000 node–hours per year. We release all source code as open source. We also discuss and clarify aspects of methodology that are essential for a DRAM prediction method to be useful in practice. We explain why standard evaluation metrics, such as precision and recall, are insufficient, and base the evaluation on a cost–benefit analysis. This methodology can help ensure that any DRAM error predictor is clear from training bias and has a clear cost–benefit calculation.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"46 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130797225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}