Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00044
E. León, T. D'Hooge, Nathan Hanford, I. Karlin, R. Pankajakshan, Jim Foraker, C. Chambreau, M. Leininger
The simulation environment of any HPC platform is key to the performance, portability, and productivity of scientific applications. This environment has traditionally been provided by platform vendors, presenting challenges for HPC centers and users including platform-specific software that tend to stagnate over the lifetime of the system. In this paper, we present the Tri-Laboratory Operating System Stack (TOSS), a production simulation environment based on Linux and open source software, with proprietary software components integrated as needed. TOSS, focused on mid-to-large scale commodity HPC systems, provides a common simulation environment across system architectures, reduces the learning curve on new systems, and benefits from a lineage of past experience and bug fixes. To further the scope and applicability of TOSS, we demonstrate its feasibility and effectiveness on a leadership-class supercomputer architecture. Our evaluation, relative to the vendor stack, includes an analysis of resource manager complexity, system noise, networking, and application performance.
{"title":"TOSS-2020: A Commodity Software Stack for HPC","authors":"E. León, T. D'Hooge, Nathan Hanford, I. Karlin, R. Pankajakshan, Jim Foraker, C. Chambreau, M. Leininger","doi":"10.1109/SC41405.2020.00044","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00044","url":null,"abstract":"The simulation environment of any HPC platform is key to the performance, portability, and productivity of scientific applications. This environment has traditionally been provided by platform vendors, presenting challenges for HPC centers and users including platform-specific software that tend to stagnate over the lifetime of the system. In this paper, we present the Tri-Laboratory Operating System Stack (TOSS), a production simulation environment based on Linux and open source software, with proprietary software components integrated as needed. TOSS, focused on mid-to-large scale commodity HPC systems, provides a common simulation environment across system architectures, reduces the learning curve on new systems, and benefits from a lineage of past experience and bug fixes. To further the scope and applicability of TOSS, we demonstrate its feasibility and effectiveness on a leadership-class supercomputer architecture. Our evaluation, relative to the vendor stack, includes an analysis of resource manager complexity, system noise, networking, and application performance.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116879190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00033
Xin You, Hailong Yang, Zhongzhi Luan, D. Qian, Xu Liu
Redundant zeros cause inefficiencies in which the zero values are loaded and computed repeatedly, resulting in unnecessary memory traffic and identity computation that waste memory bandwidth and CPU resources. optimizing compilers is difficult in eliminating these zero-related inefficiencies due to limitations in static analysis. Hardware approaches, in contrast, optimize inefficiencies without code modification, but are not widely adopted in commodity processors. In this paper, we propose ZeroSpy - a fine-grained profiler to identify redundant zeros caused by both inappropriate use of data structures and useless computation. ZeroSpy also provides intuitive optimization guidance by revealing the locations where the redundant zeros happen in source lines and calling contexts. The experimental results demonstrate ZeroSpy is capable of identifying redundant zeros in programs that have been highly optimized for years. Based on the optimization guidance revealed by ZeroSpy, we can achieve significant speedups after eliminating redundant zeros.
{"title":"ZeroSpy: Exploring Software Inefficiency with Redundant Zeros","authors":"Xin You, Hailong Yang, Zhongzhi Luan, D. Qian, Xu Liu","doi":"10.1109/SC41405.2020.00033","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00033","url":null,"abstract":"Redundant zeros cause inefficiencies in which the zero values are loaded and computed repeatedly, resulting in unnecessary memory traffic and identity computation that waste memory bandwidth and CPU resources. optimizing compilers is difficult in eliminating these zero-related inefficiencies due to limitations in static analysis. Hardware approaches, in contrast, optimize inefficiencies without code modification, but are not widely adopted in commodity processors. In this paper, we propose ZeroSpy - a fine-grained profiler to identify redundant zeros caused by both inappropriate use of data structures and useless computation. ZeroSpy also provides intuitive optimization guidance by revealing the locations where the redundant zeros happen in source lines and calling contexts. The experimental results demonstrate ZeroSpy is capable of identifying redundant zeros in programs that have been highly optimized for years. Based on the optimization guidance revealed by ZeroSpy, we can achieve significant speedups after eliminating redundant zeros.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128355932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00102
Jacob Lambert, Seyong Lee, J. Vetter, A. Malony
Heterogeneous computing and exploration into specialized accelerators are inevitable in current and future supercomputers. Although this diversity of devices is promising for performance, the array of architectures presents programming challenges. High-level programming strategies have emerged to face these challenges, such as the OpenMP offloading model and OpenACC. However, the varying levels of support for these standards within vendor-specific and open-source tools, as well as the lack of performance portability across devices, have prevented the standards from achieving their goals. To address these shortcomings, we present CCAMP, an OpenMP and OpenACC interoperable framework. CCAMP provides two primary facilities: language translation between the two standards and device-specific directive optimization within each standard. We show that by using the CCAMP framework, programmers can easily transplant non-portable code into new ecosystems for new architectures. Additionally, by using CCAMP’s device-specific directive optimizations, users can achieve optimized performance across architectures using a single source code.
{"title":"CCAMP: An Integrated Translation and optimization Framework for OpenACC and OpenMP","authors":"Jacob Lambert, Seyong Lee, J. Vetter, A. Malony","doi":"10.1109/SC41405.2020.00102","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00102","url":null,"abstract":"Heterogeneous computing and exploration into specialized accelerators are inevitable in current and future supercomputers. Although this diversity of devices is promising for performance, the array of architectures presents programming challenges. High-level programming strategies have emerged to face these challenges, such as the OpenMP offloading model and OpenACC. However, the varying levels of support for these standards within vendor-specific and open-source tools, as well as the lack of performance portability across devices, have prevented the standards from achieving their goals. To address these shortcomings, we present CCAMP, an OpenMP and OpenACC interoperable framework. CCAMP provides two primary facilities: language translation between the two standards and device-specific directive optimization within each standard. We show that by using the CCAMP framework, programmers can easily transplant non-portable code into new ecosystems for new architectures. Additionally, by using CCAMP’s device-specific directive optimizations, users can achieve optimized performance across architectures using a single source code.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129570943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00072
Luping Wang, Qizhen Weng, Wei Wang, Chen Chen, Bo Li
Online cloud services are increasingly deployed as long-running applications (LRAs) in containers. Placing LRA containers is known to be difficult as they often have sophisticated resource interferences and I/O dependencies. Existing schedulers rely on operators to manually express the container scheduling requirements as placement constraints and strive to satisfy as many constraints as possible. Such schedulers, however, fall short in performance as placement constraints only provide qualitative scheduling guidelines and minimizing constraint violations does not necessarily result in the optimal performance.In this work, we present Metis, a general-purpose scheduler that learns to optimally place LRA containers using deep reinforcement learning (RL) techniques. This eliminates the complex manual specification of placement constraints and offers, for the first time, concrete quantitative scheduling criteria. As directly training an RL agent does not scale, we develop a novel hierarchical learning technique that decomposes a complex container placement problem into a hierarchy of subproblems with significantly reduced state and action space. We show that many subproblems have similar structures and can hence be solved by training a unified RL agent offline. Large-scale EC2 deployment shows that compared with the traditional constraint-based schedulers, Metis improves the throughput by up to 61%, optimizes various performance metrics, and easily scales to a large cluster where 3K containers run on over 700 machines.
{"title":"Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale","authors":"Luping Wang, Qizhen Weng, Wei Wang, Chen Chen, Bo Li","doi":"10.1109/SC41405.2020.00072","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00072","url":null,"abstract":"Online cloud services are increasingly deployed as long-running applications (LRAs) in containers. Placing LRA containers is known to be difficult as they often have sophisticated resource interferences and I/O dependencies. Existing schedulers rely on operators to manually express the container scheduling requirements as placement constraints and strive to satisfy as many constraints as possible. Such schedulers, however, fall short in performance as placement constraints only provide qualitative scheduling guidelines and minimizing constraint violations does not necessarily result in the optimal performance.In this work, we present Metis, a general-purpose scheduler that learns to optimally place LRA containers using deep reinforcement learning (RL) techniques. This eliminates the complex manual specification of placement constraints and offers, for the first time, concrete quantitative scheduling criteria. As directly training an RL agent does not scale, we develop a novel hierarchical learning technique that decomposes a complex container placement problem into a hierarchy of subproblems with significantly reduced state and action space. We show that many subproblems have similar structures and can hence be solved by training a unified RL agent offline. Large-scale EC2 deployment shows that compared with the traditional constraint-based schedulers, Metis improves the throughput by up to 61%, optimizes various performance metrics, and easily scales to a large cluster where 3K containers run on over 700 machines.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134067379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00046
Gabor Torok, Mark R. Day, R. Hartman-Baker, C. Snavely
Without a reliable and scalable system for managing authorized users and ensuring they receive their allocated share of computational and storage resources, modern HPC centers would not be able to function. Exascale will amplify these demands with greater machine scale, more users, higher job throughput, and ever-increasing need for management insight and automation throughout the HPC environment. When our legacy system reached retirement age, NERSC took the opportunity to design and build Iris not only to meet our current needs, with 8,000 users and tens of thousands of jobs per day, but also to scale well into the exascale era. In this paper, we describe how we have designed Iris to meet these needs and discuss its key features as well as our implementation experience.
{"title":"Iris: Allocation Banking and Identity and Access Management for the Exascale Era","authors":"Gabor Torok, Mark R. Day, R. Hartman-Baker, C. Snavely","doi":"10.1109/SC41405.2020.00046","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00046","url":null,"abstract":"Without a reliable and scalable system for managing authorized users and ensuring they receive their allocated share of computational and storage resources, modern HPC centers would not be able to function. Exascale will amplify these demands with greater machine scale, more users, higher job throughput, and ever-increasing need for management insight and automation throughout the HPC environment. When our legacy system reached retirement age, NERSC took the opportunity to design and build Iris not only to meet our current needs, with 8,000 users and tens of thousands of jobs per day, but also to scale well into the exascale era. In this paper, we describe how we have designed Iris to meet these needs and discuss its key features as well as our implementation experience.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129505028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00011
Ankit Srivastava, Sriram P. Chockalingam, S. Aluru
Bayesian networks (BNs) are a widely used graphical model in machine learning. As learning the structure of BNs is NP-hard, high-performance computing methods are necessary for constructing large-scale networks. In this paper, we present a parallel framework to scale BN structure learning algorithms to tens of thousands of variables. Our framework is applicable to learning algorithms that rely on the discovery of Markov blankets (MBs) as an intermediate step. We demonstrate the applicability of our framework by parallelizing three different algorithms: Grow-Shrink (GS), Incremental Association MB (IAMB), and Interleaved IAMB (Inter-IAMB). Our implementations are able to construct BNs from real data sets with tens of thousands of variables and thousands of observations in less than a minute on 1024 cores, with a speedup of up to 845X and 82.5% efficiency. Furthermore, we demonstrate using simulated data sets that our proposed parallel framework can scale to BNs of even higher dimensionality.
贝叶斯网络(BNs)是机器学习中广泛使用的图形模型。由于神经网络的结构学习是np困难的,因此构建大规模网络需要高性能的计算方法。在本文中,我们提出了一个并行框架,将BN结构学习算法扩展到数万个变量。我们的框架适用于依赖于发现马尔可夫毯子(mb)作为中间步骤的学习算法。我们通过并行化三种不同的算法来证明我们框架的适用性:growth - shrink (GS), Incremental Association MB (IAMB)和Interleaved IAMB (Inter-IAMB)。我们的实现能够在1024个内核上,在不到一分钟的时间内,从具有数万个变量和数千个观测值的真实数据集构建bp,加速高达845X,效率高达82.5%。此外,我们使用模拟数据集证明,我们提出的并行框架可以扩展到更高维度的bn。
{"title":"A Parallel Framework for Constraint-Based Bayesian Network Learning via Markov Blanket Discovery","authors":"Ankit Srivastava, Sriram P. Chockalingam, S. Aluru","doi":"10.1109/SC41405.2020.00011","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00011","url":null,"abstract":"Bayesian networks (BNs) are a widely used graphical model in machine learning. As learning the structure of BNs is NP-hard, high-performance computing methods are necessary for constructing large-scale networks. In this paper, we present a parallel framework to scale BN structure learning algorithms to tens of thousands of variables. Our framework is applicable to learning algorithms that rely on the discovery of Markov blankets (MBs) as an intermediate step. We demonstrate the applicability of our framework by parallelizing three different algorithms: Grow-Shrink (GS), Incremental Association MB (IAMB), and Interleaved IAMB (Inter-IAMB). Our implementations are able to construct BNs from real data sets with tens of thousands of variables and thousands of observations in less than a minute on 1024 cores, with a speedup of up to 845X and 82.5% efficiency. Furthermore, we demonstrate using simulated data sets that our proposed parallel framework can scale to BNs of even higher dimensionality.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129261235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00064
Nariaki Tateiwa, Y. Shinano, Satoshi Nakamura, Akihiro Yoshida, S. Kaji, Masaya Yasuda, K. Fujisawa
Lattice-based cryptography has received attention as a next-generation encryption technique, because it is believed to be secure against attacks by classical and quantum computers. Its essential security depends on the hardness of solving the shortest vector problem (SVP). In the cryptography, to determine security levels, it is becoming significantly more important to estimate the hardness of the SVP by high-performance computing. In this study, we develop the world’s first distributed and asynchronous parallel SVP solver, the MAssively Parallel solver for SVP (MAP-SVP). It can parallelize algorithms for solving the SVP by applying the Ubiquity Generator framework, which is a generic framework for branch-and-bound algorithms. The MAP-SVP is suitable for massive-scale parallelization, owing to its small memory footprint, low communication overhead, and rapid checkpoint and restart mechanisms. We demonstrate its performance and scalability of the MAP-SVP by using up to 100,032 cores to solve instances of the Darmstadt SVP Challenge.
{"title":"Massive Parallelization for Finding Shortest Lattice Vectors Based on Ubiquity Generator Framework","authors":"Nariaki Tateiwa, Y. Shinano, Satoshi Nakamura, Akihiro Yoshida, S. Kaji, Masaya Yasuda, K. Fujisawa","doi":"10.1109/SC41405.2020.00064","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00064","url":null,"abstract":"Lattice-based cryptography has received attention as a next-generation encryption technique, because it is believed to be secure against attacks by classical and quantum computers. Its essential security depends on the hardness of solving the shortest vector problem (SVP). In the cryptography, to determine security levels, it is becoming significantly more important to estimate the hardness of the SVP by high-performance computing. In this study, we develop the world’s first distributed and asynchronous parallel SVP solver, the MAssively Parallel solver for SVP (MAP-SVP). It can parallelize algorithms for solving the SVP by applying the Ubiquity Generator framework, which is a generic framework for branch-and-bound algorithms. The MAP-SVP is suitable for massive-scale parallelization, owing to its small memory footprint, low communication overhead, and rapid checkpoint and restart mechanisms. We demonstrate its performance and scalability of the MAP-SVP by using up to 100,032 cores to solve instances of the Darmstadt SVP Challenge.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128735703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00069
Saurabh Jha, Shengkun Cui, Subho Sankar Banerjee, Tianyin Xu, J. Enos, M. Showerman, T. Kalbarczyk, R. Iyer
Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.
{"title":"Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems","authors":"Saurabh Jha, Shengkun Cui, Subho Sankar Banerjee, Tianyin Xu, J. Enos, M. Showerman, T. Kalbarczyk, R. Iyer","doi":"10.1109/SC41405.2020.00069","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00069","url":null,"abstract":"Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116767520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00017
Ang Li, Omer Subasi, Xiu Yang, S. Krishnamoorthy
As quantum computers evolve, simulations of quantum programs on classical computers will be essential in validating quantum algorithms, understanding the effect of system noise, and designing applications for future quantum computers. In this paper, we first propose a new multi-GPU programming methodology called MG-BSP which constructs a virtual BSP machine on top of modern multi-GPU platforms, and apply this methodology to build a multi-GPU density matrix quantum simulator called DM-Sim. We propose a new formulation that can significantly reduce communication overhead, and show that this formula transformation can conserve the semantics despite noise being introduced. We build the tool-chain for the simulator to run open standard quantum assembly code, execute synthesized quantum circuits, and perform ultra-deep and largescale simulations. We evaluated DM-Sim on several state-of-theart multi-GPU platforms including NVIDIA’s PascaUVolta DGX1, DGX-2, and ORNL’s Summit supercomputer. In particular, we have demonstrated the simulation of one million general gates in 94 minutes on DGX-2, far deeper circuits than has been demonstrated in prior works. Our simulator is more than 10x faster with respect to the corresponding state-vector quantum simulators on GPUs and other platforms. The DM-Sim simulator is released at: http:llgithub.comlpnnllDM-Sim.
{"title":"Density Matrix Quantum Circuit Simulation via the BSP Machine on Modern GPU Clusters","authors":"Ang Li, Omer Subasi, Xiu Yang, S. Krishnamoorthy","doi":"10.1109/SC41405.2020.00017","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00017","url":null,"abstract":"As quantum computers evolve, simulations of quantum programs on classical computers will be essential in validating quantum algorithms, understanding the effect of system noise, and designing applications for future quantum computers. In this paper, we first propose a new multi-GPU programming methodology called MG-BSP which constructs a virtual BSP machine on top of modern multi-GPU platforms, and apply this methodology to build a multi-GPU density matrix quantum simulator called DM-Sim. We propose a new formulation that can significantly reduce communication overhead, and show that this formula transformation can conserve the semantics despite noise being introduced. We build the tool-chain for the simulator to run open standard quantum assembly code, execute synthesized quantum circuits, and perform ultra-deep and largescale simulations. We evaluated DM-Sim on several state-of-theart multi-GPU platforms including NVIDIA’s PascaUVolta DGX1, DGX-2, and ORNL’s Summit supercomputer. In particular, we have demonstrated the simulation of one million general gates in 94 minutes on DGX-2, far deeper circuits than has been demonstrated in prior works. Our simulator is more than 10x faster with respect to the corresponding state-vector quantum simulators on GPUs and other platforms. The DM-Sim simulator is released at: http:llgithub.comlpnnllDM-Sim.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123005449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00100
H. T. Kung, Bradley McDanel, S. Zhang
We present a novel technique, called Term Quantization (TQ), for furthering quantization at run time for improved computational efficiency of deep neural networks (DNNs) already quantized with conventional quantization methods. TQ operates on power-of-two terms in expressions of values. In computing a dot-product computation, TQ dynamically selects a fixed number of largest terms to use from values of the two vectors. By exploiting weight and data distributions typically present in DNNs, TQ has a minimal impact on DNN model performance (e.g., accuracy or perplexity). We use TQ to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We evaluate TQ on an MLP for MNIST, multiple CNNs for ImageNet and an LSTM for Wikitext-2. We demonstrate significant reductions in inference computation costs (between $3-10times$) compared to conventional uniform quantization for the same level of model performance.
{"title":"Term Quantization: Furthering Quantization at Run Time","authors":"H. T. Kung, Bradley McDanel, S. Zhang","doi":"10.1109/SC41405.2020.00100","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00100","url":null,"abstract":"We present a novel technique, called Term Quantization (TQ), for furthering quantization at run time for improved computational efficiency of deep neural networks (DNNs) already quantized with conventional quantization methods. TQ operates on power-of-two terms in expressions of values. In computing a dot-product computation, TQ dynamically selects a fixed number of largest terms to use from values of the two vectors. By exploiting weight and data distributions typically present in DNNs, TQ has a minimal impact on DNN model performance (e.g., accuracy or perplexity). We use TQ to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We evaluate TQ on an MLP for MNIST, multiple CNNs for ImageNet and an LSTM for Wikitext-2. We demonstrate significant reductions in inference computation costs (between $3-10times$) compared to conventional uniform quantization for the same level of model performance.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115546342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}