首页 > 最新文献

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
TOSS-2020: A Commodity Software Stack for HPC TOSS-2020: HPC商用软件栈
E. León, T. D'Hooge, Nathan Hanford, I. Karlin, R. Pankajakshan, Jim Foraker, C. Chambreau, M. Leininger
The simulation environment of any HPC platform is key to the performance, portability, and productivity of scientific applications. This environment has traditionally been provided by platform vendors, presenting challenges for HPC centers and users including platform-specific software that tend to stagnate over the lifetime of the system. In this paper, we present the Tri-Laboratory Operating System Stack (TOSS), a production simulation environment based on Linux and open source software, with proprietary software components integrated as needed. TOSS, focused on mid-to-large scale commodity HPC systems, provides a common simulation environment across system architectures, reduces the learning curve on new systems, and benefits from a lineage of past experience and bug fixes. To further the scope and applicability of TOSS, we demonstrate its feasibility and effectiveness on a leadership-class supercomputer architecture. Our evaluation, relative to the vendor stack, includes an analysis of resource manager complexity, system noise, networking, and application performance.
任何HPC平台的仿真环境都是科学应用程序的性能、可移植性和生产力的关键。这种环境传统上是由平台供应商提供的,这给高性能计算中心和用户带来了挑战,包括特定于平台的软件,这些软件在系统的生命周期中往往会停滞不前。在本文中,我们提出了三实验室操作系统堆栈(TOSS),这是一个基于Linux和开源软件的生产模拟环境,根据需要集成了专有软件组件。TOSS专注于大中型商用HPC系统,提供了跨系统架构的通用模拟环境,减少了新系统的学习曲线,并受益于过去的经验和错误修复。为了进一步扩大TOSS的范围和适用性,我们论证了它在一个领导级超级计算机体系结构上的可行性和有效性。相对于供应商堆栈,我们的评估包括对资源管理器复杂性、系统噪声、网络和应用程序性能的分析。
{"title":"TOSS-2020: A Commodity Software Stack for HPC","authors":"E. León, T. D'Hooge, Nathan Hanford, I. Karlin, R. Pankajakshan, Jim Foraker, C. Chambreau, M. Leininger","doi":"10.1109/SC41405.2020.00044","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00044","url":null,"abstract":"The simulation environment of any HPC platform is key to the performance, portability, and productivity of scientific applications. This environment has traditionally been provided by platform vendors, presenting challenges for HPC centers and users including platform-specific software that tend to stagnate over the lifetime of the system. In this paper, we present the Tri-Laboratory Operating System Stack (TOSS), a production simulation environment based on Linux and open source software, with proprietary software components integrated as needed. TOSS, focused on mid-to-large scale commodity HPC systems, provides a common simulation environment across system architectures, reduces the learning curve on new systems, and benefits from a lineage of past experience and bug fixes. To further the scope and applicability of TOSS, we demonstrate its feasibility and effectiveness on a leadership-class supercomputer architecture. Our evaluation, relative to the vendor stack, includes an analysis of resource manager complexity, system noise, networking, and application performance.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116879190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ZeroSpy: Exploring Software Inefficiency with Redundant Zeros ZeroSpy:利用冗余零探索软件的低效率
Xin You, Hailong Yang, Zhongzhi Luan, D. Qian, Xu Liu
Redundant zeros cause inefficiencies in which the zero values are loaded and computed repeatedly, resulting in unnecessary memory traffic and identity computation that waste memory bandwidth and CPU resources. optimizing compilers is difficult in eliminating these zero-related inefficiencies due to limitations in static analysis. Hardware approaches, in contrast, optimize inefficiencies without code modification, but are not widely adopted in commodity processors. In this paper, we propose ZeroSpy - a fine-grained profiler to identify redundant zeros caused by both inappropriate use of data structures and useless computation. ZeroSpy also provides intuitive optimization guidance by revealing the locations where the redundant zeros happen in source lines and calling contexts. The experimental results demonstrate ZeroSpy is capable of identifying redundant zeros in programs that have been highly optimized for years. Based on the optimization guidance revealed by ZeroSpy, we can achieve significant speedups after eliminating redundant zeros.
冗余的零会导致效率低下,因为零值会被反复加载和计算,从而导致不必要的内存流量和身份计算,浪费内存带宽和CPU资源。由于静态分析的限制,优化编译器很难消除这些与零相关的低效率。相比之下,硬件方法可以在不修改代码的情况下优化低效率,但在商用处理器中没有广泛采用。在本文中,我们提出了ZeroSpy -一个细粒度分析器,用于识别由于数据结构使用不当和无用计算而导致的冗余零。ZeroSpy还通过揭示源行和调用上下文中冗余零出现的位置,提供直观的优化指导。实验结果表明,ZeroSpy能够在经过多年高度优化的程序中识别多余的零。基于ZeroSpy提供的优化指导,我们可以在消除冗余零后获得显著的加速。
{"title":"ZeroSpy: Exploring Software Inefficiency with Redundant Zeros","authors":"Xin You, Hailong Yang, Zhongzhi Luan, D. Qian, Xu Liu","doi":"10.1109/SC41405.2020.00033","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00033","url":null,"abstract":"Redundant zeros cause inefficiencies in which the zero values are loaded and computed repeatedly, resulting in unnecessary memory traffic and identity computation that waste memory bandwidth and CPU resources. optimizing compilers is difficult in eliminating these zero-related inefficiencies due to limitations in static analysis. Hardware approaches, in contrast, optimize inefficiencies without code modification, but are not widely adopted in commodity processors. In this paper, we propose ZeroSpy - a fine-grained profiler to identify redundant zeros caused by both inappropriate use of data structures and useless computation. ZeroSpy also provides intuitive optimization guidance by revealing the locations where the redundant zeros happen in source lines and calling contexts. The experimental results demonstrate ZeroSpy is capable of identifying redundant zeros in programs that have been highly optimized for years. Based on the optimization guidance revealed by ZeroSpy, we can achieve significant speedups after eliminating redundant zeros.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128355932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
CCAMP: An Integrated Translation and optimization Framework for OpenACC and OpenMP CCAMP: OpenACC和OpenMP的集成翻译和优化框架
Jacob Lambert, Seyong Lee, J. Vetter, A. Malony
Heterogeneous computing and exploration into specialized accelerators are inevitable in current and future supercomputers. Although this diversity of devices is promising for performance, the array of architectures presents programming challenges. High-level programming strategies have emerged to face these challenges, such as the OpenMP offloading model and OpenACC. However, the varying levels of support for these standards within vendor-specific and open-source tools, as well as the lack of performance portability across devices, have prevented the standards from achieving their goals. To address these shortcomings, we present CCAMP, an OpenMP and OpenACC interoperable framework. CCAMP provides two primary facilities: language translation between the two standards and device-specific directive optimization within each standard. We show that by using the CCAMP framework, programmers can easily transplant non-portable code into new ecosystems for new architectures. Additionally, by using CCAMP’s device-specific directive optimizations, users can achieve optimized performance across architectures using a single source code.
在当前和未来的超级计算机中,异构计算和探索专用加速器是不可避免的。尽管这种设备的多样性有望提高性能,但架构的阵列带来了编程挑战。为了应对这些挑战,出现了高级编程策略,例如OpenMP卸载模型和OpenACC。然而,供应商特定的和开源工具对这些标准的不同级别的支持,以及缺乏跨设备的性能可移植性,阻碍了这些标准实现其目标。为了解决这些缺点,我们提出了CCAMP,一个OpenMP和OpenACC可互操作的框架。CCAMP提供两个主要功能:两个标准之间的语言翻译和每个标准中特定于设备的指令优化。我们展示了通过使用CCAMP框架,程序员可以很容易地将不可移植的代码移植到新的体系结构的新生态系统中。此外,通过使用CCAMP的特定于设备的指令优化,用户可以使用单个源代码实现跨架构的优化性能。
{"title":"CCAMP: An Integrated Translation and optimization Framework for OpenACC and OpenMP","authors":"Jacob Lambert, Seyong Lee, J. Vetter, A. Malony","doi":"10.1109/SC41405.2020.00102","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00102","url":null,"abstract":"Heterogeneous computing and exploration into specialized accelerators are inevitable in current and future supercomputers. Although this diversity of devices is promising for performance, the array of architectures presents programming challenges. High-level programming strategies have emerged to face these challenges, such as the OpenMP offloading model and OpenACC. However, the varying levels of support for these standards within vendor-specific and open-source tools, as well as the lack of performance portability across devices, have prevented the standards from achieving their goals. To address these shortcomings, we present CCAMP, an OpenMP and OpenACC interoperable framework. CCAMP provides two primary facilities: language translation between the two standards and device-specific directive optimization within each standard. We show that by using the CCAMP framework, programmers can easily transplant non-portable code into new ecosystems for new architectures. Additionally, by using CCAMP’s device-specific directive optimizations, users can achieve optimized performance across architectures using a single source code.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129570943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale Metis:学习在共享容器集群中大规模调度长时间运行的应用程序
Luping Wang, Qizhen Weng, Wei Wang, Chen Chen, Bo Li
Online cloud services are increasingly deployed as long-running applications (LRAs) in containers. Placing LRA containers is known to be difficult as they often have sophisticated resource interferences and I/O dependencies. Existing schedulers rely on operators to manually express the container scheduling requirements as placement constraints and strive to satisfy as many constraints as possible. Such schedulers, however, fall short in performance as placement constraints only provide qualitative scheduling guidelines and minimizing constraint violations does not necessarily result in the optimal performance.In this work, we present Metis, a general-purpose scheduler that learns to optimally place LRA containers using deep reinforcement learning (RL) techniques. This eliminates the complex manual specification of placement constraints and offers, for the first time, concrete quantitative scheduling criteria. As directly training an RL agent does not scale, we develop a novel hierarchical learning technique that decomposes a complex container placement problem into a hierarchy of subproblems with significantly reduced state and action space. We show that many subproblems have similar structures and can hence be solved by training a unified RL agent offline. Large-scale EC2 deployment shows that compared with the traditional constraint-based schedulers, Metis improves the throughput by up to 61%, optimizes various performance metrics, and easily scales to a large cluster where 3K containers run on over 700 machines.
在线云服务越来越多地被部署为容器中的长时间运行的应用程序(lra)。放置LRA容器非常困难,因为它们通常具有复杂的资源干扰和I/O依赖关系。现有的调度程序依赖于操作员手动将容器调度需求表示为放置约束,并努力满足尽可能多的约束。然而,这样的调度器在性能上有所不足,因为放置约束只提供了定性的调度指导方针,最小化约束违规不一定会产生最佳性能。在这项工作中,我们介绍了Metis,一个通用的调度器,它学习使用深度强化学习(RL)技术来最佳地放置LRA容器。这消除了放置约束的复杂手工规范,并首次提供了具体的定量调度标准。由于直接训练RL智能体无法扩展,我们开发了一种新的分层学习技术,该技术将复杂的容器放置问题分解为具有显着减少的状态和动作空间的子问题层次。我们证明了许多子问题具有相似的结构,因此可以通过离线训练统一的RL代理来解决。大规模EC2部署表明,与传统的基于约束的调度器相比,Metis将吞吐量提高了61%,优化了各种性能指标,并且可以轻松扩展到大型集群,其中3K个容器在700多台机器上运行。
{"title":"Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale","authors":"Luping Wang, Qizhen Weng, Wei Wang, Chen Chen, Bo Li","doi":"10.1109/SC41405.2020.00072","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00072","url":null,"abstract":"Online cloud services are increasingly deployed as long-running applications (LRAs) in containers. Placing LRA containers is known to be difficult as they often have sophisticated resource interferences and I/O dependencies. Existing schedulers rely on operators to manually express the container scheduling requirements as placement constraints and strive to satisfy as many constraints as possible. Such schedulers, however, fall short in performance as placement constraints only provide qualitative scheduling guidelines and minimizing constraint violations does not necessarily result in the optimal performance.In this work, we present Metis, a general-purpose scheduler that learns to optimally place LRA containers using deep reinforcement learning (RL) techniques. This eliminates the complex manual specification of placement constraints and offers, for the first time, concrete quantitative scheduling criteria. As directly training an RL agent does not scale, we develop a novel hierarchical learning technique that decomposes a complex container placement problem into a hierarchy of subproblems with significantly reduced state and action space. We show that many subproblems have similar structures and can hence be solved by training a unified RL agent offline. Large-scale EC2 deployment shows that compared with the traditional constraint-based schedulers, Metis improves the throughput by up to 61%, optimizes various performance metrics, and easily scales to a large cluster where 3K containers run on over 700 machines.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134067379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Iris: Allocation Banking and Identity and Access Management for the Exascale Era Iris:百亿亿次时代的分配银行、身份和访问管理
Gabor Torok, Mark R. Day, R. Hartman-Baker, C. Snavely
Without a reliable and scalable system for managing authorized users and ensuring they receive their allocated share of computational and storage resources, modern HPC centers would not be able to function. Exascale will amplify these demands with greater machine scale, more users, higher job throughput, and ever-increasing need for management insight and automation throughout the HPC environment. When our legacy system reached retirement age, NERSC took the opportunity to design and build Iris not only to meet our current needs, with 8,000 users and tens of thousands of jobs per day, but also to scale well into the exascale era. In this paper, we describe how we have designed Iris to meet these needs and discuss its key features as well as our implementation experience.
如果没有可靠且可扩展的系统来管理授权用户并确保他们获得分配的计算和存储资源,现代HPC中心将无法正常运行。Exascale将通过更大的机器规模、更多的用户、更高的作业吞吐量以及在整个HPC环境中不断增长的管理洞察力和自动化需求来扩大这些需求。当我们的遗留系统达到退休年龄时,NERSC抓住机会设计和构建Iris,不仅满足我们目前的需求,每天有8000个用户和数万个工作,而且还可以扩展到百亿亿次时代。在本文中,我们描述了我们如何设计Iris来满足这些需求,并讨论了它的关键特性以及我们的实现经验。
{"title":"Iris: Allocation Banking and Identity and Access Management for the Exascale Era","authors":"Gabor Torok, Mark R. Day, R. Hartman-Baker, C. Snavely","doi":"10.1109/SC41405.2020.00046","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00046","url":null,"abstract":"Without a reliable and scalable system for managing authorized users and ensuring they receive their allocated share of computational and storage resources, modern HPC centers would not be able to function. Exascale will amplify these demands with greater machine scale, more users, higher job throughput, and ever-increasing need for management insight and automation throughout the HPC environment. When our legacy system reached retirement age, NERSC took the opportunity to design and build Iris not only to meet our current needs, with 8,000 users and tens of thousands of jobs per day, but also to scale well into the exascale era. In this paper, we describe how we have designed Iris to meet these needs and discuss its key features as well as our implementation experience.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129505028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Parallel Framework for Constraint-Based Bayesian Network Learning via Markov Blanket Discovery 基于马尔可夫毯子发现的约束贝叶斯网络并行学习框架
Ankit Srivastava, Sriram P. Chockalingam, S. Aluru
Bayesian networks (BNs) are a widely used graphical model in machine learning. As learning the structure of BNs is NP-hard, high-performance computing methods are necessary for constructing large-scale networks. In this paper, we present a parallel framework to scale BN structure learning algorithms to tens of thousands of variables. Our framework is applicable to learning algorithms that rely on the discovery of Markov blankets (MBs) as an intermediate step. We demonstrate the applicability of our framework by parallelizing three different algorithms: Grow-Shrink (GS), Incremental Association MB (IAMB), and Interleaved IAMB (Inter-IAMB). Our implementations are able to construct BNs from real data sets with tens of thousands of variables and thousands of observations in less than a minute on 1024 cores, with a speedup of up to 845X and 82.5% efficiency. Furthermore, we demonstrate using simulated data sets that our proposed parallel framework can scale to BNs of even higher dimensionality.
贝叶斯网络(BNs)是机器学习中广泛使用的图形模型。由于神经网络的结构学习是np困难的,因此构建大规模网络需要高性能的计算方法。在本文中,我们提出了一个并行框架,将BN结构学习算法扩展到数万个变量。我们的框架适用于依赖于发现马尔可夫毯子(mb)作为中间步骤的学习算法。我们通过并行化三种不同的算法来证明我们框架的适用性:growth - shrink (GS), Incremental Association MB (IAMB)和Interleaved IAMB (Inter-IAMB)。我们的实现能够在1024个内核上,在不到一分钟的时间内,从具有数万个变量和数千个观测值的真实数据集构建bp,加速高达845X,效率高达82.5%。此外,我们使用模拟数据集证明,我们提出的并行框架可以扩展到更高维度的bn。
{"title":"A Parallel Framework for Constraint-Based Bayesian Network Learning via Markov Blanket Discovery","authors":"Ankit Srivastava, Sriram P. Chockalingam, S. Aluru","doi":"10.1109/SC41405.2020.00011","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00011","url":null,"abstract":"Bayesian networks (BNs) are a widely used graphical model in machine learning. As learning the structure of BNs is NP-hard, high-performance computing methods are necessary for constructing large-scale networks. In this paper, we present a parallel framework to scale BN structure learning algorithms to tens of thousands of variables. Our framework is applicable to learning algorithms that rely on the discovery of Markov blankets (MBs) as an intermediate step. We demonstrate the applicability of our framework by parallelizing three different algorithms: Grow-Shrink (GS), Incremental Association MB (IAMB), and Interleaved IAMB (Inter-IAMB). Our implementations are able to construct BNs from real data sets with tens of thousands of variables and thousands of observations in less than a minute on 1024 cores, with a speedup of up to 845X and 82.5% efficiency. Furthermore, we demonstrate using simulated data sets that our proposed parallel framework can scale to BNs of even higher dimensionality.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129261235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Massive Parallelization for Finding Shortest Lattice Vectors Based on Ubiquity Generator Framework 基于泛在生成器框架的最短点阵向量的大规模并行化
Nariaki Tateiwa, Y. Shinano, Satoshi Nakamura, Akihiro Yoshida, S. Kaji, Masaya Yasuda, K. Fujisawa
Lattice-based cryptography has received attention as a next-generation encryption technique, because it is believed to be secure against attacks by classical and quantum computers. Its essential security depends on the hardness of solving the shortest vector problem (SVP). In the cryptography, to determine security levels, it is becoming significantly more important to estimate the hardness of the SVP by high-performance computing. In this study, we develop the world’s first distributed and asynchronous parallel SVP solver, the MAssively Parallel solver for SVP (MAP-SVP). It can parallelize algorithms for solving the SVP by applying the Ubiquity Generator framework, which is a generic framework for branch-and-bound algorithms. The MAP-SVP is suitable for massive-scale parallelization, owing to its small memory footprint, low communication overhead, and rapid checkpoint and restart mechanisms. We demonstrate its performance and scalability of the MAP-SVP by using up to 100,032 cores to solve instances of the Darmstadt SVP Challenge.
基于格的加密技术作为下一代加密技术受到关注,因为它被认为是安全的,可以抵御经典计算机和量子计算机的攻击。其本质安全性取决于求解最短向量问题(SVP)的难易程度。在密码学中,为了确定安全级别,通过高性能计算来估计SVP的硬度变得越来越重要。在本研究中,我们开发了世界上第一个分布式和异步并行SVP求解器,即大规模并行SVP求解器(MAP-SVP)。利用分支定界算法的通用框架Ubiquity Generator框架,实现求解SVP的算法并行化。MAP-SVP适合大规模并行化,因为它内存占用小、通信开销低、检查点和重启机制快速。通过使用多达100,032个内核来解决Darmstadt SVP挑战的实例,我们展示了MAP-SVP的性能和可扩展性。
{"title":"Massive Parallelization for Finding Shortest Lattice Vectors Based on Ubiquity Generator Framework","authors":"Nariaki Tateiwa, Y. Shinano, Satoshi Nakamura, Akihiro Yoshida, S. Kaji, Masaya Yasuda, K. Fujisawa","doi":"10.1109/SC41405.2020.00064","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00064","url":null,"abstract":"Lattice-based cryptography has received attention as a next-generation encryption technique, because it is believed to be secure against attacks by classical and quantum computers. Its essential security depends on the hardness of solving the shortest vector problem (SVP). In the cryptography, to determine security levels, it is becoming significantly more important to estimate the hardness of the SVP by high-performance computing. In this study, we develop the world’s first distributed and asynchronous parallel SVP solver, the MAssively Parallel solver for SVP (MAP-SVP). It can parallelize algorithms for solving the SVP by applying the Ubiquity Generator framework, which is a generic framework for branch-and-bound algorithms. The MAP-SVP is suitable for massive-scale parallelization, owing to its small memory footprint, low communication overhead, and rapid checkpoint and restart mechanisms. We demonstrate its performance and scalability of the MAP-SVP by using up to 100,032 cores to solve instances of the Darmstadt SVP Challenge.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128735703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems 高性能计算系统的实时取证:分布式存储系统的案例研究
Saurabh Jha, Shengkun Cui, Subho Sankar Banerjee, Tianyin Xu, J. Enos, M. Showerman, T. Kalbarczyk, R. Iyer
Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.
大规模高性能计算系统经常经历各种各样的故障模式,例如可靠性故障(例如挂起或崩溃),以及与资源过载相关的故障(例如拥塞崩溃),这些都会影响系统和应用程序。尽管这些故障会产生不利影响,但目前的系统并没有提供主动检测、定位和诊断故障的方法。我们提出了万花筒(Kaleidoscope),这是一个近乎实时的故障检测和诊断框架,由分层领域引导的机器学习模型组成,该模型可以识别故障组件、相应的故障模式,并在近乎实时的情况下(故障发生后一分钟内)指出最有可能的故障原因。Kaleidoscope已经部署在Blue Waters超级计算机上,并使用两年多的生产遥测数据进行了评估。我们的评估表明,Kaleidoscope成功地定位了843个实际生产问题中的99.3%,并确定了95.8%的根本原因,运行时开销不到0.01%。
{"title":"Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems","authors":"Saurabh Jha, Shengkun Cui, Subho Sankar Banerjee, Tianyin Xu, J. Enos, M. Showerman, T. Kalbarczyk, R. Iyer","doi":"10.1109/SC41405.2020.00069","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00069","url":null,"abstract":"Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116767520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Density Matrix Quantum Circuit Simulation via the BSP Machine on Modern GPU Clusters 基于BSP机的密度矩阵量子电路在现代GPU集群上的模拟
Ang Li, Omer Subasi, Xiu Yang, S. Krishnamoorthy
As quantum computers evolve, simulations of quantum programs on classical computers will be essential in validating quantum algorithms, understanding the effect of system noise, and designing applications for future quantum computers. In this paper, we first propose a new multi-GPU programming methodology called MG-BSP which constructs a virtual BSP machine on top of modern multi-GPU platforms, and apply this methodology to build a multi-GPU density matrix quantum simulator called DM-Sim. We propose a new formulation that can significantly reduce communication overhead, and show that this formula transformation can conserve the semantics despite noise being introduced. We build the tool-chain for the simulator to run open standard quantum assembly code, execute synthesized quantum circuits, and perform ultra-deep and largescale simulations. We evaluated DM-Sim on several state-of-theart multi-GPU platforms including NVIDIA’s PascaUVolta DGX1, DGX-2, and ORNL’s Summit supercomputer. In particular, we have demonstrated the simulation of one million general gates in 94 minutes on DGX-2, far deeper circuits than has been demonstrated in prior works. Our simulator is more than 10x faster with respect to the corresponding state-vector quantum simulators on GPUs and other platforms. The DM-Sim simulator is released at: http:llgithub.comlpnnllDM-Sim.
随着量子计算机的发展,在经典计算机上模拟量子程序对于验证量子算法、理解系统噪声的影响以及设计未来量子计算机的应用程序至关重要。本文首先提出了一种新的多gpu编程方法MG-BSP,该方法在现代多gpu平台上构建虚拟BSP机,并将该方法应用于多gpu密度矩阵量子模拟器DM-Sim的构建。我们提出了一个可以显著降低通信开销的新公式,并表明该公式变换可以在引入噪声的情况下保持语义。我们构建了模拟器的工具链,以运行开放的标准量子汇编代码,执行合成量子电路,并进行超深度和大规模模拟。我们在几个最先进的多gpu平台上对DM-Sim进行了评估,包括NVIDIA的PascaUVolta DGX1, DGX-2和ORNL的Summit超级计算机。特别是,我们在DGX-2上演示了在94分钟内模拟一百万个通用门,远比以前的工作中演示的电路更深。我们的模拟器比gpu和其他平台上相应的状态向量量子模拟器快10倍以上。DM-Sim模拟器发布于:http://www.llgithub.comppnlldm - sim。
{"title":"Density Matrix Quantum Circuit Simulation via the BSP Machine on Modern GPU Clusters","authors":"Ang Li, Omer Subasi, Xiu Yang, S. Krishnamoorthy","doi":"10.1109/SC41405.2020.00017","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00017","url":null,"abstract":"As quantum computers evolve, simulations of quantum programs on classical computers will be essential in validating quantum algorithms, understanding the effect of system noise, and designing applications for future quantum computers. In this paper, we first propose a new multi-GPU programming methodology called MG-BSP which constructs a virtual BSP machine on top of modern multi-GPU platforms, and apply this methodology to build a multi-GPU density matrix quantum simulator called DM-Sim. We propose a new formulation that can significantly reduce communication overhead, and show that this formula transformation can conserve the semantics despite noise being introduced. We build the tool-chain for the simulator to run open standard quantum assembly code, execute synthesized quantum circuits, and perform ultra-deep and largescale simulations. We evaluated DM-Sim on several state-of-theart multi-GPU platforms including NVIDIA’s PascaUVolta DGX1, DGX-2, and ORNL’s Summit supercomputer. In particular, we have demonstrated the simulation of one million general gates in 94 minutes on DGX-2, far deeper circuits than has been demonstrated in prior works. Our simulator is more than 10x faster with respect to the corresponding state-vector quantum simulators on GPUs and other platforms. The DM-Sim simulator is released at: http:llgithub.comlpnnllDM-Sim.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123005449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Term Quantization: Furthering Quantization at Run Time 项量化:在运行时进一步量化
H. T. Kung, Bradley McDanel, S. Zhang
We present a novel technique, called Term Quantization (TQ), for furthering quantization at run time for improved computational efficiency of deep neural networks (DNNs) already quantized with conventional quantization methods. TQ operates on power-of-two terms in expressions of values. In computing a dot-product computation, TQ dynamically selects a fixed number of largest terms to use from values of the two vectors. By exploiting weight and data distributions typically present in DNNs, TQ has a minimal impact on DNN model performance (e.g., accuracy or perplexity). We use TQ to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We evaluate TQ on an MLP for MNIST, multiple CNNs for ImageNet and an LSTM for Wikitext-2. We demonstrate significant reductions in inference computation costs (between $3-10times$) compared to conventional uniform quantization for the same level of model performance.
我们提出了一种新的技术,称为项量化(TQ),用于在运行时进一步量化,以提高已经用传统量化方法量化的深度神经网络(dnn)的计算效率。TQ作用于值表达式中的2次幂项。在计算点积计算时,TQ动态地从两个向量的值中选择一个固定数量的最大项来使用。通过利用DNN中通常存在的权重和数据分布,TQ对DNN模型性能的影响最小(例如,准确性或困惑度)。我们使用TQ来促进紧密同步的处理器阵列,例如收缩阵列,以实现高效的并行处理。我们在MNIST的MLP、ImageNet的多个cnn和Wikitext-2的LSTM上评估TQ。我们证明了在相同水平的模型性能下,与传统的均匀量化相比,推理计算成本显著降低(在3-10倍之间)。
{"title":"Term Quantization: Furthering Quantization at Run Time","authors":"H. T. Kung, Bradley McDanel, S. Zhang","doi":"10.1109/SC41405.2020.00100","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00100","url":null,"abstract":"We present a novel technique, called Term Quantization (TQ), for furthering quantization at run time for improved computational efficiency of deep neural networks (DNNs) already quantized with conventional quantization methods. TQ operates on power-of-two terms in expressions of values. In computing a dot-product computation, TQ dynamically selects a fixed number of largest terms to use from values of the two vectors. By exploiting weight and data distributions typically present in DNNs, TQ has a minimal impact on DNN model performance (e.g., accuracy or perplexity). We use TQ to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We evaluate TQ on an MLP for MNIST, multiple CNNs for ImageNet and an LSTM for Wikitext-2. We demonstrate significant reductions in inference computation costs (between $3-10times$) compared to conventional uniform quantization for the same level of model performance.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115546342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1