首页 > 最新文献

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
ZeroSpy: Exploring Software Inefficiency with Redundant Zeros ZeroSpy:利用冗余零探索软件的低效率
Xin You, Hailong Yang, Zhongzhi Luan, D. Qian, Xu Liu
Redundant zeros cause inefficiencies in which the zero values are loaded and computed repeatedly, resulting in unnecessary memory traffic and identity computation that waste memory bandwidth and CPU resources. optimizing compilers is difficult in eliminating these zero-related inefficiencies due to limitations in static analysis. Hardware approaches, in contrast, optimize inefficiencies without code modification, but are not widely adopted in commodity processors. In this paper, we propose ZeroSpy - a fine-grained profiler to identify redundant zeros caused by both inappropriate use of data structures and useless computation. ZeroSpy also provides intuitive optimization guidance by revealing the locations where the redundant zeros happen in source lines and calling contexts. The experimental results demonstrate ZeroSpy is capable of identifying redundant zeros in programs that have been highly optimized for years. Based on the optimization guidance revealed by ZeroSpy, we can achieve significant speedups after eliminating redundant zeros.
冗余的零会导致效率低下,因为零值会被反复加载和计算,从而导致不必要的内存流量和身份计算,浪费内存带宽和CPU资源。由于静态分析的限制,优化编译器很难消除这些与零相关的低效率。相比之下,硬件方法可以在不修改代码的情况下优化低效率,但在商用处理器中没有广泛采用。在本文中,我们提出了ZeroSpy -一个细粒度分析器,用于识别由于数据结构使用不当和无用计算而导致的冗余零。ZeroSpy还通过揭示源行和调用上下文中冗余零出现的位置,提供直观的优化指导。实验结果表明,ZeroSpy能够在经过多年高度优化的程序中识别多余的零。基于ZeroSpy提供的优化指导,我们可以在消除冗余零后获得显著的加速。
{"title":"ZeroSpy: Exploring Software Inefficiency with Redundant Zeros","authors":"Xin You, Hailong Yang, Zhongzhi Luan, D. Qian, Xu Liu","doi":"10.1109/SC41405.2020.00033","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00033","url":null,"abstract":"Redundant zeros cause inefficiencies in which the zero values are loaded and computed repeatedly, resulting in unnecessary memory traffic and identity computation that waste memory bandwidth and CPU resources. optimizing compilers is difficult in eliminating these zero-related inefficiencies due to limitations in static analysis. Hardware approaches, in contrast, optimize inefficiencies without code modification, but are not widely adopted in commodity processors. In this paper, we propose ZeroSpy - a fine-grained profiler to identify redundant zeros caused by both inappropriate use of data structures and useless computation. ZeroSpy also provides intuitive optimization guidance by revealing the locations where the redundant zeros happen in source lines and calling contexts. The experimental results demonstrate ZeroSpy is capable of identifying redundant zeros in programs that have been highly optimized for years. Based on the optimization guidance revealed by ZeroSpy, we can achieve significant speedups after eliminating redundant zeros.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128355932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
TOSS-2020: A Commodity Software Stack for HPC TOSS-2020: HPC商用软件栈
E. León, T. D'Hooge, Nathan Hanford, I. Karlin, R. Pankajakshan, Jim Foraker, C. Chambreau, M. Leininger
The simulation environment of any HPC platform is key to the performance, portability, and productivity of scientific applications. This environment has traditionally been provided by platform vendors, presenting challenges for HPC centers and users including platform-specific software that tend to stagnate over the lifetime of the system. In this paper, we present the Tri-Laboratory Operating System Stack (TOSS), a production simulation environment based on Linux and open source software, with proprietary software components integrated as needed. TOSS, focused on mid-to-large scale commodity HPC systems, provides a common simulation environment across system architectures, reduces the learning curve on new systems, and benefits from a lineage of past experience and bug fixes. To further the scope and applicability of TOSS, we demonstrate its feasibility and effectiveness on a leadership-class supercomputer architecture. Our evaluation, relative to the vendor stack, includes an analysis of resource manager complexity, system noise, networking, and application performance.
任何HPC平台的仿真环境都是科学应用程序的性能、可移植性和生产力的关键。这种环境传统上是由平台供应商提供的,这给高性能计算中心和用户带来了挑战,包括特定于平台的软件,这些软件在系统的生命周期中往往会停滞不前。在本文中,我们提出了三实验室操作系统堆栈(TOSS),这是一个基于Linux和开源软件的生产模拟环境,根据需要集成了专有软件组件。TOSS专注于大中型商用HPC系统,提供了跨系统架构的通用模拟环境,减少了新系统的学习曲线,并受益于过去的经验和错误修复。为了进一步扩大TOSS的范围和适用性,我们论证了它在一个领导级超级计算机体系结构上的可行性和有效性。相对于供应商堆栈,我们的评估包括对资源管理器复杂性、系统噪声、网络和应用程序性能的分析。
{"title":"TOSS-2020: A Commodity Software Stack for HPC","authors":"E. León, T. D'Hooge, Nathan Hanford, I. Karlin, R. Pankajakshan, Jim Foraker, C. Chambreau, M. Leininger","doi":"10.1109/SC41405.2020.00044","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00044","url":null,"abstract":"The simulation environment of any HPC platform is key to the performance, portability, and productivity of scientific applications. This environment has traditionally been provided by platform vendors, presenting challenges for HPC centers and users including platform-specific software that tend to stagnate over the lifetime of the system. In this paper, we present the Tri-Laboratory Operating System Stack (TOSS), a production simulation environment based on Linux and open source software, with proprietary software components integrated as needed. TOSS, focused on mid-to-large scale commodity HPC systems, provides a common simulation environment across system architectures, reduces the learning curve on new systems, and benefits from a lineage of past experience and bug fixes. To further the scope and applicability of TOSS, we demonstrate its feasibility and effectiveness on a leadership-class supercomputer architecture. Our evaluation, relative to the vendor stack, includes an analysis of resource manager complexity, system noise, networking, and application performance.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116879190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
CCAMP: An Integrated Translation and optimization Framework for OpenACC and OpenMP CCAMP: OpenACC和OpenMP的集成翻译和优化框架
Jacob Lambert, Seyong Lee, J. Vetter, A. Malony
Heterogeneous computing and exploration into specialized accelerators are inevitable in current and future supercomputers. Although this diversity of devices is promising for performance, the array of architectures presents programming challenges. High-level programming strategies have emerged to face these challenges, such as the OpenMP offloading model and OpenACC. However, the varying levels of support for these standards within vendor-specific and open-source tools, as well as the lack of performance portability across devices, have prevented the standards from achieving their goals. To address these shortcomings, we present CCAMP, an OpenMP and OpenACC interoperable framework. CCAMP provides two primary facilities: language translation between the two standards and device-specific directive optimization within each standard. We show that by using the CCAMP framework, programmers can easily transplant non-portable code into new ecosystems for new architectures. Additionally, by using CCAMP’s device-specific directive optimizations, users can achieve optimized performance across architectures using a single source code.
在当前和未来的超级计算机中,异构计算和探索专用加速器是不可避免的。尽管这种设备的多样性有望提高性能,但架构的阵列带来了编程挑战。为了应对这些挑战,出现了高级编程策略,例如OpenMP卸载模型和OpenACC。然而,供应商特定的和开源工具对这些标准的不同级别的支持,以及缺乏跨设备的性能可移植性,阻碍了这些标准实现其目标。为了解决这些缺点,我们提出了CCAMP,一个OpenMP和OpenACC可互操作的框架。CCAMP提供两个主要功能:两个标准之间的语言翻译和每个标准中特定于设备的指令优化。我们展示了通过使用CCAMP框架,程序员可以很容易地将不可移植的代码移植到新的体系结构的新生态系统中。此外,通过使用CCAMP的特定于设备的指令优化,用户可以使用单个源代码实现跨架构的优化性能。
{"title":"CCAMP: An Integrated Translation and optimization Framework for OpenACC and OpenMP","authors":"Jacob Lambert, Seyong Lee, J. Vetter, A. Malony","doi":"10.1109/SC41405.2020.00102","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00102","url":null,"abstract":"Heterogeneous computing and exploration into specialized accelerators are inevitable in current and future supercomputers. Although this diversity of devices is promising for performance, the array of architectures presents programming challenges. High-level programming strategies have emerged to face these challenges, such as the OpenMP offloading model and OpenACC. However, the varying levels of support for these standards within vendor-specific and open-source tools, as well as the lack of performance portability across devices, have prevented the standards from achieving their goals. To address these shortcomings, we present CCAMP, an OpenMP and OpenACC interoperable framework. CCAMP provides two primary facilities: language translation between the two standards and device-specific directive optimization within each standard. We show that by using the CCAMP framework, programmers can easily transplant non-portable code into new ecosystems for new architectures. Additionally, by using CCAMP’s device-specific directive optimizations, users can achieve optimized performance across architectures using a single source code.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129570943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Iris: Allocation Banking and Identity and Access Management for the Exascale Era Iris:百亿亿次时代的分配银行、身份和访问管理
Gabor Torok, Mark R. Day, R. Hartman-Baker, C. Snavely
Without a reliable and scalable system for managing authorized users and ensuring they receive their allocated share of computational and storage resources, modern HPC centers would not be able to function. Exascale will amplify these demands with greater machine scale, more users, higher job throughput, and ever-increasing need for management insight and automation throughout the HPC environment. When our legacy system reached retirement age, NERSC took the opportunity to design and build Iris not only to meet our current needs, with 8,000 users and tens of thousands of jobs per day, but also to scale well into the exascale era. In this paper, we describe how we have designed Iris to meet these needs and discuss its key features as well as our implementation experience.
如果没有可靠且可扩展的系统来管理授权用户并确保他们获得分配的计算和存储资源,现代HPC中心将无法正常运行。Exascale将通过更大的机器规模、更多的用户、更高的作业吞吐量以及在整个HPC环境中不断增长的管理洞察力和自动化需求来扩大这些需求。当我们的遗留系统达到退休年龄时,NERSC抓住机会设计和构建Iris,不仅满足我们目前的需求,每天有8000个用户和数万个工作,而且还可以扩展到百亿亿次时代。在本文中,我们描述了我们如何设计Iris来满足这些需求,并讨论了它的关键特性以及我们的实现经验。
{"title":"Iris: Allocation Banking and Identity and Access Management for the Exascale Era","authors":"Gabor Torok, Mark R. Day, R. Hartman-Baker, C. Snavely","doi":"10.1109/SC41405.2020.00046","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00046","url":null,"abstract":"Without a reliable and scalable system for managing authorized users and ensuring they receive their allocated share of computational and storage resources, modern HPC centers would not be able to function. Exascale will amplify these demands with greater machine scale, more users, higher job throughput, and ever-increasing need for management insight and automation throughout the HPC environment. When our legacy system reached retirement age, NERSC took the opportunity to design and build Iris not only to meet our current needs, with 8,000 users and tens of thousands of jobs per day, but also to scale well into the exascale era. In this paper, we describe how we have designed Iris to meet these needs and discuss its key features as well as our implementation experience.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129505028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale Metis:学习在共享容器集群中大规模调度长时间运行的应用程序
Luping Wang, Qizhen Weng, Wei Wang, Chen Chen, Bo Li
Online cloud services are increasingly deployed as long-running applications (LRAs) in containers. Placing LRA containers is known to be difficult as they often have sophisticated resource interferences and I/O dependencies. Existing schedulers rely on operators to manually express the container scheduling requirements as placement constraints and strive to satisfy as many constraints as possible. Such schedulers, however, fall short in performance as placement constraints only provide qualitative scheduling guidelines and minimizing constraint violations does not necessarily result in the optimal performance.In this work, we present Metis, a general-purpose scheduler that learns to optimally place LRA containers using deep reinforcement learning (RL) techniques. This eliminates the complex manual specification of placement constraints and offers, for the first time, concrete quantitative scheduling criteria. As directly training an RL agent does not scale, we develop a novel hierarchical learning technique that decomposes a complex container placement problem into a hierarchy of subproblems with significantly reduced state and action space. We show that many subproblems have similar structures and can hence be solved by training a unified RL agent offline. Large-scale EC2 deployment shows that compared with the traditional constraint-based schedulers, Metis improves the throughput by up to 61%, optimizes various performance metrics, and easily scales to a large cluster where 3K containers run on over 700 machines.
在线云服务越来越多地被部署为容器中的长时间运行的应用程序(lra)。放置LRA容器非常困难,因为它们通常具有复杂的资源干扰和I/O依赖关系。现有的调度程序依赖于操作员手动将容器调度需求表示为放置约束,并努力满足尽可能多的约束。然而,这样的调度器在性能上有所不足,因为放置约束只提供了定性的调度指导方针,最小化约束违规不一定会产生最佳性能。在这项工作中,我们介绍了Metis,一个通用的调度器,它学习使用深度强化学习(RL)技术来最佳地放置LRA容器。这消除了放置约束的复杂手工规范,并首次提供了具体的定量调度标准。由于直接训练RL智能体无法扩展,我们开发了一种新的分层学习技术,该技术将复杂的容器放置问题分解为具有显着减少的状态和动作空间的子问题层次。我们证明了许多子问题具有相似的结构,因此可以通过离线训练统一的RL代理来解决。大规模EC2部署表明,与传统的基于约束的调度器相比,Metis将吞吐量提高了61%,优化了各种性能指标,并且可以轻松扩展到大型集群,其中3K个容器在700多台机器上运行。
{"title":"Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale","authors":"Luping Wang, Qizhen Weng, Wei Wang, Chen Chen, Bo Li","doi":"10.1109/SC41405.2020.00072","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00072","url":null,"abstract":"Online cloud services are increasingly deployed as long-running applications (LRAs) in containers. Placing LRA containers is known to be difficult as they often have sophisticated resource interferences and I/O dependencies. Existing schedulers rely on operators to manually express the container scheduling requirements as placement constraints and strive to satisfy as many constraints as possible. Such schedulers, however, fall short in performance as placement constraints only provide qualitative scheduling guidelines and minimizing constraint violations does not necessarily result in the optimal performance.In this work, we present Metis, a general-purpose scheduler that learns to optimally place LRA containers using deep reinforcement learning (RL) techniques. This eliminates the complex manual specification of placement constraints and offers, for the first time, concrete quantitative scheduling criteria. As directly training an RL agent does not scale, we develop a novel hierarchical learning technique that decomposes a complex container placement problem into a hierarchy of subproblems with significantly reduced state and action space. We show that many subproblems have similar structures and can hence be solved by training a unified RL agent offline. Large-scale EC2 deployment shows that compared with the traditional constraint-based schedulers, Metis improves the throughput by up to 61%, optimizes various performance metrics, and easily scales to a large cluster where 3K containers run on over 700 machines.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134067379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Cell-List based Molecular Dynamics on Many-Core Processors: A Case Study on Sunway TaihuLight Supercomputer 基于Cell-List的多核处理器分子动力学研究——以神威太湖之光超级计算机为例
Xiaohui Duan, Ping Gao, Meng Zhang, Tingjian Zhang, Hongsong Meng, Yuxuan Li, B. Schmidt, H. Fu, L. Gan, W. Xue, Weiguo Liu, Guangwen Yang
Molecular dynamics (MD) simulations are playing an increasingly important role in several research areas. The most frequently used potentials in MD simulations are pair-wise potentials. Due to the memory wall, computing pair-wise potentials on many-core processors are usually memory bounded. In this paper, we take the SW26010 processor as an exemplary platform to explore the possibility to break the memory bottleneck by improving data reusage via cell-list-based methods. We use cell-lists instead of neighbor-lists in the potential computation, and apply a number of novel optimization methods. Theses methods include: an adaptive replica arrangement strategy, a parameter profile data structure, and a particle-cell cutoff checking filter. An incremental cell-list building method is also realized to accelerate the construction of cell-lists. Furthermore, we have established an open source standalone framework, ESMD, featuring the techniques above. Experiments show that ESMD is 50$sim$170% faster than previous ports on a single node, and can scale to 1,024 nodes with a weak scalibility of 95%.
分子动力学(MD)模拟在许多研究领域发挥着越来越重要的作用。在MD模拟中最常用的电位是成对电位。由于内存墙的存在,多核处理器上的成对计算电位通常是内存有限的。在本文中,我们以SW26010处理器为示例平台,探索通过基于单元格列表的方法改进数据重用来打破内存瓶颈的可能性。我们在势能计算中使用单元列表代替邻居列表,并应用了一些新的优化方法。这些方法包括:自适应副本排列策略、参数轮廓数据结构和粒子单元截止检查滤波器。为了加快单元表的构建速度,还实现了一种增量单元表构建方法。此外,我们还建立了一个开源的独立框架ESMD,以上述技术为特色。实验表明,ESMD在单个节点上的速度比以前的端口快50 - 170%,并且可以扩展到1024个节点,可扩展性为95%。
{"title":"Cell-List based Molecular Dynamics on Many-Core Processors: A Case Study on Sunway TaihuLight Supercomputer","authors":"Xiaohui Duan, Ping Gao, Meng Zhang, Tingjian Zhang, Hongsong Meng, Yuxuan Li, B. Schmidt, H. Fu, L. Gan, W. Xue, Weiguo Liu, Guangwen Yang","doi":"10.1109/SC41405.2020.00026","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00026","url":null,"abstract":"Molecular dynamics (MD) simulations are playing an increasingly important role in several research areas. The most frequently used potentials in MD simulations are pair-wise potentials. Due to the memory wall, computing pair-wise potentials on many-core processors are usually memory bounded. In this paper, we take the SW26010 processor as an exemplary platform to explore the possibility to break the memory bottleneck by improving data reusage via cell-list-based methods. We use cell-lists instead of neighbor-lists in the potential computation, and apply a number of novel optimization methods. Theses methods include: an adaptive replica arrangement strategy, a parameter profile data structure, and a particle-cell cutoff checking filter. An incremental cell-list building method is also realized to accelerate the construction of cell-lists. Furthermore, we have established an open source standalone framework, ESMD, featuring the techniques above. Experiments show that ESMD is 50$sim$170% faster than previous ports on a single node, and can scale to 1,024 nodes with a weak scalibility of 95%.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127470716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Term Quantization: Furthering Quantization at Run Time 项量化:在运行时进一步量化
H. T. Kung, Bradley McDanel, S. Zhang
We present a novel technique, called Term Quantization (TQ), for furthering quantization at run time for improved computational efficiency of deep neural networks (DNNs) already quantized with conventional quantization methods. TQ operates on power-of-two terms in expressions of values. In computing a dot-product computation, TQ dynamically selects a fixed number of largest terms to use from values of the two vectors. By exploiting weight and data distributions typically present in DNNs, TQ has a minimal impact on DNN model performance (e.g., accuracy or perplexity). We use TQ to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We evaluate TQ on an MLP for MNIST, multiple CNNs for ImageNet and an LSTM for Wikitext-2. We demonstrate significant reductions in inference computation costs (between $3-10times$) compared to conventional uniform quantization for the same level of model performance.
我们提出了一种新的技术,称为项量化(TQ),用于在运行时进一步量化,以提高已经用传统量化方法量化的深度神经网络(dnn)的计算效率。TQ作用于值表达式中的2次幂项。在计算点积计算时,TQ动态地从两个向量的值中选择一个固定数量的最大项来使用。通过利用DNN中通常存在的权重和数据分布,TQ对DNN模型性能的影响最小(例如,准确性或困惑度)。我们使用TQ来促进紧密同步的处理器阵列,例如收缩阵列,以实现高效的并行处理。我们在MNIST的MLP、ImageNet的多个cnn和Wikitext-2的LSTM上评估TQ。我们证明了在相同水平的模型性能下,与传统的均匀量化相比,推理计算成本显著降低(在3-10倍之间)。
{"title":"Term Quantization: Furthering Quantization at Run Time","authors":"H. T. Kung, Bradley McDanel, S. Zhang","doi":"10.1109/SC41405.2020.00100","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00100","url":null,"abstract":"We present a novel technique, called Term Quantization (TQ), for furthering quantization at run time for improved computational efficiency of deep neural networks (DNNs) already quantized with conventional quantization methods. TQ operates on power-of-two terms in expressions of values. In computing a dot-product computation, TQ dynamically selects a fixed number of largest terms to use from values of the two vectors. By exploiting weight and data distributions typically present in DNNs, TQ has a minimal impact on DNN model performance (e.g., accuracy or perplexity). We use TQ to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We evaluate TQ on an MLP for MNIST, multiple CNNs for ImageNet and an LSTM for Wikitext-2. We demonstrate significant reductions in inference computation costs (between $3-10times$) compared to conventional uniform quantization for the same level of model performance.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115546342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems 高性能计算系统的实时取证:分布式存储系统的案例研究
Saurabh Jha, Shengkun Cui, Subho Sankar Banerjee, Tianyin Xu, J. Enos, M. Showerman, T. Kalbarczyk, R. Iyer
Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.
大规模高性能计算系统经常经历各种各样的故障模式,例如可靠性故障(例如挂起或崩溃),以及与资源过载相关的故障(例如拥塞崩溃),这些都会影响系统和应用程序。尽管这些故障会产生不利影响,但目前的系统并没有提供主动检测、定位和诊断故障的方法。我们提出了万花筒(Kaleidoscope),这是一个近乎实时的故障检测和诊断框架,由分层领域引导的机器学习模型组成,该模型可以识别故障组件、相应的故障模式,并在近乎实时的情况下(故障发生后一分钟内)指出最有可能的故障原因。Kaleidoscope已经部署在Blue Waters超级计算机上,并使用两年多的生产遥测数据进行了评估。我们的评估表明,Kaleidoscope成功地定位了843个实际生产问题中的99.3%,并确定了95.8%的根本原因,运行时开销不到0.01%。
{"title":"Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems","authors":"Saurabh Jha, Shengkun Cui, Subho Sankar Banerjee, Tianyin Xu, J. Enos, M. Showerman, T. Kalbarczyk, R. Iyer","doi":"10.1109/SC41405.2020.00069","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00069","url":null,"abstract":"Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116767520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Scalable Knowledge Graph Analytics at 136 Petaflop/s 136 Petaflop/s的可扩展知识图谱分析
R. Kannan, Piyush Sao, Hao Lu, Drahomira Herrmannova, Vijay Thakkar, R. Patton, R. Vuduc, T. Potok
We are motivated by newly proposed methods for data mining large-scale corpora of scholarly publications, such as the full biomedical literature, which may consist of tens of millions of papers spanning decades of research. In this setting, analysts seek to discover how concepts relate to one another. They construct graph representations from annotated text databases and then formulate the relationship-mining problem as one of computing all-pairs shortest paths (APSP), which becomes a significant bottleneck. In this context, we present a new high-performance algorithm and implementation of the Floyd-Warshall algorithm for distributed-memory parallel computers accelerated by GPUs, which we call DSNAPSHOT (Distributed Accelerated Semiring All-Pairs Shortest Path). For our largest experiments, we ran DSNAPSHOT on a connected input graph with millions of vertices using 4, 096nodes (24,576GPUs) of the Oak Ridge National Laboratory’s Summit supercomputer system. We find DSNAPSHOT achieves a sustained performance of $136times 10^{15}$ floating-point operations per second (136petaflop/s) at a parallel efficiency of 90% under weak scaling and, in absolute speed, 70% of the best possible performance given our computation (in the single-precision tropical semiring or “min-plus” algebra). Looking forward, we believe this novel capability will enable the mining of scholarly knowledge corpora when embedded and integrated into artificial intelligence-driven natural language processing workflows at scale.
我们的动机是新提出的数据挖掘大规模学术出版物语料库的方法,例如完整的生物医学文献,它可能包含数千万篇跨越数十年研究的论文。在这种情况下,分析人员试图发现概念之间的关系。他们从带注释的文本数据库中构建图形表示,然后将关系挖掘问题表述为计算全对最短路径(APSP)的问题之一,这成为一个重要的瓶颈。在这种情况下,我们提出了一种新的高性能算法和Floyd-Warshall算法的实现,用于gpu加速的分布式内存并行计算机,我们称之为DSNAPSHOT(分布式加速半环全对最短路径)。对于我们最大的实验,我们使用橡树岭国家实验室的Summit超级计算机系统的4,096个节点(24,576个gpu),在具有数百万个顶点的连接输入图上运行DSNAPSHOT。我们发现DSNAPSHOT在弱尺度下以90%的并行效率实现了每秒$136乘以10^{15}$浮点运算(136petaflop/s)的持续性能,并且在绝对速度下,在给定我们的计算(在单精度热带半环或“min-plus”代数中)的最佳性能的70%。展望未来,我们相信这种新颖的能力将使学术知识语料库的挖掘成为可能,并大规模地嵌入和集成到人工智能驱动的自然语言处理工作流程中。
{"title":"Scalable Knowledge Graph Analytics at 136 Petaflop/s","authors":"R. Kannan, Piyush Sao, Hao Lu, Drahomira Herrmannova, Vijay Thakkar, R. Patton, R. Vuduc, T. Potok","doi":"10.1109/SC41405.2020.00010","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00010","url":null,"abstract":"We are motivated by newly proposed methods for data mining large-scale corpora of scholarly publications, such as the full biomedical literature, which may consist of tens of millions of papers spanning decades of research. In this setting, analysts seek to discover how concepts relate to one another. They construct graph representations from annotated text databases and then formulate the relationship-mining problem as one of computing all-pairs shortest paths (APSP), which becomes a significant bottleneck. In this context, we present a new high-performance algorithm and implementation of the Floyd-Warshall algorithm for distributed-memory parallel computers accelerated by GPUs, which we call DSNAPSHOT (Distributed Accelerated Semiring All-Pairs Shortest Path). For our largest experiments, we ran DSNAPSHOT on a connected input graph with millions of vertices using 4, 096nodes (24,576GPUs) of the Oak Ridge National Laboratory’s Summit supercomputer system. We find DSNAPSHOT achieves a sustained performance of $136times 10^{15}$ floating-point operations per second (136petaflop/s) at a parallel efficiency of 90% under weak scaling and, in absolute speed, 70% of the best possible performance given our computation (in the single-precision tropical semiring or “min-plus” algebra). Looking forward, we believe this novel capability will enable the mining of scholarly knowledge corpora when embedded and integrated into artificial intelligence-driven natural language processing workflows at scale.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115073068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Massive Parallelization for Finding Shortest Lattice Vectors Based on Ubiquity Generator Framework 基于泛在生成器框架的最短点阵向量的大规模并行化
Nariaki Tateiwa, Y. Shinano, Satoshi Nakamura, Akihiro Yoshida, S. Kaji, Masaya Yasuda, K. Fujisawa
Lattice-based cryptography has received attention as a next-generation encryption technique, because it is believed to be secure against attacks by classical and quantum computers. Its essential security depends on the hardness of solving the shortest vector problem (SVP). In the cryptography, to determine security levels, it is becoming significantly more important to estimate the hardness of the SVP by high-performance computing. In this study, we develop the world’s first distributed and asynchronous parallel SVP solver, the MAssively Parallel solver for SVP (MAP-SVP). It can parallelize algorithms for solving the SVP by applying the Ubiquity Generator framework, which is a generic framework for branch-and-bound algorithms. The MAP-SVP is suitable for massive-scale parallelization, owing to its small memory footprint, low communication overhead, and rapid checkpoint and restart mechanisms. We demonstrate its performance and scalability of the MAP-SVP by using up to 100,032 cores to solve instances of the Darmstadt SVP Challenge.
基于格的加密技术作为下一代加密技术受到关注,因为它被认为是安全的,可以抵御经典计算机和量子计算机的攻击。其本质安全性取决于求解最短向量问题(SVP)的难易程度。在密码学中,为了确定安全级别,通过高性能计算来估计SVP的硬度变得越来越重要。在本研究中,我们开发了世界上第一个分布式和异步并行SVP求解器,即大规模并行SVP求解器(MAP-SVP)。利用分支定界算法的通用框架Ubiquity Generator框架,实现求解SVP的算法并行化。MAP-SVP适合大规模并行化,因为它内存占用小、通信开销低、检查点和重启机制快速。通过使用多达100,032个内核来解决Darmstadt SVP挑战的实例,我们展示了MAP-SVP的性能和可扩展性。
{"title":"Massive Parallelization for Finding Shortest Lattice Vectors Based on Ubiquity Generator Framework","authors":"Nariaki Tateiwa, Y. Shinano, Satoshi Nakamura, Akihiro Yoshida, S. Kaji, Masaya Yasuda, K. Fujisawa","doi":"10.1109/SC41405.2020.00064","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00064","url":null,"abstract":"Lattice-based cryptography has received attention as a next-generation encryption technique, because it is believed to be secure against attacks by classical and quantum computers. Its essential security depends on the hardness of solving the shortest vector problem (SVP). In the cryptography, to determine security levels, it is becoming significantly more important to estimate the hardness of the SVP by high-performance computing. In this study, we develop the world’s first distributed and asynchronous parallel SVP solver, the MAssively Parallel solver for SVP (MAP-SVP). It can parallelize algorithms for solving the SVP by applying the Ubiquity Generator framework, which is a generic framework for branch-and-bound algorithms. The MAP-SVP is suitable for massive-scale parallelization, owing to its small memory footprint, low communication overhead, and rapid checkpoint and restart mechanisms. We demonstrate its performance and scalability of the MAP-SVP by using up to 100,032 cores to solve instances of the Darmstadt SVP Challenge.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128735703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1