首页 > 最新文献

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
A 1024-Member Ensemble Data Assimilation with 3.5-Km Mesh Global Weather Simulations 基于3.5 km网格全球天气模拟的1024元数据同化研究
H. Yashiro, K. Terasaki, Yuta Kawai, Shuhei Kudo, T. Miyoshi, Toshiyuki Imamura, K. Minami, Hikaru Inoue, T. Nishiki, Takayuki Saji, M. Satoh, H. Tomita
Numerical weather prediction (NWP) supports our daily lives. Weather models require higher spatiotemporal resolutions to prepare for extreme weather disasters and reduce the uncertainty of predictions. The accuracy of the initial state of the weather simulation is also critical; thus, we need more advanced data assimilation (DA) technology. By combining resolution and ensemble size, we have achieved the world’s largest weather DA experiment using a global cloud-resolving model and an ensemble Kalman filter method. The number of grid points was $sim$4.4 trillion, and 1.3 PiB of data was passed from the model simulation part to the DA part. We adopted a data-centric application design and approximate computing to speed up the overall system of DA. Our DA system, named NICAM-LETKF, scales to 131,072 nodes (6,291,456 cores) of the supercomputer Fugaku with a sustained performance of 29 PFLOPS and 79 PFLOPS for the simulation and DA parts, respectively.
数值天气预报支持我们的日常生活。天气模式需要更高的时空分辨率,以便为极端天气灾害做好准备,并减少预测的不确定性。天气模拟初始状态的准确性也很关键;因此,我们需要更先进的数据同化(DA)技术。通过结合分辨率和集合大小,我们使用全球云分辨模式和集合卡尔曼滤波方法实现了世界上最大的天气数据分析实验。网格点数为$sim$4.4万亿,从模型仿真部分向数据处理部分传递了1.3 PiB的数据。我们采用了以数据为中心的应用程序设计和近似计算来提高整个数据处理系统的速度。我们的数据处理系统名为NICAM-LETKF,可扩展到超级计算机Fugaku的131,072个节点(6,291,456个内核),模拟和数据处理部分的持续性能分别为29 PFLOPS和79 PFLOPS。
{"title":"A 1024-Member Ensemble Data Assimilation with 3.5-Km Mesh Global Weather Simulations","authors":"H. Yashiro, K. Terasaki, Yuta Kawai, Shuhei Kudo, T. Miyoshi, Toshiyuki Imamura, K. Minami, Hikaru Inoue, T. Nishiki, Takayuki Saji, M. Satoh, H. Tomita","doi":"10.1109/SC41405.2020.00005","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00005","url":null,"abstract":"Numerical weather prediction (NWP) supports our daily lives. Weather models require higher spatiotemporal resolutions to prepare for extreme weather disasters and reduce the uncertainty of predictions. The accuracy of the initial state of the weather simulation is also critical; thus, we need more advanced data assimilation (DA) technology. By combining resolution and ensemble size, we have achieved the world’s largest weather DA experiment using a global cloud-resolving model and an ensemble Kalman filter method. The number of grid points was $sim$4.4 trillion, and 1.3 PiB of data was passed from the model simulation part to the DA part. We adopted a data-centric application design and approximate computing to speed up the overall system of DA. Our DA system, named NICAM-LETKF, scales to 131,072 nodes (6,291,456 cores) of the supercomputer Fugaku with a sustained performance of 29 PFLOPS and 79 PFLOPS for the simulation and DA parts, respectively.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130926333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability 泰坦超级计算机上的GPU寿命:生存分析和可靠性
G. Ostrouchov, Don E. Maxwell, R. Ashraf, C. Engelmann, M. Shankar, James H. Rogers
The Cray XK7 Titan was the top supercomputer system in the world for a long time and remained critically important throughout its nearly seven year life. It was an interesting machine from a reliability viewpoint as most of its power came from 18,688 GPUs whose operation was forced to execute three rework cycles, two on the GPU mechanical assembly and one on the GPU circuitboards. We write about the last rework cycle and a reliability analysis of over 100,000 years of GPU lifetimes during Titan’s 6-year-long productive period. Using time between failures analysis and statistical survival analysis techniques, we find that GPU reliability is dependent on heat dissipation to an extent that strongly correlates with detailed nuances of the cooling architecture and job scheduling. We describe the history, data collection, cleaning, and analysis and give recommendations for future supercomputing systems. We make the data and our analysis codes publicly available.
克雷XK7泰坦在很长一段时间内都是世界上最顶级的超级计算机系统,在其近7年的生命周期中一直至关重要。从可靠性的角度来看,这是一台有趣的机器,因为它的大部分功率来自18,688个GPU,其操作被迫执行三个返工周期,两个在GPU机械组件上,一个在GPU电路板上。在泰坦6年的生产周期中,我们写了最近的返工周期和超过10万年GPU寿命的可靠性分析。使用故障间隔时间分析和统计生存分析技术,我们发现GPU可靠性在一定程度上依赖于散热,这与冷却架构和作业调度的详细细微差别密切相关。我们描述了历史、数据收集、清理和分析,并对未来的超级计算系统提出了建议。我们将数据和分析代码公开。
{"title":"GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability","authors":"G. Ostrouchov, Don E. Maxwell, R. Ashraf, C. Engelmann, M. Shankar, James H. Rogers","doi":"10.1109/SC41405.2020.00045","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00045","url":null,"abstract":"The Cray XK7 Titan was the top supercomputer system in the world for a long time and remained critically important throughout its nearly seven year life. It was an interesting machine from a reliability viewpoint as most of its power came from 18,688 GPUs whose operation was forced to execute three rework cycles, two on the GPU mechanical assembly and one on the GPU circuitboards. We write about the last rework cycle and a reliability analysis of over 100,000 years of GPU lifetimes during Titan’s 6-year-long productive period. Using time between failures analysis and statistical survival analysis techniques, we find that GPU reliability is dependent on heat dissipation to an extent that strongly correlates with detailed nuances of the cooling architecture and job scheduling. We describe the history, data collection, cleaning, and analysis and give recommendations for future supercomputing systems. We make the data and our analysis codes publicly available.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123691663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Experimental Evaluation of NISQ Quantum Computers: Error Measurement, Characterization, and Implications NISQ量子计算机的实验评估:误差测量、表征和影响
Tirthak Patel, Abhay Potharaju, Baolin Li, Rohan Basu Roy, Devesh Tiwari
Noisy Intermediate-Scale Quantum (NISQ) computers are being increasingly used for executing early-stage quantum programs to establish the practical realizability of existing quantum algorithms. These quantum programs have uses cases in the realm of high-performance computing ranging from molecular chemistry and physics simulations to addressing NP-complete optimization problems. However, NISQ devices are prone to multiple types of errors, which affect the fidelity and reproducibility of the program execution. As the technology is still primitive, our understanding of these quantum machines and their error characteristics is limited. To bridge that understanding gap, this is the first work to provide a systematic and rich experimental evaluation of IBM Quantum Experience (QX) quantum computers of different scales and topologies. Our experimental evaluation uncovers multiple important and interesting aspects of benchmarking and evaluating quantum program on NISQ machines. We have open-sourced our experimental framework and dataset to help accelerate the evaluation of quantum computing systems.
噪声中等规模量子(NISQ)计算机越来越多地用于执行早期量子程序,以确定现有量子算法的实际可实现性。这些量子程序在高性能计算领域的应用范围从分子化学和物理模拟到解决np完全优化问题。然而,NISQ设备容易出现多种类型的错误,这会影响程序执行的保真度和再现性。由于技术仍然很原始,我们对这些量子机器及其误差特性的理解是有限的。为了弥合这种理解差距,这是第一次对不同规模和拓扑的IBM量子体验(QX)量子计算机进行系统和丰富的实验评估。我们的实验评估揭示了在NISQ机器上对量子程序进行基准测试和评估的多个重要和有趣的方面。我们已经开源了我们的实验框架和数据集,以帮助加速量子计算系统的评估。
{"title":"Experimental Evaluation of NISQ Quantum Computers: Error Measurement, Characterization, and Implications","authors":"Tirthak Patel, Abhay Potharaju, Baolin Li, Rohan Basu Roy, Devesh Tiwari","doi":"10.1109/SC41405.2020.00050","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00050","url":null,"abstract":"Noisy Intermediate-Scale Quantum (NISQ) computers are being increasingly used for executing early-stage quantum programs to establish the practical realizability of existing quantum algorithms. These quantum programs have uses cases in the realm of high-performance computing ranging from molecular chemistry and physics simulations to addressing NP-complete optimization problems. However, NISQ devices are prone to multiple types of errors, which affect the fidelity and reproducibility of the program execution. As the technology is still primitive, our understanding of these quantum machines and their error characteristics is limited. To bridge that understanding gap, this is the first work to provide a systematic and rich experimental evaluation of IBM Quantum Experience (QX) quantum computers of different scales and topologies. Our experimental evaluation uncovers multiple important and interesting aspects of benchmarking and evaluating quantum program on NISQ machines. We have open-sourced our experimental framework and dataset to help accelerate the evaluation of quantum computing systems.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123804774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Taming I/O Variation on QoS-Less HPC Storage: What Can Applications Do? 在QoS-Less HPC存储上控制I/O变化:应用程序可以做些什么?
Zhenbo Qiao, Qing Liu, N. Podhorszki, S. Klasky, Jieyang Chen
As high-performance computing (HPC) is being scaled up to exascale to accommodate new modeling and simulation needs, I/O has continued to be a major bottleneck in the end-to-end scientific processes. Nevertheless, prior work in this area mostly aimed to maximize the average performance, and there has been a lack of study and solutions that can manage I/O performance variation on HPC systems. This work aims to take advantage of the storage characteristics and explore application level solutions that are interference-aware. In particular, we monitor the performance of data analytics and estimate the state of shared storage resources using discrete fourier transform (DFT). If heavy I/O interference is predicted to occur at a given timestep, data analytics can dynamically adapt to the environment by lowering the accuracy and performing partial or no augmentation from the shared storage, dictated by an augmentation-bandwidth plot. We evaluate three data analytics, XGC, GenASiS, and Jet, on Chameleon, and quantitatively demonstrate that both the average and variation of I/O performance can be vastly improved using our dynamic augmentation, with the mean and variance improved by as much as 67% and 96%, respectively, while maintaining acceptable outcome of data analysis.
随着高性能计算(HPC)被扩展到百亿亿次以适应新的建模和仿真需求,I/O仍然是端到端科学流程的主要瓶颈。然而,该领域的先前工作主要是为了最大化平均性能,并且缺乏能够管理HPC系统上I/O性能变化的研究和解决方案。这项工作旨在利用存储特性并探索干扰感知的应用层解决方案。特别是,我们监控数据分析的性能,并使用离散傅立叶变换(DFT)估计共享存储资源的状态。如果预测在给定的时间步长会发生严重的I/O干扰,数据分析可以动态地适应环境,方法是降低准确性,并根据增强带宽图对共享存储执行部分或不执行增强。我们在变色龙上评估了三种数据分析,XGC, GenASiS和Jet,并定量地证明了使用我们的动态增强可以极大地提高I/O性能的平均值和变化,平均值和方差分别提高了67%和96%,同时保持了可接受的数据分析结果。
{"title":"Taming I/O Variation on QoS-Less HPC Storage: What Can Applications Do?","authors":"Zhenbo Qiao, Qing Liu, N. Podhorszki, S. Klasky, Jieyang Chen","doi":"10.1109/SC41405.2020.00015","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00015","url":null,"abstract":"As high-performance computing (HPC) is being scaled up to exascale to accommodate new modeling and simulation needs, I/O has continued to be a major bottleneck in the end-to-end scientific processes. Nevertheless, prior work in this area mostly aimed to maximize the average performance, and there has been a lack of study and solutions that can manage I/O performance variation on HPC systems. This work aims to take advantage of the storage characteristics and explore application level solutions that are interference-aware. In particular, we monitor the performance of data analytics and estimate the state of shared storage resources using discrete fourier transform (DFT). If heavy I/O interference is predicted to occur at a given timestep, data analytics can dynamically adapt to the environment by lowering the accuracy and performing partial or no augmentation from the shared storage, dictated by an augmentation-bandwidth plot. We evaluate three data analytics, XGC, GenASiS, and Jet, on Chameleon, and quantitatively demonstrate that both the average and variation of I/O performance can be vastly improved using our dynamic augmentation, with the mean and variance improved by as much as 67% and 96%, respectively, while maintaining acceptable outcome of data analysis.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130706059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI Communication CAB-MPI:探索进程间窃取工作以实现均衡的MPI通信
Kaiming Ouyang, Min Si, A. Hori, Zizhong Chen, P. Balaji
Load balance is essential for high-performance applications. Unbalanced communication can cause severe performance degradation, even in computation-balanced BSP applications. Designing communication-balanced applications is challenging, however, because of the diverse communication implementations at the underlying runtime system. In this paper, we address this challenge through an interprocess workstealing scheme based on process-memory-sharing techniques. We present CAB-MPI, an MPI implementation that can identify idle processes inside MPI and use these idle resources to dynamically balance communication workload on the node. We design throughput-optimized strategies to ensure efficient stealing of the data movement tasks. We demonstrate the benefit of work stealing through several internal processes in MPI, including intranode data transfer, pack/unpack for noncontiguous communication, and computation in one-sided accumulates. The implementation is evaluated through a set of microbenchmarks and proxy applications on Intel Xeon and Xeon Phi platforms.
负载平衡对于高性能应用程序至关重要。不平衡的通信可能导致严重的性能下降,即使在计算平衡的BSP应用程序中也是如此。然而,设计通信平衡的应用程序是具有挑战性的,因为底层运行时系统中的通信实现多种多样。在本文中,我们通过基于进程内存共享技术的进程间工作窃取方案来解决这一挑战。我们提出了CAB-MPI,一种MPI实现,它可以识别MPI内部的空闲进程,并使用这些空闲资源来动态平衡节点上的通信工作负载。我们设计了吞吐量优化策略,以确保有效地窃取数据移动任务。我们通过MPI中的几个内部进程演示了工作窃取的好处,包括内部代码数据传输、不连续通信的打包/解包以及单侧累积的计算。该实现通过英特尔至强和至强Phi平台上的一组微基准测试和代理应用程序进行评估。
{"title":"CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI Communication","authors":"Kaiming Ouyang, Min Si, A. Hori, Zizhong Chen, P. Balaji","doi":"10.1109/SC41405.2020.00040","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00040","url":null,"abstract":"Load balance is essential for high-performance applications. Unbalanced communication can cause severe performance degradation, even in computation-balanced BSP applications. Designing communication-balanced applications is challenging, however, because of the diverse communication implementations at the underlying runtime system. In this paper, we address this challenge through an interprocess workstealing scheme based on process-memory-sharing techniques. We present CAB-MPI, an MPI implementation that can identify idle processes inside MPI and use these idle resources to dynamically balance communication workload on the node. We design throughput-optimized strategies to ensure efficient stealing of the data movement tasks. We demonstrate the benefit of work stealing through several internal processes in MPI, including intranode data transfer, pack/unpack for noncontiguous communication, and computation in one-sided accumulates. The implementation is evaluated through a set of microbenchmarks and proxy applications on Intel Xeon and Xeon Phi platforms.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114219324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Runtime-Guided ECC Protection using Online Estimation of Memory Vulnerability 基于内存漏洞在线估计的运行时导向ECC保护
Luc Jaulmes, Miquel Moretó, M. Valero, M. Erez, Marc Casas
Diminishing reliability of semiconductor technologies and decreasing power budgets per component hinder designing next-generation high performance computing (HPC) systems. Both constraints strongly impact memory subsystems, as DRAM main memory accounts for up to 30 to 50 percent of a node’s overall power consumption, and is the subsystem that is most subject to faults. Improving reliability requires stronger error correcting codes (ECCs), which incur additional power and storage costs. It is critical to develop strategies to uphold memory reliability while minimising these costs, with the goal of improving the power efficiency of computing machines.We introduce a methodology to dynamically estimate the vulnerability of data, and adjust ECC protection accordingly. Our methodology relies on information readily available to runtime systems in task-based dataflow programming models, and the existing Virtualized Error Correcting Code (VECC) schemes to provide adaptable protection. Guiding VECC using vulnerability estimates offers a wide range of reliabilityredundancy trade-offs, as reliable as using expensive offline profiling for guidance and up to to 25% safer than VECC without guidance. Runtime-guided VECC is more efficient than a stronger uniform ECC, reducing DIMM lifetime failure from 1.84% down to 1.26% while increasing DRAM energy consumption by only 1.03×.
半导体技术可靠性的降低和每个组件功耗预算的减少阻碍了下一代高性能计算(HPC)系统的设计。这两种约束都严重影响内存子系统,因为DRAM主内存占节点总功耗的30%到50%,并且是最容易发生故障的子系统。提高可靠性需要更强的纠错码(ecc),这需要额外的电力和存储成本。为了提高计算机器的功率效率,制定策略来维护内存可靠性,同时将这些成本降至最低,这一点至关重要。介绍了一种动态估计数据脆弱性的方法,并据此调整ECC保护。我们的方法依赖于基于任务的数据流编程模型中运行时系统随时可用的信息,以及现有的虚拟纠错码(VECC)方案来提供适应性保护。使用漏洞估计的引导VECC提供了广泛的可靠性冗余权衡,与使用昂贵的离线分析进行指导一样可靠,比没有指导的VECC安全高达25%。运行时引导的VECC比更强的统一ECC更高效,将DIMM寿命失败率从1.84%降低到1.26%,而DRAM能耗仅增加1.03倍。
{"title":"Runtime-Guided ECC Protection using Online Estimation of Memory Vulnerability","authors":"Luc Jaulmes, Miquel Moretó, M. Valero, M. Erez, Marc Casas","doi":"10.1109/SC41405.2020.00080","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00080","url":null,"abstract":"Diminishing reliability of semiconductor technologies and decreasing power budgets per component hinder designing next-generation high performance computing (HPC) systems. Both constraints strongly impact memory subsystems, as DRAM main memory accounts for up to 30 to 50 percent of a node’s overall power consumption, and is the subsystem that is most subject to faults. Improving reliability requires stronger error correcting codes (ECCs), which incur additional power and storage costs. It is critical to develop strategies to uphold memory reliability while minimising these costs, with the goal of improving the power efficiency of computing machines.We introduce a methodology to dynamically estimate the vulnerability of data, and adjust ECC protection accordingly. Our methodology relies on information readily available to runtime systems in task-based dataflow programming models, and the existing Virtualized Error Correcting Code (VECC) schemes to provide adaptable protection. Guiding VECC using vulnerability estimates offers a wide range of reliabilityredundancy trade-offs, as reliable as using expensive offline profiling for guidance and up to to 25% safer than VECC without guidance. Runtime-guided VECC is more efficient than a stronger uniform ECC, reducing DIMM lifetime failure from 1.84% down to 1.26% while increasing DRAM energy consumption by only 1.03×.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124234985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerating Large-Scale Excited-State GW Calculations on Leadership HPC Systems 在领先的高性能计算系统上加速大规模激发态GW计算
M. D. Ben, Charlene Yang, Zhenglu Li, F. Jornada, S. Louie, J. Deslippe
Large-scale GW calculations are the state-of-the-art approach to accurately describe many-body excited-state phenomena in complex materials. This is critical for novel device design but due to their extremely high computational cost, these calculations often run at a limited scale. In this paper, we present algorithm and implementation advancements made in the materials science code BerkeleyGW to scale calculations to the order of over 10,000 electrons utilizing the entire Summit at OLCF. Excellent strong and weak scaling is observed, and a 105.9 PFLOP/s double-precision performance is achieved on 27,648 V100 GPUs, reaching 52.7% of the peak. This work for the first time demonstrates the possibility to perform GW calculations at such scale within minutes on current HPC systems, and leads the way for future efficient HPC software development in materials, physical, chemical, and engineering sciences.
大规模GW计算是精确描述复杂材料中多体激发态现象的最新方法。这对于新颖的设备设计至关重要,但由于其极高的计算成本,这些计算通常在有限的规模上运行。在本文中,我们介绍了在材料科学代码BerkeleyGW中取得的算法和实现进展,以利用OLCF的整个峰会将计算扩展到超过10,000个电子的顺序。在27,648个V100 gpu上实现了105.9 PFLOP/s的双精度性能,达到峰值的52.7%。这项工作首次证明了在当前HPC系统上在几分钟内进行如此规模的GW计算的可能性,并为未来材料、物理、化学和工程科学领域高效的HPC软件开发开辟了道路。
{"title":"Accelerating Large-Scale Excited-State GW Calculations on Leadership HPC Systems","authors":"M. D. Ben, Charlene Yang, Zhenglu Li, F. Jornada, S. Louie, J. Deslippe","doi":"10.1109/SC41405.2020.00008","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00008","url":null,"abstract":"Large-scale GW calculations are the state-of-the-art approach to accurately describe many-body excited-state phenomena in complex materials. This is critical for novel device design but due to their extremely high computational cost, these calculations often run at a limited scale. In this paper, we present algorithm and implementation advancements made in the materials science code BerkeleyGW to scale calculations to the order of over 10,000 electrons utilizing the entire Summit at OLCF. Excellent strong and weak scaling is observed, and a 105.9 PFLOP/s double-precision performance is achieved on 27,648 V100 GPUs, reaching 52.7% of the peak. This work for the first time demonstrates the possibility to perform GW calculations at such scale within minutes on current HPC systems, and leads the way for future efficient HPC software development in materials, physical, chemical, and engineering sciences.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122918288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Tuning Floating-Point Precision Using Dynamic Program Information and Temporal Locality 利用动态程序信息和时间局部性调优浮点精度
Hugo Brunie, Costin Iancu, K. Ibrahim, P. Brisk, B. Cook
We present a methodology for precision tuning of full applications. These techniques must select a search space composed of either variables or instructions and provide a scalable search strategy. In full application settings one cannot assume compiler support for practical reasons. Thus, an additional important challenge is enabling code refactoring. We argue for an instruction-based search space and we show: 1) how to exploit dynamic program information based on call stacks; and 2) how to exploit the iterative nature of scientific codes, combined with temporal locality. We applied the methodology to tune the implementation of scientific codes written in a combination of Python, CUDA, C++ and Fortran, tuning calls to math exp library functions. The iterative search refinement always reduces the search complexity and the number of steps to solution. Dynamic program information increases search efficacy. Using this approach, we obtain application runtime performance improvements up to 27%.
我们提出了一种完整应用程序的精确调谐方法。这些技术必须选择由变量或指令组成的搜索空间,并提供可伸缩的搜索策略。在完整的应用程序设置中,由于实际原因,不能假设编译器支持。因此,另一个重要的挑战是支持代码重构。我们主张基于指令的搜索空间,并展示了:1)如何利用基于调用栈的动态程序信息;2)如何利用科学代码的迭代性,结合时间局部性。我们应用该方法来优化用Python、CUDA、c++和Fortran组合编写的科学代码的实现,优化对math exp库函数的调用。迭代搜索细化总是能降低搜索复杂度和求解步骤数。动态程序信息提高了搜索效率。使用这种方法,我们可以获得高达27%的应用程序运行时性能改进。
{"title":"Tuning Floating-Point Precision Using Dynamic Program Information and Temporal Locality","authors":"Hugo Brunie, Costin Iancu, K. Ibrahim, P. Brisk, B. Cook","doi":"10.1109/SC41405.2020.00054","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00054","url":null,"abstract":"We present a methodology for precision tuning of full applications. These techniques must select a search space composed of either variables or instructions and provide a scalable search strategy. In full application settings one cannot assume compiler support for practical reasons. Thus, an additional important challenge is enabling code refactoring. We argue for an instruction-based search space and we show: 1) how to exploit dynamic program information based on call stacks; and 2) how to exploit the iterative nature of scientific codes, combined with temporal locality. We applied the methodology to tune the implementation of scientific codes written in a combination of Python, CUDA, C++ and Fortran, tuning calls to math exp library functions. The iterative search refinement always reduces the search complexity and the number of steps to solution. Dynamic program information increases search efficacy. Using this approach, we obtain application runtime performance improvements up to 27%.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124735584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
SEFEE: Lightweight Storage Error Forecasting in Large-Scale Enterprise Storage Systems 大规模企业存储系统中的轻量级存储错误预测
Amirhessam Yazdi, Xing Lin, Lei Yang, Feng Yan
With the rapid growth in scale and complexity, today’s enterprise storage systems need to deal with significant amounts of errors. Existing proactive methods mainly focus on machine learning techniques trained using SMART measurements. However, such methods are usually expensive to use in practice and can only be applied to a limited types of errors with a limited scale. We collected more than 23-million storage events from 87 deployed NetApp-ONTAP systems managing 14,371 disks for two years and propose a lightweight training-free storage error forecasting method SEFEE. SEFEE employs Tensor Decomposition to directly analyze storage error-event logs and perform online error prediction for all error types in all storage nodes. SEFEE explores hidden spatio-temporal information that is deeply embedded in the global scale of storage systems to achieve record breaking error forecasting accuracy with minimal prediction overhead.
随着规模和复杂性的快速增长,当今的企业存储系统需要处理大量的错误。现有的主动方法主要集中在使用SMART测量训练的机器学习技术上。然而,这种方法在实践中使用起来通常是昂贵的,并且只能应用于有限规模的有限类型的错误。我们从87个部署的NetApp-ONTAP系统中收集了超过2300万个存储事件,这些系统管理了14,371个磁盘,历时两年,并提出了一种轻量级的无需训练的存储错误预测方法SEFEE。SEFEE采用张量分解直接分析存储错误事件日志,对所有存储节点的所有错误类型进行在线错误预测。SEFEE探索隐藏的时空信息,这些信息深嵌在存储系统的全局尺度中,以最小的预测开销实现破纪录的误差预测精度。
{"title":"SEFEE: Lightweight Storage Error Forecasting in Large-Scale Enterprise Storage Systems","authors":"Amirhessam Yazdi, Xing Lin, Lei Yang, Feng Yan","doi":"10.1109/SC41405.2020.00068","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00068","url":null,"abstract":"With the rapid growth in scale and complexity, today’s enterprise storage systems need to deal with significant amounts of errors. Existing proactive methods mainly focus on machine learning techniques trained using SMART measurements. However, such methods are usually expensive to use in practice and can only be applied to a limited types of errors with a limited scale. We collected more than 23-million storage events from 87 deployed NetApp-ONTAP systems managing 14,371 disks for two years and propose a lightweight training-free storage error forecasting method SEFEE. SEFEE employs Tensor Decomposition to directly analyze storage error-event logs and perform online error prediction for all error types in all storage nodes. SEFEE explores hidden spatio-temporal information that is deeply embedded in the global scale of storage systems to achieve record breaking error forecasting accuracy with minimal prediction overhead.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125564361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
PREEMPT: Scalable Epidemic Interventions Using Submodular Optimization on Multi-GPU Systems PREEMPT:在多gpu系统上使用子模块优化的可扩展流行病干预
Marco Minutoli, Prathyush Sambaturu, M. Halappanavar, Antonino Tumeo, A. Kalyanaraman, A. Vullikanti
Preventing and slowing the spread of epidemics is achieved through techniques such as vaccination and social distancing. Given practical limitations on the number of vaccines and cost of administration, optimization becomes a necessity. Previous approaches using mathematical programming methods have shown to be effective but are limited by computational costs. In this work, we present PREEMPT, a new approach for intervention via maximizing the influence of vaccinated nodes on the network. We prove submodular properties associated with the objective function of our method so that it aids in construction of an efficient greedy approximation strategy. Consequently, we present a new parallel algorithm based on greedy hill climbing for PREEMPT, and present an efficient parallel implementation for distributed CPU-GPU heterogeneous platforms. Our results demonstrate that PREEMPT is able to achieve a significant reduction (up to 6.75×) in the percentage of people infected and up to 98% reduction in the peak of the infection on a city-scale network. We also show strong scaling results of PREEMPT on up to 128 nodes of the Summit supercomputer. Our parallel implementation is able to significantly reduce time to solution, from hours to minutes on large networks. This work represents a first-of-its-kind effort in parallelizing greedy hill climbing and applying it toward devising effective interventions for epidemics.
预防和减缓流行病的传播是通过接种疫苗和保持社会距离等技术实现的。鉴于疫苗数量和管理成本的实际限制,优化成为必要。以前使用数学规划方法的方法已被证明是有效的,但受到计算成本的限制。在这项工作中,我们提出了PREEMPT,一种通过最大化接种疫苗节点对网络的影响来进行干预的新方法。我们证明了与我们的方法的目标函数相关的子模性质,从而有助于构造一个有效的贪婪逼近策略。为此,本文提出了一种基于贪心爬坡的PREEMPT并行算法,为分布式CPU-GPU异构平台提供了一种高效的并行实现。我们的研究结果表明,在城市规模的网络中,PREEMPT能够显著降低感染人数的百分比(高达6.75倍),并将感染高峰降低98%。我们还展示了PREEMPT在Summit超级计算机多达128个节点上的强大扩展结果。我们的并行实现能够显著缩短解决方案的时间,在大型网络上从几小时缩短到几分钟。这项工作是同类工作中首次将贪婪爬山并行化,并将其应用于设计有效的流行病干预措施。
{"title":"PREEMPT: Scalable Epidemic Interventions Using Submodular Optimization on Multi-GPU Systems","authors":"Marco Minutoli, Prathyush Sambaturu, M. Halappanavar, Antonino Tumeo, A. Kalyanaraman, A. Vullikanti","doi":"10.1109/SC41405.2020.00059","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00059","url":null,"abstract":"Preventing and slowing the spread of epidemics is achieved through techniques such as vaccination and social distancing. Given practical limitations on the number of vaccines and cost of administration, optimization becomes a necessity. Previous approaches using mathematical programming methods have shown to be effective but are limited by computational costs. In this work, we present PREEMPT, a new approach for intervention via maximizing the influence of vaccinated nodes on the network. We prove submodular properties associated with the objective function of our method so that it aids in construction of an efficient greedy approximation strategy. Consequently, we present a new parallel algorithm based on greedy hill climbing for PREEMPT, and present an efficient parallel implementation for distributed CPU-GPU heterogeneous platforms. Our results demonstrate that PREEMPT is able to achieve a significant reduction (up to 6.75×) in the percentage of people infected and up to 98% reduction in the peak of the infection on a city-scale network. We also show strong scaling results of PREEMPT on up to 128 nodes of the Summit supercomputer. Our parallel implementation is able to significantly reduce time to solution, from hours to minutes on large networks. This work represents a first-of-its-kind effort in parallelizing greedy hill climbing and applying it toward devising effective interventions for epidemics.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130589924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1