首页 > 最新文献

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
TAGO: Rethinking Routing Design in High Performance Reconfigurable Networks 重新思考高性能可重构网络中的路由设计
Min Yee Teh, Y. Hung, George Michelogiannakis, Shijia Yan, M. Glick, J. Shalf, K. Bergman
Many reconfigurable network topologies have been proposed in the past. However, efficient routing on top of these flexible interconnects still presents a challenge. In this work, we reevaluate key principles that have guided the designs of many routing protocols on static networks, and see how well those principles apply on reconfigurable network topologies. Based on a theoretical analysis of key properties that routing in a reconfigurable network should satisfy to maximize performance, we propose a topology-aware, globally-direct oblivious (TAGO) routing protocol for reconfigurable topologies. Our proposed routing protocol is simple in design and yet, when deployed in conjunction with a reconfigurable network topology, improves throughput by up to $2.2 times$ compared to established routing protocols and even comes within 10% of the throughput of impractical adaptive routing that has instant global congestion information.
过去已经提出了许多可重构的网络拓扑。然而,在这些灵活互连之上的高效路由仍然是一个挑战。在这项工作中,我们重新评估了指导静态网络上许多路由协议设计的关键原则,并了解了这些原则在可重构网络拓扑中的应用情况。基于对可重构网络中路由需要满足的关键属性的理论分析,我们提出了一种可重构拓扑的拓扑感知全局直接无关(TAGO)路由协议。我们提出的路由协议设计简单,然而,当与可重构网络拓扑一起部署时,与已建立的路由协议相比,吞吐量提高了2.2倍,甚至在具有即时全局拥塞信息的不切实际的自适应路由吞吐量的10%以内。
{"title":"TAGO: Rethinking Routing Design in High Performance Reconfigurable Networks","authors":"Min Yee Teh, Y. Hung, George Michelogiannakis, Shijia Yan, M. Glick, J. Shalf, K. Bergman","doi":"10.1109/SC41405.2020.00029","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00029","url":null,"abstract":"Many reconfigurable network topologies have been proposed in the past. However, efficient routing on top of these flexible interconnects still presents a challenge. In this work, we reevaluate key principles that have guided the designs of many routing protocols on static networks, and see how well those principles apply on reconfigurable network topologies. Based on a theoretical analysis of key properties that routing in a reconfigurable network should satisfy to maximize performance, we propose a topology-aware, globally-direct oblivious (TAGO) routing protocol for reconfigurable topologies. Our proposed routing protocol is simple in design and yet, when deployed in conjunction with a reconfigurable network topology, improves throughput by up to $2.2 times$ compared to established routing protocols and even comes within 10% of the throughput of impractical adaptive routing that has instant global congestion information.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117163283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
VERITAS: Accurately Estimating the Correct Output on Noisy Intermediate-Scale Quantum Computers VERITAS:在有噪声的中等规模量子计算机上准确估计正确输出
Tirthak Patel, Devesh Tiwari
Noisy Intermediate-Scale Quantum (NISQ) machines are being increasingly used to develop quantum algorithms and establish use cases for quantum computing. However, these devices are highly error-prone and produce output, which can be far from the correct output of the quantum algorithm. In this paper, we propose VERITAS, an end-to-end approach toward designing quantum experiments, executing experiments, and correcting outputs produced by quantum circuits post their execution such that the correct output of the quantum algorithm can be accurately estimated.
噪声中等规模量子(NISQ)机器越来越多地用于开发量子算法和建立量子计算用例。然而,这些设备非常容易出错,并且产生的输出可能与量子算法的正确输出相差甚远。在本文中,我们提出了VERITAS,这是一种端到端的方法,用于设计量子实验,执行实验,并在量子电路执行后纠正其产生的输出,从而可以准确估计量子算法的正确输出。
{"title":"VERITAS: Accurately Estimating the Correct Output on Noisy Intermediate-Scale Quantum Computers","authors":"Tirthak Patel, Devesh Tiwari","doi":"10.1109/SC41405.2020.00019","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00019","url":null,"abstract":"Noisy Intermediate-Scale Quantum (NISQ) machines are being increasingly used to develop quantum algorithms and establish use cases for quantum computing. However, these devices are highly error-prone and produce output, which can be far from the correct output of the quantum algorithm. In this paper, we propose VERITAS, an end-to-end approach toward designing quantum experiments, executing experiments, and correcting outputs produced by quantum circuits post their execution such that the correct output of the quantum algorithm can be accurately estimated.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126134600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Scaling the Hartree-Fock Matrix Build on Summit 在顶点上扩展Hartree-Fock矩阵
Giuseppe M. J. Barca, David L. Poole, J. Vallejo, Melisa Alkan, C. Bertoni, Alistair P. Rendell, M. Gordon
Usage of Graphics Processing Units (GPU) has become strategic for simulating the chemistry of large molecular systems, with the majority of top supercomputers utilizing GPUs as their main source of computational horsepower. In this paper, a new fragmentation-based Hartree-Fock matrix build algorithm designed for scaling on many-GPU architectures is presented. The new algorithm uses a novel dynamic load balancing scheme based on a binned shell-pair container to distribute batches of significant shell quartets with the same code path to different GPUs. This maximizes computational throughput and load balancing, and eliminates GPU thread divergence due to integral screening. Additionally, the code uses a novel Fock digestion algorithm to contract electron repulsion integrals into the Fock matrix, which exploits all forms of permutational symmetry and eliminates thread synchronization requirements. The implementation demonstrates excellent scalability on the Summit computer, achieving good strong scaling performance up to 4096 nodes, and linear weak scaling up to 612 nodes.
图形处理单元(GPU)的使用已经成为模拟大分子系统化学的战略,大多数顶级超级计算机使用GPU作为其计算能力的主要来源。本文提出了一种新的基于碎片的Hartree-Fock矩阵构建算法,该算法设计用于多gpu架构下的缩放。该算法采用了一种基于装箱壳对容器的动态负载均衡方案,将具有相同代码路径的重要壳对分批分发到不同的gpu上。这最大限度地提高了计算吞吐量和负载平衡,并消除了GPU线程发散由于整体筛选。此外,该代码使用了一种新颖的Fock消化算法来将电子排斥积分压缩到Fock矩阵中,该算法利用了所有形式的排列对称并消除了线程同步要求。该实现在Summit计算机上展示了出色的可扩展性,在4096个节点上实现了良好的强扩展性能,在612个节点上实现了线性弱扩展。
{"title":"Scaling the Hartree-Fock Matrix Build on Summit","authors":"Giuseppe M. J. Barca, David L. Poole, J. Vallejo, Melisa Alkan, C. Bertoni, Alistair P. Rendell, M. Gordon","doi":"10.1109/SC41405.2020.00085","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00085","url":null,"abstract":"Usage of Graphics Processing Units (GPU) has become strategic for simulating the chemistry of large molecular systems, with the majority of top supercomputers utilizing GPUs as their main source of computational horsepower. In this paper, a new fragmentation-based Hartree-Fock matrix build algorithm designed for scaling on many-GPU architectures is presented. The new algorithm uses a novel dynamic load balancing scheme based on a binned shell-pair container to distribute batches of significant shell quartets with the same code path to different GPUs. This maximizes computational throughput and load balancing, and eliminates GPU thread divergence due to integral screening. Additionally, the code uses a novel Fock digestion algorithm to contract electron repulsion integrals into the Fock matrix, which exploits all forms of permutational symmetry and eliminates thread synchronization requirements. The implementation demonstrates excellent scalability on the Summit computer, achieving good strong scaling performance up to 4096 nodes, and linear weak scaling up to 612 nodes.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130510659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
DRCCTPROF: A Fine-Grained Call Path Profiler for ARM-Based Clusters DRCCTPROF:基于arm集群的细粒度调用路径分析器
Qidong Zhao, Xu Liu, Milind Chabbi
ARM is an attractive CPU architecture for exascale systems because of its energy efficiency. As a recent entry into the HPC paradigm, ARM lags in its software stack, especially in the performance tooling aspect. Notably, there is a lack of fine-grained measurement tools to analyze fully optimized HPC binary executables on ARM processors. In this paper, we introduce DRCCTPROF — a fine-grained call path profiling framework for binaries running on ARM architectures. The unique ability of DRCCTPROF is to obtain full calling context at any and every machine instruction that executes, which provides more detailed diagnostic feedback for performance optimization and correctness tools. Furthermore, DRCCTPROF not only associates any instruction with source code along the call path, but also associates memory access instructions back to the constituent data object. Finally, DRCCTPROF incurs moderate overhead and provides a compact view to visualize the profiles collected from parallel executions.
对于百亿亿级系统来说,ARM是一种很有吸引力的CPU架构,因为它的能效很高。作为一个新近进入高性能计算范式的人,ARM在软件堆栈方面落后,尤其是在性能工具方面。值得注意的是,缺乏细粒度的测量工具来分析ARM处理器上完全优化的HPC二进制可执行文件。在本文中,我们介绍了DRCCTPROF——一个用于运行在ARM架构上的二进制文件的细粒度调用路径分析框架。DRCCTPROF的独特功能是在执行的任何一条机器指令中获得完整的调用上下文,这为性能优化和正确性工具提供了更详细的诊断反馈。此外,DRCCTPROF不仅沿着调用路径将任何指令与源代码关联起来,而且还将内存访问指令关联回组成数据对象。最后,DRCCTPROF带来了适度的开销,并提供了一个紧凑的视图来可视化从并行执行中收集的概要文件。
{"title":"DRCCTPROF: A Fine-Grained Call Path Profiler for ARM-Based Clusters","authors":"Qidong Zhao, Xu Liu, Milind Chabbi","doi":"10.1109/SC41405.2020.00034","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00034","url":null,"abstract":"ARM is an attractive CPU architecture for exascale systems because of its energy efficiency. As a recent entry into the HPC paradigm, ARM lags in its software stack, especially in the performance tooling aspect. Notably, there is a lack of fine-grained measurement tools to analyze fully optimized HPC binary executables on ARM processors. In this paper, we introduce DRCCTPROF — a fine-grained call path profiling framework for binaries running on ARM architectures. The unique ability of DRCCTPROF is to obtain full calling context at any and every machine instruction that executes, which provides more detailed diagnostic feedback for performance optimization and correctness tools. Furthermore, DRCCTPROF not only associates any instruction with source code along the call path, but also associates memory access instructions back to the constituent data object. Finally, DRCCTPROF incurs moderate overhead and provides a compact view to visualize the profiles collected from parallel executions.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130514122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Co-Design for A64FX Manycore Processor and ”Fugaku” A64FX多核处理器与“Fugaku”的协同设计
M. Sato, Y. Ishikawa, H. Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, H. Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, Kouichi Hirai, Atsushi Furuya, A. Asato, K. Morita, T. Shimizu
We have been carrying out the FLAGSHIP 2020 Project to develop the Japanese next-generation flagship supercomputer, the Post-K, recently named “Fugaku”. We have designed an original many core processor based on Armv8 instruction sets with the Scalable Vector Extension (SVE), an A64FX processor, as well as a system including interconnect and a storage subsystem with the industry partner, Fujitsu. The “co-design” of the system and applications is a key to making it power efficient and high performance. We determined many architectural parameters by reflecting an analysis of a set of target applications provided by applications teams. In this paper, we present the pragmatic practice of our co-design effort for “Fugaku”. As a result, the system has been proven to be a very power-efficient system, and it is confirmed that the performance of some target applications using the whole system is more than 100 times the performance of the K computer.
我们一直在开展旗舰2020项目,开发日本下一代旗舰超级计算机Post-K,最近被命名为“Fugaku”。我们设计了一个基于Armv8指令集的原始多核处理器,带有可扩展向量扩展(SVE), A64FX处理器,以及与行业合作伙伴富士通(Fujitsu)的互连和存储子系统系统。系统和应用程序的“协同设计”是使其节能和高性能的关键。通过反映应用程序团队提供的一组目标应用程序的分析,我们确定了许多体系结构参数。在本文中,我们展示了我们为“Fugaku”共同设计的务实实践。结果,该系统已被证明是一个非常节能的系统,并且确认使用整个系统的一些目标应用程序的性能是K计算机性能的100倍以上。
{"title":"Co-Design for A64FX Manycore Processor and ”Fugaku”","authors":"M. Sato, Y. Ishikawa, H. Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, H. Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, Kouichi Hirai, Atsushi Furuya, A. Asato, K. Morita, T. Shimizu","doi":"10.1109/SC41405.2020.00051","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00051","url":null,"abstract":"We have been carrying out the FLAGSHIP 2020 Project to develop the Japanese next-generation flagship supercomputer, the Post-K, recently named “Fugaku”. We have designed an original many core processor based on Armv8 instruction sets with the Scalable Vector Extension (SVE), an A64FX processor, as well as a system including interconnect and a storage subsystem with the industry partner, Fujitsu. The “co-design” of the system and applications is a key to making it power efficient and high performance. We determined many architectural parameters by reflecting an analysis of a set of target applications provided by applications teams. In this paper, we present the pragmatic practice of our co-design effort for “Fugaku”. As a result, the system has been proven to be a very power-efficient system, and it is confirmed that the performance of some target applications using the whole system is more than 100 times the performance of the K computer.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"315 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131963192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Herring: Rethinking the Parameter Server at Scale for the Cloud 鲱鱼:重新考虑云计算的规模参数服务器
Indu Thangakrishnan, D. Çavdar, C. Karakuş, Piyush Ghai, Yauheni Selivonchyk, Cory Pruce
Training large deep neural networks is time-consuming and may take days or even weeks to complete. Although parameter-server-based approaches were initially popular in distributed training, scalability issues led the field to move towards all-reduce-based approaches. Recent developments in cloud networking technologies, however, such as the Elastic Fabric Adapter (EFA) and Scalable Reliable Datagram (SRD), motivate a re-thinking of the parameter-server approach to address its fundamental inefficiencies. To this end, we introduce a novel communication library, Herring, which is designed to alleviate the performance bottlenecks in parameter-server-based training. We show that gradient reduction with Herring is twice as fast as all-reduce-based methods. We further demonstrate that training deep learning models like $mathrm{B}mathrm{E}mathrm{R}mathrm{T}_{mathrm{l}mathrm{a}mathrm{r}mathrm{g}mathrm{e}}$ using Herring outperforms all-reduce-based training, achieving 85% scaling efficiency on large clusters with up to 2048 NVIDIA V100 GPUs without accuracy drop.
训练大型深度神经网络非常耗时,可能需要几天甚至几周才能完成。尽管基于参数服务器的方法最初在分布式训练中很流行,但可伸缩性问题导致该领域转向基于全约简的方法。然而,最近云网络技术的发展,如弹性结构适配器(EFA)和可扩展可靠数据报(SRD),激发了对参数服务器方法的重新思考,以解决其根本的低效率问题。为此,我们引入了一个新的通信库Herring,它旨在缓解基于参数服务器的训练中的性能瓶颈。我们表明,Herring的梯度约简速度是所有基于约简的方法的两倍。我们进一步证明,使用Herring训练深度学习模型,如$mathrm{B}mathrm{E}mathrm{R}mathrm{T}_{mathrm{l}mathrm{a}mathrm{R}mathrm{g}mathrm{E}}$,优于基于全约简的训练,在多达2048个NVIDIA V100 gpu的大型集群上达到85%的扩展效率,而精度没有下降。
{"title":"Herring: Rethinking the Parameter Server at Scale for the Cloud","authors":"Indu Thangakrishnan, D. Çavdar, C. Karakuş, Piyush Ghai, Yauheni Selivonchyk, Cory Pruce","doi":"10.1109/SC41405.2020.00048","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00048","url":null,"abstract":"Training large deep neural networks is time-consuming and may take days or even weeks to complete. Although parameter-server-based approaches were initially popular in distributed training, scalability issues led the field to move towards all-reduce-based approaches. Recent developments in cloud networking technologies, however, such as the Elastic Fabric Adapter (EFA) and Scalable Reliable Datagram (SRD), motivate a re-thinking of the parameter-server approach to address its fundamental inefficiencies. To this end, we introduce a novel communication library, Herring, which is designed to alleviate the performance bottlenecks in parameter-server-based training. We show that gradient reduction with Herring is twice as fast as all-reduce-based methods. We further demonstrate that training deep learning models like $mathrm{B}mathrm{E}mathrm{R}mathrm{T}_{mathrm{l}mathrm{a}mathrm{r}mathrm{g}mathrm{e}}$ using Herring outperforms all-reduce-based training, achieving 85% scaling efficiency on large clusters with up to 2048 NVIDIA V100 GPUs without accuracy drop.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132181981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs GPU- trident: GPU程序中误差传播的有效建模
Abdul Rehman Anwer, Guanpeng Li, K. Pattabiraman, Michael B. Sullivan, Timothy Tsai, S. Hari
Fault injection (FI) techniques are typically used to determine the reliability profiles of programs under soft errors. However, these techniques are highly resource- and time-intensive. Prior research developed a model, TRIDENT to analytically predict Silent Data Corruption (SDC, i.e., incorrect output without any indication) probabilities of single-threaded CPU applications without requiring FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures than CPU programs. The main challenge is that modeling error propagation across thousands of threads in a GPU kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications.In this paper, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. We find that GPU-TRIDENT is 2 orders of magnitude faster than FI-based approaches, and nearly as accurate in determining the SDC rate of GPU programs.
故障注入技术通常用于确定软错误下程序的可靠性概况。然而,这些技术非常耗费资源和时间。之前的研究开发了一个模型,TRIDENT,可以在不需要fi的情况下分析预测单线程CPU应用程序的静默数据损坏(SDC,即没有任何指示的错误输出)概率。不幸的是,TRIDENT与GPU程序不兼容,由于它们的高度并行性和不同的内存架构比CPU程序。主要的挑战是,在GPU内核中对数千个线程中的错误传播进行建模需要对大量数据进行分析和分析,这对HPC应用程序构成了主要的可伸缩性瓶颈。在本文中,我们提出了GPU- trident,这是一种精确且可扩展的技术,用于建模GPU程序中的误差传播。我们发现GPU- trident比基于fi的方法快2个数量级,并且在确定GPU程序的SDC速率方面几乎同样准确。
{"title":"GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs","authors":"Abdul Rehman Anwer, Guanpeng Li, K. Pattabiraman, Michael B. Sullivan, Timothy Tsai, S. Hari","doi":"10.1109/SC41405.2020.00092","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00092","url":null,"abstract":"Fault injection (FI) techniques are typically used to determine the reliability profiles of programs under soft errors. However, these techniques are highly resource- and time-intensive. Prior research developed a model, TRIDENT to analytically predict Silent Data Corruption (SDC, i.e., incorrect output without any indication) probabilities of single-threaded CPU applications without requiring FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures than CPU programs. The main challenge is that modeling error propagation across thousands of threads in a GPU kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications.In this paper, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. We find that GPU-TRIDENT is 2 orders of magnitude faster than FI-based approaches, and nearly as accurate in determining the SDC rate of GPU programs.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1756 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129487295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Processing Full-Scale Square Kilometre Array Data on the Summit Supercomputer 在顶峰超级计算机上处理全尺寸平方公里阵列数据
Ruonan Wang, R. Tobar, M. Dolensky, Tao An, A. Wicenec, Chen Wu, F. Dulwich, N. Podhorszki, V. Anantharaj, E. Suchyta, B. Lao, S. Klasky
This work presents a workflow for simulating and processing the full-scale low-frequency telescope data of the Square Kilometre Array (SKA) Phase 1. The SKA project will enter the construction phase soon, and once completed, it will be the world’s largest radio telescope and one of the world’s largest data generators. The authors used Summit to mimic an endto-end SKA workflow, simulating a dataset of a typical 6 hour observation and then processing that dataset with an imaging pipeline. This workflow was deployed and run on 4,560 compute nodes, and used 27,360 GPUs to generate 2.6 PB of data. This was the first time that radio astronomical data were processed at this scale. Results show that the workflow has the capability to process one of the key SKA science cases, an Epoch of Reionization observation. This analysis also helps reveal critical design factors for the next-generation radio telescopes and the required dedicated processing facilities.
本文提出了一种模拟和处理平方公里阵列(SKA)第一期全尺寸低频望远镜数据的工作流程。SKA项目将很快进入建设阶段,一旦建成,它将成为世界上最大的射电望远镜和世界上最大的数据发生器之一。作者使用Summit来模拟端到端SKA工作流程,模拟典型的6小时观测数据集,然后用成像管道处理该数据集。该工作流部署和运行在4,560个计算节点上,使用27,360个gpu生成2.6 PB的数据。这是射电天文数据第一次以这种规模进行处理。结果表明,该工作流能够处理SKA关键科学案例之一的再电离观测纪元。该分析还有助于揭示下一代射电望远镜和所需专用处理设施的关键设计因素。
{"title":"Processing Full-Scale Square Kilometre Array Data on the Summit Supercomputer","authors":"Ruonan Wang, R. Tobar, M. Dolensky, Tao An, A. Wicenec, Chen Wu, F. Dulwich, N. Podhorszki, V. Anantharaj, E. Suchyta, B. Lao, S. Klasky","doi":"10.1109/SC41405.2020.00006","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00006","url":null,"abstract":"This work presents a workflow for simulating and processing the full-scale low-frequency telescope data of the Square Kilometre Array (SKA) Phase 1. The SKA project will enter the construction phase soon, and once completed, it will be the world’s largest radio telescope and one of the world’s largest data generators. The authors used Summit to mimic an endto-end SKA workflow, simulating a dataset of a typical 6 hour observation and then processing that dataset with an imaging pipeline. This workflow was deployed and run on 4,560 compute nodes, and used 27,360 GPUs to generate 2.6 PB of data. This was the first time that radio astronomical data were processed at this scale. Results show that the workflow has the capability to process one of the key SKA science cases, an Epoch of Reionization observation. This analysis also helps reveal critical design factors for the next-generation radio telescopes and the required dedicated processing facilities.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127208082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Architecture and Performance Studies of 3D-Hyper-FleX-LION for Reconfigurable All-to-All HPC Networks 面向可重构全对全HPC网络的3D-Hyper-FleX-LION架构与性能研究
Gengchen Liu, R. Proietti, Marjan Fariborz, P. Fotouhi, Xian Xiao, S. Yoo
While the Fat-Tree network topology represents the dominant state-of-art solution for large-scale HPC networks, its scalability in terms of power, latency, complexity, and cost is significantly challenged by the ever-increasing communication bandwidth among tens of thousands of heterogeneous computing nodes. We propose 3D-Hyper-FleX-LION, a flat hybrid electronic-photonic interconnect network that leverages the multichannel nature of modern multi-terabit switch ASICs (with 100 Gb/s granularity) and a reconfigurable all-to-all photonic fabric called Flex-LIONS. Compared to a Fat-Tree network interconnecting the same number of nodes and with the same oversubscription ratio, the proposed 3D-Hyper-FleX-LION offers a 20% smaller diameter, $3times$ lower power consumption, $10 times$ fewer cable connections, and $4 times$ reduction in the number of transceivers. When bandwidth reconfiguration capabilities of Flex-LIONS are exploited for non-uniform traffic workloads, simulation results indicate that 3D-Hyper-FleX-LION can achieve up to $4 times$ improvement in energy efficiency for synthetic traffic workloads with high locality compared to Fat-Tree.
虽然Fat-Tree网络拓扑代表了大规模HPC网络的主流解决方案,但由于成千上万个异构计算节点之间不断增加的通信带宽,它在功率、延迟、复杂性和成本方面的可扩展性受到了极大的挑战。我们提出3D-Hyper-FleX-LION,这是一种平面混合电子-光子互连网络,它利用了现代多太比特开关asic (100 Gb/s粒度)的多通道特性和一种可重构的全对全光子结构,称为Flex-LIONS。与Fat-Tree网络互连相同数量的节点和相同的超订阅比相比,所提出的3D-Hyper-FleX-LION直径小20%,功耗降低3倍,电缆连接减少10倍,收发器数量减少4倍。仿真结果表明,当Flex-LIONS的带宽重构能力被用于非均匀流量工作负载时,3D-Hyper-FleX-LION在高局域性合成流量工作负载上的能效比Fat-Tree提高了4倍。
{"title":"Architecture and Performance Studies of 3D-Hyper-FleX-LION for Reconfigurable All-to-All HPC Networks","authors":"Gengchen Liu, R. Proietti, Marjan Fariborz, P. Fotouhi, Xian Xiao, S. Yoo","doi":"10.1109/SC41405.2020.00030","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00030","url":null,"abstract":"While the Fat-Tree network topology represents the dominant state-of-art solution for large-scale HPC networks, its scalability in terms of power, latency, complexity, and cost is significantly challenged by the ever-increasing communication bandwidth among tens of thousands of heterogeneous computing nodes. We propose 3D-Hyper-FleX-LION, a flat hybrid electronic-photonic interconnect network that leverages the multichannel nature of modern multi-terabit switch ASICs (with 100 Gb/s granularity) and a reconfigurable all-to-all photonic fabric called Flex-LIONS. Compared to a Fat-Tree network interconnecting the same number of nodes and with the same oversubscription ratio, the proposed 3D-Hyper-FleX-LION offers a 20% smaller diameter, $3times$ lower power consumption, $10 times$ fewer cable connections, and $4 times$ reduction in the number of transceivers. When bandwidth reconfiguration capabilities of Flex-LIONS are exploited for non-uniform traffic workloads, simulation results indicate that 3D-Hyper-FleX-LION can achieve up to $4 times$ improvement in energy efficiency for synthetic traffic workloads with high locality compared to Fat-Tree.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124883151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Evaluation of a Minimally Synchronous Algorithm for 2:1 Octree Balance 一种2:1八叉树平衡最小同步算法的评估
Hansol Suh, T. Isaac
The p4est library implements octree-based adaptive mesh refinement (AMR) and has demonstrated parallel scalability beyond 100,000 MPI processes in previous weak scaling studies. This work focuses on the strong scalability of mesh adaptivity in p4est, where the communication pattern of the existing 2:1-balance is a latency bottleneck. The sorting-based algorithm of Malhotra and Biros has balanced communication, but synchronizes all processes. We propose an algorithm that combines sorting and neighbor-to-neighbor exchange to minimize the number of processes each process synchronizes with.We measure the performance of these algorithms on several test problems on Stampede2 at TACC. Both the parallel-sorting and minimally-synchronous algorithms significantly outperform the existing algorithm and have nearly identical performance out to 1,024 Xeon Phi KNL nodes, meaning the asymptotic advantage of the minimally-synchronous algorithm does not translate to improved performance at this scale. We conclude by showing that global metadata communication will limit future strong scaling.
p4est库实现了基于八叉树的自适应网格细化(AMR),并在之前的弱缩放研究中展示了超过100,000 MPI进程的并行可扩展性。本文重点研究了p4test中网格自适应的强大可扩展性,其中现有的2:1-balance通信模式是延迟瓶颈。Malhotra和Biros的基于排序的算法具有平衡的通信,但同步所有进程。我们提出了一种结合排序和邻居间交换的算法,以最小化每个进程同步的进程数量。我们在Stampede2在TACC上的几个测试问题上测量了这些算法的性能。并行排序和最小同步算法都明显优于现有算法,并且在1024 Xeon Phi KNL节点上具有几乎相同的性能,这意味着最小同步算法的渐近优势在这种规模下不会转化为改进的性能。我们的结论是,全球元数据通信将限制未来的强扩展。
{"title":"Evaluation of a Minimally Synchronous Algorithm for 2:1 Octree Balance","authors":"Hansol Suh, T. Isaac","doi":"10.1109/SC41405.2020.00027","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00027","url":null,"abstract":"The p4est library implements octree-based adaptive mesh refinement (AMR) and has demonstrated parallel scalability beyond 100,000 MPI processes in previous weak scaling studies. This work focuses on the strong scalability of mesh adaptivity in p4est, where the communication pattern of the existing 2:1-balance is a latency bottleneck. The sorting-based algorithm of Malhotra and Biros has balanced communication, but synchronizes all processes. We propose an algorithm that combines sorting and neighbor-to-neighbor exchange to minimize the number of processes each process synchronizes with.We measure the performance of these algorithms on several test problems on Stampede2 at TACC. Both the parallel-sorting and minimally-synchronous algorithms significantly outperform the existing algorithm and have nearly identical performance out to 1,024 Xeon Phi KNL nodes, meaning the asymptotic advantage of the minimally-synchronous algorithm does not translate to improved performance at this scale. We conclude by showing that global metadata communication will limit future strong scaling.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122521682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
SC20: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1