Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00029
Min Yee Teh, Y. Hung, George Michelogiannakis, Shijia Yan, M. Glick, J. Shalf, K. Bergman
Many reconfigurable network topologies have been proposed in the past. However, efficient routing on top of these flexible interconnects still presents a challenge. In this work, we reevaluate key principles that have guided the designs of many routing protocols on static networks, and see how well those principles apply on reconfigurable network topologies. Based on a theoretical analysis of key properties that routing in a reconfigurable network should satisfy to maximize performance, we propose a topology-aware, globally-direct oblivious (TAGO) routing protocol for reconfigurable topologies. Our proposed routing protocol is simple in design and yet, when deployed in conjunction with a reconfigurable network topology, improves throughput by up to $2.2 times$ compared to established routing protocols and even comes within 10% of the throughput of impractical adaptive routing that has instant global congestion information.
{"title":"TAGO: Rethinking Routing Design in High Performance Reconfigurable Networks","authors":"Min Yee Teh, Y. Hung, George Michelogiannakis, Shijia Yan, M. Glick, J. Shalf, K. Bergman","doi":"10.1109/SC41405.2020.00029","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00029","url":null,"abstract":"Many reconfigurable network topologies have been proposed in the past. However, efficient routing on top of these flexible interconnects still presents a challenge. In this work, we reevaluate key principles that have guided the designs of many routing protocols on static networks, and see how well those principles apply on reconfigurable network topologies. Based on a theoretical analysis of key properties that routing in a reconfigurable network should satisfy to maximize performance, we propose a topology-aware, globally-direct oblivious (TAGO) routing protocol for reconfigurable topologies. Our proposed routing protocol is simple in design and yet, when deployed in conjunction with a reconfigurable network topology, improves throughput by up to $2.2 times$ compared to established routing protocols and even comes within 10% of the throughput of impractical adaptive routing that has instant global congestion information.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117163283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00019
Tirthak Patel, Devesh Tiwari
Noisy Intermediate-Scale Quantum (NISQ) machines are being increasingly used to develop quantum algorithms and establish use cases for quantum computing. However, these devices are highly error-prone and produce output, which can be far from the correct output of the quantum algorithm. In this paper, we propose VERITAS, an end-to-end approach toward designing quantum experiments, executing experiments, and correcting outputs produced by quantum circuits post their execution such that the correct output of the quantum algorithm can be accurately estimated.
{"title":"VERITAS: Accurately Estimating the Correct Output on Noisy Intermediate-Scale Quantum Computers","authors":"Tirthak Patel, Devesh Tiwari","doi":"10.1109/SC41405.2020.00019","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00019","url":null,"abstract":"Noisy Intermediate-Scale Quantum (NISQ) machines are being increasingly used to develop quantum algorithms and establish use cases for quantum computing. However, these devices are highly error-prone and produce output, which can be far from the correct output of the quantum algorithm. In this paper, we propose VERITAS, an end-to-end approach toward designing quantum experiments, executing experiments, and correcting outputs produced by quantum circuits post their execution such that the correct output of the quantum algorithm can be accurately estimated.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126134600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00085
Giuseppe M. J. Barca, David L. Poole, J. Vallejo, Melisa Alkan, C. Bertoni, Alistair P. Rendell, M. Gordon
Usage of Graphics Processing Units (GPU) has become strategic for simulating the chemistry of large molecular systems, with the majority of top supercomputers utilizing GPUs as their main source of computational horsepower. In this paper, a new fragmentation-based Hartree-Fock matrix build algorithm designed for scaling on many-GPU architectures is presented. The new algorithm uses a novel dynamic load balancing scheme based on a binned shell-pair container to distribute batches of significant shell quartets with the same code path to different GPUs. This maximizes computational throughput and load balancing, and eliminates GPU thread divergence due to integral screening. Additionally, the code uses a novel Fock digestion algorithm to contract electron repulsion integrals into the Fock matrix, which exploits all forms of permutational symmetry and eliminates thread synchronization requirements. The implementation demonstrates excellent scalability on the Summit computer, achieving good strong scaling performance up to 4096 nodes, and linear weak scaling up to 612 nodes.
{"title":"Scaling the Hartree-Fock Matrix Build on Summit","authors":"Giuseppe M. J. Barca, David L. Poole, J. Vallejo, Melisa Alkan, C. Bertoni, Alistair P. Rendell, M. Gordon","doi":"10.1109/SC41405.2020.00085","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00085","url":null,"abstract":"Usage of Graphics Processing Units (GPU) has become strategic for simulating the chemistry of large molecular systems, with the majority of top supercomputers utilizing GPUs as their main source of computational horsepower. In this paper, a new fragmentation-based Hartree-Fock matrix build algorithm designed for scaling on many-GPU architectures is presented. The new algorithm uses a novel dynamic load balancing scheme based on a binned shell-pair container to distribute batches of significant shell quartets with the same code path to different GPUs. This maximizes computational throughput and load balancing, and eliminates GPU thread divergence due to integral screening. Additionally, the code uses a novel Fock digestion algorithm to contract electron repulsion integrals into the Fock matrix, which exploits all forms of permutational symmetry and eliminates thread synchronization requirements. The implementation demonstrates excellent scalability on the Summit computer, achieving good strong scaling performance up to 4096 nodes, and linear weak scaling up to 612 nodes.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130510659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00034
Qidong Zhao, Xu Liu, Milind Chabbi
ARM is an attractive CPU architecture for exascale systems because of its energy efficiency. As a recent entry into the HPC paradigm, ARM lags in its software stack, especially in the performance tooling aspect. Notably, there is a lack of fine-grained measurement tools to analyze fully optimized HPC binary executables on ARM processors. In this paper, we introduce DRCCTPROF — a fine-grained call path profiling framework for binaries running on ARM architectures. The unique ability of DRCCTPROF is to obtain full calling context at any and every machine instruction that executes, which provides more detailed diagnostic feedback for performance optimization and correctness tools. Furthermore, DRCCTPROF not only associates any instruction with source code along the call path, but also associates memory access instructions back to the constituent data object. Finally, DRCCTPROF incurs moderate overhead and provides a compact view to visualize the profiles collected from parallel executions.
{"title":"DRCCTPROF: A Fine-Grained Call Path Profiler for ARM-Based Clusters","authors":"Qidong Zhao, Xu Liu, Milind Chabbi","doi":"10.1109/SC41405.2020.00034","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00034","url":null,"abstract":"ARM is an attractive CPU architecture for exascale systems because of its energy efficiency. As a recent entry into the HPC paradigm, ARM lags in its software stack, especially in the performance tooling aspect. Notably, there is a lack of fine-grained measurement tools to analyze fully optimized HPC binary executables on ARM processors. In this paper, we introduce DRCCTPROF — a fine-grained call path profiling framework for binaries running on ARM architectures. The unique ability of DRCCTPROF is to obtain full calling context at any and every machine instruction that executes, which provides more detailed diagnostic feedback for performance optimization and correctness tools. Furthermore, DRCCTPROF not only associates any instruction with source code along the call path, but also associates memory access instructions back to the constituent data object. Finally, DRCCTPROF incurs moderate overhead and provides a compact view to visualize the profiles collected from parallel executions.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130514122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00051
M. Sato, Y. Ishikawa, H. Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, H. Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, Kouichi Hirai, Atsushi Furuya, A. Asato, K. Morita, T. Shimizu
We have been carrying out the FLAGSHIP 2020 Project to develop the Japanese next-generation flagship supercomputer, the Post-K, recently named “Fugaku”. We have designed an original many core processor based on Armv8 instruction sets with the Scalable Vector Extension (SVE), an A64FX processor, as well as a system including interconnect and a storage subsystem with the industry partner, Fujitsu. The “co-design” of the system and applications is a key to making it power efficient and high performance. We determined many architectural parameters by reflecting an analysis of a set of target applications provided by applications teams. In this paper, we present the pragmatic practice of our co-design effort for “Fugaku”. As a result, the system has been proven to be a very power-efficient system, and it is confirmed that the performance of some target applications using the whole system is more than 100 times the performance of the K computer.
{"title":"Co-Design for A64FX Manycore Processor and ”Fugaku”","authors":"M. Sato, Y. Ishikawa, H. Tomita, Yuetsu Kodama, Tetsuya Odajima, Miwako Tsuji, H. Yashiro, Masaki Aoki, Naoyuki Shida, Ikuo Miyoshi, Kouichi Hirai, Atsushi Furuya, A. Asato, K. Morita, T. Shimizu","doi":"10.1109/SC41405.2020.00051","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00051","url":null,"abstract":"We have been carrying out the FLAGSHIP 2020 Project to develop the Japanese next-generation flagship supercomputer, the Post-K, recently named “Fugaku”. We have designed an original many core processor based on Armv8 instruction sets with the Scalable Vector Extension (SVE), an A64FX processor, as well as a system including interconnect and a storage subsystem with the industry partner, Fujitsu. The “co-design” of the system and applications is a key to making it power efficient and high performance. We determined many architectural parameters by reflecting an analysis of a set of target applications provided by applications teams. In this paper, we present the pragmatic practice of our co-design effort for “Fugaku”. As a result, the system has been proven to be a very power-efficient system, and it is confirmed that the performance of some target applications using the whole system is more than 100 times the performance of the K computer.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"315 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131963192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00048
Indu Thangakrishnan, D. Çavdar, C. Karakuş, Piyush Ghai, Yauheni Selivonchyk, Cory Pruce
Training large deep neural networks is time-consuming and may take days or even weeks to complete. Although parameter-server-based approaches were initially popular in distributed training, scalability issues led the field to move towards all-reduce-based approaches. Recent developments in cloud networking technologies, however, such as the Elastic Fabric Adapter (EFA) and Scalable Reliable Datagram (SRD), motivate a re-thinking of the parameter-server approach to address its fundamental inefficiencies. To this end, we introduce a novel communication library, Herring, which is designed to alleviate the performance bottlenecks in parameter-server-based training. We show that gradient reduction with Herring is twice as fast as all-reduce-based methods. We further demonstrate that training deep learning models like $mathrm{B}mathrm{E}mathrm{R}mathrm{T}_{mathrm{l}mathrm{a}mathrm{r}mathrm{g}mathrm{e}}$ using Herring outperforms all-reduce-based training, achieving 85% scaling efficiency on large clusters with up to 2048 NVIDIA V100 GPUs without accuracy drop.
{"title":"Herring: Rethinking the Parameter Server at Scale for the Cloud","authors":"Indu Thangakrishnan, D. Çavdar, C. Karakuş, Piyush Ghai, Yauheni Selivonchyk, Cory Pruce","doi":"10.1109/SC41405.2020.00048","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00048","url":null,"abstract":"Training large deep neural networks is time-consuming and may take days or even weeks to complete. Although parameter-server-based approaches were initially popular in distributed training, scalability issues led the field to move towards all-reduce-based approaches. Recent developments in cloud networking technologies, however, such as the Elastic Fabric Adapter (EFA) and Scalable Reliable Datagram (SRD), motivate a re-thinking of the parameter-server approach to address its fundamental inefficiencies. To this end, we introduce a novel communication library, Herring, which is designed to alleviate the performance bottlenecks in parameter-server-based training. We show that gradient reduction with Herring is twice as fast as all-reduce-based methods. We further demonstrate that training deep learning models like $mathrm{B}mathrm{E}mathrm{R}mathrm{T}_{mathrm{l}mathrm{a}mathrm{r}mathrm{g}mathrm{e}}$ using Herring outperforms all-reduce-based training, achieving 85% scaling efficiency on large clusters with up to 2048 NVIDIA V100 GPUs without accuracy drop.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132181981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00092
Abdul Rehman Anwer, Guanpeng Li, K. Pattabiraman, Michael B. Sullivan, Timothy Tsai, S. Hari
Fault injection (FI) techniques are typically used to determine the reliability profiles of programs under soft errors. However, these techniques are highly resource- and time-intensive. Prior research developed a model, TRIDENT to analytically predict Silent Data Corruption (SDC, i.e., incorrect output without any indication) probabilities of single-threaded CPU applications without requiring FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures than CPU programs. The main challenge is that modeling error propagation across thousands of threads in a GPU kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications.In this paper, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. We find that GPU-TRIDENT is 2 orders of magnitude faster than FI-based approaches, and nearly as accurate in determining the SDC rate of GPU programs.
{"title":"GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs","authors":"Abdul Rehman Anwer, Guanpeng Li, K. Pattabiraman, Michael B. Sullivan, Timothy Tsai, S. Hari","doi":"10.1109/SC41405.2020.00092","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00092","url":null,"abstract":"Fault injection (FI) techniques are typically used to determine the reliability profiles of programs under soft errors. However, these techniques are highly resource- and time-intensive. Prior research developed a model, TRIDENT to analytically predict Silent Data Corruption (SDC, i.e., incorrect output without any indication) probabilities of single-threaded CPU applications without requiring FIs. Unfortunately, TRIDENT is incompatible with GPU programs, due to their high degree of parallelism and different memory architectures than CPU programs. The main challenge is that modeling error propagation across thousands of threads in a GPU kernel requires enormous amounts of data to be profiled and analyzed, posing a major scalability bottleneck for HPC applications.In this paper, we propose GPU-TRIDENT, an accurate and scalable technique for modeling error propagation in GPU programs. We find that GPU-TRIDENT is 2 orders of magnitude faster than FI-based approaches, and nearly as accurate in determining the SDC rate of GPU programs.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1756 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129487295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00006
Ruonan Wang, R. Tobar, M. Dolensky, Tao An, A. Wicenec, Chen Wu, F. Dulwich, N. Podhorszki, V. Anantharaj, E. Suchyta, B. Lao, S. Klasky
This work presents a workflow for simulating and processing the full-scale low-frequency telescope data of the Square Kilometre Array (SKA) Phase 1. The SKA project will enter the construction phase soon, and once completed, it will be the world’s largest radio telescope and one of the world’s largest data generators. The authors used Summit to mimic an endto-end SKA workflow, simulating a dataset of a typical 6 hour observation and then processing that dataset with an imaging pipeline. This workflow was deployed and run on 4,560 compute nodes, and used 27,360 GPUs to generate 2.6 PB of data. This was the first time that radio astronomical data were processed at this scale. Results show that the workflow has the capability to process one of the key SKA science cases, an Epoch of Reionization observation. This analysis also helps reveal critical design factors for the next-generation radio telescopes and the required dedicated processing facilities.
{"title":"Processing Full-Scale Square Kilometre Array Data on the Summit Supercomputer","authors":"Ruonan Wang, R. Tobar, M. Dolensky, Tao An, A. Wicenec, Chen Wu, F. Dulwich, N. Podhorszki, V. Anantharaj, E. Suchyta, B. Lao, S. Klasky","doi":"10.1109/SC41405.2020.00006","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00006","url":null,"abstract":"This work presents a workflow for simulating and processing the full-scale low-frequency telescope data of the Square Kilometre Array (SKA) Phase 1. The SKA project will enter the construction phase soon, and once completed, it will be the world’s largest radio telescope and one of the world’s largest data generators. The authors used Summit to mimic an endto-end SKA workflow, simulating a dataset of a typical 6 hour observation and then processing that dataset with an imaging pipeline. This workflow was deployed and run on 4,560 compute nodes, and used 27,360 GPUs to generate 2.6 PB of data. This was the first time that radio astronomical data were processed at this scale. Results show that the workflow has the capability to process one of the key SKA science cases, an Epoch of Reionization observation. This analysis also helps reveal critical design factors for the next-generation radio telescopes and the required dedicated processing facilities.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127208082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00030
Gengchen Liu, R. Proietti, Marjan Fariborz, P. Fotouhi, Xian Xiao, S. Yoo
While the Fat-Tree network topology represents the dominant state-of-art solution for large-scale HPC networks, its scalability in terms of power, latency, complexity, and cost is significantly challenged by the ever-increasing communication bandwidth among tens of thousands of heterogeneous computing nodes. We propose 3D-Hyper-FleX-LION, a flat hybrid electronic-photonic interconnect network that leverages the multichannel nature of modern multi-terabit switch ASICs (with 100 Gb/s granularity) and a reconfigurable all-to-all photonic fabric called Flex-LIONS. Compared to a Fat-Tree network interconnecting the same number of nodes and with the same oversubscription ratio, the proposed 3D-Hyper-FleX-LION offers a 20% smaller diameter, $3times$ lower power consumption, $10 times$ fewer cable connections, and $4 times$ reduction in the number of transceivers. When bandwidth reconfiguration capabilities of Flex-LIONS are exploited for non-uniform traffic workloads, simulation results indicate that 3D-Hyper-FleX-LION can achieve up to $4 times$ improvement in energy efficiency for synthetic traffic workloads with high locality compared to Fat-Tree.
{"title":"Architecture and Performance Studies of 3D-Hyper-FleX-LION for Reconfigurable All-to-All HPC Networks","authors":"Gengchen Liu, R. Proietti, Marjan Fariborz, P. Fotouhi, Xian Xiao, S. Yoo","doi":"10.1109/SC41405.2020.00030","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00030","url":null,"abstract":"While the Fat-Tree network topology represents the dominant state-of-art solution for large-scale HPC networks, its scalability in terms of power, latency, complexity, and cost is significantly challenged by the ever-increasing communication bandwidth among tens of thousands of heterogeneous computing nodes. We propose 3D-Hyper-FleX-LION, a flat hybrid electronic-photonic interconnect network that leverages the multichannel nature of modern multi-terabit switch ASICs (with 100 Gb/s granularity) and a reconfigurable all-to-all photonic fabric called Flex-LIONS. Compared to a Fat-Tree network interconnecting the same number of nodes and with the same oversubscription ratio, the proposed 3D-Hyper-FleX-LION offers a 20% smaller diameter, $3times$ lower power consumption, $10 times$ fewer cable connections, and $4 times$ reduction in the number of transceivers. When bandwidth reconfiguration capabilities of Flex-LIONS are exploited for non-uniform traffic workloads, simulation results indicate that 3D-Hyper-FleX-LION can achieve up to $4 times$ improvement in energy efficiency for synthetic traffic workloads with high locality compared to Fat-Tree.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124883151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00027
Hansol Suh, T. Isaac
The p4est library implements octree-based adaptive mesh refinement (AMR) and has demonstrated parallel scalability beyond 100,000 MPI processes in previous weak scaling studies. This work focuses on the strong scalability of mesh adaptivity in p4est, where the communication pattern of the existing 2:1-balance is a latency bottleneck. The sorting-based algorithm of Malhotra and Biros has balanced communication, but synchronizes all processes. We propose an algorithm that combines sorting and neighbor-to-neighbor exchange to minimize the number of processes each process synchronizes with.We measure the performance of these algorithms on several test problems on Stampede2 at TACC. Both the parallel-sorting and minimally-synchronous algorithms significantly outperform the existing algorithm and have nearly identical performance out to 1,024 Xeon Phi KNL nodes, meaning the asymptotic advantage of the minimally-synchronous algorithm does not translate to improved performance at this scale. We conclude by showing that global metadata communication will limit future strong scaling.
{"title":"Evaluation of a Minimally Synchronous Algorithm for 2:1 Octree Balance","authors":"Hansol Suh, T. Isaac","doi":"10.1109/SC41405.2020.00027","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00027","url":null,"abstract":"The p4est library implements octree-based adaptive mesh refinement (AMR) and has demonstrated parallel scalability beyond 100,000 MPI processes in previous weak scaling studies. This work focuses on the strong scalability of mesh adaptivity in p4est, where the communication pattern of the existing 2:1-balance is a latency bottleneck. The sorting-based algorithm of Malhotra and Biros has balanced communication, but synchronizes all processes. We propose an algorithm that combines sorting and neighbor-to-neighbor exchange to minimize the number of processes each process synchronizes with.We measure the performance of these algorithms on several test problems on Stampede2 at TACC. Both the parallel-sorting and minimally-synchronous algorithms significantly outperform the existing algorithm and have nearly identical performance out to 1,024 Xeon Phi KNL nodes, meaning the asymptotic advantage of the minimally-synchronous algorithm does not translate to improved performance at this scale. We conclude by showing that global metadata communication will limit future strong scaling.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122521682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}