Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00033
Xin You, Hailong Yang, Zhongzhi Luan, D. Qian, Xu Liu
Redundant zeros cause inefficiencies in which the zero values are loaded and computed repeatedly, resulting in unnecessary memory traffic and identity computation that waste memory bandwidth and CPU resources. optimizing compilers is difficult in eliminating these zero-related inefficiencies due to limitations in static analysis. Hardware approaches, in contrast, optimize inefficiencies without code modification, but are not widely adopted in commodity processors. In this paper, we propose ZeroSpy - a fine-grained profiler to identify redundant zeros caused by both inappropriate use of data structures and useless computation. ZeroSpy also provides intuitive optimization guidance by revealing the locations where the redundant zeros happen in source lines and calling contexts. The experimental results demonstrate ZeroSpy is capable of identifying redundant zeros in programs that have been highly optimized for years. Based on the optimization guidance revealed by ZeroSpy, we can achieve significant speedups after eliminating redundant zeros.
{"title":"ZeroSpy: Exploring Software Inefficiency with Redundant Zeros","authors":"Xin You, Hailong Yang, Zhongzhi Luan, D. Qian, Xu Liu","doi":"10.1109/SC41405.2020.00033","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00033","url":null,"abstract":"Redundant zeros cause inefficiencies in which the zero values are loaded and computed repeatedly, resulting in unnecessary memory traffic and identity computation that waste memory bandwidth and CPU resources. optimizing compilers is difficult in eliminating these zero-related inefficiencies due to limitations in static analysis. Hardware approaches, in contrast, optimize inefficiencies without code modification, but are not widely adopted in commodity processors. In this paper, we propose ZeroSpy - a fine-grained profiler to identify redundant zeros caused by both inappropriate use of data structures and useless computation. ZeroSpy also provides intuitive optimization guidance by revealing the locations where the redundant zeros happen in source lines and calling contexts. The experimental results demonstrate ZeroSpy is capable of identifying redundant zeros in programs that have been highly optimized for years. Based on the optimization guidance revealed by ZeroSpy, we can achieve significant speedups after eliminating redundant zeros.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128355932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00044
E. León, T. D'Hooge, Nathan Hanford, I. Karlin, R. Pankajakshan, Jim Foraker, C. Chambreau, M. Leininger
The simulation environment of any HPC platform is key to the performance, portability, and productivity of scientific applications. This environment has traditionally been provided by platform vendors, presenting challenges for HPC centers and users including platform-specific software that tend to stagnate over the lifetime of the system. In this paper, we present the Tri-Laboratory Operating System Stack (TOSS), a production simulation environment based on Linux and open source software, with proprietary software components integrated as needed. TOSS, focused on mid-to-large scale commodity HPC systems, provides a common simulation environment across system architectures, reduces the learning curve on new systems, and benefits from a lineage of past experience and bug fixes. To further the scope and applicability of TOSS, we demonstrate its feasibility and effectiveness on a leadership-class supercomputer architecture. Our evaluation, relative to the vendor stack, includes an analysis of resource manager complexity, system noise, networking, and application performance.
{"title":"TOSS-2020: A Commodity Software Stack for HPC","authors":"E. León, T. D'Hooge, Nathan Hanford, I. Karlin, R. Pankajakshan, Jim Foraker, C. Chambreau, M. Leininger","doi":"10.1109/SC41405.2020.00044","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00044","url":null,"abstract":"The simulation environment of any HPC platform is key to the performance, portability, and productivity of scientific applications. This environment has traditionally been provided by platform vendors, presenting challenges for HPC centers and users including platform-specific software that tend to stagnate over the lifetime of the system. In this paper, we present the Tri-Laboratory Operating System Stack (TOSS), a production simulation environment based on Linux and open source software, with proprietary software components integrated as needed. TOSS, focused on mid-to-large scale commodity HPC systems, provides a common simulation environment across system architectures, reduces the learning curve on new systems, and benefits from a lineage of past experience and bug fixes. To further the scope and applicability of TOSS, we demonstrate its feasibility and effectiveness on a leadership-class supercomputer architecture. Our evaluation, relative to the vendor stack, includes an analysis of resource manager complexity, system noise, networking, and application performance.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116879190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00102
Jacob Lambert, Seyong Lee, J. Vetter, A. Malony
Heterogeneous computing and exploration into specialized accelerators are inevitable in current and future supercomputers. Although this diversity of devices is promising for performance, the array of architectures presents programming challenges. High-level programming strategies have emerged to face these challenges, such as the OpenMP offloading model and OpenACC. However, the varying levels of support for these standards within vendor-specific and open-source tools, as well as the lack of performance portability across devices, have prevented the standards from achieving their goals. To address these shortcomings, we present CCAMP, an OpenMP and OpenACC interoperable framework. CCAMP provides two primary facilities: language translation between the two standards and device-specific directive optimization within each standard. We show that by using the CCAMP framework, programmers can easily transplant non-portable code into new ecosystems for new architectures. Additionally, by using CCAMP’s device-specific directive optimizations, users can achieve optimized performance across architectures using a single source code.
{"title":"CCAMP: An Integrated Translation and optimization Framework for OpenACC and OpenMP","authors":"Jacob Lambert, Seyong Lee, J. Vetter, A. Malony","doi":"10.1109/SC41405.2020.00102","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00102","url":null,"abstract":"Heterogeneous computing and exploration into specialized accelerators are inevitable in current and future supercomputers. Although this diversity of devices is promising for performance, the array of architectures presents programming challenges. High-level programming strategies have emerged to face these challenges, such as the OpenMP offloading model and OpenACC. However, the varying levels of support for these standards within vendor-specific and open-source tools, as well as the lack of performance portability across devices, have prevented the standards from achieving their goals. To address these shortcomings, we present CCAMP, an OpenMP and OpenACC interoperable framework. CCAMP provides two primary facilities: language translation between the two standards and device-specific directive optimization within each standard. We show that by using the CCAMP framework, programmers can easily transplant non-portable code into new ecosystems for new architectures. Additionally, by using CCAMP’s device-specific directive optimizations, users can achieve optimized performance across architectures using a single source code.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129570943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00046
Gabor Torok, Mark R. Day, R. Hartman-Baker, C. Snavely
Without a reliable and scalable system for managing authorized users and ensuring they receive their allocated share of computational and storage resources, modern HPC centers would not be able to function. Exascale will amplify these demands with greater machine scale, more users, higher job throughput, and ever-increasing need for management insight and automation throughout the HPC environment. When our legacy system reached retirement age, NERSC took the opportunity to design and build Iris not only to meet our current needs, with 8,000 users and tens of thousands of jobs per day, but also to scale well into the exascale era. In this paper, we describe how we have designed Iris to meet these needs and discuss its key features as well as our implementation experience.
{"title":"Iris: Allocation Banking and Identity and Access Management for the Exascale Era","authors":"Gabor Torok, Mark R. Day, R. Hartman-Baker, C. Snavely","doi":"10.1109/SC41405.2020.00046","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00046","url":null,"abstract":"Without a reliable and scalable system for managing authorized users and ensuring they receive their allocated share of computational and storage resources, modern HPC centers would not be able to function. Exascale will amplify these demands with greater machine scale, more users, higher job throughput, and ever-increasing need for management insight and automation throughout the HPC environment. When our legacy system reached retirement age, NERSC took the opportunity to design and build Iris not only to meet our current needs, with 8,000 users and tens of thousands of jobs per day, but also to scale well into the exascale era. In this paper, we describe how we have designed Iris to meet these needs and discuss its key features as well as our implementation experience.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129505028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00072
Luping Wang, Qizhen Weng, Wei Wang, Chen Chen, Bo Li
Online cloud services are increasingly deployed as long-running applications (LRAs) in containers. Placing LRA containers is known to be difficult as they often have sophisticated resource interferences and I/O dependencies. Existing schedulers rely on operators to manually express the container scheduling requirements as placement constraints and strive to satisfy as many constraints as possible. Such schedulers, however, fall short in performance as placement constraints only provide qualitative scheduling guidelines and minimizing constraint violations does not necessarily result in the optimal performance.In this work, we present Metis, a general-purpose scheduler that learns to optimally place LRA containers using deep reinforcement learning (RL) techniques. This eliminates the complex manual specification of placement constraints and offers, for the first time, concrete quantitative scheduling criteria. As directly training an RL agent does not scale, we develop a novel hierarchical learning technique that decomposes a complex container placement problem into a hierarchy of subproblems with significantly reduced state and action space. We show that many subproblems have similar structures and can hence be solved by training a unified RL agent offline. Large-scale EC2 deployment shows that compared with the traditional constraint-based schedulers, Metis improves the throughput by up to 61%, optimizes various performance metrics, and easily scales to a large cluster where 3K containers run on over 700 machines.
{"title":"Metis: Learning to Schedule Long-Running Applications in Shared Container Clusters at Scale","authors":"Luping Wang, Qizhen Weng, Wei Wang, Chen Chen, Bo Li","doi":"10.1109/SC41405.2020.00072","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00072","url":null,"abstract":"Online cloud services are increasingly deployed as long-running applications (LRAs) in containers. Placing LRA containers is known to be difficult as they often have sophisticated resource interferences and I/O dependencies. Existing schedulers rely on operators to manually express the container scheduling requirements as placement constraints and strive to satisfy as many constraints as possible. Such schedulers, however, fall short in performance as placement constraints only provide qualitative scheduling guidelines and minimizing constraint violations does not necessarily result in the optimal performance.In this work, we present Metis, a general-purpose scheduler that learns to optimally place LRA containers using deep reinforcement learning (RL) techniques. This eliminates the complex manual specification of placement constraints and offers, for the first time, concrete quantitative scheduling criteria. As directly training an RL agent does not scale, we develop a novel hierarchical learning technique that decomposes a complex container placement problem into a hierarchy of subproblems with significantly reduced state and action space. We show that many subproblems have similar structures and can hence be solved by training a unified RL agent offline. Large-scale EC2 deployment shows that compared with the traditional constraint-based schedulers, Metis improves the throughput by up to 61%, optimizes various performance metrics, and easily scales to a large cluster where 3K containers run on over 700 machines.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134067379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00026
Xiaohui Duan, Ping Gao, Meng Zhang, Tingjian Zhang, Hongsong Meng, Yuxuan Li, B. Schmidt, H. Fu, L. Gan, W. Xue, Weiguo Liu, Guangwen Yang
Molecular dynamics (MD) simulations are playing an increasingly important role in several research areas. The most frequently used potentials in MD simulations are pair-wise potentials. Due to the memory wall, computing pair-wise potentials on many-core processors are usually memory bounded. In this paper, we take the SW26010 processor as an exemplary platform to explore the possibility to break the memory bottleneck by improving data reusage via cell-list-based methods. We use cell-lists instead of neighbor-lists in the potential computation, and apply a number of novel optimization methods. Theses methods include: an adaptive replica arrangement strategy, a parameter profile data structure, and a particle-cell cutoff checking filter. An incremental cell-list building method is also realized to accelerate the construction of cell-lists. Furthermore, we have established an open source standalone framework, ESMD, featuring the techniques above. Experiments show that ESMD is 50$sim$170% faster than previous ports on a single node, and can scale to 1,024 nodes with a weak scalibility of 95%.
{"title":"Cell-List based Molecular Dynamics on Many-Core Processors: A Case Study on Sunway TaihuLight Supercomputer","authors":"Xiaohui Duan, Ping Gao, Meng Zhang, Tingjian Zhang, Hongsong Meng, Yuxuan Li, B. Schmidt, H. Fu, L. Gan, W. Xue, Weiguo Liu, Guangwen Yang","doi":"10.1109/SC41405.2020.00026","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00026","url":null,"abstract":"Molecular dynamics (MD) simulations are playing an increasingly important role in several research areas. The most frequently used potentials in MD simulations are pair-wise potentials. Due to the memory wall, computing pair-wise potentials on many-core processors are usually memory bounded. In this paper, we take the SW26010 processor as an exemplary platform to explore the possibility to break the memory bottleneck by improving data reusage via cell-list-based methods. We use cell-lists instead of neighbor-lists in the potential computation, and apply a number of novel optimization methods. Theses methods include: an adaptive replica arrangement strategy, a parameter profile data structure, and a particle-cell cutoff checking filter. An incremental cell-list building method is also realized to accelerate the construction of cell-lists. Furthermore, we have established an open source standalone framework, ESMD, featuring the techniques above. Experiments show that ESMD is 50$sim$170% faster than previous ports on a single node, and can scale to 1,024 nodes with a weak scalibility of 95%.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127470716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00100
H. T. Kung, Bradley McDanel, S. Zhang
We present a novel technique, called Term Quantization (TQ), for furthering quantization at run time for improved computational efficiency of deep neural networks (DNNs) already quantized with conventional quantization methods. TQ operates on power-of-two terms in expressions of values. In computing a dot-product computation, TQ dynamically selects a fixed number of largest terms to use from values of the two vectors. By exploiting weight and data distributions typically present in DNNs, TQ has a minimal impact on DNN model performance (e.g., accuracy or perplexity). We use TQ to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We evaluate TQ on an MLP for MNIST, multiple CNNs for ImageNet and an LSTM for Wikitext-2. We demonstrate significant reductions in inference computation costs (between $3-10times$) compared to conventional uniform quantization for the same level of model performance.
{"title":"Term Quantization: Furthering Quantization at Run Time","authors":"H. T. Kung, Bradley McDanel, S. Zhang","doi":"10.1109/SC41405.2020.00100","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00100","url":null,"abstract":"We present a novel technique, called Term Quantization (TQ), for furthering quantization at run time for improved computational efficiency of deep neural networks (DNNs) already quantized with conventional quantization methods. TQ operates on power-of-two terms in expressions of values. In computing a dot-product computation, TQ dynamically selects a fixed number of largest terms to use from values of the two vectors. By exploiting weight and data distributions typically present in DNNs, TQ has a minimal impact on DNN model performance (e.g., accuracy or perplexity). We use TQ to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We evaluate TQ on an MLP for MNIST, multiple CNNs for ImageNet and an LSTM for Wikitext-2. We demonstrate significant reductions in inference computation costs (between $3-10times$) compared to conventional uniform quantization for the same level of model performance.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115546342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00069
Saurabh Jha, Shengkun Cui, Subho Sankar Banerjee, Tianyin Xu, J. Enos, M. Showerman, T. Kalbarczyk, R. Iyer
Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.
{"title":"Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems","authors":"Saurabh Jha, Shengkun Cui, Subho Sankar Banerjee, Tianyin Xu, J. Enos, M. Showerman, T. Kalbarczyk, R. Iyer","doi":"10.1109/SC41405.2020.00069","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00069","url":null,"abstract":"Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116767520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00010
R. Kannan, Piyush Sao, Hao Lu, Drahomira Herrmannova, Vijay Thakkar, R. Patton, R. Vuduc, T. Potok
We are motivated by newly proposed methods for data mining large-scale corpora of scholarly publications, such as the full biomedical literature, which may consist of tens of millions of papers spanning decades of research. In this setting, analysts seek to discover how concepts relate to one another. They construct graph representations from annotated text databases and then formulate the relationship-mining problem as one of computing all-pairs shortest paths (APSP), which becomes a significant bottleneck. In this context, we present a new high-performance algorithm and implementation of the Floyd-Warshall algorithm for distributed-memory parallel computers accelerated by GPUs, which we call DSNAPSHOT (Distributed Accelerated Semiring All-Pairs Shortest Path). For our largest experiments, we ran DSNAPSHOT on a connected input graph with millions of vertices using 4, 096nodes (24,576GPUs) of the Oak Ridge National Laboratory’s Summit supercomputer system. We find DSNAPSHOT achieves a sustained performance of $136times 10^{15}$ floating-point operations per second (136petaflop/s) at a parallel efficiency of 90% under weak scaling and, in absolute speed, 70% of the best possible performance given our computation (in the single-precision tropical semiring or “min-plus” algebra). Looking forward, we believe this novel capability will enable the mining of scholarly knowledge corpora when embedded and integrated into artificial intelligence-driven natural language processing workflows at scale.
{"title":"Scalable Knowledge Graph Analytics at 136 Petaflop/s","authors":"R. Kannan, Piyush Sao, Hao Lu, Drahomira Herrmannova, Vijay Thakkar, R. Patton, R. Vuduc, T. Potok","doi":"10.1109/SC41405.2020.00010","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00010","url":null,"abstract":"We are motivated by newly proposed methods for data mining large-scale corpora of scholarly publications, such as the full biomedical literature, which may consist of tens of millions of papers spanning decades of research. In this setting, analysts seek to discover how concepts relate to one another. They construct graph representations from annotated text databases and then formulate the relationship-mining problem as one of computing all-pairs shortest paths (APSP), which becomes a significant bottleneck. In this context, we present a new high-performance algorithm and implementation of the Floyd-Warshall algorithm for distributed-memory parallel computers accelerated by GPUs, which we call DSNAPSHOT (Distributed Accelerated Semiring All-Pairs Shortest Path). For our largest experiments, we ran DSNAPSHOT on a connected input graph with millions of vertices using 4, 096nodes (24,576GPUs) of the Oak Ridge National Laboratory’s Summit supercomputer system. We find DSNAPSHOT achieves a sustained performance of $136times 10^{15}$ floating-point operations per second (136petaflop/s) at a parallel efficiency of 90% under weak scaling and, in absolute speed, 70% of the best possible performance given our computation (in the single-precision tropical semiring or “min-plus” algebra). Looking forward, we believe this novel capability will enable the mining of scholarly knowledge corpora when embedded and integrated into artificial intelligence-driven natural language processing workflows at scale.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115073068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-01DOI: 10.1109/SC41405.2020.00064
Nariaki Tateiwa, Y. Shinano, Satoshi Nakamura, Akihiro Yoshida, S. Kaji, Masaya Yasuda, K. Fujisawa
Lattice-based cryptography has received attention as a next-generation encryption technique, because it is believed to be secure against attacks by classical and quantum computers. Its essential security depends on the hardness of solving the shortest vector problem (SVP). In the cryptography, to determine security levels, it is becoming significantly more important to estimate the hardness of the SVP by high-performance computing. In this study, we develop the world’s first distributed and asynchronous parallel SVP solver, the MAssively Parallel solver for SVP (MAP-SVP). It can parallelize algorithms for solving the SVP by applying the Ubiquity Generator framework, which is a generic framework for branch-and-bound algorithms. The MAP-SVP is suitable for massive-scale parallelization, owing to its small memory footprint, low communication overhead, and rapid checkpoint and restart mechanisms. We demonstrate its performance and scalability of the MAP-SVP by using up to 100,032 cores to solve instances of the Darmstadt SVP Challenge.
{"title":"Massive Parallelization for Finding Shortest Lattice Vectors Based on Ubiquity Generator Framework","authors":"Nariaki Tateiwa, Y. Shinano, Satoshi Nakamura, Akihiro Yoshida, S. Kaji, Masaya Yasuda, K. Fujisawa","doi":"10.1109/SC41405.2020.00064","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00064","url":null,"abstract":"Lattice-based cryptography has received attention as a next-generation encryption technique, because it is believed to be secure against attacks by classical and quantum computers. Its essential security depends on the hardness of solving the shortest vector problem (SVP). In the cryptography, to determine security levels, it is becoming significantly more important to estimate the hardness of the SVP by high-performance computing. In this study, we develop the world’s first distributed and asynchronous parallel SVP solver, the MAssively Parallel solver for SVP (MAP-SVP). It can parallelize algorithms for solving the SVP by applying the Ubiquity Generator framework, which is a generic framework for branch-and-bound algorithms. The MAP-SVP is suitable for massive-scale parallelization, owing to its small memory footprint, low communication overhead, and rapid checkpoint and restart mechanisms. We demonstrate its performance and scalability of the MAP-SVP by using up to 100,032 cores to solve instances of the Darmstadt SVP Challenge.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128735703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}