Pub Date : 2021-11-01DOI: 10.1109/IA354616.2021.00011
Scott Sallinen, M. Ripeanu
We have surveyed multiple PageRank implementations available with popular graph processing frameworks, and discovered that they treat sink vertices (i.e., vertices without outgoing edges) incorrectly. This leads to two issues: (i) incorrect PageRank scores, and (ii) flawed performance evaluations (as costly scatter operations are avoided). For synchronous PageRank implementations, a strategy to fix these issues exists (accumu-lating all values from sinks during an algorithmic superstep of a PageRank iteration), albeit with sizeable overhead. This solution, however, is not applicable in the context of asynchronous frameworks. We present and evaluate a novel, low-cost algorithmic solution to address this issue. For asynchronous PageRank, our key target, our solution simply requires an inexpensive O(Vertex) computation performed alongside the final normalization step. We also show that this strategy has advantages over prior work for synchronous PageRank, as it both avoids graph restructuring and reduces inline computation costs by performing a final score reassignment to vertices once at the end of processing.
{"title":"No More Leaky PageRank","authors":"Scott Sallinen, M. Ripeanu","doi":"10.1109/IA354616.2021.00011","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00011","url":null,"abstract":"We have surveyed multiple PageRank implementations available with popular graph processing frameworks, and discovered that they treat sink vertices (i.e., vertices without outgoing edges) incorrectly. This leads to two issues: (i) incorrect PageRank scores, and (ii) flawed performance evaluations (as costly scatter operations are avoided). For synchronous PageRank implementations, a strategy to fix these issues exists (accumu-lating all values from sinks during an algorithmic superstep of a PageRank iteration), albeit with sizeable overhead. This solution, however, is not applicable in the context of asynchronous frameworks. We present and evaluate a novel, low-cost algorithmic solution to address this issue. For asynchronous PageRank, our key target, our solution simply requires an inexpensive O(Vertex) computation performed alongside the final normalization step. We also show that this strategy has advantages over prior work for synchronous PageRank, as it both avoids graph restructuring and reduces inline computation costs by performing a final score reassignment to vertices once at the end of processing.","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129485671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/ia354616.2021.00001
{"title":"Proceedings of IA3 2021: Workshop on Irregular Applications: Architectures and Algorithms [Title page]","authors":"","doi":"10.1109/ia354616.2021.00001","DOIUrl":"https://doi.org/10.1109/ia354616.2021.00001","url":null,"abstract":"","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124112197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/IA354616.2021.00010
C. Stone, Aaron C. Walden, M. Zubair, E. Nielsen
Computational performance of the FUN3D unstructured-grid computational fluid dynamics (CFD) application on GPUs is highly dependent on the efficiency of floating-point atomic updates needed to support the irregular cell-, edge-, and node-based data access patterns in massively parallel GPU environments. We examine several optimization methods to improve GPU efficiency of performance-critical kernels that are dominated by atomic update costs on NVIDIA V100/A100and AMD CDNA MI100 GPUs. Optimization on the AMD MI100 GPU was of primary interest since similar hardware will be used in the upcoming Frontier supercomputer. Techniques combining register shuffling and on-chip shared memory were used to transpose and/or aggregate results amongst collaborating GPU threads before atomically updating global memory. These techniques, along with algorithmic optimizations to reduce the update frequency, reduced the run-time of select kernels on the MI100 GPU by a factor of between 2.5 and 6.0 over atomically updating global memory directly. Performance impact on the NVIDIA GPUs was mixed with the performance of the V100 often degraded when using register-based aggregation/transposition techniques while the A100 generally benefited from these methods, though to a lesser extent than measured on the MI100 GPU. Overall, both V100 and A100 GPUs outperformed the MI100 GPU on kernels dominated by double-precision atomic updates; however, the techniques demonstrated here reduced the performance gap and improved the MI100 performance.
{"title":"Accelerating unstructured-grid CFD algorithms on NVIDIA and AMD GPUs","authors":"C. Stone, Aaron C. Walden, M. Zubair, E. Nielsen","doi":"10.1109/IA354616.2021.00010","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00010","url":null,"abstract":"Computational performance of the FUN3D unstructured-grid computational fluid dynamics (CFD) application on GPUs is highly dependent on the efficiency of floating-point atomic updates needed to support the irregular cell-, edge-, and node-based data access patterns in massively parallel GPU environments. We examine several optimization methods to improve GPU efficiency of performance-critical kernels that are dominated by atomic update costs on NVIDIA V100/A100and AMD CDNA MI100 GPUs. Optimization on the AMD MI100 GPU was of primary interest since similar hardware will be used in the upcoming Frontier supercomputer. Techniques combining register shuffling and on-chip shared memory were used to transpose and/or aggregate results amongst collaborating GPU threads before atomically updating global memory. These techniques, along with algorithmic optimizations to reduce the update frequency, reduced the run-time of select kernels on the MI100 GPU by a factor of between 2.5 and 6.0 over atomically updating global memory directly. Performance impact on the NVIDIA GPUs was mixed with the performance of the V100 often degraded when using register-based aggregation/transposition techniques while the A100 generally benefited from these methods, though to a lesser extent than measured on the MI100 GPU. Overall, both V100 and A100 GPUs outperformed the MI100 GPU on kernels dominated by double-precision atomic updates; however, the techniques demonstrated here reduced the performance gap and improved the MI100 performance.","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129996719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/IA354616.2021.00009
Brian A. Page, P. Kogge
Applications where continuous streams of data are passed through large data structures are becoming of increasing importance. However, their execution on conventional architectures, especially when parallelism is desired to boost performance, is highly inefficient. The primary issue is often with the need to stream large numbers of disparate data items through the equivalent of very large hash tables distributed across many nodes. This paper builds on some prior work on the Firehose streaming benchmark where an emerging architecture using threads that can migrate through memory has shown to be much more efficient at such problems. This paper extends that work to use a second generation system to not only show that same improved efficiency (10X) for larger core counts, but even significantly higher raw performance (with FPGA-based cores running at 1/10th the clock of conventional systems). Further, this additional data yields insight into what resources represent the bottlenecks to even more performance, and make a reasonable projection that implementation of such an architecture with current technology would lead to 10X performance gain on an apples-to-apples basis with conventional systems.
{"title":"Greatly Accelerated Scaling of Streaming Problems with A Migrating Thread Architecture","authors":"Brian A. Page, P. Kogge","doi":"10.1109/IA354616.2021.00009","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00009","url":null,"abstract":"Applications where continuous streams of data are passed through large data structures are becoming of increasing importance. However, their execution on conventional architectures, especially when parallelism is desired to boost performance, is highly inefficient. The primary issue is often with the need to stream large numbers of disparate data items through the equivalent of very large hash tables distributed across many nodes. This paper builds on some prior work on the Firehose streaming benchmark where an emerging architecture using threads that can migrate through memory has shown to be much more efficient at such problems. This paper extends that work to use a second generation system to not only show that same improved efficiency (10X) for larger core counts, but even significantly higher raw performance (with FPGA-based cores running at 1/10th the clock of conventional systems). Further, this additional data yields insight into what resources represent the bottlenecks to even more performance, and make a reasonable projection that implementation of such an architecture with current technology would lead to 10X performance gain on an apples-to-apples basis with conventional systems.","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125511991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/IA354616.2021.00012
Jinhao Chen, T. Davis, Christopher Lourenco, Erick Moreno-Centeno
To meet the growing need for extended or exact precision solvers, an efficient framework based on Integer-Preserving Gaussian Elimination (IPGE) has been recently developed which includes dense/sparse LU/Cholesky factorizations and dense LU/Cholesky factorization updates for column and/or row replacement. In this paper, we discuss our on-going work developing the sparse LU/Cholesky column/row-replacement update and the sparse rank-l update/downdate. We first present some basic background for the exact factorization framework based on IPGE. Then we give our proposed algorithms along with some implementation and data-structure details. Finally, we provide some experimental results showcasing the performance of our update algorithms. Specifically, we show that updating these exact factorizations can be typically 10x to 100x faster than (re-)factorizing the matrices from scratch.
{"title":"Sparse Exact Factorization Update","authors":"Jinhao Chen, T. Davis, Christopher Lourenco, Erick Moreno-Centeno","doi":"10.1109/IA354616.2021.00012","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00012","url":null,"abstract":"To meet the growing need for extended or exact precision solvers, an efficient framework based on Integer-Preserving Gaussian Elimination (IPGE) has been recently developed which includes dense/sparse LU/Cholesky factorizations and dense LU/Cholesky factorization updates for column and/or row replacement. In this paper, we discuss our on-going work developing the sparse LU/Cholesky column/row-replacement update and the sparse rank-l update/downdate. We first present some basic background for the exact factorization framework based on IPGE. Then we give our proposed algorithms along with some implementation and data-structure details. Finally, we provide some experimental results showcasing the performance of our update algorithms. Specifically, we show that updating these exact factorizations can be typically 10x to 100x faster than (re-)factorizing the matrices from scratch.","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132368924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/IA354616.2021.00013
P. Pirkelbauer, Seth Bromberger, Keita Iwabuchi, R. Pearce
The Python programming language has become a popular choice for data scientists. While easy to use, the Python language is not well suited to drive data science on large-scale systems. This paper presents a first prototype of CLIPPy (Command line interface plus Python), a user-side class in Python that connects to high-performance computing environments with nonvolatile memory (NVM). CLIPPy queries available executable files and prepares a Python API on the fly. The executables can connect to a backend that executes on a large-scale system. The executables can be implemented in any language, for example in C++. CLIPPy and the executables are loosely coupled and communicate through a JSON based interface. By storing data in NVM, executables can attach and detach to data structures without expensive format conversions. The Underlying Philosophy, Design Challenges, and a Prototype Implementation that Accesses Data Stored in Non-Volatile Memory Will Be Discussed.
{"title":"Towards Scalable Data Processing in Python with CLIPPy","authors":"P. Pirkelbauer, Seth Bromberger, Keita Iwabuchi, R. Pearce","doi":"10.1109/IA354616.2021.00013","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00013","url":null,"abstract":"The Python programming language has become a popular choice for data scientists. While easy to use, the Python language is not well suited to drive data science on large-scale systems. This paper presents a first prototype of CLIPPy (Command line interface plus Python), a user-side class in Python that connects to high-performance computing environments with nonvolatile memory (NVM). CLIPPy queries available executable files and prepares a Python API on the fly. The executables can connect to a backend that executes on a large-scale system. The executables can be implemented in any language, for example in C++. CLIPPy and the executables are loosely coupled and communicate through a JSON based interface. By storing data in NVM, executables can attach and detach to data structures without expensive format conversions. The Underlying Philosophy, Design Challenges, and a Prototype Implementation that Accesses Data Stored in Non-Volatile Memory Will Be Discussed.","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131017149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-11-01DOI: 10.1109/IA354616.2021.00008
Leonardo Solis-Vasquez, E. Focht, Andreas Koch
Molecular docking is a key method in computer-aided drug design, where the rapid identification of drug candidates is crucial for combating diseases. AutoDock is a widely-used molecular docking program, having an irregular structure characterized by a divergent control flow and compute-intensive calculations. This work investigates porting AutoDock to the SX-Aurora TSUBASA vector engine and evaluates the achievable performance on a number of real-world input compounds. In particular, we discuss the platform-specific coding styles required to handle the high degree of irregularity in both local-search methods employed by AutoDock. These Solis-Wets and ADADELTA methods take up a large part of the total computation time. Based on our experiments, we achieved runtimes on the SX-Aurora TSUBASA VE 20B that are on average 3 x faster than on modern dual-socket 64-core CPU nodes. Our solution is competitive with V100 GPUs, even though these already use newer chip fabrication technology (12 nm vs. 16 nm on the VE 20B).
分子对接是计算机辅助药物设计的一种关键方法,快速识别候选药物对对抗疾病至关重要。AutoDock是一个广泛使用的分子对接程序,具有不规则结构,其特点是控制流发散和计算密集。这项工作研究了将AutoDock移植到SX-Aurora TSUBASA矢量引擎上,并评估了在许多实际输入化合物上可实现的性能。特别地,我们讨论了处理AutoDock使用的两种本地搜索方法中的高度不规则性所需的特定于平台的编码风格。这些Solis-Wets和ADADELTA方法占用了很大一部分总计算时间。根据我们的实验,我们在SX-Aurora TSUBASA VE 20B上实现了比现代双插槽64核CPU节点平均快3倍的运行时间。我们的解决方案与V100 gpu具有竞争力,尽管这些gpu已经使用了较新的芯片制造技术(VE 20B上的12纳米与16纳米)。
{"title":"Mapping Irregular Computations for Molecular Docking to the SX-Aurora TSUBASA Vector Engine","authors":"Leonardo Solis-Vasquez, E. Focht, Andreas Koch","doi":"10.1109/IA354616.2021.00008","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00008","url":null,"abstract":"Molecular docking is a key method in computer-aided drug design, where the rapid identification of drug candidates is crucial for combating diseases. AutoDock is a widely-used molecular docking program, having an irregular structure characterized by a divergent control flow and compute-intensive calculations. This work investigates porting AutoDock to the SX-Aurora TSUBASA vector engine and evaluates the achievable performance on a number of real-world input compounds. In particular, we discuss the platform-specific coding styles required to handle the high degree of irregularity in both local-search methods employed by AutoDock. These Solis-Wets and ADADELTA methods take up a large part of the total computation time. Based on our experiments, we achieved runtimes on the SX-Aurora TSUBASA VE 20B that are on average 3 x faster than on modern dual-socket 64-core CPU nodes. Our solution is competitive with V100 GPUs, even though these already use newer chip fabrication technology (12 nm vs. 16 nm on the VE 20B).","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131346018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}