2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)最新文献

英文中文

2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)

Pub Date : 2021-11-01 DOI: 10.1109/IA354616.2021.00011

Scott Sallinen, M. Ripeanu

We have surveyed multiple PageRank implementations available with popular graph processing frameworks, and discovered that they treat sink vertices (i.e., vertices without outgoing edges) incorrectly. This leads to two issues: (i) incorrect PageRank scores, and (ii) flawed performance evaluations (as costly scatter operations are avoided). For synchronous PageRank implementations, a strategy to fix these issues exists (accumu-lating all values from sinks during an algorithmic superstep of a PageRank iteration), albeit with sizeable overhead. This solution, however, is not applicable in the context of asynchronous frameworks. We present and evaluate a novel, low-cost algorithmic solution to address this issue. For asynchronous PageRank, our key target, our solution simply requires an inexpensive O(Vertex) computation performed alongside the final normalization step. We also show that this strategy has advantages over prior work for synchronous PageRank, as it both avoids graph restructuring and reduces inline computation costs by performing a final score reassignment to vertices once at the end of processing.

我们调查了多个可用的流行图处理框架的PageRank实现，发现它们错误地处理汇聚顶点(即没有外向边的顶点)。这导致了两个问题:(i)不正确的PageRank分数，(ii)有缺陷的性能评估(因为避免了昂贵的分散操作)。对于同步PageRank实现，存在一种解决这些问题的策略(在PageRank迭代的算法超步期间累积来自sink的所有值)，尽管开销相当大。然而，此解决方案不适用于异步框架的上下文中。我们提出并评估了一种新颖的、低成本的算法解决方案来解决这个问题。对于异步PageRank(我们的关键目标)，我们的解决方案只需要在最后的规范化步骤中执行廉价的O(顶点)计算。我们还表明，这种策略比同步PageRank的先前工作更有优势，因为它既避免了图重构，又通过在处理结束时对顶点执行最终分数重新分配来减少内联计算成本。

引用次数: 1

Proceedings of IA3 2021: Workshop on Irregular Applications: Architectures and Algorithms [Title page] IA3 2021会议记录:不规则应用:架构和算法研讨会[标题页]

2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)

Pub Date : 2021-11-01 DOI: 10.1109/ia354616.2021.00001

引用次数: 0

Accelerating unstructured-grid CFD algorithms on NVIDIA and AMD GPUs 在NVIDIA和AMD gpu上加速非结构化网格CFD算法

2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)

Pub Date : 2021-11-01 DOI: 10.1109/IA354616.2021.00010

C. Stone, Aaron C. Walden, M. Zubair, E. Nielsen

Computational performance of the FUN3D unstructured-grid computational fluid dynamics (CFD) application on GPUs is highly dependent on the efficiency of floating-point atomic updates needed to support the irregular cell-, edge-, and node-based data access patterns in massively parallel GPU environments. We examine several optimization methods to improve GPU efficiency of performance-critical kernels that are dominated by atomic update costs on NVIDIA V100/A100and AMD CDNA MI100 GPUs. Optimization on the AMD MI100 GPU was of primary interest since similar hardware will be used in the upcoming Frontier supercomputer. Techniques combining register shuffling and on-chip shared memory were used to transpose and/or aggregate results amongst collaborating GPU threads before atomically updating global memory. These techniques, along with algorithmic optimizations to reduce the update frequency, reduced the run-time of select kernels on the MI100 GPU by a factor of between 2.5 and 6.0 over atomically updating global memory directly. Performance impact on the NVIDIA GPUs was mixed with the performance of the V100 often degraded when using register-based aggregation/transposition techniques while the A100 generally benefited from these methods, though to a lesser extent than measured on the MI100 GPU. Overall, both V100 and A100 GPUs outperformed the MI100 GPU on kernels dominated by double-precision atomic updates; however, the techniques demonstrated here reduced the performance gap and improved the MI100 performance.

GPU上的FUN3D非结构化网格计算流体动力学(CFD)应用程序的计算性能高度依赖于浮点原子更新的效率，这些更新需要支持大规模并行GPU环境中基于不规则单元、边缘和节点的数据访问模式。在NVIDIA V100/ a100和AMD CDNA MI100 GPU上，我们研究了几种优化方法来提高由原子更新成本主导的性能关键内核的GPU效率。对AMD MI100 GPU的优化是主要的兴趣，因为类似的硬件将在即将到来的前沿超级计算机中使用。结合寄存器变换和片上共享内存的技术用于在自动更新全局内存之前在协作的GPU线程之间转置和/或聚合结果。与直接自动更新全局内存相比，这些技术以及降低更新频率的算法优化将MI100 GPU上选择内核的运行时间减少了2.5到6.0倍。当使用基于寄存器的聚合/转置技术时，对NVIDIA GPU的性能影响混合在一起，V100的性能通常会下降，而A100通常受益于这些方法，尽管程度低于MI100 GPU。总体而言，在双精度原子更新为主的内核上，V100和A100 GPU的性能都优于MI100 GPU;然而，这里展示的技术减少了性能差距并提高了MI100的性能。

{"title":"Accelerating unstructured-grid CFD algorithms on NVIDIA and AMD GPUs","authors":"C. Stone, Aaron C. Walden, M. Zubair, E. Nielsen","doi":"10.1109/IA354616.2021.00010","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00010","url":null,"abstract":"Computational performance of the FUN3D unstructured-grid computational fluid dynamics (CFD) application on GPUs is highly dependent on the efficiency of floating-point atomic updates needed to support the irregular cell-, edge-, and node-based data access patterns in massively parallel GPU environments. We examine several optimization methods to improve GPU efficiency of performance-critical kernels that are dominated by atomic update costs on NVIDIA V100/A100and AMD CDNA MI100 GPUs. Optimization on the AMD MI100 GPU was of primary interest since similar hardware will be used in the upcoming Frontier supercomputer. Techniques combining register shuffling and on-chip shared memory were used to transpose and/or aggregate results amongst collaborating GPU threads before atomically updating global memory. These techniques, along with algorithmic optimizations to reduce the update frequency, reduced the run-time of select kernels on the MI100 GPU by a factor of between 2.5 and 6.0 over atomically updating global memory directly. Performance impact on the NVIDIA GPUs was mixed with the performance of the V100 often degraded when using register-based aggregation/transposition techniques while the A100 generally benefited from these methods, though to a lesser extent than measured on the MI100 GPU. Overall, both V100 and A100 GPUs outperformed the MI100 GPU on kernels dominated by double-precision atomic updates; however, the techniques demonstrated here reduced the performance gap and improved the MI100 performance.","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129996719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Greatly Accelerated Scaling of Streaming Problems with A Migrating Thread Architecture 通过迁移线程架构极大地加速了流问题的扩展

2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)

Pub Date : 2021-11-01 DOI: 10.1109/IA354616.2021.00009

Brian A. Page, P. Kogge

Applications where continuous streams of data are passed through large data structures are becoming of increasing importance. However, their execution on conventional architectures, especially when parallelism is desired to boost performance, is highly inefficient. The primary issue is often with the need to stream large numbers of disparate data items through the equivalent of very large hash tables distributed across many nodes. This paper builds on some prior work on the Firehose streaming benchmark where an emerging architecture using threads that can migrate through memory has shown to be much more efficient at such problems. This paper extends that work to use a second generation system to not only show that same improved efficiency (10X) for larger core counts, but even significantly higher raw performance (with FPGA-based cores running at 1/10th the clock of conventional systems). Further, this additional data yields insight into what resources represent the bottlenecks to even more performance, and make a reasonable projection that implementation of such an architecture with current technology would lead to 10X performance gain on an apples-to-apples basis with conventional systems.

连续数据流通过大型数据结构的应用变得越来越重要。然而，它们在传统架构上的执行效率非常低，特别是当并行性需要提高性能时。主要问题通常是需要通过分布在许多节点上的非常大的散列表来传输大量不同的数据项。本文建立在Firehose流基准测试之前的一些工作的基础上，其中使用可以通过内存迁移的线程的新兴架构在此类问题上显示出更高的效率。本文将这项工作扩展到使用第二代系统，不仅可以在更大的内核数量下显示相同的改进效率(10倍)，而且甚至可以显着提高原始性能(基于fpga的内核以传统系统的1/10时钟运行)。此外，这些额外的数据可以深入了解哪些资源代表了更高性能的瓶颈，并做出合理的预测，即使用当前技术实现这种体系结构将在与传统系统的同类基础上获得10倍的性能提升。

{"title":"Greatly Accelerated Scaling of Streaming Problems with A Migrating Thread Architecture","authors":"Brian A. Page, P. Kogge","doi":"10.1109/IA354616.2021.00009","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00009","url":null,"abstract":"Applications where continuous streams of data are passed through large data structures are becoming of increasing importance. However, their execution on conventional architectures, especially when parallelism is desired to boost performance, is highly inefficient. The primary issue is often with the need to stream large numbers of disparate data items through the equivalent of very large hash tables distributed across many nodes. This paper builds on some prior work on the Firehose streaming benchmark where an emerging architecture using threads that can migrate through memory has shown to be much more efficient at such problems. This paper extends that work to use a second generation system to not only show that same improved efficiency (10X) for larger core counts, but even significantly higher raw performance (with FPGA-based cores running at 1/10th the clock of conventional systems). Further, this additional data yields insight into what resources represent the bottlenecks to even more performance, and make a reasonable projection that implementation of such an architecture with current technology would lead to 10X performance gain on an apples-to-apples basis with conventional systems.","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125511991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Sparse Exact Factorization Update 稀疏精确分解更新

2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)

Pub Date : 2021-11-01 DOI: 10.1109/IA354616.2021.00012

Jinhao Chen, T. Davis, Christopher Lourenco, Erick Moreno-Centeno

To meet the growing need for extended or exact precision solvers, an efficient framework based on Integer-Preserving Gaussian Elimination (IPGE) has been recently developed which includes dense/sparse LU/Cholesky factorizations and dense LU/Cholesky factorization updates for column and/or row replacement. In this paper, we discuss our on-going work developing the sparse LU/Cholesky column/row-replacement update and the sparse rank-l update/downdate. We first present some basic background for the exact factorization framework based on IPGE. Then we give our proposed algorithms along with some implementation and data-structure details. Finally, we provide some experimental results showcasing the performance of our update algorithms. Specifically, we show that updating these exact factorizations can be typically 10x to 100x faster than (re-)factorizing the matrices from scratch.

为了满足日益增长的对扩展或精确解算器的需求，最近开发了一个基于整数保持高斯消去(IPGE)的高效框架，该框架包括密集/稀疏LU/Cholesky分解和密集LU/Cholesky分解更新，用于列和/或行替换。在本文中，我们讨论了我们正在进行的开发稀疏LU/Cholesky列/行替换更新和稀疏rank- 1更新/downdate的工作。首先介绍了基于IPGE的精确分解框架的一些基本背景。然后给出了我们提出的算法以及一些实现和数据结构的细节。最后，我们提供了一些实验结果来展示我们的更新算法的性能。具体来说，我们表明更新这些精确的分解通常比从头开始(重新)分解矩阵快10倍到100倍。

引用次数: 0

[Copyright notice] (版权)

2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)

Pub Date : 2021-11-01 DOI: 10.1109/ia354616.2021.00002

引用次数: 0

Towards Scalable Data Processing in Python with CLIPPy 用CLIPPy实现Python中的可扩展数据处理

2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)

Pub Date : 2021-11-01 DOI: 10.1109/IA354616.2021.00013

P. Pirkelbauer, Seth Bromberger, Keita Iwabuchi, R. Pearce

The Python programming language has become a popular choice for data scientists. While easy to use, the Python language is not well suited to drive data science on large-scale systems. This paper presents a first prototype of CLIPPy (Command line interface plus Python), a user-side class in Python that connects to high-performance computing environments with nonvolatile memory (NVM). CLIPPy queries available executable files and prepares a Python API on the fly. The executables can connect to a backend that executes on a large-scale system. The executables can be implemented in any language, for example in C++. CLIPPy and the executables are loosely coupled and communicate through a JSON based interface. By storing data in NVM, executables can attach and detach to data structures without expensive format conversions. The Underlying Philosophy, Design Challenges, and a Prototype Implementation that Accesses Data Stored in Non-Volatile Memory Will Be Discussed.

Python编程语言已经成为数据科学家的热门选择。虽然易于使用，但Python语言并不适合在大规模系统上驱动数据科学。本文介绍了CLIPPy(命令行接口加Python)的第一个原型，CLIPPy是Python中的一个用户端类，它通过非易失性内存(NVM)连接到高性能计算环境。CLIPPy查询可用的可执行文件，并动态地准备Python API。可执行文件可以连接到在大型系统上执行的后端。可执行文件可以用任何语言实现，例如c++。CLIPPy和可执行文件是松散耦合的，并通过基于JSON的接口进行通信。通过在NVM中存储数据，可执行文件可以附加和分离数据结构，而无需进行昂贵的格式转换。基本原理，设计挑战，以及访问存储在非易失性存储器中的数据的原型实现将被讨论。

引用次数: 1

Mapping Irregular Computations for Molecular Docking to the SX-Aurora TSUBASA Vector Engine 分子对接的不规则计算映射到SX-Aurora TSUBASA矢量引擎

2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)

Pub Date : 2021-11-01 DOI: 10.1109/IA354616.2021.00008

Leonardo Solis-Vasquez, E. Focht, Andreas Koch

Molecular docking is a key method in computer-aided drug design, where the rapid identification of drug candidates is crucial for combating diseases. AutoDock is a widely-used molecular docking program, having an irregular structure characterized by a divergent control flow and compute-intensive calculations. This work investigates porting AutoDock to the SX-Aurora TSUBASA vector engine and evaluates the achievable performance on a number of real-world input compounds. In particular, we discuss the platform-specific coding styles required to handle the high degree of irregularity in both local-search methods employed by AutoDock. These Solis-Wets and ADADELTA methods take up a large part of the total computation time. Based on our experiments, we achieved runtimes on the SX-Aurora TSUBASA VE 20B that are on average 3 x faster than on modern dual-socket 64-core CPU nodes. Our solution is competitive with V100 GPUs, even though these already use newer chip fabrication technology (12 nm vs. 16 nm on the VE 20B).

分子对接是计算机辅助药物设计的一种关键方法，快速识别候选药物对对抗疾病至关重要。AutoDock是一个广泛使用的分子对接程序，具有不规则结构，其特点是控制流发散和计算密集。这项工作研究了将AutoDock移植到SX-Aurora TSUBASA矢量引擎上，并评估了在许多实际输入化合物上可实现的性能。特别地，我们讨论了处理AutoDock使用的两种本地搜索方法中的高度不规则性所需的特定于平台的编码风格。这些Solis-Wets和ADADELTA方法占用了很大一部分总计算时间。根据我们的实验，我们在SX-Aurora TSUBASA VE 20B上实现了比现代双插槽64核CPU节点平均快3倍的运行时间。我们的解决方案与V100 gpu具有竞争力，尽管这些gpu已经使用了较新的芯片制造技术(VE 20B上的12纳米与16纳米)。

引用次数: 1

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀