Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms最新文献

英文中文

An Efficient Data Layout Transformation Algorithm for Locality-Aware Parallel Sparse FFT 一种位置感知并行稀疏FFT的高效数据布局转换算法

Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

Pub Date : 2017-11-12 DOI: 10.1145/3149704.3149769

Cheng Wang, S. Chandrasekaran, B. Chapman

Fast Fourier Transform (FFT) is one of the most important numerical algorithms widely used in numerous scientific and engineering computations. With the emergence of big data problems, however, it is challenging to acquire, process and store a sufficient amount of data to compute the FFT in the first place. Recently developed sparse FFT (sFFT) algorithm provides a solution to this problem. sFFT computes a compressed Fourier transform by using only a small subset of the input data, thus achieving significant performance improvement. While the increase in the number of cores and memory bandwidth on modern architectures provide an opportunity to improve the performance through sophisticated parallel algorithm design, sFFT is inherently complex, and numerous challenges need to be addressed. Among all the challenges, sFFT falls into the category of irregular applications in which memory access patterns are indirect and irregular that exhibit poor data locality. In this paper, we explore data layout transformation algorithms to tackle the challenge. Our approach shows that an optimized and locality-aware parallel sFFT can perform 7x faster than the original sequential sFFT library on a multicore platform. This optimized locality-aware parallel sFFT is also approximately 10x faster than the parallel FFTW.

快速傅里叶变换(FFT)是广泛应用于众多科学和工程计算的重要数值算法之一。然而，随着大数据问题的出现，获取、处理和存储足够数量的数据来计算FFT是一项挑战。近年来发展起来的稀疏FFT (sFFT)算法为解决这一问题提供了一种方法。sFFT仅使用输入数据的一小部分来计算压缩的傅里叶变换，从而实现了显着的性能改进。虽然现代架构上内核数量和内存带宽的增加为通过复杂的并行算法设计提高性能提供了机会，但sFFT本身就很复杂，需要解决许多挑战。在所有挑战中，sFFT属于不规则应用程序的类别，其中内存访问模式是间接的和不规则的，表现出较差的数据局部性。在本文中，我们探索数据布局转换算法来解决这一挑战。我们的方法表明，在多核平台上，优化和位置感知的并行sFFT比原始顺序sFFT库的执行速度快7倍。这种优化的位置感知并行sFFT也比并行FFTW快大约10倍。

{"title":"An Efficient Data Layout Transformation Algorithm for Locality-Aware Parallel Sparse FFT","authors":"Cheng Wang, S. Chandrasekaran, B. Chapman","doi":"10.1145/3149704.3149769","DOIUrl":"https://doi.org/10.1145/3149704.3149769","url":null,"abstract":"Fast Fourier Transform (FFT) is one of the most important numerical algorithms widely used in numerous scientific and engineering computations. With the emergence of big data problems, however, it is challenging to acquire, process and store a sufficient amount of data to compute the FFT in the first place. Recently developed sparse FFT (sFFT) algorithm provides a solution to this problem. sFFT computes a compressed Fourier transform by using only a small subset of the input data, thus achieving significant performance improvement. While the increase in the number of cores and memory bandwidth on modern architectures provide an opportunity to improve the performance through sophisticated parallel algorithm design, sFFT is inherently complex, and numerous challenges need to be addressed. Among all the challenges, sFFT falls into the category of irregular applications in which memory access patterns are indirect and irregular that exhibit poor data locality. In this paper, we explore data layout transformation algorithms to tackle the challenge. Our approach shows that an optimized and locality-aware parallel sFFT can perform 7x faster than the original sequential sFFT library on a multicore platform. This optimized locality-aware parallel sFFT is also approximately 10x faster than the parallel FFTW.","PeriodicalId":292798,"journal":{"name":"Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128079161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluation of Knight Landing High Bandwidth Memory for HPC Workloads Knight Landing高带宽内存对HPC工作负载的评估

Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

Pub Date : 2017-11-12 DOI: 10.1145/3149704.3149766

S. Salehian, Yonghong Yan

The Intel Knight Landing (KNL) manycore chip includes 3D-stacked memory named MCDRAM, also known as High Bandwidth Memory (HBM) for parallel applications that needs to scale to high thread count. In this paper, we provide a quantitative study of the KNL for HPC proxy applications including Lulesh, HPCG, AMG, and Hotspot when using DDR4 and MCDRAM. The results indicate that HBM significantly improves the performance of memory intensive applications for as many as three times better than DDR4 in HPCG, and Lulesh and HPCG for as many as 40% and 200%. For the selected compute intensive applications, the performance advantage of MCDRAM over DDR4 varies from 2% to 28%. We also observed that the cross-points, where MCDRAM starts outperforming DDR4, are around 8 to 16 threads.

英特尔Knight Landing (KNL)多核芯片包括名为MCDRAM的3d堆叠内存，也称为高带宽内存(HBM)，用于需要扩展到高线程数的并行应用程序。本文对使用DDR4和MCDRAM的HPC代理应用(包括Lulesh、HPCG、AMG和Hotspot)的KNL进行了定量研究。结果表明，HBM在HPCG中显著提高了内存密集型应用程序的性能，比DDR4提高了3倍，而Lulesh和HPCG的性能分别提高了40%和200%。对于选定的计算密集型应用，MCDRAM比DDR4的性能优势在2%到28%之间。我们还观察到，MCDRAM开始优于DDR4的交叉点大约是8到16个线程。

引用次数: 6

Progressive load balancing of asynchronous algorithms 异步算法的渐进式负载均衡

Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

Pub Date : 2017-11-12 DOI: 10.1145/3149704.3149765

Justs Zarins, Michèle Weiland

Synchronisation in the presence of noise and hardware performance variability is a key challenge that prevents applications from scaling to large problems and machines. Using asynchronous or semi-synchronous algorithms can help overcome this issue, but at the cost of reduced stability or convergence rate. In this paper we propose progressive load balancing to manage progress imbalance in asynchronous algorithms dynamically. In our technique the balancing is done over time, not instantaneously. Using Jacobi iterations as a test case, we show that, with CPU performance variability present, this approach leads to higher iteration rate and lower progress imbalance between parts of the solution space. We also show that under these conditions the balanced asynchronous method outperforms synchronous, semi-synchronous and totally asynchronous implementations in terms of time to solution.

在噪声和硬件性能可变性存在下的同步是阻碍应用程序扩展到大型问题和机器的关键挑战。使用异步或半同步算法可以帮助克服这个问题，但代价是降低稳定性或收敛速度。本文提出渐进式负载均衡来动态管理异步算法中的进度不平衡。在我们的技术中，平衡是随着时间的推移而完成的，而不是瞬间。使用Jacobi迭代作为测试用例，我们表明，在CPU性能存在可变性的情况下，这种方法导致更高的迭代率和更低的解决方案空间部分之间的进度不平衡。我们还表明，在这些条件下，平衡异步方法在求解时间方面优于同步、半同步和完全异步实现。

引用次数: 2

Parallel Depth-First Search for Directed Acyclic Graphs 有向无环图的并行深度优先搜索

Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

Pub Date : 2017-11-12 DOI: 10.1145/3149704.3149764

M. Naumov, A. Vrielink, M. Garland

Depth-First Search (DFS) is a pervasive algorithm, often used as a building block for topological sort, connectivity and planarity testing, among many other applications. We propose a novel work-efficient parallel algorithm for the DFS traversal of directed acyclic graph (DAG). The algorithm traverses the entire DAG in a BFS-like fashion no more than three times. As a result it finds the DFS pre-order (discovery) and post-order (finish time) as well as the parent relationship associated with every node in a DAG. We analyse the runtime and work complexity of this novel parallel algorithm. Also, we show that our algorithm is easy to implement and optimize for performance. In particular, we show that its CUDA implementation on the GPU outperforms sequential DFS on the CPU by up to 6x in our experiments.

深度优先搜索(deep - first Search, DFS)是一种普遍的算法，通常用作拓扑排序、连通性和平面性测试以及许多其他应用程序的构建块。针对有向无环图(DAG)的DFS遍历问题，提出了一种新的高效并行算法。该算法以类似bfs的方式遍历整个DAG不超过三次。因此，它可以找到DFS的预顺序(发现)和后顺序(完成时间)，以及与DAG中每个节点相关联的父关系。分析了该算法的运行时间和工作复杂度。此外，我们还证明了我们的算法易于实现和优化性能。特别是，在我们的实验中，我们表明它在GPU上的CUDA实现比CPU上的顺序DFS性能高出6倍。

引用次数: 14

Quantum Computing and Irregular Applications 量子计算及其不规则应用

Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

Pub Date : 2017-11-12 DOI: 10.1145/3149704.3149773

F. Chong

In particular, I will discuss the opportunities and challenges for irregular applications on quantum machines. Quantum machines face challenges of communication and control which will be affected by irregularity. Physical connectivity faces scaling challenges and hierarchical structures may be needed. Yet, underneath these similarities with classical machines, quantum computers involve non-local interactions that make them uniquely different.

特别地，我将讨论量子机器上不规则应用的机遇和挑战。量子机器面临着通信和控制方面的挑战，这些挑战将受到不规则性的影响。物理连接面临扩展挑战，可能需要分层结构。然而，在与经典机器的这些相似之处之下，量子计算机涉及非局部相互作用，这使它们与众不同。

引用次数: 0

Pressure-Driven Hardware Managed Thread Concurrency for Irregular Applications 不规则应用程序的压力驱动硬件管理线程并发

Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

Pub Date : 2017-11-12 DOI: 10.1145/3149704.3149705

John D. Leidel, Xi Wang, Yong Chen

Given the increasing importance of efficient data intensive computing, we find that modern processor designs are not well suited to the irregular memory access patterns found in these algorithms. This research focuses on mapping the compiler's instruction cost scheduling logic to hardware managed concurrency controls in order to minimize pipeline stalls. In this manner, the hardware modules managing the low-latency thread concurrency can be directly understood by modern compilers. We introduce a thread context switching method that is managed directly via a set of hardware-based mechanisms that are coupled to the compiler instruction scheduler. As individual instructions from a thread execute, their respective cost is accumulated into a control register. Once the register reaches a pre-determined saturation point, the thread is forced to context switch. We evaluate the performance benefits of our approach using a series of 24 benchmarks that exhibit performance acceleration of up to 14.6X.

鉴于高效数据密集型计算的重要性日益增加，我们发现现代处理器设计并不适合这些算法中发现的不规则内存访问模式。本研究的重点是将编译器的指令成本调度逻辑映射到硬件管理的并发控制，以最大限度地减少管道停滞。通过这种方式，现代编译器可以直接理解管理低延迟线程并发性的硬件模块。我们引入了一种线程上下文切换方法，该方法通过一组与编译器指令调度程序耦合的基于硬件的机制直接管理。当线程中的单个指令执行时，它们各自的开销会累积到控制寄存器中。一旦寄存器达到预先确定的饱和点，线程就被迫进行上下文切换。我们使用一系列24个基准测试来评估我们的方法的性能优势，这些基准测试显示性能加速高达14.6倍。

引用次数: 3

Enabling Work-Efficiency for High Performance Vertex-Centric Graph Analytics on GPUs 在gpu上实现高性能以顶点为中心的图形分析的工作效率

Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

Pub Date : 2017-11-12 DOI: 10.1145/3149704.3149762

Farzad Khorasani, Keval Vora, Rajiv Gupta, L. Bhuyan

Massive parallel processing power of GPUs has attracted researchers to develop iterative vertex-centric graph processing frameworks for GPUs. Enabling work-efficiency in these solutions, however, is not straightforward and comes at the cost of SIMD-inefficiency and load imbalance. This paper offers techniques that overcome these challenges when processing the graph on a GPU. For a SIMD-efficient kernel operation involving gathering of neighbors and performing reduction on them, we employ an effective task expansion strategy that avoids intra-warp thread underutilization. As recording vertex activeness requires additional data structures, to attenuate the graph storage overhead on limited GPU DRAM, we introduce vertex grouping as a technique that enables trade-off between memory consumption and the work efficiency in our solution. Our experiments show that these techniques provide up to 5.46x over the recently proposed WS-VR [4] framework over multiple algorithms and inputs.

gpu巨大的并行处理能力吸引了研究人员为gpu开发迭代的以顶点为中心的图形处理框架。然而，在这些解决方案中实现工作效率并不简单，而且代价是simd效率低下和负载不平衡。本文提供了在GPU上处理图形时克服这些挑战的技术。对于simd高效的内核操作，包括收集邻居并对其执行缩减，我们采用了一种有效的任务扩展策略，避免了内部线程的利用率不足。由于记录顶点活跃度需要额外的数据结构，为了减少有限的GPU DRAM上的图形存储开销，我们引入顶点分组作为一种技术，在我们的解决方案中实现内存消耗和工作效率之间的权衡。我们的实验表明，这些技术在多个算法和输入上比最近提出的WS-VR[4]框架提供高达5.46倍的性能。

引用次数: 8

Overcoming Load Imbalance for Irregular Sparse Matrices 克服不规则稀疏矩阵的负载不平衡

Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

Pub Date : 2017-11-12 DOI: 10.1145/3149704.3149767

Goran Flegar, H. Anzt

In this paper we propose a load-balanced GPU kernel for computing the sparse matrix vector (SpMV) product. Making heavy use of the latest GPU programming features, we also enable satisfying performance for irregular and unbalanced matrices. In a performance comparison using 400 test matrices we reveal the new kernel being superior to the most popular SpMV implementations.

本文提出了一种计算稀疏矩阵向量积的负载均衡GPU内核。通过大量使用最新的GPU编程功能，我们还为不规则和不平衡矩阵提供了令人满意的性能。在使用400个测试矩阵的性能比较中，我们发现新内核优于最流行的SpMV实现。

引用次数: 16

Spherical Region Queries on Multicore Architectures 多核架构的球面区域查询

Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

Pub Date : 2017-11-12 DOI: 10.1145/3149704.3149772

Hao Lu, S. Seal, Wei Guo, J. Poplawsky

In this short paper, we report the performance of multiple thread-parallel algorithms for spherical region queries on multicore architectures motivated by a challenging data analytics application in materials science. Performances of two tree-based algorithms and a naive algorithm are compared to identify the length scales at which these approaches perform optimally. The optimal algorithm is then used to scale the driving materials science application, which is shown to deliver over 17X speedup using 32 OpenMP threads on data sets containing many millions of atoms.

在这篇短文中，我们报告了多线程并行算法在多核架构上的球面区域查询的性能，这些多核架构是由材料科学中具有挑战性的数据分析应用驱动的。比较了两种基于树的算法和一种朴素算法的性能，以确定这些方法执行最佳的长度尺度。然后将最优算法用于扩展驱动材料科学应用程序，在包含数百万原子的数据集上使用32个OpenMP线程可以提供超过17倍的加速。

引用次数: 2

A Case for Migrating Execution for Irregular Applications 不规则应用程序的迁移执行案例

Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

Pub Date : 2017-11-12 DOI: 10.1145/3149704.3149770

P. Kogge, Shannon K. Kuntz

Modern supercomputers have millions of cores, each capable of executing one or more threads of program execution. In these computers the site of execution for program threads rarely, if ever, changes from the node in which they were born. This paper discusses the advantages that may accrue when thread states migrate freely from node to node, especially when migration is managed by hardware without requiring software intervention. Emphasis is on supporting the growing classes of algorithms where there is significant sparsity, irregularity, and lack of locality in the memory reference patterns. Evidence is drawn from reformulation of several kernels into a migrating thread context approximating that of an emerging architecture with such capabilities.

现代超级计算机有数百万个核心，每个核心能够执行一个或多个程序执行线程。在这些计算机中，程序线程的执行位置很少(如果有的话)从它们诞生的节点改变。本文讨论了当线程状态从一个节点自由迁移到另一个节点时可能产生的优势，特别是当迁移由硬件管理而不需要软件干预时。重点是支持不断增长的算法类，这些算法类在内存引用模式中存在显著的稀疏性、不规则性和局部性缺乏。证据来自将几个内核重新表述为一个迁移线程上下文，该上下文近似于具有此类功能的新兴体系结构。

引用次数: 10

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀