ACM Transactions on Parallel Computing最新文献_第7页

PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs PowerLyra:歪斜图上的微分图计算和划分

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2019-01-23 DOI: 10.1145/3298989

Rong Chen, Jiaxin Shi, Yanzhe Chen, B. Zang, Haibing Guan, Haibo Chen

Natural graphs with skewed distributions raise unique challenges to distributed graph computation and partitioning. Existing graph-parallel systems usually use a “one-size-fits-all” design that uniformly processes all vertices, which either suffer from notable load imbalance and high contention for high-degree vertices (e.g., Pregel and GraphLab) or incur high communication cost and memory consumption even for low-degree vertices (e.g., PowerGraph and GraphX). In this article, we argue that skewed distributions in natural graphs also necessitate differentiated processing on high-degree and low-degree vertices. We then introduce PowerLyra, a new distributed graph processing system that embraces the best of both worlds of existing graph-parallel systems. Specifically, PowerLyra uses centralized computation for low-degree vertices to avoid frequent communications and distributes the computation for high-degree vertices to balance workloads. PowerLyra further provides an efficient hybrid graph partitioning algorithm (i.e., hybrid-cut) that combines edge-cut (for low-degree vertices) and vertex-cut (for high-degree vertices) with heuristics. To improve cache locality of inter-node graph accesses, PowerLyra further provides a locality-conscious data layout optimization. PowerLyra is implemented based on the latest GraphLab and can seamlessly support various graph algorithms running in both synchronous and asynchronous execution modes. A detailed evaluation on three clusters using various graph-analytics and MLDM (Machine Learning and Data Mining) applications shows that PowerLyra outperforms PowerGraph by up to 5.53X (from 1.24X) and 3.26X (from 1.49X) for real-world and synthetic graphs, respectively, and is much faster than other systems like GraphX and Giraph, yet with much less memory consumption. A porting of hybrid-cut to GraphX further confirms the efficiency and generality of PowerLyra.

具有倾斜分布的自然图对分布式图的计算和划分提出了独特的挑战。现有的图并行系统通常使用“一刀切”的设计，统一处理所有顶点，这要么会导致明显的负载不平衡和高程度顶点(例如，Pregel和GraphLab)的高争用，要么即使对于低程度顶点(例如，PowerGraph和GraphX)也会产生高通信成本和内存消耗。在本文中，我们认为在自然图的歪斜分布中，也需要对高次顶点和低次顶点进行区分处理。然后我们介绍了PowerLyra，一个新的分布式图形处理系统，它包含了现有图形并行系统的两个世界的优点。具体来说，PowerLyra对低度顶点使用集中计算以避免频繁的通信，并对高度顶点分配计算以平衡工作负载。PowerLyra进一步提供了一种高效的混合图划分算法(即hybrid-cut)，它将边切(用于低度顶点)和顶点切(用于高度顶点)与启发式相结合。为了提高节点间图访问的缓存局域性，PowerLyra进一步提供了一个局域意识数据布局优化。PowerLyra是基于最新的GraphLab实现的，可以无缝地支持在同步和异步执行模式下运行的各种图形算法。使用各种图形分析和MLDM(机器学习和数据挖掘)应用程序对三个集群进行的详细评估表明，PowerLyra在实际和合成图形方面分别比PowerGraph高出5.53倍(从1.24倍)和3.26倍(从1.49倍)，并且比其他系统(如GraphX和Giraph)快得多，但内存消耗少得多。将hybrid-cut移植到GraphX进一步证实了PowerLyra的效率和通用性。

{"title":"PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs","authors":"Rong Chen, Jiaxin Shi, Yanzhe Chen, B. Zang, Haibing Guan, Haibo Chen","doi":"10.1145/3298989","DOIUrl":"https://doi.org/10.1145/3298989","url":null,"abstract":"Natural graphs with skewed distributions raise unique challenges to distributed graph computation and partitioning. Existing graph-parallel systems usually use a “one-size-fits-all” design that uniformly processes all vertices, which either suffer from notable load imbalance and high contention for high-degree vertices (e.g., Pregel and GraphLab) or incur high communication cost and memory consumption even for low-degree vertices (e.g., PowerGraph and GraphX). In this article, we argue that skewed distributions in natural graphs also necessitate differentiated processing on high-degree and low-degree vertices. We then introduce PowerLyra, a new distributed graph processing system that embraces the best of both worlds of existing graph-parallel systems. Specifically, PowerLyra uses centralized computation for low-degree vertices to avoid frequent communications and distributes the computation for high-degree vertices to balance workloads. PowerLyra further provides an efficient hybrid graph partitioning algorithm (i.e., hybrid-cut) that combines edge-cut (for low-degree vertices) and vertex-cut (for high-degree vertices) with heuristics. To improve cache locality of inter-node graph accesses, PowerLyra further provides a locality-conscious data layout optimization. PowerLyra is implemented based on the latest GraphLab and can seamlessly support various graph algorithms running in both synchronous and asynchronous execution modes. A detailed evaluation on three clusters using various graph-analytics and MLDM (Machine Learning and Data Mining) applications shows that PowerLyra outperforms PowerGraph by up to 5.53X (from 1.24X) and 3.26X (from 1.49X) for real-world and synthetic graphs, respectively, and is much faster than other systems like GraphX and Giraph, yet with much less memory consumption. A porting of hybrid-cut to GraphX further confirms the efficiency and generality of PowerLyra.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"5 1","pages":"13:1-13:39"},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90205401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 323

Lock Contention Management in Multithreaded MPI 多线程MPI中的锁争用管理

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2019-01-23 DOI: 10.1145/3275443

A. Amer, Huiwei Lu, P. Balaji, Milind Chabbi, Yanjie Wei, J. Hammond, S. Matsuoka

In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus serialization is unavoidable. Our work distinguishes between lock acquisitions with respect to work being performed inside a critical section; productive vs. unproductive. Waiting for message reception without doing anything else inside a critical section is an example of unproductive lock acquisition. We show that the high-throughput nature of modern scalable locking protocols translates into better communication progress for throughput-intensive MPI communication but negatively impacts latency-sensitive communication because of overzealous unproductive lock acquisition. To reduce unproductive lock acquisitions, we devised a method that promotes threads with productive work using a generic two-level priority locking protocol. Our results show that using a high-throughput protocol for productive work and a fair protocol for less productive code paths ensures the best tradeoff for fine-grained communication, whereas a fair protocol is sufficient for more coarse-grained communication. Although these efforts have been rewarding, scalability degradation remains significant. We discuss techniques that diverge from the pure locking model and offer the potential to further improve scalability.

在本文中，我们将研究基于锁的线程安全MPI库中的争用管理。具体来说，我们做了两个假设:(1)锁是保护通信路径时唯一的同步形式;(2)发生争用，因此序列化是不可避免的。我们的工作区分了在关键区域内执行的工作的锁获取;高效vs低效。在临界区中等待消息接收而不做任何其他事情是非生产性锁获取的一个例子。我们表明，现代可扩展锁定协议的高吞吐量特性转化为吞吐量密集型MPI通信的更好通信进度，但由于过度热心的非生产性锁获取，对延迟敏感通信产生负面影响。为了减少非生产性的锁获取，我们设计了一种方法，该方法使用通用的两级优先级锁定协议来促进具有生产性工作的线程。我们的结果表明，使用高吞吐量协议进行高效工作，使用公平协议进行低效率代码路径，确保了细粒度通信的最佳权衡，而公平协议则足以进行更粗粒度的通信。尽管这些努力得到了回报，但可伸缩性的退化仍然很严重。我们讨论了与纯锁定模型不同的技术，这些技术提供了进一步提高可伸缩性的潜力。

{"title":"Lock Contention Management in Multithreaded MPI","authors":"A. Amer, Huiwei Lu, P. Balaji, Milind Chabbi, Yanjie Wei, J. Hammond, S. Matsuoka","doi":"10.1145/3275443","DOIUrl":"https://doi.org/10.1145/3275443","url":null,"abstract":"In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus serialization is unavoidable. Our work distinguishes between lock acquisitions with respect to work being performed inside a critical section; productive vs. unproductive. Waiting for message reception without doing anything else inside a critical section is an example of unproductive lock acquisition. We show that the high-throughput nature of modern scalable locking protocols translates into better communication progress for throughput-intensive MPI communication but negatively impacts latency-sensitive communication because of overzealous unproductive lock acquisition. To reduce unproductive lock acquisitions, we devised a method that promotes threads with productive work using a generic two-level priority locking protocol. Our results show that using a high-throughput protocol for productive work and a fair protocol for less productive code paths ensures the best tradeoff for fine-grained communication, whereas a fair protocol is sufficient for more coarse-grained communication. Although these efforts have been rewarding, scalability degradation remains significant. We discuss techniques that diverge from the pure locking model and offer the potential to further improve scalability.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"35 1","pages":"12:1-12:21"},"PeriodicalIF":1.6,"publicationDate":"2019-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84422335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An Autotuning Protocol to Rapidly Build Autotuners 快速构建自动调谐器的自动调谐协议

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2019-01-23 DOI: 10.1145/3291527

Junhong Liu, Guangming Tan, Yulong Luo, Jiajia Li, Z. Mo, Ninghui Sun

Automatic performance tuning (Autotuning) is an increasingly critical tuning technique for the high portable performance of Exascale applications. However, constructing an autotuner from scratch remains a challenge, even for domain experts. In this work, we propose a performance tuning and knowledge management suite (PAK) to help rapidly build autotuners. In order to accommodate existing autotuning techniques, we present an autotuning protocol that is composed of an extractor, producer, optimizer, evaluator, and learner. To achieve modularity and reusability, we also define programming interfaces for each protocol component as the fundamental infrastructure, which provides a customizable mechanism to deploy knowledge mining in the performance database. PAK’s usability is demonstrated by studying two important computational kernels: stencil computation and sparse matrix-vector multiplication (SpMV). Our proposed autotuner based on PAK shows comparable performance and higher productivity than traditional autotuners by writing just a few tens of code using our autotuning protocol.

自动性能调优(Autotuning)对于Exascale应用程序的高可移植性能来说是一项日益重要的调优技术。然而，从头开始构建自动调谐器仍然是一个挑战，即使对领域专家来说也是如此。在这项工作中，我们提出了一个性能调优和知识管理套件(PAK)来帮助快速构建自动调优器。为了适应现有的自动调优技术，我们提出了一个由提取器、生产者、优化器、评估器和学习者组成的自动调优协议。为了实现模块化和可重用性，我们还为每个协议组件定义了编程接口作为基础架构，为在性能数据库中部署知识挖掘提供了可定制的机制。通过研究两个重要的计算内核:模板计算和稀疏矩阵向量乘法(SpMV)，证明了PAK的可用性。我们提出的基于PAK的自动调谐器通过使用我们的自动调谐协议编写几十个代码，显示出与传统自动调谐器相当的性能和更高的生产力。

引用次数: 4

Scheduling Dynamic Parallel Workload of Mobile Devices with Access Guarantees 具有访问保证的移动设备动态并行工作负载调度

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2018-12-08 DOI: 10.1145/3291529

Antonio Fernández, D. Kowalski, Miguel A. Mosteiro, Prudence W. H. Wong

We study a dynamic resource-allocation problem that arises in various parallel computing scenarios, such as mobile cloud computing, cloud computing systems, Internet of Things systems, and others. Generically, we model the architecture as client mobile devices and static base stations. Each client “arrives” to the system to upload data to base stations by radio transmissions and then “leaves.” The problem, called Station Assignment, is to assign clients to stations so that every client uploads their data under some restrictions, including a target subset of stations, a maximum delay between transmissions, a volume of data to upload, and a maximum bandwidth for each station. We study the solvability of Station Assignment under an adversary that controls the arrival and departure of clients, limited to maximum rate and burstiness of such arrivals. We show upper and lower bounds on the rate and burstiness for various client arrival schedules and protocol classes. To the best of our knowledge, this is the first time that Station Assignment is studied under adversarial arrivals and departures.

研究了各种并行计算场景下的动态资源分配问题，如移动云计算、云计算系统、物联网系统等。一般来说，我们将架构建模为客户端移动设备和静态基站。每个客户端“到达”系统，通过无线电传输将数据上传到基站，然后“离开”。这个问题被称为“站点分配”，它是将客户端分配到站点，以便每个客户端在一定的限制下上传数据，包括站点的目标子集、传输之间的最大延迟、要上传的数据量以及每个站点的最大带宽。我们研究了在对手控制客户到达和离开的情况下，站点分配的可解性，限制在这些到达的最大速率和突发次数。我们显示了各种客户端到达时间表和协议类的速率和突发性的上限和下限。据我们所知，这是第一次在对抗到达和离开的情况下研究车站分配。

{"title":"Scheduling Dynamic Parallel Workload of Mobile Devices with Access Guarantees","authors":"Antonio Fernández, D. Kowalski, Miguel A. Mosteiro, Prudence W. H. Wong","doi":"10.1145/3291529","DOIUrl":"https://doi.org/10.1145/3291529","url":null,"abstract":"We study a dynamic resource-allocation problem that arises in various parallel computing scenarios, such as mobile cloud computing, cloud computing systems, Internet of Things systems, and others. Generically, we model the architecture as client mobile devices and static base stations. Each client “arrives” to the system to upload data to base stations by radio transmissions and then “leaves.” The problem, called Station Assignment, is to assign clients to stations so that every client uploads their data under some restrictions, including a target subset of stations, a maximum delay between transmissions, a volume of data to upload, and a maximum bandwidth for each station. We study the solvability of Station Assignment under an adversary that controls the arrival and departure of clients, limited to maximum rate and burstiness of such arrivals. We show upper and lower bounds on the rate and burstiness for various client arrival schedules and protocol classes. To the best of our knowledge, this is the first time that Station Assignment is studied under adversarial arrivals and departures.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"113 4","pages":"10:1-10:19"},"PeriodicalIF":1.6,"publicationDate":"2018-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3291529","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72538506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code 应用于大规模生产天气预报代码的新型高性能GPGPU代码转换框架

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2018-02-16 DOI: 10.1145/3291523

Michel Müller, T. Aoki

We introduce “Hybrid Fortran,” a new approach that allows a high-performance GPGPU port for structured grid Fortran codes. This technique only requires minimal changes for a CPU targeted codebase, which is a significant advancement in terms of productivity. It has been successfully applied to both dynamical core and physical processes of ASUCA, a Japanese mesoscale weather prediction model with more than 150k lines of code. By means of a minimal weather application that resembles ASUCA’s code structure, Hybrid Fortran is compared to both a performance model as well as today’s commonly used method, OpenACC. As a result, the Hybrid Fortran implementation is shown to deliver the same or better performance than OpenACC, and its performance agrees with the model both on CPU and GPU. In a full-scale production run, using an ASUCA grid with 1581 × 1301 × 58 cells and real-world weather data in 2km resolution, 24 NVIDIA Tesla P100 running the Hybrid Fortran–based GPU port are shown to replace more than fifty 18-core Intel Xeon Broadwell E5-2695 v4 running the reference implementation—an achievement comparable to more invasive GPGPU rewrites of other weather models.

我们介绍了“Hybrid Fortran”，这是一种允许高性能GPGPU端口用于结构化网格Fortran代码的新方法。这种技术只需要对以CPU为目标的代码库进行最小的更改，这在生产力方面是一个显著的进步。该方法已成功应用于日本150k多行代码的中尺度天气预报模式ASUCA的动力核心和物理过程。通过一个类似于ASUCA代码结构的最小天气应用程序，Hybrid Fortran既可以与性能模型进行比较，也可以与当今常用的方法OpenACC进行比较。结果表明，混合Fortran实现提供了与OpenACC相同或更好的性能，并且其性能在CPU和GPU上都与模型一致。在全面生产运行中，使用1581 × 1301 × 58单元的ASUCA网格和2km分辨率的真实天气数据，24个运行基于Hybrid fortran的GPU端口的NVIDIA Tesla P100被证明可以取代50多个运行参考实现的18核Intel Xeon Broadwell E5-2695 v4，这一成就可与更具侵入性的GPGPU重写其他天气模型相媲美。

{"title":"New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code","authors":"Michel Müller, T. Aoki","doi":"10.1145/3291523","DOIUrl":"https://doi.org/10.1145/3291523","url":null,"abstract":"We introduce “Hybrid Fortran,” a new approach that allows a high-performance GPGPU port for structured grid Fortran codes. This technique only requires minimal changes for a CPU targeted codebase, which is a significant advancement in terms of productivity. It has been successfully applied to both dynamical core and physical processes of ASUCA, a Japanese mesoscale weather prediction model with more than 150k lines of code. By means of a minimal weather application that resembles ASUCA’s code structure, Hybrid Fortran is compared to both a performance model as well as today’s commonly used method, OpenACC. As a result, the Hybrid Fortran implementation is shown to deliver the same or better performance than OpenACC, and its performance agrees with the model both on CPU and GPU. In a full-scale production run, using an ASUCA grid with 1581 × 1301 × 58 cells and real-world weather data in 2km resolution, 24 NVIDIA Tesla P100 running the Hybrid Fortran–based GPU port are shown to replace more than fifty 18-core Intel Xeon Broadwell E5-2695 v4 running the reference implementation—an achievement comparable to more invasive GPGPU rewrites of other weather models.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"31 1","pages":"7:1-7:42"},"PeriodicalIF":1.6,"publicationDate":"2018-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79334909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs 多gpu预条件共轭梯度自适应优化建模

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2016-12-26 DOI: 10.1145/2990849

Jiaquan Gao, Yu Wang, Jun Wang, Ronghua Liang

The preconditioned conjugate gradient (PCG) algorithm is a well-known iterative method for solving sparse linear systems in scientific computations. GPU-accelerated PCG algorithms for large-sized problems have attracted considerable attention recently. However, on a specific multi-GPU platform, producing a highly parallel PCG implementation for any large-sized problem requires significant time because several manual steps are involved in adjusting the related parameters and selecting an appropriate storage format for the matrix block that is assigned to each GPU. This motivates us to propose adaptive optimization modeling of PCG on multi-GPUs, which mainly involves the following parts: (1) an optimization multi-GPU parallel framework of PCG and (2) the profile-based optimization modeling for each one of the main components of the PCG algorithm, including vector operation, inner product, and sparse matrix-vector multiplication (SpMV). Our model does not construct a new storage format or kernel but automatically and rapidly generates an optimal parallel PCG algorithm for any problem on a specific multi-GPU platform by integrating existing storage formats and kernels. We take a vector operation kernel, an inner-product kernel, and five popular SpMV kernels for an example to present the idea of constructing the model. Given that our model is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.

预条件共轭梯度(PCG)算法是科学计算中求解稀疏线性系统的一种众所周知的迭代方法。gpu加速的大规模问题PCG算法近年来引起了人们的广泛关注。然而，在特定的多GPU平台上，为任何大型问题生成高度并行的PCG实现需要大量时间，因为涉及到调整相关参数和为分配给每个GPU的矩阵块选择适当的存储格式的几个手动步骤。这促使我们提出PCG在多gpu上的自适应优化建模，主要包括以下几个部分:(1)PCG的优化多gpu并行框架;(2)PCG算法各主要组成部分的基于轮廓的优化建模，包括向量运算、内积和稀疏矩阵向量乘法(SpMV)。我们的模型不构建新的存储格式或内核，而是通过集成现有的存储格式和内核，自动快速地为特定多gpu平台上的任何问题生成最优并行PCG算法。我们以一个向量运算核、一个内积核和五个流行的SpMV核为例来介绍构造模型的思想。考虑到我们的模型是通用的，独立于问题，并且依赖于设备的资源，这个模型只针对每种类型的GPU构建一次。实验验证了该模型的有效性。

{"title":"Adaptive Optimization Modeling of Preconditioned Conjugate Gradient on Multi-GPUs","authors":"Jiaquan Gao, Yu Wang, Jun Wang, Ronghua Liang","doi":"10.1145/2990849","DOIUrl":"https://doi.org/10.1145/2990849","url":null,"abstract":"The preconditioned conjugate gradient (PCG) algorithm is a well-known iterative method for solving sparse linear systems in scientific computations. GPU-accelerated PCG algorithms for large-sized problems have attracted considerable attention recently. However, on a specific multi-GPU platform, producing a highly parallel PCG implementation for any large-sized problem requires significant time because several manual steps are involved in adjusting the related parameters and selecting an appropriate storage format for the matrix block that is assigned to each GPU. This motivates us to propose adaptive optimization modeling of PCG on multi-GPUs, which mainly involves the following parts: (1) an optimization multi-GPU parallel framework of PCG and (2) the profile-based optimization modeling for each one of the main components of the PCG algorithm, including vector operation, inner product, and sparse matrix-vector multiplication (SpMV). Our model does not construct a new storage format or kernel but automatically and rapidly generates an optimal parallel PCG algorithm for any problem on a specific multi-GPU platform by integrating existing storage formats and kernels. We take a vector operation kernel, an inner-product kernel, and five popular SpMV kernels for an example to present the idea of constructing the model. Given that our model is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"26 1","pages":"16:1-16:33"},"PeriodicalIF":1.6,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82616127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations Damaris:解决千兆级模拟后数据管理中的性能变化

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2016-12-26 DOI: 10.1145/2987371

Matthieu Dorier, Gabriel Antoniu, F. Cappello, M. Snir, R. Sisneros, Orcun Yildiz, Shadi Ibrahim, T. Peterka, Leigh Orf

With exascale computing on the horizon, reducing performance variability in data management tasks (storage, visualization, analysis, etc.) is becoming a key challenge in sustaining high performance. This variability significantly impacts the overall application performance at scale and its predictability over time. In this article, we present Damaris, a system that leverages dedicated cores in multicore nodes to offload data management tasks, including I/O, data compression, scheduling of data movements, in situ analysis, and visualization. We evaluate Damaris with the CM1 atmospheric simulation and the Nek5000 computational fluid dynamic simulation on four platforms, including NICS’s Kraken and NCSA’s Blue Waters. Our results show that (1) Damaris fully hides the I/O variability as well as all I/O-related costs, thus making simulation performance predictable; (2) it increases the sustained write throughput by a factor of up to 15 compared with standard I/O approaches; (3) it allows almost perfect scalability of the simulation up to over 9,000 cores, as opposed to state-of-the-art approaches that fail to scale; and (4) it enables a seamless connection to the VisIt visualization software to perform in situ analysis and visualization in a way that impacts neither the performance of the simulation nor its variability. In addition, we extended our implementation of Damaris to also support the use of dedicated nodes and conducted a thorough comparison of the two approaches—dedicated cores and dedicated nodes—for I/O tasks with the aforementioned applications.

随着百亿亿级计算的出现，减少数据管理任务(存储、可视化、分析等)的性能变化正成为维持高性能的关键挑战。这种可变性在一定规模上显著影响应用程序的整体性能及其随时间的可预测性。在本文中，我们介绍Damaris，这是一个利用多核节点中的专用核心来卸载数据管理任务的系统，包括I/O、数据压缩、数据移动调度、原位分析和可视化。利用CM1大气模拟和Nek5000计算流体动力学模拟，在NICS的Kraken和NCSA的Blue Waters四个平台上对Damaris进行了评估。我们的研究结果表明:(1)Damaris完全隐藏了I/O可变性以及所有与I/O相关的成本，从而使仿真性能可预测;(2)与标准I/O方法相比，它将持续写入吞吐量提高了15倍;(3)它允许模拟的几乎完美的可扩展性高达9000多个核心，而不是最先进的方法，无法扩展;(4)它能够与VisIt可视化软件无缝连接，以一种既不影响模拟性能也不影响其可变性的方式执行原位分析和可视化。此外，我们扩展了Damaris的实现，以支持使用专用节点，并对使用上述应用程序处理I/O任务的两种方法(专用核心和专用节点)进行了彻底的比较。

{"title":"Damaris: Addressing Performance Variability in Data Management for Post-Petascale Simulations","authors":"Matthieu Dorier, Gabriel Antoniu, F. Cappello, M. Snir, R. Sisneros, Orcun Yildiz, Shadi Ibrahim, T. Peterka, Leigh Orf","doi":"10.1145/2987371","DOIUrl":"https://doi.org/10.1145/2987371","url":null,"abstract":"With exascale computing on the horizon, reducing performance variability in data management tasks (storage, visualization, analysis, etc.) is becoming a key challenge in sustaining high performance. This variability significantly impacts the overall application performance at scale and its predictability over time.\u0000 In this article, we present Damaris, a system that leverages dedicated cores in multicore nodes to offload data management tasks, including I/O, data compression, scheduling of data movements, in situ analysis, and visualization. We evaluate Damaris with the CM1 atmospheric simulation and the Nek5000 computational fluid dynamic simulation on four platforms, including NICS’s Kraken and NCSA’s Blue Waters. Our results show that (1) Damaris fully hides the I/O variability as well as all I/O-related costs, thus making simulation performance predictable; (2) it increases the sustained write throughput by a factor of up to 15 compared with standard I/O approaches; (3) it allows almost perfect scalability of the simulation up to over 9,000 cores, as opposed to state-of-the-art approaches that fail to scale; and (4) it enables a seamless connection to the VisIt visualization software to perform in situ analysis and visualization in a way that impacts neither the performance of the simulation nor its variability.\u0000 In addition, we extended our implementation of Damaris to also support the use of dedicated nodes and conducted a thorough comparison of the two approaches—dedicated cores and dedicated nodes—for I/O tasks with the aforementioned applications.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"13 1","pages":"15:1-15:43"},"PeriodicalIF":1.6,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91252020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Transparently Space Sharing a Multicore Among Multiple Processes 在多个进程之间透明地共享多核空间

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2016-12-26 DOI: 10.1145/3001910

T. Creech, R. Barua

As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed runtime environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend on profiling applications ahead of time to make good decisions about allocations or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This article presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution that supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications without requiring application modification or recompilation. In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters and demonstrate its effectiveness in aiding allocation decisions. We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning—the best existing competing scheme in the literature. We found that SCAF improves on equipartitioning on four out of five machines, showing a mean improvement factor in sum of speedups of 1.04 to 1.11x for benchmark pairs, depending on the machine, and 1.09x on average. Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end users today. SCAF improves on the unmodified OpenMP runtimes for all five machines, with a mean improvement of 1.08 to 2.07x, depending on the machine, and 1.59x on average.

随着硬件变得越来越并行，可伸缩并行软件的可用性得到提高，管理多个多线程应用程序(进程)的问题变得非常重要。可塑进程可以在运行时改变所使用的线程数量，从而实现复杂而灵活的资源管理。尽管为具有并行运行时的smp并行化的许多现有应用程序实际上已经具有延展性，但部署的运行时环境既没有提供接口，也没有提供任何策略来智能地分配硬件线程，甚至防止过度订阅。先前的研究方法要么依赖于提前分析应用程序以做出关于分配的正确决策，要么根本不考虑流程效率，从而导致较差的性能。这些先前的方法都没有在实践中得到广泛的应用。本文介绍了带反馈的调度和分配(SCAF)系统:一个插入式运行时解决方案，它支持现有的可扩展应用程序根据观察到的效率做出智能分配决策，而无需对语义、程序修改、脱机分析甚至重新编译进行任何更改。我们现有的实现可以控制大多数未经修改的OpenMP应用程序。其他具有延展性的线程库也可以通过少量修改来支持，而不需要修改应用程序或重新编译。在本文中，我们介绍了SCAF守护进程和GNU OpenMP运行时的一个支持SCAF的端口。我们提出了一种使用可用硬件计数器在运行时评估进程效率的新技术，并证明了它在帮助分配决策方面的有效性。我们在五个商品并行平台上使用NAS NPB并行基准来评估SCAF，列举了架构特征及其对我们方案的影响。当同时运行所有基准对时，我们根据加速改进的总和(多程序环境的常用指标)来衡量SCAF的好处，并将其与均衡(文献中现有的最佳竞争方案)进行比较。我们发现SCAF在5台机器中的4台机器上的均分方面得到了改进，根据机器的不同，基准对的加速总和的平均改进系数为1.04到1.11倍，平均为1.09倍。由于我们不知道有任何广泛可用的均分工具，我们还比较了SCAF与使用未修改的OpenMP的多路编程，OpenMP是目前最终用户可用的唯一环境。SCAF在未修改的OpenMP运行时上对所有五台机器进行了改进，根据机器的不同，平均改进了1.08到2.07倍，平均改进了1.59倍。

{"title":"Transparently Space Sharing a Multicore Among Multiple Processes","authors":"T. Creech, R. Barua","doi":"10.1145/3001910","DOIUrl":"https://doi.org/10.1145/3001910","url":null,"abstract":"As hardware becomes increasingly parallel and the availability of scalable parallel software improves, the problem of managing multiple multithreaded applications (processes) becomes important. Malleable processes, which can vary the number of threads used as they run, enable sophisticated and flexible resource management. Although many existing applications parallelized for SMPs with parallel runtimes are in fact already malleable, deployed runtime environments provide no interface nor any strategy for intelligently allocating hardware threads or even preventing oversubscription. Prior research methods either depend on profiling applications ahead of time to make good decisions about allocations or do not account for process efficiency at all, leading to poor performance. None of these prior methods have been adapted widely in practice. This article presents the Scheduling and Allocation with Feedback (SCAF) system: a drop-in runtime solution that supports existing malleable applications in making intelligent allocation decisions based on observed efficiency without any changes to semantics, program modification, offline profiling, or even recompilation. Our existing implementation can control most unmodified OpenMP applications. Other malleable threading libraries can also easily be supported with small modifications without requiring application modification or recompilation.\u0000 In this work, we present the SCAF daemon and a SCAF-aware port of the GNU OpenMP runtime. We present a new technique for estimating process efficiency purely at runtime using available hardware counters and demonstrate its effectiveness in aiding allocation decisions.\u0000 We evaluated SCAF using NAS NPB parallel benchmarks on five commodity parallel platforms, enumerating architectural features and their effects on our scheme. We measured the benefit of SCAF in terms of sum of speedups improvement (a common metric for multiprogrammed environments) when running all benchmark pairs concurrently compared to equipartitioning—the best existing competing scheme in the literature. We found that SCAF improves on equipartitioning on four out of five machines, showing a mean improvement factor in sum of speedups of 1.04 to 1.11x for benchmark pairs, depending on the machine, and 1.09x on average.\u0000 Since we are not aware of any widely available tool for equipartitioning, we also compare SCAF against multiprogramming using unmodified OpenMP, which is the only environment available to end users today. SCAF improves on the unmodified OpenMP runtimes for all five machines, with a mean improvement of 1.08 to 2.07x, depending on the machine, and 1.59x on average.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"28 1","pages":"17:1-17:35"},"PeriodicalIF":1.6,"publicationDate":"2016-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88132169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Selecting Multiple Order Statistics with a Graphics Processing Unit 选择多个订单统计与图形处理单元

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2016-08-08 DOI: 10.1145/2948974

Jeffrey D. Blanchard, Erik Opavsky, Emircan Uysaler

Extracting a set of multiple order statistics from a huge data set provides important information about the distribution of the values in the full set of data. This article introduces an algorithm, bucketMultiSelect, for simultaneously selecting multiple order statistics with a graphics processing unit (GPU). Typically, when a large set of order statistics is desired, the vector is sorted. When the sorted version of the vector is not needed, bucketMultiSelect significantly reduces computation time by eliminating a large portion of the unnecessary operations involved in sorting. For large vectors, bucketMultiSelect returns thousands of order statistics in less time than sorting the vector while typically using less memory. For vectors containing 228 values of type double, bucketMultiSelect selects the 101 percentile order statistics in less than 95ms and is more than 8× faster than sorting the vector with a GPU optimized merge sort.

从一个庞大的数据集中提取一组多阶统计信息，可以提供关于整个数据集中值分布的重要信息。本文介绍了一种算法bucketMultiSelect，用于使用图形处理单元(GPU)同时选择多个顺序统计信息。通常，当需要大量顺序统计信息时，对向量进行排序。当不需要向量的排序版本时，bucketMultiSelect通过消除排序中涉及的大部分不必要的操作来显着减少计算时间。对于大的向量，bucketMultiSelect在比排序向量更短的时间内返回数千个顺序统计信息，同时通常使用更少的内存。对于包含228个double类型值的向量，bucketMultiSelect在不到95ms的时间内选择101个百分位数的顺序统计信息，并且比使用GPU优化的合并排序对向量进行排序快8倍以上。

引用次数: 1

Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory 为共享和分布式内存上的动态调度运行时编译仿射循环巢

IF 1.6 Q3 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Parallel Computing

Pub Date : 2016-08-08 DOI: 10.1145/2948975

Roshan Dathathri, Ravi Teja Mullapudi, Uday Bondhugula

Current de-facto parallel programming models like OpenMP and MPI make it difficult to extract task-level dataflow parallelism as opposed to bulk-synchronous parallelism. Task parallel approaches that use point-to-point synchronization between dependent tasks in conjunction with dynamic scheduling dataflow runtimes are thus becoming attractive. Although good performance can be extracted for both shared and distributed memory using these approaches, there is little compiler support for them. In this article, we describe the design of compiler--runtime interaction to automatically extract coarse-grained dataflow parallelism in affine loop nests for both shared and distributed-memory architectures. We use techniques from the polyhedral compiler framework to extract tasks and generate components of the runtime that are used to dynamically schedule the generated tasks. The runtime includes a distributed decentralized scheduler that dynamically schedules tasks on a node. The schedulers on different nodes cooperate with each other through asynchronous point-to-point communication, and all of this is achieved by code automatically generated by the compiler. On a set of six representative affine loop nest benchmarks, while running on 32 nodes with 8 threads each, our compiler-assisted runtime yields a geometric mean speedup of 143.6× (70.3× to 474.7× ) over the sequential version and a geometric mean speedup of 1.64× (1.04× to 2.42× ) over the state-of-the-art automatic parallelization approach that uses bulk synchronization. We also compare our system with past work that addresses some of these challenges on shared memory, and an emerging runtime (Intel Concurrent Collections) that demands higher programmer input and effort in parallelizing. To the best of our knowledge, ours is also the first automatic scheme that allows for dynamic scheduling of affine loop nests on a cluster of multicores.

当前事实上的并行编程模型(如OpenMP和MPI)很难提取任务级数据流并行性，而不是批量同步并行性。因此，在依赖任务之间使用点对点同步并结合动态调度数据流运行时的任务并行方法变得越来越有吸引力。尽管使用这些方法可以为共享和分布式内存提取良好的性能，但编译器对它们的支持很少。在本文中，我们描述了编译器-运行时交互的设计，以自动提取共享和分布式内存体系结构中仿射循环巢中的粗粒度数据流并行性。我们使用来自多面体编译器框架的技术来提取任务并生成用于动态调度生成的任务的运行时组件。运行时包括一个分布式的去中心化调度程序，它可以动态地调度节点上的任务。不同节点上的调度器通过异步点对点通信相互协作，所有这些都是由编译器自动生成的代码实现的。在一组具有代表性的六个仿射循环巢基准测试中，当运行在32个节点上，每个节点有8个线程时，我们的编译器辅助运行时比顺序版本产生了143.6倍(70.3倍到474.7倍)的几何平均加速，比使用批量同步的最先进的自动并行化方法产生了1.64倍(1.04倍到2.42倍)的几何平均加速。我们还将我们的系统与过去的工作进行了比较，这些工作解决了共享内存方面的一些挑战，以及新兴的运行时(Intel Concurrent Collections)，它要求程序员在并行化方面投入更多的精力和精力。据我们所知，我们的方案也是第一个允许在多核集群上动态调度仿射环路巢的自动方案。

{"title":"Compiling Affine Loop Nests for a Dynamic Scheduling Runtime on Shared and Distributed Memory","authors":"Roshan Dathathri, Ravi Teja Mullapudi, Uday Bondhugula","doi":"10.1145/2948975","DOIUrl":"https://doi.org/10.1145/2948975","url":null,"abstract":"Current de-facto parallel programming models like OpenMP and MPI make it difficult to extract task-level dataflow parallelism as opposed to bulk-synchronous parallelism. Task parallel approaches that use point-to-point synchronization between dependent tasks in conjunction with dynamic scheduling dataflow runtimes are thus becoming attractive. Although good performance can be extracted for both shared and distributed memory using these approaches, there is little compiler support for them.\u0000 In this article, we describe the design of compiler--runtime interaction to automatically extract coarse-grained dataflow parallelism in affine loop nests for both shared and distributed-memory architectures. We use techniques from the polyhedral compiler framework to extract tasks and generate components of the runtime that are used to dynamically schedule the generated tasks. The runtime includes a distributed decentralized scheduler that dynamically schedules tasks on a node. The schedulers on different nodes cooperate with each other through asynchronous point-to-point communication, and all of this is achieved by code automatically generated by the compiler. On a set of six representative affine loop nest benchmarks, while running on 32 nodes with 8 threads each, our compiler-assisted runtime yields a geometric mean speedup of 143.6× (70.3× to 474.7× ) over the sequential version and a geometric mean speedup of 1.64× (1.04× to 2.42× ) over the state-of-the-art automatic parallelization approach that uses bulk synchronization. We also compare our system with past work that addresses some of these challenges on shared memory, and an emerging runtime (Intel Concurrent Collections) that demands higher programmer input and effort in parallelizing. To the best of our knowledge, ours is also the first automatic scheme that allows for dynamic scheduling of affine loop nests on a cluster of multicores.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"17 1","pages":"12:1-12:28"},"PeriodicalIF":1.6,"publicationDate":"2016-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89724962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9