2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第2页

Incrementalization of Vertex-Centric Programs 以顶点为中心的程序的增量化

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00109

Timothy A. K. Zakian, L. Capelli, Zhenjiang Hu

As the graphs in our world become ever larger, the need for programmable, easy to use, and highly scalable graph processing has become ever greater. One such popular graph processing model—the vertex-centric computational model—does precisely this by distributing computations across the vertices of the graph being computed over. Due to this distribution of the program to the vertices of the graph, the programmer "thinks like a vertex" when writing their graph computation, with limited to no sense of shared memory and where almost all communication between each on-vertex computation must be sent over the network. Because of this inherent communication overhead in the computational model, reducing the number of messages sent while performing a given computation is a central aspect of any efforts to optimize vertex-centric programs. While previous work has focused on reducing communication overhead by directly changing communication patterns—by altering the way the graph is partitioned and distributed, or by altering the graph topology itself—in this paper we present a different optimization strategy based on a family of complementary compile-time program transformations in order to minimize communication overhead by changing both the messaging and computational structures of programs. Particularly, we present and formalize a method by which a compiler can automatically incrementalize a vertex-centric program through a series of compile-time program transformations—by modifying the on-vertex computation and messaging between vertices so that messages between vertices represent patches to be applied to the other vertex's local state. We empirically evaluate these transformations on a set of common vertex-centric algorithms and graphs and achieve an average reduction of 2.7X in total computational time, and 2.9X in the number of messages sent across all programs in the benchmark suite. Furthermore, since these are compile-time program transformations alone, other prior optimization strategies for vertex-centric programs can work with the resulting vertex-centric program just as they would a non-incrementalized program.

随着我们世界中的图形变得越来越大，对可编程、易于使用和高度可扩展的图形处理的需求也越来越大。一种流行的图处理模型——以顶点为中心的计算模型——通过在被计算的图的顶点之间分配计算来精确地做到这一点。由于程序分布在图的顶点上，程序员在编写图计算时“像一个顶点一样思考”，限制或没有共享内存，并且每个顶点计算之间的几乎所有通信都必须通过网络发送。由于计算模型中存在这种固有的通信开销，因此在执行给定计算时减少发送的消息数量是优化以顶点为中心的程序的一个核心方面。虽然以前的工作主要集中在通过直接改变通信模式来减少通信开销——通过改变图的划分和分布方式，或者通过改变图的拓扑结构本身——在本文中，我们提出了一种不同的优化策略，该策略基于一系列互补的编译时程序转换，以便通过改变程序的消息传递和计算结构来最小化通信开销。特别是，我们提出并形式化了一种方法，通过修改顶点上的计算和顶点之间的消息传递，编译器可以通过一系列编译时程序转换来自动增量化以顶点为中心的程序，从而使顶点之间的消息表示将应用于另一个顶点的局部状态的补丁。我们在一组常见的以顶点为中心的算法和图上对这些转换进行了经验评估，并在总计算时间上平均减少了2.7倍，在基准套件中的所有程序之间发送的消息数量减少了2.9倍。此外，由于这些仅仅是编译时的程序转换，因此针对以顶点为中心的程序的其他先前优化策略可以处理最终的以顶点为中心的程序，就像它们处理非增量化的程序一样。

{"title":"Incrementalization of Vertex-Centric Programs","authors":"Timothy A. K. Zakian, L. Capelli, Zhenjiang Hu","doi":"10.1109/IPDPS.2019.00109","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00109","url":null,"abstract":"As the graphs in our world become ever larger, the need for programmable, easy to use, and highly scalable graph processing has become ever greater. One such popular graph processing model—the vertex-centric computational model—does precisely this by distributing computations across the vertices of the graph being computed over. Due to this distribution of the program to the vertices of the graph, the programmer \"thinks like a vertex\" when writing their graph computation, with limited to no sense of shared memory and where almost all communication between each on-vertex computation must be sent over the network. Because of this inherent communication overhead in the computational model, reducing the number of messages sent while performing a given computation is a central aspect of any efforts to optimize vertex-centric programs. While previous work has focused on reducing communication overhead by directly changing communication patterns—by altering the way the graph is partitioned and distributed, or by altering the graph topology itself—in this paper we present a different optimization strategy based on a family of complementary compile-time program transformations in order to minimize communication overhead by changing both the messaging and computational structures of programs. Particularly, we present and formalize a method by which a compiler can automatically incrementalize a vertex-centric program through a series of compile-time program transformations—by modifying the on-vertex computation and messaging between vertices so that messages between vertices represent patches to be applied to the other vertex's local state. We empirically evaluate these transformations on a set of common vertex-centric algorithms and graphs and achieve an average reduction of 2.7X in total computational time, and 2.9X in the number of messages sent across all programs in the benchmark suite. Furthermore, since these are compile-time program transformations alone, other prior optimization strategies for vertex-centric programs can work with the resulting vertex-centric program just as they would a non-incrementalized program.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129480196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

An Efficient Collaborative Communication Mechanism for MPI Neighborhood Collectives MPI社区集体的高效协同沟通机制

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00087

S. M. Ghazimirsaeed, S. Mirsadeghi, A. Afsahi

Neighborhood collectives are introduced in MPI-3.0 standard to provide users with the opportunity to define their own communication patterns through the process topology interface of MPI. In this paper, we propose a collaborative communication mechanism based on common neighborhoods that might exist among groups of k processes. Such common neighborhoods are used to decrease the number of communication stages through message combining. We show how designing our desired communication pattern can be modeled as a maximum weighted matching problem in distributed hypergraphs, and propose a distributed algorithm to solve it. Moreover, we consider two design alternatives: topology-agnostic and topology-aware. The former ignores the physical topology of the system and the mapping of processes, whereas the latter takes them into account to further optimize the communication pattern. Our experimental results show that we can gain up to 8x and 5.2x improvement for various process topologies and a SpMM kernel, respectively.

MPI-3.0标准引入邻域集体，为用户提供了通过MPI的过程拓扑接口定义自己的通信模式的机会。在本文中，我们提出了一种基于可能存在于k进程组之间的共同邻域的协作通信机制。利用这种共同邻域，可以通过消息组合减少通信阶段的数量。我们展示了如何设计我们期望的通信模式可以建模为分布式超图中的最大加权匹配问题，并提出了一个分布式算法来解决它。此外，我们考虑了两种设计方案:拓扑不可知和拓扑感知。前者忽略了系统的物理拓扑结构和进程的映射，而后者考虑了它们以进一步优化通信模式。我们的实验结果表明，对于各种进程拓扑和SpMM内核，我们分别可以获得高达8倍和5.2倍的改进。

引用次数: 6

Themis: Predicting and Reining in Application-Level Slowdown on Spatial Multitasking GPUs 预测和控制空间多任务gpu的应用级减速

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00074

Wenyi Zhao, Quan Chen, Hao Lin, Jianfeng Zhang, Jingwen Leng, Chao Li, Wenli Zheng, Li Li, M. Guo

Predicting performance degradation of a GPU application when it is co-located with other applications on a spatial multitasking GPU without prior application knowledge is essential in public Clouds. Prior work mainly targets CPU co-location, and is inaccurate and/or inefficient for predicting performance of applications at co-location on spatial multitasking GPUs. Our investigation shows that hardware event statistics caused by co-located applications, which can be collected with negligible overhead, strongly correlate with their slowdowns. Based on this observation, we present Themis, an online slowdown predictor that can precisely and efficiently predict application slowdown without prior application knowledge. We first train a precise slowdown model offline using hardware event statistics collected from representative co-locations. When new applications co-run, Themis collects event statistics and predicts their slowdowns simultaneously. Our evaluation shows that Themis has negligible runtime overhead and can precisely predict application-level slowdown with prediction error smaller than 9.5%. Based on Themis, we also implement an SM allocation engine to rein in application slowdown at co-location. Case studies show that the engine successfully enforces fair sharing and QoS.

在公共云中，当GPU应用程序与空间多任务GPU上的其他应用程序共存时，在没有事先应用程序知识的情况下预测其性能下降是必不可少的。先前的工作主要针对CPU共置，并且在预测空间多任务gpu上共置的应用程序性能方面是不准确和/或低效的。我们的调查显示，由位于同一位置的应用程序引起的硬件事件统计数据与它们的减速密切相关，这些统计数据的开销可以忽略不计。基于这一观察，我们提出了Themis，一个在线减速预测器，可以在没有事先应用程序知识的情况下精确有效地预测应用程序减速。我们首先使用从代表性的共址收集的硬件事件统计数据离线训练精确的减速模型。当新的应用程序共同运行时，Themis收集事件统计信息并同时预测它们的减速。我们的评估表明，Themis的运行时开销可以忽略不计，并且可以精确地预测应用程序级的减速，预测误差小于9.5%。基于Themis，我们还实现了一个SM分配引擎，以控制共存位置时的应用程序减速。实例研究表明，该引擎成功地实现了公平共享和服务质量。

{"title":"Themis: Predicting and Reining in Application-Level Slowdown on Spatial Multitasking GPUs","authors":"Wenyi Zhao, Quan Chen, Hao Lin, Jianfeng Zhang, Jingwen Leng, Chao Li, Wenli Zheng, Li Li, M. Guo","doi":"10.1109/IPDPS.2019.00074","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00074","url":null,"abstract":"Predicting performance degradation of a GPU application when it is co-located with other applications on a spatial multitasking GPU without prior application knowledge is essential in public Clouds. Prior work mainly targets CPU co-location, and is inaccurate and/or inefficient for predicting performance of applications at co-location on spatial multitasking GPUs. Our investigation shows that hardware event statistics caused by co-located applications, which can be collected with negligible overhead, strongly correlate with their slowdowns. Based on this observation, we present Themis, an online slowdown predictor that can precisely and efficiently predict application slowdown without prior application knowledge. We first train a precise slowdown model offline using hardware event statistics collected from representative co-locations. When new applications co-run, Themis collects event statistics and predicts their slowdowns simultaneously. Our evaluation shows that Themis has negligible runtime overhead and can precisely predict application-level slowdown with prediction error smaller than 9.5%. Based on Themis, we also implement an SM allocation engine to rein in application slowdown at co-location. Case studies show that the engine successfully enforces fair sharing and QoS.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132213779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

An Error-Reflective Consistency Model for Distributed Data Stores 分布式数据存储的错误反射一致性模型

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00082

Philip Dexter, K. Chiu, Bedri Sendir

Consistency models for distributed data stores offer insights and paths to reasoning about what a user of such a system can expect. However, often consistency models are defined or implemented in coarse-grained manners, making it difficult to achieve precisely the consistency required. Further, many domains are already written to handle anomalies in distributed systems, yet they have little opportunity for expressing or taking advantage of their leniency. We propose reflective consistency-an active solution which adapts an underlying data store to changing loads and resource availability to meet a given consistency level. We implement reflective consistency in Cassandra, an existing distributed data store supporting per-read and per-write consistency. Our implementation allows users to express their anomaly leniency directly and the system will react to the presence of anomalies, changing Cassandra's consistency only when needed. Users of Reflective Cassandra can expect minimal overhead (anywhere from 1% to 14% depending on configuration) and a 50% decrease in the amount of costly strong reads.

分布式数据存储的一致性模型提供了关于这种系统的用户可以期望什么的见解和推理路径。然而，一致性模型通常以粗粒度的方式定义或实现，这使得很难精确地实现所需的一致性。此外，许多领域已经被编写来处理分布式系统中的异常，但是它们很少有机会表达或利用它们的宽松性。我们提出了反射一致性——一种主动的解决方案，它使底层数据存储适应不断变化的负载和资源可用性，以满足给定的一致性级别。我们在Cassandra中实现了反射一致性，Cassandra是一个现有的分布式数据存储，支持读和写一致性。我们的实现允许用户直接表达他们的异常宽大，系统将对异常的存在做出反应，仅在需要时改变Cassandra的一致性。反射式Cassandra的用户可以期望最小的开销(根据配置从1%到14%不等)，并且在昂贵的强读取数量上减少50%。

引用次数: 2

Software-Based Buffering of Associative Operations on Random Memory Addresses 基于软件的随机内存地址关联操作缓冲

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00102

Matthias Hauck, M. Paradies, H. Fröning

An important concept for indivisible updates in parallel computing are atomic operations. For most architectures, they also provide ordering guarantees, which in practice can hurt performance. For associative and commutative updates, in this paper we present software buffering techniques that overcome the problem of ordering by combining multiple updates in a temporary buffer and by prefetching addresses before updating them. As a result, our buffering techniques reduce contention and avoid unnecessary ordering constraints, in order to increase the amount of memory parallelism. We evaluate our techniques in different scenarios, including applications like histogram and graph computations, and reason about the applicability for standard systems and multi-socket systems.

并行计算中不可分割更新的一个重要概念是原子操作。对于大多数体系结构，它们还提供排序保证，这在实践中可能会损害性能。对于关联和交换更新，在本文中，我们提出了软件缓冲技术，通过在临时缓冲区中组合多个更新和在更新之前预取地址来克服排序问题。因此，我们的缓冲技术减少了争用，避免了不必要的排序约束，从而增加了内存并行性。我们在不同的场景中评估了我们的技术，包括直方图和图形计算等应用程序，并解释了标准系统和多套接字系统的适用性。

引用次数: 3

Dual Pattern Compression Using Data-Preprocessing for Large-Scale GPU Architectures 基于数据预处理的大规模GPU架构双模式压缩

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00076

Kyung Hoon Kim, Priyank Devpura, Abhishek Nayyar, Andrew Doolittle, K. H. Yum, Eun Jung Kim

Graphics Processing Units (GPUs) have been widely accepted for diverse general purpose applications due to a massive degree of parallelism. The demand for large-scale GPUs processing a large volume of data with high throughput has been rising rapidly. However, in large-scale GPUs, a bandwidth-efficient network design is challenging. Compression techniques are a practical remedy to effectively increase network bandwidth by reducing data size transferred. We propose a new simple compression mechanism, Dual Pattern Compression (DPC), that compresses only two patterns with a very low latency. The simplicity of compression/decompression is achieved through data remapping and data-type-aware data preprocessing which exploits bit-level data redundancy. The data type is detected during runtime. We demonstrate that our compression scheme effectively mitigates the network congestion in a large-scale GPU. It achieves IPC improvement by 33% on average (up to 126%) across various benchmarks with average space savings ratios of 61% in integer, 46% (up to 72%) in floating-point and 23% (up to 57%) in character type benchmarks.

由于大量的并行性，图形处理单元(gpu)已被广泛接受用于各种通用应用程序。对处理大数据量、高吞吐量的大型gpu的需求一直在快速增长。然而，在大规模gpu中，带宽高效的网络设计是具有挑战性的。压缩技术是通过减少传输的数据量来有效增加网络带宽的一种实用补救措施。我们提出了一种新的简单的压缩机制，双模式压缩(DPC)，它只压缩两个模式，具有非常低的延迟。压缩/解压缩的简单性是通过数据重映射和数据类型感知的数据预处理实现的，这种预处理利用了位级数据冗余。在运行时期间检测数据类型。我们证明了我们的压缩方案有效地缓解了大规模GPU中的网络拥塞。它在各种基准测试中平均实现了33%(最高126%)的IPC改进，在整数测试中平均节省了61%的空间，在浮点测试中节省了46%(最高72%)的空间，在字符类型测试中节省了23%(最高57%)的空间。

{"title":"Dual Pattern Compression Using Data-Preprocessing for Large-Scale GPU Architectures","authors":"Kyung Hoon Kim, Priyank Devpura, Abhishek Nayyar, Andrew Doolittle, K. H. Yum, Eun Jung Kim","doi":"10.1109/IPDPS.2019.00076","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00076","url":null,"abstract":"Graphics Processing Units (GPUs) have been widely accepted for diverse general purpose applications due to a massive degree of parallelism. The demand for large-scale GPUs processing a large volume of data with high throughput has been rising rapidly. However, in large-scale GPUs, a bandwidth-efficient network design is challenging. Compression techniques are a practical remedy to effectively increase network bandwidth by reducing data size transferred. We propose a new simple compression mechanism, Dual Pattern Compression (DPC), that compresses only two patterns with a very low latency. The simplicity of compression/decompression is achieved through data remapping and data-type-aware data preprocessing which exploits bit-level data redundancy. The data type is detected during runtime. We demonstrate that our compression scheme effectively mitigates the network congestion in a large-scale GPU. It achieves IPC improvement by 33% on average (up to 126%) across various benchmarks with average space savings ratios of 61% in integer, 46% (up to 72%) in floating-point and 23% (up to 57%) in character type benchmarks.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"194 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121464208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Exploiting Flow Graph of System of ODEs to Accelerate the Simulation of Biologically-Detailed Neural Networks 利用ode系统的流程图加速生物精细神经网络的仿真

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00028

Bruno R. C. Magalhães, T. Sterling, F. Schürmann, M. Hines

Exposing parallelism in scientific applications has become a core requirement for efficiently running on modern distributed multicore SIMD compute architectures. The granularity of parallelism that can be attained is a key determinant for the achievable acceleration and time to solution. Motivated by a scientific use case that requires the simulation of long spans of time — the study of plasticity and learning in detailed models of brain tissue — we present a strategy that exposes and exploits multicore and SIMD micro-parallelism from unrolling flow dependencies and concurrent outputs in a large system of coupled ordinary differential equations (ODEs). An implementation of a parallel simulator is presented, running on the HPX runtime system for the ParalleX execution model, providing dynamic task-scheduling and asynchronous execution. The implementation was tested on different architectures using a previously published brain tissue model. Benchmark of single neurons on a single compute node present a speed-up of circa 4-7x when compared with the state of the art Single Instruction Multiple Data (SIMD) implementation and 13-40x over its Single Instruction Single Data (SISD) counterpart. Large scale benchmarks suggest almost ideal strong scaling and a speed-up of 2-8x on a distributed architecture of 128 Cray X6 compute nodes.

在科学应用程序中公开并行性已经成为在现代分布式多核SIMD计算体系结构上高效运行的核心需求。可获得的并行度粒度是可实现的加速和求解时间的关键决定因素。在一个需要长时间模拟的科学用例的激励下——在脑组织的详细模型中研究可塑性和学习——我们提出了一种策略，该策略暴露并利用了多核和SIMD微并行性，这些微并行性来自于耦合常微分方程(ode)的大型系统中的展开流依赖和并发输出。提出了一种并行模拟器的实现，该模拟器运行在HPX运行时系统上，为parallelx执行模型提供动态任务调度和异步执行。使用先前发表的脑组织模型在不同的架构上对该实现进行了测试。与单指令多数据(SIMD)实现相比，在单个计算节点上对单个神经元进行基准测试的速度提高了大约4-7倍，比单指令单数据(SISD)实现的速度提高了13-40倍。大规模基准测试表明，在128个Cray X6计算节点的分布式架构上，几乎可以实现理想的强大扩展和2-8倍的加速。

{"title":"Exploiting Flow Graph of System of ODEs to Accelerate the Simulation of Biologically-Detailed Neural Networks","authors":"Bruno R. C. Magalhães, T. Sterling, F. Schürmann, M. Hines","doi":"10.1109/IPDPS.2019.00028","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00028","url":null,"abstract":"Exposing parallelism in scientific applications has become a core requirement for efficiently running on modern distributed multicore SIMD compute architectures. The granularity of parallelism that can be attained is a key determinant for the achievable acceleration and time to solution. Motivated by a scientific use case that requires the simulation of long spans of time — the study of plasticity and learning in detailed models of brain tissue — we present a strategy that exposes and exploits multicore and SIMD micro-parallelism from unrolling flow dependencies and concurrent outputs in a large system of coupled ordinary differential equations (ODEs). An implementation of a parallel simulator is presented, running on the HPX runtime system for the ParalleX execution model, providing dynamic task-scheduling and asynchronous execution. The implementation was tested on different architectures using a previously published brain tissue model. Benchmark of single neurons on a single compute node present a speed-up of circa 4-7x when compared with the state of the art Single Instruction Multiple Data (SIMD) implementation and 13-40x over its Single Instruction Single Data (SISD) counterpart. Large scale benchmarks suggest almost ideal strong scaling and a speed-up of 2-8x on a distributed architecture of 128 Cray X6 compute nodes.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131218607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Distributed Weighted All Pairs Shortest Paths Through Pipelining 基于管道的分布式加权全对最短路径

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00014

U. Agarwal, V. Ramachandran

We present new results for the distributed computation of all pairs shortest paths (APSP) in the CONGEST model in an n-node graph with moderate non-negative integer weights. Our methods can handle zero-weight edges which are known to present difficulties for distributed APSP algorithms. The current best deterministic distributed algorithm in the CONGEST model that handles zero weight edges is the Õ(n^3/2)-round algorithm of Agarwal et al. [ARKP18] that works for arbitrary edge weights. Our new deterministic algorithms run in Õ(W^1/4⋅ n^5/4) rounds in graphs with non-negative integer edge-weight at most W, and in Õ(n ⋅ Δ^1/3) rounds for shortest path distances at most Δ. These algorithms are built on top of a new pipelined algorithm we present for this problem that runs in at most 2n √Δ + 2n rounds. Additionally, we show that the techniques in our results simplify some of the procedures in the earlier APSP algorithms for non-negative edge weights in [HNS17, ARKP18]. We also present new results for computing h-hop shortest paths from k given sources, and we present an Õ(n/ε^2)-round deterministic $(1+ε) approximation algorithm for graphs with non-negative poly(n) integer weights, improving results in [Nanongkai14, LP15] that hold only for positive integer weights.

本文给出了中等非负整数权值n节点图中CONGEST模型中所有对最短路径(APSP)的分布式计算的新结果。我们的方法可以处理零权边，这是已知的分布式APSP算法存在的困难。目前在处理零权边的CONGEST模型中，最好的确定性分布式算法是Agarwal等人[ARKP18]的Õ(n^3/2)-round算法，它适用于任意权边。我们的新确定性算法在不超过W的非负整数边权图中运行Õ(W^1/4·n^5/4)轮，在不超过Δ的最短路径距离图中运行Õ(n·Δ^1/3)轮。这些算法是建立在一个新的流水线算法之上的，我们为这个问题提出了一个最多运行2n√Δ + 2n轮的算法。此外，我们表明，我们的结果中的技术简化了[HNS17, ARKP18]中非负边权重的早期APSP算法中的一些过程。我们还提出了从k个给定源计算h跳最短路径的新结果，并且我们提出了一个Õ(n/ε^2)-round确定性$(1+ε)近似算法，用于非负聚(n)整数权值的图，改进了[Nanongkai14, LP15]中仅适用于正整数权值的结果。

{"title":"Distributed Weighted All Pairs Shortest Paths Through Pipelining","authors":"U. Agarwal, V. Ramachandran","doi":"10.1109/IPDPS.2019.00014","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00014","url":null,"abstract":"We present new results for the distributed computation of all pairs shortest paths (APSP) in the CONGEST model in an n-node graph with moderate non-negative integer weights. Our methods can handle zero-weight edges which are known to present difficulties for distributed APSP algorithms. The current best deterministic distributed algorithm in the CONGEST model that handles zero weight edges is the Õ(n^3/2)-round algorithm of Agarwal et al. [ARKP18] that works for arbitrary edge weights. Our new deterministic algorithms run in Õ(W^1/4⋅ n^5/4) rounds in graphs with non-negative integer edge-weight at most W, and in Õ(n ⋅ Δ^1/3) rounds for shortest path distances at most Δ. These algorithms are built on top of a new pipelined algorithm we present for this problem that runs in at most 2n √Δ + 2n rounds. Additionally, we show that the techniques in our results simplify some of the procedures in the earlier APSP algorithms for non-negative edge weights in [HNS17, ARKP18]. We also present new results for computing h-hop shortest paths from k given sources, and we present an Õ(n/ε^2)-round deterministic $(1+ε) approximation algorithm for graphs with non-negative poly(n) integer weights, improving results in [Nanongkai14, LP15] that hold only for positive integer weights.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"393 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113986889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Asynchronous Multigrid Methods 异步多网格方法

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00021

Jordi Wolfson-Pou, Edmond Chow

Reducing synchronization in iterative methods for solving large sparse linear systems may become one of the most important goals for such solvers on exascale computers. Research in asynchronous iterative methods has primarily considered basic iterative methods. In this paper, we examine how multigrid methods can be executed asynchronously. We present models of asynchronous additive multigrid methods, and use these models to study the convergence properties of these methods. We also introduce two parallel algorithms for implementing asynchronous additive multigrid, the global-res and local-res algorithms. These two algorithms differ in how the fine grid residual is computed, where local-res requires less computation than global-res but converges more slowly. We compare two types of asynchronous additive multigrid methods: the asynchronous fast adaptive composite grid method with smoothing (AFACx) and additive variants of the classical multiplicative method (Multadd). We implement asynchronous versions of Multadd and AFACx in OpenMP and generate the prolongation and coarse grid matrices using the BoomerAMG package. Our experimental results show that asynchronous multigrid can exhibit grid-size independent convergence and can be faster than classical multigrid in terms of solve wall-clock time. We also show that asynchronous smoothing is the best choice of smoother for our test cases, even when only one smoothing sweep is used.

在求解大型稀疏线性系统的迭代方法中减少同步可能成为此类求解器在百亿亿次计算机上的最重要目标之一。异步迭代方法的研究主要考虑基本迭代方法。在本文中，我们研究了如何异步执行多网格方法。提出了异步加性多网格方法的模型，并用这些模型研究了这些方法的收敛性。本文还介绍了实现异步加性多网格的两种并行算法:全局分辨率算法和局部分辨率算法。这两种算法的不同之处在于如何计算精细网格残差，局部分辨率比全局分辨率需要更少的计算，但收敛速度更慢。比较了两种异步加性多重网格方法:带平滑的异步快速自适应复合网格方法(AFACx)和经典乘法方法的加性变体(Multadd)。我们在OpenMP中实现了异步版本的multiadd和AFACx，并使用BoomerAMG包生成了扩展矩阵和粗网格矩阵。实验结果表明，异步多网格具有网格大小无关的收敛性，并且在求解挂钟时间方面比经典多网格更快。我们还展示了异步平滑对于我们的测试用例来说是平滑的最佳选择，即使只使用了一次平滑扫描。

{"title":"Asynchronous Multigrid Methods","authors":"Jordi Wolfson-Pou, Edmond Chow","doi":"10.1109/IPDPS.2019.00021","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00021","url":null,"abstract":"Reducing synchronization in iterative methods for solving large sparse linear systems may become one of the most important goals for such solvers on exascale computers. Research in asynchronous iterative methods has primarily considered basic iterative methods. In this paper, we examine how multigrid methods can be executed asynchronously. We present models of asynchronous additive multigrid methods, and use these models to study the convergence properties of these methods. We also introduce two parallel algorithms for implementing asynchronous additive multigrid, the global-res and local-res algorithms. These two algorithms differ in how the fine grid residual is computed, where local-res requires less computation than global-res but converges more slowly. We compare two types of asynchronous additive multigrid methods: the asynchronous fast adaptive composite grid method with smoothing (AFACx) and additive variants of the classical multiplicative method (Multadd). We implement asynchronous versions of Multadd and AFACx in OpenMP and generate the prolongation and coarse grid matrices using the BoomerAMG package. Our experimental results show that asynchronous multigrid can exhibit grid-size independent convergence and can be faster than classical multigrid in terms of solve wall-clock time. We also show that asynchronous smoothing is the best choice of smoother for our test cases, even when only one smoothing sweep is used.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125052361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Z-Dedup:A Case for Deduplicating Compressed Contents in Cloud Z-Dedup:云中压缩内容的重复数据删除案例

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00049

Zhichao Yan, Hong Jiang, Yujuan Tan, S. Skelton, Hao Luo

Lossless data reduction techniques, particularly compression and deduplication, have emerged as effective approaches to tackling the combined challenge of explosive growth in data volumes but lagging growth in network bandwidth, to improve space and bandwidth efficiency in the cloud storage environment. However, our observations reveal that traditional deduplication solutions are rendered essentially useless in detecting and removing redundant data from the compressed packages in the cloud, which are poised to greatly increase in their presence and popularity. This is because even uncompressed, compressed and differently compressed packages of the exact same contents tend to have completely different byte stream patterns, whose redundancy cannot be identified by comparing their fingerprints. This, combined with different compressed packets mixed with different data but containing significant duplicate data, will further exacerbate the problem in the cloud storage environment. To address this fundamental problem, we propose Z-Dedup, a novel deduplication system that is able to detect and remove redundant data in compressed packages, by exploiting some key invariant information embedded in the metadata of compressed packages such as file-based checksum and original file length information. Our evaluations show that Z-Dedup can significantly improve both space and bandwidth efficiency over traditional approaches by eliminating 1.61% to 98.75% redundant data of a compressed package based on our collected datasets, and even more storage space and bandwidth are expected to be saved after the storage servers have accumulated more compressed contents.

无损数据缩减技术，特别是压缩和重复数据删除技术，已成为应对数据量爆炸性增长但网络带宽增长滞后这一综合挑战的有效方法，从而提高了云存储环境中的空间和带宽效率。然而，我们的观察表明，传统的重复数据删除解决方案在检测和删除云中的压缩包中的冗余数据方面基本上是无用的，这些数据将大大增加它们的存在和普及程度。这是因为即使是具有完全相同内容的未压缩、压缩和不同压缩的包也往往具有完全不同的字节流模式，其冗余不能通过比较它们的指纹来识别。再加上不同的压缩包混合了不同的数据，但包含了大量的重复数据，这将进一步加剧云存储环境中的问题。为了解决这个基本问题，我们提出了一种新的重复数据删除系统Z-Dedup，它能够通过利用嵌入在压缩包元数据中的一些关键不变信息(如基于文件的校验和和原始文件长度信息)来检测和删除压缩包中的冗余数据。我们的评估表明，基于我们收集的数据集，Z-Dedup可以显著提高空间和带宽效率，在压缩包中消除1.61%至98.75%的冗余数据，并且在存储服务器积累更多压缩内容后，有望节省更多的存储空间和带宽。

{"title":"Z-Dedup:A Case for Deduplicating Compressed Contents in Cloud","authors":"Zhichao Yan, Hong Jiang, Yujuan Tan, S. Skelton, Hao Luo","doi":"10.1109/IPDPS.2019.00049","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00049","url":null,"abstract":"Lossless data reduction techniques, particularly compression and deduplication, have emerged as effective approaches to tackling the combined challenge of explosive growth in data volumes but lagging growth in network bandwidth, to improve space and bandwidth efficiency in the cloud storage environment. However, our observations reveal that traditional deduplication solutions are rendered essentially useless in detecting and removing redundant data from the compressed packages in the cloud, which are poised to greatly increase in their presence and popularity. This is because even uncompressed, compressed and differently compressed packages of the exact same contents tend to have completely different byte stream patterns, whose redundancy cannot be identified by comparing their fingerprints. This, combined with different compressed packets mixed with different data but containing significant duplicate data, will further exacerbate the problem in the cloud storage environment. To address this fundamental problem, we propose Z-Dedup, a novel deduplication system that is able to detect and remove redundant data in compressed packages, by exploiting some key invariant information embedded in the metadata of compressed packages such as file-based checksum and original file length information. Our evaluations show that Z-Dedup can significantly improve both space and bandwidth efficiency over traditional approaches by eliminating 1.61% to 98.75% redundant data of a compressed package based on our collected datasets, and even more storage space and bandwidth are expected to be saved after the storage servers have accumulated more compressed contents.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116142087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2