Data replication is a common technique used for fault-tolerance in reliable distributed systems. In geo-replicated systems and the cloud, it additionally provides low latency. Recently, causal consistency in such systems has received much attention. However, all existing works assume the data is fully replicated. This greatly simplifies the design of the algorithms to implement causal consistency. In this paper, we propose that it can be advantageous to have partial replication of data, and we propose two algorithms for achieving causal consistency in such systems where the data is only partially replicated. This is the first work that explores causal consistency for partially replicated geo-replicated systems. We also give a special case algorithm for causal consistency in the full-replication case.
{"title":"Causal Consistency for Geo-Replicated Cloud Storage under Partial Replication","authors":"Min Shen, A. Kshemkalyani, T. Hsu","doi":"10.1109/IPDPSW.2015.68","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.68","url":null,"abstract":"Data replication is a common technique used for fault-tolerance in reliable distributed systems. In geo-replicated systems and the cloud, it additionally provides low latency. Recently, causal consistency in such systems has received much attention. However, all existing works assume the data is fully replicated. This greatly simplifies the design of the algorithms to implement causal consistency. In this paper, we propose that it can be advantageous to have partial replication of data, and we propose two algorithms for achieving causal consistency in such systems where the data is only partially replicated. This is the first work that explores causal consistency for partially replicated geo-replicated systems. We also give a special case algorithm for causal consistency in the full-replication case.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122550351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With continued performance scaling of many cores per chip, an on-chip, off-chip memory has increasingly become a system bottleneck due to inter-thread contention. The memory access streams emerging from many cores and the simultaneously executed threads, exhibit increasingly limited locality. Large and high-density DRAMs contribute significantly to system power consumption and data over fetch. We develop a fine-grained Victim Row-Buffer (VRB) memory system to increase performance of the memory system. The VRB mechanism helps reuse the data accessed from the memory banks, avoids unnecessary data transfers, mitigates memory contentions, and thus can improve system throughput and system fairness by decoupling row-buffer contentions. Through full-system cycle-accurate simulations of many threads applications, we demonstrate that our proposed VRB technique achieves an up to 19% (8.4% on average) system-level throughput improvement, an up to 20% (7.2% on average) system fairness improvement, and it saves 6.8% of power consumption across the whole suite.
{"title":"Decoupling Contention with Victim Row-Buffer on Multicore Memory Systems","authors":"Ke Gao, Dongrui Fan, Jie Wu, Zhiyong Liu","doi":"10.1109/IPDPSW.2015.30","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.30","url":null,"abstract":"With continued performance scaling of many cores per chip, an on-chip, off-chip memory has increasingly become a system bottleneck due to inter-thread contention. The memory access streams emerging from many cores and the simultaneously executed threads, exhibit increasingly limited locality. Large and high-density DRAMs contribute significantly to system power consumption and data over fetch. We develop a fine-grained Victim Row-Buffer (VRB) memory system to increase performance of the memory system. The VRB mechanism helps reuse the data accessed from the memory banks, avoids unnecessary data transfers, mitigates memory contentions, and thus can improve system throughput and system fairness by decoupling row-buffer contentions. Through full-system cycle-accurate simulations of many threads applications, we demonstrate that our proposed VRB technique achieves an up to 19% (8.4% on average) system-level throughput improvement, an up to 20% (7.2% on average) system fairness improvement, and it saves 6.8% of power consumption across the whole suite.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123983184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiayuan Meng, T. Uram, V. Morozov, V. Vishwanath, Kalyan Kumaran
Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patterns (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.
大多数加速器,如图形处理单元(gpu)和矢量处理器,特别适合加速大规模并行工作负载。另一方面,传统的工作负载是为多核并行性开发的,通常只扩展到几十个OpenMP线程。当硬件线程的数量明显超过外部循环的并行度时,程序员就面临着如何有效利用硬件的挑战。一种常见的解决方案是进一步利用隐藏在代码结构深处的并行性。这样的并行性结构更少:并行循环和顺序循环可能不完美地嵌套在一起,相邻的内部循环可能表现出不同的并发模式(例如Reduction vs. Forall),但必须在相同的并行部分中并行化。必须探索许多依赖于输入的转换。程序员通常使用较大的硬件线程组来协作地遍历较小的外部循环分区,并自适应地利用任何遇到的并行性。此过程耗时且容易出错,但是对于此类工作负载,获得很少或没有性能的风险仍然很高。为了降低风险并指导实现,我们提出了一种技术来对具有有限并行性的工作负载进行建模,该技术可以自动探索和评估涉及合作线程的转换。最终,我们的框架可以实现最佳性能和最有希望的转换,而无需实现GPU代码或使用物理硬件。我们设想将我们的技术集成到未来的编译器或自动调优的优化框架中。
{"title":"Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism","authors":"Jiayuan Meng, T. Uram, V. Morozov, V. Vishwanath, Kalyan Kumaran","doi":"10.1109/IPDPSW.2015.55","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.55","url":null,"abstract":"Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patterns (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130093885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stream processing accelerators are often applied in MPSoCs for software defined radios. Sharing of these accelerators between different streams could improve their utilization and reduce thereby the hardware cost but is challenging under real-time constraints. In this paper we introduce entry- and exit-gateways that are responsible for multiplexing blocks of data over accelerators under real-time constraints. These gateways check for the availability of sufficient data and space and thereby enable the derivation of a dataflow model of the application. The dataflow model is used to verify the worst-case temporal behaviour based on the sizes of the blocks of data used for multiplexing. We demonstrate that required buffer capacities are non-monotone in the block size. Therefore, an ILP is presented to compute minimum block sizes and sufficient buffer capacities. The benefits of sharing accelerators are demonstrated using a multi-core system that is implemented on a Virtex 6 FPGA. A stereo audio stream from a PAL video signal is demodulated in this system in real-time where two accelerators are shared within and between two streams. In this system sharing reduces the number of accelerators by 75% and reduced the number of logic cells with 63%.
{"title":"Real-Time Multiprocessor Architecture for Sharing Stream Processing Accelerators","authors":"B. Dekens, M. Bekooij, G. Smit","doi":"10.1109/IPDPSW.2015.147","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.147","url":null,"abstract":"Stream processing accelerators are often applied in MPSoCs for software defined radios. Sharing of these accelerators between different streams could improve their utilization and reduce thereby the hardware cost but is challenging under real-time constraints. In this paper we introduce entry- and exit-gateways that are responsible for multiplexing blocks of data over accelerators under real-time constraints. These gateways check for the availability of sufficient data and space and thereby enable the derivation of a dataflow model of the application. The dataflow model is used to verify the worst-case temporal behaviour based on the sizes of the blocks of data used for multiplexing. We demonstrate that required buffer capacities are non-monotone in the block size. Therefore, an ILP is presented to compute minimum block sizes and sufficient buffer capacities. The benefits of sharing accelerators are demonstrated using a multi-core system that is implemented on a Virtex 6 FPGA. A stereo audio stream from a PAL video signal is demodulated in this system in real-time where two accelerators are shared within and between two streams. In this system sharing reduces the number of accelerators by 75% and reduced the number of logic cells with 63%.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129720659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Several approaches to reduce the power consumption of data enters have been described in the literature, most of which aim to improve energy efficiency by trading off performance for reducing power consumption. However, these approaches do not always provide means for administrators and users to specify how they want to explore such trade-offs. This work provides techniques for assigning jobs to distributed resources, exploring energy efficient resource provisioning. We use middleware-level mechanisms to adapt resource allocation according to energy-related events and user-defined rules. A proposed framework enables developers, users and system administrators to specify and explore energy efficiency and performance trade-offs without detailed knowledge of the underlying hardware platform. Evaluation of the proposed solution under three scheduling policies shows gains of 25% in energy-efficiency with minimal impact on the overall application performance. We also evaluate reactivity in the adaptive resource provisioning.
{"title":"Energy-Aware Server Provisioning by Introducing Middleware-Level Dynamic Green Scheduling","authors":"Daniel Balouek-Thomert, E. Caron, L. Lefèvre","doi":"10.1109/IPDPSW.2015.121","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.121","url":null,"abstract":"Several approaches to reduce the power consumption of data enters have been described in the literature, most of which aim to improve energy efficiency by trading off performance for reducing power consumption. However, these approaches do not always provide means for administrators and users to specify how they want to explore such trade-offs. This work provides techniques for assigning jobs to distributed resources, exploring energy efficient resource provisioning. We use middleware-level mechanisms to adapt resource allocation according to energy-related events and user-defined rules. A proposed framework enables developers, users and system administrators to specify and explore energy efficiency and performance trade-offs without detailed knowledge of the underlying hardware platform. Evaluation of the proposed solution under three scheduling policies shows gains of 25% in energy-efficiency with minimal impact on the overall application performance. We also evaluate reactivity in the adaptive resource provisioning.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129563381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuai Che, Gregory P. Rodgers, Bradford M. Beckmann, S. Reinhardt
Graphics processing units (GPUs) have been increasingly used to accelerate irregular applications such as graph and sparse-matrix computation. Graph coloring is a key building block for many graph applications. The first step of many graph applications is graph coloring/partitioning to obtain sets of independent vertices for subsequent parallel computations. However, parallelization and optimization of coloring for GPUs have been a challenge for programmers. This paper studies approaches to implementing graph coloring on a GPU and characterizes their program behaviors with different graph structures. We also investigate load imbalance, which can be the main cause for performance bottlenecks. We evaluate the effectiveness of different optimization techniques, including the use of work stealing and the design of a hybrid algorithm. We are able to improve graph coloring performance by approximately 25% compared to a baseline GPU implementation on an AMD Radeon HD 7950 GPU. We also analyze some important factors affecting performance.
图形处理单元(gpu)越来越多地用于加速图形和稀疏矩阵计算等不规则应用。图形着色是许多图形应用程序的关键组成部分。许多图形应用程序的第一步是图形着色/划分,以获得后续并行计算的独立顶点集。然而,gpu的并行化和着色优化一直是程序员面临的挑战。本文研究了在GPU上实现图着色的方法,并描述了它们在不同图结构下的程序行为。我们还研究了负载不平衡,这可能是导致性能瓶颈的主要原因。我们评估了不同优化技术的有效性,包括使用工作窃取和混合算法的设计。与AMD Radeon HD 7950 GPU上的基准GPU实现相比,我们能够将图形着色性能提高约25%。本文还分析了影响性能的一些重要因素。
{"title":"Graph Coloring on the GPU and Some Techniques to Improve Load Imbalance","authors":"Shuai Che, Gregory P. Rodgers, Bradford M. Beckmann, S. Reinhardt","doi":"10.1109/IPDPSW.2015.74","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.74","url":null,"abstract":"Graphics processing units (GPUs) have been increasingly used to accelerate irregular applications such as graph and sparse-matrix computation. Graph coloring is a key building block for many graph applications. The first step of many graph applications is graph coloring/partitioning to obtain sets of independent vertices for subsequent parallel computations. However, parallelization and optimization of coloring for GPUs have been a challenge for programmers. This paper studies approaches to implementing graph coloring on a GPU and characterizes their program behaviors with different graph structures. We also investigate load imbalance, which can be the main cause for performance bottlenecks. We evaluate the effectiveness of different optimization techniques, including the use of work stealing and the design of a hybrid algorithm. We are able to improve graph coloring performance by approximately 25% compared to a baseline GPU implementation on an AMD Radeon HD 7950 GPU. We also analyze some important factors affecting performance.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116518469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. W. Bethel, David Camp, D. Donofrio, Mark Howison
Many data-intensive algorithms -- particularly in visualization, image processing, and data analysis -- operate on structured data, that is, data organized in multidimensional arrays. While many of these algorithms are quite numerically intensive, by and large, their performance is limited by the cost of memory accesses. As we move towards the exascale regime of computing, one central research challenge is finding ways to minimize data movement through the memory hierarchy, particularly within a node in a shared-memory parallel setting. We study the effects that an alternative in-memory data layout format has in terms of runtime performance gains resulting from reducing the amount of data moved through the memory hierarchy. We focus the study on shared-memory parallel implementations of two algorithms common in visualization and analysis: a stencil-based convolution kernel, which uses a structured memory access pattern, and ray casting volume rendering, which uses a semi-structured memory access pattern. The question we study is to better understand to what degree an alternative memory layout, when used by these key algorithms, will result in improved runtime performance and memory system utilization. Our approach uses a layout based on a Z-order (Morton-order) space-filling curve data organization, and we measure and report runtime and various metrics and counters associated with memory system utilization. Our results show nearly uniform improved runtime performance and improved utilization of the memory hierarchy across varying levels of concurrency the applications we tested. This approach is complementary to other memory optimization strategies like cache blocking, but may also be more general and widely applicable to a diverse set of applications.
{"title":"Improving Performance of Structured-Memory, Data-Intensive Applications on Multi-core Platforms via a Space-Filling Curve Memory Layout","authors":"E. W. Bethel, David Camp, D. Donofrio, Mark Howison","doi":"10.1109/IPDPSW.2015.71","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.71","url":null,"abstract":"Many data-intensive algorithms -- particularly in visualization, image processing, and data analysis -- operate on structured data, that is, data organized in multidimensional arrays. While many of these algorithms are quite numerically intensive, by and large, their performance is limited by the cost of memory accesses. As we move towards the exascale regime of computing, one central research challenge is finding ways to minimize data movement through the memory hierarchy, particularly within a node in a shared-memory parallel setting. We study the effects that an alternative in-memory data layout format has in terms of runtime performance gains resulting from reducing the amount of data moved through the memory hierarchy. We focus the study on shared-memory parallel implementations of two algorithms common in visualization and analysis: a stencil-based convolution kernel, which uses a structured memory access pattern, and ray casting volume rendering, which uses a semi-structured memory access pattern. The question we study is to better understand to what degree an alternative memory layout, when used by these key algorithms, will result in improved runtime performance and memory system utilization. Our approach uses a layout based on a Z-order (Morton-order) space-filling curve data organization, and we measure and report runtime and various metrics and counters associated with memory system utilization. Our results show nearly uniform improved runtime performance and improved utilization of the memory hierarchy across varying levels of concurrency the applications we tested. This approach is complementary to other memory optimization strategies like cache blocking, but may also be more general and widely applicable to a diverse set of applications.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126726405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The solution of real symmetric dense Eigen value problems is one of the fundamental matrix computations. To date, several new high-performance Eigen solvers have been developed for peta and postpeta scale systems. One of these, the Eigen Exa Eigen solver, has been developed in Japan. Eigen Exa provides two routines: eigens, which is based on traditional tridiagonalization, and eigensx, which employs a new method via a pentadiagonal matrix. Recently, we conducted a detailed performance evaluation of Eigen Exa by using 4,800 nodes of the Oak leaf-FX supercomputer system. In this paper, we report the results of our evaluation, which is mainly focused on investigating the differences between the two routines. The results clearly indicate both the advantages and disadvantages of eigensx over eigens, which will contribute to further performance improvement of Eigen Exa. The obtained results are also expected to be useful for other parallel dense matrix computations, in addition to Eigen value problems.
{"title":"Performance Evaluation of the Eigen Exa Eigensolver on Oakleaf-FX: Tridiagonalization Versus Pentadiagonalization","authors":"Takeshi Fukaya, Toshiyuki Imamura","doi":"10.1109/IPDPSW.2015.128","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.128","url":null,"abstract":"The solution of real symmetric dense Eigen value problems is one of the fundamental matrix computations. To date, several new high-performance Eigen solvers have been developed for peta and postpeta scale systems. One of these, the Eigen Exa Eigen solver, has been developed in Japan. Eigen Exa provides two routines: eigens, which is based on traditional tridiagonalization, and eigensx, which employs a new method via a pentadiagonal matrix. Recently, we conducted a detailed performance evaluation of Eigen Exa by using 4,800 nodes of the Oak leaf-FX supercomputer system. In this paper, we report the results of our evaluation, which is mainly focused on investigating the differences between the two routines. The results clearly indicate both the advantages and disadvantages of eigensx over eigens, which will contribute to further performance improvement of Eigen Exa. The obtained results are also expected to be useful for other parallel dense matrix computations, in addition to Eigen value problems.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"69 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130439804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent work on graph analytics has sought to leverage the high performance offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithm and limitations in GPU-resident memory for storing large graphs. The Graph Reduce methods presented in this paper permit a GPU-based accelerator to operate on graphs that exceed its internal memory capacity. Graph Reduce operates with a combination of both edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model, to achieve high degrees of parallelism supported by methods that partition graphs across GPU and host memories and efficiently move graph data between both. Graph Reduce-based programming is performed via device functions that include gather map, gather reduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Experimental evaluations for a wide variety of graph inputs, algorithms, and system configuration demonstrate that Graph Reduce outperforms other competing approaches.
{"title":"GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems","authors":"D. Sengupta, K. Agarwal, S. Song, K. Schwan","doi":"10.1109/IPDPSW.2015.16","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.16","url":null,"abstract":"Recent work on graph analytics has sought to leverage the high performance offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithm and limitations in GPU-resident memory for storing large graphs. The Graph Reduce methods presented in this paper permit a GPU-based accelerator to operate on graphs that exceed its internal memory capacity. Graph Reduce operates with a combination of both edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model, to achieve high degrees of parallelism supported by methods that partition graphs across GPU and host memories and efficiently move graph data between both. Graph Reduce-based programming is performed via device functions that include gather map, gather reduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Experimental evaluations for a wide variety of graph inputs, algorithms, and system configuration demonstrate that Graph Reduce outperforms other competing approaches.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131837528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Contention-based protocols are commonly used for providing channel access to the nodes wishing to communicate. The Binary Exponential Back off (BEB) is a well-known contention protocol implemented by the IEEE 802.11 standard. Despite its widespread use, Medium Access Control (MAC) protocols employing BEB struggle to concede channel access when the number of contending nodes increases. The main contribution of this work is to propose a randomized contention protocol to the case where the contending stations have no-collision detection (NCD) capabilities. The proposed protocol, termed RNCD, explores the use of tone signaling to provide fair selection of a transmitter. We show that the task of selecting a single transmitter, among n ≥ 2 NCD-stations, can be accomplished in 48n time slots with probability of at least 1 - 2-1.5n. Furthermore, RNCD works without previous knowledge on the number of contending nodes. For comparison purpose, RNCD and BEB were implemented in OMNeT++ Simulator. For n = 256, the simulation results show that RNCD can deliver twice as much transmissions per second while channel access resolution takes less than 1% of the time needed by the BEB protocol. Different from the exponential growth tendency observed in the channel access time of the BEB implementation, the RNCD has a logarithmic tendency allowing it to better comply with QoS demands of real-time applications.
{"title":"A Fair Randomized Contention Resolution Protocol for Wireless Nodes without Collision Detection Capabilities","authors":"Marcos F. Caetano, J. Bordim","doi":"10.1109/IPDPSW.2015.86","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.86","url":null,"abstract":"Contention-based protocols are commonly used for providing channel access to the nodes wishing to communicate. The Binary Exponential Back off (BEB) is a well-known contention protocol implemented by the IEEE 802.11 standard. Despite its widespread use, Medium Access Control (MAC) protocols employing BEB struggle to concede channel access when the number of contending nodes increases. The main contribution of this work is to propose a randomized contention protocol to the case where the contending stations have no-collision detection (NCD) capabilities. The proposed protocol, termed RNCD, explores the use of tone signaling to provide fair selection of a transmitter. We show that the task of selecting a single transmitter, among n ≥ 2 NCD-stations, can be accomplished in 48n time slots with probability of at least 1 - 2-1.5n. Furthermore, RNCD works without previous knowledge on the number of contending nodes. For comparison purpose, RNCD and BEB were implemented in OMNeT++ Simulator. For n = 256, the simulation results show that RNCD can deliver twice as much transmissions per second while channel access resolution takes less than 1% of the time needed by the BEB protocol. Different from the exponential growth tendency observed in the channel access time of the BEB implementation, the RNCD has a logarithmic tendency allowing it to better comply with QoS demands of real-time applications.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131926484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}