首页 > 最新文献

2015 IEEE International Parallel and Distributed Processing Symposium Workshop最新文献

英文 中文
Causal Consistency for Geo-Replicated Cloud Storage under Partial Replication 部分复制下两地复制云存储的因果一致性
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.68
Min Shen, A. Kshemkalyani, T. Hsu
Data replication is a common technique used for fault-tolerance in reliable distributed systems. In geo-replicated systems and the cloud, it additionally provides low latency. Recently, causal consistency in such systems has received much attention. However, all existing works assume the data is fully replicated. This greatly simplifies the design of the algorithms to implement causal consistency. In this paper, we propose that it can be advantageous to have partial replication of data, and we propose two algorithms for achieving causal consistency in such systems where the data is only partially replicated. This is the first work that explores causal consistency for partially replicated geo-replicated systems. We also give a special case algorithm for causal consistency in the full-replication case.
数据复制是可靠分布式系统中用于容错的常用技术。在地理复制系统和云中,它还提供了低延迟。最近,这类系统的因果一致性受到了广泛关注。然而,所有现有的工作都假定数据是完全复制的。这大大简化了实现因果一致性的算法设计。在本文中,我们提出数据的部分复制可能是有利的,并且我们提出了两种算法,用于在数据仅部分复制的系统中实现因果一致性。这是第一个探索部分复制的地理复制系统的因果一致性的工作。我们还给出了全复制情况下因果一致性的一个特例算法。
{"title":"Causal Consistency for Geo-Replicated Cloud Storage under Partial Replication","authors":"Min Shen, A. Kshemkalyani, T. Hsu","doi":"10.1109/IPDPSW.2015.68","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.68","url":null,"abstract":"Data replication is a common technique used for fault-tolerance in reliable distributed systems. In geo-replicated systems and the cloud, it additionally provides low latency. Recently, causal consistency in such systems has received much attention. However, all existing works assume the data is fully replicated. This greatly simplifies the design of the algorithms to implement causal consistency. In this paper, we propose that it can be advantageous to have partial replication of data, and we propose two algorithms for achieving causal consistency in such systems where the data is only partially replicated. This is the first work that explores causal consistency for partially replicated geo-replicated systems. We also give a special case algorithm for causal consistency in the full-replication case.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122550351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Decoupling Contention with Victim Row-Buffer on Multicore Memory Systems 多核存储系统中受害者行缓冲区的解耦争用
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.30
Ke Gao, Dongrui Fan, Jie Wu, Zhiyong Liu
With continued performance scaling of many cores per chip, an on-chip, off-chip memory has increasingly become a system bottleneck due to inter-thread contention. The memory access streams emerging from many cores and the simultaneously executed threads, exhibit increasingly limited locality. Large and high-density DRAMs contribute significantly to system power consumption and data over fetch. We develop a fine-grained Victim Row-Buffer (VRB) memory system to increase performance of the memory system. The VRB mechanism helps reuse the data accessed from the memory banks, avoids unnecessary data transfers, mitigates memory contentions, and thus can improve system throughput and system fairness by decoupling row-buffer contentions. Through full-system cycle-accurate simulations of many threads applications, we demonstrate that our proposed VRB technique achieves an up to 19% (8.4% on average) system-level throughput improvement, an up to 20% (7.2% on average) system fairness improvement, and it saves 6.8% of power consumption across the whole suite.
随着每个芯片的性能不断扩展,由于线程间争用,片内、片外内存日益成为系统的瓶颈。来自多个内核和同时执行的线程的内存访问流表现出越来越有限的局部性。大内存和高密度内存对系统功耗和数据读取的影响很大。为了提高内存系统的性能,我们开发了一种细粒度的受害者行缓冲(VRB)内存系统。VRB机制有助于重用从内存库访问的数据,避免不必要的数据传输,减轻内存争用,从而通过解耦行缓冲区争用来提高系统吞吐量和系统公平性。通过对许多线程应用程序的全系统周期精确模拟,我们证明了我们提出的VRB技术实现了高达19%(平均8.4%)的系统级吞吐量改进,高达20%(平均7.2%)的系统公平性改进,并且在整个套件中节省了6.8%的功耗。
{"title":"Decoupling Contention with Victim Row-Buffer on Multicore Memory Systems","authors":"Ke Gao, Dongrui Fan, Jie Wu, Zhiyong Liu","doi":"10.1109/IPDPSW.2015.30","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.30","url":null,"abstract":"With continued performance scaling of many cores per chip, an on-chip, off-chip memory has increasingly become a system bottleneck due to inter-thread contention. The memory access streams emerging from many cores and the simultaneously executed threads, exhibit increasingly limited locality. Large and high-density DRAMs contribute significantly to system power consumption and data over fetch. We develop a fine-grained Victim Row-Buffer (VRB) memory system to increase performance of the memory system. The VRB mechanism helps reuse the data accessed from the memory banks, avoids unnecessary data transfers, mitigates memory contentions, and thus can improve system throughput and system fairness by decoupling row-buffer contentions. Through full-system cycle-accurate simulations of many threads applications, we demonstrate that our proposed VRB technique achieves an up to 19% (8.4% on average) system-level throughput improvement, an up to 20% (7.2% on average) system fairness improvement, and it saves 6.8% of power consumption across the whole suite.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123983184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism 基于自适应并行性的协作线程建模来规划GPU性能
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.55
Jiayuan Meng, T. Uram, V. Morozov, V. Vishwanath, Kalyan Kumaran
Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patterns (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.
大多数加速器,如图形处理单元(gpu)和矢量处理器,特别适合加速大规模并行工作负载。另一方面,传统的工作负载是为多核并行性开发的,通常只扩展到几十个OpenMP线程。当硬件线程的数量明显超过外部循环的并行度时,程序员就面临着如何有效利用硬件的挑战。一种常见的解决方案是进一步利用隐藏在代码结构深处的并行性。这样的并行性结构更少:并行循环和顺序循环可能不完美地嵌套在一起,相邻的内部循环可能表现出不同的并发模式(例如Reduction vs. Forall),但必须在相同的并行部分中并行化。必须探索许多依赖于输入的转换。程序员通常使用较大的硬件线程组来协作地遍历较小的外部循环分区,并自适应地利用任何遇到的并行性。此过程耗时且容易出错,但是对于此类工作负载,获得很少或没有性能的风险仍然很高。为了降低风险并指导实现,我们提出了一种技术来对具有有限并行性的工作负载进行建模,该技术可以自动探索和评估涉及合作线程的转换。最终,我们的框架可以实现最佳性能和最有希望的转换,而无需实现GPU代码或使用物理硬件。我们设想将我们的技术集成到未来的编译器或自动调优的优化框架中。
{"title":"Modeling Cooperative Threads to Project GPU Performance for Adaptive Parallelism","authors":"Jiayuan Meng, T. Uram, V. Morozov, V. Vishwanath, Kalyan Kumaran","doi":"10.1109/IPDPSW.2015.55","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.55","url":null,"abstract":"Most accelerators, such as graphics processing units (GPUs) and vector processors, are particularly suitable for accelerating massively parallel workloads. On the other hand, conventional workloads are developed for multi-core parallelism, which often scale to only a few dozen OpenMP threads. When hardware threads significantly outnumber the degree of parallelism in the outer loop, programmers are challenged with efficient hardware utilization. A common solution is to further exploit the parallelism hidden deep in the code structure. Such parallelism is less structured: parallel and sequential loops may be imperfectly nested within each other, neigh boring inner loops may exhibit different concurrency patterns (e.g. Reduction vs. Forall), yet have to be parallelized in the same parallel section. Many input-dependent transformations have to be explored. A programmer often employs a larger group of hardware threads to cooperatively walk through a smaller outer loop partition and adaptively exploit any encountered parallelism. This process is time-consuming and error-prone, yet the risk of gaining little or no performance remains high for such workloads. To reduce risk and guide implementation, we propose a technique to model workloads with limited parallelism that can automatically explore and evaluate transformations involving cooperative threads. Eventually, our framework projects the best achievable performance and the most promising transformations without implementing GPU code or using physical hardware. We envision our technique to be integrated into future compilers or optimization frameworks for autotuning.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130093885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-Time Multiprocessor Architecture for Sharing Stream Processing Accelerators 用于共享流处理加速器的实时多处理器架构
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.147
B. Dekens, M. Bekooij, G. Smit
Stream processing accelerators are often applied in MPSoCs for software defined radios. Sharing of these accelerators between different streams could improve their utilization and reduce thereby the hardware cost but is challenging under real-time constraints. In this paper we introduce entry- and exit-gateways that are responsible for multiplexing blocks of data over accelerators under real-time constraints. These gateways check for the availability of sufficient data and space and thereby enable the derivation of a dataflow model of the application. The dataflow model is used to verify the worst-case temporal behaviour based on the sizes of the blocks of data used for multiplexing. We demonstrate that required buffer capacities are non-monotone in the block size. Therefore, an ILP is presented to compute minimum block sizes and sufficient buffer capacities. The benefits of sharing accelerators are demonstrated using a multi-core system that is implemented on a Virtex 6 FPGA. A stereo audio stream from a PAL video signal is demodulated in this system in real-time where two accelerators are shared within and between two streams. In this system sharing reduces the number of accelerators by 75% and reduced the number of logic cells with 63%.
流处理加速器通常应用于mpsoc的软件定义无线电。在不同流之间共享这些加速器可以提高它们的利用率,从而降低硬件成本,但在实时性限制下具有挑战性。在本文中,我们介绍了入口和出口网关,它们负责在实时约束下通过加速器复用数据块。这些网关检查是否有足够的数据和空间可用,从而支持推导应用程序的数据流模型。数据流模型用于验证基于用于多路复用的数据块大小的最坏情况时间行为。我们证明了所需的缓冲容量在块大小上是非单调的。因此,提出了一个ILP来计算最小块大小和足够的缓冲容量。使用在Virtex 6 FPGA上实现的多核系统演示了共享加速器的好处。来自PAL视频信号的立体声音频流在该系统中实时解调,其中两个加速器在两个流内部和之间共享。在这个系统中,共享使加速器的数量减少了75%,逻辑单元的数量减少了63%。
{"title":"Real-Time Multiprocessor Architecture for Sharing Stream Processing Accelerators","authors":"B. Dekens, M. Bekooij, G. Smit","doi":"10.1109/IPDPSW.2015.147","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.147","url":null,"abstract":"Stream processing accelerators are often applied in MPSoCs for software defined radios. Sharing of these accelerators between different streams could improve their utilization and reduce thereby the hardware cost but is challenging under real-time constraints. In this paper we introduce entry- and exit-gateways that are responsible for multiplexing blocks of data over accelerators under real-time constraints. These gateways check for the availability of sufficient data and space and thereby enable the derivation of a dataflow model of the application. The dataflow model is used to verify the worst-case temporal behaviour based on the sizes of the blocks of data used for multiplexing. We demonstrate that required buffer capacities are non-monotone in the block size. Therefore, an ILP is presented to compute minimum block sizes and sufficient buffer capacities. The benefits of sharing accelerators are demonstrated using a multi-core system that is implemented on a Virtex 6 FPGA. A stereo audio stream from a PAL video signal is demodulated in this system in real-time where two accelerators are shared within and between two streams. In this system sharing reduces the number of accelerators by 75% and reduced the number of logic cells with 63%.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129720659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Energy-Aware Server Provisioning by Introducing Middleware-Level Dynamic Green Scheduling 引入中间件级动态绿色调度的节能服务器配置
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.121
Daniel Balouek-Thomert, E. Caron, L. Lefèvre
Several approaches to reduce the power consumption of data enters have been described in the literature, most of which aim to improve energy efficiency by trading off performance for reducing power consumption. However, these approaches do not always provide means for administrators and users to specify how they want to explore such trade-offs. This work provides techniques for assigning jobs to distributed resources, exploring energy efficient resource provisioning. We use middleware-level mechanisms to adapt resource allocation according to energy-related events and user-defined rules. A proposed framework enables developers, users and system administrators to specify and explore energy efficiency and performance trade-offs without detailed knowledge of the underlying hardware platform. Evaluation of the proposed solution under three scheduling policies shows gains of 25% in energy-efficiency with minimal impact on the overall application performance. We also evaluate reactivity in the adaptive resource provisioning.
文献中描述了几种降低数据输入功耗的方法,其中大多数旨在通过牺牲性能来降低功耗来提高能源效率。然而,这些方法并不总是为管理员和用户提供方法来指定他们希望如何探索这种权衡。这项工作提供了将工作分配给分布式资源的技术,探索了能源效率的资源配置。我们使用中间件机制根据与能源相关的事件和用户定义的规则来调整资源分配。建议的框架使开发人员、用户和系统管理员能够指定和探索能源效率和性能权衡,而无需详细了解底层硬件平台。在三种调度策略下对所建议的解决方案进行的评估表明,在对整体应用程序性能影响最小的情况下,能源效率提高了25%。我们还评估了自适应资源供应中的反应性。
{"title":"Energy-Aware Server Provisioning by Introducing Middleware-Level Dynamic Green Scheduling","authors":"Daniel Balouek-Thomert, E. Caron, L. Lefèvre","doi":"10.1109/IPDPSW.2015.121","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.121","url":null,"abstract":"Several approaches to reduce the power consumption of data enters have been described in the literature, most of which aim to improve energy efficiency by trading off performance for reducing power consumption. However, these approaches do not always provide means for administrators and users to specify how they want to explore such trade-offs. This work provides techniques for assigning jobs to distributed resources, exploring energy efficient resource provisioning. We use middleware-level mechanisms to adapt resource allocation according to energy-related events and user-defined rules. A proposed framework enables developers, users and system administrators to specify and explore energy efficiency and performance trade-offs without detailed knowledge of the underlying hardware platform. Evaluation of the proposed solution under three scheduling policies shows gains of 25% in energy-efficiency with minimal impact on the overall application performance. We also evaluate reactivity in the adaptive resource provisioning.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129563381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Graph Coloring on the GPU and Some Techniques to Improve Load Imbalance GPU上的图形着色及改善负载不平衡的一些技术
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.74
Shuai Che, Gregory P. Rodgers, Bradford M. Beckmann, S. Reinhardt
Graphics processing units (GPUs) have been increasingly used to accelerate irregular applications such as graph and sparse-matrix computation. Graph coloring is a key building block for many graph applications. The first step of many graph applications is graph coloring/partitioning to obtain sets of independent vertices for subsequent parallel computations. However, parallelization and optimization of coloring for GPUs have been a challenge for programmers. This paper studies approaches to implementing graph coloring on a GPU and characterizes their program behaviors with different graph structures. We also investigate load imbalance, which can be the main cause for performance bottlenecks. We evaluate the effectiveness of different optimization techniques, including the use of work stealing and the design of a hybrid algorithm. We are able to improve graph coloring performance by approximately 25% compared to a baseline GPU implementation on an AMD Radeon HD 7950 GPU. We also analyze some important factors affecting performance.
图形处理单元(gpu)越来越多地用于加速图形和稀疏矩阵计算等不规则应用。图形着色是许多图形应用程序的关键组成部分。许多图形应用程序的第一步是图形着色/划分,以获得后续并行计算的独立顶点集。然而,gpu的并行化和着色优化一直是程序员面临的挑战。本文研究了在GPU上实现图着色的方法,并描述了它们在不同图结构下的程序行为。我们还研究了负载不平衡,这可能是导致性能瓶颈的主要原因。我们评估了不同优化技术的有效性,包括使用工作窃取和混合算法的设计。与AMD Radeon HD 7950 GPU上的基准GPU实现相比,我们能够将图形着色性能提高约25%。本文还分析了影响性能的一些重要因素。
{"title":"Graph Coloring on the GPU and Some Techniques to Improve Load Imbalance","authors":"Shuai Che, Gregory P. Rodgers, Bradford M. Beckmann, S. Reinhardt","doi":"10.1109/IPDPSW.2015.74","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.74","url":null,"abstract":"Graphics processing units (GPUs) have been increasingly used to accelerate irregular applications such as graph and sparse-matrix computation. Graph coloring is a key building block for many graph applications. The first step of many graph applications is graph coloring/partitioning to obtain sets of independent vertices for subsequent parallel computations. However, parallelization and optimization of coloring for GPUs have been a challenge for programmers. This paper studies approaches to implementing graph coloring on a GPU and characterizes their program behaviors with different graph structures. We also investigate load imbalance, which can be the main cause for performance bottlenecks. We evaluate the effectiveness of different optimization techniques, including the use of work stealing and the design of a hybrid algorithm. We are able to improve graph coloring performance by approximately 25% compared to a baseline GPU implementation on an AMD Radeon HD 7950 GPU. We also analyze some important factors affecting performance.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116518469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Improving Performance of Structured-Memory, Data-Intensive Applications on Multi-core Platforms via a Space-Filling Curve Memory Layout 通过空间填充曲线内存布局提高多核平台上结构化内存、数据密集型应用程序的性能
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.71
E. W. Bethel, David Camp, D. Donofrio, Mark Howison
Many data-intensive algorithms -- particularly in visualization, image processing, and data analysis -- operate on structured data, that is, data organized in multidimensional arrays. While many of these algorithms are quite numerically intensive, by and large, their performance is limited by the cost of memory accesses. As we move towards the exascale regime of computing, one central research challenge is finding ways to minimize data movement through the memory hierarchy, particularly within a node in a shared-memory parallel setting. We study the effects that an alternative in-memory data layout format has in terms of runtime performance gains resulting from reducing the amount of data moved through the memory hierarchy. We focus the study on shared-memory parallel implementations of two algorithms common in visualization and analysis: a stencil-based convolution kernel, which uses a structured memory access pattern, and ray casting volume rendering, which uses a semi-structured memory access pattern. The question we study is to better understand to what degree an alternative memory layout, when used by these key algorithms, will result in improved runtime performance and memory system utilization. Our approach uses a layout based on a Z-order (Morton-order) space-filling curve data organization, and we measure and report runtime and various metrics and counters associated with memory system utilization. Our results show nearly uniform improved runtime performance and improved utilization of the memory hierarchy across varying levels of concurrency the applications we tested. This approach is complementary to other memory optimization strategies like cache blocking, but may also be more general and widely applicable to a diverse set of applications.
许多数据密集型算法——特别是在可视化、图像处理和数据分析方面——操作结构化数据,即组织在多维数组中的数据。虽然这些算法中的许多都是数字密集型的,但总的来说,它们的性能受到内存访问成本的限制。随着我们向百亿级计算体系迈进,一个核心的研究挑战是找到最小化内存层次结构中的数据移动的方法,特别是在共享内存并行设置中的节点内。我们研究了另一种内存数据布局格式在运行时性能提升方面的影响,因为它减少了通过内存层次结构移动的数据量。我们重点研究了可视化和分析中常见的两种算法的共享内存并行实现:基于模板的卷积核,它使用结构化内存访问模式,以及射线投射体渲染,它使用半结构化内存访问模式。我们研究的问题是更好地理解,当这些关键算法使用替代内存布局时,将在多大程度上提高运行时性能和内存系统利用率。我们的方法使用基于Z-order (Morton-order)空间填充曲线数据组织的布局,我们测量和报告运行时以及与内存系统利用率相关的各种指标和计数器。我们的结果显示,在我们测试的应用程序的不同并发级别上,运行时性能和内存层次结构的利用率几乎一致地得到了改进。这种方法是对其他内存优化策略(如缓存阻塞)的补充,但也可能更通用,更广泛地适用于各种应用程序。
{"title":"Improving Performance of Structured-Memory, Data-Intensive Applications on Multi-core Platforms via a Space-Filling Curve Memory Layout","authors":"E. W. Bethel, David Camp, D. Donofrio, Mark Howison","doi":"10.1109/IPDPSW.2015.71","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.71","url":null,"abstract":"Many data-intensive algorithms -- particularly in visualization, image processing, and data analysis -- operate on structured data, that is, data organized in multidimensional arrays. While many of these algorithms are quite numerically intensive, by and large, their performance is limited by the cost of memory accesses. As we move towards the exascale regime of computing, one central research challenge is finding ways to minimize data movement through the memory hierarchy, particularly within a node in a shared-memory parallel setting. We study the effects that an alternative in-memory data layout format has in terms of runtime performance gains resulting from reducing the amount of data moved through the memory hierarchy. We focus the study on shared-memory parallel implementations of two algorithms common in visualization and analysis: a stencil-based convolution kernel, which uses a structured memory access pattern, and ray casting volume rendering, which uses a semi-structured memory access pattern. The question we study is to better understand to what degree an alternative memory layout, when used by these key algorithms, will result in improved runtime performance and memory system utilization. Our approach uses a layout based on a Z-order (Morton-order) space-filling curve data organization, and we measure and report runtime and various metrics and counters associated with memory system utilization. Our results show nearly uniform improved runtime performance and improved utilization of the memory hierarchy across varying levels of concurrency the applications we tested. This approach is complementary to other memory optimization strategies like cache blocking, but may also be more general and widely applicable to a diverse set of applications.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126726405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Performance Evaluation of the Eigen Exa Eigensolver on Oakleaf-FX: Tridiagonalization Versus Pentadiagonalization Oakleaf-FX上特征解算器的性能评价:三对角化与五对角化
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.128
Takeshi Fukaya, Toshiyuki Imamura
The solution of real symmetric dense Eigen value problems is one of the fundamental matrix computations. To date, several new high-performance Eigen solvers have been developed for peta and postpeta scale systems. One of these, the Eigen Exa Eigen solver, has been developed in Japan. Eigen Exa provides two routines: eigens, which is based on traditional tridiagonalization, and eigensx, which employs a new method via a pentadiagonal matrix. Recently, we conducted a detailed performance evaluation of Eigen Exa by using 4,800 nodes of the Oak leaf-FX supercomputer system. In this paper, we report the results of our evaluation, which is mainly focused on investigating the differences between the two routines. The results clearly indicate both the advantages and disadvantages of eigensx over eigens, which will contribute to further performance improvement of Eigen Exa. The obtained results are also expected to be useful for other parallel dense matrix computations, in addition to Eigen value problems.
实对称密集特征值问题的求解是矩阵计算的基本问题之一。到目前为止,已经为peta和postpeta尺度系统开发了几种新的高性能特征解算器。其中之一是日本开发的Eigen Exa Eigen求解器。Eigen Exa提供了两个例程:基于传统三对角化的eigens和通过五对角矩阵采用新方法的eigensx。最近,我们利用Oak leaf-FX超级计算机系统的4800个节点,对Eigen Exa进行了详细的性能评估。在本文中,我们报告了我们的评估结果,主要集中在调查两个例程之间的差异。结果清楚地表明了eigensx相对于eigensx的优缺点,这将有助于进一步提高eigensx的性能。所得结果也有望用于除特征值问题外的其他并行密集矩阵计算。
{"title":"Performance Evaluation of the Eigen Exa Eigensolver on Oakleaf-FX: Tridiagonalization Versus Pentadiagonalization","authors":"Takeshi Fukaya, Toshiyuki Imamura","doi":"10.1109/IPDPSW.2015.128","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.128","url":null,"abstract":"The solution of real symmetric dense Eigen value problems is one of the fundamental matrix computations. To date, several new high-performance Eigen solvers have been developed for peta and postpeta scale systems. One of these, the Eigen Exa Eigen solver, has been developed in Japan. Eigen Exa provides two routines: eigens, which is based on traditional tridiagonalization, and eigensx, which employs a new method via a pentadiagonal matrix. Recently, we conducted a detailed performance evaluation of Eigen Exa by using 4,800 nodes of the Oak leaf-FX supercomputer system. In this paper, we report the results of our evaluation, which is mainly focused on investigating the differences between the two routines. The results clearly indicate both the advantages and disadvantages of eigensx over eigens, which will contribute to further performance improvement of Eigen Exa. The obtained results are also expected to be useful for other parallel dense matrix computations, in addition to Eigen value problems.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"69 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130439804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems GraphReduce:基于加速器的高性能计算系统上的大规模图形分析
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.16
D. Sengupta, K. Agarwal, S. Song, K. Schwan
Recent work on graph analytics has sought to leverage the high performance offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithm and limitations in GPU-resident memory for storing large graphs. The Graph Reduce methods presented in this paper permit a GPU-based accelerator to operate on graphs that exceed its internal memory capacity. Graph Reduce operates with a combination of both edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model, to achieve high degrees of parallelism supported by methods that partition graphs across GPU and host memories and efficiently move graph data between both. Graph Reduce-based programming is performed via device functions that include gather map, gather reduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Experimental evaluations for a wide variety of graph inputs, algorithms, and system configuration demonstrate that Graph Reduce outperforms other competing approaches.
最近关于图形分析的工作试图利用GPU设备提供的高性能,但由于图形算法固有的不规则性和GPU驻留内存存储大型图形的限制,挑战仍然存在。本文提出的Graph Reduce方法允许基于gpu的加速器对超出其内部内存容量的图形进行操作。Graph Reduce结合了以边缘为中心和以顶点为中心的collect - apply - scatter编程模型实现,通过在GPU和主机内存之间划分图形并有效地在两者之间移动图形数据的方法来实现高度并行性。基于Graph reduce的编程是通过设备函数执行的,这些设备函数包括gather map、gather reduce、apply和scatter,由程序员为他们希望实现的图算法实现。对各种各样的图输入、算法和系统配置的实验评估表明,graph Reduce优于其他竞争方法。
{"title":"GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems","authors":"D. Sengupta, K. Agarwal, S. Song, K. Schwan","doi":"10.1109/IPDPSW.2015.16","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.16","url":null,"abstract":"Recent work on graph analytics has sought to leverage the high performance offered by GPU devices, but challenges remain due to the inherent irregularity of graph algorithm and limitations in GPU-resident memory for storing large graphs. The Graph Reduce methods presented in this paper permit a GPU-based accelerator to operate on graphs that exceed its internal memory capacity. Graph Reduce operates with a combination of both edge- and vertex-centric implementations of the Gather-Apply-Scatter programming model, to achieve high degrees of parallelism supported by methods that partition graphs across GPU and host memories and efficiently move graph data between both. Graph Reduce-based programming is performed via device functions that include gather map, gather reduce, apply, and scatter, implemented by programmers for the graph algorithms they wish to realize. Experimental evaluations for a wide variety of graph inputs, algorithms, and system configuration demonstrate that Graph Reduce outperforms other competing approaches.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131837528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Fair Randomized Contention Resolution Protocol for Wireless Nodes without Collision Detection Capabilities 无碰撞检测无线节点公平随机争用解决协议
Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2015.86
Marcos F. Caetano, J. Bordim
Contention-based protocols are commonly used for providing channel access to the nodes wishing to communicate. The Binary Exponential Back off (BEB) is a well-known contention protocol implemented by the IEEE 802.11 standard. Despite its widespread use, Medium Access Control (MAC) protocols employing BEB struggle to concede channel access when the number of contending nodes increases. The main contribution of this work is to propose a randomized contention protocol to the case where the contending stations have no-collision detection (NCD) capabilities. The proposed protocol, termed RNCD, explores the use of tone signaling to provide fair selection of a transmitter. We show that the task of selecting a single transmitter, among n ≥ 2 NCD-stations, can be accomplished in 48n time slots with probability of at least 1 - 2-1.5n. Furthermore, RNCD works without previous knowledge on the number of contending nodes. For comparison purpose, RNCD and BEB were implemented in OMNeT++ Simulator. For n = 256, the simulation results show that RNCD can deliver twice as much transmissions per second while channel access resolution takes less than 1% of the time needed by the BEB protocol. Different from the exponential growth tendency observed in the channel access time of the BEB implementation, the RNCD has a logarithmic tendency allowing it to better comply with QoS demands of real-time applications.
基于争用的协议通常用于向希望通信的节点提供通道访问。二进制指数回退(BEB)是由IEEE 802.11标准实现的一个著名的争用协议。尽管它被广泛使用,但当竞争节点数量增加时,采用BEB的介质访问控制(MAC)协议难以让步通道访问。本工作的主要贡献是针对竞争站具有无碰撞检测(NCD)能力的情况,提出了一种随机竞争协议。提出的协议,称为RNCD,探索使用音调信令提供公平选择的发射机。我们证明了在n≥2个ncd站中选择单个发射机的任务可以在48n个时隙中完成,概率至少为1 - 2-1.5n。此外,RNCD工作时不需要事先知道竞争节点的数量。为了比较,RNCD和BEB在omnet++模拟器中实现。仿真结果表明,当n = 256时,RNCD每秒的传输量是BEB协议的两倍,而信道访问分辨率所需的时间不到BEB协议所需时间的1%。与BEB实现的信道访问时间呈指数增长趋势不同,RNCD具有对数增长趋势,使其能够更好地满足实时应用的QoS需求。
{"title":"A Fair Randomized Contention Resolution Protocol for Wireless Nodes without Collision Detection Capabilities","authors":"Marcos F. Caetano, J. Bordim","doi":"10.1109/IPDPSW.2015.86","DOIUrl":"https://doi.org/10.1109/IPDPSW.2015.86","url":null,"abstract":"Contention-based protocols are commonly used for providing channel access to the nodes wishing to communicate. The Binary Exponential Back off (BEB) is a well-known contention protocol implemented by the IEEE 802.11 standard. Despite its widespread use, Medium Access Control (MAC) protocols employing BEB struggle to concede channel access when the number of contending nodes increases. The main contribution of this work is to propose a randomized contention protocol to the case where the contending stations have no-collision detection (NCD) capabilities. The proposed protocol, termed RNCD, explores the use of tone signaling to provide fair selection of a transmitter. We show that the task of selecting a single transmitter, among n ≥ 2 NCD-stations, can be accomplished in 48n time slots with probability of at least 1 - 2-1.5n. Furthermore, RNCD works without previous knowledge on the number of contending nodes. For comparison purpose, RNCD and BEB were implemented in OMNeT++ Simulator. For n = 256, the simulation results show that RNCD can deliver twice as much transmissions per second while channel access resolution takes less than 1% of the time needed by the BEB protocol. Different from the exponential growth tendency observed in the channel access time of the BEB implementation, the RNCD has a logarithmic tendency allowing it to better comply with QoS demands of real-time applications.","PeriodicalId":340697,"journal":{"name":"2015 IEEE International Parallel and Distributed Processing Symposium Workshop","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131926484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2015 IEEE International Parallel and Distributed Processing Symposium Workshop
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1