2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第9页

An Adaptive Core-Specific Runtime for Energy Efficiency 能源效率的自适应核心特定运行时

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.114

Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, J. Prins

Energy efficiency in high performance computing (HPC) will be critical to limit operating costs and carbon footprints in future supercomputing centers. Energy efficiency of a computation can be improved by reducing time to completion without a substantial increase in power drawn or by reducing power with a little increase in time to completion. We present an Adaptive Core-specific Runtime (ACR) that dynamically adapts core frequencies to workload characteristics, and show examples of both reductions in power and improvement in the average performance. This improvement in energy efficiency is obtained without changes to the application. The adaptation policy embedded in the runtime uses existing core-specific power controls like software-controlled clock modulation and per-core Dynamic Voltage Frequency Scaling (DVFS) introduced in Intel Haswell. Experiments on six standard MPI benchmarks and a real world application show an overall 20% improvement in energy efficiency with less than 1% increase in execution time on 32 nodes (1024 cores) using per-core DVFS. An improvement in energy efficiency of up to 42% is obtained with the real world application ParaDis through a combination of speedup and power reduction. For one configuration, ParaDis achieves an average speedup of 11%, while the power is lowered by about 31%. The average improvement in the performance seen is a direct result of the reduction in run-to-run variation and running at turbo frequencies.

高性能计算(HPC)的能源效率对于限制未来超级计算中心的运营成本和碳足迹至关重要。可以通过减少完成时间而不大幅增加功耗或通过减少功耗而稍微增加完成时间来提高计算的能源效率。我们提出了一个特定于核心的自适应运行时(ACR)，它可以根据工作负载特征动态地调整核心频率，并展示了功耗降低和平均性能提高的示例。这种能源效率的提高是在不改变应用程序的情况下获得的。在运行时中嵌入的自适应策略使用了现有的特定于核心的功率控制，如英特尔Haswell中引入的软件控制时钟调制和单核动态电压频率缩放(DVFS)。在六个标准MPI基准测试和一个实际应用程序上进行的实验表明，使用每核DVFS在32个节点(1024核)上，能源效率总体提高了20%，执行时间增加了不到1%。通过加速和降低功耗的结合，ParaDis在实际应用中获得了高达42%的能源效率提高。对于一种配置，ParaDis实现了11%的平均加速，而功耗降低了约31%。性能的平均改善是减少运行到运行的变化和在涡轮频率下运行的直接结果。

{"title":"An Adaptive Core-Specific Runtime for Energy Efficiency","authors":"Sridutt Bhalachandra, Allan Porterfield, Stephen L. Olivier, J. Prins","doi":"10.1109/IPDPS.2017.114","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.114","url":null,"abstract":"Energy efficiency in high performance computing (HPC) will be critical to limit operating costs and carbon footprints in future supercomputing centers. Energy efficiency of a computation can be improved by reducing time to completion without a substantial increase in power drawn or by reducing power with a little increase in time to completion. We present an Adaptive Core-specific Runtime (ACR) that dynamically adapts core frequencies to workload characteristics, and show examples of both reductions in power and improvement in the average performance. This improvement in energy efficiency is obtained without changes to the application. The adaptation policy embedded in the runtime uses existing core-specific power controls like software-controlled clock modulation and per-core Dynamic Voltage Frequency Scaling (DVFS) introduced in Intel Haswell. Experiments on six standard MPI benchmarks and a real world application show an overall 20% improvement in energy efficiency with less than 1% increase in execution time on 32 nodes (1024 cores) using per-core DVFS. An improvement in energy efficiency of up to 42% is obtained with the real world application ParaDis through a combination of speedup and power reduction. For one configuration, ParaDis achieves an average speedup of 11%, while the power is lowered by about 31%. The average improvement in the performance seen is a direct result of the reduction in run-to-run variation and running at turbo frequencies.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122108643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Model-Driven Sparse CP Decomposition for Higher-Order Tensors 模型驱动的高阶张量稀疏CP分解

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.80

Jiajia Li, Jee W. Choi, Ioakeim Perros, Jimeng Sun, R. Vuduc

Given an input tensor, its CANDECOMP/PARAFAC decomposition (or CPD) is a low-rank representation. CPDs are of particular interest in data analysis and mining, especially when the data tensor is sparse and of higher order (dimension). This paper focuses on the central bottleneck of a CPD algorithm, which is evaluating a sequence of matricized tensor times Khatri-Rao products (MTTKRPs). To speed up the MTTKRP sequence, we propose a novel, adaptive tensor memoization algorithm, AdaTM. Besides removing redundant computations within the MTTKRP sequence, which potentially reduces its overall asymptotic complexity, our technique also allows a user to make a space-time tradeoff by automatically tuning algorithmic and machine parameters using a model-driven framework. Our method improves as the tensor order grows, making its performance more scalable for higher-order data problems. We show speedups of up to 8× and 820× on real sparse data tensors with orders as high as 85 over the SPLATT package and Tensor Toolbox library respectively; and on a full CPD algorithm (CP-ALS), AdaTM can be up to 8× faster than state-of-the-art method implemented in SPLATT.

给定一个输入张量，它的CANDECOMP/PARAFAC分解(或CPD)是一个低秩表示。cpd在数据分析和挖掘中特别有趣，特别是当数据张量是稀疏的和高阶(维)的时候。本文重点研究了一种CPD算法的中心瓶颈，即矩阵张量乘Khatri-Rao积(MTTKRPs)序列的求值。为了加快MTTKRP序列的速度，我们提出了一种新的自适应张量记忆算法AdaTM。除了消除MTTKRP序列中的冗余计算(这可能会降低其总体渐近复杂性)之外，我们的技术还允许用户通过使用模型驱动框架自动调整算法和机器参数来进行时空权衡。我们的方法随着张量阶的增长而改进，使其性能在高阶数据问题上更具可扩展性。我们分别在SPLATT包和Tensor Toolbox库上显示了高达85阶的真实稀疏数据张量的加速高达8倍和820倍;在全CPD算法(CP-ALS)上，AdaTM可以比SPLATT中实现的最先进方法快8倍。

{"title":"Model-Driven Sparse CP Decomposition for Higher-Order Tensors","authors":"Jiajia Li, Jee W. Choi, Ioakeim Perros, Jimeng Sun, R. Vuduc","doi":"10.1109/IPDPS.2017.80","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.80","url":null,"abstract":"Given an input tensor, its CANDECOMP/PARAFAC decomposition (or CPD) is a low-rank representation. CPDs are of particular interest in data analysis and mining, especially when the data tensor is sparse and of higher order (dimension). This paper focuses on the central bottleneck of a CPD algorithm, which is evaluating a sequence of matricized tensor times Khatri-Rao products (MTTKRPs). To speed up the MTTKRP sequence, we propose a novel, adaptive tensor memoization algorithm, AdaTM. Besides removing redundant computations within the MTTKRP sequence, which potentially reduces its overall asymptotic complexity, our technique also allows a user to make a space-time tradeoff by automatically tuning algorithmic and machine parameters using a model-driven framework. Our method improves as the tensor order grows, making its performance more scalable for higher-order data problems. We show speedups of up to 8× and 820× on real sparse data tensors with orders as high as 85 over the SPLATT package and Tensor Toolbox library respectively; and on a full CPD algorithm (CP-ALS), AdaTM can be up to 8× faster than state-of-the-art method implemented in SPLATT.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122685818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Transparent Caching for RMA Systems RMA系统的透明缓存

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.92

S. D. Girolamo, Flavio Vella, T. Hoefler

The constantly increasing gap between communication and computation performance emphasizes the importance of communication-avoidance techniques. Caching is a well-known concept used to reduce accesses to slow local memories. In this work, we extend the caching idea to MPI-3 Remote Memory Access (RMA) operations. Here, caching can avoid inter-node communications and achieve similar benefits for irregular applications as communication-avoiding algorithms for structured applications. We propose CLaMPI, a caching library layered on top of MPI-3 RMA, to automatically optimize code with minimum user intervention. We demonstrate how cached RMA improves the performance of a Barnes Hut simulation and a Local Clustering Coefficient computation up to a factor of 1.8x and 5x, respectively. Due to the low overheads in the cache miss case and the potential benefits, we expect that our ideas around transparent RMA caching will soon be an integral part of many MPI libraries.

通信性能与计算性能之间不断扩大的差距凸显了通信回避技术的重要性。缓存是一个众所周知的概念，用于减少对慢速本地内存的访问。在这项工作中，我们将缓存思想扩展到MPI-3远程内存访问(RMA)操作中。在这里，缓存可以避免节点间通信，并为不规则应用程序获得与结构化应用程序避免通信算法类似的好处。我们提出了一个基于MPI-3 RMA的缓存库CLaMPI，它可以在最小的用户干预下自动优化代码。我们演示了缓存的RMA如何将Barnes Hut模拟和局部聚类系数计算的性能分别提高1.8倍和5倍。由于缓存丢失情况下的低开销和潜在的好处，我们希望我们关于透明RMA缓存的想法很快就会成为许多MPI库的一个组成部分。

引用次数: 3

PaPar: A Parallel Data Partitioning Framework for Big Data Applications 面向大数据应用的并行数据分区框架

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.119

Hao Wang, Jing Zhang, Da Zhang, S. Pumma, Wu-chun Feng

Today, big data applications can generate largescale data sets at an unprecedented rate; and scientists have turned to parallel and distributed systems for data analysis. Although many big data processing systems provide advanced mechanisms to partition data and tackle the computational skew, it is difficult to efficiently implement skew-resistant mechanisms, because the runtime of different partitions not only depends on input data size but also algorithms that will be applied on data. As a result, many research efforts have been undertaken to explore user-defined partitioning methods for different types of applications and algorithms. However, manually writing application-specific partitioning methods requires significant coding effort, and finding the optimal data partitioning strategy is particularly challenging even for developers that have mastered sufficient application knowledge. In this paper, we propose PaPar, a Parallel data Partitioning framework for big data applications, to simplify the implementations of data partitioning algorithms. PaPar provides a set of computational operators and distribution strategies for programmers to describe desired data partitioning methods. Taking an input data configuration file and a workflow configuration file as the input, PaPar can automatically generate the parallel partitioning codes by formalizing the user-defined workflow as a sequence of key-value operations and matrixvector multiplications, and efficiently mapping to the parallel implementations with MPI and MapReduce. We apply our approach on two applications: muBLAST, a MPI implementation of BLAST algorithms for biological sequence search; and PowerLyra, a computation and partitioning method for skewed graphs. The experimental results show that compared to the partitioning methods of applications, the codes generated by PaPar can produce the same data partitions with comparable or less partitioning time.

今天，大数据应用程序可以以前所未有的速度生成大规模数据集;科学家们已经转向并行和分布式系统进行数据分析。尽管许多大数据处理系统提供了先进的数据分区和解决计算倾斜的机制，但由于不同分区的运行时间不仅取决于输入数据的大小，还取决于将应用于数据的算法，因此很难有效地实现抗倾斜机制。因此，已经进行了许多研究工作来探索针对不同类型的应用程序和算法的用户定义划分方法。然而，手动编写特定于应用程序的分区方法需要大量的编码工作，即使对于掌握了足够的应用程序知识的开发人员来说，找到最佳的数据分区策略也特别具有挑战性。本文提出了一种用于大数据应用的并行数据分区框架PaPar，以简化数据分区算法的实现。PaPar为程序员提供了一组计算运算符和分布策略来描述所需的数据划分方法。PaPar以输入数据配置文件和工作流配置文件为输入，通过将用户定义的工作流形式化为一系列键值操作和矩阵向量乘法，自动生成并行分区代码，并有效地映射到MPI和MapReduce的并行实现中。我们将我们的方法应用于两个应用程序:muBLAST，用于生物序列搜索的BLAST算法的MPI实现;和PowerLyra，一个计算和划分歪斜图的方法。实验结果表明，与应用程序的分区方法相比，PaPar生成的代码可以在相当或更少的分区时间内生成相同的数据分区。

{"title":"PaPar: A Parallel Data Partitioning Framework for Big Data Applications","authors":"Hao Wang, Jing Zhang, Da Zhang, S. Pumma, Wu-chun Feng","doi":"10.1109/IPDPS.2017.119","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.119","url":null,"abstract":"Today, big data applications can generate largescale data sets at an unprecedented rate; and scientists have turned to parallel and distributed systems for data analysis. Although many big data processing systems provide advanced mechanisms to partition data and tackle the computational skew, it is difficult to efficiently implement skew-resistant mechanisms, because the runtime of different partitions not only depends on input data size but also algorithms that will be applied on data. As a result, many research efforts have been undertaken to explore user-defined partitioning methods for different types of applications and algorithms. However, manually writing application-specific partitioning methods requires significant coding effort, and finding the optimal data partitioning strategy is particularly challenging even for developers that have mastered sufficient application knowledge. In this paper, we propose PaPar, a Parallel data Partitioning framework for big data applications, to simplify the implementations of data partitioning algorithms. PaPar provides a set of computational operators and distribution strategies for programmers to describe desired data partitioning methods. Taking an input data configuration file and a workflow configuration file as the input, PaPar can automatically generate the parallel partitioning codes by formalizing the user-defined workflow as a sequence of key-value operations and matrixvector multiplications, and efficiently mapping to the parallel implementations with MPI and MapReduce. We apply our approach on two applications: muBLAST, a MPI implementation of BLAST algorithms for biological sequence search; and PowerLyra, a computation and partitioning method for skewed graphs. The experimental results show that compared to the partitioning methods of applications, the codes generated by PaPar can produce the same data partitions with comparable or less partitioning time.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125259867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

SimProf: A Sampling Framework for Data Analytic Workloads SimProf:数据分析工作负载的采样框架

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.118

Jen-Cheng Huang, Lifeng Nai, Pranith Kumar, Hyojong Kim, Hyesoon Kim

Today, there is a steep rise in the amount of data being collected from diverse applications. Consequently, data analytic workloads are gaining popularity to gain insight that can benefit the application, e.g., financial trading, social media analysis. To study the architectural behavior of the workloads, architectural simulation is one of the most common approaches. However, because of the long-running nature of the workloads, it is not trivial to identify which parts of the analysis to simulate. In the current work, we introduce SimProf, a sampling framework for data analytic workloads. Using this tool, we are able to select representative simulation points based on the phase behavior of the analysis at a method level granularity. This provides a better understanding of the simulation point and also reduces the simulation time for different input sets. We present the framework for Apache Hadoop and Apache Spark frameworks, which can be easily extended to other data analytic workloads.

今天，从各种应用程序收集的数据量急剧增加。因此，数据分析工作负载越来越受欢迎，以获得对应用程序有益的洞察力，例如金融交易、社交媒体分析。为了研究工作负载的体系结构行为，体系结构模拟是最常用的方法之一。但是，由于工作负载的长时间运行性质，确定要模拟分析的哪些部分并不是一件容易的事情。在当前的工作中，我们介绍了SimProf，这是一个用于数据分析工作负载的采样框架。使用此工具，我们能够根据方法级别粒度的分析的阶段行为选择具有代表性的模拟点。这样可以更好地理解模拟点，还可以减少不同输入集的模拟时间。我们为Apache Hadoop和Apache Spark框架提供了一个框架，它可以很容易地扩展到其他数据分析工作负载。

引用次数: 1

Improving the Integration of Task Nesting and Dependencies in OpenMP 改进OpenMP中任务嵌套和依赖关系的集成

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.69

Josep M. Pérez, Vicencc Beltran, Jesús Labarta, E. Ayguadé

The tasking model of OpenMP 4.0 supports both nesting and the definition of dependences between sibling tasks. A natural way to parallelize many codes with tasks is to first taskify the high-level functions and then to further refine these tasks with additional subtasks. However, this top-down approach has some drawbacks since combining nesting with dependencies usually requires additional measures to enforce the correct coordination of dependencies across nesting levels. For instance, most non-leaf tasks need to include a taskwait at the end of their code. While these measures enforce the correct order of execution, as a side effect, they also limit the discovery of parallelism. In this paper we extend the OpenMP tasking model to improve the integration of nesting and dependencies. Our proposal builds on both formulas, nesting and dependencies, and benefits from their individual strengths. On one hand, it encourages a top-down approach to parallelizing codes that also enables the parallel instantiation of tasks. On the other hand, it allows the runtime to control dependencies at a fine grain that until now was only possible using a single domain of dependencies. Our proposal is realized through additions to the OpenMP task directive that ensure backward compatibility with current codes. We have implemented a new runtime with these extensions and used it to evaluate the impact on several benchmarks. Our initial findings show that our extensions improve performance in three areas. First, they expose more parallelism. Second, they uncover dependencies across nesting levels, which allows the runtime to make better scheduling decisions. And third, they allow the parallel instantiation of tasks with dependencies between them.

OpenMP 4.0的任务模型既支持嵌套，也支持定义兄弟任务之间的依赖关系。将许多代码与任务并行化的一种自然方法是，首先为高级函数指定任务，然后用额外的子任务进一步细化这些任务。然而，这种自顶向下的方法有一些缺点，因为将嵌套与依赖相结合通常需要额外的措施来强制跨嵌套级别的依赖的正确协调。例如，大多数非叶子任务需要在其代码末尾包含一个任务等待。虽然这些措施强制执行正确的顺序，但作为副作用，它们也限制了并行性的发现。在本文中，我们扩展了OpenMP任务模型，以提高嵌套和依赖关系的集成。我们的建议建立在公式、嵌套和依赖关系的基础上，并从它们各自的优势中获益。一方面，它鼓励采用自顶向下的方法来并行化代码，这种方法也支持任务的并行实例化。另一方面，它允许运行时以精细的粒度控制依赖关系，而到目前为止，这只能使用单个依赖域。我们的建议是通过增加OpenMP任务指令来实现的，以确保与当前代码的向后兼容性。我们已经用这些扩展实现了一个新的运行时，并用它来评估对几个基准测试的影响。我们的初步发现表明，我们的扩展在三个方面提高了性能。首先，它们暴露了更多的并行性。其次，它们揭示了跨嵌套级别的依赖关系，这允许运行时做出更好的调度决策。第三，它们允许并行实例化具有它们之间依赖关系的任务。

{"title":"Improving the Integration of Task Nesting and Dependencies in OpenMP","authors":"Josep M. Pérez, Vicencc Beltran, Jesús Labarta, E. Ayguadé","doi":"10.1109/IPDPS.2017.69","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.69","url":null,"abstract":"The tasking model of OpenMP 4.0 supports both nesting and the definition of dependences between sibling tasks. A natural way to parallelize many codes with tasks is to first taskify the high-level functions and then to further refine these tasks with additional subtasks. However, this top-down approach has some drawbacks since combining nesting with dependencies usually requires additional measures to enforce the correct coordination of dependencies across nesting levels. For instance, most non-leaf tasks need to include a taskwait at the end of their code. While these measures enforce the correct order of execution, as a side effect, they also limit the discovery of parallelism. In this paper we extend the OpenMP tasking model to improve the integration of nesting and dependencies. Our proposal builds on both formulas, nesting and dependencies, and benefits from their individual strengths. On one hand, it encourages a top-down approach to parallelizing codes that also enables the parallel instantiation of tasks. On the other hand, it allows the runtime to control dependencies at a fine grain that until now was only possible using a single domain of dependencies. Our proposal is realized through additions to the OpenMP task directive that ensure backward compatibility with current codes. We have implemented a new runtime with these extensions and used it to evaluate the impact on several benchmarks. Our initial findings show that our extensions improve performance in three areas. First, they expose more parallelism. Second, they uncover dependencies across nesting levels, which allows the runtime to make better scheduling decisions. And third, they allow the parallel instantiation of tasks with dependencies between them.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121655390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

FFQ: A Fast Single-Producer/Multiple-Consumer Concurrent FIFO Queue 快速单生产者/多消费者并发FIFO队列

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.41

Sergei Arnautov, P. Felber, C. Fetzer, Bohdan Trach

With the spreading of multi-core architectures, operating systems and applications are becoming increasingly more concurrent and their scalability is often limited by the primitives used to synchronize the different hardware threads. In this paper, we address the problem of how to optimize the throughput of a system with multiple producer and consumer threads. Such applications typically synchronize their threads via multi-producer/multi-consumer FIFO queues, but existing solutions have poor scalability, as we could observe when designing a secure application framework that requires high-throughput communication between many concurrent threads. In our target system, however, the items enqueued by different producers do not necessarily need to be FIFO ordered. Hence, we propose a fast FIFO queue, FFQ, that aims at maximizing throughput by specializing the algorithm for single-producer/multiple-consumer settings: each producer has its own queue from which multiple consumers can concurrently dequeue. Furthermore, while we provide a wait-free interface for producers, we limit ourselves to lock-free consumers to eliminate the need for helping. We also propose a multi-producer variant to show which synchronization operations we were able to remove by focusing on a single producer variant. Our evaluation analyses the performance using micro-benchmarks and compares our results with other state-of-the-art solutions: FFQ exhibits excellent performance and scalability.

随着多核体系结构的普及，操作系统和应用程序的并发性越来越强，它们的可伸缩性常常受到用于同步不同硬件线程的原语的限制。在本文中，我们解决了如何优化具有多个生产者和消费者线程的系统的吞吐量问题。这类应用程序通常通过多生产者/多消费者FIFO队列同步它们的线程，但是现有的解决方案具有较差的可伸缩性，正如我们在设计需要在许多并发线程之间进行高吞吐量通信的安全应用程序框架时可以观察到的那样。然而，在我们的目标系统中，由不同生产者排队的物品不一定需要FIFO排序。因此，我们提出了一个快速FIFO队列，FFQ，旨在通过专门化算法实现单生产者/多消费者设置的吞吐量最大化:每个生产者有自己的队列，多个消费者可以同时从中退出队列。此外，虽然我们为生产者提供了一个无等待的接口，但我们将自己限制为无锁定的消费者，以消除对帮助的需要。我们还提出了一个多生产者变体，以显示我们能够通过关注单个生产者变体来删除哪些同步操作。我们的评估使用微基准测试分析性能，并将我们的结果与其他最先进的解决方案进行比较:FFQ表现出出色的性能和可扩展性。

{"title":"FFQ: A Fast Single-Producer/Multiple-Consumer Concurrent FIFO Queue","authors":"Sergei Arnautov, P. Felber, C. Fetzer, Bohdan Trach","doi":"10.1109/IPDPS.2017.41","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.41","url":null,"abstract":"With the spreading of multi-core architectures, operating systems and applications are becoming increasingly more concurrent and their scalability is often limited by the primitives used to synchronize the different hardware threads. In this paper, we address the problem of how to optimize the throughput of a system with multiple producer and consumer threads. Such applications typically synchronize their threads via multi-producer/multi-consumer FIFO queues, but existing solutions have poor scalability, as we could observe when designing a secure application framework that requires high-throughput communication between many concurrent threads. In our target system, however, the items enqueued by different producers do not necessarily need to be FIFO ordered. Hence, we propose a fast FIFO queue, FFQ, that aims at maximizing throughput by specializing the algorithm for single-producer/multiple-consumer settings: each producer has its own queue from which multiple consumers can concurrently dequeue. Furthermore, while we provide a wait-free interface for producers, we limit ourselves to lock-free consumers to eliminate the need for helping. We also propose a multi-producer variant to show which synchronization operations we were able to remove by focusing on a single producer variant. Our evaluation analyses the performance using micro-benchmarks and compares our results with other state-of-the-art solutions: FFQ exhibits excellent performance and scalability.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131848002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Leader Election in Asymmetric Labeled Unidirectional Rings 非对称标记单向环的Leader选举

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.23

K. Altisen, A. Datta, Stéphane Devismes, Anaïs Durand, L. Larmore

We study (deterministic) leader election in unidirectional rings of homonym processes that have no a priori knowledge on the number of processes. In this context, we show that there is no algorithm that solves process-terminating leader election for the class of asymmetric labeled rings. In particular, there is no process-terminating leader election algorithm in rings in which at least one label is unique. However, we show that process-terminating leader election is possible for the subclass of asymmetric rings, where multiplicity is bounded. We confirm this positive results by proposing two algorithms, which achieve the classical trade-off between time and space.

本文研究了同音过程单向环中对同音过程个数没有先验知识的(确定性)leader选举。在这种情况下，我们证明了不对称标记环类没有解决进程终止领导者选择的算法。特别是，在至少有一个标签是唯一的环中，不存在进程终止leader选举算法。然而，我们证明了在非对称环的子类中，进程终止的领导者选举是可能的，其中多重性是有界的。我们通过提出两种算法来证实这一积极的结果，这两种算法实现了时间和空间之间的经典权衡。

引用次数: 12

Fly-Over: A Light-Weight Distributed Power-Gating Mechanism for Energy-Efficient Networks-on-Chip 高能效片上网络的轻量级分布式功率门控机制

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.77

R. Boyapati, Jiayi Huang, Ningyuan Wang, Kyung Hoon Kim, K. H. Yum, Eun Jung Kim

Scalable Networks-on-Chip (NoCs) have become the de facto interconnection mechanism in large scale Chip Multiprocessors. Not only are NoCs devouring a large fraction of the on-chip power budget but static NoC power consumption is becoming the dominant component as technology scales down. Hence reducing static NoC power consumption is critical for energy-efficient computing. Previous research has proposed to power-gate routers attached to inactive cores so as to save static power, but requires centralized control and global network knowledge. In this paper, we propose Fly-Over (FLOV), a light-weight distributed mechanism for power-gating routers, which encompasses FLOV router architecture, handshake protocols, and a partition-based dynamic routing algorithm to maintain network functionalities. With simple modifications to the baseline router architecture, FLOV can facilitate FLOV links over power-gated routers. Then we present two handshake protocols for FLOV routers, restricted FLOV that can power-gate routers under restricted conditions and generalized FLOV with more power saving capability. The proposed routing algorithm provides best-effort minimal path routing without the necessity for global network information. We evaluate our schemes using synthetic workloads as well as real workloads from PARSEC 2.1 benchmark suite. Our full system evaluations show that FLOV reduces the total and static energy consumption by 18% and 22% respectively, on average across several benchmarks, compared to state-of-the-art NoC power-gating mechanism while keeping the performance degradation minimal.

可扩展的片上网络(noc)已经成为大规模芯片多处理器事实上的互连机制。随着技术规模的缩小，NoC不仅占据了芯片上功耗预算的很大一部分，而且静态NoC功耗正在成为主导因素。因此，减少静态NoC功耗对于节能计算至关重要。先前的研究提出将电源闸路由器连接到非活动核心上，以节省静态功率，但需要集中控制和全局网络知识。在本文中，我们提出了flover (FLOV)，这是一种轻量级的分布式机制，用于电源门控路由器，它包括flv路由器架构，握手协议和基于分区的动态路由算法来维护网络功能。通过对基准路由器架构的简单修改，FLOV可以在电源门控路由器上促进FLOV链路。在此基础上，提出了两种面向FLOV路由器的握手协议，一种是在受限条件下能够对路由器进行电源闸的受限FLOV协议，另一种是具有更节能性能的广义FLOV协议。提出的路由算法在不需要全局网络信息的情况下提供了最佳努力的最小路径路由。我们使用PARSEC 2.1基准测试套件中的合成工作负载和实际工作负载来评估我们的方案。我们的完整系统评估表明，在几个基准测试中，与最先进的NoC功率门控机制相比，FLOV将总能耗和静态能耗分别降低了18%和22%，同时将性能下降降到最低。

{"title":"Fly-Over: A Light-Weight Distributed Power-Gating Mechanism for Energy-Efficient Networks-on-Chip","authors":"R. Boyapati, Jiayi Huang, Ningyuan Wang, Kyung Hoon Kim, K. H. Yum, Eun Jung Kim","doi":"10.1109/IPDPS.2017.77","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.77","url":null,"abstract":"Scalable Networks-on-Chip (NoCs) have become the de facto interconnection mechanism in large scale Chip Multiprocessors. Not only are NoCs devouring a large fraction of the on-chip power budget but static NoC power consumption is becoming the dominant component as technology scales down. Hence reducing static NoC power consumption is critical for energy-efficient computing. Previous research has proposed to power-gate routers attached to inactive cores so as to save static power, but requires centralized control and global network knowledge. In this paper, we propose Fly-Over (FLOV), a light-weight distributed mechanism for power-gating routers, which encompasses FLOV router architecture, handshake protocols, and a partition-based dynamic routing algorithm to maintain network functionalities. With simple modifications to the baseline router architecture, FLOV can facilitate FLOV links over power-gated routers. Then we present two handshake protocols for FLOV routers, restricted FLOV that can power-gate routers under restricted conditions and generalized FLOV with more power saving capability. The proposed routing algorithm provides best-effort minimal path routing without the necessity for global network information. We evaluate our schemes using synthetic workloads as well as real workloads from PARSEC 2.1 benchmark suite. Our full system evaluations show that FLOV reduces the total and static energy consumption by 18% and 22% respectively, on average across several benchmarks, compared to state-of-the-art NoC power-gating mechanism while keeping the performance degradation minimal.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133592544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

One-Way Wave Equation Migration at Scale on GPUs Using Directive Based Programming 基于指令编程的gpu单向波动方程大规模迁移

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.82

Kshitij Mehta, M. Hugues, Oscar R. Hernandez, D. Bernholdt, H. Calandra

One-Way Wave Equation Migration (OWEM) is a depth migration algorithm used for seismic imaging. A parallel version of this algorithm is widely implemented using MPI. Heterogenous architectures that use GPUs have become popular in the Top 500 because of their performance/power ratio. In this paper, we discuss the methodology and code transformations used to port OWEM to GPUs using OpenACC, along with the code changes needed for scaling the application up to 18,400 GPUs (more than 98%) of the Titan leadership class supercomputer at Oak Ridget National Laboratory. For the individual OpenACC kernels, we achieved an average of 3X speedup on a test dataset using one GPU as compared with an 8-core Intel Sandy Bridge CPU. The application was then run at large scale on the Titan supercomputer achieving a peak of 1.2 petaflops using an average of 5.5 megawatts. After porting the application to GPUs, we discuss how we dealt with other challenges of running at scale such as the application becoming more I/O bound and prone to silent errors. We believe this work will serve as valuable proof that directive-based programming models are a viable option for scaling HPC applications to heterogenous architectures.

单向波动方程偏移(OWEM)是一种用于地震成像的深度偏移算法。该算法的并行版本被广泛使用MPI实现。使用gpu的异构架构由于其性能/功耗比而在500强中变得流行。在本文中，我们讨论了使用OpenACC将OWEM移植到gpu的方法和代码转换，以及将应用程序扩展到橡树岭国家实验室的泰坦领导级超级计算机的18,400个gpu(超过98%)所需的代码更改。对于单个OpenACC内核，我们在使用一个GPU的测试数据集上实现了与8核英特尔Sandy Bridge CPU相比平均3倍的加速。然后，该应用程序在泰坦超级计算机上大规模运行，达到每秒1.2千万亿次的峰值，平均使用5.5兆瓦的功率。在将应用程序移植到gpu之后，我们将讨论如何处理大规模运行的其他挑战，例如应用程序变得越来越受I/O限制，并且容易出现无声错误。我们相信这项工作将作为有价值的证据，证明基于指令的编程模型是将HPC应用程序扩展到异构架构的可行选择。

{"title":"One-Way Wave Equation Migration at Scale on GPUs Using Directive Based Programming","authors":"Kshitij Mehta, M. Hugues, Oscar R. Hernandez, D. Bernholdt, H. Calandra","doi":"10.1109/IPDPS.2017.82","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.82","url":null,"abstract":"One-Way Wave Equation Migration (OWEM) is a depth migration algorithm used for seismic imaging. A parallel version of this algorithm is widely implemented using MPI. Heterogenous architectures that use GPUs have become popular in the Top 500 because of their performance/power ratio. In this paper, we discuss the methodology and code transformations used to port OWEM to GPUs using OpenACC, along with the code changes needed for scaling the application up to 18,400 GPUs (more than 98%) of the Titan leadership class supercomputer at Oak Ridget National Laboratory. For the individual OpenACC kernels, we achieved an average of 3X speedup on a test dataset using one GPU as compared with an 8-core Intel Sandy Bridge CPU. The application was then run at large scale on the Titan supercomputer achieving a peak of 1.2 petaflops using an average of 5.5 megawatts. After porting the application to GPUs, we discuss how we dealt with other challenges of running at scale such as the application becoming more I/O bound and prone to silent errors. We believe this work will serve as valuable proof that directive-based programming models are a viable option for scaling HPC applications to heterogenous architectures.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133557711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1