2011 International Conference on Parallel Architectures and Compilation Techniques最新文献

英文中文

A Hierarchical Approach to Maximizing MapReduce Efficiency 最大化MapReduce效率的分层方法

2011 International Conference on Parallel Architectures and Compilation Techniques

Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.22

Zhiwei Xiao, Haibo Chen, B. Zang

MapReduce has been widely recognized for its elastic scalability and fault tolerance, with the efficiency being relatively disregarded, which, however, is equally important in "pay-as-you-go" cloud systems such as Amazon's Elastic Map Reduce. This paper argues that there are multiple levels of data locality and parallelism in typical multicore clusters that affect performance. By characterizing the performance limitations of typical Map Reduce applications on multi-core based Hadoop clusters, we show that current JVM-based runtime (i.e., Task Worker) fails to exploit data locality and task parallelism at single-node level. Based on the study, we extend Hadoop with a hierarchical Map Reduce model and seamlessly integrate an efficient multicore Map Reduce runtime to Hadoop, resulting in a system we called Azwraith. Such a hierarchical scheme enables Map Reduce applications to explore locality and parallelism at both cluster level and single-node level. To reuse data across job boundary, we also extend Azwraith with an effective in-memory cache scheme that significantly reduces networking and disk traffics. Performance evaluation on a small-scale cluster show that, Azwraith, combined with the optimizations, outperforms Hadoop from 1.4x to 3.5x.

MapReduce因其弹性可伸缩性和容错性而被广泛认可，而效率相对被忽视，然而，这在Amazon的elastic MapReduce等“按需付费”的云系统中同样重要。本文认为，在典型的多核集群中，存在多级数据局部性和并行性，这会影响性能。通过描述基于多核Hadoop集群的典型Map Reduce应用程序的性能限制，我们表明当前基于jvm的运行时(即Task Worker)无法在单节点级别利用数据局部性和任务并行性。在此基础上，我们使用分层Map Reduce模型扩展Hadoop，并将高效的多核Map Reduce运行时无缝集成到Hadoop中，从而形成了一个我们称之为Azwraith的系统。这种分层方案使Map Reduce应用程序能够在集群级和单节点级探索局部性和并行性。为了跨作业边界重用数据，我们还使用有效的内存缓存方案扩展了Azwraith，该方案显著减少了网络和磁盘流量。在一个小规模集群上的性能评估表明，Azwraith经过优化后，性能比Hadoop高出1.4倍到3.5倍。

{"title":"A Hierarchical Approach to Maximizing MapReduce Efficiency","authors":"Zhiwei Xiao, Haibo Chen, B. Zang","doi":"10.1109/PACT.2011.22","DOIUrl":"https://doi.org/10.1109/PACT.2011.22","url":null,"abstract":"MapReduce has been widely recognized for its elastic scalability and fault tolerance, with the efficiency being relatively disregarded, which, however, is equally important in \"pay-as-you-go\" cloud systems such as Amazon's Elastic Map Reduce. This paper argues that there are multiple levels of data locality and parallelism in typical multicore clusters that affect performance. By characterizing the performance limitations of typical Map Reduce applications on multi-core based Hadoop clusters, we show that current JVM-based runtime (i.e., Task Worker) fails to exploit data locality and task parallelism at single-node level. Based on the study, we extend Hadoop with a hierarchical Map Reduce model and seamlessly integrate an efficient multicore Map Reduce runtime to Hadoop, resulting in a system we called Azwraith. Such a hierarchical scheme enables Map Reduce applications to explore locality and parallelism at both cluster level and single-node level. To reuse data across job boundary, we also extend Azwraith with an effective in-memory cache scheme that significantly reduces networking and disk traffics. Performance evaluation on a small-scale cluster show that, Azwraith, combined with the optimizations, outperforms Hadoop from 1.4x to 3.5x.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124469566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Building Retargetable and Efficient Compilers for Multimedia Instruction Sets 为多媒体指令集构建可重目标的高效编译器

2011 International Conference on Parallel Architectures and Compilation Techniques

Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.23

S. Guelton, A. Guinet, R. Keryell

Multimedia Instruction Sets have been introduced more than 20 years ago to speedup multimedia processing on General Purpose Processors. However, to take advantage of these instructions, developers have to cope with the low-level assembly or the equivalent C interfaces, which hinders code portability and raises development cost increases. An alternative is to let the compiler automatically generate optimized versions: ICC generates relatively efficient code for its supported platforms but does not target other processors. On the other hand, GCC targets a wide range of devices but generates less efficient code. In this paper, we present a retargetable compiler infrastructure for SIMD architectures based on three key points: a generic Multimedia Instruction Set, the combination of loop vectorization transformations with Super-word Level Parallelism and memory transfer optimizations. Three compilers, for SSE, AVX and NEON have been built using this common infrastructure.

多媒体指令集在20多年前被引入，以加速通用处理器上的多媒体处理。然而，为了利用这些指令，开发人员必须处理低级汇编或等效的C接口，这阻碍了代码的可移植性并增加了开发成本。另一种方法是让编译器自动生成优化版本:ICC为其支持的平台生成相对高效的代码，但不针对其他处理器。另一方面，GCC针对的设备范围很广，但生成的代码效率较低。在本文中，我们提出了一个基于三个关键点的SIMD架构的可重目标编译器基础结构:一个通用的多媒体指令集，循环向量化转换与超级字级并行性的组合以及内存传输优化。三个编译器，SSE, AVX和NEON已经使用这个公共基础结构构建。

引用次数: 5

Beforehand Migration on D-NUCA Caches 预先迁移D-NUCA缓存

2011 International Conference on Parallel Architectures and Compilation Techniques

Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.38

Javier Lira, Timothy M. Jones, Carlos Molina, Antonio González

Determining the best placement for data in the NUCA cache at any particular moment during program execution is crucial for exploiting the benefits that this architecture provides. Dynamic NUCA (D-NUCA) allows data to be mapped to multiple banks within the NUCA cache, and then uses data migration to adapt data placement to the program's behavior. Although the standard migration scheme is effective in moving data to its optimal position within the cache, half the hits still occur within non-optimal banks. This proposal reduces this number by migrating data beforehand.

在程序执行期间的任何特定时刻，确定数据在NUCA缓存中的最佳位置对于利用该体系结构提供的好处至关重要。动态NUCA (D-NUCA)允许数据映射到NUCA缓存内的多个银行，然后使用数据迁移来调整数据放置到程序的行为。尽管标准迁移方案在将数据移动到缓存中的最佳位置方面是有效的，但仍然有一半的命中发生在非最佳银行中。该建议通过预先迁移数据来减少这个数字。

引用次数: 2

Linear-time Modeling of Program Working Set in Shared Cache 共享缓存中程序工作集的线性时间建模

2011 International Conference on Parallel Architectures and Compilation Techniques

Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.66

Xiaoya Xiang, Bin Bao, C. Ding, Yaoqing Gao

Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n^2) windows in an n-element trace. Two recent techniques have significantly reduced the measurement time, but the cost is still too high for real-size workloads. Instead of measuring all footprint sizes, this paper presents a technique for measuring the average footprint size. By confining the analysis to the average rather than the full range, the problem can be solved accurately by a linear-time algorithm. The paper presents the algorithm and evaluates it using the complete suites of 26 SPEC2000 and 29 SPEC2006 benchmarks. The new algorithm is compared against the previously fastest algorithm in both the speed of the measurement and the accuracy of shared-cache performance prediction.

许多技术通过程序占用的概念来描述程序工作集，即在一个时间窗口内访问的数据量。一个完整的表征需要测量n元素轨迹中所有O(n^2)个窗口中的数据访问。最近的两种技术已经显著缩短了测量时间，但是对于实际规模的工作负载来说，成本仍然太高。本文提出了一种测量平均内存占用大小的技术，而不是测量所有内存占用大小。通过将分析限制在平均值而不是全范围内，可以用线性时间算法精确地解决问题。本文介绍了该算法，并使用完整的26个SPEC2000和29个SPEC2006基准测试套件对其进行了评估。在测量速度和共享缓存性能预测的准确性方面，将新算法与先前最快的算法进行了比较。

引用次数: 67

Divergence Analysis and Optimizations 发散分析和优化

2011 International Conference on Parallel Architectures and Compilation Techniques

Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.63

Bruno Coutinho, Diogo Sampaio, Fernando Magno Quintão Pereira, Wagner Meira Jr

The growing interest in GPU programming has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers a tremendous computational power, however, the model also brings restrictions. In particular, processing elements (PEs) execute in lock-step, and may lose performance due to divergences caused by conditional branches. In face of divergences, some PEs execute, while others wait, this alternation ending when they reach a synchronization point. In this paper we introduce divergence analysis, a static analysis that determines which program variables will have the same values for every PE. This analysis is useful in three different ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the compiler in the optimization of SIMD programs. We demonstrate this last point by introducing branch fusion, a new compiler optimization that identifies, via a gene sequencing algorithm, chains of similarities between divergent program paths, and weaves these paths together as much as possible. Our implementation has been accepted in the Ocelot open-source CUDA compiler, and is publicly available. We have tested it on many industrial-strength GPU benchmarks, including Rodinia and the Nvidia's SDK. Our divergence analysis has a 34% false-positive rate, compared to the results of a dynamic profiler. Our automatic optimization adds a 3% speed-up onto parallel quick sort, a heavily optimized benchmark. Our manual optimizations extend this number to over 10%.

对GPU编程日益增长的兴趣使人们重新关注单指令多数据(SIMD)执行模型。SIMD机器为应用程序开发人员提供了巨大的计算能力，然而，该模型也带来了限制。特别是，处理元素(pe)以锁步执行，并且可能由于条件分支引起的分歧而失去性能。面对分歧，一些pe执行，而另一些pe等待，这种交替在它们到达同步点时结束。在本文中，我们介绍了散度分析，这是一种静态分析，用于确定哪些程序变量对每个PE具有相同的值。这种分析在三个不同的方面是有用的:它改进了SIMD代码到非SIMD cpu的转换，它帮助开发人员手动改进他们的SIMD应用程序，它还指导编译器优化SIMD程序。我们通过引入分支融合来证明最后一点，分支融合是一种新的编译器优化，它通过基因测序算法识别不同程序路径之间的相似性链，并尽可能地将这些路径编织在一起。我们的实现已经被Ocelot开源CUDA编译器所接受，并且是公开可用的。我们已经在许多工业级GPU基准测试中测试了它，包括Rodinia和Nvidia的SDK。与动态分析器的结果相比，我们的散度分析有34%的假阳性率。我们的自动优化为并行快速排序增加了3%的加速，这是一个经过大量优化的基准。我们的手动优化将这个数字扩展到10%以上。

{"title":"Divergence Analysis and Optimizations","authors":"Bruno Coutinho, Diogo Sampaio, Fernando Magno Quintão Pereira, Wagner Meira Jr","doi":"10.1109/PACT.2011.63","DOIUrl":"https://doi.org/10.1109/PACT.2011.63","url":null,"abstract":"The growing interest in GPU programming has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers a tremendous computational power, however, the model also brings restrictions. In particular, processing elements (PEs) execute in lock-step, and may lose performance due to divergences caused by conditional branches. In face of divergences, some PEs execute, while others wait, this alternation ending when they reach a synchronization point. In this paper we introduce divergence analysis, a static analysis that determines which program variables will have the same values for every PE. This analysis is useful in three different ways: it improves the translation of SIMD code to non-SIMD CPUs, it helps developers to manually improve their SIMD applications, and it also guides the compiler in the optimization of SIMD programs. We demonstrate this last point by introducing branch fusion, a new compiler optimization that identifies, via a gene sequencing algorithm, chains of similarities between divergent program paths, and weaves these paths together as much as possible. Our implementation has been accepted in the Ocelot open-source CUDA compiler, and is publicly available. We have tested it on many industrial-strength GPU benchmarks, including Rodinia and the Nvidia's SDK. Our divergence analysis has a 34% false-positive rate, compared to the results of a dynamic profiler. Our automatic optimization adds a 3% speed-up onto parallel quick sort, a heavily optimized benchmark. Our manual optimizations extend this number to over 10%.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132635518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 104

TIDeFlow: A Parallel Execution Model for High Performance Computing Programs TIDeFlow:高性能计算程序的并行执行模型

2011 International Conference on Parallel Architectures and Compilation Techniques

Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.44

Daniel A. Orozco

The popularity of serial execution paradigms in the High Performance Computing (HPC) field greatly hinders the ability of computational scientists to develop and support massively parallel programs. Programmers are left with languages that are inadequate to express parallel constructs, being forced to take decisions that are not directly related to the programs they write. Computer architects are forced to support sequential memory semantics only because serial languages require them and operating system designers are forced to support slow synchronization operations. This poster addresses the development and execution of HPC programs in many-core architectures by introducing the Time Iterated Dependency Flow (Tide Flow) execution model.

串行执行范式在高性能计算(HPC)领域的流行极大地阻碍了计算科学家开发和支持大规模并行程序的能力。程序员只能使用不足以表达并行结构的语言，被迫做出与他们编写的程序没有直接关系的决定。计算机架构师被迫支持顺序内存语义，只是因为串行语言需要它们，而操作系统设计人员被迫支持缓慢的同步操作。这张海报通过引入时间迭代依赖流(Tide Flow)执行模型，解决了在多核架构中开发和执行HPC程序的问题。

引用次数: 7

A Heterogeneous Parallel Framework for Domain-Specific Languages 面向领域特定语言的异构并行框架

2011 International Conference on Parallel Architectures and Compilation Techniques

Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.15

Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, K. Olukotun

Computing systems are becoming increasingly parallel and heterogeneous, and therefore new applications must be capable of exploiting parallelism in order to continue achieving high performance. However, targeting these emerging devices often requires using multiple disparate programming models and making decisions that can limit forward scalability. In previous work we proposed the use of domain-specific languages (DSLs) to provide high-level abstractions that enable transformations to high performance parallel code without degrading programmer productivity. In this paper we present a new end-to-end system for building, compiling, and executing DSL applications on parallel heterogeneous hardware, the Delite Compiler Framework and Runtime. The framework lifts embedded DSL applications to an intermediate representation (IR), performs generic, parallel, and domain-specific optimizations, and generates an execution graph that targets multiple heterogeneous hardware devices. Finally we present results comparing the performance of several machine learning applications written in OptiML, a DSL for machine learning that utilizes Delite, to C++ and MATLAB implementations. We find that the implicitly parallel OptiML applications achieve single-threaded performance comparable to C++ and outperform explicitly parallel MATLAB in nearly all cases.

计算系统正变得越来越并行和异构，因此新的应用程序必须能够利用并行性，以便继续实现高性能。然而，针对这些新兴设备通常需要使用多个不同的编程模型，并做出可能限制向前可伸缩性的决策。在之前的工作中，我们建议使用领域特定语言(dsl)来提供高级抽象，使转换成为高性能并行代码而不会降低程序员的工作效率。在本文中，我们提出了一个新的端到端系统，用于在并行异构硬件上构建、编译和执行DSL应用程序，即Delite编译器框架和运行时。该框架将嵌入式DSL应用程序提升到中间表示(IR)，执行通用的、并行的和特定于领域的优化，并生成针对多个异构硬件设备的执行图。最后，我们给出了用OptiML(一种利用Delite的机器学习DSL)编写的几个机器学习应用程序与c++和MATLAB实现的性能比较结果。我们发现隐式并行的OptiML应用程序实现了与c++相当的单线程性能，并且在几乎所有情况下都优于显式并行的MATLAB。

{"title":"A Heterogeneous Parallel Framework for Domain-Specific Languages","authors":"Kevin J. Brown, Arvind K. Sujeeth, HyoukJoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, K. Olukotun","doi":"10.1109/PACT.2011.15","DOIUrl":"https://doi.org/10.1109/PACT.2011.15","url":null,"abstract":"Computing systems are becoming increasingly parallel and heterogeneous, and therefore new applications must be capable of exploiting parallelism in order to continue achieving high performance. However, targeting these emerging devices often requires using multiple disparate programming models and making decisions that can limit forward scalability. In previous work we proposed the use of domain-specific languages (DSLs) to provide high-level abstractions that enable transformations to high performance parallel code without degrading programmer productivity. In this paper we present a new end-to-end system for building, compiling, and executing DSL applications on parallel heterogeneous hardware, the Delite Compiler Framework and Runtime. The framework lifts embedded DSL applications to an intermediate representation (IR), performs generic, parallel, and domain-specific optimizations, and generates an execution graph that targets multiple heterogeneous hardware devices. Finally we present results comparing the performance of several machine learning applications written in OptiML, a DSL for machine learning that utilizes Delite, to C++ and MATLAB implementations. We find that the implicitly parallel OptiML applications achieve single-threaded performance comparable to C++ and outperform explicitly parallel MATLAB in nearly all cases.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122168646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 201

Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling 使用动态电压/频率和核心缩放提高功率受限gpu的吞吐量

2011 International Conference on Parallel Architectures and Compilation Techniques

Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.17

Jungseob Lee, V. Sathish, M. Schulte, Katherine Compton, N. Kim

State-of-the-art graphic processing units (GPUs) can offer very high computational throughput for highly parallel applications using hundreds of integrated cores. In general, the peak throughput of a GPU is proportional to the product of the number of cores and their frequency. However, the product is often limited by a power constraint. Although the throughput can be increased with more cores for some applications, it cannot for others because parallelism of applications and/or bandwidth of on-chip interconnects/caches and off-chip memory are limited. In this paper, first, we demonstrate that adjusting the number of operating cores and the voltage/frequency of cores and/or on-chip interconnects/caches for different applications can improve the throughput of GPUs under a power constraint. Second, we show that dynamically scaling the number of operating cores and the voltages/frequencies of both cores and on-chip interconnects/caches at runtime can improve the throughput of application even further. Our experimental results show that a GPU adopting our runtime dynamic voltage/frequency and core scaling technique can provide up to 38% (and nearly 20% on average) higher throughput than the baseline GPU under the same power constraint.

最先进的图形处理单元(gpu)可以为使用数百个集成核心的高度并行应用程序提供非常高的计算吞吐量。一般来说，GPU的峰值吞吐量与内核数量及其频率的乘积成正比。然而，该产品经常受到功率限制。虽然对于某些应用程序来说，更多的核可以增加吞吐量，但对于其他应用程序来说却不能，因为应用程序的并行性和/或片上互连/缓存和片外内存的带宽是有限的。在本文中，首先，我们证明了调整工作核心的数量和核心的电压/频率和/或片上互连/缓存的不同应用可以提高gpu在功率限制下的吞吐量。其次，我们展示了在运行时动态缩放操作内核的数量以及内核和片上互连/缓存的电压/频率可以进一步提高应用程序的吞吐量。我们的实验结果表明，在相同功率约束下，采用我们的运行时动态电压/频率和核心缩放技术的GPU可以提供比基线GPU高38%(平均近20%)的吞吐量。

{"title":"Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling","authors":"Jungseob Lee, V. Sathish, M. Schulte, Katherine Compton, N. Kim","doi":"10.1109/PACT.2011.17","DOIUrl":"https://doi.org/10.1109/PACT.2011.17","url":null,"abstract":"State-of-the-art graphic processing units (GPUs) can offer very high computational throughput for highly parallel applications using hundreds of integrated cores. In general, the peak throughput of a GPU is proportional to the product of the number of cores and their frequency. However, the product is often limited by a power constraint. Although the throughput can be increased with more cores for some applications, it cannot for others because parallelism of applications and/or bandwidth of on-chip interconnects/caches and off-chip memory are limited. In this paper, first, we demonstrate that adjusting the number of operating cores and the voltage/frequency of cores and/or on-chip interconnects/caches for different applications can improve the throughput of GPUs under a power constraint. Second, we show that dynamically scaling the number of operating cores and the voltages/frequencies of both cores and on-chip interconnects/caches at runtime can improve the throughput of application even further. Our experimental results show that a GPU adopting our runtime dynamic voltage/frequency and core scaling technique can provide up to 38% (and nearly 20% on average) higher throughput than the baseline GPU under the same power constraint.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127155673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73

DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism DeNovo:重新思考有纪律并行的内存层次结构

2011 International Conference on Parallel Architectures and Compilation Techniques

Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.21

Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, N. Honarmand, S. Adve, Vikram S. Adve, N. Carter, Ching-Tsun Chou

For parallelism to become tractable for mass programmers, shared-memory languages and environments must evolve to enforce disciplined practices that ban "wild shared-memory behaviors;'' e.g., unstructured parallelism, arbitrary data races, and ubiquitous non-determinism. This software evolution is a rare opportunity for hardware designers to rethink hardware from the ground up to exploit opportunities exposed by such disciplined software models. Such a co-designed effort is more likely to achieve many-core scalability than a software-oblivious hardware evolution. This paper presents DeNovo, a hardware architecture motivated by these observations. We show how a disciplined parallel programming model greatly simplifies cache coherence and consistency, while enabling a more efficient communication and cache architecture. The DeNovo coherence protocol is simple because it eliminates transient states -- verification using model checking shows 15X fewer reachable states than a state-of-the-art implementation of the conventional MESI protocol. The DeNovo protocol is also more extensible. Adding two sophisticated optimizations, flexible communication granularity and direct cache-to-cache transfers, did not introduce additional protocol states (unlike MESI). Finally, DeNovo shows better cache hit rates and network traffic, translating to better performance and energy. Overall, a disciplined shared-memory programming model allows DeNovo to seamlessly integrate message passing-like interactions within a global address space for improved design complexity, performance, and efficiency.

为了让并行性对大量程序员来说变得易于处理，共享内存语言和环境必须发展到强制执行有纪律的实践，以禁止“野蛮的共享内存行为”，例如，非结构化并行、任意数据竞争和无处不在的非确定性。对于硬件设计师来说，这种软件进化是一个难得的机会，可以从头开始重新思考硬件，从而利用这种有纪律的软件模型所暴露的机会。与软件无关的硬件进化相比，这种共同设计的努力更有可能实现多核可伸缩性。本文介绍了DeNovo，这是一种基于这些观察的硬件架构。我们展示了一个有纪律的并行编程模型如何极大地简化了缓存一致性和一致性，同时实现了更有效的通信和缓存架构。DeNovo相干协议很简单，因为它消除了瞬态，使用模型检查验证显示的可达状态比传统MESI协议的最先进实现少15倍。DeNovo协议也具有更强的可扩展性。添加两个复杂的优化，灵活的通信粒度和直接缓存到缓存传输，并没有引入额外的协议状态(与MESI不同)。最后，DeNovo显示出更好的缓存命中率和网络流量，转化为更好的性能和能源。总的来说，规范的共享内存编程模型允许DeNovo在全局地址空间内无缝集成类似消息传递的交互，从而提高设计复杂性、性能和效率。

{"title":"DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism","authors":"Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, N. Honarmand, S. Adve, Vikram S. Adve, N. Carter, Ching-Tsun Chou","doi":"10.1109/PACT.2011.21","DOIUrl":"https://doi.org/10.1109/PACT.2011.21","url":null,"abstract":"For parallelism to become tractable for mass programmers, shared-memory languages and environments must evolve to enforce disciplined practices that ban \"wild shared-memory behaviors;'' e.g., unstructured parallelism, arbitrary data races, and ubiquitous non-determinism. This software evolution is a rare opportunity for hardware designers to rethink hardware from the ground up to exploit opportunities exposed by such disciplined software models. Such a co-designed effort is more likely to achieve many-core scalability than a software-oblivious hardware evolution. This paper presents DeNovo, a hardware architecture motivated by these observations. We show how a disciplined parallel programming model greatly simplifies cache coherence and consistency, while enabling a more efficient communication and cache architecture. The DeNovo coherence protocol is simple because it eliminates transient states -- verification using model checking shows 15X fewer reachable states than a state-of-the-art implementation of the conventional MESI protocol. The DeNovo protocol is also more extensible. Adding two sophisticated optimizations, flexible communication granularity and direct cache-to-cache transfers, did not introduce additional protocol states (unlike MESI). Finally, DeNovo shows better cache hit rates and network traffic, translating to better performance and energy. Overall, a disciplined shared-memory programming model allows DeNovo to seamlessly integrate message passing-like interactions within a global address space for improved design complexity, performance, and efficiency.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126869590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 173

Dynamic Fine-Grain Scheduling of Pipeline Parallelism 管道并行性的动态细粒度调度

2011 International Conference on Parallel Architectures and Compilation Techniques

Pub Date : 2011-10-10 DOI: 10.1109/PACT.2011.9

Daniel Sánchez, David Lo, Richard M. Yoo, J. Sugerman, C. Kozyrakis

Scheduling pipeline-parallel programs, defined as a graph of stages that communicate explicitly through queues, is challenging. When the application is regular and the underlying architecture can guarantee predictable execution times, several techniques exist to compute highly optimized static schedules. However, these schedules do not admit run-time load balancing, so variability introduced by the application or the underlying hardware causes load imbalance, hindering performance. On the other hand, existing schemes for dynamic fine-grain load balancing (such as task-stealing) do not work well on pipeline-parallel programs: they cannot guarantee memory footprint bounds, and do not adequately schedule complex graphs or graphs with ordered queues. We present a scheduler implementation for pipeline-parallel programs that performs fine-grain dynamic load balancing efficiently. Specifically, we implement the first real runtime for GRAMPS, a recently proposed programming model that focuses on supporting irregular pipeline and data-parallel applications (in contrast to classical stream programming models and schedulers, which require programs to be regular). Task-stealing with per-stage queues and queuing policies, coupled with a backpressure mechanism, allow us to maintain strict footprint bounds, and a buffer management scheme based on packet-stealing allows low-overhead and locality-aware dynamic allocation of queue data. We evaluate our runtime on a multi-core SMP and find that it provides low-overhead scheduling of irregular workloads while maintaining locality. We also show that the GRAMPS scheduler outperforms several other commonly used scheduling approaches. Specifically, while a typical task-stealing scheduler performs on par with GRAMPS on simple graphs, it does significantly worse on complex ones, a canonical GPGPU scheduler cannot exploit pipeline parallelism and suffers from large memory footprints, and a typical static, streaming scheduler achieves somewhat better locality, but suffers significant load imbalance on a general-purpose multi-core due to fine-grain architecture variability (e.g., cache misses and SMT).

调度管道并行程序(定义为通过队列显式通信的阶段图)具有挑战性。当应用程序是常规的并且底层体系结构可以保证可预测的执行时间时，存在几种技术来计算高度优化的静态调度。然而，这些调度不允许运行时负载平衡，因此应用程序或底层硬件引入的可变性会导致负载不平衡，从而影响性能。另一方面，现有的动态细粒度负载平衡方案(例如任务窃取)在管道并行程序上不能很好地工作:它们不能保证内存占用范围，并且不能充分调度复杂的图或具有有序队列的图。我们提出了一种用于管道并行程序的调度器实现，它可以有效地执行细粒度动态负载平衡。具体来说，我们实现了GRAMPS的第一个真实运行时，这是一个最近提出的编程模型，专注于支持不规则的管道和数据并行应用程序(与传统的流编程模型和调度程序相反，它们要求程序是规则的)。使用每阶段队列和队列策略的任务窃取，加上反压机制，使我们能够保持严格的内存占用范围，并且基于数据包窃取的缓冲区管理方案允许低开销和位置感知的队列数据动态分配。我们在多核SMP上评估了我们的运行时，发现它在保持局部性的同时提供了不规则工作负载的低开销调度。我们还展示了GRAMPS调度器优于其他几种常用的调度方法。具体来说，虽然典型的任务窃取调度器在简单图形上的性能与GRAMPS相当，但在复杂图形上的性能却明显差得多;规范的GPGPU调度器不能利用管道并行性，并且会占用大量内存;典型的静态流调度器可以实现更好的局部性，但由于细粒度架构可变性(例如，缓存丢失和SMT)，在通用多核上会遭受严重的负载不平衡。

{"title":"Dynamic Fine-Grain Scheduling of Pipeline Parallelism","authors":"Daniel Sánchez, David Lo, Richard M. Yoo, J. Sugerman, C. Kozyrakis","doi":"10.1109/PACT.2011.9","DOIUrl":"https://doi.org/10.1109/PACT.2011.9","url":null,"abstract":"Scheduling pipeline-parallel programs, defined as a graph of stages that communicate explicitly through queues, is challenging. When the application is regular and the underlying architecture can guarantee predictable execution times, several techniques exist to compute highly optimized static schedules. However, these schedules do not admit run-time load balancing, so variability introduced by the application or the underlying hardware causes load imbalance, hindering performance. On the other hand, existing schemes for dynamic fine-grain load balancing (such as task-stealing) do not work well on pipeline-parallel programs: they cannot guarantee memory footprint bounds, and do not adequately schedule complex graphs or graphs with ordered queues. We present a scheduler implementation for pipeline-parallel programs that performs fine-grain dynamic load balancing efficiently. Specifically, we implement the first real runtime for GRAMPS, a recently proposed programming model that focuses on supporting irregular pipeline and data-parallel applications (in contrast to classical stream programming models and schedulers, which require programs to be regular). Task-stealing with per-stage queues and queuing policies, coupled with a backpressure mechanism, allow us to maintain strict footprint bounds, and a buffer management scheme based on packet-stealing allows low-overhead and locality-aware dynamic allocation of queue data. We evaluate our runtime on a multi-core SMP and find that it provides low-overhead scheduling of irregular workloads while maintaining locality. We also show that the GRAMPS scheduler outperforms several other commonly used scheduling approaches. Specifically, while a typical task-stealing scheduler performs on par with GRAMPS on simple graphs, it does significantly worse on complex ones, a canonical GPGPU scheduler cannot exploit pipeline parallelism and suffers from large memory footprints, and a typical static, streaming scheduler achieves somewhat better locality, but suffers significant load imbalance on a general-purpose multi-core due to fine-grain architecture variability (e.g., cache misses and SMT).","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126203212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 International Conference on Parallel Architectures and Compilation Techniques

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀