2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献

Improving support for locality and fine-grain sharing in chip multiprocessors 改进芯片多处理器对局部性和细粒度共享的支持

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454138

Hemayet Hossain, S. Dwarkadas, Michael C. Huang

Both commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer's primary cache when accessed. Chip multiprocessors present an opportunity to optimize for fine-grain sharing using direct access to remote processor components through low-latency on-chip interconnects. In this paper, we present Adaptive Replication, Migration, and producer-Consumer Optimization (ARMCO), a coherence protocol that, to the best of our knowledge, is the first to exploit direct access to the L1 caches of remote processors (rather than via coherence mechanisms) in order to support fine-grain sharing. Our goal is to provide support for tightly coupled sharing by recognizing and adapting to common sharing patterns such as migratory, producer-consumer, multiple-reader, and multiple read-write. The protocol places data close to where it is most needed and leverages direct access when following conventional coherence actions proves wasteful. Via targeted optimizations for each of these access patterns, our proposed protocol is able to reduce the average access latency and increase the effective cache capacity at the L1 level with on-chip storage overhead as low as 0.38%. Full-system simulations of 16-processor CMPs show an average (geometric mean) speedup of 1.13 (ranging from 1.04 to 2.26) for 12 commercial, scientific, and mining workloads, with an average of 1.18 if we include 2 microbenchmarks. ARMCO also reduces the on-chip bandwidth requirements and dynamic energy (power) consumption by an average of 33.3% and 31.2% (20.2%) respectively. By evaluating optimizations at both the L1 and the L2 level, we demonstrate that when considering performance, optimization at the L1 level is more effective at supporting fine-grain sharing than that at the L2 level.

商业和科学工作负载都受益于并发性和跨线程/进程的数据共享。生成的共享模式通常是细粒度的，访问时修改的缓存行仍然驻留在写入器的主缓存中。芯片多处理器通过低延迟片上互连直接访问远程处理器组件，提供了优化细粒度共享的机会。在本文中，我们提出了自适应复制、迁移和生产者-消费者优化(ARMCO)，据我们所知，这是第一个利用直接访问远程处理器的L1缓存(而不是通过一致性机制)来支持细粒度共享的一致性协议。我们的目标是通过识别和适应常见的共享模式(如迁移、生产者-消费者、多个阅读器和多个读写)来提供对紧密耦合共享的支持。该协议将数据放在最需要的地方，并在遵循常规一致性操作被证明是浪费的情况下利用直接访问。通过对这些访问模式中的每一种进行有针对性的优化，我们提出的协议能够减少平均访问延迟，并在片上存储开销低至0.38%的情况下增加L1级的有效缓存容量。16处理器cmp的全系统模拟显示，对于12个商业、科学和采矿工作负载，平均(几何平均)加速速度为1.13(范围从1.04到2.26)，如果我们包括2个微基准测试，平均速度为1.18。ARMCO还将片上带宽需求和动态能量(功率)消耗平均分别降低33.3%和31.2%(20.2%)。通过评估L1和L2级别的优化，我们证明了在考虑性能时，L1级别的优化在支持细粒度共享方面比L2级别的优化更有效。

{"title":"Improving support for locality and fine-grain sharing in chip multiprocessors","authors":"Hemayet Hossain, S. Dwarkadas, Michael C. Huang","doi":"10.1145/1454115.1454138","DOIUrl":"https://doi.org/10.1145/1454115.1454138","url":null,"abstract":"Both commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer's primary cache when accessed. Chip multiprocessors present an opportunity to optimize for fine-grain sharing using direct access to remote processor components through low-latency on-chip interconnects. In this paper, we present Adaptive Replication, Migration, and producer-Consumer Optimization (ARMCO), a coherence protocol that, to the best of our knowledge, is the first to exploit direct access to the L1 caches of remote processors (rather than via coherence mechanisms) in order to support fine-grain sharing. Our goal is to provide support for tightly coupled sharing by recognizing and adapting to common sharing patterns such as migratory, producer-consumer, multiple-reader, and multiple read-write. The protocol places data close to where it is most needed and leverages direct access when following conventional coherence actions proves wasteful. Via targeted optimizations for each of these access patterns, our proposed protocol is able to reduce the average access latency and increase the effective cache capacity at the L1 level with on-chip storage overhead as low as 0.38%. Full-system simulations of 16-processor CMPs show an average (geometric mean) speedup of 1.13 (ranging from 1.04 to 2.26) for 12 commercial, scientific, and mining workloads, with an average of 1.18 if we include 2 microbenchmarks. ARMCO also reduces the on-chip bandwidth requirements and dynamic energy (power) consumption by an average of 33.3% and 31.2% (20.2%) respectively. By evaluating optimizations at both the L1 and the L2 level, we demonstrate that when considering performance, optimization at the L1 level is more effective at supporting fine-grain sharing than that at the L2 level.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124815720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Hybrid access-specific software cache techniques for the cell BE architecture 用于单元BE体系结构的混合特定于访问的软件缓存技术

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454156

Marc González, Nikola Vujic, X. Martorell, E. Ayguadé, A. Eichenberger, Tong Chen, Zehra Sura, Tao Zhang, K. O'Brien, Kathryn M. O'Brien

Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.

编程的简易性是广泛接受没有硬件支持本地和全局存储器之间透明数据传输的多核系统的主要障碍之一。软件缓存是一种健壮的方法，为用户提供了一个透明的内存架构视图;但是这种软件方法可能会受到性能不佳的影响。在本文中，我们提出了一种分层的混合软件缓存架构，该架构在编译时将内存访问分为高局域性和不规则性两类。然后，我们的方法将内存引用引导到针对各自访问模式优化的两个特定缓存结构之一。对特定的缓存结构进行了优化，使高级编译器能够积极地展开循环、重新排序缓存引用和/或转换周围的循环，从而实际上消除了最内层循环中的软件缓存开销。性能评估表明，与传统的软件缓存方法相比，由于优化的软件缓存结构与建议的代码优化相结合而得到的改进转化为3.5到8.4倍的加速系数。因此，我们证明了Cell BE处理器可以作为一组并行NAS应用程序的现代服务器级多核(如IBM Power5处理器)的竞争性替代品。

{"title":"Hybrid access-specific software cache techniques for the cell BE architecture","authors":"Marc González, Nikola Vujic, X. Martorell, E. Ayguadé, A. Eichenberger, Tong Chen, Zehra Sura, Tao Zhang, K. O'Brien, Kathryn M. O'Brien","doi":"10.1145/1454115.1454156","DOIUrl":"https://doi.org/10.1145/1454115.1454156","url":null,"abstract":"Ease of programming is one of the main impediments for the broad acceptance of multi-core systems with no hardware support for transparent data transfer between local and global memories. Software cache is a robust approach to provide the user with a transparent view of the memory architecture; but this software approach can suffer from poor performance. In this paper, we propose a hierarchical, hybrid software-cache architecture that classifies at compile time memory accesses in two classes, high-locality and irregular. Our approach then steers the memory references toward one of two specific cache structures optimized for their respective access pattern. The specific cache structures are optimized to enable high-level compiler optimizations to aggressively unroll loops, reorder cache references, and/or transform surrounding loops so as to practically eliminate the software cache overhead in the innermost loop. Performance evaluation indicates that improvements due to the optimized software-cache structures combined with the proposed code-optimizations translate into 3.5 to 8.4 speedup factors, compared to a traditional software cache approach. As a result, we demonstrate that the Cell BE processor can be a competitive alternative to a modern server-class multi-core such as the IBM Power5 processor for a set of parallel NAS applications.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128267827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

Skewed redundancy 倾斜的冗余

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454126

Gordon B. Bell, Mikko H. Lipasti

Technology scaling in integrated circuits has consistently provided dramatic performance improvements in modern microprocessors. However, increasing device counts and decreasing on-chip voltage levels have made transient errors a first-order design constraint that can no longer be ignored. Several proposals have provided fault detection and tolerance through redundantly executing a program on an additional hardware thread or core. While such techniques can provide high fault coverage, they at best provide equivalent performance to the original execution and at worst incur a slowdown due to error checking, contention for shared resources, and synchronization overheads. This work achieves a similar goal of detecting transient errors by redundantly executing a program on an additional processor core, however it speeds up (rather than slows down) program execution compared to the unprotected baseline case. It makes the observation that a small number of instructions are detrimental to overall performance, and selectively skipping them enables one core to advance far ahead of the other to obtain prefetching and large instruction window benefits. We highlight the modest incremental hardware required to support skewed redundancy and demonstrate a speedup of 6%/54% for a collection of integer/floating point benchmarks while still providing 100% error detection coverage within our sphere of replication. Additionally, we show that a third core can further improve performance while adding error recovery capabilities.

集成电路中的技术扩展一直为现代微处理器提供了显著的性能改进。然而，随着器件数量的增加和片上电压水平的降低，瞬态误差已成为一阶设计约束，不能再忽视。一些建议通过在额外的硬件线程或核心上冗余执行程序来提供故障检测和容错。虽然这些技术可以提供较高的故障覆盖率，但它们最多只能提供与原始执行相当的性能，最坏的情况是由于错误检查、共享资源争用和同步开销而导致速度减慢。这项工作通过在额外的处理器核心上冗余地执行程序来实现检测瞬态错误的类似目标，但是与未受保护的基线情况相比，它加快了程序的执行速度(而不是减慢)。它观察到少量指令对整体性能是有害的，有选择地跳过它们可以使一个核心远远领先于另一个核心，以获得预取和大指令窗口的好处。我们强调了支持倾斜冗余所需的适度增量硬件，并演示了整数/浮点基准集合的6%/54%的加速，同时在我们的复制范围内仍然提供100%的错误检测覆盖率。此外，我们还展示了第三个核心可以在增加错误恢复功能的同时进一步提高性能。

{"title":"Skewed redundancy","authors":"Gordon B. Bell, Mikko H. Lipasti","doi":"10.1145/1454115.1454126","DOIUrl":"https://doi.org/10.1145/1454115.1454126","url":null,"abstract":"Technology scaling in integrated circuits has consistently provided dramatic performance improvements in modern microprocessors. However, increasing device counts and decreasing on-chip voltage levels have made transient errors a first-order design constraint that can no longer be ignored. Several proposals have provided fault detection and tolerance through redundantly executing a program on an additional hardware thread or core. While such techniques can provide high fault coverage, they at best provide equivalent performance to the original execution and at worst incur a slowdown due to error checking, contention for shared resources, and synchronization overheads. This work achieves a similar goal of detecting transient errors by redundantly executing a program on an additional processor core, however it speeds up (rather than slows down) program execution compared to the unprotected baseline case. It makes the observation that a small number of instructions are detrimental to overall performance, and selectively skipping them enables one core to advance far ahead of the other to obtain prefetching and large instruction window benefits. We highlight the modest incremental hardware required to support skewed redundancy and demonstrate a speedup of 6%/54% for a collection of integer/floating point benchmarks while still providing 100% error detection coverage within our sphere of replication. Additionally, we show that a third core can further improve performance while adding error recovery capabilities.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127620106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Prediction models for multi-dimensional power-performance optimization on many cores 多核多维功率性能优化预测模型

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454151

Matthew Curtis-Maury, Ankur Shah, F. Blagojevic, Dimitrios S. Nikolopoulos, B. Supinski, M. Schulz

Power has become a primary concern for HPC systems. Dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) are two software tools (or knobs) for reducing the dynamic power consumption of HPC systems. To date, few works have considered the synergistic integration of DVFS and DCT in performance-constrained systems, and, to the best of our knowledge, no prior research has developed application-aware simultaneous DVFS and DCT controllers in real systems and parallel programming frameworks. We present a multi-dimensional, online performance predictor, which we deploy to address the problem of simultaneous runtime optimization of DVFS and DCT on multi-core systems. We present results from an implementation of the predictor in a runtime library linked to the Intel OpenMP environment and running on an actual dual-processor quad-core system. We show that our predictor derives near-optimal settings of the power-aware program adaptation knobs that we consider. Our overall framework achieves significant reductions in energy (19% mean) and ED2 (40% mean), through simultaneous power savings (6% mean) and performance improvements (14% mean). We also find that our framework outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS. Further, our results indicate that prediction-based schemes for runtime adaptation compare favorably and typically improve upon heuristic search-based approaches in both performance and energy savings.

功率已经成为高性能计算系统的主要关注点。动态电压和频率缩放(DVFS)和动态并发调节(DCT)是降低高性能计算系统动态功耗的两个软件工具(或旋钮)。迄今为止，很少有研究考虑在性能受限的系统中DVFS和DCT的协同集成，而且据我们所知，目前还没有研究在实际系统和并行编程框架中开发应用感知的DVFS和DCT控制器。我们提出了一个多维在线性能预测器，我们部署它来解决多核系统上DVFS和DCT同时运行时优化的问题。我们在与Intel OpenMP环境相关联的运行时库中提供了预测器的实现结果，并在实际的双处理器四核系统上运行。我们表明，我们的预测器导出了我们所考虑的功率感知程序自适应旋钮的接近最佳设置。我们的整体框架通过同时节能(平均6%)和性能改进(平均14%)，实现了能源(平均19%)和ED2(平均40%)的显著降低。我们还发现，我们的框架优于仅适应DVFS或DCT的早期解决方案，以及依次应用DCT然后DVFS的解决方案。此外，我们的研究结果表明，基于预测的运行时适应方案在性能和节能方面都优于启发式搜索方法。

{"title":"Prediction models for multi-dimensional power-performance optimization on many cores","authors":"Matthew Curtis-Maury, Ankur Shah, F. Blagojevic, Dimitrios S. Nikolopoulos, B. Supinski, M. Schulz","doi":"10.1145/1454115.1454151","DOIUrl":"https://doi.org/10.1145/1454115.1454151","url":null,"abstract":"Power has become a primary concern for HPC systems. Dynamic voltage and frequency scaling (DVFS) and dynamic concurrency throttling (DCT) are two software tools (or knobs) for reducing the dynamic power consumption of HPC systems. To date, few works have considered the synergistic integration of DVFS and DCT in performance-constrained systems, and, to the best of our knowledge, no prior research has developed application-aware simultaneous DVFS and DCT controllers in real systems and parallel programming frameworks. We present a multi-dimensional, online performance predictor, which we deploy to address the problem of simultaneous runtime optimization of DVFS and DCT on multi-core systems. We present results from an implementation of the predictor in a runtime library linked to the Intel OpenMP environment and running on an actual dual-processor quad-core system. We show that our predictor derives near-optimal settings of the power-aware program adaptation knobs that we consider. Our overall framework achieves significant reductions in energy (19% mean) and ED2 (40% mean), through simultaneous power savings (6% mean) and performance improvements (14% mean). We also find that our framework outperforms earlier solutions that adapt only DVFS or DCT, as well as one that sequentially applies DCT then DVFS. Further, our results indicate that prediction-based schemes for runtime adaptation compare favorably and typically improve upon heuristic search-based approaches in both performance and energy savings.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"413 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126692646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 190

Multi-Optimization power management for chip multiprocessors 芯片多处理器的多优化电源管理

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454141

Ke Meng, R. Joseph, R. Dick, L. Shang

The emergence of power as a first-class design constraint has fueled the proposal of a growing number of run-time power optimizations. Many of these optimizations trade-off power saving opportunity for a variable performance loss which depends on application characteristics and program phase. Furthermore, the potential benefits of these optimizations are sometimes non-additive, and it can be difficult to identify which combinations of these optimizations to apply. Trial-and-error approaches have been proposed to adaptively tune a processor. However, in a chip multiprocessor, the cost of individually configuring each core under a wide range of optimizations would be prohibitive under simple trial-and-error approaches. In this work, we introduce an adaptive, multi-optimization power saving strategy for multi-core power management. Specifically, we solve the problem of meeting a global chip-wide power budget through run-time adaptation of highly configurable processor cores. Our approach applies analytic modeling to reduce exploration time and decrease the reliance on trial-and-error methods. We also introduce risk evaluation to balance the benefit of various power saving optimizations versus the potential performance loss. Overall, we find that our approach can significantly reduce processor power consumption compared to alternative optimization strategies.

功率作为一级设计约束的出现推动了越来越多的运行时功率优化的提出。这些优化中的许多都是在性能损失可变的情况下权衡省电的机会，这取决于应用程序特性和程序阶段。此外，这些优化的潜在好处有时是不可加的，并且很难确定应用这些优化的哪些组合。已经提出了试错方法来自适应地调整处理器。然而，在芯片多处理器中，在广泛的优化范围下单独配置每个核心的成本在简单的试错方法下将是令人望而却步的。在这项工作中，我们介绍了一种用于多核电源管理的自适应多优化节能策略。具体来说，我们通过高度可配置的处理器内核的运行时适应解决了满足全局芯片范围内功耗预算的问题。我们的方法采用分析建模来减少勘探时间，减少对试错方法的依赖。我们还介绍了风险评估，以平衡各种节能优化的好处与潜在的性能损失。总的来说，我们发现与其他优化策略相比，我们的方法可以显著降低处理器功耗。

{"title":"Multi-Optimization power management for chip multiprocessors","authors":"Ke Meng, R. Joseph, R. Dick, L. Shang","doi":"10.1145/1454115.1454141","DOIUrl":"https://doi.org/10.1145/1454115.1454141","url":null,"abstract":"The emergence of power as a first-class design constraint has fueled the proposal of a growing number of run-time power optimizations. Many of these optimizations trade-off power saving opportunity for a variable performance loss which depends on application characteristics and program phase. Furthermore, the potential benefits of these optimizations are sometimes non-additive, and it can be difficult to identify which combinations of these optimizations to apply. Trial-and-error approaches have been proposed to adaptively tune a processor. However, in a chip multiprocessor, the cost of individually configuring each core under a wide range of optimizations would be prohibitive under simple trial-and-error approaches. In this work, we introduce an adaptive, multi-optimization power saving strategy for multi-core power management. Specifically, we solve the problem of meeting a global chip-wide power budget through run-time adaptation of highly configurable processor cores. Our approach applies analytic modeling to reduce exploration time and decrease the reliance on trial-and-error methods. We also introduce risk evaluation to balance the benefit of various power saving optimizations versus the potential performance loss. Overall, we find that our approach can significantly reduce processor power consumption compared to alternative optimization strategies.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122229590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 111

Mars: A MapReduce Framework on graphics processors Mars:一个基于图形处理器的MapReduce框架

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454152

Bingsheng He, Wenbin Fang, Qiong Luo, N. Govindaraju, Tuyong Wang

We design and implement Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of commodity CPUs. Compared with CPUs, GPUs have an order of magnitude higher computation power and memory bandwidth, but are harder to program since their architectures are designed as a special-purpose co-processor and their programming interfaces are typically for graphics applications. As the first attempt to harness GPU's power for MapReduce, we developed Mars on an NVIDIA G80 GPU, which contains over one hundred processors, and evaluated it in comparison with Phoenix, the state-of-the-art MapReduce framework on multi-core CPUs. Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface. It is up to 16 times faster than its CPU-based counterpart for six common web applications on a quad-core machine.

我们在图形处理器(gpu)上设计并实现了MapReduce框架Mars。MapReduce是一个分布式编程框架，最初由Google提出，用于在大量商用cpu上轻松开发web搜索应用程序。与cpu相比，gpu的计算能力和内存带宽要高一个数量级，但由于其架构被设计为专用协处理器，并且其编程接口通常用于图形应用程序，因此编程难度更大。作为MapReduce的首次尝试，我们在包含100多个处理器的NVIDIA G80 GPU上开发了Mars，并将其与基于多核cpu的最先进MapReduce框架Phoenix进行了比较。Mars将GPU编程的复杂性隐藏在简单而熟悉的MapReduce接口后面。对于四核机器上的六个常见web应用程序，它的速度是基于cpu的同类程序的16倍。

引用次数: 829

A tuning framework for software-managed memory hierarchies 软件管理内存层次结构的调优框架

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454155

Manman Ren, Ji Young Park, M. Houston, A. Aiken, W. Dally

Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine's particular characteristics. A large program on a multi-level machine can easily expose tens or hundreds of inter-dependent parameters which require tuning, and manually searching the resultant large, non-linear space of program parameters is a tedious process of trial-and-error. In this paper we present a general framework for automatically tuning general applications to machines with software-managed memory hierarchies. We evaluate our framework by measuring the performance of benchmarks that are tuned for a range of machines with different memory hierarchy configurations: a cluster of Intel P4 Xeon processors, a single Cell processor, and a cluster of Sony Playstation3's.

在具有多级内存层次结构的现代机器上，特别是在具有软件管理内存的机器上，要获得良好的性能，需要根据机器的特定特性对程序进行精确的调优。多级机器上的大型程序很容易暴露出数十或数百个需要调整的相互依赖的参数，手动搜索由此产生的大型非线性程序参数空间是一个乏味的试错过程。在本文中，我们提出了一个通用框架，用于自动调整通用应用程序到具有软件管理内存层次结构的机器。我们通过测量针对一系列具有不同内存层次配置的机器进行调优的基准测试的性能来评估我们的框架:一个Intel P4 Xeon处理器集群、一个Cell处理器集群和一个Sony Playstation3集群。

引用次数: 43

MCAMP: Communication optimization on Massively Parallel Machines with hierarchical scratch-pad memory 基于分级刮擦板存储器的大规模并行机器上的通信优化

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454132

H. Hayashizaki, Yutaka Sugawara, M. Inaba, K. Hiraki

Massively parallel machines that integrate a large number of simple processors and small scratch-pad memories (SPMs) into a single chip can achieve a high peak performance per watt of power. In these machines, communication optimizations are important because the communication bandwidth tends to be a bottleneck. Previously proposed communication optimizations using copy candidates, which have been shown to be effective, detect frequently reused array regions by compile-time analysis and copy the regions to scratch-pad memories nearer to the processors. However, they have been proposed for uniprocessor systems or small parallel machines with one or more layers of scratch-pad memories, and the analysis time increases when they are applied to massively parallel machines. In this paper, we propose Multilayer Copy-candidate Analysis for Massively Parallel machines (MCAMP), a communication optimization method for massively parallel machines. MCAMP re-formalizes the framework used in earlier works and improves the scalability of the analysis by assuming the homogeneity of the target systems. We implemented an MCAMP optimizer, which takes an input program that consists of perfectly nested loops containing array references and computation codes, and generates optimized communication. We measured the performance of the output programs of the MCAMP optimizer by executing them on a real massively parallel machine GRAPE-DR using a software tool chain that we also implemented. We showed that MCAMP can achieve optimal data transfer patterns and comparable performance to that of hand-optimized codes with a short analysis time.

将大量简单处理器和小型刮刮板存储器(spm)集成到单个芯片中的大规模并行机器可以实现每瓦功率的峰值性能。在这些机器中，通信优化非常重要，因为通信带宽往往是瓶颈。先前提出的使用副本候选的通信优化已被证明是有效的，它通过编译时分析检测频繁重用的数组区域，并将这些区域复制到靠近处理器的临时存储器中。然而，它们已经被提出用于单处理器系统或具有一层或多层刮擦板存储器的小型并行机器，并且当它们应用于大规模并行机器时，分析时间增加。本文提出了一种大规模并行机通信优化方法MCAMP (Multilayer Copy-candidate Analysis for Massively Parallel machine)。MCAMP重新形式化了早期工作中使用的框架，并通过假设目标系统的同质性来提高分析的可伸缩性。我们实现了一个MCAMP优化器，它接受一个由包含数组引用和计算代码的完美嵌套循环组成的输入程序，并生成优化的通信。我们通过在真正的大规模并行机GRAPE-DR上执行MCAMP优化器输出程序来测量它们的性能，并使用了我们实现的软件工具链。我们证明MCAMP可以在较短的分析时间内实现最佳的数据传输模式和与手动优化代码相当的性能。

{"title":"MCAMP: Communication optimization on Massively Parallel Machines with hierarchical scratch-pad memory","authors":"H. Hayashizaki, Yutaka Sugawara, M. Inaba, K. Hiraki","doi":"10.1145/1454115.1454132","DOIUrl":"https://doi.org/10.1145/1454115.1454132","url":null,"abstract":"Massively parallel machines that integrate a large number of simple processors and small scratch-pad memories (SPMs) into a single chip can achieve a high peak performance per watt of power. In these machines, communication optimizations are important because the communication bandwidth tends to be a bottleneck. Previously proposed communication optimizations using copy candidates, which have been shown to be effective, detect frequently reused array regions by compile-time analysis and copy the regions to scratch-pad memories nearer to the processors. However, they have been proposed for uniprocessor systems or small parallel machines with one or more layers of scratch-pad memories, and the analysis time increases when they are applied to massively parallel machines. In this paper, we propose Multilayer Copy-candidate Analysis for Massively Parallel machines (MCAMP), a communication optimization method for massively parallel machines. MCAMP re-formalizes the framework used in earlier works and improves the scalability of the analysis by assuming the homogeneity of the target systems. We implemented an MCAMP optimizer, which takes an input program that consists of perfectly nested loops containing array references and computation codes, and generates optimized communication. We measured the performance of the output programs of the MCAMP optimizer by executing them on a real massively parallel machine GRAPE-DR using a software tool chain that we also implemented. We showed that MCAMP can achieve optimal data transfer patterns and comparable performance to that of hand-optimized codes with a short analysis time.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115833236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Edge-centric modulo scheduling for coarse-grained reconfigurable architectures 面向粗粒度可重构架构的以边缘为中心的模调度

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454140

Hyunchul Park, Kevin Fan, S. Mahlke, Taewook Oh, Heeseok Kim, Hong-Seok Kim

Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs consist of an array of function units and register files often organized as a two dimensional grid. The most difficult challenge in deploying CGRAs is compiler scheduling technology that can efficiently map software implementations of compute intensive loops onto the array. Traditional schedulers focus on the placement of operations in time and space. With CGRAs, the challenge of placement is compounded by the need to explicitly route operands from producers to consumers. To systematically attack this problem, we take an edge-centric approach to modulo scheduling that focuses on the routing problem as its primary objective. With edge-centric modulo scheduling (EMS), placement is a by-product of the routing process, and the schedule is developed by routing each edge in the dataflow graph. Routing cost metrics provide the scheduler with a global perspective to guide selection. Experiments on a wide variety of compute-intensive loops from the multimedia domain show that EMS improves throughput by 25% over traditional iterative modulo scheduling, and achieves 98% of the throughput of simulated annealing techniques at a fraction of the compilation time.

粗粒度可重构架构(CGRAs)通过提供高计算吞吐量、可伸缩性、低成本和能源效率的潜力，提供了一个吸引人的硬件平台。CGRAs由一组函数单元和寄存器文件组成，通常组织为二维网格。部署CGRAs最困难的挑战是编译器调度技术，该技术可以有效地将计算密集型循环的软件实现映射到阵列上。传统的调度程序关注操作在时间和空间上的放置。对于CGRAs，由于需要显式地将操作数从生产者路由到消费者，因此放置的挑战变得更加复杂。为了系统地解决这个问题，我们采用了一种以边缘为中心的模调度方法，将路由问题作为其主要目标。在以边为中心的模调度(EMS)中，位置是路由过程的副产品，调度是通过路由数据流图中的每个边来制定的。路由成本度量为调度程序提供了全局视角，以指导选择。在多媒体领域的各种计算密集型循环上的实验表明，EMS比传统的迭代模调度提高了25%的吞吐量，并且在编译时间的一小部分内实现了模拟退火技术98%的吞吐量。

{"title":"Edge-centric modulo scheduling for coarse-grained reconfigurable architectures","authors":"Hyunchul Park, Kevin Fan, S. Mahlke, Taewook Oh, Heeseok Kim, Hong-Seok Kim","doi":"10.1145/1454115.1454140","DOIUrl":"https://doi.org/10.1145/1454115.1454140","url":null,"abstract":"Coarse-grained reconfigurable architectures (CGRAs) present an appealing hardware platform by providing the potential for high computation throughput, scalability, low cost, and energy efficiency. CGRAs consist of an array of function units and register files often organized as a two dimensional grid. The most difficult challenge in deploying CGRAs is compiler scheduling technology that can efficiently map software implementations of compute intensive loops onto the array. Traditional schedulers focus on the placement of operations in time and space. With CGRAs, the challenge of placement is compounded by the need to explicitly route operands from producers to consumers. To systematically attack this problem, we take an edge-centric approach to modulo scheduling that focuses on the routing problem as its primary objective. With edge-centric modulo scheduling (EMS), placement is a by-product of the routing process, and the schedule is developed by routing each edge in the dataflow graph. Routing cost metrics provide the scheduler with a global perspective to guide selection. Experiments on a wide variety of compute-intensive loops from the multimedia domain show that EMS improves throughput by 25% over traditional iterative modulo scheduling, and achieves 98% of the throughput of simulated annealing techniques at a fraction of the compilation time.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128211249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 195

Multitasking workload scheduling on flexible-core chip multiprocessors 柔性核芯片多处理器上的多任务工作负载调度

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454142

Divya Gulati, Changkyu Kim, S. Sethumadhavan, S. Keckler, D. Burger

While technology trends have ushered in the age of chip multiprocessors (CMP), a fundamental question is what size to make each core. Most current commercial designs are symmetric CMPs (SCMP) in which each core is identical and range from a simple RISC processor to a complex out-of-order x86 processor. Some researchers have proposed asymmetric CMPs (ACMP) consisting of multiple types of cores. While less of an issue for ACMPs, the fixed nature of both these architectures makes them vulnerable to mismatches between the granularity of the cores and the parallelism in the workload, which can cause inefficient execution. To remedy this weakness, recent research has proposed flexible-core CMPs (FCMP), which have the capability of aggregating multiple small processing cores to form larger logical processors. FCMPs introduce a new resource allocation and scheduling problem which must determine how many logical processors should be configured, how powerful each processor should be, and where/when each task should run. This paper introduces and motivates this problem, describes the challenges associated with it, and evaluates algorithms appropriate for multitasking on FCMPs. We also evaluate static-core CMPs of various configurations and compare them to FCMPs for various multitasking workloads.

虽然技术趋势已经迎来了芯片多处理器(CMP)时代，但一个基本问题是每个核心的尺寸是多少。目前大多数商业设计都是对称cmp (SCMP)，其中每个内核都是相同的，范围从简单的RISC处理器到复杂的乱序x86处理器。一些研究人员提出了由多种核组成的非对称cmp (ACMP)。虽然对于acmp来说不是问题，但这两种体系结构的固定性质使它们容易受到内核粒度与工作负载并行性之间不匹配的影响，从而导致执行效率低下。为了弥补这一缺陷，最近的研究提出了灵活核cmp (FCMP)，它具有聚合多个小处理核心以形成更大逻辑处理器的能力。fcmp引入了一个新的资源分配和调度问题，它必须决定应该配置多少逻辑处理器，每个处理器应该有多强大，以及每个任务应该在哪里/何时运行。本文介绍了这一问题，描述了与之相关的挑战，并评估了适用于fcmp上多任务处理的算法。我们还评估了各种配置的静态核cmp，并将它们与用于各种多任务工作负载的fcmp进行了比较。

{"title":"Multitasking workload scheduling on flexible-core chip multiprocessors","authors":"Divya Gulati, Changkyu Kim, S. Sethumadhavan, S. Keckler, D. Burger","doi":"10.1145/1454115.1454142","DOIUrl":"https://doi.org/10.1145/1454115.1454142","url":null,"abstract":"While technology trends have ushered in the age of chip multiprocessors (CMP), a fundamental question is what size to make each core. Most current commercial designs are symmetric CMPs (SCMP) in which each core is identical and range from a simple RISC processor to a complex out-of-order x86 processor. Some researchers have proposed asymmetric CMPs (ACMP) consisting of multiple types of cores. While less of an issue for ACMPs, the fixed nature of both these architectures makes them vulnerable to mismatches between the granularity of the cores and the parallelism in the workload, which can cause inefficient execution. To remedy this weakness, recent research has proposed flexible-core CMPs (FCMP), which have the capability of aggregating multiple small processing cores to form larger logical processors. FCMPs introduce a new resource allocation and scheduling problem which must determine how many logical processors should be configured, how powerful each processor should be, and where/when each task should run. This paper introduces and motivates this problem, describes the challenges associated with it, and evaluates algorithms appropriate for multitasking on FCMPs. We also evaluate static-core CMPs of various configurations and compare them to FCMPs for various multitasking workloads.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131033034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30