2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)最新文献_第2页

Meeting points: Using thread criticality to adapt multicore hardware to parallel regions 会议要点:使用线程临界性使多核硬件适应并行区域

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454149

Qiong Cai, José González, R. Rakvic, G. Magklis, P. Chaparro, Antonio González

We present a novel mechanism, called meeting point thread characterization, to dynamically detect critical threads in a parallel region. We define the critical thread the one with the longest completion time in the parallel region. Knowing the criticality of each thread has many potential applications. In this work, we propose two applications: thread delaying for multi-core systems and thread balancing for simultaneous multi-threaded (SMT) cores. Thread delaying saves energy consumptions by running the core containing the critical thread at maximum frequency while scaling down the frequency and voltage of the cores containing non-critical threads. Thread balancing improves overall performance by giving higher priority to the critical thread in the issue queue of an SMT core. Our experiments on a detailed microprocessor simulator with the Recognition, Mining, and Synthesis applications from Intel research laboratory reveal that thread delaying can achieve energy savings up to more than 40% with negligible performance loss. Thread balancing can improve performance from 1% to 20%.

我们提出了一种新的机制，称为会合点线程表征，以动态检测并行区域中的关键线程。我们将临界线程定义为并行区域中完成时间最长的线程。了解每个线程的临界状态有许多潜在的应用。在这项工作中，我们提出了两种应用:多核系统的线程延迟和同步多线程(SMT)内核的线程平衡。线程延迟通过以最大频率运行包含关键线程的内核，同时降低包含非关键线程的内核的频率和电压，从而节省能源消耗。线程平衡通过为SMT核心的问题队列中的关键线程提供更高的优先级来提高整体性能。我们使用英特尔研究实验室的识别、挖掘和合成应用程序在一个详细的微处理器模拟器上进行的实验表明，线程延迟可以节省高达40%以上的能源，而性能损失可以忽略不计。线程平衡可以将性能提高1%到20%。

{"title":"Meeting points: Using thread criticality to adapt multicore hardware to parallel regions","authors":"Qiong Cai, José González, R. Rakvic, G. Magklis, P. Chaparro, Antonio González","doi":"10.1145/1454115.1454149","DOIUrl":"https://doi.org/10.1145/1454115.1454149","url":null,"abstract":"We present a novel mechanism, called meeting point thread characterization, to dynamically detect critical threads in a parallel region. We define the critical thread the one with the longest completion time in the parallel region. Knowing the criticality of each thread has many potential applications. In this work, we propose two applications: thread delaying for multi-core systems and thread balancing for simultaneous multi-threaded (SMT) cores. Thread delaying saves energy consumptions by running the core containing the critical thread at maximum frequency while scaling down the frequency and voltage of the cores containing non-critical threads. Thread balancing improves overall performance by giving higher priority to the critical thread in the issue queue of an SMT core. Our experiments on a detailed microprocessor simulator with the Recognition, Mining, and Synthesis applications from Intel research laboratory reveal that thread delaying can achieve energy savings up to more than 40% with negligible performance loss. Thread balancing can improve performance from 1% to 20%.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114158041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 94

(How) can programmers conquer the multicore menace? 程序员如何克服多核威胁?

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454117

Saman P. Amarasinghe

The document was not made available for publication as part of the conference proceedings.

该文件没有作为会议记录的一部分提供出版。

引用次数: 8

Leveraging on-chip networks for data cache migration in chip multiprocessors 在芯片多处理器中利用片上网络进行数据缓存迁移

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454144

Noel Eisley, L. Peh, L. Shang

Recently, chip multiprocessors (CMPs) have arisen as the de facto design for modern high-performance processors, with increasing core counts. An important property of CMPs is that remote, but on-chip, L2 cache accesses are less costly than off-chip accesses; this is in contrast to earlier chip-to-chip or board-to-board multiprocessors, where an access to a remote node is just as costly if not more so than a main memory access. This motivates on-chip cache migration as a means to retain more data on-chip. However, previously proposed techniques do not scale to high core counts: they do not leverage the on-chip caches of all cores nor have a scalable migration mechanism. In this paper we propose ascalable in-network migration technique which uses hints embedded within the router microarchitecture to steer L2 cache evictions towards free/invalid cache slots in any on-chip core cache, rather than evicting it off-chip. We show that our technique can provide an average of a 19% reduction in the number of off-chip memory accesses over the state-of-the-art, beating the performance of a pseudo-optimal migration technique. This can be done with negligible area overhead and a manageable traffic overhead of 13.4%.

最近，随着核心数量的增加，芯片多处理器(cmp)已经成为现代高性能处理器的实际设计。cmp的一个重要特性是，远程但片上的L2缓存访问比片外访问成本更低;这与早期的芯片对芯片或板对板多处理器形成对比，在这些处理器中，访问远程节点的成本与访问主内存的成本一样高。这促使片上缓存迁移作为在片上保留更多数据的一种手段。然而，以前提出的技术不能扩展到高核数:它们不能利用所有核的片上缓存，也没有可扩展的迁移机制。在本文中，我们提出了可扩展的网络内迁移技术，该技术使用嵌入在路由器微架构中的提示来引导L2缓存驱逐到任何片上核心缓存中的空闲/无效缓存插槽，而不是将其驱逐到片外。我们表明，与最先进的技术相比，我们的技术可以使片外存储器访问的数量平均减少19%，优于伪最佳迁移技术的性能。这可以用微不足道的面积开销和可管理的13.4%的流量开销来完成。

{"title":"Leveraging on-chip networks for data cache migration in chip multiprocessors","authors":"Noel Eisley, L. Peh, L. Shang","doi":"10.1145/1454115.1454144","DOIUrl":"https://doi.org/10.1145/1454115.1454144","url":null,"abstract":"Recently, chip multiprocessors (CMPs) have arisen as the de facto design for modern high-performance processors, with increasing core counts. An important property of CMPs is that remote, but on-chip, L2 cache accesses are less costly than off-chip accesses; this is in contrast to earlier chip-to-chip or board-to-board multiprocessors, where an access to a remote node is just as costly if not more so than a main memory access. This motivates on-chip cache migration as a means to retain more data on-chip. However, previously proposed techniques do not scale to high core counts: they do not leverage the on-chip caches of all cores nor have a scalable migration mechanism. In this paper we propose ascalable in-network migration technique which uses hints embedded within the router microarchitecture to steer L2 cache evictions towards free/invalid cache slots in any on-chip core cache, rather than evicting it off-chip. We show that our technique can provide an average of a 19% reduction in the number of off-chip memory accesses over the state-of-the-art, beating the performance of a pseudo-optimal migration technique. This can be done with negligible area overhead and a manageable traffic overhead of 13.4%.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126314601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor Pangaea:一个紧密耦合的IA32异构芯片多处理器

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454125

Henry Wong, Anne Bracy, E. Schuchman, Tor M. Aamodt, Jamison D. Collins, P. Wang, G. Chinya, Ankur Khandelwal Groen, Hong Jiang, Hong Wang

Moore's Law and the drive towards performance efficiency have led to the on-chip integration of general-purpose cores with special-purpose accelerators. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multi-cores, extending the current state-of-the-art CPU-GPU integration that physically “fuses” existing CPU and GPU designs. Pangaea introduces (1) a resource repartitioning of the GPU, where the hardware budget dedicated for 3D-specific graphics processing is used to build more general-purpose GPU cores, and (2) a 3-instruction extension to the IA32 ISA that supports tighter architectural integration and fine-grain shared memory collaborative multithreading between the IA32 CPU cores and the non-IA32 GPU cores. We implement Pangaea and the current CPU-GPU designs in fully-functional synthesizable RTL based on the production quality RTL of an IA32 CPU and an Intel GMA X4500 GPU. On a 65 nm ASIC process technology, the legacy graphics-specific fixed-function hardware has the area of 9 GPU cores and total power consumption of 5 GPU cores. With the ISA extensions, the latency from the time an IA32 core spawns a GPU thread to the time the thread begins execution is reduced from thousands of cycles to fewer than 30 cycles. Pangaea is synthesized on a FPGA-based prototype and runs off-the-shelf IA32 OSes. A set of general-purpose non-graphics workloads demonstrate speedups of up to 8.8×.

摩尔定律和对性能效率的追求导致了通用核心与专用加速器的片上集成。Pangaea是针对非渲染工作负载的异构CMP设计，它将IA32 CPU内核与非IA32 GPU类多核集成在一起，扩展了当前最先进的CPU-GPU集成，物理上“融合”了现有的CPU和GPU设计。Pangaea引入了(1)GPU的资源重新划分，其中专用于3d特定图形处理的硬件预算用于构建更通用的GPU内核，以及(2)对IA32 ISA的3指令扩展，支持更紧密的架构集成和IA32 CPU内核与非IA32 GPU内核之间的细粒度共享内存协作多线程。我们基于IA32 CPU和Intel GMA X4500 GPU的生产质量RTL，在全功能可合成RTL中实现Pangaea和当前的CPU-GPU设计。在65纳米ASIC工艺技术上，传统图形专用固定功能硬件的面积为9个GPU内核，总功耗为5个GPU内核。通过ISA扩展，从IA32内核生成GPU线程到线程开始执行的延迟从数千个周期减少到不到30个周期。Pangaea是在基于fpga的原型上合成的，运行现成的IA32操作系统。一组通用的非图形工作负载显示了高达8.8倍的加速。

{"title":"Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor","authors":"Henry Wong, Anne Bracy, E. Schuchman, Tor M. Aamodt, Jamison D. Collins, P. Wang, G. Chinya, Ankur Khandelwal Groen, Hong Jiang, Hong Wang","doi":"10.1145/1454115.1454125","DOIUrl":"https://doi.org/10.1145/1454115.1454125","url":null,"abstract":"Moore's Law and the drive towards performance efficiency have led to the on-chip integration of general-purpose cores with special-purpose accelerators. Pangaea is a heterogeneous CMP design for non-rendering workloads that integrates IA32 CPU cores with non-IA32 GPU-class multi-cores, extending the current state-of-the-art CPU-GPU integration that physically “fuses” existing CPU and GPU designs. Pangaea introduces (1) a resource repartitioning of the GPU, where the hardware budget dedicated for 3D-specific graphics processing is used to build more general-purpose GPU cores, and (2) a 3-instruction extension to the IA32 ISA that supports tighter architectural integration and fine-grain shared memory collaborative multithreading between the IA32 CPU cores and the non-IA32 GPU cores. We implement Pangaea and the current CPU-GPU designs in fully-functional synthesizable RTL based on the production quality RTL of an IA32 CPU and an Intel GMA X4500 GPU. On a 65 nm ASIC process technology, the legacy graphics-specific fixed-function hardware has the area of 9 GPU cores and total power consumption of 5 GPU cores. With the ISA extensions, the latency from the time an IA32 core spawns a GPU thread to the time the thread begins execution is reduced from thousands of cycles to fewer than 30 cycles. Pangaea is synthesized on a FPGA-based prototype and runs off-the-shelf IA32 OSes. A set of general-purpose non-graphics workloads demonstrate speedups of up to 8.8×.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115709234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

Runtime optimization of vector operations on large scale SMP clusters 大规模SMP集群上矢量运算的运行时优化

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454134

Costin Iancu, S. Hofmeyr

“Vector” style communication operations transfer multiple disjoint memory regions within one logical step. These operations are widely used in applications, they do improve application performance, and their behavior has been studied and optimized using different implementation techniques across a large variety of systems. In this paper we present a methodology for the selection of the best performing implementation of a vector operation from multiple alternative implementations. Our approach is designed to work for systems with wide SMP nodes where we believe that most published studies fail to correctly predict performance. Due to the emergence of multi-core processors we believe that techniques similar to ours will be incorporated for performance reasons in communication libraries or language runtimes. The methodology relies on the exploration of the application space and a classification of the regions within this space where a particular implementation method performs best. We use micro-benchmarks to measure the performance of an implementation for a given point in the application space and then compose profiles that compare the performance of two given implementations. These profiles capture an empirical upper bound for the performance degradation of a given protocol under heavy node load. At runtime, the application selects the implementation according to these performance profiles. Our approach provides performance portability and using our dynamic multi-protocol selection we have been able to improve the performance of a NAS Parallel Benchmarks workload by 22% on an IBM large scale cluster. Very positive results have also been obtained on large scale InfiniBand and Cray XT systems. This work indicates that perhaps the most important factor for application performance on wide SMP systems is the successful management of load on the Network Interface Cards.

“矢量”式通信操作在一个逻辑步骤内传输多个不相交的存储区域。这些操作在应用程序中被广泛使用，它们确实提高了应用程序的性能，并且它们的行为已经在各种各样的系统中使用不同的实现技术进行了研究和优化。在本文中，我们提出了一种从多个备选实现中选择性能最佳的矢量操作实现的方法。我们的方法设计用于具有宽SMP节点的系统，我们认为大多数已发表的研究都无法正确预测性能。由于多核处理器的出现，我们相信出于性能原因，类似于我们的技术将被纳入通信库或语言运行时。该方法依赖于对应用程序空间的探索，以及对该空间中特定实现方法表现最佳的区域进行分类。我们使用微基准测试来衡量应用程序空间中给定点的实现性能，然后编写概要文件来比较两个给定实现的性能。这些概要文件捕获了给定协议在高节点负载下性能下降的经验上限。在运行时，应用程序根据这些性能配置文件选择实现。我们的方法提供了性能可移植性，并且使用我们的动态多协议选择，我们已经能够在IBM大型集群上将NAS并行基准工作负载的性能提高22%。在大规模InfiniBand和Cray XT系统上也取得了非常积极的结果。这项工作表明，在广泛的SMP系统上，影响应用程序性能的最重要因素可能是成功地管理网络接口卡上的负载。

{"title":"Runtime optimization of vector operations on large scale SMP clusters","authors":"Costin Iancu, S. Hofmeyr","doi":"10.1145/1454115.1454134","DOIUrl":"https://doi.org/10.1145/1454115.1454134","url":null,"abstract":"“Vector” style communication operations transfer multiple disjoint memory regions within one logical step. These operations are widely used in applications, they do improve application performance, and their behavior has been studied and optimized using different implementation techniques across a large variety of systems. In this paper we present a methodology for the selection of the best performing implementation of a vector operation from multiple alternative implementations. Our approach is designed to work for systems with wide SMP nodes where we believe that most published studies fail to correctly predict performance. Due to the emergence of multi-core processors we believe that techniques similar to ours will be incorporated for performance reasons in communication libraries or language runtimes. The methodology relies on the exploration of the application space and a classification of the regions within this space where a particular implementation method performs best. We use micro-benchmarks to measure the performance of an implementation for a given point in the application space and then compose profiles that compare the performance of two given implementations. These profiles capture an empirical upper bound for the performance degradation of a given protocol under heavy node load. At runtime, the application selects the implementation according to these performance profiles. Our approach provides performance portability and using our dynamic multi-protocol selection we have been able to improve the performance of a NAS Parallel Benchmarks workload by 22% on an IBM large scale cluster. Very positive results have also been obtained on large scale InfiniBand and Cray XT systems. This work indicates that perhaps the most important factor for application performance on wide SMP systems is the successful management of load on the Network Interface Cards.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125544044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Redundancy elimination revisited 重审冗余消除

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454120

K. Cooper, J. Eckhardt, K. Kennedy

This work proposes and evaluates improvements to previously known algorithms for redundancy elimination.

这项工作提出并评估了以前已知的冗余消除算法的改进。

引用次数: 15

Outer-loop vectorization - revisited for short SIMD architectures 外环矢量化——为短SIMD体系结构重新设计

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454119

Dorit Nuzman, A. Zaks

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multimedia and embedded applications on short SIMD architectures such as MMX, SSE and AltiVec. Most of the focus has been directed at innermost loops, effectively executing their iterations concurrently as much as possible. Outer loop vectorization refers to vectorizing a level of a loop nest other than the innermost, which can be beneficial if the outer loop exhibits greater data-level parallelism and locality than the innermost loop. Outer loop vectorization has traditionally been performed by interchanging an outer-loop with the innermost loop, followed by vectorizing it at the innermost position. A more direct unroll-and-jam approach can be used to vectorize an outer-loop without involving loop interchange, which can be especially suitable for short SIMD architectures. In this paper we revisit the method of outer loop vectorization, paying special attention to properties of modern short SIMD architectures. We show that even though current optimizing compilers for such targets do not apply outer-loop vectorization in general, it can provide significant performance improvements over innermost loop vectorization. Our implementation of direct outer-loop vectorization, available in GCC 4.3, achieves speedup factors of 3.13 and 2.77 on average across a set of benchmarks, compared to 1.53 and 1.39 achieved by innermost loop vectorization, when running on a Cell BE SPU and PowerPC970 processors respectively. Moreover, outer-loop vectorization provides new reuse opportunities that can be vital for such short SIMD architectures, including efficient handling of alignment. We present an optimization tapping such opportunities, capable of further boosting the performance obtained by outer-loop vectorization to achieve average speedup factors of 5.26 and 3.64.

在过去的三十年里，矢量化一直是使用数据级并行性来加速像Cray这样的向量机上的科学工作负载的重要方法。在过去的十年中，它也被证明可以在短SIMD架构(如MMX、SSE和AltiVec)上加速多媒体和嵌入式应用程序。大多数焦点都集中在最内层循环上，尽可能有效地并发执行它们的迭代。外循环向量化指的是向量化除最内层之外的另一层循环嵌套，如果外循环比最内层循环表现出更大的数据级并行性和局部性，这可能是有益的。外循环矢量化传统上是通过将外循环与最内层循环交换，然后在最内层位置进行矢量化来实现的。可以使用更直接的展开和阻塞方法来对外部环路进行矢量化，而不涉及环路交换，这特别适合于短SIMD体系结构。在本文中，我们重新审视了外环矢量化的方法，特别关注了现代短SIMD体系结构的特性。我们表明，尽管目前针对此类目标的优化编译器通常不应用外循环向量化，但它可以提供比最内层循环向量化显著的性能改进。我们在GCC 4.3中实现的直接外环矢量化在一组基准测试中平均实现了3.13和2.77的加速系数，而在Cell BE SPU和PowerPC970处理器上运行时，最内环矢量化分别实现了1.53和1.39的加速系数。此外，外环矢量化提供了新的重用机会，这对于这种短SIMD体系结构至关重要，包括对对齐的有效处理。我们提出了一种利用这些机会的优化方法，能够进一步提高由外环矢量化获得的性能，达到5.26和3.64的平均加速系数。

{"title":"Outer-loop vectorization - revisited for short SIMD architectures","authors":"Dorit Nuzman, A. Zaks","doi":"10.1145/1454115.1454119","DOIUrl":"https://doi.org/10.1145/1454115.1454119","url":null,"abstract":"Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multimedia and embedded applications on short SIMD architectures such as MMX, SSE and AltiVec. Most of the focus has been directed at innermost loops, effectively executing their iterations concurrently as much as possible. Outer loop vectorization refers to vectorizing a level of a loop nest other than the innermost, which can be beneficial if the outer loop exhibits greater data-level parallelism and locality than the innermost loop. Outer loop vectorization has traditionally been performed by interchanging an outer-loop with the innermost loop, followed by vectorizing it at the innermost position. A more direct unroll-and-jam approach can be used to vectorize an outer-loop without involving loop interchange, which can be especially suitable for short SIMD architectures. In this paper we revisit the method of outer loop vectorization, paying special attention to properties of modern short SIMD architectures. We show that even though current optimizing compilers for such targets do not apply outer-loop vectorization in general, it can provide significant performance improvements over innermost loop vectorization. Our implementation of direct outer-loop vectorization, available in GCC 4.3, achieves speedup factors of 3.13 and 2.77 on average across a set of benchmarks, compared to 1.53 and 1.39 achieved by innermost loop vectorization, when running on a Cell BE SPU and PowerPC970 processors respectively. Moreover, outer-loop vectorization provides new reuse opportunities that can be vital for such short SIMD architectures, including efficient handling of alignment. We present an optimization tapping such opportunities, capable of further boosting the performance obtained by outer-loop vectorization to achieve average speedup factors of 5.26 and 3.64.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122769472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 160

GPU evolution: Will graphics morph into compute? GPU的进化:图形会变成计算吗?

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454116

Norman Rubin

In the last several years GPU devices have started to evolve into supercomputers. New, non-graphics, features are rapidly appearing along with new more general programming languages. One reason for the quick pace of change is that, games and hardware evolve together: Hardware vendors review the most popular games, looking for places to add hardware while game developers review new hardware, looking for places to add more realism. Today, we see both GPU devices and games moving from a model of looks real to one of acts real. One consequence of acts real is that evaluating physics, simulations, and artificial intelligence on a GPU is becoming an element of future game programs. We will review the difference between a CPU and a GPU. Then we will describe hardware changes added to the current generation of AMD graphics processors, including the introduction of traditional compute operations such as double precision, scatter/gather and local memory. Along with new features, we have added new metrics like performance/watt and performance/dollar. The current AMD GPU processor delivers 9 gigaflops/watt and 5 gigaflops/dollar. For the last two generations, each AMD GPU has provided double the performance/watt of the prior machine. We believe the software community needs to become more aware and appreciate these metrics. Because this has been a kind of co-evolution and not a process of radical change, current GPU devices have retained a number of odd sounding transitional features, including fixed functions like memory systems that can do filtering, depth buffers, a rasterizer and the like. Today, each of these remain because they are important for graphics performance. Software on GPU devices also shows transitional features. As AI/physics virtual reality starts to become important, development frameworks have started to shift. Graphics APIs have added compute shaders. Finally, there has been a set of transitional programs implemented by graphics programmers but whose only real connection with graphics is that the result is rendered. One early example is toy shop which contains a weak physical simulation of rain on window (it looks great but the random number generator would not pass any kind of test). A more recent and better acting program is March of the Froblins an AI program related to robotic path calculations. This program both simulates large crowds of independent creatures and shows how massively parallel compute can benefit character-centric entertainment.

在过去的几年里，GPU设备已经开始向超级计算机发展。随着新的更通用的编程语言的出现，新的非图形特性也在迅速出现。快速变化的一个原因是，游戏和硬件一起发展:硬件供应商审查最受欢迎的游戏，寻找添加硬件的地方，而游戏开发商审查新硬件，寻找添加更多现实性的地方。今天，我们看到GPU设备和游戏都从一个看起来真实的模型转变为一个真实的模型。真实行为的一个后果是，在GPU上评估物理、模拟和人工智能将成为未来游戏程序的一个元素。我们将回顾CPU和GPU之间的区别。然后，我们将描述当前一代AMD图形处理器的硬件变化，包括引入传统的计算操作，如双精度，分散/收集和本地内存。除了新功能，我们还添加了性能/瓦特和性能/美元等新指标。目前AMD的GPU处理器可以提供9千兆次/瓦特和5千兆次/美元的速度。在过去的两代中，每个AMD GPU的性能/瓦特都是之前机器的两倍。我们相信软件社区需要更加了解并欣赏这些度量标准。因为这是一种共同进化，而不是一个彻底改变的过程，所以当前的GPU设备保留了许多奇怪的过渡功能，包括可以进行过滤的内存系统、深度缓冲、光栅化等固定功能。今天，这些都保留了下来，因为它们对图形性能很重要。GPU设备上的软件也显示出过渡特性。随着AI/物理虚拟现实开始变得重要，开发框架也开始发生转变。图形api增加了计算着色器。最后，有一组由图形程序员实现的过渡程序，但它们与图形的唯一真正联系是呈现结果。一个早期的例子是《玩具店》，它包含了一个弱的物理模拟(游戏邦注:它看起来很棒，但随机数生成器无法通过任何类型的测试)。最近的一个更好的表演节目是Froblins的三月，这是一个与机器人路径计算相关的人工智能程序。这个程序既模拟了大量独立生物，又展示了大规模并行计算如何有利于以角色为中心的娱乐。

{"title":"GPU evolution: Will graphics morph into compute?","authors":"Norman Rubin","doi":"10.1145/1454115.1454116","DOIUrl":"https://doi.org/10.1145/1454115.1454116","url":null,"abstract":"In the last several years GPU devices have started to evolve into supercomputers. New, non-graphics, features are rapidly appearing along with new more general programming languages. One reason for the quick pace of change is that, games and hardware evolve together: Hardware vendors review the most popular games, looking for places to add hardware while game developers review new hardware, looking for places to add more realism. Today, we see both GPU devices and games moving from a model of looks real to one of acts real. One consequence of acts real is that evaluating physics, simulations, and artificial intelligence on a GPU is becoming an element of future game programs. We will review the difference between a CPU and a GPU. Then we will describe hardware changes added to the current generation of AMD graphics processors, including the introduction of traditional compute operations such as double precision, scatter/gather and local memory. Along with new features, we have added new metrics like performance/watt and performance/dollar. The current AMD GPU processor delivers 9 gigaflops/watt and 5 gigaflops/dollar. For the last two generations, each AMD GPU has provided double the performance/watt of the prior machine. We believe the software community needs to become more aware and appreciate these metrics.\u0000 Because this has been a kind of co-evolution and not a process of radical change, current GPU devices have retained a number of odd sounding transitional features, including fixed functions like memory systems that can do filtering, depth buffers, a rasterizer and the like. Today, each of these remain because they are important for graphics performance. Software on GPU devices also shows transitional features. As AI/physics virtual reality starts to become important, development frameworks have started to shift. Graphics APIs have added compute shaders. Finally, there has been a set of transitional programs implemented by graphics programmers but whose only real connection with graphics is that the result is rendered. One early example is toy shop which contains a weak physical simulation of rain on window (it looks great but the random number generator would not pass any kind of test). A more recent and better acting program is March of the Froblins an AI program related to robotic path calculations. This program both simulates large crowds of independent creatures and shows how massively parallel compute can benefit character-centric entertainment.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124198098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Distributed Cooperative Caching 分布式协同缓存

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454136

E. Herrero, José González, R. Canal

This paper presents the Distributed Cooperative Caching, a scalable and energy-efficient scheme to manage chip multiprocessor (CMP) cache resources. The proposed configuration is based in the Cooperative Caching framework [3] but it is intended for large scale CMPs. Both centralized and distributed configurations have the advantage of combining the benefits of private and shared caches. In our proposal, the Coherence Engine has been redesigned to allow its partitioning and thus, eliminate the size constraints imposed by the duplication of all tags. At the same time, a global replacement mechanism has been added to improve the usage of cache space. Our framework uses several Distributed Coherence Engines spread across all the nodes to improve scalability. The distribution permits a better balance of the network traffic over the entire chip avoiding bottlenecks and increasing performance for a 32-core CMP by 21% over a traditional shared memory configuration and by 57% over the Cooperative Caching scheme. Furthermore, we have reduced the power consumption of the entire system by using a different tag allocation method and by reducing the number of tags compared on each request. For a 32-core CMP the Distributed Cooperative Caching framework provides an average improvement of the power/performance relation (MIPS3/W) of 3.66× over a traditional shared memory configuration and 4.30× over Cooperative Caching.

本文提出了分布式协同缓存，一种可扩展且节能的芯片多处理器(CMP)缓存资源管理方案。提议的配置基于协作缓存框架[3]，但它是为大规模cmp设计的。集中式和分布式配置都具有将私有缓存和共享缓存的优点结合起来的优点。在我们的建议中，连贯性引擎已被重新设计，以允许其分区，从而消除所有标签重复所施加的大小限制。同时，增加了一个全局替换机制来改进缓存空间的使用。我们的框架使用了几个分布在所有节点上的分布式一致性引擎来提高可伸缩性。该分布允许在整个芯片上更好地平衡网络流量，避免瓶颈，并使32核CMP的性能比传统共享内存配置提高21%，比协作缓存方案提高57%。此外，我们还通过使用不同的标签分配方法和减少每个请求比较的标签数量来降低整个系统的功耗。对于32核CMP，分布式协作式缓存框架提供的功率/性能关系(MIPS3/W)比传统共享内存配置平均提高3.66倍，比协作式缓存平均提高4.30倍。

{"title":"Distributed Cooperative Caching","authors":"E. Herrero, José González, R. Canal","doi":"10.1145/1454115.1454136","DOIUrl":"https://doi.org/10.1145/1454115.1454136","url":null,"abstract":"This paper presents the Distributed Cooperative Caching, a scalable and energy-efficient scheme to manage chip multiprocessor (CMP) cache resources. The proposed configuration is based in the Cooperative Caching framework [3] but it is intended for large scale CMPs. Both centralized and distributed configurations have the advantage of combining the benefits of private and shared caches. In our proposal, the Coherence Engine has been redesigned to allow its partitioning and thus, eliminate the size constraints imposed by the duplication of all tags. At the same time, a global replacement mechanism has been added to improve the usage of cache space. Our framework uses several Distributed Coherence Engines spread across all the nodes to improve scalability. The distribution permits a better balance of the network traffic over the entire chip avoiding bottlenecks and increasing performance for a 32-core CMP by 21% over a traditional shared memory configuration and by 57% over the Cooperative Caching scheme. Furthermore, we have reduced the power consumption of the entire system by using a different tag allocation method and by reducing the number of tags compared on each request. For a 32-core CMP the Distributed Cooperative Caching framework provides an average improvement of the power/performance relation (MIPS3/W) of 3.66× over a traditional shared memory configuration and 4.30× over Cooperative Caching.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116895861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Profiler and compiler assisted adaptive I/O prefetching for shared storage caches 分析器和编译器辅助共享存储缓存的自适应I/O预取

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Pub Date : 2008-10-25 DOI: 10.1145/1454115.1454133

S. Son, Sai Prashanth Muralidhara, O. Ozturk, M. Kandemir, I. Kolcu, Mustafa Karaköy

I/O prefetching has been employed in the past as one of the mechanisms to hide large disk latencies. However, I/O prefetching in parallel applications is problematic when multiple CPUs share the same set of disks due to the possibility that prefetches from different CPUs can interact on shared memory caches in the I/O nodes in complex and unpredictable ways. In this paper, we (i) quantify the impact of compiler-directed I/O prefetching - developed originally in the context of sequential execution - on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings benefits, its effectiveness reduces significantly as the number of CPUs is increased; (ii) identify inter-CPU misses due to harmful prefetches as one of the main sources for this reduction in performance with the increased number of CPUs; and (iii) propose and experimentally evaluate a profiler and compiler assisted adaptive I/O prefetching scheme targeting shared storage caches. The proposed scheme obtains inter-thread data sharing information using profiling and, based on the captured data sharing patterns, divides the threads into clusters and assigns a separate (customized) I/O prefetcher thread for each cluster. In our approach, the compiler generates the I/O prefetching threads automatically. We implemented this new I/O prefetching scheme using a compiler and the PVFS file system running on Linux, and the empirical data collected clearly underline the importance of adapting I/O prefetching based on program phases. Specifically, our proposed scheme improves performance, on average, by 19.9%, 11.9% and 10.3% over the cases without I/O prefetching, with independent I/O prefetching (each CPU is performing compiler-directed I/O prefetching independently), and with one CPU prefetching (one CPU is reserved for prefetching on behalf of others), respectively, when 8 CPUs are used.

I/O预取在过去被用作隐藏大磁盘延迟的机制之一。但是，当多个cpu共享同一组磁盘时，并行应用程序中的I/O预取会出现问题，因为来自不同cpu的预取可能以复杂且不可预测的方式在I/O节点中的共享内存缓存上交互。在本文中，我们(i)量化了编译器定向i /O预取(最初是在顺序执行的背景下开发的)对i /O节点上共享缓存的影响。收集的实验数据表明，虽然I/O预取带来了好处，但随着cpu数量的增加，其有效性显著降低;(ii)识别由于有害的预取而导致的cpu间丢失，作为cpu数量增加导致性能下降的主要原因之一;(iii)提出并实验评估一个分析器和编译器辅助的针对共享存储缓存的自适应I/O预取方案。该方案通过分析获取线程间数据共享信息，并根据捕获的数据共享模式将线程划分为集群，并为每个集群分配一个单独的(定制的)I/O预取线程。在我们的方法中，编译器自动生成I/O预取线程。我们使用在Linux上运行的编译器和PVFS文件系统实现了这个新的I/O预取方案，收集的经验数据清楚地强调了基于程序阶段调整I/O预取的重要性。具体来说，我们提出的方案在使用8个CPU时，与没有I/O预取、独立I/O预取(每个CPU独立执行编译器定向的I/O预取)和一个CPU预取(为代表其他CPU保留一个CPU预取)的情况相比，平均分别提高了19.9%、11.9%和10.3%的性能。

{"title":"Profiler and compiler assisted adaptive I/O prefetching for shared storage caches","authors":"S. Son, Sai Prashanth Muralidhara, O. Ozturk, M. Kandemir, I. Kolcu, Mustafa Karaköy","doi":"10.1145/1454115.1454133","DOIUrl":"https://doi.org/10.1145/1454115.1454133","url":null,"abstract":"I/O prefetching has been employed in the past as one of the mechanisms to hide large disk latencies. However, I/O prefetching in parallel applications is problematic when multiple CPUs share the same set of disks due to the possibility that prefetches from different CPUs can interact on shared memory caches in the I/O nodes in complex and unpredictable ways. In this paper, we (i) quantify the impact of compiler-directed I/O prefetching - developed originally in the context of sequential execution - on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings benefits, its effectiveness reduces significantly as the number of CPUs is increased; (ii) identify inter-CPU misses due to harmful prefetches as one of the main sources for this reduction in performance with the increased number of CPUs; and (iii) propose and experimentally evaluate a profiler and compiler assisted adaptive I/O prefetching scheme targeting shared storage caches. The proposed scheme obtains inter-thread data sharing information using profiling and, based on the captured data sharing patterns, divides the threads into clusters and assigns a separate (customized) I/O prefetcher thread for each cluster. In our approach, the compiler generates the I/O prefetching threads automatically. We implemented this new I/O prefetching scheme using a compiler and the PVFS file system running on Linux, and the empirical data collected clearly underline the importance of adapting I/O prefetching based on program phases. Specifically, our proposed scheme improves performance, on average, by 19.9%, 11.9% and 10.3% over the cases without I/O prefetching, with independent I/O prefetching (each CPU is performing compiler-directed I/O prefetching independently), and with one CPU prefetching (one CPU is reserved for prefetching on behalf of others), respectively, when 8 CPUs are used.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122295806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13