2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing最新文献

英文中文

BTL: A Framework for Measuring and Modeling Energy in Memory Hierarchies 内存层次中能量测量和建模的框架

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.38

I. Manousakis, Dimitrios S. Nikolopoulos

Understanding the energy efficiency of computing systems is paramount. Although processors remain dominant energy consumers and the focal target of energy-aware optimization in computing systems, the memory subsystem dissipates substantial amounts of power, which at high densities may exceed50% of total system power. The failure of DRAM to keep up with increasing processor speeds, creates a two-pronged bottleneck for overall system energy efficiency. This paper presents a high-performance, autonomic power instrumentation setup to measure energy consumption in computing systems and accurately attribute energy to processors and components of the memory hierarchy. We provide a set of carefully engineered micro benchmarks that reveal the energy efficiency under different memory access patterns and stress the importance of minimizing costly data transfers that involve multiple levels of the system's memory hierarchy. Lastly, we present BTL (Bottom line), a processor specific model for deriving lower bounds of energy consumption. BTL predicts the minimum dynamic energy consumption for any workload, thus uncovering opportunities for energy optimization.

了解计算系统的能源效率是至关重要的。尽管处理器仍然是主要的能源消耗者和计算系统中能源感知优化的焦点目标，但内存子系统消耗了大量的功率，在高密度时可能超过系统总功率的50%。DRAM无法跟上不断增长的处理器速度，会给整个系统的能源效率造成双重瓶颈。本文提出了一种高性能、自主的功率测量装置，用于测量计算系统中的能量消耗，并准确地将能量分配给处理器和内存层次结构的组件。我们提供了一组精心设计的微基准测试，揭示了不同内存访问模式下的能源效率，并强调了最小化涉及系统内存层次结构多个级别的昂贵数据传输的重要性。最后，我们提出了BTL(底线)，这是一个特定于处理器的模型，用于推导能量消耗的下界。BTL预测任何工作负载的最小动态能耗，从而发现能源优化的机会。

{"title":"BTL: A Framework for Measuring and Modeling Energy in Memory Hierarchies","authors":"I. Manousakis, Dimitrios S. Nikolopoulos","doi":"10.1109/SBAC-PAD.2012.38","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.38","url":null,"abstract":"Understanding the energy efficiency of computing systems is paramount. Although processors remain dominant energy consumers and the focal target of energy-aware optimization in computing systems, the memory subsystem dissipates substantial amounts of power, which at high densities may exceed50% of total system power. The failure of DRAM to keep up with increasing processor speeds, creates a two-pronged bottleneck for overall system energy efficiency. This paper presents a high-performance, autonomic power instrumentation setup to measure energy consumption in computing systems and accurately attribute energy to processors and components of the memory hierarchy. We provide a set of carefully engineered micro benchmarks that reveal the energy efficiency under different memory access patterns and stress the importance of minimizing costly data transfers that involve multiple levels of the system's memory hierarchy. Lastly, we present BTL (Bottom line), a processor specific model for deriving lower bounds of energy consumption. BTL predicts the minimum dynamic energy consumption for any workload, thus uncovering opportunities for energy optimization.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131900307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Efficient Sorting on the Tilera Manycore Architecture 基于Tilera多核架构的高效排序

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.41

Alessandro Morari, Antonino Tumeo, Oreste Villa, Simone Secchi, M. Valero

We present an efficient implementation of the radix sort algorithm for the Tilera TILEPro64 processor. The TILEPro64 is one of the first successful commercial manycore processors. It is composed of 64 tiles interconnected through multiple fast Networks-on-chip and features a fully coherent, shared distributed cache. The architecture has a large degree of flexibility, and allows various optimization strategies. We describe how we mapped the algorithm to this architecture. We present an in-depth analysis of the optimizations for each phase of the algorithm with respect to the processor's sustained performance. We discuss the overall throughput reached by our radix sort implementation (up to 132 MK/s) and show that it provides comparable or better performance-per-watt with respect to state-of-the art implementations on x86 processors and graphic processing units.

提出了一种基于Tilera tile64处理器的基数排序算法的高效实现。tile64是第一批成功的商用多核处理器之一。它由64块块组成，通过多个快速片上网络相互连接，并具有完全一致的共享分布式缓存。该体系结构具有很大的灵活性，并允许各种优化策略。我们描述了如何将算法映射到这个体系结构。我们提出了一个深入的分析优化算法的每个阶段相对于处理器的持续性能。我们讨论了基数排序实现所达到的总体吞吐量(高达132 MK/s)，并表明它提供了与x86处理器和图形处理单元上最先进的实现相当或更好的每瓦特性能。

引用次数: 12

Beyond CPU Frequency Scaling for a Fine-grained Energy Control of HPC Systems 超越CPU频率缩放的HPC系统细粒度能量控制

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.32

Ghislain Landry Tsafack Chetsa, L. Lefèvre, J. Pierson, P. Stolf, Georges Da Costa

Modern high performance computing subsystems (HPC) - including processor, network, memory, and IO - are provided with power management mechanisms. These include dynamic speed scaling and dynamic resource sleeping. Understanding the behavioral patterns of high performance computing systems at runtime can lead to a multitude of optimization opportunities including controlling and limiting their energy usage. In this paper, we present a general purpose methodology for optimizing energy performance of HPC systems considering processor, disk and network. We rely on the concept of execution vector along with a partial phase recognition technique for on-the-fly dynamic management without any a priori knowledge of the workload. We demonstrate the effectiveness of our management policy under two real-life workloads. Experimental results show that our management policy in comparison with baseline unmanaged execution saves up to 24% of energy with less than 4% performance overhead for our real-life workloads.

现代高性能计算子系统(HPC)——包括处理器、网络、内存和IO——都提供了电源管理机制。其中包括动态速度缩放和动态资源休眠。了解高性能计算系统在运行时的行为模式可以带来许多优化机会，包括控制和限制它们的能源使用。在本文中，我们提出了一种通用的方法来优化高性能计算系统的能量性能，考虑处理器，磁盘和网络。我们依靠执行向量的概念以及部分阶段识别技术来进行动态管理，而无需任何工作负载的先验知识。我们在两个实际工作负载下展示了我们的管理政策的有效性。实验结果表明，与基线非托管执行相比，我们的管理策略在实际工作负载中节省了高达24%的能源，而性能开销不到4%。

引用次数: 19

Divergence Analysis with Affine Constraints 仿射约束下的散度分析

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.22

Diogo Sampaio, R. M. Souza, Caroline Collange, Fernando Magno Quintão Pereira

The rising popularity of graphics processing units is bringing renewed interest in code optimization techniques for SIMD processors. Many of these optimizations rely on divergence analyses, which classify variables as uniform, if they have the same value on every thread, or divergent, if they might not. This paper introduces a new kind of divergence analysis, that is able to represent variables as affine functions of thread identifiers. We have implemented this analysis in Ocelot, an open source compiler, and use it to analyze a suite of 177 CUDA kernels from well-known benchmarks. We can mark about one fourth of all program variables as affine functions of thread identifiers. In addition to the novel divergence analysis, we also introduce the notion of a divergence aware register allocator. This allocator uses information from our analysis to either rematerialize affine variables, or to move uniform variables to shared memory. As a testimony of its effectiveness, our divergence aware allocator produces GPU code that is 29.70% faster than the code produced by Ocelot's register allocator. Divergence analysis with affine constraints is publicly available in the Ocelot compiler since June/2012.

图形处理单元的日益流行重新引起了对SIMD处理器代码优化技术的兴趣。这些优化中的许多都依赖于散度分析，如果变量在每个线程上具有相同的值，则将其分类为均匀的，如果变量在每个线程上具有相同的值，则将其分类为发散的。本文介绍了一种新的散度分析方法，将变量表示为线程标识符的仿射函数。我们已经在Ocelot(一个开源编译器)中实现了这种分析，并使用它来分析来自知名基准测试的177个CUDA内核。我们可以将大约四分之一的程序变量标记为线程标识符的仿射函数。除了新的发散分析之外，我们还引入了发散感知寄存器分配器的概念。这个分配器使用我们分析的信息来重新实现仿射变量，或者将统一变量移动到共享内存中。作为其有效性的证明，我们的发散感知分配器生成的GPU代码比Ocelot的寄存器分配器生成的代码快29.70%。带有仿射约束的发散分析自2012年6月起在Ocelot编译器中公开可用。

{"title":"Divergence Analysis with Affine Constraints","authors":"Diogo Sampaio, R. M. Souza, Caroline Collange, Fernando Magno Quintão Pereira","doi":"10.1109/SBAC-PAD.2012.22","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.22","url":null,"abstract":"The rising popularity of graphics processing units is bringing renewed interest in code optimization techniques for SIMD processors. Many of these optimizations rely on divergence analyses, which classify variables as uniform, if they have the same value on every thread, or divergent, if they might not. This paper introduces a new kind of divergence analysis, that is able to represent variables as affine functions of thread identifiers. We have implemented this analysis in Ocelot, an open source compiler, and use it to analyze a suite of 177 CUDA kernels from well-known benchmarks. We can mark about one fourth of all program variables as affine functions of thread identifiers. In addition to the novel divergence analysis, we also introduce the notion of a divergence aware register allocator. This allocator uses information from our analysis to either rematerialize affine variables, or to move uniform variables to shared memory. As a testimony of its effectiveness, our divergence aware allocator produces GPU code that is 29.70% faster than the code produced by Ocelot's register allocator. Divergence analysis with affine constraints is publicly available in the Ocelot compiler since June/2012.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130359186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs 利用并发GPU操作在多GPU上高效窃取工作

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.28

J. F. Lima, T. Gautier, N. Maillard, Vincent Danjean

The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.

百亿亿次计算的竞争自然导致了当前技术向多cpu /多gpu计算机的融合，基于数千个cpu和gpu通过PCI-Express总线或互连网络相互连接。为了利用这种高计算能力，程序员必须解决在混合架构上调度并行程序的问题。而且，由于GPU的性能增长速度比PCI总线的吞吐量快得多，数据传输必须由调度器有效地管理。本文针对多gpu计算节点，其中多个gpu连接到同一台机器。为了克服这些平台上的数据传输限制，可用的软件通常在执行之前计算任务的映射，该映射尊重它们的依赖关系并最小化全局数据传输。这种方法过于死板，不能使执行适应系统或应用程序负载的可能变化。我们提出了一个与上述正交的解决方案:Xkaapi软件堆栈的扩展，可以通过异步GPU任务利用多GPU系统的全部性能。Xkaapi通过使用标准的工作窃取算法来调度任务，并且运行时有效地利用并发GPU操作。运行时扩展使当前一代gpu上的数据传输和任务执行重叠成为可能。我们证明了重叠能力至少与计算调度决策一样重要，以减少并行程序的完成时间。我们对两个密集线性代数问题(矩阵乘积和Cholesky分解)的实验表明，我们的解决方案与其他基于静态调度的软件相比具有很强的竞争力。此外，我们能够维持峰值性能(约为。310 GFlop/s)在DGEMM上，即使对于不能完全存储在一个GPU内存中的矩阵。使用8个gpu时，相对于单个gpu，我们的速度提升了6.74。我们的Cholesky分解的性能，与任务之间更复杂的依赖关系，优于最先进的单gpu MAGMA代码的状态。

{"title":"Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs","authors":"J. F. Lima, T. Gautier, N. Maillard, Vincent Danjean","doi":"10.1109/SBAC-PAD.2012.28","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.28","url":null,"abstract":"The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"29 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120823293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Data and Instruction Uniformity in Minimal Multi-threading 最小多线程中的数据和指令一致性

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.21

Teo Milanez, Caroline Collange, Fernando Magno Quintão Pereira, Wagner Meira Jr, R. Ferreira

Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same instruction fetching unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is a technique recently proposed to share instructions and execution between threads in a SMT machine. In this paper we propose new ways to explore redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristics that handles function calls better than previous approaches. Second, we demonstrate the existence of substantial regularity in inter-thread memory access patterns. We validate our results on the four data-parallel applications present in the PARSEC benchmark suite. The new thread reconvergence heuristics is, on the average, 82% more efficient than MMT's original reconvergence method. Furthermore, about 69% to 87% of all the memory addresses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses.

同步多线程(SMT)是一种不同线程共享同一指令获取单元的硬件模型。该模型是高并行性和低硬件成本之间的折衷。最小多线程(MMT)是最近提出的一种技术，用于在SMT机器中的线程之间共享指令和执行。在本文中，我们提出了探索MMT执行模型中的冗余的新方法。首先，我们提出并评估了一种新的线程重新收敛启发式方法，它比以前的方法更好地处理函数调用。其次，我们证明了在线程间内存访问模式中存在实质性的规律性。我们在PARSEC基准测试套件中的四个数据并行应用程序上验证我们的结果。新的线程重新收敛启发式算法的效率平均比MMT原来的重新收敛方法高82%。此外，大约69%到87%的内存地址对于所有线程都是相同的，或者是线程标识符的仿射表达式。这一观察结果激发了新提出的硬件设计，使其受益于线程间内存访问的规律性。

{"title":"Data and Instruction Uniformity in Minimal Multi-threading","authors":"Teo Milanez, Caroline Collange, Fernando Magno Quintão Pereira, Wagner Meira Jr, R. Ferreira","doi":"10.1109/SBAC-PAD.2012.21","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.21","url":null,"abstract":"Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same instruction fetching unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is a technique recently proposed to share instructions and execution between threads in a SMT machine. In this paper we propose new ways to explore redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristics that handles function calls better than previous approaches. Second, we demonstrate the existence of substantial regularity in inter-thread memory access patterns. We validate our results on the four data-parallel applications present in the PARSEC benchmark suite. The new thread reconvergence heuristics is, on the average, 82% more efficient than MMT's original reconvergence method. Furthermore, about 69% to 87% of all the memory addresses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122768777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ACCGen: An Automatic ArchC Compiler Generator ACCGen:一个自动ArchC编译器生成器

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.33

R. Auler, P. Centoducatte, E. Borin

The current level of circuit integration led to complex designs encompassing full systems on a single chip, known as System-on-a-Chip (SoC). In order to predict the best design options and reduce the design costs, designers are required to perform a large design space exploration on early stages of the design. To speed up this process, Electronic Design Automation (EDA) tools are employed to model and experiment with the system. ArchC is an "Architecture Description Language" (ADL) and a set of tools that can be leveraged to automatically build SoC simulators based on high-level system models, enabling easy and fast design space exploration in early stages of the design. Currently, ArchC is capable of automatically generating hardware simulators, assemblers, and linkers for a given architecture model. In this work, we present ACCGen, an automatic Compiler Generator for ArchC, the missing link on the automatic generation of compiler tool chains for ArchC. Our experimental results show that compilers generated by ACCGen are correct for Mibench applications. They compare, as well, the generated code quality with LLVM and gcc, two well-known open-source compilers. We also show that ACCGen is fast and has little impact on the design space exploration turnaround time, allowing the designer to, using an easy and fully automated workflow, completely assess the outcome of architectural changes in less than 2 minutes.

目前的电路集成水平导致了复杂的设计，包括在单个芯片上的完整系统，称为片上系统(SoC)。为了预测最佳设计方案并降低设计成本，设计师需要在设计的早期阶段进行大量的设计空间探索。为了加快这一过程，采用电子设计自动化(EDA)工具对系统进行建模和实验。ArchC是一种“架构描述语言”(Architecture Description Language, ADL)和一组工具，可用于基于高级系统模型自动构建SoC模拟器，从而在设计的早期阶段实现轻松快速的设计空间探索。目前，ArchC能够为给定的体系结构模型自动生成硬件模拟器、汇编器和链接器。在这项工作中，我们介绍了ACCGen，一个用于ArchC的自动编译器生成器，它是ArchC自动生成编译器工具链中缺失的一环。我们的实验结果表明，ACCGen生成的编译器对于Mibench应用程序是正确的。它们还比较了LLVM和gcc这两个著名的开源编译器生成的代码质量。我们还表明，ACCGen是快速的，对设计空间探索周转时间的影响很小，允许设计师，使用一个简单的和完全自动化的工作流程，在不到2分钟的时间内完全评估建筑变更的结果。

{"title":"ACCGen: An Automatic ArchC Compiler Generator","authors":"R. Auler, P. Centoducatte, E. Borin","doi":"10.1109/SBAC-PAD.2012.33","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.33","url":null,"abstract":"The current level of circuit integration led to complex designs encompassing full systems on a single chip, known as System-on-a-Chip (SoC). In order to predict the best design options and reduce the design costs, designers are required to perform a large design space exploration on early stages of the design. To speed up this process, Electronic Design Automation (EDA) tools are employed to model and experiment with the system. ArchC is an \"Architecture Description Language\" (ADL) and a set of tools that can be leveraged to automatically build SoC simulators based on high-level system models, enabling easy and fast design space exploration in early stages of the design. Currently, ArchC is capable of automatically generating hardware simulators, assemblers, and linkers for a given architecture model. In this work, we present ACCGen, an automatic Compiler Generator for ArchC, the missing link on the automatic generation of compiler tool chains for ArchC. Our experimental results show that compilers generated by ACCGen are correct for Mibench applications. They compare, as well, the generated code quality with LLVM and gcc, two well-known open-source compilers. We also show that ACCGen is fast and has little impact on the design space exploration turnaround time, allowing the designer to, using an easy and fully automated workflow, completely assess the outcome of architectural changes in less than 2 minutes.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"194 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132622425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

VPC: Scalable, Low Downtime Checkpointing for Virtual Clusters VPC:可扩展，低停机检查点的虚拟集群

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.31

Peng Lu, B. Ravindran, Changsoo Kim

A virtual cluster (VC) consists of multiple virtual machines (VMs) running on different physical hosts, inter-connected by a virtual network. A fault-tolerant protocol and mechanism are essential to the VC's availability and usability. We present Virtual Predict Check pointing (or VPC), a lightweight, globally consistent check pointing mechanism, which checkpoints the VC for immediate restoration after VM failures. By predicting the checkpoint-caused page faults during each check pointing interval, VPC further reduces the solo VM downtime than traditional incremental check pointing approaches. Besides, VPC uses a globally consistent check-pointing algorithm, which preserves the global consistency of the VMs' execution and communication states, and only saves the updated memory pages during each check pointing interval to reduce the entire VC downtime. Our implementation reveals that, compared with past VC check pointing/migration solutions including VNsnap, VPC reduces the solo VM downtime by as much as 45%, under the NPB benchmark, and reduces the entire VC downtime by as much as 50%, under the NPB distributed program. Additionally, VPC incurs a memory overhead of no more than 9%. In all cases, VPC's performance overhead is less than 16%.

虚拟集群(VC)由运行在不同物理主机上的多个虚拟机组成，这些虚拟机通过虚拟网络相互连接。容错协议和机制对VC的可用性和可用性至关重要。我们提出了虚拟预测检查指向(或VPC)，这是一种轻量级的，全局一致的检查指向机制，它在VM故障后检查点VC以便立即恢复。通过预测每个检查点间隔内检查点导致的页面错误，VPC比传统的增量检查点方法进一步减少单个虚拟机的停机时间。此外，VPC使用全局一致的检查点算法，保持虚拟机执行状态和通信状态的全局一致性，在每个检查点周期内只保存更新的内存页面，减少整个VC的停机时间。我们的实现表明，与过去的VC检查点/迁移解决方案(包括VNsnap)相比，VPC在NPB基准下将单个VM停机时间减少了45%，在NPB分布式程序下将整个VC停机时间减少了50%。另外，VPC的内存开销不超过9%。在所有情况下，VPC的性能开销都小于16%。

{"title":"VPC: Scalable, Low Downtime Checkpointing for Virtual Clusters","authors":"Peng Lu, B. Ravindran, Changsoo Kim","doi":"10.1109/SBAC-PAD.2012.31","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.31","url":null,"abstract":"A virtual cluster (VC) consists of multiple virtual machines (VMs) running on different physical hosts, inter-connected by a virtual network. A fault-tolerant protocol and mechanism are essential to the VC's availability and usability. We present Virtual Predict Check pointing (or VPC), a lightweight, globally consistent check pointing mechanism, which checkpoints the VC for immediate restoration after VM failures. By predicting the checkpoint-caused page faults during each check pointing interval, VPC further reduces the solo VM downtime than traditional incremental check pointing approaches. Besides, VPC uses a globally consistent check-pointing algorithm, which preserves the global consistency of the VMs' execution and communication states, and only saves the updated memory pages during each check pointing interval to reduce the entire VC downtime. Our implementation reveals that, compared with past VC check pointing/migration solutions including VNsnap, VPC reduces the solo VM downtime by as much as 45%, under the NPB benchmark, and reduces the entire VC downtime by as much as 50%, under the NPB distributed program. Additionally, VPC incurs a memory overhead of no more than 9%. In all cases, VPC's performance overhead is less than 16%.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127521185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Using Heterogeneous Networks to Improve Energy Efficiency in Direct Coherence Protocols for Many-Core CMPs 利用异构网络提高多核cmp直接相干协议的能量效率

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.23

Alberto Ros, Ricardo Fernández Pascual, M. Acacio

Direct coherence protocols have been recently proposed as an alternative to directory-based protocols to keep cache coherence in many-core CMPs. Differently from directory-based protocols, in direct coherence the responsible for providing the requested data in case of a cache miss (i.e., the owner cache) is also tasked with keeping the updated directory information and serializing the different accesses to the block by all cores. This way, these protocols send requests directly to the owner cache, thus avoiding the indirection caused by accessing a separate directory (usually in the home node). A hints mechanism ensures a high hit rate when predicting the current owner of a block for sending requests, but at the price of significantly increasing network traffic, and consequently, energy consumption. In this work, we show how using a heterogeneous interconnection network composed of two kinds of links is enough to drastically reduce the energy consumed by hint messages, obtaining significant improvements in energy efficiency.

直接一致性协议最近被提出作为基于目录的协议的替代方案，以保持多核cmp中的缓存一致性。与基于目录的协议不同，在直接一致性中，负责在缓存丢失的情况下提供所请求的数据(即所有者缓存)还负责保持更新的目录信息，并序列化所有内核对块的不同访问。这样，这些协议将请求直接发送到所有者缓存，从而避免了由于访问单独的目录(通常在主节点中)而导致的间接访问。提示机制在预测发送请求的块的当前所有者时确保了高命中率，但代价是显著增加网络流量，从而增加能源消耗。在这项工作中，我们展示了如何使用由两种链路组成的异构互连网络足以大幅减少提示消息消耗的能量，从而显著提高能源效率。

引用次数: 2

Parallelizing Information Set Generation for Game Tree Search Applications 游戏树搜索应用的并行信息集生成

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.42

M. Richards, Abhishek K. Gupta, O. Sarood, L. Kalé

Information Set Generation (ISG) is the identification of the set of paths in an imperfect information game tree that are consistent with a player's observations. The ability to reason about the possible a history is critical to the performance of game-playing agents. ISG represents a class of combinatorial search problems which is computationally intensive but challenging to efficiently parallelize. In this paper, we address the parallelization of information set generation in the context of Kriegspiel (partially observable chess). We implement the algorithm on top of a general purpose combinatorial search engine and discuss its performance using datasets from real game instances in addition to benchmarks. Further, we demonstrate the effect of load balancing strategies, problem sizes and computational granularity (grain size parameters) on performance. We achieve speedups of over 500 on 1,024 processors, far exceeding previous scalability results for game tree search applications.

信息集生成(ISG)是识别不完全信息博弈树中与玩家观察一致的路径集。推理可能历史的能力对游戏代理的表现至关重要。ISG是一类计算量大但难于高效并行化的组合搜索问题。在本文中，我们解决了Kriegspiel(部分可观察象棋)背景下信息集生成的并行化问题。我们在一个通用的组合搜索引擎上实现了这个算法，并使用来自真实游戏实例的数据集和基准测试来讨论它的性能。此外，我们还展示了负载平衡策略、问题大小和计算粒度(粒度参数)对性能的影响。我们在1024个处理器上实现了超过500的加速，远远超过了之前游戏树搜索应用程序的可扩展性结果。

引用次数: 1

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀