2015 IEEE International Symposium on Workload Characterization最新文献_第2页

3D Workload Subsetting for GPU Architecture Pathfinding GPU架构寻路的3D工作负载子集

2015 IEEE International Symposium on Workload Characterization

Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.24

V. George

Growth of high-end 3D gaming, expansion of gaming to new devices like tablets and phones, and evolution of multiple Graphics APIs like Direct3D 10+, and OpenGL 3.0+ have led to an explosion in the number of workloads that need to be evaluated for GPU architecture path-finding. To decide on the optimal architecture configuration, the workloads need to be simulated on a wide range of architecture designs which incurs huge cost, both in terms of time and resources. In order to reduce the simulation cost of path-finding, extracting workload subsets from 3D workloads is essential. This paper presents a methodology to find representative workload subsets from 3D workloads by combining clustering and phase detection. In the first part, this paper presents a methodology to group draw-calls based on performance similarity by clustering on their micro architecture independent characteristics. Across 717 frames encompassing 828K draw-calls, the clustering solution obtained an average performance prediction error per frame of 1.0% at an average clustering efficiency of 65.8%. The clustering quality is additionally evaluated by calculating cluster outliers, which are clusters with intra cluster prediction error greater than 20%. The clustering quality, measured using cluster outliers, is an indication of the performance similarity of the individual clusters. Across the spectrum of frames, we found that on an average only 3.0% of the clusters are outliers which indicates a high clustering quality. In order to detect repetitive behavior in 3D workloads, we propose characterization of frame intervals using shader vectors and then using shader vector equality to extract the repeating patterns. We show that phases exist in each game in the Bio shock series enabling extraction of small representative subsets from the workloads. Performance improvement of the workload subsets, which are less than one percent of parent workload, with GPU frequency scaling has high correlation (correlation coefficient=99.7%+) to the performance improvement of its parent workload.

高端3D游戏的发展，游戏向平板电脑和手机等新设备的扩展，以及Direct3D 10+和OpenGL 3.0+等多种图形api的发展，导致需要评估GPU架构寻径的工作负载数量激增。为了确定最佳的体系结构配置，需要在广泛的体系结构设计上模拟工作负载，这在时间和资源方面都会产生巨大的成本。为了降低寻路的仿真成本，从三维工作负载中提取工作负载子集是至关重要的。本文提出了一种结合聚类和相位检测的方法，从三维工作负载中寻找具有代表性的工作负载子集。在第一部分中，本文提出了一种基于性能相似性对绘制调用进行分组的方法，该方法通过对绘制调用的微体系结构独立特征进行聚类。在包含828K绘制调用的717帧中，聚类解决方案每帧的平均性能预测误差为1.0%，平均聚类效率为65.8%。另外，通过计算聚类异常值来评估聚类质量，这些异常值是指聚类内预测误差大于20%的聚类。使用聚类异常值测量的聚类质量是单个聚类的性能相似性的指示。在整个帧谱中，我们发现平均只有3.0%的集群是异常值，这表明集群质量很高。为了检测3D工作负载中的重复行为，我们提出使用着色器矢量对帧间隔进行表征，然后使用着色器矢量相等来提取重复模式。我们证明了《生化奇兵》系列中的每个游戏都存在阶段，从而能够从工作负载中提取出具有代表性的小子集。GPU频率缩放的工作负载子集(不到父工作负载的1%)的性能提升与其父工作负载的性能提升具有很高的相关性(相关系数=99.7%+)。

{"title":"3D Workload Subsetting for GPU Architecture Pathfinding","authors":"V. George","doi":"10.1109/IISWC.2015.24","DOIUrl":"https://doi.org/10.1109/IISWC.2015.24","url":null,"abstract":"Growth of high-end 3D gaming, expansion of gaming to new devices like tablets and phones, and evolution of multiple Graphics APIs like Direct3D 10+, and OpenGL 3.0+ have led to an explosion in the number of workloads that need to be evaluated for GPU architecture path-finding. To decide on the optimal architecture configuration, the workloads need to be simulated on a wide range of architecture designs which incurs huge cost, both in terms of time and resources. In order to reduce the simulation cost of path-finding, extracting workload subsets from 3D workloads is essential. This paper presents a methodology to find representative workload subsets from 3D workloads by combining clustering and phase detection. In the first part, this paper presents a methodology to group draw-calls based on performance similarity by clustering on their micro architecture independent characteristics. Across 717 frames encompassing 828K draw-calls, the clustering solution obtained an average performance prediction error per frame of 1.0% at an average clustering efficiency of 65.8%. The clustering quality is additionally evaluated by calculating cluster outliers, which are clusters with intra cluster prediction error greater than 20%. The clustering quality, measured using cluster outliers, is an indication of the performance similarity of the individual clusters. Across the spectrum of frames, we found that on an average only 3.0% of the clusters are outliers which indicates a high clustering quality. In order to detect repetitive behavior in 3D workloads, we propose characterization of frame intervals using shader vectors and then using shader vector equality to extract the repeating patterns. We show that phases exist in each game in the Bio shock series enabling extraction of small representative subsets from the workloads. Performance improvement of the workload subsets, which are less than one percent of parent workload, with GPU frequency scaling has high correlation (correlation coefficient=99.7%+) to the performance improvement of its parent workload.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121338364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

On Power-Performance Characterization of Concurrent Throughput Kernels 并发吞吐量内核的功率性能表征

2015 IEEE International Symposium on Workload Characterization

Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.17

Nilanjan Goswami, Yuhai Li, Amer Qouneh, Chao Li, Tao Li

Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers pushes the envelope of power-performance co-optimization capabilities of GPUs. Realization of exascale computing using accelerators demands further improvements in power efficiency. With hardwired kernel concurrency enablement in accelerators, inter- and intra-workload simultaneous kernels computation predicts increased throughput at lower energy budget. To improve Performance-per-Watt metric of the architectures, a systematic empirical study of real-world throughput workloads (with simultaneous kernel execution) is required. To this end, we propose a multi-kernel throughput workload generation framework that will facilitate aggressive energy and performance management of exascale data centers and will stimulate synergistic power-performance co-optimization of throughput architectures.

数据中心中不断增长的功率和能效吞吐量加速器(GPU)部署推动了GPU的功率-性能协同优化功能的极限。使用加速器实现百亿亿次计算需要进一步提高功率效率。通过在加速器中支持硬连线的内核并发性，工作负载间和工作负载内的并发内核计算可以在较低的能量预算下提高吞吐量。为了改进体系结构的每瓦特性能指标，需要对实际吞吐量工作负载(同时执行内核)进行系统的实证研究。为此，我们提出了一个多内核吞吐量工作负载生成框架，该框架将促进百亿亿级数据中心的积极能源和性能管理，并将刺激吞吐量架构的协同功率性能协同优化。

引用次数: 0

Revealing Critical Loads and Hidden Data Locality in GPGPU Applications 揭示GPGPU应用程序中的关键负载和隐藏数据位置

2015 IEEE International Symposium on Workload Characterization

Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.23

Gunjae Koo, Hyeran Jeon, M. Annavaram

In graphics processing units (GPUs), memory access latency is one of the most critical performance hurdles. Several warp schedulers and memory prefetching algorithms have been proposed to avoid the long memory access latency. Prior application characterization studies shed light on the interaction between applications, GPU micro architecture and memory subsystem behavior. Most of these studies, however, only present aggregate statistics on how memory system behaves over the entire application run. In particular, they do not consider how individual load instructions in a program contribute to the aggregate memory system behavior. The analysis presented in this paper shows that there are two distinct classes of load instructions, categorized as deterministic and non-deterministic loads. Using a combination of profiling data from a real GPU card and cycle accurate simulation data we show that there is a significant performance impact disparity when executing these two types of loads. We discuss and suggest several approaches to treat these two load categories differently within the GPU micro architecture for optimizing memory system performance.

在图形处理单元(gpu)中，内存访问延迟是最关键的性能障碍之一。为了避免长时间的内存访问延迟，提出了几种warp调度器和内存预取算法。先前的应用特性研究揭示了应用程序、GPU微架构和内存子系统行为之间的交互。然而，这些研究中的大多数只提供有关内存系统在整个应用程序运行过程中如何行为的汇总统计信息。特别是，它们没有考虑程序中的单个加载指令如何对聚合内存系统行为做出贡献。本文的分析表明，负载指令有两种不同的类型，即确定性负载和非确定性负载。使用来自真实GPU卡的分析数据和循环精确的模拟数据的组合，我们显示在执行这两种类型的负载时存在显着的性能影响差异。我们讨论并建议了几种方法来在GPU微架构中以不同的方式处理这两种负载类别，以优化内存系统性能。

{"title":"Revealing Critical Loads and Hidden Data Locality in GPGPU Applications","authors":"Gunjae Koo, Hyeran Jeon, M. Annavaram","doi":"10.1109/IISWC.2015.23","DOIUrl":"https://doi.org/10.1109/IISWC.2015.23","url":null,"abstract":"In graphics processing units (GPUs), memory access latency is one of the most critical performance hurdles. Several warp schedulers and memory prefetching algorithms have been proposed to avoid the long memory access latency. Prior application characterization studies shed light on the interaction between applications, GPU micro architecture and memory subsystem behavior. Most of these studies, however, only present aggregate statistics on how memory system behaves over the entire application run. In particular, they do not consider how individual load instructions in a program contribute to the aggregate memory system behavior. The analysis presented in this paper shows that there are two distinct classes of load instructions, categorized as deterministic and non-deterministic loads. Using a combination of profiling data from a real GPU card and cycle accurate simulation data we show that there is a significant performance impact disparity when executing these two types of loads. We discuss and suggest several approaches to treat these two load categories differently within the GPU micro architecture for optimizing memory system performance.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117190826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Characterization of Shared Library Access Patterns of Android Applications Android应用程序共享库访问模式的表征

2015 IEEE International Symposium on Workload Characterization

Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.19

Xiaowan Dong, S. Dwarkadas, A. Cox

We analyze the instruction access patterns of Android applications. Although Android applications are ordinarily written in Java, we find that native-code shared libraries play a large role in their instruction footprint. Specifically, averaging over a wide range of applications, we find that 60% of the instruction pages accessed belong to native-code shared libraries and 72% of the instruction fetches are from these same pages. Moreover, given the extensive use of native-code shared libraries, we find that, for any pair of applications, on average 28% of the overall instruction pages accessed by one of the applications are also accessed by the other. These results suggest the possibility of optimizations targeting shared libraries in order to improve instruction access efficiency and overall performance.

我们分析了Android应用程序的指令访问模式。虽然Android应用程序通常是用Java编写的，但我们发现本机代码共享库在它们的指令占用中起着很大的作用。具体来说，在广泛的应用程序中进行平均，我们发现访问的指令页中有60%属于本机代码共享库，72%的指令读取来自这些相同的页面。此外，考虑到本机代码共享库的广泛使用，我们发现，对于任何一对应用程序，由其中一个应用程序访问的总体指令页面的平均28%也被另一个应用程序访问。这些结果表明，为了提高指令访问效率和整体性能，可以针对共享库进行优化。

引用次数: 4

Characterizing Data Analytics Workloads on Intel Xeon Phi 在Intel Xeon Phi处理器上表征数据分析工作负载

2015 IEEE International Symposium on Workload Characterization

Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.20

Biwei Xie, Xu Liu, Jianfeng Zhan, Zhen Jia, Yuqing Zhu, Lei Wang, Lixin Zhang

With the growing computation demands of data analytics, heterogeneous architectures become popular for their support of high parallelism. Intel Xeon Phi, a many-core coprocessor originally designed for high performance computing applications, is promising for data analytics workloads. However, to the best of knowledge, there is no prior work systematically characterizing the performance of data analytics workloads on Xeon Phi. It is difficult to design a benchmark suite to represent the behavior of data analytics workloads on Xeon Phi. The main challenge resides in fully exploiting Xeon Phi's features, such as long SIMD instruction, simultaneous multithreading, and complex memory hierarchy. To address this issue, we develop Big Data Bench-Phi, which consists of seven representative data analytics workloads. All of these benchmarks are optimized for Xeon Phi and able to characterize Xeon Phi's support for data analytics workloads. Compared with a 24-core Xeon E5-2620 machine, Big Data Bench-Phi achieves reasonable speedups for most of its benchmarks, ranging from 1.5 to 23.4X. Our experiments show that workloads working on high-dimensional matrices can significantly benefit from instruction- and thread-level parallelism on Xeon Phi.

随着数据分析计算需求的不断增长，异构架构因其对高并行性的支持而受到欢迎。英特尔至强协处理器是一款多核协处理器，最初是为高性能计算应用而设计的，有望用于数据分析工作负载。然而，据我所知，目前还没有研究系统地描述Xeon Phi协处理器上数据分析工作负载的性能。很难设计一个基准套件来表示Xeon Phi处理器上数据分析工作负载的行为。主要的挑战在于充分利用Xeon Phi处理器的特性，如长SIMD指令、同时多线程和复杂的内存层次结构。为了解决这个问题，我们开发了Big Data Bench-Phi，它由七个代表性的数据分析工作负载组成。所有这些基准测试都针对至强协处理器进行了优化，并能够表征至强协处理器对数据分析工作负载的支持。与24核至强E5-2620机器相比，大数据Bench-Phi在大多数基准测试中都达到了合理的速度，范围从1.5到23.4倍。我们的实验表明，处理高维矩阵的工作负载可以显著受益于Xeon Phi处理器上的指令级和线程级并行性。

{"title":"Characterizing Data Analytics Workloads on Intel Xeon Phi","authors":"Biwei Xie, Xu Liu, Jianfeng Zhan, Zhen Jia, Yuqing Zhu, Lei Wang, Lixin Zhang","doi":"10.1109/IISWC.2015.20","DOIUrl":"https://doi.org/10.1109/IISWC.2015.20","url":null,"abstract":"With the growing computation demands of data analytics, heterogeneous architectures become popular for their support of high parallelism. Intel Xeon Phi, a many-core coprocessor originally designed for high performance computing applications, is promising for data analytics workloads. However, to the best of knowledge, there is no prior work systematically characterizing the performance of data analytics workloads on Xeon Phi. It is difficult to design a benchmark suite to represent the behavior of data analytics workloads on Xeon Phi. The main challenge resides in fully exploiting Xeon Phi's features, such as long SIMD instruction, simultaneous multithreading, and complex memory hierarchy. To address this issue, we develop Big Data Bench-Phi, which consists of seven representative data analytics workloads. All of these benchmarks are optimized for Xeon Phi and able to characterize Xeon Phi's support for data analytics workloads. Compared with a 24-core Xeon E5-2620 machine, Big Data Bench-Phi achieves reasonable speedups for most of its benchmarks, ranging from 1.5 to 23.4X. Our experiments show that workloads working on high-dimensional matrices can significantly benefit from instruction- and thread-level parallelism on Xeon Phi.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128135026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Differential Fault Injection on Microarchitectural Simulators 微架构模拟器的差分故障注入

2015 IEEE International Symposium on Workload Characterization

Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.28

Manolis Kaliorakis, Sotiris Tselonis, Athanasios Chatzidimitriou, N. Foutris, D. Gizopoulos

Fault injection on micro architectural structures modeled in performance simulators is an effective method for the assessment of microprocessors reliability in early design stages. Compared to lower level fault injection approaches it is orders of magnitude faster and allows execution of large portions of workloads to study the effect of faults to the final program output. Moreover, for many important hardware components it delivers accurate reliability estimates compared to analytical methods which are fast but are known to significantly over-estimate a structure's vulnerability to faults. This paper investigates the effectiveness of micro architectural fault injection for x86 and ARM microprocessors in a differential way: by developing and comparing two fault injection frameworks on top of the most popular performance simulators, MARSS and Gem5. The injectors, called MaFIN and GeFIN (for MARSS-based and Gem5-based Fault Injector, respectively), are designed for accurate reliability studies and deliver several contributions among which: (a) reliability studies for a wide set of fault models on major hardware structures (for different sizes and organizations), (b) study on the reliability sensitivity of micro architecture structures for the same ISA (x86) implemented on two different simulators, (c) study on the reliability of workloads and micro architectures for the two most popular ISAs (ARM vs. x86). For the workloads of our experimental study we analyze the common trends observed in the CPU reliability assessments produced by the two injectors. Also, we explain the sources of difference when diverging reliability reports are provided by the tools. Both the common trends and the differences are attributed to fundamental implementations of the simulators and are supported by benchmarks runtime statistics. The insights of our analysis can guide the selection of the most appropriate tool for hardware reliability studies (and thus decision-making for protection mechanisms) on certain micro architectures for the popular x86 and ARM ISAs.

在性能模拟器中对微结构进行故障注入是微处理器设计早期可靠性评估的一种有效方法。与较低级别的故障注入方法相比，它的速度要快几个数量级，并且允许执行大部分工作负载来研究故障对最终程序输出的影响。此外，与分析方法相比，对于许多重要的硬件组件，它提供了准确的可靠性估计，而分析方法速度很快，但已知会严重高估结构对故障的脆弱性。本文通过在最流行的性能模拟器MARSS和Gem5上开发和比较两种故障注入框架，以不同的方式研究了x86和ARM微处理器的微架构故障注入的有效性。MaFIN和GeFIN(分别是基于mars的故障注入器和基于gem5的故障注入器)是为精确的可靠性研究而设计的，并提供了以下几个贡献:(a)对主要硬件结构(针对不同规模和组织)的一系列故障模型进行可靠性研究，(b)对在两个不同模拟器上实现的同一ISA (x86)的微架构结构的可靠性敏感性研究，(c)对两种最流行的ISA (ARM与x86)的工作负载和微架构的可靠性研究。对于我们实验研究的工作负载，我们分析了在两个注入器产生的CPU可靠性评估中观察到的共同趋势。此外，我们解释差异的来源时，发散可靠性报告提供的工具。共同趋势和差异都归因于模拟器的基本实现，并得到基准测试运行时统计数据的支持。我们分析的见解可以指导在流行的x86和ARM isa的某些微架构上选择最合适的工具进行硬件可靠性研究(从而制定保护机制)。

{"title":"Differential Fault Injection on Microarchitectural Simulators","authors":"Manolis Kaliorakis, Sotiris Tselonis, Athanasios Chatzidimitriou, N. Foutris, D. Gizopoulos","doi":"10.1109/IISWC.2015.28","DOIUrl":"https://doi.org/10.1109/IISWC.2015.28","url":null,"abstract":"Fault injection on micro architectural structures modeled in performance simulators is an effective method for the assessment of microprocessors reliability in early design stages. Compared to lower level fault injection approaches it is orders of magnitude faster and allows execution of large portions of workloads to study the effect of faults to the final program output. Moreover, for many important hardware components it delivers accurate reliability estimates compared to analytical methods which are fast but are known to significantly over-estimate a structure's vulnerability to faults. This paper investigates the effectiveness of micro architectural fault injection for x86 and ARM microprocessors in a differential way: by developing and comparing two fault injection frameworks on top of the most popular performance simulators, MARSS and Gem5. The injectors, called MaFIN and GeFIN (for MARSS-based and Gem5-based Fault Injector, respectively), are designed for accurate reliability studies and deliver several contributions among which: (a) reliability studies for a wide set of fault models on major hardware structures (for different sizes and organizations), (b) study on the reliability sensitivity of micro architecture structures for the same ISA (x86) implemented on two different simulators, (c) study on the reliability of workloads and micro architectures for the two most popular ISAs (ARM vs. x86). For the workloads of our experimental study we analyze the common trends observed in the CPU reliability assessments produced by the two injectors. Also, we explain the sources of difference when diverging reliability reports are provided by the tools. Both the common trends and the differences are attributed to fundamental implementations of the simulators and are supported by benchmarks runtime statistics. The insights of our analysis can guide the selection of the most appropriate tool for hardware reliability studies (and thus decision-making for protection mechanisms) on certain micro architectures for the popular x86 and ARM ISAs.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128520722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 64

Performance Characterization of High-Level Programming Models for GPU Graph Analytics GPU图形分析高级编程模型的性能表征

2015 IEEE International Symposium on Workload Characterization

Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.13

Yuduo Wu, Yangzihao Wang, Yuechao Pan, Carl Yang, John Douglas Owens

We identify several factors that are critical to high-performance GPU graph analytics: efficient building block operators, synchronization and data movement, workload distribution and load balancing, and memory access patterns. We analyze the impact of these critical factors through three GPU graph analytic frameworks, Gun rock, Map Graph, and VertexAPI2. We also examine their effect on different workloads: four common graph primitives from multiple graph application domains, evaluated through real-world and synthetic graphs. We show that efficient building block operators enable more powerful operations for fast information propagation and result in fewer device kernel invocations, less data movement, and fewer global synchronizations, and thus are key focus areas for efficient large-scale graph analytics on the GPU.

我们确定了几个对高性能GPU图形分析至关重要的因素:高效的构建块操作符、同步和数据移动、工作负载分配和负载平衡以及内存访问模式。我们通过三个GPU图形分析框架，Gun rock, Map graph和VertexAPI2来分析这些关键因素的影响。我们还研究了它们对不同工作负载的影响:来自多个图应用程序领域的四种常见图原语，通过真实世界和合成图进行评估。我们表明，高效的构建块运算符能够实现更强大的操作，以实现快速的信息传播，并导致更少的设备内核调用，更少的数据移动和更少的全局同步，因此是GPU上高效大规模图形分析的关键重点领域。

引用次数: 23

Power Aware NUMA Scheduler in VMware's ESXi Hypervisor VMware的ESXi Hypervisor中的Power Aware NUMA Scheduler

2015 IEEE International Symposium on Workload Characterization

Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.30

Qasim Ali, Haoqiang Zheng, Tim Mann, Raghunathan Srinivasan

Virtualized platforms have emerged as the top solution for cloud computing, especially in today's power-constrained data centers. Virtualization helps save power and energy by allowing physical machines to be replaced by virtual machines (VMs) and then consolidated onto a smaller number of physical hosts. The number of physical hosts that are powered on can even be dynamically varied, as with VMware's Distributed Power Management (DPM) feature. At a lower level, it remains valuable to manage power usage within each individual host, and typical systems, including VMware's ESXi hypervisor, do so by adjusting each processor's P-states (frequency and voltage states) and Cstates (idle states) according to the demands of the current workload. With current NUMA systems, however, there is an intermediate level of power management possible that has gone largely unexplored. In this paper we propose to optimize the placement of virtual machines on NUMA enabled systems, such that the overall energy consumption of the virtualized system is reduced with minimal impact on VM performance. Our heuristics exploit a relatively new CPU hardware feature, called independent package C-states. To the best of our knowledge, this paper presents the first work on making a NUMA scheduler power-aware by exploiting independent package C-states. We implemented a simple heuristic in ESXi and observed power savings of up to 26% and energy efficiency improvements of up to 30% using four realistic workloads and two micro-benchmarks.

虚拟化平台已经成为云计算的顶级解决方案，特别是在当今电力受限的数据中心中。虚拟化允许将物理机替换为虚拟机(vm)，然后将其整合到数量较少的物理主机上，从而有助于节省电力和能源。上电的物理主机数量甚至可以动态变化，就像VMware的分布式电源管理(DPM)功能一样。在较低的级别上，管理每个单独主机内的电源使用仍然很有价值，典型的系统，包括VMware的ESXi管理程序，通过根据当前工作负载的需求调整每个处理器的p状态(频率和电压状态)和cstate(空闲状态)来做到这一点。然而，在目前的NUMA系统中，有一种中间水平的电源管理可能尚未得到很大程度的探索。在本文中，我们建议在启用NUMA的系统上优化虚拟机的位置，以便在对VM性能影响最小的情况下降低虚拟化系统的总体能耗。我们的启发式方法利用了一个相对较新的CPU硬件特性，称为独立包c状态。据我们所知，本文介绍了通过利用独立的包c状态使NUMA调度器具有功率感知的第一项工作。我们在ESXi中实现了一个简单的启发式方法，通过使用四个实际工作负载和两个微基准测试，观察到节能高达26%，能效提高高达30%。

{"title":"Power Aware NUMA Scheduler in VMware's ESXi Hypervisor","authors":"Qasim Ali, Haoqiang Zheng, Tim Mann, Raghunathan Srinivasan","doi":"10.1109/IISWC.2015.30","DOIUrl":"https://doi.org/10.1109/IISWC.2015.30","url":null,"abstract":"Virtualized platforms have emerged as the top solution for cloud computing, especially in today's power-constrained data centers. Virtualization helps save power and energy by allowing physical machines to be replaced by virtual machines (VMs) and then consolidated onto a smaller number of physical hosts. The number of physical hosts that are powered on can even be dynamically varied, as with VMware's Distributed Power Management (DPM) feature. At a lower level, it remains valuable to manage power usage within each individual host, and typical systems, including VMware's ESXi hypervisor, do so by adjusting each processor's P-states (frequency and voltage states) and Cstates (idle states) according to the demands of the current workload. With current NUMA systems, however, there is an intermediate level of power management possible that has gone largely unexplored. In this paper we propose to optimize the placement of virtual machines on NUMA enabled systems, such that the overall energy consumption of the virtualized system is reduced with minimal impact on VM performance. Our heuristics exploit a relatively new CPU hardware feature, called independent package C-states. To the best of our knowledge, this paper presents the first work on making a NUMA scheduler power-aware by exploiting independent package C-states. We implemented a simple heuristic in ESXi and observed power savings of up to 26% and energy efficiency improvements of up to 30% using four realistic workloads and two micro-benchmarks.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134091009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores CRONO:在未来多核上执行多线程图形算法的基准套件

2015 IEEE International Symposium on Workload Characterization

Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.11

Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, O. Khan

Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.

在图形设置上操作的算法是高度不规则和非结构化的。当这些算法在不断发展的多核处理器上并行化和执行时，这会导致工作负载不平衡和数据局部性挑战。以前的共享内存多核并行基准套件主要关注各种工作负载领域，如科学、图形、视觉、金融和媒体处理。然而，这些套件缺乏图形应用程序，必须在未来多核建筑设计空间探索的背景下进行评估。本文提出了一种基于多线程图算法的多核共享内存处理器基准测试套件CRONO。我们使用多核模拟器和真实的多核机器设置来分析和描述这些基准测试。CRONO同时使用合成和真实世界的图形。我们的表征表明，在缩放效率的背景下，图形基准是多种多样的，具有挑战性。由于非结构化内存访问模式，它们表现出低局部性，并导致线程之间的细粒度通信。由于网络连接上的内存和同步模式不确定，也会产生能量开销。我们的表征表明，这些挑战仍然存在于最先进的图形算法中，在这种情况下，CRONO可以用来识别、分析和开发新的架构方法，以缓解未来多核处理器的效率瓶颈。

{"title":"CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores","authors":"Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, O. Khan","doi":"10.1109/IISWC.2015.11","DOIUrl":"https://doi.org/10.1109/IISWC.2015.11","url":null,"abstract":"Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"12 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116789908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 103

Exploring Parallel Programming Models for Heterogeneous Computing Systems 探索异构计算系统的并行编程模型

2015 IEEE International Symposium on Workload Characterization

Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.16

Mayank Daga, Zachary S. Tschirhart, Chip Freitag

Parallel systems that employ CPUs and GPUs as two heterogeneous computational units have become immensely popular due to their ability to maximize performance under restrictive thermal budgets. However, programming heterogeneous systems via traditional programming models like OpenCL or CUDA involves rewriting large portions of application-code. They also lead to code that is not performance portable across different architectures or even across different generations of the same architecture. In this paper, we evaluate the current state of two emerging parallel programming models: C++ AMP and OpenACC. These emerging programming paradigms require minimal code changes and rely on compilers to interact with the low-level hardware language, thereby producing performance portable code from an application standpoint. We analyze the performance and productivity of the emerging programming models and compare them with OpenCL using a diverse set of applications on two different architectures, a CPU coupled with a discrete GPU and an Accelerated Programming Unit (APU). Our experiments demonstrate that while the emerging programming models improve programmer productivity, they do not yet expose enough flexibility to extract maximum performance as compared to traditional programming models.

采用cpu和gpu作为两个异构计算单元的并行系统已经变得非常流行，因为它们能够在有限的热预算下最大化性能。然而，通过传统编程模型(如OpenCL或CUDA)对异构系统进行编程需要重写大部分应用程序代码。它们还会导致代码不能在不同的体系结构之间，甚至在同一体系结构的不同代之间进行性能移植。在本文中，我们评估了两种新兴的并行编程模型:c++ AMP和OpenACC的现状。这些新出现的编程范例需要最少的代码更改，并且依赖于编译器与低级硬件语言交互，因此从应用程序的角度来看，可以生成性能可移植的代码。我们分析了新兴编程模型的性能和生产力，并将它们与OpenCL进行了比较，使用了两种不同架构上的不同应用程序集，CPU与离散GPU和加速编程单元(APU)相结合。我们的实验表明，虽然新兴的编程模型提高了程序员的生产力，但与传统的编程模型相比，它们还没有暴露出足够的灵活性来提取最大的性能。

{"title":"Exploring Parallel Programming Models for Heterogeneous Computing Systems","authors":"Mayank Daga, Zachary S. Tschirhart, Chip Freitag","doi":"10.1109/IISWC.2015.16","DOIUrl":"https://doi.org/10.1109/IISWC.2015.16","url":null,"abstract":"Parallel systems that employ CPUs and GPUs as two heterogeneous computational units have become immensely popular due to their ability to maximize performance under restrictive thermal budgets. However, programming heterogeneous systems via traditional programming models like OpenCL or CUDA involves rewriting large portions of application-code. They also lead to code that is not performance portable across different architectures or even across different generations of the same architecture. In this paper, we evaluate the current state of two emerging parallel programming models: C++ AMP and OpenACC. These emerging programming paradigms require minimal code changes and rely on compilers to interact with the low-level hardware language, thereby producing performance portable code from an application standpoint. We analyze the performance and productivity of the emerging programming models and compare them with OpenCL using a diverse set of applications on two different architectures, a CPU coupled with a discrete GPU and an Accelerated Programming Unit (APU). Our experiments demonstrate that while the emerging programming models improve programmer productivity, they do not yet expose enough flexibility to extract maximum performance as compared to traditional programming models.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"82 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116943006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17