Growth of high-end 3D gaming, expansion of gaming to new devices like tablets and phones, and evolution of multiple Graphics APIs like Direct3D 10+, and OpenGL 3.0+ have led to an explosion in the number of workloads that need to be evaluated for GPU architecture path-finding. To decide on the optimal architecture configuration, the workloads need to be simulated on a wide range of architecture designs which incurs huge cost, both in terms of time and resources. In order to reduce the simulation cost of path-finding, extracting workload subsets from 3D workloads is essential. This paper presents a methodology to find representative workload subsets from 3D workloads by combining clustering and phase detection. In the first part, this paper presents a methodology to group draw-calls based on performance similarity by clustering on their micro architecture independent characteristics. Across 717 frames encompassing 828K draw-calls, the clustering solution obtained an average performance prediction error per frame of 1.0% at an average clustering efficiency of 65.8%. The clustering quality is additionally evaluated by calculating cluster outliers, which are clusters with intra cluster prediction error greater than 20%. The clustering quality, measured using cluster outliers, is an indication of the performance similarity of the individual clusters. Across the spectrum of frames, we found that on an average only 3.0% of the clusters are outliers which indicates a high clustering quality. In order to detect repetitive behavior in 3D workloads, we propose characterization of frame intervals using shader vectors and then using shader vector equality to extract the repeating patterns. We show that phases exist in each game in the Bio shock series enabling extraction of small representative subsets from the workloads. Performance improvement of the workload subsets, which are less than one percent of parent workload, with GPU frequency scaling has high correlation (correlation coefficient=99.7%+) to the performance improvement of its parent workload.
{"title":"3D Workload Subsetting for GPU Architecture Pathfinding","authors":"V. George","doi":"10.1109/IISWC.2015.24","DOIUrl":"https://doi.org/10.1109/IISWC.2015.24","url":null,"abstract":"Growth of high-end 3D gaming, expansion of gaming to new devices like tablets and phones, and evolution of multiple Graphics APIs like Direct3D 10+, and OpenGL 3.0+ have led to an explosion in the number of workloads that need to be evaluated for GPU architecture path-finding. To decide on the optimal architecture configuration, the workloads need to be simulated on a wide range of architecture designs which incurs huge cost, both in terms of time and resources. In order to reduce the simulation cost of path-finding, extracting workload subsets from 3D workloads is essential. This paper presents a methodology to find representative workload subsets from 3D workloads by combining clustering and phase detection. In the first part, this paper presents a methodology to group draw-calls based on performance similarity by clustering on their micro architecture independent characteristics. Across 717 frames encompassing 828K draw-calls, the clustering solution obtained an average performance prediction error per frame of 1.0% at an average clustering efficiency of 65.8%. The clustering quality is additionally evaluated by calculating cluster outliers, which are clusters with intra cluster prediction error greater than 20%. The clustering quality, measured using cluster outliers, is an indication of the performance similarity of the individual clusters. Across the spectrum of frames, we found that on an average only 3.0% of the clusters are outliers which indicates a high clustering quality. In order to detect repetitive behavior in 3D workloads, we propose characterization of frame intervals using shader vectors and then using shader vector equality to extract the repeating patterns. We show that phases exist in each game in the Bio shock series enabling extraction of small representative subsets from the workloads. Performance improvement of the workload subsets, which are less than one percent of parent workload, with GPU frequency scaling has high correlation (correlation coefficient=99.7%+) to the performance improvement of its parent workload.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121338364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nilanjan Goswami, Yuhai Li, Amer Qouneh, Chao Li, Tao Li
Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers pushes the envelope of power-performance co-optimization capabilities of GPUs. Realization of exascale computing using accelerators demands further improvements in power efficiency. With hardwired kernel concurrency enablement in accelerators, inter- and intra-workload simultaneous kernels computation predicts increased throughput at lower energy budget. To improve Performance-per-Watt metric of the architectures, a systematic empirical study of real-world throughput workloads (with simultaneous kernel execution) is required. To this end, we propose a multi-kernel throughput workload generation framework that will facilitate aggressive energy and performance management of exascale data centers and will stimulate synergistic power-performance co-optimization of throughput architectures.
{"title":"On Power-Performance Characterization of Concurrent Throughput Kernels","authors":"Nilanjan Goswami, Yuhai Li, Amer Qouneh, Chao Li, Tao Li","doi":"10.1109/IISWC.2015.17","DOIUrl":"https://doi.org/10.1109/IISWC.2015.17","url":null,"abstract":"Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers pushes the envelope of power-performance co-optimization capabilities of GPUs. Realization of exascale computing using accelerators demands further improvements in power efficiency. With hardwired kernel concurrency enablement in accelerators, inter- and intra-workload simultaneous kernels computation predicts increased throughput at lower energy budget. To improve Performance-per-Watt metric of the architectures, a systematic empirical study of real-world throughput workloads (with simultaneous kernel execution) is required. To this end, we propose a multi-kernel throughput workload generation framework that will facilitate aggressive energy and performance management of exascale data centers and will stimulate synergistic power-performance co-optimization of throughput architectures.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115830278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In graphics processing units (GPUs), memory access latency is one of the most critical performance hurdles. Several warp schedulers and memory prefetching algorithms have been proposed to avoid the long memory access latency. Prior application characterization studies shed light on the interaction between applications, GPU micro architecture and memory subsystem behavior. Most of these studies, however, only present aggregate statistics on how memory system behaves over the entire application run. In particular, they do not consider how individual load instructions in a program contribute to the aggregate memory system behavior. The analysis presented in this paper shows that there are two distinct classes of load instructions, categorized as deterministic and non-deterministic loads. Using a combination of profiling data from a real GPU card and cycle accurate simulation data we show that there is a significant performance impact disparity when executing these two types of loads. We discuss and suggest several approaches to treat these two load categories differently within the GPU micro architecture for optimizing memory system performance.
{"title":"Revealing Critical Loads and Hidden Data Locality in GPGPU Applications","authors":"Gunjae Koo, Hyeran Jeon, M. Annavaram","doi":"10.1109/IISWC.2015.23","DOIUrl":"https://doi.org/10.1109/IISWC.2015.23","url":null,"abstract":"In graphics processing units (GPUs), memory access latency is one of the most critical performance hurdles. Several warp schedulers and memory prefetching algorithms have been proposed to avoid the long memory access latency. Prior application characterization studies shed light on the interaction between applications, GPU micro architecture and memory subsystem behavior. Most of these studies, however, only present aggregate statistics on how memory system behaves over the entire application run. In particular, they do not consider how individual load instructions in a program contribute to the aggregate memory system behavior. The analysis presented in this paper shows that there are two distinct classes of load instructions, categorized as deterministic and non-deterministic loads. Using a combination of profiling data from a real GPU card and cycle accurate simulation data we show that there is a significant performance impact disparity when executing these two types of loads. We discuss and suggest several approaches to treat these two load categories differently within the GPU micro architecture for optimizing memory system performance.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117190826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We analyze the instruction access patterns of Android applications. Although Android applications are ordinarily written in Java, we find that native-code shared libraries play a large role in their instruction footprint. Specifically, averaging over a wide range of applications, we find that 60% of the instruction pages accessed belong to native-code shared libraries and 72% of the instruction fetches are from these same pages. Moreover, given the extensive use of native-code shared libraries, we find that, for any pair of applications, on average 28% of the overall instruction pages accessed by one of the applications are also accessed by the other. These results suggest the possibility of optimizations targeting shared libraries in order to improve instruction access efficiency and overall performance.
{"title":"Characterization of Shared Library Access Patterns of Android Applications","authors":"Xiaowan Dong, S. Dwarkadas, A. Cox","doi":"10.1109/IISWC.2015.19","DOIUrl":"https://doi.org/10.1109/IISWC.2015.19","url":null,"abstract":"We analyze the instruction access patterns of Android applications. Although Android applications are ordinarily written in Java, we find that native-code shared libraries play a large role in their instruction footprint. Specifically, averaging over a wide range of applications, we find that 60% of the instruction pages accessed belong to native-code shared libraries and 72% of the instruction fetches are from these same pages. Moreover, given the extensive use of native-code shared libraries, we find that, for any pair of applications, on average 28% of the overall instruction pages accessed by one of the applications are also accessed by the other. These results suggest the possibility of optimizations targeting shared libraries in order to improve instruction access efficiency and overall performance.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125498808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the growing computation demands of data analytics, heterogeneous architectures become popular for their support of high parallelism. Intel Xeon Phi, a many-core coprocessor originally designed for high performance computing applications, is promising for data analytics workloads. However, to the best of knowledge, there is no prior work systematically characterizing the performance of data analytics workloads on Xeon Phi. It is difficult to design a benchmark suite to represent the behavior of data analytics workloads on Xeon Phi. The main challenge resides in fully exploiting Xeon Phi's features, such as long SIMD instruction, simultaneous multithreading, and complex memory hierarchy. To address this issue, we develop Big Data Bench-Phi, which consists of seven representative data analytics workloads. All of these benchmarks are optimized for Xeon Phi and able to characterize Xeon Phi's support for data analytics workloads. Compared with a 24-core Xeon E5-2620 machine, Big Data Bench-Phi achieves reasonable speedups for most of its benchmarks, ranging from 1.5 to 23.4X. Our experiments show that workloads working on high-dimensional matrices can significantly benefit from instruction- and thread-level parallelism on Xeon Phi.
随着数据分析计算需求的不断增长,异构架构因其对高并行性的支持而受到欢迎。英特尔至强协处理器是一款多核协处理器,最初是为高性能计算应用而设计的,有望用于数据分析工作负载。然而,据我所知,目前还没有研究系统地描述Xeon Phi协处理器上数据分析工作负载的性能。很难设计一个基准套件来表示Xeon Phi处理器上数据分析工作负载的行为。主要的挑战在于充分利用Xeon Phi处理器的特性,如长SIMD指令、同时多线程和复杂的内存层次结构。为了解决这个问题,我们开发了Big Data Bench-Phi,它由七个代表性的数据分析工作负载组成。所有这些基准测试都针对至强协处理器进行了优化,并能够表征至强协处理器对数据分析工作负载的支持。与24核至强E5-2620机器相比,大数据Bench-Phi在大多数基准测试中都达到了合理的速度,范围从1.5到23.4倍。我们的实验表明,处理高维矩阵的工作负载可以显著受益于Xeon Phi处理器上的指令级和线程级并行性。
{"title":"Characterizing Data Analytics Workloads on Intel Xeon Phi","authors":"Biwei Xie, Xu Liu, Jianfeng Zhan, Zhen Jia, Yuqing Zhu, Lei Wang, Lixin Zhang","doi":"10.1109/IISWC.2015.20","DOIUrl":"https://doi.org/10.1109/IISWC.2015.20","url":null,"abstract":"With the growing computation demands of data analytics, heterogeneous architectures become popular for their support of high parallelism. Intel Xeon Phi, a many-core coprocessor originally designed for high performance computing applications, is promising for data analytics workloads. However, to the best of knowledge, there is no prior work systematically characterizing the performance of data analytics workloads on Xeon Phi. It is difficult to design a benchmark suite to represent the behavior of data analytics workloads on Xeon Phi. The main challenge resides in fully exploiting Xeon Phi's features, such as long SIMD instruction, simultaneous multithreading, and complex memory hierarchy. To address this issue, we develop Big Data Bench-Phi, which consists of seven representative data analytics workloads. All of these benchmarks are optimized for Xeon Phi and able to characterize Xeon Phi's support for data analytics workloads. Compared with a 24-core Xeon E5-2620 machine, Big Data Bench-Phi achieves reasonable speedups for most of its benchmarks, ranging from 1.5 to 23.4X. Our experiments show that workloads working on high-dimensional matrices can significantly benefit from instruction- and thread-level parallelism on Xeon Phi.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128135026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manolis Kaliorakis, Sotiris Tselonis, Athanasios Chatzidimitriou, N. Foutris, D. Gizopoulos
Fault injection on micro architectural structures modeled in performance simulators is an effective method for the assessment of microprocessors reliability in early design stages. Compared to lower level fault injection approaches it is orders of magnitude faster and allows execution of large portions of workloads to study the effect of faults to the final program output. Moreover, for many important hardware components it delivers accurate reliability estimates compared to analytical methods which are fast but are known to significantly over-estimate a structure's vulnerability to faults. This paper investigates the effectiveness of micro architectural fault injection for x86 and ARM microprocessors in a differential way: by developing and comparing two fault injection frameworks on top of the most popular performance simulators, MARSS and Gem5. The injectors, called MaFIN and GeFIN (for MARSS-based and Gem5-based Fault Injector, respectively), are designed for accurate reliability studies and deliver several contributions among which: (a) reliability studies for a wide set of fault models on major hardware structures (for different sizes and organizations), (b) study on the reliability sensitivity of micro architecture structures for the same ISA (x86) implemented on two different simulators, (c) study on the reliability of workloads and micro architectures for the two most popular ISAs (ARM vs. x86). For the workloads of our experimental study we analyze the common trends observed in the CPU reliability assessments produced by the two injectors. Also, we explain the sources of difference when diverging reliability reports are provided by the tools. Both the common trends and the differences are attributed to fundamental implementations of the simulators and are supported by benchmarks runtime statistics. The insights of our analysis can guide the selection of the most appropriate tool for hardware reliability studies (and thus decision-making for protection mechanisms) on certain micro architectures for the popular x86 and ARM ISAs.
{"title":"Differential Fault Injection on Microarchitectural Simulators","authors":"Manolis Kaliorakis, Sotiris Tselonis, Athanasios Chatzidimitriou, N. Foutris, D. Gizopoulos","doi":"10.1109/IISWC.2015.28","DOIUrl":"https://doi.org/10.1109/IISWC.2015.28","url":null,"abstract":"Fault injection on micro architectural structures modeled in performance simulators is an effective method for the assessment of microprocessors reliability in early design stages. Compared to lower level fault injection approaches it is orders of magnitude faster and allows execution of large portions of workloads to study the effect of faults to the final program output. Moreover, for many important hardware components it delivers accurate reliability estimates compared to analytical methods which are fast but are known to significantly over-estimate a structure's vulnerability to faults. This paper investigates the effectiveness of micro architectural fault injection for x86 and ARM microprocessors in a differential way: by developing and comparing two fault injection frameworks on top of the most popular performance simulators, MARSS and Gem5. The injectors, called MaFIN and GeFIN (for MARSS-based and Gem5-based Fault Injector, respectively), are designed for accurate reliability studies and deliver several contributions among which: (a) reliability studies for a wide set of fault models on major hardware structures (for different sizes and organizations), (b) study on the reliability sensitivity of micro architecture structures for the same ISA (x86) implemented on two different simulators, (c) study on the reliability of workloads and micro architectures for the two most popular ISAs (ARM vs. x86). For the workloads of our experimental study we analyze the common trends observed in the CPU reliability assessments produced by the two injectors. Also, we explain the sources of difference when diverging reliability reports are provided by the tools. Both the common trends and the differences are attributed to fundamental implementations of the simulators and are supported by benchmarks runtime statistics. The insights of our analysis can guide the selection of the most appropriate tool for hardware reliability studies (and thus decision-making for protection mechanisms) on certain micro architectures for the popular x86 and ARM ISAs.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128520722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuduo Wu, Yangzihao Wang, Yuechao Pan, Carl Yang, John Douglas Owens
We identify several factors that are critical to high-performance GPU graph analytics: efficient building block operators, synchronization and data movement, workload distribution and load balancing, and memory access patterns. We analyze the impact of these critical factors through three GPU graph analytic frameworks, Gun rock, Map Graph, and VertexAPI2. We also examine their effect on different workloads: four common graph primitives from multiple graph application domains, evaluated through real-world and synthetic graphs. We show that efficient building block operators enable more powerful operations for fast information propagation and result in fewer device kernel invocations, less data movement, and fewer global synchronizations, and thus are key focus areas for efficient large-scale graph analytics on the GPU.
我们确定了几个对高性能GPU图形分析至关重要的因素:高效的构建块操作符、同步和数据移动、工作负载分配和负载平衡以及内存访问模式。我们通过三个GPU图形分析框架,Gun rock, Map graph和VertexAPI2来分析这些关键因素的影响。我们还研究了它们对不同工作负载的影响:来自多个图应用程序领域的四种常见图原语,通过真实世界和合成图进行评估。我们表明,高效的构建块运算符能够实现更强大的操作,以实现快速的信息传播,并导致更少的设备内核调用,更少的数据移动和更少的全局同步,因此是GPU上高效大规模图形分析的关键重点领域。
{"title":"Performance Characterization of High-Level Programming Models for GPU Graph Analytics","authors":"Yuduo Wu, Yangzihao Wang, Yuechao Pan, Carl Yang, John Douglas Owens","doi":"10.1109/IISWC.2015.13","DOIUrl":"https://doi.org/10.1109/IISWC.2015.13","url":null,"abstract":"We identify several factors that are critical to high-performance GPU graph analytics: efficient building block operators, synchronization and data movement, workload distribution and load balancing, and memory access patterns. We analyze the impact of these critical factors through three GPU graph analytic frameworks, Gun rock, Map Graph, and VertexAPI2. We also examine their effect on different workloads: four common graph primitives from multiple graph application domains, evaluated through real-world and synthetic graphs. We show that efficient building block operators enable more powerful operations for fast information propagation and result in fewer device kernel invocations, less data movement, and fewer global synchronizations, and thus are key focus areas for efficient large-scale graph analytics on the GPU.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"177 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121489438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qasim Ali, Haoqiang Zheng, Tim Mann, Raghunathan Srinivasan
Virtualized platforms have emerged as the top solution for cloud computing, especially in today's power-constrained data centers. Virtualization helps save power and energy by allowing physical machines to be replaced by virtual machines (VMs) and then consolidated onto a smaller number of physical hosts. The number of physical hosts that are powered on can even be dynamically varied, as with VMware's Distributed Power Management (DPM) feature. At a lower level, it remains valuable to manage power usage within each individual host, and typical systems, including VMware's ESXi hypervisor, do so by adjusting each processor's P-states (frequency and voltage states) and Cstates (idle states) according to the demands of the current workload. With current NUMA systems, however, there is an intermediate level of power management possible that has gone largely unexplored. In this paper we propose to optimize the placement of virtual machines on NUMA enabled systems, such that the overall energy consumption of the virtualized system is reduced with minimal impact on VM performance. Our heuristics exploit a relatively new CPU hardware feature, called independent package C-states. To the best of our knowledge, this paper presents the first work on making a NUMA scheduler power-aware by exploiting independent package C-states. We implemented a simple heuristic in ESXi and observed power savings of up to 26% and energy efficiency improvements of up to 30% using four realistic workloads and two micro-benchmarks.
{"title":"Power Aware NUMA Scheduler in VMware's ESXi Hypervisor","authors":"Qasim Ali, Haoqiang Zheng, Tim Mann, Raghunathan Srinivasan","doi":"10.1109/IISWC.2015.30","DOIUrl":"https://doi.org/10.1109/IISWC.2015.30","url":null,"abstract":"Virtualized platforms have emerged as the top solution for cloud computing, especially in today's power-constrained data centers. Virtualization helps save power and energy by allowing physical machines to be replaced by virtual machines (VMs) and then consolidated onto a smaller number of physical hosts. The number of physical hosts that are powered on can even be dynamically varied, as with VMware's Distributed Power Management (DPM) feature. At a lower level, it remains valuable to manage power usage within each individual host, and typical systems, including VMware's ESXi hypervisor, do so by adjusting each processor's P-states (frequency and voltage states) and Cstates (idle states) according to the demands of the current workload. With current NUMA systems, however, there is an intermediate level of power management possible that has gone largely unexplored. In this paper we propose to optimize the placement of virtual machines on NUMA enabled systems, such that the overall energy consumption of the virtualized system is reduced with minimal impact on VM performance. Our heuristics exploit a relatively new CPU hardware feature, called independent package C-states. To the best of our knowledge, this paper presents the first work on making a NUMA scheduler power-aware by exploiting independent package C-states. We implemented a simple heuristic in ESXi and observed power savings of up to 26% and energy efficiency improvements of up to 30% using four realistic workloads and two micro-benchmarks.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134091009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, O. Khan
Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.
{"title":"CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores","authors":"Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, O. Khan","doi":"10.1109/IISWC.2015.11","DOIUrl":"https://doi.org/10.1109/IISWC.2015.11","url":null,"abstract":"Algorithms operating on a graph setting are known to be highly irregular and unstructured. This leads to workload imbalance and data locality challenge when these algorithms are parallelized and executed on the evolving multicore processors. Previous parallel benchmark suites for shared memory multicores have focused on various workload domains, such as scientific, graphics, vision, financial and media processing. However, these suites lack graph applications that must be evaluated in the context of architectural design space exploration for futuristic multicores. This paper presents CRONO, a benchmark suite composed of multi-threaded graph algorithms for shared memory multicore processors. We analyze and characterize these benchmarks using a multicore simulator, as well as a real multicore machine setup. CRONO uses both synthetic and real world graphs. Our characterization shows that graph benchmarks are diverse and challenging in the context of scaling efficiency. They exhibit low locality due to unstructured memory access patterns, and incur fine-grain communication between threads. Energy overheads also occur due to nondeterministic memory and synchronization patterns on network connections. Our characterization reveals that these challenges remain in state-of-the-art graph algorithms, and in this context CRONO can be used to identify, analyze and develop novel architectural methods to mitigate their efficiency bottlenecks in futuristic multicore processors.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"12 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116789908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Parallel systems that employ CPUs and GPUs as two heterogeneous computational units have become immensely popular due to their ability to maximize performance under restrictive thermal budgets. However, programming heterogeneous systems via traditional programming models like OpenCL or CUDA involves rewriting large portions of application-code. They also lead to code that is not performance portable across different architectures or even across different generations of the same architecture. In this paper, we evaluate the current state of two emerging parallel programming models: C++ AMP and OpenACC. These emerging programming paradigms require minimal code changes and rely on compilers to interact with the low-level hardware language, thereby producing performance portable code from an application standpoint. We analyze the performance and productivity of the emerging programming models and compare them with OpenCL using a diverse set of applications on two different architectures, a CPU coupled with a discrete GPU and an Accelerated Programming Unit (APU). Our experiments demonstrate that while the emerging programming models improve programmer productivity, they do not yet expose enough flexibility to extract maximum performance as compared to traditional programming models.
{"title":"Exploring Parallel Programming Models for Heterogeneous Computing Systems","authors":"Mayank Daga, Zachary S. Tschirhart, Chip Freitag","doi":"10.1109/IISWC.2015.16","DOIUrl":"https://doi.org/10.1109/IISWC.2015.16","url":null,"abstract":"Parallel systems that employ CPUs and GPUs as two heterogeneous computational units have become immensely popular due to their ability to maximize performance under restrictive thermal budgets. However, programming heterogeneous systems via traditional programming models like OpenCL or CUDA involves rewriting large portions of application-code. They also lead to code that is not performance portable across different architectures or even across different generations of the same architecture. In this paper, we evaluate the current state of two emerging parallel programming models: C++ AMP and OpenACC. These emerging programming paradigms require minimal code changes and rely on compilers to interact with the low-level hardware language, thereby producing performance portable code from an application standpoint. We analyze the performance and productivity of the emerging programming models and compare them with OpenCL using a diverse set of applications on two different architectures, a CPU coupled with a discrete GPU and an Accelerated Programming Unit (APU). Our experiments demonstrate that while the emerging programming models improve programmer productivity, they do not yet expose enough flexibility to extract maximum performance as compared to traditional programming models.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"82 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116943006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}