R. Clapp, Martin Dimitrov, Karthik Kumar, Vish Viswanathan, Thomas Willhalm
In recent years, DRAM technology improvements have scaled at a much slower pace than processors. While server processor core counts grow from 33% to 50% on a yearly cadence, DDR 3/4 memory channel bandwidth has grown at a slower rate, and memory latency has remained relatively flat for some time. Combined with new computing paradigms such as big data analytics, which involves analyzing massive volumes of data in real time, there is a trend of increasing pressure on the memory subsystem. This makes it important for computer architects to understand the sensitivity of the performance of big data workloads to memory bandwidth and latency, and how these workloads compare to more conventional workloads. To address this, we present straightforward analytic equations to quantify the impact of memory bandwidth and latency on workload performance, leveraging measured data from performance counters on real systems. We demonstrate how the values of the components of these equations can be used to classify different workloads according to their inherent bandwidth requirement and latency sensitivity. Using this performance model, we show the relative sensitivities of big data, high-performance computing, and enterprise workload classes to changes in memory bandwidth and latency.
{"title":"Quantifying the Performance Impact of Memory Latency and Bandwidth for Big Data Workloads","authors":"R. Clapp, Martin Dimitrov, Karthik Kumar, Vish Viswanathan, Thomas Willhalm","doi":"10.1109/IISWC.2015.32","DOIUrl":"https://doi.org/10.1109/IISWC.2015.32","url":null,"abstract":"In recent years, DRAM technology improvements have scaled at a much slower pace than processors. While server processor core counts grow from 33% to 50% on a yearly cadence, DDR 3/4 memory channel bandwidth has grown at a slower rate, and memory latency has remained relatively flat for some time. Combined with new computing paradigms such as big data analytics, which involves analyzing massive volumes of data in real time, there is a trend of increasing pressure on the memory subsystem. This makes it important for computer architects to understand the sensitivity of the performance of big data workloads to memory bandwidth and latency, and how these workloads compare to more conventional workloads. To address this, we present straightforward analytic equations to quantify the impact of memory bandwidth and latency on workload performance, leveraging measured data from performance counters on real systems. We demonstrate how the values of the components of these equations can be used to classify different workloads according to their inherent bandwidth requirement and latency sensitivity. Using this performance model, we show the relative sensitivities of big data, high-performance computing, and enterprise workload classes to changes in memory bandwidth and latency.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134381127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emerging system-on-a-chip (SoC)-based microservers promise higher energy efficiency by drastically reducing power consumption albeit at the expense of loss in performance. In this paper we thoroughly evaluate the performance and energy efficiency of two 64-bit eight-core ARM and x86 SoCs on a number of parallel scale-out benchmarks and high-performance computing benchmarks. We characterize the workloads on these servers and elaborate the impact of the SoC architecture, memory hierarchy, and system design on the performance and energy efficiency outcomes. We also contrast the results against those of standard x86 servers.
{"title":"How Good Are Low-Power 64-Bit SoCs for Server-Class Workloads?","authors":"R. Azimi, Xin Zhan, S. Reda","doi":"10.1109/IISWC.2015.21","DOIUrl":"https://doi.org/10.1109/IISWC.2015.21","url":null,"abstract":"Emerging system-on-a-chip (SoC)-based microservers promise higher energy efficiency by drastically reducing power consumption albeit at the expense of loss in performance. In this paper we thoroughly evaluate the performance and energy efficiency of two 64-bit eight-core ARM and x86 SoCs on a number of parallel scale-out benchmarks and high-performance computing benchmarks. We characterize the workloads on these servers and elaborate the impact of the SoC architecture, memory hierarchy, and system design on the performance and energy efficiency outcomes. We also contrast the results against those of standard x86 servers.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122326337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A vast majority of smart phones use eMMC (embedded multimedia card) devices as their storage subsystems. Recent studies reveal that storage subsystem is a significant contributor to the performance of smart phone applications. Nevertheless, smart phone applications' block-level I/O characteristics and their implications on eMMC design are still poorly understood. In this research, we collect and analyze block-level I/O traces from 18 common applications (e.g., Email and Twitter) on a Nexus 5 smart phone. We observe some I/O characteristics from which several implications for eMMC design are derived. For example, we find that in 15 out of the 18 traces majority requests (44.9%-57.4%) are small single-page (4KB) requests. The implication is that small requests should be served rapidly so that the overall performance of an eMMC device can be boosted. Next, we conduct a case study to demonstrate how to apply the implications to optimize eMMC design. Inspired by two implications, we propose a hybrid-page-size (HPS) eMMC. Experimental results show that the HPS scheme can reduce mean response time by up to 86% while improving space utilization by up to 24.2%.
{"title":"I/O Characteristics of Smartphone Applications and Their Implications for eMMC Design","authors":"Deng Zhou, Wen Pan, Wei Wang, T. Xie","doi":"10.1109/IISWC.2015.8","DOIUrl":"https://doi.org/10.1109/IISWC.2015.8","url":null,"abstract":"A vast majority of smart phones use eMMC (embedded multimedia card) devices as their storage subsystems. Recent studies reveal that storage subsystem is a significant contributor to the performance of smart phone applications. Nevertheless, smart phone applications' block-level I/O characteristics and their implications on eMMC design are still poorly understood. In this research, we collect and analyze block-level I/O traces from 18 common applications (e.g., Email and Twitter) on a Nexus 5 smart phone. We observe some I/O characteristics from which several implications for eMMC design are derived. For example, we find that in 15 out of the 18 traces majority requests (44.9%-57.4%) are small single-page (4KB) requests. The implication is that small requests should be served rapidly so that the overall performance of an eMMC device can be boosted. Next, we conduct a case study to demonstrate how to apply the implications to optimize eMMC design. Inspired by two implications, we propose a hybrid-page-size (HPS) eMMC. Experimental results show that the HPS scheme can reduce mean response time by up to 86% while improving space utilization by up to 24.2%.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127589176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the advent of cloud computing and online services, large enterprises rely heavily on their data centers to serve end users. Among different server components, hard disk drives are known to contribute significantly to server failures. Disk failures as well as their impact on the performance of storage systems and operating costs are becoming an increasingly important concern for data center designers and operators. However, there is very little understanding on the characteristics of disk failures in data centers. Effective disk failure management and data recovery also requires a deep understanding of the nature of disk failures. In this paper, we present a systematic approach to provide a holistic and insightful view of disk failures. We study a large-scale storage system from a production data center. We categorize disk failures based on their distinctive manifestations and properties. Then we characterize the degradation of disk errors to failures by deriving the degradation signatures for each failure category. The influence of disk health attributes on failure degradation is also quantified. We discuss leveraging the derived degradation signatures to forecast disk failures even in their early stages. To the best of our knowledge, this is the first work that shows how to discover the categories of disk failures and characterize their degradation processes on a production data center.
{"title":"Characterizing Disk Failures with Quantified Disk Degradation Signatures: An Early Experience","authors":"Song Huang, Song Fu, Quan Zhang, Weisong Shi","doi":"10.1109/IISWC.2015.26","DOIUrl":"https://doi.org/10.1109/IISWC.2015.26","url":null,"abstract":"With the advent of cloud computing and online services, large enterprises rely heavily on their data centers to serve end users. Among different server components, hard disk drives are known to contribute significantly to server failures. Disk failures as well as their impact on the performance of storage systems and operating costs are becoming an increasingly important concern for data center designers and operators. However, there is very little understanding on the characteristics of disk failures in data centers. Effective disk failure management and data recovery also requires a deep understanding of the nature of disk failures. In this paper, we present a systematic approach to provide a holistic and insightful view of disk failures. We study a large-scale storage system from a production data center. We categorize disk failures based on their distinctive manifestations and properties. Then we characterize the degradation of disk errors to failures by deriving the degradation signatures for each failure category. The influence of disk health attributes on failure degradation is also quantified. We discuss leveraging the derived degradation signatures to forecast disk failures even in their early stages. To the best of our knowledge, this is the first work that shows how to discover the categories of disk failures and characterize their degradation processes on a production data center.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129886676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andreas Sandberg, Nikos Nikoleris, Trevor E. Carlson, Erik Hagersten, S. Kaxiras, D. Black-Schaffer
Cycle-level micro architectural simulation is the de-facto standard to estimate performance of next-generation platforms. Unfortunately, the level of detail needed for accurate simulation requires complex, and therefore slow, simulation models that run at speeds that are thousands of times slower than native execution. With the introduction of sampled simulation, it has become possible to simulate only the key, representative portions of a workload in a reasonable amount of time and reliably estimate its overall performance. These sampling methodologies provide the ability to identify regions for detailed execution, and through micro architectural state check pointing, one can quickly and easily determine the performance characteristics of a workload for a variety of micro architectural changes. While this strategy of sampling simulations to generate checkpoints performs well for static applications, more complex scenarios involving hardware-software co-design (such as co-optimizing both a Java virtual machine and the micro architecture it is running on) cause this methodology to break down, as new micro architectural checkpoints are needed for each memory hierarchy configuration and software version. Solutions are therefore needed to enable fast and accurate simulation that also address the needs of hardware-software co-design and exploration. In this work we present a methodology to enhance checkpoint-based sampled simulation. Our solution integrates hardware virtualization to provide near-native speed, virtualized fast-forwarding to regions of interest, and parallel detailed simulation. However, as we cannot warm the simulated caches during virtualized fast-forwarding, we develop a novel approach to estimate the error introduced by limited cache warming, through the use of optimistic and pessimistic warming simulations. Using virtualized fast-forwarding (which operates at 90% of native speed on average), we demonstrate a parallel sampling simulator that can be used to accurately estimate the IPC of standard workloads with an average error of 2.2% while still reaching an execution rate of 2.0 GIPS (63% of native) on average. Additionally, we demonstrate that our parallelization strategy scales almost linearly and simulates one core at up to 93% of its native execution rate, 19,000x faster than detailed simulation, while using 8 cores.
{"title":"Full Speed Ahead: Detailed Architectural Simulation at Near-Native Speed","authors":"Andreas Sandberg, Nikos Nikoleris, Trevor E. Carlson, Erik Hagersten, S. Kaxiras, D. Black-Schaffer","doi":"10.1109/IISWC.2015.29","DOIUrl":"https://doi.org/10.1109/IISWC.2015.29","url":null,"abstract":"Cycle-level micro architectural simulation is the de-facto standard to estimate performance of next-generation platforms. Unfortunately, the level of detail needed for accurate simulation requires complex, and therefore slow, simulation models that run at speeds that are thousands of times slower than native execution. With the introduction of sampled simulation, it has become possible to simulate only the key, representative portions of a workload in a reasonable amount of time and reliably estimate its overall performance. These sampling methodologies provide the ability to identify regions for detailed execution, and through micro architectural state check pointing, one can quickly and easily determine the performance characteristics of a workload for a variety of micro architectural changes. While this strategy of sampling simulations to generate checkpoints performs well for static applications, more complex scenarios involving hardware-software co-design (such as co-optimizing both a Java virtual machine and the micro architecture it is running on) cause this methodology to break down, as new micro architectural checkpoints are needed for each memory hierarchy configuration and software version. Solutions are therefore needed to enable fast and accurate simulation that also address the needs of hardware-software co-design and exploration. In this work we present a methodology to enhance checkpoint-based sampled simulation. Our solution integrates hardware virtualization to provide near-native speed, virtualized fast-forwarding to regions of interest, and parallel detailed simulation. However, as we cannot warm the simulated caches during virtualized fast-forwarding, we develop a novel approach to estimate the error introduced by limited cache warming, through the use of optimistic and pessimistic warming simulations. Using virtualized fast-forwarding (which operates at 90% of native speed on average), we demonstrate a parallel sampling simulator that can be used to accurately estimate the IPC of standard workloads with an average error of 2.2% while still reaching an execution rate of 2.0 GIPS (63% of native) on average. Additionally, we demonstrate that our parallelization strategy scales almost linearly and simulates one core at up to 93% of its native execution rate, 19,000x faster than detailed simulation, while using 8 cores.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125667532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Z. Tootaghaj, F. Farhat, M. Arjomand, P. Faraboschi, M. Kandemir, A. Sivasubramaniam, C. Das
The combined impact of node architecture and workload characteristics on off-chip network traffic with performance/cost analysis has not been investigated before in the context of emerging cloud applications. Motivated by this observation, this paper performs a thorough characterization of twelve cloud workloads using a full-system datacenter simulation infrastructure. We first study the inherent network characteristics of emerging cloud applications including message inter-arrival times, packet sizes, inter-node communication overhead, self-similarity, and traffic volume. Then, we study the effect of hardware architectural metrics on network traffic. Our experimental analysis reveals that (1) the message arrival times and packet-size distributions exhibit variances across different cloud applications, (2) the inter-arrival times imply a large amount of self-similarity as the number of nodes increase, (3) the node architecture can play a significant role in shaping the overall network traffic, and finally, (4) the applications we study can be broadly divided into those which perform better in a scale-out or scale-up configuration at node level and into two categories, namely, those that have long-duration, low-burst flows and those that have short-duration, high-burst flows. Using the results of (3) and (4), the paper discusses the performance/cost trade-offs for scale-out and scale-up approaches and proposes an analytical model that can be used to predict the communication and computation demand for different configurations. It is shown that the difference between two different node architecture's performance per dollar cost (under same number of cores system wide) can be as high as 154 percent which disclose the need for accurate characterization of cloud applications before wasting the precious cloud resources by allocating wrong architecture. The results of this study can be used for system modeling, capacity planning and managing heterogeneous resources for large-scale system designs.
{"title":"Evaluating the Combined Impact of Node Architecture and Cloud Workload Characteristics on Network Traffic and Performance/Cost","authors":"D. Z. Tootaghaj, F. Farhat, M. Arjomand, P. Faraboschi, M. Kandemir, A. Sivasubramaniam, C. Das","doi":"10.1109/IISWC.2015.31","DOIUrl":"https://doi.org/10.1109/IISWC.2015.31","url":null,"abstract":"The combined impact of node architecture and workload characteristics on off-chip network traffic with performance/cost analysis has not been investigated before in the context of emerging cloud applications. Motivated by this observation, this paper performs a thorough characterization of twelve cloud workloads using a full-system datacenter simulation infrastructure. We first study the inherent network characteristics of emerging cloud applications including message inter-arrival times, packet sizes, inter-node communication overhead, self-similarity, and traffic volume. Then, we study the effect of hardware architectural metrics on network traffic. Our experimental analysis reveals that (1) the message arrival times and packet-size distributions exhibit variances across different cloud applications, (2) the inter-arrival times imply a large amount of self-similarity as the number of nodes increase, (3) the node architecture can play a significant role in shaping the overall network traffic, and finally, (4) the applications we study can be broadly divided into those which perform better in a scale-out or scale-up configuration at node level and into two categories, namely, those that have long-duration, low-burst flows and those that have short-duration, high-burst flows. Using the results of (3) and (4), the paper discusses the performance/cost trade-offs for scale-out and scale-up approaches and proposes an analytical model that can be used to predict the communication and computation demand for different configurations. It is shown that the difference between two different node architecture's performance per dollar cost (under same number of cores system wide) can be as high as 154 percent which disclose the need for accurate characterization of cloud applications before wasting the precious cloud resources by allocating wrong architecture. The results of this study can be used for system modeling, capacity planning and managing heterogeneous resources for large-scale system designs.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132453347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}