首页 > 最新文献

2015 IEEE International Symposium on Workload Characterization最新文献

英文 中文
A Taxonomy of GPGPU Performance Scaling GPGPU性能扩展的分类
Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.22
Abhinandan Majumdar, Gene Y. Wu, K. Dev, J. Greathouse, Indrani Paul, Wei Huang, Arjun Venugopal, Leonardo Piga, Chip Freitag, Sooraj Puthoor
Graphics processing units (GPUs) range from small, embedded designs to large, high-powered discrete cards. While the performance of graphics workloads is generally understood, there has been little study of the performance of GPGPU applications across a variety of hardware configurations. This work presents performance scaling data gathered for 267 GPGPU kernels from 97 programs run on 891 hardware configurations of a modern GPU. We study the performance of these kernels across a 5× change in core frequency, 8.3× change in memory bandwidth, and 11× difference in compute units. We illustrate that many kernels scale in intuitive ways, such as those that scale directly with added computational capabilities or memory bandwidth. We also find a number of kernels that scale in non-obvious ways, such as losing performance when more processing units are added or plateauing as frequency and bandwidth are increased. In addition, we show that a number of current benchmark suites do not scale to modern GPU sizes, implying that either new benchmarks or new inputs are warranted.
图形处理单元(gpu)的范围从小型的嵌入式设计到大型的高性能分立卡。虽然图形工作负载的性能通常被理解,但对GPGPU应用程序在各种硬件配置下的性能的研究很少。这项工作展示了在现代GPU的891个硬件配置上运行的97个程序中收集的267个GPGPU内核的性能缩放数据。我们研究了这些内核在核心频率变化5倍、内存带宽变化8.3倍、计算单元变化11倍的情况下的性能。我们说明了许多内核以直观的方式扩展,例如那些直接通过添加计算能力或内存带宽进行扩展的内核。我们还发现许多内核以不明显的方式扩展,例如当添加更多处理单元时性能会下降,或者随着频率和带宽的增加而趋于稳定。此外,我们表明,许多当前的基准套件不能扩展到现代GPU尺寸,这意味着需要新的基准或新的输入。
{"title":"A Taxonomy of GPGPU Performance Scaling","authors":"Abhinandan Majumdar, Gene Y. Wu, K. Dev, J. Greathouse, Indrani Paul, Wei Huang, Arjun Venugopal, Leonardo Piga, Chip Freitag, Sooraj Puthoor","doi":"10.1109/IISWC.2015.22","DOIUrl":"https://doi.org/10.1109/IISWC.2015.22","url":null,"abstract":"Graphics processing units (GPUs) range from small, embedded designs to large, high-powered discrete cards. While the performance of graphics workloads is generally understood, there has been little study of the performance of GPGPU applications across a variety of hardware configurations. This work presents performance scaling data gathered for 267 GPGPU kernels from 97 programs run on 891 hardware configurations of a modern GPU. We study the performance of these kernels across a 5× change in core frequency, 8.3× change in memory bandwidth, and 11× difference in compute units. We illustrate that many kernels scale in intuitive ways, such as those that scale directly with added computational capabilities or memory bandwidth. We also find a number of kernels that scale in non-obvious ways, such as losing performance when more processing units are added or plateauing as frequency and bandwidth are increased. In addition, we show that a number of current benchmark suites do not scale to modern GPU sizes, implying that either new benchmarks or new inputs are warranted.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129295224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Source Mark: A Source-Level Approach for Identifying Architecture and Optimization Agnostic Regions for Performance Analysis 源代码标记:用于识别用于性能分析的架构和优化不可知区域的源代码级方法
Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.27
Abhinav Agrawal, Bagus Wibowo, James Tuck
Computer architects often evaluate performance on only parts of a program and not the entire program due to long simulation times that could take weeks or longer to finish. However, choosing regions of a program to evaluate in a way that is consistent and correct with respect to different compilers and different architectures is very challenging and has not received sufficient attention. The need for such tools is growing in importance given the diversity of architectures and compilers in use today. In this work, we propose a technique that identifies regions of a desired granularity for performance evaluation. We use a source-to-source compiler that inserts software marks into the program's source code to divide the execution into regions with a desired dynamic instruction count. An evaluation framework chooses from among a set of candidate marks to find ones that are both consistent across different architectures or compilers and can yield a low run-time instruction overhead. Evaluated on a set of SPEC applications, with a region size of about 100 million instructions, our technique has a dynamic instruction overhead as high as 3.3% with an average overhead of 0.47%. We also demonstrate the scalability of our technique by evaluating the dynamic instruction overhead for regions of finer granularity and show similar small overheads, of the applications we studied, we were unable to find suitable fine grained regions only for 462.libquantum and 444.namd. Our technique is an effective alternative to traditional binary-level approaches. We have demonstrated that a source-level approach is robust, that it can achieve low overhead, and that it reduces the effort for bringing up new architectures or compilers into an existing evaluation framework.
计算机架构师通常只评估程序的部分性能,而不是整个程序,因为模拟时间很长,可能需要数周或更长时间才能完成。然而,对于不同的编译器和不同的体系结构,选择一个程序的区域以一致和正确的方式进行评估是非常具有挑战性的,并且没有得到足够的重视。考虑到当今使用的体系结构和编译器的多样性,对此类工具的需求变得越来越重要。在这项工作中,我们提出了一种技术,用于识别性能评估所需粒度的区域。我们使用一个源到源的编译器,它将软件标记插入到程序的源代码中,从而根据期望的动态指令数将执行划分为不同的区域。评估框架从一组候选标记中进行选择,以找到那些在不同的体系结构或编译器之间既一致又能产生低运行时指令开销的标记。在一组SPEC应用程序上进行评估,区域大小约为1亿条指令,我们的技术具有高达3.3%的动态指令开销,平均开销为0.47%。我们还通过评估更细粒度区域的动态指令开销来演示我们技术的可伸缩性,并显示类似的小开销,在我们研究的应用程序中,我们无法找到适合462的细粒度区域。Libquantum和444. name。我们的技术是传统二进制级方法的有效替代方案。我们已经证明了源代码级方法是健壮的,它可以实现低开销,并且减少了将新体系结构或编译器引入现有评估框架的工作量。
{"title":"Source Mark: A Source-Level Approach for Identifying Architecture and Optimization Agnostic Regions for Performance Analysis","authors":"Abhinav Agrawal, Bagus Wibowo, James Tuck","doi":"10.1109/IISWC.2015.27","DOIUrl":"https://doi.org/10.1109/IISWC.2015.27","url":null,"abstract":"Computer architects often evaluate performance on only parts of a program and not the entire program due to long simulation times that could take weeks or longer to finish. However, choosing regions of a program to evaluate in a way that is consistent and correct with respect to different compilers and different architectures is very challenging and has not received sufficient attention. The need for such tools is growing in importance given the diversity of architectures and compilers in use today. In this work, we propose a technique that identifies regions of a desired granularity for performance evaluation. We use a source-to-source compiler that inserts software marks into the program's source code to divide the execution into regions with a desired dynamic instruction count. An evaluation framework chooses from among a set of candidate marks to find ones that are both consistent across different architectures or compilers and can yield a low run-time instruction overhead. Evaluated on a set of SPEC applications, with a region size of about 100 million instructions, our technique has a dynamic instruction overhead as high as 3.3% with an average overhead of 0.47%. We also demonstrate the scalability of our technique by evaluating the dynamic instruction overhead for regions of finer granularity and show similar small overheads, of the applications we studied, we were unable to find suitable fine grained regions only for 462.libquantum and 444.namd. Our technique is an effective alternative to traditional binary-level approaches. We have demonstrated that a source-level approach is robust, that it can achieve low overhead, and that it reduces the effort for bringing up new architectures or compilers into an existing evaluation framework.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133021973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Retrospective Look Back on the Road Towards Energy Proportionality 能源比例之路的回顾
Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.18
Daniel Wong, Julia Chen, M. Annavaram
In this paper, we take a retrospective look back at the road taken towards improving energy proportionality, in order to find out where we are currently, and how we got here. Through statistical regression of published SPEC power results, were able to identify and quantify the sources of past EP improvements.
在本文中,我们回顾了改善能源比例的道路,以找出我们目前的位置,以及我们是如何到达这里的。通过公布的SPEC功率结果的统计回归,能够识别和量化过去EP改进的来源。
{"title":"A Retrospective Look Back on the Road Towards Energy Proportionality","authors":"Daniel Wong, Julia Chen, M. Annavaram","doi":"10.1109/IISWC.2015.18","DOIUrl":"https://doi.org/10.1109/IISWC.2015.18","url":null,"abstract":"In this paper, we take a retrospective look back at the road taken towards improving energy proportionality, in order to find out where we are currently, and how we got here. Through statistical regression of published SPEC power results, were able to identify and quantify the sources of past EP improvements.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123473870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors 异构CPU-GPU处理器中的GPU计算管道效率低下和优化机会
Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.15
Joel Hestness, S. Keckler, D. Wood
Emerging heterogeneous CPU-GPU processors have introduced unified memory spaces and cache coherence. CPU and GPU cores will be able to concurrently access the same memories, eliminating memory copy overheads and potentially changing the application-level optimization targets. To date, little is known about how developers may organize new applications to leverage the available, finer-grained communication in these processors. However, understanding potential application optimizations and adaptations is critical for directing heterogeneous processor programming model and architectural development. This paper quantifies opportunities for applications and architectures to evolve to leverage the new capabilities of heterogeneous processors. To identify these opportunities, we ported and simulated a broad set of benchmarks originally developed for discrete GPUs to remove memory copies, and applied analytical models to quantify their application-level pipeline inefficiencies. For existing benchmarks, GPU bulk-synchronous software pipelines result in considerable core and cache utilization inefficiency. For heterogeneous processors, the results indicate increased opportunity for techniques that provide flexible compute and data granularities, and support for efficient producer-consumer data handling and synchronization within caches.
新兴的异构CPU-GPU处理器引入了统一的内存空间和缓存一致性。CPU和GPU内核将能够并发访问相同的内存,消除内存复制开销,并可能改变应用程序级优化目标。迄今为止,对于开发人员如何组织新应用程序以利用这些处理器中可用的细粒度通信,所知甚少。然而,理解潜在的应用程序优化和适应对于指导异构处理器编程模型和体系结构开发至关重要。本文量化了应用程序和体系结构发展的机会,以利用异构处理器的新功能。为了识别这些机会,我们移植并模拟了最初为离散gpu开发的一组广泛的基准测试,以消除内存副本,并应用分析模型来量化它们的应用程序级流水线效率低下。对于现有的基准测试,GPU批量同步软件管道导致相当大的核心和缓存利用效率低下。对于异构处理器,结果表明提供灵活的计算和数据粒度以及支持高效的生产者-消费者数据处理和缓存内同步的技术的机会增加了。
{"title":"GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors","authors":"Joel Hestness, S. Keckler, D. Wood","doi":"10.1109/IISWC.2015.15","DOIUrl":"https://doi.org/10.1109/IISWC.2015.15","url":null,"abstract":"Emerging heterogeneous CPU-GPU processors have introduced unified memory spaces and cache coherence. CPU and GPU cores will be able to concurrently access the same memories, eliminating memory copy overheads and potentially changing the application-level optimization targets. To date, little is known about how developers may organize new applications to leverage the available, finer-grained communication in these processors. However, understanding potential application optimizations and adaptations is critical for directing heterogeneous processor programming model and architectural development. This paper quantifies opportunities for applications and architectures to evolve to leverage the new capabilities of heterogeneous processors. To identify these opportunities, we ported and simulated a broad set of benchmarks originally developed for discrete GPUs to remove memory copies, and applied analytical models to quantify their application-level pipeline inefficiencies. For existing benchmarks, GPU bulk-synchronous software pipelines result in considerable core and cache utilization inefficiency. For heterogeneous processors, the results indicate increased opportunity for techniques that provide flexible compute and data granularities, and support for efficient producer-consumer data handling and synchronization within caches.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"447 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122486526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Fast Computational GPU Design with GT-Pin 基于gt引脚的快速计算GPU设计
Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.14
Melanie Kambadur, Sunpyo Hong, Juan Cabral, H. Patil, C. Luk, S. Sajid, Martha A. Kim
As computational applications become common for graphics processing units, new hardware designs must be developed to meet the unique needs of these workloads. Performance simulation is an important step in appraising how well a candidate design will serve these needs, but unfortunately, computational GPU programs are so large that simulating them in detail is prohibitively slow. This work addresses the need to understand very large computational GPU programs in three ways. First, it introduces a fast tracing tool that uses binary instrumentation for in-depth analyses of native executions on existing architectures. Second, it characterizes 25 commercial and benchmark OpenCL applications, which average 308 billion GPU instructions apiece and are by far the largest benchmarks that have been natively profiled at this level of detail. Third, it accelerates simulation of future hardware by pinpointing small subsets of OpenCL applications that can be simulated as representative surrogates in lieu of full-length programs. Our fast selection method requires no simulation itself and allows the user to navigate the accuracy/simulation speed trade-off space, from extremely accurate with reasonable speedups (35X increase in simulation speed for 0.3% error) to reasonably accurate with extreme speedups (223X simulation speedup for 3.0% error).
随着图形处理单元的计算应用变得越来越普遍,必须开发新的硬件设计来满足这些工作负载的独特需求。性能模拟是评估候选设计如何满足这些需求的重要步骤,但不幸的是,计算GPU程序如此之大,以至于详细模拟它们的速度非常慢。这项工作以三种方式解决了理解非常大的计算GPU程序的需要。首先,介绍了一个快速跟踪工具,该工具使用二进制工具对现有体系结构上的本机执行进行深入分析。其次,它描述了25个商业和基准的OpenCL应用程序,平均每个应用程序有3080亿个GPU指令,是迄今为止在这个细节级别上进行本机分析的最大的基准测试。第三,它通过精确定位OpenCL应用程序的小子集来加速对未来硬件的模拟,这些子集可以模拟为具有代表性的替代品,而不是完整的程序。我们的快速选择方法不需要模拟本身,并允许用户导航精度/模拟速度权衡空间,从极其准确的合理加速(35倍的模拟速度增加0.3%的误差)到合理准确的极端加速(223X模拟加速3.0%的误差)。
{"title":"Fast Computational GPU Design with GT-Pin","authors":"Melanie Kambadur, Sunpyo Hong, Juan Cabral, H. Patil, C. Luk, S. Sajid, Martha A. Kim","doi":"10.1109/IISWC.2015.14","DOIUrl":"https://doi.org/10.1109/IISWC.2015.14","url":null,"abstract":"As computational applications become common for graphics processing units, new hardware designs must be developed to meet the unique needs of these workloads. Performance simulation is an important step in appraising how well a candidate design will serve these needs, but unfortunately, computational GPU programs are so large that simulating them in detail is prohibitively slow. This work addresses the need to understand very large computational GPU programs in three ways. First, it introduces a fast tracing tool that uses binary instrumentation for in-depth analyses of native executions on existing architectures. Second, it characterizes 25 commercial and benchmark OpenCL applications, which average 308 billion GPU instructions apiece and are by far the largest benchmarks that have been natively profiled at this level of detail. Third, it accelerates simulation of future hardware by pinpointing small subsets of OpenCL applications that can be simulated as representative surrogates in lieu of full-length programs. Our fast selection method requires no simulation itself and allows the user to navigate the accuracy/simulation speed trade-off space, from extremely accurate with reasonable speedups (35X increase in simulation speed for 0.3% error) to reasonably accurate with extreme speedups (223X simulation speedup for 3.0% error).","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115614260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server 局部性存在于图处理:Ivy Bridge服务器的工作负载表征
Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.12
S. Beamer, K. Asanović, D. Patterson
Graph processing is an increasingly important application domain and is typically communication-bound. In this work, we analyze the performance characteristics of three high-performance graph algorithm codebases using hardware performance counters on a conventional dual-socket server. Unlike many other communication-bound workloads, graph algorithms struggle to fully utilize the platform's memory bandwidth and so increasing memory bandwidth utilization could be just as effective as decreasing communication. Based on our observations of simultaneous low compute and bandwidth utilization, we find there is substantial room for a different processor architecture to improve performance without requiring a new memory system.
图形处理是一个日益重要的应用领域,通常与通信有关。在这项工作中,我们在传统的双插座服务器上使用硬件性能计数器分析了三个高性能图形算法代码库的性能特征。与许多其他通信绑定的工作负载不同,图形算法很难充分利用平台的内存带宽,因此增加内存带宽利用率可能与减少通信一样有效。根据我们对同时较低的计算和带宽利用率的观察,我们发现在不需要新的内存系统的情况下,不同的处理器架构有很大的空间来提高性能。
{"title":"Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server","authors":"S. Beamer, K. Asanović, D. Patterson","doi":"10.1109/IISWC.2015.12","DOIUrl":"https://doi.org/10.1109/IISWC.2015.12","url":null,"abstract":"Graph processing is an increasingly important application domain and is typically communication-bound. In this work, we analyze the performance characteristics of three high-performance graph algorithm codebases using hardware performance counters on a conventional dual-socket server. Unlike many other communication-bound workloads, graph algorithms struggle to fully utilize the platform's memory bandwidth and so increasing memory bandwidth utilization could be just as effective as decreasing communication. Based on our observations of simultaneous low compute and bandwidth utilization, we find there is substantial room for a different processor architecture to improve performance without requiring a new memory system.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130085004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 148
Energy-Performance Trade-offs on Energy-Constrained Devices with Multi-component DVFS 基于多组件DVFS的能量受限设备的能量性能权衡
Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.10
R. Begum, David Werner, Mark Hempstead, Guru Prasad Srinivasa, Geoffrey Challen
Battery lifetime continues to be a top complaint about smart phones. Dynamic voltage and frequency scaling (DVFS) has existed for mobile device CPUs for some time, and provides a trade off between energy and performance. Dynamic frequency scaling is beginning to be applied to memory as well to make more energy-performance tradeoffs possible. We present the first characterization of the behavior of the optimal frequency settings of workloads running both, under energy constraints and on systems capable of CPU DVFS and memory DFS, an environment representative of next-generation mobile devices. Our results show that continuously using the optimal frequency settings results in a large number of frequency transitions which end up hurting performance. However, by permitting a small loss in performance, transition overhead can be reduced and end-to-end performance and energy consumption improved. We introduce the idea of inefficiency as a way of constraining task energy consumption relative to the most energy-efficient settings, and characterize the performance of multiple workloads running under different inefficiency settings. Overall our results have multiple implications for next-generation mobile devices exposing multiple energy-performance tradeoffs.
电池寿命仍然是人们对智能手机的最大抱怨。动态电压和频率缩放(DVFS)已经用于移动设备cpu有一段时间了,它提供了能量和性能之间的折衷。动态频率缩放也开始应用于内存,以使更多的能量性能折衷成为可能。我们首次描述了在能量限制下以及在具有CPU DVFS和内存DFS(下一代移动设备的代表环境)的系统上运行的工作负载的最佳频率设置的行为。我们的结果表明,持续使用最佳频率设置会导致大量的频率转换,最终会损害性能。然而,通过允许性能上的小损失,可以减少转换开销,提高端到端性能和能耗。我们引入了低效率的概念,将其作为一种相对于最节能设置约束任务能耗的方式,并描述了在不同低效率设置下运行的多个工作负载的性能。总的来说,我们的研究结果对下一代移动设备有多种影响,暴露了多种能源性能权衡。
{"title":"Energy-Performance Trade-offs on Energy-Constrained Devices with Multi-component DVFS","authors":"R. Begum, David Werner, Mark Hempstead, Guru Prasad Srinivasa, Geoffrey Challen","doi":"10.1109/IISWC.2015.10","DOIUrl":"https://doi.org/10.1109/IISWC.2015.10","url":null,"abstract":"Battery lifetime continues to be a top complaint about smart phones. Dynamic voltage and frequency scaling (DVFS) has existed for mobile device CPUs for some time, and provides a trade off between energy and performance. Dynamic frequency scaling is beginning to be applied to memory as well to make more energy-performance tradeoffs possible. We present the first characterization of the behavior of the optimal frequency settings of workloads running both, under energy constraints and on systems capable of CPU DVFS and memory DFS, an environment representative of next-generation mobile devices. Our results show that continuously using the optimal frequency settings results in a large number of frequency transitions which end up hurting performance. However, by permitting a small loss in performance, transition overhead can be reduced and end-to-end performance and energy consumption improved. We introduce the idea of inefficiency as a way of constraining task energy consumption relative to the most energy-efficient settings, and characterize the performance of multiple workloads running under different inefficiency settings. Overall our results have multiple implications for next-generation mobile devices exposing multiple energy-performance tradeoffs.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133156916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Characterization and Throttling-Based Mitigation of Memory Interference for Heterogeneous Smartphones 异构智能手机内存干扰的表征和节流抑制
Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.9
Davesh Shingari, A. Arunkumar, Carole-Jean Wu
The availability of a wide range of general purpose as well as accelerator cores on modern smart phones means that a significant number of applications can be executed on a smart phone simultaneously, resulting in an ever increasing demand on the memory subsystem. While the increased computation capability is intended for improving user experience, memory requests from each concurrent application exhibit unique memory access patterns as well as specific timing constraints. If not considered, this could lead to significant memory contention and result in lowered user experience. In this paper, we design experiments to analyze the performance degradation caused by the interference at the memory subsystem for a broad range of commonly-used smart phone applications. The characterization studies are performed on a real smart phone device -- Google Nexus5 -- running an Android operating system. Our results show that user-centric smart phone applications, such as web browsing and media player, suffer up to 34% and 21% performance degradation, respectively, from shared resource contention at the application processor's last-level cache, the communication fabric, and the main memory. Taking a step further, we demonstrate the feasibility and effectiveness of a frequency throttling-based memory interference mitigation technique. At the expense of performance degradation of interfering applications, frequency throttling is an effective technique for mitigating memory interference, leading to better QoS and user experience, for user-centric applications.
现代智能手机上广泛的通用和加速器核心的可用性意味着大量的应用程序可以同时在智能手机上执行,从而导致对内存子系统的需求不断增加。虽然增加的计算能力旨在改善用户体验,但是来自每个并发应用程序的内存请求表现出独特的内存访问模式以及特定的时间限制。如果不加以考虑,这可能会导致严重的内存争用,并降低用户体验。在本文中,我们设计了实验来分析在广泛的常用智能手机应用中内存子系统的干扰引起的性能下降。表征研究是在运行Android操作系统的真实智能手机设备——Google Nexus5上进行的。我们的研究结果表明,以用户为中心的智能手机应用程序,如网页浏览和媒体播放器,由于应用程序处理器的最后一级缓存、通信结构和主内存的共享资源争用,分别遭受了高达34%和21%的性能下降。进一步,我们证明了基于频率节流的内存干扰缓解技术的可行性和有效性。以干扰应用程序的性能下降为代价,频率调节是一种有效的技术,可以减轻内存干扰,从而为以用户为中心的应用程序带来更好的QoS和用户体验。
{"title":"Characterization and Throttling-Based Mitigation of Memory Interference for Heterogeneous Smartphones","authors":"Davesh Shingari, A. Arunkumar, Carole-Jean Wu","doi":"10.1109/IISWC.2015.9","DOIUrl":"https://doi.org/10.1109/IISWC.2015.9","url":null,"abstract":"The availability of a wide range of general purpose as well as accelerator cores on modern smart phones means that a significant number of applications can be executed on a smart phone simultaneously, resulting in an ever increasing demand on the memory subsystem. While the increased computation capability is intended for improving user experience, memory requests from each concurrent application exhibit unique memory access patterns as well as specific timing constraints. If not considered, this could lead to significant memory contention and result in lowered user experience. In this paper, we design experiments to analyze the performance degradation caused by the interference at the memory subsystem for a broad range of commonly-used smart phone applications. The characterization studies are performed on a real smart phone device -- Google Nexus5 -- running an Android operating system. Our results show that user-centric smart phone applications, such as web browsing and media player, suffer up to 34% and 21% performance degradation, respectively, from shared resource contention at the application processor's last-level cache, the communication fabric, and the main memory. Taking a step further, we demonstrate the feasibility and effectiveness of a frequency throttling-based memory interference mitigation technique. At the expense of performance degradation of interfering applications, frequency throttling is an effective technique for mitigating memory interference, leading to better QoS and user experience, for user-centric applications.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121222253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Big or Little: A Study of Mobile Interactive Applications on an Asymmetric Multi-core Platform 大或小:非对称多核平台上的移动交互应用研究
Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.7
Wonik Seo, Daegil Im, Jeongim Choi, Jaehyuk Huh
This paper characterizes a commercial mobile platform on an asymmetric multi-core processor, investigating its available thread-level parallelism (TLP) and the impact of core asymmetry on applications. This paper explores three critical aspects of asymmetric mobile systems, asymmetric hardware platform, application behavior, and the impact of scheduling and power management. First, this paper presents the performance and energy characteristics of a commercial asymmetric multi-core architecture with two core types. The comparison between big and little cores shows the potential benefit of asymmetric multi-cores for improving energy efficiency. Second, the paper investigates the available thread-level parallelism and core utilization behaviors of mobile interactive applications. Using popular mobile applications for the Android system, this paper analyzes the distinct TLP and CPU usage patterns of interactive applications. Third, the paper explores the impact of power governor and CPU scheduler on the asymmetric system. Multiple cores with heterogeneous core types complicate scheduling and frequency scaling schemes, since the scheduler must migrate threads to different core types, in addition to traditional load balancing. This study shows that the current mobile applications are not fully utilizing the asymmetric multi-cores due to the lack of TLP and low computational requirement for big cores.
本文以非对称多核处理器为基础,研究了其可用的线程级并行性(TLP)以及核不对称对应用程序的影响。本文探讨了非对称移动系统、非对称硬件平台、应用程序行为以及调度和电源管理的影响的三个关键方面。首先,给出了一种具有两种核类型的商用非对称多核架构的性能和能量特性。通过对大核和小核的比较,可以看出非对称多核在提高能效方面的潜在优势。其次,研究了移动交互应用程序的可用线程级并行性和核心利用行为。本文以Android系统上流行的移动应用程序为例,分析了交互应用程序不同的TLP和CPU使用模式。第三,探讨了功率调节器和CPU调度器对非对称系统的影响。具有异构核类型的多核使调度和频率缩放方案复杂化,因为除了传统的负载平衡之外,调度程序还必须将线程迁移到不同的核类型。研究表明,目前的移动应用由于缺乏TLP和对大核的低计算需求,并没有充分利用非对称多核。
{"title":"Big or Little: A Study of Mobile Interactive Applications on an Asymmetric Multi-core Platform","authors":"Wonik Seo, Daegil Im, Jeongim Choi, Jaehyuk Huh","doi":"10.1109/IISWC.2015.7","DOIUrl":"https://doi.org/10.1109/IISWC.2015.7","url":null,"abstract":"This paper characterizes a commercial mobile platform on an asymmetric multi-core processor, investigating its available thread-level parallelism (TLP) and the impact of core asymmetry on applications. This paper explores three critical aspects of asymmetric mobile systems, asymmetric hardware platform, application behavior, and the impact of scheduling and power management. First, this paper presents the performance and energy characteristics of a commercial asymmetric multi-core architecture with two core types. The comparison between big and little cores shows the potential benefit of asymmetric multi-cores for improving energy efficiency. Second, the paper investigates the available thread-level parallelism and core utilization behaviors of mobile interactive applications. Using popular mobile applications for the Android system, this paper analyzes the distinct TLP and CPU usage patterns of interactive applications. Third, the paper explores the impact of power governor and CPU scheduler on the asymmetric system. Multiple cores with heterogeneous core types complicate scheduling and frequency scaling schemes, since the scheduler must migrate threads to different core types, in addition to traditional load balancing. This study shows that the current mobile applications are not fully utilizing the asymmetric multi-cores due to the lack of TLP and low computational requirement for big cores.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121307642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
PC Design, Use, and Purchase Relations 个人电脑设计、使用和购买关系
Pub Date : 2015-10-04 DOI: 10.1109/IISWC.2015.25
Al M. Rashid, B. Kuhn, B. Arbab, D. Kuck
For 25 years, industry standard benchmarks have proliferated, attempting to approximate user activities. This has helped drive the success of PCs to commodity levels by characterizing apps for designers and offering performance information for users. However, the many new configurations of each PC release cycle often leave users unsure about how to choose one. This paper takes a different approach, with tools based on new metrics to analyze real usage by millions of people. Our goal is to develop a methodology for deeper understanding of usage that can help designers satisfy users. These metrics demonstrate that usages are uniformly different between high- and low-end CPU-based systems, regardless of why a user bought a given system. We outline how this data can be used to partition markets and make more effective hardware (hw) and software (sw) design decisions tailoring systems for prospective markets.
25年来,行业标准基准激增,试图近似用户活动。这有助于将个人电脑的成功推向商品水平,为设计师提供特色应用程序,并为用户提供性能信息。然而,每个PC发布周期的许多新配置往往让用户不确定如何选择。本文采用了一种不同的方法,使用基于新指标的工具来分析数百万人的实际使用情况。我们的目标是开发一种方法来更深入地理解用法,从而帮助设计人员满足用户。这些指标表明,无论用户购买给定系统的原因是什么,基于高端和低端cpu的系统之间的使用情况都是完全不同的。我们概述了如何使用这些数据来划分市场,并为潜在市场制定更有效的硬件(hw)和软件(sw)设计决策。
{"title":"PC Design, Use, and Purchase Relations","authors":"Al M. Rashid, B. Kuhn, B. Arbab, D. Kuck","doi":"10.1109/IISWC.2015.25","DOIUrl":"https://doi.org/10.1109/IISWC.2015.25","url":null,"abstract":"For 25 years, industry standard benchmarks have proliferated, attempting to approximate user activities. This has helped drive the success of PCs to commodity levels by characterizing apps for designers and offering performance information for users. However, the many new configurations of each PC release cycle often leave users unsure about how to choose one. This paper takes a different approach, with tools based on new metrics to analyze real usage by millions of people. Our goal is to develop a methodology for deeper understanding of usage that can help designers satisfy users. These metrics demonstrate that usages are uniformly different between high- and low-end CPU-based systems, regardless of why a user bought a given system. We outline how this data can be used to partition markets and make more effective hardware (hw) and software (sw) design decisions tailoring systems for prospective markets.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"66 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129998033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2015 IEEE International Symposium on Workload Characterization
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1