2013 IEEE International Symposium on Workload Characterization (IISWC)最新文献

英文中文

Revisiting the management control plane in virtualized cloud computing infrastructure 虚拟化云计算基础架构中管理控制平面的回顾

2013 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2013-09-01 DOI: 10.1109/IISWC.2013.6704680

V. Soundararajan, Lawrence Spracklen

Previous research in virtualization has demonstrated that the management of a virtualized datacenter is a workload by itself, over and above the applications running in the virtualized datacenter. Virtualization has become more prevalent as the backbone of various cloud computing environments, and the workflows used in cloud computing are slightly different from typical datacenter workflows. In this paper, we profile the management workload induced by cloud-computing environments. Specifically, we analyze results from two real-world self-service cloud computing setups. Our results show that using the most recent virtualization techniques for conserving data bandwidth requirements in clouds, the management control plane now becomes a significant limiting factor in deploying cloud resources. We demonstrate that the rate of VM provisioning in clouds demands more aggressive means of performing previously infrequent operations like cloud reconfiguration, and these demands may influence virtualized datacenter design.

以前的虚拟化研究表明，虚拟化数据中心的管理本身就是一个工作负载，超出了在虚拟化数据中心中运行的应用程序。虚拟化作为各种云计算环境的支柱已经变得越来越普遍，云计算中使用的工作流与典型的数据中心工作流略有不同。在本文中，我们分析了云计算环境引起的管理工作量。具体来说，我们分析了两个真实世界的自助服务云计算设置的结果。我们的结果表明，使用最新的虚拟化技术来节省云中的数据带宽需求，管理控制平面现在成为部署云资源的一个重要限制因素。我们证明，云中的VM供应速度需要更积极的方式来执行以前不常见的操作，如云重新配置，这些需求可能会影响虚拟化数据中心的设计。

引用次数: 1

A structured approach to the simulation, analysis and characterization of smartphone applications 一个结构化的方法来模拟，分析和表征智能手机应用程序

2013 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2013-09-01 DOI: 10.1109/IISWC.2013.6704677

Dam Sunwoo, William Wang, Mrinmoy Ghosh, Chander Sudanthi, G. Blake, C. D. Emmons, N. Paver

Full-system simulators are invaluable tools for designing new architectures due to their ability to simulate full applications as well as capture operating system behavior, virtual machine or hypervisor behavior, and interference between concurrently-running applications. However, the systems under investigation and applications under test have become increasingly complicated leading to prohibitively long simulation times for a single experiment. This problem is compounded when many permutations of system design parameters and workloads are tested to investigate system sensitivities and full-system effects with confidence. In this paper, we propose a methodology to tractably explore the processor design space and to characterize applications in a full-system simulation environment. We combine SimPoint, Principal Component Analysis and Fractional Factorial experimental designs to substantially reduce the simulation effort needed to characterize and analyze workloads. We also present a non-invasive user-interface automation tool to allow us to study all types of workloads in a simulation environment. While our methodology is generally applicable to many simulators and workloads, we demonstrate the application of our proposed flow on smartphone applications running on the Android operating system within the gem5 simulation environment.

全系统模拟器是设计新体系结构的宝贵工具，因为它们能够模拟完整的应用程序，以及捕获操作系统行为、虚拟机或管理程序行为，以及并发运行的应用程序之间的干扰。然而，研究中的系统和测试中的应用变得越来越复杂，导致单个实验的模拟时间过长。当测试系统设计参数和工作负载的许多排列以研究系统敏感性和全系统影响时，这个问题变得更加复杂。在本文中，我们提出了一种在全系统仿真环境中可追踪地探索处理器设计空间和表征应用程序的方法。我们结合SimPoint、主成分分析和分数因子实验设计，大大减少了表征和分析工作负载所需的模拟工作量。我们还提供了一个非侵入式用户界面自动化工具，使我们能够在模拟环境中研究所有类型的工作负载。虽然我们的方法通常适用于许多模拟器和工作负载，但我们在gem5模拟环境中在Android操作系统上运行的智能手机应用程序上演示了我们提出的流程的应用程序。

{"title":"A structured approach to the simulation, analysis and characterization of smartphone applications","authors":"Dam Sunwoo, William Wang, Mrinmoy Ghosh, Chander Sudanthi, G. Blake, C. D. Emmons, N. Paver","doi":"10.1109/IISWC.2013.6704677","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704677","url":null,"abstract":"Full-system simulators are invaluable tools for designing new architectures due to their ability to simulate full applications as well as capture operating system behavior, virtual machine or hypervisor behavior, and interference between concurrently-running applications. However, the systems under investigation and applications under test have become increasingly complicated leading to prohibitively long simulation times for a single experiment. This problem is compounded when many permutations of system design parameters and workloads are tested to investigate system sensitivities and full-system effects with confidence. In this paper, we propose a methodology to tractably explore the processor design space and to characterize applications in a full-system simulation environment. We combine SimPoint, Principal Component Analysis and Fractional Factorial experimental designs to substantially reduce the simulation effort needed to characterize and analyze workloads. We also present a non-invasive user-interface automation tool to allow us to study all types of workloads in a simulation environment. While our methodology is generally applicable to many simulators and workloads, we demonstrate the application of our proposed flow on smartphone applications running on the Android operating system within the gem5 simulation environment.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125560488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Characterizing the efficiency of data deduplication for big data storage management

2013 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2013-09-01 DOI: 10.1109/IISWC.2013.6704674

Ruijin Zhou, Ming Liu, Tao Li

The demand for data storage and processing is increasing at a rapid speed in the big data era. Such a tremendous amount of data pushes the limit on storage capacity and on the storage network. A significant portion of the dataset in big data workloads is redundant. As a result, deduplication technology, which removes replicas, becomes an attractive solution to save disk space and traffic in a big data environment. However, the overhead of extra CPU computation (hash indexing) and IO latency introduced by deduplication should be considered. Therefore, the net effect of using deduplication for big data workloads needs to be examined. To this end, we characterize the redundancy of typical big data workloads to justify the need for deduplication. We analyze and characterize the performance and energy impact brought by deduplication under various big data environments. In our experiments, we identify three sources of redundancy in big data workloads: 1) deploying more nodes, 2) expanding the dataset, and 3) using replication mechanisms. We elaborate on the advantages and disadvantages of different deduplication layers, locations, and granularities. In addition, we uncover the relation between energy overhead and the degree of redundancy. Furthermore, we investigate the deduplication efficiency in an SSD environment for big data workloads.

在大数据时代，对数据存储和处理的需求正在快速增长。如此巨大的数据量推动了存储容量和存储网络的极限。在大数据工作负载中，数据集的很大一部分是冗余的。因此，删除副本的重复数据删除技术成为大数据环境中节省磁盘空间和流量的一种有吸引力的解决方案。但是，应该考虑额外CPU计算(散列索引)的开销和重复数据删除带来的IO延迟。因此，需要对大数据工作负载使用重复数据删除的净效果进行检查。为此，我们描述了典型大数据工作负载的冗余，以证明重复数据删除的必要性。分析和描述了不同大数据环境下重复数据删除带来的性能和能源影响。在我们的实验中，我们确定了大数据工作负载中的三种冗余来源:1)部署更多节点，2)扩展数据集，以及3)使用复制机制。我们详细阐述了不同的重复数据删除层、位置和粒度的优缺点。此外，我们揭示了能量开销与冗余度之间的关系。此外，我们还研究了SSD环境下大数据工作负载的重复数据删除效率。

{"title":"Characterizing the efficiency of data deduplication for big data storage management","authors":"Ruijin Zhou, Ming Liu, Tao Li","doi":"10.1109/IISWC.2013.6704674","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704674","url":null,"abstract":"The demand for data storage and processing is increasing at a rapid speed in the big data era. Such a tremendous amount of data pushes the limit on storage capacity and on the storage network. A significant portion of the dataset in big data workloads is redundant. As a result, deduplication technology, which removes replicas, becomes an attractive solution to save disk space and traffic in a big data environment. However, the overhead of extra CPU computation (hash indexing) and IO latency introduced by deduplication should be considered. Therefore, the net effect of using deduplication for big data workloads needs to be examined. To this end, we characterize the redundancy of typical big data workloads to justify the need for deduplication. We analyze and characterize the performance and energy impact brought by deduplication under various big data environments. In our experiments, we identify three sources of redundancy in big data workloads: 1) deploying more nodes, 2) expanding the dataset, and 3) using replication mechanisms. We elaborate on the advantages and disadvantages of different deduplication layers, locations, and granularities. In addition, we uncover the relation between energy overhead and the degree of redundancy. Furthermore, we investigate the deduplication efficiency in an SSD environment for big data workloads.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129763560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

(Mis)understanding the NUMA memory system performance of multithreaded workloads (2)对多线程工作负载下NUMA内存系统性能的理解不足

2013 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2013-09-01 DOI: 10.1109/IISWC.2013.6704666

Z. Majó, T. Gross

An important aspect of workload characterization is understanding memory system performance (i.e., understanding a workload's interaction with the memory system). On systems with a non-uniform memory architecture (NUMA) the performance critically depends on the distribution of data and computations. The actual memory access patterns have a large influence on performance on systems with aggressive prefetcher units. This paper describes an analysis of the memory system performance of multithreaded programs and shows that some programs are (unintentionally) structured so that they use the memory system of today's NUMA-multicores inefficiently: Programs exhibit program-level data sharing, a performance-limiting factor that makes data and computation distribution in NUMA systems difficult. Moreover, many programs have irregular memory access patterns that are hard to predict by processor prefetcher units. The memory system performance as observed for a given program on a specific platform depends also on many algorithm and implementation decisions. The paper shows that a set of simple algorithmic changes coupled with commonly available OS functionality suffice to eliminate data sharing and to regularize the memory access patterns for a subset of the PARSEC parallel benchmarks. These simple source-level changes result in performance improvements of up to 3.1X, but more importantly, they lead to a fairer and more accurate performance evaluation on NUMA-multicore systems. They also illustrate the importance of carefully considering all details of algorithms and architectures to avoid drawing incorrect conclusions.

工作负载表征的一个重要方面是理解内存系统性能(即，理解工作负载与内存系统的交互)。在具有非统一内存架构(NUMA)的系统上，性能严重依赖于数据和计算的分布。实际的内存访问模式对具有主动预取器单元的系统的性能有很大的影响。本文对多线程程序的内存系统性能进行了分析，并指出一些程序的结构(无意中)导致它们不能有效地使用当前NUMA-多核的内存系统。程序表现出程序级的数据共享，这是一个性能限制因素，使数据和计算在NUMA系统中的分布变得困难。此外，许多程序具有不规则的内存访问模式，处理器预取器单元很难预测。在特定平台上观察到的给定程序的内存系统性能还取决于许多算法和实现决策。本文表明，一组简单的算法更改加上常用的操作系统功能足以消除数据共享，并使PARSEC并行基准测试子集的内存访问模式规范化。这些简单的源代码级更改导致性能提高高达3.1倍，但更重要的是，它们导致在numa多核系统上进行更公平、更准确的性能评估。它们还说明了仔细考虑算法和架构的所有细节以避免得出错误结论的重要性。

{"title":"(Mis)understanding the NUMA memory system performance of multithreaded workloads","authors":"Z. Majó, T. Gross","doi":"10.1109/IISWC.2013.6704666","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704666","url":null,"abstract":"An important aspect of workload characterization is understanding memory system performance (i.e., understanding a workload's interaction with the memory system). On systems with a non-uniform memory architecture (NUMA) the performance critically depends on the distribution of data and computations. The actual memory access patterns have a large influence on performance on systems with aggressive prefetcher units. This paper describes an analysis of the memory system performance of multithreaded programs and shows that some programs are (unintentionally) structured so that they use the memory system of today's NUMA-multicores inefficiently: Programs exhibit program-level data sharing, a performance-limiting factor that makes data and computation distribution in NUMA systems difficult. Moreover, many programs have irregular memory access patterns that are hard to predict by processor prefetcher units. The memory system performance as observed for a given program on a specific platform depends also on many algorithm and implementation decisions. The paper shows that a set of simple algorithmic changes coupled with commonly available OS functionality suffice to eliminate data sharing and to regularize the memory access patterns for a subset of the PARSEC parallel benchmarks. These simple source-level changes result in performance improvements of up to 3.1X, but more importantly, they lead to a fairer and more accurate performance evaluation on NUMA-multicore systems. They also illustrate the importance of carefully considering all details of algorithms and architectures to avoid drawing incorrect conclusions.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127666871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Do C and Java programs scale differently on Hardware Transactional Memory? C和Java程序在硬件事务性内存上的伸缩是否不同?

2013 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2013-09-01 DOI: 10.1109/IISWC.2013.6704668

Rei Odaira, J. Castaños, T. Nakaike

People program in many different programming languages in the multi-core era, but how does each programming language affect application scalability with transactional memory? As commercial implementations of Hardware Transactional Memory (HTM) enter the market, the HTM support in two major programming languages, C and Java, is of critical importance to the industry. We studied the scalability of the same transactional memory applications written in C and Java, using the STAMP benchmarks. We performed our HTM experiments on an IBM mainframe zEnterprise EC12. We found that in 4 of the 10 STAMP benchmarks Java was more scalable than C. The biggest factor in this higher scalability was the efficient thread-local memory allocator in our Java VM. In two of the STAMP benchmarks C was more scalable because in C padding can be inserted efficiently among frequently updated fields to avoid false sharing. We also found Java VM services could cause severe aborts. By fixing or avoiding these problems, we confirmed that C and Java had similar HTM scalability for the STAMP benchmarks.

在多核时代，人们使用许多不同的编程语言进行编程，但是每种编程语言如何影响使用事务性内存的应用程序可伸缩性呢?随着硬件事务性内存(Hardware Transactional Memory, HTM)的商业实现进入市场，两种主要编程语言(C和Java)对HTM的支持对业界至关重要。我们使用STAMP基准测试研究了用C和Java编写的相同事务性内存应用程序的可伸缩性。我们在IBM大型机zEnterprise EC12上执行HTM实验。我们发现，在10个STAMP基准测试中的4个中，Java比c更具可伸缩性。这种更高可伸缩性的最大因素是Java VM中高效的线程本地内存分配器。在两个STAMP基准测试中，C具有更高的可扩展性，因为在C中，填充可以有效地插入频繁更新的字段中，以避免错误共享。我们还发现Java虚拟机服务可能导致严重的中断。通过修复或避免这些问题，我们确认C和Java在STAMP基准测试中具有类似的HTM可伸缩性。

引用次数: 5

Semantic characterization of MapReduce workloads MapReduce工作负载的语义表征

2013 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2013-09-01 DOI: 10.1109/IISWC.2013.6704673

Zhihong Xu, Martin Hirzel, G. Rothermel

MapReduce is a platform for analyzing large amounts of data on clusters of commodity machines. MapReduce is popular, in part thanks to its apparent simplicity. However, there are unstated requirements for the semantics of MapReduce applications that can affect their correctness and performance. MapReduce implementations do not check whether user code satisfies these requirements, leading to time-consuming debugging sessions, performance problems, and, worst of all, silently corrupt results. This paper makes these requirements explicit, framing them as semantic properties and assumed outcomes. It describes a black-box approach for testing for these properties, and uses the approach to characterize the semantics of 23 non-trivial MapReduce workloads. Surprisingly, we found that for most requirements, there is at least one workload that violates it. This means that MapReduce may be simple to use, but it is not as simple to use correctly. Based on our results, we provide insights to users on how to write higher-quality MapReduce code, and insights to system and language designers on ways to make their platforms more robust.

MapReduce是一个用于分析商品机器集群上大量数据的平台。MapReduce很受欢迎，部分原因是其明显的简单性。然而，对于MapReduce应用程序的语义有一些未说明的需求，这些需求可能会影响它们的正确性和性能。MapReduce实现不会检查用户代码是否满足这些要求，这会导致耗时的调试会话、性能问题，最糟糕的是，会默默地破坏结果。本文明确了这些要求，将它们框架为语义属性和假设结果。它描述了用于测试这些属性的黑盒方法，并使用该方法表征23个重要MapReduce工作负载的语义。令人惊讶的是，我们发现对于大多数需求，至少有一个工作负载违反了它。这意味着MapReduce可能使用起来很简单，但是正确使用它就不那么简单了。基于我们的研究结果，我们为用户提供了如何编写高质量MapReduce代码的见解，并为系统和语言设计师提供了如何使他们的平台更加健壮的见解。

引用次数: 11

Performance, energy characterizations and architectural implications of an emerging mobile platform benchmark suite - MobileBench 一个新兴的移动平台基准测试套件——MobileBench的性能、能耗特征和架构含义

2013 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2013-09-01 DOI: 10.1109/IISWC.2013.6704679

D. Pandiyan, Shin-Ying Lee, Carole-Jean Wu

In this paper, we explore key microarchitectural features of mobile computing platforms that are crucial to the performance of smart phone applications. We create and use a selection of representative smart phone applications, which we call MobileBench that aid in this analysis. We also evaluate the effectiveness of current memory subsystem on the mobile platforms. Furthermore, by instrumenting the Android framework, we perform energy characterization for MobileBench on an existing Samsung Galaxy S III smart phone. Based on our energy analysis, we find that application cores on modern smart phones consume significant amount of energy. This motivates our detailed performance analysis centered at the application cores. Based on our detailed performance studies, we reach several key findings. (i) Using a more sophisticated tournament branch predictor can improve the branch prediction accuracy but this does not translate to observable performance gain. (ii) Smart phone applications show distinct TLB capacity needs. Larger TLBs can improve performance by an avg. of 14%. (iii) The current L2 cache on most smart phone platform experiences poor utilization because of the fast-changing memory requirements of smart phone applications. Using a more effective cache management scheme improves the L2 cache utilization by as much as 29.3% and by an avg. of 12%. (iv) Smart phone applications are prefetching-friendly. Using a simple stride prefetcher can improve performance across MobileBench applications by an avg. of 14%. (v) Lastly, the memory bandwidth requirements of MobileBench applications are moderate and well under current smart phone memory bandwidth capacity of 8.3 GB/s. With these insights into the smart phone application characteristics, we hope to guide the design of future smart phone platforms for lower power consumptions through simpler architecture while achieving high performance.

在本文中，我们探讨了移动计算平台的关键微架构特征，这些特征对智能手机应用程序的性能至关重要。我们创建并使用了一系列具有代表性的智能手机应用程序，我们称之为MobileBench，这有助于我们的分析。我们还评估了当前存储子系统在移动平台上的有效性。此外，通过测量Android框架，我们在现有的三星Galaxy S III智能手机上对MobileBench进行能量表征。根据我们的能量分析，我们发现现代智能手机上的应用程序内核消耗了大量的能量。这促使我们以应用程序核心为中心进行详细的性能分析。基于我们详细的性能研究，我们得出了几个关键发现。(i)使用更复杂的比赛分支预测器可以提高分支预测的准确性，但这并不能转化为可观察的性能增益。(ii)智能手机应用显示出不同的TLB容量需求。较大的tlb可以使性能平均提高14%。(iii)由于智能手机应用对内存需求的快速变化，目前大多数智能手机平台上的二级缓存利用率较低。使用更有效的缓存管理方案可将二级缓存利用率提高29.3%，平均提高12%。(iv)智能手机应用程序支持预取。使用简单的步幅预取器可以将MobileBench应用程序的性能平均提高14%。(v)最后，MobileBench应用程序的内存带宽要求适中，并且在当前8.3 GB/s的智能手机内存带宽容量下运行良好。通过这些对智能手机应用特性的洞察，我们希望指导未来智能手机平台的设计，通过更简单的架构实现更低的功耗，同时实现高性能。

{"title":"Performance, energy characterizations and architectural implications of an emerging mobile platform benchmark suite - MobileBench","authors":"D. Pandiyan, Shin-Ying Lee, Carole-Jean Wu","doi":"10.1109/IISWC.2013.6704679","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704679","url":null,"abstract":"In this paper, we explore key microarchitectural features of mobile computing platforms that are crucial to the performance of smart phone applications. We create and use a selection of representative smart phone applications, which we call MobileBench that aid in this analysis. We also evaluate the effectiveness of current memory subsystem on the mobile platforms. Furthermore, by instrumenting the Android framework, we perform energy characterization for MobileBench on an existing Samsung Galaxy S III smart phone. Based on our energy analysis, we find that application cores on modern smart phones consume significant amount of energy. This motivates our detailed performance analysis centered at the application cores. Based on our detailed performance studies, we reach several key findings. (i) Using a more sophisticated tournament branch predictor can improve the branch prediction accuracy but this does not translate to observable performance gain. (ii) Smart phone applications show distinct TLB capacity needs. Larger TLBs can improve performance by an avg. of 14%. (iii) The current L2 cache on most smart phone platform experiences poor utilization because of the fast-changing memory requirements of smart phone applications. Using a more effective cache management scheme improves the L2 cache utilization by as much as 29.3% and by an avg. of 12%. (iv) Smart phone applications are prefetching-friendly. Using a simple stride prefetcher can improve performance across MobileBench applications by an avg. of 14%. (v) Lastly, the memory bandwidth requirements of MobileBench applications are moderate and well under current smart phone memory bandwidth capacity of 8.3 GB/s. With these insights into the smart phone application characteristics, we hope to guide the design of future smart phone platforms for lower power consumptions through simpler architecture while achieving high performance.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115786326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 68

Hardware-independent application characterization 独立于硬件的应用程序特性

2013 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2013-09-01 DOI: 10.2172/1214640

S. Pakin, P. McCormick

The trend in high-performance computing is to include computational accelerators such as GPUs or Xeon Phis in each node of a large-scale system. Qualitatively, such accelerators tend to favor codes that perform large numbers of floating-point and integer operations per branch; that exhibit high degrees of memory locality; and that are highly data-parallel. The question we address in this work is how to quantify those characteristics. To that end we developed an application-characterization tool called Byfl that provides a set of “software performance counters”. These are analogous to the hardware performance counters provided by most modern processors but are implemented via code instrumentation-the equivalent of adding flops = flops + 1 after every floating-point operation but in fact implemented by modifying the compiler's internal representation of the code.

高性能计算的趋势是在大型系统的每个节点中包括gpu或Xeon Phis等计算加速器。从性质上讲，这种加速器倾向于支持每个分支执行大量浮点和整数操作的代码;表现出高度的记忆局部性;这是高度数据并行的。我们在这项工作中要解决的问题是如何量化这些特征。为此，我们开发了一个名为Byfl的应用程序表征工具，它提供了一组“软件性能计数器”。它们类似于大多数现代处理器提供的硬件性能计数器，但是通过代码插装实现的——相当于在每个浮点操作之后添加flops = flops + 1，但实际上是通过修改编译器对代码的内部表示来实现的。

引用次数: 18

Modeling virtual machines misprediction overhead 虚拟机建模错误预测开销

2013 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2013-09-01 DOI: 10.1109/IISWC.2013.6704681

D. C. S. Lucas, R. Auler, Rafael Dalibera, S. Rigo, E. Borin, G. Araújo

Virtual machines are versatile systems that can support innovative solutions to many problems. These systems usually rely on emulation techniques, such as interpretation and dynamic binary translation, to execute guest application code. Usually, in order to select the best emulation technique for each code segment, the system must predict whether the code is worth compiling (frequently executed) or not, known as hotness prediction. In this paper we show that the threshold-based hot code predictor, frequently mispredicts the code hotness and as a result the VM emulation performance become dominated by miscompilations. To do so, we developed a mathematical model to simulate the behavior of such predictor and using it we quantify and characterize the impact of mispredictions in several benchmarks. We also show how the threshold choice can affect the predictor, what are the major overhead components and how using SPEC to analyze a VM performance can lead to misleading results.

虚拟机是多功能系统，可以支持针对许多问题的创新解决方案。这些系统通常依赖于模拟技术，例如解释和动态二进制转换，来执行来宾应用程序代码。通常，为了为每个代码段选择最佳的仿真技术，系统必须预测代码是否值得编译(经常执行)，称为热度预测。本文指出，基于阈值的热代码预测器经常会错误地预测代码的热度，从而导致虚拟机仿真性能受到错误编译的影响。为此，我们开发了一个数学模型来模拟这种预测器的行为，并使用它在几个基准中量化和表征错误预测的影响。我们还展示了阈值选择如何影响预测器，主要开销组件是什么，以及使用SPEC分析VM性能如何导致误导性结果。

引用次数: 5

Pannotia: Understanding irregular GPGPU graph applications Pannotia:理解不规则的GPGPU图形应用

2013 IEEE International Symposium on Workload Characterization (IISWC)

Pub Date : 2013-09-01 DOI: 10.1109/IISWC.2013.6704684

Shuai Che, Bradford M. Beckmann, S. Reinhardt, K. Skadron

GPUs have become popular recently to accelerate general-purpose data-parallel applications. However, most existing work has focused on GPU-friendly applications with regular data structures and access patterns. While a few prior studies have shown that some irregular workloads can also achieve speedups on GPUs, this domain has not been investigated thoroughly. Graph applications are one such set of irregular workloads, used in many commercial and scientific domains. In particular, graph mining -as well as web and social network analysis- are promising applications that GPUs could accelerate. However, implementing and optimizing these graph algorithms on SIMD architectures is challenging because their data-dependent behavior results in significant branch and memory divergence. To address these concerns and facilitate research in this area, this paper presents and characterizes a suite of GPGPU graph applications, Pannotia, which is implemented in OpenCL and contains problems from diverse and important graph application domains. We perform a first-step characterization and analysis of these benchmarks and study their behavior on real hardware. We also use clustering analysis to illustrate the similarities and differences of the applications in the suite. Finally, we make architectural and scheduling suggestions that will improve their execution efficiency on GPUs.

gpu最近在加速通用数据并行应用方面变得非常流行。然而，大多数现有的工作都集中在具有常规数据结构和访问模式的gpu友好应用程序上。虽然之前的一些研究表明，一些不规则的工作负载也可以在gpu上实现加速，但这一领域尚未得到彻底的研究。图应用程序就是这样一组不规则的工作负载，用于许多商业和科学领域。特别是图形挖掘——以及网络和社交网络分析——是gpu可以加速的有前途的应用程序。然而，在SIMD架构上实现和优化这些图算法是具有挑战性的，因为它们的数据依赖行为会导致显著的分支和内存分歧。为了解决这些问题并促进这一领域的研究，本文提出并描述了一套GPGPU图形应用程序Pannotia，它是在OpenCL中实现的，包含了来自不同和重要图形应用领域的问题。我们对这些基准执行第一步表征和分析，并研究它们在实际硬件上的行为。我们还使用聚类分析来说明套件中应用程序的相同点和不同点。最后，我们提出了架构和调度建议，以提高它们在gpu上的执行效率。

{"title":"Pannotia: Understanding irregular GPGPU graph applications","authors":"Shuai Che, Bradford M. Beckmann, S. Reinhardt, K. Skadron","doi":"10.1109/IISWC.2013.6704684","DOIUrl":"https://doi.org/10.1109/IISWC.2013.6704684","url":null,"abstract":"GPUs have become popular recently to accelerate general-purpose data-parallel applications. However, most existing work has focused on GPU-friendly applications with regular data structures and access patterns. While a few prior studies have shown that some irregular workloads can also achieve speedups on GPUs, this domain has not been investigated thoroughly. Graph applications are one such set of irregular workloads, used in many commercial and scientific domains. In particular, graph mining -as well as web and social network analysis- are promising applications that GPUs could accelerate. However, implementing and optimizing these graph algorithms on SIMD architectures is challenging because their data-dependent behavior results in significant branch and memory divergence. To address these concerns and facilitate research in this area, this paper presents and characterizes a suite of GPGPU graph applications, Pannotia, which is implemented in OpenCL and contains problems from diverse and important graph application domains. We perform a first-step characterization and analysis of these benchmarks and study their behavior on real hardware. We also use clustering analysis to illustrate the similarities and differences of the applications in the suite. Finally, we make architectural and scheduling suggestions that will improve their execution efficiency on GPUs.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114753023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 178

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2013 IEEE International Symposium on Workload Characterization (IISWC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀