2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)最新文献_第4页

Estimation-based profiling for code placement optimization in sensor network programs 基于估计的传感器网络程序代码放置优化分析

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095799

Lipeng Wan, Qing Cao, Wenjun Zhou

In this work, we focus on applying profiling guided code placement to programs running on resource-constrained sensor motes. Specifically, we model the execution of sensor network programs under nondeterministic inputs as discrete-time Markov processes, and propose a novel approach named Code Tomography to estimate parameters of the Markov models that reflect sensor network programs' dynamic execution behavior by only using end-to-end timing information measured at start and end points of each procedure. The parameters estimated by Code Tomography are fed back to compilers to optimize the code placement so that branch misprediction rate can be reduced.

在这项工作中，我们专注于将分析指导的代码放置应用于运行在资源受限的传感器节点上的程序。具体来说，我们将不确定性输入下传感器网络程序的执行建模为离散时间马尔可夫过程，并提出了一种名为代码层析成像的新方法，通过仅使用在每个过程的起点和终点测量的端到端定时信息来估计反映传感器网络程序动态执行行为的马尔可夫模型的参数。将代码层析估计的参数反馈给编译器以优化代码放置，从而降低分支预测错误率。

引用次数: 2

Self-monitoring overhead of the Linux perf_ event performance counter interface Linux perf_事件性能计数器接口的自监视开销

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095789

Vincent M. Weaver

Most modern CPUs include hardware performance counters: architectural registers that allow programmers to gain low-level insight into system performance. Low-overhead access to these counters is necessary for accurate performance analysis, making the operating system interface critical to providing low-latency performance data. We investigate the overhead of self-monitoring performance counter measurements on the Linux perf_event interface. We find that default code (such as that used by PAPI) implementing the perf_event self-monitoring interface can have large overhead: up to an order of magnitude larger than the previously used perfctr and perfmon2 performance counter implementations. We investigate the causes of this overhead and find that with proper coding this overhead can be greatly reduced on recent Linux kernels.

大多数现代cpu包括硬件性能计数器:架构寄存器，允许程序员获得对系统性能的低级洞察。对这些计数器的低开销访问对于准确的性能分析是必要的，这使得操作系统接口对于提供低延迟性能数据至关重要。我们研究了Linux perf_event接口上的自监视性能计数器测量的开销。我们发现实现perf_event自监视接口的默认代码(如PAPI使用的代码)可能会有很大的开销:比以前使用的perfctr和perfmon2性能计数器实现要大一个数量级。我们研究了这种开销的原因，发现在最新的Linux内核上，通过适当的编码可以大大减少这种开销。

引用次数: 39

Nyami: a synthesizable GPU architectural model for general-purpose and graphics-specific workloads

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095803

Jeffrey T. Bush, Philip Dexter, Timothy N. Miller, A. Carpenter

Graphics processing units (GPUs) continue to grow in popularity for general-purpose, highly parallel, high-throughput systems. This has forced GPU vendors to increase their focus on general purpose workloads, sometimes at the expense of the graphics-specific workloads. Using GPUs for general-purpose computation is a departure from the driving forces behind programmable GPUs that were focused on a narrow subset of graphics rendering operations. Rather than focus on purely graphics-related or general-purpose use, we have designed and modeled an architecture that optimizes for both simultaneously to efficiently handle all GPU workloads. In this paper, we present Nyami, a co-optimized GPU architecture and simulation model with an open-source implementation written in Verilog. This approach allows us to more easily explore the GPU design space in a synthesizable, cycle-precise, modular environment. An instruction-precise functional simulator is provided for co-simulation and verification. Overall, we assume a GPU may be used as a general-purpose GPU (GPGPU) or a graphics engine and account for this in the architecture's construction and in the options and modules selectable for synthesis and simulation. To demonstrate Nyami's viability as a GPU research platform, we exploit its flexibility and modularity to explore the impact of a set of architectural decisions. These include sensitivity to cache size and associativity, barrel and switch-on-stall multithreaded instruction scheduling, and software vs. hardware implementations of rasterization. Through these experiments, we gain insight into commonly accepted GPU architecture decisions, adapt the architecture accordingly, and give examples of the intended use as a GPU research tool.

图形处理单元(gpu)在通用、高度并行、高吞吐量的系统中越来越受欢迎。这迫使GPU供应商增加对通用工作负载的关注，有时以牺牲特定于图形的工作负载为代价。使用gpu进行通用计算是对可编程gpu背后驱动力的背离，可编程gpu专注于图形渲染操作的狭窄子集。而不是专注于纯粹的图形相关或通用用途，我们已经设计和建模了一个架构，同时优化两者，以有效地处理所有GPU工作负载。在本文中，我们提出了Nyami，一个协同优化的GPU架构和仿真模型，并使用Verilog编写了一个开源实现。这种方法使我们能够在可合成的、周期精确的、模块化的环境中更容易地探索GPU设计空间。提供了一个指令精确的功能模拟器，用于联合仿真和验证。总体而言，我们假设GPU可以用作通用GPU (GPGPU)或图形引擎，并在架构的构建以及可用于合成和模拟的选项和模块中考虑到这一点。为了证明Nyami作为GPU研究平台的可行性，我们利用其灵活性和模块化来探索一组架构决策的影响。其中包括对缓存大小和关联性的敏感性，桶式和开关式暂停多线程指令调度，以及栅格化的软件与硬件实现。通过这些实验，我们深入了解了普遍接受的GPU架构决策，相应地调整了架构，并给出了作为GPU研究工具的预期用途的示例。

{"title":"Nyami: a synthesizable GPU architectural model for general-purpose and graphics-specific workloads","authors":"Jeffrey T. Bush, Philip Dexter, Timothy N. Miller, A. Carpenter","doi":"10.1109/ISPASS.2015.7095803","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095803","url":null,"abstract":"Graphics processing units (GPUs) continue to grow in popularity for general-purpose, highly parallel, high-throughput systems. This has forced GPU vendors to increase their focus on general purpose workloads, sometimes at the expense of the graphics-specific workloads. Using GPUs for general-purpose computation is a departure from the driving forces behind programmable GPUs that were focused on a narrow subset of graphics rendering operations. Rather than focus on purely graphics-related or general-purpose use, we have designed and modeled an architecture that optimizes for both simultaneously to efficiently handle all GPU workloads. In this paper, we present Nyami, a co-optimized GPU architecture and simulation model with an open-source implementation written in Verilog. This approach allows us to more easily explore the GPU design space in a synthesizable, cycle-precise, modular environment. An instruction-precise functional simulator is provided for co-simulation and verification. Overall, we assume a GPU may be used as a general-purpose GPU (GPGPU) or a graphics engine and account for this in the architecture's construction and in the options and modules selectable for synthesis and simulation. To demonstrate Nyami's viability as a GPU research platform, we exploit its flexibility and modularity to explore the impact of a set of architectural decisions. These include sensitivity to cache size and associativity, barrel and switch-on-stall multithreaded instruction scheduling, and software vs. hardware implementations of rasterization. Through these experiments, we gain insight into commonly accepted GPU architecture decisions, adapt the architecture accordingly, and give examples of the intended use as a GPU research tool.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125256549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Synchrotrace: synchronization-aware architecture-agnostic traces for light-weight multicore simulation Synchrotrace:用于轻量级多核仿真的同步感知体系结构不可知跟踪

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095813

Siddharth Nilakantan, K. Sangaiah, A. More, G. Salvador, B. Taskin, Mark Hempstead

Trace-driven simulation of chip multiprocessor (CMP) systems offers many advantages over execution-driven simulation, such as reducing simulation time and complexity, and allowing portability, and scalability. However, trace-based simulation approaches have encountered difficulty capturing and accurately replaying multi-threaded traces due to the inherent non-determinism in the execution of multi-threaded programs. In this work, we present SynchroTrace, a scalable, flexible, and accurate trace-based multi-threaded simulation methodology. The methodology captures synchronization- and dependency-aware, architecture-agnostic, multi-threaded traces and uses a replay mechanism that plays back these traces correctly. By recording synchronization events and dependencies in the traces, independent of the host architecture, the methodology is able to accurately model the non-determinism of multi-threaded programs for different platforms. We validate the SynchroTrace simulation flow by successfully achieving the equivalent results of a constraint-based design space exploration with the Gem5 Full-System simulator. The results from simulating benchmarks from PARSEC 2.1 and Splash-2 show that our trace-based approach with trace filtering has a peak speedup of up to 18.4x over simulation in Gem5 Full-System with an average of about 7.5x speedup. We are also able to compress traces up to 74% of their original size with almost no impact on accuracy.

芯片多处理器(CMP)系统的跟踪驱动仿真与执行驱动仿真相比具有许多优点，例如减少仿真时间和复杂性，并允许可移植性和可伸缩性。然而，由于多线程程序执行中固有的不确定性，基于跟踪的模拟方法在捕获和准确地重放多线程跟踪时遇到了困难。在这项工作中，我们提出了SynchroTrace，一个可扩展的，灵活的，准确的基于跟踪的多线程仿真方法。该方法捕获同步和依赖关系感知、体系结构不可知、多线程跟踪，并使用正确回放这些跟踪的重放机制。通过在跟踪中记录同步事件和依赖关系，独立于主机体系结构，该方法能够准确地为不同平台的多线程程序的非确定性建模。我们通过Gem5全系统模拟器成功实现基于约束的设计空间探索的等效结果来验证SynchroTrace仿真流程。PARSEC 2.1和Splash-2的模拟基准测试结果表明，与Gem5 Full-System模拟相比，我们基于跟踪滤波的方法具有高达18.4倍的峰值加速，平均加速约为7.5倍。我们还能够将痕迹压缩到原始尺寸的74%，几乎不会影响精度。

{"title":"Synchrotrace: synchronization-aware architecture-agnostic traces for light-weight multicore simulation","authors":"Siddharth Nilakantan, K. Sangaiah, A. More, G. Salvador, B. Taskin, Mark Hempstead","doi":"10.1109/ISPASS.2015.7095813","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095813","url":null,"abstract":"Trace-driven simulation of chip multiprocessor (CMP) systems offers many advantages over execution-driven simulation, such as reducing simulation time and complexity, and allowing portability, and scalability. However, trace-based simulation approaches have encountered difficulty capturing and accurately replaying multi-threaded traces due to the inherent non-determinism in the execution of multi-threaded programs. In this work, we present SynchroTrace, a scalable, flexible, and accurate trace-based multi-threaded simulation methodology. The methodology captures synchronization- and dependency-aware, architecture-agnostic, multi-threaded traces and uses a replay mechanism that plays back these traces correctly. By recording synchronization events and dependencies in the traces, independent of the host architecture, the methodology is able to accurately model the non-determinism of multi-threaded programs for different platforms. We validate the SynchroTrace simulation flow by successfully achieving the equivalent results of a constraint-based design space exploration with the Gem5 Full-System simulator. The results from simulating benchmarks from PARSEC 2.1 and Splash-2 show that our trace-based approach with trace filtering has a peak speedup of up to 18.4x over simulation in Gem5 Full-System with an average of about 7.5x speedup. We are also able to compress traces up to 74% of their original size with almost no impact on accuracy.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120901563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Graph-matching-based simulation-region selection for multiple binaries 基于图匹配的多二进制模拟区域选择

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095784

Charles R. Yount, H. Patil, M. S. Islam, Aditya Srikanth

Comparison of simulation-based performance estimates of program binaries built with different compiler settings or targeted at variants of an instruction set architecture is essential for software/hardware co-design and similar engineering activities. Commonly-used sampling techniques for selecting simulation regions do not ensure that samples from the various binaries being compared represent the same source-level work, leading to biased speedup estimates and difficulty in comparative performance debugging. The task of creating equal-work samples is made difficult by differences between the structure and execution paths across multiple binaries such as variations in libraries, in-lining, and loop-iteration counts. Such complexities are addressed in this work by first applying an existing graph-matching technique to call and loop graphs for multiple binaries for the same source program. Then, a new sequence-alignment algorithm is applied to execution traces from the various binaries, using the graph-matching results to define intervals of equal work. A basic-block profile generated for these matched intervals can then be used for phase-detection and simulation-region selection across all binaries simultaneously. The resulting selected simulation regions match both in number and the work done across multiple binaries. The application of this technique is demonstrated on binaries compiled for different Intel 64 Architecture instruction-set extensions. Quality metrics for speedup estimation and an example of applying the data for performance debugging are presented.

使用不同的编译器设置或针对指令集体系结构的变体构建的程序二进制文件的基于仿真的性能估计的比较对于软件/硬件协同设计和类似的工程活动是必不可少的。用于选择模拟区域的常用采样技术并不能确保来自被比较的各种二进制文件的样本代表相同的源级工作，从而导致有偏差的加速估计和比较性能调试的困难。由于跨多个二进制文件的结构和执行路径的差异，例如库、内联和循环迭代计数的变化，使得创建相同工作示例的任务变得困难。本文首先应用现有的图匹配技术，为同一源程序的多个二进制文件调用和循环图，从而解决了这种复杂性。然后，将一种新的序列对齐算法应用于各种二进制文件的执行轨迹，使用图匹配结果来定义相等工作的间隔。为这些匹配区间生成的基本块剖面可以同时用于所有二进制的相位检测和模拟区域选择。结果选择的模拟区域在数量和跨多个二进制文件完成的工作上都是匹配的。在针对不同的Intel 64架构指令集扩展编译的二进制文件上演示了该技术的应用。给出了用于加速估计的质量度量，并给出了应用这些数据进行性能调试的实例。

{"title":"Graph-matching-based simulation-region selection for multiple binaries","authors":"Charles R. Yount, H. Patil, M. S. Islam, Aditya Srikanth","doi":"10.1109/ISPASS.2015.7095784","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095784","url":null,"abstract":"Comparison of simulation-based performance estimates of program binaries built with different compiler settings or targeted at variants of an instruction set architecture is essential for software/hardware co-design and similar engineering activities. Commonly-used sampling techniques for selecting simulation regions do not ensure that samples from the various binaries being compared represent the same source-level work, leading to biased speedup estimates and difficulty in comparative performance debugging. The task of creating equal-work samples is made difficult by differences between the structure and execution paths across multiple binaries such as variations in libraries, in-lining, and loop-iteration counts. Such complexities are addressed in this work by first applying an existing graph-matching technique to call and loop graphs for multiple binaries for the same source program. Then, a new sequence-alignment algorithm is applied to execution traces from the various binaries, using the graph-matching results to define intervals of equal work. A basic-block profile generated for these matched intervals can then be used for phase-detection and simulation-region selection across all binaries simultaneously. The resulting selected simulation regions match both in number and the work done across multiple binaries. The application of this technique is demonstrated on binaries compiled for different Intel 64 Architecture instruction-set extensions. Quality metrics for speedup estimation and an example of applying the data for performance debugging are presented.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114981643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Characterization and cross-platform analysis of high-throughput accelerators 高通量加速器的特性和跨平台分析

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095797

Keitarou Oka, Wenhao Jia, M. Martonosi, Koji Inoue

Today's computer systems often employ high-throughput accelerators (such as Intel Xeon Phi coprocessors and NVIDIA Tesla GPUs) to improve the performance of some applications or portions of applications. While such accelerators are useful for suitable applications, it remains challenging to predict which workloads will run well on these platforms and to predict the resulting performance trends for varying input. This paper provides detailed characterizations on such platforms across a range of programs and input sizes. Furthermore, we show opportunities for cross-platform performance analysis and comparison between Xeon Phi and Tesla. Our crossplatform comparison has three steps. First, we build Xeon Phi performance regression models as a function of important Xeon Phi performance counters to identify critical architectural resources that highly affect a benchmark's performance. Then, cross-platform Tesla performance regression models are built to relate the Tesla performance trends of the benchmark to the Xeon Phi performance counter measurements of the benchmark. Finally, we compare the counters most important for Xeon Phi models to those most important for Tesla's models; this reveals similarities and distinctions of dynamic application behaviors on the two platforms.

今天的计算机系统通常采用高吞吐量加速器(如Intel Xeon Phi协处理器和NVIDIA Tesla gpu)来提高某些应用程序或部分应用程序的性能。虽然这样的加速器对于合适的应用程序很有用，但是预测哪些工作负载可以在这些平台上良好运行，以及预测不同输入的最终性能趋势仍然具有挑战性。本文提供了跨一系列程序和输入大小的此类平台的详细特征。此外，我们还展示了Xeon Phi和Tesla之间跨平台性能分析和比较的机会。我们的跨平台比较分为三个步骤。首先，我们构建Xeon Phi协处理器性能回归模型作为重要的Xeon Phi协处理器性能计数器的功能，以识别高度影响基准性能的关键架构资源。然后，建立跨平台特斯拉性能回归模型，将基准的特斯拉性能趋势与基准的Xeon Phi性能计数器测量相关联。最后，我们比较了Xeon Phi模型最重要的计数器和特斯拉模型最重要的计数器;这揭示了两个平台上动态应用程序行为的相似性和差异性。

{"title":"Characterization and cross-platform analysis of high-throughput accelerators","authors":"Keitarou Oka, Wenhao Jia, M. Martonosi, Koji Inoue","doi":"10.1109/ISPASS.2015.7095797","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095797","url":null,"abstract":"Today's computer systems often employ high-throughput accelerators (such as Intel Xeon Phi coprocessors and NVIDIA Tesla GPUs) to improve the performance of some applications or portions of applications. While such accelerators are useful for suitable applications, it remains challenging to predict which workloads will run well on these platforms and to predict the resulting performance trends for varying input. This paper provides detailed characterizations on such platforms across a range of programs and input sizes. Furthermore, we show opportunities for cross-platform performance analysis and comparison between Xeon Phi and Tesla. Our crossplatform comparison has three steps. First, we build Xeon Phi performance regression models as a function of important Xeon Phi performance counters to identify critical architectural resources that highly affect a benchmark's performance. Then, cross-platform Tesla performance regression models are built to relate the Tesla performance trends of the benchmark to the Xeon Phi performance counter measurements of the benchmark. Finally, we compare the counters most important for Xeon Phi models to those most important for Tesla's models; this reveals similarities and distinctions of dynamic application behaviors on the two platforms.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127668741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A study of mobile device utilization 移动设备利用的研究

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095808

Cao Gao, Anthony Gutierrez, M. Rajan, R. Dreslinski, T. Mudge, Carole-Jean Wu

Mobile devices are becoming more powerful and versatile than ever, calling for better embedded processors. Following the trend in desktop CPUs, microprocessor vendors are trying to meet such needs by increasing the number of cores in mobile device SoCs. However, increasing the number does not translate proportionally into performance gain and power reduction. In the past, studies have shown that there exists little parallelism to be exploited by a multi-core processor in desktop platform applications, and many cores sit idle during runtime. In this paper, we investigate whether the same is true for current mobile applications. We analyze the behavior of a broad range of commonly used mobile applications on real devices. We measure their Thread Level Parallelism (TLP), which is the machine utilization over the non-idle runtime. Our results demonstrate that mobile applications are utilizing less than 2 cores on average, even with background applications running concurrently. We observe a diminishing return on TLP with increasing the number of cores, and low TLP even with heavy-load scenarios. These studies suggest that having many powerful cores is over-provisioning. Further analysis of TLP behavior and big-little core energy efficiency suggests that current mobile workloads can benefit from an architecture that has the flexibility to accommodate both high performance and good energy-efficiency for different application phases.

移动设备正变得比以往任何时候都更强大、更多功能，这就需要更好的嵌入式处理器。随着桌面cpu的发展趋势，微处理器供应商正试图通过增加移动设备soc的核心数量来满足这种需求。然而，增加数量并不能成比例地转化为性能提高和功耗降低。过去的研究表明，在桌面平台应用程序中，多核处理器可以利用的并行性很少，并且许多内核在运行时处于空闲状态。在本文中，我们调查是否同样适用于当前的移动应用程序。我们分析了在真实设备上广泛使用的移动应用程序的行为。我们测量它们的线程级别并行性(TLP)，这是机器在非空闲运行时的利用率。我们的结果表明，即使后台应用程序并发运行，移动应用程序平均使用的内核也少于2个。我们观察到，随着核心数量的增加，TLP的回报会减少，即使在高负载情况下，TLP也会降低。这些研究表明，拥有许多强大的核心是过度供应。对TLP行为和大小核心能效的进一步分析表明，当前的移动工作负载可以从具有灵活性的架构中受益，该架构可以适应不同应用程序阶段的高性能和良好的能效。

{"title":"A study of mobile device utilization","authors":"Cao Gao, Anthony Gutierrez, M. Rajan, R. Dreslinski, T. Mudge, Carole-Jean Wu","doi":"10.1109/ISPASS.2015.7095808","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095808","url":null,"abstract":"Mobile devices are becoming more powerful and versatile than ever, calling for better embedded processors. Following the trend in desktop CPUs, microprocessor vendors are trying to meet such needs by increasing the number of cores in mobile device SoCs. However, increasing the number does not translate proportionally into performance gain and power reduction. In the past, studies have shown that there exists little parallelism to be exploited by a multi-core processor in desktop platform applications, and many cores sit idle during runtime. In this paper, we investigate whether the same is true for current mobile applications. We analyze the behavior of a broad range of commonly used mobile applications on real devices. We measure their Thread Level Parallelism (TLP), which is the machine utilization over the non-idle runtime. Our results demonstrate that mobile applications are utilizing less than 2 cores on average, even with background applications running concurrently. We observe a diminishing return on TLP with increasing the number of cores, and low TLP even with heavy-load scenarios. These studies suggest that having many powerful cores is over-provisioning. Further analysis of TLP behavior and big-little core energy efficiency suggests that current mobile workloads can benefit from an architecture that has the flexibility to accommodate both high performance and good energy-efficiency for different application phases.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128771443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

Can RDMA benefit online data processing workloads on memcached and MySQL? RDMA对memcached和MySQL上的在线数据处理工作负载有好处吗?

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095796

D. Shankar, Xiaoyi Lu, Jithin Jose, Md. Wasi-ur-Rahman, Nusrat S. Islam, D. Panda

At the onset of the widespread usage of social networking services in the Web 2.0/3.0 era, leveraging a distributed and scalable caching layer like Memcached is often invaluable to application server performance. Since a majority of the existing clusters today are equipped with modern high speed interconnects such as InfiniBand, that offer high bandwidth and low latency communication, there is potential to improve the response time and throughput of the application servers, by taking advantage of advanced features like RDMA. We explore the potential of employing RDMA to improve the performance of Online Data Processing (OLDP) workloads on MySQL using Memcached for real-world web applications.

在Web 2.0/3.0时代开始广泛使用社交网络服务时，利用像Memcached这样的分布式可伸缩缓存层通常对应用服务器性能非常重要。由于目前大多数现有集群都配备了现代高速互连(如InfiniBand)，提供高带宽和低延迟通信，因此通过利用RDMA等高级特性，有可能改善应用程序服务器的响应时间和吞吐量。我们探索了在现实世界的web应用程序中使用Memcached来提高MySQL在线数据处理(OLDP)工作负载性能的潜力。

引用次数: 6

Analyzing communication models for distributed thread-collaborative processors in terms of energy and time 从精力和时间的角度分析分布式线程协作处理器的通信模型

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095817

Benjamin Klenk, Lena Oden, H. Fröning

Accelerated computing has become pervasive for increasing the computational power and energy efficiency in terms of GFLOPs/Watt. For application areas with highest demands, for instance high performance computing, data warehousing and high performance analytics, accelerators like GPUs or Intel's MICs are distributed throughout the cluster. Since current analyses and predictions show that data movement will be the main contributor to energy consumption, we are entering an era of communication-centric heterogeneous systems that are operating with hard power constraints. In this work, we analyze data movement optimizations for distributed heterogeneous systems based on CPUs and GPUs. Thread-collaborative processors like GPUs differ significantly in their execution model from generalpurpose processors like CPUs, but available communication models are still designed and optimized for CPUs. Similar to heterogeneity in processing, heterogeneity in communication can have a huge impact on energy and time. To analyze this impact, we use multiple workloads with distinct properties regarding computational intensity and communication characteristics. We show for which workloads tailored communication models are essential, not only reducing execution time but also saving energy. Exposing the impact in terms of energy and time for communication-centric heterogeneous systems is crucial for future optimizations, and this work is a first step in this direction.

加速计算已经变得无处不在，以GFLOPs/Watt来增加计算能力和能源效率。对于要求最高的应用领域，例如高性能计算、数据仓库和高性能分析，gpu或英特尔的mic等加速器分布在整个集群中。由于目前的分析和预测表明，数据移动将是能源消耗的主要贡献者，我们正在进入一个以通信为中心的异构系统的时代，该系统在硬实力限制下运行。在这项工作中，我们分析了基于cpu和gpu的分布式异构系统的数据移动优化。像gpu这样的线程协作处理器的执行模型与cpu这样的通用处理器有很大的不同，但是可用的通信模型仍然是为cpu设计和优化的。与处理的异质性类似，通信的异质性会对精力和时间产生巨大影响。为了分析这种影响，我们使用了在计算强度和通信特性方面具有不同属性的多个工作负载。我们展示了定制的通信模型对于哪些工作负载至关重要，不仅可以减少执行时间，还可以节省能源。揭示以通信为中心的异构系统在能量和时间方面的影响对于未来的优化是至关重要的，而这项工作是朝着这个方向迈出的第一步。

{"title":"Analyzing communication models for distributed thread-collaborative processors in terms of energy and time","authors":"Benjamin Klenk, Lena Oden, H. Fröning","doi":"10.1109/ISPASS.2015.7095817","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095817","url":null,"abstract":"Accelerated computing has become pervasive for increasing the computational power and energy efficiency in terms of GFLOPs/Watt. For application areas with highest demands, for instance high performance computing, data warehousing and high performance analytics, accelerators like GPUs or Intel's MICs are distributed throughout the cluster. Since current analyses and predictions show that data movement will be the main contributor to energy consumption, we are entering an era of communication-centric heterogeneous systems that are operating with hard power constraints. In this work, we analyze data movement optimizations for distributed heterogeneous systems based on CPUs and GPUs. Thread-collaborative processors like GPUs differ significantly in their execution model from generalpurpose processors like CPUs, but available communication models are still designed and optimized for CPUs. Similar to heterogeneity in processing, heterogeneity in communication can have a huge impact on energy and time. To analyze this impact, we use multiple workloads with distinct properties regarding computational intensity and communication characteristics. We show for which workloads tailored communication models are essential, not only reducing execution time but also saving energy. Exposing the impact in terms of energy and time for communication-centric heterogeneous systems is crucial for future optimizations, and this work is a first step in this direction.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114776271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Hierarchical cycle accounting: a new method for application performance tuning 分层循环记帐:应用程序性能调优的新方法

2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Pub Date : 2015-03-29 DOI: 10.1109/ISPASS.2015.7095790

A. Nowak, D. Levinthal, W. Zwaenepoel

To address the growing difficulty of performance debugging on modern processors with increasingly complex micro-architectures, we present Hierarchical Cycle Accounting (HCA), a structured, hierarchical, architecture-agnostic methodology for the identification of performance issues in workloads running on these modern processors. HCA reports to the user the cost of a number of execution components, such as load latency, memory bandwidth, instruction starvation, and branch misprediction. A critical novel feature of HCA is that all cost components are presented in the same unit, core pipeline cycles. Their relative importance can therefore be compared directly. These cost components are furthermore presented in a hierarchical fashion, with architecture-agnostic components at the top levels of the hierarchy and architecture-specific components at the bottom. This hierarchical structure is useful in guiding the performance debugging effort to the places where it can be the most effective. For a given architecture, the cost components are computed based on the observation of architecture-specific events, typically provided by a performance monitoring unit (PMU), and using a set of formulas to attribute a certain cost in cycles to each event. The selection of what PMU events to use, their validation, and the derivation of the formulas are done offline by an architecture expert, thereby freeing the non-expert from the burdensome and error-prone task of directly interpreting PMU data. We have implemented the HCA methodology in Gooda, a publicly available tool. We describe the application of Gooda to the analysis of several workloads in wide use, showing how HCA's features facilitated performance debugging for these applications. We also describe the discovery of relevant bugs in Intel hardware and the Linux Kernel as a result of using HCA.

为了解决在具有日益复杂的微体系结构的现代处理器上进行性能调试日益困难的问题，我们提出了分层周期核算(HCA)，这是一种结构化的、分层的、与体系结构无关的方法，用于识别在这些现代处理器上运行的工作负载中的性能问题。HCA向用户报告许多执行组件的成本，例如负载延迟、内存带宽、指令饥饿和分支错误预测。HCA的一个重要特点是，所有的成本组成部分都在同一个单元中，即核心管道周期中。因此，可以直接比较它们的相对重要性。这些成本组件进一步以层次方式呈现，与体系结构无关的组件位于层次结构的顶层，而特定于体系结构的组件位于底层。这种分层结构有助于将性能调试工作引导到最有效的地方。对于给定的体系结构，成本组件是基于对特定于体系结构的事件的观察来计算的，这些事件通常由性能监视单元(PMU)提供，并使用一组公式将特定的周期成本归因于每个事件。要使用的PMU事件的选择、它们的验证以及公式的推导都由体系结构专家离线完成，从而将非专家从直接解释PMU数据的繁重且容易出错的任务中解放出来。我们已经在Gooda(一个公开可用的工具)中实现了HCA方法。我们描述了Gooda的应用程序，分析了几种广泛使用的工作负载，展示了HCA的特性如何促进这些应用程序的性能调试。我们还描述了由于使用HCA而在英特尔硬件和Linux内核中发现的相关错误。

{"title":"Hierarchical cycle accounting: a new method for application performance tuning","authors":"A. Nowak, D. Levinthal, W. Zwaenepoel","doi":"10.1109/ISPASS.2015.7095790","DOIUrl":"https://doi.org/10.1109/ISPASS.2015.7095790","url":null,"abstract":"To address the growing difficulty of performance debugging on modern processors with increasingly complex micro-architectures, we present Hierarchical Cycle Accounting (HCA), a structured, hierarchical, architecture-agnostic methodology for the identification of performance issues in workloads running on these modern processors. HCA reports to the user the cost of a number of execution components, such as load latency, memory bandwidth, instruction starvation, and branch misprediction. A critical novel feature of HCA is that all cost components are presented in the same unit, core pipeline cycles. Their relative importance can therefore be compared directly. These cost components are furthermore presented in a hierarchical fashion, with architecture-agnostic components at the top levels of the hierarchy and architecture-specific components at the bottom. This hierarchical structure is useful in guiding the performance debugging effort to the places where it can be the most effective. For a given architecture, the cost components are computed based on the observation of architecture-specific events, typically provided by a performance monitoring unit (PMU), and using a set of formulas to attribute a certain cost in cycles to each event. The selection of what PMU events to use, their validation, and the derivation of the formulas are done offline by an architecture expert, thereby freeing the non-expert from the burdensome and error-prone task of directly interpreting PMU data. We have implemented the HCA methodology in Gooda, a publicly available tool. We describe the application of Gooda to the analysis of several workloads in wide use, showing how HCA's features facilitated performance debugging for these applications. We also describe the discovery of relevant bugs in Intel hardware and the Linux Kernel as a result of using HCA.","PeriodicalId":189378,"journal":{"name":"2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128912372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13