ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software最新文献

Trace-based Performance Analysis on Cell BE 基于跟踪的Cell BE性能分析

ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software

Pub Date : 2008-04-20 DOI: 10.1109/ISPASS.2008.4510753

M. Biberstein, U. Shvadron, Javier Turek, Bilha Mendelson, Moon-Seok Chang

The transition to multicore architectures creates significant challenges for programming systems. Taking advantage of specialized processing cores such as those in the Cell BE processor and managing all the required data movement inside the processor cannot be done efficiently without help from the software infrastructure. Alongside new programming models and compiler support for multicores, programmers need performance evaluation and analysis tools. In this paper, we present tools that help analyze the performance of applications executing on the Cell platform. The performance debugging tool (PDT) provides a means for recording significant events during program execution, maintaining the sequential order of events, and preserving important runtime information such as core assignment and relative timing of events. The trace analyzer (TA) reads and visualizes the PDT traces. We describe the architecture of the PDT and present several important use cases demonstrating the usage of PDT and TA to understand the performance of several workloads. We also discuss the overhead of tracing and its impact on the benchmark execution and performance analysis.

向多核体系结构的过渡给编程系统带来了重大挑战。如果没有软件基础设施的帮助，就无法有效地利用Cell BE处理器中的专用处理内核并管理处理器内所需的所有数据移动。除了新的编程模型和多核编译器支持外，程序员还需要性能评估和分析工具。在本文中，我们提供了一些工具来帮助分析在Cell平台上执行的应用程序的性能。性能调试工具(PDT)提供了一种方法，用于记录程序执行期间的重要事件，维护事件的顺序顺序，并保留重要的运行时信息，如核心分配和事件的相对定时。迹线分析器(TA)读取PDT迹线并将其可视化。我们描述了PDT的体系结构，并给出了几个重要的用例，演示了PDT和TA的使用情况，以了解几种工作负载的性能。我们还讨论了跟踪的开销及其对基准执行和性能分析的影响。

{"title":"Trace-based Performance Analysis on Cell BE","authors":"M. Biberstein, U. Shvadron, Javier Turek, Bilha Mendelson, Moon-Seok Chang","doi":"10.1109/ISPASS.2008.4510753","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510753","url":null,"abstract":"The transition to multicore architectures creates significant challenges for programming systems. Taking advantage of specialized processing cores such as those in the Cell BE processor and managing all the required data movement inside the processor cannot be done efficiently without help from the software infrastructure. Alongside new programming models and compiler support for multicores, programmers need performance evaluation and analysis tools. In this paper, we present tools that help analyze the performance of applications executing on the Cell platform. The performance debugging tool (PDT) provides a means for recording significant events during program execution, maintaining the sequential order of events, and preserving important runtime information such as core assignment and relative timing of events. The trace analyzer (TA) reads and visualizes the PDT traces. We describe the architecture of the PDT and present several important use cases demonstrating the usage of PDT and TA to understand the performance of several workloads. We also discuss the overhead of tracing and its impact on the benchmark execution and performance analysis.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125219025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Explaining the Impact of Network Transport Protocols on SIP Proxy Performance 解释网络传输协议对SIP代理性能的影响

ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software

Pub Date : 2008-04-20 DOI: 10.1109/ISPASS.2008.4510740

K. Ram, Ian C. Fedeli, A. Cox, S. Rixner

This paper characterizes the impact that the use of UDP versus TCP has on the performance and scalability of the OpenSER SIP proxy server. The session initiation protocol (SIP) is an application-layer signaling protocol that is widely used for establishing voice-over-IP (VoIP) phone calls. SIP can utilize a variety of transport protocols, including UDP and TCP. Despite the advantages of TCP, such as reliable delivery and congestion control, the common practice is to use UDP. This is a result of the belief that UDP's lower processor and network overhead results in improved performance and scalability of SIP services. This paper argues against this conventional wisdom. This paper shows that the principal reasons for OpenSER's poor performance using TCP are caused by the server's design, and not the low-level performance of UDP versus TCP. Specifically, OpenSER's architecture for handling concurrent calls is responsible for most of the difference. Moreover, once these issues are addressed, OpenSER's performance using TCP is much more competitive with its performance using UDP.

本文描述了使用UDP和TCP对OpenSER SIP代理服务器的性能和可伸缩性的影响。SIP (session initiation protocol)是一种应用层信令协议，广泛用于建立VoIP (voice-over-IP)电话呼叫。SIP可以利用多种传输协议，包括UDP和TCP。尽管TCP有一些优点，比如可靠的传输和拥塞控制，但通常的做法是使用UDP。这是因为相信UDP较低的处理器和网络开销会提高SIP服务的性能和可伸缩性。本文反对这种传统观点。本文表明OpenSER使用TCP性能差的主要原因是由服务器的设计引起的，而不是UDP相对于TCP的低性能。具体来说，OpenSER处理并发调用的体系结构是造成这种差异的主要原因。此外，一旦这些问题得到解决，OpenSER使用TCP的性能比使用UDP的性能更具竞争力。

引用次数: 35

An Adaptive Synchronization Technique for Parallel Simulation of Networked Clusters 网络集群并行仿真的自适应同步技术

ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software

Pub Date : 2008-04-20 DOI: 10.1109/ISPASS.2008.4510735

Ayose Falcón, P. Faraboschi, Daniel Ortega

Computer clusters are a very cost-effective approach for high performance computing, but simulating a complete cluster is still an open research problem. The obvious approach - to parallelize individual node simulators - is complex and slow. Combining individual parallel simulators implies synchronizing their progress of time. This can be accomplished with a variety of parallel discrete event simulation techniques, but unfortunately any straightforward approach introduces a synchronization overhead causing up two orders of magnitude of slowdown with respect to the simulation speed of an individual node. In this paper we present a novel adaptive technique that automatically adjusts the synchronization boundaries. By dynamically relaxing accuracy over the least interesting computational phases we dramatically increase performance with a marginal loss of precision. For example, in the simulation of an 8-node cluster running NAMD (a parallel molecular dynamics application) we show an acceleration factor of 26x over the deterministic "ground truth" simulation, at less than a 1% accuracy error.

计算机集群是一种非常经济有效的高性能计算方法，但是模拟一个完整的集群仍然是一个开放的研究问题。显而易见的方法——并行化单个节点模拟器——既复杂又缓慢。组合单个并行模拟器意味着同步它们的时间进度。这可以通过各种并行离散事件模拟技术来实现，但不幸的是，任何直接的方法都会引入同步开销，导致单个节点的模拟速度降低两个数量级。本文提出了一种自动调整同步边界的自适应技术。通过在最不感兴趣的计算阶段动态放松精度，我们可以在精度损失很小的情况下显着提高性能。例如，在运行NAMD(一个并行分子动力学应用程序)的8节点集群的模拟中，我们显示了比确定性“真实”模拟的26倍的加速因子，精度误差小于1%。

{"title":"An Adaptive Synchronization Technique for Parallel Simulation of Networked Clusters","authors":"Ayose Falcón, P. Faraboschi, Daniel Ortega","doi":"10.1109/ISPASS.2008.4510735","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510735","url":null,"abstract":"Computer clusters are a very cost-effective approach for high performance computing, but simulating a complete cluster is still an open research problem. The obvious approach - to parallelize individual node simulators - is complex and slow. Combining individual parallel simulators implies synchronizing their progress of time. This can be accomplished with a variety of parallel discrete event simulation techniques, but unfortunately any straightforward approach introduces a synchronization overhead causing up two orders of magnitude of slowdown with respect to the simulation speed of an individual node. In this paper we present a novel adaptive technique that automatically adjusts the synchronization boundaries. By dynamically relaxing accuracy over the least interesting computational phases we dramatically increase performance with a marginal loss of precision. For example, in the simulation of an 8-node cluster running NAMD (a parallel molecular dynamics application) we show an acceleration factor of 26x over the deterministic \"ground truth\" simulation, at less than a 1% accuracy error.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128291699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Dynamic Thermal Management through Task Scheduling 基于任务调度的动态热管理

ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software

Pub Date : 2008-04-20 DOI: 10.1109/ISPASS.2008.4510751

Jun Yang, Xiuyi Zhou, M. Chrobak, Youtao Zhang, Lingling Jin

The evolution of microprocessors has been hindered by their increasing power consumption and the heat generation speed on-die. High temperature impairs the processor's reliability and reduces its lifetime. While hardware level dynamic thermal management (DTM) techniques, such as voltage and frequency scaling, can effectively lower the chip temperature when it surpasses the thermal threshold, they inevitably come at the cost of performance degradation. We propose an OS level technique that performs thermal- aware job scheduling to reduce the number of thermal trespasses. Our scheduler reduces the amount of hardware DTMs and achieves higher performance while keeping the temperature low. Our methods leverage the natural discrepancies in thermal behavior among different workloads, and schedule them to keep the chip temperature below a given budget. We develop a heuristic algorithm based on the observation that there is a difference in the resulting temperature when a hot and a cool job are executed in a different order. To evaluate our scheduling algorithms, we developed a lightweight runtime temperature monitor to enable informed scheduling decisions. We have implemented our scheduling algorithm and the entire temperature monitoring framework in the Linux kernel. Our proposed scheduler can remove 10.5-73.6% of the hardware DTMs in various combinations of workloads in a medium thermal environment. As a result, the CPU throughput was improved by up to 7.6% (4.1% on average) even under a severe thermal environment.

微处理器的发展受到其不断增加的功耗和芯片上的热产生速度的阻碍。高温会损害处理器的可靠性并缩短其使用寿命。虽然硬件级动态热管理(DTM)技术，如电压和频率缩放，可以有效地降低芯片温度，当它超过热阈值时，它们不可避免地以性能下降为代价。我们提出一种操作系统级别的技术，执行热感知作业调度，以减少热越界的数量。我们的调度器减少了硬件dtm的数量，在保持低温度的同时实现了更高的性能。我们的方法利用了不同工作负载之间热行为的自然差异，并对它们进行调度，以保持芯片温度低于给定的预算。我们开发了一种启发式算法，该算法基于这样的观察，即当热工和冷工以不同的顺序执行时，产生的温度是不同的。为了评估我们的调度算法，我们开发了一个轻量级的运行时温度监视器，以支持明智的调度决策。我们已经在Linux内核中实现了调度算法和整个温度监测框架。我们建议的调度器可以在中等温度环境下的各种工作负载组合中删除10.5-73.6%的硬件dtm。因此，即使在恶劣的热环境下，CPU吞吐量也提高了7.6%(平均4.1%)。

{"title":"Dynamic Thermal Management through Task Scheduling","authors":"Jun Yang, Xiuyi Zhou, M. Chrobak, Youtao Zhang, Lingling Jin","doi":"10.1109/ISPASS.2008.4510751","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510751","url":null,"abstract":"The evolution of microprocessors has been hindered by their increasing power consumption and the heat generation speed on-die. High temperature impairs the processor's reliability and reduces its lifetime. While hardware level dynamic thermal management (DTM) techniques, such as voltage and frequency scaling, can effectively lower the chip temperature when it surpasses the thermal threshold, they inevitably come at the cost of performance degradation. We propose an OS level technique that performs thermal- aware job scheduling to reduce the number of thermal trespasses. Our scheduler reduces the amount of hardware DTMs and achieves higher performance while keeping the temperature low. Our methods leverage the natural discrepancies in thermal behavior among different workloads, and schedule them to keep the chip temperature below a given budget. We develop a heuristic algorithm based on the observation that there is a difference in the resulting temperature when a hot and a cool job are executed in a different order. To evaluate our scheduling algorithms, we developed a lightweight runtime temperature monitor to enable informed scheduling decisions. We have implemented our scheduling algorithm and the entire temperature monitoring framework in the Linux kernel. Our proposed scheduler can remove 10.5-73.6% of the hardware DTMs in various combinations of workloads in a medium thermal environment. As a result, the CPU throughput was improved by up to 7.6% (4.1% on average) even under a severe thermal environment.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120955716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 138

Configurational Workload Characterization 配置负载特性

ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software

Pub Date : 2008-04-20 DOI: 10.1109/ISPASS.2008.4510747

H. H. Najaf-abadi, E. Rotenberg

Although the best processor design for executing a specific workload does depend on the characteristics of the workload, it can not be determined without factoring-in the effect of the interdependencies between different architectural subcomponents. Consequently, workload characteristics alone do not provide accurate indication of which workloads can perform close-to-optimal on the same architectural configuration. The primary goal of this paper is to demonstrate that, in the design of a heterogeneous CMP, reducing the set of essential benchmarks based on relative similarity in raw workload behavior may direct the design process towards options that result in sub-optimality of the ultimate design. It is shown that the design parameters of the customized processor configurations, what we refer to as the configurational characteristics, can yield a more accurate indication of the best way to partition the workload space for the cores of a heterogeneous system to be customized to. In order to automate the extraction of the configurational- characteristics of workloads, a design exploration tool based on the Simplescalar timing simulator and the CACTI modeling tool is presented. Results from this tool are used to display how a systematic methodology can be employed to determine the optimal set of core configurations for a heterogeneous CMP under different design objectives. In addition, it is shown that reducing the set of workloads based on even a single widely documented benchmark similarity (between bzip and gzip) can lead to a slowdown in the overall performance of a heterogeneous-CMP design.

尽管执行特定工作负载的最佳处理器设计确实取决于工作负载的特征，但如果不考虑不同体系结构子组件之间相互依赖关系的影响，就无法确定最佳处理器设计。因此，单独的工作负载特征并不能准确指示哪些工作负载可以在相同的体系结构配置上执行接近最优的性能。本文的主要目标是证明，在异构CMP的设计中，基于原始工作负载行为的相对相似性减少基本基准集可能会将设计过程引向导致最终设计的次优性的选项。结果表明，定制处理器配置的设计参数(我们称之为配置特征)可以更准确地指示为要定制的异构系统的核心划分工作负载空间的最佳方法。为了实现工作负载组态特征的自动提取，提出了一种基于Simplescalar时序模拟器和CACTI建模工具的设计探索工具。该工具的结果用于展示如何使用系统方法来确定不同设计目标下异构CMP的最佳核心配置集。此外，本文还表明，即使是基于一个广泛记录的基准相似性(bzip和gzip之间)来减少工作负载集，也会导致异构cmp设计的整体性能下降。

{"title":"Configurational Workload Characterization","authors":"H. H. Najaf-abadi, E. Rotenberg","doi":"10.1109/ISPASS.2008.4510747","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510747","url":null,"abstract":"Although the best processor design for executing a specific workload does depend on the characteristics of the workload, it can not be determined without factoring-in the effect of the interdependencies between different architectural subcomponents. Consequently, workload characteristics alone do not provide accurate indication of which workloads can perform close-to-optimal on the same architectural configuration. The primary goal of this paper is to demonstrate that, in the design of a heterogeneous CMP, reducing the set of essential benchmarks based on relative similarity in raw workload behavior may direct the design process towards options that result in sub-optimality of the ultimate design. It is shown that the design parameters of the customized processor configurations, what we refer to as the configurational characteristics, can yield a more accurate indication of the best way to partition the workload space for the cores of a heterogeneous system to be customized to. In order to automate the extraction of the configurational- characteristics of workloads, a design exploration tool based on the Simplescalar timing simulator and the CACTI modeling tool is presented. Results from this tool are used to display how a systematic methodology can be employed to determine the optimal set of core configurations for a heterogeneous CMP under different design objectives. In addition, it is shown that reducing the set of workloads based on even a single widely documented benchmark similarity (between bzip and gzip) can lead to a slowdown in the overall performance of a heterogeneous-CMP design.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129991269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Scientific Computing Applications on a Stream Processor 流处理器上的科学计算应用

ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software

Pub Date : 2008-04-20 DOI: 10.1109/ISPASS.2008.4510743

Y. Zhang, Xuejun Yang, Guibin Wang, Ian Rogers, Gen Li, Yu Deng, Xiaobo Yan

Stream processors, developed for the stream programming model, perform well on media applications. In this paper we examine the applicability of a stream processor to scientific computing applications. Eight scientific applications, each having different performance characteristics, are mapped to a stream processor. Due to the novelty of the stream programming model, we show how to map programs in a traditional language, such as FORTRAN. In a stream processor system, the management of system resources is the programmers' responsibility. We present several optimizations, which enable mapped programs to exploit various aspects of the stream processor architecture. Finally, we analyze the performance of the stream processor and the presented optimizations on a set of scientific computing applications. The stream programs are from 1.67 to 32.5 times faster than the corresponding FORTRAN programs on an Itanium 2 processor, with the optimizations playing an important role in realizing the performance improvement.

针对流编程模型开发的流处理器在媒体应用中表现良好。本文研究了流处理器在科学计算应用中的适用性。八个具有不同性能特征的科学应用程序被映射到一个流处理器。由于流编程模型的新颖性，我们将展示如何用传统语言(如FORTRAN)映射程序。在流处理器系统中，系统资源的管理是程序员的责任。我们提出了几个优化，使映射程序能够利用流处理器体系结构的各个方面。最后，我们分析了流处理器的性能，并在一系列科学计算应用中提出了优化方案。流程序比Itanium 2处理器上相应的FORTRAN程序快1.67到32.5倍，其中的优化在实现性能提升方面发挥了重要作用。

引用次数: 7

Computer Aided Engineering of Cluster Computers 集群计算机的计算机辅助工程

ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software

Pub Date : 2008-04-20 DOI: 10.1109/ISPASS.2008.4510737

W. Dieter, H. Dietz

There are many scientific and engineering applications that require the resources of a dedicated supercomputer: drug design, weather prediction, simulating vehicle crashes, fluid dynamics simulations of aircraft or even consumer products. Cluster supercomputers can leverage commodity parts with standard interfaces that allow them to be used interchangeably to build supercomputers customized for these and other applications. However, the best design for one application is not necessarily the best design for other applications. Supercomputer design is challenging, but this problem is harder due to the huge range of possible configurations, volatile component availability and pricing, and constraints on available power, cooling, and floor space. Cluster design rules (CDR) is a computer-aided engineering tool that uses resource constraints and application performance models to identify the few best designs among the trillions of designs that could be constructed using parts from a given database. It uses a branch-and-bound strategy based on cluster design principles that can eliminate many inferior designs from the search without evaluating them. For the millions of designs that remain, CDR measures fitness by one of several user-specified application performance models. New application performance models can be added by means of a programming interface. This paper details the concepts and mechanisms inside CDR and shows how it facilitates model-based engineering of custom clusters.

有许多科学和工程应用需要专用超级计算机的资源:药物设计，天气预报，模拟车辆碰撞，飞机流体动力学模拟甚至消费产品。集群超级计算机可以利用具有标准接口的商品部件，允许它们互换使用，以构建为这些和其他应用程序定制的超级计算机。然而，一个应用程序的最佳设计并不一定是其他应用程序的最佳设计。超级计算机设计具有挑战性，但由于可能的配置范围大，组件可用性和价格不稳定，以及可用功率，冷却和占地面积的限制，这个问题更加困难。集群设计规则(CDR)是一种计算机辅助工程工具，它使用资源约束和应用程序性能模型，从数万亿种设计中识别出少数最佳设计，这些设计可以使用给定数据库中的部件构造。它使用基于聚类设计原则的分支绑定策略，可以在不评估的情况下从搜索中消除许多劣质设计。对于剩下的数百万种设计，CDR通过几个用户指定的应用程序性能模型之一来测量适合度。可以通过编程接口添加新的应用程序性能模型。本文详细介绍了CDR内部的概念和机制，并展示了它如何促进基于模型的定制集群工程。

{"title":"Computer Aided Engineering of Cluster Computers","authors":"W. Dieter, H. Dietz","doi":"10.1109/ISPASS.2008.4510737","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510737","url":null,"abstract":"There are many scientific and engineering applications that require the resources of a dedicated supercomputer: drug design, weather prediction, simulating vehicle crashes, fluid dynamics simulations of aircraft or even consumer products. Cluster supercomputers can leverage commodity parts with standard interfaces that allow them to be used interchangeably to build supercomputers customized for these and other applications. However, the best design for one application is not necessarily the best design for other applications. Supercomputer design is challenging, but this problem is harder due to the huge range of possible configurations, volatile component availability and pricing, and constraints on available power, cooling, and floor space. Cluster design rules (CDR) is a computer-aided engineering tool that uses resource constraints and application performance models to identify the few best designs among the trillions of designs that could be constructed using parts from a given database. It uses a branch-and-bound strategy based on cluster design principles that can eliminate many inferior designs from the search without evaluating them. For the millions of designs that remain, CDR measures fitness by one of several user-specified application performance models. New application performance models can be added by means of a programming interface. This paper details the concepts and mechanisms inside CDR and shows how it facilitates model-based engineering of custom clusters.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129001529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterization of SPEC CPU2006 and SPEC OMP2001: Regression Models and their Transferability SPEC CPU2006和SPEC OMP2001的特征:回归模型及其可移植性

ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software

Pub Date : 2008-04-20 DOI: 10.1109/ISPASS.2008.4510750

ElMoustapha Ould-Ahmed-Vall, K. Doshi, Charles R. Yount, J. Woodlee

Analysis of workload execution and identification of software and hardware performance barriers provide critical engineering benefits; these include guidance on software optimization, hardware design tradeoffs, configuration tuning, and comparative assessments for platform selection. This paper uses Model trees to build statistical regression models for the SPEC1 CPU2006 and the SPEC OMP2001 suites. These models link performance to key microarchitectural events. The models provide detailed recipes for identifying the key performance factors for each suite and for determining the contribution of each factor to performance. The paper discusses how the models can be used to understand the behaviors of the two suites on a modern processor. These models are applied to obtain a detailed performance characterization of each benchmark suite and its member workloads and to identify the commonalities and distinctions among the performance factors that affect each of the member workloads within the two suites. This paper also addresses the issue of model transferability. It explores the question: How useful is an existing performance model (built on a given suite of workloads) to study the performance of different workloads or suites of workloads? A performance model built using data from workload suite P is considered transferable to workload suite Q if it can be used to accurately study the performance of workload suite Q. Statistical methodologies to assess model transferability are discussed. In particular, the paper explores the use of two-sample hypothesis tests and prediction accuracy analysis techniques to assess model transferability. It is found that a model trained using only 10% of the SPEC CPU2006 data is transferable to the remaining data. This finding holds also for SPEC OMP2001. In contrast, it is found that the SPEC CPU2006 model is not transferable to SPEC OMP2001 and vice versa.

分析工作负载执行和识别软件和硬件性能障碍提供关键的工程效益;这些指南包括软件优化、硬件设计权衡、配置调优以及平台选择的比较评估。本文利用模型树建立了SPEC1 CPU2006和specomp2001套件的统计回归模型。这些模型将性能与关键的微架构事件联系起来。这些模型为确定每个套件的关键性能因素以及确定每个因素对性能的贡献提供了详细的方法。本文讨论了如何使用这些模型来理解现代处理器上这两个组件的行为。这些模型用于获得每个基准套件及其成员工作负载的详细性能特征，并确定影响两个套件中每个成员工作负载的性能因素之间的共性和区别。本文还讨论了模型可移植性问题。它探讨了这样一个问题:现有的性能模型(构建在给定的工作负载套件上)对于研究不同工作负载或工作负载套件的性能有多大用处?如果使用来自工作负载套件P的数据构建的性能模型可以用于准确地研究工作负载套件Q的性能，则认为该性能模型可转移到工作负载套件Q。特别是，本文探讨了使用双样本假设检验和预测精度分析技术来评估模型的可转移性。发现仅使用10%的SPEC CPU2006数据训练的模型可转移到其余数据。这一发现也适用于SPEC OMP2001。相比之下，发现SPEC CPU2006模型不能转移到SPEC OMP2001，反之亦然。

{"title":"Characterization of SPEC CPU2006 and SPEC OMP2001: Regression Models and their Transferability","authors":"ElMoustapha Ould-Ahmed-Vall, K. Doshi, Charles R. Yount, J. Woodlee","doi":"10.1109/ISPASS.2008.4510750","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510750","url":null,"abstract":"Analysis of workload execution and identification of software and hardware performance barriers provide critical engineering benefits; these include guidance on software optimization, hardware design tradeoffs, configuration tuning, and comparative assessments for platform selection. This paper uses Model trees to build statistical regression models for the SPEC1 CPU2006 and the SPEC OMP2001 suites. These models link performance to key microarchitectural events. The models provide detailed recipes for identifying the key performance factors for each suite and for determining the contribution of each factor to performance. The paper discusses how the models can be used to understand the behaviors of the two suites on a modern processor. These models are applied to obtain a detailed performance characterization of each benchmark suite and its member workloads and to identify the commonalities and distinctions among the performance factors that affect each of the member workloads within the two suites. This paper also addresses the issue of model transferability. It explores the question: How useful is an existing performance model (built on a given suite of workloads) to study the performance of different workloads or suites of workloads? A performance model built using data from workload suite P is considered transferable to workload suite Q if it can be used to accurately study the performance of workload suite Q. Statistical methodologies to assess model transferability are discussed. In particular, the paper explores the use of two-sample hypothesis tests and prediction accuracy analysis techniques to assess model transferability. It is found that a model trained using only 10% of the SPEC CPU2006 data is transferable to the remaining data. This finding holds also for SPEC OMP2001. In contrast, it is found that the SPEC CPU2006 model is not transferable to SPEC OMP2001 and vice versa.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128240467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Full-System Critical Path Analysis 全系统关键路径分析

ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software

Pub Date : 2008-04-20 DOI: 10.1109/ISPASS.2008.4510739

A. Saidi, N. Binkert, S. Reinhardt, T. Mudge

Many interesting workloads today are limited not by CPU processing power but by the interactions between the CPU, memory system, I/O devices, and the complex software that ties all the components together. Optimizing these workloads requires identifying performance bottlenecks across concurrent hardware components and across multiple layers of software. Common software profiling techniques cannot account for hardware bottlenecks or situations where software overheads are hidden due to overlap with hardware operations. Critical-path analysis is a powerful approach for identifying bottlenecks in highly concurrent systems, but typically requires detailed domain knowledge to construct the required event dependence graphs. As a result, to date it has been applied only to isolated system layers (e.g., processor microarchitectures or message-passing applications). In this paper we present a novel technique for applying critical-path analysis to complex systems composed of numerous interacting state machines. We avoid tedious up-front modeling by using control-flow tracing to expose implicit software state machines automatically, and iterative refinement to add necessary manual annotations with minimal effort. By applying our technique within a full-system simulator, we achieve an integrated trace of hardware and software events with minimal perturbation. As a result, we can perform this analysis across the user/kernel and hardware/software boundaries and even across multiple systems. We apply this technique to analyzing network performance, and show that we are able to find performance bottlenecks in both hardware and software, including some surprising bottlenecks in the Linux 2.6.13 kernel.

今天，许多有趣的工作负载不是受到CPU处理能力的限制，而是受到CPU、内存系统、I/O设备和将所有组件连接在一起的复杂软件之间的交互的限制。优化这些工作负载需要识别跨并发硬件组件和跨多层软件的性能瓶颈。常见的软件分析技术无法解释硬件瓶颈或由于与硬件操作重叠而隐藏软件开销的情况。关键路径分析是在高度并发系统中识别瓶颈的一种强大方法，但通常需要详细的领域知识来构建所需的事件依赖图。因此，到目前为止，它只应用于隔离的系统层(例如，处理器微架构或消息传递应用程序)。本文提出了一种将关键路径分析应用于由许多相互作用的状态机组成的复杂系统的新技术。我们通过使用控制流跟踪来自动暴露隐式软件状态机，并通过迭代细化来以最小的努力添加必要的手动注释，从而避免了繁琐的前期建模。通过在全系统模拟器中应用我们的技术，我们以最小的扰动实现了硬件和软件事件的集成跟踪。因此，我们可以跨用户/内核和硬件/软件边界，甚至跨多个系统执行此分析。我们将此技术应用于分析网络性能，并证明我们能够在硬件和软件中找到性能瓶颈，包括Linux 2.6.13内核中的一些令人惊讶的瓶颈。

{"title":"Full-System Critical Path Analysis","authors":"A. Saidi, N. Binkert, S. Reinhardt, T. Mudge","doi":"10.1109/ISPASS.2008.4510739","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510739","url":null,"abstract":"Many interesting workloads today are limited not by CPU processing power but by the interactions between the CPU, memory system, I/O devices, and the complex software that ties all the components together. Optimizing these workloads requires identifying performance bottlenecks across concurrent hardware components and across multiple layers of software. Common software profiling techniques cannot account for hardware bottlenecks or situations where software overheads are hidden due to overlap with hardware operations. Critical-path analysis is a powerful approach for identifying bottlenecks in highly concurrent systems, but typically requires detailed domain knowledge to construct the required event dependence graphs. As a result, to date it has been applied only to isolated system layers (e.g., processor microarchitectures or message-passing applications). In this paper we present a novel technique for applying critical-path analysis to complex systems composed of numerous interacting state machines. We avoid tedious up-front modeling by using control-flow tracing to expose implicit software state machines automatically, and iterative refinement to add necessary manual annotations with minimal effort. By applying our technique within a full-system simulator, we achieve an integrated trace of hardware and software events with minimal perturbation. As a result, we can perform this analysis across the user/kernel and hardware/software boundaries and even across multiple systems. We apply this technique to analyzing network performance, and show that we are able to find performance bottlenecks in both hardware and software, including some surprising bottlenecks in the Linux 2.6.13 kernel.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129482635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Quick Performance Models Quickly: Closely-Coupled Partitioned Simulation on FPGAs 快速性能模型:fpga的紧密耦合分区仿真

ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software

Pub Date : 2008-04-20 DOI: 10.1109/ISPASS.2008.4510733

Michael Pellauer, M. Vijayaraghavan, Michael Adler, Arvind, J. Emer

In this paper we explore microprocessor performance models implemented on FPGAs. While FPGAs can help with simulation speed, the increased implementation complexity can degrade model development time. We assess whether a simulator split into closely-coupled timing and functional partitions can address this by easing the development of timing models while retaining fine-grained parallelism. We give the semantics of our simulator partitioning, and discuss the architecture of its implementation on an FPGA. We describe how three timing models of vastly different target processors can use the same functional partition, and assess their performance.

本文探讨了在fpga上实现的微处理器性能模型。虽然fpga可以帮助提高仿真速度，但增加的实现复杂性会降低模型开发时间。我们评估了将模拟器分为紧密耦合的计时和功能分区是否可以通过简化计时模型的开发来解决这个问题，同时保留细粒度的并行性。给出了模拟器分区的语义，并讨论了其在FPGA上的实现架构。我们描述了截然不同的目标处理器的三种定时模型如何使用相同的功能分区，并评估了它们的性能。

引用次数: 42