Pub Date : 2008-04-20DOI: 10.1109/ISPASS.2008.4510753
M. Biberstein, U. Shvadron, Javier Turek, Bilha Mendelson, Moon-Seok Chang
The transition to multicore architectures creates significant challenges for programming systems. Taking advantage of specialized processing cores such as those in the Cell BE processor and managing all the required data movement inside the processor cannot be done efficiently without help from the software infrastructure. Alongside new programming models and compiler support for multicores, programmers need performance evaluation and analysis tools. In this paper, we present tools that help analyze the performance of applications executing on the Cell platform. The performance debugging tool (PDT) provides a means for recording significant events during program execution, maintaining the sequential order of events, and preserving important runtime information such as core assignment and relative timing of events. The trace analyzer (TA) reads and visualizes the PDT traces. We describe the architecture of the PDT and present several important use cases demonstrating the usage of PDT and TA to understand the performance of several workloads. We also discuss the overhead of tracing and its impact on the benchmark execution and performance analysis.
{"title":"Trace-based Performance Analysis on Cell BE","authors":"M. Biberstein, U. Shvadron, Javier Turek, Bilha Mendelson, Moon-Seok Chang","doi":"10.1109/ISPASS.2008.4510753","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510753","url":null,"abstract":"The transition to multicore architectures creates significant challenges for programming systems. Taking advantage of specialized processing cores such as those in the Cell BE processor and managing all the required data movement inside the processor cannot be done efficiently without help from the software infrastructure. Alongside new programming models and compiler support for multicores, programmers need performance evaluation and analysis tools. In this paper, we present tools that help analyze the performance of applications executing on the Cell platform. The performance debugging tool (PDT) provides a means for recording significant events during program execution, maintaining the sequential order of events, and preserving important runtime information such as core assignment and relative timing of events. The trace analyzer (TA) reads and visualizes the PDT traces. We describe the architecture of the PDT and present several important use cases demonstrating the usage of PDT and TA to understand the performance of several workloads. We also discuss the overhead of tracing and its impact on the benchmark execution and performance analysis.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125219025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-04-20DOI: 10.1109/ISPASS.2008.4510740
K. Ram, Ian C. Fedeli, A. Cox, S. Rixner
This paper characterizes the impact that the use of UDP versus TCP has on the performance and scalability of the OpenSER SIP proxy server. The session initiation protocol (SIP) is an application-layer signaling protocol that is widely used for establishing voice-over-IP (VoIP) phone calls. SIP can utilize a variety of transport protocols, including UDP and TCP. Despite the advantages of TCP, such as reliable delivery and congestion control, the common practice is to use UDP. This is a result of the belief that UDP's lower processor and network overhead results in improved performance and scalability of SIP services. This paper argues against this conventional wisdom. This paper shows that the principal reasons for OpenSER's poor performance using TCP are caused by the server's design, and not the low-level performance of UDP versus TCP. Specifically, OpenSER's architecture for handling concurrent calls is responsible for most of the difference. Moreover, once these issues are addressed, OpenSER's performance using TCP is much more competitive with its performance using UDP.
{"title":"Explaining the Impact of Network Transport Protocols on SIP Proxy Performance","authors":"K. Ram, Ian C. Fedeli, A. Cox, S. Rixner","doi":"10.1109/ISPASS.2008.4510740","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510740","url":null,"abstract":"This paper characterizes the impact that the use of UDP versus TCP has on the performance and scalability of the OpenSER SIP proxy server. The session initiation protocol (SIP) is an application-layer signaling protocol that is widely used for establishing voice-over-IP (VoIP) phone calls. SIP can utilize a variety of transport protocols, including UDP and TCP. Despite the advantages of TCP, such as reliable delivery and congestion control, the common practice is to use UDP. This is a result of the belief that UDP's lower processor and network overhead results in improved performance and scalability of SIP services. This paper argues against this conventional wisdom. This paper shows that the principal reasons for OpenSER's poor performance using TCP are caused by the server's design, and not the low-level performance of UDP versus TCP. Specifically, OpenSER's architecture for handling concurrent calls is responsible for most of the difference. Moreover, once these issues are addressed, OpenSER's performance using TCP is much more competitive with its performance using UDP.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129786337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-04-20DOI: 10.1109/ISPASS.2008.4510735
Ayose Falcón, P. Faraboschi, Daniel Ortega
Computer clusters are a very cost-effective approach for high performance computing, but simulating a complete cluster is still an open research problem. The obvious approach - to parallelize individual node simulators - is complex and slow. Combining individual parallel simulators implies synchronizing their progress of time. This can be accomplished with a variety of parallel discrete event simulation techniques, but unfortunately any straightforward approach introduces a synchronization overhead causing up two orders of magnitude of slowdown with respect to the simulation speed of an individual node. In this paper we present a novel adaptive technique that automatically adjusts the synchronization boundaries. By dynamically relaxing accuracy over the least interesting computational phases we dramatically increase performance with a marginal loss of precision. For example, in the simulation of an 8-node cluster running NAMD (a parallel molecular dynamics application) we show an acceleration factor of 26x over the deterministic "ground truth" simulation, at less than a 1% accuracy error.
{"title":"An Adaptive Synchronization Technique for Parallel Simulation of Networked Clusters","authors":"Ayose Falcón, P. Faraboschi, Daniel Ortega","doi":"10.1109/ISPASS.2008.4510735","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510735","url":null,"abstract":"Computer clusters are a very cost-effective approach for high performance computing, but simulating a complete cluster is still an open research problem. The obvious approach - to parallelize individual node simulators - is complex and slow. Combining individual parallel simulators implies synchronizing their progress of time. This can be accomplished with a variety of parallel discrete event simulation techniques, but unfortunately any straightforward approach introduces a synchronization overhead causing up two orders of magnitude of slowdown with respect to the simulation speed of an individual node. In this paper we present a novel adaptive technique that automatically adjusts the synchronization boundaries. By dynamically relaxing accuracy over the least interesting computational phases we dramatically increase performance with a marginal loss of precision. For example, in the simulation of an 8-node cluster running NAMD (a parallel molecular dynamics application) we show an acceleration factor of 26x over the deterministic \"ground truth\" simulation, at less than a 1% accuracy error.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128291699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-04-20DOI: 10.1109/ISPASS.2008.4510751
Jun Yang, Xiuyi Zhou, M. Chrobak, Youtao Zhang, Lingling Jin
The evolution of microprocessors has been hindered by their increasing power consumption and the heat generation speed on-die. High temperature impairs the processor's reliability and reduces its lifetime. While hardware level dynamic thermal management (DTM) techniques, such as voltage and frequency scaling, can effectively lower the chip temperature when it surpasses the thermal threshold, they inevitably come at the cost of performance degradation. We propose an OS level technique that performs thermal- aware job scheduling to reduce the number of thermal trespasses. Our scheduler reduces the amount of hardware DTMs and achieves higher performance while keeping the temperature low. Our methods leverage the natural discrepancies in thermal behavior among different workloads, and schedule them to keep the chip temperature below a given budget. We develop a heuristic algorithm based on the observation that there is a difference in the resulting temperature when a hot and a cool job are executed in a different order. To evaluate our scheduling algorithms, we developed a lightweight runtime temperature monitor to enable informed scheduling decisions. We have implemented our scheduling algorithm and the entire temperature monitoring framework in the Linux kernel. Our proposed scheduler can remove 10.5-73.6% of the hardware DTMs in various combinations of workloads in a medium thermal environment. As a result, the CPU throughput was improved by up to 7.6% (4.1% on average) even under a severe thermal environment.
{"title":"Dynamic Thermal Management through Task Scheduling","authors":"Jun Yang, Xiuyi Zhou, M. Chrobak, Youtao Zhang, Lingling Jin","doi":"10.1109/ISPASS.2008.4510751","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510751","url":null,"abstract":"The evolution of microprocessors has been hindered by their increasing power consumption and the heat generation speed on-die. High temperature impairs the processor's reliability and reduces its lifetime. While hardware level dynamic thermal management (DTM) techniques, such as voltage and frequency scaling, can effectively lower the chip temperature when it surpasses the thermal threshold, they inevitably come at the cost of performance degradation. We propose an OS level technique that performs thermal- aware job scheduling to reduce the number of thermal trespasses. Our scheduler reduces the amount of hardware DTMs and achieves higher performance while keeping the temperature low. Our methods leverage the natural discrepancies in thermal behavior among different workloads, and schedule them to keep the chip temperature below a given budget. We develop a heuristic algorithm based on the observation that there is a difference in the resulting temperature when a hot and a cool job are executed in a different order. To evaluate our scheduling algorithms, we developed a lightweight runtime temperature monitor to enable informed scheduling decisions. We have implemented our scheduling algorithm and the entire temperature monitoring framework in the Linux kernel. Our proposed scheduler can remove 10.5-73.6% of the hardware DTMs in various combinations of workloads in a medium thermal environment. As a result, the CPU throughput was improved by up to 7.6% (4.1% on average) even under a severe thermal environment.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120955716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-04-20DOI: 10.1109/ISPASS.2008.4510747
H. H. Najaf-abadi, E. Rotenberg
Although the best processor design for executing a specific workload does depend on the characteristics of the workload, it can not be determined without factoring-in the effect of the interdependencies between different architectural subcomponents. Consequently, workload characteristics alone do not provide accurate indication of which workloads can perform close-to-optimal on the same architectural configuration. The primary goal of this paper is to demonstrate that, in the design of a heterogeneous CMP, reducing the set of essential benchmarks based on relative similarity in raw workload behavior may direct the design process towards options that result in sub-optimality of the ultimate design. It is shown that the design parameters of the customized processor configurations, what we refer to as the configurational characteristics, can yield a more accurate indication of the best way to partition the workload space for the cores of a heterogeneous system to be customized to. In order to automate the extraction of the configurational- characteristics of workloads, a design exploration tool based on the Simplescalar timing simulator and the CACTI modeling tool is presented. Results from this tool are used to display how a systematic methodology can be employed to determine the optimal set of core configurations for a heterogeneous CMP under different design objectives. In addition, it is shown that reducing the set of workloads based on even a single widely documented benchmark similarity (between bzip and gzip) can lead to a slowdown in the overall performance of a heterogeneous-CMP design.
{"title":"Configurational Workload Characterization","authors":"H. H. Najaf-abadi, E. Rotenberg","doi":"10.1109/ISPASS.2008.4510747","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510747","url":null,"abstract":"Although the best processor design for executing a specific workload does depend on the characteristics of the workload, it can not be determined without factoring-in the effect of the interdependencies between different architectural subcomponents. Consequently, workload characteristics alone do not provide accurate indication of which workloads can perform close-to-optimal on the same architectural configuration. The primary goal of this paper is to demonstrate that, in the design of a heterogeneous CMP, reducing the set of essential benchmarks based on relative similarity in raw workload behavior may direct the design process towards options that result in sub-optimality of the ultimate design. It is shown that the design parameters of the customized processor configurations, what we refer to as the configurational characteristics, can yield a more accurate indication of the best way to partition the workload space for the cores of a heterogeneous system to be customized to. In order to automate the extraction of the configurational- characteristics of workloads, a design exploration tool based on the Simplescalar timing simulator and the CACTI modeling tool is presented. Results from this tool are used to display how a systematic methodology can be employed to determine the optimal set of core configurations for a heterogeneous CMP under different design objectives. In addition, it is shown that reducing the set of workloads based on even a single widely documented benchmark similarity (between bzip and gzip) can lead to a slowdown in the overall performance of a heterogeneous-CMP design.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129991269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-04-20DOI: 10.1109/ISPASS.2008.4510743
Y. Zhang, Xuejun Yang, Guibin Wang, Ian Rogers, Gen Li, Yu Deng, Xiaobo Yan
Stream processors, developed for the stream programming model, perform well on media applications. In this paper we examine the applicability of a stream processor to scientific computing applications. Eight scientific applications, each having different performance characteristics, are mapped to a stream processor. Due to the novelty of the stream programming model, we show how to map programs in a traditional language, such as FORTRAN. In a stream processor system, the management of system resources is the programmers' responsibility. We present several optimizations, which enable mapped programs to exploit various aspects of the stream processor architecture. Finally, we analyze the performance of the stream processor and the presented optimizations on a set of scientific computing applications. The stream programs are from 1.67 to 32.5 times faster than the corresponding FORTRAN programs on an Itanium 2 processor, with the optimizations playing an important role in realizing the performance improvement.
{"title":"Scientific Computing Applications on a Stream Processor","authors":"Y. Zhang, Xuejun Yang, Guibin Wang, Ian Rogers, Gen Li, Yu Deng, Xiaobo Yan","doi":"10.1109/ISPASS.2008.4510743","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510743","url":null,"abstract":"Stream processors, developed for the stream programming model, perform well on media applications. In this paper we examine the applicability of a stream processor to scientific computing applications. Eight scientific applications, each having different performance characteristics, are mapped to a stream processor. Due to the novelty of the stream programming model, we show how to map programs in a traditional language, such as FORTRAN. In a stream processor system, the management of system resources is the programmers' responsibility. We present several optimizations, which enable mapped programs to exploit various aspects of the stream processor architecture. Finally, we analyze the performance of the stream processor and the presented optimizations on a set of scientific computing applications. The stream programs are from 1.67 to 32.5 times faster than the corresponding FORTRAN programs on an Itanium 2 processor, with the optimizations playing an important role in realizing the performance improvement.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115675875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-04-20DOI: 10.1109/ISPASS.2008.4510737
W. Dieter, H. Dietz
There are many scientific and engineering applications that require the resources of a dedicated supercomputer: drug design, weather prediction, simulating vehicle crashes, fluid dynamics simulations of aircraft or even consumer products. Cluster supercomputers can leverage commodity parts with standard interfaces that allow them to be used interchangeably to build supercomputers customized for these and other applications. However, the best design for one application is not necessarily the best design for other applications. Supercomputer design is challenging, but this problem is harder due to the huge range of possible configurations, volatile component availability and pricing, and constraints on available power, cooling, and floor space. Cluster design rules (CDR) is a computer-aided engineering tool that uses resource constraints and application performance models to identify the few best designs among the trillions of designs that could be constructed using parts from a given database. It uses a branch-and-bound strategy based on cluster design principles that can eliminate many inferior designs from the search without evaluating them. For the millions of designs that remain, CDR measures fitness by one of several user-specified application performance models. New application performance models can be added by means of a programming interface. This paper details the concepts and mechanisms inside CDR and shows how it facilitates model-based engineering of custom clusters.
{"title":"Computer Aided Engineering of Cluster Computers","authors":"W. Dieter, H. Dietz","doi":"10.1109/ISPASS.2008.4510737","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510737","url":null,"abstract":"There are many scientific and engineering applications that require the resources of a dedicated supercomputer: drug design, weather prediction, simulating vehicle crashes, fluid dynamics simulations of aircraft or even consumer products. Cluster supercomputers can leverage commodity parts with standard interfaces that allow them to be used interchangeably to build supercomputers customized for these and other applications. However, the best design for one application is not necessarily the best design for other applications. Supercomputer design is challenging, but this problem is harder due to the huge range of possible configurations, volatile component availability and pricing, and constraints on available power, cooling, and floor space. Cluster design rules (CDR) is a computer-aided engineering tool that uses resource constraints and application performance models to identify the few best designs among the trillions of designs that could be constructed using parts from a given database. It uses a branch-and-bound strategy based on cluster design principles that can eliminate many inferior designs from the search without evaluating them. For the millions of designs that remain, CDR measures fitness by one of several user-specified application performance models. New application performance models can be added by means of a programming interface. This paper details the concepts and mechanisms inside CDR and shows how it facilitates model-based engineering of custom clusters.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129001529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-04-20DOI: 10.1109/ISPASS.2008.4510750
ElMoustapha Ould-Ahmed-Vall, K. Doshi, Charles R. Yount, J. Woodlee
Analysis of workload execution and identification of software and hardware performance barriers provide critical engineering benefits; these include guidance on software optimization, hardware design tradeoffs, configuration tuning, and comparative assessments for platform selection. This paper uses Model trees to build statistical regression models for the SPEC1 CPU2006 and the SPEC OMP2001 suites. These models link performance to key microarchitectural events. The models provide detailed recipes for identifying the key performance factors for each suite and for determining the contribution of each factor to performance. The paper discusses how the models can be used to understand the behaviors of the two suites on a modern processor. These models are applied to obtain a detailed performance characterization of each benchmark suite and its member workloads and to identify the commonalities and distinctions among the performance factors that affect each of the member workloads within the two suites. This paper also addresses the issue of model transferability. It explores the question: How useful is an existing performance model (built on a given suite of workloads) to study the performance of different workloads or suites of workloads? A performance model built using data from workload suite P is considered transferable to workload suite Q if it can be used to accurately study the performance of workload suite Q. Statistical methodologies to assess model transferability are discussed. In particular, the paper explores the use of two-sample hypothesis tests and prediction accuracy analysis techniques to assess model transferability. It is found that a model trained using only 10% of the SPEC CPU2006 data is transferable to the remaining data. This finding holds also for SPEC OMP2001. In contrast, it is found that the SPEC CPU2006 model is not transferable to SPEC OMP2001 and vice versa.
{"title":"Characterization of SPEC CPU2006 and SPEC OMP2001: Regression Models and their Transferability","authors":"ElMoustapha Ould-Ahmed-Vall, K. Doshi, Charles R. Yount, J. Woodlee","doi":"10.1109/ISPASS.2008.4510750","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510750","url":null,"abstract":"Analysis of workload execution and identification of software and hardware performance barriers provide critical engineering benefits; these include guidance on software optimization, hardware design tradeoffs, configuration tuning, and comparative assessments for platform selection. This paper uses Model trees to build statistical regression models for the SPEC1 CPU2006 and the SPEC OMP2001 suites. These models link performance to key microarchitectural events. The models provide detailed recipes for identifying the key performance factors for each suite and for determining the contribution of each factor to performance. The paper discusses how the models can be used to understand the behaviors of the two suites on a modern processor. These models are applied to obtain a detailed performance characterization of each benchmark suite and its member workloads and to identify the commonalities and distinctions among the performance factors that affect each of the member workloads within the two suites. This paper also addresses the issue of model transferability. It explores the question: How useful is an existing performance model (built on a given suite of workloads) to study the performance of different workloads or suites of workloads? A performance model built using data from workload suite P is considered transferable to workload suite Q if it can be used to accurately study the performance of workload suite Q. Statistical methodologies to assess model transferability are discussed. In particular, the paper explores the use of two-sample hypothesis tests and prediction accuracy analysis techniques to assess model transferability. It is found that a model trained using only 10% of the SPEC CPU2006 data is transferable to the remaining data. This finding holds also for SPEC OMP2001. In contrast, it is found that the SPEC CPU2006 model is not transferable to SPEC OMP2001 and vice versa.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128240467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-04-20DOI: 10.1109/ISPASS.2008.4510739
A. Saidi, N. Binkert, S. Reinhardt, T. Mudge
Many interesting workloads today are limited not by CPU processing power but by the interactions between the CPU, memory system, I/O devices, and the complex software that ties all the components together. Optimizing these workloads requires identifying performance bottlenecks across concurrent hardware components and across multiple layers of software. Common software profiling techniques cannot account for hardware bottlenecks or situations where software overheads are hidden due to overlap with hardware operations. Critical-path analysis is a powerful approach for identifying bottlenecks in highly concurrent systems, but typically requires detailed domain knowledge to construct the required event dependence graphs. As a result, to date it has been applied only to isolated system layers (e.g., processor microarchitectures or message-passing applications). In this paper we present a novel technique for applying critical-path analysis to complex systems composed of numerous interacting state machines. We avoid tedious up-front modeling by using control-flow tracing to expose implicit software state machines automatically, and iterative refinement to add necessary manual annotations with minimal effort. By applying our technique within a full-system simulator, we achieve an integrated trace of hardware and software events with minimal perturbation. As a result, we can perform this analysis across the user/kernel and hardware/software boundaries and even across multiple systems. We apply this technique to analyzing network performance, and show that we are able to find performance bottlenecks in both hardware and software, including some surprising bottlenecks in the Linux 2.6.13 kernel.
{"title":"Full-System Critical Path Analysis","authors":"A. Saidi, N. Binkert, S. Reinhardt, T. Mudge","doi":"10.1109/ISPASS.2008.4510739","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510739","url":null,"abstract":"Many interesting workloads today are limited not by CPU processing power but by the interactions between the CPU, memory system, I/O devices, and the complex software that ties all the components together. Optimizing these workloads requires identifying performance bottlenecks across concurrent hardware components and across multiple layers of software. Common software profiling techniques cannot account for hardware bottlenecks or situations where software overheads are hidden due to overlap with hardware operations. Critical-path analysis is a powerful approach for identifying bottlenecks in highly concurrent systems, but typically requires detailed domain knowledge to construct the required event dependence graphs. As a result, to date it has been applied only to isolated system layers (e.g., processor microarchitectures or message-passing applications). In this paper we present a novel technique for applying critical-path analysis to complex systems composed of numerous interacting state machines. We avoid tedious up-front modeling by using control-flow tracing to expose implicit software state machines automatically, and iterative refinement to add necessary manual annotations with minimal effort. By applying our technique within a full-system simulator, we achieve an integrated trace of hardware and software events with minimal perturbation. As a result, we can perform this analysis across the user/kernel and hardware/software boundaries and even across multiple systems. We apply this technique to analyzing network performance, and show that we are able to find performance bottlenecks in both hardware and software, including some surprising bottlenecks in the Linux 2.6.13 kernel.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129482635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2008-04-20DOI: 10.1109/ISPASS.2008.4510733
Michael Pellauer, M. Vijayaraghavan, Michael Adler, Arvind, J. Emer
In this paper we explore microprocessor performance models implemented on FPGAs. While FPGAs can help with simulation speed, the increased implementation complexity can degrade model development time. We assess whether a simulator split into closely-coupled timing and functional partitions can address this by easing the development of timing models while retaining fine-grained parallelism. We give the semantics of our simulator partitioning, and discuss the architecture of its implementation on an FPGA. We describe how three timing models of vastly different target processors can use the same functional partition, and assess their performance.
{"title":"Quick Performance Models Quickly: Closely-Coupled Partitioned Simulation on FPGAs","authors":"Michael Pellauer, M. Vijayaraghavan, Michael Adler, Arvind, J. Emer","doi":"10.1109/ISPASS.2008.4510733","DOIUrl":"https://doi.org/10.1109/ISPASS.2008.4510733","url":null,"abstract":"In this paper we explore microprocessor performance models implemented on FPGAs. While FPGAs can help with simulation speed, the increased implementation complexity can degrade model development time. We assess whether a simulator split into closely-coupled timing and functional partitions can address this by easing the development of timing models while retaining fine-grained parallelism. We give the semantics of our simulator partitioning, and discuss the architecture of its implementation on an FPGA. We describe how three timing models of vastly different target processors can use the same functional partition, and assess their performance.","PeriodicalId":137239,"journal":{"name":"ISPASS 2008 - IEEE International Symposium on Performance Analysis of Systems and software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131110831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}