Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620799
Michael Van Biesbrouck, L. Eeckhout, B. Calder
Commercial processors have support for simultaneous multithreading (SMT), yet little work has been done to provide representative simulation results for SMT. Given a workload, current simulation techniques typically run one combination of those programs from a specific starting offset, or just run one combination of samples across the benchmarks. We have found that the architecture behavior and overall throughput seen can vary drastically based upon the starting points of the different benchmarks. Therefore, to completely evaluate the effect of an SMT architecture optimization on a workload, one would need to simulate many or all of the program combinations from different starting offsets. But exhaustively running all program combinations from many starting offsets is infeasible - even running single programs to completion is often infeasible with modern benchmarks. In this paper we propose an SMT simulation methodology that estimates the average performance over all possible starting points when running multiple programs concurrently on an SMT processor. This is based on our prior co-phase matrix phase analysis and simulation infrastructure. This approach samples all of the unique phase combinations for a set of benchmarks to be run together. Once these phase combinations are sampled, our approach uses these samples, along with a trace of the phase behavior for each program, to provide a CPI estimate of all starting points. This all starting point CPI estimate is precisely calculated in just minutes.
{"title":"Considering all starting points for simultaneous multithreading simulation","authors":"Michael Van Biesbrouck, L. Eeckhout, B. Calder","doi":"10.1109/ISPASS.2006.1620799","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620799","url":null,"abstract":"Commercial processors have support for simultaneous multithreading (SMT), yet little work has been done to provide representative simulation results for SMT. Given a workload, current simulation techniques typically run one combination of those programs from a specific starting offset, or just run one combination of samples across the benchmarks. We have found that the architecture behavior and overall throughput seen can vary drastically based upon the starting points of the different benchmarks. Therefore, to completely evaluate the effect of an SMT architecture optimization on a workload, one would need to simulate many or all of the program combinations from different starting offsets. But exhaustively running all program combinations from many starting offsets is infeasible - even running single programs to completion is often infeasible with modern benchmarks. In this paper we propose an SMT simulation methodology that estimates the average performance over all possible starting points when running multiple programs concurrently on an SMT processor. This is based on our prior co-phase matrix phase analysis and simulation infrastructure. This approach samples all of the unique phase combinations for a set of benchmarks to be run together. Once these phase combinations are sampled, our approach uses these samples, along with a trace of the phase behavior for each program, to provide a CPI estimate of all starting points. This all starting point CPI estimate is precisely calculated in just minutes.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124611952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620798
Greg Hamerly, Erez Perelman, B. Calder
SimPoint is a technique used to pick what parts of the program's execution to simulate in order to have a complete picture of execution. SimPoint uses data clustering algorithms from machine learning to automatically find repetitive (similar) patterns in a program's execution, and it chooses one sample to represent each unique repetitive behavior. Together these samples represent an accurate picture of the complete execution of the program. SimPoint is based on the k-means clustering algorithm; recent work proposed using a different clustering method based on multinomial models, but only provided a preliminary comparison and analysis. In this work we provide a detailed comparison of using k-means and multinomial clustering for SimPoint. We show that k-means performs better than the recently proposed multinomial clustering approach. We then propose two improvements to the prior multinomial clustering approach in the areas of feature reduction and the picking of simulation points which allow multinomial clustering to perform as well as k-means. We then conclude by examining how to potentially combine multinomial clustering with k-means.
{"title":"Comparing multinomial and k-means clustering for SimPoint","authors":"Greg Hamerly, Erez Perelman, B. Calder","doi":"10.1109/ISPASS.2006.1620798","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620798","url":null,"abstract":"SimPoint is a technique used to pick what parts of the program's execution to simulate in order to have a complete picture of execution. SimPoint uses data clustering algorithms from machine learning to automatically find repetitive (similar) patterns in a program's execution, and it chooses one sample to represent each unique repetitive behavior. Together these samples represent an accurate picture of the complete execution of the program. SimPoint is based on the k-means clustering algorithm; recent work proposed using a different clustering method based on multinomial models, but only provided a preliminary comparison and analysis. In this work we provide a detailed comparison of using k-means and multinomial clustering for SimPoint. We show that k-means performs better than the recently proposed multinomial clustering approach. We then propose two improvements to the prior multinomial clustering approach in the areas of feature reduction and the picking of simulation points which allow multinomial clustering to perform as well as k-means. We then conclude by examining how to potentially combine multinomial clustering with k-means.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124830852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620806
D. Feitelson, Dan Tsafrir
The performance of computer systems depends, among other things, on the workload. Performance evaluations are therefore often done using logs of workloads on current productions systems, under the assumption that such real workloads are representative and reliable; likewise, workload modeling is typically based on real workloads. We show, however, that real workloads may also contain anomalies that make them non-representative and unreliable. This is a special case of multi-class workloads, where one class is the "real" workload which we wish to use in the evaluation, and the other class contaminates the log with "bogus" data. We provide several examples of this situation, including a previously unrecognized type of anomaly we call "workload flurries": surges of activity with a repetitive nature, caused by a single user, that dominate the workload for a relatively short period. Using a workload with such anomalies in effect emphasizes rare and unique events (e.g. occurring for a few days out of two years of logged data), and risks optimizing the design decision for the anomalous workload at the expense of the normal workload. Thus we claim that such anomalies should be removed from the workload before it is used in evaluations, and that ignoring them is actually an unjustifiable approach.
{"title":"Workload sanitation for performance evaluation","authors":"D. Feitelson, Dan Tsafrir","doi":"10.1109/ISPASS.2006.1620806","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620806","url":null,"abstract":"The performance of computer systems depends, among other things, on the workload. Performance evaluations are therefore often done using logs of workloads on current productions systems, under the assumption that such real workloads are representative and reliable; likewise, workload modeling is typically based on real workloads. We show, however, that real workloads may also contain anomalies that make them non-representative and unreliable. This is a special case of multi-class workloads, where one class is the \"real\" workload which we wish to use in the evaluation, and the other class contaminates the log with \"bogus\" data. We provide several examples of this situation, including a previously unrecognized type of anomaly we call \"workload flurries\": surges of activity with a repetitive nature, caused by a single user, that dominate the workload for a relatively short period. Using a workload with such anomalies in effect emphasizes rare and unique events (e.g. occurring for a few days out of two years of logged data), and risks optimizing the design decision for the anomalous workload at the expense of the normal workload. Thus we claim that such anomalies should be removed from the workload before it is used in evaluations, and that ignoring them is actually an unjustifiable approach.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130820245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620791
A. Joshi, J. Yi, R. Bell, L. Eeckhout, L. John, D. Lilja
Recent research has proposed statistical simulation as a technique for fast performance evaluation of superscalar microprocessors. The idea in statistical simulation is to measure a program's key performance characteristics, generate a synthetic trace with these characteristics, and simulate the synthetic trace. Due to the probabilistic nature of statistical simulation the performance estimate quickly converges to a solution, making it an attractive technique to efficiently cull a large microprocessor design space. In this paper, we evaluate the efficacy of statistical simulation in exploring the design space. Specifically, we characterize the following aspects of statistical simulation: (i) fidelity of performance bottlenecks, with respect to cycle-accurate simulation of the program, (ii) ability' to track design changes, and (Hi) trade-off between accuracy and complexity in statistical simulation models. In our characterization experiments, we use the Plackett & Burman (P&B) design to systematically stress statistical simulation by creating different performance bottlenecks. The key results from this paper are: (1) Synthetic traces stress at least the same 10 most significant processor performance bottlenecks as the original workload, (2) Statistical simulation can effectively track design changes to identify feasible design points in a large design space of aggressive microarchitectures, (3) Our evaluation of 4 statistical simulation models shows that although a very detailed model is needed to achieve a good absolute accuracy in performance estimation, a simple model is sufficient to achieve good relative accuracy, and (4) The P&B design technique can be used to quickly identify areas to focus on to improve the accuracy of the statistical simulation model.
{"title":"Evaluating the efficacy of statistical simulation for design space exploration","authors":"A. Joshi, J. Yi, R. Bell, L. Eeckhout, L. John, D. Lilja","doi":"10.1109/ISPASS.2006.1620791","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620791","url":null,"abstract":"Recent research has proposed statistical simulation as a technique for fast performance evaluation of superscalar microprocessors. The idea in statistical simulation is to measure a program's key performance characteristics, generate a synthetic trace with these characteristics, and simulate the synthetic trace. Due to the probabilistic nature of statistical simulation the performance estimate quickly converges to a solution, making it an attractive technique to efficiently cull a large microprocessor design space. In this paper, we evaluate the efficacy of statistical simulation in exploring the design space. Specifically, we characterize the following aspects of statistical simulation: (i) fidelity of performance bottlenecks, with respect to cycle-accurate simulation of the program, (ii) ability' to track design changes, and (Hi) trade-off between accuracy and complexity in statistical simulation models. In our characterization experiments, we use the Plackett & Burman (P&B) design to systematically stress statistical simulation by creating different performance bottlenecks. The key results from this paper are: (1) Synthetic traces stress at least the same 10 most significant processor performance bottlenecks as the original workload, (2) Statistical simulation can effectively track design changes to identify feasible design points in a large design space of aggressive microarchitectures, (3) Our evaluation of 4 statistical simulation models shows that although a very detailed model is needed to achieve a good absolute accuracy in performance estimation, a simple model is sufficient to achieve good relative accuracy, and (4) The P&B design technique can be used to quickly identify areas to focus on to improve the accuracy of the statistical simulation model.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121051973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}