2006 IEEE International Symposium on Performance Analysis of Systems and Software最新文献_第2页

Aestimo: a feedback-directed optimization evaluation tool Aestimo:一个反馈导向的优化评估工具

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620809

Paul Berube, J. N. Amaral

Published studies that use feedback-directed optimization (FDO) techniques use either a single input for both training and performance evaluation, or a single input for training and a single input for evaluation. Thus an important question is if the FDO results published in the literature are sensitive to the training and testing input selection. Aestimo is a new evaluation tool that uses a workload of inputs to evaluate the sensitivity of specific code transformations to the choice of inputs in the training and testing phases. Aestimo uses optimization logs to isolate the effects of individual code transformations. It incorporates metrics to determine the effect of training input selection on individual compiler decisions. Besides describing the structure of Aestimo, this paper presents a case study that uses SPEC CINT2000 benchmark programs with the Open Research Compiler (ORC) to investigate the effect of training/testing input selection on in-lining and if-conversion. The experimental results indicate that: (1) training input selection affects the compiler decisions made for these code transformation; (2) the choice of training/testing inputs can have a significant impact on measured performance.

已发表的使用反馈导向优化(FDO)技术的研究要么使用单一输入进行培训和绩效评估，要么使用单一输入进行培训和单一输入进行评估。因此，一个重要的问题是，如果发表在文献中的FDO结果是敏感的训练和测试输入选择。Aestimo是一种新的评估工具，它使用输入的工作量来评估特定代码转换对训练和测试阶段输入选择的敏感性。Aestimo使用优化日志来隔离单个代码转换的效果。它结合了度量来确定训练输入选择对单个编译器决策的影响。除了描述Aestimo的结构外，本文还介绍了一个案例研究，使用SPEC CINT2000基准程序和开放研究编译器(ORC)来研究训练/测试输入选择对内联和if转换的影响。实验结果表明:(1)训练输入的选择影响编译器对这些代码转换的决策;(2)训练/测试输入的选择对测量绩效有显著影响。

{"title":"Aestimo: a feedback-directed optimization evaluation tool","authors":"Paul Berube, J. N. Amaral","doi":"10.1109/ISPASS.2006.1620809","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620809","url":null,"abstract":"Published studies that use feedback-directed optimization (FDO) techniques use either a single input for both training and performance evaluation, or a single input for training and a single input for evaluation. Thus an important question is if the FDO results published in the literature are sensitive to the training and testing input selection. Aestimo is a new evaluation tool that uses a workload of inputs to evaluate the sensitivity of specific code transformations to the choice of inputs in the training and testing phases. Aestimo uses optimization logs to isolate the effects of individual code transformations. It incorporates metrics to determine the effect of training input selection on individual compiler decisions. Besides describing the structure of Aestimo, this paper presents a case study that uses SPEC CINT2000 benchmark programs with the Open Research Compiler (ORC) to investigate the effect of training/testing input selection on in-lining and if-conversion. The experimental results indicate that: (1) training input selection affects the compiler decisions made for these code transformation; (2) the choice of training/testing inputs can have a significant impact on measured performance.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123044568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

A statistical multiprocessor cache model 统计多处理器缓存模型

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620793

Erik Berg, Håkan Zeffer, Erik Hagersten

The introduction of general-purpose microprocessors running multiple threads will put a focus on methods and tools helping a programmer to write efficient parallel applications. Such a tool should be fast enough to meet a software developer's need for short turn-around time, but also be accurate and flexible enough to provide trend-correct and intuitive feedback. This paper presents a novel sample-based method for analyzing the data locality of a multithreaded application. Very sparse data is collected during a single execution of the studied application. The architectural-independent information collected during the execution is fed to a mathematical memory-system model for predicting the cache miss ratio. The sparse data can be used to characterize the application's data locality with respect to almost any possible memory system, such as complicated multiprocessor multilevel cache hierarchies. Any combination of cache size, cache-line size and degree of sharing can be modeled. Each modeled design point takes only a fraction of a second to evaluate, even though the application from which the sampled data was collected may have executed for hours. This makes the tool not just usable for software developers, but also for hardware developers who need to evaluate a huge memory-system design space. The accuracy of the method is evaluated using a large number of commercial and technical multi-threaded applications. The result produced by the algorithm is shown to be consistent with results from a traditional (and much slower) architecture simulation.

运行多线程的通用微处理器的引入将把重点放在帮助程序员编写高效并行应用程序的方法和工具上。这样的工具应该足够快，以满足软件开发人员对短周转时间的需求，但也要足够准确和灵活，以提供趋势正确和直观的反馈。本文提出了一种基于样本的多线程应用程序数据局部性分析方法。在所研究的应用程序的单次执行期间收集的数据非常稀疏。在执行过程中收集的与体系结构无关的信息被提供给一个数学内存系统模型，用于预测缓存缺失率。稀疏数据可用于描述应用程序相对于几乎任何可能的内存系统的数据局部性，例如复杂的多处理器多层缓存层次结构。可以对缓存大小、缓存行大小和共享程度的任何组合进行建模。每个建模的设计点只需要几分之一秒的时间来评估，即使从中收集采样数据的应用程序可能已经执行了几个小时。这使得该工具不仅可用于软件开发人员，也可用于需要评估巨大内存系统设计空间的硬件开发人员。使用大量商业和技术上的多线程应用对该方法的准确性进行了评估。该算法产生的结果与传统(并且慢得多)架构模拟的结果一致。

{"title":"A statistical multiprocessor cache model","authors":"Erik Berg, Håkan Zeffer, Erik Hagersten","doi":"10.1109/ISPASS.2006.1620793","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620793","url":null,"abstract":"The introduction of general-purpose microprocessors running multiple threads will put a focus on methods and tools helping a programmer to write efficient parallel applications. Such a tool should be fast enough to meet a software developer's need for short turn-around time, but also be accurate and flexible enough to provide trend-correct and intuitive feedback. This paper presents a novel sample-based method for analyzing the data locality of a multithreaded application. Very sparse data is collected during a single execution of the studied application. The architectural-independent information collected during the execution is fed to a mathematical memory-system model for predicting the cache miss ratio. The sparse data can be used to characterize the application's data locality with respect to almost any possible memory system, such as complicated multiprocessor multilevel cache hierarchies. Any combination of cache size, cache-line size and degree of sharing can be modeled. Each modeled design point takes only a fraction of a second to evaluate, even though the application from which the sampled data was collected may have executed for hours. This makes the tool not just usable for software developers, but also for hardware developers who need to evaluate a huge memory-system design space. The accuracy of the method is evaluated using a large number of commercial and technical multi-threaded applications. The result produced by the algorithm is shown to be consistent with results from a traditional (and much slower) architecture simulation.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126687709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 64

Power efficient resource scaling in partitioned architectures through dynamic heterogeneity 通过动态异构在分区架构中进行高效的资源扩展

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620794

Naveen Muralimanohar, K. Ramani, R. Balasubramonian

The ever increasing demand for high clock speeds and the desire to exploit abundant transistor budgets have resulted in alarming increases in processor power dissipation. Partitioned (or clustered) architectures have been proposed in recent years to address scalability concerns in future billion-transistor microprocessors. Our analysis shows that increasing processor resources in a clustered architecture results in a linear increase in power consumption, while providing diminishing improvements in single-thread performance. To preserve high performance to power ratios, we claim that the power consumption of additional resources should be in proportion to the performance improvements they yield. Hence, in this paper, we propose the implementation of heterogeneous clusters that have varying delay and power characteristics. A cluster's performance and power characteristic is tuned by scaling its frequency and novel policies dynamically assign frequencies to clusters, while attempting to either meet a fixed power budget or minimize a metric such as Energy /spl times/ Delay/sup 2/ (ED/sup 2/). By increasing resources in a power-efficient manner, we observe an 11% improvement in ED/sup 2/ and a 22.4% average reduction in peak temperature, when compared to a processor with homogeneous units. Our proposed processor model also provides strategies to handle thermal emergencies that have a relatively low impact on performance.

对高时钟速度的不断增长的需求和利用丰富的晶体管预算的愿望导致了处理器功耗的惊人增加。近年来提出了分区(或集群)架构，以解决未来十亿晶体管微处理器的可伸缩性问题。我们的分析表明，在集群架构中增加处理器资源会导致功耗线性增加，而单线程性能的改进却越来越少。为了保持高性能功率比，我们认为额外资源的功耗应该与其产生的性能改进成比例。因此，在本文中，我们提出了具有不同延迟和功率特性的异构集群的实现。集群的性能和功率特性是通过调整其频率和新策略动态地为集群分配频率来调整的，同时尝试满足固定的功率预算或最小化度量，如Energy /spl times/ Delay/sup 2/ (ED/sup 2/)。通过以节能的方式增加资源，我们观察到与具有均匀单元的处理器相比，ED/sup /提高了11%，峰值温度平均降低了22.4%。我们提出的处理器模型还提供了处理对性能影响相对较小的热紧急情况的策略。

{"title":"Power efficient resource scaling in partitioned architectures through dynamic heterogeneity","authors":"Naveen Muralimanohar, K. Ramani, R. Balasubramonian","doi":"10.1109/ISPASS.2006.1620794","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620794","url":null,"abstract":"The ever increasing demand for high clock speeds and the desire to exploit abundant transistor budgets have resulted in alarming increases in processor power dissipation. Partitioned (or clustered) architectures have been proposed in recent years to address scalability concerns in future billion-transistor microprocessors. Our analysis shows that increasing processor resources in a clustered architecture results in a linear increase in power consumption, while providing diminishing improvements in single-thread performance. To preserve high performance to power ratios, we claim that the power consumption of additional resources should be in proportion to the performance improvements they yield. Hence, in this paper, we propose the implementation of heterogeneous clusters that have varying delay and power characteristics. A cluster's performance and power characteristic is tuned by scaling its frequency and novel policies dynamically assign frequencies to clusters, while attempting to either meet a fixed power budget or minimize a metric such as Energy /spl times/ Delay/sup 2/ (ED/sup 2/). By increasing resources in a power-efficient manner, we observe an 11% improvement in ED/sup 2/ and a 22.4% average reduction in peak temperature, when compared to a processor with homogeneous units. Our proposed processor model also provides strategies to handle thermal emergencies that have a relatively low impact on performance.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125080733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Assessing the impact of reactive workloads on the performance of Web applications 评估响应性工作负载对Web应用程序性能的影响

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620805

A. Pereira, Leonardo Silva, Wagner Meira Jr, W. Santos

Designing systems with better performance and scalability is a real need to fulfill the user demands and generate profitable Web services. Being able to mimic user behavior and the workload they generate on the servers is fundamental to evaluate the performance of systems and their improvements. One aspect that is usually neglected by workload generators is the user reactivity, that is, how the users react to variable server response time. Further, it is not clear how the reactivity-related changes in the user generated workload affect the server and how these dependences converge. This paper addresses this problem by proposing, implementing, and validating a workload generator that accounts for reactivity while interacting with servers. Our workload generator is used, for instance, to generate workloads based on a TPC-W benchmark. These workloads are used to assess the impacts of reactivity on the performance of a Web application. The results show significant changes in terms of throughput and response time for the experiments, raising the possibility of improving the performance of Web systems considering user reactivity.

设计具有更好性能和可伸缩性的系统是满足用户需求和生成有利可图的Web服务的实际需要。能够模拟用户行为及其在服务器上生成的工作负载是评估系统性能及其改进的基础。工作负载生成器通常忽略的一个方面是用户响应性，即用户如何对可变的服务器响应时间做出反应。此外，还不清楚用户生成的工作负载中与响应性相关的更改如何影响服务器，以及这些依赖关系如何汇聚。本文通过提出、实现和验证一个工作负载生成器来解决这个问题，该生成器在与服务器交互时考虑了响应性。例如，我们的工作负载生成器用于基于TPC-W基准生成工作负载。这些工作负载用于评估响应性对Web应用程序性能的影响。结果显示了实验在吞吐量和响应时间方面的显著变化，提高了考虑用户响应性来改进Web系统性能的可能性。

{"title":"Assessing the impact of reactive workloads on the performance of Web applications","authors":"A. Pereira, Leonardo Silva, Wagner Meira Jr, W. Santos","doi":"10.1109/ISPASS.2006.1620805","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620805","url":null,"abstract":"Designing systems with better performance and scalability is a real need to fulfill the user demands and generate profitable Web services. Being able to mimic user behavior and the workload they generate on the servers is fundamental to evaluate the performance of systems and their improvements. One aspect that is usually neglected by workload generators is the user reactivity, that is, how the users react to variable server response time. Further, it is not clear how the reactivity-related changes in the user generated workload affect the server and how these dependences converge. This paper addresses this problem by proposing, implementing, and validating a workload generator that accounts for reactivity while interacting with servers. Our workload generator is used, for instance, to generate workloads based on a TPC-W benchmark. These workloads are used to assess the impacts of reactivity on the performance of a Web application. The results show significant changes in terms of throughput and response time for the experiments, raising the possibility of improving the performance of Web systems considering user reactivity.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133780146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Acquisition and evaluation of long DDR2-SDRAM access sequences 长DDR2-SDRAM存取序列的获取与评估

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620808

Simon Albert, Sven Kalms, Christian Weiss, A. Schramm

Trace driven simulation is extensively used in memory system evaluation. Traditional measurement equipment such as logic analyzers currently lack of the capability to record long memory access sequences (e.g. multiple seconds or even entire benchmark runs) due to their limited sampling depth, without altering system behavior. This paper presents a system, that is capable of recording long access sequences in realtime without affecting system operation. For the first time, a classification of the SPEC CPU2000 benchmark suite along main memory access criteria is provided. Furthermore the impact of shared memory graphics on system performance affecting future system simulation methodology is investigated.

轨迹驱动仿真广泛应用于存储系统评估。传统的测量设备，如逻辑分析仪，由于采样深度有限，在不改变系统行为的情况下，目前缺乏记录长内存访问序列(例如，多秒甚至整个基准运行)的能力。本文介绍了一种能够在不影响系统运行的情况下实时记录长访问序列的系统。本文第一次根据主内存访问标准对SPEC CPU2000基准套件进行了分类。此外，还研究了共享内存图形对系统性能的影响以及未来系统仿真方法。

引用次数: 0

Branch trace compression for snapshot-based simulation 分支跟踪压缩基于快照的仿真

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620787

K. Barr, K. Asanović

We present a scheme to compress branch trace information for use in snapshot-based microarchitecture simulation. The compressed trace can be used to warm any arbitrary branch predictor's state before detailed microarchitecture simulation of the snapshot. We show that compressed branch traces require less space than snapshots of concrete predictor state. Our branch-predictor based compression (BPC) technique uses a software branch predictor to provide an accurate model of the input branch trace, requiring only mispredictions to be stored in the compressed trace file. The decompressor constructs a matching software branch predictor to help reconstruct the original branch trace from the record of mispredictions. Evaluations using traces from the Journal of ILP branch predictor competition show we achieve compression rates of 0.013-0.72 bits/branch (depending on workload), which is up to 210/spl times/ better than gzip; up to 52/spl times/ better than the best general-purpose compression techniques; and up to 4.4/spl times/ better than recently-published, more general trace compression techniques. Moreover, BPC-compressed traces can be decompressed in less time than corresponding traces compressed with other methods.

我们提出了一种压缩分支跟踪信息的方案，用于基于快照的微架构仿真。在对快照进行详细的微体系结构模拟之前，压缩的跟踪可以用来预热任意分支预测器的状态。我们表明，压缩的分支跟踪比具体预测器状态的快照需要更少的空间。我们基于分支预测器的压缩(BPC)技术使用软件分支预测器来提供输入分支跟踪的精确模型，只需要将错误预测存储在压缩的跟踪文件中。解压缩器构建了一个匹配的软件分支预测器，以帮助从错误预测的记录中重建原始分支跟踪。使用来自ILP分支预测器竞争期刊的跟踪的评估表明，我们实现了0.013-0.72比特/分支的压缩率(取决于工作负载)，这比gzip高210/spl倍;高达52/spl倍/比最好的通用压缩技术更好;并且比最近发布的更通用的跟踪压缩技术高出4.4/spl倍。此外，与其他方法压缩的相应路径相比，bpc压缩的路径可以在更短的时间内解压缩。

{"title":"Branch trace compression for snapshot-based simulation","authors":"K. Barr, K. Asanović","doi":"10.1109/ISPASS.2006.1620787","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620787","url":null,"abstract":"We present a scheme to compress branch trace information for use in snapshot-based microarchitecture simulation. The compressed trace can be used to warm any arbitrary branch predictor's state before detailed microarchitecture simulation of the snapshot. We show that compressed branch traces require less space than snapshots of concrete predictor state. Our branch-predictor based compression (BPC) technique uses a software branch predictor to provide an accurate model of the input branch trace, requiring only mispredictions to be stored in the compressed trace file. The decompressor constructs a matching software branch predictor to help reconstruct the original branch trace from the record of mispredictions. Evaluations using traces from the Journal of ILP branch predictor competition show we achieve compression rates of 0.013-0.72 bits/branch (depending on workload), which is up to 210/spl times/ better than gzip; up to 52/spl times/ better than the best general-purpose compression techniques; and up to 4.4/spl times/ better than recently-published, more general trace compression techniques. Moreover, BPC-compressed traces can be decompressed in less time than corresponding traces compressed with other methods.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133690010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Simulation sampling with live-points 用活点模拟采样

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620785

T. Wenisch, Roland E. Wunderlich, B. Falsafi, J. Hoe

Current simulation-sampling techniques construct accurate model state for each measurement by continuously warming large microarchitectural structures (e.g., caches and the branch predictor) while functionally simulating the billions of instructions between measurements. This approach, called functional warming, is the main performance bottleneck of simulation sampling and requires hours of runtime while the detailed simulation of the sample requires only minutes. Existing simulators can avoid functional simulation by jumping directly to particular instruction stream locations with architectural state checkpoints. To replace functional warming, these checkpoints must additionally provide microarchitectural model state that is accurate and reusable across experiments while meeting tight storage constraints. In this paper, we present a simulation-sampling framework that replaces functional warming with live-points without sacrificing accuracy. A live-point stores the bare minimum of functionally-warmed state for accurate simulation of a limited execution window while placing minimal restrictions on microarchitectural configuration. Live-points can be processed in random rather than program order, allowing simulation results and their statistical confidence to be reported while simulations are in progress. Our framework matches the accuracy of prior simulation-sampling techniques (i.e., /spl plusmn/3% error with 99.7% confidence), while estimating the performance of an 8-way out-of-order superscalar processor running SPEC CPU2000 in 91 seconds per benchmark, on average, using a 12 GB live-point library.

当前的模拟采样技术通过不断加热大型微架构结构(例如缓存和分支预测器)来为每次测量构建精确的模型状态，同时在测量之间功能模拟数十亿条指令。这种方法称为功能升温，是模拟采样的主要性能瓶颈，需要数小时的运行时间，而样本的详细模拟只需要几分钟。现有的模拟器可以通过直接跳转到具有体系结构状态检查点的特定指令流位置来避免功能模拟。为了取代功能性升温，这些检查点必须额外提供精确的微架构模型状态，并在满足严格的存储约束的情况下跨实验可重用。在本文中，我们提出了一个模拟采样框架，在不牺牲精度的情况下，用活点代替功能变暖。活动点存储最小的功能预热状态，用于精确模拟有限的执行窗口，同时对微架构配置施加最小的限制。实时点可以随机处理，而不是按程序顺序处理，允许在模拟进行时报告模拟结果及其统计置信度。我们的框架与之前的模拟采样技术的准确性相匹配(即/spl + usmn/3%的误差，99.7%的置信度)，同时使用12 GB的活点库估计运行SPEC CPU2000的8路无序标量处理器在每个基准测试平均91秒内的性能。

{"title":"Simulation sampling with live-points","authors":"T. Wenisch, Roland E. Wunderlich, B. Falsafi, J. Hoe","doi":"10.1109/ISPASS.2006.1620785","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620785","url":null,"abstract":"Current simulation-sampling techniques construct accurate model state for each measurement by continuously warming large microarchitectural structures (e.g., caches and the branch predictor) while functionally simulating the billions of instructions between measurements. This approach, called functional warming, is the main performance bottleneck of simulation sampling and requires hours of runtime while the detailed simulation of the sample requires only minutes. Existing simulators can avoid functional simulation by jumping directly to particular instruction stream locations with architectural state checkpoints. To replace functional warming, these checkpoints must additionally provide microarchitectural model state that is accurate and reusable across experiments while meeting tight storage constraints. In this paper, we present a simulation-sampling framework that replaces functional warming with live-points without sacrificing accuracy. A live-point stores the bare minimum of functionally-warmed state for accurate simulation of a limited execution window while placing minimal restrictions on microarchitectural configuration. Live-points can be processed in random rather than program order, allowing simulation results and their statistical confidence to be reported while simulations are in progress. Our framework matches the accuracy of prior simulation-sampling techniques (i.e., /spl plusmn/3% error with 99.7% confidence), while estimating the performance of an 8-way out-of-order superscalar processor running SPEC CPU2000 in 91 seconds per benchmark, on average, using a 12 GB live-point library.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116785635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

Accelerating architectural exploration using canonical instruction segments 使用规范指令段加速架构探索

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620786

Rose F. Liu, K. Asanović

Detailed microarchitectural simulators are not well suited for exploring large design spaces due to their excessive simulation times. We introduce AXCIS, a framework for fast and accurate design space exploration. AXCIS achieves fast simulation times by exploiting repetitions in program behavior to reduce the number of instructions simulated. For each dynamic instruction encountered during an initial full run of a benchmark, AXCIS builds an instruction segment, which concisely represents performance-critical information. AXCIS then compresses the string of dynamic segments into a table of canonical instruction segments (CIST) to give a compact representation of the entire benchmark trace. Given a precompiled CIST and a target microarchitecture configuration, AXCIS can quickly and accurately estimate performance metrics such as instructions per cycle (IPC). For the SPEC CPU2000 benchmarks and all simulated configurations, AXCIS achieves an average IPC error of 2.6%. While cycle-accurate simulators can take many hours to simulate billions of dynamic instructions, AXCIS can complete the same simulation on the corresponding CIST within seconds.

详细的微架构模拟器不适合探索大型设计空间，因为它们的模拟时间过多。我们介绍了AXCIS，一个快速准确的设计空间探索框架。AXCIS通过利用程序行为中的重复来减少模拟指令的数量，从而实现快速的模拟时间。对于在基准测试的初始完整运行期间遇到的每个动态指令，AXCIS构建一个指令段，它简明地表示性能关键信息。然后，AXCIS将动态段字符串压缩到规范指令段(CIST)表中，以提供整个基准跟踪的紧凑表示。给定预编译的CIST和目标微体系结构配置，AXCIS可以快速准确地估计性能指标，如每周期指令数(IPC)。对于SPEC CPU2000基准测试和所有模拟配置，AXCIS实现了2.6%的平均IPC误差。周期精确的模拟器可能需要数小时来模拟数十亿条动态指令，而AXCIS可以在几秒钟内在相应的CIST上完成相同的模拟。

{"title":"Accelerating architectural exploration using canonical instruction segments","authors":"Rose F. Liu, K. Asanović","doi":"10.1109/ISPASS.2006.1620786","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620786","url":null,"abstract":"Detailed microarchitectural simulators are not well suited for exploring large design spaces due to their excessive simulation times. We introduce AXCIS, a framework for fast and accurate design space exploration. AXCIS achieves fast simulation times by exploiting repetitions in program behavior to reduce the number of instructions simulated. For each dynamic instruction encountered during an initial full run of a benchmark, AXCIS builds an instruction segment, which concisely represents performance-critical information. AXCIS then compresses the string of dynamic segments into a table of canonical instruction segments (CIST) to give a compact representation of the entire benchmark trace. Given a precompiled CIST and a target microarchitecture configuration, AXCIS can quickly and accurately estimate performance metrics such as instructions per cycle (IPC). For the SPEC CPU2000 benchmarks and all simulated configurations, AXCIS achieves an average IPC error of 2.6%. While cycle-accurate simulators can take many hours to simulate billions of dynamic instructions, AXCIS can complete the same simulation on the corresponding CIST within seconds.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115547069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Compiler-based adaptive fetch throttling for energy-efficiency 基于编译器的自适应获取节流以提高能源效率

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620795

Huaping Wang, Yao Guo, I. Koren, C. M. Krishna

Front-end instruction delivery accounts for a significant fraction of energy consumption in dynamically scheduled superscalar processors. Different front-end throttling techniques have been introduced to reduce the chip-wide energy consumption caused by redundant fetching. Hardware-based techniques, such as flow-based throttling, could reduce the energy consumption considerably, but with a high performance loss. On the other hand, compiler-based IPC-estimation-driven software fetch throttling (CFT) techniques result in relatively low performance degradation, which is desirable for high-performance processors. However, their energy savings are limited by the fact that they typically use a predefined fixed low IPC-threshold to control throttling. In this paper, we propose a compiler-based adaptive fetch throttling (CAFT) technique that allows changing the throttling threshold dynamically at runtime. Instead of using a fixed threshold, our technique uses the decode/issue difference (DID) to assist the fetch throttling decision based on the statically estimated IPC. Changing the threshold dynamically makes it possible to throttle at a higher estimated IPC, thus increasing the throttling opportunities and resulting in larger energy savings. We demonstrate that CAFT could increase the energy savings significantly compared to CFT, while preserving its benefit of low performance loss. Our simulation results show that the proposed technique doubles the energy-delay product (EDP) savings compared to the fixed threshold throttling and achieves a 6.1% average EDP saving.

在动态调度的超标量处理器中，前端指令传递占能耗的很大一部分。引入了不同的前端节流技术来减少冗余提取引起的芯片范围内的能量消耗。基于硬件的技术，如基于流量的节流，可以大大降低能耗，但性能损失很大。另一方面，基于编译器的ipc估计驱动的软件获取节流(CFT)技术导致相对较低的性能下降，这对于高性能处理器来说是理想的。然而，由于它们通常使用预定义的固定的低ipcc阈值来控制节流，因此它们的节能受到限制。在本文中，我们提出了一种基于编译器的自适应获取节流(CAFT)技术，该技术允许在运行时动态更改节流阈值。我们的技术没有使用固定的阈值，而是使用解码/发布差异(DID)来帮助基于静态估计的IPC的获取节流决策。动态更改阈值可以在更高的IPC估计上节流，从而增加节流机会并导致更大的节能。我们证明，与CFT相比，CAFT可以显着增加节能，同时保留其低性能损失的好处。我们的仿真结果表明，与固定阈值节流相比，所提出的技术使能量延迟积(EDP)节省了一倍，平均EDP节省了6.1%。

{"title":"Compiler-based adaptive fetch throttling for energy-efficiency","authors":"Huaping Wang, Yao Guo, I. Koren, C. M. Krishna","doi":"10.1109/ISPASS.2006.1620795","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620795","url":null,"abstract":"Front-end instruction delivery accounts for a significant fraction of energy consumption in dynamically scheduled superscalar processors. Different front-end throttling techniques have been introduced to reduce the chip-wide energy consumption caused by redundant fetching. Hardware-based techniques, such as flow-based throttling, could reduce the energy consumption considerably, but with a high performance loss. On the other hand, compiler-based IPC-estimation-driven software fetch throttling (CFT) techniques result in relatively low performance degradation, which is desirable for high-performance processors. However, their energy savings are limited by the fact that they typically use a predefined fixed low IPC-threshold to control throttling. In this paper, we propose a compiler-based adaptive fetch throttling (CAFT) technique that allows changing the throttling threshold dynamically at runtime. Instead of using a fixed threshold, our technique uses the decode/issue difference (DID) to assist the fetch throttling decision based on the statically estimated IPC. Changing the threshold dynamically makes it possible to throttle at a higher estimated IPC, thus increasing the throttling opportunities and resulting in larger energy savings. We demonstrate that CAFT could increase the energy savings significantly compared to CFT, while preserving its benefit of low performance loss. Our simulation results show that the proposed technique doubles the energy-delay product (EDP) savings compared to the fixed threshold throttling and achieves a 6.1% average EDP saving.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129461856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Automatic testcase synthesis and performance model validation for high performance PowerPC processors 高性能PowerPC处理器的自动测试用例合成和性能模型验证

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620800

R. Bell, Rajiv Bhatia, L. John, Jeffrey Stuecheli, J. Griswell, P. Tu, Louis Capps, A. Blanchard, Ravel Thai

The latest high-performance IBM PowerPC microprocessor, the POWERS chip, poses challenges for performance model validation. The current state-of-the-art is to use simple hand-coded bandwidth and latency testcases, but these are not comprehensive for processors as complex as the POWER5 chip. Applications and benchmark suites such as SPEC CPU are difficult to set up or take too long to execute on functional models or even on detailed performance models. We present an automatic testcase synthesis methodology to address these concerns. By basing testcase synthesis on the workload characteristics of an application, source code is created that largely represents the performance of the application, but which executes in a fraction of the runtime. We synthesize representative PowerPC versions of the SPEC2000, STREAM, TPC-C and Java benchmarks, compile and execute them, and obtain an average IPC within 2.4% of the average IPC of the original benchmarks and with many similar average workload characteristics. The synthetic testcases often execute two orders of magnitude faster than the original applications, typically in less than 300K instructions, making performance model validation for today's complex processors feasible.

最新的高性能IBM PowerPC微处理器，即POWERS芯片，对性能模型验证提出了挑战。目前最先进的技术是使用简单的手工编码带宽和延迟测试用例，但是对于像POWER5芯片这样复杂的处理器来说，这些测试用例并不全面。应用程序和基准套件(如SPEC CPU)很难设置，或者需要很长时间才能在功能模型甚至详细的性能模型上执行。我们提出一个自动的测试用例合成方法来处理这些问题。通过基于应用程序的工作负载特征的测试用例合成，源代码被创建，它在很大程度上代表了应用程序的性能，但是在一小部分运行时中执行。我们综合了具有代表性的PowerPC版本的SPEC2000、STREAM、TPC-C和Java基准测试，编译并执行它们，并获得了在原始基准测试平均IPC的2.4%以内的平均IPC，并且具有许多相似的平均工作负载特征。合成测试用例的执行速度通常比原始应用程序快两个数量级，通常在少于300K指令的情况下，这使得对当今复杂处理器的性能模型验证变得可行。

{"title":"Automatic testcase synthesis and performance model validation for high performance PowerPC processors","authors":"R. Bell, Rajiv Bhatia, L. John, Jeffrey Stuecheli, J. Griswell, P. Tu, Louis Capps, A. Blanchard, Ravel Thai","doi":"10.1109/ISPASS.2006.1620800","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620800","url":null,"abstract":"The latest high-performance IBM PowerPC microprocessor, the POWERS chip, poses challenges for performance model validation. The current state-of-the-art is to use simple hand-coded bandwidth and latency testcases, but these are not comprehensive for processors as complex as the POWER5 chip. Applications and benchmark suites such as SPEC CPU are difficult to set up or take too long to execute on functional models or even on detailed performance models. We present an automatic testcase synthesis methodology to address these concerns. By basing testcase synthesis on the workload characteristics of an application, source code is created that largely represents the performance of the application, but which executes in a fraction of the runtime. We synthesize representative PowerPC versions of the SPEC2000, STREAM, TPC-C and Java benchmarks, compile and execute them, and obtain an average IPC within 2.4% of the average IPC of the original benchmarks and with many similar average workload characteristics. The synthetic testcases often execute two orders of magnitude faster than the original applications, typically in less than 300K instructions, making performance model validation for today's complex processors feasible.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128478805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19