IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.最新文献

英文中文

PowerFITS: Reduce Dynamic and Static I-Cache Power Using Application Specific Instruction Set Synthesis PowerFITS:使用应用特定指令集合成降低动态和静态I-Cache功率

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430557

A. Cheng, G. Tyson, T. Mudge

Power consumption, performance, area, and cost are critical concerns in designing microprocessors for embedded systems such as portable handheld computing and personal telecommunication devices. In previous work [A. Cheng et al., (2004)], we introduced the concept of framework-based instruction-set tuning synthesis (FITS), which is a new instruction synthesis paradigm that falls between a general-purpose embedded processor and a synthesized application specific processor (ASP). We address these design constraints through FITS by improving the code density. A FITS processor improves code density by tailoring the instruction set to the requirement of a target application to reduce the code size. This is achieved by replacing the fixed instruction and register decoding of general purpose embedded processor with programmable decoders that can achieve ASP performance, low power consumption, and compact chip area with the fabrication advantages of a mass produced single chip solution to amortize the cost. Instruction cache has been recognized as one of the most predominant source of power dissipation in a microprocessor. For instance, in Intel's StrongARMprocessor, 27% of total chip power loss goes into the instruction cache [J. Montanaro et al., (1996)]. In this paper, we demonstrate how FITS can be applied to improve the instruction cache power efficiency. Experimental results show that our synthesized instruction sets result in significant power reduction in the instruction cache compared to ARM instructions. For 21 benchmarks from the MiBench suite [M. Guthaus et al., (2001)], our simulation results indicate on average: a 49.4% saving for switching power; a 43.9% saving for internal power; a 14.9% saving for leakage power; a 46.6% saving for total cache power with up to 60.3% saving for peak power

功耗、性能、面积和成本是为嵌入式系统(如便携式手持计算和个人电信设备)设计微处理器的关键问题。在以前的工作中[A]。Cheng等人，(2004)]，我们引入了基于框架的指令集调优综合(FITS)的概念，这是一种新的指令综合范式，介于通用嵌入式处理器和综合应用特定处理器(ASP)之间。我们通过改进代码密度通过FITS解决这些设计约束。FITS处理器通过根据目标应用程序的需求定制指令集来减少代码大小，从而提高代码密度。这是通过用可编程解码器取代通用嵌入式处理器的固定指令和寄存器解码来实现的，该解码器可以实现ASP性能，低功耗，芯片面积小，并且具有批量生产单芯片解决方案的制造优势，以摊销成本。指令缓存被认为是微处理器中最主要的功耗来源之一。例如，在英特尔的strongarm处理器中，总芯片功耗的27%用于指令缓存[J]。Montanaro等，(1996)]。在本文中，我们演示了如何应用FITS来提高指令缓存的功率效率。实验结果表明，与ARM指令相比，我们的合成指令集在指令缓存中显著降低了功耗。对于来自MiBench套件的21个基准测试[M。Guthaus et al.，(2001)]，我们的仿真结果表明:开关功率平均节省49.4%;内部电源节省43.9%;漏电节电14.9%;总缓存功率节省46.6%，峰值功率节省高达60.3%

{"title":"PowerFITS: Reduce Dynamic and Static I-Cache Power Using Application Specific Instruction Set Synthesis","authors":"A. Cheng, G. Tyson, T. Mudge","doi":"10.1109/ISPASS.2005.1430557","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430557","url":null,"abstract":"Power consumption, performance, area, and cost are critical concerns in designing microprocessors for embedded systems such as portable handheld computing and personal telecommunication devices. In previous work [A. Cheng et al., (2004)], we introduced the concept of framework-based instruction-set tuning synthesis (FITS), which is a new instruction synthesis paradigm that falls between a general-purpose embedded processor and a synthesized application specific processor (ASP). We address these design constraints through FITS by improving the code density. A FITS processor improves code density by tailoring the instruction set to the requirement of a target application to reduce the code size. This is achieved by replacing the fixed instruction and register decoding of general purpose embedded processor with programmable decoders that can achieve ASP performance, low power consumption, and compact chip area with the fabrication advantages of a mass produced single chip solution to amortize the cost. Instruction cache has been recognized as one of the most predominant source of power dissipation in a microprocessor. For instance, in Intel's StrongARMprocessor, 27% of total chip power loss goes into the instruction cache [J. Montanaro et al., (1996)]. In this paper, we demonstrate how FITS can be applied to improve the instruction cache power efficiency. Experimental results show that our synthesized instruction sets result in significant power reduction in the instruction cache compared to ARM instructions. For 21 benchmarks from the MiBench suite [M. Guthaus et al., (2001)], our simulation results indicate on average: a 49.4% saving for switching power; a 43.9% saving for internal power; a 14.9% saving for leakage power; a 46.6% saving for total cache power with up to 60.3% saving for peak power","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122843754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Architectural Characterization of Processor Affinity in Network Processing 网络处理中处理器亲和性的体系结构表征

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430575

A. Foong, Jason M. Fung, D. Newell, S. Abraham, Peggy Irelan, Alex A. Lopez-Estrada

Network protocol stacks, in particular TCP/IP software implementations, are known for its inability to scale well in general-purpose monolithic operating systems (OS) for SMP. Previous researchers have experimented with affinitizing processes/thread, as well as interrupts from devices, to specific processors in a SMP system. However, general purpose operating systems have minimal consideration of user-defined affinity in their schedulers. Our goal is to expose the full potential of affinity by in-depth characterization of the reasons behind performance gains. We conducted an experimental study of TCP performance under various affinity modes on IA-based servers. Results showed that interrupt affinity alone provided a throughput gain of up to 25%, and combined thread/process and interrupt affinity can achieve gains of 30%. In particular, calling out the impact of affinity on machine clears (in addition to cache misses) is characterization that has not been done before

众所周知，网络协议栈，特别是TCP/IP软件实现，无法很好地扩展到用于SMP的通用单片操作系统(OS)中。先前的研究人员已经尝试将进程/线程以及来自设备的中断关联到SMP系统中的特定处理器。但是，通用操作系统在其调度器中很少考虑用户定义的关联。我们的目标是通过深入描述性能提升背后的原因，揭示亲和力的全部潜力。我们在基于ia的服务器上对不同亲和模式下的TCP性能进行了实验研究。结果表明，中断亲和性单独提供了高达25%的吞吐量增益，而线程/进程和中断亲和性相结合可以实现30%的增益。特别是，指出亲和性对机器清除(除了缓存丢失)的影响是以前从未做过的特性描述

引用次数: 38

Fast, Accurate Microarchitecture Simulation Using Statistical Phase Detection 使用统计相位检测快速，准确的微架构仿真

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430569

R. Srinivasan, Jeanine E. Cook, S. Cooper

Simulation-based microarchitecture research is often hindered by the slow speed of simulators. In this work, we propose a novel statistical technique to identify highly representative unique behaviors or phases in a benchmark based on its IPC (instructions committed per cycle) trace. By simulating the timing of only the unique phases, the cycle-accurate simulation time for the SPEC suite is reduced from 5 months to 5 days, with a significant retention of the original dynamic behavior. Evaluation across many processor configurations within the same architecture family shows that the algorithm is robust. A cost function is provided that enables users to easily optimize the parameters of the algorithm for either simulation speed or accuracy depending on preference. A new measure is introduced to quantify the ability of a simulation speedup technique to retain behavior realized in the original workload. Unlike a first order statistic such as mean value, the newly introduced measure captures important differences in dynamic behavior between the complete and the sampled simulations

基于仿真的微体系结构研究常常受到仿真器速度慢的阻碍。在这项工作中，我们提出了一种新的统计技术来识别基于IPC(每周期提交的指令)跟踪的基准中具有高度代表性的独特行为或阶段。通过仅模拟独特相位的时间，SPEC套件的周期精确模拟时间从5个月减少到5天，并显著保留了原始动态行为。对同一体系结构家族中多个处理器配置的评估表明，该算法具有鲁棒性。提供了一个成本函数，使用户能够根据偏好轻松地优化算法的参数，无论是模拟速度还是精度。引入了一种新的度量来量化仿真加速技术保留在原始工作负载中实现的行为的能力。与均值等一阶统计量不同，新引入的测量捕获了完整和采样模拟之间动态行为的重要差异

{"title":"Fast, Accurate Microarchitecture Simulation Using Statistical Phase Detection","authors":"R. Srinivasan, Jeanine E. Cook, S. Cooper","doi":"10.1109/ISPASS.2005.1430569","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430569","url":null,"abstract":"Simulation-based microarchitecture research is often hindered by the slow speed of simulators. In this work, we propose a novel statistical technique to identify highly representative unique behaviors or phases in a benchmark based on its IPC (instructions committed per cycle) trace. By simulating the timing of only the unique phases, the cycle-accurate simulation time for the SPEC suite is reduced from 5 months to 5 days, with a significant retention of the original dynamic behavior. Evaluation across many processor configurations within the same architecture family shows that the algorithm is robust. A cost function is provided that enables users to easily optimize the parameters of the algorithm for either simulation speed or accuracy depending on preference. A new measure is introduced to quantify the ability of a simulation speedup technique to retain behavior realized in the original workload. Unlike a first order statistic such as mean value, the newly introduced measure captures important differences in dynamic behavior between the complete and the sampled simulations","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130238879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Performance Characterization of Java Applications on SMT Processors SMT处理器上Java应用程序的性能表征

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430565

Wei Huang, Jiang Lin, Zhao Zhang, J. M. Chang

As Java is emerging as one of the major programming languages in software development, studying how Java applications behave on recent SMT processors is of great interest. This paper characterizes the performance of Java applications on an Intel Pentium 4 hyper-threading processor. Using the performance counters provided by Pentium 4, we quantitatively evaluate micro-architecture metrics while running various types of Java applications. The experimental results reveal that: (1) Hyper-threading can indeed improve the performance of multithreaded Java programs; (2) The resource contentions within Pentium 4 are the major reason of pipeline inefficiency, which prevents better performance promised by SMT; (3) The static partition design of hyper-threading causes considerable performance loss for many single-thread Java programs; (4) Most multiprogrammed Java benchmarks can achieve decent combined speedups on hyper-threading processors

随着Java逐渐成为软件开发中的主要编程语言之一，研究Java应用程序在最近的SMT处理器上的行为是非常有趣的。本文描述了Java应用程序在Intel Pentium 4超线程处理器上的性能。使用Pentium 4提供的性能计数器，我们在运行各种类型的Java应用程序时定量地评估微体系结构指标。实验结果表明:(1)超线程确实可以提高多线程Java程序的性能;(2) Pentium 4内部的资源争用是导致流水线效率低下的主要原因，阻碍了SMT所承诺的更好的性能;(3)超线程的静态分区设计对很多单线程Java程序造成了相当大的性能损失;大多数多程序Java基准测试可以在超线程处理器上获得不错的组合加速

引用次数: 19

Studying Thermal Management for Graphics-Processor Architectures 研究图形处理器架构的热管理

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430559

J. Sheaffer, K. Skadron, D. Luebke

We have previously presented Qsilver, a flexible simulation system for graphics architectures. In this paper we describe our extensions to this system, which we use - instrumented with a power model and HotSpot - to analyze the application of standard CPU static and runtime thermal management techniques on the GPU. We describe experiments implementing clock gating, fetch gating, dynamic voltage scaling, multiple clock domains and permuted floor-planning on the GPU using our simulation environment, and demonstrate that these techniques are beneficial in the GPU domain. Further, we show that the inherent parallelism of GPU workloads enables significant thermal gains on chips designed employing static floorplan repartitioning

我们之前介绍过Qsilver，一个灵活的图形架构仿真系统。在本文中，我们描述了我们对该系统的扩展，我们使用功率模型和热点来分析标准CPU静态和运行时热管理技术在GPU上的应用。我们描述了使用我们的仿真环境在GPU上实现时钟门控，取门控，动态电压缩放，多时钟域和排列地板规划的实验，并证明这些技术在GPU领域是有益的。此外，我们表明GPU工作负载的固有并行性可以在采用静态平面图重新分区设计的芯片上获得显着的热增益

引用次数: 48

On the provision of prioritization and soft qos in dynamically reconfigurable shared data-centers over infiniband 动态可重构共享数据中心中优先级和软qos的提供

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430582

P. Balaji, S. Narravula, K. Vaidyanathan, Hyun-Wook Jin, D. Panda

In the past few years several researchers have proposed and configured data-centers providing multiple independent services, known as shared data-centers. For example, several ISPs and other Web service providers host multiple unrelated Web-sites on their data-centers allowing potential differentiation in the service provided to each of them. Such differentiation becomes essential in several scenarios in a shared data-center environment. In this paper, we extend our previously proposed scheme on dynamic re-configurability to allow service differentiation in the shared data-center environment. In particular, we point out the issues associated with the basic dynamic configurability scheme and propose two extensions to it, namely (i) dynamic reconfiguration with prioritization and (ii) dynamic reconfiguration with prioritization and QoS. Our experimental results show that our extensions can allow the dynamic reconfigurability scheme to attain a performance improvement of up to five times for high priority Web sites irrespective of any background low priority requests. Also, these extensions are able to significantly improve the performance of low priority requests when there are minimal or no high priority requests in the system. Further, they can achieve a similar performance as a static scheme with up to 43% lesser nodes in some cases

在过去的几年中，一些研究人员提出并配置了提供多种独立服务的数据中心，称为共享数据中心。例如，几个isp和其他Web服务提供商在其数据中心上托管多个不相关的Web站点，从而允许向每个站点提供不同的服务。在共享数据中心环境的几个场景中，这种区别变得至关重要。在本文中，我们扩展了之前提出的动态可重构性方案，以允许在共享数据中心环境中实现服务差异化。特别地，我们指出了与基本动态可配置性方案相关的问题，并提出了两个扩展，即(i)具有优先级的动态重构和(ii)具有优先级和QoS的动态重构。实验结果表明，无论后台是否存在低优先级请求，我们的扩展都可以使动态可重构方案在高优先级网站上实现高达5倍的性能改进。此外，当系统中只有很少或没有高优先级请求时，这些扩展能够显著提高低优先级请求的性能。此外，在某些情况下，它们可以实现与静态方案相似的性能，最多减少43%的节点

{"title":"On the provision of prioritization and soft qos in dynamically reconfigurable shared data-centers over infiniband","authors":"P. Balaji, S. Narravula, K. Vaidyanathan, Hyun-Wook Jin, D. Panda","doi":"10.1109/ISPASS.2005.1430582","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430582","url":null,"abstract":"In the past few years several researchers have proposed and configured data-centers providing multiple independent services, known as shared data-centers. For example, several ISPs and other Web service providers host multiple unrelated Web-sites on their data-centers allowing potential differentiation in the service provided to each of them. Such differentiation becomes essential in several scenarios in a shared data-center environment. In this paper, we extend our previously proposed scheme on dynamic re-configurability to allow service differentiation in the shared data-center environment. In particular, we point out the issues associated with the basic dynamic configurability scheme and propose two extensions to it, namely (i) dynamic reconfiguration with prioritization and (ii) dynamic reconfiguration with prioritization and QoS. Our experimental results show that our extensions can allow the dynamic reconfigurability scheme to attain a performance improvement of up to five times for high priority Web sites irrespective of any background low priority requests. Also, these extensions are able to significantly improve the performance of low priority requests when there are minimal or no high priority requests in the system. Further, they can achieve a similar performance as a static scheme with up to 43% lesser nodes in some cases","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"12 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116817931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Intrinsic Checkpointing: A Methodology for Decreasing Simulation Time Through Binary Modification 内在检查点:一种通过二值修改减少仿真时间的方法

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430561

J. Ringenberg, Chris Pelosi, D. Oehmke, T. Mudge

With the proliferation of benchmarks available today, benchmarking new designs can significantly impact overall development time. In order to fully test and represent a typical workload, a large number of benchmarks must be run, and while current techniques such as SimPoint and SMARTS have had considerable success reducing simulation time, there are still areas of improvement. This paper details a methodology that continues to decrease this simulation time by analyzing and augmenting benchmark binaries to contain intrinsic checkpoints that allow for the rapid execution of important portions of code thereby removing the need for explicit checkpointing support. In addition, these modified binaries have increased portability across multiple simulation environments and the ability to be run in a highly parallel fashion. Average speedups for SPEC2000 of roughly 60x are seen over a standard SimPoint interval of 100 million instructions corresponding to a reduction in simulation time from 3.13 hours down to 3 minutes

随着当今可用基准测试的激增，对新设计进行基准测试可以显著影响总体开发时间。为了全面测试和代表典型的工作负载，必须运行大量的基准测试，虽然SimPoint和SMARTS等当前技术在减少模拟时间方面取得了相当大的成功，但仍有改进的领域。本文详细介绍了一种方法，该方法通过分析和增加基准二进制文件来包含内在检查点，从而允许快速执行代码的重要部分，从而消除对显式检查点支持的需要，从而继续减少模拟时间。此外，这些修改后的二进制文件增加了跨多个模拟环境的可移植性，并能够以高度并行的方式运行。在1亿个指令的标准SimPoint间隔中，SPEC2000的平均速度大约为60倍，对应于将模拟时间从3.13小时减少到3分钟

引用次数: 30

Balancing Performance and Reliability in the Memory Hierarchy 在内存层次结构中平衡性能和可靠性

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430581

H. Asadi, Vilas Sridharan, M. Tahoori, D. Kaeli

Cosmic-ray induced soft errors in cache memories are becoming a major threat to the reliability of microprocessor-based systems. In this paper, we present a new method to accurately estimate the reliability of cache memories. We have measured the MTTF (mean-time-to-failure) of unprotected first-level (L1) caches for twenty programs taken from SPEC2000 benchmark suite. Our results show that a 16 KB first-level cache possesses a MTTF of at least 400 years (for a raw error rate of 0.002 FIT/bit.) However, this MTTF is significantly reduced for higher error rates and larger cache sizes. Our results show that for selected programs, a 64 KB first-level cache is more than 10 times as vulnerable to soft errors versus a 16 KB cache memory. Our work also illustrates that the reliability of cache memories is highly application-dependent. Finally, we present three different techniques to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude. Our analysis shows how to achieve a balance between performance and reliability

宇宙射线在高速缓存存储器中引起的软错误正成为微处理器系统可靠性的主要威胁。本文提出了一种准确估计高速缓存可靠性的新方法。我们测量了来自SPEC2000基准测试套件的20个程序的未受保护的一级(L1)缓存的MTTF(平均故障前时间)。我们的结果表明，一个16 KB的一级缓存拥有至少400年的MTTF(原始错误率为0.002 FIT/bit)。但是，对于较高的错误率和较大的缓存大小，这个MTTF会显著减少。我们的结果表明，对于选定的程序，64 KB的一级缓存比16 KB的缓存更容易出现软错误。我们的工作还表明，缓存存储器的可靠性高度依赖于应用程序。最后，我们提出了三种不同的技术，将一级缓存对软错误的敏感性降低了两个数量级。我们的分析显示了如何在性能和可靠性之间取得平衡

引用次数: 115

Enhancing Multiprocessor Architecture Simulation Speed Using Matched-Pair Comparison 利用配对比较提高多处理器体系结构仿真速度

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430562

M. Ekman, P. Stenström

While cycle-level, full-system architecture simulation tools are capable of estimating performance at arbitrary accuracy, the time to simulate an entire application is usually prohibitive. Moreover, simulating multi-threaded applications further exacerbates this problem as most simulation tools are single-threaded. Recently, statistical sampling techniques, such as SMARTS, have managed to bring down the simulation time significantly by making it possible to only simulate about 1% of the code with sufficient accuracy. However, thousands of simulation points throughout the benchmark must still be simulated. First of all, we propose to use the well-established statistical method matched-pair comparison and motivate why this will bring down the number of simulation points needed to achieve a given accuracy. We apply it to single-processor as well as multiprocessor simulation and show that it is capable of reducing the number of needed simulation points by one order of magnitude. Secondly, since we apply the technique to single- as well as multiprocessors, we study for the first time the efficiency of statistical sampling techniques in multiprocessor systems to establish a baseline to compare with. We show theoretically and confirm experimentally, that while the instruction throughput vary significantly on each individual processor, the variability of instruction throughput across processors in a multiprocessor system decreases as we increase the number of processors for some important workloads. Thus, a factor of P fewer simulation points, where P is the number of processors, are needed to begin with when sampling is applied to multiprocessors

虽然周期级的全系统架构仿真工具能够以任意精度估计性能，但模拟整个应用程序的时间通常是令人望而却步的。此外，模拟多线程应用程序进一步加剧了这个问题，因为大多数模拟工具都是单线程的。最近，统计抽样技术，如SMARTS，已经成功地大大缩短了模拟时间，使其能够以足够的精度模拟大约1%的代码。但是，仍然必须模拟整个基准中的数千个模拟点。首先，我们建议使用完善的统计方法配对比较，并解释为什么这将减少达到给定精度所需的模拟点的数量。我们将其应用于单处理器和多处理器仿真，并表明它能够将所需仿真点的数量减少一个数量级。其次，由于我们将该技术应用于单处理器和多处理器，我们首次研究了统计采样技术在多处理器系统中的效率，以建立一个基线进行比较。我们从理论上和实验上证实，虽然指令吞吐量在每个单独的处理器上变化很大，但在多处理器系统中，随着我们为一些重要工作负载增加处理器数量，指令吞吐量的可变性会降低。因此，当对多处理器进行采样时，需要减少P个模拟点，其中P是处理器的数量

{"title":"Enhancing Multiprocessor Architecture Simulation Speed Using Matched-Pair Comparison","authors":"M. Ekman, P. Stenström","doi":"10.1109/ISPASS.2005.1430562","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430562","url":null,"abstract":"While cycle-level, full-system architecture simulation tools are capable of estimating performance at arbitrary accuracy, the time to simulate an entire application is usually prohibitive. Moreover, simulating multi-threaded applications further exacerbates this problem as most simulation tools are single-threaded. Recently, statistical sampling techniques, such as SMARTS, have managed to bring down the simulation time significantly by making it possible to only simulate about 1% of the code with sufficient accuracy. However, thousands of simulation points throughout the benchmark must still be simulated. First of all, we propose to use the well-established statistical method matched-pair comparison and motivate why this will bring down the number of simulation points needed to achieve a given accuracy. We apply it to single-processor as well as multiprocessor simulation and show that it is capable of reducing the number of needed simulation points by one order of magnitude. Secondly, since we apply the technique to single- as well as multiprocessors, we study for the first time the efficiency of statistical sampling techniques in multiprocessor systems to establish a baseline to compare with. We show theoretically and confirm experimentally, that while the instruction throughput vary significantly on each individual processor, the variability of instruction throughput across processors in a multiprocessor system decreases as we increase the number of processors for some important workloads. Thus, a factor of P fewer simulation points, where P is the number of processors, are needed to begin with when sampling is applied to multiprocessors","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114972882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Reaping the Benefit of Temporal Silence to Improve Communication Performance 从暂时的沉默中获益，提高沟通表现

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

Pub Date : 2005-03-20 DOI: 10.1109/ISPASS.2005.1430580

Kevin M. Lepak, Mikko H. Lipasti

Communication misses - those serviced by dirty data in remote caches - are a pressing performance limiter in shared-memory multiprocessors. Recent research has indicated that temporally silent stores can be exploited to substantially reduce such misses, either with coherence protocol enhancements (MESTI); by employing speculation to create atomic silent store-pairs that achieve speculative lock elision (SLE); or by employing load value prediction (LVP). We evaluate all three approaches utilizing full-system, execution-driven simulation, with scientific and commercial workloads, to measure performance. Our studies indicate that accurate detection of elision idioms for SLE is vitally important for delivering robust performance and appears difficult for existing commercial codes. Furthermore, common datapath issues in out-of-order cores cause barriers to speculation and therefore may cause SLE failures unless SLE-specific speculation mechanisms are added to the microarchitecture. We also propose novel prediction and silence detection mechanisms that enable the MESTI protocol to deliver robust performance for all workloads. Finally, we conduct a detailed execution-driven performance evaluation of load value prediction (LVP), another simple method for capturing the benefit of temporally silent stores. We show that while theoretically LVP can capture the greatest fraction of communication misses among all approaches, it is usually not the most effective at delivering performance. This occurs because attempting to hide latency by speculating at the consumer, i.e. predicting load values, is fundamentally less effective than eliminating the latency at the source, by removing the invalidation effect of stores. Applying each method, we observe performance changes in application benchmarks ranging from 1% to 14% for an enhanced version of MESTI, -1.0% to 9% for LVP, -3% to 9% for enhanced SLE, and 2% to 21% for combined techniques

通信缺失——由远程缓存中的脏数据提供服务——是共享内存多处理器中一个紧迫的性能限制因素。最近的研究表明，暂时沉默存储可以利用相干协议增强(mesi)来大幅减少这种缺失;通过使用推测来创建原子静默存储对，从而实现推测锁省略(SLE);或采用负荷值预测(LVP)。我们利用全系统、执行驱动的模拟、科学和商业工作负载来评估这三种方法，以衡量性能。我们的研究表明，SLE省略习语的准确检测对于提供稳健的性能至关重要，而对于现有的商业代码来说似乎很困难。此外，乱序核心中的常见数据路径问题会导致推测障碍，因此可能导致SLE失败，除非在微架构中添加特定于SLE的推测机制。我们还提出了新的预测和沉默检测机制，使mesi协议能够为所有工作负载提供强大的性能。最后，我们对负载值预测(LVP)进行了详细的执行驱动性能评估，这是另一种获取暂时静默存储好处的简单方法。我们表明，虽然理论上LVP可以捕获所有方法中最大比例的通信缺失，但它通常不是最有效的交付性能。这是因为试图通过推测消费者来隐藏延迟，即预测负载值，从根本上说，比通过消除存储的无效效应来消除源处的延迟更有效。应用每种方法，我们观察到应用程序基准中的性能变化范围为:增强版mesi的性能变化为1%至14%，LVP的性能变化为-1.0%至9%，增强版SLE的性能变化为-3%至9%，组合技术的性能变化为2%至21%

{"title":"Reaping the Benefit of Temporal Silence to Improve Communication Performance","authors":"Kevin M. Lepak, Mikko H. Lipasti","doi":"10.1109/ISPASS.2005.1430580","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430580","url":null,"abstract":"Communication misses - those serviced by dirty data in remote caches - are a pressing performance limiter in shared-memory multiprocessors. Recent research has indicated that temporally silent stores can be exploited to substantially reduce such misses, either with coherence protocol enhancements (MESTI); by employing speculation to create atomic silent store-pairs that achieve speculative lock elision (SLE); or by employing load value prediction (LVP). We evaluate all three approaches utilizing full-system, execution-driven simulation, with scientific and commercial workloads, to measure performance. Our studies indicate that accurate detection of elision idioms for SLE is vitally important for delivering robust performance and appears difficult for existing commercial codes. Furthermore, common datapath issues in out-of-order cores cause barriers to speculation and therefore may cause SLE failures unless SLE-specific speculation mechanisms are added to the microarchitecture. We also propose novel prediction and silence detection mechanisms that enable the MESTI protocol to deliver robust performance for all workloads. Finally, we conduct a detailed execution-driven performance evaluation of load value prediction (LVP), another simple method for capturing the benefit of temporally silent stores. We show that while theoretically LVP can capture the greatest fraction of communication misses among all approaches, it is usually not the most effective at delivering performance. This occurs because attempting to hide latency by speculating at the consumer, i.e. predicting load values, is fundamentally less effective than eliminating the latency at the source, by removing the invalidation effect of stores. Applying each method, we observe performance changes in application benchmarks ranging from 1% to 14% for an enhanced version of MESTI, -1.0% to 9% for LVP, -3% to 9% for enhanced SLE, and 2% to 21% for combined techniques","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129525314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀