Pub Date : 2005-03-20DOI: 10.1109/ISPASS.2005.1430557
A. Cheng, G. Tyson, T. Mudge
Power consumption, performance, area, and cost are critical concerns in designing microprocessors for embedded systems such as portable handheld computing and personal telecommunication devices. In previous work [A. Cheng et al., (2004)], we introduced the concept of framework-based instruction-set tuning synthesis (FITS), which is a new instruction synthesis paradigm that falls between a general-purpose embedded processor and a synthesized application specific processor (ASP). We address these design constraints through FITS by improving the code density. A FITS processor improves code density by tailoring the instruction set to the requirement of a target application to reduce the code size. This is achieved by replacing the fixed instruction and register decoding of general purpose embedded processor with programmable decoders that can achieve ASP performance, low power consumption, and compact chip area with the fabrication advantages of a mass produced single chip solution to amortize the cost. Instruction cache has been recognized as one of the most predominant source of power dissipation in a microprocessor. For instance, in Intel's StrongARMprocessor, 27% of total chip power loss goes into the instruction cache [J. Montanaro et al., (1996)]. In this paper, we demonstrate how FITS can be applied to improve the instruction cache power efficiency. Experimental results show that our synthesized instruction sets result in significant power reduction in the instruction cache compared to ARM instructions. For 21 benchmarks from the MiBench suite [M. Guthaus et al., (2001)], our simulation results indicate on average: a 49.4% saving for switching power; a 43.9% saving for internal power; a 14.9% saving for leakage power; a 46.6% saving for total cache power with up to 60.3% saving for peak power
功耗、性能、面积和成本是为嵌入式系统(如便携式手持计算和个人电信设备)设计微处理器的关键问题。在以前的工作中[A]。Cheng等人,(2004)],我们引入了基于框架的指令集调优综合(FITS)的概念,这是一种新的指令综合范式,介于通用嵌入式处理器和综合应用特定处理器(ASP)之间。我们通过改进代码密度通过FITS解决这些设计约束。FITS处理器通过根据目标应用程序的需求定制指令集来减少代码大小,从而提高代码密度。这是通过用可编程解码器取代通用嵌入式处理器的固定指令和寄存器解码来实现的,该解码器可以实现ASP性能,低功耗,芯片面积小,并且具有批量生产单芯片解决方案的制造优势,以摊销成本。指令缓存被认为是微处理器中最主要的功耗来源之一。例如,在英特尔的strongarm处理器中,总芯片功耗的27%用于指令缓存[J]。Montanaro等,(1996)]。在本文中,我们演示了如何应用FITS来提高指令缓存的功率效率。实验结果表明,与ARM指令相比,我们的合成指令集在指令缓存中显著降低了功耗。对于来自MiBench套件的21个基准测试[M。Guthaus et al.,(2001)],我们的仿真结果表明:开关功率平均节省49.4%;内部电源节省43.9%;漏电节电14.9%;总缓存功率节省46.6%,峰值功率节省高达60.3%
{"title":"PowerFITS: Reduce Dynamic and Static I-Cache Power Using Application Specific Instruction Set Synthesis","authors":"A. Cheng, G. Tyson, T. Mudge","doi":"10.1109/ISPASS.2005.1430557","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430557","url":null,"abstract":"Power consumption, performance, area, and cost are critical concerns in designing microprocessors for embedded systems such as portable handheld computing and personal telecommunication devices. In previous work [A. Cheng et al., (2004)], we introduced the concept of framework-based instruction-set tuning synthesis (FITS), which is a new instruction synthesis paradigm that falls between a general-purpose embedded processor and a synthesized application specific processor (ASP). We address these design constraints through FITS by improving the code density. A FITS processor improves code density by tailoring the instruction set to the requirement of a target application to reduce the code size. This is achieved by replacing the fixed instruction and register decoding of general purpose embedded processor with programmable decoders that can achieve ASP performance, low power consumption, and compact chip area with the fabrication advantages of a mass produced single chip solution to amortize the cost. Instruction cache has been recognized as one of the most predominant source of power dissipation in a microprocessor. For instance, in Intel's StrongARMprocessor, 27% of total chip power loss goes into the instruction cache [J. Montanaro et al., (1996)]. In this paper, we demonstrate how FITS can be applied to improve the instruction cache power efficiency. Experimental results show that our synthesized instruction sets result in significant power reduction in the instruction cache compared to ARM instructions. For 21 benchmarks from the MiBench suite [M. Guthaus et al., (2001)], our simulation results indicate on average: a 49.4% saving for switching power; a 43.9% saving for internal power; a 14.9% saving for leakage power; a 46.6% saving for total cache power with up to 60.3% saving for peak power","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122843754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-03-20DOI: 10.1109/ISPASS.2005.1430575
A. Foong, Jason M. Fung, D. Newell, S. Abraham, Peggy Irelan, Alex A. Lopez-Estrada
Network protocol stacks, in particular TCP/IP software implementations, are known for its inability to scale well in general-purpose monolithic operating systems (OS) for SMP. Previous researchers have experimented with affinitizing processes/thread, as well as interrupts from devices, to specific processors in a SMP system. However, general purpose operating systems have minimal consideration of user-defined affinity in their schedulers. Our goal is to expose the full potential of affinity by in-depth characterization of the reasons behind performance gains. We conducted an experimental study of TCP performance under various affinity modes on IA-based servers. Results showed that interrupt affinity alone provided a throughput gain of up to 25%, and combined thread/process and interrupt affinity can achieve gains of 30%. In particular, calling out the impact of affinity on machine clears (in addition to cache misses) is characterization that has not been done before
{"title":"Architectural Characterization of Processor Affinity in Network Processing","authors":"A. Foong, Jason M. Fung, D. Newell, S. Abraham, Peggy Irelan, Alex A. Lopez-Estrada","doi":"10.1109/ISPASS.2005.1430575","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430575","url":null,"abstract":"Network protocol stacks, in particular TCP/IP software implementations, are known for its inability to scale well in general-purpose monolithic operating systems (OS) for SMP. Previous researchers have experimented with affinitizing processes/thread, as well as interrupts from devices, to specific processors in a SMP system. However, general purpose operating systems have minimal consideration of user-defined affinity in their schedulers. Our goal is to expose the full potential of affinity by in-depth characterization of the reasons behind performance gains. We conducted an experimental study of TCP performance under various affinity modes on IA-based servers. Results showed that interrupt affinity alone provided a throughput gain of up to 25%, and combined thread/process and interrupt affinity can achieve gains of 30%. In particular, calling out the impact of affinity on machine clears (in addition to cache misses) is characterization that has not been done before","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128375772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-03-20DOI: 10.1109/ISPASS.2005.1430569
R. Srinivasan, Jeanine E. Cook, S. Cooper
Simulation-based microarchitecture research is often hindered by the slow speed of simulators. In this work, we propose a novel statistical technique to identify highly representative unique behaviors or phases in a benchmark based on its IPC (instructions committed per cycle) trace. By simulating the timing of only the unique phases, the cycle-accurate simulation time for the SPEC suite is reduced from 5 months to 5 days, with a significant retention of the original dynamic behavior. Evaluation across many processor configurations within the same architecture family shows that the algorithm is robust. A cost function is provided that enables users to easily optimize the parameters of the algorithm for either simulation speed or accuracy depending on preference. A new measure is introduced to quantify the ability of a simulation speedup technique to retain behavior realized in the original workload. Unlike a first order statistic such as mean value, the newly introduced measure captures important differences in dynamic behavior between the complete and the sampled simulations
{"title":"Fast, Accurate Microarchitecture Simulation Using Statistical Phase Detection","authors":"R. Srinivasan, Jeanine E. Cook, S. Cooper","doi":"10.1109/ISPASS.2005.1430569","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430569","url":null,"abstract":"Simulation-based microarchitecture research is often hindered by the slow speed of simulators. In this work, we propose a novel statistical technique to identify highly representative unique behaviors or phases in a benchmark based on its IPC (instructions committed per cycle) trace. By simulating the timing of only the unique phases, the cycle-accurate simulation time for the SPEC suite is reduced from 5 months to 5 days, with a significant retention of the original dynamic behavior. Evaluation across many processor configurations within the same architecture family shows that the algorithm is robust. A cost function is provided that enables users to easily optimize the parameters of the algorithm for either simulation speed or accuracy depending on preference. A new measure is introduced to quantify the ability of a simulation speedup technique to retain behavior realized in the original workload. Unlike a first order statistic such as mean value, the newly introduced measure captures important differences in dynamic behavior between the complete and the sampled simulations","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130238879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-03-20DOI: 10.1109/ISPASS.2005.1430565
Wei Huang, Jiang Lin, Zhao Zhang, J. M. Chang
As Java is emerging as one of the major programming languages in software development, studying how Java applications behave on recent SMT processors is of great interest. This paper characterizes the performance of Java applications on an Intel Pentium 4 hyper-threading processor. Using the performance counters provided by Pentium 4, we quantitatively evaluate micro-architecture metrics while running various types of Java applications. The experimental results reveal that: (1) Hyper-threading can indeed improve the performance of multithreaded Java programs; (2) The resource contentions within Pentium 4 are the major reason of pipeline inefficiency, which prevents better performance promised by SMT; (3) The static partition design of hyper-threading causes considerable performance loss for many single-thread Java programs; (4) Most multiprogrammed Java benchmarks can achieve decent combined speedups on hyper-threading processors
{"title":"Performance Characterization of Java Applications on SMT Processors","authors":"Wei Huang, Jiang Lin, Zhao Zhang, J. M. Chang","doi":"10.1109/ISPASS.2005.1430565","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430565","url":null,"abstract":"As Java is emerging as one of the major programming languages in software development, studying how Java applications behave on recent SMT processors is of great interest. This paper characterizes the performance of Java applications on an Intel Pentium 4 hyper-threading processor. Using the performance counters provided by Pentium 4, we quantitatively evaluate micro-architecture metrics while running various types of Java applications. The experimental results reveal that: (1) Hyper-threading can indeed improve the performance of multithreaded Java programs; (2) The resource contentions within Pentium 4 are the major reason of pipeline inefficiency, which prevents better performance promised by SMT; (3) The static partition design of hyper-threading causes considerable performance loss for many single-thread Java programs; (4) Most multiprogrammed Java benchmarks can achieve decent combined speedups on hyper-threading processors","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121713236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-03-20DOI: 10.1109/ISPASS.2005.1430559
J. Sheaffer, K. Skadron, D. Luebke
We have previously presented Qsilver, a flexible simulation system for graphics architectures. In this paper we describe our extensions to this system, which we use - instrumented with a power model and HotSpot - to analyze the application of standard CPU static and runtime thermal management techniques on the GPU. We describe experiments implementing clock gating, fetch gating, dynamic voltage scaling, multiple clock domains and permuted floor-planning on the GPU using our simulation environment, and demonstrate that these techniques are beneficial in the GPU domain. Further, we show that the inherent parallelism of GPU workloads enables significant thermal gains on chips designed employing static floorplan repartitioning
{"title":"Studying Thermal Management for Graphics-Processor Architectures","authors":"J. Sheaffer, K. Skadron, D. Luebke","doi":"10.1109/ISPASS.2005.1430559","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430559","url":null,"abstract":"We have previously presented Qsilver, a flexible simulation system for graphics architectures. In this paper we describe our extensions to this system, which we use - instrumented with a power model and HotSpot - to analyze the application of standard CPU static and runtime thermal management techniques on the GPU. We describe experiments implementing clock gating, fetch gating, dynamic voltage scaling, multiple clock domains and permuted floor-planning on the GPU using our simulation environment, and demonstrate that these techniques are beneficial in the GPU domain. Further, we show that the inherent parallelism of GPU workloads enables significant thermal gains on chips designed employing static floorplan repartitioning","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129157037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-03-20DOI: 10.1109/ISPASS.2005.1430582
P. Balaji, S. Narravula, K. Vaidyanathan, Hyun-Wook Jin, D. Panda
In the past few years several researchers have proposed and configured data-centers providing multiple independent services, known as shared data-centers. For example, several ISPs and other Web service providers host multiple unrelated Web-sites on their data-centers allowing potential differentiation in the service provided to each of them. Such differentiation becomes essential in several scenarios in a shared data-center environment. In this paper, we extend our previously proposed scheme on dynamic re-configurability to allow service differentiation in the shared data-center environment. In particular, we point out the issues associated with the basic dynamic configurability scheme and propose two extensions to it, namely (i) dynamic reconfiguration with prioritization and (ii) dynamic reconfiguration with prioritization and QoS. Our experimental results show that our extensions can allow the dynamic reconfigurability scheme to attain a performance improvement of up to five times for high priority Web sites irrespective of any background low priority requests. Also, these extensions are able to significantly improve the performance of low priority requests when there are minimal or no high priority requests in the system. Further, they can achieve a similar performance as a static scheme with up to 43% lesser nodes in some cases
{"title":"On the provision of prioritization and soft qos in dynamically reconfigurable shared data-centers over infiniband","authors":"P. Balaji, S. Narravula, K. Vaidyanathan, Hyun-Wook Jin, D. Panda","doi":"10.1109/ISPASS.2005.1430582","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430582","url":null,"abstract":"In the past few years several researchers have proposed and configured data-centers providing multiple independent services, known as shared data-centers. For example, several ISPs and other Web service providers host multiple unrelated Web-sites on their data-centers allowing potential differentiation in the service provided to each of them. Such differentiation becomes essential in several scenarios in a shared data-center environment. In this paper, we extend our previously proposed scheme on dynamic re-configurability to allow service differentiation in the shared data-center environment. In particular, we point out the issues associated with the basic dynamic configurability scheme and propose two extensions to it, namely (i) dynamic reconfiguration with prioritization and (ii) dynamic reconfiguration with prioritization and QoS. Our experimental results show that our extensions can allow the dynamic reconfigurability scheme to attain a performance improvement of up to five times for high priority Web sites irrespective of any background low priority requests. Also, these extensions are able to significantly improve the performance of low priority requests when there are minimal or no high priority requests in the system. Further, they can achieve a similar performance as a static scheme with up to 43% lesser nodes in some cases","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"12 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116817931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-03-20DOI: 10.1109/ISPASS.2005.1430561
J. Ringenberg, Chris Pelosi, D. Oehmke, T. Mudge
With the proliferation of benchmarks available today, benchmarking new designs can significantly impact overall development time. In order to fully test and represent a typical workload, a large number of benchmarks must be run, and while current techniques such as SimPoint and SMARTS have had considerable success reducing simulation time, there are still areas of improvement. This paper details a methodology that continues to decrease this simulation time by analyzing and augmenting benchmark binaries to contain intrinsic checkpoints that allow for the rapid execution of important portions of code thereby removing the need for explicit checkpointing support. In addition, these modified binaries have increased portability across multiple simulation environments and the ability to be run in a highly parallel fashion. Average speedups for SPEC2000 of roughly 60x are seen over a standard SimPoint interval of 100 million instructions corresponding to a reduction in simulation time from 3.13 hours down to 3 minutes
{"title":"Intrinsic Checkpointing: A Methodology for Decreasing Simulation Time Through Binary Modification","authors":"J. Ringenberg, Chris Pelosi, D. Oehmke, T. Mudge","doi":"10.1109/ISPASS.2005.1430561","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430561","url":null,"abstract":"With the proliferation of benchmarks available today, benchmarking new designs can significantly impact overall development time. In order to fully test and represent a typical workload, a large number of benchmarks must be run, and while current techniques such as SimPoint and SMARTS have had considerable success reducing simulation time, there are still areas of improvement. This paper details a methodology that continues to decrease this simulation time by analyzing and augmenting benchmark binaries to contain intrinsic checkpoints that allow for the rapid execution of important portions of code thereby removing the need for explicit checkpointing support. In addition, these modified binaries have increased portability across multiple simulation environments and the ability to be run in a highly parallel fashion. Average speedups for SPEC2000 of roughly 60x are seen over a standard SimPoint interval of 100 million instructions corresponding to a reduction in simulation time from 3.13 hours down to 3 minutes","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123941990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-03-20DOI: 10.1109/ISPASS.2005.1430581
H. Asadi, Vilas Sridharan, M. Tahoori, D. Kaeli
Cosmic-ray induced soft errors in cache memories are becoming a major threat to the reliability of microprocessor-based systems. In this paper, we present a new method to accurately estimate the reliability of cache memories. We have measured the MTTF (mean-time-to-failure) of unprotected first-level (L1) caches for twenty programs taken from SPEC2000 benchmark suite. Our results show that a 16 KB first-level cache possesses a MTTF of at least 400 years (for a raw error rate of 0.002 FIT/bit.) However, this MTTF is significantly reduced for higher error rates and larger cache sizes. Our results show that for selected programs, a 64 KB first-level cache is more than 10 times as vulnerable to soft errors versus a 16 KB cache memory. Our work also illustrates that the reliability of cache memories is highly application-dependent. Finally, we present three different techniques to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude. Our analysis shows how to achieve a balance between performance and reliability
{"title":"Balancing Performance and Reliability in the Memory Hierarchy","authors":"H. Asadi, Vilas Sridharan, M. Tahoori, D. Kaeli","doi":"10.1109/ISPASS.2005.1430581","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430581","url":null,"abstract":"Cosmic-ray induced soft errors in cache memories are becoming a major threat to the reliability of microprocessor-based systems. In this paper, we present a new method to accurately estimate the reliability of cache memories. We have measured the MTTF (mean-time-to-failure) of unprotected first-level (L1) caches for twenty programs taken from SPEC2000 benchmark suite. Our results show that a 16 KB first-level cache possesses a MTTF of at least 400 years (for a raw error rate of 0.002 FIT/bit.) However, this MTTF is significantly reduced for higher error rates and larger cache sizes. Our results show that for selected programs, a 64 KB first-level cache is more than 10 times as vulnerable to soft errors versus a 16 KB cache memory. Our work also illustrates that the reliability of cache memories is highly application-dependent. Finally, we present three different techniques to reduce the susceptibility of first-level caches to soft errors by two orders of magnitude. Our analysis shows how to achieve a balance between performance and reliability","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130517712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-03-20DOI: 10.1109/ISPASS.2005.1430562
M. Ekman, P. Stenström
While cycle-level, full-system architecture simulation tools are capable of estimating performance at arbitrary accuracy, the time to simulate an entire application is usually prohibitive. Moreover, simulating multi-threaded applications further exacerbates this problem as most simulation tools are single-threaded. Recently, statistical sampling techniques, such as SMARTS, have managed to bring down the simulation time significantly by making it possible to only simulate about 1% of the code with sufficient accuracy. However, thousands of simulation points throughout the benchmark must still be simulated. First of all, we propose to use the well-established statistical method matched-pair comparison and motivate why this will bring down the number of simulation points needed to achieve a given accuracy. We apply it to single-processor as well as multiprocessor simulation and show that it is capable of reducing the number of needed simulation points by one order of magnitude. Secondly, since we apply the technique to single- as well as multiprocessors, we study for the first time the efficiency of statistical sampling techniques in multiprocessor systems to establish a baseline to compare with. We show theoretically and confirm experimentally, that while the instruction throughput vary significantly on each individual processor, the variability of instruction throughput across processors in a multiprocessor system decreases as we increase the number of processors for some important workloads. Thus, a factor of P fewer simulation points, where P is the number of processors, are needed to begin with when sampling is applied to multiprocessors
{"title":"Enhancing Multiprocessor Architecture Simulation Speed Using Matched-Pair Comparison","authors":"M. Ekman, P. Stenström","doi":"10.1109/ISPASS.2005.1430562","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430562","url":null,"abstract":"While cycle-level, full-system architecture simulation tools are capable of estimating performance at arbitrary accuracy, the time to simulate an entire application is usually prohibitive. Moreover, simulating multi-threaded applications further exacerbates this problem as most simulation tools are single-threaded. Recently, statistical sampling techniques, such as SMARTS, have managed to bring down the simulation time significantly by making it possible to only simulate about 1% of the code with sufficient accuracy. However, thousands of simulation points throughout the benchmark must still be simulated. First of all, we propose to use the well-established statistical method matched-pair comparison and motivate why this will bring down the number of simulation points needed to achieve a given accuracy. We apply it to single-processor as well as multiprocessor simulation and show that it is capable of reducing the number of needed simulation points by one order of magnitude. Secondly, since we apply the technique to single- as well as multiprocessors, we study for the first time the efficiency of statistical sampling techniques in multiprocessor systems to establish a baseline to compare with. We show theoretically and confirm experimentally, that while the instruction throughput vary significantly on each individual processor, the variability of instruction throughput across processors in a multiprocessor system decreases as we increase the number of processors for some important workloads. Thus, a factor of P fewer simulation points, where P is the number of processors, are needed to begin with when sampling is applied to multiprocessors","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114972882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2005-03-20DOI: 10.1109/ISPASS.2005.1430580
Kevin M. Lepak, Mikko H. Lipasti
Communication misses - those serviced by dirty data in remote caches - are a pressing performance limiter in shared-memory multiprocessors. Recent research has indicated that temporally silent stores can be exploited to substantially reduce such misses, either with coherence protocol enhancements (MESTI); by employing speculation to create atomic silent store-pairs that achieve speculative lock elision (SLE); or by employing load value prediction (LVP). We evaluate all three approaches utilizing full-system, execution-driven simulation, with scientific and commercial workloads, to measure performance. Our studies indicate that accurate detection of elision idioms for SLE is vitally important for delivering robust performance and appears difficult for existing commercial codes. Furthermore, common datapath issues in out-of-order cores cause barriers to speculation and therefore may cause SLE failures unless SLE-specific speculation mechanisms are added to the microarchitecture. We also propose novel prediction and silence detection mechanisms that enable the MESTI protocol to deliver robust performance for all workloads. Finally, we conduct a detailed execution-driven performance evaluation of load value prediction (LVP), another simple method for capturing the benefit of temporally silent stores. We show that while theoretically LVP can capture the greatest fraction of communication misses among all approaches, it is usually not the most effective at delivering performance. This occurs because attempting to hide latency by speculating at the consumer, i.e. predicting load values, is fundamentally less effective than eliminating the latency at the source, by removing the invalidation effect of stores. Applying each method, we observe performance changes in application benchmarks ranging from 1% to 14% for an enhanced version of MESTI, -1.0% to 9% for LVP, -3% to 9% for enhanced SLE, and 2% to 21% for combined techniques
{"title":"Reaping the Benefit of Temporal Silence to Improve Communication Performance","authors":"Kevin M. Lepak, Mikko H. Lipasti","doi":"10.1109/ISPASS.2005.1430580","DOIUrl":"https://doi.org/10.1109/ISPASS.2005.1430580","url":null,"abstract":"Communication misses - those serviced by dirty data in remote caches - are a pressing performance limiter in shared-memory multiprocessors. Recent research has indicated that temporally silent stores can be exploited to substantially reduce such misses, either with coherence protocol enhancements (MESTI); by employing speculation to create atomic silent store-pairs that achieve speculative lock elision (SLE); or by employing load value prediction (LVP). We evaluate all three approaches utilizing full-system, execution-driven simulation, with scientific and commercial workloads, to measure performance. Our studies indicate that accurate detection of elision idioms for SLE is vitally important for delivering robust performance and appears difficult for existing commercial codes. Furthermore, common datapath issues in out-of-order cores cause barriers to speculation and therefore may cause SLE failures unless SLE-specific speculation mechanisms are added to the microarchitecture. We also propose novel prediction and silence detection mechanisms that enable the MESTI protocol to deliver robust performance for all workloads. Finally, we conduct a detailed execution-driven performance evaluation of load value prediction (LVP), another simple method for capturing the benefit of temporally silent stores. We show that while theoretically LVP can capture the greatest fraction of communication misses among all approaches, it is usually not the most effective at delivering performance. This occurs because attempting to hide latency by speculating at the consumer, i.e. predicting load values, is fundamentally less effective than eliminating the latency at the source, by removing the invalidation effect of stores. Applying each method, we observe performance changes in application benchmarks ranging from 1% to 14% for an enhanced version of MESTI, -1.0% to 9% for LVP, -3% to 9% for enhanced SLE, and 2% to 21% for combined techniques","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129525314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}