Pub Date : 2011-02-12DOI: 10.1109/HPCA.2011.5749727
Hung-Wei Tseng, D. Tullsen
This paper introduces the concept of data-triggered threads. Unlike threads in parallel programs in conventional programming models, these threads are initiated on a change to a memory location. This enables increased parallelism and the elimination of redundant, unnecessary computation. This paper focuses primarily on the latter. It is shown that 78% of all loads fetch redundant data, leading to a high incidence of redundant computation. By expressing computation through data-triggered threads, that computation is executed once when the data changes, and is skipped whenever the data does not change. The set of C SPEC benchmarks show performance speedup of up to 5.9X, and averaging 46%.
{"title":"Data-triggered threads: Eliminating redundant computation","authors":"Hung-Wei Tseng, D. Tullsen","doi":"10.1109/HPCA.2011.5749727","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749727","url":null,"abstract":"This paper introduces the concept of data-triggered threads. Unlike threads in parallel programs in conventional programming models, these threads are initiated on a change to a memory location. This enables increased parallelism and the elimination of redundant, unnecessary computation. This paper focuses primarily on the latter. It is shown that 78% of all loads fetch redundant data, leading to a high incidence of redundant computation. By expressing computation through data-triggered threads, that computation is executed once when the data changes, and is skipped whenever the data does not change. The set of C SPEC benchmarks show performance speedup of up to 5.9X, and averaging 46%.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130458732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-02-12DOI: 10.1109/HPCA.2011.5749728
Jeffery A. Brown, Leo Porter, D. Tullsen
The most significant source of lost performance when a thread migrates between cores is the loss of cache state. A significant boost in post-migration performance is possible if the cache working set can be moved, proactively, with the thread. This work accelerates thread startup performance after migration by predicting and prefetching the working set of the application into the new cache. It shows that simply moving cache state performs poorly, and that moving the instruction working set can be even more critical than data. This paper demonstrates a technique that captures the access behavior of a thread, summarizes that behavior into a compact form for transfer between cores, and then prefetches appropriate data into the new caches based on the summary. It presents a detailed study of single-thread migration effects, and then demonstrates its utility on a speculative multithreading architecture. Working set prediction as much as doubles the performance of short-lived threads, and in a full speculative multithreading implementation, the technique is also shown to nearly double the effectiveness of the spawned threads.
{"title":"Fast thread migration via cache working set prediction","authors":"Jeffery A. Brown, Leo Porter, D. Tullsen","doi":"10.1109/HPCA.2011.5749728","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749728","url":null,"abstract":"The most significant source of lost performance when a thread migrates between cores is the loss of cache state. A significant boost in post-migration performance is possible if the cache working set can be moved, proactively, with the thread. This work accelerates thread startup performance after migration by predicting and prefetching the working set of the application into the new cache. It shows that simply moving cache state performs poorly, and that moving the instruction working set can be even more critical than data. This paper demonstrates a technique that captures the access behavior of a thread, summarizes that behavior into a compact form for transfer between cores, and then prefetches appropriate data into the new caches based on the summary. It presents a detailed study of single-thread migration effects, and then demonstrates its utility on a speculative multithreading architecture. Working set prediction as much as doubles the performance of short-lived threads, and in a full speculative multithreading implementation, the technique is also shown to nearly double the effectiveness of the spawned threads.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131537951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-02-12DOI: 10.1109/HPCA.2011.5749752
D. Yoon, Naveen Muralimanohar, Jichuan Chang, Parthasarathy Ranganathan, N. Jouppi, M. Erez
Emerging non-volatile memories such as phase-change RAM (PCRAM) offer significant advantages but suffer from write endurance problems. However, prior solutions are oblivious to soft errors (recently raised as a potential issue even for PCRAM) and are incompatible with high-level fault tolerance techniques such as chipkill. To additionally address such failures requires unnecessarily high costs for techniques that focus singularly on wear-out tolerance. In this paper, we propose fine-grained remapping with ECC and embedded pointers (FREE-p). FREE-p remaps fine-grained worn-out NVRAM blocks without requiring large dedicated storage. We discuss how FREE-p protects against both hard and soft errors and can be extended to chipkill. Further, FREE-p can be implemented purely in the memory controller, avoiding custom NVRAM devices. In addition to these benefits, FREE-p increases NVRAM lifetime by up to 26% over the state-of-the-art even with severe process variation while performance degradation is less than 2% for the initial 7 years.
{"title":"FREE-p: Protecting non-volatile memory against both hard and soft errors","authors":"D. Yoon, Naveen Muralimanohar, Jichuan Chang, Parthasarathy Ranganathan, N. Jouppi, M. Erez","doi":"10.1109/HPCA.2011.5749752","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749752","url":null,"abstract":"Emerging non-volatile memories such as phase-change RAM (PCRAM) offer significant advantages but suffer from write endurance problems. However, prior solutions are oblivious to soft errors (recently raised as a potential issue even for PCRAM) and are incompatible with high-level fault tolerance techniques such as chipkill. To additionally address such failures requires unnecessarily high costs for techniques that focus singularly on wear-out tolerance. In this paper, we propose fine-grained remapping with ECC and embedded pointers (FREE-p). FREE-p remaps fine-grained worn-out NVRAM blocks without requiring large dedicated storage. We discuss how FREE-p protects against both hard and soft errors and can be extended to chipkill. Further, FREE-p can be implemented purely in the memory controller, avoiding custom NVRAM devices. In addition to these benefits, FREE-p increases NVRAM lifetime by up to 26% over the state-of-the-art even with severe process variation while performance degradation is less than 2% for the initial 7 years.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134426930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-02-12DOI: 10.1109/HPCA.2011.5749724
Chris Fallin, Chris Craik, O. Mutlu
As Chip Multiprocessors (CMPs) scale to tens or hundreds of nodes, the interconnect becomes a significant factor in cost, energy consumption and performance. Recent work has explored many design tradeoffs for networks-on-chip (NoCs) with novel router architectures to reduce hardware cost. In particular, recent work proposes bufferless deflection routing to eliminate router buffers. The high cost of buffers makes this choice potentially appealing, especially for low-to-medium network loads. However, current bufferless designs usually add complexity to control logic. Deflection routing introduces a sequential dependence in port allocation, yielding a slow critical path. Explicit mechanisms are required for livelock freedom due to the non-minimal nature of deflection. Finally, deflection routing can fragment packets, and the reassembly buffers require large worst-case sizing to avoid deadlock, due to the lack of network backpressure. The complexity that arises out of these three problems has discouraged practical adoption of bufferless routing. To counter this, we propose CHIPPER (Cheap-Interconnect Partially Permuting Router), a simplified router microarchitecture that eliminates in-router buffers and the crossbar. We introduce three key insights: first, that deflection routing port allocation maps naturally to a permutation network within the router; second, that livelock freedom requires only an implicit token-passing scheme, eliminating expensive age-based priorities; and finally, that flow control can provide correctness in the absence of network backpressure, avoiding deadlock and allowing cache miss buffers (MSHRs) to be used as reassembly buffers. Using multiprogrammed SPEC CPU2006, server, and desktop application workloads and SPLASH-2 multithreaded workloads, we achieve an average 54.9% network power reduction for 13.6% average performance degradation (multipro-grammed) and 73.4% power reduction for 1.9% slowdown (multithreaded), with minimal degradation and large power savings at low-to-medium load. Finally, we show 36.2% router area reduction relative to buffered routing, with comparable timing.
{"title":"CHIPPER: A low-complexity bufferless deflection router","authors":"Chris Fallin, Chris Craik, O. Mutlu","doi":"10.1109/HPCA.2011.5749724","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749724","url":null,"abstract":"As Chip Multiprocessors (CMPs) scale to tens or hundreds of nodes, the interconnect becomes a significant factor in cost, energy consumption and performance. Recent work has explored many design tradeoffs for networks-on-chip (NoCs) with novel router architectures to reduce hardware cost. In particular, recent work proposes bufferless deflection routing to eliminate router buffers. The high cost of buffers makes this choice potentially appealing, especially for low-to-medium network loads. However, current bufferless designs usually add complexity to control logic. Deflection routing introduces a sequential dependence in port allocation, yielding a slow critical path. Explicit mechanisms are required for livelock freedom due to the non-minimal nature of deflection. Finally, deflection routing can fragment packets, and the reassembly buffers require large worst-case sizing to avoid deadlock, due to the lack of network backpressure. The complexity that arises out of these three problems has discouraged practical adoption of bufferless routing. To counter this, we propose CHIPPER (Cheap-Interconnect Partially Permuting Router), a simplified router microarchitecture that eliminates in-router buffers and the crossbar. We introduce three key insights: first, that deflection routing port allocation maps naturally to a permutation network within the router; second, that livelock freedom requires only an implicit token-passing scheme, eliminating expensive age-based priorities; and finally, that flow control can provide correctness in the absence of network backpressure, avoiding deadlock and allowing cache miss buffers (MSHRs) to be used as reassembly buffers. Using multiprogrammed SPEC CPU2006, server, and desktop application workloads and SPLASH-2 multithreaded workloads, we achieve an average 54.9% network power reduction for 13.6% average performance degradation (multipro-grammed) and 73.4% power reduction for 1.9% slowdown (multithreaded), with minimal degradation and large power savings at low-to-medium load. Finally, we show 36.2% router area reduction relative to buffered routing, with comparable timing.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127545125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-02-12DOI: 10.1109/HPCA.2011.5749731
Hyunjin Lee, Sangyeun Cho, B. Childers
The number of cores in a single chip multiprocessor is expected to grow in coming years. Likewise, aggregate on-chip cache capacity is increasing fast and its effective utilization is becoming ever more important. Furthermore, available cores are expected to be underutilized due to the power wall and highly heterogeneous future workloads. This trend makes existing L2 cache management techniques less effective for two problems: increased capacity interference between working cores and longer L2 access latency. We propose a novel scalable cache management framework called CloudCache that creates dynamically expanding and shrinking L2 caches for working threads with fine-grained hardware monitoring and control. The key architectural components of CloudCache are L2 cache chaining, inter- and intra-bank cache partitioning, and a performance-optimized coherence protocol. Our extensive experimental evaluation demonstrates that CloudCache significantly improves performance of a wide range of workloads when all or a subset of cores are occupied.
{"title":"CloudCache: Expanding and shrinking private caches","authors":"Hyunjin Lee, Sangyeun Cho, B. Childers","doi":"10.1109/HPCA.2011.5749731","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749731","url":null,"abstract":"The number of cores in a single chip multiprocessor is expected to grow in coming years. Likewise, aggregate on-chip cache capacity is increasing fast and its effective utilization is becoming ever more important. Furthermore, available cores are expected to be underutilized due to the power wall and highly heterogeneous future workloads. This trend makes existing L2 cache management techniques less effective for two problems: increased capacity interference between working cores and longer L2 access latency. We propose a novel scalable cache management framework called CloudCache that creates dynamically expanding and shrinking L2 caches for working threads with fine-grained hardware monitoring and control. The key architectural components of CloudCache are L2 cache chaining, inter- and intra-bank cache partitioning, and a performance-optimized coherence protocol. Our extensive experimental evaluation demonstrates that CloudCache significantly improves performance of a wide range of workloads when all or a subset of cores are occupied.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128157723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-02-12DOI: 10.1109/HPCA.2011.5749716
IV ClintonWillsSmullen, Vidyabhushan Mohan, Anurag Nigam, S. Gurumurthi, M. Stan
Spin-Transfer Torque RAM (STT-RAM) is an emerging non-volatile memory technology that is a potential universal memory that could replace SRAM in processor caches. This paper presents a novel approach for redesigning STT-RAM memory cells to reduce the high dynamic energy and slow write latencies. We lower the retention time by reducing the planar area of the cell, thereby reducing the write current, which we then use with CACTI to design caches and memories. We simulate quad-core processor designs using a combination of SRAM- and STT-RAM-based caches. Since ultra-low retention STT-RAM may lose data, we also provide a preliminary evaluation for a simple, DRAMstyle refresh policy. We found that a pure STT-RAM cache hierarchy provides the best energy efficiency, though a hybrid design of SRAM-based L1 caches with reduced-retention STT-RAM L2 and L3 caches eliminates performance loss while still reducing the energy-delay product by more than 70%.
{"title":"Relaxing non-volatility for fast and energy-efficient STT-RAM caches","authors":"IV ClintonWillsSmullen, Vidyabhushan Mohan, Anurag Nigam, S. Gurumurthi, M. Stan","doi":"10.1109/HPCA.2011.5749716","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749716","url":null,"abstract":"Spin-Transfer Torque RAM (STT-RAM) is an emerging non-volatile memory technology that is a potential universal memory that could replace SRAM in processor caches. This paper presents a novel approach for redesigning STT-RAM memory cells to reduce the high dynamic energy and slow write latencies. We lower the retention time by reducing the planar area of the cell, thereby reducing the write current, which we then use with CACTI to design caches and memories. We simulate quad-core processor designs using a combination of SRAM- and STT-RAM-based caches. Since ultra-low retention STT-RAM may lose data, we also provide a preliminary evaluation for a simple, DRAMstyle refresh policy. We found that a pure STT-RAM cache hierarchy provides the best energy efficiency, though a hybrid design of SRAM-based L1 caches with reduced-retention STT-RAM L2 and L3 caches eliminates performance loss while still reducing the energy-delay product by more than 70%.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124188110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-02-12DOI: 10.1109/HPCA.2011.5749725
Jian Li, Wei Huang, C. Lefurgy, Lixin Zhang, W. Denzel, Richard R. Treumann, Kun Wang
This paper presents two complementary techniques to manage the power consumption of large-scale systems with a packet-switched interconnection network. First, we propose Thrifty Interconnection Network (TIN), where the network links are activated and de-activated dynamically with little or no overhead by using inherent system events to timely trigger link activation or de-activation. Second, we propose Network Power Shifting (NPS) that dynamically shifts the power budget between the compute nodes and their corresponding network components. TIN activates and trains the links in the interconnection network, just-in-time before the network communication is about to happen, and thriftily puts them into a low-power mode when communication is finished, hence reducing unnecessary network power consumption. Furthermore, the compute nodes can absorb the extra power budget shifted from its attached network components and increase their processor frequency for higher performance with NPS. Our simulation results on a set of real-world workload traces show that TIN can achieve on average 60% network power reduction, with the support of only one low-power mode. When NPS is enabled, the two together can achieve 12% application performance improvement and 13% overall system energy reduction. Further performance improvement is possible if the compute nodes can speed up more and fully utilize the extra power budget reinvested from the thrifty network with more aggressive cooling support.
{"title":"Power shifting in Thrifty Interconnection Network","authors":"Jian Li, Wei Huang, C. Lefurgy, Lixin Zhang, W. Denzel, Richard R. Treumann, Kun Wang","doi":"10.1109/HPCA.2011.5749725","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749725","url":null,"abstract":"This paper presents two complementary techniques to manage the power consumption of large-scale systems with a packet-switched interconnection network. First, we propose Thrifty Interconnection Network (TIN), where the network links are activated and de-activated dynamically with little or no overhead by using inherent system events to timely trigger link activation or de-activation. Second, we propose Network Power Shifting (NPS) that dynamically shifts the power budget between the compute nodes and their corresponding network components. TIN activates and trains the links in the interconnection network, just-in-time before the network communication is about to happen, and thriftily puts them into a low-power mode when communication is finished, hence reducing unnecessary network power consumption. Furthermore, the compute nodes can absorb the extra power budget shifted from its attached network components and increase their processor frequency for higher performance with NPS. Our simulation results on a set of real-world workload traces show that TIN can achieve on average 60% network power reduction, with the support of only one low-power mode. When NPS is enabled, the two together can achieve 12% application performance improvement and 13% overall system energy reduction. Further performance improvement is possible if the compute nodes can speed up more and fully utilize the extra power budget reinvested from the thrifty network with more aggressive cooling support.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133993379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-02-12DOI: 10.1109/HPCA.2011.5749721
Junli Gu, S. Lumetta, Rakesh Kumar, Yihe Sun
Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make these models attractive, synchronization overhead and data copying must also be offloaded from the processors. In this paper, we describe a Message Orchestration and Performance Enhancement Device (MOPED) that provides hardware mechanisms to support state-of-the-art message passing protocols such as MPI. MOPED extends the per-processor cache controllers and coherence protocol to support message synchronization and management in hardware, to transfer message data efficiently without intermediate buffer copies, and to place useful data in caches in a timely manner. MOPED thus allows full overlap between communication and computation on the cores. We extended a 16-core full-system simulator based on Simics and FeS2. MOPED interacts with the directory controllers to orchestrate message data. We evaluated benefits to performance and coherence traffic by integrating MOPED into the MPICH runtime. Relative to unmodified MPI execution, MOPED reduces execution time of real applications (NAS Parallel Benchmarks) by 17–45% and of communication microbenchmarks (Intel's IMB) by 76–94%. Off-chip memory misses are reduced by 43–88% for applications and by 75–100% for microbenchmarks.
{"title":"MOPED: Orchestrating interprocess message data on CMPs","authors":"Junli Gu, S. Lumetta, Rakesh Kumar, Yihe Sun","doi":"10.1109/HPCA.2011.5749721","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749721","url":null,"abstract":"Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make these models attractive, synchronization overhead and data copying must also be offloaded from the processors. In this paper, we describe a Message Orchestration and Performance Enhancement Device (MOPED) that provides hardware mechanisms to support state-of-the-art message passing protocols such as MPI. MOPED extends the per-processor cache controllers and coherence protocol to support message synchronization and management in hardware, to transfer message data efficiently without intermediate buffer copies, and to place useful data in caches in a timely manner. MOPED thus allows full overlap between communication and computation on the cores. We extended a 16-core full-system simulator based on Simics and FeS2. MOPED interacts with the directory controllers to orchestrate message data. We evaluated benefits to performance and coherence traffic by integrating MOPED into the MPICH runtime. Relative to unmodified MPI execution, MOPED reduces execution time of real applications (NAS Parallel Benchmarks) by 17–45% and of communication microbenchmarks (Intel's IMB) by 76–94%. Off-chip memory misses are reduced by 43–88% for applications and by 75–100% for microbenchmarks.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134314487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-02-12DOI: 10.1109/HPCA.2011.5749730
K. McKinley
Two trends changed the computing landscape over the past decade: (1) hardware vendors started delivering chip multiprocessors (CMPs) instead of uniprocessors, and (2) software developers increasingly chose managed languages instead of native languages. Unfortunately, the former change is disrupting the virtuous-cycle between performance improvements and software innovation. Establishing a new parallel performance virtuous cycle for managed languages will require scalable applications executing on scalable Virtual Machine (VM) services, since the VM schedules, monitors, compiles, optimizes, garbage collects, and executes together with the application. This talk describes current progress, opportunities, and challenges for scalable VM services. The parallel computing revolution urgently needs more innovations.
{"title":"Keynote address II: How's the parallel computing revolution going?","authors":"K. McKinley","doi":"10.1109/HPCA.2011.5749730","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749730","url":null,"abstract":"Two trends changed the computing landscape over the past decade: (1) hardware vendors started delivering chip multiprocessors (CMPs) instead of uniprocessors, and (2) software developers increasingly chose managed languages instead of native languages. Unfortunately, the former change is disrupting the virtuous-cycle between performance improvements and software innovation. Establishing a new parallel performance virtuous cycle for managed languages will require scalable applications executing on scalable Virtual Machine (VM) services, since the VM schedules, monitors, compiles, optimizes, garbage collects, and executes together with the application. This talk describes current progress, opportunities, and challenges for scalable VM services. The parallel computing revolution urgently needs more innovations.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114538613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-02-12DOI: 10.1109/HPCA.2011.5749746
H. Jacobson, A. Buyuktosunoglu, P. Bose, E. Acar, R. Eickemeyer
Early-stage, microarchitecture-level power modeling methodologies have been used in industry and academic research for a decade (or more). Such methods use cycle-accurate performance simulators and deduce active power based on utilization markers. A key question faced in this context is: what key utilization metrics to monitor, and how many are needed for accuracy? Is there a systematic way to select the “best” markers? We also pose a key follow-on question: is it possible to perform accurate scaling of an abstracted model to enable exploration of new microarchitecture features? In this paper, we address these particular questions and examine the results for a range of abstraction levels. We highlight innovative insights for intelligent abstraction and microarchitecture scaling, and point out the pitfalls of abstractions that are not based on a systematic methodology or sound theory.
{"title":"Abstraction and microarchitecture scaling in early-stage power modeling","authors":"H. Jacobson, A. Buyuktosunoglu, P. Bose, E. Acar, R. Eickemeyer","doi":"10.1109/HPCA.2011.5749746","DOIUrl":"https://doi.org/10.1109/HPCA.2011.5749746","url":null,"abstract":"Early-stage, microarchitecture-level power modeling methodologies have been used in industry and academic research for a decade (or more). Such methods use cycle-accurate performance simulators and deduce active power based on utilization markers. A key question faced in this context is: what key utilization metrics to monitor, and how many are needed for accuracy? Is there a systematic way to select the “best” markers? We also pose a key follow-on question: is it possible to perform accurate scaling of an abstracted model to enable exploration of new microarchitecture features? In this paper, we address these particular questions and examine the results for a range of abstraction levels. We highlight innovative insights for intelligent abstraction and microarchitecture scaling, and point out the pitfalls of abstractions that are not based on a systematic methodology or sound theory.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"343 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134158605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}