In this article, we present an approach for improving the performance of sequences of dependent instructions. We observe that many sequences of instructions can be interpreted as functions. Unlike sequences of instructions, functions can be translated into very fast but exponentially costly two-level combinational circuits. We present an approach that exploits this principle, speeds up programs thanks to circuit-level parallelism/redundancy, but avoids the exponential costs. We analyze the potential of this approach, and then we propose an implementation that consists of a superscalar processor with a large specific functional unit associated with specific back-end transformations. The performance of the SpecInt2000 benchmarks and selected programs from the Olden and MiBench benchmark suites improves on average from 2.4% to 12% depending on the latency of the functional units, and up to 39.6%; more precisely, the performance of optimized code sections improves on average from 3.5% to 19%, and up to 49%.
{"title":"From sequences of dependent instructions to functions: an approach for improving performance without ILP or speculation","authors":"S. Yehia, O. Temam","doi":"10.1145/1028176.1006721","DOIUrl":"https://doi.org/10.1145/1028176.1006721","url":null,"abstract":"In this article, we present an approach for improving the performance of sequences of dependent instructions. We observe that many sequences of instructions can be interpreted as functions. Unlike sequences of instructions, functions can be translated into very fast but exponentially costly two-level combinational circuits. We present an approach that exploits this principle, speeds up programs thanks to circuit-level parallelism/redundancy, but avoids the exponential costs. We analyze the potential of this approach, and then we propose an implementation that consists of a superscalar processor with a large specific functional unit associated with specific back-end transformations. The performance of the SpecInt2000 benchmarks and selected programs from the Olden and MiBench benchmark suites improves on average from 2.4% to 12% depending on the latency of the functional units, and up to 39.6%; more precisely, the performance of optimized code sections improves on average from 3.5% to 19%, and up to 49%.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115599824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-06-19DOI: 10.1109/isca.2004.1310771
J. Oliver, Ravishankar Rao, P. Sultana, Jedidiah R. Crandall, E. Czernikowski, IV LeslieW.Jones, D. Franklin, V. Akella, F. Chong
We present Synchroscalar, a tile-based architecture for embedded processing that is designed to provide the flexibility of DSPs while approaching the power efficiency of ASICs. We achieve this goal by providing high parallelism and voltage scaling while minimizing control and communication costs. Specifically, Synchroscalar uses columns of processor tiles organized into statically-assigned frequency-voltage domains to minimize power consumption. Furthermore, while columns use SIMD control to minimize overhead, data-dependent computations can be supported by extremely flexible statically-scheduled communication between columns. We provide a detailed evaluation of Synchroscalar including SPICE simulation, wire and device models, synthesis of key components, cycle-level simulation, and compiler- and hand-optimized signal processing applications. We find that the goal of meeting, not exceeding, performance targets with data-parallel applications leads to designs that depart significantly from our intuitions derived from general-purpose microprocessor design. In particular, synchronous design and substantial global interconnect are desirable in the low-frequency, low-power domain. This global interconnect supports parallelization and reduces processor idle time, which are critical to energy efficient implementations of high bandwidth signal processing. Overall, Synchroscalar provides programmability while achieving power efficiencies within 8-30/spl times/ of known ASIC implementations, which is 10-60/spl times/ better than conventional DSPs. In addition, frequency-voltage scaling in Synchroscalar provides between 3-32% power savings in our application suite.
{"title":"Synchroscalar: a multiple clock domain, power-aware, tile-based embedded processor","authors":"J. Oliver, Ravishankar Rao, P. Sultana, Jedidiah R. Crandall, E. Czernikowski, IV LeslieW.Jones, D. Franklin, V. Akella, F. Chong","doi":"10.1109/isca.2004.1310771","DOIUrl":"https://doi.org/10.1109/isca.2004.1310771","url":null,"abstract":"We present Synchroscalar, a tile-based architecture for embedded processing that is designed to provide the flexibility of DSPs while approaching the power efficiency of ASICs. We achieve this goal by providing high parallelism and voltage scaling while minimizing control and communication costs. Specifically, Synchroscalar uses columns of processor tiles organized into statically-assigned frequency-voltage domains to minimize power consumption. Furthermore, while columns use SIMD control to minimize overhead, data-dependent computations can be supported by extremely flexible statically-scheduled communication between columns. We provide a detailed evaluation of Synchroscalar including SPICE simulation, wire and device models, synthesis of key components, cycle-level simulation, and compiler- and hand-optimized signal processing applications. We find that the goal of meeting, not exceeding, performance targets with data-parallel applications leads to designs that depart significantly from our intuitions derived from general-purpose microprocessor design. In particular, synchronous design and substantial global interconnect are desirable in the low-frequency, low-power domain. This global interconnect supports parallelization and reduces processor idle time, which are critical to energy efficient implementations of high bandwidth signal processing. Overall, Synchroscalar provides programmability while achieving power efficiencies within 8-30/spl times/ of known ASIC implementations, which is 10-60/spl times/ better than conventional DSPs. In addition, frequency-voltage scaling in Synchroscalar provides between 3-32% power savings in our application suite.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127919759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pin Zhou, Feng Qin, Wei Liu, Yuanyuan Zhou, J. Torrellas
Recent impressive performance improvements in computer architecture have not led to significant gains in ease of debugging. Software debugging often relies on inserting run-time software checks. In many cases, however, it is hard to find the root cause of a bug. Moreover, program execution typically slows down significantly, often by 10-100 times. To address this problem, this paper introduces the Intelligent Watcher (iWatcher), novel architectural support to monitor dynamic execution with minimal overhead, automatically, and flexibly. iWatcher associates program-specified monitoring functions with memory locations. When any such location is accessed, the monitoring function is automatically triggered with low overhead. To further reduce overhead and support rollback, iWatcher can leverage Thread-Level Speculation (TLS). To test iWatcher, we use applications with various bugs. Our results show that iWatcher detects many more software bugs than Valgrind, a well-known open-source bug detector. Moreover, iWatcher only induces a 4-80% execution overhead, which is orders of magnitude less than Valgrind. Even with 20% of the dynamic loads monitored in a program, iWatcher adds only 66-174% overhead. Finally, TLS is effective at reducing overheads for programs with substantial monitoring.
{"title":"iWatcher: efficient architectural support for software debugging","authors":"Pin Zhou, Feng Qin, Wei Liu, Yuanyuan Zhou, J. Torrellas","doi":"10.1145/1028176.1006720","DOIUrl":"https://doi.org/10.1145/1028176.1006720","url":null,"abstract":"Recent impressive performance improvements in computer architecture have not led to significant gains in ease of debugging. Software debugging often relies on inserting run-time software checks. In many cases, however, it is hard to find the root cause of a bug. Moreover, program execution typically slows down significantly, often by 10-100 times. To address this problem, this paper introduces the Intelligent Watcher (iWatcher), novel architectural support to monitor dynamic execution with minimal overhead, automatically, and flexibly. iWatcher associates program-specified monitoring functions with memory locations. When any such location is accessed, the monitoring function is automatically triggered with low overhead. To further reduce overhead and support rollback, iWatcher can leverage Thread-Level Speculation (TLS). To test iWatcher, we use applications with various bugs. Our results show that iWatcher detects many more software bugs than Valgrind, a well-known open-source bug detector. Moreover, iWatcher only induces a 4-80% execution overhead, which is orders of magnitude less than Valgrind. Even with 20% of the dynamic loads monitored in a program, iWatcher adds only 66-174% overhead. Finally, TLS is effective at reducing overheads for programs with substantial monitoring.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126554510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Eeckhout, R. Bell, Bastiaan Stougie, K. D. Bosschere, L. John
Designing a new microprocessor is extremely time-consuming. One of the contributing reasons is that computer designers rely heavily on detailed architectural simulations, which are very time-consuming. Recent work has focused on statistical simulation to address this issue. The basic idea of statistical simulation is to measure characteristics during program execution, generate a synthetic trace with those characteristics and then simulate the synthetic trace. The statistically generated synthetic trace is orders of magnitude smaller than the original program sequence and hence results in significantly faster simulation. This paper makes the following contributions to the statistical simulation methodology. First, we propose the use of a statistical flow graph to characterize the control flow of a program execution. Second, we model delayed update of branch predictors while profiling program execution characteristics. Experimental results show that statistical simulation using this improved control flow modeling attains significantly better accuracy than the previously proposed HLS system. We evaluate both the absolute and the relative accuracy of our approach for power/performance modeling of superscalar microarchitectures. The results show that our statistical simulation framework can be used to efficiently explore processor design spaces.
{"title":"Control flow modeling in statistical simulation for accurate and efficient processor design studies","authors":"L. Eeckhout, R. Bell, Bastiaan Stougie, K. D. Bosschere, L. John","doi":"10.1145/1028176.1006730","DOIUrl":"https://doi.org/10.1145/1028176.1006730","url":null,"abstract":"Designing a new microprocessor is extremely time-consuming. One of the contributing reasons is that computer designers rely heavily on detailed architectural simulations, which are very time-consuming. Recent work has focused on statistical simulation to address this issue. The basic idea of statistical simulation is to measure characteristics during program execution, generate a synthetic trace with those characteristics and then simulate the synthetic trace. The statistically generated synthetic trace is orders of magnitude smaller than the original program sequence and hence results in significantly faster simulation. This paper makes the following contributions to the statistical simulation methodology. First, we propose the use of a statistical flow graph to characterize the control flow of a program execution. Second, we model delayed update of branch predictors while profiling program execution characteristics. Experimental results show that statistical simulation using this improved control flow modeling attains significantly better accuracy than the previously proposed HLS system. We evaluate both the absolute and the relative accuracy of our approach for power/performance modeling of superscalar microarchitectures. The results show that our statistical simulation framework can be used to efficiently explore processor design spaces.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128040029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Physical register access time increases the delay between scheduling and execution in modern out-of-order processors. As the number of physical registers increases, this delay grows, forcing designers to employ register files with multicycle access. This paper advocates more efficient utilization of a fewer number of physical registers in order to reduce the access time of the physical register file. Register values with few significant bits are stored in the rename map using physical register inlining, a scheme analogous to inlining of operand fields in data structures. Specifically, whenever a register value can be expressed with fewer bits than the register map would need to specify a physical register number, the value is stored directly in the map, avoiding the indirection, and saving space in the physical register file. Not surprisingly, we find that a significant portion of all register operands can be stored in the map in this fashion, and describe straightforward microarchitectural extensions that correctly implement physical register inlining. We find that physical register inlining performs well, particularly in processors that are register-constrained.
{"title":"Physical register inlining","authors":"Mikko H. Lipasti, Brian R. Mestan, Erika Gunadi","doi":"10.1145/1028176.1006728","DOIUrl":"https://doi.org/10.1145/1028176.1006728","url":null,"abstract":"Physical register access time increases the delay between scheduling and execution in modern out-of-order processors. As the number of physical registers increases, this delay grows, forcing designers to employ register files with multicycle access. This paper advocates more efficient utilization of a fewer number of physical registers in order to reduce the access time of the physical register file. Register values with few significant bits are stored in the rename map using physical register inlining, a scheme analogous to inlining of operand fields in data structures. Specifically, whenever a register value can be expressed with fewer bits than the register map would need to specify a physical register number, the value is stored directly in the map, avoiding the indirection, and saving space in the physical register file. Not surprisingly, we find that a significant portion of all register operands can be stored in the map in this fashion, and describe straightforward microarchitectural extensions that correctly implement physical register inlining. We find that physical register inlining performs well, particularly in processors that are register-constrained.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"199 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132708357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The on-chip communication requirements of many systems are best served through the deployment of a regular chip-wide network. This paper presents the design of a low-latency on-chip network router for such applications. We remove control overheads (routing and arbitration logic) from the critical path in order to minimise cycle-time and latency. Simulations illustrate that dramatic cycle time improvements are possible without compromising router efficiency. Furthermore, these reductions permit flits to be routed in a single cycle, maximising the effectiveness of the router's limited buffering resources.
{"title":"Low-latency virtual-channel routers for on-chip networks","authors":"R. Mullins, A. West, S. Moore","doi":"10.1145/1028176.1006717","DOIUrl":"https://doi.org/10.1145/1028176.1006717","url":null,"abstract":"The on-chip communication requirements of many systems are best served through the deployment of a regular chip-wide network. This paper presents the design of a low-latency on-chip network router for such applications. We remove control overheads (routing and arbitration logic) from the critical path in order to minimise cycle-time and latency. Simulations illustrate that dramatic cycle time improvements are possible without compromising router efficiency. Furthermore, these reductions permit flits to be routed in a single cycle, maximising the effectiveness of the router's limited buffering resources.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130945150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wide, deep pipelines need many physical registers to hold the results of in-flight instructions. Simultaneously, high clock frequencies prohibit using large register files and bypass networks without a significant performance penalty. Previously proposed techniques using register caching to reduce this penalty suffer from several problems including poor insertion and replacement decisions and the need for a fully-associative cache for good performance. We present novel mechanisms for managing and indexing register caches that address these problems using knowledge of the number of consumers of each register value. The insertion policy reduces pollution by not caching a register value when all of its predicted consumers are satisfied by the bypass network. The replacement policy selects register cache entries with the fewest remaining uses (often zero), lowering the miss rate. We also introduce a new, general method of mapping physical registers to register cache sets that improves the performance of set-associative cache organizations by reducing conflicts. Our results indicate that a 64-entry, two-way set associative cache using these techniques outperforms multi-cycle monolithic register files and previously proposed hierarchical register files.
{"title":"Use-based register caching with decoupled indexing","authors":"J. A. Butts, G. Sohi","doi":"10.1145/1028176.1006724","DOIUrl":"https://doi.org/10.1145/1028176.1006724","url":null,"abstract":"Wide, deep pipelines need many physical registers to hold the results of in-flight instructions. Simultaneously, high clock frequencies prohibit using large register files and bypass networks without a significant performance penalty. Previously proposed techniques using register caching to reduce this penalty suffer from several problems including poor insertion and replacement decisions and the need for a fully-associative cache for good performance. We present novel mechanisms for managing and indexing register caches that address these problems using knowledge of the number of consumers of each register value. The insertion policy reduces pollution by not caching a register value when all of its predicted consumers are satisfied by the bypass network. The replacement policy selects register cache entries with the fewest remaining uses (often zero), lowering the miss rate. We also introduce a new, general method of mapping physical registers to register cache sets that improves the performance of set-associative cache organizations by reducing conflicts. Our results indicate that a 64-entry, two-way set associative cache using these techniques outperforms multi-cycle monolithic register files and previously proposed hierarchical register files.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127583810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conventional out-of-order processors employ a multi-ported, fully-associative load queue to guarantee correct memory reference order both within a single thread of execution and across threads in a multiprocessor system. As improvements in process technology and pipelining lead to higher clock frequencies, scaling this complex structure to accommodate a larger number of in-flight loads becomes difficult if not impossible. Furthermore, each access to this complex structure consumes excessive amounts of energy. In this paper, we solve the associative load queue scalability problem by completely eliminating the associative load queue. Instead, data dependences and memory consistency constraints are enforced by simply reexecuting load instructions in program order prior to retirement. Using heuristics to filter the set of loads that must be re-executed, we show that our replay-based mechanism enables a simple, scalable, and energy-efficient FIFO load queue design with no associative lookup functionality, while sacrificing only a negligible amount of performance and cache bandwidth.
{"title":"Memory ordering: a value-based approach","authors":"Harold W. Cain, Mikko H. Lipasti","doi":"10.1109/MM.2004.81","DOIUrl":"https://doi.org/10.1109/MM.2004.81","url":null,"abstract":"Conventional out-of-order processors employ a multi-ported, fully-associative load queue to guarantee correct memory reference order both within a single thread of execution and across threads in a multiprocessor system. As improvements in process technology and pipelining lead to higher clock frequencies, scaling this complex structure to accommodate a larger number of in-flight loads becomes difficult if not impossible. Furthermore, each access to this complex structure consumes excessive amounts of energy. In this paper, we solve the associative load queue scalability problem by completely eliminating the associative load queue. Instead, data dependences and memory consistency constraints are enforced by simply reexecuting load instructions in program order prior to retirement. Using heuristics to filter the set of loads that must be re-executed, we show that our replay-based mechanism enables a simple, scalable, and energy-efficient FIFO load queue design with no associative lookup functionality, while sacrificing only a negligible amount of performance and cache bandwidth.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130416263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jung Ho Ahn, W. Dally, Brucek Khailany, U. Kapasi, Abhishek Das
This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagine (Kapasi et al., 2002) is a stream processor that employs a two-level register hierarchy with 9.7 Kbytes of local register file capacity and 128 Kbytes of stream register file (SRF) capacity to capture producer-consumer locality in stream applications. Parallelism is exploited using an array of 48 floating-point arithmetic units organized as eight SIMD clusters with a 6-wide VLIW per cluster. We evaluate the performance of each aspect of the Imagine architecture using a set of synthetic micro-benchmarks, key media processing kernels, and full applications. These micro-benchmarks show that the prototype hardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmetic performance, 12.7 Gbytes/s of SRF bandwidth, 1.58 Gbytes/s of memory system bandwidth, and accept up to 2 million stream processor instructions per second from a host processor. On a set of media processing kernels, Imagine sustained an average of 43% of peak arithmetic performance. An evaluation of full applications provides a breakdown of where execution time is spent. Over full applications, Imagine achieves 39.4% of peak performance, of the remainder on average 36.4% of time is lost due to load imbalance between arithmetic units in the VLIW clusters and limited instruction-level parallelism within kernel inner loops, 10.6% is due to kernel startup and shutdown overhead because of short stream lengths, 7.6% is due to memory stalls, and the rest is due to insufficient host processor bandwidth. Further analysis included in the paper presents the impact of host instruction bandwidth on application performance, particularly on smaller datasets. In summary, the experimental measurements described in this paper demonstrate the high performance and efficiency of stream processing: operating at 200 MHz, Imagine sustains 4.81 GFLOPS on QR decomposition while dissipating 7.42 Watts.
{"title":"Evaluating the Imagine stream architecture","authors":"Jung Ho Ahn, W. Dally, Brucek Khailany, U. Kapasi, Abhishek Das","doi":"10.1145/1028176.1006734","DOIUrl":"https://doi.org/10.1145/1028176.1006734","url":null,"abstract":"This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagine (Kapasi et al., 2002) is a stream processor that employs a two-level register hierarchy with 9.7 Kbytes of local register file capacity and 128 Kbytes of stream register file (SRF) capacity to capture producer-consumer locality in stream applications. Parallelism is exploited using an array of 48 floating-point arithmetic units organized as eight SIMD clusters with a 6-wide VLIW per cluster. We evaluate the performance of each aspect of the Imagine architecture using a set of synthetic micro-benchmarks, key media processing kernels, and full applications. These micro-benchmarks show that the prototype hardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmetic performance, 12.7 Gbytes/s of SRF bandwidth, 1.58 Gbytes/s of memory system bandwidth, and accept up to 2 million stream processor instructions per second from a host processor. On a set of media processing kernels, Imagine sustained an average of 43% of peak arithmetic performance. An evaluation of full applications provides a breakdown of where execution time is spent. Over full applications, Imagine achieves 39.4% of peak performance, of the remainder on average 36.4% of time is lost due to load imbalance between arithmetic units in the VLIW clusters and limited instruction-level parallelism within kernel inner loops, 10.6% is due to kernel startup and shutdown overhead because of short stream lengths, 7.6% is due to memory stalls, and the rest is due to insufficient host processor bandwidth. Further analysis included in the paper presents the impact of host instruction bandwidth on application performance, particularly on smaller datasets. In summary, the experimental measurements described in this paper demonstrate the high performance and efficiency of stream processing: operating at 200 MHz, Imagine sustains 4.81 GFLOPS on QR decomposition while dissipating 7.42 Watts.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115604212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lance Hammond, Vicky Wong, Michael K. Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, C. Kozyrakis, K. Olukotun
In this paper, we propose a new shared memory model: transactional memory coherence and consistency (TCC). TCC provides a model in which atomic transactions are always the basic unit of parallel work, communication, memory coherence, and memory reference consistency. TCC greatly simplifies parallel software by eliminating the need for synchronization using conventional locks and semaphores, along with their complexities. TCC hardware must combine all writes from each transaction region in a program into a single packet and broadcast this packet to the permanent shared memory state atomically as a large block. This simplifies the coherence hardware because it reduces the need for small, low-latency messages and completely eliminates the need for conventional snoopy cache coherence protocols, as multiple speculatively written versions of a cache line may safely coexist within the system. Meanwhile, automatic, hardware-controlled rollback of speculative transactions resolves any correctness violations that may occur when several processors attempt to read and write the same data simultaneously. The cost of this simplified scheme is higher interprocessor bandwidth. To explore the costs and benefits of TCC, we study the characteristics of an optimal transaction-based memory system, and examine how different design parameters could affect the performance of real systems. Across a spectrum of applications, the TCC model itself did not limit available parallelism. Most applications are easily divided into transactions requiring only small write buffers, on the order of 4-8 KB. The broadcast requirements of TCC are high, but are well within the capabilities of CMPs and small-scale SMPs with high-speed interconnects.
{"title":"Transactional memory coherence and consistency","authors":"Lance Hammond, Vicky Wong, Michael K. Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, C. Kozyrakis, K. Olukotun","doi":"10.1145/1028176.1006711","DOIUrl":"https://doi.org/10.1145/1028176.1006711","url":null,"abstract":"In this paper, we propose a new shared memory model: transactional memory coherence and consistency (TCC). TCC provides a model in which atomic transactions are always the basic unit of parallel work, communication, memory coherence, and memory reference consistency. TCC greatly simplifies parallel software by eliminating the need for synchronization using conventional locks and semaphores, along with their complexities. TCC hardware must combine all writes from each transaction region in a program into a single packet and broadcast this packet to the permanent shared memory state atomically as a large block. This simplifies the coherence hardware because it reduces the need for small, low-latency messages and completely eliminates the need for conventional snoopy cache coherence protocols, as multiple speculatively written versions of a cache line may safely coexist within the system. Meanwhile, automatic, hardware-controlled rollback of speculative transactions resolves any correctness violations that may occur when several processors attempt to read and write the same data simultaneously. The cost of this simplified scheme is higher interprocessor bandwidth. To explore the costs and benefits of TCC, we study the characteristics of an optimal transaction-based memory system, and examine how different design parameters could affect the performance of real systems. Across a spectrum of applications, the TCC model itself did not limit available parallelism. Most applications are easily divided into transactions requiring only small write buffers, on the order of 4-8 KB. The broadcast requirements of TCC are high, but are well within the capabilities of CMPs and small-scale SMPs with high-speed interconnects.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128278345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}