Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.最新文献

From sequences of dependent instructions to functions: an approach for improving performance without ILP or speculation 从相关指令序列到函数:一种没有ILP或推测的提高性能的方法

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006721

S. Yehia, O. Temam

In this article, we present an approach for improving the performance of sequences of dependent instructions. We observe that many sequences of instructions can be interpreted as functions. Unlike sequences of instructions, functions can be translated into very fast but exponentially costly two-level combinational circuits. We present an approach that exploits this principle, speeds up programs thanks to circuit-level parallelism/redundancy, but avoids the exponential costs. We analyze the potential of this approach, and then we propose an implementation that consists of a superscalar processor with a large specific functional unit associated with specific back-end transformations. The performance of the SpecInt2000 benchmarks and selected programs from the Olden and MiBench benchmark suites improves on average from 2.4% to 12% depending on the latency of the functional units, and up to 39.6%; more precisely, the performance of optimized code sections improves on average from 3.5% to 19%, and up to 49%.

在本文中，我们提出了一种改进依赖指令序列性能的方法。我们观察到许多指令序列可以被解释为函数。与指令序列不同，函数可以转换成非常快但指数级昂贵的两级组合电路。我们提出了一种利用这一原理的方法，由于电路级并行/冗余，加速了程序，但避免了指数成本。我们分析了这种方法的潜力，然后提出了一种实现，该实现由一个超标量处理器和一个与特定后端转换相关联的大型特定功能单元组成。根据功能单元的延迟，SpecInt2000基准测试和从Olden和MiBench基准测试套件中选择的程序的性能平均提高了2.4%到12%，最高可达39.6%;更准确地说，优化代码段的性能平均从3.5%提高到19%，最高提高到49%。

引用次数: 45

Synchroscalar: a multiple clock domain, power-aware, tile-based embedded processor 同步标量:一个多时钟域，功率感知，基于磁贴的嵌入式处理器

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1109/isca.2004.1310771

J. Oliver, Ravishankar Rao, P. Sultana, Jedidiah R. Crandall, E. Czernikowski, IV LeslieW.Jones, D. Franklin, V. Akella, F. Chong

We present Synchroscalar, a tile-based architecture for embedded processing that is designed to provide the flexibility of DSPs while approaching the power efficiency of ASICs. We achieve this goal by providing high parallelism and voltage scaling while minimizing control and communication costs. Specifically, Synchroscalar uses columns of processor tiles organized into statically-assigned frequency-voltage domains to minimize power consumption. Furthermore, while columns use SIMD control to minimize overhead, data-dependent computations can be supported by extremely flexible statically-scheduled communication between columns. We provide a detailed evaluation of Synchroscalar including SPICE simulation, wire and device models, synthesis of key components, cycle-level simulation, and compiler- and hand-optimized signal processing applications. We find that the goal of meeting, not exceeding, performance targets with data-parallel applications leads to designs that depart significantly from our intuitions derived from general-purpose microprocessor design. In particular, synchronous design and substantial global interconnect are desirable in the low-frequency, low-power domain. This global interconnect supports parallelization and reduces processor idle time, which are critical to energy efficient implementations of high bandwidth signal processing. Overall, Synchroscalar provides programmability while achieving power efficiencies within 8-30/spl times/ of known ASIC implementations, which is 10-60/spl times/ better than conventional DSPs. In addition, frequency-voltage scaling in Synchroscalar provides between 3-32% power savings in our application suite.

我们提出了Synchroscalar，这是一种基于瓷砖的嵌入式处理架构，旨在提供dsp的灵活性，同时接近asic的功率效率。我们通过提供高并行性和电压缩放来实现这一目标，同时最大限度地降低控制和通信成本。具体来说，Synchroscalar使用将处理器块组织成静态分配的频率-电压域的列，以最小化功耗。此外，虽然列使用SIMD控件来最小化开销，但可以通过列之间极其灵活的静态调度通信来支持依赖数据的计算。我们提供了Synchroscalar的详细评估，包括SPICE仿真、电线和器件模型、关键组件的合成、周期级仿真以及编译器和手动优化的信号处理应用。我们发现，通过数据并行应用程序满足而不是超过性能目标的目标导致设计明显偏离我们从通用微处理器设计中获得的直觉。特别是，在低频、低功耗领域，需要同步设计和大量的全局互连。这种全局互连支持并行化并减少处理器空闲时间，这对于高带宽信号处理的节能实现至关重要。总体而言，Synchroscalar提供可编程性，同时实现功率效率在已知ASIC实现的8-30/spl倍内，比传统dsp好10-60/spl倍。此外，Synchroscalar中的频率电压缩放在我们的应用套件中提供3-32%的功耗节省。

{"title":"Synchroscalar: a multiple clock domain, power-aware, tile-based embedded processor","authors":"J. Oliver, Ravishankar Rao, P. Sultana, Jedidiah R. Crandall, E. Czernikowski, IV LeslieW.Jones, D. Franklin, V. Akella, F. Chong","doi":"10.1109/isca.2004.1310771","DOIUrl":"https://doi.org/10.1109/isca.2004.1310771","url":null,"abstract":"We present Synchroscalar, a tile-based architecture for embedded processing that is designed to provide the flexibility of DSPs while approaching the power efficiency of ASICs. We achieve this goal by providing high parallelism and voltage scaling while minimizing control and communication costs. Specifically, Synchroscalar uses columns of processor tiles organized into statically-assigned frequency-voltage domains to minimize power consumption. Furthermore, while columns use SIMD control to minimize overhead, data-dependent computations can be supported by extremely flexible statically-scheduled communication between columns. We provide a detailed evaluation of Synchroscalar including SPICE simulation, wire and device models, synthesis of key components, cycle-level simulation, and compiler- and hand-optimized signal processing applications. We find that the goal of meeting, not exceeding, performance targets with data-parallel applications leads to designs that depart significantly from our intuitions derived from general-purpose microprocessor design. In particular, synchronous design and substantial global interconnect are desirable in the low-frequency, low-power domain. This global interconnect supports parallelization and reduces processor idle time, which are critical to energy efficient implementations of high bandwidth signal processing. Overall, Synchroscalar provides programmability while achieving power efficiencies within 8-30/spl times/ of known ASIC implementations, which is 10-60/spl times/ better than conventional DSPs. In addition, frequency-voltage scaling in Synchroscalar provides between 3-32% power savings in our application suite.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127919759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 63

iWatcher: efficient architectural support for software debugging iWatcher:对软件调试的高效架构支持

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006720

Pin Zhou, Feng Qin, Wei Liu, Yuanyuan Zhou, J. Torrellas

Recent impressive performance improvements in computer architecture have not led to significant gains in ease of debugging. Software debugging often relies on inserting run-time software checks. In many cases, however, it is hard to find the root cause of a bug. Moreover, program execution typically slows down significantly, often by 10-100 times. To address this problem, this paper introduces the Intelligent Watcher (iWatcher), novel architectural support to monitor dynamic execution with minimal overhead, automatically, and flexibly. iWatcher associates program-specified monitoring functions with memory locations. When any such location is accessed, the monitoring function is automatically triggered with low overhead. To further reduce overhead and support rollback, iWatcher can leverage Thread-Level Speculation (TLS). To test iWatcher, we use applications with various bugs. Our results show that iWatcher detects many more software bugs than Valgrind, a well-known open-source bug detector. Moreover, iWatcher only induces a 4-80% execution overhead, which is orders of magnitude less than Valgrind. Even with 20% of the dynamic loads monitored in a program, iWatcher adds only 66-174% overhead. Finally, TLS is effective at reducing overheads for programs with substantial monitoring.

最近计算机体系结构中令人印象深刻的性能改进并没有导致调试的便利性的显著提高。软件调试通常依赖于插入运行时软件检查。然而，在许多情况下，很难找到bug的根本原因。此外，程序执行速度通常会显著降低，通常降低10-100倍。为了解决这个问题，本文引入了智能监视器(iWatcher)，这是一种新颖的架构支持，可以以最小的开销自动灵活地监视动态执行。iWatcher将程序指定的监视功能与内存位置联系起来。当访问任何这样的位置时，监控功能会自动触发，开销很小。为了进一步减少开销和支持回滚，iWatcher可以利用线程级推测(TLS)。为了测试iWatcher，我们使用了存在各种bug的应用程序。我们的结果表明，iWatcher检测到的软件漏洞比知名的开源漏洞检测器Valgrind要多得多。此外，iWatcher只导致4-80%的执行开销，这比Valgrind少了几个数量级。即使在程序中监控20%的动态负载，iWatcher也只增加了66-174%的开销。最后，TLS可以有效地减少具有大量监控的程序的开销。

{"title":"iWatcher: efficient architectural support for software debugging","authors":"Pin Zhou, Feng Qin, Wei Liu, Yuanyuan Zhou, J. Torrellas","doi":"10.1145/1028176.1006720","DOIUrl":"https://doi.org/10.1145/1028176.1006720","url":null,"abstract":"Recent impressive performance improvements in computer architecture have not led to significant gains in ease of debugging. Software debugging often relies on inserting run-time software checks. In many cases, however, it is hard to find the root cause of a bug. Moreover, program execution typically slows down significantly, often by 10-100 times. To address this problem, this paper introduces the Intelligent Watcher (iWatcher), novel architectural support to monitor dynamic execution with minimal overhead, automatically, and flexibly. iWatcher associates program-specified monitoring functions with memory locations. When any such location is accessed, the monitoring function is automatically triggered with low overhead. To further reduce overhead and support rollback, iWatcher can leverage Thread-Level Speculation (TLS). To test iWatcher, we use applications with various bugs. Our results show that iWatcher detects many more software bugs than Valgrind, a well-known open-source bug detector. Moreover, iWatcher only induces a 4-80% execution overhead, which is orders of magnitude less than Valgrind. Even with 20% of the dynamic loads monitored in a program, iWatcher adds only 66-174% overhead. Finally, TLS is effective at reducing overheads for programs with substantial monitoring.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126554510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 148

Control flow modeling in statistical simulation for accurate and efficient processor design studies 统计仿真中的控制流建模用于精确高效的处理器设计研究

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006730

L. Eeckhout, R. Bell, Bastiaan Stougie, K. D. Bosschere, L. John

Designing a new microprocessor is extremely time-consuming. One of the contributing reasons is that computer designers rely heavily on detailed architectural simulations, which are very time-consuming. Recent work has focused on statistical simulation to address this issue. The basic idea of statistical simulation is to measure characteristics during program execution, generate a synthetic trace with those characteristics and then simulate the synthetic trace. The statistically generated synthetic trace is orders of magnitude smaller than the original program sequence and hence results in significantly faster simulation. This paper makes the following contributions to the statistical simulation methodology. First, we propose the use of a statistical flow graph to characterize the control flow of a program execution. Second, we model delayed update of branch predictors while profiling program execution characteristics. Experimental results show that statistical simulation using this improved control flow modeling attains significantly better accuracy than the previously proposed HLS system. We evaluate both the absolute and the relative accuracy of our approach for power/performance modeling of superscalar microarchitectures. The results show that our statistical simulation framework can be used to efficiently explore processor design spaces.

设计一种新的微处理器非常耗时。其中一个原因是，计算机设计师严重依赖于详细的架构模拟，这非常耗时。最近的工作集中在统计模拟来解决这个问题。统计仿真的基本思想是在程序执行过程中测量特征，生成具有这些特征的合成跟踪，然后模拟合成跟踪。统计生成的合成轨迹比原始程序序列小几个数量级，因此可以显著加快模拟速度。本文对统计模拟方法做出了以下贡献。首先，我们建议使用统计流图来描述程序执行的控制流。其次，在分析程序执行特征时，我们对分支预测器的延迟更新进行了建模。实验结果表明，采用这种改进的控制流模型进行统计仿真，其精度明显优于先前提出的HLS系统。我们评估了我们的方法对超标量微架构的功率/性能建模的绝对和相对准确性。结果表明，我们的统计仿真框架可以有效地探索处理器设计空间。

{"title":"Control flow modeling in statistical simulation for accurate and efficient processor design studies","authors":"L. Eeckhout, R. Bell, Bastiaan Stougie, K. D. Bosschere, L. John","doi":"10.1145/1028176.1006730","DOIUrl":"https://doi.org/10.1145/1028176.1006730","url":null,"abstract":"Designing a new microprocessor is extremely time-consuming. One of the contributing reasons is that computer designers rely heavily on detailed architectural simulations, which are very time-consuming. Recent work has focused on statistical simulation to address this issue. The basic idea of statistical simulation is to measure characteristics during program execution, generate a synthetic trace with those characteristics and then simulate the synthetic trace. The statistically generated synthetic trace is orders of magnitude smaller than the original program sequence and hence results in significantly faster simulation. This paper makes the following contributions to the statistical simulation methodology. First, we propose the use of a statistical flow graph to characterize the control flow of a program execution. Second, we model delayed update of branch predictors while profiling program execution characteristics. Experimental results show that statistical simulation using this improved control flow modeling attains significantly better accuracy than the previously proposed HLS system. We evaluate both the absolute and the relative accuracy of our approach for power/performance modeling of superscalar microarchitectures. The results show that our statistical simulation framework can be used to efficiently explore processor design spaces.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128040029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 127

Physical register inlining 物理寄存器内联

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006728

Mikko H. Lipasti, Brian R. Mestan, Erika Gunadi

Physical register access time increases the delay between scheduling and execution in modern out-of-order processors. As the number of physical registers increases, this delay grows, forcing designers to employ register files with multicycle access. This paper advocates more efficient utilization of a fewer number of physical registers in order to reduce the access time of the physical register file. Register values with few significant bits are stored in the rename map using physical register inlining, a scheme analogous to inlining of operand fields in data structures. Specifically, whenever a register value can be expressed with fewer bits than the register map would need to specify a physical register number, the value is stored directly in the map, avoiding the indirection, and saving space in the physical register file. Not surprisingly, we find that a significant portion of all register operands can be stored in the map in this fashion, and describe straightforward microarchitectural extensions that correctly implement physical register inlining. We find that physical register inlining performs well, particularly in processors that are register-constrained.

在现代乱序处理器中，物理寄存器访问时间增加了调度和执行之间的延迟。随着物理寄存器数量的增加，这种延迟也会增加，迫使设计人员使用具有多周期访问的寄存器文件。本文提倡更有效地利用较少数量的物理寄存器，以减少对物理寄存器文件的访问时间。具有少量有效位的寄存器值使用物理寄存器内联存储在重命名映射中，这种方案类似于数据结构中操作数字段的内联。具体来说，当一个寄存器值可以用比寄存器映射指定物理寄存器号所需的更少的比特来表示时，该值直接存储在映射中，避免了间接，并节省了物理寄存器文件中的空间。毫不奇怪，我们发现所有寄存器操作数的很大一部分都可以以这种方式存储在映射中，并描述了正确实现物理寄存器内联的简单微体系结构扩展。我们发现物理寄存器内联性能很好，特别是在寄存器受限的处理器中。

{"title":"Physical register inlining","authors":"Mikko H. Lipasti, Brian R. Mestan, Erika Gunadi","doi":"10.1145/1028176.1006728","DOIUrl":"https://doi.org/10.1145/1028176.1006728","url":null,"abstract":"Physical register access time increases the delay between scheduling and execution in modern out-of-order processors. As the number of physical registers increases, this delay grows, forcing designers to employ register files with multicycle access. This paper advocates more efficient utilization of a fewer number of physical registers in order to reduce the access time of the physical register file. Register values with few significant bits are stored in the rename map using physical register inlining, a scheme analogous to inlining of operand fields in data structures. Specifically, whenever a register value can be expressed with fewer bits than the register map would need to specify a physical register number, the value is stored directly in the map, avoiding the indirection, and saving space in the physical register file. Not surprisingly, we find that a significant portion of all register operands can be stored in the map in this fashion, and describe straightforward microarchitectural extensions that correctly implement physical register inlining. We find that physical register inlining performs well, particularly in processors that are register-constrained.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"199 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132708357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69

Low-latency virtual-channel routers for on-chip networks 用于片上网络的低延迟虚拟通道路由器

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006717

R. Mullins, A. West, S. Moore

The on-chip communication requirements of many systems are best served through the deployment of a regular chip-wide network. This paper presents the design of a low-latency on-chip network router for such applications. We remove control overheads (routing and arbitration logic) from the critical path in order to minimise cycle-time and latency. Simulations illustrate that dramatic cycle time improvements are possible without compromising router efficiency. Furthermore, these reductions permit flits to be routed in a single cycle, maximising the effectiveness of the router's limited buffering resources.

许多系统的片上通信需求最好通过部署常规的片级网络来满足。本文提出了一种低延迟片上网络路由器的设计。我们从关键路径中移除控制开销(路由和仲裁逻辑)，以最小化循环时间和延迟。仿真表明，在不影响路由器效率的情况下，显著改善循环时间是可能的。此外，这些减少允许在单个周期内路由航班，最大限度地提高路由器有限缓冲资源的有效性。

引用次数: 500

Use-based register caching with decoupled indexing 具有解耦索引的基于使用的寄存器缓存

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006724

J. A. Butts, G. Sohi

Wide, deep pipelines need many physical registers to hold the results of in-flight instructions. Simultaneously, high clock frequencies prohibit using large register files and bypass networks without a significant performance penalty. Previously proposed techniques using register caching to reduce this penalty suffer from several problems including poor insertion and replacement decisions and the need for a fully-associative cache for good performance. We present novel mechanisms for managing and indexing register caches that address these problems using knowledge of the number of consumers of each register value. The insertion policy reduces pollution by not caching a register value when all of its predicted consumers are satisfied by the bypass network. The replacement policy selects register cache entries with the fewest remaining uses (often zero), lowering the miss rate. We also introduce a new, general method of mapping physical registers to register cache sets that improves the performance of set-associative cache organizations by reducing conflicts. Our results indicate that a 64-entry, two-way set associative cache using these techniques outperforms multi-cycle monolithic register files and previously proposed hierarchical register files.

又宽又深的管道需要许多物理寄存器来保存飞行指令的结果。同时，高时钟频率禁止在没有显著性能损失的情况下使用大型寄存器文件和旁路网络。以前提出的使用寄存器缓存来减少这种损失的技术存在几个问题，包括插入和替换决策不佳，以及需要完全关联的缓存来获得良好的性能。我们提出了管理和索引寄存器缓存的新机制，利用每个寄存器值的消费者数量的知识来解决这些问题。插入策略通过不缓存寄存器值来减少污染，当它的所有预测消费者都被旁路网络满足时。替换策略选择剩余使用最少(通常为零)的寄存器缓存项，从而降低丢失率。我们还介绍了一种新的通用方法，将物理寄存器映射到注册缓存集，通过减少冲突来提高集关联缓存组织的性能。我们的研究结果表明，使用这些技术的64项双向集关联缓存优于多周期单片寄存器文件和先前提出的分层寄存器文件。

{"title":"Use-based register caching with decoupled indexing","authors":"J. A. Butts, G. Sohi","doi":"10.1145/1028176.1006724","DOIUrl":"https://doi.org/10.1145/1028176.1006724","url":null,"abstract":"Wide, deep pipelines need many physical registers to hold the results of in-flight instructions. Simultaneously, high clock frequencies prohibit using large register files and bypass networks without a significant performance penalty. Previously proposed techniques using register caching to reduce this penalty suffer from several problems including poor insertion and replacement decisions and the need for a fully-associative cache for good performance. We present novel mechanisms for managing and indexing register caches that address these problems using knowledge of the number of consumers of each register value. The insertion policy reduces pollution by not caching a register value when all of its predicted consumers are satisfied by the bypass network. The replacement policy selects register cache entries with the fewest remaining uses (often zero), lowering the miss rate. We also introduce a new, general method of mapping physical registers to register cache sets that improves the performance of set-associative cache organizations by reducing conflicts. Our results indicate that a 64-entry, two-way set associative cache using these techniques outperforms multi-cycle monolithic register files and previously proposed hierarchical register files.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127583810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

Memory ordering: a value-based approach 内存排序:基于值的方法

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1109/MM.2004.81

Harold W. Cain, Mikko H. Lipasti

Conventional out-of-order processors employ a multi-ported, fully-associative load queue to guarantee correct memory reference order both within a single thread of execution and across threads in a multiprocessor system. As improvements in process technology and pipelining lead to higher clock frequencies, scaling this complex structure to accommodate a larger number of in-flight loads becomes difficult if not impossible. Furthermore, each access to this complex structure consumes excessive amounts of energy. In this paper, we solve the associative load queue scalability problem by completely eliminating the associative load queue. Instead, data dependences and memory consistency constraints are enforced by simply reexecuting load instructions in program order prior to retirement. Using heuristics to filter the set of loads that must be re-executed, we show that our replay-based mechanism enables a simple, scalable, and energy-efficient FIFO load queue design with no associative lookup functionality, while sacrificing only a negligible amount of performance and cache bandwidth.

传统的乱序处理器使用多端口、完全关联的加载队列来保证在单个执行线程内和多处理器系统中跨线程的正确内存引用顺序。随着工艺技术和流水线的改进导致更高的时钟频率，扩展这种复杂的结构以适应更多的飞行负载变得困难，如果不是不可能的话。此外，每次进入这个复杂的结构都会消耗过多的能量。本文通过完全消除关联负载队列来解决关联负载队列的可扩展性问题。相反，数据依赖性和内存一致性约束是通过在退役前按程序顺序重新执行加载指令来实现的。使用启发式方法来过滤必须重新执行的负载集，我们的基于重播的机制实现了简单、可扩展和节能的FIFO负载队列设计，没有关联查找功能，同时只牺牲了可忽略不计的性能和缓存带宽。

引用次数: 94

Evaluating the Imagine stream architecture 评估Imagine流架构

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006734

Jung Ho Ahn, W. Dally, Brucek Khailany, U. Kapasi, Abhishek Das

This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagine (Kapasi et al., 2002) is a stream processor that employs a two-level register hierarchy with 9.7 Kbytes of local register file capacity and 128 Kbytes of stream register file (SRF) capacity to capture producer-consumer locality in stream applications. Parallelism is exploited using an array of 48 floating-point arithmetic units organized as eight SIMD clusters with a 6-wide VLIW per cluster. We evaluate the performance of each aspect of the Imagine architecture using a set of synthetic micro-benchmarks, key media processing kernels, and full applications. These micro-benchmarks show that the prototype hardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmetic performance, 12.7 Gbytes/s of SRF bandwidth, 1.58 Gbytes/s of memory system bandwidth, and accept up to 2 million stream processor instructions per second from a host processor. On a set of media processing kernels, Imagine sustained an average of 43% of peak arithmetic performance. An evaluation of full applications provides a breakdown of where execution time is spent. Over full applications, Imagine achieves 39.4% of peak performance, of the remainder on average 36.4% of time is lost due to load imbalance between arithmetic units in the VLIW clusters and limited instruction-level parallelism within kernel inner loops, 10.6% is due to kernel startup and shutdown overhead because of short stream lengths, 7.6% is due to memory stalls, and the rest is due to insufficient host processor bandwidth. Further analysis included in the paper presents the impact of host instruction bandwidth on application performance, particularly on smaller datasets. In summary, the experimental measurements described in this paper demonstrate the high performance and efficiency of stream processing: operating at 200 MHz, Imagine sustains 4.81 GFLOPS on QR decomposition while dissipating 7.42 Watts.

本文介绍了Imagine流处理器原型的实验评估。Imagine (Kapasi et al.， 2002)是一种流处理器，它采用两级寄存器层次结构，具有9.7 kb的本地寄存器文件容量和128 kb的流寄存器文件(SRF)容量，以捕获流应用程序中的生产者-消费者局域性。并行性利用了由48个浮点算术单元组成的数组，这些单元被组织为8个SIMD集群，每个集群有一个6宽的VLIW。我们使用一组综合微基准测试、关键媒体处理内核和完整的应用程序来评估Imagine架构的每个方面的性能。这些微基准测试表明，原型硬件可以达到7.96 GFLOPS或25.4 GOPS的算术性能，12.7 gb /s的SRF带宽，1.58 gb /s的存储系统带宽，每秒从主机处理器接收多达200万条流处理器指令。在一组媒体处理内核上，Imagine平均维持了43%的峰值算术性能。对完整应用程序的评估提供了执行时间在何处花费的细分。在完整的应用程序中，Imagine达到了39.4%的峰值性能，其余平均36.4%的时间损失是由于VLIW集群中算术单元之间的负载不平衡和内核内循环中有限的指令级并行性，10.6%是由于短流长度导致的内核启动和关闭开销，7.6%是由于内存停滞，其余是由于主机处理器带宽不足。本文进一步分析了主机指令带宽对应用程序性能的影响，特别是在较小的数据集上。总之，本文中描述的实验测量证明了流处理的高性能和效率:在200 MHz工作时，Imagine在QR分解时维持4.81 GFLOPS，而功耗为7.42瓦。

{"title":"Evaluating the Imagine stream architecture","authors":"Jung Ho Ahn, W. Dally, Brucek Khailany, U. Kapasi, Abhishek Das","doi":"10.1145/1028176.1006734","DOIUrl":"https://doi.org/10.1145/1028176.1006734","url":null,"abstract":"This paper describes an experimental evaluation of the prototype Imagine stream processor. Imagine (Kapasi et al., 2002) is a stream processor that employs a two-level register hierarchy with 9.7 Kbytes of local register file capacity and 128 Kbytes of stream register file (SRF) capacity to capture producer-consumer locality in stream applications. Parallelism is exploited using an array of 48 floating-point arithmetic units organized as eight SIMD clusters with a 6-wide VLIW per cluster. We evaluate the performance of each aspect of the Imagine architecture using a set of synthetic micro-benchmarks, key media processing kernels, and full applications. These micro-benchmarks show that the prototype hardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmetic performance, 12.7 Gbytes/s of SRF bandwidth, 1.58 Gbytes/s of memory system bandwidth, and accept up to 2 million stream processor instructions per second from a host processor. On a set of media processing kernels, Imagine sustained an average of 43% of peak arithmetic performance. An evaluation of full applications provides a breakdown of where execution time is spent. Over full applications, Imagine achieves 39.4% of peak performance, of the remainder on average 36.4% of time is lost due to load imbalance between arithmetic units in the VLIW clusters and limited instruction-level parallelism within kernel inner loops, 10.6% is due to kernel startup and shutdown overhead because of short stream lengths, 7.6% is due to memory stalls, and the rest is due to insufficient host processor bandwidth. Further analysis included in the paper presents the impact of host instruction bandwidth on application performance, particularly on smaller datasets. In summary, the experimental measurements described in this paper demonstrate the high performance and efficiency of stream processing: operating at 200 MHz, Imagine sustains 4.81 GFLOPS on QR decomposition while dissipating 7.42 Watts.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115604212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 150

Transactional memory coherence and consistency 事务性内存的一致性和一致性

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006711

Lance Hammond, Vicky Wong, Michael K. Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, C. Kozyrakis, K. Olukotun

In this paper, we propose a new shared memory model: transactional memory coherence and consistency (TCC). TCC provides a model in which atomic transactions are always the basic unit of parallel work, communication, memory coherence, and memory reference consistency. TCC greatly simplifies parallel software by eliminating the need for synchronization using conventional locks and semaphores, along with their complexities. TCC hardware must combine all writes from each transaction region in a program into a single packet and broadcast this packet to the permanent shared memory state atomically as a large block. This simplifies the coherence hardware because it reduces the need for small, low-latency messages and completely eliminates the need for conventional snoopy cache coherence protocols, as multiple speculatively written versions of a cache line may safely coexist within the system. Meanwhile, automatic, hardware-controlled rollback of speculative transactions resolves any correctness violations that may occur when several processors attempt to read and write the same data simultaneously. The cost of this simplified scheme is higher interprocessor bandwidth. To explore the costs and benefits of TCC, we study the characteristics of an optimal transaction-based memory system, and examine how different design parameters could affect the performance of real systems. Across a spectrum of applications, the TCC model itself did not limit available parallelism. Most applications are easily divided into transactions requiring only small write buffers, on the order of 4-8 KB. The broadcast requirements of TCC are high, but are well within the capabilities of CMPs and small-scale SMPs with high-speed interconnects.

本文提出了一种新的共享内存模型:事务性内存一致性模型(TCC)。TCC提供了一个模型，其中原子事务始终是并行工作、通信、内存一致性和内存引用一致性的基本单元。TCC通过消除使用传统锁和信号量进行同步的需要，以及它们的复杂性，极大地简化了并行软件。TCC硬件必须将程序中每个事务区域的所有写操作组合成单个数据包，并将该数据包作为一个大块自动广播到永久共享内存状态。这简化了一致性硬件，因为它减少了对小而低延迟消息的需求，并且完全消除了对传统snoopy缓存一致性协议的需求，因为缓存线的多个推测性编写版本可以安全地共存于系统中。同时，自动的、由硬件控制的推测性事务回滚解决了当多个处理器试图同时读写相同数据时可能发生的任何正确性违规。这种简化方案的代价是更高的处理器间带宽。为了探讨TCC的成本和收益，我们研究了一个最优的基于事务的内存系统的特征，并检查了不同的设计参数如何影响实际系统的性能。在一系列应用程序中，TCC模型本身并不限制可用的并行性。大多数应用程序很容易被划分为只需要4-8 KB的小写缓冲区的事务。TCC的广播要求很高，但在cmp和具有高速互连的小型smp的能力范围内。

{"title":"Transactional memory coherence and consistency","authors":"Lance Hammond, Vicky Wong, Michael K. Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, C. Kozyrakis, K. Olukotun","doi":"10.1145/1028176.1006711","DOIUrl":"https://doi.org/10.1145/1028176.1006711","url":null,"abstract":"In this paper, we propose a new shared memory model: transactional memory coherence and consistency (TCC). TCC provides a model in which atomic transactions are always the basic unit of parallel work, communication, memory coherence, and memory reference consistency. TCC greatly simplifies parallel software by eliminating the need for synchronization using conventional locks and semaphores, along with their complexities. TCC hardware must combine all writes from each transaction region in a program into a single packet and broadcast this packet to the permanent shared memory state atomically as a large block. This simplifies the coherence hardware because it reduces the need for small, low-latency messages and completely eliminates the need for conventional snoopy cache coherence protocols, as multiple speculatively written versions of a cache line may safely coexist within the system. Meanwhile, automatic, hardware-controlled rollback of speculative transactions resolves any correctness violations that may occur when several processors attempt to read and write the same data simultaneously. The cost of this simplified scheme is higher interprocessor bandwidth. To explore the costs and benefits of TCC, we study the characteristics of an optimal transaction-based memory system, and examine how different design parameters could affect the performance of real systems. Across a spectrum of applications, the TCC model itself did not limit available parallelism. Most applications are easily divided into transactions requiring only small write buffers, on the order of 4-8 KB. The broadcast requirements of TCC are high, but are well within the capabilities of CMPs and small-scale SMPs with high-speed interconnects.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128278345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 761