首页 > 最新文献

[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture最新文献

英文 中文
Race-free interconnection networks and multiprocessor consistency 无竞赛互连网络和多处理器一致性
A. Landin, Erik Hagersten, Seif Haridi
Modern shared-memory multiprocmors require complex interconnection networks to provide sufficient communication bandwidth between processors. They also rely on advanced memory systems that allow multiple memory operations to be made in parallel. It is expensive to maintain a high consistency level in a machine based on a general network, but for special interconnection topologies, some of these costs can he reduced. We define and study one class of interconnection networks, race-free networks. New conditions for sequential consistency are presented which show that sequential consistency can be maintained if all accesses in a multiprocessor can be ordered in an acyclic graph. We show that this can be done in racefree networks without the need for a transaction to be globally performed before the next transaction can be issued: We also investigate what is required to maintain processor consistency in race-free networks. In a race-free network which maintains processor consistency, writes may be pipelined, and reads may bypass writes. - The proposed methods reduce the latencies associated with processor write-misses to shared data.
现代共享内存多处理器需要复杂的互连网络来提供处理器之间足够的通信带宽。它们还依赖于允许并行进行多个存储操作的高级存储系统。在基于一般网络的机器中维护高一致性级别是昂贵的,但是对于特殊的互连拓扑,可以降低其中的一些成本。我们定义并研究了一类互连网络——无种族网络。提出了新的顺序一致性的条件,证明了如果多处理机中的所有访问在一个非循环图中可以有序,则可以保持顺序一致性。我们表明,这可以在无竞争网络中完成,而不需要在发布下一个事务之前全局执行事务:我们还研究了在无竞争网络中维护处理器一致性所需的条件。在保持处理器一致性的无竞争网络中,写可以是流水线的,读可以绕过写。-建议的方法减少了与处理器对共享数据写错误相关的延迟。
{"title":"Race-free interconnection networks and multiprocessor consistency","authors":"A. Landin, Erik Hagersten, Seif Haridi","doi":"10.1145/115952.115964","DOIUrl":"https://doi.org/10.1145/115952.115964","url":null,"abstract":"Modern shared-memory multiprocmors require complex interconnection networks to provide sufficient communication bandwidth between processors. They also rely on advanced memory systems that allow multiple memory operations to be made in parallel. It is expensive to maintain a high consistency level in a machine based on a general network, but for special interconnection topologies, some of these costs can he reduced. We define and study one class of interconnection networks, race-free networks. New conditions for sequential consistency are presented which show that sequential consistency can be maintained if all accesses in a multiprocessor can be ordered in an acyclic graph. We show that this can be done in racefree networks without the need for a transaction to be globally performed before the next transaction can be issued: We also investigate what is required to maintain processor consistency in race-free networks. In a race-free network which maintains processor consistency, writes may be pipelined, and reads may bypass writes. - The proposed methods reduce the latencies associated with processor write-misses to shared data.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123778780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 78
Deadlock-fyee multicast wormhole routing in multicomputer networks 多计算机网络中的死锁多播虫洞路由
X. Lin, L. Ni
Efficient routing of messages is the key to the performance of multicomputers. Multicast communication refers to the delivery of the same message from a source node to an arbitrary number of destination nodes. Wormhole routing is the most promising switching technique used in new generation multicomputers. In this paper, we present multicast wormhole routing methods for multicomputers adopting 2D-mesh and hypercube topologies. The dual-path routing algorithm requires less system resource, while the multipath routing algorithm creates less traffic. More import antly, both routing algorithms are deadlock-free, which is essential to wormhole networks.
有效的消息路由是多计算机性能的关键。多播通信是指将同一消息从一个源节点传递到任意数量的目标节点。虫洞路由是新一代多机中最有前途的交换技术。本文提出了采用二维网格和超立方体拓扑结构的多计算机组播虫洞路由方法。双路径路由算法占用的系统资源较少,而多路径路由算法占用的流量较少。更重要的是,这两种路由算法都是无死锁的,这对虫洞网络至关重要。
{"title":"Deadlock-fyee multicast wormhole routing in multicomputer networks","authors":"X. Lin, L. Ni","doi":"10.1109/ISCA.1991.1021605","DOIUrl":"https://doi.org/10.1109/ISCA.1991.1021605","url":null,"abstract":"Efficient routing of messages is the key to the performance of multicomputers. Multicast communication refers to the delivery of the same message from a source node to an arbitrary number of destination nodes. Wormhole routing is the most promising switching technique used in new generation multicomputers. In this paper, we present multicast wormhole routing methods for multicomputers adopting 2D-mesh and hypercube topologies. The dual-path routing algorithm requires less system resource, while the multipath routing algorithm creates less traffic. More import antly, both routing algorithms are deadlock-free, which is essential to wormhole networks.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127075002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 170
OHMEGA : a VLSI superscalar processor architecture for numerical applications OHMEGA:用于数值应用的VLSI超标量处理器架构
M. Nakajima, H. Nakano, Y. Nakakura, T. Yoshida, Y. Goi, Y. Nakai, R. Segawa, T. Xishida, H. Kadota
multiple instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, LOOP instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.
本文描述了一种可以在数字应用中保持很高性能的超大规模集成电路(VLSI)超标量处理器架构。该体系结构由编译器静态地执行指令级调度,并执行指令的无序发出和执行,以减少执行过程中动态发生的管道上的停顿。在这个体系结构中,在每个时钟周期中获取一对指令,同时解码,并独立地发布到相应的执行管道。为了便于编译器的指令级调度,架构提供者:-i)几乎所有指令对的同时执行,包括Store-Stare对和Load-Store对;ii)简单、低延迟、易于配对的执行管道结构;iii)大容量多端口浮点寄存器和整数寄存器。采用新颖的直接标签比较(direct Tag Compare, DTC)方法实现高效的数据依赖解析,采用无处罚分支机制实现简单的控制依赖解析,采用流水线数据缓存和128位宽总线带宽实现大数据传输能力,从而动态降低管道危害,从而提高系统性能。采用新的DTC方法、同步管道操作和数据绕过网络实现了一种有效的数据依赖解析机制,允许乱序指令的发布和执行。DTC方法的思想类似于带标记令牌的动态数据缺陷体系结构。非惩罚分支是通过延迟分支、执行器计数器在一个时钟周期内递减、比较和分支的LOOP指令和带有预测条件码的非惩罚条件分支三种技术实现的。这些技术有助于减少在运行时发生的管道失速。利用这些技术,该架构可以在4OMHz时钟下实现80MFLOPS/80MIPS的峰值性能,并保持比简单的MFU型RISC处理器高1.4至3.6倍的性能。
{"title":"OHMEGA : a VLSI superscalar processor architecture for numerical applications","authors":"M. Nakajima, H. Nakano, Y. Nakakura, T. Yoshida, Y. Goi, Y. Nakai, R. Segawa, T. Xishida, H. Kadota","doi":"10.1145/115952.115969","DOIUrl":"https://doi.org/10.1145/115952.115969","url":null,"abstract":"multiple instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, LOOP instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117249772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Performance prediction and tuning on a multiprocessor 多处理器上的性能预测和调优
R. Dimpsey, R. Iyer
This paper presents a methodology for modeling the behavior of a given class of applications executing in real workloads on a particular machine. The methodology is illustrated by modeling the execution of computationally bound, parallel applications running in real workloads on an Alliant FX/80. me model is constructed from real measured data obtained during normal machine operation and is capable of capturing intricate multiple job interactions, such as contention for shared resources. The model is a finitestate, discrete-time Markov model with rewards and costs associated with each state. The model is capable of predicting the distribution of completion times in real workloads for a given application. The predictions are useful in gauging how quickly an application will execute, or in predicting the performance impact of a system change. The model constructed in this study is validated with three separate sets of empirical data. In one validation, the model successfully predicts the effects of operating the machine with one less processor.
本文提出了一种方法,用于对特定机器上在实际工作负载中执行的给定类应用程序的行为进行建模。该方法通过在Alliant FX/80上对实际工作负载中运行的计算绑定并行应用程序的执行建模来说明。该模型是根据机器正常运行过程中获得的真实测量数据构建的,能够捕捉复杂的多作业交互,例如共享资源的争用。该模型是一个有限的、离散时间的马尔可夫模型,其奖励和成本与每个状态相关。该模型能够预测给定应用程序的实际工作负载中的完成时间分布。这些预测在衡量应用程序的执行速度或预测系统更改对性能的影响时非常有用。用三组独立的经验数据验证了本研究构建的模型。在一次验证中,该模型成功地预测了少一个处理器操作机器的效果。
{"title":"Performance prediction and tuning on a multiprocessor","authors":"R. Dimpsey, R. Iyer","doi":"10.1145/115952.115972","DOIUrl":"https://doi.org/10.1145/115952.115972","url":null,"abstract":"This paper presents a methodology for modeling the behavior of a given class of applications executing in real workloads on a particular machine. The methodology is illustrated by modeling the execution of computationally bound, parallel applications running in real workloads on an Alliant FX/80. me model is constructed from real measured data obtained during normal machine operation and is capable of capturing intricate multiple job interactions, such as contention for shared resources. The model is a finitestate, discrete-time Markov model with rewards and costs associated with each state. The model is capable of predicting the distribution of completion times in real workloads for a given application. The predictions are useful in gauging how quickly an application will execute, or in predicting the performance impact of a system change. The model constructed in this study is validated with three separate sets of empirical data. In one validation, the model successfully predicts the effects of operating the machine with one less processor.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115161471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Branch history table prediction of moving target branches due to subroutine returns 子程序返回引起的移动目标分支历史表预测
D. Kaeli, P. Emma
Ideally, a pipeline processor can run at a rate that is limited by its slowest stage. Branches in the instruction stream disrupt the pipeIine, and reduce processor performance to well below ideal. Since workloads contain a high percentage of taken branches, techniques are needed to reduce or eliminate thk degradation. A Branch History Table (BHT) stores past action and target for branches, and predicts that future behavior will repeat. Although past action is a good indicator of future action, the subroutine CALL/RETURN paradigm makes correct prediction of the branch target dlfflcult. We propose a new stack mechanism for reducing this type of mispredlction. Using traces of the SPEC benchmark suite running on an RS/6000, we provide an analysis of the performance enhancements possible using a BHT. We show that the proposed mechanism can reduce the number of branch wrong guesses by 18.2°/0 on average.
理想情况下,流水线处理器可以以受其最慢阶段限制的速率运行。指令流中的分支会破坏管道,并将处理器性能降低到远低于理想的水平。由于工作负载包含很高比例的已取分支,因此需要技术来减少或消除这种退化。分支历史表(BHT)存储分支过去的操作和目标,并预测未来的行为将重复。虽然过去的操作是未来操作的良好指示器,但子例程CALL/RETURN范式对分支目标进行了正确的预测。我们提出了一种新的堆栈机制来减少这种类型的错误预测。通过在RS/6000上运行SPEC基准测试套件的跟踪,我们分析了使用BHT可能带来的性能增强。结果表明,该机制可将分支猜测错误率平均降低18.2°/0。
{"title":"Branch history table prediction of moving target branches due to subroutine returns","authors":"D. Kaeli, P. Emma","doi":"10.1145/115952.115957","DOIUrl":"https://doi.org/10.1145/115952.115957","url":null,"abstract":"Ideally, a pipeline processor can run at a rate that is limited by its slowest stage. Branches in the instruction stream disrupt the pipeIine, and reduce processor performance to well below ideal. Since workloads contain a high percentage of taken branches, techniques are needed to reduce or eliminate thk degradation. A Branch History Table (BHT) stores past action and target for branches, and predicts that future behavior will repeat. Although past action is a good indicator of future action, the subroutine CALL/RETURN paradigm makes correct prediction of the branch target dlfflcult. We propose a new stack mechanism for reducing this type of mispredlction. Using traces of the SPEC benchmark suite running on an RS/6000, we provide an analysis of the performance enhancements possible using a BHT. We show that the proposed mechanism can reduce the number of branch wrong guesses by 18.2°/0 on average.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124232473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 158
Implementing a cache for a high-performance GaAs microprocessor 实现高性能砷化镓微处理器的缓存
K. Olukotun, T. Mudge, Richard B. Brown
In the near future, microprocessor systems with very high clock rates will use multichip module (MCM) pack- aging technology to reduce chip-crossing delays. In this paper we present the results of a study for the design of a 250 MHz Gallium Arsenide (GaAs) microprocessor t,lrat employs h4CM technology to improve performance. The design study for the resulting two-level split cache st.arts with a baseline cache architecture and then ex- amines the following aspects: 1) primary cache size and degree of associativity; 2) primary data-cache write pol- icy; 3) secondary cache size and organization; 4) pri- mary cache fetch size; 5) concurrency between instruc- tion and data accesses. A trace-driven simulator is used to analyze each design's performance. The results show that memory access time and page-size constraints ef- Cectively limit the size of the primary data and instruc- tion caches to 4I
在不久的将来,具有非常高时钟速率的微处理器系统将使用多芯片模块(MCM)封装老化技术来减少芯片交叉延迟。在本文中,我们介绍了一种250 MHz砷化镓(GaAs)微处理器的设计研究结果,该微处理器采用h4CM技术来提高性能。设计研究了基于基线缓存架构的两级分割缓存,然后分析了以下几个方面:1)主缓存的大小和关联度;2)主数据缓存写策略;3)二级缓存的大小和组织;4) pri- Mary缓存读取大小;指令和数据访问之间的并发性。跟踪驱动模拟器用于分析每种设计的性能。结果表明,内存访问时间和页面大小约束有效地将主数据和指令缓存的大小限制在4I
{"title":"Implementing a cache for a high-performance GaAs microprocessor","authors":"K. Olukotun, T. Mudge, Richard B. Brown","doi":"10.1145/115952.115967","DOIUrl":"https://doi.org/10.1145/115952.115967","url":null,"abstract":"In the near future, microprocessor systems with very high clock rates will use multichip module (MCM) pack- aging technology to reduce chip-crossing delays. In this paper we present the results of a study for the design of a 250 MHz Gallium Arsenide (GaAs) microprocessor t,lrat employs h4CM technology to improve performance. The design study for the resulting two-level split cache st.arts with a baseline cache architecture and then ex- amines the following aspects: 1) primary cache size and degree of associativity; 2) primary data-cache write pol- icy; 3) secondary cache size and organization; 4) pri- mary cache fetch size; 5) concurrency between instruc- tion and data accesses. A trace-driven simulator is used to analyze each design's performance. The results show that memory access time and page-size constraints ef- Cectively limit the size of the primary data and instruc- tion caches to 4I<W (16KB). For such cache sizes, a write-through policy is better than a write-back policy. Three cache mechanisms that contribute to improved performance are introduced. The first is a variant of the write-through policy called write-only. This write policy provides most of the performance benefits of sub- Ilod placernenl without extra valid bits. The second, is the use of a split secondary cache. Finally, the third mechanism allows loads to pass stores without associa- tive matching. Keywords-two-level caches, high performance pro- cessors, gallium arsenide, multichip modules, trace- driven cache simulation.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133516865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
The effect on RISC performance of register set size and structure versus code generation strategy 寄存器集大小和结构与代码生成策略对RISC性能的影响
David G. Bradlee, S. Eggers, R. Henry
This paper examines the effect of code generation strategy and register set size and structure on the performance of RISC processors. We vary the number of registers from 16 to 128, in both split and shared organizations, and use three different code generation strategies that differ in the way their instruction schedulers and register allocators cooperate in utilizing registers. The architectnres used in the experiments incorporate fealures of the Motorola 88000 and the MIPS R2000. We observed three things. First, more sophisticated code generation strategies require fewer registers. In our experiments more than 32 registers yielded only marginal performance improvement over 32. Using a simpler strategy, the point of diminishing returns appeared after 64 registers. Second, given a small number of registers (e.g. 16), a machine with a shared register organization executes faster than one with a split organization; given a larger number of registers, the write-back bus to the shared register set becomes the bottleneck, and a split organization is better. Third, a machine with a floating point coprocessor does not always execute faster than one with a slower on-chip implementation, if the coprocessor does not perform expensive integer operations as well. The problem can be solved by transferring operands to the floating point unit, doing a multiply or divide there, and then shipping the data back to the CPU.
本文研究了代码生成策略、寄存器集大小和结构对RISC处理器性能的影响。在拆分和共享组织中,我们将寄存器的数量从16个改变为128个,并使用三种不同的代码生成策略,它们的指令调度器和寄存器分配器在使用寄存器时的合作方式不同。实验中使用的架构结合了摩托罗拉88000和MIPS R2000的特性。我们观察到了三件事。首先,更复杂的代码生成策略需要更少的寄存器。在我们的实验中,超过32个寄存器只产生了边际性能改进。使用更简单的策略,收益递减点出现在64个寄存器之后。其次,给定少量寄存器(例如16),具有共享寄存器组织的机器执行速度比具有拆分组织的机器快;如果寄存器数量较多,那么到共享寄存器集的回写总线就会成为瓶颈,因此拆分组织会更好。第三,如果协处理器不执行昂贵的整数操作,那么具有浮点协处理器的机器并不总是比具有较慢的片上实现的机器执行得快。这个问题可以通过将操作数传输到浮点单元,在那里进行乘法或除法运算,然后将数据传送回CPU来解决。
{"title":"The effect on RISC performance of register set size and structure versus code generation strategy","authors":"David G. Bradlee, S. Eggers, R. Henry","doi":"10.1145/115953.115985","DOIUrl":"https://doi.org/10.1145/115953.115985","url":null,"abstract":"This paper examines the effect of code generation strategy and register set size and structure on the performance of RISC processors. We vary the number of registers from 16 to 128, in both split and shared organizations, and use three different code generation strategies that differ in the way their instruction schedulers and register allocators cooperate in utilizing registers. The architectnres used in the experiments incorporate fealures of the Motorola 88000 and the MIPS R2000. We observed three things. First, more sophisticated code generation strategies require fewer registers. In our experiments more than 32 registers yielded only marginal performance improvement over 32. Using a simpler strategy, the point of diminishing returns appeared after 64 registers. Second, given a small number of registers (e.g. 16), a machine with a shared register organization executes faster than one with a split organization; given a larger number of registers, the write-back bus to the shared register set becomes the bottleneck, and a split organization is better. Third, a machine with a floating point coprocessor does not always execute faster than one with a slower on-chip implementation, if the coprocessor does not perform expensive integer operations as well. The problem can be solved by transferring operands to the floating point unit, doing a multiply or divide there, and then shipping the data back to the CPU.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114158195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Modeling the performance of limited pointers directories for cache coherence 为缓存一致性对有限指针目录的性能进行建模
R. Simoni, M. Horowitz
Directory-hsed protocols have been proposed as an efficient means of implementing cache consistency in large-scale sharedmemory multiprocessors. One class of these protocols utilizes a limired pointers directory, which Stores the identities of a Small number of caches mntaining a given block of data. However. the performance potential of these directories in large-scale machines has been speculative at best. In this paper we introduce an analytic model that not only explains the behavior seen in small-scale simulation studies, but also allows us to extrapolate forward to evaluate the efficiency of limited pointers directories in large-scale systems. Our model shows that miss rates inherent to invalidation-based consistencyschemes are relatively high (typically 10% to 60%) for actively shared data, across a variety of workloads. We find that limited pointers schemes that resort to broadcasting invalidations when the pointers are exhausted perform very poorly in largescale machines, even if there are sufficient pointas most of the time. On the other hand, no-broadcast slrategies that limit the degree of caching to the number of pointers in an entry have only a modest impact on the cache miss rate and network traflic under a wide range of workloads. including those in which data blocks are actively accessed by a large number of processors.
基于目录的协议是在大规模共享内存多处理器中实现缓存一致性的一种有效方法。其中一类协议使用有限指针目录,该目录存储维护给定数据块的少量缓存的身份。然而。这些目录在大型机器中的性能潜力充其量是推测性的。在本文中,我们引入了一个解析模型,它不仅解释了在小规模模拟研究中看到的行为,而且还允许我们向前推断,以评估大规模系统中有限指针目录的效率。我们的模型显示,对于各种工作负载的主动共享数据,基于无效的一致性方案固有的缺失率相对较高(通常为10%到60%)。我们发现,在指针耗尽时诉诸广播失效的有限指针方案在大型机器上的性能非常差,即使大多数时候有足够的指针。另一方面,在各种工作负载下,将缓存程度限制为条目中指针数量的非广播策略对缓存丢失率和网络流量的影响并不大。包括那些数据块被大量处理器主动访问的数据块。
{"title":"Modeling the performance of limited pointers directories for cache coherence","authors":"R. Simoni, M. Horowitz","doi":"10.1145/115952.115983","DOIUrl":"https://doi.org/10.1145/115952.115983","url":null,"abstract":"Directory-hsed protocols have been proposed as an efficient means of implementing cache consistency in large-scale sharedmemory multiprocessors. One class of these protocols utilizes a limired pointers directory, which Stores the identities of a Small number of caches mntaining a given block of data. However. the performance potential of these directories in large-scale machines has been speculative at best. In this paper we introduce an analytic model that not only explains the behavior seen in small-scale simulation studies, but also allows us to extrapolate forward to evaluate the efficiency of limited pointers directories in large-scale systems. Our model shows that miss rates inherent to invalidation-based consistencyschemes are relatively high (typically 10% to 60%) for actively shared data, across a variety of workloads. We find that limited pointers schemes that resort to broadcasting invalidations when the pointers are exhausted perform very poorly in largescale machines, even if there are sufficient pointas most of the time. On the other hand, no-broadcast slrategies that limit the degree of caching to the number of pointers in an entry have only a modest impact on the cache miss rate and network traflic under a wide range of workloads. including those in which data blocks are actively accessed by a large number of processors.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127061266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
An architecture for software-controlled data prefetching 一种软件控制的数据预取体系结构
A. Klaiber, H. Levy
This paper describes an architecture and related compiler support for software-controlled daia prefetching, a technique to hide memory latency in high-performance processors. At compile-time, FETCB instructions are inserted into the instruction-stream by the compiler, based on anticipated data references and detailed information about the memory system. At run time, a separate functional unit in the CPU, the fe tch uni t , interprets these instructions and initiates appropriate memory reads. Prefetched data is kept in a small, fullyassociative cache, called the fetchbuffer, to reduce contention with the conventional direct-mapped cache. We also introduce a prewrileback technique that can reduce the impact.of stalls due to replacement writebacks in the cache. A detailed hardware model is presented and the required compiler support is developed. Simulations based on a MIPS processor model show that this technique can dramatically reduce on-chip cache miss ratios and average observed memory latency for scientific loops at only slight cost in total memory traffic.
本文描述了软件控制数据预取的体系结构和相关的编译器支持,软件控制数据预取是一种在高性能处理器中隐藏内存延迟的技术。在编译时,FETCB指令由编译器根据预期的数据引用和有关内存系统的详细信息插入指令流。在运行时,CPU中的一个独立的功能单元,即内存读取单元,解释这些指令并启动适当的内存读取。预取的数据保存在一个小的、完全关联的缓存中,称为fetchbuffer,以减少与传统的直接映射缓存的争用。我们还介绍了一种可以减少影响的预回卷技术。由于缓存中的替换回写而导致的停顿。给出了详细的硬件模型,并开发了所需的编译器支持。基于MIPS处理器模型的仿真表明,该技术可以显著降低片上缓存缺失率和科学循环的平均观察到的内存延迟,而总内存流量的成本很小。
{"title":"An architecture for software-controlled data prefetching","authors":"A. Klaiber, H. Levy","doi":"10.1145/115953.115958","DOIUrl":"https://doi.org/10.1145/115953.115958","url":null,"abstract":"This paper describes an architecture and related compiler support for software-controlled daia prefetching, a technique to hide memory latency in high-performance processors. At compile-time, FETCB instructions are inserted into the instruction-stream by the compiler, based on anticipated data references and detailed information about the memory system. At run time, a separate functional unit in the CPU, the fe tch uni t , interprets these instructions and initiates appropriate memory reads. Prefetched data is kept in a small, fullyassociative cache, called the fetchbuffer, to reduce contention with the conventional direct-mapped cache. We also introduce a prewrileback technique that can reduce the impact.of stalls due to replacement writebacks in the cache. A detailed hardware model is presented and the required compiler support is developed. Simulations based on a MIPS processor model show that this technique can dramatically reduce on-chip cache miss ratios and average observed memory latency for scientific loops at only slight cost in total memory traffic.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129538764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 264
Flexible register management for sequential programs 灵活的顺序程序寄存器管理
D. Quammen, D. Miller
Most current architectures have registers organized in one of two ways: single register sets; or register stacks, implemented as either overlapping register windows or register-caches, Each has particular strengths and weaknesses. For example, a single register set excels over a stack if a program requires frequent access to globals. ~ However, a register stack performs better if deep^ recursive chains~exist. One drawback of all current systems is that the hardware limits the manner in which the software can use registers. In this paper, a register hardware organization called fhreoded windows or f-windows, which is being developed by the authors to enhance the performance of concurrent systems, is evaluated for sequential programs. The organization allows the registers to be dynamically restructured in any of the above forms, and any combination of the above forms. This permits the compiler, or the programmer, to capitalize upon each register organization’s strong points and avoid their disadvantages.
大多数当前的体系结构都以以下两种方式组织寄存器:单个寄存器集;或寄存器栈,实现为重叠的寄存器窗口或寄存器缓存,每个都有特定的优点和缺点。例如,如果程序需要频繁访问全局变量,单个寄存器集优于堆栈。然而,如果存在深度递归链,寄存器堆栈的性能会更好。当前所有系统的一个缺点是硬件限制了软件使用寄存器的方式。为了提高并发系统的性能,作者正在开发一种寄存器硬件组织,称为fhreded windows,本文对顺序程序进行了评估。该组织允许登记册以上述任何一种形式以及上述形式的任何组合动态地进行重组。这允许编译器或程序员利用每个寄存器组织的优点,避免它们的缺点。
{"title":"Flexible register management for sequential programs","authors":"D. Quammen, D. Miller","doi":"10.1145/115953.115984","DOIUrl":"https://doi.org/10.1145/115953.115984","url":null,"abstract":"Most current architectures have registers organized in one of two ways: single register sets; or register stacks, implemented as either overlapping register windows or register-caches, Each has particular strengths and weaknesses. For example, a single register set excels over a stack if a program requires frequent access to globals. ~ However, a register stack performs better if deep^ recursive chains~exist. One drawback of all current systems is that the hardware limits the manner in which the software can use registers. In this paper, a register hardware organization called fhreoded windows or f-windows, which is being developed by the authors to enhance the performance of concurrent systems, is evaluated for sequential programs. The organization allows the registers to be dynamically restructured in any of the above forms, and any combination of the above forms. This permits the compiler, or the programmer, to capitalize upon each register organization’s strong points and avoid their disadvantages.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127490068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1