M. Nakajima, H. Nakano, Y. Nakakura, T. Yoshida, Y. Goi, Y. Nakai, R. Segawa, T. Xishida, H. Kadota
{"title":"OHMEGA:用于数值应用的VLSI超标量处理器架构","authors":"M. Nakajima, H. Nakano, Y. Nakakura, T. Yoshida, Y. Goi, Y. Nakai, R. Segawa, T. Xishida, H. Kadota","doi":"10.1145/115952.115969","DOIUrl":null,"url":null,"abstract":"multiple instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, LOOP instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"OHMEGA : a VLSI superscalar processor architecture for numerical applications\",\"authors\":\"M. Nakajima, H. Nakano, Y. Nakakura, T. Yoshida, Y. Goi, Y. Nakai, R. Segawa, T. Xishida, H. Kadota\",\"doi\":\"10.1145/115952.115969\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"multiple instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, LOOP instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.\",\"PeriodicalId\":187095,\"journal\":{\"name\":\"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture\",\"volume\":\"81 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1991-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/115952.115969\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/115952.115969","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
摘要
本文描述了一种可以在数字应用中保持很高性能的超大规模集成电路(VLSI)超标量处理器架构。该体系结构由编译器静态地执行指令级调度,并执行指令的无序发出和执行,以减少执行过程中动态发生的管道上的停顿。在这个体系结构中,在每个时钟周期中获取一对指令,同时解码,并独立地发布到相应的执行管道。为了便于编译器的指令级调度,架构提供者:-i)几乎所有指令对的同时执行,包括Store-Stare对和Load-Store对;ii)简单、低延迟、易于配对的执行管道结构;iii)大容量多端口浮点寄存器和整数寄存器。采用新颖的直接标签比较(direct Tag Compare, DTC)方法实现高效的数据依赖解析,采用无处罚分支机制实现简单的控制依赖解析,采用流水线数据缓存和128位宽总线带宽实现大数据传输能力,从而动态降低管道危害,从而提高系统性能。采用新的DTC方法、同步管道操作和数据绕过网络实现了一种有效的数据依赖解析机制,允许乱序指令的发布和执行。DTC方法的思想类似于带标记令牌的动态数据缺陷体系结构。非惩罚分支是通过延迟分支、执行器计数器在一个时钟周期内递减、比较和分支的LOOP指令和带有预测条件码的非惩罚条件分支三种技术实现的。这些技术有助于减少在运行时发生的管道失速。利用这些技术,该架构可以在4OMHz时钟下实现80MFLOPS/80MIPS的峰值性能,并保持比简单的MFU型RISC处理器高1.4至3.6倍的性能。
OHMEGA : a VLSI superscalar processor architecture for numerical applications
multiple instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, LOOP instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.