OHMEGA : a VLSI superscalar processor architecture for numerical applications

M. Nakajima, H. Nakano, Y. Nakakura, T. Yoshida, Y. Goi, Y. Nakai, R. Segawa, T. Xishida, H. Kadota
{"title":"OHMEGA : a VLSI superscalar processor architecture for numerical applications","authors":"M. Nakajima, H. Nakano, Y. Nakakura, T. Yoshida, Y. Goi, Y. Nakai, R. Segawa, T. Xishida, H. Kadota","doi":"10.1145/115952.115969","DOIUrl":null,"url":null,"abstract":"multiple instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, LOOP instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/115952.115969","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

multiple instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, LOOP instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
OHMEGA:用于数值应用的VLSI超标量处理器架构
本文描述了一种可以在数字应用中保持很高性能的超大规模集成电路(VLSI)超标量处理器架构。该体系结构由编译器静态地执行指令级调度,并执行指令的无序发出和执行,以减少执行过程中动态发生的管道上的停顿。在这个体系结构中,在每个时钟周期中获取一对指令,同时解码,并独立地发布到相应的执行管道。为了便于编译器的指令级调度,架构提供者:-i)几乎所有指令对的同时执行,包括Store-Stare对和Load-Store对;ii)简单、低延迟、易于配对的执行管道结构;iii)大容量多端口浮点寄存器和整数寄存器。采用新颖的直接标签比较(direct Tag Compare, DTC)方法实现高效的数据依赖解析,采用无处罚分支机制实现简单的控制依赖解析,采用流水线数据缓存和128位宽总线带宽实现大数据传输能力,从而动态降低管道危害,从而提高系统性能。采用新的DTC方法、同步管道操作和数据绕过网络实现了一种有效的数据依赖解析机制,允许乱序指令的发布和执行。DTC方法的思想类似于带标记令牌的动态数据缺陷体系结构。非惩罚分支是通过延迟分支、执行器计数器在一个时钟周期内递减、比较和分支的LOOP指令和带有预测条件码的非惩罚条件分支三种技术实现的。这些技术有助于减少在运行时发生的管道失速。利用这些技术,该架构可以在4OMHz时钟下实现80MFLOPS/80MIPS的峰值性能,并保持比简单的MFU型RISC处理器高1.4至3.6倍的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
The effect on RISC performance of register set size and structure versus code generation strategy GT-EP: a novel high-performance real-time architecture Performance prediction and tuning on a multiprocessor High performance interprocessor communication through optical wavelength division multiple access channels An empirical study of the CRAY Y-MP processor using the PERFECT club benchmarks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1