A High Performance, Energy Efficient GALS ProcessorMicroarchitecture with Reduced Implementation Complexity

Yongkang Zhu, D. Albonesi, A. Buyuktosunoglu
{"title":"A High Performance, Energy Efficient GALS ProcessorMicroarchitecture with Reduced Implementation Complexity","authors":"Yongkang Zhu, D. Albonesi, A. Buyuktosunoglu","doi":"10.1109/ISPASS.2005.1430558","DOIUrl":null,"url":null,"abstract":"As the costs and challenges of global clock distribution grow with each new microprocessor generation, a globally asynchronous, locally synchronous (GALS) approach becomes an attractive alternative. One proposed GALS approach, called a multiple clock domain (MCD) processor, achieves impressive energy savings for a relatively low performance cost. However, the approach requires separating the processor into four domains, including separating the integer and memory domains which complicates load scheduling, and the implementation of 32 voltage and frequency levels in each domain. In addition, the hardware-based control algorithm, though effective overall, produces a significant performance degradation for some applications. In this paper, we devise modifications to the MCD design that retain many of its benefits while greatly reducing the implementation complexity. We first determine that the synchronization channels that are most responsible for the MCD performance degradation are those involving cache access, and propose merging the integer and memory domains to virtually eliminate this overhead. We further propose significantly reducing the number of voltage levels, separating the reorder buffer into its own domain to permit front-end frequency scaling, separating the L2 cache to permit standard power optimizations to be used, and a new online algorithm that provides consistent results across our benchmark suite. The overall result is a significant reduction in the performance degradation of the original MCD approach and greater energy savings, with a greatly simplified microarchitecture that is much easier to implement","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS.2005.1430558","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

Abstract

As the costs and challenges of global clock distribution grow with each new microprocessor generation, a globally asynchronous, locally synchronous (GALS) approach becomes an attractive alternative. One proposed GALS approach, called a multiple clock domain (MCD) processor, achieves impressive energy savings for a relatively low performance cost. However, the approach requires separating the processor into four domains, including separating the integer and memory domains which complicates load scheduling, and the implementation of 32 voltage and frequency levels in each domain. In addition, the hardware-based control algorithm, though effective overall, produces a significant performance degradation for some applications. In this paper, we devise modifications to the MCD design that retain many of its benefits while greatly reducing the implementation complexity. We first determine that the synchronization channels that are most responsible for the MCD performance degradation are those involving cache access, and propose merging the integer and memory domains to virtually eliminate this overhead. We further propose significantly reducing the number of voltage levels, separating the reorder buffer into its own domain to permit front-end frequency scaling, separating the L2 cache to permit standard power optimizations to be used, and a new online algorithm that provides consistent results across our benchmark suite. The overall result is a significant reduction in the performance degradation of the original MCD approach and greater energy savings, with a greatly simplified microarchitecture that is much easier to implement
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一种高性能、高能效的GALS处理器微架构,降低了实现复杂度
随着每一代新微处理器的出现,全球时钟分布的成本和挑战都在增加,全球异步、本地同步(GALS)方法成为一种有吸引力的替代方案。一种被提出的GALS方法,称为多时钟域(MCD)处理器,以相对较低的性能成本实现了令人印象深刻的节能。然而,该方法需要将处理器分为四个域,包括分离整数域和内存域,这使得负载调度变得复杂,并且在每个域中实现32个电压和频率电平。此外,基于硬件的控制算法虽然总体上是有效的,但在某些应用中会产生显著的性能下降。在本文中,我们对MCD设计进行了修改,保留了许多优点,同时大大降低了实现的复杂性。我们首先确定对MCD性能下降最负责的同步通道是那些涉及缓存访问的通道,并建议合并整数域和内存域以消除这种开销。我们进一步建议显著减少电压电平的数量,将重排序缓冲区分离到自己的域中以允许前端频率缩放,分离L2缓存以允许使用标准功率优化,以及一个新的在线算法,在我们的基准套件中提供一致的结果。总体结果是显著减少了原始MCD方法的性能下降,节省了更多的能源,并且大大简化了微架构,更容易实现
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Power-Performance Implications of Thread-level Parallelism on Chip Multiprocessors Performance Analysis of a New Packet Trace Compressor based on TCP Flow Clustering Enhancing Multiprocessor Architecture Simulation Speed Using Matched-Pair Comparison A High Performance, Energy Efficient GALS ProcessorMicroarchitecture with Reduced Implementation Complexity Dataflow: A Complement to Superscalar
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1