A High Performance, Energy Efficient GALS ProcessorMicroarchitecture with Reduced Implementation Complexity

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. Pub Date : 2005-03-20 DOI:10.1109/ISPASS.2005.1430558

Yongkang Zhu, D. Albonesi, A. Buyuktosunoglu

{"title":"A High Performance, Energy Efficient GALS ProcessorMicroarchitecture with Reduced Implementation Complexity","authors":"Yongkang Zhu, D. Albonesi, A. Buyuktosunoglu","doi":"10.1109/ISPASS.2005.1430558","DOIUrl":null,"url":null,"abstract":"As the costs and challenges of global clock distribution grow with each new microprocessor generation, a globally asynchronous, locally synchronous (GALS) approach becomes an attractive alternative. One proposed GALS approach, called a multiple clock domain (MCD) processor, achieves impressive energy savings for a relatively low performance cost. However, the approach requires separating the processor into four domains, including separating the integer and memory domains which complicates load scheduling, and the implementation of 32 voltage and frequency levels in each domain. In addition, the hardware-based control algorithm, though effective overall, produces a significant performance degradation for some applications. In this paper, we devise modifications to the MCD design that retain many of its benefits while greatly reducing the implementation complexity. We first determine that the synchronization channels that are most responsible for the MCD performance degradation are those involving cache access, and propose merging the integer and memory domains to virtually eliminate this overhead. We further propose significantly reducing the number of voltage levels, separating the reorder buffer into its own domain to permit front-end frequency scaling, separating the L2 cache to permit standard power optimizations to be used, and a new online algorithm that provides consistent results across our benchmark suite. The overall result is a significant reduction in the performance degradation of the original MCD approach and greater energy savings, with a greatly simplified microarchitecture that is much easier to implement","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS.2005.1430558","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

As the costs and challenges of global clock distribution grow with each new microprocessor generation, a globally asynchronous, locally synchronous (GALS) approach becomes an attractive alternative. One proposed GALS approach, called a multiple clock domain (MCD) processor, achieves impressive energy savings for a relatively low performance cost. However, the approach requires separating the processor into four domains, including separating the integer and memory domains which complicates load scheduling, and the implementation of 32 voltage and frequency levels in each domain. In addition, the hardware-based control algorithm, though effective overall, produces a significant performance degradation for some applications. In this paper, we devise modifications to the MCD design that retain many of its benefits while greatly reducing the implementation complexity. We first determine that the synchronization channels that are most responsible for the MCD performance degradation are those involving cache access, and propose merging the integer and memory domains to virtually eliminate this overhead. We further propose significantly reducing the number of voltage levels, separating the reorder buffer into its own domain to permit front-end frequency scaling, separating the L2 cache to permit standard power optimizations to be used, and a new online algorithm that provides consistent results across our benchmark suite. The overall result is a significant reduction in the performance degradation of the original MCD approach and greater energy savings, with a greatly simplified microarchitecture that is much easier to implement

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

一种高性能、高能效的GALS处理器微架构，降低了实现复杂度

随着每一代新微处理器的出现，全球时钟分布的成本和挑战都在增加，全球异步、本地同步(GALS)方法成为一种有吸引力的替代方案。一种被提出的GALS方法，称为多时钟域(MCD)处理器，以相对较低的性能成本实现了令人印象深刻的节能。然而，该方法需要将处理器分为四个域，包括分离整数域和内存域，这使得负载调度变得复杂，并且在每个域中实现32个电压和频率电平。此外，基于硬件的控制算法虽然总体上是有效的，但在某些应用中会产生显著的性能下降。在本文中，我们对MCD设计进行了修改，保留了许多优点，同时大大降低了实现的复杂性。我们首先确定对MCD性能下降最负责的同步通道是那些涉及缓存访问的通道，并建议合并整数域和内存域以消除这种开销。我们进一步建议显著减少电压电平的数量，将重排序缓冲区分离到自己的域中以允许前端频率缩放，分离L2缓存以允许使用标准功率优化，以及一个新的在线算法，在我们的基准套件中提供一致的结果。总体结果是显著减少了原始MCD方法的性能下降，节省了更多的能源，并且大大简化了微架构，更容易实现

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

自引率

0.00%

发文量