Re-NUCA: Boosting CMP Performance Through Block Replication

2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools Pub Date : 2010-09-01 DOI:10.1109/DSD.2010.41

P. Foglia, C. Prete, M. Solinas, Giovanna Monni

{"title":"Re-NUCA: Boosting CMP Performance Through Block Replication","authors":"P. Foglia, C. Prete, M. Solinas, Giovanna Monni","doi":"10.1109/DSD.2010.41","DOIUrl":null,"url":null,"abstract":"Chip Multiprocessor (CMP) systems have become the reference architecture for designing micro-processors, thanks to the improvements in semiconductor nanotechnology that have continuously provided a crescent number of faster and smaller per-chip transistors. The interests for CMPs grew up since classical techniques for boosting performance, e.g. the increase of clock frequency and the amount of work performed at each clock cycle, can no longer deliver to significant improvement due to energy constrains and wire delay effects. CMP systems generally adopt a large last-level-cache (LLC) (typically, L2 or L3) shared among all cores, and private L1 caches. As the miss resolution time for private caches depends on the response time of the LLC, which is wire-delay dominated, performance are affected by wire delay. NUCA caches have been proposed for single and multi core systems as a mechanism for tolerating wire-delay effects on the overall performance. In this paper, we introduce a novel NUCA architecture, called Re-NUCA, specifically suited for (but not limited to) CMPs in which cores are placed at different sides of the shared cache. The idea is to allow shared blocks to be replicated inside the shared cache, in order to avoid the limitations to performance improvements that arise in classical D-NUCA caches due to the conflict hit problem. Our results show that Re-NUCA outperforms D-NUCA of more then 5% on average, but for those applications that strongly suffer from the conflict hit problem we observe performance improvements up to 15%.","PeriodicalId":356885,"journal":{"name":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSD.2010.41","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Chip Multiprocessor (CMP) systems have become the reference architecture for designing micro-processors, thanks to the improvements in semiconductor nanotechnology that have continuously provided a crescent number of faster and smaller per-chip transistors. The interests for CMPs grew up since classical techniques for boosting performance, e.g. the increase of clock frequency and the amount of work performed at each clock cycle, can no longer deliver to significant improvement due to energy constrains and wire delay effects. CMP systems generally adopt a large last-level-cache (LLC) (typically, L2 or L3) shared among all cores, and private L1 caches. As the miss resolution time for private caches depends on the response time of the LLC, which is wire-delay dominated, performance are affected by wire delay. NUCA caches have been proposed for single and multi core systems as a mechanism for tolerating wire-delay effects on the overall performance. In this paper, we introduce a novel NUCA architecture, called Re-NUCA, specifically suited for (but not limited to) CMPs in which cores are placed at different sides of the shared cache. The idea is to allow shared blocks to be replicated inside the shared cache, in order to avoid the limitations to performance improvements that arise in classical D-NUCA caches due to the conflict hit problem. Our results show that Re-NUCA outperforms D-NUCA of more then 5% on average, but for those applications that strongly suffer from the conflict hit problem we observe performance improvements up to 15%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Re-NUCA:通过块复制提高CMP性能

芯片多处理器(CMP)系统已经成为设计微处理器的参考架构，这要归功于半导体纳米技术的进步，它不断地提供了数量如新月一般的更快、更小的单片晶体管。由于能量限制和电线延迟效应，提高性能的经典技术(例如增加时钟频率和每个时钟周期执行的工作量)不再能够提供显着的改进，因此对cmp的兴趣不断增长。CMP系统通常采用在所有核心之间共享的大型最后一级缓存(LLC)(通常是L2或L3)和专用L1缓存。由于私有缓存的miss解析时间取决于LLC的响应时间，而LLC的响应时间以线延迟为主，因此线延迟会影响性能。NUCA缓存已被提议用于单核和多核系统，作为容忍线延迟对整体性能影响的机制。在本文中，我们介绍了一种新的NUCA架构，称为Re-NUCA，特别适用于(但不限于)cmp，其中内核放置在共享缓存的不同侧。这个想法是允许在共享缓存内复制共享块，以避免由于冲突命中问题而在经典D-NUCA缓存中出现的性能改进限制。我们的结果表明，Re-NUCA的性能平均优于D-NUCA 5%以上，但对于那些严重遭受冲突打击问题的应用程序，我们观察到性能提高高达15%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools

自引率

0.00%

发文量

期刊最新文献

A Multicore SDR Architecture for Reconfigurable WiMAX Downlink Design of Testable Universal Logic Gate Targeting Minimum Wire-Crossings in QCA Logic Circuit Low Latency Recovery from Transient Faults for Pipelined Processor Architectures System Level Hardening by Computing with Matrices Reconfigurable Grid Alu Processor: Optimization and Design Space Exploration