A Domain-Specific On-Chip Network Design for Large Scale Cache Systems

2007 IEEE 13th International Symposium on High Performance Computer Architecture Pub Date : 2007-02-10 DOI:10.1109/HPCA.2007.346209

Yuho Jin, Eun Jung Kim, K. H. Yum

{"title":"A Domain-Specific On-Chip Network Design for Large Scale Cache Systems","authors":"Yuho Jin, Eun Jung Kim, K. H. Yum","doi":"10.1109/HPCA.2007.346209","DOIUrl":null,"url":null,"abstract":"As circuit integration technology advances, the design of efficient interconnects has become critical. On-chip networks have been adopted to overcome scalability and the poor resource sharing problems of shared buses or dedicated wires. However, using a general on-chip network for a specific domain may cause underutilization of the network resources and huge network delays because the interconnects are not optimized for the domain. Addressing these two issues is challenging because in-depth knowledges of interconnects and the specific domain are required. Non-uniform cache architectures (NUCAs) use wormhole-routed 2D mesh networks to improve the performance of on-chip L2 caches. We observe that network resources in NUCAs are underutilized and occupy considerable chip area (52% of cache area). Also the network delay is significantly large (63% of cache access time). Motivated by our observations, we investigate how to optimize cache operations and and design the network in large scale cache systems. We propose a single-cycle router architecture that can efficiently support multicasting in on-chip caches. Next, we present fast-LRU replacement, where cache replacement overlaps with data request delivery. Finally we propose a deadlock-free XYX routing algorithm and a new halo network topology to minimize the number of links in the network. Simulation results show that our networked cache system improves the average IPC by 38% over the mesh network design with multicast promotion replacement while using only 23% of the interconnection area. Specifically, multicast fast-LRU replacement improves the average IPC by 20% compared with multicast promotion replacement. A halo topology design additionally improves the average IPC by 18% over a mesh topology","PeriodicalId":177324,"journal":{"name":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","volume":"152 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"40","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 IEEE 13th International Symposium on High Performance Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2007.346209","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 40

Abstract

As circuit integration technology advances, the design of efficient interconnects has become critical. On-chip networks have been adopted to overcome scalability and the poor resource sharing problems of shared buses or dedicated wires. However, using a general on-chip network for a specific domain may cause underutilization of the network resources and huge network delays because the interconnects are not optimized for the domain. Addressing these two issues is challenging because in-depth knowledges of interconnects and the specific domain are required. Non-uniform cache architectures (NUCAs) use wormhole-routed 2D mesh networks to improve the performance of on-chip L2 caches. We observe that network resources in NUCAs are underutilized and occupy considerable chip area (52% of cache area). Also the network delay is significantly large (63% of cache access time). Motivated by our observations, we investigate how to optimize cache operations and and design the network in large scale cache systems. We propose a single-cycle router architecture that can efficiently support multicasting in on-chip caches. Next, we present fast-LRU replacement, where cache replacement overlaps with data request delivery. Finally we propose a deadlock-free XYX routing algorithm and a new halo network topology to minimize the number of links in the network. Simulation results show that our networked cache system improves the average IPC by 38% over the mesh network design with multicast promotion replacement while using only 23% of the interconnection area. Specifically, multicast fast-LRU replacement improves the average IPC by 20% compared with multicast promotion replacement. A halo topology design additionally improves the average IPC by 18% over a mesh topology

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

面向大规模高速缓存系统的特定领域片上网络设计

随着电路集成技术的进步，高效互连的设计变得至关重要。采用片上网络来克服可伸缩性和共享总线或专用线路的资源共享问题。但是，对于特定的域使用通用的片上网络，由于没有针对特定的域优化互连，可能会导致网络资源的利用率不足和巨大的网络延迟。解决这两个问题具有挑战性，因为需要对互连和特定领域有深入的了解。非均匀缓存架构(nuca)使用虫洞路由的二维网格网络来提高片上L2缓存的性能。我们观察到，nuca中的网络资源未得到充分利用，占用了相当大的芯片面积(占缓存面积的52%)。此外，网络延迟也非常大(缓存访问时间的63%)。在我们的观察的激励下，我们研究了如何优化缓存操作和设计大规模缓存系统中的网络。我们提出了一种单周期路由器架构，可以有效地支持片上高速缓存中的多播。接下来，我们介绍快速lru替换，其中缓存替换与数据请求传递重叠。最后，我们提出了一种无死锁的XYX路由算法和一种新的halo网络拓扑结构，以最大限度地减少网络中的链路数量。仿真结果表明，该网络缓存系统在只占用23%的互联面积的情况下，比采用组播提升替代的网状网络设计提高了38%的IPC。其中，组播快速lru替换比组播提升替换平均IPC提高20%。halo拓扑设计比网状拓扑还能提高平均IPC 18%

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2007 IEEE 13th International Symposium on High Performance Computer Architecture

自引率

0.00%

发文量