Proximity coherence for chip multiprocessors

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI:10.1145/1854273.1854293

Nick Barrow-Williams

{"title":"Proximity coherence for chip multiprocessors","authors":"Nick Barrow-Williams","doi":"10.1145/1854273.1854293","DOIUrl":null,"url":null,"abstract":"Many-core architectures provide an efficient way of harnessing the increasing numbers of transistors available in modern fabrication processes. While they are similar to multi-node systems, they exhibit different communication latency and storage characteristics, providing new design opportunities that were previously not feasible. Traditional cache coherence protocols, although often used in many-core designs, have been developed in the context of multi-node systems. As such, they seldom take advantage of the new possibilities that many-core architectures offer. We propose Proximity Coherence, a scheme in which L1 load misses are optimistically forwarded to nearby caches via new dedicated links rather than always being indirected via a directory structure. Such an optimization is made possible by the comparable cost of local cache accesses with the use of on-chip network resources. Coherency is maintained using lightweight graph structures embedded in the L1 caches. We compare our Proximity Coherence protocol to an existing directory-based MESI protocol using full-system simulations of a 32 core system. Our extension lowers the latency of L1 cache load misses by up to 32% while reducing the bytes transferred on the global on-chip interconnect by up to 19% for a range of parallel benchmarks. Employing Proximity Coherence provides execution time improvements of up to 13%, reduces cache hierarchy energy consumption by up to 30% and delivers a more efficient solution to the challenge of coherence in chip multiprocessors.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1854273.1854293","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 53

Abstract

Many-core architectures provide an efficient way of harnessing the increasing numbers of transistors available in modern fabrication processes. While they are similar to multi-node systems, they exhibit different communication latency and storage characteristics, providing new design opportunities that were previously not feasible. Traditional cache coherence protocols, although often used in many-core designs, have been developed in the context of multi-node systems. As such, they seldom take advantage of the new possibilities that many-core architectures offer. We propose Proximity Coherence, a scheme in which L1 load misses are optimistically forwarded to nearby caches via new dedicated links rather than always being indirected via a directory structure. Such an optimization is made possible by the comparable cost of local cache accesses with the use of on-chip network resources. Coherency is maintained using lightweight graph structures embedded in the L1 caches. We compare our Proximity Coherence protocol to an existing directory-based MESI protocol using full-system simulations of a 32 core system. Our extension lowers the latency of L1 cache load misses by up to 32% while reducing the bytes transferred on the global on-chip interconnect by up to 19% for a range of parallel benchmarks. Employing Proximity Coherence provides execution time improvements of up to 13%, reduces cache hierarchy energy consumption by up to 30% and delivers a more efficient solution to the challenge of coherence in chip multiprocessors.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

芯片多处理器的邻近相干性

多核架构为利用现代制造工艺中可用的越来越多的晶体管提供了一种有效的方法。虽然它们与多节点系统相似，但它们表现出不同的通信延迟和存储特性，从而提供了以前不可行的新设计机会。传统的缓存一致性协议虽然经常用于多核设计，但已经在多节点系统的背景下发展起来。因此，它们很少利用多核体系结构提供的新可能性。我们提出了邻近相干(Proximity Coherence)方案，在该方案中，L1负载丢失通过新的专用链接乐观地转发到附近的缓存，而不是总是通过目录结构间接转发。这种优化是通过使用片上网络资源访问本地缓存的可比较成本实现的。一致性是通过嵌入在L1缓存中的轻量级图结构来维持的。我们将我们的邻近相干协议与现有的基于目录的MESI协议进行比较，使用32核系统的全系统模拟。我们的扩展将L1缓存负载丢失的延迟降低了32%，同时在一系列并行基准测试中，将全局片上互连上传输的字节减少了19%。采用邻近一致性可将执行时间提高13%，将缓存层次能耗降低30%，并为芯片多处理器中的一致性挑战提供更有效的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

自引率

0.00%

发文量

期刊最新文献

Reducing task creation and termination overhead in explicitly parallel programs An intra-tile cache set balancing scheme NUcache: A multicore cache organization based on Next-Use distance Towards a science of parallel programming Discovering and understanding performance bottlenecks in transactional applications