Improving coherence protocol reactiveness by trading bandwidth for latency

L. G. Menezo, Valentin Puente, Pablo Abad Fidalgo, J. Gregorio
{"title":"Improving coherence protocol reactiveness by trading bandwidth for latency","authors":"L. G. Menezo, Valentin Puente, Pablo Abad Fidalgo, J. Gregorio","doi":"10.1145/2212908.2212929","DOIUrl":null,"url":null,"abstract":"This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2212908.2212929","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
通过带宽交换延迟来提高一致性协议的反应性
本文描述了如何利用片上网络的特殊性来提高一致性协议的响应性。为了实现这一目标,提出了一种新的相干协议LOCKE。LOCKE成功地利用了大的片上带宽可用性,以提高缓存相干芯片多处理器的性能和能源效率。假设互连网络被设计为支持组播流量,并且协议最大限度地发挥了直接相干带来的潜在优势,我们证明了基于组播的相干协议可以降低CMP内存层次中的能量需求。提出的关键思想是建立一个合适的片上网络吞吐量水平,通过两种方式加速同步:避免协议序列化,固有的基于目录的一致性协议,并减少平均访问时间比其他基于窥探的一致性协议,当共享数据真正竞争时。LOCKE是在令牌一致性性能基础上开发的,具有一组新的简单的主动策略,可以加速数据同步并消除被动令牌饥饿避免机制。使用一个完整的系统模拟器,真实地模拟片上互连,积极的核心架构和精确的内存层次结构细节,同时运行广泛的工作负载,我们的建议可以在能量和性能方面改进基于目录和基于令牌的一致性协议,至少在芯片中具有多达16个积极的乱序处理器的系统中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Strategies for improving performance and energy efficiency on a many-core Cost-effective soft-error protection for SRAM-based structures in GPGPUs Kinship: efficient resource management for performance and functionally asymmetric platforms An algorithm for parallel calculation of trigonometric functions DCNSim: a unified and cross-layer computer architecture simulation framework for data center network research
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1