iNPG: Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI:10.1109/HPCA.2018.00012

Y. Yao, Zhonghai Lu

{"title":"iNPG: Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores","authors":"Y. Yao, Zhonghai Lu","doi":"10.1109/HPCA.2018.00012","DOIUrl":null,"url":null,"abstract":"As recently studied, serialized competition overhead for entering critical section is more dominant than critical section execution itself in limiting performance of multi-threaded shared variable applications on NoC-based many-cores. We illustrate that the invalidation-acknowledgement delay for cache coherency between the home node storing the critical section lock and the cores running competing threads is the leading factor to high competition overhead in lock spinning, which is realized in various spin-lock primitives (such as the ticket lock, ABQL, MCS lock, etc.) and the spinning phase of queue spin-lock (QSL) in advanced operating systems. To reduce such high lock coherence overhead, we propose in-network packet generation (iNPG) to turn passive \"normal\" NoC routers which only transmit packets into active \"big\" ones that can generate packets. Instead of performing all coherence maintenance at the home node, big routers which are deployed nearer to competing threads can generate packets to perform early invalidation-acknowledgement for failing threads before their requests reach the home node, shortening the protocol round-trip delay and thus significantly reducing competition overhead in various locking primitives. We evaluate iNPG in Gem5 using PARSEC and SPEC OMP2012 programs with five different locking primitives. Compared to a state-of-the-art technique accelerating critical section access, experimental results show that iNPG can effectively reduce lock coherence overhead, expediting critical section access by 1.35x on average and 2.03x at maximum and consequently improving the program Region-of-Interest (ROI) runtime by 7.8% on average and 14.7% at maximum.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2018.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

As recently studied, serialized competition overhead for entering critical section is more dominant than critical section execution itself in limiting performance of multi-threaded shared variable applications on NoC-based many-cores. We illustrate that the invalidation-acknowledgement delay for cache coherency between the home node storing the critical section lock and the cores running competing threads is the leading factor to high competition overhead in lock spinning, which is realized in various spin-lock primitives (such as the ticket lock, ABQL, MCS lock, etc.) and the spinning phase of queue spin-lock (QSL) in advanced operating systems. To reduce such high lock coherence overhead, we propose in-network packet generation (iNPG) to turn passive "normal" NoC routers which only transmit packets into active "big" ones that can generate packets. Instead of performing all coherence maintenance at the home node, big routers which are deployed nearer to competing threads can generate packets to perform early invalidation-acknowledgement for failing threads before their requests reach the home node, shortening the protocol round-trip delay and thus significantly reducing competition overhead in various locking primitives. We evaluate iNPG in Gem5 using PARSEC and SPEC OMP2012 programs with five different locking primitives. Compared to a state-of-the-art technique accelerating critical section access, experimental results show that iNPG can effectively reduce lock coherence overhead, expediting critical section access by 1.35x on average and 2.03x at maximum and consequently improving the program Region-of-Interest (ROI) runtime by 7.8% on average and 14.7% at maximum.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

iNPG:基于多核NoC的网内包生成加速临界区访问

正如最近研究的那样，在限制基于cpu的多核多线程共享变量应用程序的性能方面，进入临界区的序列化竞争开销比临界区执行本身更占主导地位。在高级操作系统的各种自旋锁原语(如票证锁、ABQL、MCS锁等)和队列自旋锁(QSL)的自旋阶段中，我们发现存储临界区锁的主节点和运行竞争线程的核心之间的缓存一致性的失效确认延迟是导致锁自旋高竞争开销的主要原因。为了减少如此高的锁相干开销，我们提出了网络内数据包生成(iNPG)，将只能传输数据包的被动“普通”NoC路由器转变为可以生成数据包的主动“大”NoC路由器。部署在竞争线程附近的大型路由器可以生成数据包，在失败线程的请求到达主节点之前对其执行早期的无效确认，而不是在主节点上执行所有的一致性维护，从而缩短协议往返延迟，从而显着减少各种锁定原语的竞争开销。我们使用具有五种不同锁原语的PARSEC和SPEC OMP2012程序在Gem5中评估iNPG。实验结果表明，与目前最先进的加速临界区访问技术相比，iNPG可以有效地降低锁相干开销，将临界区访问速度平均提高1.35倍，最大提高2.03倍，从而将程序感兴趣区域(ROI)运行时间平均提高7.8%，最大提高14.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量

期刊最新文献

Record-Replay Architecture as a General Security Framework LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy Efficient GPUs Secure DIMM: Moving ORAM Primitives Closer to Memory OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs