Exploiting criticality to reduce bottlenecks in distributed uniprocessors

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI:10.1109/HPCA.2011.5749749

Behnam Robatmili, Madhu Saravana Sibi Govindan, D. Burger, S. Keckler

{"title":"Exploiting criticality to reduce bottlenecks in distributed uniprocessors","authors":"Behnam Robatmili, Madhu Saravana Sibi Govindan, D. Burger, S. Keckler","doi":"10.1109/HPCA.2011.5749749","DOIUrl":null,"url":null,"abstract":"Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2011.5749749","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

Composable multicore systems merge multiple independent cores for running sequential single-threaded workloads. The performance scalability of these systems, however, is limited due to partitioning overheads. This paper addresses two of the key performance scalability limitations of composable multicore systems. We present a critical path analysis revealing that communication needed for cross-core register value delivery and fetch stalls due to misspeculation are the two worst bottlenecks that prevent efficient scaling to a large number of fused cores. To alleviate these bottlenecks, this paper proposes a fully distributed framework to exploit criticality in these architectures at different granularities. A coordinator core exploits different types of block-level communication criticality information to fine-tune critical instructions at decode and register forward pipeline stages of their executing cores. The framework exploits the fetch criticality information at a coarser granularity by reissuing all instructions in the blocks previously fetched into the merged cores. This general framework reduces competing bottlenecks in a synergic manner and achieves scalable performance/power efficiency for sequential programs when running across a large number of cores.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用临界性来减少分布式单处理器中的瓶颈

可组合的多核系统合并多个独立的核，以运行顺序的单线程工作负载。然而，由于分区开销，这些系统的性能可伸缩性受到限制。本文讨论了可组合多核系统的两个关键性能可伸缩性限制。我们提出了一个关键路径分析，揭示了跨核寄存器值传递所需的通信和由于错误猜测而导致的获取停顿是阻碍有效扩展到大量融合核的两个最严重的瓶颈。为了缓解这些瓶颈，本文提出了一个完全分布式的框架来利用这些体系结构中不同粒度的临界性。协调器核心利用不同类型的块级通信关键信息来微调关键指令，在解码和注册其执行核心的前向管道阶段。该框架通过重新发布先前获取到合并核中的块中的所有指令，以更粗粒度利用获取临界信息。这个通用框架以协同的方式减少了竞争瓶颈，并在跨大量核心运行时为顺序程序实现了可扩展的性能/功率效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 IEEE 17th International Symposium on High Performance Computer Architecture

自引率

0.00%

发文量

期刊最新文献

Safe and efficient supervised memory systems Keynote address II: How's the parallel computing revolution going? A case for guarded power gating for multi-core processors Fg-STP: Fine-Grain Single Thread Partitioning on Multicores A quantitative performance analysis model for GPU architectures