Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

ACM Transactions on Computer Systems (TOCS) Pub Date : 2015-09-11 DOI:10.1145/2754930

Michael Pellauer, A. Parashar, Michael Adler, Bushra Ahsan, R. Allmon, N. Crago, Kermin Fleming, M. Gambhir, A. Jaleel, T. Krishna, Daniel Lustig, S. Maresh, Vladimir Pavlov, Rachid Rayess, Antonia Zhai, J. Emer

{"title":"Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures","authors":"Michael Pellauer, A. Parashar, Michael Adler, Bushra Ahsan, R. Allmon, N. Crago, Kermin Fleming, M. Gambhir, A. Jaleel, T. Krishna, Daniel Lustig, S. Maresh, Vladimir Pavlov, Rachid Rayess, Antonia Zhai, J. Emer","doi":"10.1145/2754930","DOIUrl":null,"url":null,"abstract":"There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading. Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems (TOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2754930","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading. Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

粗粒度空间体系结构的有效控制和通信范式

最近，人们对探索使用空间编程架构加速非向量化工作负载很感兴趣，这种架构旨在有效地利用管道并行性。这种架构面临两个主要问题:如何有效地控制系统中的每个处理元素(PE)，以及如何在不增加传统共享内存的开销的情况下促进PE之间的通信。在本文中，我们将探讨使用触发指令和延迟不敏感通道来解决这些问题。触发指令完全消除了程序计数器(PC)，并允许程序在没有显式分支指令的情况下简洁地在状态之间转换。延迟不敏感通道允许pe间控制信息的有效通信，同时支持灵活的代码放置并提高对可变事件(如缓存访问)的容忍度。总之，这些方法提供了一种统一的机制来避免过度序列化的执行，基本上达到了动态指令重排序和多线程等技术的效果。我们的分析表明，使用触发指令和延迟不敏感通道的空间加速器可以实现比传统通用处理器高8倍的面积归一化性能。进一步的分析表明，与pc风格的基线相比，触发控制将关键路径中的静态和动态指令的数量分别减少了62%和64%，将空间编程方法的性能提高了2.0倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Computer Systems (TOCS)

自引率

0.00%

发文量