Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures

Michael Pellauer, A. Parashar, Michael Adler, Bushra Ahsan, R. Allmon, N. Crago, Kermin Fleming, M. Gambhir, A. Jaleel, T. Krishna, Daniel Lustig, S. Maresh, Vladimir Pavlov, Rachid Rayess, Antonia Zhai, J. Emer
{"title":"Efficient Control and Communication Paradigms for Coarse-Grained Spatial Architectures","authors":"Michael Pellauer, A. Parashar, Michael Adler, Bushra Ahsan, R. Allmon, N. Crago, Kermin Fleming, M. Gambhir, A. Jaleel, T. Krishna, Daniel Lustig, S. Maresh, Vladimir Pavlov, Rachid Rayess, Antonia Zhai, J. Emer","doi":"10.1145/2754930","DOIUrl":null,"url":null,"abstract":"There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading. Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.","PeriodicalId":318554,"journal":{"name":"ACM Transactions on Computer Systems (TOCS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems (TOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2754930","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

There has been recent interest in exploring the acceleration of nonvectorizable workloads with spatially programmed architectures that are designed to efficiently exploit pipeline parallelism. Such an architecture faces two main problems: how to efficiently control each processing element (PE) in the system, and how to facilitate inter-PE communication without the overheads of traditional shared-memory coherent memory. In this article, we explore solving these problems using triggered instructions and latency-insensitive channels. Triggered instructions completely eliminate the program counter (PC) and allow programs to transition concisely between states without explicit branch instructions. Latency-insensitive channels allow efficient communication of inter-PE control information while simultaneously enabling flexible code placement and improving tolerance for variable events such as cache accesses. Together, these approaches provide a unified mechanism to avoid overserialized execution, essentially achieving the effect of techniques such as dynamic instruction reordering and multithreading. Our analysis shows that a spatial accelerator using triggered instructions and latency-insensitive channels can achieve 8 × greater area-normalized performance than a traditional general-purpose processor. Further analysis shows that triggered control reduces the number of static and dynamic instructions in the critical paths by 62% and 64%, respectively, over a PC-style baseline, increasing the performance of the spatial programming approach by 2.0 ×.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
粗粒度空间体系结构的有效控制和通信范式
最近,人们对探索使用空间编程架构加速非向量化工作负载很感兴趣,这种架构旨在有效地利用管道并行性。这种架构面临两个主要问题:如何有效地控制系统中的每个处理元素(PE),以及如何在不增加传统共享内存的开销的情况下促进PE之间的通信。在本文中,我们将探讨使用触发指令和延迟不敏感通道来解决这些问题。触发指令完全消除了程序计数器(PC),并允许程序在没有显式分支指令的情况下简洁地在状态之间转换。延迟不敏感通道允许pe间控制信息的有效通信,同时支持灵活的代码放置并提高对可变事件(如缓存访问)的容忍度。总之,这些方法提供了一种统一的机制来避免过度序列化的执行,基本上达到了动态指令重排序和多线程等技术的效果。我们的分析表明,使用触发指令和延迟不敏感通道的空间加速器可以实现比传统通用处理器高8倍的面积归一化性能。进一步的分析表明,与pc风格的基线相比,触发控制将关键路径中的静态和动态指令的数量分别减少了62%和64%,将空间编程方法的性能提高了2.0倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Boosting Inter-process Communication with Architectural Support H-Container: Enabling Heterogeneous-ISA Container Migration in Edge Computing ROME: All Overlays Lead to Aggregation, but Some Are Faster than Others The Role of Compute in Autonomous Micro Aerial Vehicles: Optimizing for Mission Time and Energy Efficiency An OpenMP Runtime for Transparent Work Sharing across Cache-Incoherent Heterogeneous Nodes
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1