Pushing the limits of accelerator efficiency while retaining programmability

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI:10.1109/HPCA.2016.7446051

Tony Nowatzki, Vinay Gangadhar, K. Sankaralingam, G. Wright

{"title":"Pushing the limits of accelerator efficiency while retaining programmability","authors":"Tony Nowatzki, Vinay Gangadhar, K. Sankaralingam, G. Wright","doi":"10.1109/HPCA.2016.7446051","DOIUrl":null,"url":null,"abstract":"The waning benefits of device scaling have caused a push towards domain specific accelerators (DSAs), which sacrifice programmability for efficiency. While providing huge benefits, DSAs are prone to obsoletion due to domain volatility, have recurring design and verification costs, and have large area footprints when multiple DSAs are required in a single device. Because of the benefits of generality, this work explores how far a programmable architecture can be pushed, and whether it can come close to the performance, energy, and area efficiency of a DSA-based approach. Our insight is that DSAs employ common specialization principles for concurrency, computation, communication, data-reuse and coordination, and that these same principles can be exploited in a programmable architecture using a composition of known microarchitectural mechanisms. Specifically, we propose and study an architecture called LSSD, which is composed of many low-power and tiny cores, each having a configurable spatial architecture, scratchpads, and DMA. Our results show that a programmable, specialized architecture can indeed be competitive with a domain-specific approach. Compared to four prominent and diverse DSAs, LSSD can match the DSAs' 10× to 150× speedup over an OOO core, with only up to 4× more area and power than a single DSA, while retaining programmability.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2016.7446051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 43

Abstract

The waning benefits of device scaling have caused a push towards domain specific accelerators (DSAs), which sacrifice programmability for efficiency. While providing huge benefits, DSAs are prone to obsoletion due to domain volatility, have recurring design and verification costs, and have large area footprints when multiple DSAs are required in a single device. Because of the benefits of generality, this work explores how far a programmable architecture can be pushed, and whether it can come close to the performance, energy, and area efficiency of a DSA-based approach. Our insight is that DSAs employ common specialization principles for concurrency, computation, communication, data-reuse and coordination, and that these same principles can be exploited in a programmable architecture using a composition of known microarchitectural mechanisms. Specifically, we propose and study an architecture called LSSD, which is composed of many low-power and tiny cores, each having a configurable spatial architecture, scratchpads, and DMA. Our results show that a programmable, specialized architecture can indeed be competitive with a domain-specific approach. Compared to four prominent and diverse DSAs, LSSD can match the DSAs' 10× to 150× speedup over an OOO core, with only up to 4× more area and power than a single DSA, while retaining programmability.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在保持可编程性的同时推动加速器效率的极限

设备可扩展性的优势逐渐减弱，导致了对特定领域加速器(dsa)的推动，这种加速器牺牲了可编程性以提高效率。虽然dsa提供了巨大的好处，但由于域的波动性，dsa容易过时，具有重复的设计和验证成本，并且在单个设备中需要多个dsa时占地面积很大。由于通用性的好处，这项工作探讨了可编程架构可以推进到什么程度，以及它是否可以接近基于dsa的方法的性能、能量和面积效率。我们的见解是，dsa在并发性、计算、通信、数据重用和协调方面采用通用的专门化原则，并且这些相同的原则可以在使用已知微体系结构机制组合的可编程体系结构中得到利用。具体来说，我们提出并研究了一种称为LSSD的架构，它由许多低功耗和微小的内核组成，每个内核都具有可配置的空间架构、刮擦板和DMA。我们的结果表明，可编程的、专门的体系结构确实可以与特定于领域的方法竞争。与四个突出的多样化DSA相比，LSSD可以在OOO内核上匹配DSA的10倍到150倍的加速，而面积和功率仅比单个DSA多4倍，同时保留可编程性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量