A programmable parallel accelerator for learning and classification

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2010-09-11 DOI:10.1145/1854273.1854309

S. Cadambi, Abhinandan Majumdar, M. Becchi, S. Chakradhar, H. Graf

{"title":"A programmable parallel accelerator for learning and classification","authors":"S. Cadambi, Abhinandan Majumdar, M. Becchi, S. Chakradhar, H. Graf","doi":"10.1145/1854273.1854309","DOIUrl":null,"url":null,"abstract":"For learning and classification workloads that operate on large amounts of unstructured data with stringent performance constraints, general purpose processor performance scales poorly with data size. In this paper, we present a programmable accelerator for this workload domain. To architect the accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. The proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses in-memory processing where on-chip memory blocks perform the secondary reduction operations. By doing so, the intermediate data are dynamically processed and never stored or sent off-chip. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features together allow MAPLE to scale its performance with data size. This paper describes the MAPLE architecture, explores its design space with a simulator, and illustrates how to automatically map application kernels to the hardware. We also implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5–10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"106","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1854273.1854309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 106

Abstract

For learning and classification workloads that operate on large amounts of unstructured data with stringent performance constraints, general purpose processor performance scales poorly with data size. In this paper, we present a programmable accelerator for this workload domain. To architect the accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. The proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses in-memory processing where on-chip memory blocks perform the secondary reduction operations. By doing so, the intermediate data are dynamically processed and never stored or sent off-chip. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features together allow MAPLE to scale its performance with data size. This paper describes the MAPLE architecture, explores its design space with a simulator, and illustrates how to automatically map application kernels to the hardware. We also implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5–10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于学习和分类的可编程并行加速器

对于在具有严格性能约束的大量非结构化数据上操作的学习和分类工作负载，通用处理器的性能随着数据大小的变化而变化得很差。在本文中，我们提出了一个可编程的加速器。为了构建加速器，我们分析了五个代表性的工作负载，并发现它们的计算密集型部分可以被表述为生成大量中间数据的矩阵或向量操作，然后通过二次操作(如数组排序、查找max/min和聚合)减少中间数据。这个被提议的加速器被称为MAPLE，它有数百个简单的处理元素(pe)布置在一个二维网格中，有两个关键特征。首先，它使用内存处理，其中片上内存块执行二次缩减操作。通过这样做，中间数据被动态处理，而不会存储或发送到芯片外。其次，MAPLE使用存储的片外内存，并将pe组织成独立的组，每个组都有自己的片外内存库。这两个特性一起允许MAPLE根据数据大小扩展其性能。本文描述了MAPLE的体系结构，用模拟器探讨了其设计空间，并说明了如何将应用程序内核自动映射到硬件。我们还实现了MAPLE的512-PE FPGA原型，并发现它比2.5 GHz四核至强处理器快1.5 - 10倍，尽管运行频率为125 MHz。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)

自引率

0.00%

发文量

期刊最新文献

Reducing task creation and termination overhead in explicitly parallel programs An intra-tile cache set balancing scheme NUcache: A multicore cache organization based on Next-Use distance Towards a science of parallel programming Discovering and understanding performance bottlenecks in transactional applications