Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献

英文中文

Don't Forget the Memory: Automatic Block RAM Modelling, Optimization, and Architecture Exploration 不要忘记内存:自动块RAM建模，优化和架构探索

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021731

S. Yazdanshenas, K. Tatsumura, Vaughn Betz

While academic FPGA architecture exploration tools have become sufficiently advanced to enable a wide variety of explorations and optimizations on soft fabric and outing, support for Block RAM (BRAM) has been very limited. In this paper, we present enhancements to the COFFE transistor sizing tool to facilitate automatic generation and optimization of BRAM for both SRAM and Magnetic Tunnelling Junction technologies. These new capabilities enable investigation of area, delay, and energy trends for various sizes of BRAM or different BRAM technologies. We also validate these trends against available commercial FPGA BRAM data. Furthermore, we demonstrate that BRAMs generated by COFFE can be used to carry out system-level architecture explorations using an area-oriented RAM-mapping flow and the Verilog-To-Routing flow.

虽然学术界的FPGA架构探索工具已经足够先进，可以在软结构和路由上进行各种各样的探索和优化，但对块RAM (BRAM)的支持非常有限。在本文中，我们提出了对COFFE晶体管尺寸工具的改进，以促进自动生成和优化SRAM和磁隧道结技术的BRAM。这些新功能可以研究各种尺寸的BRAM或不同BRAM技术的面积、延迟和能量趋势。我们还针对可用的商用FPGA BRAM数据验证了这些趋势。此外，我们证明了COFFE生成的bram可用于使用面向区域的ram映射流和Verilog-To-Routing流进行系统级架构探索。

引用次数: 27

Synchronization Constraints for Interconnect Synthesis 互连综合的同步约束

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021729

A. Rodionov, Jonathan Rose

Interconnect synthesis tools ease the burden on the designer by automatically generating and optimizing communication hardware. In this paper we propose a novel capability for FPGA interconnect synthesis tools that further simplifies the designer's effort: automatic cycle-level synchronization of data delivery. This capability enables the creation of interconnect with significantly reduced hardware cost, provided that communicating modules have fixed latency and do not apply upstream backpressure. To do so, the designer specifies constraints on the lengths, in clock cycles, of multi-hop logical communication paths. The tool then uses an integer programming-based method to insert balancing registers into optimal locations, satisfying the designer's constraints while minimizing register usage. On an example convolutional neural network application, the new approach uses 43% less area than a FIFO-based synchronization scheme.

互连综合工具通过自动生成和优化通信硬件，减轻了设计人员的负担。在本文中，我们提出了一种FPGA互连合成工具的新功能，进一步简化了设计人员的工作:数据传输的自动周期级同步。只要通信模块具有固定的延迟并且不施加上游反压，该功能就可以在显著降低硬件成本的情况下创建互连。为此，设计人员指定了对多跳逻辑通信路径的时钟周期长度的约束。然后，该工具使用基于整数规划的方法将平衡寄存器插入到最佳位置，满足设计者的约束，同时最大限度地减少寄存器的使用。在卷积神经网络应用示例中，新方法比基于fifo的同步方案占用的面积少43%。

引用次数: 3

An Energy-Efficient Design-Time Scheduler for FPGAs Leveraging Dynamic Frequency Scaling Emulation (Abstract Only) 基于动态频率缩放仿真的fpga节能设计时间调度器(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021805

W. Loke, Chin Yang Koay

We present a design-time tool, EASTA, that combines the feature of reconfigurability in FPGAs and Dynamic Frequency Scaling to realize an efficient multiprocessing scheduler on a single-FPGA system. Multiple deadlines, reconvergent nodes, flow dependency and processor constraints of the multiprocessor problem on general task graphs are rigorously taken into consideration. EASTA is able to determine the minimum number of processing elements required to create a feasible schedule and dynamically adjust the clock speed of each processing element to reclaim slack. The schedule is represented by an efficient tree-based lookup table. We evaluate the EASTA tool using randomly generated task graphs and demonstrate that our framework is able to produce energy savings of 39.41% and 33% for task graphs of size 9.

我们提出了一个设计时工具EASTA，它结合了fpga的可重构特性和动态频率缩放特性，在单fpga系统上实现了一个高效的多处理调度程序。严格考虑了一般任务图上多处理器问题的多期限、再收敛节点、流依赖和处理器约束。EASTA能够确定创建可行计划所需的最小加工单元数，并动态调整每个加工单元的时钟速度以回收空闲。该计划由一个高效的基于树的查找表表示。我们使用随机生成的任务图来评估EASTA工具，并证明我们的框架能够为大小为9的任务图节省39.41%和33%的能源。

引用次数: 0

Energy Efficient Scientific Computing on FPGAs using OpenCL 基于OpenCL的fpga节能科学计算

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021730

Dennis D. Weller, Fabian Oboril, D. Lukarski, J. Becker, M. Tahoori

An indispensable part of our modern life is scientific computing which is used in large-scale high-performance systems as well as in low-power smart cyber-physical systems. Hence, accelerators for scientific computing need to be fast and energy efficient. Therefore, partial differential equations (PDEs), as an integral component of many scientific computing tasks, require efficient implementation. In this regard, FPGAs are well suited for data-parallel computations as they occur in PDE solvers. However, including FPGAs in the programming flow is not trivial, as hardware description languages (HDLs) have to be exploited, which requires detailed knowledge of the underlying hardware. This issue is tackled by OpenCL, which allows to write standardized code in a C-like fashion, rendering experience with HDLs unnecessary. Yet, hiding the underlying hardware from the developer makes it challenging to implement solvers that exploit the full FPGA potential. Therefore, we propose in this work a comprehensive set of generic and specific optimization techniques for PDE solvers using OpenCL that improve the FPGA performance and energy efficiency by orders of magnitude. Based on these optimizations, our study shows that, despite the high abstraction level of OpenCL, very energy efficient PDE accelerators on the FPGA fabric can be designed, making the FPGA an ideal solution for power-constrained applications.

科学计算是我们现代生活中不可或缺的一部分，它用于大规模高性能系统以及低功耗智能网络物理系统。因此，用于科学计算的加速器需要快速且节能。因此，偏微分方程作为许多科学计算任务的组成部分，需要高效的实现。在这方面，fpga非常适合数据并行计算，因为它们出现在PDE求解器中。然而，在编程流程中包括fpga并不简单，因为必须利用硬件描述语言(hdl)，这需要对底层硬件有详细的了解。这个问题是由OpenCL解决的，它允许以类似c的方式编写标准化代码，使使用hdl的体验变得不必要。然而，对开发人员隐藏底层硬件使得实现充分利用FPGA潜力的求解器具有挑战性。因此，我们在这项工作中提出了一套全面的通用和特定的优化技术，用于使用OpenCL的PDE求解器，以提高FPGA性能和能效的数量级。基于这些优化，我们的研究表明，尽管OpenCL具有很高的抽象级别，但可以在FPGA结构上设计非常节能的PDE加速器，使FPGA成为功耗受限应用的理想解决方案。

{"title":"Energy Efficient Scientific Computing on FPGAs using OpenCL","authors":"Dennis D. Weller, Fabian Oboril, D. Lukarski, J. Becker, M. Tahoori","doi":"10.1145/3020078.3021730","DOIUrl":"https://doi.org/10.1145/3020078.3021730","url":null,"abstract":"An indispensable part of our modern life is scientific computing which is used in large-scale high-performance systems as well as in low-power smart cyber-physical systems. Hence, accelerators for scientific computing need to be fast and energy efficient. Therefore, partial differential equations (PDEs), as an integral component of many scientific computing tasks, require efficient implementation. In this regard, FPGAs are well suited for data-parallel computations as they occur in PDE solvers. However, including FPGAs in the programming flow is not trivial, as hardware description languages (HDLs) have to be exploited, which requires detailed knowledge of the underlying hardware. This issue is tackled by OpenCL, which allows to write standardized code in a C-like fashion, rendering experience with HDLs unnecessary. Yet, hiding the underlying hardware from the developer makes it challenging to implement solvers that exploit the full FPGA potential. Therefore, we propose in this work a comprehensive set of generic and specific optimization techniques for PDE solvers using OpenCL that improve the FPGA performance and energy efficiency by orders of magnitude. Based on these optimizations, our study shows that, despite the high abstraction level of OpenCL, very energy efficient PDE accelerators on the FPGA fabric can be designed, making the FPGA an ideal solution for power-constrained applications.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129771854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Session details: Interconnect and Routing 会话详细信息:互连和路由

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3257185

S. Kaptanoglu

引用次数: 0

FPGA-Accelerated Transactional Execution of Graph Workloads 图形工作负载的fpga加速事务性执行

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021743

Xiaoyu Ma, Dan Zhang, Derek Chiou

Many applications that operate on large graphs can be intuitively parallelized by executing a large number of the graph operations concurrently and as transactions to deal with potential conflicts. However, large numbers of operations occurring concurrently might incur too many conflicts that would negate the potential benefits of the parallelization which has probably made highly multi-threaded transactional machines seem impractical. Given the large size and topology of many modern graphs, however, such machines can provide real performance, energy efficiency, and programability benefits. This paper describes an architecture that consists of many lightweight multi-threaded processing engines, a global transactional shared memory, and a work scheduler. We present challenges of realizing such an architecture, especially the requirement of scalable conflict detection, and propose solutions. We also argue that despite increased transaction conflicts due to the higher concurrency and single-thread latency, scalable speedup over serial execution can be achieved. We implement the proposed architecture as a synthesizable FPGA RTL design and demonstrate improved per-socket performance (2X) and energy efficiency (22X) by comparing to a baseline platform that contains two Intel Haswell processors, each with 12 cores.

许多操作大型图的应用程序可以通过并发地执行大量图操作并将其作为事务处理潜在冲突来直观地并行化。然而，并发发生的大量操作可能会导致太多冲突，从而抵消并行化的潜在好处，这可能会使高度多线程的事务机器看起来不切实际。然而，考虑到许多现代图的大尺寸和拓扑结构，这样的机器可以提供真正的性能、能源效率和可编程性优势。本文描述了一个由许多轻量级多线程处理引擎、全局事务性共享内存和工作调度器组成的体系结构。我们提出了实现这种架构的挑战，特别是对可扩展冲突检测的需求，并提出了解决方案。我们还认为，尽管由于更高的并发性和单线程延迟而增加了事务冲突，但可以实现串行执行的可扩展加速。我们将提出的架构实现为可合成的FPGA RTL设计，并通过与包含两个Intel Haswell处理器的基线平台(每个处理器具有12核)进行比较，展示了改进的每个插槽性能(2X)和能效(22X)。

{"title":"FPGA-Accelerated Transactional Execution of Graph Workloads","authors":"Xiaoyu Ma, Dan Zhang, Derek Chiou","doi":"10.1145/3020078.3021743","DOIUrl":"https://doi.org/10.1145/3020078.3021743","url":null,"abstract":"Many applications that operate on large graphs can be intuitively parallelized by executing a large number of the graph operations concurrently and as transactions to deal with potential conflicts. However, large numbers of operations occurring concurrently might incur too many conflicts that would negate the potential benefits of the parallelization which has probably made highly multi-threaded transactional machines seem impractical. Given the large size and topology of many modern graphs, however, such machines can provide real performance, energy efficiency, and programability benefits. This paper describes an architecture that consists of many lightweight multi-threaded processing engines, a global transactional shared memory, and a work scheduler. We present challenges of realizing such an architecture, especially the requirement of scalable conflict detection, and propose solutions. We also argue that despite increased transaction conflicts due to the higher concurrency and single-thread latency, scalable speedup over serial execution can be achieved. We implement the proposed architecture as a synthesizable FPGA RTL design and demonstrate improved per-socket performance (2X) and energy efficiency (22X) by comparing to a baseline platform that contains two Intel Haswell processors, each with 12 cores.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126447898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Session details: Applications 会话详细信息:应用

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3257192

M. Leeser

引用次数: 0

A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks (Abstract Only) 一种用于二进制卷积神经网络的7.663-TOPS 8.2 w节能FPGA加速器(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-20 DOI: 10.1145/3020078.3021786

Yixing Li, Zichuan Liu, Kai Xu, Hao Yu, Fengbo Ren

FPGA-based hardware accelerator for convolutional neural networks (CNNs) has obtained great attentions due to its higher energy efficiency than GPUs. However, it has been a challenge for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipeline (temporal parallelism) stages. Experiment results show that the proposed architecture running at 90 MHz on a Xilinx Virtex-7 FPGA achieves a computing throughput of 7.663 TOPS with a power consumption of 8.2 W regardless of the batch size of input data. This is 8.3x faster and 75x more energy-efficient than a Titan X GPU for processing online individual requests (in small batch size). For processing static data (in large batch size), the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5x higher energy efficiency.

基于fpga的卷积神经网络硬件加速器以其比gpu更高的能效而备受关注。然而，对于基于fpga的解决方案来说，实现比GPU更高的吞吐量一直是一个挑战。在本文中，我们证明了FPGA加速在吞吐量和能量效率方面都可以是一个优越的解决方案，当CNN在权重和激活上使用二进制约束进行训练时。具体来说，我们提出了一种针对位卷积和归一化量身定制的优化加速器架构，该架构具有深度管道(时间并行)阶段的大规模空间并行性。实验结果表明，在Xilinx Virtex-7 FPGA上，无论输入数据的批量大小如何，该架构在90 MHz频率下的计算吞吐量为7.663 TOPS，功耗为8.2 W。在处理在线单个请求(小批量)时，这比Titan X GPU快8.3倍，节能75倍。对于处理静态数据(大型批处理)，建议的解决方案在吞吐量方面与Titan X GPU相当，同时提供9.5倍的能效。

{"title":"A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks (Abstract Only)","authors":"Yixing Li, Zichuan Liu, Kai Xu, Hao Yu, Fengbo Ren","doi":"10.1145/3020078.3021786","DOIUrl":"https://doi.org/10.1145/3020078.3021786","url":null,"abstract":"FPGA-based hardware accelerator for convolutional neural networks (CNNs) has obtained great attentions due to its higher energy efficiency than GPUs. However, it has been a challenge for FPGA-based solutions to achieve a higher throughput than GPU counterparts. In this paper, we demonstrate that FPGA acceleration can be a superior solution in terms of both throughput and energy efficiency when a CNN is trained with binary constraints on weights and activations. Specifically, we propose an optimized accelerator architecture tailored for bitwise convolution and normalization that features massive spatial parallelism with deep pipeline (temporal parallelism) stages. Experiment results show that the proposed architecture running at 90 MHz on a Xilinx Virtex-7 FPGA achieves a computing throughput of 7.663 TOPS with a power consumption of 8.2 W regardless of the batch size of input data. This is 8.3x faster and 75x more energy-efficient than a Titan X GPU for processing online individual requests (in small batch size). For processing static data (in large batch size), the proposed solution is on a par with a Titan X GPU in terms of throughput while delivering 9.5x higher energy efficiency.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127778087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

An OpenCL™ Deep Learning Accelerator on Arria 10

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-01-13 DOI: 10.1145/3020078.3021738

U. Aydonat, Shane O'Connell, D. Capalija, A. Ling, Gordon R. Chiu

Convolutional neural nets (CNNs) have become a practical means to perform vision tasks, particularly in the area of image classification. FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on FPGAs have shown limited advantages over other devices such as GPUs. Previous approaches on FPGAs have often been memory bound due to the limited external memory bandwidth on the FPGA device. We show a novel architecture written in OpenCL(TM), which we refer to as a Deep Learning Accelerator (DLA), that maximizes data reuse and minimizes external memory bandwidth. Furthermore, we show how we can use the Winograd transform to significantly boost the performance of the FPGA. As a result, when running our DLA on Intel's Arria 10 device we can achieve a performance of 1020 img/s, or 23 img/s/W when running the AlexNet CNN benchmark. This comes to 1382 GFLOPs and is 10x faster with 8.4x more GFLOPS and 5.8x better efficiency than the state-of-the-art on FPGAs. Additionally, 23 img/s/W is competitive against the best publicly known implementation of AlexNet on nVidia's TitanX GPU.

卷积神经网络(cnn)已经成为执行视觉任务的实用手段，特别是在图像分类领域。众所周知，fpga能够有效地执行卷积，然而，最近在fpga上运行cnn的努力表明，与gpu等其他设备相比，fpga的优势有限。由于FPGA器件上的外部存储器带宽有限，以前的FPGA方法通常是内存绑定的。此外，我们展示了如何使用Winograd变换来显著提高FPGA的性能。因此，当在英特尔的Arria 10设备上运行我们的DLA时，我们可以实现1020 img/s的性能，或者在运行AlexNet CNN基准测试时实现23 img/s/W的性能。这是1382 GFLOPs，比最先进的fpga快10倍，GFLOPs提高8.4倍，效率提高5.8倍。此外，23img /s/W与nVidia的TitanX GPU上AlexNet的最佳公开实现具有竞争力。

{"title":"An OpenCL™ Deep Learning Accelerator on Arria 10","authors":"U. Aydonat, Shane O'Connell, D. Capalija, A. Ling, Gordon R. Chiu","doi":"10.1145/3020078.3021738","DOIUrl":"https://doi.org/10.1145/3020078.3021738","url":null,"abstract":"Convolutional neural nets (CNNs) have become a practical means to perform vision tasks, particularly in the area of image classification. FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on FPGAs have shown limited advantages over other devices such as GPUs. Previous approaches on FPGAs have often been memory bound due to the limited external memory bandwidth on the FPGA device. We show a novel architecture written in OpenCL(TM), which we refer to as a Deep Learning Accelerator (DLA), that maximizes data reuse and minimizes external memory bandwidth. Furthermore, we show how we can use the Winograd transform to significantly boost the performance of the FPGA. As a result, when running our DLA on Intel's Arria 10 device we can achieve a performance of 1020 img/s, or 23 img/s/W when running the AlexNet CNN benchmark. This comes to 1382 GFLOPs and is 10x faster with 8.4x more GFLOPS and 5.8x better efficiency than the state-of-the-art on FPGAs. Additionally, 23 img/s/W is competitive against the best publicly known implementation of AlexNet on nVidia's TitanX GPU.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124237659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 234

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA 基于FPGA的高效稀疏LSTM语音识别引擎

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2016-12-01 DOI: 10.1145/3020078.3021745

Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, W. Dally

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center. To speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose a scheduler that encodes and partitions the compressed model to multiple PEs for parallelism and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the sparse LSTM model. Implemented on Xilinx KU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the sparse LSTM network, corresponding to 2.52 TOPS on the dense one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.

长短期记忆(LSTM)在语音识别中有着广泛的应用。为了达到更高的预测精度，机器学习科学家已经建立了越来越大的模型。如此庞大的模型既需要大量的计算，又需要大量的内存。部署如此庞大的模型会导致高功耗，并导致数据中心的高总拥有成本(TCO)。为了加速预测并使其节能，我们首先提出了一种负载平衡感知的修剪方法，该方法可以将LSTM模型大小压缩20倍(修剪10倍，量化2倍)，而预测精度的损失可以忽略不计。剪枝模型便于并行处理。接下来，我们提出了一个调度器，该调度器将压缩模型编码并划分为多个pe以实现并行性，并调度复杂的LSTM数据流。最后，设计了直接作用于稀疏LSTM模型的高效语音识别引擎(ESE)硬件架构。ESE在运行频率为200MHz的Xilinx KU060 FPGA上实现，在稀疏LSTM网络上直接工作的性能为282 GOPS，在密集LSTM网络上对应2.52 TOPS，处理一个完整的LSTM用于语音识别，功耗为41瓦。在LSTM的语音识别基准测试中，ESE比Core i7 5930k CPU和Pascal Titan X GPU实现分别快43倍和3倍。与CPU和GPU相比，能效分别提高了40倍和11.5倍。

{"title":"ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA","authors":"Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, W. Dally","doi":"10.1145/3020078.3021745","DOIUrl":"https://doi.org/10.1145/3020078.3021745","url":null,"abstract":"Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center. To speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose a scheduler that encodes and partitions the compressed model to multiple PEs for parallelism and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the sparse LSTM model. Implemented on Xilinx KU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the sparse LSTM network, corresponding to 2.52 TOPS on the dense one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127531116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 570

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀