TABLA: A unified template-based framework for accelerating statistical machine learning

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2016-03-12 DOI:10.1109/HPCA.2016.7446050

Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, A. Yazdanbakhsh, J. Kim, H. Esmaeilzadeh

{"title":"TABLA: A unified template-based framework for accelerating statistical machine learning","authors":"Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, A. Yazdanbakhsh, J. Kim, H. Esmaeilzadeh","doi":"10.1109/HPCA.2016.7446050","DOIUrl":null,"url":null,"abstract":"A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. Tabla then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use Tabla to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"154","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2016.7446050","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 154

Abstract

A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. Tabla then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use Tabla to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TABLA:用于加速统计机器学习的统一的基于模板的框架

越来越多的商业和企业系统越来越依赖于计算密集型机器学习(ML)算法。虽然对这些计算密集型应用程序的需求正在增长，但通用平台的性能优势正在减少。现场可编程门阵列(fpga)提供了一条有前途的道路，以适应机器学习算法的需求，并代表了asic的效率和通用处理器的可编程性之间的中间点。然而，fpga加速仍然需要较长的开发周期和丰富的硬件设计专业知识。为了应对这一挑战，我们提出了TABLA，这是一个为一类机器学习算法生成加速器的框架，而不是为机器学习算法设计加速器。关键是要识别各种机器学习算法的共性，并利用这种共性为程序员提供高级抽象。TABLA利用了许多学习算法可以表示为随机优化问题的洞察力。因此，学习变成了一个使用随机梯度下降最小化训练数据上的目标函数的优化问题。对于不同的学习算法，梯度下降求解器是固定的，而目标函数是不同的。TABLA提供了一个基于模板的框架来加速这类学习算法。因此，开发人员可以通过使用我们的高级语言表示目标函数的梯度来指定学习任务。Tabla然后使用一组手工优化的模板自动生成加速器的FPGA实现。我们使用Tabla为针对Xilinx Zynq FPGA平台的十种不同学习任务生成加速器。我们使用真实的硬件测量严格比较了FPGA加速与多核cpu (ARM Cortex A15和Xeon E3)和多核gpu (Tegra K1, GTX 650 Ti和Tesla K40)的优势。与ARM和Xeon处理器相比，taba生成的加速器分别提供19.4倍和2.9倍的平均加速。与Tegra、GTX 650 Ti和Tesla相比，这些加速器的每瓦性能分别提高了17.57倍、20.2倍和33.4倍。这些好处是在程序员编写不到50行代码的情况下实现的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量