TABLA: A unified template-based framework for accelerating statistical machine learning

Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, A. Yazdanbakhsh, J. Kim, H. Esmaeilzadeh
{"title":"TABLA: A unified template-based framework for accelerating statistical machine learning","authors":"Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, A. Yazdanbakhsh, J. Kim, H. Esmaeilzadeh","doi":"10.1109/HPCA.2016.7446050","DOIUrl":null,"url":null,"abstract":"A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. Tabla then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use Tabla to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.","PeriodicalId":417994,"journal":{"name":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"154","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2016.7446050","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 154

Abstract

A growing number of commercial and enterprise systems increasingly rely on compute-intensive Machine Learning (ML) algorithms. While the demand for these compute-intensive applications is growing, the performance benefits from general-purpose platforms are diminishing. Field Programmable Gate Arrays (FPGAs) provide a promising path forward to accommodate the needs of machine learning algorithms and represent an intermediate point between the efficiency of ASICs and the programmability of general-purpose processors. However, acceleration with FPGAs still requires long development cycles and extensive expertise in hardware design. To tackle this challenge, instead of designing an accelerator for a machine learning algorithm, we present TABLA, a framework that generates accelerators for a class of machine learning algorithms. The key is to identify the commonalities across a wide range of machine learning algorithms and utilize this commonality to provide a high-level abstraction for programmers. TABLA leverages the insight that many learning algorithms can be expressed as a stochastic optimization problem. Therefore, learning becomes solving an optimization problem using stochastic gradient descent that minimizes an objective function over the training data. The gradient descent solver is fixed while the objective function changes for different learning algorithms. TABLA provides a template-based framework to accelerate this class of learning algorithms. Therefore, a developer can specify the learning task by only expressing the gradient of the objective function using our high-level language. Tabla then automatically generates the synthesizable implementation of the accelerator for FPGA realization using a set of hand-optimized templates. We use Tabla to generate accelerators for ten different learning tasks targeted at a Xilinx Zynq FPGA platform. We rigorously compare the benefits of FPGA acceleration to multi-core CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650 Ti, and Tesla K40) using real hardware measurements. TABLA-generated accelerators provide 19.4x and 2.9x average speedup over the ARM and Xeon processors, respectively. These accelerators provide 17.57x, 20.2x, and 33.4x higher Performance-per-Watt in comparison to Tegra, GTX 650 Ti and Tesla, respectively. These benefits are achieved while the programmers write less than 50 lines of code.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TABLA:用于加速统计机器学习的统一的基于模板的框架
越来越多的商业和企业系统越来越依赖于计算密集型机器学习(ML)算法。虽然对这些计算密集型应用程序的需求正在增长,但通用平台的性能优势正在减少。现场可编程门阵列(fpga)提供了一条有前途的道路,以适应机器学习算法的需求,并代表了asic的效率和通用处理器的可编程性之间的中间点。然而,fpga加速仍然需要较长的开发周期和丰富的硬件设计专业知识。为了应对这一挑战,我们提出了TABLA,这是一个为一类机器学习算法生成加速器的框架,而不是为机器学习算法设计加速器。关键是要识别各种机器学习算法的共性,并利用这种共性为程序员提供高级抽象。TABLA利用了许多学习算法可以表示为随机优化问题的洞察力。因此,学习变成了一个使用随机梯度下降最小化训练数据上的目标函数的优化问题。对于不同的学习算法,梯度下降求解器是固定的,而目标函数是不同的。TABLA提供了一个基于模板的框架来加速这类学习算法。因此,开发人员可以通过使用我们的高级语言表示目标函数的梯度来指定学习任务。Tabla然后使用一组手工优化的模板自动生成加速器的FPGA实现。我们使用Tabla为针对Xilinx Zynq FPGA平台的十种不同学习任务生成加速器。我们使用真实的硬件测量严格比较了FPGA加速与多核cpu (ARM Cortex A15和Xeon E3)和多核gpu (Tegra K1, GTX 650 Ti和Tesla K40)的优势。与ARM和Xeon处理器相比,taba生成的加速器分别提供19.4倍和2.9倍的平均加速。与Tegra、GTX 650 Ti和Tesla相比,这些加速器的每瓦性能分别提高了17.57倍、20.2倍和33.4倍。这些好处是在程序员编写不到50行代码的情况下实现的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A low power software-defined-radio baseband processor for the Internet of Things Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing MaPU: A novel mathematical computing architecture A low-power hybrid reconfigurable architecture for resistive random-access memories PleaseTM: Enabling transaction conflict management in requester-wins hardware transactional memory
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1