Yao Chen, Kaili Zhang, Cheng Gong, Cong Hao, Xiaofan Zhang, Tao Li, Deming Chen
{"title":"T-DLA: An Open-source Deep Learning Accelerator for Ternarized DNN Models on Embedded FPGA","authors":"Yao Chen, Kaili Zhang, Cheng Gong, Cong Hao, Xiaofan Zhang, Tao Li, Deming Chen","doi":"10.1109/ISVLSI.2019.00012","DOIUrl":null,"url":null,"abstract":"Deep Neural Networks (DNNs) have become promising solutions for data analysis especially for raw data processing from sensors. However, using DNN-based approaches can easily introduce huge demands of computation and memory consumption, which may not be feasible for direct deployment onto the Internet of Thing (IoT) devices, since they have strict constraints on hardware resources, power budgets, response latency, and manufacturing cost. To bring DNNs into IoT devices, embedded FPGA can be one of the most suitable candidates by providing better energy efficiency than GPU and CPU based solutions, and higher flexibility than ASICs. In this paper, we propose a systematic solution to deploy DNNs on embedded FPGAs, which includes a ternarized hardware Deep Learning Accelerator (T-DLA), and a framework for ternary neural network (TNN) training. T-DLA is a highly optimized hardware unit in FPGA specializing in accelerating the TNNs, while the proposed framework can significantly compress the DNN parameters down to two bits with little accuracy drop. Results show that our training framework can compress the DNN up to 14.14x while maintaining nearly the same accuracy compared to the floating point version. By illustrating our proposed design techniques, the T-DLA can deliver up to 0.4TOPS with 2.576W power consumption, showing 873.6x and 5.1x higher energy efficiency (fps/W) on ImageNet with Resnet-18 model comparing to Xeon E5-2630 CPU and Nvidia 1080 Ti GPU. To the best of our knowledge, this is the first instruction-based highly efficient ternary DLA design reported from the literature.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"1 1","pages":"13-18"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISVLSI.2019.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 28

摘要

深度神经网络(dnn)已经成为数据分析特别是传感器原始数据处理的有前途的解决方案。然而,使用基于dnn的方法很容易带来巨大的计算和内存消耗需求,这对于直接部署到物联网(IoT)设备可能不可行,因为它们对硬件资源、功率预算、响应延迟和制造成本有严格的限制。为了将dnn引入物联网设备,嵌入式FPGA可以通过提供比基于GPU和CPU的解决方案更好的能效以及比asic更高的灵活性,成为最合适的候选者之一。T-DLA是FPGA中高度优化的硬件单元,专门用于加速tnn,而所提出的框架可以将DNN参数显著压缩到2位,精度几乎没有下降。结果表明,我们的训练框架可以将DNN压缩到14.14倍,同时保持与浮点版本几乎相同的精度。通过说明我们提出的设计技术,T-DLA可以提供高达0.4TOPS,功耗为2.576W,与Xeon E5-2630 CPU和Nvidia 1080 Ti GPU相比,在Resnet-18模型的ImageNet上显示873.6倍和5.1倍的能效(fps/W)。据我们所知,这是文献中报道的第一个基于指令的高效三元DLA设计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
T-DLA: An Open-source Deep Learning Accelerator for Ternarized DNN Models on Embedded FPGA
Deep Neural Networks (DNNs) have become promising solutions for data analysis especially for raw data processing from sensors. However, using DNN-based approaches can easily introduce huge demands of computation and memory consumption, which may not be feasible for direct deployment onto the Internet of Thing (IoT) devices, since they have strict constraints on hardware resources, power budgets, response latency, and manufacturing cost. To bring DNNs into IoT devices, embedded FPGA can be one of the most suitable candidates by providing better energy efficiency than GPU and CPU based solutions, and higher flexibility than ASICs. In this paper, we propose a systematic solution to deploy DNNs on embedded FPGAs, which includes a ternarized hardware Deep Learning Accelerator (T-DLA), and a framework for ternary neural network (TNN) training. T-DLA is a highly optimized hardware unit in FPGA specializing in accelerating the TNNs, while the proposed framework can significantly compress the DNN parameters down to two bits with little accuracy drop. Results show that our training framework can compress the DNN up to 14.14x while maintaining nearly the same accuracy compared to the floating point version. By illustrating our proposed design techniques, the T-DLA can deliver up to 0.4TOPS with 2.576W power consumption, showing 873.6x and 5.1x higher energy efficiency (fps/W) on ImageNet with Resnet-18 model comparing to Xeon E5-2630 CPU and Nvidia 1080 Ti GPU. To the best of our knowledge, this is the first instruction-based highly efficient ternary DLA design reported from the literature.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Ferroelectric FET Based TCAM Designs for Energy Efficient Computing Evaluation of Compilers Effects on OpenMP Soft Error Resiliency Towards Efficient Compact Network Training on Edge-Devices PageCmp: Bandwidth Efficient Page Deduplication through In-memory Page Comparison Improving Logic Optimization in Sequential Circuits using Majority-inverter Graphs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1