Yao Chen, Kaili Zhang, Cheng Gong, Cong Hao, Xiaofan Zhang, Tao Li, Deming Chen
{"title":"T-DLA: An Open-source Deep Learning Accelerator for Ternarized DNN Models on Embedded FPGA","authors":"Yao Chen, Kaili Zhang, Cheng Gong, Cong Hao, Xiaofan Zhang, Tao Li, Deming Chen","doi":"10.1109/ISVLSI.2019.00012","DOIUrl":null,"url":null,"abstract":"Deep Neural Networks (DNNs) have become promising solutions for data analysis especially for raw data processing from sensors. However, using DNN-based approaches can easily introduce huge demands of computation and memory consumption, which may not be feasible for direct deployment onto the Internet of Thing (IoT) devices, since they have strict constraints on hardware resources, power budgets, response latency, and manufacturing cost. To bring DNNs into IoT devices, embedded FPGA can be one of the most suitable candidates by providing better energy efficiency than GPU and CPU based solutions, and higher flexibility than ASICs. In this paper, we propose a systematic solution to deploy DNNs on embedded FPGAs, which includes a ternarized hardware Deep Learning Accelerator (T-DLA), and a framework for ternary neural network (TNN) training. T-DLA is a highly optimized hardware unit in FPGA specializing in accelerating the TNNs, while the proposed framework can significantly compress the DNN parameters down to two bits with little accuracy drop. Results show that our training framework can compress the DNN up to 14.14x while maintaining nearly the same accuracy compared to the floating point version. By illustrating our proposed design techniques, the T-DLA can deliver up to 0.4TOPS with 2.576W power consumption, showing 873.6x and 5.1x higher energy efficiency (fps/W) on ImageNet with Resnet-18 model comparing to Xeon E5-2630 CPU and Nvidia 1080 Ti GPU. To the best of our knowledge, this is the first instruction-based highly efficient ternary DLA design reported from the literature.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"1 1","pages":"13-18"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISVLSI.2019.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 28
摘要
深度神经网络(dnn)已经成为数据分析特别是传感器原始数据处理的有前途的解决方案。然而,使用基于dnn的方法很容易带来巨大的计算和内存消耗需求,这对于直接部署到物联网(IoT)设备可能不可行,因为它们对硬件资源、功率预算、响应延迟和制造成本有严格的限制。为了将dnn引入物联网设备,嵌入式FPGA可以通过提供比基于GPU和CPU的解决方案更好的能效以及比asic更高的灵活性,成为最合适的候选者之一。T-DLA是FPGA中高度优化的硬件单元,专门用于加速tnn,而所提出的框架可以将DNN参数显著压缩到2位,精度几乎没有下降。结果表明,我们的训练框架可以将DNN压缩到14.14倍,同时保持与浮点版本几乎相同的精度。通过说明我们提出的设计技术,T-DLA可以提供高达0.4TOPS,功耗为2.576W,与Xeon E5-2630 CPU和Nvidia 1080 Ti GPU相比,在Resnet-18模型的ImageNet上显示873.6倍和5.1倍的能效(fps/W)。据我们所知,这是文献中报道的第一个基于指令的高效三元DLA设计。
T-DLA: An Open-source Deep Learning Accelerator for Ternarized DNN Models on Embedded FPGA
Deep Neural Networks (DNNs) have become promising solutions for data analysis especially for raw data processing from sensors. However, using DNN-based approaches can easily introduce huge demands of computation and memory consumption, which may not be feasible for direct deployment onto the Internet of Thing (IoT) devices, since they have strict constraints on hardware resources, power budgets, response latency, and manufacturing cost. To bring DNNs into IoT devices, embedded FPGA can be one of the most suitable candidates by providing better energy efficiency than GPU and CPU based solutions, and higher flexibility than ASICs. In this paper, we propose a systematic solution to deploy DNNs on embedded FPGAs, which includes a ternarized hardware Deep Learning Accelerator (T-DLA), and a framework for ternary neural network (TNN) training. T-DLA is a highly optimized hardware unit in FPGA specializing in accelerating the TNNs, while the proposed framework can significantly compress the DNN parameters down to two bits with little accuracy drop. Results show that our training framework can compress the DNN up to 14.14x while maintaining nearly the same accuracy compared to the floating point version. By illustrating our proposed design techniques, the T-DLA can deliver up to 0.4TOPS with 2.576W power consumption, showing 873.6x and 5.1x higher energy efficiency (fps/W) on ImageNet with Resnet-18 model comparing to Xeon E5-2630 CPU and Nvidia 1080 Ti GPU. To the best of our knowledge, this is the first instruction-based highly efficient ternary DLA design reported from the literature.