Parallelized Convolutions for Embedded Ultra Low Power Deep Learning SoC

2018 IEEE 4th International Forum on Research and Technology for Society and Industry (RTSI) Pub Date : 2018-09-01 DOI:10.1109/RTSI.2018.8548362

L. Cunial, Ahmet Erdem, C. Silvano, M. Falchetto, Andrea C. Ornstein, Emanuele Plebani, G. Desoli, D. Pau

{"title":"Parallelized Convolutions for Embedded Ultra Low Power Deep Learning SoC","authors":"L. Cunial, Ahmet Erdem, C. Silvano, M. Falchetto, Andrea C. Ornstein, Emanuele Plebani, G. Desoli, D. Pau","doi":"10.1109/RTSI.2018.8548362","DOIUrl":null,"url":null,"abstract":"Deep Convolutional Neural Networks (DCNNs) achieve state of the art results compared to classic machine learning in many applications that need recognition, identification and classification. An ever-increasing embedded deployment of DCNNs inference engines thus supporting the intelligence close to the sensor paradigm has been observed, overcoming limitations of cloud-based computing as bandwidth requirements, security, privacy, scalability, and responsiveness. However, increasing the robustness and accuracy of DCNNs comes at the price of increased computational cost. As result, implementing CNNs on embedded devices with real-time constraints is a challenge if the lowest power consumption shall be achieved. A solution to the challenge is to take advantage of the intra-device massive fine grain parallelism offered by these systems and benefit from the extensive concurrency exhibited by DCNN processing pipelines. The trick is to divide intensive tasks into smaller, weakly interacting batches subject to parallel processing. Referred to that, this paper has mainly two goals: 1) describe the implementation of a state-of-art technique to map DCNN most intensive tasks (dominated by multiply-and-accumulate ops) onto Orlando SoC, an ultra-low power heterogeneous multi cores developed by STMicroelectronics; 2) integrate the proposed implementation on a toolchain that allows deep learning developers to deploy DCNNs on low-power applications.","PeriodicalId":363896,"journal":{"name":"2018 IEEE 4th International Forum on Research and Technology for Society and Industry (RTSI)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 4th International Forum on Research and Technology for Society and Industry (RTSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RTSI.2018.8548362","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Deep Convolutional Neural Networks (DCNNs) achieve state of the art results compared to classic machine learning in many applications that need recognition, identification and classification. An ever-increasing embedded deployment of DCNNs inference engines thus supporting the intelligence close to the sensor paradigm has been observed, overcoming limitations of cloud-based computing as bandwidth requirements, security, privacy, scalability, and responsiveness. However, increasing the robustness and accuracy of DCNNs comes at the price of increased computational cost. As result, implementing CNNs on embedded devices with real-time constraints is a challenge if the lowest power consumption shall be achieved. A solution to the challenge is to take advantage of the intra-device massive fine grain parallelism offered by these systems and benefit from the extensive concurrency exhibited by DCNN processing pipelines. The trick is to divide intensive tasks into smaller, weakly interacting batches subject to parallel processing. Referred to that, this paper has mainly two goals: 1) describe the implementation of a state-of-art technique to map DCNN most intensive tasks (dominated by multiply-and-accumulate ops) onto Orlando SoC, an ultra-low power heterogeneous multi cores developed by STMicroelectronics; 2) integrate the proposed implementation on a toolchain that allows deep learning developers to deploy DCNNs on low-power applications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

嵌入式超低功耗深度学习SoC的并行卷积

与经典机器学习相比，深度卷积神经网络(DCNNs)在许多需要识别、识别和分类的应用中取得了最先进的结果。已经观察到，越来越多的DCNNs推理引擎的嵌入式部署，从而支持接近传感器范例的智能，克服了基于云计算的带宽要求、安全性、隐私性、可扩展性和响应性等限制。然而，提高DCNNs的鲁棒性和准确性是以增加计算成本为代价的。因此，如果要实现最低功耗，在具有实时性约束的嵌入式设备上实现cnn是一个挑战。解决方案是利用这些系统提供的设备内大规模细粒度并行性，并从DCNN处理管道所展示的广泛并发性中获益。诀窍是将密集的任务划分为更小的、弱交互的批次，并进行并行处理。鉴于此，本文主要有两个目标:1)描述将DCNN最密集的任务(以乘法和累加操作为主)映射到意法半导体(STMicroelectronics)开发的超低功耗异构多核Orlando SoC上的最新技术的实现;2)在工具链上集成提议的实现，允许深度学习开发人员在低功耗应用程序上部署DCNNs。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 IEEE 4th International Forum on Research and Technology for Society and Industry (RTSI)

自引率

0.00%

发文量

期刊最新文献

Data-Driven Approaches to Predict States in a Food Technology Case Study Automating Lung Cancer Identification in PET/CT Imaging Spectral Repeatability of a Hyperspectral System for Human Iris Imaging Hybrid Observer for Indoor Localization with Random Time-of-Arrival Measurments A LiDAR Prototype with Silicon Photomultiplier and MEMS Mirrors