Parallelized Convolutions for Embedded Ultra Low Power Deep Learning SoC

L. Cunial, Ahmet Erdem, C. Silvano, M. Falchetto, Andrea C. Ornstein, Emanuele Plebani, G. Desoli, D. Pau
{"title":"Parallelized Convolutions for Embedded Ultra Low Power Deep Learning SoC","authors":"L. Cunial, Ahmet Erdem, C. Silvano, M. Falchetto, Andrea C. Ornstein, Emanuele Plebani, G. Desoli, D. Pau","doi":"10.1109/RTSI.2018.8548362","DOIUrl":null,"url":null,"abstract":"Deep Convolutional Neural Networks (DCNNs) achieve state of the art results compared to classic machine learning in many applications that need recognition, identification and classification. An ever-increasing embedded deployment of DCNNs inference engines thus supporting the intelligence close to the sensor paradigm has been observed, overcoming limitations of cloud-based computing as bandwidth requirements, security, privacy, scalability, and responsiveness. However, increasing the robustness and accuracy of DCNNs comes at the price of increased computational cost. As result, implementing CNNs on embedded devices with real-time constraints is a challenge if the lowest power consumption shall be achieved. A solution to the challenge is to take advantage of the intra-device massive fine grain parallelism offered by these systems and benefit from the extensive concurrency exhibited by DCNN processing pipelines. The trick is to divide intensive tasks into smaller, weakly interacting batches subject to parallel processing. Referred to that, this paper has mainly two goals: 1) describe the implementation of a state-of-art technique to map DCNN most intensive tasks (dominated by multiply-and-accumulate ops) onto Orlando SoC, an ultra-low power heterogeneous multi cores developed by STMicroelectronics; 2) integrate the proposed implementation on a toolchain that allows deep learning developers to deploy DCNNs on low-power applications.","PeriodicalId":363896,"journal":{"name":"2018 IEEE 4th International Forum on Research and Technology for Society and Industry (RTSI)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 4th International Forum on Research and Technology for Society and Industry (RTSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RTSI.2018.8548362","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Deep Convolutional Neural Networks (DCNNs) achieve state of the art results compared to classic machine learning in many applications that need recognition, identification and classification. An ever-increasing embedded deployment of DCNNs inference engines thus supporting the intelligence close to the sensor paradigm has been observed, overcoming limitations of cloud-based computing as bandwidth requirements, security, privacy, scalability, and responsiveness. However, increasing the robustness and accuracy of DCNNs comes at the price of increased computational cost. As result, implementing CNNs on embedded devices with real-time constraints is a challenge if the lowest power consumption shall be achieved. A solution to the challenge is to take advantage of the intra-device massive fine grain parallelism offered by these systems and benefit from the extensive concurrency exhibited by DCNN processing pipelines. The trick is to divide intensive tasks into smaller, weakly interacting batches subject to parallel processing. Referred to that, this paper has mainly two goals: 1) describe the implementation of a state-of-art technique to map DCNN most intensive tasks (dominated by multiply-and-accumulate ops) onto Orlando SoC, an ultra-low power heterogeneous multi cores developed by STMicroelectronics; 2) integrate the proposed implementation on a toolchain that allows deep learning developers to deploy DCNNs on low-power applications.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
嵌入式超低功耗深度学习SoC的并行卷积
与经典机器学习相比,深度卷积神经网络(DCNNs)在许多需要识别、识别和分类的应用中取得了最先进的结果。已经观察到,越来越多的DCNNs推理引擎的嵌入式部署,从而支持接近传感器范例的智能,克服了基于云计算的带宽要求、安全性、隐私性、可扩展性和响应性等限制。然而,提高DCNNs的鲁棒性和准确性是以增加计算成本为代价的。因此,如果要实现最低功耗,在具有实时性约束的嵌入式设备上实现cnn是一个挑战。解决方案是利用这些系统提供的设备内大规模细粒度并行性,并从DCNN处理管道所展示的广泛并发性中获益。诀窍是将密集的任务划分为更小的、弱交互的批次,并进行并行处理。鉴于此,本文主要有两个目标:1)描述将DCNN最密集的任务(以乘法和累加操作为主)映射到意法半导体(STMicroelectronics)开发的超低功耗异构多核Orlando SoC上的最新技术的实现;2)在工具链上集成提议的实现,允许深度学习开发人员在低功耗应用程序上部署DCNNs。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Data-Driven Approaches to Predict States in a Food Technology Case Study Automating Lung Cancer Identification in PET/CT Imaging Spectral Repeatability of a Hyperspectral System for Human Iris Imaging Hybrid Observer for Indoor Localization with Random Time-of-Arrival Measurments A LiDAR Prototype with Silicon Photomultiplier and MEMS Mirrors
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1