Akshay Parashar, Arun Abraham, Deepak Chaudhary, V. N. Rajendiran
{"title":"嵌入式设备上高效深度神经网络推理的处理器流水线方法","authors":"Akshay Parashar, Arun Abraham, Deepak Chaudhary, V. N. Rajendiran","doi":"10.1109/HiPC50609.2020.00022","DOIUrl":null,"url":null,"abstract":"Myriad applications of Deep Neural Networks (DNN) and the race for better accuracy have paved the way for the development of more computationally intensive network architectures. Execution of these heavy networks on embedded devices needs highly efficient real-time DNN inference frameworks. But the sequential architecture of popular DNNs makes it difficult to parallelize its operations among different processors. We propose a novel pipelining method pluggable on top of conventional inference frameworks and capable of parallelizing DNN inference on heterogeneous processors without impacting the accuracy. We partition the network into subnets, by estimating the optimal split points, and pipeline these subnets across multiple processors. The results shows that the proposed method achieves up to 68% improvement in the frames per second (FPS) rate of popular network architectures like VGG19, DenseNet-121 and ResNet-152. Moreover, we show that our method can be used to extract even more performance out of high performance chipsets, by better utilizing the capabilities of its AI processor ecosystem. We also showcase that our method can be easily extended to other low performance chipsets, where this additional performance gain is crucial to deploy real-time AI applications. Our results show performance improvement of up to 47% in the FPS rate on these chipsets without the need of specialized AI hardware.","PeriodicalId":375004,"journal":{"name":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Processor Pipelining Method for Efficient Deep Neural Network Inference on Embedded Devices\",\"authors\":\"Akshay Parashar, Arun Abraham, Deepak Chaudhary, V. N. Rajendiran\",\"doi\":\"10.1109/HiPC50609.2020.00022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Myriad applications of Deep Neural Networks (DNN) and the race for better accuracy have paved the way for the development of more computationally intensive network architectures. Execution of these heavy networks on embedded devices needs highly efficient real-time DNN inference frameworks. But the sequential architecture of popular DNNs makes it difficult to parallelize its operations among different processors. We propose a novel pipelining method pluggable on top of conventional inference frameworks and capable of parallelizing DNN inference on heterogeneous processors without impacting the accuracy. We partition the network into subnets, by estimating the optimal split points, and pipeline these subnets across multiple processors. The results shows that the proposed method achieves up to 68% improvement in the frames per second (FPS) rate of popular network architectures like VGG19, DenseNet-121 and ResNet-152. Moreover, we show that our method can be used to extract even more performance out of high performance chipsets, by better utilizing the capabilities of its AI processor ecosystem. We also showcase that our method can be easily extended to other low performance chipsets, where this additional performance gain is crucial to deploy real-time AI applications. Our results show performance improvement of up to 47% in the FPS rate on these chipsets without the need of specialized AI hardware.\",\"PeriodicalId\":375004,\"journal\":{\"name\":\"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HiPC50609.2020.00022\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC50609.2020.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Processor Pipelining Method for Efficient Deep Neural Network Inference on Embedded Devices
Myriad applications of Deep Neural Networks (DNN) and the race for better accuracy have paved the way for the development of more computationally intensive network architectures. Execution of these heavy networks on embedded devices needs highly efficient real-time DNN inference frameworks. But the sequential architecture of popular DNNs makes it difficult to parallelize its operations among different processors. We propose a novel pipelining method pluggable on top of conventional inference frameworks and capable of parallelizing DNN inference on heterogeneous processors without impacting the accuracy. We partition the network into subnets, by estimating the optimal split points, and pipeline these subnets across multiple processors. The results shows that the proposed method achieves up to 68% improvement in the frames per second (FPS) rate of popular network architectures like VGG19, DenseNet-121 and ResNet-152. Moreover, we show that our method can be used to extract even more performance out of high performance chipsets, by better utilizing the capabilities of its AI processor ecosystem. We also showcase that our method can be easily extended to other low performance chipsets, where this additional performance gain is crucial to deploy real-time AI applications. Our results show performance improvement of up to 47% in the FPS rate on these chipsets without the need of specialized AI hardware.