Taehee Jeong, Ehsam Ghasemi, Jorn Tuyls, Elliott Delaye, Ashish Sirasao
{"title":"神经网络修剪和硬件加速","authors":"Taehee Jeong, Ehsam Ghasemi, Jorn Tuyls, Elliott Delaye, Ashish Sirasao","doi":"10.1109/UCC48980.2020.00069","DOIUrl":null,"url":null,"abstract":"Neural network pruning is a critical technique to efficiently deploy neural network models on edge devices with limited computing resources. Although many neural network pruning methods have been published, it is difficult to implement such algorithms due to their inherent complexity. In this work, we propose a functional pruning tool for neural network models. Our pruning procedure is simple and easy to be implemented, and efficient for deployment. Our pruning tool automatically detects redundancy inside neural network models and prunes the redundant channels. Doing so reduces the total number of model parameters and hence, compresses the size of the model. This approach significantly reduces the number of FLOPs needed for executing the neural network model and improves the inference runtime. To further improve the inference runtime of the pruned model, we leveraged Apache TVM to deploy the pruned model on the DPU FPGA-based hardware accelerator. To demonstrate our approach, we pruned the VGG-16 model on Flower dataset and reached 53-fold reduction in model size with only 7% drop in validation accuracy. The inference latency is reduced 4-fold on CPU and 16-fold on FPGA for the pruned models, compared with the latency of the base model on CPU.","PeriodicalId":125849,"journal":{"name":"2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Neural network pruning and hardware acceleration\",\"authors\":\"Taehee Jeong, Ehsam Ghasemi, Jorn Tuyls, Elliott Delaye, Ashish Sirasao\",\"doi\":\"10.1109/UCC48980.2020.00069\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural network pruning is a critical technique to efficiently deploy neural network models on edge devices with limited computing resources. Although many neural network pruning methods have been published, it is difficult to implement such algorithms due to their inherent complexity. In this work, we propose a functional pruning tool for neural network models. Our pruning procedure is simple and easy to be implemented, and efficient for deployment. Our pruning tool automatically detects redundancy inside neural network models and prunes the redundant channels. Doing so reduces the total number of model parameters and hence, compresses the size of the model. This approach significantly reduces the number of FLOPs needed for executing the neural network model and improves the inference runtime. To further improve the inference runtime of the pruned model, we leveraged Apache TVM to deploy the pruned model on the DPU FPGA-based hardware accelerator. To demonstrate our approach, we pruned the VGG-16 model on Flower dataset and reached 53-fold reduction in model size with only 7% drop in validation accuracy. The inference latency is reduced 4-fold on CPU and 16-fold on FPGA for the pruned models, compared with the latency of the base model on CPU.\",\"PeriodicalId\":125849,\"journal\":{\"name\":\"2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC)\",\"volume\":\"58 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/UCC48980.2020.00069\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UCC48980.2020.00069","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Neural network pruning is a critical technique to efficiently deploy neural network models on edge devices with limited computing resources. Although many neural network pruning methods have been published, it is difficult to implement such algorithms due to their inherent complexity. In this work, we propose a functional pruning tool for neural network models. Our pruning procedure is simple and easy to be implemented, and efficient for deployment. Our pruning tool automatically detects redundancy inside neural network models and prunes the redundant channels. Doing so reduces the total number of model parameters and hence, compresses the size of the model. This approach significantly reduces the number of FLOPs needed for executing the neural network model and improves the inference runtime. To further improve the inference runtime of the pruned model, we leveraged Apache TVM to deploy the pruned model on the DPU FPGA-based hardware accelerator. To demonstrate our approach, we pruned the VGG-16 model on Flower dataset and reached 53-fold reduction in model size with only 7% drop in validation accuracy. The inference latency is reduced 4-fold on CPU and 16-fold on FPGA for the pruned models, compared with the latency of the base model on CPU.