{"title":"WidePipe:基于神经处理单元集群的高吞吐量深度学习推理系统","authors":"Lixian Ma, En Shao, Yueyuan Zhou, Guangming Tan","doi":"10.1109/ICCD53106.2021.00091","DOIUrl":null,"url":null,"abstract":"The wide application of machine learning technology promotes the generation of ML-as-a-Service(MLaaS), which is a serverless computing paradigm for rapidly deploying a trained model as a serving. However, it is a challenge to design an inference system that is capable of coping with large traffic for low latency and heterogeneous neural networks. It is difficult to adaptively configure multilevel parallelism in existing cloud inference systems for machine learning servings, particularly if the cluster has accelerators, such as GPUs, NPUs, FPGAs, etc. These issues lead to poor resource utilization and limit the system throughput. In this paper, we propose and implement a high-throughput inference system called WidePipe, which WidePipe leverages reinforcement learning to co-adapt resource allocation and batch size of request according to device status. We evaluated the performance of WidePipe for a large cluster with 1000 neural processing units in 250 nodes. Our experimental results show that WidePipe has a 2.11× higher throughput than current inference systems when deploying heterogeneous machine learning servings, meeting the service-level objectives for the response time.","PeriodicalId":154014,"journal":{"name":"2021 IEEE 39th International Conference on Computer Design (ICCD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"WidePipe: High-Throughput Deep Learning Inference System on a Cluster of Neural Processing Units\",\"authors\":\"Lixian Ma, En Shao, Yueyuan Zhou, Guangming Tan\",\"doi\":\"10.1109/ICCD53106.2021.00091\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The wide application of machine learning technology promotes the generation of ML-as-a-Service(MLaaS), which is a serverless computing paradigm for rapidly deploying a trained model as a serving. However, it is a challenge to design an inference system that is capable of coping with large traffic for low latency and heterogeneous neural networks. It is difficult to adaptively configure multilevel parallelism in existing cloud inference systems for machine learning servings, particularly if the cluster has accelerators, such as GPUs, NPUs, FPGAs, etc. These issues lead to poor resource utilization and limit the system throughput. In this paper, we propose and implement a high-throughput inference system called WidePipe, which WidePipe leverages reinforcement learning to co-adapt resource allocation and batch size of request according to device status. We evaluated the performance of WidePipe for a large cluster with 1000 neural processing units in 250 nodes. Our experimental results show that WidePipe has a 2.11× higher throughput than current inference systems when deploying heterogeneous machine learning servings, meeting the service-level objectives for the response time.\",\"PeriodicalId\":154014,\"journal\":{\"name\":\"2021 IEEE 39th International Conference on Computer Design (ICCD)\",\"volume\":\"34 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 39th International Conference on Computer Design (ICCD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCD53106.2021.00091\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 39th International Conference on Computer Design (ICCD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCD53106.2021.00091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
WidePipe: High-Throughput Deep Learning Inference System on a Cluster of Neural Processing Units
The wide application of machine learning technology promotes the generation of ML-as-a-Service(MLaaS), which is a serverless computing paradigm for rapidly deploying a trained model as a serving. However, it is a challenge to design an inference system that is capable of coping with large traffic for low latency and heterogeneous neural networks. It is difficult to adaptively configure multilevel parallelism in existing cloud inference systems for machine learning servings, particularly if the cluster has accelerators, such as GPUs, NPUs, FPGAs, etc. These issues lead to poor resource utilization and limit the system throughput. In this paper, we propose and implement a high-throughput inference system called WidePipe, which WidePipe leverages reinforcement learning to co-adapt resource allocation and batch size of request according to device status. We evaluated the performance of WidePipe for a large cluster with 1000 neural processing units in 250 nodes. Our experimental results show that WidePipe has a 2.11× higher throughput than current inference systems when deploying heterogeneous machine learning servings, meeting the service-level objectives for the response time.