{"title":"Layercake:基于云和移动资源的高效推理服务","authors":"Samuel S. Ogden, Tian Guo","doi":"10.1109/CCGrid57682.2023.00027","DOIUrl":null,"url":null,"abstract":"Many mobile applications are now integrating deep learning models into their core functionality. These functionalities have diverse latency requirements while demanding high-accuracy results. Currently, mobile applications statically decide to use either in-cloud inference, relying on a fast and consistent network, or on-device execution, relying on sufficient local resources. However, neither mobile networks nor computation resources deliver consistent performance in practice. Consequently, mobile inference often experiences variable performance or struggles to meet performance goals, when inference execution decisions are not made dynamically. In this paper, we introduce Layer Cake, a deep-learning inference framework that dynamically selects the best model and location for executing inferences. Layercake accomplishes this by tracking model state and availability, both locally and remotely, as well as the network bandwidth, allowing for accurate estimations of model response time. By doing so, Layercake achieves latency targets in up to 96.4% of cases, which is an improvement of 16.7% over similar systems, while decreasing the cost of cloud-based resources by over 68.33% than in-cloud inference.","PeriodicalId":363806,"journal":{"name":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Layercake: Efficient Inference Serving with Cloud and Mobile Resources\",\"authors\":\"Samuel S. Ogden, Tian Guo\",\"doi\":\"10.1109/CCGrid57682.2023.00027\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many mobile applications are now integrating deep learning models into their core functionality. These functionalities have diverse latency requirements while demanding high-accuracy results. Currently, mobile applications statically decide to use either in-cloud inference, relying on a fast and consistent network, or on-device execution, relying on sufficient local resources. However, neither mobile networks nor computation resources deliver consistent performance in practice. Consequently, mobile inference often experiences variable performance or struggles to meet performance goals, when inference execution decisions are not made dynamically. In this paper, we introduce Layer Cake, a deep-learning inference framework that dynamically selects the best model and location for executing inferences. Layercake accomplishes this by tracking model state and availability, both locally and remotely, as well as the network bandwidth, allowing for accurate estimations of model response time. By doing so, Layercake achieves latency targets in up to 96.4% of cases, which is an improvement of 16.7% over similar systems, while decreasing the cost of cloud-based resources by over 68.33% than in-cloud inference.\",\"PeriodicalId\":363806,\"journal\":{\"name\":\"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)\",\"volume\":\"85 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGrid57682.2023.00027\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid57682.2023.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Layercake: Efficient Inference Serving with Cloud and Mobile Resources
Many mobile applications are now integrating deep learning models into their core functionality. These functionalities have diverse latency requirements while demanding high-accuracy results. Currently, mobile applications statically decide to use either in-cloud inference, relying on a fast and consistent network, or on-device execution, relying on sufficient local resources. However, neither mobile networks nor computation resources deliver consistent performance in practice. Consequently, mobile inference often experiences variable performance or struggles to meet performance goals, when inference execution decisions are not made dynamically. In this paper, we introduce Layer Cake, a deep-learning inference framework that dynamically selects the best model and location for executing inferences. Layercake accomplishes this by tracking model state and availability, both locally and remotely, as well as the network bandwidth, allowing for accurate estimations of model response time. By doing so, Layercake achieves latency targets in up to 96.4% of cases, which is an improvement of 16.7% over similar systems, while decreasing the cost of cloud-based resources by over 68.33% than in-cloud inference.