GPU-enabled Function-as-a-Service for Machine Learning Inference

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-03-09 DOI:10.1109/IPDPS54959.2023.00096

Ming Zhao, Kritshekhar Jha, Sungho Hong

{"title":"GPU-enabled Function-as-a-Service for Machine Learning Inference","authors":"Ming Zhao, Kritshekhar Jha, Sungho Hong","doi":"10.1109/IPDPS54959.2023.00096","DOIUrl":null,"url":null,"abstract":"Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable resources and complex software configurations. These inference tasks heavily rely on GPUs to achieve high performance; however, support for GPUs is currently lacking in the existing FaaS solutions. The unique event-triggered and short-lived nature of functions poses new challenges to enabling GPUs on FaaS, which must consider the overhead of transferring data (e.g., ML model parameters and inputs/outputs) between GPU and host memory. This paper proposes a novel GPU-enabled FaaS solution that enables ML inference functions to efficiently utilize GPUs to accelerate their computations. First, it extends existing FaaS frameworks such as OpenFaaS to support the scheduling and execution of functions across GPUs in a FaaS cluster. Second, it provides caching of ML models in GPU memory to improve the performance of model inference functions and global management of GPU memories to improve cache utilization. Third, it offers co-designed GPU function scheduling and cache management to optimize the performance of ML inference functions. Specifically, the paper proposes locality-aware scheduling, which maximizes the utilization of both GPU memory for cache hits and GPU cores for parallel processing. A thorough evaluation based on real-world traces and ML models shows that the proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed locality-aware scheduler achieves a speedup of 48x compared to the default, load balancing only schedulers.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable resources and complex software configurations. These inference tasks heavily rely on GPUs to achieve high performance; however, support for GPUs is currently lacking in the existing FaaS solutions. The unique event-triggered and short-lived nature of functions poses new challenges to enabling GPUs on FaaS, which must consider the overhead of transferring data (e.g., ML model parameters and inputs/outputs) between GPU and host memory. This paper proposes a novel GPU-enabled FaaS solution that enables ML inference functions to efficiently utilize GPUs to accelerate their computations. First, it extends existing FaaS frameworks such as OpenFaaS to support the scheduling and execution of functions across GPUs in a FaaS cluster. Second, it provides caching of ML models in GPU memory to improve the performance of model inference functions and global management of GPU memories to improve cache utilization. Third, it offers co-designed GPU function scheduling and cache management to optimize the performance of ML inference functions. Specifically, the paper proposes locality-aware scheduling, which maximizes the utilization of both GPU memory for cache hits and GPU cores for parallel processing. A thorough evaluation based on real-world traces and ML models shows that the proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed locality-aware scheduler achieves a speedup of 48x compared to the default, load balancing only schedulers.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

支持gpu的功能即服务，用于机器学习推理

功能即服务(FaaS)正在成为一种重要的云计算服务模型，因为它可以提高广泛应用程序的可扩展性和可用性，特别是需要可扩展资源和复杂软件配置的机器学习(ML)推理任务。这些推理任务严重依赖gpu来实现高性能;然而，现有的FaaS解决方案目前缺乏对gpu的支持。独特的事件触发和功能的短暂性对在FaaS上启用GPU提出了新的挑战，这必须考虑在GPU和主机内存之间传输数据(例如，ML模型参数和输入/输出)的开销。本文提出了一种新的支持gpu的FaaS解决方案，使机器学习推理函数能够有效地利用gpu来加速其计算。首先，它扩展了现有的FaaS框架(如OpenFaaS)，以支持FaaS集群中跨gpu的功能调度和执行。其次，在GPU内存中提供ML模型的缓存，以提高模型推理函数的性能，并提供GPU内存的全局管理，以提高缓存利用率。第三，提供协同设计的GPU功能调度和缓存管理，以优化ML推理功能的性能。具体来说，本文提出了位置感知调度，它最大限度地利用GPU内存进行缓存命中和GPU内核进行并行处理。基于真实世界轨迹和ML模型的全面评估表明，提议的支持gpu的FaaS可以很好地用于ML推理任务，并且提议的位置感知调度器与默认的仅负载平衡调度器相比，实现了48倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量