GPU-enabled Function-as-a-Service for Machine Learning Inference

Ming Zhao, Kritshekhar Jha, Sungho Hong
{"title":"GPU-enabled Function-as-a-Service for Machine Learning Inference","authors":"Ming Zhao, Kritshekhar Jha, Sungho Hong","doi":"10.1109/IPDPS54959.2023.00096","DOIUrl":null,"url":null,"abstract":"Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable resources and complex software configurations. These inference tasks heavily rely on GPUs to achieve high performance; however, support for GPUs is currently lacking in the existing FaaS solutions. The unique event-triggered and short-lived nature of functions poses new challenges to enabling GPUs on FaaS, which must consider the overhead of transferring data (e.g., ML model parameters and inputs/outputs) between GPU and host memory. This paper proposes a novel GPU-enabled FaaS solution that enables ML inference functions to efficiently utilize GPUs to accelerate their computations. First, it extends existing FaaS frameworks such as OpenFaaS to support the scheduling and execution of functions across GPUs in a FaaS cluster. Second, it provides caching of ML models in GPU memory to improve the performance of model inference functions and global management of GPU memories to improve cache utilization. Third, it offers co-designed GPU function scheduling and cache management to optimize the performance of ML inference functions. Specifically, the paper proposes locality-aware scheduling, which maximizes the utilization of both GPU memory for cache hits and GPU cores for parallel processing. A thorough evaluation based on real-world traces and ML models shows that the proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed locality-aware scheduler achieves a speedup of 48x compared to the default, load balancing only schedulers.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Function-as-a-Service (FaaS) is emerging as an important cloud computing service model as it can improve the scalability and usability of a wide range of applications, especially Machine-Learning (ML) inference tasks that require scalable resources and complex software configurations. These inference tasks heavily rely on GPUs to achieve high performance; however, support for GPUs is currently lacking in the existing FaaS solutions. The unique event-triggered and short-lived nature of functions poses new challenges to enabling GPUs on FaaS, which must consider the overhead of transferring data (e.g., ML model parameters and inputs/outputs) between GPU and host memory. This paper proposes a novel GPU-enabled FaaS solution that enables ML inference functions to efficiently utilize GPUs to accelerate their computations. First, it extends existing FaaS frameworks such as OpenFaaS to support the scheduling and execution of functions across GPUs in a FaaS cluster. Second, it provides caching of ML models in GPU memory to improve the performance of model inference functions and global management of GPU memories to improve cache utilization. Third, it offers co-designed GPU function scheduling and cache management to optimize the performance of ML inference functions. Specifically, the paper proposes locality-aware scheduling, which maximizes the utilization of both GPU memory for cache hits and GPU cores for parallel processing. A thorough evaluation based on real-world traces and ML models shows that the proposed GPU-enabled FaaS works well for ML inference tasks, and the proposed locality-aware scheduler achieves a speedup of 48x compared to the default, load balancing only schedulers.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
支持gpu的功能即服务,用于机器学习推理
功能即服务(FaaS)正在成为一种重要的云计算服务模型,因为它可以提高广泛应用程序的可扩展性和可用性,特别是需要可扩展资源和复杂软件配置的机器学习(ML)推理任务。这些推理任务严重依赖gpu来实现高性能;然而,现有的FaaS解决方案目前缺乏对gpu的支持。独特的事件触发和功能的短暂性对在FaaS上启用GPU提出了新的挑战,这必须考虑在GPU和主机内存之间传输数据(例如,ML模型参数和输入/输出)的开销。本文提出了一种新的支持gpu的FaaS解决方案,使机器学习推理函数能够有效地利用gpu来加速其计算。首先,它扩展了现有的FaaS框架(如OpenFaaS),以支持FaaS集群中跨gpu的功能调度和执行。其次,在GPU内存中提供ML模型的缓存,以提高模型推理函数的性能,并提供GPU内存的全局管理,以提高缓存利用率。第三,提供协同设计的GPU功能调度和缓存管理,以优化ML推理功能的性能。具体来说,本文提出了位置感知调度,它最大限度地利用GPU内存进行缓存命中和GPU内核进行并行处理。基于真实世界轨迹和ML模型的全面评估表明,提议的支持gpu的FaaS可以很好地用于ML推理任务,并且提议的位置感知调度器与默认的仅负载平衡调度器相比,实现了48倍的加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations Generalizable Reinforcement Learning-Based Coarsening Model for Resource Allocation over Large and Diverse Stream Processing Graphs Smart Redbelly Blockchain: Reducing Congestion for Web3 QoS-Aware and Cost-Efficient Dynamic Resource Allocation for Serverless ML Workflows Fast Sparse GPU Kernels for Accelerated Training of Graph Neural Networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1