Ahsan Ali , Xiaolong Ma , Syed Zawad , Paarijaat Aditya , Istemi Ekin Akkus , Ruichuan Chen , Lei Yang , Feng Yan
{"title":"Enabling scalable and adaptive machine learning training via serverless computing on public cloud","authors":"Ahsan Ali , Xiaolong Ma , Syed Zawad , Paarijaat Aditya , Istemi Ekin Akkus , Ruichuan Chen , Lei Yang , Feng Yan","doi":"10.1016/j.peva.2024.102451","DOIUrl":null,"url":null,"abstract":"<div><div>In today’s production machine learning (ML) systems, models are continuously trained, improved, and deployed. ML design and training are becoming a continuous workflow of various tasks that have dynamic resource demands. Serverless computing is an emerging cloud paradigm that provides transparent resource management and scaling for users and has the potential to revolutionize the routine of ML design and training. However, hosting modern ML workflows on existing serverless platforms has non-trivial challenges due to their intrinsic design limitations such as stateless nature, limited communication support across function instances, and limited function execution duration. These limitations result in a lack of an overarching view and adaptation mechanism for training dynamics, and an amplification of existing problems in ML workflows.</div><div>To address the above challenges, we propose <span>SMLT</span>, an automated, scalable and adaptive serverless framework on public cloud to enable efficient and user-centric ML design and training. <span>SMLT</span> employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training. <span>SMLT</span> further enables user-centric ML workflow execution by supporting user-specified training deadline and budget limit. In addition, by providing an end-to-end design, <span>SMLT</span> solves the intrinsic problems in public cloud serverless platforms such as the communication overhead, limited function execution duration, need for repeated initialization, and also provides explicit fault tolerance for ML training. <span>SMLT</span> is open-sourced and compatible with all major ML frameworks. Our experimental evaluation with large, sophisticated modern ML models demonstrates that <span>SMLT</span> outperforms the state-of-the-art VM-based systems and existing public cloud serverless ML training frameworks in both training speed (up to 8<span><math><mo>×</mo></math></span>) and monetary cost (up to 3<span><math><mo>×</mo></math></span>).</div></div>","PeriodicalId":19964,"journal":{"name":"Performance Evaluation","volume":"167 ","pages":"Article 102451"},"PeriodicalIF":1.0000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Performance Evaluation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0166531624000567","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
In today’s production machine learning (ML) systems, models are continuously trained, improved, and deployed. ML design and training are becoming a continuous workflow of various tasks that have dynamic resource demands. Serverless computing is an emerging cloud paradigm that provides transparent resource management and scaling for users and has the potential to revolutionize the routine of ML design and training. However, hosting modern ML workflows on existing serverless platforms has non-trivial challenges due to their intrinsic design limitations such as stateless nature, limited communication support across function instances, and limited function execution duration. These limitations result in a lack of an overarching view and adaptation mechanism for training dynamics, and an amplification of existing problems in ML workflows.
To address the above challenges, we propose SMLT, an automated, scalable and adaptive serverless framework on public cloud to enable efficient and user-centric ML design and training. SMLT employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training. SMLT further enables user-centric ML workflow execution by supporting user-specified training deadline and budget limit. In addition, by providing an end-to-end design, SMLT solves the intrinsic problems in public cloud serverless platforms such as the communication overhead, limited function execution duration, need for repeated initialization, and also provides explicit fault tolerance for ML training. SMLT is open-sourced and compatible with all major ML frameworks. Our experimental evaluation with large, sophisticated modern ML models demonstrates that SMLT outperforms the state-of-the-art VM-based systems and existing public cloud serverless ML training frameworks in both training speed (up to 8) and monetary cost (up to 3).
在当今的生产型机器学习(ML)系统中,模型需要不断训练、改进和部署。ML 的设计和训练正在成为各种任务的连续工作流程,而这些任务都有动态的资源需求。无服务器计算是一种新兴的云计算模式,可为用户提供透明的资源管理和扩展,并有可能彻底改变 ML 设计和训练的常规工作。然而,在现有的无服务器平台上托管现代 ML 工作流面临着非同小可的挑战,原因在于其固有的设计限制,例如无状态特性、跨功能实例的通信支持有限以及功能执行持续时间有限。为了应对上述挑战,我们在公共云上提出了一个自动化、可扩展和自适应的无服务器框架--SMLT,以实现高效和以用户为中心的 ML 设计和训练。SMLT 采用自动化自适应调度机制,在训练过程中动态优化 ML 任务的部署和资源扩展。通过支持用户指定的训练截止日期和预算限制,SMLT 进一步实现了以用户为中心的 ML 工作流执行。此外,通过提供端到端设计,SMLT 解决了公有云无服务器平台的固有问题,如通信开销、有限的函数执行时间、需要重复初始化等,还为 ML 训练提供了显式容错。SMLT 是开源的,兼容所有主要的 ML 框架。我们使用大型、复杂的现代 ML 模型进行的实验评估表明,SMLT 在训练速度(高达 8 倍)和货币成本(高达 3 倍)方面都优于最先进的基于虚拟机的系统和现有的公共云无服务器 ML 训练框架。
期刊介绍:
Performance Evaluation functions as a leading journal in the area of modeling, measurement, and evaluation of performance aspects of computing and communication systems. As such, it aims to present a balanced and complete view of the entire Performance Evaluation profession. Hence, the journal is interested in papers that focus on one or more of the following dimensions:
-Define new performance evaluation tools, including measurement and monitoring tools as well as modeling and analytic techniques
-Provide new insights into the performance of computing and communication systems
-Introduce new application areas where performance evaluation tools can play an important role and creative new uses for performance evaluation tools.
More specifically, common application areas of interest include the performance of:
-Resource allocation and control methods and algorithms (e.g. routing and flow control in networks, bandwidth allocation, processor scheduling, memory management)
-System architecture, design and implementation
-Cognitive radio
-VANETs
-Social networks and media
-Energy efficient ICT
-Energy harvesting
-Data centers
-Data centric networks
-System reliability
-System tuning and capacity planning
-Wireless and sensor networks
-Autonomic and self-organizing systems
-Embedded systems
-Network science