Opportunities and Challenges for Resource Management and Machine Learning Clusters

Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing Companion Pub Date : 2019-12-02 DOI:10.1145/3368235.3369376

L. Chen

{"title":"Opportunities and Challenges for Resource Management and Machine Learning Clusters","authors":"L. Chen","doi":"10.1145/3368235.3369376","DOIUrl":null,"url":null,"abstract":"The practice of collecting big performance data has changed how infrastructure providers model and manage the system in the past decade. There is a methodology shift from domain-knowledge based white-box models, e.g., queueing [1] and simulation[2], to black-box data-driven models, e.g., machine learning. Such a game change for resource management from workload characterization[3], dependability prediction [4,5] to sprinting policy[6], can be seen from major IT infastructure providers, e.g., IBM and Google. While applying higher order deep neural networks show promises in predicting performance [4,5], the scalability of such an approach is often limited. A plethoral of prior work focus on deriving complex and highly accurate models, such as deep neural networks, overlooking the constraints of computation efficiency and the scalability. Their applicability on resource management problems of the production systems is thus hindered. A crucial aspect to derive accurate and scalable predictive performance models lies on leveraging the domain expertise, white-box models, and black-box models. Examples of scalable ticket management services from IBM [4] and predicting job failures [5] at Google. Model driven computation sprinting [6] dynamically scales the frequency and the allocation of computing cores based on grey box models which outperforms deep neural networks. Aforementioned case studies strongly argue for the importance of combing domain-driven and data-driven models At the same time, various of acceleration techniques are developed to reduce the computation overhead of (deep) machine learning models in small scale and isolated testbed. Managing the performance of clusters that are dominated by machine learning workloads remains challenging and calls for novel solutions. SlimML [9] accelerates the ML modeli training time by only processing critical data set at a slight cost of accuracy, whereas Dias [7] simultaneously explores the data dropping and frequency sprinting for ML clusters that support multiple priorities of different training workloads. Aforementioned studies point out the complexity of managing the accuracy-efficiency tradeoff of ML jobs in a cluster-like environment where jobs interfere each other via sharing the underlying resources and common data sets.","PeriodicalId":166357,"journal":{"name":"Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing Companion","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing Companion","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3368235.3369376","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The practice of collecting big performance data has changed how infrastructure providers model and manage the system in the past decade. There is a methodology shift from domain-knowledge based white-box models, e.g., queueing [1] and simulation[2], to black-box data-driven models, e.g., machine learning. Such a game change for resource management from workload characterization[3], dependability prediction [4,5] to sprinting policy[6], can be seen from major IT infastructure providers, e.g., IBM and Google. While applying higher order deep neural networks show promises in predicting performance [4,5], the scalability of such an approach is often limited. A plethoral of prior work focus on deriving complex and highly accurate models, such as deep neural networks, overlooking the constraints of computation efficiency and the scalability. Their applicability on resource management problems of the production systems is thus hindered. A crucial aspect to derive accurate and scalable predictive performance models lies on leveraging the domain expertise, white-box models, and black-box models. Examples of scalable ticket management services from IBM [4] and predicting job failures [5] at Google. Model driven computation sprinting [6] dynamically scales the frequency and the allocation of computing cores based on grey box models which outperforms deep neural networks. Aforementioned case studies strongly argue for the importance of combing domain-driven and data-driven models At the same time, various of acceleration techniques are developed to reduce the computation overhead of (deep) machine learning models in small scale and isolated testbed. Managing the performance of clusters that are dominated by machine learning workloads remains challenging and calls for novel solutions. SlimML [9] accelerates the ML modeli training time by only processing critical data set at a slight cost of accuracy, whereas Dias [7] simultaneously explores the data dropping and frequency sprinting for ML clusters that support multiple priorities of different training workloads. Aforementioned studies point out the complexity of managing the accuracy-efficiency tradeoff of ML jobs in a cluster-like environment where jobs interfere each other via sharing the underlying resources and common data sets.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

资源管理与机器学习集群的机遇与挑战

在过去十年中，收集大性能数据的做法已经改变了基础设施提供商对系统的建模和管理方式。从基于领域知识的白盒模型(如排队[1]和仿真[2])到黑盒数据驱动模型(如机器学习)，方法学正在发生转变。从工作负载表征[3]、可靠性预测[4,5]到冲刺策略[6]，这种资源管理的游戏改变可以从主要的IT基础设施提供商(例如IBM和Google)那里看到。虽然应用高阶深度神经网络在预测性能方面有希望[4,5]，但这种方法的可扩展性通常是有限的。先前的大量工作集中于推导复杂和高精度的模型，例如深度神经网络，而忽略了计算效率和可扩展性的约束。因此，它们在生产系统资源管理问题上的适用性受到阻碍。获得准确和可伸缩的预测性能模型的一个关键方面在于利用领域专业知识、白盒模型和黑盒模型。IBM的可扩展票据管理服务[4]和Google的预测作业失败[5]的例子。模型驱动的计算冲刺[6]基于灰盒模型动态缩放计算核的频率和分配，优于深度神经网络。上述案例研究强烈地证明了结合领域驱动和数据驱动模型的重要性，同时，各种加速技术被开发出来，以减少(深度)机器学习模型在小规模和孤立的测试平台上的计算开销。管理由机器学习工作负载主导的集群的性能仍然具有挑战性，需要新的解决方案。SlimML[9]通过只处理关键数据集来加速ML模型的训练时间，而Dias[7]同时探索了支持不同训练工作负载的多个优先级的ML集群的数据下降和频率冲刺。上述研究指出了在类似集群的环境中管理ML作业的准确性和效率权衡的复杂性，在这种环境中，作业通过共享底层资源和公共数据集而相互干扰。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing Companion

自引率

0.00%

发文量