Moneo:在AI基础设施中非侵入性地监控细粒度指标

Q3 Computer Science Operating Systems Review (ACM) Pub Date : 2022-01-01 DOI:10.1145/3544497.3544501

Yuting Jiang, Yifan Xiong, L. Qu, Cheng Luo, Chen Tian, Peng Cheng, Y. Xiong

{"title":"Moneo:在AI基础设施中非侵入性地监控细粒度指标","authors":"Yuting Jiang, Yifan Xiong, L. Qu, Cheng Luo, Chen Tian, Peng Cheng, Y. Xiong","doi":"10.1145/3544497.3544501","DOIUrl":null,"url":null,"abstract":"Cloud-based AI infrastructure is becoming increasingly important, especially on large-scale distributed training. To improve its efficiency and serviceability, real-time monitoring of the infrastructure and workload profiling are proved to be the effective approach empirically. However, cloud environment poses great challenges as service providers cannot interfere with their tenants’ workloads or touch user data, thus previous instrumentation-based monitoring approach cannot be applied, nor does the workload trace collection. In this paper, we propose Moneo, a non-intrusive cloudfriendly monitoring system for AI infrastructure. Moneo is capable of intelligently collecting the key architecture-level metrics at finer granularity in real-time without instrumenting or tracing the workloads, which has been deployed in real production cloud, Azure. We analyze the results reported by Moneo for typical large-scale distributed AI workloads from real deployment. Results demonstrate that Moneo can effectively help service providers understand the real resource usage patterns of various AI workloads and real networking requirements, so as to get valuable findings help improve the efficiency of cloud infrastructure and optimize the software stack with the consideration of the characteristic resource usage requirements for different AI workloads. This is a revised version of the symposium paper [23] presented in IEEE ICC 2022 originally.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"56 1","pages":"18-25"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure\",\"authors\":\"Yuting Jiang, Yifan Xiong, L. Qu, Cheng Luo, Chen Tian, Peng Cheng, Y. Xiong\",\"doi\":\"10.1145/3544497.3544501\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cloud-based AI infrastructure is becoming increasingly important, especially on large-scale distributed training. To improve its efficiency and serviceability, real-time monitoring of the infrastructure and workload profiling are proved to be the effective approach empirically. However, cloud environment poses great challenges as service providers cannot interfere with their tenants’ workloads or touch user data, thus previous instrumentation-based monitoring approach cannot be applied, nor does the workload trace collection. In this paper, we propose Moneo, a non-intrusive cloudfriendly monitoring system for AI infrastructure. Moneo is capable of intelligently collecting the key architecture-level metrics at finer granularity in real-time without instrumenting or tracing the workloads, which has been deployed in real production cloud, Azure. We analyze the results reported by Moneo for typical large-scale distributed AI workloads from real deployment. Results demonstrate that Moneo can effectively help service providers understand the real resource usage patterns of various AI workloads and real networking requirements, so as to get valuable findings help improve the efficiency of cloud infrastructure and optimize the software stack with the consideration of the characteristic resource usage requirements for different AI workloads. This is a revised version of the symposium paper [23] presented in IEEE ICC 2022 originally.\",\"PeriodicalId\":38935,\"journal\":{\"name\":\"Operating Systems Review (ACM)\",\"volume\":\"56 1\",\"pages\":\"18-25\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Operating Systems Review (ACM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3544497.3544501\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operating Systems Review (ACM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3544497.3544501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 2

摘要

基于云的人工智能基础设施正变得越来越重要，特别是在大规模分布式训练中。为了提高其效率和可维护性，实践证明，对基础设施进行实时监控和工作负载分析是有效的方法。然而，云环境带来了巨大的挑战，因为服务提供商无法干扰其租户的工作负载或触摸用户数据，因此无法应用以前基于仪器的监控方法，也无法应用工作负载跟踪收集。在本文中，我们提出了Moneo，一个非侵入式的云友好型人工智能基础设施监控系统。Moneo能够以更细的粒度实时智能地收集关键架构级指标，而无需检测或跟踪工作负载，这已经部署在真实的生产云Azure中。我们分析了Moneo报告的来自实际部署的典型大规模分布式AI工作负载的结果。结果表明，Moneo可以有效地帮助服务提供商了解各种AI工作负载的真实资源使用模式和真实的组网需求，从而在考虑不同AI工作负载的特征资源使用需求的情况下，获得有助于提高云基础设施效率和优化软件堆栈的有价值的发现。这是IEEE ICC 2022上发表的研讨会论文[23]的修订版。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure

Cloud-based AI infrastructure is becoming increasingly important, especially on large-scale distributed training. To improve its efficiency and serviceability, real-time monitoring of the infrastructure and workload profiling are proved to be the effective approach empirically. However, cloud environment poses great challenges as service providers cannot interfere with their tenants’ workloads or touch user data, thus previous instrumentation-based monitoring approach cannot be applied, nor does the workload trace collection. In this paper, we propose Moneo, a non-intrusive cloudfriendly monitoring system for AI infrastructure. Moneo is capable of intelligently collecting the key architecture-level metrics at finer granularity in real-time without instrumenting or tracing the workloads, which has been deployed in real production cloud, Azure. We analyze the results reported by Moneo for typical large-scale distributed AI workloads from real deployment. Results demonstrate that Moneo can effectively help service providers understand the real resource usage patterns of various AI workloads and real networking requirements, so as to get valuable findings help improve the efficiency of cloud infrastructure and optimize the software stack with the consideration of the characteristic resource usage requirements for different AI workloads. This is a revised version of the symposium paper [23] presented in IEEE ICC 2022 originally.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Operating Systems Review (ACM) Computer Science-Computer Networks and Communications

CiteScore

2.80

自引率

0.00%

发文量

期刊介绍： Operating Systems Review (OSR) is a publication of the ACM Special Interest Group on Operating Systems (SIGOPS), whose scope of interest includes: computer operating systems and architecture for multiprogramming, multiprocessing, and time sharing; resource management; evaluation and simulation; reliability, integrity, and security of data; communications among computing processors; and computer system modeling and analysis.

期刊最新文献

Disaggregated GPU Acceleration for Serverless Applications Navigating Performance-Efficiency Tradeoffs in Serverless Computing: Deduplication to the Rescue! Using Local Cache Coherence for Disaggregated Memory Systems Make It Real: An End-to-End Implementation of A Physically Disaggregated Data Center Memory disaggregation: why now and what are the challenges