GEMM工作负载下GPU共享策略的探索

Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems Pub Date : 2020-05-25 DOI:10.1145/3378678.3391887

Ioannis Oroutzoglou, Dimosthenis Masouros, Konstantina Koliogeorgi, S. Xydis, D. Soudris

{"title":"GEMM工作负载下GPU共享策略的探索","authors":"Ioannis Oroutzoglou, Dimosthenis Masouros, Konstantina Koliogeorgi, S. Xydis, D. Soudris","doi":"10.1145/3378678.3391887","DOIUrl":null,"url":null,"abstract":"Lately, cloud computing has seen explosive growth, due to the flexibility and scalability it offers. The ever-increasing computational demands, especially from the machine learning domain, have forced cloud operators to enhance their infrastructure with acceleration devices, such as General-Purpose (GP)GPUs or FPGAs. Even though multi-tenancy has been widely examined for conventional CPUs, this is not the case for accelerators. Current solutions support \"one accelerator per user\" schemes, which can lead to both under-utilization and starvation of available resources. In this work, we analyze the potentials of GPU sharing inside data-center environments. We investigate how several architectural features affect the performance of GPUs under different multi-tenant stressing scenarios. We compare CUDA MPS with the native, default CUDA scheduler and also with Vinetalk, a research framework providing GPU sharing capabilities. Experimental results show that NVIDIA's MPS achieves the best performance in multi-application scenarios, specifically up to X4.5 and X11.2 compared to native CUDA scheduler and Vinetalk respectively.","PeriodicalId":383191,"journal":{"name":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","volume":"604 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Exploration of GPU sharing policies under GEMM workloads\",\"authors\":\"Ioannis Oroutzoglou, Dimosthenis Masouros, Konstantina Koliogeorgi, S. Xydis, D. Soudris\",\"doi\":\"10.1145/3378678.3391887\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Lately, cloud computing has seen explosive growth, due to the flexibility and scalability it offers. The ever-increasing computational demands, especially from the machine learning domain, have forced cloud operators to enhance their infrastructure with acceleration devices, such as General-Purpose (GP)GPUs or FPGAs. Even though multi-tenancy has been widely examined for conventional CPUs, this is not the case for accelerators. Current solutions support \\\"one accelerator per user\\\" schemes, which can lead to both under-utilization and starvation of available resources. In this work, we analyze the potentials of GPU sharing inside data-center environments. We investigate how several architectural features affect the performance of GPUs under different multi-tenant stressing scenarios. We compare CUDA MPS with the native, default CUDA scheduler and also with Vinetalk, a research framework providing GPU sharing capabilities. Experimental results show that NVIDIA's MPS achieves the best performance in multi-application scenarios, specifically up to X4.5 and X11.2 compared to native CUDA scheduler and Vinetalk respectively.\",\"PeriodicalId\":383191,\"journal\":{\"name\":\"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems\",\"volume\":\"604 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3378678.3391887\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3378678.3391887","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

最近，由于云计算提供的灵活性和可伸缩性，它出现了爆炸式的增长。不断增长的计算需求，特别是来自机器学习领域的需求，迫使云运营商使用加速设备(如通用(GP) gpu或fpga)来增强其基础设施。尽管对传统cpu的多租户已经进行了广泛的研究，但加速器的情况并非如此。当前的解决方案支持“每个用户一个加速器”方案，这可能导致可用资源利用率不足和缺乏。在这项工作中，我们分析了在数据中心环境中GPU共享的潜力。我们研究了几种架构特性在不同的多租户压力场景下如何影响gpu的性能。我们将CUDA MPS与本地默认CUDA调度器以及提供GPU共享功能的研究框架Vinetalk进行了比较。实验结果表明，与原生CUDA调度器和Vinetalk相比，NVIDIA的MPS在多应用场景下实现了最佳性能，特别是高达X4.5和X11.2。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Exploration of GPU sharing policies under GEMM workloads

Lately, cloud computing has seen explosive growth, due to the flexibility and scalability it offers. The ever-increasing computational demands, especially from the machine learning domain, have forced cloud operators to enhance their infrastructure with acceleration devices, such as General-Purpose (GP)GPUs or FPGAs. Even though multi-tenancy has been widely examined for conventional CPUs, this is not the case for accelerators. Current solutions support "one accelerator per user" schemes, which can lead to both under-utilization and starvation of available resources. In this work, we analyze the potentials of GPU sharing inside data-center environments. We investigate how several architectural features affect the performance of GPUs under different multi-tenant stressing scenarios. We compare CUDA MPS with the native, default CUDA scheduler and also with Vinetalk, a research framework providing GPU sharing capabilities. Experimental results show that NVIDIA's MPS achieves the best performance in multi-application scenarios, specifically up to X4.5 and X11.2 compared to native CUDA scheduler and Vinetalk respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

自引率

0.00%

发文量

期刊最新文献

A secure hardware-software solution based on RISC-V, logic locking and microkernel Configuring loosely time-triggered wireless control software Analog implementation of arithmetic operations on real memristors Programming tensor cores from an image processing DSL Data-layout optimization based on memory-access-pattern analysis for source-code performance improvement