Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2019-02-01 DOI:10.1109/HPCA.2019.00061

Saumay Dublish, V. Nagarajan, N. Topham

{"title":"Poise: Balancing Thread-Level Parallelism and Memory System Performance in GPUs Using Machine Learning","authors":"Saumay Dublish, V. Nagarajan, N. Topham","doi":"10.1109/HPCA.2019.00061","DOIUrl":null,"url":null,"abstract":"GPUs employ a high degree of thread-level parallelism (TLP) to hide the long latency of memory operations. However, the consequent increase in demand on the memory system causes pathological effects such as cache thrashing and bandwidth bottlenecks. As a result, high degrees of TLP can adversely affect system throughput. In this paper, we present Poise, a novel approach for balancing TLP and memory system performance in GPUs. Poise has two major components: a machine learning framework and a hardware inference engine. The machine learning framework comprises a regression model that is trained offline on a set of profiled kernels to learn best warp scheduling decisions. At runtime, the hardware inference engine uses the previously learned model to dynamically predict best warp scheduling decisions for unseen applications. Therefore, Poise helps in optimizing entirely new applications without posing any profiling, training or programming burden on the end-user. Across a set of benchmarks that were unseen during training, Poise achieves a speedup of up to 2.94× and a harmonic mean speedup of 46.6%, over the baseline greedythen-oldest warp scheduler. Poise is extremely lightweight and incurs a minimal hardware overhead of around 41 bytes per SM. It also reduces the overall energy consumption by an average of 51.6%. Furthermore, Poise outperforms the prior state-ofthe-art warp scheduler by an average of 15.1%. In effect, Poise solves a complex hardware optimization problem with considerable accuracy and efficiency. Keywords-warp scheduling; caches; machine learning","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2019.00061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

GPUs employ a high degree of thread-level parallelism (TLP) to hide the long latency of memory operations. However, the consequent increase in demand on the memory system causes pathological effects such as cache thrashing and bandwidth bottlenecks. As a result, high degrees of TLP can adversely affect system throughput. In this paper, we present Poise, a novel approach for balancing TLP and memory system performance in GPUs. Poise has two major components: a machine learning framework and a hardware inference engine. The machine learning framework comprises a regression model that is trained offline on a set of profiled kernels to learn best warp scheduling decisions. At runtime, the hardware inference engine uses the previously learned model to dynamically predict best warp scheduling decisions for unseen applications. Therefore, Poise helps in optimizing entirely new applications without posing any profiling, training or programming burden on the end-user. Across a set of benchmarks that were unseen during training, Poise achieves a speedup of up to 2.94× and a harmonic mean speedup of 46.6%, over the baseline greedythen-oldest warp scheduler. Poise is extremely lightweight and incurs a minimal hardware overhead of around 41 bytes per SM. It also reduces the overall energy consumption by an average of 51.6%. Furthermore, Poise outperforms the prior state-ofthe-art warp scheduler by an average of 15.1%. In effect, Poise solves a complex hardware optimization problem with considerable accuracy and efficiency. Keywords-warp scheduling; caches; machine learning

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

平衡:使用机器学习在gpu中平衡线程级并行性和内存系统性能

gpu采用高度的线程级并行性(TLP)来隐藏内存操作的长延迟。然而，随之而来的对内存系统需求的增加会导致诸如缓存抖动和带宽瓶颈之类的病态影响。因此，高度的TLP会对系统吞吐量产生不利影响。在本文中，我们提出了平衡gpu中TLP和存储系统性能的新方法Poise。Poise有两个主要组成部分:一个机器学习框架和一个硬件推理引擎。机器学习框架包括一个回归模型，该模型在一组概要内核上进行离线训练，以学习最佳的warp调度决策。在运行时，硬件推理引擎使用先前学习的模型动态预测未见过的应用程序的最佳翘曲调度决策。因此，Poise可以帮助优化全新的应用程序，而不会给最终用户带来任何分析、培训或编程负担。在训练期间未见的一组基准测试中，Poise实现了高达2.94倍的加速，谐波平均加速为46.6%，超过了基线贪婪的最古老的经纱调度器。Poise非常轻量级，每个SM的硬件开销最小，约为41字节。它还使总能耗平均降低51.6%。此外，Poise的性能比之前最先进的曲速调度器平均高出15.1%。实际上，Poise以相当的精度和效率解决了复杂的硬件优化问题。Keywords-warp调度;缓存;机器学习

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量

期刊最新文献

Machine Learning at Facebook: Understanding Inference at the Edge Understanding the Future of Energy Efficiency in Multi-Module GPUs POWERT Channels: A Novel Class of Covert CommunicationExploiting Power Management Vulnerabilities The Accelerator Wall: Limits of Chip Specialization Featherlight Reuse-Distance Measurement