迷失在抽象中:在中级语言水平分析gpu的陷阱

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI:10.1109/HPCA.2018.00058

Anthony Gutierrez, Bradford M. Beckmann, A. Duțu, Joseph Gross, Michael LeBeane, J. Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, Timothy G. Rogers

{"title":"迷失在抽象中:在中级语言水平分析gpu的陷阱","authors":"Anthony Gutierrez, Bradford M. Beckmann, A. Duțu, Joseph Gross, Michael LeBeane, J. Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, Timothy G. Rogers","doi":"10.1109/HPCA.2018.00058","DOIUrl":null,"url":null,"abstract":"Modern GPU frameworks use a two-phase compilation approach. Kernels written in a high-level language are initially compiled to an implementation agnostic intermediate language (IL), then finalized to the machine ISA only when the target GPU hardware is known. Most GPU microarchitecture simulators available to academics execute IL instructions because there is substantially less functional state associated with the instructions, and in some situations, the machine ISA’s intellectual property may not be publicly disclosed. In this paper, we demonstrate the pitfalls of evaluating GPUs using this higher-level abstraction, and make the case that several important microarchitecture interactions are only visible when executing lower-level instructions. Our analysis shows that given identical application source code and GPU microarchitecture models, execution behavior will differ significantly depending on the instruction set abstraction. For example, our analysis shows the dynamic instruction count of the machine ISA is nearly 2× that of the IL on average, but contention for vector registers is reduced by 3× due to the optimized resource utilization. In addition, our analysis highlights the deficiencies of using IL to model instruction fetching, control divergence, and value similarity. Finally, we show that simulating IL instructions adds 33% error as compared to the machine ISA when comparing absolute runtimes to real hardware.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"51","resultStr":"{\"title\":\"Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level\",\"authors\":\"Anthony Gutierrez, Bradford M. Beckmann, A. Duțu, Joseph Gross, Michael LeBeane, J. Kalamatianos, Onur Kayiran, Matthew Poremba, Brandon Potter, Sooraj Puthoor, Matthew D. Sinclair, Mark Wyse, Jieming Yin, Xianwei Zhang, Akshay Jain, Timothy G. Rogers\",\"doi\":\"10.1109/HPCA.2018.00058\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern GPU frameworks use a two-phase compilation approach. Kernels written in a high-level language are initially compiled to an implementation agnostic intermediate language (IL), then finalized to the machine ISA only when the target GPU hardware is known. Most GPU microarchitecture simulators available to academics execute IL instructions because there is substantially less functional state associated with the instructions, and in some situations, the machine ISA’s intellectual property may not be publicly disclosed. In this paper, we demonstrate the pitfalls of evaluating GPUs using this higher-level abstraction, and make the case that several important microarchitecture interactions are only visible when executing lower-level instructions. Our analysis shows that given identical application source code and GPU microarchitecture models, execution behavior will differ significantly depending on the instruction set abstraction. For example, our analysis shows the dynamic instruction count of the machine ISA is nearly 2× that of the IL on average, but contention for vector registers is reduced by 3× due to the optimized resource utilization. In addition, our analysis highlights the deficiencies of using IL to model instruction fetching, control divergence, and value similarity. Finally, we show that simulating IL instructions adds 33% error as compared to the machine ISA when comparing absolute runtimes to real hardware.\",\"PeriodicalId\":154694,\"journal\":{\"name\":\"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)\",\"volume\":\"115 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"51\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2018.00058\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2018.00058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 51

摘要

现代GPU框架使用两阶段编译方法。用高级语言编写的内核最初被编译为与实现无关的中间语言(IL)，然后只有在目标GPU硬件已知时才最终被编译为机器ISA。学术界可用的大多数GPU微架构模拟器执行IL指令，因为与指令相关的功能状态实质上较少，并且在某些情况下，机器ISA的知识产权可能不会公开披露。在本文中，我们演示了使用这种高级抽象来评估gpu的陷阱，并说明了几个重要的微架构交互仅在执行低级指令时可见。我们的分析表明，给定相同的应用程序源代码和GPU微架构模型，执行行为将根据指令集抽象而显着不同。例如，我们的分析表明，机器ISA的动态指令计数平均接近IL的2倍，但由于优化的资源利用率，向量寄存器的争用减少了3倍。此外，我们的分析强调了使用IL来建模指令获取、控制发散和值相似的不足。最后，我们表明，在将绝对运行时间与真实硬件进行比较时，与机器ISA相比，模拟IL指令增加了33%的误差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Lost in Abstraction: Pitfalls of Analyzing GPUs at the Intermediate Language Level

Modern GPU frameworks use a two-phase compilation approach. Kernels written in a high-level language are initially compiled to an implementation agnostic intermediate language (IL), then finalized to the machine ISA only when the target GPU hardware is known. Most GPU microarchitecture simulators available to academics execute IL instructions because there is substantially less functional state associated with the instructions, and in some situations, the machine ISA’s intellectual property may not be publicly disclosed. In this paper, we demonstrate the pitfalls of evaluating GPUs using this higher-level abstraction, and make the case that several important microarchitecture interactions are only visible when executing lower-level instructions. Our analysis shows that given identical application source code and GPU microarchitecture models, execution behavior will differ significantly depending on the instruction set abstraction. For example, our analysis shows the dynamic instruction count of the machine ISA is nearly 2× that of the IL on average, but contention for vector registers is reduced by 3× due to the optimized resource utilization. In addition, our analysis highlights the deficiencies of using IL to model instruction fetching, control divergence, and value similarity. Finally, we show that simulating IL instructions adds 33% error as compared to the machine ISA when comparing absolute runtimes to real hardware.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量

期刊最新文献

Record-Replay Architecture as a General Security Framework LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy Efficient GPUs Secure DIMM: Moving ORAM Primitives Closer to Memory OuterSPACE: An Outer Product Based Sparse Matrix Multiplication Accelerator WIR: Warp Instruction Reuse to Minimize Repeated Computations in GPUs