{"title":"揭示高性能GPU应用程序中隐藏的性能机会","authors":"Benjamin Welton, B. Miller","doi":"10.1109/CCGRID.2018.00045","DOIUrl":null,"url":null,"abstract":"Leadership class systems with nodes containing many-core accelerators, such as GPUs, have the potential to increase the performance of applications. Effectively exploiting the parallelism provided by many-core accelerators requires developers to identify where accelerator parallelization would provide benefit and ensuring efficient interaction between the CPU and accelerator. In the abstract, these issues appear straightforward and well understood. However, we have found that significant untapped performance opportunities exist in these areas even in well-known, heavily optimized, real world applications created by experienced GPU developers. These untapped performance opportunities exist because accelerated libraries can create unexpected synchronization delay and memory transfer requests, interaction between accelerated libraries can cause unexpected inefficiencies when combined, and vectorization opportunities can be hidden by the structure of the program. In applications we have studied (Qball, QBox, Hoomd-blue, LAMMPs, and cuIBM), exploiting these opportunities resulted in reduction of their execution time by 18%-87%. In this work, we provide concrete evidence of the existence and impact that these performance issues have on real world applications today. We characterize the missed performance opportunities we have identified by their underlying cause and describe a preliminary design of detection methods that can be used by performance tools to identify these missed opportunities.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"494 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Exposing Hidden Performance Opportunities in High Performance GPU Applications\",\"authors\":\"Benjamin Welton, B. Miller\",\"doi\":\"10.1109/CCGRID.2018.00045\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Leadership class systems with nodes containing many-core accelerators, such as GPUs, have the potential to increase the performance of applications. Effectively exploiting the parallelism provided by many-core accelerators requires developers to identify where accelerator parallelization would provide benefit and ensuring efficient interaction between the CPU and accelerator. In the abstract, these issues appear straightforward and well understood. However, we have found that significant untapped performance opportunities exist in these areas even in well-known, heavily optimized, real world applications created by experienced GPU developers. These untapped performance opportunities exist because accelerated libraries can create unexpected synchronization delay and memory transfer requests, interaction between accelerated libraries can cause unexpected inefficiencies when combined, and vectorization opportunities can be hidden by the structure of the program. In applications we have studied (Qball, QBox, Hoomd-blue, LAMMPs, and cuIBM), exploiting these opportunities resulted in reduction of their execution time by 18%-87%. In this work, we provide concrete evidence of the existence and impact that these performance issues have on real world applications today. We characterize the missed performance opportunities we have identified by their underlying cause and describe a preliminary design of detection methods that can be used by performance tools to identify these missed opportunities.\",\"PeriodicalId\":321027,\"journal\":{\"name\":\"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)\",\"volume\":\"494 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGRID.2018.00045\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2018.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Exposing Hidden Performance Opportunities in High Performance GPU Applications
Leadership class systems with nodes containing many-core accelerators, such as GPUs, have the potential to increase the performance of applications. Effectively exploiting the parallelism provided by many-core accelerators requires developers to identify where accelerator parallelization would provide benefit and ensuring efficient interaction between the CPU and accelerator. In the abstract, these issues appear straightforward and well understood. However, we have found that significant untapped performance opportunities exist in these areas even in well-known, heavily optimized, real world applications created by experienced GPU developers. These untapped performance opportunities exist because accelerated libraries can create unexpected synchronization delay and memory transfer requests, interaction between accelerated libraries can cause unexpected inefficiencies when combined, and vectorization opportunities can be hidden by the structure of the program. In applications we have studied (Qball, QBox, Hoomd-blue, LAMMPs, and cuIBM), exploiting these opportunities resulted in reduction of their execution time by 18%-87%. In this work, we provide concrete evidence of the existence and impact that these performance issues have on real world applications today. We characterize the missed performance opportunities we have identified by their underlying cause and describe a preliminary design of detection methods that can be used by performance tools to identify these missed opportunities.