{"title":"异构CPU-GPU处理器中的GPU计算管道效率低下和优化机会","authors":"Joel Hestness, S. Keckler, D. Wood","doi":"10.1109/IISWC.2015.15","DOIUrl":null,"url":null,"abstract":"Emerging heterogeneous CPU-GPU processors have introduced unified memory spaces and cache coherence. CPU and GPU cores will be able to concurrently access the same memories, eliminating memory copy overheads and potentially changing the application-level optimization targets. To date, little is known about how developers may organize new applications to leverage the available, finer-grained communication in these processors. However, understanding potential application optimizations and adaptations is critical for directing heterogeneous processor programming model and architectural development. This paper quantifies opportunities for applications and architectures to evolve to leverage the new capabilities of heterogeneous processors. To identify these opportunities, we ported and simulated a broad set of benchmarks originally developed for discrete GPUs to remove memory copies, and applied analytical models to quantify their application-level pipeline inefficiencies. For existing benchmarks, GPU bulk-synchronous software pipelines result in considerable core and cache utilization inefficiency. For heterogeneous processors, the results indicate increased opportunity for techniques that provide flexible compute and data granularities, and support for efficient producer-consumer data handling and synchronization within caches.","PeriodicalId":142698,"journal":{"name":"2015 IEEE International Symposium on Workload Characterization","volume":"447 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":"{\"title\":\"GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors\",\"authors\":\"Joel Hestness, S. Keckler, D. Wood\",\"doi\":\"10.1109/IISWC.2015.15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Emerging heterogeneous CPU-GPU processors have introduced unified memory spaces and cache coherence. CPU and GPU cores will be able to concurrently access the same memories, eliminating memory copy overheads and potentially changing the application-level optimization targets. To date, little is known about how developers may organize new applications to leverage the available, finer-grained communication in these processors. However, understanding potential application optimizations and adaptations is critical for directing heterogeneous processor programming model and architectural development. This paper quantifies opportunities for applications and architectures to evolve to leverage the new capabilities of heterogeneous processors. To identify these opportunities, we ported and simulated a broad set of benchmarks originally developed for discrete GPUs to remove memory copies, and applied analytical models to quantify their application-level pipeline inefficiencies. For existing benchmarks, GPU bulk-synchronous software pipelines result in considerable core and cache utilization inefficiency. For heterogeneous processors, the results indicate increased opportunity for techniques that provide flexible compute and data granularities, and support for efficient producer-consumer data handling and synchronization within caches.\",\"PeriodicalId\":142698,\"journal\":{\"name\":\"2015 IEEE International Symposium on Workload Characterization\",\"volume\":\"447 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"35\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE International Symposium on Workload Characterization\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IISWC.2015.15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Symposium on Workload Characterization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2015.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors
Emerging heterogeneous CPU-GPU processors have introduced unified memory spaces and cache coherence. CPU and GPU cores will be able to concurrently access the same memories, eliminating memory copy overheads and potentially changing the application-level optimization targets. To date, little is known about how developers may organize new applications to leverage the available, finer-grained communication in these processors. However, understanding potential application optimizations and adaptations is critical for directing heterogeneous processor programming model and architectural development. This paper quantifies opportunities for applications and architectures to evolve to leverage the new capabilities of heterogeneous processors. To identify these opportunities, we ported and simulated a broad set of benchmarks originally developed for discrete GPUs to remove memory copies, and applied analytical models to quantify their application-level pipeline inefficiencies. For existing benchmarks, GPU bulk-synchronous software pipelines result in considerable core and cache utilization inefficiency. For heterogeneous processors, the results indicate increased opportunity for techniques that provide flexible compute and data granularities, and support for efficient producer-consumer data handling and synchronization within caches.