gpu中核心辅助瓶颈加速的案例:通过辅助扭曲实现灵活的数据压缩

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI:10.1145/2749469.2750399

Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, A. Bhowmick, Rachata Ausavarungnirun, C. Das, M. Kandemir, T. Mowry, O. Mutlu

{"title":"gpu中核心辅助瓶颈加速的案例:通过辅助扭曲实现灵活的数据压缩","authors":"Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, A. Bhowmick, Rachata Ausavarungnirun, C. Das, M. Kandemir, T. Mowry, O. Mutlu","doi":"10.1145/2749469.2750399","DOIUrl":null,"url":null,"abstract":"Modern Graphics Processing Units (CPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a CPU is bottle necked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in CPU execution. CABA provides flexible mechanisms to automatically generate \"assist warps\" that execute on CPU cores to perform specific tasks that can improve CPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the CPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the CPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"188 1","pages":"41-53"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"106","resultStr":"{\"title\":\"A case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling flexible data compression with assist warps\",\"authors\":\"Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, A. Bhowmick, Rachata Ausavarungnirun, C. Das, M. Kandemir, T. Mowry, O. Mutlu\",\"doi\":\"10.1145/2749469.2750399\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern Graphics Processing Units (CPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a CPU is bottle necked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in CPU execution. CABA provides flexible mechanisms to automatically generate \\\"assist warps\\\" that execute on CPU cores to perform specific tasks that can improve CPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the CPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the CPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.\",\"PeriodicalId\":6878,\"journal\":{\"name\":\"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)\",\"volume\":\"188 1\",\"pages\":\"41-53\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-06-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"106\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2749469.2750399\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2749469.2750399","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 106

摘要

现代图形处理单元(cpu)的配置很好，可以支持数千个线程的并发执行。不幸的是，执行过程中的不同瓶颈和异构应用程序需求会导致核心中资源利用率的不平衡。例如，当CPU受到可用的片外内存带宽的限制时，它的计算资源通常非常空闲，等待来自内存的数据到达。本文介绍了核心辅助瓶颈加速(CABA)框架，该框架利用空闲的片上资源来缓解CPU执行中的各种瓶颈。CABA提供了灵活的机制来自动生成在CPU内核上执行的“辅助扭曲”，以执行可以提高CPU性能和效率的特定任务。CABA允许使用空闲的计算单元和管道来缓解内存带宽瓶颈，例如，通过使用辅助扭曲来执行数据压缩以从内存传输更少的数据。相反，同样的框架可以用来处理CPU被可用的计算单元阻塞的情况，在这种情况下，内存管道是空闲的，可以被CABA用来加速计算，例如，通过使用辅助扭曲执行记忆。我们提供了一个全面的设计和评估的CABA，以执行有效和灵活的数据压缩在CPU内存层次，以缓解内存带宽瓶颈。我们的广泛评估表明，当使用CABA实现数据压缩时，在各种内存带宽敏感的GPGPU应用程序中提供41.7%(高达2.6倍)的平均性能改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling flexible data compression with assist warps

Modern Graphics Processing Units (CPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a CPU is bottle necked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in CPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on CPU cores to perform specific tasks that can improve CPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the CPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the CPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量

期刊最新文献

Redundant Memory Mappings for fast access to large memories Multiple Clone Row DRAM: A low latency and area optimized DRAM Manycore Network Interfaces for in-memory rack-scale computing Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures ShiDianNao: Shifting vision processing closer to the sensor