Mobile GPU shader processor based on non-blocking Coarse Grained Reconfigurable Arrays architecture

2013 International Conference on Field-Programmable Technology (FPT) Pub Date : 2013-12-01 DOI:10.1109/FPT.2013.6718353

Kwon-Taek Kwon, Sungjin Son, Jeongae Park, Jeongae Park, Sangoak Woo, Seokyoon Jung, Soojung Ryu

{"title":"Mobile GPU shader processor based on non-blocking Coarse Grained Reconfigurable Arrays architecture","authors":"Kwon-Taek Kwon, Sungjin Son, Jeongae Park, Jeongae Park, Sangoak Woo, Seokyoon Jung, Soojung Ryu","doi":"10.1109/FPT.2013.6718353","DOIUrl":null,"url":null,"abstract":"Coarse-grained reconfigurable arrays (CGRAs) based processors provide high performance and energy-efficiency as well as programmability by means of the ability to reconfigure the datapath connecting the ALU arrays. A CGRA based processor executes loop kernels whose schedule should be fixed at compile time. This restriction hinders CGRA from being efficient particularly in accessing external memories or caches whose access time varies greatly. This makes it challenging to build a CGRA based high-performance, energy-efficient mobile GPU because GPU shader execution usually involves massive texture memory accesses which consist of accesses to texture cache and external texture memory. In this paper, we present an Non-blocking Coarse Grained Reconfigurable Arrays (NBC-GRA) architecture which can handle varying-latency operations efficiently. We also propose an improved CGRA based GPU shader processor architecture based on it. Retry buffer enables threads to re-execute later when the required memory access completes. With a non-blocking texture cache, the shader core can execute without stalls even in the case of cache misses. All of these components help to improve CGRA core throughput greatly despite of longer memory access latencies. Evaluation results show that our NBCGRA architecture based shader processor could perform efficiently despite extreme variation of texture cache access latencies and could reduce the shader execution cycles by upto 68% with minimal hardware cost overhead.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"122 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Field-Programmable Technology (FPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPT.2013.6718353","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Coarse-grained reconfigurable arrays (CGRAs) based processors provide high performance and energy-efficiency as well as programmability by means of the ability to reconfigure the datapath connecting the ALU arrays. A CGRA based processor executes loop kernels whose schedule should be fixed at compile time. This restriction hinders CGRA from being efficient particularly in accessing external memories or caches whose access time varies greatly. This makes it challenging to build a CGRA based high-performance, energy-efficient mobile GPU because GPU shader execution usually involves massive texture memory accesses which consist of accesses to texture cache and external texture memory. In this paper, we present an Non-blocking Coarse Grained Reconfigurable Arrays (NBC-GRA) architecture which can handle varying-latency operations efficiently. We also propose an improved CGRA based GPU shader processor architecture based on it. Retry buffer enables threads to re-execute later when the required memory access completes. With a non-blocking texture cache, the shader core can execute without stalls even in the case of cache misses. All of these components help to improve CGRA core throughput greatly despite of longer memory access latencies. Evaluation results show that our NBCGRA architecture based shader processor could perform efficiently despite extreme variation of texture cache access latencies and could reduce the shader execution cycles by upto 68% with minimal hardware cost overhead.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于非阻塞粗粒度可重构阵列架构的移动GPU着色器处理器

基于粗粒度可重构阵列(CGRAs)的处理器通过重新配置连接ALU阵列的数据路径的能力，提供了高性能、高能效以及可编程性。基于CGRA的处理器执行循环内核，其调度应该在编译时固定。这种限制阻碍了CGRA的效率，特别是在访问访问时间变化很大的外部存储器或缓存时。这使得构建基于CGRA的高性能，节能的移动GPU变得具有挑战性，因为GPU着色器的执行通常涉及大量纹理内存访问，包括访问纹理缓存和外部纹理内存。在本文中，我们提出了一种非阻塞粗粒度可重构阵列(NBC-GRA)架构，它可以有效地处理变延迟操作。在此基础上提出了一种改进的基于CGRA的GPU着色处理器架构。重试缓冲区允许线程在完成所需的内存访问后重新执行。使用非阻塞纹理缓存，着色器核心即使在缓存丢失的情况下也可以不停顿地执行。尽管内存访问延迟较长，但所有这些组件都有助于极大地提高CGRA核心吞吐量。评估结果表明，我们基于NBCGRA架构的着色器处理器可以在纹理缓存访问延迟极端变化的情况下高效地执行，并且可以在最小的硬件成本开销下减少着色器执行周期高达68%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2013 International Conference on Field-Programmable Technology (FPT)

自引率

0.00%

发文量