基于GPU的可扩展LDPC解码器

2011 24th Internatioal Conference on VLSI Design Pub Date : 2011-01-02 DOI:10.1109/VLSID.2011.44

K. Abburi

{"title":"基于GPU的可扩展LDPC解码器","authors":"K. Abburi","doi":"10.1109/VLSID.2011.44","DOIUrl":null,"url":null,"abstract":"A flexible and scalable approach for LDPC decodingon CUDA based Graphics Processing Unit (GPU) is presented in this paper. Layered decoding is a popular method for LDPC decoding and is known for its fast convergence. However, efficient implementation of the layered decoding algorithm on GPU is challenging due to the limited amount of data-parallelism available in this algorithm. To overcome this problem, a kernel execution configuration that can decode multiple codewords simultaneously on GPU is developed. This paper proposes a compact data packing scheme to reduce the number of global memory accesses and parity-check matrix representation to reduce constant memory latency. Global memory bandwidth efficiency is improved by coalescing simultaneous memory accesses of threads in a half-warp into a single memory transaction. Asynchronous data transfers are used to hide host memory latency by overlapping kernel execution with data transfers between CPU and GPU. The proposed implementation of LDPC decoder on GPU performs two orders of magnitude faster than the LDPC decoder on a CPU and four times faster than the previously reported LDPC decoder on GPU. This implementation achieves a throughput of 160Mbps, which is comparable to dedicated hardware solutions.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"A Scalable LDPC Decoder on GPU\",\"authors\":\"K. Abburi\",\"doi\":\"10.1109/VLSID.2011.44\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A flexible and scalable approach for LDPC decodingon CUDA based Graphics Processing Unit (GPU) is presented in this paper. Layered decoding is a popular method for LDPC decoding and is known for its fast convergence. However, efficient implementation of the layered decoding algorithm on GPU is challenging due to the limited amount of data-parallelism available in this algorithm. To overcome this problem, a kernel execution configuration that can decode multiple codewords simultaneously on GPU is developed. This paper proposes a compact data packing scheme to reduce the number of global memory accesses and parity-check matrix representation to reduce constant memory latency. Global memory bandwidth efficiency is improved by coalescing simultaneous memory accesses of threads in a half-warp into a single memory transaction. Asynchronous data transfers are used to hide host memory latency by overlapping kernel execution with data transfers between CPU and GPU. The proposed implementation of LDPC decoder on GPU performs two orders of magnitude faster than the LDPC decoder on a CPU and four times faster than the previously reported LDPC decoder on GPU. This implementation achieves a throughput of 160Mbps, which is comparable to dedicated hardware solutions.\",\"PeriodicalId\":371062,\"journal\":{\"name\":\"2011 24th Internatioal Conference on VLSI Design\",\"volume\":\"11 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-01-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 24th Internatioal Conference on VLSI Design\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/VLSID.2011.44\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 24th Internatioal Conference on VLSI Design","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/VLSID.2011.44","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

摘要

本文提出了一种在基于CUDA的图形处理单元(GPU)上进行LDPC解码的灵活且可扩展的方法。分层译码是LDPC译码的一种常用方法，具有快速收敛的特点。然而，由于该算法中可用的数据并行性有限，因此在GPU上有效实现分层解码算法具有挑战性。为了解决这一问题，开发了一种可以在GPU上同时解码多个码字的内核执行配置。本文提出了一种紧凑的数据打包方案，以减少全局内存访问次数，并提出了奇偶校验矩阵表示，以减少恒定的内存延迟。全局内存带宽效率是通过将半曲线程的并发内存访问合并到单个内存事务中来提高的。异步数据传输通过在CPU和GPU之间的数据传输重叠内核执行来隐藏主机内存延迟。提出的LDPC解码器在GPU上的实现比CPU上的LDPC解码器快两个数量级，比以前报道的GPU上的LDPC解码器快4倍。该实现实现了160Mbps的吞吐量，与专用硬件解决方案相当。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Scalable LDPC Decoder on GPU

A flexible and scalable approach for LDPC decodingon CUDA based Graphics Processing Unit (GPU) is presented in this paper. Layered decoding is a popular method for LDPC decoding and is known for its fast convergence. However, efficient implementation of the layered decoding algorithm on GPU is challenging due to the limited amount of data-parallelism available in this algorithm. To overcome this problem, a kernel execution configuration that can decode multiple codewords simultaneously on GPU is developed. This paper proposes a compact data packing scheme to reduce the number of global memory accesses and parity-check matrix representation to reduce constant memory latency. Global memory bandwidth efficiency is improved by coalescing simultaneous memory accesses of threads in a half-warp into a single memory transaction. Asynchronous data transfers are used to hide host memory latency by overlapping kernel execution with data transfers between CPU and GPU. The proposed implementation of LDPC decoder on GPU performs two orders of magnitude faster than the LDPC decoder on a CPU and four times faster than the previously reported LDPC decoder on GPU. This implementation achieves a throughput of 160Mbps, which is comparable to dedicated hardware solutions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 24th Internatioal Conference on VLSI Design

自引率

0.00%

发文量