A flexible and scalable approach for LDPC decodingon CUDA based Graphics Processing Unit (GPU) is presented in this paper. Layered decoding is a popular method for LDPC decoding and is known for its fast convergence. However, efficient implementation of the layered decoding algorithm on GPU is challenging due to the limited amount of data-parallelism available in this algorithm. To overcome this problem, a kernel execution configuration that can decode multiple codewords simultaneously on GPU is developed. This paper proposes a compact data packing scheme to reduce the number of global memory accesses and parity-check matrix representation to reduce constant memory latency. Global memory bandwidth efficiency is improved by coalescing simultaneous memory accesses of threads in a half-warp into a single memory transaction. Asynchronous data transfers are used to hide host memory latency by overlapping kernel execution with data transfers between CPU and GPU. The proposed implementation of LDPC decoder on GPU performs two orders of magnitude faster than the LDPC decoder on a CPU and four times faster than the previously reported LDPC decoder on GPU. This implementation achieves a throughput of 160Mbps, which is comparable to dedicated hardware solutions.
{"title":"A Scalable LDPC Decoder on GPU","authors":"K. Abburi","doi":"10.1109/VLSID.2011.44","DOIUrl":"https://doi.org/10.1109/VLSID.2011.44","url":null,"abstract":"A flexible and scalable approach for LDPC decodingon CUDA based Graphics Processing Unit (GPU) is presented in this paper. Layered decoding is a popular method for LDPC decoding and is known for its fast convergence. However, efficient implementation of the layered decoding algorithm on GPU is challenging due to the limited amount of data-parallelism available in this algorithm. To overcome this problem, a kernel execution configuration that can decode multiple codewords simultaneously on GPU is developed. This paper proposes a compact data packing scheme to reduce the number of global memory accesses and parity-check matrix representation to reduce constant memory latency. Global memory bandwidth efficiency is improved by coalescing simultaneous memory accesses of threads in a half-warp into a single memory transaction. Asynchronous data transfers are used to hide host memory latency by overlapping kernel execution with data transfers between CPU and GPU. The proposed implementation of LDPC decoder on GPU performs two orders of magnitude faster than the LDPC decoder on a CPU and four times faster than the previously reported LDPC decoder on GPU. This implementation achieves a throughput of 160Mbps, which is comparable to dedicated hardware solutions.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127500804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Model Checking is an effective method for design verification, useful for proving temporal properties of the underlying system. In model checking, computing the pre-image (or image) space of a given temporal property plays a critical role. In this paper, we propose a novel learning framework for efficient state space exploration based on search state extensibility relation. This allows for the identification and pruning of several non-trivial redundant search spaces, thereby reducing the computational cost. We also propose a probability-based heuristic to guide our learning method. Experimental evidence is given to show the practicality of the proposed method.
{"title":"A Novel Learning Framework for State Space Exploration Based on Search State Extensibility Relation","authors":"M. Chandrasekar, M. Hsiao","doi":"10.1109/VLSID.2011.57","DOIUrl":"https://doi.org/10.1109/VLSID.2011.57","url":null,"abstract":"Model Checking is an effective method for design verification, useful for proving temporal properties of the underlying system. In model checking, computing the pre-image (or image) space of a given temporal property plays a critical role. In this paper, we propose a novel learning framework for efficient state space exploration based on search state extensibility relation. This allows for the identification and pruning of several non-trivial redundant search spaces, thereby reducing the computational cost. We also propose a probability-based heuristic to guide our learning method. Experimental evidence is given to show the practicality of the proposed method.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125653769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeyavijayan Rajendran, H. Manem, R. Karri, G. Rose
Memristors have been proposed to be used in a wide variety of applications ranging from neural networks to memory to digital logic. Like other electronic devices, memristors are also prone to process variations. We show that the effect of process induced variations in the thickness of the oxide layer of a memristor has a non-linear relationship with memristance. We analyze the effects of process variation on memristor-based threshold gates. We propose two algorithms to tolerate variations on memristance based on two different constraints. One is used to determine the memristance values for a given list of Boolean functions to tolerate a maximum amount of variation. The other is used to determine the list of Boolean functions that can tolerate a maximum amount of variation for given memristance values. Finally, we analyze the performance of memristor-based threshold gates to tolerate variations.
{"title":"An Approach to Tolerate Process Related Variations in Memristor-Based Applications","authors":"Jeyavijayan Rajendran, H. Manem, R. Karri, G. Rose","doi":"10.1109/VLSID.2011.49","DOIUrl":"https://doi.org/10.1109/VLSID.2011.49","url":null,"abstract":"Memristors have been proposed to be used in a wide variety of applications ranging from neural networks to memory to digital logic. Like other electronic devices, memristors are also prone to process variations. We show that the effect of process induced variations in the thickness of the oxide layer of a memristor has a non-linear relationship with memristance. We analyze the effects of process variation on memristor-based threshold gates. We propose two algorithms to tolerate variations on memristance based on two different constraints. One is used to determine the memristance values for a given list of Boolean functions to tolerate a maximum amount of variation. The other is used to determine the list of Boolean functions that can tolerate a maximum amount of variation for given memristance values. Finally, we analyze the performance of memristor-based threshold gates to tolerate variations.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124519139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
he heterogeneous nature of the modern day applications has resulted in widespread use of Multicore SoC architectures. The emerging Network-On-Chip (NoC) interconnect architecture provides an energy-efficient and scalable communication solution for multiple cores, serving as a powerful replacement for traditional bus architectures. The key to the successful realization of such architectures is a flexible, fast and robust emulation platform. This paper presents the design, implementation and evaluation of AcENoCs, a flexible and cycle-accurate FPGA emulation platform for validating synchronous and GALS-based NoC architectures. The emulation platform is built around a HWSW framework consisting of reconfigurable network components, traffic generators and ejectors, statistics collection and analysis modules. We also address the unique features of our platform in terms of reconfigurability and co-design of the hardware and software components, and assess the performance improvements and tradeoffs over existing PGA emulators and software simulators. Our experimental analysis indicate speedup improvements in the order of 10000-12000X over HDL simulators and 14-47X over software simulators, without sacrificing cycle accuracy.
{"title":"AcENoCs: A Configurable HW/SW Platform for FPGA Accelerated NoC Emulation","authors":"Swapnil Lotlikar, Vinay S. Pai, Paul V. Gratz","doi":"10.1109/VLSID.2011.46","DOIUrl":"https://doi.org/10.1109/VLSID.2011.46","url":null,"abstract":"he heterogeneous nature of the modern day applications has resulted in widespread use of Multicore SoC architectures. The emerging Network-On-Chip (NoC) interconnect architecture provides an energy-efficient and scalable communication solution for multiple cores, serving as a powerful replacement for traditional bus architectures. The key to the successful realization of such architectures is a flexible, fast and robust emulation platform. This paper presents the design, implementation and evaluation of AcENoCs, a flexible and cycle-accurate FPGA emulation platform for validating synchronous and GALS-based NoC architectures. The emulation platform is built around a HWSW framework consisting of reconfigurable network components, traffic generators and ejectors, statistics collection and analysis modules. We also address the unique features of our platform in terms of reconfigurability and co-design of the hardware and software components, and assess the performance improvements and tradeoffs over existing PGA emulators and software simulators. Our experimental analysis indicate speedup improvements in the order of 10000-12000X over HDL simulators and 14-47X over software simulators, without sacrificing cycle accuracy.","PeriodicalId":371062,"journal":{"name":"2011 24th Internatioal Conference on VLSI Design","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126441673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}