N. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton J. Smith, John Manferdelli
{"title":"图形处理器上的高性能离散傅里叶变换","authors":"N. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton J. Smith, John Manferdelli","doi":"10.1109/SC.2008.5213922","DOIUrl":null,"url":null,"abstract":"We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"315","resultStr":"{\"title\":\"High performance discrete Fourier transforms on graphics processors\",\"authors\":\"N. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton J. Smith, John Manferdelli\",\"doi\":\"10.1109/SC.2008.5213922\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes.\",\"PeriodicalId\":230761,\"journal\":{\"name\":\"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"315\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC.2008.5213922\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.2008.5213922","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 315
摘要
我们提出了在gpu上计算高性能离散傅里叶变换的新算法。我们提出了分层的、混合基数的FFT算法,用于2的幂和非2的幂的大小。我们的分层FFT算法使用Stockham公式有效地利用gpu上的共享内存。我们通过将转置组合成基于块的多fft算法来减少分层算法中的内存转置开销。对于非2次幂大小,我们使用小素数的混合基数fft和Bluestein算法的组合。我们在Bluestein算法中使用模算法来提高准确率。我们使用NVIDIA CUDA API实现算法,并将其性能与NVIDIA的CUFFT库和高端四核CPU上的优化CPU实现(英特尔的MKL)进行比较。在NVIDIA GPU上,我们获得了高达300 GFlops的性能,在大尺寸情况下,通常性能比CUFFT提高2-4倍,比MKL提高8-40倍。
High performance discrete Fourier transforms on graphics processors
We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2-4times over CUFFT and 8-40times improvement over MKL for large sizes.