{"title":"Twinned buffering: A simple and highly effective scheme for parallelization of Successive Over-Relaxation on GPUs and other accelerators","authors":"W. Vanderbauwhede, T. Takemi","doi":"10.1109/HPCSim.2015.7237073","DOIUrl":null,"url":null,"abstract":"In this paper we present a new scheme for parallelization of the Successive Over-Relaxation method for solving the Poisson equation over a 3-D volume. Our new scheme is both simple and effective, outperforming the conventional Red-Black scheme by a factor of 16 on an NVIDIA GeForce GTX 590 GPU, a factor of 11 on an NVIDIA GeForce TITAN Black GPU and a factor of 5 on an Intel Xeon Phi. The speed-up compared to the fully optimised reference implementation running on an Intel Xeon CPU is 16 times on the GTX 590, 22 times on the TITAN and 5 times on the Xeon Phi. We explain the rationale and the implementation in OpenCL and present the performance evaluation results.","PeriodicalId":134009,"journal":{"name":"2015 International Conference on High Performance Computing & Simulation (HPCS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCSim.2015.7237073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
In this paper we present a new scheme for parallelization of the Successive Over-Relaxation method for solving the Poisson equation over a 3-D volume. Our new scheme is both simple and effective, outperforming the conventional Red-Black scheme by a factor of 16 on an NVIDIA GeForce GTX 590 GPU, a factor of 11 on an NVIDIA GeForce TITAN Black GPU and a factor of 5 on an Intel Xeon Phi. The speed-up compared to the fully optimised reference implementation running on an Intel Xeon CPU is 16 times on the GTX 590, 22 times on the TITAN and 5 times on the Xeon Phi. We explain the rationale and the implementation in OpenCL and present the performance evaluation results.