{"title":"GPU编程中求解三对角系统的并行Thomas方法的发展-定常和非定常流动模拟","authors":"M. Souri, P. Akbarzadeh, H. M. Darian","doi":"10.1051/meca/2020013","DOIUrl":null,"url":null,"abstract":"The solution of tridiagonal system of equations using graphic processing units (GPU) is assessed. The parallel-Thomas-algorithm (PTA) is developed and the solution of PTA is compared to two known parallel algorithms, i.e. cyclic-reduction (CR) and parallel-cyclic-reduction (PCR). Lid-driven cavity problem is considered to assess these parallel approaches. This problem is also simulated using the classic Thomas algorithm that runs on a central processing unit (CPU). Runtimes and physical parameters of the mentioned GPU and CPU algorithms are compared. The results show that the speedup of CR, PCR and PTA against the CPU runtime is 4.4x,5.2x and 38.5x, respectively. Furthermore, the effect of coalesced and uncoalesced memory access to GPU global memory is examined for PTA, and a 2x-speedup is achieved for the coalesced memory access. Additionally, the PTA performance in a time dependent problem, the unsteady flow over a square, is assessed and a 9x-speedup is obtained against the CPU.","PeriodicalId":49018,"journal":{"name":"Mechanics & Industry","volume":"62 1","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Parallel Thomas approach development for solving tridiagonal systems in GPU programming − steady and unsteady flow simulation\",\"authors\":\"M. Souri, P. Akbarzadeh, H. M. Darian\",\"doi\":\"10.1051/meca/2020013\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The solution of tridiagonal system of equations using graphic processing units (GPU) is assessed. The parallel-Thomas-algorithm (PTA) is developed and the solution of PTA is compared to two known parallel algorithms, i.e. cyclic-reduction (CR) and parallel-cyclic-reduction (PCR). Lid-driven cavity problem is considered to assess these parallel approaches. This problem is also simulated using the classic Thomas algorithm that runs on a central processing unit (CPU). Runtimes and physical parameters of the mentioned GPU and CPU algorithms are compared. The results show that the speedup of CR, PCR and PTA against the CPU runtime is 4.4x,5.2x and 38.5x, respectively. Furthermore, the effect of coalesced and uncoalesced memory access to GPU global memory is examined for PTA, and a 2x-speedup is achieved for the coalesced memory access. Additionally, the PTA performance in a time dependent problem, the unsteady flow over a square, is assessed and a 9x-speedup is obtained against the CPU.\",\"PeriodicalId\":49018,\"journal\":{\"name\":\"Mechanics & Industry\",\"volume\":\"62 1\",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2020-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Mechanics & Industry\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1051/meca/2020013\",\"RegionNum\":4,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENGINEERING, MECHANICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mechanics & Industry","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1051/meca/2020013","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, MECHANICAL","Score":null,"Total":0}
Parallel Thomas approach development for solving tridiagonal systems in GPU programming − steady and unsteady flow simulation
The solution of tridiagonal system of equations using graphic processing units (GPU) is assessed. The parallel-Thomas-algorithm (PTA) is developed and the solution of PTA is compared to two known parallel algorithms, i.e. cyclic-reduction (CR) and parallel-cyclic-reduction (PCR). Lid-driven cavity problem is considered to assess these parallel approaches. This problem is also simulated using the classic Thomas algorithm that runs on a central processing unit (CPU). Runtimes and physical parameters of the mentioned GPU and CPU algorithms are compared. The results show that the speedup of CR, PCR and PTA against the CPU runtime is 4.4x,5.2x and 38.5x, respectively. Furthermore, the effect of coalesced and uncoalesced memory access to GPU global memory is examined for PTA, and a 2x-speedup is achieved for the coalesced memory access. Additionally, the PTA performance in a time dependent problem, the unsteady flow over a square, is assessed and a 9x-speedup is obtained against the CPU.
期刊介绍:
An International Journal on Mechanical Sciences and Engineering Applications
With papers from industry, Research and Development departments and academic institutions, this journal acts as an interface between research and industry, coordinating and disseminating scientific and technical mechanical research in relation to industrial activities.
Targeted readers are technicians, engineers, executives, researchers, and teachers who are working in industrial companies as managers or in Research and Development departments, technical centres, laboratories, universities, technical and engineering schools. The journal is an AFM (Association Française de Mécanique) publication.