swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI:10.1145/3178487.3178513

Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

{"title":"swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures","authors":"Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu","doi":"10.1145/3178487.3178513","DOIUrl":null,"url":null,"abstract":"Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the out-of-control data reuse and high cost for global memory or shared cache access in inter-level synchronization have been largely neglected in existing work. In this paper, we propose a novel data layout called Sparse Level Tile to make all data reuse under control, and design a Producer-Consumer pairing method to make any inter-level synchronization only happen in very fast register communication. We implement our data layout and algorithms on an SW26010 many-core processor, which is the main building-block of the current world fastest supercomputer Sunway Taihulight. The experimental results of testing all 2057 square matrices from the Florida Matrix Collection show that our method achieves an average speedup of 6.9 and the best speedup of 38.5 over parallel level-set method. Our method also outperforms the latest methods on a KNC many-core processor in 1856 matrices and the latest methods on a K80 GPU in 1672 matrices, respectively.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"94 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"66","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3178487.3178513","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 66

Abstract

Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the out-of-control data reuse and high cost for global memory or shared cache access in inter-level synchronization have been largely neglected in existing work. In this paper, we propose a novel data layout called Sparse Level Tile to make all data reuse under control, and design a Producer-Consumer pairing method to make any inter-level synchronization only happen in very fast register communication. We implement our data layout and algorithms on an SW26010 many-core processor, which is the main building-block of the current world fastest supercomputer Sunway Taihulight. The experimental results of testing all 2057 square matrices from the Florida Matrix Collection show that our method achieves an average speedup of 6.9 and the best speedup of 38.5 over parallel level-set method. Our method also outperforms the latest methods on a KNC many-core processor in 1856 matrices and the latest methods on a K80 GPU in 1672 matrices, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

swSpTRSV:一种快速的稀疏三角形求解方法，在双威架构上实现稀疏层贴图的布局

稀疏三角解(SpTRSV)是许多实际应用中最重要的核算法之一。目前，关于并行SpTRSV的研究主要集中在构建水平集以减少层间同步的数量。然而，在层间同步中，由于全局内存或共享缓存的访问而导致的数据重用失控和高成本问题在很大程度上被现有工作所忽视。在本文中，我们提出了一种新的数据布局，称为稀疏级块，使所有的数据重用可控，并设计了一种生产者-消费者配对方法，使任何层间同步只发生在非常快的寄存器通信中。我们在SW26010多核处理器上实现我们的数据布局和算法，SW26010是目前世界上最快的超级计算机神威太湖之光的主要构建模块。对来自Florida Matrix Collection的所有2057个方阵进行测试的实验结果表明，与并行水平集方法相比，该方法的平均加速速度为6.9，最佳加速速度为38.5。我们的方法也分别优于KNC多核处理器上的最新方法(1856个矩阵)和K80 GPU上的最新方法(1672个矩阵)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量

期刊最新文献

Graph partitioning applied to DAG scheduling to reduce NUMA effects Juggler: a dependence-aware task-based execution framework for GPUs Performance modeling for GPUs using abstract kernel emulation Automated code acceleration targeting heterogeneous openCL devices Layrub: layer-centric GPU memory reuse and data migration in extreme-scale deep learning systems