Multi-level parallel multi-layer block reproducible summation algorithm

IF 2.1 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Parallel Computing Pub Date : 2023-02-01 DOI:10.1016/j.parco.2023.102996

Kuan Li , Kang He , Stef Graillat , Hao Jiang , Tongxiang Gu , Jie Liu

{"title":"Multi-level parallel multi-layer block reproducible summation algorithm","authors":"Kuan Li , Kang He , Stef Graillat , Hao Jiang , Tongxiang Gu , Jie Liu","doi":"10.1016/j.parco.2023.102996","DOIUrl":null,"url":null,"abstract":"<div><p>Reproducibility means getting the bitwise identical floating point results from multiple runs of the same program, which plays an essential role in debugging and correctness checking in many codes (Villa et al., 2009). However, in parallel computing environments, the combination of dynamic scheduling of parallel computing resources. Moreover, floating point nonassociativity leads to non-reproducible results. Demmel and Nguyen proposed a floating-point summation algorithm that is reproducible independent of the order of summation (Demmel and Nguye, 2013; 2015) and accurate by using the 1-Reduction technique. Our work combines their work with the multi-layer block technology proposed by Castaldo et al. (2009), designs the multi-level parallel multi-layer block reproducible summation algorithm (MLP_rsum), including SIMD, OpenMP, and MPI based on each layer of blocks, and then attains reproducible and expected accurate results with high performance. Numerical experiments show that our algorithm is more efficient than the reproducible summation function in ReproBLAS (2018). With SIMD optimization, our algorithm is 2.41, 2.85, and 3.44 times faster than ReproBLAS on the three ARM platforms. With OpenMP optimization, our algorithm obtains linear speedup, showing that our method applies to multi-core processors. Finally, with reproducible MPI reduction, our algorithm’s parallel efficiency is 76% at 32 nodes with 4 threads and 32 processes.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102996"},"PeriodicalIF":2.1000,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819123000029","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 1

Abstract

Reproducibility means getting the bitwise identical floating point results from multiple runs of the same program, which plays an essential role in debugging and correctness checking in many codes (Villa et al., 2009). However, in parallel computing environments, the combination of dynamic scheduling of parallel computing resources. Moreover, floating point nonassociativity leads to non-reproducible results. Demmel and Nguyen proposed a floating-point summation algorithm that is reproducible independent of the order of summation (Demmel and Nguye, 2013; 2015) and accurate by using the 1-Reduction technique. Our work combines their work with the multi-layer block technology proposed by Castaldo et al. (2009), designs the multi-level parallel multi-layer block reproducible summation algorithm (MLP_rsum), including SIMD, OpenMP, and MPI based on each layer of blocks, and then attains reproducible and expected accurate results with high performance. Numerical experiments show that our algorithm is more efficient than the reproducible summation function in ReproBLAS (2018). With SIMD optimization, our algorithm is 2.41, 2.85, and 3.44 times faster than ReproBLAS on the three ARM platforms. With OpenMP optimization, our algorithm obtains linear speedup, showing that our method applies to multi-core processors. Finally, with reproducible MPI reduction, our algorithm’s parallel efficiency is 76% at 32 nodes with 4 threads and 32 processes.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多级并行多层块可重复求和算法

再现性意味着从同一程序的多次运行中获得逐位相同的浮点结果，这在许多代码的调试和正确性检查中起着至关重要的作用（Villa等人，2009）。然而，在并行计算环境中，并行计算资源的动态调度组合。此外，浮点非关联性导致不可重现的结果。Demmel和Nguyen提出了一种浮点求和算法，该算法是可重复的，与求和的顺序无关（Demmel and Nguye，2013；2015），并通过使用1-归约技术进行精确计算。我们的工作将他们的工作与Castaldo等人提出的多层块技术相结合。（2009），设计了多级并行多层块可重复求和算法（MLP_rsum），包括基于每层块的SIMD、OpenMP和MPI，然后以高性能获得可重复和预期的精确结果。数值实验表明，我们的算法比ReproBLAS（2018）中的可重复求和函数更有效。通过SIMD优化，我们的算法在三个ARM平台上分别比ReproBLAS快2.41、2.85和3.44倍。通过OpenMP优化，我们的算法获得了线性加速，表明我们的方法适用于多核处理器。最后，通过可重复的MPI减少，我们的算法在具有4个线程和32个进程的32个节点上的并行效率为76%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Parallel Computing 工程技术-计算机：理论方法

CiteScore

3.50

自引率

7.10%

发文量

审稿时长

4.5 months

期刊介绍： Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications