Multi-level parallel multi-layer block reproducible summation algorithm

IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Parallel Computing Pub Date : 2023-02-01 DOI:10.1016/j.parco.2023.102996
Kuan Li , Kang He , Stef Graillat , Hao Jiang , Tongxiang Gu , Jie Liu
{"title":"Multi-level parallel multi-layer block reproducible summation algorithm","authors":"Kuan Li ,&nbsp;Kang He ,&nbsp;Stef Graillat ,&nbsp;Hao Jiang ,&nbsp;Tongxiang Gu ,&nbsp;Jie Liu","doi":"10.1016/j.parco.2023.102996","DOIUrl":null,"url":null,"abstract":"<div><p>Reproducibility means getting the bitwise identical floating point results from multiple runs of the same program, which plays an essential role in debugging and correctness checking in many codes (Villa et al., 2009). However, in parallel computing environments, the combination of dynamic scheduling of parallel computing resources. Moreover, floating point nonassociativity leads to non-reproducible results. Demmel and Nguyen proposed a floating-point summation algorithm that is reproducible independent of the order of summation (Demmel and Nguye, 2013; 2015) and accurate by using the 1-Reduction technique. Our work combines their work with the multi-layer block technology proposed by Castaldo et al. (2009), designs the multi-level parallel multi-layer block reproducible summation algorithm (MLP_rsum), including SIMD, OpenMP, and MPI based on each layer of blocks, and then attains reproducible and expected accurate results with high performance. Numerical experiments show that our algorithm is more efficient than the reproducible summation function in ReproBLAS (2018). With SIMD optimization, our algorithm is 2.41, 2.85, and 3.44 times faster than ReproBLAS on the three ARM platforms. With OpenMP optimization, our algorithm obtains linear speedup, showing that our method applies to multi-core processors. Finally, with reproducible MPI reduction, our algorithm’s parallel efficiency is 76% at 32 nodes with 4 threads and 32 processes.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102996"},"PeriodicalIF":2.0000,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819123000029","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 1

Abstract

Reproducibility means getting the bitwise identical floating point results from multiple runs of the same program, which plays an essential role in debugging and correctness checking in many codes (Villa et al., 2009). However, in parallel computing environments, the combination of dynamic scheduling of parallel computing resources. Moreover, floating point nonassociativity leads to non-reproducible results. Demmel and Nguyen proposed a floating-point summation algorithm that is reproducible independent of the order of summation (Demmel and Nguye, 2013; 2015) and accurate by using the 1-Reduction technique. Our work combines their work with the multi-layer block technology proposed by Castaldo et al. (2009), designs the multi-level parallel multi-layer block reproducible summation algorithm (MLP_rsum), including SIMD, OpenMP, and MPI based on each layer of blocks, and then attains reproducible and expected accurate results with high performance. Numerical experiments show that our algorithm is more efficient than the reproducible summation function in ReproBLAS (2018). With SIMD optimization, our algorithm is 2.41, 2.85, and 3.44 times faster than ReproBLAS on the three ARM platforms. With OpenMP optimization, our algorithm obtains linear speedup, showing that our method applies to multi-core processors. Finally, with reproducible MPI reduction, our algorithm’s parallel efficiency is 76% at 32 nodes with 4 threads and 32 processes.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
多级并行多层块可重复求和算法
再现性意味着从同一程序的多次运行中获得逐位相同的浮点结果,这在许多代码的调试和正确性检查中起着至关重要的作用(Villa等人,2009)。然而,在并行计算环境中,并行计算资源的动态调度组合。此外,浮点非关联性导致不可重现的结果。Demmel和Nguyen提出了一种浮点求和算法,该算法是可重复的,与求和的顺序无关(Demmel and Nguye,2013;2015),并通过使用1-归约技术进行精确计算。我们的工作将他们的工作与Castaldo等人提出的多层块技术相结合。(2009),设计了多级并行多层块可重复求和算法(MLP_rsum),包括基于每层块的SIMD、OpenMP和MPI,然后以高性能获得可重复和预期的精确结果。数值实验表明,我们的算法比ReproBLAS(2018)中的可重复求和函数更有效。通过SIMD优化,我们的算法在三个ARM平台上分别比ReproBLAS快2.41、2.85和3.44倍。通过OpenMP优化,我们的算法获得了线性加速,表明我们的方法适用于多核处理器。最后,通过可重复的MPI减少,我们的算法在具有4个线程和32个进程的32个节点上的并行效率为76%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Parallel Computing
Parallel Computing 工程技术-计算机:理论方法
CiteScore
3.50
自引率
7.10%
发文量
49
审稿时长
4.5 months
期刊介绍: Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications
期刊最新文献
Towards resilient and energy efficient scalable Krylov solvers Seesaw: A 4096-bit vector processor for accelerating Kyber based on RISC-V ISA extensions Editorial Board FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning Distributed consensus-based estimation of the leading eigenvalue of a non-negative irreducible matrix
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1