Distributed and heterogeneous tensor–vector contraction algorithms for high performance computing

IF 6.2 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Future Generation Computer Systems-The International Journal of Escience Pub Date : 2025-01-21 DOI:10.1016/j.future.2024.107698
Pedro J. Martinez-Ferrer , Albert-Jan Yzelman , Vicenç Beltran
{"title":"Distributed and heterogeneous tensor–vector contraction algorithms for high performance computing","authors":"Pedro J. Martinez-Ferrer ,&nbsp;Albert-Jan Yzelman ,&nbsp;Vicenç Beltran","doi":"10.1016/j.future.2024.107698","DOIUrl":null,"url":null,"abstract":"<div><div>The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper brings distributed-memory parallelization to a native TVC algorithm for dense tensors that overall remains oblivious to contraction mode, tensor splitting, and tensor order. Similarly, we propose a novel distributed HOPM, namely dHOPM<span><math><msub><mrow></mrow><mrow><mn>3</mn></mrow></msub></math></span>, that can save up to one order of magnitude of streamed memory and is about twice as costly in terms of data movement as a distributed TVC operation (dTVC) when using task-based parallelization. The numerical experiments carried out in this work on three different architectures featuring multicore and accelerators confirm that the performances of dTVC and dHOPM<span><math><msub><mrow></mrow><mrow><mn>3</mn></mrow></msub></math></span> remain relatively close to the peak system memory bandwidth (50%–80%, depending on the architecture) and on par with STREAM benchmark figures. On strong scalability scenarios, our native multicore implementations of these two algorithms can achieve similar and sometimes even greater performance figures than those based upon state-of-the-art CUDA batched kernels. Finally, we demonstrate that both computation and communication can benefit from mixed precision arithmetic also in cases where the hardware does not support low precision data types natively.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"166 ","pages":"Article 107698"},"PeriodicalIF":6.2000,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24006629","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper brings distributed-memory parallelization to a native TVC algorithm for dense tensors that overall remains oblivious to contraction mode, tensor splitting, and tensor order. Similarly, we propose a novel distributed HOPM, namely dHOPM3, that can save up to one order of magnitude of streamed memory and is about twice as costly in terms of data movement as a distributed TVC operation (dTVC) when using task-based parallelization. The numerical experiments carried out in this work on three different architectures featuring multicore and accelerators confirm that the performances of dTVC and dHOPM3 remain relatively close to the peak system memory bandwidth (50%–80%, depending on the architecture) and on par with STREAM benchmark figures. On strong scalability scenarios, our native multicore implementations of these two algorithms can achieve similar and sometimes even greater performance figures than those based upon state-of-the-art CUDA batched kernels. Finally, we demonstrate that both computation and communication can benefit from mixed precision arithmetic also in cases where the hardware does not support low precision data types natively.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高性能计算的分布式和异构张量矢量收缩算法
张量-向量收缩(TVC)是该类中内存约束最大的运算,也是高阶幂方法(HOPM)的核心组成部分。本文将分布式内存并行化引入到致密张量的原生TVC算法中,该算法总体上不受收缩模式、张量分裂和张量顺序的影响。同样,我们提出了一种新的分布式HOPM,即dHOPM3,当使用基于任务的并行化时,它可以节省多达一个数量级的流内存,并且在数据移动方面的成本大约是分布式TVC操作(dTVC)的两倍。在三种不同的多核和加速器架构上进行的数值实验证实,dTVC和dHOPM3的性能仍然相对接近峰值系统内存带宽(50%-80%,取决于架构),并与STREAM基准数据相当。在强大的可扩展性场景下,这两种算法的本地多核实现可以获得与基于最先进的CUDA批处理内核的算法相似甚至更高的性能数据。最后,我们证明了在硬件本身不支持低精度数据类型的情况下,混合精度算法也可以使计算和通信受益。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
19.90
自引率
2.70%
发文量
376
审稿时长
10.6 months
期刊介绍: Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.
期刊最新文献
A Reversible Debugger for MPI Applications with Flexible Backends MTD in Depth: Multi-phased Moving Target Defense Techniques against Cyber-Attacks based on Cyber Kill Chain A Three-Stage Adaptive Task Scheduling Strategy for Load Balancing in Cloud Computing MD-CGM: Malicious Traffic Detection Model Based on CycleGAN and Multi-Head Self-Attention Mechanism FedAPE: Heterogeneous Federated Learning with Attention-guided Aggregation and Prototype Enhancement
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1