Large Scale Kronecker Product on Supercomputers

C. Tadonki
{"title":"Large Scale Kronecker Product on Supercomputers","authors":"C. Tadonki","doi":"10.1109/WAMCA.2011.10","DOIUrl":null,"url":null,"abstract":"The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer.","PeriodicalId":380586,"journal":{"name":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 Second Workshop on Architecture and Multi-Core Applications (wamca 2011)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WAMCA.2011.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

The Kronecker product, also called tensor product, is a fundamental matrix algebra operation, which is widely used as a natural formalism to express a convolution of many interactions or representations. Given a set of matrices, we need to multiply their Kronecker product by a vector. This operation is a critical kernel for iterative algorithms, thus needs to be computed efficiently. In a previous work, we have proposed a cost optimal parallel algorithm for the problem, both in terms of floating point computation time and interprocessor communication steps. However, the lower bound of data transfers can only be achieved if we really consider (local) logarithmic broadcasts. In practice, we consider a communication loop instead. Thus, it becomes important to care about the real cost of each broadcast. As this local broadcast is performed simultaneously by each processor, the situation is getting worse on a large number of processors (supercomputers). We address the problem in this paper in two points. In one hand, we propose a way to build a virtual topology which has the lowest gap to the theoretical lower bound. In the other hand, we consider a hybrid implementation, which has the advantage of reducing the number of communicating nodes. We illustrate our work with some benchmarks on a large SMP 8-Core supercomputer.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
超级计算机上的大规模克罗内克产品
Kronecker积,也称为张量积,是一种基本的矩阵代数运算,它被广泛用作表达许多相互作用或表示的卷积的自然形式。给定一组矩阵,我们需要用一个向量乘以它们的克罗内克积。该运算是迭代算法的关键核心,需要高效计算。在之前的工作中,我们提出了一个成本最优的并行算法来解决这个问题,无论是在浮点计算时间和处理器间通信步骤方面。然而,只有当我们真正考虑(本地)对数广播时,才能实现数据传输的下限。在实践中,我们考虑一个通信回路。因此,关注每次广播的实际成本变得非常重要。由于这种本地广播是由每个处理器同时执行的,因此在大量处理器(超级计算机)上,情况变得越来越糟。本文从两个方面来解决这个问题。一方面,我们提出了一种构造与理论下界差距最小的虚拟拓扑的方法。另一方面,我们考虑一种混合实现,它的优点是减少了通信节点的数量。我们用大型SMP 8核超级计算机上的一些基准测试来说明我们的工作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Adaptive Power Optimization of On-chip SNUCA Cache on Tiled Chip Multicore Architecture Using Remap Policy Economical Two-fold Working Precision Matrix Multiplication on Consumer-Level CUDA GPUs Large Scale Kronecker Product on Supercomputers Evaluating the Problem of Process Mapping on Network-on-Chip for Parallel Applications Trace-Based Visualization as a Tool to Understand Applications' I/O Performance in Multi-core Machines
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1