Distributed Matrix Computations With Low-Weight Encodings

IEEE journal on selected areas in information theory Pub Date : 2023-08-30 DOI:10.1109/JSAIT.2023.3308768

Anindya Bijoy Das;Aditya Ramamoorthy;David J. Love;Christopher G. Brinton

{"title":"Distributed Matrix Computations With Low-Weight Encodings","authors":"Anindya Bijoy Das;Aditya Ramamoorthy;David J. Love;Christopher G. Brinton","doi":"10.1109/JSAIT.2023.3308768","DOIUrl":null,"url":null,"abstract":"Straggler nodes are well-known bottlenecks of distributed matrix computations which induce reductions in computation/communication speeds. A common strategy for mitigating such stragglers is to incorporate Reed-Solomon based MDS (maximum distance separable) codes into the framework; this can achieve resilience against an optimal number of stragglers. However, these codes assign dense linear combinations of submatrices to the worker nodes. When the input matrices are sparse, these approaches increase the number of non-zero entries in the encoded matrices, which in turn adversely affects the worker computation time. In this work, we develop a distributed matrix computation approach where the assigned encoded submatrices are random linear combinations of a small number of submatrices. In addition to being well suited for sparse input matrices, our approach continues to have the optimal straggler resilience in a certain range of problem parameters. Moreover, compared to recent sparse matrix computation approaches, the search for a “good” set of random coefficients to promote numerical stability in our method is much more computationally efficient. We show that our approach can efficiently utilize partial computations done by slower worker nodes in a heterogeneous system which can enhance the overall computation speed. Numerical experiments conducted through Amazon Web Services (AWS) demonstrate up to 30% reduction in per worker node computation time and \n<inline-formula> <tex-math>$100\\times $ </tex-math></inline-formula>\n faster encoding compared to the available methods.","PeriodicalId":73295,"journal":{"name":"IEEE journal on selected areas in information theory","volume":"4 ","pages":"363-378"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE journal on selected areas in information theory","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10234626/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Straggler nodes are well-known bottlenecks of distributed matrix computations which induce reductions in computation/communication speeds. A common strategy for mitigating such stragglers is to incorporate Reed-Solomon based MDS (maximum distance separable) codes into the framework; this can achieve resilience against an optimal number of stragglers. However, these codes assign dense linear combinations of submatrices to the worker nodes. When the input matrices are sparse, these approaches increase the number of non-zero entries in the encoded matrices, which in turn adversely affects the worker computation time. In this work, we develop a distributed matrix computation approach where the assigned encoded submatrices are random linear combinations of a small number of submatrices. In addition to being well suited for sparse input matrices, our approach continues to have the optimal straggler resilience in a certain range of problem parameters. Moreover, compared to recent sparse matrix computation approaches, the search for a “good” set of random coefficients to promote numerical stability in our method is much more computationally efficient. We show that our approach can efficiently utilize partial computations done by slower worker nodes in a heterogeneous system which can enhance the overall computation speed. Numerical experiments conducted through Amazon Web Services (AWS) demonstrate up to 30% reduction in per worker node computation time and

$100\times $

faster encoding compared to the available methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

具有低权重编码的分布式矩阵计算

杂散节点是分布式矩阵计算的众所周知的瓶颈，其导致计算/通信速度的降低。减轻这种掉队者的一种常见策略是将基于Reed-Solomon的MDS（最大距离可分离）码合并到框架中；这可以实现对抗最优数量的掉队者的弹性。然而，这些代码将子矩阵的密集线性组合分配给工作节点。当输入矩阵是稀疏的时，这些方法会增加编码矩阵中非零项的数量，这反过来又会对工作者的计算时间产生不利影响。在这项工作中，我们开发了一种分布式矩阵计算方法，其中指定的编码子矩阵是少量子矩阵的随机线性组合。除了非常适合稀疏输入矩阵外，我们的方法在一定的问题参数范围内仍然具有最佳掉队者弹性。此外，与最近的稀疏矩阵计算方法相比，在我们的方法中，搜索一组“好”的随机系数来提高数值稳定性在计算上要高效得多。我们表明，我们的方法可以有效地利用异构系统中较慢的工作节点所做的部分计算，这可以提高整体计算速度。通过亚马逊网络服务（AWS）进行的数值实验表明，与现有方法相比，每个工作节点的计算时间减少了30%，编码速度加快了100倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊