基于运行时系统的结构化密集矩阵的O(N)分布直接分解

Sameer Deshmukh, Qinxiang Ma, Rio Yokota, George Bosilca
{"title":"基于运行时系统的结构化密集矩阵的O(N)分布直接分解","authors":"Sameer Deshmukh, Qinxiang Ma, Rio Yokota, George Bosilca","doi":"arxiv-2311.00921","DOIUrl":null,"url":null,"abstract":"Structured dense matrices result from boundary integral problems in\nelectrostatics and geostatistics, and also Schur complements in sparse\npreconditioners such as multi-frontal methods. Exploiting the structure of such\nmatrices can reduce the time for dense direct factorization from $O(N^3)$ to\n$O(N)$. The Hierarchically Semi-Separable (HSS) matrix is one such low rank\nmatrix format that can be factorized using a Cholesky-like algorithm called ULV\nfactorization. The HSS-ULV algorithm is highly parallel because it removes the\ndependency on trailing sub-matrices at each HSS level. However, a key merge\nstep that links two successive HSS levels remains a challenge for efficient\nparallelization. In this paper, we use an asynchronous runtime system PaRSEC\nwith the HSS-ULV algorithm. We compare our work with STRUMPACK and LORAPO, both\nstate-of-the-art implementations of dense direct low rank factorization, and\nachieve up to 2x better factorization time for matrices arising from a diverse\nset of applications on up to 128 nodes of Fugaku for similar or better accuracy\nfor all the problems that we survey.","PeriodicalId":501256,"journal":{"name":"arXiv - CS - Mathematical Software","volume":"13 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"$O(N)$ distributed direct factorization of structured dense matrices using runtime systems\",\"authors\":\"Sameer Deshmukh, Qinxiang Ma, Rio Yokota, George Bosilca\",\"doi\":\"arxiv-2311.00921\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Structured dense matrices result from boundary integral problems in\\nelectrostatics and geostatistics, and also Schur complements in sparse\\npreconditioners such as multi-frontal methods. Exploiting the structure of such\\nmatrices can reduce the time for dense direct factorization from $O(N^3)$ to\\n$O(N)$. The Hierarchically Semi-Separable (HSS) matrix is one such low rank\\nmatrix format that can be factorized using a Cholesky-like algorithm called ULV\\nfactorization. The HSS-ULV algorithm is highly parallel because it removes the\\ndependency on trailing sub-matrices at each HSS level. However, a key merge\\nstep that links two successive HSS levels remains a challenge for efficient\\nparallelization. In this paper, we use an asynchronous runtime system PaRSEC\\nwith the HSS-ULV algorithm. We compare our work with STRUMPACK and LORAPO, both\\nstate-of-the-art implementations of dense direct low rank factorization, and\\nachieve up to 2x better factorization time for matrices arising from a diverse\\nset of applications on up to 128 nodes of Fugaku for similar or better accuracy\\nfor all the problems that we survey.\",\"PeriodicalId\":501256,\"journal\":{\"name\":\"arXiv - CS - Mathematical Software\",\"volume\":\"13 2\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Mathematical Software\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2311.00921\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Mathematical Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2311.00921","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

结构密集矩阵来源于静电学和地统计学中的边界积分问题,也来源于稀疏预处理中的Schur互补,如多正面方法。利用这种矩阵的结构可以将密集直接分解的时间从$O(N^3)$减少到$O(N)$。层次半可分(HSS)矩阵就是这样一种低秩矩阵格式,可以使用称为ULVfactorization的类cholesky算法进行分解。HSS- ulv算法是高度并行的,因为它消除了对每个HSS级别的尾子矩阵的依赖。然而,连接两个连续HSS级别的关键合并步骤仍然是有效并行化的挑战。在本文中,我们使用了一个异步运行时系统parsecs与HSS-ULV算法。我们将我们的工作与STRUMPACK和LORAPO进行了比较,两者都是最先进的密集直接低秩分解实现,并且在多达128个Fugaku节点上对各种应用产生的矩阵实现了高达2倍的分解时间,对于我们调查的所有问题具有相似或更好的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
$O(N)$ distributed direct factorization of structured dense matrices using runtime systems
Structured dense matrices result from boundary integral problems in electrostatics and geostatistics, and also Schur complements in sparse preconditioners such as multi-frontal methods. Exploiting the structure of such matrices can reduce the time for dense direct factorization from $O(N^3)$ to $O(N)$. The Hierarchically Semi-Separable (HSS) matrix is one such low rank matrix format that can be factorized using a Cholesky-like algorithm called ULV factorization. The HSS-ULV algorithm is highly parallel because it removes the dependency on trailing sub-matrices at each HSS level. However, a key merge step that links two successive HSS levels remains a challenge for efficient parallelization. In this paper, we use an asynchronous runtime system PaRSEC with the HSS-ULV algorithm. We compare our work with STRUMPACK and LORAPO, both state-of-the-art implementations of dense direct low rank factorization, and achieve up to 2x better factorization time for matrices arising from a diverse set of applications on up to 128 nodes of Fugaku for similar or better accuracy for all the problems that we survey.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A prony method variant which surpasses the Adaptive LMS filter in the output signal's representation of input TorchDA: A Python package for performing data assimilation with deep learning forward and transformation functions HOBOTAN: Efficient Higher Order Binary Optimization Solver with Tensor Networks and PyTorch MPAT: Modular Petri Net Assembly Toolkit Enabling MPI communication within Numba/LLVM JIT-compiled Python code using numba-mpi v1.0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1