稀疏矩阵向量乘法的性能优化与边界

R. Vuduc, J. Demmel, K. Yelick, S. Kamil, R. Nishtala, Benjamin C. Lee
{"title":"稀疏矩阵向量乘法的性能优化与边界","authors":"R. Vuduc, J. Demmel, K. Yelick, S. Kamil, R. Nishtala, Benjamin C. Lee","doi":"10.1109/SC.2002.10025","DOIUrl":null,"url":null,"abstract":"We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM×V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits. Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM×V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM×V performance (Mflop/s) by as much as 2.5× over an untuned implementation. Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM×V, will come from two sources: (1) consideration of higher-level matrix structures (e.g. exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g. sparse matrix-multiple vector multiply, multiplication of AT A by a vector).","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"152","resultStr":"{\"title\":\"Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply\",\"authors\":\"R. Vuduc, J. Demmel, K. Yelick, S. Kamil, R. Nishtala, Benjamin C. Lee\",\"doi\":\"10.1109/SC.2002.10025\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM×V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits. Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM×V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM×V performance (Mflop/s) by as much as 2.5× over an untuned implementation. Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM×V, will come from two sources: (1) consideration of higher-level matrix structures (e.g. exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g. sparse matrix-multiple vector multiply, multiplication of AT A by a vector).\",\"PeriodicalId\":302800,\"journal\":{\"name\":\"ACM/IEEE SC 2002 Conference (SC'02)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2002-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"152\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM/IEEE SC 2002 Conference (SC'02)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC.2002.10025\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IEEE SC 2002 Conference (SC'02)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.2002.10025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 152

摘要

我们考虑通过代码和数据结构重组的稀疏矩阵向量乘法(SpM×V)的性能调优,这是科学应用中最重要的计算内核之一。本文讨论的基本问题是这种性能调优存在哪些限制,以及调优代码如何接近这些限制。具体来说,我们在使用之前提出的寄存器阻塞优化进行调优时,开发了SpM×V性能(Mflop/s)的上限和下限。这些边界基于矩阵中的非零模式和基本内存操作的成本,例如缓存命中和未命中。我们使用4个不同平台上的硬件计数器数据和44个稀疏矩阵的测试集来评估我们针对这些边界的优化实现。我们发现,我们经常可以得到上界的20%以内,特别是在一类从有限元建模(FEM)问题的矩阵;在非fem矩阵上,性能提高2倍仍然是可能的。最后,我们提出了一种新的启发式算法,它比以前的启发式算法更准确地选择最优或接近最优的寄存器块大小(关键调优参数)。使用新的启发式算法,我们发现SpM×V性能(Mflop/s)比未调优的实现提高了2.5倍。总的来说,我们的结果表明,未来的性能改进,除了我们已经为SpM×V展示的那些,将来自两个来源:(1)考虑更高级别的矩阵结构(例如利用对称性,矩阵重排序,多寄存器块大小),以及(2)优化内核,提供更多的数据重用机会(例如稀疏矩阵-多向量乘法,AT A乘以向量)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply
We consider performance tuning, by code and data structure reorganization, of sparse matrix-vector multiply (SpM×V), one of the most important computational kernels in scientific applications. This paper addresses the fundamental questions of what limits exist on such performance tuning, and how closely tuned code approaches these limits. Specifically, we develop upper and lower bounds on the performance (Mflop/s) of SpM×V when tuned using our previously proposed register blocking optimization. These bounds are based on the non-zero pattern in the matrix and the cost of basic memory operations, such as cache hits and misses. We evaluate our tuned implementations with respect to these bounds using hardware counter data on 4 different platforms and on test set of 44 sparse matrices. We find that we can often get within 20% of the upper bound, particularly on class of matrices from finite element modeling (FEM) problems; on non-FEM matrices, performance improvements of 2× are still possible. Lastly, we present new heuristic that selects optimal or near-optimal register block sizes (the key tuning parameters) more accurately than our previous heuristic. Using the new heuristic, we show improvements in SpM×V performance (Mflop/s) by as much as 2.5× over an untuned implementation. Collectively, our results suggest that future performance improvements, beyond those that we have already demonstrated for SpM×V, will come from two sources: (1) consideration of higher-level matrix structures (e.g. exploiting symmetry, matrix reordering, multiple register block sizes), and (2) optimizing kernels with more opportunity for data reuse (e.g. sparse matrix-multiple vector multiply, multiplication of AT A by a vector).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Interoperable Web Services for Computational Portals Advanced Visualization Technology for Terascale Particle Accelerator Simulations Library Support for Hierarchical Multi-Processor Tasks Utilization of Departmental Computing GRID System for Development of an Artificial Intelligent Tapping Inspection Method, Tapping Sound Analysis 16.4-Tflops Direct Numerical Simulation of Turbulence by a Fourier Spectral Method on the Earth Simulator
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1