并行多线程矩阵处理器的FPGA实现与性能评价

M. Soliman, E. Elsayed
{"title":"并行多线程矩阵处理器的FPGA实现与性能评价","authors":"M. Soliman, E. Elsayed","doi":"10.1109/ICCES.2014.7030959","DOIUrl":null,"url":null,"abstract":"This paper proposes a simultaneous multithreaded matrix processor called SMMP to improve the performance of data-parallel applications by exploiting ILP, DLP, and TLP. In SMMP, the well-known 5-stage pipeline (baseline scalar processor) is extended to execute multi-scalar/vector/matrix instructions on unified parallel execution datapaths. SMMP can issue four scalar instructions from two threads each cycle or four vector/matrix operations from one thread, where the execution of vector/matrix instructions in threads is done in round-robin fashion. Moreover, this paper presents the implementation of our proposed SMMP using VHDL targeting FPGA Virtex-6. In addition, the performance of SMMP is evaluated on some kernels from the basic linear algebra subprograms (BLAS). Our results show that, the hardware complexity of SMMP is 5.68 times higher than the baseline scalar processor. However, speedups of 4.9, 6.09, 6.98, 8.2, 8.25, 8.72, 9.36, 11.84, and 21.57 are achieved on BLAS kernels of applying Givens rotation, scalar times vector plus another, vector addition, vector scaling, setting up Givens rotation, dot-product, matrix-vector multiplication, Euclidean length, and matrix-matrix multiplications, respectively. In conclusion, the average speedup over the baseline is 9.55 and the average speedup over complexity is 1.68.","PeriodicalId":339697,"journal":{"name":"2014 9th International Conference on Computer Engineering & Systems (ICCES)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FPGA implementation and performance evaluation of a simultaneous multithreaded matrix processor\",\"authors\":\"M. Soliman, E. Elsayed\",\"doi\":\"10.1109/ICCES.2014.7030959\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a simultaneous multithreaded matrix processor called SMMP to improve the performance of data-parallel applications by exploiting ILP, DLP, and TLP. In SMMP, the well-known 5-stage pipeline (baseline scalar processor) is extended to execute multi-scalar/vector/matrix instructions on unified parallel execution datapaths. SMMP can issue four scalar instructions from two threads each cycle or four vector/matrix operations from one thread, where the execution of vector/matrix instructions in threads is done in round-robin fashion. Moreover, this paper presents the implementation of our proposed SMMP using VHDL targeting FPGA Virtex-6. In addition, the performance of SMMP is evaluated on some kernels from the basic linear algebra subprograms (BLAS). Our results show that, the hardware complexity of SMMP is 5.68 times higher than the baseline scalar processor. However, speedups of 4.9, 6.09, 6.98, 8.2, 8.25, 8.72, 9.36, 11.84, and 21.57 are achieved on BLAS kernels of applying Givens rotation, scalar times vector plus another, vector addition, vector scaling, setting up Givens rotation, dot-product, matrix-vector multiplication, Euclidean length, and matrix-matrix multiplications, respectively. In conclusion, the average speedup over the baseline is 9.55 and the average speedup over complexity is 1.68.\",\"PeriodicalId\":339697,\"journal\":{\"name\":\"2014 9th International Conference on Computer Engineering & Systems (ICCES)\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 9th International Conference on Computer Engineering & Systems (ICCES)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCES.2014.7030959\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 9th International Conference on Computer Engineering & Systems (ICCES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCES.2014.7030959","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

本文提出一种称为SMMP的同步多线程矩阵处理器,通过利用ILP、DLP和TLP来提高数据并行应用程序的性能。在SMMP中,众所周知的5阶段流水线(基线标量处理器)被扩展到在统一的并行执行数据路径上执行多标量/矢量/矩阵指令。SMMP每个周期可以从两个线程发出四个标量指令,或者从一个线程发出四个向量/矩阵操作,其中线程中向量/矩阵指令的执行以循环方式完成。此外,本文还介绍了基于FPGA Virtex-6的VHDL实现我们提出的SMMP。此外,在基本线性代数子程序(BLAS)的核上对SMMP的性能进行了评价。结果表明,SMMP的硬件复杂度是基准标量处理器的5.68倍。然而,在应用Givens旋转、标量乘以向量加另一个、向量加法、向量缩放、设置Givens旋转、点积、矩阵-向量乘法、欧几里得长度和矩阵-矩阵乘法的BLAS内核上,分别实现了4.9、6.09、6.98、8.2、8.25、8.72、9.36、11.84和21.57的速度提升。总之,在基线上的平均加速是9.55,在复杂度上的平均加速是1.68。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
FPGA implementation and performance evaluation of a simultaneous multithreaded matrix processor
This paper proposes a simultaneous multithreaded matrix processor called SMMP to improve the performance of data-parallel applications by exploiting ILP, DLP, and TLP. In SMMP, the well-known 5-stage pipeline (baseline scalar processor) is extended to execute multi-scalar/vector/matrix instructions on unified parallel execution datapaths. SMMP can issue four scalar instructions from two threads each cycle or four vector/matrix operations from one thread, where the execution of vector/matrix instructions in threads is done in round-robin fashion. Moreover, this paper presents the implementation of our proposed SMMP using VHDL targeting FPGA Virtex-6. In addition, the performance of SMMP is evaluated on some kernels from the basic linear algebra subprograms (BLAS). Our results show that, the hardware complexity of SMMP is 5.68 times higher than the baseline scalar processor. However, speedups of 4.9, 6.09, 6.98, 8.2, 8.25, 8.72, 9.36, 11.84, and 21.57 are achieved on BLAS kernels of applying Givens rotation, scalar times vector plus another, vector addition, vector scaling, setting up Givens rotation, dot-product, matrix-vector multiplication, Euclidean length, and matrix-matrix multiplications, respectively. In conclusion, the average speedup over the baseline is 9.55 and the average speedup over complexity is 1.68.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Simulations and performance evaluation of Real-Time Multi-core Systems An Enhanced Queries Scheduler for query processing over a cloud environment EMD thresholding and denoising inspired by wavelet technique A proposed SNOMED CT ontology-based encoding methodology for diabetes diagnosis case-base A proposed framework for robust face identification system
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1