BSArc: blacksmith streaming architecture for HPC accelerators

M. Shafiq, M. Pericàs, N. Navarro, E. Ayguadé
{"title":"BSArc: blacksmith streaming architecture for HPC accelerators","authors":"M. Shafiq, M. Pericàs, N. Navarro, E. Ayguadé","doi":"10.1145/2212908.2212914","DOIUrl":null,"url":null,"abstract":"The current trend in high performance computing (HPC) systems is to deploy parallel computers equipped with general purpose multi-core processors and possibly multi-core streaming accelerators. However, the performance of these multi-cores is often constrained by the limited external bandwidth or by badly matching data access patterns. The latter reduces the size of useful data during memory transactions. A change in the application algorithm can improve the memory accesses but a hardware support mechanism for an application specific data arrangement in the memory hierarchy can significantly boost the performance for many application domains.\n In this work, we present a conceptual computing architecture named BSArc (Blacksmith Streaming Architecture). BSArc introduces a forging front-end to efficiently distribute data to a large set of simple streaming processors in the back-end. We apply this concept to a SIMT execution model and present a design space exploration in the context of a GPU-like streaming architecture with a reconfigurable application specific front-end. These design space explorations are carried out on a streaming architectural simulator that models BSArc. We evaluate the performance advantages for the BSArc design against a standard L2 cache in a GPU-like device. In our evaluations we use three application kernels: 2D-FFT, Matrix-Matrix Multiplication and 3D-Stencil. The results show that employing an application specific arrangement of data on these kernels achieves an average speedup of 2.3× compared to a standard cache for a GPU-like streaming device.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2212908.2212914","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The current trend in high performance computing (HPC) systems is to deploy parallel computers equipped with general purpose multi-core processors and possibly multi-core streaming accelerators. However, the performance of these multi-cores is often constrained by the limited external bandwidth or by badly matching data access patterns. The latter reduces the size of useful data during memory transactions. A change in the application algorithm can improve the memory accesses but a hardware support mechanism for an application specific data arrangement in the memory hierarchy can significantly boost the performance for many application domains. In this work, we present a conceptual computing architecture named BSArc (Blacksmith Streaming Architecture). BSArc introduces a forging front-end to efficiently distribute data to a large set of simple streaming processors in the back-end. We apply this concept to a SIMT execution model and present a design space exploration in the context of a GPU-like streaming architecture with a reconfigurable application specific front-end. These design space explorations are carried out on a streaming architectural simulator that models BSArc. We evaluate the performance advantages for the BSArc design against a standard L2 cache in a GPU-like device. In our evaluations we use three application kernels: 2D-FFT, Matrix-Matrix Multiplication and 3D-Stencil. The results show that employing an application specific arrangement of data on these kernels achieves an average speedup of 2.3× compared to a standard cache for a GPU-like streaming device.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
BSArc: HPC加速器的铁匠流架构
当前高性能计算(HPC)系统的趋势是部署配备通用多核处理器和可能的多核流加速器的并行计算机。然而,这些多核的性能经常受到有限的外部带宽或不匹配的数据访问模式的限制。后者减少了内存事务期间有用数据的大小。应用程序算法的更改可以改善内存访问,但是在内存层次结构中为应用程序特定的数据安排提供硬件支持机制可以显著提高许多应用程序域的性能。在这项工作中,我们提出了一个名为BSArc (Blacksmith Streaming architecture)的概念计算架构。BSArc引入了一个锻造前端,以有效地将数据分发到后端大量简单的流处理器。我们将此概念应用于SIMT执行模型,并在具有可重构应用程序特定前端的类gpu流架构的上下文中提出了设计空间探索。这些设计空间探索是在模拟BSArc的流架构模拟器上进行的。我们针对类似gpu的设备中的标准L2缓存评估了BSArc设计的性能优势。在我们的评估中,我们使用三个应用程序内核:2D-FFT,矩阵-矩阵乘法和3D-Stencil。结果表明,与类似gpu的流媒体设备的标准缓存相比,在这些内核上使用特定于应用程序的数据安排可以实现2.3倍的平均加速。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Strategies for improving performance and energy efficiency on a many-core Cost-effective soft-error protection for SRAM-based structures in GPGPUs Kinship: efficient resource management for performance and functionally asymmetric platforms An algorithm for parallel calculation of trigonometric functions DCNSim: a unified and cross-layer computer architecture simulation framework for data center network research
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1