BSArc: blacksmith streaming architecture for HPC accelerators

ACM International Conference on Computing Frontiers Pub Date : 2012-05-15 DOI:10.1145/2212908.2212914

M. Shafiq, M. Pericàs, N. Navarro, E. Ayguadé

{"title":"BSArc: blacksmith streaming architecture for HPC accelerators","authors":"M. Shafiq, M. Pericàs, N. Navarro, E. Ayguadé","doi":"10.1145/2212908.2212914","DOIUrl":null,"url":null,"abstract":"The current trend in high performance computing (HPC) systems is to deploy parallel computers equipped with general purpose multi-core processors and possibly multi-core streaming accelerators. However, the performance of these multi-cores is often constrained by the limited external bandwidth or by badly matching data access patterns. The latter reduces the size of useful data during memory transactions. A change in the application algorithm can improve the memory accesses but a hardware support mechanism for an application specific data arrangement in the memory hierarchy can significantly boost the performance for many application domains.\n In this work, we present a conceptual computing architecture named BSArc (Blacksmith Streaming Architecture). BSArc introduces a forging front-end to efficiently distribute data to a large set of simple streaming processors in the back-end. We apply this concept to a SIMT execution model and present a design space exploration in the context of a GPU-like streaming architecture with a reconfigurable application specific front-end. These design space explorations are carried out on a streaming architectural simulator that models BSArc. We evaluate the performance advantages for the BSArc design against a standard L2 cache in a GPU-like device. In our evaluations we use three application kernels: 2D-FFT, Matrix-Matrix Multiplication and 3D-Stencil. The results show that employing an application specific arrangement of data on these kernels achieves an average speedup of 2.3× compared to a standard cache for a GPU-like streaming device.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2212908.2212914","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The current trend in high performance computing (HPC) systems is to deploy parallel computers equipped with general purpose multi-core processors and possibly multi-core streaming accelerators. However, the performance of these multi-cores is often constrained by the limited external bandwidth or by badly matching data access patterns. The latter reduces the size of useful data during memory transactions. A change in the application algorithm can improve the memory accesses but a hardware support mechanism for an application specific data arrangement in the memory hierarchy can significantly boost the performance for many application domains. In this work, we present a conceptual computing architecture named BSArc (Blacksmith Streaming Architecture). BSArc introduces a forging front-end to efficiently distribute data to a large set of simple streaming processors in the back-end. We apply this concept to a SIMT execution model and present a design space exploration in the context of a GPU-like streaming architecture with a reconfigurable application specific front-end. These design space explorations are carried out on a streaming architectural simulator that models BSArc. We evaluate the performance advantages for the BSArc design against a standard L2 cache in a GPU-like device. In our evaluations we use three application kernels: 2D-FFT, Matrix-Matrix Multiplication and 3D-Stencil. The results show that employing an application specific arrangement of data on these kernels achieves an average speedup of 2.3× compared to a standard cache for a GPU-like streaming device.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

BSArc: HPC加速器的铁匠流架构

当前高性能计算(HPC)系统的趋势是部署配备通用多核处理器和可能的多核流加速器的并行计算机。然而，这些多核的性能经常受到有限的外部带宽或不匹配的数据访问模式的限制。后者减少了内存事务期间有用数据的大小。应用程序算法的更改可以改善内存访问，但是在内存层次结构中为应用程序特定的数据安排提供硬件支持机制可以显著提高许多应用程序域的性能。在这项工作中，我们提出了一个名为BSArc (Blacksmith Streaming architecture)的概念计算架构。BSArc引入了一个锻造前端，以有效地将数据分发到后端大量简单的流处理器。我们将此概念应用于SIMT执行模型，并在具有可重构应用程序特定前端的类gpu流架构的上下文中提出了设计空间探索。这些设计空间探索是在模拟BSArc的流架构模拟器上进行的。我们针对类似gpu的设备中的标准L2缓存评估了BSArc设计的性能优势。在我们的评估中，我们使用三个应用程序内核:2D-FFT，矩阵-矩阵乘法和3D-Stencil。结果表明，与类似gpu的流媒体设备的标准缓存相比，在这些内核上使用特定于应用程序的数据安排可以实现2.3倍的平均加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM International Conference on Computing Frontiers

自引率

0.00%

发文量