Architecting Waferscale Processors - A GPU Case Study

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2019-02-01 DOI:10.1109/HPCA.2019.00042

Saptadeep Pal, Daniel Petrisko, Matthew Tomei, Puneet Gupta, S. Iyer, Rakesh Kumar

{"title":"Architecting Waferscale Processors - A GPU Case Study","authors":"Saptadeep Pal, Daniel Petrisko, Matthew Tomei, Puneet Gupta, S. Iyer, Rakesh Kumar","doi":"10.1109/HPCA.2019.00042","DOIUrl":null,"url":null,"abstract":"Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing. However, waferscale processors [1], [2], [3] have been historically deemed impractical due to yield issues [1], [4] inherent to conventional integration technology. Emerging integration technologies such as Silicon-Interconnection Fabric (Si-IF) [5], [6], [7], where pre-manufactured dies are directly bonded on to a silicon wafer, may enable one to build a waferscale system without the corresponding yield issues. As such, waferscalar architectures need to be revisited. In this paper, we study if it is feasible and useful to build today’s architectures at waferscale. Using a waferscale GPU as a case study, we show that while a 300 mm wafer can house about 100 GPU modules (GPM), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We also study the performance and energy implications of waferscale architectures. We show that waferscale GPUs can provide significant performance and energy efficiency advantages (up to 18.9x speedup and 143x EDP benefit compared against equivalent MCM-GPU based implementation on PCB) without any change in the programming model. We also develop thread scheduling and data placement policies for waferscale GPU architectures. Our policies outperform state-of-art scheduling and data placement policies by up to 2.88x (average 1.4x) and 1.62x (average 1.11x) for 24 GPM and 40 GPM cases respectively. Finally, we build the first Si-IF prototype with interconnected dies. We observe 100% of the inter-die interconnects to be successfully connected in our prototype. Coupled with the high yield reported previously for bonding of dies on Si-IF, this demonstrates the technological readiness for building a waferscale GPU architecture. Keywords—Waferscale Processors, GPU, Silicon Interconnect Fabric","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"14 5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2019.00042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

Abstract

Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing. However, waferscale processors [1], [2], [3] have been historically deemed impractical due to yield issues [1], [4] inherent to conventional integration technology. Emerging integration technologies such as Silicon-Interconnection Fabric (Si-IF) [5], [6], [7], where pre-manufactured dies are directly bonded on to a silicon wafer, may enable one to build a waferscale system without the corresponding yield issues. As such, waferscalar architectures need to be revisited. In this paper, we study if it is feasible and useful to build today’s architectures at waferscale. Using a waferscale GPU as a case study, we show that while a 300 mm wafer can house about 100 GPU modules (GPM), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We also study the performance and energy implications of waferscale architectures. We show that waferscale GPUs can provide significant performance and energy efficiency advantages (up to 18.9x speedup and 143x EDP benefit compared against equivalent MCM-GPU based implementation on PCB) without any change in the programming model. We also develop thread scheduling and data placement policies for waferscale GPU architectures. Our policies outperform state-of-art scheduling and data placement policies by up to 2.88x (average 1.4x) and 1.62x (average 1.11x) for 24 GPM and 40 GPM cases respectively. Finally, we build the first Si-IF prototype with interconnected dies. We observe 100% of the inter-die interconnects to be successfully connected in our prototype. Coupled with the high yield reported previously for bonding of dies on Si-IF, this demonstrates the technological readiness for building a waferscale GPU architecture. Keywords—Waferscale Processors, GPU, Silicon Interconnect Fabric

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

架构晶圆级处理器- GPU案例研究

不断增加的通信开销已经威胁到计算机系统的扩展。大幅度降低通信开销的一种方法是晶圆级处理。然而，由于传统集成技术固有的良率问题[1]，[4]，晶圆级处理器[1]，[2]，[3]一直被认为是不切实际的。新兴的集成技术，如硅互连结构(Si-IF)[5]，[6]，[7]，其中预先制造的模具直接粘合在硅片上，可以使人们建立一个晶圆级系统，而没有相应的良率问题。因此，需要重新审视晶圆标量架构。在本文中，我们研究了在晶圆规模上构建今天的架构是否可行和有用。使用晶圆级GPU作为案例研究，我们表明，虽然300毫米晶圆可以容纳大约100个GPU模块(GPM)，但当考虑到物理问题时，只能构建具有大约40个GPM的大幅缩小的GPU架构。我们还研究了晶圆级架构的性能和能源影响。我们表明，在不改变编程模型的情况下，晶圆级gpu可以提供显着的性能和能效优势(与基于PCB的等效MCM-GPU实现相比，高达18.9倍的加速和143倍的EDP优势)。我们还为晶圆级GPU架构开发线程调度和数据放置策略。在24 GPM和40 GPM情况下，我们的策略分别比最先进的调度和数据放置策略高出2.88倍(平均1.4倍)和1.62倍(平均1.11倍)。最后，我们建立了第一个Si-IF原型与互连的模具。在我们的原型中，我们观察到100%的内部芯片互连成功连接。再加上之前报道的Si-IF上的高成品率，这表明构建晶圆级GPU架构的技术准备就绪。关键词:晶圆级处理器，GPU，硅互连结构

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量

期刊最新文献

Machine Learning at Facebook: Understanding Inference at the Edge Understanding the Future of Energy Efficiency in Multi-Module GPUs POWERT Channels: A Novel Class of Covert CommunicationExploiting Power Management Vulnerabilities The Accelerator Wall: Limits of Chip Specialization Featherlight Reuse-Distance Measurement