The Mozart reuse exposed dataflow processor for AI and beyond: industrial product

Proceedings of the 49th Annual International Symposium on Computer Architecture Pub Date : 2022-06-11 DOI:10.1145/3470496.3533040

K. Sankaralingam, Tony Nowatzki, Vinay Gangadhar, Preyas Shah, Michael Davies, Will Galliher, Ziliang Guo, Jitu Khare, Deepa Vijay, Poly Palamuttam, Maghawan Punde, A. Tan, Vijayraghavan Thiruvengadam, Rongyi Wang, Shunmiao Xu

{"title":"The Mozart reuse exposed dataflow processor for AI and beyond: industrial product","authors":"K. Sankaralingam, Tony Nowatzki, Vinay Gangadhar, Preyas Shah, Michael Davies, Will Galliher, Ziliang Guo, Jitu Khare, Deepa Vijay, Poly Palamuttam, Maghawan Punde, A. Tan, Vijayraghavan Thiruvengadam, Rongyi Wang, Shunmiao Xu","doi":"10.1145/3470496.3533040","DOIUrl":null,"url":null,"abstract":"In this paper we introduce the Mozart Processor, which implements a new processing paradigm called Reuse Exposed Dataflow (RED). RED is a counterpart to existing execution models of Von-Neumann, SIMT, Dataflow, and FPGA. Dataflow and data reuse are the fundamental architecture primitives in RED, implemented with mechanisms for inter-worker communication and synchronization. The paper defines the processor architecture, the details of the microarchitecture, chip implementation, software stack development, and performance results. The architecture's goal is to achieve near-CPU like flexibility while having ASIC-like efficiency for a large-class of data-intensive workloads. An additional goal was software maturity --- have large coverage of applications immediately, avoiding the need for a long-drawn hand-tuning software development phase. The architecture was defined with this software-maturity/compiler friendliness in mind. In short, the goal was to do to GPUs, what GPUs did to CPUs --- i.e. be a better solution for a large range of workloads, while preserving flexibility and programmability. The chip was implemented with HBM and PCIe interfaces and taken to production on a 16nm TSMC FFC process. For ML inference tasks with batch-size=4, Mozart is integer factors better than state-of-the-art GPUs even while being nearly 2 technology nodes behind. We conclude with a set of lessons learned, the unique challenges of a clean-slate architecture in a commercial setting, and pointers for uncovered research problems.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3470496.3533040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

In this paper we introduce the Mozart Processor, which implements a new processing paradigm called Reuse Exposed Dataflow (RED). RED is a counterpart to existing execution models of Von-Neumann, SIMT, Dataflow, and FPGA. Dataflow and data reuse are the fundamental architecture primitives in RED, implemented with mechanisms for inter-worker communication and synchronization. The paper defines the processor architecture, the details of the microarchitecture, chip implementation, software stack development, and performance results. The architecture's goal is to achieve near-CPU like flexibility while having ASIC-like efficiency for a large-class of data-intensive workloads. An additional goal was software maturity --- have large coverage of applications immediately, avoiding the need for a long-drawn hand-tuning software development phase. The architecture was defined with this software-maturity/compiler friendliness in mind. In short, the goal was to do to GPUs, what GPUs did to CPUs --- i.e. be a better solution for a large range of workloads, while preserving flexibility and programmability. The chip was implemented with HBM and PCIe interfaces and taken to production on a 16nm TSMC FFC process. For ML inference tasks with batch-size=4, Mozart is integer factors better than state-of-the-art GPUs even while being nearly 2 technology nodes behind. We conclude with a set of lessons learned, the unique challenges of a clean-slate architecture in a commercial setting, and pointers for uncovered research problems.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

莫扎特重用暴露的数据流处理器的人工智能和超越:工业产品

在本文中，我们介绍了Mozart处理器，它实现了一种新的处理范式，称为重用暴露的数据流(RED)。RED对应于现有的Von-Neumann、SIMT、Dataflow和FPGA的执行模型。数据流和数据重用是RED中的基本架构原语，通过worker间通信和同步机制实现。本文定义了处理器体系结构、微体系结构的细节、芯片实现、软件栈开发和性能结果。该体系结构的目标是实现接近cpu的灵活性，同时为大型数据密集型工作负载提供类似asic的效率。另一个目标是软件成熟度——立即拥有大范围的应用程序覆盖，避免需要长时间的手工调优软件开发阶段。在定义体系结构时，考虑到了软件成熟度/编译器友好性。简而言之，我们的目标是对gpu做什么，gpu对cpu做什么——即，在保持灵活性和可编程性的同时，为大范围的工作负载提供更好的解决方案。该芯片采用HBM和PCIe接口实现，并采用16纳米台积电FFC工艺进行生产。对于批处理大小=4的ML推理任务，莫扎特比最先进的gpu要好整数倍，即使落后近2个技术节点。最后，我们总结了一些经验教训、商业环境中全新架构的独特挑战，以及未发现的研究问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 49th Annual International Symposium on Computer Architecture

自引率

0.00%

发文量