Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture

Mohammed Elbtity, Peyton Chandarana, Ramtin Zand
{"title":"Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture","authors":"Mohammed Elbtity, Peyton Chandarana, Ramtin Zand","doi":"arxiv-2407.08700","DOIUrl":null,"url":null,"abstract":"Tensor processing units (TPUs) are one of the most well-known machine\nlearning (ML) accelerators utilized at large scale in data centers as well as\nin tiny ML applications. TPUs offer several improvements and advantages over\nconventional ML accelerators, like graphical processing units (GPUs), being\ndesigned specifically to perform the multiply-accumulate (MAC) operations\nrequired in the matrix-matrix and matrix-vector multiplies extensively present\nthroughout the execution of deep neural networks (DNNs). Such improvements\ninclude maximizing data reuse and minimizing data transfer by leveraging the\ntemporal dataflow paradigms provided by the systolic array architecture. While\nthis design provides a significant performance benefit, the current\nimplementations are restricted to a single dataflow consisting of either input,\noutput, or weight stationary architectures. This can limit the achievable\nperformance of DNN inference and reduce the utilization of compute units.\nTherefore, the work herein consists of developing a reconfigurable dataflow\nTPU, called the Flex-TPU, which can dynamically change the dataflow per layer\nduring run-time. Our experiments thoroughly test the viability of the Flex-TPU\ncomparing it to conventional TPU designs across multiple well-known ML\nworkloads. The results show that our Flex-TPU design achieves a significant\nperformance increase of up to 2.75x compared to conventional TPU, with only\nminor area and power overheads.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"157 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.08700","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Flex-TPU:具有运行时可重构数据流架构的灵活 TPU
张量处理单元(TPU)是最著名的机器学习(ML)加速器之一,在数据中心和小型 ML 应用中得到了大规模应用。与图形处理器(GPU)等传统 ML 加速器相比,TPU 具有多项改进和优势,其设计专门用于执行深度神经网络(DNN)执行过程中广泛存在的矩阵-矩阵和矩阵-矢量乘法所需的乘积(MAC)运算。这种改进包括通过利用系统阵列架构提供的时态数据流范例,最大限度地提高数据重用率,并最大限度地减少数据传输。虽然这种设计具有显著的性能优势,但目前的实现方式仅限于由输入、输出或权重固定架构组成的单一数据流。因此,本文的工作包括开发一种名为 Flex-TPU 的可重新配置数据流处理单元,它可以在运行时动态改变每层的数据流。我们的实验对 Flex-TPU 的可行性进行了全面测试,并将其与传统的 TPU 设计在多个著名的 ML 工作负载中进行了比较。结果表明,与传统 TPU 相比,我们的 Flex-TPU 设计实现了高达 2.75 倍的性能大幅提升,而面积和功耗开销却很小。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study The Landscape of GPU-Centric Communication A Global Perspective on the Past, Present, and Future of Video Streaming over Starlink
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1