3.2 A100数据中心GPU与安培架构

Jack Choquette, Ming-Ju Edward Lee, R. Krashinsky, V. Balan, Brucek Khailany
{"title":"3.2 A100数据中心GPU与安培架构","authors":"Jack Choquette, Ming-Ju Edward Lee, R. Krashinsky, V. Balan, Brucek Khailany","doi":"10.1109/ISSCC42613.2021.9365803","DOIUrl":null,"url":null,"abstract":"The diversity of compute-intensive applications in modern cloud data centers has driven the explosion of GPU-accelerated cloud computing. Such applications include AI deep learning training and inference, data analytics, scientific computing, genomics, edge video analytics and 5G services, graphics rendering, and cloud gaming. The A100 GPU introduces several features targeting these workloads: a $3^{rd}-$generation Tensor Core with support for fine-grained sparsity, new BFIoat16 (BF16), TensorFIoat-32 (TF32), and FP64 datatypes, scale-out support with multi-instance GPU (MIG) virtualization, and scale-up support with a $3^{rd}-$generation 50Gbps NVLink I/0 interface (NVLink3) and NVSwitch inter-GPU communication. As shown in Fig. 3.2.1, A100 contains 108 Streaming Multiprocessors (SMs) and 6912 CUDA cores. The SMs are fed by a 40MB L2 cache and 1. 56TB/s of HBM2 memory bandwidth (BW). At 1.41GHz, A100 provides an effective peak 1248T0PS (8b integers), 624TFLOPS (FP16) and312TFLOPS (TF32) when including sparsity optimizations. Implemented in a TSMC 7nm N7 process, the A100 die (Fig. 3.2.7) contains 54B transistors and measures 826mm2.","PeriodicalId":371093,"journal":{"name":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"3.2 The A100 Datacenter GPU and Ampere Architecture\",\"authors\":\"Jack Choquette, Ming-Ju Edward Lee, R. Krashinsky, V. Balan, Brucek Khailany\",\"doi\":\"10.1109/ISSCC42613.2021.9365803\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The diversity of compute-intensive applications in modern cloud data centers has driven the explosion of GPU-accelerated cloud computing. Such applications include AI deep learning training and inference, data analytics, scientific computing, genomics, edge video analytics and 5G services, graphics rendering, and cloud gaming. The A100 GPU introduces several features targeting these workloads: a $3^{rd}-$generation Tensor Core with support for fine-grained sparsity, new BFIoat16 (BF16), TensorFIoat-32 (TF32), and FP64 datatypes, scale-out support with multi-instance GPU (MIG) virtualization, and scale-up support with a $3^{rd}-$generation 50Gbps NVLink I/0 interface (NVLink3) and NVSwitch inter-GPU communication. As shown in Fig. 3.2.1, A100 contains 108 Streaming Multiprocessors (SMs) and 6912 CUDA cores. The SMs are fed by a 40MB L2 cache and 1. 56TB/s of HBM2 memory bandwidth (BW). At 1.41GHz, A100 provides an effective peak 1248T0PS (8b integers), 624TFLOPS (FP16) and312TFLOPS (TF32) when including sparsity optimizations. Implemented in a TSMC 7nm N7 process, the A100 die (Fig. 3.2.7) contains 54B transistors and measures 826mm2.\",\"PeriodicalId\":371093,\"journal\":{\"name\":\"2021 IEEE International Solid- State Circuits Conference (ISSCC)\",\"volume\":\"72 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Solid- State Circuits Conference (ISSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC42613.2021.9365803\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Solid- State Circuits Conference (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC42613.2021.9365803","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 19

摘要

现代云数据中心中计算密集型应用程序的多样性推动了gpu加速云计算的爆炸式增长。这些应用包括人工智能深度学习训练和推理、数据分析、科学计算、基因组学、边缘视频分析和5G服务、图形渲染和云游戏。A100 GPU针对这些工作负载引入了几个特性:支持细粒度稀疏的$3^{rd}-$代张量核心,支持新的BFIoat16 (BF16), tensorfio32 (TF32)和FP64数据类型,支持多实例GPU (MIG)虚拟化的横向扩展,以及支持$3^{rd}-$代50Gbps NVLink I/0接口(NVLink3)和NVSwitch GPU间通信的横向扩展。如图3.2.1所示,A100包含108个流式多处理器(SMs)和6912个CUDA内核。SMs由40MB二级缓存和1。56TB/s HBM2内存带宽(BW)。在1.41GHz时,A100提供1248T0PS (8b整数),624TFLOPS (FP16)和312tflops (TF32)的有效峰值,包括稀疏性优化。A100芯片采用台积电7nm N7工艺,包含54B个晶体管,尺寸为826mm2。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
3.2 The A100 Datacenter GPU and Ampere Architecture
The diversity of compute-intensive applications in modern cloud data centers has driven the explosion of GPU-accelerated cloud computing. Such applications include AI deep learning training and inference, data analytics, scientific computing, genomics, edge video analytics and 5G services, graphics rendering, and cloud gaming. The A100 GPU introduces several features targeting these workloads: a $3^{rd}-$generation Tensor Core with support for fine-grained sparsity, new BFIoat16 (BF16), TensorFIoat-32 (TF32), and FP64 datatypes, scale-out support with multi-instance GPU (MIG) virtualization, and scale-up support with a $3^{rd}-$generation 50Gbps NVLink I/0 interface (NVLink3) and NVSwitch inter-GPU communication. As shown in Fig. 3.2.1, A100 contains 108 Streaming Multiprocessors (SMs) and 6912 CUDA cores. The SMs are fed by a 40MB L2 cache and 1. 56TB/s of HBM2 memory bandwidth (BW). At 1.41GHz, A100 provides an effective peak 1248T0PS (8b integers), 624TFLOPS (FP16) and312TFLOPS (TF32) when including sparsity optimizations. Implemented in a TSMC 7nm N7 process, the A100 die (Fig. 3.2.7) contains 54B transistors and measures 826mm2.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
10.6 A 12b 16GS/s RF-Sampling Capacitive DAC for Multi-Band Soft-Radio Base-Station Applications with On-Chip Transmission-Line Matching Network in 16nm FinFET A 0.021mm2 PVT-Aware Digital-Flow-Compatible Adaptive Back-Biasing Regulator with Scalable Drivers Achieving 450% Frequency Boosting and 30% Power Reduction in 22nm FDSOI Technology 8.1 A 224Gb/s DAC-Based PAM-4 Transmitter with 8-Tap FFE in 10nm CMOS 14.7 An Adaptive Analog Temperature-Healing Low-Power 17.7-to-19.2GHz RX Front-End with ±0.005dB/°C Gain Variation, <1.6dB NF Variation, and <2.2dB IP1dB Variation across -15 to 85°C for Phased-Array Receiver ISSCC 2021 Index to Authors
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1