When implementations of the Transformer's self-attention layer utilize SRAM instead of DRAM, they can achieve significant speedups. The Tenstorrent Grayskull architecture provides a large SRAM, distributed across a grid of cores. This work presents a fused kernel for Grayskull, that exclusively utilizes its large SRAM by combining matrix multiplication, attention score scaling and Softmax operations. Additionally, a dedicated Softmax kernel utilizing the SRAM and a CPU implementation serving as a baseline are presented. The Softmax operation consumes most of the runtime in the computation of attention weights from queries and keys on Grayskull. The speedup of the dedicated Softmax kernel compared to the CPU implementation is up to $10 times$, and the Softmax implementation inside the fused kernel is approximately $1.8 times$ faster than the dedicated Softmax kernel. The time and memory complexity of all implementations is quadratic in sequence length. Currently, the Grayskull e150 is approximately $30 times$ cheaper for the general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers approximately $1.5 times$ more SRAM.
{"title":"Attention in SRAM on Tenstorrent Grayskull","authors":"Moritz Thüning","doi":"arxiv-2407.13885","DOIUrl":"https://doi.org/arxiv-2407.13885","url":null,"abstract":"When implementations of the Transformer's self-attention layer utilize SRAM\u0000instead of DRAM, they can achieve significant speedups. The Tenstorrent\u0000Grayskull architecture provides a large SRAM, distributed across a grid of\u0000cores. This work presents a fused kernel for Grayskull, that exclusively\u0000utilizes its large SRAM by combining matrix multiplication, attention score\u0000scaling and Softmax operations. Additionally, a dedicated Softmax kernel\u0000utilizing the SRAM and a CPU implementation serving as a baseline are\u0000presented. The Softmax operation consumes most of the runtime in the\u0000computation of attention weights from queries and keys on Grayskull. The\u0000speedup of the dedicated Softmax kernel compared to the CPU implementation is\u0000up to $10 times$, and the Softmax implementation inside the fused kernel is\u0000approximately $1.8 times$ faster than the dedicated Softmax kernel. The time\u0000and memory complexity of all implementations is quadratic in sequence length.\u0000Currently, the Grayskull e150 is approximately $30 times$ cheaper for the\u0000general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers\u0000approximately $1.5 times$ more SRAM.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiang Wang, Laiyi Li, Weile Luo, Yijia Zhang, Bingqiang Wang
Increased reliance on graphics processing units (GPUs) for high-intensity computing tasks raises challenges regarding energy consumption. To address this issue, dynamic voltage and frequency scaling (DVFS) has emerged as a promising technique for conserving energy while maintaining the quality of service (QoS) of GPU applications. However, existing solutions using DVFS are hindered by inefficiency or inaccuracy as they depend either on dynamic or static information respectively, which prevents them from being adopted to practical power management schemes. To this end, we propose a novel energy efficiency optimizer, called DSO, to explore a light weight solution that leverages both dynamic and static information to model and optimize the GPU energy efficiency. DSO firstly proposes a novel theoretical energy efficiency model which reflects the DVFS roofline phenomenon and considers the tradeoff between performance and energy. Then it applies machine learning techniques to predict the parameters of the above model with both GPU kernel runtime metrics and static code features. Experiments on modern DVFS-enabled GPUs indicate that DSO can enhance energy efficiency by 19% whilst maintaining performance within a 5% loss margin.
{"title":"DSO: A GPU Energy Efficiency Optimizer by Fusing Dynamic and Static Information","authors":"Qiang Wang, Laiyi Li, Weile Luo, Yijia Zhang, Bingqiang Wang","doi":"arxiv-2407.13096","DOIUrl":"https://doi.org/arxiv-2407.13096","url":null,"abstract":"Increased reliance on graphics processing units (GPUs) for high-intensity\u0000computing tasks raises challenges regarding energy consumption. To address this\u0000issue, dynamic voltage and frequency scaling (DVFS) has emerged as a promising\u0000technique for conserving energy while maintaining the quality of service (QoS)\u0000of GPU applications. However, existing solutions using DVFS are hindered by\u0000inefficiency or inaccuracy as they depend either on dynamic or static\u0000information respectively, which prevents them from being adopted to practical\u0000power management schemes. To this end, we propose a novel energy efficiency\u0000optimizer, called DSO, to explore a light weight solution that leverages both\u0000dynamic and static information to model and optimize the GPU energy efficiency.\u0000DSO firstly proposes a novel theoretical energy efficiency model which reflects\u0000the DVFS roofline phenomenon and considers the tradeoff between performance and\u0000energy. Then it applies machine learning techniques to predict the parameters\u0000of the above model with both GPU kernel runtime metrics and static code\u0000features. Experiments on modern DVFS-enabled GPUs indicate that DSO can enhance\u0000energy efficiency by 19% whilst maintaining performance within a 5% loss\u0000margin.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fully homomorphic encryption (FHE) is a cryptographic technology capable of resolving security and privacy problems in cloud computing by encrypting data in use. However, FHE introduces tremendous computational overhead for processing encrypted data, causing FHE workloads to become 2-6 orders of magnitude slower than their unencrypted counterparts. To mitigate the overhead, we propose Cheddar, an FHE library for CUDA GPUs, which demonstrates significantly faster performance compared to prior GPU implementations. We develop optimized functionalities at various implementation levels ranging from efficient low-level primitives to streamlined high-level operational sequences. Especially, we improve major FHE operations, including number-theoretic transform and base conversion, based on efficient kernel designs using a small word size of 32 bits. By these means, Cheddar demonstrates 2.9 to 25.6 times higher performance for representative FHE workloads compared to prior GPU implementations.
{"title":"Cheddar: A Swift Fully Homomorphic Encryption Library for CUDA GPUs","authors":"Jongmin Kim, Wonseok Choi, Jung Ho Ahn","doi":"arxiv-2407.13055","DOIUrl":"https://doi.org/arxiv-2407.13055","url":null,"abstract":"Fully homomorphic encryption (FHE) is a cryptographic technology capable of\u0000resolving security and privacy problems in cloud computing by encrypting data\u0000in use. However, FHE introduces tremendous computational overhead for\u0000processing encrypted data, causing FHE workloads to become 2-6 orders of\u0000magnitude slower than their unencrypted counterparts. To mitigate the overhead,\u0000we propose Cheddar, an FHE library for CUDA GPUs, which demonstrates\u0000significantly faster performance compared to prior GPU implementations. We\u0000develop optimized functionalities at various implementation levels ranging from\u0000efficient low-level primitives to streamlined high-level operational sequences.\u0000Especially, we improve major FHE operations, including number-theoretic\u0000transform and base conversion, based on efficient kernel designs using a small\u0000word size of 32 bits. By these means, Cheddar demonstrates 2.9 to 25.6 times\u0000higher performance for representative FHE workloads compared to prior GPU\u0000implementations.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dengke Han, Mingyu Yan, Xiaochun Ye, Dongrui Fan, Ninghui Sun
Owing to their remarkable representation capabilities for heterogeneous graph data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in many critical real-world domains such as recommendation systems and medical analysis. Prior to their practical application, identifying the optimal HGNN model parameters tailored to specific tasks through extensive training is a time-consuming and costly process. To enhance the efficiency of HGNN training, it is essential to characterize and analyze the execution semantics and patterns within the training process to identify performance bottlenecks. In this study, we conduct an in-depth quantification and analysis of two mainstream HGNN training scenarios, including single-GPU and multi-GPU distributed training. Based on the characterization results, we disclose the performance bottlenecks and their underlying causes in different HGNN training scenarios and provide optimization guidelines from both software and hardware perspectives.
{"title":"Characterizing and Understanding HGNN Training on GPUs","authors":"Dengke Han, Mingyu Yan, Xiaochun Ye, Dongrui Fan, Ninghui Sun","doi":"arxiv-2407.11790","DOIUrl":"https://doi.org/arxiv-2407.11790","url":null,"abstract":"Owing to their remarkable representation capabilities for heterogeneous graph\u0000data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in\u0000many critical real-world domains such as recommendation systems and medical\u0000analysis. Prior to their practical application, identifying the optimal HGNN\u0000model parameters tailored to specific tasks through extensive training is a\u0000time-consuming and costly process. To enhance the efficiency of HGNN training,\u0000it is essential to characterize and analyze the execution semantics and\u0000patterns within the training process to identify performance bottlenecks. In\u0000this study, we conduct an in-depth quantification and analysis of two\u0000mainstream HGNN training scenarios, including single-GPU and multi-GPU\u0000distributed training. Based on the characterization results, we disclose the\u0000performance bottlenecks and their underlying causes in different HGNN training\u0000scenarios and provide optimization guidelines from both software and hardware\u0000perspectives.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Application tail latency is a key metric for many services, with high latencies being linked directly to loss of revenue. Modern deeply-nested micro-service architectures exacerbate tail latencies, increasing the likelihood of users experiencing them. In this work, we show how CPU overcommitment by OS threads leads to high tail latencies when applications are under heavy load. CPU overcommitment can arise from two operational factors: incorrectly determining the number of CPUs available when under a CPU quota, and the ignorance of neighbour applications and their CPU usage. We discuss different languages' solutions to obtaining the CPUs available, evaluating the impact, and discuss opportunities for a more unified language-independent interface to obtain the number of CPUs available. We then evaluate the impact of neighbour usage on tail latency and introduce a new neighbour-aware threadpool, the friendlypool, that dynamically avoids overcommitment. In our evaluation, the friendlypool reduces maximum worker latency by up to $6.7times$ at the cost of decreasing throughput by up to $1.4times$.
应用程序尾端延迟是许多服务的关键指标,高延迟与收入损失直接相关。现代深嵌套微服务架构加剧了尾部延迟,增加了用户遇到尾部延迟的可能性。在这项工作中,我们展示了操作系统线程对 CPU 的过度承诺如何在应用程序处于重负载时导致高尾延迟。CPU 过度承诺可能源于两个操作因素:在 CPU 配额下错误地确定可用 CPU 的数量,以及对相邻应用程序及其 CPU 使用情况的不了解。我们讨论了不同语言获取可用 CPU 的解决方案,评估了其影响,并讨论了建立一个更统一的、与语言无关的接口来获取可用 CPU 数量的可能性。然后,我们评估了邻居使用对尾部延迟的影响,并引入了一种新的邻居感知线程池--友好线程池(friendlypool),它可以动态避免过度承诺。在我们的评估中,友好线程池以降低吞吐量达 1.4 美元/次为代价,将最大工作者延迟降低了 6.7 美元/次。
{"title":"Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management","authors":"Andrew Jeffery, Chris Jensen, Richard Mortier","doi":"arxiv-2407.11582","DOIUrl":"https://doi.org/arxiv-2407.11582","url":null,"abstract":"Application tail latency is a key metric for many services, with high\u0000latencies being linked directly to loss of revenue. Modern deeply-nested\u0000micro-service architectures exacerbate tail latencies, increasing the\u0000likelihood of users experiencing them. In this work, we show how CPU\u0000overcommitment by OS threads leads to high tail latencies when applications are\u0000under heavy load. CPU overcommitment can arise from two operational factors:\u0000incorrectly determining the number of CPUs available when under a CPU quota,\u0000and the ignorance of neighbour applications and their CPU usage. We discuss\u0000different languages' solutions to obtaining the CPUs available, evaluating the\u0000impact, and discuss opportunities for a more unified language-independent\u0000interface to obtain the number of CPUs available. We then evaluate the impact\u0000of neighbour usage on tail latency and introduce a new neighbour-aware\u0000threadpool, the friendlypool, that dynamically avoids overcommitment. In our\u0000evaluation, the friendlypool reduces maximum worker latency by up to\u0000$6.7times$ at the cost of decreasing throughput by up to $1.4times$.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2012 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Milo Lurati, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven
Many studies have focused on developing and improving auto-tuning algorithms for Nvidia Graphics Processing Units (GPUs), but the effectiveness and efficiency of these approaches on AMD devices have hardly been studied. This paper aims to address this gap by introducing an auto-tuner for AMD's HIP. We do so by extending Kernel Tuner, an open-source Python library for auto-tuning GPU programs. We analyze the performance impact and tuning difficulty for four highly-tunable benchmark kernels on four different GPUs: two from Nvidia and two from AMD. Our results demonstrate that auto-tuning has a significantly higher impact on performance on AMD compared to Nvidia (10x vs 2x). Additionally, we show that applications tuned for Nvidia do not perform optimally on AMD, underscoring the importance of auto-tuning specifically for AMD to achieve high performance on these GPUs.
{"title":"Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs","authors":"Milo Lurati, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven","doi":"arxiv-2407.11488","DOIUrl":"https://doi.org/arxiv-2407.11488","url":null,"abstract":"Many studies have focused on developing and improving auto-tuning algorithms\u0000for Nvidia Graphics Processing Units (GPUs), but the effectiveness and\u0000efficiency of these approaches on AMD devices have hardly been studied. This\u0000paper aims to address this gap by introducing an auto-tuner for AMD's HIP. We\u0000do so by extending Kernel Tuner, an open-source Python library for auto-tuning\u0000GPU programs. We analyze the performance impact and tuning difficulty for four\u0000highly-tunable benchmark kernels on four different GPUs: two from Nvidia and\u0000two from AMD. Our results demonstrate that auto-tuning has a significantly\u0000higher impact on performance on AMD compared to Nvidia (10x vs 2x).\u0000Additionally, we show that applications tuned for Nvidia do not perform\u0000optimally on AMD, underscoring the importance of auto-tuning specifically for\u0000AMD to achieve high performance on these GPUs.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucas Alvarenga, Victor Ferrari, Rafael Souza, Marcio Pereira, Guido Araujo
Convolution is a compute-intensive operation placed at the heart of Convolution Neural Networks (CNNs). It has led to the development of many high-performance algorithms, such as Im2col-GEMM, Winograd, and Direct-Convolution. However, the comparison of different convolution algorithms is an error-prone task as it requires specific data layouts and system resources. Failure to address these requirements might lead to unwanted time penalties. Thus, considering all processing steps within convolution algorithms is essential to comprehensively evaluate and fairly compare their performance. Furthermore, most known convolution benchmarking adopts ad-hoc testing suites with limited coverage and handmade operations. This paper proposes ConvBench, a primitive-level benchmark for the evaluation and comparison of convolution algorithms. It assesses 9243 convolution operations derived from 1097 real-world deep learning models, resulting in performance and execution breakdown graphs for a detailed evaluation. ConvBench capability is evaluated across the Sliced Convolution (SConv) algorithm. The experiments showed results faster than Im2col-GEMM in 93.6% of the convolutions. However, the use of ConvBench allowed the delving into the remaining 6.4% underperforming convolutions, uncovering a critical slowdown of 79.5% on average of SConv's packing step. This analysis underscores a potential source of optimization for SConv, opening up new paths for convolution designers to improve their algorithms.
{"title":"ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation","authors":"Lucas Alvarenga, Victor Ferrari, Rafael Souza, Marcio Pereira, Guido Araujo","doi":"arxiv-2407.10730","DOIUrl":"https://doi.org/arxiv-2407.10730","url":null,"abstract":"Convolution is a compute-intensive operation placed at the heart of\u0000Convolution Neural Networks (CNNs). It has led to the development of many\u0000high-performance algorithms, such as Im2col-GEMM, Winograd, and\u0000Direct-Convolution. However, the comparison of different convolution algorithms\u0000is an error-prone task as it requires specific data layouts and system\u0000resources. Failure to address these requirements might lead to unwanted time\u0000penalties. Thus, considering all processing steps within convolution algorithms\u0000is essential to comprehensively evaluate and fairly compare their performance.\u0000Furthermore, most known convolution benchmarking adopts ad-hoc testing suites\u0000with limited coverage and handmade operations. This paper proposes ConvBench, a\u0000primitive-level benchmark for the evaluation and comparison of convolution\u0000algorithms. It assesses 9243 convolution operations derived from 1097\u0000real-world deep learning models, resulting in performance and execution\u0000breakdown graphs for a detailed evaluation. ConvBench capability is evaluated\u0000across the Sliced Convolution (SConv) algorithm. The experiments showed results\u0000faster than Im2col-GEMM in 93.6% of the convolutions. However, the use of\u0000ConvBench allowed the delving into the remaining 6.4% underperforming\u0000convolutions, uncovering a critical slowdown of 79.5% on average of SConv's\u0000packing step. This analysis underscores a potential source of optimization for\u0000SConv, opening up new paths for convolution designers to improve their\u0000algorithms.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Metaverse virtual reality (VR) applications enable users to socialise, work, entertain, and study online with immersive experiences beyond the classic PC-based interactions. While the 360-degree immersion enables users to be fully engaged in a virtual scenario, suboptimal Quality-of-Experience (QoE) like poorly displayed 3D graphics, disruptive loading time, or motion lagging caused by degraded network Quality-of-Service (QoS) can be perceived by users much worse (such as dizziness) than a monitor visualisation. This paper empirically measures user QoE of metaverse VR caused by network QoS. Specifically, by focusing on both public social hubs and private user-created events in three popular metaverse VR applications (Rec Room, VRChat and MultiverseVR), we first identify three metrics, including environment freeze level, peripheral content loading time, and control response time, that describe metaverse user experience. By tuning three network QoS parameters (bandwidth, latency, and packet loss), we benchmark each QoE metric's level from excellent to unplayable. Key insights are revealed, such as freeze of metaverse virtual environment is resilient to latency but sensitive to packet loss, and private user-created events demand better network conditions than public social hubs, providing a reference for ISPs to optimise their network QoS for superlative metaverse user experience.
{"title":"Assessing the Impact of Network Quality-of-Service on Metaverse Virtual Reality User Experience","authors":"Rahul Dev Tripathi, Minzhao Lyu, Vijay Sivaraman","doi":"arxiv-2407.10423","DOIUrl":"https://doi.org/arxiv-2407.10423","url":null,"abstract":"Metaverse virtual reality (VR) applications enable users to socialise, work,\u0000entertain, and study online with immersive experiences beyond the classic\u0000PC-based interactions. While the 360-degree immersion enables users to be fully\u0000engaged in a virtual scenario, suboptimal Quality-of-Experience (QoE) like\u0000poorly displayed 3D graphics, disruptive loading time, or motion lagging caused\u0000by degraded network Quality-of-Service (QoS) can be perceived by users much\u0000worse (such as dizziness) than a monitor visualisation. This paper empirically\u0000measures user QoE of metaverse VR caused by network QoS. Specifically, by\u0000focusing on both public social hubs and private user-created events in three\u0000popular metaverse VR applications (Rec Room, VRChat and MultiverseVR), we first\u0000identify three metrics, including environment freeze level, peripheral content\u0000loading time, and control response time, that describe metaverse user\u0000experience. By tuning three network QoS parameters (bandwidth, latency, and\u0000packet loss), we benchmark each QoE metric's level from excellent to\u0000unplayable. Key insights are revealed, such as freeze of metaverse virtual\u0000environment is resilient to latency but sensitive to packet loss, and private\u0000user-created events demand better network conditions than public social hubs,\u0000providing a reference for ISPs to optimise their network QoS for superlative\u0000metaverse user experience.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiechen Zhao, Ran Shu, Katie Lim, Zewen Fan, Thomas Anderson, Mingyu Gao, Natalie Enright Jerger
I/O devices in public clouds have integrated increasing numbers of hardware accelerators, e.g., AWS Nitro, Azure FPGA and Nvidia BlueField. However, such specialized compute (1) is not explicitly accessible to cloud users with performance guarantee, (2) cannot be leveraged simultaneously by both providers and users, unlike general-purpose compute (e.g., CPUs). Through ten observations, we present that the fundamental difficulty of democratizing accelerators is insufficient performance isolation support. The key obstacles to enforcing accelerator isolation are (1) too many unknown traffic patterns in public clouds and (2) too many possible contention sources in the datapath. In this work, instead of scheduling such complex traffic on-the-fly and augmenting isolation support on each system component, we propose to model traffic as network flows and proactively re-shape the traffic to avoid unpredictable contention. We discuss the implications of our findings on the design of future I/O management stacks and device interfaces.
{"title":"Accelerator-as-a-Service in Public Clouds: An Intra-Host Traffic Management View for Performance Isolation in the Wild","authors":"Jiechen Zhao, Ran Shu, Katie Lim, Zewen Fan, Thomas Anderson, Mingyu Gao, Natalie Enright Jerger","doi":"arxiv-2407.10098","DOIUrl":"https://doi.org/arxiv-2407.10098","url":null,"abstract":"I/O devices in public clouds have integrated increasing numbers of hardware\u0000accelerators, e.g., AWS Nitro, Azure FPGA and Nvidia BlueField. However, such\u0000specialized compute (1) is not explicitly accessible to cloud users with\u0000performance guarantee, (2) cannot be leveraged simultaneously by both providers\u0000and users, unlike general-purpose compute (e.g., CPUs). Through ten\u0000observations, we present that the fundamental difficulty of democratizing\u0000accelerators is insufficient performance isolation support. The key obstacles\u0000to enforcing accelerator isolation are (1) too many unknown traffic patterns in\u0000public clouds and (2) too many possible contention sources in the datapath. In\u0000this work, instead of scheduling such complex traffic on-the-fly and augmenting\u0000isolation support on each system component, we propose to model traffic as\u0000network flows and proactively re-shape the traffic to avoid unpredictable\u0000contention. We discuss the implications of our findings on the design of future\u0000I/O management stacks and device interfaces.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.
张量处理单元(TPU)是最著名的机器学习(ML)加速器之一,在数据中心和小型 ML 应用中得到了大规模应用。与图形处理器(GPU)等传统 ML 加速器相比,TPU 具有多项改进和优势,其设计专门用于执行深度神经网络(DNN)执行过程中广泛存在的矩阵-矩阵和矩阵-矢量乘法所需的乘积(MAC)运算。这种改进包括通过利用系统阵列架构提供的时态数据流范例,最大限度地提高数据重用率,并最大限度地减少数据传输。虽然这种设计具有显著的性能优势,但目前的实现方式仅限于由输入、输出或权重固定架构组成的单一数据流。因此,本文的工作包括开发一种名为 Flex-TPU 的可重新配置数据流处理单元,它可以在运行时动态改变每层的数据流。我们的实验对 Flex-TPU 的可行性进行了全面测试,并将其与传统的 TPU 设计在多个著名的 ML 工作负载中进行了比较。结果表明,与传统 TPU 相比,我们的 Flex-TPU 设计实现了高达 2.75 倍的性能大幅提升,而面积和功耗开销却很小。
{"title":"Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture","authors":"Mohammed Elbtity, Peyton Chandarana, Ramtin Zand","doi":"arxiv-2407.08700","DOIUrl":"https://doi.org/arxiv-2407.08700","url":null,"abstract":"Tensor processing units (TPUs) are one of the most well-known machine\u0000learning (ML) accelerators utilized at large scale in data centers as well as\u0000in tiny ML applications. TPUs offer several improvements and advantages over\u0000conventional ML accelerators, like graphical processing units (GPUs), being\u0000designed specifically to perform the multiply-accumulate (MAC) operations\u0000required in the matrix-matrix and matrix-vector multiplies extensively present\u0000throughout the execution of deep neural networks (DNNs). Such improvements\u0000include maximizing data reuse and minimizing data transfer by leveraging the\u0000temporal dataflow paradigms provided by the systolic array architecture. While\u0000this design provides a significant performance benefit, the current\u0000implementations are restricted to a single dataflow consisting of either input,\u0000output, or weight stationary architectures. This can limit the achievable\u0000performance of DNN inference and reduce the utilization of compute units.\u0000Therefore, the work herein consists of developing a reconfigurable dataflow\u0000TPU, called the Flex-TPU, which can dynamically change the dataflow per layer\u0000during run-time. Our experiments thoroughly test the viability of the Flex-TPU\u0000comparing it to conventional TPU designs across multiple well-known ML\u0000workloads. The results show that our Flex-TPU design achieves a significant\u0000performance increase of up to 2.75x compared to conventional TPU, with only\u0000minor area and power overheads.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"157 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141613815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}