首页 > 最新文献

arXiv - CS - Performance最新文献

英文 中文
Attention in SRAM on Tenstorrent Grayskull 关注 Tenstorrent Grayskull 上的 SRAM
Pub Date : 2024-07-18 DOI: arxiv-2407.13885
Moritz Thüning
When implementations of the Transformer's self-attention layer utilize SRAMinstead of DRAM, they can achieve significant speedups. The TenstorrentGrayskull architecture provides a large SRAM, distributed across a grid ofcores. This work presents a fused kernel for Grayskull, that exclusivelyutilizes its large SRAM by combining matrix multiplication, attention scorescaling and Softmax operations. Additionally, a dedicated Softmax kernelutilizing the SRAM and a CPU implementation serving as a baseline arepresented. The Softmax operation consumes most of the runtime in thecomputation of attention weights from queries and keys on Grayskull. Thespeedup of the dedicated Softmax kernel compared to the CPU implementation isup to $10 times$, and the Softmax implementation inside the fused kernel isapproximately $1.8 times$ faster than the dedicated Softmax kernel. The timeand memory complexity of all implementations is quadratic in sequence length.Currently, the Grayskull e150 is approximately $30 times$ cheaper for thegeneral public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offersapproximately $1.5 times$ more SRAM.
当 Transformer 的自我关注层利用 SRAM 而不是 DRAM 实现时,它们可以实现显著的提速。TenstorrentGrayskull 架构提供了一个大型 SRAM,分布在网格状的核心上。这项工作为 Grayskull 提出了一个融合内核,通过结合矩阵乘法、注意力评分和 Softmax 操作,充分利用其大型 SRAM。此外,还介绍了利用 SRAM 的专用 Softmax 内核和作为基线的 CPU 实现。在根据 Grayskull 上的查询和按键计算注意力权重时,Softmax 操作消耗了大部分运行时间。与CPU实现相比,专用Softmax内核的速度提高了10倍,而融合内核中的Softmax实现比专用Softmax内核快约1.8倍。目前,对于普通大众来说,Grayskull e150 比 Nvidia H100 PCIe(最先进的 GPU)便宜约 30 美元(times$),SRAM 也多出约 1.5 美元(times$)。
{"title":"Attention in SRAM on Tenstorrent Grayskull","authors":"Moritz Thüning","doi":"arxiv-2407.13885","DOIUrl":"https://doi.org/arxiv-2407.13885","url":null,"abstract":"When implementations of the Transformer's self-attention layer utilize SRAM\u0000instead of DRAM, they can achieve significant speedups. The Tenstorrent\u0000Grayskull architecture provides a large SRAM, distributed across a grid of\u0000cores. This work presents a fused kernel for Grayskull, that exclusively\u0000utilizes its large SRAM by combining matrix multiplication, attention score\u0000scaling and Softmax operations. Additionally, a dedicated Softmax kernel\u0000utilizing the SRAM and a CPU implementation serving as a baseline are\u0000presented. The Softmax operation consumes most of the runtime in the\u0000computation of attention weights from queries and keys on Grayskull. The\u0000speedup of the dedicated Softmax kernel compared to the CPU implementation is\u0000up to $10 times$, and the Softmax implementation inside the fused kernel is\u0000approximately $1.8 times$ faster than the dedicated Softmax kernel. The time\u0000and memory complexity of all implementations is quadratic in sequence length.\u0000Currently, the Grayskull e150 is approximately $30 times$ cheaper for the\u0000general public than an Nvidia H100 PCIe (a state-of-the-art GPU) and offers\u0000approximately $1.5 times$ more SRAM.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DSO: A GPU Energy Efficiency Optimizer by Fusing Dynamic and Static Information DSO:融合动态和静态信息的 GPU 能效优化器
Pub Date : 2024-07-18 DOI: arxiv-2407.13096
Qiang Wang, Laiyi Li, Weile Luo, Yijia Zhang, Bingqiang Wang
Increased reliance on graphics processing units (GPUs) for high-intensitycomputing tasks raises challenges regarding energy consumption. To address thisissue, dynamic voltage and frequency scaling (DVFS) has emerged as a promisingtechnique for conserving energy while maintaining the quality of service (QoS)of GPU applications. However, existing solutions using DVFS are hindered byinefficiency or inaccuracy as they depend either on dynamic or staticinformation respectively, which prevents them from being adopted to practicalpower management schemes. To this end, we propose a novel energy efficiencyoptimizer, called DSO, to explore a light weight solution that leverages bothdynamic and static information to model and optimize the GPU energy efficiency.DSO firstly proposes a novel theoretical energy efficiency model which reflectsthe DVFS roofline phenomenon and considers the tradeoff between performance andenergy. Then it applies machine learning techniques to predict the parametersof the above model with both GPU kernel runtime metrics and static codefeatures. Experiments on modern DVFS-enabled GPUs indicate that DSO can enhanceenergy efficiency by 19% whilst maintaining performance within a 5% lossmargin.
高强度计算任务越来越依赖图形处理器(GPU),这给能耗带来了挑战。为解决这一问题,动态电压和频率缩放(DVFS)已成为在保持 GPU 应用服务(QoS)质量的同时节约能源的一种有前途的技术。然而,现有的 DVFS 解决方案分别依赖于动态或静态信息,因此存在效率低或不准确的问题,无法将其应用到实际的电源管理方案中。为此,我们提出了一种名为 DSO 的新型能效优化器,以探索一种利用动态和静态信息来建模和优化 GPU 能效的轻量级解决方案。DSO 首先提出了一种新型理论能效模型,该模型反映了 DVFS 顶线现象,并考虑了性能和能耗之间的权衡。DSO 首先提出了一个新颖的理论能效模型,该模型反映了 DVFS 屋顶线现象,并考虑了性能和能耗之间的权衡。然后,它应用机器学习技术,通过 GPU 内核运行时指标和静态代码特征来预测上述模型的参数。在支持 DVFS 的现代 GPU 上进行的实验表明,DSO 可以将能效提高 19%,同时将性能保持在 5% 的损耗范围内。
{"title":"DSO: A GPU Energy Efficiency Optimizer by Fusing Dynamic and Static Information","authors":"Qiang Wang, Laiyi Li, Weile Luo, Yijia Zhang, Bingqiang Wang","doi":"arxiv-2407.13096","DOIUrl":"https://doi.org/arxiv-2407.13096","url":null,"abstract":"Increased reliance on graphics processing units (GPUs) for high-intensity\u0000computing tasks raises challenges regarding energy consumption. To address this\u0000issue, dynamic voltage and frequency scaling (DVFS) has emerged as a promising\u0000technique for conserving energy while maintaining the quality of service (QoS)\u0000of GPU applications. However, existing solutions using DVFS are hindered by\u0000inefficiency or inaccuracy as they depend either on dynamic or static\u0000information respectively, which prevents them from being adopted to practical\u0000power management schemes. To this end, we propose a novel energy efficiency\u0000optimizer, called DSO, to explore a light weight solution that leverages both\u0000dynamic and static information to model and optimize the GPU energy efficiency.\u0000DSO firstly proposes a novel theoretical energy efficiency model which reflects\u0000the DVFS roofline phenomenon and considers the tradeoff between performance and\u0000energy. Then it applies machine learning techniques to predict the parameters\u0000of the above model with both GPU kernel runtime metrics and static code\u0000features. Experiments on modern DVFS-enabled GPUs indicate that DSO can enhance\u0000energy efficiency by 19% whilst maintaining performance within a 5% loss\u0000margin.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cheddar: A Swift Fully Homomorphic Encryption Library for CUDA GPUs Cheddar:适用于 CUDA GPU 的 Swift 全同态加密库
Pub Date : 2024-07-17 DOI: arxiv-2407.13055
Jongmin Kim, Wonseok Choi, Jung Ho Ahn
Fully homomorphic encryption (FHE) is a cryptographic technology capable ofresolving security and privacy problems in cloud computing by encrypting datain use. However, FHE introduces tremendous computational overhead forprocessing encrypted data, causing FHE workloads to become 2-6 orders ofmagnitude slower than their unencrypted counterparts. To mitigate the overhead,we propose Cheddar, an FHE library for CUDA GPUs, which demonstratessignificantly faster performance compared to prior GPU implementations. Wedevelop optimized functionalities at various implementation levels ranging fromefficient low-level primitives to streamlined high-level operational sequences.Especially, we improve major FHE operations, including number-theoretictransform and base conversion, based on efficient kernel designs using a smallword size of 32 bits. By these means, Cheddar demonstrates 2.9 to 25.6 timeshigher performance for representative FHE workloads compared to prior GPUimplementations.
全同态加密(FHE)是一种加密技术,能够通过加密使用中的数据来解决云计算中的安全和隐私问题。然而,FHE 在处理加密数据时引入了巨大的计算开销,导致 FHE 工作负载比未加密工作负载慢 2-6 个数量级。为了减少这种开销,我们提出了 Cheddar,这是一个用于 CUDA GPU 的 FHE 库,与之前的 GPU 实现相比,它的性能显著提高。我们在不同的实现层面开发了优化功能,从高效的底层基元到精简的高层操作序列。特别是,我们基于使用 32 位小字的高效内核设计,改进了主要的 FHE 操作,包括数论变换和基数转换。通过这些方法,与之前的 GPU 实现相比,Cheddar 在具有代表性的 FHE 工作负载上的性能提高了 2.9 到 25.6 倍。
{"title":"Cheddar: A Swift Fully Homomorphic Encryption Library for CUDA GPUs","authors":"Jongmin Kim, Wonseok Choi, Jung Ho Ahn","doi":"arxiv-2407.13055","DOIUrl":"https://doi.org/arxiv-2407.13055","url":null,"abstract":"Fully homomorphic encryption (FHE) is a cryptographic technology capable of\u0000resolving security and privacy problems in cloud computing by encrypting data\u0000in use. However, FHE introduces tremendous computational overhead for\u0000processing encrypted data, causing FHE workloads to become 2-6 orders of\u0000magnitude slower than their unencrypted counterparts. To mitigate the overhead,\u0000we propose Cheddar, an FHE library for CUDA GPUs, which demonstrates\u0000significantly faster performance compared to prior GPU implementations. We\u0000develop optimized functionalities at various implementation levels ranging from\u0000efficient low-level primitives to streamlined high-level operational sequences.\u0000Especially, we improve major FHE operations, including number-theoretic\u0000transform and base conversion, based on efficient kernel designs using a small\u0000word size of 32 bits. By these means, Cheddar demonstrates 2.9 to 25.6 times\u0000higher performance for representative FHE workloads compared to prior GPU\u0000implementations.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing and Understanding HGNN Training on GPUs 描述和理解 GPU 上的 HGNN 训练
Pub Date : 2024-07-16 DOI: arxiv-2407.11790
Dengke Han, Mingyu Yan, Xiaochun Ye, Dongrui Fan, Ninghui Sun
Owing to their remarkable representation capabilities for heterogeneous graphdata, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted inmany critical real-world domains such as recommendation systems and medicalanalysis. Prior to their practical application, identifying the optimal HGNNmodel parameters tailored to specific tasks through extensive training is atime-consuming and costly process. To enhance the efficiency of HGNN training,it is essential to characterize and analyze the execution semantics andpatterns within the training process to identify performance bottlenecks. Inthis study, we conduct an in-depth quantification and analysis of twomainstream HGNN training scenarios, including single-GPU and multi-GPUdistributed training. Based on the characterization results, we disclose theperformance bottlenecks and their underlying causes in different HGNN trainingscenarios and provide optimization guidelines from both software and hardwareperspectives.
异构图神经网络(Heterogeneous Graph Neural Networks,HGNNs)因其出色的异构图数据表示能力,已被广泛应用于推荐系统和医学分析等许多重要的现实世界领域。在实际应用之前,通过大量的训练来确定适合特定任务的最佳 HGNN 模型参数是一个耗时耗钱的过程。为了提高 HGNN 训练的效率,必须对训练过程中的执行语义和模式进行描述和分析,以找出性能瓶颈。在本研究中,我们对两个主流 HGNN 训练场景(包括单 GPU 和多 GPU 分布式训练)进行了深入的量化和分析。基于表征结果,我们揭示了不同 HGNN 训练场景中的性能瓶颈及其根本原因,并从软件和硬件两个角度提供了优化指南。
{"title":"Characterizing and Understanding HGNN Training on GPUs","authors":"Dengke Han, Mingyu Yan, Xiaochun Ye, Dongrui Fan, Ninghui Sun","doi":"arxiv-2407.11790","DOIUrl":"https://doi.org/arxiv-2407.11790","url":null,"abstract":"Owing to their remarkable representation capabilities for heterogeneous graph\u0000data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in\u0000many critical real-world domains such as recommendation systems and medical\u0000analysis. Prior to their practical application, identifying the optimal HGNN\u0000model parameters tailored to specific tasks through extensive training is a\u0000time-consuming and costly process. To enhance the efficiency of HGNN training,\u0000it is essential to characterize and analyze the execution semantics and\u0000patterns within the training process to identify performance bottlenecks. In\u0000this study, we conduct an in-depth quantification and analysis of two\u0000mainstream HGNN training scenarios, including single-GPU and multi-GPU\u0000distributed training. Based on the characterization results, we disclose the\u0000performance bottlenecks and their underlying causes in different HGNN training\u0000scenarios and provide optimization guidelines from both software and hardware\u0000perspectives.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management 通过环境和邻居感知线程管理减少尾端延迟
Pub Date : 2024-07-16 DOI: arxiv-2407.11582
Andrew Jeffery, Chris Jensen, Richard Mortier
Application tail latency is a key metric for many services, with highlatencies being linked directly to loss of revenue. Modern deeply-nestedmicro-service architectures exacerbate tail latencies, increasing thelikelihood of users experiencing them. In this work, we show how CPUovercommitment by OS threads leads to high tail latencies when applications areunder heavy load. CPU overcommitment can arise from two operational factors:incorrectly determining the number of CPUs available when under a CPU quota,and the ignorance of neighbour applications and their CPU usage. We discussdifferent languages' solutions to obtaining the CPUs available, evaluating theimpact, and discuss opportunities for a more unified language-independentinterface to obtain the number of CPUs available. We then evaluate the impactof neighbour usage on tail latency and introduce a new neighbour-awarethreadpool, the friendlypool, that dynamically avoids overcommitment. In ourevaluation, the friendlypool reduces maximum worker latency by up to$6.7times$ at the cost of decreasing throughput by up to $1.4times$.
应用程序尾端延迟是许多服务的关键指标,高延迟与收入损失直接相关。现代深嵌套微服务架构加剧了尾部延迟,增加了用户遇到尾部延迟的可能性。在这项工作中,我们展示了操作系统线程对 CPU 的过度承诺如何在应用程序处于重负载时导致高尾延迟。CPU 过度承诺可能源于两个操作因素:在 CPU 配额下错误地确定可用 CPU 的数量,以及对相邻应用程序及其 CPU 使用情况的不了解。我们讨论了不同语言获取可用 CPU 的解决方案,评估了其影响,并讨论了建立一个更统一的、与语言无关的接口来获取可用 CPU 数量的可能性。然后,我们评估了邻居使用对尾部延迟的影响,并引入了一种新的邻居感知线程池--友好线程池(friendlypool),它可以动态避免过度承诺。在我们的评估中,友好线程池以降低吞吐量达 1.4 美元/次为代价,将最大工作者延迟降低了 6.7 美元/次。
{"title":"Reducing Tail Latencies Through Environment- and Neighbour-aware Thread Management","authors":"Andrew Jeffery, Chris Jensen, Richard Mortier","doi":"arxiv-2407.11582","DOIUrl":"https://doi.org/arxiv-2407.11582","url":null,"abstract":"Application tail latency is a key metric for many services, with high\u0000latencies being linked directly to loss of revenue. Modern deeply-nested\u0000micro-service architectures exacerbate tail latencies, increasing the\u0000likelihood of users experiencing them. In this work, we show how CPU\u0000overcommitment by OS threads leads to high tail latencies when applications are\u0000under heavy load. CPU overcommitment can arise from two operational factors:\u0000incorrectly determining the number of CPUs available when under a CPU quota,\u0000and the ignorance of neighbour applications and their CPU usage. We discuss\u0000different languages' solutions to obtaining the CPUs available, evaluating the\u0000impact, and discuss opportunities for a more unified language-independent\u0000interface to obtain the number of CPUs available. We then evaluate the impact\u0000of neighbour usage on tail latency and introduce a new neighbour-aware\u0000threadpool, the friendlypool, that dynamically avoids overcommitment. In our\u0000evaluation, the friendlypool reduces maximum worker latency by up to\u0000$6.7times$ at the cost of decreasing throughput by up to $1.4times$.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2012 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs 将自动调整引入 HIP:分析调整对 AMD 和 Nvidia GPU 的影响和难度
Pub Date : 2024-07-16 DOI: arxiv-2407.11488
Milo Lurati, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven
Many studies have focused on developing and improving auto-tuning algorithmsfor Nvidia Graphics Processing Units (GPUs), but the effectiveness andefficiency of these approaches on AMD devices have hardly been studied. Thispaper aims to address this gap by introducing an auto-tuner for AMD's HIP. Wedo so by extending Kernel Tuner, an open-source Python library for auto-tuningGPU programs. We analyze the performance impact and tuning difficulty for fourhighly-tunable benchmark kernels on four different GPUs: two from Nvidia andtwo from AMD. Our results demonstrate that auto-tuning has a significantlyhigher impact on performance on AMD compared to Nvidia (10x vs 2x).Additionally, we show that applications tuned for Nvidia do not performoptimally on AMD, underscoring the importance of auto-tuning specifically forAMD to achieve high performance on these GPUs.
许多研究都专注于开发和改进针对 Nvidia 图形处理器(GPU)的自动调整算法,但这些方法在 AMD 设备上的有效性和效率几乎没有得到研究。本文旨在通过为 AMD 的 HIP 引入自动调谐器来弥补这一不足。我们通过扩展 Kernel Tuner 来实现这一目标,Kernel Tuner 是一个用于自动调整 GPU 程序的开源 Python 库。我们分析了在四种不同 GPU(两种来自 Nvidia,两种来自 AMD)上对四种高度可调谐基准内核的性能影响和调谐难度。我们的结果表明,与 Nvidia 相比,自动调整对 AMD 性能的影响要大得多(10 倍对 2 倍)。此外,我们还表明,为 Nvidia 调整的应用程序在 AMD 上的性能并不理想,这凸显了专门为 AMD 进行自动调整以在这些 GPU 上实现高性能的重要性。
{"title":"Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs","authors":"Milo Lurati, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven","doi":"arxiv-2407.11488","DOIUrl":"https://doi.org/arxiv-2407.11488","url":null,"abstract":"Many studies have focused on developing and improving auto-tuning algorithms\u0000for Nvidia Graphics Processing Units (GPUs), but the effectiveness and\u0000efficiency of these approaches on AMD devices have hardly been studied. This\u0000paper aims to address this gap by introducing an auto-tuner for AMD's HIP. We\u0000do so by extending Kernel Tuner, an open-source Python library for auto-tuning\u0000GPU programs. We analyze the performance impact and tuning difficulty for four\u0000highly-tunable benchmark kernels on four different GPUs: two from Nvidia and\u0000two from AMD. Our results demonstrate that auto-tuning has a significantly\u0000higher impact on performance on AMD compared to Nvidia (10x vs 2x).\u0000Additionally, we show that applications tuned for Nvidia do not perform\u0000optimally on AMD, underscoring the importance of auto-tuning specifically for\u0000AMD to achieve high performance on these GPUs.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation ConvBench:用于二维卷积基元评估的综合基准
Pub Date : 2024-07-15 DOI: arxiv-2407.10730
Lucas Alvarenga, Victor Ferrari, Rafael Souza, Marcio Pereira, Guido Araujo
Convolution is a compute-intensive operation placed at the heart ofConvolution Neural Networks (CNNs). It has led to the development of manyhigh-performance algorithms, such as Im2col-GEMM, Winograd, andDirect-Convolution. However, the comparison of different convolution algorithmsis an error-prone task as it requires specific data layouts and systemresources. Failure to address these requirements might lead to unwanted timepenalties. Thus, considering all processing steps within convolution algorithmsis essential to comprehensively evaluate and fairly compare their performance.Furthermore, most known convolution benchmarking adopts ad-hoc testing suiteswith limited coverage and handmade operations. This paper proposes ConvBench, aprimitive-level benchmark for the evaluation and comparison of convolutionalgorithms. It assesses 9243 convolution operations derived from 1097real-world deep learning models, resulting in performance and executionbreakdown graphs for a detailed evaluation. ConvBench capability is evaluatedacross the Sliced Convolution (SConv) algorithm. The experiments showed resultsfaster than Im2col-GEMM in 93.6% of the convolutions. However, the use ofConvBench allowed the delving into the remaining 6.4% underperformingconvolutions, uncovering a critical slowdown of 79.5% on average of SConv'spacking step. This analysis underscores a potential source of optimization forSConv, opening up new paths for convolution designers to improve theiralgorithms.
卷积是一种计算密集型操作,是卷积神经网络(CNN)的核心。它催生了许多高性能算法,如 Im2col-GEMM、Winograd 和直接卷积。然而,比较不同的卷积算法是一项容易出错的任务,因为它需要特定的数据布局和系统资源。如果不能满足这些要求,可能会导致不必要的时间损失。因此,考虑卷积算法中的所有处理步骤对于全面评估和公平比较它们的性能至关重要。此外,大多数已知的卷积基准测试都采用了覆盖范围有限的临时测试套件和手工操作。本文提出了用于评估和比较卷积算法的原始级基准 ConvBench。它评估了来自 1097 个真实世界深度学习模型的 9243 个卷积操作,并绘制了性能和执行分解图以进行详细评估。ConvBench 评估了切片卷积(SConv)算法的能力。实验显示,在 93.6% 的卷积中,ConvBench 的结果比 Im2col-GEMM 更快。不过,使用 ConvBench 可以深入研究其余 6.4% 表现不佳的卷积,发现 SConv 的堆积步骤平均减慢了 79.5% 的关键速度。这项分析强调了 SConv 的潜在优化来源,为卷积设计人员改进算法开辟了新的途径。
{"title":"ConvBench: A Comprehensive Benchmark for 2D Convolution Primitive Evaluation","authors":"Lucas Alvarenga, Victor Ferrari, Rafael Souza, Marcio Pereira, Guido Araujo","doi":"arxiv-2407.10730","DOIUrl":"https://doi.org/arxiv-2407.10730","url":null,"abstract":"Convolution is a compute-intensive operation placed at the heart of\u0000Convolution Neural Networks (CNNs). It has led to the development of many\u0000high-performance algorithms, such as Im2col-GEMM, Winograd, and\u0000Direct-Convolution. However, the comparison of different convolution algorithms\u0000is an error-prone task as it requires specific data layouts and system\u0000resources. Failure to address these requirements might lead to unwanted time\u0000penalties. Thus, considering all processing steps within convolution algorithms\u0000is essential to comprehensively evaluate and fairly compare their performance.\u0000Furthermore, most known convolution benchmarking adopts ad-hoc testing suites\u0000with limited coverage and handmade operations. This paper proposes ConvBench, a\u0000primitive-level benchmark for the evaluation and comparison of convolution\u0000algorithms. It assesses 9243 convolution operations derived from 1097\u0000real-world deep learning models, resulting in performance and execution\u0000breakdown graphs for a detailed evaluation. ConvBench capability is evaluated\u0000across the Sliced Convolution (SConv) algorithm. The experiments showed results\u0000faster than Im2col-GEMM in 93.6% of the convolutions. However, the use of\u0000ConvBench allowed the delving into the remaining 6.4% underperforming\u0000convolutions, uncovering a critical slowdown of 79.5% on average of SConv's\u0000packing step. This analysis underscores a potential source of optimization for\u0000SConv, opening up new paths for convolution designers to improve their\u0000algorithms.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Assessing the Impact of Network Quality-of-Service on Metaverse Virtual Reality User Experience 评估网络服务质量对元虚拟现实用户体验的影响
Pub Date : 2024-07-15 DOI: arxiv-2407.10423
Rahul Dev Tripathi, Minzhao Lyu, Vijay Sivaraman
Metaverse virtual reality (VR) applications enable users to socialise, work,entertain, and study online with immersive experiences beyond the classicPC-based interactions. While the 360-degree immersion enables users to be fullyengaged in a virtual scenario, suboptimal Quality-of-Experience (QoE) likepoorly displayed 3D graphics, disruptive loading time, or motion lagging causedby degraded network Quality-of-Service (QoS) can be perceived by users muchworse (such as dizziness) than a monitor visualisation. This paper empiricallymeasures user QoE of metaverse VR caused by network QoS. Specifically, byfocusing on both public social hubs and private user-created events in threepopular metaverse VR applications (Rec Room, VRChat and MultiverseVR), we firstidentify three metrics, including environment freeze level, peripheral contentloading time, and control response time, that describe metaverse userexperience. By tuning three network QoS parameters (bandwidth, latency, andpacket loss), we benchmark each QoE metric's level from excellent tounplayable. Key insights are revealed, such as freeze of metaverse virtualenvironment is resilient to latency but sensitive to packet loss, and privateuser-created events demand better network conditions than public social hubs,providing a reference for ISPs to optimise their network QoS for superlativemetaverse user experience.
元虚拟现实(VR)应用使用户能够通过身临其境的体验进行在线社交、工作、娱乐和学习,超越了传统的基于电脑的交互方式。虽然 360 度沉浸式体验能让用户完全融入虚拟场景,但与显示器的可视化效果相比,用户可能会感受到不理想的体验质量(QoE),如显示效果不佳的 3D 图形、干扰性加载时间或因网络服务质量(QoS)下降而导致的运动滞后(如头晕)。本文通过实证测量网络 QoS 导致的元虚拟现实用户 QoE。具体来说,通过重点关注三个流行的元虚拟现实应用(Rec Room、VRChat 和 MultiverseVR)中的公共社交中心和用户创建的私人活动,我们首先确定了描述元虚拟现实使用体验的三个指标,包括环境冻结水平、外围内容加载时间和控制响应时间。通过调整三个网络 QoS 参数(带宽、延迟和丢包),我们将每个 QoE 指标的水平从优秀到可玩性进行了基准测试。研究揭示了一些关键问题,如元宇宙虚拟环境的冻结对延迟有弹性,但对丢包很敏感;与公共社交中心相比,私人用户创建的事件需要更好的网络条件,这为互联网服务提供商优化网络 QoS 以获得一流的元宇宙用户体验提供了参考。
{"title":"Assessing the Impact of Network Quality-of-Service on Metaverse Virtual Reality User Experience","authors":"Rahul Dev Tripathi, Minzhao Lyu, Vijay Sivaraman","doi":"arxiv-2407.10423","DOIUrl":"https://doi.org/arxiv-2407.10423","url":null,"abstract":"Metaverse virtual reality (VR) applications enable users to socialise, work,\u0000entertain, and study online with immersive experiences beyond the classic\u0000PC-based interactions. While the 360-degree immersion enables users to be fully\u0000engaged in a virtual scenario, suboptimal Quality-of-Experience (QoE) like\u0000poorly displayed 3D graphics, disruptive loading time, or motion lagging caused\u0000by degraded network Quality-of-Service (QoS) can be perceived by users much\u0000worse (such as dizziness) than a monitor visualisation. This paper empirically\u0000measures user QoE of metaverse VR caused by network QoS. Specifically, by\u0000focusing on both public social hubs and private user-created events in three\u0000popular metaverse VR applications (Rec Room, VRChat and MultiverseVR), we first\u0000identify three metrics, including environment freeze level, peripheral content\u0000loading time, and control response time, that describe metaverse user\u0000experience. By tuning three network QoS parameters (bandwidth, latency, and\u0000packet loss), we benchmark each QoE metric's level from excellent to\u0000unplayable. Key insights are revealed, such as freeze of metaverse virtual\u0000environment is resilient to latency but sensitive to packet loss, and private\u0000user-created events demand better network conditions than public social hubs,\u0000providing a reference for ISPs to optimise their network QoS for superlative\u0000metaverse user experience.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerator-as-a-Service in Public Clouds: An Intra-Host Traffic Management View for Performance Isolation in the Wild 公共云中的加速器即服务:野外性能隔离的主机内流量管理视图
Pub Date : 2024-07-14 DOI: arxiv-2407.10098
Jiechen Zhao, Ran Shu, Katie Lim, Zewen Fan, Thomas Anderson, Mingyu Gao, Natalie Enright Jerger
I/O devices in public clouds have integrated increasing numbers of hardwareaccelerators, e.g., AWS Nitro, Azure FPGA and Nvidia BlueField. However, suchspecialized compute (1) is not explicitly accessible to cloud users withperformance guarantee, (2) cannot be leveraged simultaneously by both providersand users, unlike general-purpose compute (e.g., CPUs). Through tenobservations, we present that the fundamental difficulty of democratizingaccelerators is insufficient performance isolation support. The key obstaclesto enforcing accelerator isolation are (1) too many unknown traffic patterns inpublic clouds and (2) too many possible contention sources in the datapath. Inthis work, instead of scheduling such complex traffic on-the-fly and augmentingisolation support on each system component, we propose to model traffic asnetwork flows and proactively re-shape the traffic to avoid unpredictablecontention. We discuss the implications of our findings on the design of futureI/O management stacks and device interfaces.
公共云中的 I/O 设备集成了越来越多的硬件加速器,例如 AWS Nitro、Azure FPGA 和 Nvidia BlueField。然而,与通用计算(如中央处理器)不同的是,这种专用计算(1)不能明确地向云用户提供性能保证,(2)不能被提供商和用户同时利用。通过十项观察,我们发现加速器民主化的根本困难在于性能隔离支持不足。实施加速器隔离的关键障碍在于:(1)公共云中有太多未知的流量模式;(2)数据路径中有太多可能的争用源。在这项工作中,我们建议将流量建模为网络流,并主动重新塑造流量以避免不可预测的争用,而不是即时调度此类复杂流量并在每个系统组件上增强隔离支持。我们将讨论我们的发现对未来 I/O 管理堆栈和设备接口设计的影响。
{"title":"Accelerator-as-a-Service in Public Clouds: An Intra-Host Traffic Management View for Performance Isolation in the Wild","authors":"Jiechen Zhao, Ran Shu, Katie Lim, Zewen Fan, Thomas Anderson, Mingyu Gao, Natalie Enright Jerger","doi":"arxiv-2407.10098","DOIUrl":"https://doi.org/arxiv-2407.10098","url":null,"abstract":"I/O devices in public clouds have integrated increasing numbers of hardware\u0000accelerators, e.g., AWS Nitro, Azure FPGA and Nvidia BlueField. However, such\u0000specialized compute (1) is not explicitly accessible to cloud users with\u0000performance guarantee, (2) cannot be leveraged simultaneously by both providers\u0000and users, unlike general-purpose compute (e.g., CPUs). Through ten\u0000observations, we present that the fundamental difficulty of democratizing\u0000accelerators is insufficient performance isolation support. The key obstacles\u0000to enforcing accelerator isolation are (1) too many unknown traffic patterns in\u0000public clouds and (2) too many possible contention sources in the datapath. In\u0000this work, instead of scheduling such complex traffic on-the-fly and augmenting\u0000isolation support on each system component, we propose to model traffic as\u0000network flows and proactively re-shape the traffic to avoid unpredictable\u0000contention. We discuss the implications of our findings on the design of future\u0000I/O management stacks and device interfaces.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture Flex-TPU:具有运行时可重构数据流架构的灵活 TPU
Pub Date : 2024-07-11 DOI: arxiv-2407.08700
Mohammed Elbtity, Peyton Chandarana, Ramtin Zand
Tensor processing units (TPUs) are one of the most well-known machinelearning (ML) accelerators utilized at large scale in data centers as well asin tiny ML applications. TPUs offer several improvements and advantages overconventional ML accelerators, like graphical processing units (GPUs), beingdesigned specifically to perform the multiply-accumulate (MAC) operationsrequired in the matrix-matrix and matrix-vector multiplies extensively presentthroughout the execution of deep neural networks (DNNs). Such improvementsinclude maximizing data reuse and minimizing data transfer by leveraging thetemporal dataflow paradigms provided by the systolic array architecture. Whilethis design provides a significant performance benefit, the currentimplementations are restricted to a single dataflow consisting of either input,output, or weight stationary architectures. This can limit the achievableperformance of DNN inference and reduce the utilization of compute units.Therefore, the work herein consists of developing a reconfigurable dataflowTPU, called the Flex-TPU, which can dynamically change the dataflow per layerduring run-time. Our experiments thoroughly test the viability of the Flex-TPUcomparing it to conventional TPU designs across multiple well-known MLworkloads. The results show that our Flex-TPU design achieves a significantperformance increase of up to 2.75x compared to conventional TPU, with onlyminor area and power overheads.
张量处理单元(TPU)是最著名的机器学习(ML)加速器之一,在数据中心和小型 ML 应用中得到了大规模应用。与图形处理器(GPU)等传统 ML 加速器相比,TPU 具有多项改进和优势,其设计专门用于执行深度神经网络(DNN)执行过程中广泛存在的矩阵-矩阵和矩阵-矢量乘法所需的乘积(MAC)运算。这种改进包括通过利用系统阵列架构提供的时态数据流范例,最大限度地提高数据重用率,并最大限度地减少数据传输。虽然这种设计具有显著的性能优势,但目前的实现方式仅限于由输入、输出或权重固定架构组成的单一数据流。因此,本文的工作包括开发一种名为 Flex-TPU 的可重新配置数据流处理单元,它可以在运行时动态改变每层的数据流。我们的实验对 Flex-TPU 的可行性进行了全面测试,并将其与传统的 TPU 设计在多个著名的 ML 工作负载中进行了比较。结果表明,与传统 TPU 相比,我们的 Flex-TPU 设计实现了高达 2.75 倍的性能大幅提升,而面积和功耗开销却很小。
{"title":"Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture","authors":"Mohammed Elbtity, Peyton Chandarana, Ramtin Zand","doi":"arxiv-2407.08700","DOIUrl":"https://doi.org/arxiv-2407.08700","url":null,"abstract":"Tensor processing units (TPUs) are one of the most well-known machine\u0000learning (ML) accelerators utilized at large scale in data centers as well as\u0000in tiny ML applications. TPUs offer several improvements and advantages over\u0000conventional ML accelerators, like graphical processing units (GPUs), being\u0000designed specifically to perform the multiply-accumulate (MAC) operations\u0000required in the matrix-matrix and matrix-vector multiplies extensively present\u0000throughout the execution of deep neural networks (DNNs). Such improvements\u0000include maximizing data reuse and minimizing data transfer by leveraging the\u0000temporal dataflow paradigms provided by the systolic array architecture. While\u0000this design provides a significant performance benefit, the current\u0000implementations are restricted to a single dataflow consisting of either input,\u0000output, or weight stationary architectures. This can limit the achievable\u0000performance of DNN inference and reduce the utilization of compute units.\u0000Therefore, the work herein consists of developing a reconfigurable dataflow\u0000TPU, called the Flex-TPU, which can dynamically change the dataflow per layer\u0000during run-time. Our experiments thoroughly test the viability of the Flex-TPU\u0000comparing it to conventional TPU designs across multiple well-known ML\u0000workloads. The results show that our Flex-TPU design achieves a significant\u0000performance increase of up to 2.75x compared to conventional TPU, with only\u0000minor area and power overheads.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"157 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141613815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Performance
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1