ACM Transactions on Architecture and Code Optimization最新文献_第8页

PARALiA : A Performance Aware Runtime for Auto-tuning Linear Algebra on heterogeneous systems 在异构系统上自动调优线性代数的性能感知运行时

3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-09-15 DOI: 10.1145/3624569

Petros Anastasiadis, Nikela Papadopoulou, Georgios Goumas, Nectarios Koziris, Dennis Hoppe, Li Zhong

Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by two significant optimization challenges: data transfer bottlenecks as well as problem splitting and scheduling in multiple workers (GPUs) with distinct memories. We demonstrate that the current multi-GPU BLAS methods for tackling these challenges target very specific problem and data characteristics, resulting in serious performance degradation for any slightly deviating workload. Additionally, an even more critical decision is omitted because it cannot be addressed using current scheduler-based approaches: the determination of which devices should be used for a certain routine invocation. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into an end-to-end BLAS framework named PARALiA. This framework couples autotuning with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7X and energy efficiency by 2.5X over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems.

密集线性代数运算在高性能计算(HPC)应用程序中非常频繁地出现，因此其性能对于实现最佳可伸缩性至关重要。由于许多现代HPC集群包含多个gpu节点，BLAS操作经常在gpu上卸载，因此需要使用优化的库来确保良好的性能。不幸的是，多gpu系统伴随着两个重大的优化挑战:数据传输瓶颈以及具有不同内存的多个工作人员(gpu)中的问题分割和调度。我们证明，目前用于解决这些挑战的多gpu BLAS方法针对非常具体的问题和数据特征，导致任何轻微偏离工作负载的严重性能下降。此外，还忽略了一个更为关键的决策，因为使用当前基于调度器的方法无法解决这个问题:确定应该将哪些设备用于某个例程调用。为了解决这些问题，我们提出了一种基于模型的方法:使用性能评估在运行时提供特定于问题的自动调优。我们将这种自动调优集成到一个名为PARALiA的端到端BLAS框架中。该框架将自动调优与优化的任务调度器结合在一起，从而实现近乎最佳的数据分布和性能感知的资源利用率。我们在拥有8个NVIDIA-V100 gpu的高性能计算测试平台上对PARALiA进行了评估，在大型和多样化的数据集中，GEMM的平均性能提高了1.7倍，能效提高了2.5倍，并展示了我们的性能感知方法对未来异构系统的适应性。

{"title":"PARALiA : A Performance Aware Runtime for Auto-tuning Linear Algebra on heterogeneous systems","authors":"Petros Anastasiadis, Nikela Papadopoulou, Georgios Goumas, Nectarios Koziris, Dennis Hoppe, Li Zhong","doi":"10.1145/3624569","DOIUrl":"https://doi.org/10.1145/3624569","url":null,"abstract":"Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by two significant optimization challenges: data transfer bottlenecks as well as problem splitting and scheduling in multiple workers (GPUs) with distinct memories. We demonstrate that the current multi-GPU BLAS methods for tackling these challenges target very specific problem and data characteristics, resulting in serious performance degradation for any slightly deviating workload. Additionally, an even more critical decision is omitted because it cannot be addressed using current scheduler-based approaches: the determination of which devices should be used for a certain routine invocation. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into an end-to-end BLAS framework named PARALiA. This framework couples autotuning with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7X and energy efficiency by 2.5X over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135397586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MicroProf: Code-level Attribution of Unnecessary Data Transfer in Microservice Applications MicroProf:微服务应用程序中不必要数据传输的代码级归属

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-09-08 DOI: 10.1145/3622787

Syed Salauddin Mohammad Tariq, Lance Menard, Pengfei Su, Probir Roy

The microservice architecture style has gained popularity due to its ability to fault isolation, ease of scaling applications, and developer’s agility. However, writing applications in the microservice design style has its challenges. Due to the loosely coupled nature, services communicate with others through standard communication APIs. This incurs significant overhead in the application due to communication protocol and data transformation. An inefficient service communication at the microservice application logic can further overwhelm the application. We perform a grey literature review showing that unnecessary data transfer is a real challenge in the industry. To the best of our knowledge, no effective tool is currently available to accurately identify the origins of unnecessary microservice communications that lead to significant performance overhead and provide guidance for optimization. To bridge the knowledge gap, we propose MicroProf, a dynamic program analysis tool to detect unnecessary data transfer in Java-based microservice applications. At the implementation level, MicroProf proposes novel techniques such as remote object sampling and hardware debug registers to monitor remote object usage. MicroProf reports the unnecessary data transfer at the application source code level. Furthermore, MicroProf pinpoints the opportunities for communication API optimization. MicroProf is evaluated on four well-known applications involving two real-world applications and two benchmarks, identifying five inefficient remote invocations. Guided by MicroProf, API optimization achieves an 87.5% reduction in the number of fields within REST API responses. The empirical evaluation further reveals that the optimized services experience a speedup of up to 4.59 ×.

微服务架构风格因其故障隔离能力、易于扩展应用程序和开发人员的敏捷性而受到欢迎。然而，以微服务设计风格编写应用程序有其挑战。由于松耦合的性质，服务通过标准的通信api与其他服务通信。由于通信协议和数据转换，这会在应用程序中产生很大的开销。微服务应用程序逻辑上低效的服务通信会进一步压垮应用程序。我们进行了灰色文献回顾，显示不必要的数据传输是该行业的真正挑战。据我们所知，目前还没有有效的工具可以准确地识别导致显著性能开销的不必要的微服务通信的根源，并为优化提供指导。为了弥合知识差距，我们提出了MicroProf，这是一个动态程序分析工具，用于检测基于java的微服务应用程序中不必要的数据传输。在实现层面，MicroProf提出了一些新技术，比如远程对象采样和硬件调试寄存器来监控远程对象的使用。MicroProf在应用程序源代码级别报告不必要的数据传输。此外，MicroProf还指出了通信API优化的机会。MicroProf在四个知名应用程序上进行了评估，其中包括两个实际应用程序和两个基准测试，确定了五个低效的远程调用。在MicroProf的指导下，API优化使REST API响应中的字段数量减少了87.5%。实证评价进一步表明，优化后的服务速度提升高达4.59倍。

{"title":"MicroProf: Code-level Attribution of Unnecessary Data Transfer in Microservice Applications","authors":"Syed Salauddin Mohammad Tariq, Lance Menard, Pengfei Su, Probir Roy","doi":"10.1145/3622787","DOIUrl":"https://doi.org/10.1145/3622787","url":null,"abstract":"The microservice architecture style has gained popularity due to its ability to fault isolation, ease of scaling applications, and developer’s agility. However, writing applications in the microservice design style has its challenges. Due to the loosely coupled nature, services communicate with others through standard communication APIs. This incurs significant overhead in the application due to communication protocol and data transformation. An inefficient service communication at the microservice application logic can further overwhelm the application. We perform a grey literature review showing that unnecessary data transfer is a real challenge in the industry. To the best of our knowledge, no effective tool is currently available to accurately identify the origins of unnecessary microservice communications that lead to significant performance overhead and provide guidance for optimization. To bridge the knowledge gap, we propose MicroProf, a dynamic program analysis tool to detect unnecessary data transfer in Java-based microservice applications. At the implementation level, MicroProf proposes novel techniques such as remote object sampling and hardware debug registers to monitor remote object usage. MicroProf reports the unnecessary data transfer at the application source code level. Furthermore, MicroProf pinpoints the opportunities for communication API optimization. MicroProf is evaluated on four well-known applications involving two real-world applications and two benchmarks, identifying five inefficient remote invocations. Guided by MicroProf, API optimization achieves an 87.5% reduction in the number of fields within REST API responses. The empirical evaluation further reveals that the optimized services experience a speedup of up to 4.59 ×.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"28 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87132560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Compilation Tool for Computation Offloading in ReRAM-based CIM Architectures 基于rram的CIM体系结构中计算卸载的编译工具

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-09-05 DOI: 10.1145/3617686

Hai Jin, Bo Lei, Haikun Liu, Xiaofei Liao, Zhuohui Duan, Chencheng Ye, Yu Zhang

Computing-In-Memory (CIM) architectures using Non-Volatile Memories (NVMs) have emerged as a promising way to address the “memory wall” problem in traditional Von Neumann architectures. CIM accelerators can perform arithmetic or Boolean logic operations in NVMs by fully exploiting their high parallelism for bit-wise operations. These accelerators are often used in cooperation with general-purpose processors to speed up a wide variety of artificial neural network applications. In such a heterogeneous computing architecture, the legacy software should be redesigned and re-engineered to utilize new CIM accelerators. In this paper, we propose a compilation tool to automatically migrate legacy programs to such heterogeneous architectures based on the LLVM compiler infrastructure. To accelerate some computations such as vector-matrix multiplication in CIM accelerators, we identify several typical computing patterns from LLVM intermediate representations (IRs), which are oblivious to high-level programming paradigms. Our compilation tool can modify acceleratable LLVM IRs to offload them to CIM accelerators automatically, without re-engineering legacy software. Experimental results show that our compilation tool can translate many legacy programs to CIM-supported binary executables effectively, and improve application performance and energy efficiency by up to 51 × and 309 ×, respectively, compared with general-purpose x86 processors.

使用非易失性存储器(nvm)的内存计算(CIM)体系结构已经成为解决传统冯·诺依曼体系结构中“内存墙”问题的一种有前途的方法。CIM加速器可以在nvm中执行算术或布尔逻辑运算，方法是充分利用nvm对位操作的高并行性。这些加速器通常与通用处理器合作使用，以加速各种人工神经网络应用。在这种异构计算体系结构中，应该重新设计和重新设计遗留软件，以利用新的CIM加速器。在本文中，我们提出了一种基于LLVM编译器基础架构的编译工具，可以自动将遗留程序迁移到这种异构体系结构中。为了加速CIM加速器中的一些计算，例如向量矩阵乘法，我们从LLVM中间表示(ir)中确定了几种典型的计算模式，这些模式与高级编程范例无关。我们的编译工具可以修改可加速的LLVM ir，将它们自动卸载到CIM加速器，而无需重新设计遗留软件。实验结果表明，我们的编译工具可以有效地将许多遗留程序转换为支持cim的二进制可执行文件，与通用x86处理器相比，应用程序性能和能源效率分别提高了51倍和309倍。

{"title":"A Compilation Tool for Computation Offloading in ReRAM-based CIM Architectures","authors":"Hai Jin, Bo Lei, Haikun Liu, Xiaofei Liao, Zhuohui Duan, Chencheng Ye, Yu Zhang","doi":"10.1145/3617686","DOIUrl":"https://doi.org/10.1145/3617686","url":null,"abstract":"Computing-In-Memory (CIM) architectures using Non-Volatile Memories (NVMs) have emerged as a promising way to address the “memory wall” problem in traditional Von Neumann architectures. CIM accelerators can perform arithmetic or Boolean logic operations in NVMs by fully exploiting their high parallelism for bit-wise operations. These accelerators are often used in cooperation with general-purpose processors to speed up a wide variety of artificial neural network applications. In such a heterogeneous computing architecture, the legacy software should be redesigned and re-engineered to utilize new CIM accelerators. In this paper, we propose a compilation tool to automatically migrate legacy programs to such heterogeneous architectures based on the LLVM compiler infrastructure. To accelerate some computations such as vector-matrix multiplication in CIM accelerators, we identify several typical computing patterns from LLVM intermediate representations (IRs), which are oblivious to high-level programming paradigms. Our compilation tool can modify acceleratable LLVM IRs to offload them to CIM accelerators automatically, without re-engineering legacy software. Experimental results show that our compilation tool can translate many legacy programs to CIM-supported binary executables effectively, and improve application performance and energy efficiency by up to 51 × and 309 ×, respectively, compared with general-purpose x86 processors.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"57 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80969180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Smart-DNN+: a Memory-Efficient Neural Networks Compression Framework for the Model Inference Smart-DNN+:用于模型推理的高效记忆神经网络压缩框架

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-08-30 DOI: 10.1145/3617688

Donglei Wu, Weihao Yang, Xiangyu Zou, Wen Xia, Shiyi Li, Zhenbo Hu, Weizhe Zhang, Binxing Fang

Deep Neural Networks (DNNs) have achieved remarkable success in various real-world applications. However, running a DNN typically requires hundreds of megabytes of memory footprints, making it challenging to deploy on resource-constrained platforms such as mobile devices and IoT. Although mainstream DNNs compression techniques such as pruning, distillation, and quantization can reduce the memory overhead of model parameters during DNN inference, they suffer from three limitations: (i) low model compression ratio for the lightweight DNN structures with little redundancy; (ii) potential degradation in model inference accuracy; (iii) inadequate memory compression ratio is attributable to ignoring the layering property of DNN inference. To address these issues, we propose a lightweight memory-efficient DNN inference framework called Smart-DNN+, which significantly reduces the memory costs of DNN inference without degrading the model quality. Specifically, ① Smart-DNN+ applies a layer-wise binary-quantizer with a remapping mechanism to greatly reduce the model size by quantizing the typical floating-point DNN weights of 32-bit to the 1-bit signs layer by layer. To maintain model quality, ② Smart-DNN+ employs a bucket-encoder to keep the compressed quantization error by encoding the multiple similar floating-point residuals into the same integer bucket IDs. When running the compressed DNN in the user’s device, ③ Smart-DNN+ utilizes a partially decompressing strategy to greatly reduce the required memory overhead by first loading the compressed DNNs in memory and then dynamically decompressing the required materials for model inference layer by layer. Experimental results on popular DNNs and datasets demonstrate that Smart-DNN+ achieves lower 0.17 (% ) -0.92 (% ) memory costs at lower runtime overheads compared with the state of the arts without degrading the inference accuracy. Moreover, Smart-DNN+ potentially reduces the inference runtime up to 2.04 × that of conventional DNN inference workflow.

深度神经网络(dnn)在各种实际应用中取得了显著的成功。然而，运行DNN通常需要数百兆字节的内存占用，这使得在移动设备和物联网等资源受限的平台上部署具有挑战性。虽然主流的DNN压缩技术，如剪枝、蒸馏和量化，可以减少DNN推理过程中模型参数的内存开销，但它们存在三个局限性:(1)对于冗余少的轻量级DNN结构，模型压缩比低;(ii)模型推理精度的潜在下降;(iii)内存压缩比不足是由于忽略了DNN推理的层次性。为了解决这些问题，我们提出了一个轻量级的内存高效DNN推理框架，称为Smart-DNN+，它在不降低模型质量的情况下显著降低了DNN推理的内存成本。具体来说，①Smart-DNN+应用了一种带重映射机制的分层二进制量化器，通过将典型的32位浮点DNN权重逐层量化为1位符号，大大减小了模型大小。为了保持模型质量，②Smart-DNN+采用了桶编码器，通过将多个相似的浮点残差编码为相同的整数桶id来保持压缩的量化误差。当在用户设备上运行压缩的DNN时，③Smart-DNN+采用部分解压缩策略，首先将压缩的DNN加载到内存中，然后逐层动态解压缩模型推理所需的材料，从而大大减少所需的内存开销。在流行的dnn和数据集上的实验结果表明，与现有技术相比，Smart-DNN+在不降低推理精度的情况下，以更低的运行时开销实现了更低的0.17 (% ) -0.92 (% )内存成本。此外，Smart-DNN+有可能将推理运行时间减少到传统DNN推理工作流程的2.04倍。

{"title":"Smart-DNN+: a Memory-Efficient Neural Networks Compression Framework for the Model Inference","authors":"Donglei Wu, Weihao Yang, Xiangyu Zou, Wen Xia, Shiyi Li, Zhenbo Hu, Weizhe Zhang, Binxing Fang","doi":"10.1145/3617688","DOIUrl":"https://doi.org/10.1145/3617688","url":null,"abstract":"Deep Neural Networks (DNNs) have achieved remarkable success in various real-world applications. However, running a DNN typically requires hundreds of megabytes of memory footprints, making it challenging to deploy on resource-constrained platforms such as mobile devices and IoT. Although mainstream DNNs compression techniques such as pruning, distillation, and quantization can reduce the memory overhead of model parameters during DNN inference, they suffer from three limitations: (i) low model compression ratio for the lightweight DNN structures with little redundancy; (ii) potential degradation in model inference accuracy; (iii) inadequate memory compression ratio is attributable to ignoring the layering property of DNN inference. To address these issues, we propose a lightweight memory-efficient DNN inference framework called Smart-DNN+, which significantly reduces the memory costs of DNN inference without degrading the model quality. Specifically, ① Smart-DNN+ applies a layer-wise binary-quantizer with a remapping mechanism to greatly reduce the model size by quantizing the typical floating-point DNN weights of 32-bit to the 1-bit signs layer by layer. To maintain model quality, ② Smart-DNN+ employs a bucket-encoder to keep the compressed quantization error by encoding the multiple similar floating-point residuals into the same integer bucket IDs. When running the compressed DNN in the user’s device, ③ Smart-DNN+ utilizes a partially decompressing strategy to greatly reduce the required memory overhead by first loading the compressed DNNs in memory and then dynamically decompressing the required materials for model inference layer by layer. Experimental results on popular DNNs and datasets demonstrate that Smart-DNN+ achieves lower 0.17 (% ) -0.92 (% ) memory costs at lower runtime overheads compared with the state of the arts without degrading the inference accuracy. Moreover, Smart-DNN+ potentially reduces the inference runtime up to 2.04 × that of conventional DNN inference workflow.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"27 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77083358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RACE: An Efficient Redundancy-aware Accelerator for Dynamic Graph Neural Network RACE:一种高效的动态图神经网络冗余感知加速器

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-08-30 DOI: 10.1145/3617685

Hui Yu, Yu Zhang, Jin Zhao, Yujian Liao, Zhiying Huang, Donghao He, Lin Gu, Hai Jin, Xiaofei Liao, Haikun Liu, Bin He, Jianhui Yue

Dynamic Graph Neural Network (DGNN) has recently attracted a significant amount of research attention from various domains, because most real-world graphs are inherently dynamic. Despite of many research effort, for DGNN, existing hardware/software solutions still suffer significantly from redundant computation and memory access overhead, because they need to irregularly access and recompute all graph data of each graph snapshot. To address these issues, we propose an efficient redundancy-aware accelerator, RACE, which enables energy-efficient execution of DGNN models. Specifically, we propose a redundancy-aware incremental execution approach into the accelerator design for DGNN to instantly achieve the output features of the latest graph snapshot by correctly and incrementally refining the output features of the previous graph snapshot and also enable regular accesses of vertices’ input features. Through traversing the graph on the fly, RACE identifies the vertices which are not affected by graph updates between successive snapshots to reuse these vertices’ states (i.e., their output features) of the previous snapshot for the processing of the latest snapshot. The vertices affected by graph updates are also tracked to incrementally recompute their new states using their neighbors’ input features of the latest snapshot for correctness. In this way, the processing and accessing of many graph data which are not affected by graph updates can be correctly eliminated, enabling smaller redundant computation and memory access overhead. Besides, the input features, which are accessed more frequently, are dynamically identified according to graph topology and are preferentially resident in the on-chip memory for less off-chip communications. Experimental results show that RACE achieves on average 1139x and 84.7x speedups for DGNN inference, with average 2242x and 234.2x energy savings, in comparison with the state-of-the-art software DGNN running on Intel Xeon CPU and NVIDIA A100 GPU, respectively. Moreover, for DGNN inference, RACE obtains on average 13.1x, 11.7x, 10.4x, 7.9x speedup and average 14.8x, 12.9x, 11.5x, 8.9x energy saving over the state-of-the-art GNN accelerators, i.e., AWB-GCN, GCNAX, ReGNN, and I-GCN, respectively.

动态图神经网络(Dynamic Graph Neural Network, DGNN)近年来吸引了各个领域的大量研究关注，因为大多数现实世界的图本质上是动态的。尽管进行了许多研究，但对于DGNN，现有的硬件/软件解决方案仍然存在冗余计算和内存访问开销，因为它们需要不定期地访问和重新计算每个图快照的所有图数据。为了解决这些问题，我们提出了一种高效的冗余感知加速器RACE，它可以高效地执行DGNN模型。具体而言，我们在DGNN加速器设计中提出了一种冗余感知的增量执行方法，通过正确和增量地细化前一个图快照的输出特征，立即实现最新图快照的输出特征，并允许定期访问顶点的输入特征。通过动态地遍历图，RACE识别出在连续快照之间不受图更新影响的顶点，以便重用这些顶点在前一个快照中的状态(即它们的输出特征)来处理最新快照。还跟踪受图更新影响的顶点，以使用其邻居的最新快照的输入特征增量地重新计算其新状态，以确保正确性。通过这种方式，可以正确地消除许多不受图更新影响的图数据的处理和访问，从而实现更小的冗余计算和内存访问开销。此外，根据图拓扑动态识别访问频率较高的输入特征，并优先驻留在片内存储器中，以减少片外通信。实验结果表明，与运行在Intel Xeon CPU和NVIDIA A100 GPU上的最先进的软件DGNN相比，RACE在DGNN推理上的平均速度分别提高了1139倍和84.7倍，平均节能2242x和234.2倍。此外，对于DGNN推理，RACE比最先进的GNN加速器AWB-GCN、GCNAX、ReGNN和I-GCN平均加速13.1倍、11.7倍、10.4倍、7.9倍，平均节能14.8倍、12.9倍、11.5倍、8.9倍。

{"title":"RACE: An Efficient Redundancy-aware Accelerator for Dynamic Graph Neural Network","authors":"Hui Yu, Yu Zhang, Jin Zhao, Yujian Liao, Zhiying Huang, Donghao He, Lin Gu, Hai Jin, Xiaofei Liao, Haikun Liu, Bin He, Jianhui Yue","doi":"10.1145/3617685","DOIUrl":"https://doi.org/10.1145/3617685","url":null,"abstract":"Dynamic Graph Neural Network (DGNN) has recently attracted a significant amount of research attention from various domains, because most real-world graphs are inherently dynamic. Despite of many research effort, for DGNN, existing hardware/software solutions still suffer significantly from redundant computation and memory access overhead, because they need to irregularly access and recompute all graph data of each graph snapshot. To address these issues, we propose an efficient redundancy-aware accelerator, RACE, which enables energy-efficient execution of DGNN models. Specifically, we propose a redundancy-aware incremental execution approach into the accelerator design for DGNN to instantly achieve the output features of the latest graph snapshot by correctly and incrementally refining the output features of the previous graph snapshot and also enable regular accesses of vertices’ input features. Through traversing the graph on the fly, RACE identifies the vertices which are not affected by graph updates between successive snapshots to reuse these vertices’ states (i.e., their output features) of the previous snapshot for the processing of the latest snapshot. The vertices affected by graph updates are also tracked to incrementally recompute their new states using their neighbors’ input features of the latest snapshot for correctness. In this way, the processing and accessing of many graph data which are not affected by graph updates can be correctly eliminated, enabling smaller redundant computation and memory access overhead. Besides, the input features, which are accessed more frequently, are dynamically identified according to graph topology and are preferentially resident in the on-chip memory for less off-chip communications. Experimental results show that RACE achieves on average 1139x and 84.7x speedups for DGNN inference, with average 2242x and 234.2x energy savings, in comparison with the state-of-the-art software DGNN running on Intel Xeon CPU and NVIDIA A100 GPU, respectively. Moreover, for DGNN inference, RACE obtains on average 13.1x, 11.7x, 10.4x, 7.9x speedup and average 14.8x, 12.9x, 11.5x, 8.9x energy saving over the state-of-the-art GNN accelerators, i.e., AWB-GCN, GCNAX, ReGNN, and I-GCN, respectively.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"13 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76021646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs 提高gpu上真实世界变压器推理的计算和内存效率

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-08-26 DOI: 10.1145/3617689

Jiangsu Du, Jiazhi Jiang, Jiang Zheng, Hongbin Zhang, Dan-E Huang, Yutong Lu

Transformer models have emerged as a leading approach in the field of natural language processing (NLP) and are increasingly being deployed in production environments. Graphic processing units (GPUs) have become a popular choice for the transformer deployment, and often rely on the batch processing technique to ensure high hardware performance. Nonetheless, the current practice for transformer inference encounters computational and memory redundancy due to the heavy-tailed distribution of sequence lengths in NLP scenarios, resulting in low practical performance. In this paper, we propose a unified solution for improving both computation and memory efficiency of the real-world transformer inference on GPUs. The solution eliminates the redundant computation and memory footprint across a transformer model. At first, a GPU-oriented computation approach is proposed to process the self-attention module in a fine-grained manner, eliminating its redundant computation. Next, the multi-layer perceptron module continues to use the word-accumulation approach to eliminate its redundant computation. Then, to better unify the fine-grained approach and the word-accumulation approach, it organizes the data layout of the self-attention module in block granularity. Since aforementioned approaches make the required memory size largely reduce and constantly fluctuate, we propose the chunk-based approach to enable a better balance between memory footprint and allocation/free efficiency. Our experimental results show that our unified solution achieves a decrease of average latency by 28 (% ) on the entire transformer model, 63.8 (% ) on the self-attention module and reduces memory footprint of intermediate results by 7.8 ×, compared with prevailing frameworks.

Transformer模型已经成为自然语言处理(NLP)领域的一种领先方法，并且越来越多地部署在生产环境中。图形处理单元(gpu)已成为变压器部署的热门选择，并且通常依赖于批处理技术来确保高硬件性能。尽管如此，由于NLP场景中序列长度的重尾分布，目前的变压器推理实践遇到了计算和内存冗余，导致实际性能较低。在本文中，我们提出了一个统一的解决方案，以提高gpu上实际变压器推理的计算和存储效率。该解决方案消除了跨变压器模型的冗余计算和内存占用。首先提出了一种面向gpu的计算方法，对自关注模块进行细粒度处理，消除了自关注模块的冗余计算。接下来，多层感知器模块继续使用单词积累方法来消除冗余计算。然后，为了更好地统一细粒度方法和单词积累方法，以块粒度组织自关注模块的数据布局。由于上述方法使所需的内存大小大大减少并不断波动，因此我们建议使用基于块的方法来实现内存占用和分配/空闲效率之间的更好平衡。实验结果表明，与现有框架相比，我们的统一解决方案在整个变压器模型上实现了28 (% )的平均延迟降低，在自关注模块上实现了63.8 (% )的平均延迟降低，中间结果的内存占用减少了7.8 x。

{"title":"Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs","authors":"Jiangsu Du, Jiazhi Jiang, Jiang Zheng, Hongbin Zhang, Dan-E Huang, Yutong Lu","doi":"10.1145/3617689","DOIUrl":"https://doi.org/10.1145/3617689","url":null,"abstract":"Transformer models have emerged as a leading approach in the field of natural language processing (NLP) and are increasingly being deployed in production environments. Graphic processing units (GPUs) have become a popular choice for the transformer deployment, and often rely on the batch processing technique to ensure high hardware performance. Nonetheless, the current practice for transformer inference encounters computational and memory redundancy due to the heavy-tailed distribution of sequence lengths in NLP scenarios, resulting in low practical performance. In this paper, we propose a unified solution for improving both computation and memory efficiency of the real-world transformer inference on GPUs. The solution eliminates the redundant computation and memory footprint across a transformer model. At first, a GPU-oriented computation approach is proposed to process the self-attention module in a fine-grained manner, eliminating its redundant computation. Next, the multi-layer perceptron module continues to use the word-accumulation approach to eliminate its redundant computation. Then, to better unify the fine-grained approach and the word-accumulation approach, it organizes the data layout of the self-attention module in block granularity. Since aforementioned approaches make the required memory size largely reduce and constantly fluctuate, we propose the chunk-based approach to enable a better balance between memory footprint and allocation/free efficiency. Our experimental results show that our unified solution achieves a decrease of average latency by 28 (% ) on the entire transformer model, 63.8 (% ) on the self-attention module and reduces memory footprint of intermediate results by 7.8 ×, compared with prevailing frameworks.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"54 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83275360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints 近似rm:在精度和时间约束下减少异构多核处理器的能量

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-22 DOI: https://dl.acm.org/doi/10.1145/3605214

Muhammad Waqar Azhar, Madhavan Manivannan, Per Stenström

Reducing energy consumption while providing performance and quality guarantees is crucial for computing systems ranging from battery-powered embedded systems to data centers. This article considers approximate iterative applications executing on heterogeneous multi-core platforms under user-specified performance and quality targets. We note that allowing a slight yet bounded relaxation in solution quality can considerably reduce the required iteration count and thereby can save significant amounts of energy. To this end, this article proposes Approx-RM, a resource management scheme that reduces energy expenditure while guaranteeing a specified performance as well as accuracy target. Approx-RM predicts the number of iterations required to meet the relaxed accuracy target at runtime. The time saved generates execution-time slack, which allows Approx-RM to allocate fewer resources on a heterogeneous multi-core platform in terms of DVFS, core type, and core count to save energy while meeting the performance target. Approx-RM contributes with lightweight methods for predicting the iteration count needed to meet the accuracy target and the resources needed to meet the performance target. Approx-RM uses the aforementioned predictions to allocate just enough resources to comply with quality of service constraints to save energy. Our evaluation shows energy savings of 31.6%, on average, compared to Race-to-idle when the accuracy is only relaxed by 1%. Approx-RM incurs timing and energy overheads of less than 0.1%.

从电池供电的嵌入式系统到数据中心，在提供性能和质量保证的同时降低能耗对于计算系统至关重要。本文考虑在用户指定的性能和质量目标下在异构多核平台上执行的近似迭代应用程序。我们注意到，在解决方案质量上允许轻微的但有限的松弛可以大大减少所需的迭代计数，从而可以节省大量的能量。为此，本文提出了一种在保证指定性能和精度目标的同时减少能源消耗的资源管理方案——approximate - rm。大约- rm预测在运行时满足放宽精度目标所需的迭代次数。节省的时间产生了执行时间的松弛，这使得大约- rm可以在异构多核平台上分配更少的资源，包括DVFS、核心类型和核心数量，从而在满足性能目标的同时节省能源。约- rm提供轻量级方法，用于预测满足精度目标所需的迭代计数和满足性能目标所需的资源。大约- rm使用上述预测来分配刚好足够的资源，以符合服务质量约束，从而节省能源。我们的评估显示，与精确度仅降低1%的Race-to-idle相比，平均节省了31.6%的能源。大约- rm产生的时间和能源开销小于0.1%。

{"title":"Approx-RM: Reducing Energy on Heterogeneous Multicore Processors under Accuracy and Timing Constraints","authors":"Muhammad Waqar Azhar, Madhavan Manivannan, Per Stenström","doi":"https://dl.acm.org/doi/10.1145/3605214","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3605214","url":null,"abstract":"Reducing energy consumption while providing performance and quality guarantees is crucial for computing systems ranging from battery-powered embedded systems to data centers. This article considers approximate iterative applications executing on heterogeneous multi-core platforms under user-specified performance and quality targets. We note that allowing a slight yet bounded relaxation in solution quality can considerably reduce the required iteration count and thereby can save significant amounts of energy. To this end, this article proposes Approx-RM, a resource management scheme that reduces energy expenditure while guaranteeing a specified performance as well as accuracy target. Approx-RM predicts the number of iterations required to meet the relaxed accuracy target at runtime. The time saved generates execution-time slack, which allows Approx-RM to allocate fewer resources on a heterogeneous multi-core platform in terms of DVFS, core type, and core count to save energy while meeting the performance target. Approx-RM contributes with lightweight methods for predicting the iteration count needed to meet the accuracy target and the resources needed to meet the performance target. Approx-RM uses the aforementioned predictions to allocate just enough resources to comply with quality of service constraints to save energy. Our evaluation shows energy savings of 31.6%, on average, compared to Race-to-idle when the accuracy is only relaxed by 1%. Approx-RM incurs timing and energy overheads of less than 0.1%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"75 4 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework 一个GPU加速的高效混合精度大规模FFT框架

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-22 DOI: https://dl.acm.org/doi/10.1145/3605148

Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuanchi Peng, Cui Wang

Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 × faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53× and 9.48× on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.

快速傅里叶变换(FFT)在大规模并行程序的计算应用中得到了广泛的应用，而数据通信是FFT的主要性能瓶颈，严重影响其并行效率。为了解决这个问题，我们提出了一种新的大规模FFT框架MFFT，它采用一种新的混合精度优化技术，采用“高精度计算，低精度通信”的策略来优化并行FFT。为了实现“低精度通信”，我们提出了一种共享指数浮点数压缩技术，该技术在保持较高精度的同时减少了数据通信量。此外，我们还采用了一种两阶段归一化技术来进一步减小舍入误差。在混合精度MFFT框架的基础上，采用了GPU内核流化、MPI消息组合、内核优化和内存优化等优化技术来提高性能。我们在一个有4,096个gpu的系统上评估MFFT。结果表明，共享指数MFFT比双精度MFFT平均快1.23倍，双精度MFFT的性能比开源库2Decomp&FFT(基于cpu的版本)和heFFTe(基于AMD gpu的版本)分别平均高3.53倍和9.48倍。与2Decomp&FFT相比，双精度MFFT的并行效率从53.2%提高到78.1%，共享指数MFFT的并行效率进一步提高到83.8%。

{"title":"MFFT: A GPU Accelerated Highly Efficient Mixed-Precision Large-Scale FFT Framework","authors":"Yuwen Zhao, Fangfang Liu, Wenjing Ma, Huiyuan Li, Yuanchi Peng, Cui Wang","doi":"https://dl.acm.org/doi/10.1145/3605148","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3605148","url":null,"abstract":"Fast Fourier transform (FFT) is widely used in computing applications in large-scale parallel programs, and data communication is the main performance bottleneck of FFT and seriously affects its parallel efficiency. To tackle this problem, we propose a new large-scale FFT framework, MFFT, which optimizes parallel FFT with a new mixed-precision optimization technique, adopting the “high precision computation, low precision communication” strategy. To enable “low precision communication”, we propose a shared-exponent floating-point number compression technique, which reduces the volume of data communication, while maintaining higher accuracy. In addition, we apply a two-phase normalization technique to further reduce the round-off error. Based on the mixed-precision MFFT framework, we apply several optimization techniques to improve the performance, such as streaming of GPU kernels, MPI message combination, kernel optimization, and memory optimization. We evaluate MFFT on a system with 4,096 GPUs. The results show that shared-exponent MFFT is 1.23 × faster than that of double-precision MFFT on average, and double-precision MFFT achieves performance 3.53× and 9.48× on average higher than open source library 2Decomp&FFT (CPU-based version) and heFFTe (AMD GPU-based version), respectively. The parallel efficiency of double-precision MFFT increased from 53.2% to 78.1% compared with 2Decomp&FFT, and shared-exponent MFFT further increases the parallel efficiency to 83.8%.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"5 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs 利用gpu上的稀疏性加速卷积神经网络

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3600092

Weizhi Xu, Yintai Sun, Shengyu Fan, Hui Yu, Xin Fu

The convolutional neural network (CNN) is an important deep learning method, which is widely used in many fields. However, it is very time consuming to implement the CNN where convolution usually takes most of the time. There are many zero values in feature maps and filters, which leads to redundant calculations and memory accesses if dense methods are used to compute convolution. Many works recently have made use of sparsity to skip the calculations for zero values to reduce the inference time of the CNN. On the graphics processing unit platform, current works cannot fully exploit the sparsity of the feature map and achieve satisfactory performance. Therefore, we design a new parallel strategy to transform the feature map into a new storage format to avoid the redundant computation of zero values on graphics processing units. Also considering the sparsity in the feature map, we propose a fused storage format to combine the convolution operation with the following pooling operation, to further improve the performance. We carry out experiments with mainstream CNN models and achieve better performance compared with cuDNN and cuSPARSE. For VGG-19, ResNet-50, DenseNet-121, and RegNetX-16GF, 1.97×, 2.23×, 2.74×, and 1.58× speedups respectively are obtained over cuDNN. The speedups over cuSPARSE respectively are 2.10×, 1.83×, 2.35×, and 1.35× when only using the first method.

卷积神经网络(CNN)是一种重要的深度学习方法，广泛应用于许多领域。然而，在卷积通常花费大部分时间的情况下，实现CNN非常耗时。在特征映射和滤波器中存在许多零值，如果使用密集方法计算卷积，会导致冗余计算和内存访问。近年来，许多研究都利用稀疏性来跳过零值的计算，以减少CNN的推理时间。在图形处理单元平台上，目前的工作还不能充分利用特征映射的稀疏性，取得令人满意的性能。因此，我们设计了一种新的并行策略，将特征映射转换为一种新的存储格式，以避免图形处理单元上零值的冗余计算。同时考虑到特征映射的稀疏性，我们提出了一种融合存储格式，将卷积操作与后续池化操作结合起来，进一步提高了性能。我们在主流CNN模型上进行了实验，与cuDNN和cuSPARSE相比，取得了更好的性能。对于VGG-19、ResNet-50、DenseNet-121和RegNetX-16GF, cuDNN的速度分别为1.97×、2.23×、2.74×和1.58×。仅使用第一种方法时，相对于cuSPARSE的加速分别为2.10倍、1.83倍、2.35倍和1.35倍。

{"title":"Accelerating Convolutional Neural Network by Exploiting Sparsity on GPUs","authors":"Weizhi Xu, Yintai Sun, Shengyu Fan, Hui Yu, Xin Fu","doi":"https://dl.acm.org/doi/10.1145/3600092","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3600092","url":null,"abstract":"The convolutional neural network (CNN) is an important deep learning method, which is widely used in many fields. However, it is very time consuming to implement the CNN where convolution usually takes most of the time. There are many zero values in feature maps and filters, which leads to redundant calculations and memory accesses if dense methods are used to compute convolution. Many works recently have made use of sparsity to skip the calculations for zero values to reduce the inference time of the CNN. On the graphics processing unit platform, current works cannot fully exploit the sparsity of the feature map and achieve satisfactory performance. Therefore, we design a new parallel strategy to transform the feature map into a new storage format to avoid the redundant computation of zero values on graphics processing units. Also considering the sparsity in the feature map, we propose a fused storage format to combine the convolution operation with the following pooling operation, to further improve the performance. We carry out experiments with mainstream CNN models and achieve better performance compared with cuDNN and cuSPARSE. For VGG-19, ResNet-50, DenseNet-121, and RegNetX-16GF, 1.97×, 2.23×, 2.74×, and 1.58× speedups respectively are obtained over cuDNN. The speedups over cuSPARSE respectively are 2.10×, 1.83×, 2.35×, and 1.35× when only using the first method.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"875 1","pages":""},"PeriodicalIF":1.6,"publicationDate":"2023-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138529849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

rNdN: Fast Query Compilation for NVIDIA GPUs rNdN: NVIDIA gpu快速查询编译

IF 1.6 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Architecture and Code Optimization

Pub Date : 2023-07-19 DOI: https://dl.acm.org/doi/10.1145/3603503

Alexander Krolik, Clark Verbrugge, Laurie Hendren

GPU database systems are an effective solution to query optimization, particularly with compilation and data caching. They fall short, however, in end-to-end workloads, as existing compiler toolchains are too expensive for use with short-running queries. In this work, we define and evaluate a runtime-suitable query compilation pipeline for NVIDIA GPUs that extracts high performance with only minimal optimization. In particular, our balanced approach successfully trades minor slowdowns in execution for major speedups in compilation, even as data sizes increase. We demonstrate performance benefits compared to both CPU and GPU database systems using interpreters and compilers, extending query compilation for GPUs beyond cached use cases.

GPU数据库系统是查询优化的有效解决方案，特别是在编译和数据缓存方面。然而，在端到端工作负载中，它们的作用不大，因为现有的编译器工具链对于短时间运行的查询来说太昂贵了。在这项工作中，我们为NVIDIA gpu定义并评估了一个适合运行时的查询编译管道，该管道仅通过最小的优化即可提取高性能。特别是，我们的平衡方法成功地以执行上的小减速换取了编译上的大加速，即使在数据大小增加时也是如此。我们演示了与使用解释器和编译器的CPU和GPU数据库系统相比的性能优势，将GPU的查询编译扩展到缓存用例之外。

引用次数: 0