Yu Zhang , Lu Lu , Zhanyu Yang , Zhihong Liang , Siliang Suo
{"title":"LE-GEMM: A lightweight emulation-based GEMM with precision refinement on GPU","authors":"Yu Zhang , Lu Lu , Zhanyu Yang , Zhihong Liang , Siliang Suo","doi":"10.1016/j.sysarc.2025.103336","DOIUrl":null,"url":null,"abstract":"<div><div>Many special hardware units, such as Matrix Core and Tensor Core, have recently been designed and applied in various scientific computing scenarios. These units support tensor-level computation with different precisions on GPU. Previous studies have proposed methods for computing single-precision GEneral Matrix Multiplication (GEMM) with the half-precision matrix. However, this routine often leads to some loss of accuracy, which limits its application. This paper proposed a Lightweight Emulation-based GEMM (LE-GEMM) on GPU that includes a lightweight emulation algorithm, a thread parallelism analytic model, and an efficient multi-level pipeline implementation to accelerate the computation process without compromising the accuracy requirements. First, we propose a lightweight emulation algorithm that includes a precision transformation process and GEMM emulation calculation to achieve better computational accuracy and performance. Secondly, a thread parallel analytic model is designed to analyze and guide the selection of the optimal tiling scheme based on various computing scenarios and hardware. Thirdly, an efficient multi-level pipeline is implemented, which can maximize instruction-level parallelism and latency hiding. Several comparison experiments were conducted on two commonly used GPU platforms: AMD-platform and NVIDIA-platform. The experimental results show that the proposed method outperforms the previous approaches in terms of computational accuracy and speed.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"160 ","pages":"Article 103336"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762125000086","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Many special hardware units, such as Matrix Core and Tensor Core, have recently been designed and applied in various scientific computing scenarios. These units support tensor-level computation with different precisions on GPU. Previous studies have proposed methods for computing single-precision GEneral Matrix Multiplication (GEMM) with the half-precision matrix. However, this routine often leads to some loss of accuracy, which limits its application. This paper proposed a Lightweight Emulation-based GEMM (LE-GEMM) on GPU that includes a lightweight emulation algorithm, a thread parallelism analytic model, and an efficient multi-level pipeline implementation to accelerate the computation process without compromising the accuracy requirements. First, we propose a lightweight emulation algorithm that includes a precision transformation process and GEMM emulation calculation to achieve better computational accuracy and performance. Secondly, a thread parallel analytic model is designed to analyze and guide the selection of the optimal tiling scheme based on various computing scenarios and hardware. Thirdly, an efficient multi-level pipeline is implemented, which can maximize instruction-level parallelism and latency hiding. Several comparison experiments were conducted on two commonly used GPU platforms: AMD-platform and NVIDIA-platform. The experimental results show that the proposed method outperforms the previous approaches in terms of computational accuracy and speed.
期刊介绍:
The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software.
Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.