Nitish Satya Murthy, Francky Catthoor, Marian Verhelst
{"title":"Optimization of block-scaled integer GeMMs for efficient DNN deployment on scalable in-order vector processors","authors":"Nitish Satya Murthy, Francky Catthoor, Marian Verhelst","doi":"10.1016/j.sysarc.2024.103236","DOIUrl":null,"url":null,"abstract":"<div><p>A continuing rise in DNN usage in distributed and embedded use cases has demanded more efficient hardware execution in the field. Low-precision GeMMs with optimized data formats have played a key role in more memory and computationally-efficient networks. Recently trending formats are block-scaled representations stemming from tight HW-SW co-optimization, that compress network size by sharing exponents per data block. Prior work mostly focuses on deploying such block-scaled GeMM operations on domain-specific accelerators for optimum efficiency at the cost of flexibility and ease of deployment. In this work, we exploit and optimize the deployment of block-scaled GeMMs on fully-programmable in-order vector processors using ARM SVE. We define a systematic methodology for performing design space exploration to optimally match the workload specifications with processor vector-lengths, different microkernels, block sizes and shapes. We introduce efficient intrinsics-based microkernels with effective loop unrollings, and data-transfer efficient fused requantization strategies to maximize kernel performance, while also ensuring several deployment configurations. We enable generalized block-scaled kernel deployments through tunable block sizes and shapes, which helps in accommodating different accuracy-speed trade-off requirements. Utilizing 2D activation blocks instead of conventional 1D blocks, the static and dynamic BS-INT8 configurations yielded on average 3.8x and 2.9x faster speedups over FP32 models respectively, at no accuracy loss for CNN classification tasks on CIFAR10/100 datasets.</p></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"154 ","pages":"Article 103236"},"PeriodicalIF":3.7000,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762124001735","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
A continuing rise in DNN usage in distributed and embedded use cases has demanded more efficient hardware execution in the field. Low-precision GeMMs with optimized data formats have played a key role in more memory and computationally-efficient networks. Recently trending formats are block-scaled representations stemming from tight HW-SW co-optimization, that compress network size by sharing exponents per data block. Prior work mostly focuses on deploying such block-scaled GeMM operations on domain-specific accelerators for optimum efficiency at the cost of flexibility and ease of deployment. In this work, we exploit and optimize the deployment of block-scaled GeMMs on fully-programmable in-order vector processors using ARM SVE. We define a systematic methodology for performing design space exploration to optimally match the workload specifications with processor vector-lengths, different microkernels, block sizes and shapes. We introduce efficient intrinsics-based microkernels with effective loop unrollings, and data-transfer efficient fused requantization strategies to maximize kernel performance, while also ensuring several deployment configurations. We enable generalized block-scaled kernel deployments through tunable block sizes and shapes, which helps in accommodating different accuracy-speed trade-off requirements. Utilizing 2D activation blocks instead of conventional 1D blocks, the static and dynamic BS-INT8 configurations yielded on average 3.8x and 2.9x faster speedups over FP32 models respectively, at no accuracy loss for CNN classification tasks on CIFAR10/100 datasets.
期刊介绍:
The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software.
Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.