TernGEMM: GEneral Matrix Multiply Library with Ternary Weights for Fast DNN Inference

2021 IEEE Workshop on Signal Processing Systems (SiPS) Pub Date : 2021-10-01 DOI:10.1109/SiPS52927.2021.00028

Seokhyeon Choi, Kyuhong Shim, Jungwook Choi, Wonyong Sung, B. Shim

{"title":"TernGEMM: GEneral Matrix Multiply Library with Ternary Weights for Fast DNN Inference","authors":"Seokhyeon Choi, Kyuhong Shim, Jungwook Choi, Wonyong Sung, B. Shim","doi":"10.1109/SiPS52927.2021.00028","DOIUrl":null,"url":null,"abstract":"Efficient implementation of deep neural networks on CPU-based systems is very critical because applications proliferate to embedded and Internet of Things (IoT) systems. Many CPUs for personal computers and embedded systems equip Single Instruction Multiple Data (SIMD) instructions, which can be utilized to implement an efficient GEneral Matrix Multiply (GEMM) library that is very necessary for efficient deep neural network implementation. While many deep neural networks show quite good performance even at 1-bit or 2-bit precision, the current CPU instruction and library do not efficiently support arithmetic operations below 8-bit. We propose TernGEMM, a special GEMM library using SIMD instructions for Deep Neural Network (DNN) inference with ternary weights and activations under 8-bit. TernGEMM improves the speed by replacing slow multiply-add with logical operations and also accumulating a number of multiplications without bit expansion operations. We compared the speedup of TernGEMM with tiling optimization and GEMMLowp, an 8-bit precision GEMM library. For Intel CPU, the speedup of ×2.052, ×2.973, and ×2.986 is achieved on ResNet-50, MobileNet-V2, EfficientNet-B0, respectively. For ARM CPU, TernGEMM’s speedup is ×2.143, ×1.765, and ×1.856, respectively.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SiPS52927.2021.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Efficient implementation of deep neural networks on CPU-based systems is very critical because applications proliferate to embedded and Internet of Things (IoT) systems. Many CPUs for personal computers and embedded systems equip Single Instruction Multiple Data (SIMD) instructions, which can be utilized to implement an efficient GEneral Matrix Multiply (GEMM) library that is very necessary for efficient deep neural network implementation. While many deep neural networks show quite good performance even at 1-bit or 2-bit precision, the current CPU instruction and library do not efficiently support arithmetic operations below 8-bit. We propose TernGEMM, a special GEMM library using SIMD instructions for Deep Neural Network (DNN) inference with ternary weights and activations under 8-bit. TernGEMM improves the speed by replacing slow multiply-add with logical operations and also accumulating a number of multiplications without bit expansion operations. We compared the speedup of TernGEMM with tiling optimization and GEMMLowp, an 8-bit precision GEMM library. For Intel CPU, the speedup of ×2.052, ×2.973, and ×2.986 is achieved on ResNet-50, MobileNet-V2, EfficientNet-B0, respectively. For ARM CPU, TernGEMM’s speedup is ×2.143, ×1.765, and ×1.856, respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TernGEMM:用于快速DNN推理的通用三元权矩阵乘法库

深度神经网络在基于cpu的系统上的高效实现是非常关键的，因为应用程序激增到嵌入式和物联网(IoT)系统。许多用于个人计算机和嵌入式系统的cpu都配备了单指令多数据(SIMD)指令，可用于实现高效的通用矩阵乘法(GEMM)库，这是高效实现深度神经网络所必需的。虽然许多深度神经网络即使在1位或2位精度下也表现出相当好的性能，但目前的CPU指令和库并不能有效地支持8位以下的算术运算。我们提出了TernGEMM，一个特殊的GEMM库，使用SIMD指令进行深度神经网络(DNN)推理，具有三元权值和8位以下的激活。TernGEMM通过用逻辑运算取代缓慢的乘法加运算以及在不进行位扩展操作的情况下累积大量乘法运算来提高速度。我们比较了采用平铺优化的TernGEMM和8位精度的GEMMLowp的加速。对于Intel CPU， ×2.052、×2.973和×2.986的加速分别在ResNet-50、MobileNet-V2和EfficientNet-B0上实现。对于ARM CPU, TernGEMM的加速分别为×2.143、×1.765和×1.856。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 IEEE Workshop on Signal Processing Systems (SiPS)

自引率

0.00%

发文量