TernGEMM: GEneral Matrix Multiply Library with Ternary Weights for Fast DNN Inference

Seokhyeon Choi, Kyuhong Shim, Jungwook Choi, Wonyong Sung, B. Shim
{"title":"TernGEMM: GEneral Matrix Multiply Library with Ternary Weights for Fast DNN Inference","authors":"Seokhyeon Choi, Kyuhong Shim, Jungwook Choi, Wonyong Sung, B. Shim","doi":"10.1109/SiPS52927.2021.00028","DOIUrl":null,"url":null,"abstract":"Efficient implementation of deep neural networks on CPU-based systems is very critical because applications proliferate to embedded and Internet of Things (IoT) systems. Many CPUs for personal computers and embedded systems equip Single Instruction Multiple Data (SIMD) instructions, which can be utilized to implement an efficient GEneral Matrix Multiply (GEMM) library that is very necessary for efficient deep neural network implementation. While many deep neural networks show quite good performance even at 1-bit or 2-bit precision, the current CPU instruction and library do not efficiently support arithmetic operations below 8-bit. We propose TernGEMM, a special GEMM library using SIMD instructions for Deep Neural Network (DNN) inference with ternary weights and activations under 8-bit. TernGEMM improves the speed by replacing slow multiply-add with logical operations and also accumulating a number of multiplications without bit expansion operations. We compared the speedup of TernGEMM with tiling optimization and GEMMLowp, an 8-bit precision GEMM library. For Intel CPU, the speedup of ×2.052, ×2.973, and ×2.986 is achieved on ResNet-50, MobileNet-V2, EfficientNet-B0, respectively. For ARM CPU, TernGEMM’s speedup is ×2.143, ×1.765, and ×1.856, respectively.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SiPS52927.2021.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Efficient implementation of deep neural networks on CPU-based systems is very critical because applications proliferate to embedded and Internet of Things (IoT) systems. Many CPUs for personal computers and embedded systems equip Single Instruction Multiple Data (SIMD) instructions, which can be utilized to implement an efficient GEneral Matrix Multiply (GEMM) library that is very necessary for efficient deep neural network implementation. While many deep neural networks show quite good performance even at 1-bit or 2-bit precision, the current CPU instruction and library do not efficiently support arithmetic operations below 8-bit. We propose TernGEMM, a special GEMM library using SIMD instructions for Deep Neural Network (DNN) inference with ternary weights and activations under 8-bit. TernGEMM improves the speed by replacing slow multiply-add with logical operations and also accumulating a number of multiplications without bit expansion operations. We compared the speedup of TernGEMM with tiling optimization and GEMMLowp, an 8-bit precision GEMM library. For Intel CPU, the speedup of ×2.052, ×2.973, and ×2.986 is achieved on ResNet-50, MobileNet-V2, EfficientNet-B0, respectively. For ARM CPU, TernGEMM’s speedup is ×2.143, ×1.765, and ×1.856, respectively.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
TernGEMM:用于快速DNN推理的通用三元权矩阵乘法库
深度神经网络在基于cpu的系统上的高效实现是非常关键的,因为应用程序激增到嵌入式和物联网(IoT)系统。许多用于个人计算机和嵌入式系统的cpu都配备了单指令多数据(SIMD)指令,可用于实现高效的通用矩阵乘法(GEMM)库,这是高效实现深度神经网络所必需的。虽然许多深度神经网络即使在1位或2位精度下也表现出相当好的性能,但目前的CPU指令和库并不能有效地支持8位以下的算术运算。我们提出了TernGEMM,一个特殊的GEMM库,使用SIMD指令进行深度神经网络(DNN)推理,具有三元权值和8位以下的激活。TernGEMM通过用逻辑运算取代缓慢的乘法加运算以及在不进行位扩展操作的情况下累积大量乘法运算来提高速度。我们比较了采用平铺优化的TernGEMM和8位精度的GEMMLowp的加速。对于Intel CPU, ×2.052、×2.973和×2.986的加速分别在ResNet-50、MobileNet-V2和EfficientNet-B0上实现。对于ARM CPU, TernGEMM的加速分别为×2.143、×1.765和×1.856。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Reconfigurable Neural Synaptic Plasticity-Based Stochastic Deep Neural Network Computing Efficient Mind-wandering Detection System with GSR Signals on MM-SART Database Time sliding window for the detection of CCSK frames Implementation of a Two-Dimensional FFT/IFFT Processor for Real-Time High-Resolution Synthetic Aperture Radar Imaging TernGEMM: GEneral Matrix Multiply Library with Ternary Weights for Fast DNN Inference
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1