BiQGEMM: Matrix Multiplication with Lookup Table for Binary-Coding-Based Quantized DNNs

Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee
{"title":"BiQGEMM: Matrix Multiplication with Lookup Table for Binary-Coding-Based Quantized DNNs","authors":"Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee","doi":"10.1109/SC41405.2020.00099","DOIUrl":null,"url":null,"abstract":"The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns by compressing DNNs such that computations can be simplified while required storage footprint is significantly reduced. Unfortunately, commercial CPUs and GPUs do not fully support quantization because only fixed data transfers (such as 32 bits) are allowed. As a result, even if weights are quantized (by a non-uniform quantization scheme) into a few bits, CPUs and GPUs may not access multiple quantized weights without memory bandwidth waste. Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs. In this paper, we propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access multiple quantized weights simultaneously in one instruction. In addition, BiQGEMM pre-computes intermediate results that are highly redundant when quantization leads to limited available computation space. Since pre-computed values are stored in lookup tables and reused, BiQGEMM achieves lower amount of overall computations. Our extensive experimental results show that BiQGEMM presents higher performance than conventional schemes when DNNs are quantized.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC41405.2020.00099","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 21

Abstract

The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns by compressing DNNs such that computations can be simplified while required storage footprint is significantly reduced. Unfortunately, commercial CPUs and GPUs do not fully support quantization because only fixed data transfers (such as 32 bits) are allowed. As a result, even if weights are quantized (by a non-uniform quantization scheme) into a few bits, CPUs and GPUs may not access multiple quantized weights without memory bandwidth waste. Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs. In this paper, we propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access multiple quantized weights simultaneously in one instruction. In addition, BiQGEMM pre-computes intermediate results that are highly redundant when quantization leads to limited available computation space. Since pre-computed values are stored in lookup tables and reused, BiQGEMM achieves lower amount of overall computations. Our extensive experimental results show that BiQGEMM presents higher performance than conventional schemes when DNNs are quantized.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
BiQGEMM:基于二进制编码的量化dnn的查找表矩阵乘法
为了支持复杂任务和提高模型精度,深度神经网络(dnn)中的参数数量正在迅速增加。相应地,计算量和所需的内存占用也会增加。量化是一种有效的方法,通过压缩dnn来解决这些问题,这样可以简化计算,同时显著减少所需的存储空间。不幸的是,商用cpu和gpu并不完全支持量化,因为只允许固定数据传输(如32位)。因此,即使权重被量化(通过非均匀量化方案)为几个位,cpu和gpu也可能无法访问多个量化的权重,而不会浪费内存带宽。因此,在实践中,量化的成功依赖于高效的计算引擎设计,特别是对于矩阵乘法,这是大多数深度神经网络的基本计算引擎。在本文中,我们提出了一种新的矩阵乘法方法,称为BiQGEMM,专门用于量子化dnn。BiQGEMM可以在一条指令中同时访问多个量化权重。此外,当量化导致可用计算空间有限时,BiQGEMM会预先计算高度冗余的中间结果。由于预先计算的值存储在查找表中并被重用,因此BiQGEMM实现了较低的总计算量。我们的大量实验结果表明,当dnn被量化时,BiQGEMM比传统方案具有更高的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI Communication Toward Realization of Numerical Towing-Tank Tests by Wall-Resolved Large Eddy Simulation based on 32 Billion Grid Finite-Element Computation Scalable yet Rigorous Floating-Point Error Analysis Scalable Knowledge Graph Analytics at 136 Petaflop/s BORA: A Bag Optimizer for Robotic Analysis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1