BiQGEMM: Matrix Multiplication with Lookup Table for Binary-Coding-Based Quantized DNNs

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-05-20 DOI:10.1109/SC41405.2020.00099

Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee

{"title":"BiQGEMM: Matrix Multiplication with Lookup Table for Binary-Coding-Based Quantized DNNs","authors":"Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee","doi":"10.1109/SC41405.2020.00099","DOIUrl":null,"url":null,"abstract":"The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns by compressing DNNs such that computations can be simplified while required storage footprint is significantly reduced. Unfortunately, commercial CPUs and GPUs do not fully support quantization because only fixed data transfers (such as 32 bits) are allowed. As a result, even if weights are quantized (by a non-uniform quantization scheme) into a few bits, CPUs and GPUs may not access multiple quantized weights without memory bandwidth waste. Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs. In this paper, we propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access multiple quantized weights simultaneously in one instruction. In addition, BiQGEMM pre-computes intermediate results that are highly redundant when quantization leads to limited available computation space. Since pre-computed values are stored in lookup tables and reused, BiQGEMM achieves lower amount of overall computations. Our extensive experimental results show that BiQGEMM presents higher performance than conventional schemes when DNNs are quantized.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC41405.2020.00099","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

The number of parameters in deep neural networks (DNNs) is rapidly increasing to support complicated tasks and to improve model accuracy. Correspondingly, the amount of computations and required memory footprint increase as well. Quantization is an efficient method to address such concerns by compressing DNNs such that computations can be simplified while required storage footprint is significantly reduced. Unfortunately, commercial CPUs and GPUs do not fully support quantization because only fixed data transfers (such as 32 bits) are allowed. As a result, even if weights are quantized (by a non-uniform quantization scheme) into a few bits, CPUs and GPUs may not access multiple quantized weights without memory bandwidth waste. Success of quantization in practice, hence, relies on an efficient computation engine design, especially for matrix multiplication that is a basic computation engine in most DNNs. In this paper, we propose a novel matrix multiplication method, called BiQGEMM, dedicated to quantized DNNs. BiQGEMM can access multiple quantized weights simultaneously in one instruction. In addition, BiQGEMM pre-computes intermediate results that are highly redundant when quantization leads to limited available computation space. Since pre-computed values are stored in lookup tables and reused, BiQGEMM achieves lower amount of overall computations. Our extensive experimental results show that BiQGEMM presents higher performance than conventional schemes when DNNs are quantized.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

BiQGEMM:基于二进制编码的量化dnn的查找表矩阵乘法

为了支持复杂任务和提高模型精度，深度神经网络(dnn)中的参数数量正在迅速增加。相应地，计算量和所需的内存占用也会增加。量化是一种有效的方法，通过压缩dnn来解决这些问题，这样可以简化计算，同时显著减少所需的存储空间。不幸的是，商用cpu和gpu并不完全支持量化，因为只允许固定数据传输(如32位)。因此，即使权重被量化(通过非均匀量化方案)为几个位，cpu和gpu也可能无法访问多个量化的权重，而不会浪费内存带宽。因此，在实践中，量化的成功依赖于高效的计算引擎设计，特别是对于矩阵乘法，这是大多数深度神经网络的基本计算引擎。在本文中，我们提出了一种新的矩阵乘法方法，称为BiQGEMM，专门用于量子化dnn。BiQGEMM可以在一条指令中同时访问多个量化权重。此外，当量化导致可用计算空间有限时，BiQGEMM会预先计算高度冗余的中间结果。由于预先计算的值存储在查找表中并被重用，因此BiQGEMM实现了较低的总计算量。我们的大量实验结果表明，当dnn被量化时，BiQGEMM比传统方案具有更高的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量

期刊最新文献

CAB-MPI: Exploring Interprocess Work-Stealing towards Balanced MPI Communication Toward Realization of Numerical Towing-Tank Tests by Wall-Resolved Large Eddy Simulation based on 32 Billion Grid Finite-Element Computation Scalable yet Rigorous Floating-Point Error Analysis Scalable Knowledge Graph Analytics at 136 Petaflop/s BORA: A Bag Optimizer for Robotic Analysis