Xiaotian Zhao;Ruge Xu;Yimin Gao;Vaibhav Verma;Mircea R. Stan;Xinfei Guo
As one of the prevailing deep neural networks compression techniques, layer-wise mixed-precision quantization (MPQ) strikes a better balance between accuracy and efficiency than uniform quantization schemes. However, existing MPQ strategies either lack hardware awareness or incur huge computation costs, limiting their deployment at the edge. Additionally, researchers usually make a one-time decision between post-training quantization (PTQ) and quantization-aware training (QAT) based on the quantized bit-width or hardware requirements. In this paper, we propose the tight integration of versatile MPQ inference units supporting INT2-INT8 and INT16 precisions, which feature a hierarchical multiplier architecture, into a RISC-V processor pipeline through micro-architecture and Instruction Set Architecture (ISA) co-design. Synthesized with a 14nm technology, the design delivers a speedup of $15.50times$