Cambricon: An Instruction Set Architecture for Neural Networks

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2016-06-18 DOI:10.1145/3007787.3001179

Shaoli Liu, Zidong Du, Jinhua Tao, D. Han, Tao Luo, Yuan Xie, Yunji Chen, Tianshi Chen

{"title":"Cambricon: An Instruction Set Architecture for Neural Networks","authors":"Shaoli Liu, Zidong Du, Jinhua Tao, D. Han, Tao Luo, Yuan Xie, Yunji Chen, Tianshi Chen","doi":"10.1145/3007787.3001179","DOIUrl":null,"url":null,"abstract":"Neural Networks (NN) are a family of models for a broad range of emerging machine learning and pattern recondition applications. NN techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which are usually not energy-efficient since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators for neural networks have been proposed recently to improve the energy-efficiency. However, such accelerators were designed for a small set of NN techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an NN (such as layers), or even an NN as a whole. Although straightforward and easy-to-implement for a limited set of similar NN techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different NN techniques with sufficient flexibility and efficiency. In this paper, we propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. Our evaluation over a total of ten representative yet distinct NN techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of NN techniques, and provides higher code density than general-purpose ISAs such as ×86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [5] (which can only accommodate 3 types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks.","PeriodicalId":6634,"journal":{"name":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","volume":"7 1","pages":"393-405"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"271","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3007787.3001179","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 271

Abstract

Neural Networks (NN) are a family of models for a broad range of emerging machine learning and pattern recondition applications. NN techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which are usually not energy-efficient since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators for neural networks have been proposed recently to improve the energy-efficiency. However, such accelerators were designed for a small set of NN techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an NN (such as layers), or even an NN as a whole. Although straightforward and easy-to-implement for a limited set of similar NN techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different NN techniques with sufficient flexibility and efficiency. In this paper, we propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. Our evaluation over a total of ten representative yet distinct NN techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of NN techniques, and provides higher code density than general-purpose ISAs such as ×86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [5] (which can only accommodate 3 types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

寒武纪:神经网络的指令集架构

神经网络(NN)是一组用于广泛新兴机器学习和模式修复应用的模型。神经网络技术通常在通用处理器(如CPU和GPGPU)上执行，通常不节能，因为它们投入过多的硬件资源来灵活地支持各种工作负载。因此，最近提出了用于神经网络的特定应用硬件加速器来提高能量效率。然而，这种加速器是为一小部分共享相似计算模式的神经网络技术而设计的，它们采用复杂且信息丰富的指令(控制信号)，直接对应于神经网络的高级功能块(如层)，甚至是整个神经网络。尽管对于一组有限的类似神经网络技术来说，这种加速器设计简单且易于实现，但指令集缺乏灵活性，阻碍了这种加速器设计以足够的灵活性和效率支持各种不同的神经网络技术。在本文中，我们在对现有神经网络技术进行综合分析的基础上，提出了一种新的用于神经网络加速器的特定领域指令集架构(ISA)，称为Cambricon，它是一种集成了标量、向量、矩阵、逻辑、数据传输和控制指令的负载存储架构。我们对十种具有代表性但不同的神经网络技术的评估表明，寒武纪在广泛的神经网络技术中表现出强大的描述能力，并且比通用isa(如×86, MIPS和GPGPU)提供更高的代码密度。与最新的最先进的神经网络加速器设计DaDianNao[5](只能容纳3种类型的神经网络技术)相比，我们采用台积电65nm技术实现的基于寒武纪的加速器原型只会产生微不足道的延迟/功耗/面积开销，并具有10种不同的神经网络基准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量

期刊最新文献

RelaxFault Memory Repair Boosting Access Parallelism to PCM-Based Main Memory Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems Energy Efficient Architecture for Graph Analytics Accelerators