2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)最新文献

英文中文

Area Efficient Box Filter Acceleration by Parallelizing with Optimized Adder Tree 基于优化加法器树并行化的区域高效箱形滤波器加速

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00019

Xinzhe Liu, Fupeng Chen, Y. Ha

Box filters are widely used in image and video processing applications. To achieve the real-time performance for these applications, designers may need to parallelize these box filters. However, it is very challenging to implement a parallel box filter on modern programmable system-on-chip (SoC). On one hand, the dependency between the operations of a box filter is too strong to achieve parallelism. On the other hand, more adder trees are required as the degree of parallelism increases. In this paper, we propose a performance and area efficient boxfilter. It uses the partial sum difference, which needs much less resources, to effectively calculate the box filter. We make the full use of this reusable partial sum to optimize the adder trees for parallel processing. We also make two case studies of the box filter by applying it to the guided filter and the stereo matching algorithm on a programmable SoC using a C-based design flow. Our method removes the dependencies between the parallel operations of the box filter. Compare to the state-of-the-art, results show that the computational complexity of the adder tree for a single pixel has been reduced from O(R^2) to O((R+N)lgN/N ) on average. There are orders of magnitude reduction in resource usage with large filter size R and parallelization degree N. The throughput can be increased by N times, where N is up to 72 in the case of Xilinx FPGA board XCZU9EG.

盒滤波器广泛应用于图像和视频处理。为了实现这些应用程序的实时性能，设计人员可能需要并行化这些盒滤波器。然而，在现代可编程片上系统(SoC)上实现并行盒滤波器是非常具有挑战性的。一方面，盒子过滤器的操作之间的依赖性太强，无法实现并行性。另一方面，随着并行度的增加，需要更多的加法树。在本文中，我们提出了一种性能和面积有效的盒子滤波器。该算法利用部分和差分法有效地计算盒形滤波器，节省了大量的资源。我们充分利用这种可重用的部分和来优化并行处理的加法器树。我们还通过使用基于c的设计流程将盒滤波器应用于可编程SoC上的引导滤波器和立体匹配算法，进行了两个案例研究。我们的方法消除了框过滤器并行操作之间的依赖关系。与最先进的技术相比，结果表明，单个像素的加法器树的计算复杂度从平均O(R^2)降低到O((R+N)lgN/N)。滤波器尺寸R大，并行度N大，资源使用量降低了几个数量级。吞吐量可提高N倍，其中Xilinx FPGA板XCZU9EG的吞吐量最高可达72倍。

{"title":"Area Efficient Box Filter Acceleration by Parallelizing with Optimized Adder Tree","authors":"Xinzhe Liu, Fupeng Chen, Y. Ha","doi":"10.1109/ISVLSI.2019.00019","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00019","url":null,"abstract":"Box filters are widely used in image and video processing applications. To achieve the real-time performance for these applications, designers may need to parallelize these box filters. However, it is very challenging to implement a parallel box filter on modern programmable system-on-chip (SoC). On one hand, the dependency between the operations of a box filter is too strong to achieve parallelism. On the other hand, more adder trees are required as the degree of parallelism increases. In this paper, we propose a performance and area efficient boxfilter. It uses the partial sum difference, which needs much less resources, to effectively calculate the box filter. We make the full use of this reusable partial sum to optimize the adder trees for parallel processing. We also make two case studies of the box filter by applying it to the guided filter and the stereo matching algorithm on a programmable SoC using a C-based design flow. Our method removes the dependencies between the parallel operations of the box filter. Compare to the state-of-the-art, results show that the computational complexity of the adder tree for a single pixel has been reduced from O(R^2) to O((R+N)lgN/N ) on average. There are orders of magnitude reduction in resource usage with large filter size R and parallelization degree N. The throughput can be increased by N times, where N is up to 72 in the case of Xilinx FPGA board XCZU9EG.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"30 1","pages":"55-60"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81320133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Automated Communication and Floorplan-Aware Hardware/Software Co-Design for SoC SoC的自动通信和平面感知软硬件协同设计

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00032

Jong Bin Lim, Deming Chen

The main objective of modern SoC (System-on-Chip) designs is to achieve high-performance while maintaining low power consumption and resource usage. However, achieving such a goal is a difficult and time-consuming engineering task due to the vast design space of hardware accelerators and HW/SW task partitioning. Depending on the partitioning decision, communication between parts of the SoC must be also optimized such that the overall runtime including both computation and communication would be fast. In this paper, we propose an automated approach to iteratively search for a near-optimal SoC design with minimum latency within the targeted power and resource budget. Our approach consists of the following main components: (1) polyhedral-model-based hardware accelerator design space exploration, (2) modeling of various communication types and integration into LLVM-based integer linear programming for HW/SW task partitioning, (3) fast and efficient search algorithm to extract maximum operating frequency using floorplanner, and (4) back-annotation of extracted information to system level for iterative partitioning. Using FPGA as the target platform, we demonstrate that our approach consistently outperforms the previous state-of-the-art solutions for automated HW/SW co-design by 37.8% on average and up to 75.2% for certain designs.

现代SoC(片上系统)设计的主要目标是在保持低功耗和资源使用的同时实现高性能。然而，由于硬件加速器和硬件/软件任务划分的巨大设计空间，实现这样的目标是一项困难且耗时的工程任务。根据分区决策，SoC各部分之间的通信也必须优化，以便包括计算和通信在内的整体运行时速度更快。在本文中，我们提出了一种自动化的方法来迭代搜索在目标功率和资源预算内具有最小延迟的近乎最佳的SoC设计。我们的方法包括以下主要部分:(1)基于多面体模型的硬件加速器设计空间探索;(2)基于llvm的各种通信类型建模并集成到基于llvm的整数线性规划中用于硬件/软件任务划分;(3)使用floorplanner快速高效的搜索算法提取最大工作频率;(4)将提取的信息反向注释到系统级进行迭代划分。使用FPGA作为目标平台，我们证明我们的方法始终优于以前最先进的自动化硬件/软件协同设计解决方案，平均高出37.8%，某些设计高达75.2%。

{"title":"Automated Communication and Floorplan-Aware Hardware/Software Co-Design for SoC","authors":"Jong Bin Lim, Deming Chen","doi":"10.1109/ISVLSI.2019.00032","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00032","url":null,"abstract":"The main objective of modern SoC (System-on-Chip) designs is to achieve high-performance while maintaining low power consumption and resource usage. However, achieving such a goal is a difficult and time-consuming engineering task due to the vast design space of hardware accelerators and HW/SW task partitioning. Depending on the partitioning decision, communication between parts of the SoC must be also optimized such that the overall runtime including both computation and communication would be fast. In this paper, we propose an automated approach to iteratively search for a near-optimal SoC design with minimum latency within the targeted power and resource budget. Our approach consists of the following main components: (1) polyhedral-model-based hardware accelerator design space exploration, (2) modeling of various communication types and integration into LLVM-based integer linear programming for HW/SW task partitioning, (3) fast and efficient search algorithm to extract maximum operating frequency using floorplanner, and (4) back-annotation of extracted information to system level for iterative partitioning. Using FPGA as the target platform, we demonstrate that our approach consistently outperforms the previous state-of-the-art solutions for automated HW/SW co-design by 37.8% on average and up to 75.2% for certain designs.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"20 1","pages":"128-133"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81865380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Accelerating Compact Convolutional Neural Networks with Multi-threaded Data Streaming 多线程数据流加速紧凑卷积神经网络

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00099

Weiguang Chen, Z. Wang, Shanliao Li, Zhibin Yu, Huijuan Li

Recent advances in convolutional neural networks (CNNs) reveal the trend towards designing compact structures such as MobileNet, which adopts variations of traditional computing kernels such as pointwise and depthwise convolution. Such modified operations significantly reduce model size with an only slight degradation in inference accuracy. State-of-the-art neural accelerators have not yet fully exploit algorithmic parallelism for such computing kernels in compact CNNs. In this work, we propose a multithreaded data streaming architecture for fast and highly parallel execution of pointwise and depthwise convolution, which can be also dynamically reconfigured to process conventional convolution, pooling, and fully connected network layers. The architecture achieves efficient memory bandwidth utilization by exploiting two modes of data alignment. We profile MobileNet on the proposed architecture and demonstrate a 9:36x speed-up compared to single-threaded architecture.

卷积神经网络(cnn)的最新进展揭示了设计紧凑结构的趋势，如MobileNet，它采用传统计算内核的变化，如点卷积和深度卷积。这种修改后的操作显著地减小了模型的大小，而只略微降低了推理精度。最先进的神经加速器还没有充分利用算法的并行性，在紧凑的cnn中这样的计算内核。在这项工作中，我们提出了一个多线程数据流架构，用于快速和高度并行地执行点向和深度卷积，该架构也可以动态地重新配置以处理传统的卷积、池化和完全连接的网络层。该架构通过利用两种数据对齐模式实现了高效的内存带宽利用。我们在提议的架构上对MobileNet进行了分析，并演示了与单线程架构相比，其速度提高了9:36倍。

引用次数: 4

A 1.8mW Perception Chip with Near-Sensor Processing Scheme for Low-Power AIoT Applications 基于近传感器处理方案的1.8mW低功耗AIoT感知芯片

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00087

Zheyu Liu, Erxiang Ren, Li Luo, Qi Wei, Xing Wu, Xueqing Li, F. Qiao, Xinjun Liu, Huazhong Yang

In the past few years, the demand for intelligence of IoT front-end devices has dramatically increased. However, such devices face challenges of limited on-chip resources and strict power or energy constraints. Recent progress in binarized neural networks has provided promising solutions for front-end processing system to conduct simple detection and classification tasks by making trade-offs between the processing quality and the computation complexity. In this paper, we propose a mixed-signal perception chip, in which an ADC-free 32x32 image sensor and a BNN processing array are directly integrated with a 180nm standard CMOS process. Taking advantage of the ADC-free processing architecture, the whole processing system only consumes 1.8mW power, while providing up to 545.4 GOPS/W energy efficiency. The implementation performance and energy efficiency are comparable with the state-of-the-art designs in much more advanced CMOS technologies. This work provides a promising alternative for low-power IoT intelligent applications.

在过去的几年里，对物联网前端设备的智能化需求急剧增加。然而，这种设备面临着有限的片上资源和严格的功率或能量限制的挑战。二值化神经网络的最新进展为前端处理系统通过权衡处理质量和计算复杂度来完成简单的检测和分类任务提供了有希望的解决方案。在本文中，我们提出了一种混合信号感知芯片，该芯片将无adc的32 × 32图像传感器和BNN处理阵列直接集成到180nm标准CMOS工艺中。利用无adc的处理架构，整个处理系统仅消耗1.8mW功率，同时提供高达545.4 GOPS/W的能效。实现性能和能源效率可与更先进的CMOS技术中最先进的设计相媲美。这项工作为低功耗物联网智能应用提供了一个有希望的替代方案。

引用次数: 7

Formal Verification of Integer Dividers:Division by a Constant 整数除法的形式化验证:被常数除法

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00022

Atif Yasin, Tiankai Su, S. Pillement, M. Ciesielski

Division is one of the most complex and hard to verify arithmetic operations. While verification of major arithmetic operators, such as adders and multipliers, has significantly progressed in recent years, less attention has been devoted to formal verification of dividers. A type of divider that is often used in embedded systems is divide by a constant. This paper presents a formal verification method for different divide-by-constant architectures and the generic restoring dividers based on the computer algebra approach. Our experiments for different divider architectures and comparison with exhaustive simulation demonstrates the effectiveness and scalability of the method.

除法是最复杂、最难验证的算术运算之一。虽然近年来对加法器和乘法器等主要算术运算符的验证取得了重大进展，但对除法的正式验证的关注却很少。在嵌入式系统中经常使用的一种除法是除以一个常数。本文提出了一种基于计算机代数方法的不同常除法体系结构的形式化验证方法和通用的恢复除法。我们对不同的分频器结构进行了实验，并与穷举仿真进行了比较，证明了该方法的有效性和可扩展性。

引用次数: 6

Security in Many-Core SoCs Leveraged by Opaque Secure Zones 利用不透明安全区域的多核soc中的安全性

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00091

L. L. Caimi, F. Moraes

This paper presents an original approach to protect the execution of applications with security constraints in many-core systems. The proposed method includes three defense mechanisms. The first one is the application admission into the many-core using ECDH and MAC techniques. The second is the spatial reservation of computation and communication resources, resulting in an Opaque Secure Zone (OSZ). The key feature enabling the runtime creation of OSZs is a rerouting mechanism responsible for deviating any traffic traversing an OSZ. The last mechanism is the access to peripherals using a secure protocol to open access points in the OSZ border, and lightweight encryption mechanisms.

本文提出了一种在多核系统中保护具有安全约束的应用程序执行的新颖方法。该方法包括三种防御机制。第一个是应用ECDH和MAC技术将应用程序送入多核。二是计算和通信资源的空间预留，形成不透明安全区域(OSZ)。启用OSZ运行时创建的关键特性是负责偏离穿越OSZ的任何流量的重路由机制。最后一种机制是使用安全协议访问外围设备，以打开OSZ边界中的接入点，以及轻量级加密机制。

引用次数: 11

Title Page iii 第三页标题

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2019-07-01 DOI: 10.1109/isvlsi.2019.00002

引用次数: 0

Traffic Driven Automated Synthesis of Network-on-Chip from Physically Aware Behavioral Specification 基于物理感知行为规范的流量驱动的片上网络自动合成

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00031

Anup Gangwar, Zheng Xu, N. Agarwal, Ravishankar Sreedharan, Ambica Prasad

The process of laying out the various interconnect components and configuring them, is termed as interconnect synthesis. A Network-on-Chip (NoC), has various building blocks such as Routers, Resizers, Power and Clock domain converters (PCDCs), Pipeline elements etc. A software tool is needed to connect these various components (topology) and then configure them (including routing) so that the user constraints are met and the overall logic and wiring cost along with power is kept low. In this paper we present a tool which generates Power, Performance and Area (PPA) optimized NoCs. The input is a behavioral specification which consists of a rough floor-plan, bridge parameters, available clock, power and voltage domains, address spaces, stochastic traffic (including classes and latency criticality), traffic dependency and any partial topology for the locked down portions of the NoC. The output is an optimized NoC, with instantiation and placement of components (routers, Resizers etc.), Virtual Channel (VC) assignments, clockdomain assignments, routing, bridge parameter tuning, FIFO sizes etc. Using this flow, we are able to generate NoCs which are within 15% of the hand-tuned designs (optimized over several months), for various metrics and exceed critical metrics by as much as 30%.

布置各种互连组件并配置它们的过程称为互连综合。片上网络(NoC)具有各种构建块，如路由器，resizer，电源和时钟域转换器(pcdc)，管道元件等。需要一个软件工具来连接这些不同的组件(拓扑)，然后对它们进行配置(包括路由)，以便满足用户约束，并保持较低的总体逻辑和布线成本以及功耗。在本文中，我们提出了一个生成功率、性能和面积(PPA)优化noc的工具。输入是一个行为规范，包括大致平面图、桥参数、可用时钟、电源和电压域、地址空间、随机流量(包括类别和延迟临界)、流量依赖和NoC锁定部分的任何部分拓扑。输出是一个优化的NoC，具有组件的实例化和放置(路由器，resizer等)，虚拟通道(VC)分配，时钟域分配，路由，桥接参数调整，FIFO大小等。使用这个流程，我们能够生成在手工调整设计(经过几个月优化)的15%以内的noc，针对各种指标，并超过关键指标30%。

{"title":"Traffic Driven Automated Synthesis of Network-on-Chip from Physically Aware Behavioral Specification","authors":"Anup Gangwar, Zheng Xu, N. Agarwal, Ravishankar Sreedharan, Ambica Prasad","doi":"10.1109/ISVLSI.2019.00031","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00031","url":null,"abstract":"The process of laying out the various interconnect components and configuring them, is termed as interconnect synthesis. A Network-on-Chip (NoC), has various building blocks such as Routers, Resizers, Power and Clock domain converters (PCDCs), Pipeline elements etc. A software tool is needed to connect these various components (topology) and then configure them (including routing) so that the user constraints are met and the overall logic and wiring cost along with power is kept low. In this paper we present a tool which generates Power, Performance and Area (PPA) optimized NoCs. The input is a behavioral specification which consists of a rough floor-plan, bridge parameters, available clock, power and voltage domains, address spaces, stochastic traffic (including classes and latency criticality), traffic dependency and any partial topology for the locked down portions of the NoC. The output is an optimized NoC, with instantiation and placement of components (routers, Resizers etc.), Virtual Channel (VC) assignments, clockdomain assignments, routing, bridge parameter tuning, FIFO sizes etc. Using this flow, we are able to generate NoCs which are within 15% of the hand-tuned designs (optimized over several months), for various metrics and exceed critical metrics by as much as 30%.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"33 1","pages":"122-127"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79742167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Low-Complexity RS Decoder for Triple-Error-Correcting RS Codes 一种用于三纠错RS码的低复杂度RS解码器

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00094

Zengchao Yan, Jun Lin, Zhongfeng Wang

Reed-Solomon (RS) codes have been widely used in digital communication and storage systems. The commonly used decoding algorithms include Berlekamp-Massey (BM) algorithm and its variants such as the inversionless BM (iBM) and the Reformulated inversionless BM (RiBM). All these algorithms require the computation-intensive procedures including key equation solver (KES), and Chien Search & Forney algorithm (CS&F). For RS codes with the error correction ability t≤ 2, it is known that error locations and magnitudes can be found through direct equation solver. However, for RS codes with t=3, no such work has been reported yet. In this paper, a low-complexity algorithm for triple-error-correcting RS codes is proposed. Moreover, an optimized architecture for the proposed algorithm is developed. For a (255, 239) RS code over GF(2^8), the synthesis results show that the area-efficiency of the proposed decoder is 217% higher than that of the conventional RiBM-based RS decoder in 4-parallel. As the degree of parallelism increases, the area-efficiency is increased to 364% in the 16-parallel architecture. The synthesis results show that the proposed decoder for the given example RS code can achieve a throughput as large as 124 Gb/s.

RS码在数字通信和存储系统中得到了广泛的应用。常用的译码算法包括Berlekamp-Massey (BM)算法及其变体，如无反转BM (iBM)和re公式化无反转BM (RiBM)。所有这些算法都需要计算密集型的过程，包括关键方程求解器(KES)和Chien Search & Forney算法(CS&F)。对于纠错能力t≤2的RS码，已知可以通过直接方程求解器找到误差位置和大小。但是对于t=3的RS码，目前还没有相关的报道。本文提出了一种低复杂度的RS码三错纠错算法。并对该算法进行了优化设计。对于GF(2^8)以上的(255,239)RS码，综合结果表明，在4并行情况下，该解码器的面积效率比传统的基于ribm的RS解码器高217%。随着并行度的增加，16并行架构的面积效率提高到364%。综合结果表明，对于给定的RS码，所提出的解码器可以达到124gb /s的吞吐量。

{"title":"A Low-Complexity RS Decoder for Triple-Error-Correcting RS Codes","authors":"Zengchao Yan, Jun Lin, Zhongfeng Wang","doi":"10.1109/ISVLSI.2019.00094","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00094","url":null,"abstract":"Reed-Solomon (RS) codes have been widely used in digital communication and storage systems. The commonly used decoding algorithms include Berlekamp-Massey (BM) algorithm and its variants such as the inversionless BM (iBM) and the Reformulated inversionless BM (RiBM). All these algorithms require the computation-intensive procedures including key equation solver (KES), and Chien Search & Forney algorithm (CS&F). For RS codes with the error correction ability t≤ 2, it is known that error locations and magnitudes can be found through direct equation solver. However, for RS codes with t=3, no such work has been reported yet. In this paper, a low-complexity algorithm for triple-error-correcting RS codes is proposed. Moreover, an optimized architecture for the proposed algorithm is developed. For a (255, 239) RS code over GF(2^8), the synthesis results show that the area-efficiency of the proposed decoder is 217% higher than that of the conventional RiBM-based RS decoder in 4-parallel. As the degree of parallelism increases, the area-efficiency is increased to 364% in the 16-parallel architecture. The synthesis results show that the proposed decoder for the given example RS code can achieve a throughput as large as 124 Gb/s.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"22 1","pages":"489-494"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82270989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Fast-ABC: A Fast Architecture for Bottleneck-Like Based Convolutional Neural Networks Fast- abc:基于瓶颈的卷积神经网络的快速架构

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00010

Xiaoru Xie, Fangxuan Sun, Jun Lin, Zhongfeng Wang

In recent years, studies on efficient inference of neural networks have become one of the most popular research fields. In order to reduce the required number of computations and weights, many efforts have been made to construct light weight networks (LWNs) where bottleneck-like operations (BLOs) have been widely adopted. However, most current hardware accelerators are not able to utilize the optimization space for BLOs. This paper firstly show that the conventional computational flows employed by most existing accelerators will incur extremely low resource utilization ratio due to the extremely high DRAM bandwidth requirements in these LWNs via both theoretic analysis and experimental results. To address this issue, a partial fusion strategy which can drastically reduce bandwidth requirement is proposed. Additionaly, Winograd algorithm is also employed to further reduce the computational complexity. Based on these, an efficient accelerator for BLO-based networks called Fast Architecture for Bottleneck-like based Convolutional neural networks (Fast-ABC) is proposed. Fast-ABC is implemented on Altera Stratix V GSMD8, and can achieve a very high throughput of up to 137 fps and 264 fps on ResNet-18 and MobileNetV2, respectively. Implementation results show that the proposed architecture significantly improve the throughput on LWNs compared with the prior arts with even much less resources cost.

近年来，神经网络的高效推理研究已成为一个热门的研究领域。为了减少所需的计算量和权重，人们在构建轻量级网络(LWNs)方面做出了许多努力，其中广泛采用了类瓶颈操作(BLOs)。然而，目前大多数硬件加速器都不能利用bloo的优化空间。本文首先通过理论分析和实验结果表明，由于LWNs对DRAM带宽的要求极高，大多数现有加速器采用的传统计算流程导致资源利用率极低。为了解决这一问题，提出了一种可以大幅降低带宽需求的部分融合策略。此外，为了进一步降低计算复杂度，还采用了Winograd算法。在此基础上，提出了一种高效的基于bloc的卷积神经网络加速器Fast Architecture for bottleneck - Based Convolutional neural networks (Fast- abc)。Fast-ABC是在Altera Stratix V GSMD8上实现的，在ResNet-18和MobileNetV2上分别可以达到137 fps和264 fps的高吞吐量。实现结果表明，与现有技术相比，该架构显著提高了LWNs的吞吐量，且资源成本更低。

{"title":"Fast-ABC: A Fast Architecture for Bottleneck-Like Based Convolutional Neural Networks","authors":"Xiaoru Xie, Fangxuan Sun, Jun Lin, Zhongfeng Wang","doi":"10.1109/ISVLSI.2019.00010","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00010","url":null,"abstract":"In recent years, studies on efficient inference of neural networks have become one of the most popular research fields. In order to reduce the required number of computations and weights, many efforts have been made to construct light weight networks (LWNs) where bottleneck-like operations (BLOs) have been widely adopted. However, most current hardware accelerators are not able to utilize the optimization space for BLOs. This paper firstly show that the conventional computational flows employed by most existing accelerators will incur extremely low resource utilization ratio due to the extremely high DRAM bandwidth requirements in these LWNs via both theoretic analysis and experimental results. To address this issue, a partial fusion strategy which can drastically reduce bandwidth requirement is proposed. Additionaly, Winograd algorithm is also employed to further reduce the computational complexity. Based on these, an efficient accelerator for BLO-based networks called Fast Architecture for Bottleneck-like based Convolutional neural networks (Fast-ABC) is proposed. Fast-ABC is implemented on Altera Stratix V GSMD8, and can achieve a very high throughput of up to 137 fps and 264 fps on ResNet-18 and MobileNetV2, respectively. Implementation results show that the proposed architecture significantly improve the throughput on LWNs compared with the prior arts with even much less resources cost.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"32 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85850122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀