Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00019
Xinzhe Liu, Fupeng Chen, Y. Ha
Box filters are widely used in image and video processing applications. To achieve the real-time performance for these applications, designers may need to parallelize these box filters. However, it is very challenging to implement a parallel box filter on modern programmable system-on-chip (SoC). On one hand, the dependency between the operations of a box filter is too strong to achieve parallelism. On the other hand, more adder trees are required as the degree of parallelism increases. In this paper, we propose a performance and area efficient boxfilter. It uses the partial sum difference, which needs much less resources, to effectively calculate the box filter. We make the full use of this reusable partial sum to optimize the adder trees for parallel processing. We also make two case studies of the box filter by applying it to the guided filter and the stereo matching algorithm on a programmable SoC using a C-based design flow. Our method removes the dependencies between the parallel operations of the box filter. Compare to the state-of-the-art, results show that the computational complexity of the adder tree for a single pixel has been reduced from O(R^2) to O((R+N)lgN/N ) on average. There are orders of magnitude reduction in resource usage with large filter size R and parallelization degree N. The throughput can be increased by N times, where N is up to 72 in the case of Xilinx FPGA board XCZU9EG.
{"title":"Area Efficient Box Filter Acceleration by Parallelizing with Optimized Adder Tree","authors":"Xinzhe Liu, Fupeng Chen, Y. Ha","doi":"10.1109/ISVLSI.2019.00019","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00019","url":null,"abstract":"Box filters are widely used in image and video processing applications. To achieve the real-time performance for these applications, designers may need to parallelize these box filters. However, it is very challenging to implement a parallel box filter on modern programmable system-on-chip (SoC). On one hand, the dependency between the operations of a box filter is too strong to achieve parallelism. On the other hand, more adder trees are required as the degree of parallelism increases. In this paper, we propose a performance and area efficient boxfilter. It uses the partial sum difference, which needs much less resources, to effectively calculate the box filter. We make the full use of this reusable partial sum to optimize the adder trees for parallel processing. We also make two case studies of the box filter by applying it to the guided filter and the stereo matching algorithm on a programmable SoC using a C-based design flow. Our method removes the dependencies between the parallel operations of the box filter. Compare to the state-of-the-art, results show that the computational complexity of the adder tree for a single pixel has been reduced from O(R^2) to O((R+N)lgN/N ) on average. There are orders of magnitude reduction in resource usage with large filter size R and parallelization degree N. The throughput can be increased by N times, where N is up to 72 in the case of Xilinx FPGA board XCZU9EG.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"30 1","pages":"55-60"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81320133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00032
Jong Bin Lim, Deming Chen
The main objective of modern SoC (System-on-Chip) designs is to achieve high-performance while maintaining low power consumption and resource usage. However, achieving such a goal is a difficult and time-consuming engineering task due to the vast design space of hardware accelerators and HW/SW task partitioning. Depending on the partitioning decision, communication between parts of the SoC must be also optimized such that the overall runtime including both computation and communication would be fast. In this paper, we propose an automated approach to iteratively search for a near-optimal SoC design with minimum latency within the targeted power and resource budget. Our approach consists of the following main components: (1) polyhedral-model-based hardware accelerator design space exploration, (2) modeling of various communication types and integration into LLVM-based integer linear programming for HW/SW task partitioning, (3) fast and efficient search algorithm to extract maximum operating frequency using floorplanner, and (4) back-annotation of extracted information to system level for iterative partitioning. Using FPGA as the target platform, we demonstrate that our approach consistently outperforms the previous state-of-the-art solutions for automated HW/SW co-design by 37.8% on average and up to 75.2% for certain designs.
{"title":"Automated Communication and Floorplan-Aware Hardware/Software Co-Design for SoC","authors":"Jong Bin Lim, Deming Chen","doi":"10.1109/ISVLSI.2019.00032","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00032","url":null,"abstract":"The main objective of modern SoC (System-on-Chip) designs is to achieve high-performance while maintaining low power consumption and resource usage. However, achieving such a goal is a difficult and time-consuming engineering task due to the vast design space of hardware accelerators and HW/SW task partitioning. Depending on the partitioning decision, communication between parts of the SoC must be also optimized such that the overall runtime including both computation and communication would be fast. In this paper, we propose an automated approach to iteratively search for a near-optimal SoC design with minimum latency within the targeted power and resource budget. Our approach consists of the following main components: (1) polyhedral-model-based hardware accelerator design space exploration, (2) modeling of various communication types and integration into LLVM-based integer linear programming for HW/SW task partitioning, (3) fast and efficient search algorithm to extract maximum operating frequency using floorplanner, and (4) back-annotation of extracted information to system level for iterative partitioning. Using FPGA as the target platform, we demonstrate that our approach consistently outperforms the previous state-of-the-art solutions for automated HW/SW co-design by 37.8% on average and up to 75.2% for certain designs.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"20 1","pages":"128-133"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81865380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00099
Weiguang Chen, Z. Wang, Shanliao Li, Zhibin Yu, Huijuan Li
Recent advances in convolutional neural networks (CNNs) reveal the trend towards designing compact structures such as MobileNet, which adopts variations of traditional computing kernels such as pointwise and depthwise convolution. Such modified operations significantly reduce model size with an only slight degradation in inference accuracy. State-of-the-art neural accelerators have not yet fully exploit algorithmic parallelism for such computing kernels in compact CNNs. In this work, we propose a multithreaded data streaming architecture for fast and highly parallel execution of pointwise and depthwise convolution, which can be also dynamically reconfigured to process conventional convolution, pooling, and fully connected network layers. The architecture achieves efficient memory bandwidth utilization by exploiting two modes of data alignment. We profile MobileNet on the proposed architecture and demonstrate a 9:36x speed-up compared to single-threaded architecture.
{"title":"Accelerating Compact Convolutional Neural Networks with Multi-threaded Data Streaming","authors":"Weiguang Chen, Z. Wang, Shanliao Li, Zhibin Yu, Huijuan Li","doi":"10.1109/ISVLSI.2019.00099","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00099","url":null,"abstract":"Recent advances in convolutional neural networks (CNNs) reveal the trend towards designing compact structures such as MobileNet, which adopts variations of traditional computing kernels such as pointwise and depthwise convolution. Such modified operations significantly reduce model size with an only slight degradation in inference accuracy. State-of-the-art neural accelerators have not yet fully exploit algorithmic parallelism for such computing kernels in compact CNNs. In this work, we propose a multithreaded data streaming architecture for fast and highly parallel execution of pointwise and depthwise convolution, which can be also dynamically reconfigured to process conventional convolution, pooling, and fully connected network layers. The architecture achieves efficient memory bandwidth utilization by exploiting two modes of data alignment. We profile MobileNet on the proposed architecture and demonstrate a 9:36x speed-up compared to single-threaded architecture.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"14 1","pages":"519-522"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84611013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00087
Zheyu Liu, Erxiang Ren, Li Luo, Qi Wei, Xing Wu, Xueqing Li, F. Qiao, Xinjun Liu, Huazhong Yang
In the past few years, the demand for intelligence of IoT front-end devices has dramatically increased. However, such devices face challenges of limited on-chip resources and strict power or energy constraints. Recent progress in binarized neural networks has provided promising solutions for front-end processing system to conduct simple detection and classification tasks by making trade-offs between the processing quality and the computation complexity. In this paper, we propose a mixed-signal perception chip, in which an ADC-free 32x32 image sensor and a BNN processing array are directly integrated with a 180nm standard CMOS process. Taking advantage of the ADC-free processing architecture, the whole processing system only consumes 1.8mW power, while providing up to 545.4 GOPS/W energy efficiency. The implementation performance and energy efficiency are comparable with the state-of-the-art designs in much more advanced CMOS technologies. This work provides a promising alternative for low-power IoT intelligent applications.
{"title":"A 1.8mW Perception Chip with Near-Sensor Processing Scheme for Low-Power AIoT Applications","authors":"Zheyu Liu, Erxiang Ren, Li Luo, Qi Wei, Xing Wu, Xueqing Li, F. Qiao, Xinjun Liu, Huazhong Yang","doi":"10.1109/ISVLSI.2019.00087","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00087","url":null,"abstract":"In the past few years, the demand for intelligence of IoT front-end devices has dramatically increased. However, such devices face challenges of limited on-chip resources and strict power or energy constraints. Recent progress in binarized neural networks has provided promising solutions for front-end processing system to conduct simple detection and classification tasks by making trade-offs between the processing quality and the computation complexity. In this paper, we propose a mixed-signal perception chip, in which an ADC-free 32x32 image sensor and a BNN processing array are directly integrated with a 180nm standard CMOS process. Taking advantage of the ADC-free processing architecture, the whole processing system only consumes 1.8mW power, while providing up to 545.4 GOPS/W energy efficiency. The implementation performance and energy efficiency are comparable with the state-of-the-art designs in much more advanced CMOS technologies. This work provides a promising alternative for low-power IoT intelligent applications.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"42 1","pages":"447-452"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78194231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00022
Atif Yasin, Tiankai Su, S. Pillement, M. Ciesielski
Division is one of the most complex and hard to verify arithmetic operations. While verification of major arithmetic operators, such as adders and multipliers, has significantly progressed in recent years, less attention has been devoted to formal verification of dividers. A type of divider that is often used in embedded systems is divide by a constant. This paper presents a formal verification method for different divide-by-constant architectures and the generic restoring dividers based on the computer algebra approach. Our experiments for different divider architectures and comparison with exhaustive simulation demonstrates the effectiveness and scalability of the method.
{"title":"Formal Verification of Integer Dividers:Division by a Constant","authors":"Atif Yasin, Tiankai Su, S. Pillement, M. Ciesielski","doi":"10.1109/ISVLSI.2019.00022","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00022","url":null,"abstract":"Division is one of the most complex and hard to verify arithmetic operations. While verification of major arithmetic operators, such as adders and multipliers, has significantly progressed in recent years, less attention has been devoted to formal verification of dividers. A type of divider that is often used in embedded systems is divide by a constant. This paper presents a formal verification method for different divide-by-constant architectures and the generic restoring dividers based on the computer algebra approach. Our experiments for different divider architectures and comparison with exhaustive simulation demonstrates the effectiveness and scalability of the method.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"23 1","pages":"76-81"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78586232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00091
L. L. Caimi, F. Moraes
This paper presents an original approach to protect the execution of applications with security constraints in many-core systems. The proposed method includes three defense mechanisms. The first one is the application admission into the many-core using ECDH and MAC techniques. The second is the spatial reservation of computation and communication resources, resulting in an Opaque Secure Zone (OSZ). The key feature enabling the runtime creation of OSZs is a rerouting mechanism responsible for deviating any traffic traversing an OSZ. The last mechanism is the access to peripherals using a secure protocol to open access points in the OSZ border, and lightweight encryption mechanisms.
{"title":"Security in Many-Core SoCs Leveraged by Opaque Secure Zones","authors":"L. L. Caimi, F. Moraes","doi":"10.1109/ISVLSI.2019.00091","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00091","url":null,"abstract":"This paper presents an original approach to protect the execution of applications with security constraints in many-core systems. The proposed method includes three defense mechanisms. The first one is the application admission into the many-core using ECDH and MAC techniques. The second is the spatial reservation of computation and communication resources, resulting in an Opaque Secure Zone (OSZ). The key feature enabling the runtime creation of OSZs is a rerouting mechanism responsible for deviating any traffic traversing an OSZ. The last mechanism is the access to peripherals using a secure protocol to open access points in the OSZ border, and lightweight encryption mechanisms.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"23 1","pages":"471-476"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74035486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00031
Anup Gangwar, Zheng Xu, N. Agarwal, Ravishankar Sreedharan, Ambica Prasad
The process of laying out the various interconnect components and configuring them, is termed as interconnect synthesis. A Network-on-Chip (NoC), has various building blocks such as Routers, Resizers, Power and Clock domain converters (PCDCs), Pipeline elements etc. A software tool is needed to connect these various components (topology) and then configure them (including routing) so that the user constraints are met and the overall logic and wiring cost along with power is kept low. In this paper we present a tool which generates Power, Performance and Area (PPA) optimized NoCs. The input is a behavioral specification which consists of a rough floor-plan, bridge parameters, available clock, power and voltage domains, address spaces, stochastic traffic (including classes and latency criticality), traffic dependency and any partial topology for the locked down portions of the NoC. The output is an optimized NoC, with instantiation and placement of components (routers, Resizers etc.), Virtual Channel (VC) assignments, clockdomain assignments, routing, bridge parameter tuning, FIFO sizes etc. Using this flow, we are able to generate NoCs which are within 15% of the hand-tuned designs (optimized over several months), for various metrics and exceed critical metrics by as much as 30%.
{"title":"Traffic Driven Automated Synthesis of Network-on-Chip from Physically Aware Behavioral Specification","authors":"Anup Gangwar, Zheng Xu, N. Agarwal, Ravishankar Sreedharan, Ambica Prasad","doi":"10.1109/ISVLSI.2019.00031","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00031","url":null,"abstract":"The process of laying out the various interconnect components and configuring them, is termed as interconnect synthesis. A Network-on-Chip (NoC), has various building blocks such as Routers, Resizers, Power and Clock domain converters (PCDCs), Pipeline elements etc. A software tool is needed to connect these various components (topology) and then configure them (including routing) so that the user constraints are met and the overall logic and wiring cost along with power is kept low. In this paper we present a tool which generates Power, Performance and Area (PPA) optimized NoCs. The input is a behavioral specification which consists of a rough floor-plan, bridge parameters, available clock, power and voltage domains, address spaces, stochastic traffic (including classes and latency criticality), traffic dependency and any partial topology for the locked down portions of the NoC. The output is an optimized NoC, with instantiation and placement of components (routers, Resizers etc.), Virtual Channel (VC) assignments, clockdomain assignments, routing, bridge parameter tuning, FIFO sizes etc. Using this flow, we are able to generate NoCs which are within 15% of the hand-tuned designs (optimized over several months), for various metrics and exceed critical metrics by as much as 30%.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"33 1","pages":"122-127"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79742167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00094
Zengchao Yan, Jun Lin, Zhongfeng Wang
Reed-Solomon (RS) codes have been widely used in digital communication and storage systems. The commonly used decoding algorithms include Berlekamp-Massey (BM) algorithm and its variants such as the inversionless BM (iBM) and the Reformulated inversionless BM (RiBM). All these algorithms require the computation-intensive procedures including key equation solver (KES), and Chien Search & Forney algorithm (CS&F). For RS codes with the error correction ability t≤ 2, it is known that error locations and magnitudes can be found through direct equation solver. However, for RS codes with t=3, no such work has been reported yet. In this paper, a low-complexity algorithm for triple-error-correcting RS codes is proposed. Moreover, an optimized architecture for the proposed algorithm is developed. For a (255, 239) RS code over GF(2^8), the synthesis results show that the area-efficiency of the proposed decoder is 217% higher than that of the conventional RiBM-based RS decoder in 4-parallel. As the degree of parallelism increases, the area-efficiency is increased to 364% in the 16-parallel architecture. The synthesis results show that the proposed decoder for the given example RS code can achieve a throughput as large as 124 Gb/s.
{"title":"A Low-Complexity RS Decoder for Triple-Error-Correcting RS Codes","authors":"Zengchao Yan, Jun Lin, Zhongfeng Wang","doi":"10.1109/ISVLSI.2019.00094","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00094","url":null,"abstract":"Reed-Solomon (RS) codes have been widely used in digital communication and storage systems. The commonly used decoding algorithms include Berlekamp-Massey (BM) algorithm and its variants such as the inversionless BM (iBM) and the Reformulated inversionless BM (RiBM). All these algorithms require the computation-intensive procedures including key equation solver (KES), and Chien Search & Forney algorithm (CS&F). For RS codes with the error correction ability t≤ 2, it is known that error locations and magnitudes can be found through direct equation solver. However, for RS codes with t=3, no such work has been reported yet. In this paper, a low-complexity algorithm for triple-error-correcting RS codes is proposed. Moreover, an optimized architecture for the proposed algorithm is developed. For a (255, 239) RS code over GF(2^8), the synthesis results show that the area-efficiency of the proposed decoder is 217% higher than that of the conventional RiBM-based RS decoder in 4-parallel. As the degree of parallelism increases, the area-efficiency is increased to 364% in the 16-parallel architecture. The synthesis results show that the proposed decoder for the given example RS code can achieve a throughput as large as 124 Gb/s.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"22 1","pages":"489-494"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82270989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-01DOI: 10.1109/ISVLSI.2019.00010
Xiaoru Xie, Fangxuan Sun, Jun Lin, Zhongfeng Wang
In recent years, studies on efficient inference of neural networks have become one of the most popular research fields. In order to reduce the required number of computations and weights, many efforts have been made to construct light weight networks (LWNs) where bottleneck-like operations (BLOs) have been widely adopted. However, most current hardware accelerators are not able to utilize the optimization space for BLOs. This paper firstly show that the conventional computational flows employed by most existing accelerators will incur extremely low resource utilization ratio due to the extremely high DRAM bandwidth requirements in these LWNs via both theoretic analysis and experimental results. To address this issue, a partial fusion strategy which can drastically reduce bandwidth requirement is proposed. Additionaly, Winograd algorithm is also employed to further reduce the computational complexity. Based on these, an efficient accelerator for BLO-based networks called Fast Architecture for Bottleneck-like based Convolutional neural networks (Fast-ABC) is proposed. Fast-ABC is implemented on Altera Stratix V GSMD8, and can achieve a very high throughput of up to 137 fps and 264 fps on ResNet-18 and MobileNetV2, respectively. Implementation results show that the proposed architecture significantly improve the throughput on LWNs compared with the prior arts with even much less resources cost.
近年来,神经网络的高效推理研究已成为一个热门的研究领域。为了减少所需的计算量和权重,人们在构建轻量级网络(LWNs)方面做出了许多努力,其中广泛采用了类瓶颈操作(BLOs)。然而,目前大多数硬件加速器都不能利用bloo的优化空间。本文首先通过理论分析和实验结果表明,由于LWNs对DRAM带宽的要求极高,大多数现有加速器采用的传统计算流程导致资源利用率极低。为了解决这一问题,提出了一种可以大幅降低带宽需求的部分融合策略。此外,为了进一步降低计算复杂度,还采用了Winograd算法。在此基础上,提出了一种高效的基于bloc的卷积神经网络加速器Fast Architecture for bottleneck - Based Convolutional neural networks (Fast- abc)。Fast-ABC是在Altera Stratix V GSMD8上实现的,在ResNet-18和MobileNetV2上分别可以达到137 fps和264 fps的高吞吐量。实现结果表明,与现有技术相比,该架构显著提高了LWNs的吞吐量,且资源成本更低。
{"title":"Fast-ABC: A Fast Architecture for Bottleneck-Like Based Convolutional Neural Networks","authors":"Xiaoru Xie, Fangxuan Sun, Jun Lin, Zhongfeng Wang","doi":"10.1109/ISVLSI.2019.00010","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00010","url":null,"abstract":"In recent years, studies on efficient inference of neural networks have become one of the most popular research fields. In order to reduce the required number of computations and weights, many efforts have been made to construct light weight networks (LWNs) where bottleneck-like operations (BLOs) have been widely adopted. However, most current hardware accelerators are not able to utilize the optimization space for BLOs. This paper firstly show that the conventional computational flows employed by most existing accelerators will incur extremely low resource utilization ratio due to the extremely high DRAM bandwidth requirements in these LWNs via both theoretic analysis and experimental results. To address this issue, a partial fusion strategy which can drastically reduce bandwidth requirement is proposed. Additionaly, Winograd algorithm is also employed to further reduce the computational complexity. Based on these, an efficient accelerator for BLO-based networks called Fast Architecture for Bottleneck-like based Convolutional neural networks (Fast-ABC) is proposed. Fast-ABC is implemented on Altera Stratix V GSMD8, and can achieve a very high throughput of up to 137 fps and 264 fps on ResNet-18 and MobileNetV2, respectively. Implementation results show that the proposed architecture significantly improve the throughput on LWNs compared with the prior arts with even much less resources cost.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"32 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85850122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}