Optimizing convolutional neural networks on multi-core vector accelerator

IF 2 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS Parallel Computing Pub Date : 2022-09-01 DOI:10.1016/j.parco.2022.102945

Zhong Liu, Xin Xiao, Chen Li, Sheng Ma, Deng Rangyu

{"title":"Optimizing convolutional neural networks on multi-core vector accelerator","authors":"Zhong Liu, Xin Xiao, Chen Li, Sheng Ma, Deng Rangyu","doi":"10.1016/j.parco.2022.102945","DOIUrl":null,"url":null,"abstract":"<div><p>Vector Accelerators have been widely used in scientific computing. It also shows great potential to accelerate the computational performance of convolutional neural networks (CNNs). However, previous general CNN-mapping methods introduced a large amount of intermediate data and additional conversion, and the resulting memory overhead would cause great performance loss.</p><p>To address these issues and achieve high computational efficiency, this paper proposes an efficient CNN-mapping method dedicated to vector accelerators, including: 1) Data layout method: establishing a set of efficient data storage and computing models for various CNN networks on vector accelerators. It achieves high memory access efficiency and high vectorization efficiency. 2) A conversion method: convert the computation of convolutional layers and fully connected layers into large-scale matrix multiplication, and convert the computation of pooling layers into row computation of matrix. All conversions are implemented by extracting rows from a two-dimensional matrix, with high data access and transmission efficiency, and without additional memory overhead and data conversion.</p><p>Based on these methods, we design a vectorization mechanism to vectorize convolutional, pooling and fully connected layers on a vector accelerator, which can be applied for various CNN models. This mechanism takes full advantage of the parallel computing capability of the multi-core vector accelerator and further improves the performance of deep convolutional neural networks. The experimental results show that the average computational efficiency of the convolutional layers and full connected layers of AlexNet, VGG-19, GoogleNet and ResNet-50 is 93.3% and 93.4% respectively, and the average data access efficiency of pooling layer is 70%. Compared to NVIDIA inference GPUs, our accelerator achieves a 36.1% performance improvement, comparable to NVIDIA V100 GPUs. Compared with Matrix2000 of similar architecture, our accelerator achieves a 17-45% improvement in computational efficiency.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"112 ","pages":"Article 102945"},"PeriodicalIF":2.0000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819122000424","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 3

Abstract

Vector Accelerators have been widely used in scientific computing. It also shows great potential to accelerate the computational performance of convolutional neural networks (CNNs). However, previous general CNN-mapping methods introduced a large amount of intermediate data and additional conversion, and the resulting memory overhead would cause great performance loss.

To address these issues and achieve high computational efficiency, this paper proposes an efficient CNN-mapping method dedicated to vector accelerators, including: 1) Data layout method: establishing a set of efficient data storage and computing models for various CNN networks on vector accelerators. It achieves high memory access efficiency and high vectorization efficiency. 2) A conversion method: convert the computation of convolutional layers and fully connected layers into large-scale matrix multiplication, and convert the computation of pooling layers into row computation of matrix. All conversions are implemented by extracting rows from a two-dimensional matrix, with high data access and transmission efficiency, and without additional memory overhead and data conversion.

Based on these methods, we design a vectorization mechanism to vectorize convolutional, pooling and fully connected layers on a vector accelerator, which can be applied for various CNN models. This mechanism takes full advantage of the parallel computing capability of the multi-core vector accelerator and further improves the performance of deep convolutional neural networks. The experimental results show that the average computational efficiency of the convolutional layers and full connected layers of AlexNet, VGG-19, GoogleNet and ResNet-50 is 93.3% and 93.4% respectively, and the average data access efficiency of pooling layer is 70%. Compared to NVIDIA inference GPUs, our accelerator achieves a 36.1% performance improvement, comparable to NVIDIA V100 GPUs. Compared with Matrix2000 of similar architecture, our accelerator achieves a 17-45% improvement in computational efficiency.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在多核矢量加速器上优化卷积神经网络

矢量加速器在科学计算中得到了广泛的应用。它也显示出极大的潜力来加速卷积神经网络(cnn)的计算性能。然而，以往的通用cnn映射方法引入了大量的中间数据和额外的转换，由此产生的内存开销会造成很大的性能损失。为了解决这些问题并获得较高的计算效率，本文提出了一种专用于矢量加速器的高效CNN映射方法，包括:1)数据布局方法:在矢量加速器上为各种CNN网络建立一套高效的数据存储和计算模型。实现了较高的内存访问效率和矢量化效率。2)一种转换方法:将卷积层和全连通层的计算转换为大规模的矩阵乘法，将池化层的计算转换为矩阵的行计算。所有转换都是通过从二维矩阵中提取行来实现的，具有很高的数据访问和传输效率，并且没有额外的内存开销和数据转换。基于这些方法，我们设计了一种矢量化机制，在矢量加速器上对卷积层、池化层和全连接层进行矢量化，可以应用于各种CNN模型。该机制充分利用了多核矢量加速器的并行计算能力，进一步提高了深度卷积神经网络的性能。实验结果表明，AlexNet、VGG-19、GoogleNet和ResNet-50的卷积层和全连接层的平均计算效率分别为93.3%和93.4%，池化层的平均数据访问效率为70%。与NVIDIA推理gpu相比，我们的加速器实现了36.1%的性能提升，与NVIDIA V100 gpu相当。与类似架构的Matrix2000相比，我们的加速器的计算效率提高了17-45%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Parallel Computing 工程技术-计算机：理论方法

CiteScore

3.50

自引率

7.10%

发文量

审稿时长

4.5 months

期刊介绍： Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications

期刊最新文献

Editorial Board Estimating resource budgets to ensure autotuning efficiency Lowering entry barriers to developing custom simulators of distributed applications and platforms with SimGrid Scalable tasking runtime with parallelized builders for explicit message passing architectures Iterative methods in GPU-resident linear solvers for nonlinear constrained optimization