- Book学术

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2019-01-07 DOI:10.1109/HPCA.2019.00027

Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Helen Li, Yiran Chen

{"title":"HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array","authors":"Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Helen Li, Yiran Chen","doi":"10.1109/HPCA.2019.00027","DOIUrl":null,"url":null,"abstract":"With the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used in many domains. To achieve high performance and energy efficiency, hardware acceleration (especially inference) of DNNs is intensively studied both in academia and industry. However, we still face two challenges: large DNN models and datasets, which incur frequent off-chip memory accesses; and the training of DNNs, which is not well-explored in recent accelerator designs. To truly provide high throughput and energy efficient acceleration for the training of deep and large models, we inevitably need to use multiple accelerators to explore the coarse-grain parallelism, compared to the fine-grain parallelism inside a layer considered in most of the existing architectures. It poses the key research question to seek the best organization of computation and dataflow among accelerators. In this paper, inspired by recent work in machine learning systems, we propose a solution HyPar to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators. HyPar partitions the feature map tensors (input and output), the kernel tensors, the gradient tensors, and the error tensors for the DNN accelerators. A partition constitutes the choice of parallelism for weighted layers. The optimization target is to search a partition that minimizes the total communication during training a complete DNN. To solve this problem, we propose a communication model to explain the source and amount of communications. Then, we use a hierarchical layer-wise dynamic programming method to search for the partition for each layer.","PeriodicalId":102050,"journal":{"name":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"80","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2019.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 80

摘要

近年来，随着人工智能的兴起，深度神经网络(Deep Neural Networks, dnn)在许多领域得到了广泛的应用。为了实现高性能和节能，深度神经网络的硬件加速(特别是推理)在学术界和工业界都得到了广泛的研究。然而，我们仍然面临两个挑战:大型DNN模型和数据集，导致频繁的片外内存访问;以及深度神经网络的训练，这在最近的加速器设计中没有得到很好的探索。为了真正为深度和大型模型的训练提供高吞吐量和节能的加速，我们不可避免地需要使用多个加速器来探索粗粒度并行性，而不是在大多数现有架构中考虑的层内的细粒度并行性。寻求加速器间计算和数据流的最佳组织方式是研究的关键问题。在本文中，受最近机器学习系统工作的启发，我们提出了一种解决方案HyPar，以确定使用一系列DNN加速器进行深度神经网络训练的分层并行性。HyPar划分了特征映射张量(输入和输出)、核张量、梯度张量和DNN加速器的误差张量。划分构成加权层的并行度选择。优化的目标是在训练一个完整的深度神经网络的过程中，寻找一个使总通信最小化的分区。为了解决这个问题，我们提出了一个通信模型来解释通信的来源和数量。然后，我们使用分层分层的动态规划方法来搜索每一层的分区。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array

With the rise of artificial intelligence in recent years, Deep Neural Networks (DNNs) have been widely used in many domains. To achieve high performance and energy efficiency, hardware acceleration (especially inference) of DNNs is intensively studied both in academia and industry. However, we still face two challenges: large DNN models and datasets, which incur frequent off-chip memory accesses; and the training of DNNs, which is not well-explored in recent accelerator designs. To truly provide high throughput and energy efficient acceleration for the training of deep and large models, we inevitably need to use multiple accelerators to explore the coarse-grain parallelism, compared to the fine-grain parallelism inside a layer considered in most of the existing architectures. It poses the key research question to seek the best organization of computation and dataflow among accelerators. In this paper, inspired by recent work in machine learning systems, we propose a solution HyPar to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators. HyPar partitions the feature map tensors (input and output), the kernel tensors, the gradient tensors, and the error tensors for the DNN accelerators. A partition constitutes the choice of parallelism for weighted layers. The optimization target is to search a partition that minimizes the total communication during training a complete DNN. To solve this problem, we propose a communication model to explain the source and amount of communications. Then, we use a hierarchical layer-wise dynamic programming method to search for the partition for each layer.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量

期刊最新文献

Machine Learning at Facebook: Understanding Inference at the Edge Understanding the Future of Energy Efficiency in Multi-Module GPUs POWERT Channels: A Novel Class of Covert CommunicationExploiting Power Management Vulnerabilities The Accelerator Wall: Limits of Chip Specialization Featherlight Reuse-Distance Measurement