A layer-block-wise pipeline for memory and bandwidth reduction in distributed deep learning

Haruki Mori, Tetsuya Youkawa, S. Izumi, M. Yoshimoto, H. Kawaguchi, Atsuki Inoue
{"title":"A layer-block-wise pipeline for memory and bandwidth reduction in distributed deep learning","authors":"Haruki Mori, Tetsuya Youkawa, S. Izumi, M. Yoshimoto, H. Kawaguchi, Atsuki Inoue","doi":"10.1109/MLSP.2017.8168127","DOIUrl":null,"url":null,"abstract":"This paper describes a pipelined stochastic gradient descent (SGD) algorithm and its hardware architecture with a memory distributed structure. In the proposed architecture, a pipeline stage takes charge of multiple layers: a “layer block.” The layer-block-wise pipeline has much less weight parameters for network training than conventional multithreading because weight memory is distributed to workers assigned to pipeline stages. The memory capacity of 2.25 GB for the four-stage proposed pipeline is about half of the 3.82 GB for multithreading when a batch size is 32 in VGG-F. Unlike multithreaded data parallelism, no parameter server for weight update or shared I/O data bus is necessary. Therefore, the memory bandwidth is drastically reduced. The proposed four-stage pipeline only needs memory bandwidths of 36.3 MB and 17.0 MB per batch, respectively, for forward propagation and backpropagation processes, whereas four-thread multithreading requires a bandwidth of 974 MB overall for send and receive processes to unify its weight parameters. At the parallelization degree of four, the proposed pipeline maintains training convergence by a factor of 1.12, compared with the conventional multithreaded architecture although the memory capacity and the memory bandwidth are decreased.","PeriodicalId":6542,"journal":{"name":"2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)","volume":"1 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MLSP.2017.8168127","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

This paper describes a pipelined stochastic gradient descent (SGD) algorithm and its hardware architecture with a memory distributed structure. In the proposed architecture, a pipeline stage takes charge of multiple layers: a “layer block.” The layer-block-wise pipeline has much less weight parameters for network training than conventional multithreading because weight memory is distributed to workers assigned to pipeline stages. The memory capacity of 2.25 GB for the four-stage proposed pipeline is about half of the 3.82 GB for multithreading when a batch size is 32 in VGG-F. Unlike multithreaded data parallelism, no parameter server for weight update or shared I/O data bus is necessary. Therefore, the memory bandwidth is drastically reduced. The proposed four-stage pipeline only needs memory bandwidths of 36.3 MB and 17.0 MB per batch, respectively, for forward propagation and backpropagation processes, whereas four-thread multithreading requires a bandwidth of 974 MB overall for send and receive processes to unify its weight parameters. At the parallelization degree of four, the proposed pipeline maintains training convergence by a factor of 1.12, compared with the conventional multithreaded architecture although the memory capacity and the memory bandwidth are decreased.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
分布式深度学习中减少内存和带宽的分层块管道
介绍了一种基于内存分布式结构的流水线随机梯度下降算法及其硬件结构。在提议的体系结构中,管道阶段负责多个层:一个“层块”。与传统多线程相比,分层块管道的网络训练权重参数要少得多,因为权重内存被分配给分配到管道阶段的工作人员。在VGG-F中,当批处理大小为32时,四级管道的内存容量为2.25 GB,大约是多线程的3.82 GB的一半。与多线程数据并行不同,权重更新或共享I/O数据总线不需要参数服务器。因此,内存带宽大大降低。所提出的四阶段管道每批仅需要36.3 MB和17.0 MB的内存带宽,用于前向传播和反向传播进程,而四线程多线程需要974 MB的带宽用于发送和接收进程以统一其权重参数。在并行度为4的情况下,与传统多线程架构相比,该管道的训练收敛性提高了1.12倍,尽管内存容量和内存带宽有所降低。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Classical quadrature rules via Gaussian processes Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR Differential mutual information forward search for multi-kernel discriminant-component selection with an application to privacy-preserving classification Partitioning in signal processing using the object migration automaton and the pursuit paradigm Inferring room semantics using acoustic monitoring
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1