Distributed Convolutional Neural Network Training on Mobile and Edge Clusters

arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-09-11 DOI:arxiv-2409.09083

Pranav Rama, Madison Threadgill, Andreas Gerstlauer

{"title":"Distributed Convolutional Neural Network Training on Mobile and Edge Clusters","authors":"Pranav Rama, Madison Threadgill, Andreas Gerstlauer","doi":"arxiv-2409.09083","DOIUrl":null,"url":null,"abstract":"The training of deep and/or convolutional neural networks (DNNs/CNNs) is\ntraditionally done on servers with powerful CPUs and GPUs. Recent efforts have\nemerged to localize machine learning tasks fully on the edge. This brings\nadvantages in reduced latency and increased privacy, but necessitates working\nwith resource-constrained devices. Approaches for inference and training in\nmobile and edge devices based on pruning, quantization or incremental and\ntransfer learning require trading off accuracy. Several works have explored\ndistributing inference operations on mobile and edge clusters instead. However,\nthere is limited literature on distributed training on the edge. Existing\napproaches all require a central, potentially powerful edge or cloud server for\ncoordination or offloading. In this paper, we describe an approach for\ndistributed CNN training exclusively on mobile and edge devices. Our approach\nis beneficial for the initial CNN layers that are feature map dominated. It is\nbased on partitioning forward inference and back-propagation operations among\ndevices through tiling and fusing to maximize locality and expose communication\nand memory-aware parallelism. We also introduce the concept of layer grouping\nto further fine-tune performance based on computation and communication\ntrade-off. Results show that for a cluster of 2-6 quad-core Raspberry Pi3\ndevices, training of an object-detection CNN provides a 2x-15x speedup with\nrespect to a single core and up to 8x reduction in memory usage per device, all\nwithout sacrificing accuracy. Grouping offers up to 1.5x speedup depending on\nthe reference profile and batch size.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09083","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The training of deep and/or convolutional neural networks (DNNs/CNNs) is traditionally done on servers with powerful CPUs and GPUs. Recent efforts have emerged to localize machine learning tasks fully on the edge. This brings advantages in reduced latency and increased privacy, but necessitates working with resource-constrained devices. Approaches for inference and training in mobile and edge devices based on pruning, quantization or incremental and transfer learning require trading off accuracy. Several works have explored distributing inference operations on mobile and edge clusters instead. However, there is limited literature on distributed training on the edge. Existing approaches all require a central, potentially powerful edge or cloud server for coordination or offloading. In this paper, we describe an approach for distributed CNN training exclusively on mobile and edge devices. Our approach is beneficial for the initial CNN layers that are feature map dominated. It is based on partitioning forward inference and back-propagation operations among devices through tiling and fusing to maximize locality and expose communication and memory-aware parallelism. We also introduce the concept of layer grouping to further fine-tune performance based on computation and communication trade-off. Results show that for a cluster of 2-6 quad-core Raspberry Pi3 devices, training of an object-detection CNN provides a 2x-15x speedup with respect to a single core and up to 8x reduction in memory usage per device, all without sacrificing accuracy. Grouping offers up to 1.5x speedup depending on the reference profile and batch size.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

移动和边缘集群上的分布式卷积神经网络训练

深度和/或卷积神经网络（DNN/CNN）的训练传统上是在配备强大 CPU 和 GPU 的服务器上进行的。最近，人们开始努力将机器学习任务完全本地化到边缘。这带来了降低延迟和提高隐私性的优势，但必须与资源受限的设备协同工作。基于剪枝、量化或增量和转移学习的移动和边缘设备推理和训练方法需要权衡准确性。有几项研究探索了在移动和边缘集群上进行分布式推理操作。然而，关于边缘分布式训练的文献还很有限。现有的方法都需要一个潜在功能强大的中央边缘或云服务器来进行协调或卸载。在本文中，我们介绍了一种专门在移动和边缘设备上进行分布式 CNN 训练的方法。我们的方法有利于以特征图为主的初始 CNN 层。它的基础是通过平铺和融合将前向推理和反向传播操作在设备间进行分区，以最大限度地提高局部性，并暴露出通信和内存感知并行性。我们还引入了层分组的概念，以根据计算和通信分担情况进一步微调性能。结果表明，对于由 2-6 个四核 Raspberry Pi3 设备组成的集群，与单核相比，物体检测 CNN 的训练速度提高了 2-15 倍，每个设备的内存使用量最多减少了 8 倍，而所有这一切都没有牺牲准确性。分组速度最多可提高 1.5 倍，具体取决于参考配置文件和批量大小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Distributed, Parallel, and Cluster Computing

自引率

0.00%

发文量

期刊最新文献

Massively parallel CMA-ES with increasing population Communication Lower Bounds and Optimal Algorithms for Symmetric Matrix Computations Energy Efficiency Support for Software Defined Networks: a Serverless Computing Approach CountChain: A Decentralized Oracle Network for Counting Systems Delay Analysis of EIP-4844