Pranav Rama, Madison Threadgill, Andreas Gerstlauer
{"title":"Distributed Convolutional Neural Network Training on Mobile and Edge Clusters","authors":"Pranav Rama, Madison Threadgill, Andreas Gerstlauer","doi":"arxiv-2409.09083","DOIUrl":null,"url":null,"abstract":"The training of deep and/or convolutional neural networks (DNNs/CNNs) is\ntraditionally done on servers with powerful CPUs and GPUs. Recent efforts have\nemerged to localize machine learning tasks fully on the edge. This brings\nadvantages in reduced latency and increased privacy, but necessitates working\nwith resource-constrained devices. Approaches for inference and training in\nmobile and edge devices based on pruning, quantization or incremental and\ntransfer learning require trading off accuracy. Several works have explored\ndistributing inference operations on mobile and edge clusters instead. However,\nthere is limited literature on distributed training on the edge. Existing\napproaches all require a central, potentially powerful edge or cloud server for\ncoordination or offloading. In this paper, we describe an approach for\ndistributed CNN training exclusively on mobile and edge devices. Our approach\nis beneficial for the initial CNN layers that are feature map dominated. It is\nbased on partitioning forward inference and back-propagation operations among\ndevices through tiling and fusing to maximize locality and expose communication\nand memory-aware parallelism. We also introduce the concept of layer grouping\nto further fine-tune performance based on computation and communication\ntrade-off. Results show that for a cluster of 2-6 quad-core Raspberry Pi3\ndevices, training of an object-detection CNN provides a 2x-15x speedup with\nrespect to a single core and up to 8x reduction in memory usage per device, all\nwithout sacrificing accuracy. Grouping offers up to 1.5x speedup depending on\nthe reference profile and batch size.","PeriodicalId":501422,"journal":{"name":"arXiv - CS - Distributed, Parallel, and Cluster Computing","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Distributed, Parallel, and Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09083","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The training of deep and/or convolutional neural networks (DNNs/CNNs) is
traditionally done on servers with powerful CPUs and GPUs. Recent efforts have
emerged to localize machine learning tasks fully on the edge. This brings
advantages in reduced latency and increased privacy, but necessitates working
with resource-constrained devices. Approaches for inference and training in
mobile and edge devices based on pruning, quantization or incremental and
transfer learning require trading off accuracy. Several works have explored
distributing inference operations on mobile and edge clusters instead. However,
there is limited literature on distributed training on the edge. Existing
approaches all require a central, potentially powerful edge or cloud server for
coordination or offloading. In this paper, we describe an approach for
distributed CNN training exclusively on mobile and edge devices. Our approach
is beneficial for the initial CNN layers that are feature map dominated. It is
based on partitioning forward inference and back-propagation operations among
devices through tiling and fusing to maximize locality and expose communication
and memory-aware parallelism. We also introduce the concept of layer grouping
to further fine-tune performance based on computation and communication
trade-off. Results show that for a cluster of 2-6 quad-core Raspberry Pi3
devices, training of an object-detection CNN provides a 2x-15x speedup with
respect to a single core and up to 8x reduction in memory usage per device, all
without sacrificing accuracy. Grouping offers up to 1.5x speedup depending on
the reference profile and batch size.