Yuxuan Jiang, Qiang Ye, E. T. Fapi, Wenting Sun, Fudong Li
{"title":"Workload Allocation for Distributed Coded Machine Learning: From Offline Model-Based to Online Model-Free","authors":"Yuxuan Jiang, Qiang Ye, E. T. Fapi, Wenting Sun, Fudong Li","doi":"10.1109/IOTM.001.2300247","DOIUrl":null,"url":null,"abstract":"Distributed machine learning (ML) is an important Internet-of-Things (IoT) application. In traditional partitioned learning (PL) paradigm, a coordinator divides a high-dimensional dataset into subsets, which are processed on IoT devices. The execution time of PL can be seriously bottlenecked by slow devices named stragglers. To mitigate the negative impact of stragglers, distributed coded machine learning (DCML) was recently proposed to inject redundancy into the subsets using coding techniques. With this redundancy, the coordinator no longer requires the processing results from all devices, but only from a subgroup, where stragglers can be eliminated. This article aims to bring the burgeoning field of DCML to the wider community. After outlining the principles of DCML, we focus on its workload allocation, which addresses the appropriate level of injected redundancy to minimize the overall execution time. We highlight the fundamental trade-off and point out two critical design choices in workload allocation: model-based versus model-free, and offline versus online. Despite the predominance of offline model-based approaches in the literature, online model-based approaches also have a wide array of use case scenarios, but remain largely unexplored. At the end of the article, we propose the first online model-free workload allocation scheme for DCML, and identify future paths and opportunities along this direction.","PeriodicalId":235472,"journal":{"name":"IEEE Internet of Things Magazine","volume":"44 1","pages":"100-106"},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Internet of Things Magazine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IOTM.001.2300247","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Distributed machine learning (ML) is an important Internet-of-Things (IoT) application. In traditional partitioned learning (PL) paradigm, a coordinator divides a high-dimensional dataset into subsets, which are processed on IoT devices. The execution time of PL can be seriously bottlenecked by slow devices named stragglers. To mitigate the negative impact of stragglers, distributed coded machine learning (DCML) was recently proposed to inject redundancy into the subsets using coding techniques. With this redundancy, the coordinator no longer requires the processing results from all devices, but only from a subgroup, where stragglers can be eliminated. This article aims to bring the burgeoning field of DCML to the wider community. After outlining the principles of DCML, we focus on its workload allocation, which addresses the appropriate level of injected redundancy to minimize the overall execution time. We highlight the fundamental trade-off and point out two critical design choices in workload allocation: model-based versus model-free, and offline versus online. Despite the predominance of offline model-based approaches in the literature, online model-based approaches also have a wide array of use case scenarios, but remain largely unexplored. At the end of the article, we propose the first online model-free workload allocation scheme for DCML, and identify future paths and opportunities along this direction.