Deep neural networks are getting larger and, therefore, harder to deploy on constrained IoT devices. Split computing provides a solution by splitting a network and placing the first few layers on the IoT device. The output of these layers is transmitted to the cloud where inference continues. Earlier works indicate a degree of high sparsity in intermediate activation outputs, this paper analyzes and exploits activation sparsity to reduce the network communication overhead when transmitting intermediate data to the cloud. Specifically, we analyze the intermediate activations of two early layers in ResNet-50 on CIFAR-10 and ImageNet, focusing on sparsity to guide the process of choosing a splitting point. We employ dynamic pruning of activations and feature maps and find that sparsity is very dependent on the size of a layer, and weights do not correlate with activation sparsity in convolutional layers. Additionally, we show that sparse intermediate outputs can be compressed by a factor of 3.3X at an accuracy loss of 1.1% without any fine-tuning. When adding fine-tuning, the compression factor increases up to 14X at a total accuracy loss of 1%.
{"title":"Activation sparsity and dynamic pruning for split computing in edge AI","authors":"Janek Haberer, O. Landsiedel","doi":"10.1145/3565010.3569066","DOIUrl":"https://doi.org/10.1145/3565010.3569066","url":null,"abstract":"Deep neural networks are getting larger and, therefore, harder to deploy on constrained IoT devices. Split computing provides a solution by splitting a network and placing the first few layers on the IoT device. The output of these layers is transmitted to the cloud where inference continues. Earlier works indicate a degree of high sparsity in intermediate activation outputs, this paper analyzes and exploits activation sparsity to reduce the network communication overhead when transmitting intermediate data to the cloud. Specifically, we analyze the intermediate activations of two early layers in ResNet-50 on CIFAR-10 and ImageNet, focusing on sparsity to guide the process of choosing a splitting point. We employ dynamic pruning of activations and feature maps and find that sparsity is very dependent on the size of a layer, and weights do not correlate with activation sparsity in convolutional layers. Additionally, we show that sparse intermediate outputs can be compressed by a factor of 3.3X at an accuracy loss of 1.1% without any fine-tuning. When adding fine-tuning, the compression factor increases up to 14X at a total accuracy loss of 1%.","PeriodicalId":325359,"journal":{"name":"Proceedings of the 3rd International Workshop on Distributed Machine Learning","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116759397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated learning (FL) is a distributed learning paradigm that allows a cohort of clients to collaborate in jointly training a machine learning model. By design, FL assures data-privacy for clients involved, making it the perfect fit for a wide range of real-world applications requiring data privacy. Despite its great potential and conceptual guarantees, FL has been found to suffer from unbalanced data, causing the overall performance of the final model to decrease and the contribution of individual clients to the federated model to vary greatly. Assuming that imbalance does not only affect contribution but also the extent to which individual clients benefit from participating in FL, we investigate the predictive potential of data imbalance metrics on benefit and contribution. In particular, our approach comprises three phases: (1) we measure data imbalance of clients while maintaining data privacy using secure aggregation, (2) we measure how individual clients benefit from FL participation and how valuable they are for the cohort, and (3) we train classifiers to pairwisely rank clients regarding benefit and contribution. The resulting classifiers rank pairs of clients with an accuracy of 0.71 and 0.65 for benefit and contribution, respectively. Thus, our approach contributes towards providing an indication for the expected value for individual clients and the cohort prior to their participation.
{"title":"Towards predicting client benefit and contribution in federated learning from data imbalance","authors":"Christoph Düsing, P. Cimiano","doi":"10.1145/3565010.3569063","DOIUrl":"https://doi.org/10.1145/3565010.3569063","url":null,"abstract":"Federated learning (FL) is a distributed learning paradigm that allows a cohort of clients to collaborate in jointly training a machine learning model. By design, FL assures data-privacy for clients involved, making it the perfect fit for a wide range of real-world applications requiring data privacy. Despite its great potential and conceptual guarantees, FL has been found to suffer from unbalanced data, causing the overall performance of the final model to decrease and the contribution of individual clients to the federated model to vary greatly. Assuming that imbalance does not only affect contribution but also the extent to which individual clients benefit from participating in FL, we investigate the predictive potential of data imbalance metrics on benefit and contribution. In particular, our approach comprises three phases: (1) we measure data imbalance of clients while maintaining data privacy using secure aggregation, (2) we measure how individual clients benefit from FL participation and how valuable they are for the cohort, and (3) we train classifiers to pairwisely rank clients regarding benefit and contribution. The resulting classifiers rank pairs of clients with an accuracy of 0.71 and 0.65 for benefit and contribution, respectively. Thus, our approach contributes towards providing an indication for the expected value for individual clients and the cohort prior to their participation.","PeriodicalId":325359,"journal":{"name":"Proceedings of the 3rd International Workshop on Distributed Machine Learning","volume":"37 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124521059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The main premise of federated learning is that machine learning model updates are computed locally, in particular to preserve user data privacy, as those never leave the perimeter of their device. This mechanism supposes the general model, once aggregated, to be broadcast to collaborating and non malicious nodes. However, without proper defenses, compromised clients can easily probe the model inside their local memory in search of adversarial examples. For instance, considering image-based applications, adversarial examples consist of imperceptibly perturbed images (to the human eye) misclassified by the local model, which can be later presented to a victim node's counterpart model to replicate the attack. To mitigate such malicious probing, we introduce Pelta, a novel shielding mechanism leveraging trusted hardware. By harnessing the capabilities of Trusted Execution Environments (TEEs), Pelta masks part of the back-propagation chain rule, otherwise typically exploited by attackers for the design of malicious samples. We evaluate Pelta on a state of the art ensemble model and demonstrate its effectiveness against the Self Attention Gradient adversarial Attack.
{"title":"Pelta: shielding transformers to mitigate evasion attacks in federated learning","authors":"Simon Queyrut, Yérom-David Bromberg, V. Schiavoni","doi":"10.1145/3565010.3569064","DOIUrl":"https://doi.org/10.1145/3565010.3569064","url":null,"abstract":"The main premise of federated learning is that machine learning model updates are computed locally, in particular to preserve user data privacy, as those never leave the perimeter of their device. This mechanism supposes the general model, once aggregated, to be broadcast to collaborating and non malicious nodes. However, without proper defenses, compromised clients can easily probe the model inside their local memory in search of adversarial examples. For instance, considering image-based applications, adversarial examples consist of imperceptibly perturbed images (to the human eye) misclassified by the local model, which can be later presented to a victim node's counterpart model to replicate the attack. To mitigate such malicious probing, we introduce Pelta, a novel shielding mechanism leveraging trusted hardware. By harnessing the capabilities of Trusted Execution Environments (TEEs), Pelta masks part of the back-propagation chain rule, otherwise typically exploited by attackers for the design of malicious samples. We evaluate Pelta on a state of the art ensemble model and demonstrate its effectiveness against the Self Attention Gradient adversarial Attack.","PeriodicalId":325359,"journal":{"name":"Proceedings of the 3rd International Workshop on Distributed Machine Learning","volume":"55 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123079293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Training Machine Learning (ML) models to convergence is a long-running and expensive procedure, as it requires large clusters of high-end accelerators such as GPUs and TPUs. Many ML frameworks have proposed elastic distributed training, which enables using transient resources such as spot VMs in the cloud, reducing the overall cost. However, the availability of transient resources varies over time, creating an inherently dynamic environment that requires special handling of training hyperparameters. Techniques such as gradient accumulation enable using the same hyperparameters upon resource preemptions, however sequentially accumulating gradients stalls synchronous distributed training. On the other hand, scaling the batch size according to the available resources requires tuning of other hyperparameters, such as the learning rate. In this work, we study how learning rate scaling rules perform under dynamic environments when the batch size changes frequently and drastically, as we observed in real cloud clusters. We build a PyTorch-based system to evaluate Stochastic Gradient Descent on Image Recognition and Object Detection tasks under various learning rate scaling rules and resource availability traces. We observe minor or no degradation in model convergence when choosing the correct learning rate scaling rule. Identifying the appropriate scaling rule for a given model is non-trivial. Automating this decision remains an open question.
{"title":"Exploring learning rate scaling rules for distributed ML training on transient resources","authors":"Joel André, F. Strati, Ana Klimovic","doi":"10.1145/3565010.3569067","DOIUrl":"https://doi.org/10.1145/3565010.3569067","url":null,"abstract":"Training Machine Learning (ML) models to convergence is a long-running and expensive procedure, as it requires large clusters of high-end accelerators such as GPUs and TPUs. Many ML frameworks have proposed elastic distributed training, which enables using transient resources such as spot VMs in the cloud, reducing the overall cost. However, the availability of transient resources varies over time, creating an inherently dynamic environment that requires special handling of training hyperparameters. Techniques such as gradient accumulation enable using the same hyperparameters upon resource preemptions, however sequentially accumulating gradients stalls synchronous distributed training. On the other hand, scaling the batch size according to the available resources requires tuning of other hyperparameters, such as the learning rate. In this work, we study how learning rate scaling rules perform under dynamic environments when the batch size changes frequently and drastically, as we observed in real cloud clusters. We build a PyTorch-based system to evaluate Stochastic Gradient Descent on Image Recognition and Object Detection tasks under various learning rate scaling rules and resource availability traces. We observe minor or no degradation in model convergence when choosing the correct learning rate scaling rule. Identifying the appropriate scaling rule for a given model is non-trivial. Automating this decision remains an open question.","PeriodicalId":325359,"journal":{"name":"Proceedings of the 3rd International Workshop on Distributed Machine Learning","volume":"46 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132934223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Z. Tootaghaj, A. Mercian, V. Adarsh, M. Sharifian, P. Sharma
This paper proposes a new architecture that strategically harvests the untapped compute capacity of the SmartNICs to offload transient microservices workload spikes, thereby reducing the SLA violations while providing better performance/energy consumption. This is particularly important for ML workloads at Edge deployments with stringent SLA requirements. Usage of the untapped compute capacity is more favorable than deploying extra servers, as SmartNICs are economically and operationally more desirable. We propose Spike-Offload, a low-cost and scalable platform that leverages machine learning to predict the spikes and orchestrates seamless offloading of generic microservices workloads to the SmartNICs, eliminating the need for pre-deploying expensive host servers and their under-utilization. Our SpikeOffload evaluation shows that SLA violations can be reduced by up to 20% for specific workloads. Furthermore, we demonstrate that for specific workloads our approach can potentially reduce capital expenditure (CAPEX) by more than 40%. Also, performance per unit energy consumption can be improved by upto 2X.
{"title":"SmartNICs at edge for transient compute elasticity","authors":"D. Z. Tootaghaj, A. Mercian, V. Adarsh, M. Sharifian, P. Sharma","doi":"10.1145/3565010.3569065","DOIUrl":"https://doi.org/10.1145/3565010.3569065","url":null,"abstract":"This paper proposes a new architecture that strategically harvests the untapped compute capacity of the SmartNICs to offload transient microservices workload spikes, thereby reducing the SLA violations while providing better performance/energy consumption. This is particularly important for ML workloads at Edge deployments with stringent SLA requirements. Usage of the untapped compute capacity is more favorable than deploying extra servers, as SmartNICs are economically and operationally more desirable. We propose Spike-Offload, a low-cost and scalable platform that leverages machine learning to predict the spikes and orchestrates seamless offloading of generic microservices workloads to the SmartNICs, eliminating the need for pre-deploying expensive host servers and their under-utilization. Our SpikeOffload evaluation shows that SLA violations can be reduced by up to 20% for specific workloads. Furthermore, we demonstrate that for specific workloads our approach can potentially reduce capital expenditure (CAPEX) by more than 40%. Also, performance per unit energy consumption can be improved by upto 2X.","PeriodicalId":325359,"journal":{"name":"Proceedings of the 3rd International Workshop on Distributed Machine Learning","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125670747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhixu Du, Jingwei Sun, Ang Li, Pin-Yu Chen, Jianyi Zhang, H. Li, Yiran Chen
Federated learning (FL) is a popular distributed learning framework that can reduce privacy risks by not explicitly sharing private data. In this work, we explicitly uncover external covariate shift problem in FL, which is caused by the independent local training processes on different devices. We demonstrate that external covariate shifts will lead to the obliteration of some devices' contributions to the global model. Further, we show that normalization layers are indispensable in FL since their inherited properties can alleviate the problem of obliterating some devices' contributions. However, recent works have shown that batch normalization, which is one of the standard components in many deep neural networks, will incur accuracy drop of the global model in FL. The essential reason for the failure of batch normalization in FL is poorly studied. We unveil that external covariate shift is the key reason why batch normalization is ineffective in FL. We also show that layer normalization is a better choice in FL which can mitigate the external covariate shift and improve the performance of the global model. We conduct experiments on CIFAR10 under non-IID settings. The results demonstrate that models with layer normalization converge fastest and achieve the best or comparable accuracy for three different model architectures.
{"title":"Rethinking normalization methods in federated learning","authors":"Zhixu Du, Jingwei Sun, Ang Li, Pin-Yu Chen, Jianyi Zhang, H. Li, Yiran Chen","doi":"10.1145/3565010.3569062","DOIUrl":"https://doi.org/10.1145/3565010.3569062","url":null,"abstract":"Federated learning (FL) is a popular distributed learning framework that can reduce privacy risks by not explicitly sharing private data. In this work, we explicitly uncover external covariate shift problem in FL, which is caused by the independent local training processes on different devices. We demonstrate that external covariate shifts will lead to the obliteration of some devices' contributions to the global model. Further, we show that normalization layers are indispensable in FL since their inherited properties can alleviate the problem of obliterating some devices' contributions. However, recent works have shown that batch normalization, which is one of the standard components in many deep neural networks, will incur accuracy drop of the global model in FL. The essential reason for the failure of batch normalization in FL is poorly studied. We unveil that external covariate shift is the key reason why batch normalization is ineffective in FL. We also show that layer normalization is a better choice in FL which can mitigate the external covariate shift and improve the performance of the global model. We conduct experiments on CIFAR10 under non-IID settings. The results demonstrate that models with layer normalization converge fastest and achieve the best or comparable accuracy for three different model architectures.","PeriodicalId":325359,"journal":{"name":"Proceedings of the 3rd International Workshop on Distributed Machine Learning","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115830986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 3rd International Workshop on Distributed Machine Learning","authors":"","doi":"10.1145/3565010","DOIUrl":"https://doi.org/10.1145/3565010","url":null,"abstract":"","PeriodicalId":325359,"journal":{"name":"Proceedings of the 3rd International Workshop on Distributed Machine Learning","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131047024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}