{"title":"探索基于暂态资源的分布式机器学习训练的学习率缩放规则","authors":"Joel André, F. Strati, Ana Klimovic","doi":"10.1145/3565010.3569067","DOIUrl":null,"url":null,"abstract":"Training Machine Learning (ML) models to convergence is a long-running and expensive procedure, as it requires large clusters of high-end accelerators such as GPUs and TPUs. Many ML frameworks have proposed elastic distributed training, which enables using transient resources such as spot VMs in the cloud, reducing the overall cost. However, the availability of transient resources varies over time, creating an inherently dynamic environment that requires special handling of training hyperparameters. Techniques such as gradient accumulation enable using the same hyperparameters upon resource preemptions, however sequentially accumulating gradients stalls synchronous distributed training. On the other hand, scaling the batch size according to the available resources requires tuning of other hyperparameters, such as the learning rate. In this work, we study how learning rate scaling rules perform under dynamic environments when the batch size changes frequently and drastically, as we observed in real cloud clusters. We build a PyTorch-based system to evaluate Stochastic Gradient Descent on Image Recognition and Object Detection tasks under various learning rate scaling rules and resource availability traces. We observe minor or no degradation in model convergence when choosing the correct learning rate scaling rule. Identifying the appropriate scaling rule for a given model is non-trivial. Automating this decision remains an open question.","PeriodicalId":325359,"journal":{"name":"Proceedings of the 3rd International Workshop on Distributed Machine Learning","volume":"46 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Exploring learning rate scaling rules for distributed ML training on transient resources\",\"authors\":\"Joel André, F. Strati, Ana Klimovic\",\"doi\":\"10.1145/3565010.3569067\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Training Machine Learning (ML) models to convergence is a long-running and expensive procedure, as it requires large clusters of high-end accelerators such as GPUs and TPUs. Many ML frameworks have proposed elastic distributed training, which enables using transient resources such as spot VMs in the cloud, reducing the overall cost. However, the availability of transient resources varies over time, creating an inherently dynamic environment that requires special handling of training hyperparameters. Techniques such as gradient accumulation enable using the same hyperparameters upon resource preemptions, however sequentially accumulating gradients stalls synchronous distributed training. On the other hand, scaling the batch size according to the available resources requires tuning of other hyperparameters, such as the learning rate. In this work, we study how learning rate scaling rules perform under dynamic environments when the batch size changes frequently and drastically, as we observed in real cloud clusters. We build a PyTorch-based system to evaluate Stochastic Gradient Descent on Image Recognition and Object Detection tasks under various learning rate scaling rules and resource availability traces. We observe minor or no degradation in model convergence when choosing the correct learning rate scaling rule. Identifying the appropriate scaling rule for a given model is non-trivial. Automating this decision remains an open question.\",\"PeriodicalId\":325359,\"journal\":{\"name\":\"Proceedings of the 3rd International Workshop on Distributed Machine Learning\",\"volume\":\"46 3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 3rd International Workshop on Distributed Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3565010.3569067\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Workshop on Distributed Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3565010.3569067","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Exploring learning rate scaling rules for distributed ML training on transient resources
Training Machine Learning (ML) models to convergence is a long-running and expensive procedure, as it requires large clusters of high-end accelerators such as GPUs and TPUs. Many ML frameworks have proposed elastic distributed training, which enables using transient resources such as spot VMs in the cloud, reducing the overall cost. However, the availability of transient resources varies over time, creating an inherently dynamic environment that requires special handling of training hyperparameters. Techniques such as gradient accumulation enable using the same hyperparameters upon resource preemptions, however sequentially accumulating gradients stalls synchronous distributed training. On the other hand, scaling the batch size according to the available resources requires tuning of other hyperparameters, such as the learning rate. In this work, we study how learning rate scaling rules perform under dynamic environments when the batch size changes frequently and drastically, as we observed in real cloud clusters. We build a PyTorch-based system to evaluate Stochastic Gradient Descent on Image Recognition and Object Detection tasks under various learning rate scaling rules and resource availability traces. We observe minor or no degradation in model convergence when choosing the correct learning rate scaling rule. Identifying the appropriate scaling rule for a given model is non-trivial. Automating this decision remains an open question.