面向GPU集群深度学习应用的分层、批量同步随机梯度下降算法

2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) Pub Date : 2017-12-01 DOI:10.1109/ICMLA.2017.00-56

Guojing Cong, Onkar Bhardwaj

{"title":"面向GPU集群深度学习应用的分层、批量同步随机梯度下降算法","authors":"Guojing Cong, Onkar Bhardwaj","doi":"10.1109/ICMLA.2017.00-56","DOIUrl":null,"url":null,"abstract":"The training data and models are becoming increasingly large in many deep-learning applications. Large-scale distributed processing is employed to accelerate training. Increasing the number of learners in synchronous and asynchronous stochastic gradient descent presents challenges to convergence and communication performance. We present our hierarchical, bulk-synchronous stochastic gradient algorithm that effectively balances execution time and accuracy for training in deep-learning applications on GPU clusters. It achieves much better convergence and execution time at scale in comparison to asynchronous stochastic gradient descent implementations. When deployed on a cluster of 128 GPUs, our implementation achieves up to 56 times speedups over the sequential stochastic gradient descent with similar test accuracy for our target application.","PeriodicalId":6636,"journal":{"name":"2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)","volume":"30 1","pages":"818-821"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"A Hierarchical, Bulk-Synchronous Stochastic Gradient Descent Algorithm for Deep-Learning Applications on GPU Clusters\",\"authors\":\"Guojing Cong, Onkar Bhardwaj\",\"doi\":\"10.1109/ICMLA.2017.00-56\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The training data and models are becoming increasingly large in many deep-learning applications. Large-scale distributed processing is employed to accelerate training. Increasing the number of learners in synchronous and asynchronous stochastic gradient descent presents challenges to convergence and communication performance. We present our hierarchical, bulk-synchronous stochastic gradient algorithm that effectively balances execution time and accuracy for training in deep-learning applications on GPU clusters. It achieves much better convergence and execution time at scale in comparison to asynchronous stochastic gradient descent implementations. When deployed on a cluster of 128 GPUs, our implementation achieves up to 56 times speedups over the sequential stochastic gradient descent with similar test accuracy for our target application.\",\"PeriodicalId\":6636,\"journal\":{\"name\":\"2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)\",\"volume\":\"30 1\",\"pages\":\"818-821\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2017.00-56\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2017.00-56","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

在许多深度学习应用中，训练数据和模型变得越来越大。采用大规模分布式处理，加快训练速度。同步和异步随机梯度下降中学习器数量的增加对收敛性和通信性能提出了挑战。我们提出了我们的分层、批量同步随机梯度算法，该算法有效地平衡了GPU集群上深度学习应用程序训练的执行时间和准确性。与异步随机梯度下降实现相比，它在规模上实现了更好的收敛性和执行时间。当部署在128个gpu的集群上时，我们的实现在顺序随机梯度下降的基础上实现了高达56倍的加速，并具有类似的目标应用程序测试精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Hierarchical, Bulk-Synchronous Stochastic Gradient Descent Algorithm for Deep-Learning Applications on GPU Clusters

The training data and models are becoming increasingly large in many deep-learning applications. Large-scale distributed processing is employed to accelerate training. Increasing the number of learners in synchronous and asynchronous stochastic gradient descent presents challenges to convergence and communication performance. We present our hierarchical, bulk-synchronous stochastic gradient algorithm that effectively balances execution time and accuracy for training in deep-learning applications on GPU clusters. It achieves much better convergence and execution time at scale in comparison to asynchronous stochastic gradient descent implementations. When deployed on a cluster of 128 GPUs, our implementation achieves up to 56 times speedups over the sequential stochastic gradient descent with similar test accuracy for our target application.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)

自引率

0.00%

发文量

期刊最新文献

Tree-Structured Curriculum Learning Based on Semantic Similarity of Text Direct Multiclass Boosting Using Base Classifiers' Posterior Probabilities Estimates Predicting Psychosis Using the Experience Sampling Method with Mobile Apps Human Action Recognition from Body-Part Directional Velocity Using Hidden Markov Models Realistic Traffic Generation for Web Robots