无需完全数据洗牌的随机梯度下降:在数据库内机器学习和深度学习系统中的应用

Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang
{"title":"无需完全数据洗牌的随机梯度下降:在数据库内机器学习和深度学习系统中的应用","authors":"Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang","doi":"10.1007/s00778-024-00845-0","DOIUrl":null,"url":null,"abstract":"<p>Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on <i>block-addressable secondary storage</i> such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel <i>two-level</i> data shuffling strategy named <span>CorgiPile</span>, which can <i>avoid</i> a full data shuffle while maintaining <i>comparable</i> convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of <span>CorgiPile</span> and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate <span>CorgiPile</span> into PostgreSQL by introducing three new <i>physical</i> operators with optimizations. For deep learning systems, we extend single-process <span>CorgiPile</span> to multi-process <span>CorgiPile</span> for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that <span>CorgiPile</span> can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, <span>CorgiPile</span> is 1.6<span>\\(\\times \\)</span> <span>\\(-\\)</span>12.8<span>\\(\\times \\)</span> faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, <span>CorgiPile</span> is 1.5<span>\\(\\times \\)</span> faster than PyTorch with full data shuffle.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"44 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems\",\"authors\":\"Lijie Xu, Shuang Qiu, Binhang Yuan, Jiawei Jiang, Cedric Renggli, Shaoduo Gan, Kaan Kara, Guoliang Li, Ji Liu, Wentao Wu, Jieping Ye, Ce Zhang\",\"doi\":\"10.1007/s00778-024-00845-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on <i>block-addressable secondary storage</i> such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel <i>two-level</i> data shuffling strategy named <span>CorgiPile</span>, which can <i>avoid</i> a full data shuffle while maintaining <i>comparable</i> convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of <span>CorgiPile</span> and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate <span>CorgiPile</span> into PostgreSQL by introducing three new <i>physical</i> operators with optimizations. For deep learning systems, we extend single-process <span>CorgiPile</span> to multi-process <span>CorgiPile</span> for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that <span>CorgiPile</span> can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, <span>CorgiPile</span> is 1.6<span>\\\\(\\\\times \\\\)</span> <span>\\\\(-\\\\)</span>12.8<span>\\\\(\\\\times \\\\)</span> faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, <span>CorgiPile</span> is 1.5<span>\\\\(\\\\times \\\\)</span> faster than PyTorch with full data shuffle.</p>\",\"PeriodicalId\":501532,\"journal\":{\"name\":\"The VLDB Journal\",\"volume\":\"44 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The VLDB Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00778-024-00845-0\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The VLDB Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00778-024-00845-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

现代机器学习(ML)系统通常使用随机梯度下降法(SGD)来训练 ML 模型。然而,随机梯度下降依赖于随机数据顺序来收敛,这通常需要进行全面的数据洗牌。对于在 HDD 和 SSD 等可分块寻址的二级存储上存储大型数据集的 In-DB ML 系统和深度学习系统来说,这种全数据洗牌会导致 I/O 性能低下--由于大量的随机数据访问,数据洗牌时间可能比训练本身还要长。为了平衡 SGD 的收敛速度(倾向于数据随机性)和 I/O 性能(倾向于顺序访问),前人提出了几种数据洗牌策略。在本文中,我们首先对现有的数据洗牌策略进行了实证研究,结果表明这些策略要么性能低下,要么收敛率低。为了解决这个问题,我们提出了一种名为 CorgiPile 的简单而新颖的两级数据洗牌策略,它可以避免完全的数据洗牌,同时保持与完全洗牌相当的 SGD 收敛率。我们进一步从理论上分析了 CorgiPile 的收敛行为,并在数据库内 ML 和深度学习系统中对其功效进行了实证评估。对于数据库内 ML 系统,我们将 CorgiPile 集成到 PostgreSQL 中,引入了三个新的物理算子并进行了优化。对于深度学习系统,我们将单进程 CorgiPile 扩展为并行/分布式环境下的多进程 CorgiPile,并将其集成到 PyTorch 中。我们的评估结果表明,对于线性模型和深度学习模型,CorgiPile 可以达到与基于全洗牌的 SGD 相当的收敛速度。对于线性模型的数据库内 ML,在 HDD 和 SSD 上,CorgiPile 比 Apache MADlib 和 Bismarck 这两个最先进的系统快 1.6(\times \)\(-\)12.8(\times \)。对于 ImageNet 上的深度学习模型,CorgiPile 比完全数据洗牌的 PyTorch 快 1.5(\times \)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on block-addressable secondary storage such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel two-level data shuffling strategy named CorgiPile, which can avoid a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of CorgiPile and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. For deep learning systems, we extend single-process CorgiPile to multi-process CorgiPile for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that CorgiPile can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, CorgiPile is 1.6\(\times \) \(-\)12.8\(\times \) faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, CorgiPile is 1.5\(\times \) faster than PyTorch with full data shuffle.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A versatile framework for attributed network clustering via K-nearest neighbor augmentation Discovering critical vertices for reinforcement of large-scale bipartite networks DumpyOS: A data-adaptive multi-ary index for scalable data series similarity search Enabling space-time efficient range queries with REncoder AutoCTS++: zero-shot joint neural architecture and hyperparameter search for correlated time series forecasting
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1