Dataset Placement and Data Loading Optimizations for Cloud-Native Deep Learning Workloads

Zhuangwei Kang, Ziran Min, Shuang Zhou, Yogesh D. Barve, A. Gokhale
{"title":"Dataset Placement and Data Loading Optimizations for Cloud-Native Deep Learning Workloads","authors":"Zhuangwei Kang, Ziran Min, Shuang Zhou, Yogesh D. Barve, A. Gokhale","doi":"10.1109/ISORC58943.2023.00023","DOIUrl":null,"url":null,"abstract":"The primary challenge facing cloud-based deep learning systems is the need for efficient orchestration of large-scale datasets with diverse data formats and provisioning of high-performance data loading capabilities. To that end, we present DLCache, a cloud-native dataset management and runtime-aware data-loading solution for deep learning training jobs. DLCache supports the low-latency and high-throughput I/O requirements of DL training jobs using cloud buckets as persistent data storage and a dedicated computation cluster for training. DLCache comprises four layers: a control plane, a metadata plane, an operator plane, and a multi-tier storage plane, which are seamlessly integrated with the Kubernetes ecosystem thereby providing ease of deployment, scalability, and self-healing. For efficient memory utilization, DLCache is designed with an on-the-fly and best-effort caching mechanism that can auto-scale the cache according to runtime configurations, resource constraints, and training speeds. DLCache considers both frequency and freshness of data access as well as data preparation costs in making effective cache eviction decisions that result in reduced completion time for deep learning workloads. Results of evaluating DLCache on the Imagenet-ILSVRC and LibriSpeech datasets under various runtime configurations and simulated GPU computation time experiments showed up to a 147.49% and 156.67% improvement in data loading throughput, respectively, compared to the popular PyTorch framework.","PeriodicalId":281426,"journal":{"name":"2023 IEEE 26th International Symposium on Real-Time Distributed Computing (ISORC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 26th International Symposium on Real-Time Distributed Computing (ISORC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISORC58943.2023.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The primary challenge facing cloud-based deep learning systems is the need for efficient orchestration of large-scale datasets with diverse data formats and provisioning of high-performance data loading capabilities. To that end, we present DLCache, a cloud-native dataset management and runtime-aware data-loading solution for deep learning training jobs. DLCache supports the low-latency and high-throughput I/O requirements of DL training jobs using cloud buckets as persistent data storage and a dedicated computation cluster for training. DLCache comprises four layers: a control plane, a metadata plane, an operator plane, and a multi-tier storage plane, which are seamlessly integrated with the Kubernetes ecosystem thereby providing ease of deployment, scalability, and self-healing. For efficient memory utilization, DLCache is designed with an on-the-fly and best-effort caching mechanism that can auto-scale the cache according to runtime configurations, resource constraints, and training speeds. DLCache considers both frequency and freshness of data access as well as data preparation costs in making effective cache eviction decisions that result in reduced completion time for deep learning workloads. Results of evaluating DLCache on the Imagenet-ILSVRC and LibriSpeech datasets under various runtime configurations and simulated GPU computation time experiments showed up to a 147.49% and 156.67% improvement in data loading throughput, respectively, compared to the popular PyTorch framework.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
云原生深度学习工作负载的数据集放置和数据加载优化
基于云的深度学习系统面临的主要挑战是需要高效地编排具有不同数据格式的大规模数据集,并提供高性能的数据加载能力。为此,我们提出了DLCache,这是一种用于深度学习训练工作的云原生数据集管理和运行时感知数据加载解决方案。DLCache支持DL训练作业的低延迟和高吞吐量I/O需求,使用云桶作为持久的数据存储和专用的计算集群进行训练。DLCache包括四层:控制平面、元数据平面、操作平面和多层存储平面,它们与Kubernetes生态系统无缝集成,从而提供易于部署、可扩展性和自修复性。为了有效地利用内存,DLCache设计了一个动态的、尽最大努力的缓存机制,可以根据运行时配置、资源约束和训练速度自动扩展缓存。DLCache考虑数据访问的频率和新鲜度,以及数据准备成本,从而做出有效的缓存清除决策,从而减少深度学习工作负载的完成时间。在各种运行时配置和模拟GPU计算时间实验下,在Imagenet-ILSVRC和librisspeech数据集上评估DLCache的结果表明,与流行的PyTorch框架相比,DLCache的数据加载吞吐量分别提高了147.49%和156.67%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A Design and Implementation of Decentralized Edge Intelligent LoRa Gateway A collaborative and distributed task management system for real-time systems HRMP3+TECS: Component Framework for Multiprocessor Real-time Operating System with Memory Protection A Robust Scheduling Algorithm for Overload-Tolerant Real-Time Systems Workshop Committee
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1