Dataset Placement and Data Loading Optimizations for Cloud-Native Deep Learning Workloads

2023 IEEE 26th International Symposium on Real-Time Distributed Computing (ISORC) Pub Date : 2023-05-01 DOI:10.1109/ISORC58943.2023.00023

Zhuangwei Kang, Ziran Min, Shuang Zhou, Yogesh D. Barve, A. Gokhale

{"title":"Dataset Placement and Data Loading Optimizations for Cloud-Native Deep Learning Workloads","authors":"Zhuangwei Kang, Ziran Min, Shuang Zhou, Yogesh D. Barve, A. Gokhale","doi":"10.1109/ISORC58943.2023.00023","DOIUrl":null,"url":null,"abstract":"The primary challenge facing cloud-based deep learning systems is the need for efficient orchestration of large-scale datasets with diverse data formats and provisioning of high-performance data loading capabilities. To that end, we present DLCache, a cloud-native dataset management and runtime-aware data-loading solution for deep learning training jobs. DLCache supports the low-latency and high-throughput I/O requirements of DL training jobs using cloud buckets as persistent data storage and a dedicated computation cluster for training. DLCache comprises four layers: a control plane, a metadata plane, an operator plane, and a multi-tier storage plane, which are seamlessly integrated with the Kubernetes ecosystem thereby providing ease of deployment, scalability, and self-healing. For efficient memory utilization, DLCache is designed with an on-the-fly and best-effort caching mechanism that can auto-scale the cache according to runtime configurations, resource constraints, and training speeds. DLCache considers both frequency and freshness of data access as well as data preparation costs in making effective cache eviction decisions that result in reduced completion time for deep learning workloads. Results of evaluating DLCache on the Imagenet-ILSVRC and LibriSpeech datasets under various runtime configurations and simulated GPU computation time experiments showed up to a 147.49% and 156.67% improvement in data loading throughput, respectively, compared to the popular PyTorch framework.","PeriodicalId":281426,"journal":{"name":"2023 IEEE 26th International Symposium on Real-Time Distributed Computing (ISORC)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 26th International Symposium on Real-Time Distributed Computing (ISORC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISORC58943.2023.00023","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The primary challenge facing cloud-based deep learning systems is the need for efficient orchestration of large-scale datasets with diverse data formats and provisioning of high-performance data loading capabilities. To that end, we present DLCache, a cloud-native dataset management and runtime-aware data-loading solution for deep learning training jobs. DLCache supports the low-latency and high-throughput I/O requirements of DL training jobs using cloud buckets as persistent data storage and a dedicated computation cluster for training. DLCache comprises four layers: a control plane, a metadata plane, an operator plane, and a multi-tier storage plane, which are seamlessly integrated with the Kubernetes ecosystem thereby providing ease of deployment, scalability, and self-healing. For efficient memory utilization, DLCache is designed with an on-the-fly and best-effort caching mechanism that can auto-scale the cache according to runtime configurations, resource constraints, and training speeds. DLCache considers both frequency and freshness of data access as well as data preparation costs in making effective cache eviction decisions that result in reduced completion time for deep learning workloads. Results of evaluating DLCache on the Imagenet-ILSVRC and LibriSpeech datasets under various runtime configurations and simulated GPU computation time experiments showed up to a 147.49% and 156.67% improvement in data loading throughput, respectively, compared to the popular PyTorch framework.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

云原生深度学习工作负载的数据集放置和数据加载优化

基于云的深度学习系统面临的主要挑战是需要高效地编排具有不同数据格式的大规模数据集，并提供高性能的数据加载能力。为此，我们提出了DLCache，这是一种用于深度学习训练工作的云原生数据集管理和运行时感知数据加载解决方案。DLCache支持DL训练作业的低延迟和高吞吐量I/O需求，使用云桶作为持久的数据存储和专用的计算集群进行训练。DLCache包括四层:控制平面、元数据平面、操作平面和多层存储平面，它们与Kubernetes生态系统无缝集成，从而提供易于部署、可扩展性和自修复性。为了有效地利用内存，DLCache设计了一个动态的、尽最大努力的缓存机制，可以根据运行时配置、资源约束和训练速度自动扩展缓存。DLCache考虑数据访问的频率和新鲜度，以及数据准备成本，从而做出有效的缓存清除决策，从而减少深度学习工作负载的完成时间。在各种运行时配置和模拟GPU计算时间实验下，在Imagenet-ILSVRC和librisspeech数据集上评估DLCache的结果表明，与流行的PyTorch框架相比，DLCache的数据加载吞吐量分别提高了147.49%和156.67%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 IEEE 26th International Symposium on Real-Time Distributed Computing (ISORC)

自引率

0.00%

发文量

期刊最新文献

A Design and Implementation of Decentralized Edge Intelligent LoRa Gateway A collaborative and distributed task management system for real-time systems HRMP3+TECS: Component Framework for Multiprocessor Real-time Operating System with Memory Protection A Robust Scheduling Algorithm for Overload-Tolerant Real-Time Systems Workshop Committee