Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training

IF 1.5 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE ACM Transactions on Architecture and Code Optimization Pub Date : 2023-10-25 DOI:10.1145/3630108

Jia Wei, Xingjun Zhang, Longxiang Wang, Zheng Wei

{"title":"Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training","authors":"Jia Wei, Xingjun Zhang, Longxiang Wang, Zheng Wei","doi":"10.1145/3630108","DOIUrl":null,"url":null,"abstract":"In recent years, benefiting from the increase in model size and complexity, deep learning has achieved tremendous success in computer vision (CV) and natural language processing (NLP). Training deep learning models using accelerators such as GPUs often requires much iterative data to be transferred from NVMe SSD to GPU memory. Much recent work has focused on data transfer during the pre-processing phase and has introduced techniques such as multiprocessing and GPU Direct Storage (GDS) to accelerate it. However, tensor data during training (such as Checkpoints, logs, and intermediate feature maps) which is also time-consuming, is often transferred using traditional serial, long-I/O-path transfer methods. In this paper, based on GDS technology, we built Fastensor, an efficient tool for tensor data transfer between NVMe SSDs and GPUs. To achieve higher tensor data I/O throughput, we optimized the traditional data I/O process. We also proposed a data and runtime context-aware tensor I/O algorithm. Fastensor can select the most suitable data transfer tool for the current tensor from a candidate set of tools during model training. The optimal tool is derived from a dictionary generated by our adaptive exploration algorithm in the first few training iterations. We used Fastensor’s unified interface to test the read/write bandwidth and energy consumption of different transfer tools for different sizes of tensor blocks. We found that the execution efficiency of different tensor transfer tools is related to both the tensor block size and the runtime context. We then deployed Fastensor in the widely applicable Pytorch deep learning framework. We showed that Fastensor could perform superior in typical scenarios of model parameter saving and intermediate feature map transfer with the same hardware configuration. Fastensor achieves a 5.37x read performance improvement compared to torch . save () when used for model parameter saving. When used for intermediate feature map transfer, Fastensor can increase the supported training batch size by 20x, while the total read and write speed is increased by 2.96x compared to the torch I/O API.","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"56 1","pages":"0"},"PeriodicalIF":1.5000,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3630108","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, benefiting from the increase in model size and complexity, deep learning has achieved tremendous success in computer vision (CV) and natural language processing (NLP). Training deep learning models using accelerators such as GPUs often requires much iterative data to be transferred from NVMe SSD to GPU memory. Much recent work has focused on data transfer during the pre-processing phase and has introduced techniques such as multiprocessing and GPU Direct Storage (GDS) to accelerate it. However, tensor data during training (such as Checkpoints, logs, and intermediate feature maps) which is also time-consuming, is often transferred using traditional serial, long-I/O-path transfer methods. In this paper, based on GDS technology, we built Fastensor, an efficient tool for tensor data transfer between NVMe SSDs and GPUs. To achieve higher tensor data I/O throughput, we optimized the traditional data I/O process. We also proposed a data and runtime context-aware tensor I/O algorithm. Fastensor can select the most suitable data transfer tool for the current tensor from a candidate set of tools during model training. The optimal tool is derived from a dictionary generated by our adaptive exploration algorithm in the first few training iterations. We used Fastensor’s unified interface to test the read/write bandwidth and energy consumption of different transfer tools for different sizes of tensor blocks. We found that the execution efficiency of different tensor transfer tools is related to both the tensor block size and the runtime context. We then deployed Fastensor in the widely applicable Pytorch deep learning framework. We showed that Fastensor could perform superior in typical scenarios of model parameter saving and intermediate feature map transfer with the same hardware configuration. Fastensor achieves a 5.37x read performance improvement compared to torch . save () when used for model parameter saving. When used for intermediate feature map transfer, Fastensor can increase the supported training batch size by 20x, while the total read and write speed is increased by 2.96x compared to the torch I/O API.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

fastsensor:优化从SSD到GPU的张量I/O路径，用于深度学习训练

近年来，得益于模型规模和复杂性的增加，深度学习在计算机视觉(CV)和自然语言处理(NLP)领域取得了巨大的成功。使用GPU等加速器训练深度学习模型通常需要将大量迭代数据从NVMe SSD传输到GPU内存。最近的工作主要集中在预处理阶段的数据传输，并引入了多处理和GPU直接存储(GDS)等技术来加速数据传输。然而，训练期间的张量数据(如检查点、日志和中间特征映射)也很耗时，通常使用传统的串行、长i / o路径传输方法进行传输。本文基于GDS技术，构建了一种高效的NVMe ssd与gpu之间张量数据传输工具Fastensor。为了实现更高的张量数据I/O吞吐量，我们对传统的数据I/O流程进行了优化。我们还提出了一个数据和运行时上下文感知张量I/O算法。在模型训练过程中，Fastensor可以从候选工具集中选择最适合当前张量的数据传输工具。最优工具来源于自适应探索算法在前几次训练迭代中生成的字典。我们使用Fastensor的统一接口测试了不同传输工具对不同张量块大小的读写带宽和能耗。我们发现不同张量传输工具的执行效率与张量块大小和运行时环境有关。然后，我们将Fastensor部署到广泛应用的Pytorch深度学习框架中。结果表明，在相同的硬件配置下，Fastensor在典型的模型参数保存和中间特征映射传递场景中表现优异。与火炬相比，fastsensor实现了5.37倍的读取性能改进。用于模型参数保存时的Save()。当用于中间特征图传输时，Fastensor可以将支持的训练批大小增加20倍，而与torch I/O API相比，总读写速度提高了2.96倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Architecture and Code Optimization 工程技术-计算机：理论方法

CiteScore

3.60

自引率

6.20%

发文量

审稿时长

6-12 weeks

期刊介绍： ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.