Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter
The explosion in Machine Learning (ML) over the past ten years has led to a dramatic increase in demand for GPUs to train ML models. Because it is prohibitively expensive for most users to build and maintain a large GPU cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have seen explosive growth in demand for renting cloud-based GPUs. In this cloud-computing paradigm, a user must specify their demand for GPUs at every moment in time, and will pay for every GPU-hour they use. ML training jobs are known to be parallelizable to different degrees. Given a stream of ML training jobs, a user typically wants to minimize the mean response time across all jobs. Here, the response time of a job denotes the time from when a job arrives until it is complete. Additionally, the user is constrained by some operating budget. Specifically, in this paper the user is constrained to use no more than $b$ GPUs per hour, over a long-run time average. The question is how to minimize mean response time while meeting the budget constraint. Because training jobs receive a diminishing marginal benefit from running on additional GPUs, allocating too many GPUs to a single training job can dramatically increase the overall cost paid by the user. Hence, an optimal rental policy must balance a tradeoff between training cost and mean response time. This paper derives the optimal rental policy for a stream of training jobs where the jobs have different levels of parallelizability (specified by a speedup function) and different job sizes (amounts of inherent work). We make almost no assumptions about the arrival process and about the job size distribution. Our optimal policy specifies how many GPUs to rent at every moment in time and how to allocate these GPUs.
{"title":"How to Rent GPUs on a Budget","authors":"Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter","doi":"arxiv-2406.15560","DOIUrl":"https://doi.org/arxiv-2406.15560","url":null,"abstract":"The explosion in Machine Learning (ML) over the past ten years has led to a\u0000dramatic increase in demand for GPUs to train ML models. Because it is\u0000prohibitively expensive for most users to build and maintain a large GPU\u0000cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have\u0000seen explosive growth in demand for renting cloud-based GPUs. In this\u0000cloud-computing paradigm, a user must specify their demand for GPUs at every\u0000moment in time, and will pay for every GPU-hour they use. ML training jobs are\u0000known to be parallelizable to different degrees. Given a stream of ML training\u0000jobs, a user typically wants to minimize the mean response time across all\u0000jobs. Here, the response time of a job denotes the time from when a job arrives\u0000until it is complete. Additionally, the user is constrained by some operating\u0000budget. Specifically, in this paper the user is constrained to use no more than\u0000$b$ GPUs per hour, over a long-run time average. The question is how to\u0000minimize mean response time while meeting the budget constraint. Because\u0000training jobs receive a diminishing marginal benefit from running on additional\u0000GPUs, allocating too many GPUs to a single training job can dramatically\u0000increase the overall cost paid by the user. Hence, an optimal rental policy\u0000must balance a tradeoff between training cost and mean response time. This\u0000paper derives the optimal rental policy for a stream of training jobs where the\u0000jobs have different levels of parallelizability (specified by a speedup\u0000function) and different job sizes (amounts of inherent work). We make almost no\u0000assumptions about the arrival process and about the job size distribution. Our\u0000optimal policy specifies how many GPUs to rent at every moment in time and how\u0000to allocate these GPUs.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reducing the inference latency of large language models (LLMs) is crucial, and speculative decoding (SD) stands out as one of the most effective techniques. Rather than letting the LLM generate all tokens directly, speculative decoding employs effective proxies to predict potential outputs, which are then verified by the LLM without compromising the generation quality. Yet, deploying SD in real online LLM serving systems (with continuous batching) does not always yield improvement -- under higher request rates or low speculation accuracy, it paradoxically increases latency. Furthermore, there is no best speculation length work for all workloads under different system loads. Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec dynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) -- hence the associated speculative execution costs -- based on a new metric called goodput, which characterizes the current observed load of the entire system and the speculation accuracy. We show that SmartSpec consistently reduces average request latency by up to 3.2x compared to non-speculative decoding baselines across different sizes of target models, draft models, request rates, and datasets. Moreover, SmartSpec can be applied to different styles of speculative decoding, including traditional, model-based approaches as well as model-free methods like prompt lookup and tree-style decoding.
减少大型语言模型(LLM)的推理延迟至关重要,而推测解码(SD)是最有效的技术之一。推测解码不是让 LLM 直接生成所有标记,而是利用有效的代理来预测潜在的输出,然后由 LLM 在不影响生成质量的情况下进行验证。然而,在实际的在线 LLM 服务系统(具有连续批处理功能)中部署 SD 并不总是能带来改善--在请求率较高或推测准确率较低的情况下,它反而会增加延迟。此外,在不同的系统负载下,并不存在适合所有工作负载的最佳推测长度。基于上述观察结果,我们开发了一个动态框架 SmartSpec。SmartSpecynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) - hence the associated speculativeexecution costs - based on a new metric called goodput, which characterizesthe current observed load of the entire system and the speculation accuracy.Wesh显示,与非推测性解码基线相比,在不同规模的目标模型、草稿模型、请求率和数据集上,SmartSpec始终能将平均请求延迟降低3.2x。此外,SmartSpec 还可应用于不同风格的推测式解码,包括传统的基于模型的方法以及无模型方法(如提示查找和树型解码)。
{"title":"Optimizing Speculative Decoding for Serving Large Language Models Using Goodput","authors":"Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang","doi":"arxiv-2406.14066","DOIUrl":"https://doi.org/arxiv-2406.14066","url":null,"abstract":"Reducing the inference latency of large language models (LLMs) is crucial,\u0000and speculative decoding (SD) stands out as one of the most effective\u0000techniques. Rather than letting the LLM generate all tokens directly,\u0000speculative decoding employs effective proxies to predict potential outputs,\u0000which are then verified by the LLM without compromising the generation quality.\u0000Yet, deploying SD in real online LLM serving systems (with continuous batching)\u0000does not always yield improvement -- under higher request rates or low\u0000speculation accuracy, it paradoxically increases latency. Furthermore, there is\u0000no best speculation length work for all workloads under different system loads.\u0000Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec\u0000dynamically determines the best speculation length for each request (from 0,\u0000i.e., no speculation, to many tokens) -- hence the associated speculative\u0000execution costs -- based on a new metric called goodput, which characterizes\u0000the current observed load of the entire system and the speculation accuracy. We\u0000show that SmartSpec consistently reduces average request latency by up to 3.2x\u0000compared to non-speculative decoding baselines across different sizes of target\u0000models, draft models, request rates, and datasets. Moreover, SmartSpec can be\u0000applied to different styles of speculative decoding, including traditional,\u0000model-based approaches as well as model-free methods like prompt lookup and\u0000tree-style decoding.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Online Large Language Model (LLM) services such as ChatGPT and Claude 3 have transformed business operations and academic research by effortlessly enabling new opportunities. However, due to data-sharing restrictions, sectors such as healthcare and finance prefer to deploy local LLM applications using costly hardware resources. This scenario requires a balance between the effectiveness advantages of LLMs and significant financial burdens. Additionally, the rapid evolution of models increases the frequency and redundancy of benchmarking efforts. Existing benchmarking toolkits, which typically focus on effectiveness, often overlook economic considerations, making their findings less applicable to practical scenarios. To address these challenges, we introduce CEBench, an open-source toolkit specifically designed for multi-objective benchmarking that focuses on the critical trade-offs between expenditure and effectiveness required for LLM deployments. CEBench allows for easy modifications through configuration files, enabling stakeholders to effectively assess and optimize these trade-offs. This strategic capability supports crucial decision-making processes aimed at maximizing effectiveness while minimizing cost impacts. By streamlining the evaluation process and emphasizing cost-effectiveness, CEBench seeks to facilitate the development of economically viable AI solutions across various industries and research fields. The code and demonstration are available in url{https://github.com/amademicnoboday12/CEBench}.
{"title":"CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines","authors":"Wenbo Sun, Jiaqi Wang, Qiming Guo, Ziyu Li, Wenlu Wang, Rihan Hai","doi":"arxiv-2407.12797","DOIUrl":"https://doi.org/arxiv-2407.12797","url":null,"abstract":"Online Large Language Model (LLM) services such as ChatGPT and Claude 3 have\u0000transformed business operations and academic research by effortlessly enabling\u0000new opportunities. However, due to data-sharing restrictions, sectors such as\u0000healthcare and finance prefer to deploy local LLM applications using costly\u0000hardware resources. This scenario requires a balance between the effectiveness\u0000advantages of LLMs and significant financial burdens. Additionally, the rapid\u0000evolution of models increases the frequency and redundancy of benchmarking\u0000efforts. Existing benchmarking toolkits, which typically focus on\u0000effectiveness, often overlook economic considerations, making their findings\u0000less applicable to practical scenarios. To address these challenges, we\u0000introduce CEBench, an open-source toolkit specifically designed for\u0000multi-objective benchmarking that focuses on the critical trade-offs between\u0000expenditure and effectiveness required for LLM deployments. CEBench allows for\u0000easy modifications through configuration files, enabling stakeholders to\u0000effectively assess and optimize these trade-offs. This strategic capability\u0000supports crucial decision-making processes aimed at maximizing effectiveness\u0000while minimizing cost impacts. By streamlining the evaluation process and\u0000emphasizing cost-effectiveness, CEBench seeks to facilitate the development of\u0000economically viable AI solutions across various industries and research fields.\u0000The code and demonstration are available in\u0000url{https://github.com/amademicnoboday12/CEBench}.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141746437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guanhua Wang, Olatunji Ruwase, Bing Xie, Yuxiong He
Model checkpoints are critical Deep Learning (DL) artifacts that enable fault tolerance for training and downstream applications, such as inference. However, writing checkpoints to persistent storage, and other I/O aspects of DL training, are mostly ignored by compute-focused optimization efforts for faster training of rapidly growing models and datasets. Towards addressing this imbalance, we propose FastPersist to accelerate checkpoint creation in DL training. FastPersist combines three novel techniques: (i) NVMe optimizations for faster checkpoint writes to SSDs, (ii) efficient write parallelism using the available SSDs in training environments, and (iii) overlapping checkpointing with independent training computations. Our evaluation using real world dense and sparse DL models shows that FastPersist creates checkpoints in persistent storage up to 116x faster than baseline, and enables per-iteration checkpointing with negligible overhead.
{"title":"FastPersist: Accelerating Model Checkpointing in Deep Learning","authors":"Guanhua Wang, Olatunji Ruwase, Bing Xie, Yuxiong He","doi":"arxiv-2406.13768","DOIUrl":"https://doi.org/arxiv-2406.13768","url":null,"abstract":"Model checkpoints are critical Deep Learning (DL) artifacts that enable fault\u0000tolerance for training and downstream applications, such as inference. However,\u0000writing checkpoints to persistent storage, and other I/O aspects of DL\u0000training, are mostly ignored by compute-focused optimization efforts for faster\u0000training of rapidly growing models and datasets. Towards addressing this\u0000imbalance, we propose FastPersist to accelerate checkpoint creation in DL\u0000training. FastPersist combines three novel techniques: (i) NVMe optimizations\u0000for faster checkpoint writes to SSDs, (ii) efficient write parallelism using\u0000the available SSDs in training environments, and (iii) overlapping\u0000checkpointing with independent training computations. Our evaluation using real\u0000world dense and sparse DL models shows that FastPersist creates checkpoints in\u0000persistent storage up to 116x faster than baseline, and enables per-iteration\u0000checkpointing with negligible overhead.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"62 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Apanasevich, Yogesh Kale, Himanshu Sharma, Ana Marija Sokovic
For many years, systems running Nvidia-based GPU architectures have dominated the heterogeneous supercomputer landscape. However, recently GPU chipsets manufactured by Intel and AMD have cut into this market and can now be found in some of the worlds fastest supercomputers. The June 2023 edition of the TOP500 list of supercomputers ranks the Frontier supercomputer at the Oak Ridge National Laboratory in Tennessee as the top system in the world. This system features AMD Instinct 250 X GPUs and is currently the only true exascale computer in the world.The first framework that enabled support for heterogeneous platforms across multiple hardware vendors was OpenCL, in 2009. Since then a number of frameworks have been developed to support vendor agnostic heterogeneous environments including OpenMP, OpenCL, Kokkos, and SYCL. SYCL, which combines the concepts of OpenCL with the flexibility of single-source C++, is one of the more promising programming models for heterogeneous computing devices. One key advantage of this framework is that it provides a higher-level programming interface that abstracts away many of the hardware details than the other frameworks. This makes SYCL easier to learn and to maintain across multiple architectures and vendors. In n recent years, there has been growing interest in using heterogeneous computing architectures to accelerate molecular dynamics simulations. Some of the more popular molecular dynamics simulations include Amber, NAMD, and Gromacs. However, to the best of our knowledge, only Gromacs has been successfully ported to SYCL to date. In this paper, we compare the performance of GROMACS compiled using the SYCL and CUDA frameworks for a variety of standard GROMACS benchmarks. In addition, we compare its performance across three different Nvidia GPU chipsets, P100, V100, and A100.
{"title":"A Comparison of the Performance of the Molecular Dynamics Simulation Package GROMACS Implemented in the SYCL and CUDA Programming Models","authors":"L. Apanasevich, Yogesh Kale, Himanshu Sharma, Ana Marija Sokovic","doi":"arxiv-2406.10362","DOIUrl":"https://doi.org/arxiv-2406.10362","url":null,"abstract":"For many years, systems running Nvidia-based GPU architectures have dominated\u0000the heterogeneous supercomputer landscape. However, recently GPU chipsets\u0000manufactured by Intel and AMD have cut into this market and can now be found in\u0000some of the worlds fastest supercomputers. The June 2023 edition of the TOP500\u0000list of supercomputers ranks the Frontier supercomputer at the Oak Ridge\u0000National Laboratory in Tennessee as the top system in the world. This system\u0000features AMD Instinct 250 X GPUs and is currently the only true exascale\u0000computer in the world.The first framework that enabled support for\u0000heterogeneous platforms across multiple hardware vendors was OpenCL, in 2009.\u0000Since then a number of frameworks have been developed to support vendor\u0000agnostic heterogeneous environments including OpenMP, OpenCL, Kokkos, and SYCL.\u0000SYCL, which combines the concepts of OpenCL with the flexibility of\u0000single-source C++, is one of the more promising programming models for\u0000heterogeneous computing devices. One key advantage of this framework is that it\u0000provides a higher-level programming interface that abstracts away many of the\u0000hardware details than the other frameworks. This makes SYCL easier to learn and\u0000to maintain across multiple architectures and vendors. In n recent years, there\u0000has been growing interest in using heterogeneous computing architectures to\u0000accelerate molecular dynamics simulations. Some of the more popular molecular\u0000dynamics simulations include Amber, NAMD, and Gromacs. However, to the best of\u0000our knowledge, only Gromacs has been successfully ported to SYCL to date. In\u0000this paper, we compare the performance of GROMACS compiled using the SYCL and\u0000CUDA frameworks for a variety of standard GROMACS benchmarks. In addition, we\u0000compare its performance across three different Nvidia GPU chipsets, P100, V100,\u0000and A100.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we propose a dynamic model of Common Cause Failures (CCF) that allows to generate common cause events in time. The proposed model is a generalization of Binomial Failure Rate Model (Atwood model) that can generate staggered failures of multiple components due to a common cause. We implement the model using statechart formalism, a similar implementation can be adopted in other modeling languages like Petri Nets or Hybrid Stochastic Automata. The presented model was integrated in a Dynamic PRA study.
在本文中,我们提出了一种共因故障(CCF)动态模型,可以及时生成共因事件。所提出的模型是对二项式故障率模型(Atwood 模型)的概括,可以生成由共同原因导致的多个组件的交错故障。我们使用状态图形式实现该模型,其他建模语言(如 Petri 网或混合随机自动机)也可采用类似的实现方式。所提出的模型已被集成到动态 PRA 研究中。
{"title":"Modeling Common Cause Failure in Dynamic PRA","authors":"Claudia PicocoEDF R&D, Valentin RychkovEDF R&D","doi":"arxiv-2406.08879","DOIUrl":"https://doi.org/arxiv-2406.08879","url":null,"abstract":"In this paper we propose a dynamic model of Common Cause Failures (CCF) that\u0000allows to generate common cause events in time. The proposed model is a\u0000generalization of Binomial Failure Rate Model (Atwood model) that can generate\u0000staggered failures of multiple components due to a common cause. We implement\u0000the model using statechart formalism, a similar implementation can be adopted\u0000in other modeling languages like Petri Nets or Hybrid Stochastic Automata. The\u0000presented model was integrated in a Dynamic PRA study.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"98 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander Louis-Ferdinand Jung, Jannik Steinmetz, Jonathan Gietz, Konstantin Lübeck, Oliver Bringmann
Statistical models are widely used to estimate the performance of commercial off-the-shelf (COTS) AI hardware accelerators. However, training of statistical performance models often requires vast amounts of data, leading to a significant time investment and can be difficult in case of limited hardware availability. To alleviate this problem, we propose a novel performance modeling methodology that significantly reduces the number of training samples while maintaining good accuracy. Our approach leverages knowledge of the target hardware architecture and initial parameter sweeps to identify a set of Performance Representatives (PR) for deep neural network (DNN) layers. These PRs are then used for benchmarking, building a statistical performance model, and making estimations. This targeted approach drastically reduces the number of training samples needed, opposed to random sampling, to achieve a better estimation accuracy. We achieve a Mean Absolute Percentage Error (MAPE) of as low as 0.02% for single-layer estimations and 0.68% for whole DNN estimations with less than 10000 training samples. The results demonstrate the superiority of our method for single-layer estimations compared to models trained with randomly sampled datasets of the same size.
{"title":"It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives","authors":"Alexander Louis-Ferdinand Jung, Jannik Steinmetz, Jonathan Gietz, Konstantin Lübeck, Oliver Bringmann","doi":"arxiv-2406.08330","DOIUrl":"https://doi.org/arxiv-2406.08330","url":null,"abstract":"Statistical models are widely used to estimate the performance of commercial\u0000off-the-shelf (COTS) AI hardware accelerators. However, training of statistical\u0000performance models often requires vast amounts of data, leading to a\u0000significant time investment and can be difficult in case of limited hardware\u0000availability. To alleviate this problem, we propose a novel performance\u0000modeling methodology that significantly reduces the number of training samples\u0000while maintaining good accuracy. Our approach leverages knowledge of the target\u0000hardware architecture and initial parameter sweeps to identify a set of\u0000Performance Representatives (PR) for deep neural network (DNN) layers. These\u0000PRs are then used for benchmarking, building a statistical performance model,\u0000and making estimations. This targeted approach drastically reduces the number\u0000of training samples needed, opposed to random sampling, to achieve a better\u0000estimation accuracy. We achieve a Mean Absolute Percentage Error (MAPE) of as\u0000low as 0.02% for single-layer estimations and 0.68% for whole DNN estimations\u0000with less than 10000 training samples. The results demonstrate the superiority\u0000of our method for single-layer estimations compared to models trained with\u0000randomly sampled datasets of the same size.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, Gwangsun Kim
As DNNs are widely adopted in various application domains while demanding increasingly higher compute and memory requirements, designing efficient and performant NPUs (Neural Processing Units) is becoming more important. However, existing architectural NPU simulators lack support for high-speed simulation, multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or different deep learning frameworks. To address these limitations, this work proposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNN serving systems. It takes DNN models represented in the ONNX graph format generated from various deep learning frameworks for ease of simulation. In addition, based on the observation that typical NPU cores process tensor tiles from on-chip scratchpad memory with deterministic compute latency, we forgo a detailed modeling for the computation while still preserving simulation accuracy. ONNXim also preserves dependencies between compute and tile DMAs. Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model contention among multiple cores that can execute different DNN models for multi-tenancy. Consequently, ONNXim is significantly faster than existing simulators (e.g., by up to 384x over Accel-sim) and enables various case studies, such as multi-tenant NPUs, that were previously impractical due to slow speed and/or lack of functionalities. ONNXim is publicly available at https://github.com/PSAL-POSTECH/ONNXim.
{"title":"ONNXim: A Fast, Cycle-level Multi-core NPU Simulator","authors":"Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, Gwangsun Kim","doi":"arxiv-2406.08051","DOIUrl":"https://doi.org/arxiv-2406.08051","url":null,"abstract":"As DNNs are widely adopted in various application domains while demanding\u0000increasingly higher compute and memory requirements, designing efficient and\u0000performant NPUs (Neural Processing Units) is becoming more important. However,\u0000existing architectural NPU simulators lack support for high-speed simulation,\u0000multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or\u0000different deep learning frameworks. To address these limitations, this work\u0000proposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNN\u0000serving systems. It takes DNN models represented in the ONNX graph format\u0000generated from various deep learning frameworks for ease of simulation. In\u0000addition, based on the observation that typical NPU cores process tensor tiles\u0000from on-chip scratchpad memory with deterministic compute latency, we forgo a\u0000detailed modeling for the computation while still preserving simulation\u0000accuracy. ONNXim also preserves dependencies between compute and tile DMAs.\u0000Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model\u0000contention among multiple cores that can execute different DNN models for\u0000multi-tenancy. Consequently, ONNXim is significantly faster than existing\u0000simulators (e.g., by up to 384x over Accel-sim) and enables various case\u0000studies, such as multi-tenant NPUs, that were previously impractical due to\u0000slow speed and/or lack of functionalities. ONNXim is publicly available at\u0000https://github.com/PSAL-POSTECH/ONNXim.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu
It is extremely memory-hungry to train Large Language Models (LLM). To solve this problem, existing work exploits the combination of CPU and GPU for the training process, such as ZeRO-Offload. Such a technique largely democratizes billion-scale model training, making it possible to train with few consumer graphics cards. However, based on our observation, existing frameworks often provide coarse-grained memory management and require experienced experts in configuration tuning, leading to suboptimal hardware utilization and performance. This paper proposes ProTrain, a novel training system that intelligently balances memory usage and performance by coordinating memory, computation, and IO. ProTrain achieves adaptive memory management through Chunk-Based Model State Management and Block-Wise Activation Management, guided by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$times$ to 2.71$times$ compared to the SOTA training systems.
训练大型语言模型(LLM)非常耗费内存。为了解决这个问题,现有的工作利用 CPU 和 GPU 的组合来完成训练过程,例如 ZeRO-Offload。这种技术在很大程度上实现了亿万级模型训练的民主化,使使用少量消费级显卡进行训练成为可能。然而,根据我们的观察,现有框架通常提供粗粒度内存管理,需要经验丰富的专家进行配置调整,导致硬件利用率和性能达不到最优。本文提出的 ProTrain 是一种新型训练系统,它通过协调内存、计算和 IO,智能地平衡内存使用和性能。ProTrain 通过基于大块的模型状态管理(Chunk-Based Model State Management)和基于块的激活管理(Block-Wise Activation Management)实现了自适应内存管理,并由内存感知运行时分析器提供指导,无需用户干预。实验表明,与 SOTA 训练系统相比,ProTrain 将训练吞吐量提高了 1.43 倍到 2.71 倍。
{"title":"ProTrain: Efficient LLM Training via Memory-Aware Techniques","authors":"Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu","doi":"arxiv-2406.08334","DOIUrl":"https://doi.org/arxiv-2406.08334","url":null,"abstract":"It is extremely memory-hungry to train Large Language Models (LLM). To solve\u0000this problem, existing work exploits the combination of CPU and GPU for the\u0000training process, such as ZeRO-Offload. Such a technique largely democratizes\u0000billion-scale model training, making it possible to train with few consumer\u0000graphics cards. However, based on our observation, existing frameworks often\u0000provide coarse-grained memory management and require experienced experts in\u0000configuration tuning, leading to suboptimal hardware utilization and\u0000performance. This paper proposes ProTrain, a novel training system that\u0000intelligently balances memory usage and performance by coordinating memory,\u0000computation, and IO. ProTrain achieves adaptive memory management through\u0000Chunk-Based Model State Management and Block-Wise Activation Management, guided\u0000by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not\u0000change the training algorithm and thus does not compromise accuracy.\u0000Experiments show that ProTrain improves training throughput by 1.43$times$ to\u00002.71$times$ compared to the SOTA training systems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-hop reasoning (MHR) is a process in artificial intelligence and natural language processing where a system needs to make multiple inferential steps to arrive at a conclusion or answer. In the context of knowledge graphs or databases, it involves traversing multiple linked entities and relationships to understand complex queries or perform tasks requiring a deeper understanding. Multi-hop reasoning is a critical function in various applications, including question answering, knowledge base completion, and link prediction. It has garnered significant interest in artificial intelligence, machine learning, and graph analytics. This paper focuses on optimizing MHR for time efficiency on large-scale graphs, diverging from the traditional emphasis on accuracy which is an orthogonal goal. We introduce a novel parallel algorithm that harnesses domain-specific learned embeddings to efficiently identify the top K paths between vertices in a knowledge graph to find the best answers to a three-hop query. Our contributions are: (1) We present a new parallel algorithm to enhance MHR performance, scalability and efficiency. (2) We demonstrate the algorithm's superior performance on leading-edge Intel and AMD architectures through empirical results. We showcase the algorithm's practicality through a case study on identifying academic affiliations of potential Turing Award laureates in Deep Learning, highlighting its capability to handle intricate entity relationships. This demonstrates the potential of our approach to enabling high-performance MHR, useful to navigate the growing complexity of modern knowledge graphs.
{"title":"Efficient Parallel Multi-Hop Reasoning: A Scalable Approach for Knowledge Graph Analysis","authors":"Jesmin Jahan Tithi, Fabio Checconi, Fabrizio Petrini","doi":"arxiv-2406.07727","DOIUrl":"https://doi.org/arxiv-2406.07727","url":null,"abstract":"Multi-hop reasoning (MHR) is a process in artificial intelligence and natural\u0000language processing where a system needs to make multiple inferential steps to\u0000arrive at a conclusion or answer. In the context of knowledge graphs or\u0000databases, it involves traversing multiple linked entities and relationships to\u0000understand complex queries or perform tasks requiring a deeper understanding.\u0000Multi-hop reasoning is a critical function in various applications, including\u0000question answering, knowledge base completion, and link prediction. It has\u0000garnered significant interest in artificial intelligence, machine learning, and\u0000graph analytics. This paper focuses on optimizing MHR for time efficiency on large-scale\u0000graphs, diverging from the traditional emphasis on accuracy which is an\u0000orthogonal goal. We introduce a novel parallel algorithm that harnesses\u0000domain-specific learned embeddings to efficiently identify the top K paths\u0000between vertices in a knowledge graph to find the best answers to a three-hop\u0000query. Our contributions are: (1) We present a new parallel algorithm to\u0000enhance MHR performance, scalability and efficiency. (2) We demonstrate the\u0000algorithm's superior performance on leading-edge Intel and AMD architectures\u0000through empirical results. We showcase the algorithm's practicality through a case study on identifying\u0000academic affiliations of potential Turing Award laureates in Deep Learning,\u0000highlighting its capability to handle intricate entity relationships. This\u0000demonstrates the potential of our approach to enabling high-performance MHR,\u0000useful to navigate the growing complexity of modern knowledge graphs.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"193 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141516734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}