arXiv - CS - Performance最新文献_第9页

How to Rent GPUs on a Budget 如何在预算范围内租用 GPU

arXiv - CS - Performance

Pub Date : 2024-06-21 DOI: arxiv-2406.15560

Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter

The explosion in Machine Learning (ML) over the past ten years has led to adramatic increase in demand for GPUs to train ML models. Because it isprohibitively expensive for most users to build and maintain a large GPUcluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) haveseen explosive growth in demand for renting cloud-based GPUs. In thiscloud-computing paradigm, a user must specify their demand for GPUs at everymoment in time, and will pay for every GPU-hour they use. ML training jobs areknown to be parallelizable to different degrees. Given a stream of ML trainingjobs, a user typically wants to minimize the mean response time across alljobs. Here, the response time of a job denotes the time from when a job arrivesuntil it is complete. Additionally, the user is constrained by some operatingbudget. Specifically, in this paper the user is constrained to use no more than$b$ GPUs per hour, over a long-run time average. The question is how tominimize mean response time while meeting the budget constraint. Becausetraining jobs receive a diminishing marginal benefit from running on additionalGPUs, allocating too many GPUs to a single training job can dramaticallyincrease the overall cost paid by the user. Hence, an optimal rental policymust balance a tradeoff between training cost and mean response time. Thispaper derives the optimal rental policy for a stream of training jobs where thejobs have different levels of parallelizability (specified by a speedupfunction) and different job sizes (amounts of inherent work). We make almost noassumptions about the arrival process and about the job size distribution. Ouroptimal policy specifies how many GPUs to rent at every moment in time and howto allocate these GPUs.

在过去十年中，机器学习（ML）的迅猛发展导致用于训练 ML 模型的 GPU 需求急剧增加。由于大多数用户构建和维护大型 GPU 集群的成本过于昂贵，大型云计算提供商（微软 Azure、亚马逊 AWS、谷歌云）租用云计算 GPU 的需求出现了爆炸式增长。在这种云计算模式中，用户必须在每一时刻指定自己对 GPU 的需求，并为使用的每一 GPU 小时付费。众所周知，ML 训练任务在不同程度上是可并行的。给定一个 ML 训练作业流，用户通常希望最小化所有作业的平均响应时间。这里，作业的响应时间指的是从作业到达直到作业完成的时间。此外，用户还受到一些运营预算的限制。具体来说，在本文中，用户受限于在长期平均时间内每小时使用不超过 b$ 的 GPU。问题是如何在满足预算约束的同时最大限度地缩短平均响应时间。由于在额外 GPU 上运行时，培训作业获得的边际收益递减，因此为单个培训作业分配过多 GPU 会大幅增加用户支付的总成本。因此，最佳租用策略必须在训练成本和平均响应时间之间取得平衡。本文推导了训练作业流的最优租用策略，其中的作业具有不同的并行化水平（由加速函数指定）和不同的作业规模（固有工作量）。我们对作业的到达过程和作业大小分布几乎不做任何假设。Ouroptimal 策略规定了在每个时间点需要租用多少 GPU 以及如何分配这些 GPU。

{"title":"How to Rent GPUs on a Budget","authors":"Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter","doi":"arxiv-2406.15560","DOIUrl":"https://doi.org/arxiv-2406.15560","url":null,"abstract":"The explosion in Machine Learning (ML) over the past ten years has led to a\u0000dramatic increase in demand for GPUs to train ML models. Because it is\u0000prohibitively expensive for most users to build and maintain a large GPU\u0000cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have\u0000seen explosive growth in demand for renting cloud-based GPUs. In this\u0000cloud-computing paradigm, a user must specify their demand for GPUs at every\u0000moment in time, and will pay for every GPU-hour they use. ML training jobs are\u0000known to be parallelizable to different degrees. Given a stream of ML training\u0000jobs, a user typically wants to minimize the mean response time across all\u0000jobs. Here, the response time of a job denotes the time from when a job arrives\u0000until it is complete. Additionally, the user is constrained by some operating\u0000budget. Specifically, in this paper the user is constrained to use no more than\u0000$b$ GPUs per hour, over a long-run time average. The question is how to\u0000minimize mean response time while meeting the budget constraint. Because\u0000training jobs receive a diminishing marginal benefit from running on additional\u0000GPUs, allocating too many GPUs to a single training job can dramatically\u0000increase the overall cost paid by the user. Hence, an optimal rental policy\u0000must balance a tradeoff between training cost and mean response time. This\u0000paper derives the optimal rental policy for a stream of training jobs where the\u0000jobs have different levels of parallelizability (specified by a speedup\u0000function) and different job sizes (amounts of inherent work). We make almost no\u0000assumptions about the arrival process and about the job size distribution. Our\u0000optimal policy specifies how many GPUs to rent at every moment in time and how\u0000to allocate these GPUs.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput 使用 Goodput 优化为大型语言模型提供服务的推测性解码

arXiv - CS - Performance

Pub Date : 2024-06-20 DOI: arxiv-2406.14066

Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang

Reducing the inference latency of large language models (LLMs) is crucial,and speculative decoding (SD) stands out as one of the most effectivetechniques. Rather than letting the LLM generate all tokens directly,speculative decoding employs effective proxies to predict potential outputs,which are then verified by the LLM without compromising the generation quality.Yet, deploying SD in real online LLM serving systems (with continuous batching)does not always yield improvement -- under higher request rates or lowspeculation accuracy, it paradoxically increases latency. Furthermore, there isno best speculation length work for all workloads under different system loads.Based on the observations, we develop a dynamic framework SmartSpec. SmartSpecdynamically determines the best speculation length for each request (from 0,i.e., no speculation, to many tokens) -- hence the associated speculativeexecution costs -- based on a new metric called goodput, which characterizesthe current observed load of the entire system and the speculation accuracy. Weshow that SmartSpec consistently reduces average request latency by up to 3.2xcompared to non-speculative decoding baselines across different sizes of targetmodels, draft models, request rates, and datasets. Moreover, SmartSpec can beapplied to different styles of speculative decoding, including traditional,model-based approaches as well as model-free methods like prompt lookup andtree-style decoding.

减少大型语言模型（LLM）的推理延迟至关重要，而推测解码（SD）是最有效的技术之一。推测解码不是让 LLM 直接生成所有标记，而是利用有效的代理来预测潜在的输出，然后由 LLM 在不影响生成质量的情况下进行验证。然而，在实际的在线 LLM 服务系统（具有连续批处理功能）中部署 SD 并不总是能带来改善--在请求率较高或推测准确率较低的情况下，它反而会增加延迟。此外，在不同的系统负载下，并不存在适合所有工作负载的最佳推测长度。基于上述观察结果，我们开发了一个动态框架 SmartSpec。SmartSpecynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) - hence the associated speculativeexecution costs - based on a new metric called goodput, which characterizesthe current observed load of the entire system and the speculation accuracy.Wesh显示，与非推测性解码基线相比，在不同规模的目标模型、草稿模型、请求率和数据集上，SmartSpec始终能将平均请求延迟降低3.2x。此外，SmartSpec 还可应用于不同风格的推测式解码，包括传统的基于模型的方法以及无模型方法（如提示查找和树型解码）。

{"title":"Optimizing Speculative Decoding for Serving Large Language Models Using Goodput","authors":"Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang","doi":"arxiv-2406.14066","DOIUrl":"https://doi.org/arxiv-2406.14066","url":null,"abstract":"Reducing the inference latency of large language models (LLMs) is crucial,\u0000and speculative decoding (SD) stands out as one of the most effective\u0000techniques. Rather than letting the LLM generate all tokens directly,\u0000speculative decoding employs effective proxies to predict potential outputs,\u0000which are then verified by the LLM without compromising the generation quality.\u0000Yet, deploying SD in real online LLM serving systems (with continuous batching)\u0000does not always yield improvement -- under higher request rates or low\u0000speculation accuracy, it paradoxically increases latency. Furthermore, there is\u0000no best speculation length work for all workloads under different system loads.\u0000Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec\u0000dynamically determines the best speculation length for each request (from 0,\u0000i.e., no speculation, to many tokens) -- hence the associated speculative\u0000execution costs -- based on a new metric called goodput, which characterizes\u0000the current observed load of the entire system and the speculation accuracy. We\u0000show that SmartSpec consistently reduces average request latency by up to 3.2x\u0000compared to non-speculative decoding baselines across different sizes of target\u0000models, draft models, request rates, and datasets. Moreover, SmartSpec can be\u0000applied to different styles of speculative decoding, including traditional,\u0000model-based approaches as well as model-free methods like prompt lookup and\u0000tree-style decoding.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines CEBench：LLM 管道成本效益基准工具包

arXiv - CS - Performance

Pub Date : 2024-06-20 DOI: arxiv-2407.12797

Wenbo Sun, Jiaqi Wang, Qiming Guo, Ziyu Li, Wenlu Wang, Rihan Hai

Online Large Language Model (LLM) services such as ChatGPT and Claude 3 havetransformed business operations and academic research by effortlessly enablingnew opportunities. However, due to data-sharing restrictions, sectors such ashealthcare and finance prefer to deploy local LLM applications using costlyhardware resources. This scenario requires a balance between the effectivenessadvantages of LLMs and significant financial burdens. Additionally, the rapidevolution of models increases the frequency and redundancy of benchmarkingefforts. Existing benchmarking toolkits, which typically focus oneffectiveness, often overlook economic considerations, making their findingsless applicable to practical scenarios. To address these challenges, weintroduce CEBench, an open-source toolkit specifically designed formulti-objective benchmarking that focuses on the critical trade-offs betweenexpenditure and effectiveness required for LLM deployments. CEBench allows foreasy modifications through configuration files, enabling stakeholders toeffectively assess and optimize these trade-offs. This strategic capabilitysupports crucial decision-making processes aimed at maximizing effectivenesswhile minimizing cost impacts. By streamlining the evaluation process andemphasizing cost-effectiveness, CEBench seeks to facilitate the development ofeconomically viable AI solutions across various industries and research fields.The code and demonstration are available inurl{https://github.com/amademicnoboday12/CEBench}.

在线大语言模型（LLM）服务，如 ChatGPT 和 Claude 3，通过毫不费力地创造新机会，改变了商业运作和学术研究。然而，由于数据共享的限制，医疗保健和金融等行业更倾向于使用昂贵的硬件资源部署本地 LLM 应用程序。这种情况要求在 LLM 的有效性优势和巨大的财务负担之间取得平衡。此外，模型的快速发展增加了基准测试的频率和冗余度。现有的基准工具包通常只关注有效性，却往往忽略了经济因素，导致其研究结果无法应用于实际场景。为了应对这些挑战，我们引入了 CEBench，这是一个专门为多目标基准测试而设计的开源工具包，重点关注 LLM 部署所需的支出与效果之间的关键权衡。CEBench 允许通过配置文件方便地进行修改，使利益相关者能够有效地评估和优化这些权衡。这种战略能力支持重要的决策过程，旨在最大限度地提高效率，同时最大限度地降低成本影响。通过简化评估过程和强调成本效益，CEBench 致力于促进各行业和研究领域开发经济可行的人工智能解决方案。代码和演示可在（url{https://github.com/amademicnoboday12/CEBench}.

{"title":"CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines","authors":"Wenbo Sun, Jiaqi Wang, Qiming Guo, Ziyu Li, Wenlu Wang, Rihan Hai","doi":"arxiv-2407.12797","DOIUrl":"https://doi.org/arxiv-2407.12797","url":null,"abstract":"Online Large Language Model (LLM) services such as ChatGPT and Claude 3 have\u0000transformed business operations and academic research by effortlessly enabling\u0000new opportunities. However, due to data-sharing restrictions, sectors such as\u0000healthcare and finance prefer to deploy local LLM applications using costly\u0000hardware resources. This scenario requires a balance between the effectiveness\u0000advantages of LLMs and significant financial burdens. Additionally, the rapid\u0000evolution of models increases the frequency and redundancy of benchmarking\u0000efforts. Existing benchmarking toolkits, which typically focus on\u0000effectiveness, often overlook economic considerations, making their findings\u0000less applicable to practical scenarios. To address these challenges, we\u0000introduce CEBench, an open-source toolkit specifically designed for\u0000multi-objective benchmarking that focuses on the critical trade-offs between\u0000expenditure and effectiveness required for LLM deployments. CEBench allows for\u0000easy modifications through configuration files, enabling stakeholders to\u0000effectively assess and optimize these trade-offs. This strategic capability\u0000supports crucial decision-making processes aimed at maximizing effectiveness\u0000while minimizing cost impacts. By streamlining the evaluation process and\u0000emphasizing cost-effectiveness, CEBench seeks to facilitate the development of\u0000economically viable AI solutions across various industries and research fields.\u0000The code and demonstration are available in\u0000url{https://github.com/amademicnoboday12/CEBench}.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141746437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FastPersist: Accelerating Model Checkpointing in Deep Learning FastPersist：加速深度学习中的模型检查点

arXiv - CS - Performance

Pub Date : 2024-06-19 DOI: arxiv-2406.13768

Guanhua Wang, Olatunji Ruwase, Bing Xie, Yuxiong He

Model checkpoints are critical Deep Learning (DL) artifacts that enable faulttolerance for training and downstream applications, such as inference. However,writing checkpoints to persistent storage, and other I/O aspects of DLtraining, are mostly ignored by compute-focused optimization efforts for fastertraining of rapidly growing models and datasets. Towards addressing thisimbalance, we propose FastPersist to accelerate checkpoint creation in DLtraining. FastPersist combines three novel techniques: (i) NVMe optimizationsfor faster checkpoint writes to SSDs, (ii) efficient write parallelism usingthe available SSDs in training environments, and (iii) overlappingcheckpointing with independent training computations. Our evaluation using realworld dense and sparse DL models shows that FastPersist creates checkpoints inpersistent storage up to 116x faster than baseline, and enables per-iterationcheckpointing with negligible overhead.

模型检查点是深度学习（DL）的关键工件，可为训练和推理等下游应用提供容错能力。然而，为了更快地训练快速增长的模型和数据集，以计算为重点的优化工作大多忽略了将检查点写入持久化存储以及 DL 训练的其他 I/O 方面。为了解决这一不平衡，我们提出了FastPersist，以加速DLtraining中的检查点创建。FastPersist 结合了三种新技术：(i) NVMe 优化，以更快地将检查点写入固态硬盘；(ii) 利用训练环境中可用的固态硬盘实现高效的写并行化；(iii) 将检查点与独立的训练计算重叠。我们使用真实世界的密集和稀疏 DL 模型进行的评估表明，FastPersist 在持久化存储中创建检查点的速度是基线速度的 116 倍，并能以可忽略不计的开销实现每迭代检查点。

引用次数: 0

A Comparison of the Performance of the Molecular Dynamics Simulation Package GROMACS Implemented in the SYCL and CUDA Programming Models 用 SYCL 和 CUDA 编程模型实现的分子动力学仿真软件包 GROMACS 的性能比较

arXiv - CS - Performance

Pub Date : 2024-06-14 DOI: arxiv-2406.10362

L. Apanasevich, Yogesh Kale, Himanshu Sharma, Ana Marija Sokovic

For many years, systems running Nvidia-based GPU architectures have dominatedthe heterogeneous supercomputer landscape. However, recently GPU chipsetsmanufactured by Intel and AMD have cut into this market and can now be found insome of the worlds fastest supercomputers. The June 2023 edition of the TOP500list of supercomputers ranks the Frontier supercomputer at the Oak RidgeNational Laboratory in Tennessee as the top system in the world. This systemfeatures AMD Instinct 250 X GPUs and is currently the only true exascalecomputer in the world.The first framework that enabled support forheterogeneous platforms across multiple hardware vendors was OpenCL, in 2009.Since then a number of frameworks have been developed to support vendoragnostic heterogeneous environments including OpenMP, OpenCL, Kokkos, and SYCL.SYCL, which combines the concepts of OpenCL with the flexibility ofsingle-source C++, is one of the more promising programming models forheterogeneous computing devices. One key advantage of this framework is that itprovides a higher-level programming interface that abstracts away many of thehardware details than the other frameworks. This makes SYCL easier to learn andto maintain across multiple architectures and vendors. In n recent years, therehas been growing interest in using heterogeneous computing architectures toaccelerate molecular dynamics simulations. Some of the more popular moleculardynamics simulations include Amber, NAMD, and Gromacs. However, to the best ofour knowledge, only Gromacs has been successfully ported to SYCL to date. Inthis paper, we compare the performance of GROMACS compiled using the SYCL andCUDA frameworks for a variety of standard GROMACS benchmarks. In addition, wecompare its performance across three different Nvidia GPU chipsets, P100, V100,and A100.

多年来，运行基于 Nvidia GPU 架构的系统一直在异构超级计算机领域占据主导地位。不过，最近英特尔和 AMD 制造的 GPU 芯片组已经切入了这一市场，现在世界上一些速度最快的超级计算机中都可以看到它们的身影。2023 年 6 月发布的超级计算机 TOP500 榜单将田纳西州橡树岭国家实验室的 Frontier 超级计算机列为世界顶级系统。该系统采用 AMD Instinct 250 X GPU，是目前世界上唯一一台真正意义上的外卡计算机。该框架的一个主要优势是它提供了一个更高级别的编程接口，与其他框架相比，它抽象掉了许多硬件细节。这使得 SYCL 更易于学习和维护，并适用于多种体系结构和供应商。近年来，人们对使用异构计算架构加速分子动力学模拟越来越感兴趣。比较流行的分子动力学模拟包括 Amber、NAMD 和 Gromacs。然而，据我们所知，迄今为止只有 Gromacs 成功移植到了 SYCL。在本文中，我们针对各种标准 GROMACS 基准，比较了使用 SYCL 和 CUDA 框架编译的 GROMACS 的性能。此外，我们还比较了它在三种不同的 Nvidia GPU 芯片组（P100、V100 和 A100）上的性能。

{"title":"A Comparison of the Performance of the Molecular Dynamics Simulation Package GROMACS Implemented in the SYCL and CUDA Programming Models","authors":"L. Apanasevich, Yogesh Kale, Himanshu Sharma, Ana Marija Sokovic","doi":"arxiv-2406.10362","DOIUrl":"https://doi.org/arxiv-2406.10362","url":null,"abstract":"For many years, systems running Nvidia-based GPU architectures have dominated\u0000the heterogeneous supercomputer landscape. However, recently GPU chipsets\u0000manufactured by Intel and AMD have cut into this market and can now be found in\u0000some of the worlds fastest supercomputers. The June 2023 edition of the TOP500\u0000list of supercomputers ranks the Frontier supercomputer at the Oak Ridge\u0000National Laboratory in Tennessee as the top system in the world. This system\u0000features AMD Instinct 250 X GPUs and is currently the only true exascale\u0000computer in the world.The first framework that enabled support for\u0000heterogeneous platforms across multiple hardware vendors was OpenCL, in 2009.\u0000Since then a number of frameworks have been developed to support vendor\u0000agnostic heterogeneous environments including OpenMP, OpenCL, Kokkos, and SYCL.\u0000SYCL, which combines the concepts of OpenCL with the flexibility of\u0000single-source C++, is one of the more promising programming models for\u0000heterogeneous computing devices. One key advantage of this framework is that it\u0000provides a higher-level programming interface that abstracts away many of the\u0000hardware details than the other frameworks. This makes SYCL easier to learn and\u0000to maintain across multiple architectures and vendors. In n recent years, there\u0000has been growing interest in using heterogeneous computing architectures to\u0000accelerate molecular dynamics simulations. Some of the more popular molecular\u0000dynamics simulations include Amber, NAMD, and Gromacs. However, to the best of\u0000our knowledge, only Gromacs has been successfully ported to SYCL to date. In\u0000this paper, we compare the performance of GROMACS compiled using the SYCL and\u0000CUDA frameworks for a variety of standard GROMACS benchmarks. In addition, we\u0000compare its performance across three different Nvidia GPU chipsets, P100, V100,\u0000and A100.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141507214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modeling Common Cause Failure in Dynamic PRA 动态 PRA 中的共因故障建模

arXiv - CS - Performance

Pub Date : 2024-06-13 DOI: arxiv-2406.08879

Claudia PicocoEDF R&D, Valentin RychkovEDF R&D

In this paper we propose a dynamic model of Common Cause Failures (CCF) thatallows to generate common cause events in time. The proposed model is ageneralization of Binomial Failure Rate Model (Atwood model) that can generatestaggered failures of multiple components due to a common cause. We implementthe model using statechart formalism, a similar implementation can be adoptedin other modeling languages like Petri Nets or Hybrid Stochastic Automata. Thepresented model was integrated in a Dynamic PRA study.

在本文中，我们提出了一种共因故障（CCF）动态模型，可以及时生成共因事件。所提出的模型是对二项式故障率模型（Atwood 模型）的概括，可以生成由共同原因导致的多个组件的交错故障。我们使用状态图形式实现该模型，其他建模语言（如 Petri 网或混合随机自动机）也可采用类似的实现方式。所提出的模型已被集成到动态 PRA 研究中。

引用次数: 0

It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives 关键在于公关 -- 利用性能代表对人工智能加速器进行智能基准测试

arXiv - CS - Performance

Pub Date : 2024-06-12 DOI: arxiv-2406.08330

Alexander Louis-Ferdinand Jung, Jannik Steinmetz, Jonathan Gietz, Konstantin Lübeck, Oliver Bringmann

Statistical models are widely used to estimate the performance of commercialoff-the-shelf (COTS) AI hardware accelerators. However, training of statisticalperformance models often requires vast amounts of data, leading to asignificant time investment and can be difficult in case of limited hardwareavailability. To alleviate this problem, we propose a novel performancemodeling methodology that significantly reduces the number of training sampleswhile maintaining good accuracy. Our approach leverages knowledge of the targethardware architecture and initial parameter sweeps to identify a set ofPerformance Representatives (PR) for deep neural network (DNN) layers. ThesePRs are then used for benchmarking, building a statistical performance model,and making estimations. This targeted approach drastically reduces the numberof training samples needed, opposed to random sampling, to achieve a betterestimation accuracy. We achieve a Mean Absolute Percentage Error (MAPE) of aslow as 0.02% for single-layer estimations and 0.68% for whole DNN estimationswith less than 10000 training samples. The results demonstrate the superiorityof our method for single-layer estimations compared to models trained withrandomly sampled datasets of the same size.

统计模型被广泛用于估算商用现成（COTS）人工智能硬件加速器的性能。然而，统计性能模型的训练往往需要大量数据，从而导致大量时间投入，而且在硬件可用性有限的情况下也很困难。为了缓解这一问题，我们提出了一种新颖的性能建模方法，它能在保持良好准确性的同时大幅减少训练样本的数量。我们的方法利用目标硬件架构和初始参数扫描知识，为深度神经网络（DNN）层确定一组性能代表（PR）。然后，这些性能代表将用于基准测试、建立统计性能模型和进行估算。与随机抽样相比，这种有针对性的方法大大减少了所需的训练样本数量，从而达到更好的估计精度。单层估计的平均绝对百分比误差 (MAPE) 低至 0.02%，整个 DNN 估计的平均绝对百分比误差 (MAPE) 低至 0.68%，训练样本少于 10000 个。结果表明，与使用相同大小的随机抽样数据集训练的模型相比，我们的方法在单层估计方面更具优势。

{"title":"It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives","authors":"Alexander Louis-Ferdinand Jung, Jannik Steinmetz, Jonathan Gietz, Konstantin Lübeck, Oliver Bringmann","doi":"arxiv-2406.08330","DOIUrl":"https://doi.org/arxiv-2406.08330","url":null,"abstract":"Statistical models are widely used to estimate the performance of commercial\u0000off-the-shelf (COTS) AI hardware accelerators. However, training of statistical\u0000performance models often requires vast amounts of data, leading to a\u0000significant time investment and can be difficult in case of limited hardware\u0000availability. To alleviate this problem, we propose a novel performance\u0000modeling methodology that significantly reduces the number of training samples\u0000while maintaining good accuracy. Our approach leverages knowledge of the target\u0000hardware architecture and initial parameter sweeps to identify a set of\u0000Performance Representatives (PR) for deep neural network (DNN) layers. These\u0000PRs are then used for benchmarking, building a statistical performance model,\u0000and making estimations. This targeted approach drastically reduces the number\u0000of training samples needed, opposed to random sampling, to achieve a better\u0000estimation accuracy. We achieve a Mean Absolute Percentage Error (MAPE) of as\u0000low as 0.02% for single-layer estimations and 0.68% for whole DNN estimations\u0000with less than 10000 training samples. The results demonstrate the superiority\u0000of our method for single-layer estimations compared to models trained with\u0000randomly sampled datasets of the same size.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ONNXim: A Fast, Cycle-level Multi-core NPU Simulator ONNXim：快速、周期级多核 NPU 仿真器

arXiv - CS - Performance

Pub Date : 2024-06-12 DOI: arxiv-2406.08051

Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, Gwangsun Kim

As DNNs are widely adopted in various application domains while demandingincreasingly higher compute and memory requirements, designing efficient andperformant NPUs (Neural Processing Units) is becoming more important. However,existing architectural NPU simulators lack support for high-speed simulation,multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/ordifferent deep learning frameworks. To address these limitations, this workproposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNNserving systems. It takes DNN models represented in the ONNX graph formatgenerated from various deep learning frameworks for ease of simulation. Inaddition, based on the observation that typical NPU cores process tensor tilesfrom on-chip scratchpad memory with deterministic compute latency, we forgo adetailed modeling for the computation while still preserving simulationaccuracy. ONNXim also preserves dependencies between compute and tile DMAs.Meanwhile, the DRAM and NoC are modeled in cycle-level to properly modelcontention among multiple cores that can execute different DNN models formulti-tenancy. Consequently, ONNXim is significantly faster than existingsimulators (e.g., by up to 384x over Accel-sim) and enables various casestudies, such as multi-tenant NPUs, that were previously impractical due toslow speed and/or lack of functionalities. ONNXim is publicly available athttps://github.com/PSAL-POSTECH/ONNXim.

随着 DNN 被广泛应用于各种应用领域，同时对计算和内存的要求越来越高，设计高效且性能优异的 NPU（神经处理单元）变得越来越重要。然而，现有的架构 NPU 仿真器缺乏对高速仿真、多核建模、多租户场景、详细的 DRAM/NoC 建模和/或不同深度学习框架的支持。为了解决这些局限性，本研究提出了 ONNXim，一种用于 DNN 服务系统中多核 NPU 的快速周期级仿真器。为了便于仿真，它采用由各种深度学习框架生成的 ONNX 图格式表示 DNN 模型。此外，根据观察，典型的 NPU 内核处理来自片上刮板内存的张量瓦片时具有确定的计算延迟，因此我们放弃了计算的详细建模，同时仍然保持了仿真精度。ONNXim 还保留了计算与磁贴 DMA 之间的依赖关系。同时，对 DRAM 和 NoC 进行了周期级建模，以正确模拟可执行不同 DNN 模型的多核之间的多租户关系。因此，ONNXim 的速度明显快于现有模拟器（例如，比 Accel-sim 快 384 倍），并支持各种案例研究，例如多租户 NPU，而这些案例以前由于速度慢和/或缺乏功能而不切实际。ONNXim可在https://github.com/PSAL-POSTECH/ONNXim 公开获取。

{"title":"ONNXim: A Fast, Cycle-level Multi-core NPU Simulator","authors":"Hyungkyu Ham, Wonhyuk Yang, Yunseon Shin, Okkyun Woo, Guseul Heo, Sangyeop Lee, Jongse Park, Gwangsun Kim","doi":"arxiv-2406.08051","DOIUrl":"https://doi.org/arxiv-2406.08051","url":null,"abstract":"As DNNs are widely adopted in various application domains while demanding\u0000increasingly higher compute and memory requirements, designing efficient and\u0000performant NPUs (Neural Processing Units) is becoming more important. However,\u0000existing architectural NPU simulators lack support for high-speed simulation,\u0000multi-core modeling, multi-tenant scenarios, detailed DRAM/NoC modeling, and/or\u0000different deep learning frameworks. To address these limitations, this work\u0000proposes ONNXim, a fast cycle-level simulator for multi-core NPUs in DNN\u0000serving systems. It takes DNN models represented in the ONNX graph format\u0000generated from various deep learning frameworks for ease of simulation. In\u0000addition, based on the observation that typical NPU cores process tensor tiles\u0000from on-chip scratchpad memory with deterministic compute latency, we forgo a\u0000detailed modeling for the computation while still preserving simulation\u0000accuracy. ONNXim also preserves dependencies between compute and tile DMAs.\u0000Meanwhile, the DRAM and NoC are modeled in cycle-level to properly model\u0000contention among multiple cores that can execute different DNN models for\u0000multi-tenancy. Consequently, ONNXim is significantly faster than existing\u0000simulators (e.g., by up to 384x over Accel-sim) and enables various case\u0000studies, such as multi-tenant NPUs, that were previously impractical due to\u0000slow speed and/or lack of functionalities. ONNXim is publicly available at\u0000https://github.com/PSAL-POSTECH/ONNXim.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ProTrain: Efficient LLM Training via Memory-Aware Techniques ProTrain：通过记忆感知技术进行高效 LLM 训练

arXiv - CS - Performance

Pub Date : 2024-06-12 DOI: arxiv-2406.08334

Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu

It is extremely memory-hungry to train Large Language Models (LLM). To solvethis problem, existing work exploits the combination of CPU and GPU for thetraining process, such as ZeRO-Offload. Such a technique largely democratizesbillion-scale model training, making it possible to train with few consumergraphics cards. However, based on our observation, existing frameworks oftenprovide coarse-grained memory management and require experienced experts inconfiguration tuning, leading to suboptimal hardware utilization andperformance. This paper proposes ProTrain, a novel training system thatintelligently balances memory usage and performance by coordinating memory,computation, and IO. ProTrain achieves adaptive memory management throughChunk-Based Model State Management and Block-Wise Activation Management, guidedby a Memory-Aware Runtime Profiler without user intervention. ProTrain does notchange the training algorithm and thus does not compromise accuracy.Experiments show that ProTrain improves training throughput by 1.43$times$ to2.71$times$ compared to the SOTA training systems.

训练大型语言模型（LLM）非常耗费内存。为了解决这个问题，现有的工作利用 CPU 和 GPU 的组合来完成训练过程，例如 ZeRO-Offload。这种技术在很大程度上实现了亿万级模型训练的民主化，使使用少量消费级显卡进行训练成为可能。然而，根据我们的观察，现有框架通常提供粗粒度内存管理，需要经验丰富的专家进行配置调整，导致硬件利用率和性能达不到最优。本文提出的 ProTrain 是一种新型训练系统，它通过协调内存、计算和 IO，智能地平衡内存使用和性能。ProTrain 通过基于大块的模型状态管理（Chunk-Based Model State Management）和基于块的激活管理（Block-Wise Activation Management）实现了自适应内存管理，并由内存感知运行时分析器提供指导，无需用户干预。实验表明，与 SOTA 训练系统相比，ProTrain 将训练吞吐量提高了 1.43 倍到 2.71 倍。

{"title":"ProTrain: Efficient LLM Training via Memory-Aware Techniques","authors":"Hanmei Yang, Jin Zhou, Yao Fu, Xiaoqun Wang, Ramine Roane, Hui Guan, Tongping Liu","doi":"arxiv-2406.08334","DOIUrl":"https://doi.org/arxiv-2406.08334","url":null,"abstract":"It is extremely memory-hungry to train Large Language Models (LLM). To solve\u0000this problem, existing work exploits the combination of CPU and GPU for the\u0000training process, such as ZeRO-Offload. Such a technique largely democratizes\u0000billion-scale model training, making it possible to train with few consumer\u0000graphics cards. However, based on our observation, existing frameworks often\u0000provide coarse-grained memory management and require experienced experts in\u0000configuration tuning, leading to suboptimal hardware utilization and\u0000performance. This paper proposes ProTrain, a novel training system that\u0000intelligently balances memory usage and performance by coordinating memory,\u0000computation, and IO. ProTrain achieves adaptive memory management through\u0000Chunk-Based Model State Management and Block-Wise Activation Management, guided\u0000by a Memory-Aware Runtime Profiler without user intervention. ProTrain does not\u0000change the training algorithm and thus does not compromise accuracy.\u0000Experiments show that ProTrain improves training throughput by 1.43$times$ to\u00002.71$times$ compared to the SOTA training systems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141530851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Parallel Multi-Hop Reasoning: A Scalable Approach for Knowledge Graph Analysis 高效并行多跳推理：知识图谱分析的可扩展方法

arXiv - CS - Performance

Pub Date : 2024-06-11 DOI: arxiv-2406.07727

Jesmin Jahan Tithi, Fabio Checconi, Fabrizio Petrini

Multi-hop reasoning (MHR) is a process in artificial intelligence and naturallanguage processing where a system needs to make multiple inferential steps toarrive at a conclusion or answer. In the context of knowledge graphs ordatabases, it involves traversing multiple linked entities and relationships tounderstand complex queries or perform tasks requiring a deeper understanding.Multi-hop reasoning is a critical function in various applications, includingquestion answering, knowledge base completion, and link prediction. It hasgarnered significant interest in artificial intelligence, machine learning, andgraph analytics. This paper focuses on optimizing MHR for time efficiency on large-scalegraphs, diverging from the traditional emphasis on accuracy which is anorthogonal goal. We introduce a novel parallel algorithm that harnessesdomain-specific learned embeddings to efficiently identify the top K pathsbetween vertices in a knowledge graph to find the best answers to a three-hopquery. Our contributions are: (1) We present a new parallel algorithm toenhance MHR performance, scalability and efficiency. (2) We demonstrate thealgorithm's superior performance on leading-edge Intel and AMD architecturesthrough empirical results. We showcase the algorithm's practicality through a case study on identifyingacademic affiliations of potential Turing Award laureates in Deep Learning,highlighting its capability to handle intricate entity relationships. Thisdemonstrates the potential of our approach to enabling high-performance MHR,useful to navigate the growing complexity of modern knowledge graphs.

多跳推理（MHR）是人工智能和自然语言处理中的一个过程，系统需要进行多个推理步骤才能得出结论或答案。在知识图谱和数据库中，它涉及遍历多个链接实体和关系，以理解复杂的查询或执行需要更深入理解的任务。多跳推理是各种应用中的关键功能，包括问题解答、知识库补全和链接预测。多跳推理是各种应用中的关键功能，包括问题解答、知识库补全和链接预测。它已引起人工智能、机器学习和图分析领域的极大兴趣。本文的重点是优化 MHR，以提高大规模图上的时间效率，这与传统的强调准确性的目标不同。我们介绍了一种新颖的并行算法，该算法利用特定领域的学习嵌入来高效识别知识图中顶点之间的前 K 条路径，从而找到三跳查询的最佳答案。我们的贡献在于(1) 我们提出了一种新的并行算法，以提高 MHR 的性能、可扩展性和效率。(2) 我们通过实证结果证明了该算法在英特尔和 AMD 尖端架构上的卓越性能。我们通过一个案例研究展示了该算法的实用性，即识别深度学习领域图灵奖潜在获奖者的学术隶属关系，突出了该算法处理错综复杂的实体关系的能力。这证明了我们的方法在实现高性能 MHR 方面的潜力，有助于驾驭现代知识图谱日益增长的复杂性。

{"title":"Efficient Parallel Multi-Hop Reasoning: A Scalable Approach for Knowledge Graph Analysis","authors":"Jesmin Jahan Tithi, Fabio Checconi, Fabrizio Petrini","doi":"arxiv-2406.07727","DOIUrl":"https://doi.org/arxiv-2406.07727","url":null,"abstract":"Multi-hop reasoning (MHR) is a process in artificial intelligence and natural\u0000language processing where a system needs to make multiple inferential steps to\u0000arrive at a conclusion or answer. In the context of knowledge graphs or\u0000databases, it involves traversing multiple linked entities and relationships to\u0000understand complex queries or perform tasks requiring a deeper understanding.\u0000Multi-hop reasoning is a critical function in various applications, including\u0000question answering, knowledge base completion, and link prediction. It has\u0000garnered significant interest in artificial intelligence, machine learning, and\u0000graph analytics. This paper focuses on optimizing MHR for time efficiency on large-scale\u0000graphs, diverging from the traditional emphasis on accuracy which is an\u0000orthogonal goal. We introduce a novel parallel algorithm that harnesses\u0000domain-specific learned embeddings to efficiently identify the top K paths\u0000between vertices in a knowledge graph to find the best answers to a three-hop\u0000query. Our contributions are: (1) We present a new parallel algorithm to\u0000enhance MHR performance, scalability and efficiency. (2) We demonstrate the\u0000algorithm's superior performance on leading-edge Intel and AMD architectures\u0000through empirical results. We showcase the algorithm's practicality through a case study on identifying\u0000academic affiliations of potential Turing Award laureates in Deep Learning,\u0000highlighting its capability to handle intricate entity relationships. This\u0000demonstrates the potential of our approach to enabling high-performance MHR,\u0000useful to navigate the growing complexity of modern knowledge graphs.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"193 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141516734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0