arXiv - CS - Performance最新文献_第2页

Applied Federated Model Personalisation in the Industrial Domain: A Comparative Study 工业领域的应用联合模型个性化：比较研究

arXiv - CS - Performance

Pub Date : 2024-09-10 DOI: arxiv-2409.06904

Ilias Siniosoglou, Vasileios Argyriou, George Fragulis, Panagiotis Fouliras, Georgios Th. Papadopoulos, Anastasios Lytos, Panagiotis Sarigiannidis

The time-consuming nature of training and deploying complicated Machine andDeep Learning (DL) models for a variety of applications continues to posesignificant challenges in the field of Machine Learning (ML). These challengesare particularly pronounced in the federated domain, where optimizing modelsfor individual nodes poses significant difficulty. Many methods have beendeveloped to tackle this problem, aiming to reduce training expenses and timewhile maintaining efficient optimisation. Three suggested strategies to tacklethis challenge include Active Learning, Knowledge Distillation, and LocalMemorization. These methods enable the adoption of smaller models that requirefewer computational resources and allow for model personalization with localinsights, thereby improving the effectiveness of current models. The presentstudy delves into the fundamental principles of these three approaches andproposes an advanced Federated Learning System that utilises differentPersonalisation methods towards improving the accuracy of AI models andenhancing user experience in real-time NG-IoT applications, investigating theefficacy of these techniques in the local and federated domain. The results ofthe original and optimised models are then compared in both local and federatedcontexts using a comparison analysis. The post-analysis shows encouragingoutcomes when it comes to optimising and personalising the models with thesuggested techniques.

为各种应用训练和部署复杂的机器学习和深度学习（DL）模型非常耗时，这仍然是机器学习（ML）领域面临的重大挑战。这些挑战在联合领域尤为突出，因为在联合领域，优化单个节点的模型非常困难。为解决这一问题，人们开发了许多方法，旨在减少训练费用和时间，同时保持高效的优化。应对这一挑战的三种建议策略包括主动学习（Active Learning）、知识蒸馏（Knowledge Distillation）和本地记忆（LocalMemorization）。通过这些方法，可以采用需要更少计算资源的小型模型，并利用本地知识实现模型个性化，从而提高当前模型的有效性。本研究深入探讨了这三种方法的基本原理，并提出了一种先进的联盟学习系统，该系统利用不同的个性化方法来提高人工智能模型的准确性，并增强 NG-IoT 实时应用中的用户体验，同时研究了这些技术在本地和联盟领域的有效性。然后通过对比分析，比较了本地和联盟背景下原始模型和优化模型的结果。后分析表明，在使用建议的技术优化和个性化模型方面，取得了令人鼓舞的成果。

{"title":"Applied Federated Model Personalisation in the Industrial Domain: A Comparative Study","authors":"Ilias Siniosoglou, Vasileios Argyriou, George Fragulis, Panagiotis Fouliras, Georgios Th. Papadopoulos, Anastasios Lytos, Panagiotis Sarigiannidis","doi":"arxiv-2409.06904","DOIUrl":"https://doi.org/arxiv-2409.06904","url":null,"abstract":"The time-consuming nature of training and deploying complicated Machine and\u0000Deep Learning (DL) models for a variety of applications continues to pose\u0000significant challenges in the field of Machine Learning (ML). These challenges\u0000are particularly pronounced in the federated domain, where optimizing models\u0000for individual nodes poses significant difficulty. Many methods have been\u0000developed to tackle this problem, aiming to reduce training expenses and time\u0000while maintaining efficient optimisation. Three suggested strategies to tackle\u0000this challenge include Active Learning, Knowledge Distillation, and Local\u0000Memorization. These methods enable the adoption of smaller models that require\u0000fewer computational resources and allow for model personalization with local\u0000insights, thereby improving the effectiveness of current models. The present\u0000study delves into the fundamental principles of these three approaches and\u0000proposes an advanced Federated Learning System that utilises different\u0000Personalisation methods towards improving the accuracy of AI models and\u0000enhancing user experience in real-time NG-IoT applications, investigating the\u0000efficacy of these techniques in the local and federated domain. The results of\u0000the original and optimised models are then compared in both local and federated\u0000contexts using a comparison analysis. The post-analysis shows encouraging\u0000outcomes when it comes to optimising and personalising the models with the\u0000suggested techniques.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Comprehensive Analysis of Process Energy Consumption on Multi-Socket Systems with GPUs 全面分析带 GPU 的多插槽系统的过程能耗

arXiv - CS - Performance

Pub Date : 2024-09-08 DOI: arxiv-2409.04941

Luis G. León-Vega, Niccolò Tosato, Stefano Cozzini

Robustly estimating energy consumption in High-Performance Computing (HPC) isessential for assessing the energy footprint of modern workloads, particularlyin fields such as Artificial Intelligence (AI) research, development, anddeployment. The extensive use of supercomputers for AI training has heightenedconcerns about energy consumption and carbon emissions. Existing energyestimation tools often assume exclusive use of computing nodes, a premise thatbecomes problematic with the advent of supercomputers integratingmicroservices, as seen in initiatives like Acceleration as a Service (XaaS) andcloud computing. This work investigates the impact of executed instructions on overall powerconsumption, providing insights into the comprehensive behaviour of HPCsystems. We introduce two novel mathematical models to estimate a process'senergy consumption based on the total node energy, process usage, and anormalised vector of the probability distribution of instruction types for CPUand GPU processes. Our approach enables energy accounting for specificprocesses without the need for isolation. Our models demonstrate high accuracy, predicting CPU power consumption with amere 1.9% error. For GPU predictions, the models achieve a central relativeerror of 9.7%, showing a clear tendency to fit the test data accurately. Theseresults pave the way for new tools to measure and account for energyconsumption in shared supercomputing environments.

要评估现代工作负载的能源足迹，尤其是在人工智能（AI）研究、开发和部署等领域，必须对高性能计算（HPC）的能耗进行可靠估算。超级计算机在人工智能训练中的广泛应用加剧了人们对能源消耗和碳排放的担忧。现有的能耗估算工具通常假定只使用计算节点，而随着集成了微服务的超级计算机的出现，这一前提就成了问题，这在加速即服务（XaaS）和云计算等计划中都有所体现。这项工作研究了执行指令对总体功耗的影响，为了解高性能计算系统的综合行为提供了见解。我们引入了两个新颖的数学模型，根据 CPU 和 GPU 进程的节点总能耗、进程使用率和指令类型概率分布的规范化向量来估算进程的能耗。我们的方法无需隔离就能对特定进程进行能量核算。我们的模型具有很高的准确性，对 CPU 功耗的预测误差仅为 1.9%。对于 GPU 预测，模型的中心相对误差为 9.7%，显示出准确拟合测试数据的明显趋势。这些结果为测量和计算共享超级计算环境能耗的新工具铺平了道路。

{"title":"A Comprehensive Analysis of Process Energy Consumption on Multi-Socket Systems with GPUs","authors":"Luis G. León-Vega, Niccolò Tosato, Stefano Cozzini","doi":"arxiv-2409.04941","DOIUrl":"https://doi.org/arxiv-2409.04941","url":null,"abstract":"Robustly estimating energy consumption in High-Performance Computing (HPC) is\u0000essential for assessing the energy footprint of modern workloads, particularly\u0000in fields such as Artificial Intelligence (AI) research, development, and\u0000deployment. The extensive use of supercomputers for AI training has heightened\u0000concerns about energy consumption and carbon emissions. Existing energy\u0000estimation tools often assume exclusive use of computing nodes, a premise that\u0000becomes problematic with the advent of supercomputers integrating\u0000microservices, as seen in initiatives like Acceleration as a Service (XaaS) and\u0000cloud computing. This work investigates the impact of executed instructions on overall power\u0000consumption, providing insights into the comprehensive behaviour of HPC\u0000systems. We introduce two novel mathematical models to estimate a process's\u0000energy consumption based on the total node energy, process usage, and a\u0000normalised vector of the probability distribution of instruction types for CPU\u0000and GPU processes. Our approach enables energy accounting for specific\u0000processes without the need for isolation. Our models demonstrate high accuracy, predicting CPU power consumption with a\u0000mere 1.9% error. For GPU predictions, the models achieve a central relative\u0000error of 9.7%, showing a clear tendency to fit the test data accurately. These\u0000results pave the way for new tools to measure and account for energy\u0000consumption in shared supercomputing environments.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Confidential Computing on nVIDIA H100 GPU: A Performance Benchmark Study 利用 nVIDIA H100 GPU 进行机密计算：性能基准研究

arXiv - CS - Performance

Pub Date : 2024-09-06 DOI: arxiv-2409.03992

Jianwei Zhu, Hang Yin, Shunfan Zhou

This report evaluates the performance impact of enabling Trusted ExecutionEnvironments (TEE) on NVIDIA H100 GPUs for large language model (LLM) inferencetasks. We benchmark the overhead introduced by TEE mode across various modelsand token lengths, focusing on the bottleneck caused by CPU-GPU data transfersvia PCIe. Our results show that while there is minimal computational overheadwithin the GPU, the overall performance penalty is primarily due to datatransfer. For most typical LLM queries, the overhead remains below 5%, withlarger models and longer sequences experiencing near-zero overhead.

本报告评估了在英伟达 H100 GPU 上启用可信执行环境（TEE）对大型语言模型（LLM）推断任务的性能影响。我们对 TEE 模式在不同模型和标记长度下引入的开销进行了基准测试，重点关注 CPU-GPU 通过 PCIe 传输数据造成的瓶颈。我们的结果表明，虽然 GPU 的计算开销很小，但总体性能损失主要是由于数据传输造成的。对于大多数典型的 LLM 查询，开销保持在 5% 以下，更大的模型和更长的序列的开销几乎为零。

引用次数: 0

Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL 挑战可移植性范式：使用 SYCL 和 OpenCL 进行 FPGA 加速

arXiv - CS - Performance

Pub Date : 2024-09-05 DOI: arxiv-2409.03391

Manuel de Castro, Francisco J. andújar, Roberto R. Osorio, Rocío Carratalá-Sáez, Diego R. Llanos

As the interest in FPGA-based accelerators for HPC applications increases,new challenges also arise, especially concerning different programming andportability issues. This paper aims to provide a snapshot of the current stateof the FPGA tooling and its problems. To do so, we evaluate the performanceportability of two frameworks for developing FPGA solutions for HPC (SYCL andOpenCL) when using them to port a highly-parallel application to FPGAs, usingboth ND-range and single-task type of kernels. The developer's general recommendation when using FPGAs is to developsingle-task kernels for them, as they are commonly regarded as more suited forsuch hardware. However, we discovered that, when using high-level approachessuch as OpenCL and SYCL to program a highly-parallel application with noFPGA-tailored optimizations, ND-range kernels significantly outperformsingle-task codes. Specifically, while SYCL struggles to produce efficient FPGAimplementations of applications described as single-task codes, its performanceexcels with ND-range kernels, a result that was unexpectedly favorable.

随着人们对基于 FPGA 的高性能计算应用加速器的兴趣与日俱增，新的挑战也随之而来，尤其是在不同的编程和可移植性问题上。本文旨在简要介绍 FPGA 工具的现状及其问题。为此，我们评估了为 HPC 开发 FPGA 解决方案的两个框架（SYCL 和 OpenCL）在将高度并行的应用程序移植到 FPGA（使用 ND-range 和单任务类型的内核）时的性能可移植性。在使用 FPGA 时，开发人员的一般建议是为其开发单任务内核，因为它们通常被认为更适合此类硬件。然而，我们发现，在使用 OpenCL 和 SYCL 等高级方法对高度并行的应用程序进行编程时，如果不进行针对 FPGA 的优化，ND-range 内核的性能明显优于单任务代码。具体来说，虽然 SYCL 难以生成高效的 FPGA 实现，但其性能与 ND 范围内核相比却出乎意料地好。

{"title":"Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL","authors":"Manuel de Castro, Francisco J. andújar, Roberto R. Osorio, Rocío Carratalá-Sáez, Diego R. Llanos","doi":"arxiv-2409.03391","DOIUrl":"https://doi.org/arxiv-2409.03391","url":null,"abstract":"As the interest in FPGA-based accelerators for HPC applications increases,\u0000new challenges also arise, especially concerning different programming and\u0000portability issues. This paper aims to provide a snapshot of the current state\u0000of the FPGA tooling and its problems. To do so, we evaluate the performance\u0000portability of two frameworks for developing FPGA solutions for HPC (SYCL and\u0000OpenCL) when using them to port a highly-parallel application to FPGAs, using\u0000both ND-range and single-task type of kernels. The developer's general recommendation when using FPGAs is to develop\u0000single-task kernels for them, as they are commonly regarded as more suited for\u0000such hardware. However, we discovered that, when using high-level approaches\u0000such as OpenCL and SYCL to program a highly-parallel application with no\u0000FPGA-tailored optimizations, ND-range kernels significantly outperform\u0000single-task codes. Specifically, while SYCL struggles to produce efficient FPGA\u0000implementations of applications described as single-task codes, its performance\u0000excels with ND-range kernels, a result that was unexpectedly favorable.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application Research On Real-Time Perception Of Device Performance Status 实时感知设备性能状态的应用研究

arXiv - CS - Performance

Pub Date : 2024-09-05 DOI: arxiv-2409.03218

Zhe Wang, Zhen Wang, Jianwen Wu, Wangzhong Xiao, Yidong Chen, Zihua Feng, Dian Yang, Hongchen Liu, Bo Liang, Jiaojiao Fu

In order to accurately identify the performance status of mobile devices andfinely adjust the user experience, a real-time performance perceptionevaluation method based on TOPSIS (Technique for Order Preference by Similarityto Ideal Solution) combined with entropy weighting method and time series modelconstruction was studied. After collecting the performance characteristics ofvarious mobile devices, the device performance profile was fitted by using PCA(principal component analysis) dimensionality reduction and feature engineeringmethods such as descriptive time series analysis. The ability of performancefeatures and profiles to describe the real-time performance status of deviceswas understood and studied by applying the TOPSIS method and multi-levelweighting processing. A time series model was constructed for the feature setunder objective weighting, and multiple sensitivity (real-time, short-term,long-term) performance status perception results were provided to obtainreal-time performance evaluation data and long-term stable performanceprediction data. Finally, by configuring dynamic AB experiments and overlayingfine-grained power reduction strategies, the usability of the method wasverified, and the accuracy of device performance status identification andprediction was compared with the performance of the profile features includingdimensionality reduction time series modeling, TOPSIS method and entropyweighting method, subjective weighting, HMA method. The results show thataccurate real-time performance perception results can greatly enhance businessvalue, and this research has application effectiveness and certainforward-looking significance.

为了准确识别移动设备的性能状态并精细调整用户体验，研究了一种基于 TOPSIS（Technique for Order Preference by Similarityto Ideal Solution）的实时性能感知评估方法，该方法结合了熵权法和时间序列模型构建。在收集了各种移动设备的性能特征后，利用 PCA（主成分分析）降维法和描述性时间序列分析等特征工程方法拟合了设备的性能轮廓。通过应用 TOPSIS 方法和多级加权处理，了解和研究了性能特征和轮廓对设备实时性能状态的描述能力。在客观加权下，为特征集构建了时间序列模型，并提供了多种灵敏度（实时、短期、长期）性能状态感知结果，从而获得了实时性能评估数据和长期稳定性能预测数据。最后，通过配置动态 AB 实验和叠加细粒度功耗降低策略，验证了该方法的可用性，并与降维时间序列建模、TOPSIS 法和熵权法、主观加权法、HMA 法等轮廓特征的性能比较了设备性能状态识别和预测的准确性。结果表明，准确的实时性能感知结果可以大大提升商业价值，该研究具有应用实效性和一定的前瞻性意义。

{"title":"Application Research On Real-Time Perception Of Device Performance Status","authors":"Zhe Wang, Zhen Wang, Jianwen Wu, Wangzhong Xiao, Yidong Chen, Zihua Feng, Dian Yang, Hongchen Liu, Bo Liang, Jiaojiao Fu","doi":"arxiv-2409.03218","DOIUrl":"https://doi.org/arxiv-2409.03218","url":null,"abstract":"In order to accurately identify the performance status of mobile devices and\u0000finely adjust the user experience, a real-time performance perception\u0000evaluation method based on TOPSIS (Technique for Order Preference by Similarity\u0000to Ideal Solution) combined with entropy weighting method and time series model\u0000construction was studied. After collecting the performance characteristics of\u0000various mobile devices, the device performance profile was fitted by using PCA\u0000(principal component analysis) dimensionality reduction and feature engineering\u0000methods such as descriptive time series analysis. The ability of performance\u0000features and profiles to describe the real-time performance status of devices\u0000was understood and studied by applying the TOPSIS method and multi-level\u0000weighting processing. A time series model was constructed for the feature set\u0000under objective weighting, and multiple sensitivity (real-time, short-term,\u0000long-term) performance status perception results were provided to obtain\u0000real-time performance evaluation data and long-term stable performance\u0000prediction data. Finally, by configuring dynamic AB experiments and overlaying\u0000fine-grained power reduction strategies, the usability of the method was\u0000verified, and the accuracy of device performance status identification and\u0000prediction was compared with the performance of the profile features including\u0000dimensionality reduction time series modeling, TOPSIS method and entropy\u0000weighting method, subjective weighting, HMA method. The results show that\u0000accurate real-time performance perception results can greatly enhance business\u0000value, and this research has application effectiveness and certain\u0000forward-looking significance.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards a Scalable and Efficient PGAS-based Distributed OpenMP 实现基于 PGAS 的可扩展高效分布式 OpenMP

arXiv - CS - Performance

Pub Date : 2024-09-04 DOI: arxiv-2409.02830

Baodi Shan, Mauricio Araya-Polo, Barbara Chapman

MPI+X has been the de facto standard for distributed memory parallelprogramming. It is widely used primarily as an explicit two-sided communicationmodel, which often leads to complex and error-prone code. Alternatively, PGASmodel utilizes efficient one-sided communication and more intuitivecommunication primitives. In this paper, we present a novel approach thatintegrates PGAS concepts into the OpenMP programming model, leveraging the LLVMcompiler infrastructure and the GASNet-EX communication library. Our modeladdresses the complexity associated with traditional MPI+OpenMP programmingmodels while ensuring excellent performance and scalability. We evaluate ourapproach using a set of micro-benchmarks and application kernels on twodistinct platforms: Ookami from Stony Brook University and NERSC Perlmutter.The results demonstrate that DiOMP achieves superior bandwidth and lowerlatency compared to MPI+OpenMP, up to 25% higher bandwidth and down to 45% onlatency. DiOMP offers a promising alternative to the traditional MPI+OpenMPhybrid programming model, towards providing a more productive and efficient wayto develop high-performance parallel applications for distributed memorysystems.

MPI+X 一直是分布式内存并行编程的事实标准。它主要作为一种显式双面通信模型被广泛使用，这通常会导致代码复杂且容易出错。相反，PGAS 模型利用高效的单边通信和更直观的通信基元。在本文中，我们利用 LLVM 编译器基础架构和 GASNet-EX 通信库，提出了一种将 PGAS 概念集成到 OpenMP 编程模型中的新方法。我们的模型解决了与传统 MPI+OpenMP 编程模型相关的复杂性问题，同时确保了卓越的性能和可扩展性。我们在两个不同的平台上使用一组微基准和应用内核对我们的方法进行了评估：结果表明，与MPI+OpenMP相比，DiOMP实现了更优越的带宽和更低的延迟，带宽提高了25%，延迟降低了45%。DiOMP为传统的MPI+OpenMP混合编程模型提供了一种很有前途的替代方案，为分布式内存系统开发高性能并行应用程序提供了一种更有成效、更高效的方法。

{"title":"Towards a Scalable and Efficient PGAS-based Distributed OpenMP","authors":"Baodi Shan, Mauricio Araya-Polo, Barbara Chapman","doi":"arxiv-2409.02830","DOIUrl":"https://doi.org/arxiv-2409.02830","url":null,"abstract":"MPI+X has been the de facto standard for distributed memory parallel\u0000programming. It is widely used primarily as an explicit two-sided communication\u0000model, which often leads to complex and error-prone code. Alternatively, PGAS\u0000model utilizes efficient one-sided communication and more intuitive\u0000communication primitives. In this paper, we present a novel approach that\u0000integrates PGAS concepts into the OpenMP programming model, leveraging the LLVM\u0000compiler infrastructure and the GASNet-EX communication library. Our model\u0000addresses the complexity associated with traditional MPI+OpenMP programming\u0000models while ensuring excellent performance and scalability. We evaluate our\u0000approach using a set of micro-benchmarks and application kernels on two\u0000distinct platforms: Ookami from Stony Brook University and NERSC Perlmutter.\u0000The results demonstrate that DiOMP achieves superior bandwidth and lower\u0000latency compared to MPI+OpenMP, up to 25% higher bandwidth and down to 45% on\u0000latency. DiOMP offers a promising alternative to the traditional MPI+OpenMP\u0000hybrid programming model, towards providing a more productive and efficient way\u0000to develop high-performance parallel applications for distributed memory\u0000systems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ISO: Overlap of Computation and Communication within Seqenence For LLM Inference ISO：用于 LLM 推断的序列内计算与通信的重叠

arXiv - CS - Performance

Pub Date : 2024-09-04 DOI: arxiv-2409.11155

Bin Xiao, Lei Su

In the realm of Large Language Model (LLM) inference, the inherent structureof transformer models coupled with the multi-GPU tensor parallelism strategyleads to a sequential execution of computation and communication. This resultsin substantial underutilization of computing resources during the communicationphase. To mitigate this inefficiency, various techniques have been developed tooptimize the use of computational power throughout the communication process.These strategies primarily involve overlapping matrix computations andcommunications, as well as interleaving micro-batches across differentrequests. Nonetheless, these approaches either fall short of achieving idealoverlap or impose certain limitations on their application. To overcome thesechallenges, this paper introduces a novel strategy forcomputation-communication overlap that operates at the sequence level. Thismethod not only enhances the degree of overlap but also minimizes theconstraints on its applicability. Experimental evaluations conducted using30b/70b models have demonstrated significant improvements in efficiency.Specifically, the proposed technique has been shown to reduce time consumptionby approximately 35% on 4090 GPU and by roughly 15% on A800 GPU during theprefill stage of LLM inference.

在大型语言模型（LLM）推理领域，变压器模型的固有结构与多 GPU 张量并行策略导致计算和通信的顺序执行。这导致在通信阶段计算资源利用率严重不足。这些策略主要涉及矩阵计算和通信的重叠，以及不同请求之间微批处理的交错。然而，这些方法要么无法实现理想的重叠，要么在应用上存在一定的局限性。为了克服这些挑战，本文介绍了一种在序列级运行的新型计算-通信重叠策略。这种方法不仅提高了重叠度，而且最大限度地减少了对其应用的限制。使用 30b/70b 模型进行的实验评估表明，该方法显著提高了效率。具体来说，在 LLM 推理的填充阶段，所提出的技术在 4090 GPU 上减少了约 35% 的时间消耗，在 A800 GPU 上减少了约 15%。

{"title":"ISO: Overlap of Computation and Communication within Seqenence For LLM Inference","authors":"Bin Xiao, Lei Su","doi":"arxiv-2409.11155","DOIUrl":"https://doi.org/arxiv-2409.11155","url":null,"abstract":"In the realm of Large Language Model (LLM) inference, the inherent structure\u0000of transformer models coupled with the multi-GPU tensor parallelism strategy\u0000leads to a sequential execution of computation and communication. This results\u0000in substantial underutilization of computing resources during the communication\u0000phase. To mitigate this inefficiency, various techniques have been developed to\u0000optimize the use of computational power throughout the communication process.\u0000These strategies primarily involve overlapping matrix computations and\u0000communications, as well as interleaving micro-batches across different\u0000requests. Nonetheless, these approaches either fall short of achieving ideal\u0000overlap or impose certain limitations on their application. To overcome these\u0000challenges, this paper introduces a novel strategy for\u0000computation-communication overlap that operates at the sequence level. This\u0000method not only enhances the degree of overlap but also minimizes the\u0000constraints on its applicability. Experimental evaluations conducted using\u000030b/70b models have demonstrated significant improvements in efficiency.\u0000Specifically, the proposed technique has been shown to reduce time consumption\u0000by approximately 35% on 4090 GPU and by roughly 15% on A800 GPU during the\u0000prefill stage of LLM inference.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scaler: Efficient and Effective Cross Flow Analysis Scaler：高效和有效的横流分析

arXiv - CS - Performance

Pub Date : 2024-09-01 DOI: arxiv-2409.00854

StevenJiaxun, Tang, Mingcan Xiang, Yang Wang, Bo Wu, Jianjun Chen, Tongping Liu

Performance analysis is challenging as different components (e.g.,differentlibraries, and applications) of a complex system can interact with each other.However, few existing tools focus on understanding such interactions. To bridgethis gap, we propose a novel analysis method "Cross Flow Analysis (XFA)" thatmonitors the interactions/flows across these components. We also built theScaler profiler that provides a holistic view of the time spent on eachcomponent (e.g., library or application) and every API inside each component.This paper proposes multiple new techniques, such as Universal Shadow Table,and Relation-Aware Data Folding. These techniques enable Scaler to achieve lowruntime overhead, low memory overhead, and high profiling accuracy. Based onour extensive experimental results, Scaler detects multiple unknown performanceissues inside widely-used applications, and therefore will be a usefulcomplement to existing work. The reproduction package including the source code, benchmarks, andevaluation scripts, can be found at https://doi.org/10.5281/zenodo.13336658.

性能分析具有挑战性，因为复杂系统的不同组件（例如不同的库和应用程序）之间会相互影响。为了弥补这一差距，我们提出了一种新型分析方法 "交叉流分析（XFA）"，用于监控这些组件之间的交互/流。我们还构建了Scaler剖析器，该剖析器可提供每个组件（如库或应用程序）和每个组件内每个应用程序接口所用时间的整体视图。本文提出了多种新技术，如通用阴影表（Universal Shadow Table）和关系感知数据折叠（Relation-Aware Data Folding）。这些技术使 Scaler 能够实现低运行时间开销、低内存开销和高剖析精度。根据我们广泛的实验结果，Scaler 可以检测到广泛使用的应用程序中存在的多个未知性能问题，因此将成为现有工作的有益补充。包括源代码、基准测试和评估脚本在内的重现包可以在 https://doi.org/10.5281/zenodo.13336658 上找到。

引用次数: 0

CASA: A Framework for SLO and Carbon-Aware Autoscaling and Scheduling in Serverless Cloud Computing CASA：无服务器云计算中的 SLO 和碳感知自动扩展与调度框架

arXiv - CS - Performance

Pub Date : 2024-08-31 DOI: arxiv-2409.00550

S. Qi, H. Moore, N. Hogade, D. Milojicic, C. Bash, S. Pasricha

Serverless computing is an emerging cloud computing paradigm that can reducecosts for cloud providers and their customers. However, serverless cloudplatforms have stringent performance requirements (due to the need to executeshort duration functions in a timely manner) and a growing carbon footprint.Traditional carbon-reducing techniques such as shutting down idle containerscan reduce performance by increasing cold-start latencies of containersrequired in the future. This can cause higher violation rates of service levelobjectives (SLOs). Conversely, traditional latency-reduction approaches ofprewarming containers or keeping them alive when not in use can improveperformance but increase the associated carbon footprint of the serverlesscluster platform. To strike a balance between sustainability and performance,in this paper, we propose a novel carbon- and SLO-aware framework called CASAto schedule and autoscale containers in a serverless cloud computing cluster.Experimental results indicate that CASA reduces the operational carbonfootprint of a FaaS cluster by up to 2.6x while also reducing the SLO violationrate by up to 1.4x compared to the state-of-the-art.

无服务器计算是一种新兴的云计算模式，可以为云提供商及其客户降低成本。然而，无服务器云平台对性能有严格要求（因为需要及时执行短时功能），而且碳足迹也在不断增加。传统的减碳技术（如关闭闲置容器）会增加未来需要的容器的冷启动延迟，从而降低性能。传统的减碳技术（如关闭闲置容器）会增加未来需要的容器的冷启动延迟，从而降低性能，这可能会导致更高的服务级别目标（SLO）违反率。相反，对容器进行预热或在不使用时保持容器存活的传统延迟降低方法，虽然可以提高性能，但会增加无服务器集群平台的相关碳足迹。为了在可持续发展和性能之间取得平衡，我们在本文中提出了一种名为 CASA 的新型碳和 SLO 感知框架，用于在无服务器云计算集群中调度和自动扩展容器。

{"title":"CASA: A Framework for SLO and Carbon-Aware Autoscaling and Scheduling in Serverless Cloud Computing","authors":"S. Qi, H. Moore, N. Hogade, D. Milojicic, C. Bash, S. Pasricha","doi":"arxiv-2409.00550","DOIUrl":"https://doi.org/arxiv-2409.00550","url":null,"abstract":"Serverless computing is an emerging cloud computing paradigm that can reduce\u0000costs for cloud providers and their customers. However, serverless cloud\u0000platforms have stringent performance requirements (due to the need to execute\u0000short duration functions in a timely manner) and a growing carbon footprint.\u0000Traditional carbon-reducing techniques such as shutting down idle containers\u0000can reduce performance by increasing cold-start latencies of containers\u0000required in the future. This can cause higher violation rates of service level\u0000objectives (SLOs). Conversely, traditional latency-reduction approaches of\u0000prewarming containers or keeping them alive when not in use can improve\u0000performance but increase the associated carbon footprint of the serverless\u0000cluster platform. To strike a balance between sustainability and performance,\u0000in this paper, we propose a novel carbon- and SLO-aware framework called CASA\u0000to schedule and autoscale containers in a serverless cloud computing cluster.\u0000Experimental results indicate that CASA reduces the operational carbon\u0000footprint of a FaaS cluster by up to 2.6x while also reducing the SLO violation\u0000rate by up to 1.4x compared to the state-of-the-art.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application-Driven Exascale: The JUPITER Benchmark Suite 应用驱动的超大规模：JUPITER 基准套件

arXiv - CS - Performance

Pub Date : 2024-08-30 DOI: arxiv-2408.17211

Andreas Herten, Sebastian Achilles, Damian Alvarez, Jayesh Badwaik, Eric Behle, Mathis Bode, Thomas Breuer, Daniel Caviedes-Voullième, Mehdi Cherti, Adel Dabah, Salem El Sayed, Wolfgang Frings, Ana Gonzalez-Nicolas, Eric B. Gregory, Kaveh Haghighi Mood, Thorsten Hater, Jenia Jitsev, Chelsea Maria John, Jan H. Meinke, Catrin I. Meyer, Pavel Mezentsev, Jan-Oliver Mirus, Stepan Nassyr, Carolin Penke, Manoel Römmer, Ujjwal Sinha, Benedikt von St. Vieth, Olaf Stein, Estela Suarez, Dennis Willsch, Ilya Zhukov

Benchmarks are essential in the design of modern HPC installations, as theydefine key aspects of system components. Beyond synthetic workloads, it iscrucial to include real applications that represent user requirements intobenchmark suites, to guarantee high usability and widespread adoption of a newsystem. Given the significant investments in leadership-class supercomputers ofthe exascale era, this is even more important and necessitates alignment with avision of Open Science and reproducibility. In this work, we present theJUPITER Benchmark Suite, which incorporates 16 applications from variousdomains. It was designed for and used in the procurement of JUPITER, the firstEuropean exascale supercomputer. We identify requirements and challenges andoutline the project and software infrastructure setup. We provide descriptionsand scalability studies of selected applications and a set of key takeaways.The JUPITER Benchmark Suite is released as open source software with this workat https://github.com/FZJ-JSC/jubench.

基准测试对现代高性能计算设备的设计至关重要，因为它们定义了系统组件的关键方面。除了合成工作负载，将代表用户需求的真实应用纳入基准测试套件也至关重要，这样才能保证新系统的高可用性和广泛采用。鉴于在超大规模时代对领先级超级计算机的大量投资，这一点显得更加重要，而且必须与开放科学和可重复性的愿景保持一致。在这项工作中，我们提出了JUPITER基准套件，其中包含来自不同领域的16个应用。该套件专为欧洲首台超大规模超级计算机 JUPITER 的采购而设计，并在采购过程中使用。我们确定了需求和挑战，并概述了项目和软件基础设施的设置。我们对选定的应用进行了描述和可扩展性研究，并提出了一系列主要启示。JUPITER 基准套件作为开源软件与本作品一起发布，网址为 https://github.com/FZJ-JSC/jubench。

{"title":"Application-Driven Exascale: The JUPITER Benchmark Suite","authors":"Andreas Herten, Sebastian Achilles, Damian Alvarez, Jayesh Badwaik, Eric Behle, Mathis Bode, Thomas Breuer, Daniel Caviedes-Voullième, Mehdi Cherti, Adel Dabah, Salem El Sayed, Wolfgang Frings, Ana Gonzalez-Nicolas, Eric B. Gregory, Kaveh Haghighi Mood, Thorsten Hater, Jenia Jitsev, Chelsea Maria John, Jan H. Meinke, Catrin I. Meyer, Pavel Mezentsev, Jan-Oliver Mirus, Stepan Nassyr, Carolin Penke, Manoel Römmer, Ujjwal Sinha, Benedikt von St. Vieth, Olaf Stein, Estela Suarez, Dennis Willsch, Ilya Zhukov","doi":"arxiv-2408.17211","DOIUrl":"https://doi.org/arxiv-2408.17211","url":null,"abstract":"Benchmarks are essential in the design of modern HPC installations, as they\u0000define key aspects of system components. Beyond synthetic workloads, it is\u0000crucial to include real applications that represent user requirements into\u0000benchmark suites, to guarantee high usability and widespread adoption of a new\u0000system. Given the significant investments in leadership-class supercomputers of\u0000the exascale era, this is even more important and necessitates alignment with a\u0000vision of Open Science and reproducibility. In this work, we present the\u0000JUPITER Benchmark Suite, which incorporates 16 applications from various\u0000domains. It was designed for and used in the procurement of JUPITER, the first\u0000European exascale supercomputer. We identify requirements and challenges and\u0000outline the project and software infrastructure setup. We provide descriptions\u0000and scalability studies of selected applications and a set of key takeaways.\u0000The JUPITER Benchmark Suite is released as open source software with this work\u0000at https://github.com/FZJ-JSC/jubench.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0