The time-consuming nature of training and deploying complicated Machine and Deep Learning (DL) models for a variety of applications continues to pose significant challenges in the field of Machine Learning (ML). These challenges are particularly pronounced in the federated domain, where optimizing models for individual nodes poses significant difficulty. Many methods have been developed to tackle this problem, aiming to reduce training expenses and time while maintaining efficient optimisation. Three suggested strategies to tackle this challenge include Active Learning, Knowledge Distillation, and Local Memorization. These methods enable the adoption of smaller models that require fewer computational resources and allow for model personalization with local insights, thereby improving the effectiveness of current models. The present study delves into the fundamental principles of these three approaches and proposes an advanced Federated Learning System that utilises different Personalisation methods towards improving the accuracy of AI models and enhancing user experience in real-time NG-IoT applications, investigating the efficacy of these techniques in the local and federated domain. The results of the original and optimised models are then compared in both local and federated contexts using a comparison analysis. The post-analysis shows encouraging outcomes when it comes to optimising and personalising the models with the suggested techniques.
{"title":"Applied Federated Model Personalisation in the Industrial Domain: A Comparative Study","authors":"Ilias Siniosoglou, Vasileios Argyriou, George Fragulis, Panagiotis Fouliras, Georgios Th. Papadopoulos, Anastasios Lytos, Panagiotis Sarigiannidis","doi":"arxiv-2409.06904","DOIUrl":"https://doi.org/arxiv-2409.06904","url":null,"abstract":"The time-consuming nature of training and deploying complicated Machine and\u0000Deep Learning (DL) models for a variety of applications continues to pose\u0000significant challenges in the field of Machine Learning (ML). These challenges\u0000are particularly pronounced in the federated domain, where optimizing models\u0000for individual nodes poses significant difficulty. Many methods have been\u0000developed to tackle this problem, aiming to reduce training expenses and time\u0000while maintaining efficient optimisation. Three suggested strategies to tackle\u0000this challenge include Active Learning, Knowledge Distillation, and Local\u0000Memorization. These methods enable the adoption of smaller models that require\u0000fewer computational resources and allow for model personalization with local\u0000insights, thereby improving the effectiveness of current models. The present\u0000study delves into the fundamental principles of these three approaches and\u0000proposes an advanced Federated Learning System that utilises different\u0000Personalisation methods towards improving the accuracy of AI models and\u0000enhancing user experience in real-time NG-IoT applications, investigating the\u0000efficacy of these techniques in the local and federated domain. The results of\u0000the original and optimised models are then compared in both local and federated\u0000contexts using a comparison analysis. The post-analysis shows encouraging\u0000outcomes when it comes to optimising and personalising the models with the\u0000suggested techniques.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195452","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luis G. León-Vega, Niccolò Tosato, Stefano Cozzini
Robustly estimating energy consumption in High-Performance Computing (HPC) is essential for assessing the energy footprint of modern workloads, particularly in fields such as Artificial Intelligence (AI) research, development, and deployment. The extensive use of supercomputers for AI training has heightened concerns about energy consumption and carbon emissions. Existing energy estimation tools often assume exclusive use of computing nodes, a premise that becomes problematic with the advent of supercomputers integrating microservices, as seen in initiatives like Acceleration as a Service (XaaS) and cloud computing. This work investigates the impact of executed instructions on overall power consumption, providing insights into the comprehensive behaviour of HPC systems. We introduce two novel mathematical models to estimate a process's energy consumption based on the total node energy, process usage, and a normalised vector of the probability distribution of instruction types for CPU and GPU processes. Our approach enables energy accounting for specific processes without the need for isolation. Our models demonstrate high accuracy, predicting CPU power consumption with a mere 1.9% error. For GPU predictions, the models achieve a central relative error of 9.7%, showing a clear tendency to fit the test data accurately. These results pave the way for new tools to measure and account for energy consumption in shared supercomputing environments.
要评估现代工作负载的能源足迹,尤其是在人工智能(AI)研究、开发和部署等领域,必须对高性能计算(HPC)的能耗进行可靠估算。超级计算机在人工智能训练中的广泛应用加剧了人们对能源消耗和碳排放的担忧。现有的能耗估算工具通常假定只使用计算节点,而随着集成了微服务的超级计算机的出现,这一前提就成了问题,这在加速即服务(XaaS)和云计算等计划中都有所体现。这项工作研究了执行指令对总体功耗的影响,为了解高性能计算系统的综合行为提供了见解。我们引入了两个新颖的数学模型,根据 CPU 和 GPU 进程的节点总能耗、进程使用率和指令类型概率分布的规范化向量来估算进程的能耗。我们的方法无需隔离就能对特定进程进行能量核算。我们的模型具有很高的准确性,对 CPU 功耗的预测误差仅为 1.9%。对于 GPU 预测,模型的中心相对误差为 9.7%,显示出准确拟合测试数据的明显趋势。这些结果为测量和计算共享超级计算环境能耗的新工具铺平了道路。
{"title":"A Comprehensive Analysis of Process Energy Consumption on Multi-Socket Systems with GPUs","authors":"Luis G. León-Vega, Niccolò Tosato, Stefano Cozzini","doi":"arxiv-2409.04941","DOIUrl":"https://doi.org/arxiv-2409.04941","url":null,"abstract":"Robustly estimating energy consumption in High-Performance Computing (HPC) is\u0000essential for assessing the energy footprint of modern workloads, particularly\u0000in fields such as Artificial Intelligence (AI) research, development, and\u0000deployment. The extensive use of supercomputers for AI training has heightened\u0000concerns about energy consumption and carbon emissions. Existing energy\u0000estimation tools often assume exclusive use of computing nodes, a premise that\u0000becomes problematic with the advent of supercomputers integrating\u0000microservices, as seen in initiatives like Acceleration as a Service (XaaS) and\u0000cloud computing. This work investigates the impact of executed instructions on overall power\u0000consumption, providing insights into the comprehensive behaviour of HPC\u0000systems. We introduce two novel mathematical models to estimate a process's\u0000energy consumption based on the total node energy, process usage, and a\u0000normalised vector of the probability distribution of instruction types for CPU\u0000and GPU processes. Our approach enables energy accounting for specific\u0000processes without the need for isolation. Our models demonstrate high accuracy, predicting CPU power consumption with a\u0000mere 1.9% error. For GPU predictions, the models achieve a central relative\u0000error of 9.7%, showing a clear tendency to fit the test data accurately. These\u0000results pave the way for new tools to measure and account for energy\u0000consumption in shared supercomputing environments.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This report evaluates the performance impact of enabling Trusted Execution Environments (TEE) on NVIDIA H100 GPUs for large language model (LLM) inference tasks. We benchmark the overhead introduced by TEE mode across various models and token lengths, focusing on the bottleneck caused by CPU-GPU data transfers via PCIe. Our results show that while there is minimal computational overhead within the GPU, the overall performance penalty is primarily due to data transfer. For most typical LLM queries, the overhead remains below 5%, with larger models and longer sequences experiencing near-zero overhead.
{"title":"Confidential Computing on nVIDIA H100 GPU: A Performance Benchmark Study","authors":"Jianwei Zhu, Hang Yin, Shunfan Zhou","doi":"arxiv-2409.03992","DOIUrl":"https://doi.org/arxiv-2409.03992","url":null,"abstract":"This report evaluates the performance impact of enabling Trusted Execution\u0000Environments (TEE) on NVIDIA H100 GPUs for large language model (LLM) inference\u0000tasks. We benchmark the overhead introduced by TEE mode across various models\u0000and token lengths, focusing on the bottleneck caused by CPU-GPU data transfers\u0000via PCIe. Our results show that while there is minimal computational overhead\u0000within the GPU, the overall performance penalty is primarily due to data\u0000transfer. For most typical LLM queries, the overhead remains below 5%, with\u0000larger models and longer sequences experiencing near-zero overhead.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"176 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manuel de Castro, Francisco J. andújar, Roberto R. Osorio, Rocío Carratalá-Sáez, Diego R. Llanos
As the interest in FPGA-based accelerators for HPC applications increases, new challenges also arise, especially concerning different programming and portability issues. This paper aims to provide a snapshot of the current state of the FPGA tooling and its problems. To do so, we evaluate the performance portability of two frameworks for developing FPGA solutions for HPC (SYCL and OpenCL) when using them to port a highly-parallel application to FPGAs, using both ND-range and single-task type of kernels. The developer's general recommendation when using FPGAs is to develop single-task kernels for them, as they are commonly regarded as more suited for such hardware. However, we discovered that, when using high-level approaches such as OpenCL and SYCL to program a highly-parallel application with no FPGA-tailored optimizations, ND-range kernels significantly outperform single-task codes. Specifically, while SYCL struggles to produce efficient FPGA implementations of applications described as single-task codes, its performance excels with ND-range kernels, a result that was unexpectedly favorable.
{"title":"Challenging Portability Paradigms: FPGA Acceleration Using SYCL and OpenCL","authors":"Manuel de Castro, Francisco J. andújar, Roberto R. Osorio, Rocío Carratalá-Sáez, Diego R. Llanos","doi":"arxiv-2409.03391","DOIUrl":"https://doi.org/arxiv-2409.03391","url":null,"abstract":"As the interest in FPGA-based accelerators for HPC applications increases,\u0000new challenges also arise, especially concerning different programming and\u0000portability issues. This paper aims to provide a snapshot of the current state\u0000of the FPGA tooling and its problems. To do so, we evaluate the performance\u0000portability of two frameworks for developing FPGA solutions for HPC (SYCL and\u0000OpenCL) when using them to port a highly-parallel application to FPGAs, using\u0000both ND-range and single-task type of kernels. The developer's general recommendation when using FPGAs is to develop\u0000single-task kernels for them, as they are commonly regarded as more suited for\u0000such hardware. However, we discovered that, when using high-level approaches\u0000such as OpenCL and SYCL to program a highly-parallel application with no\u0000FPGA-tailored optimizations, ND-range kernels significantly outperform\u0000single-task codes. Specifically, while SYCL struggles to produce efficient FPGA\u0000implementations of applications described as single-task codes, its performance\u0000excels with ND-range kernels, a result that was unexpectedly favorable.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhe Wang, Zhen Wang, Jianwen Wu, Wangzhong Xiao, Yidong Chen, Zihua Feng, Dian Yang, Hongchen Liu, Bo Liang, Jiaojiao Fu
In order to accurately identify the performance status of mobile devices and finely adjust the user experience, a real-time performance perception evaluation method based on TOPSIS (Technique for Order Preference by Similarity to Ideal Solution) combined with entropy weighting method and time series model construction was studied. After collecting the performance characteristics of various mobile devices, the device performance profile was fitted by using PCA (principal component analysis) dimensionality reduction and feature engineering methods such as descriptive time series analysis. The ability of performance features and profiles to describe the real-time performance status of devices was understood and studied by applying the TOPSIS method and multi-level weighting processing. A time series model was constructed for the feature set under objective weighting, and multiple sensitivity (real-time, short-term, long-term) performance status perception results were provided to obtain real-time performance evaluation data and long-term stable performance prediction data. Finally, by configuring dynamic AB experiments and overlaying fine-grained power reduction strategies, the usability of the method was verified, and the accuracy of device performance status identification and prediction was compared with the performance of the profile features including dimensionality reduction time series modeling, TOPSIS method and entropy weighting method, subjective weighting, HMA method. The results show that accurate real-time performance perception results can greatly enhance business value, and this research has application effectiveness and certain forward-looking significance.
为了准确识别移动设备的性能状态并精细调整用户体验,研究了一种基于 TOPSIS(Technique for Order Preference by Similarityto Ideal Solution)的实时性能感知评估方法,该方法结合了熵权法和时间序列模型构建。在收集了各种移动设备的性能特征后,利用 PCA(主成分分析)降维法和描述性时间序列分析等特征工程方法拟合了设备的性能轮廓。通过应用 TOPSIS 方法和多级加权处理,了解和研究了性能特征和轮廓对设备实时性能状态的描述能力。在客观加权下,为特征集构建了时间序列模型,并提供了多种灵敏度(实时、短期、长期)性能状态感知结果,从而获得了实时性能评估数据和长期稳定性能预测数据。最后,通过配置动态 AB 实验和叠加细粒度功耗降低策略,验证了该方法的可用性,并与降维时间序列建模、TOPSIS 法和熵权法、主观加权法、HMA 法等轮廓特征的性能比较了设备性能状态识别和预测的准确性。结果表明,准确的实时性能感知结果可以大大提升商业价值,该研究具有应用实效性和一定的前瞻性意义。
{"title":"Application Research On Real-Time Perception Of Device Performance Status","authors":"Zhe Wang, Zhen Wang, Jianwen Wu, Wangzhong Xiao, Yidong Chen, Zihua Feng, Dian Yang, Hongchen Liu, Bo Liang, Jiaojiao Fu","doi":"arxiv-2409.03218","DOIUrl":"https://doi.org/arxiv-2409.03218","url":null,"abstract":"In order to accurately identify the performance status of mobile devices and\u0000finely adjust the user experience, a real-time performance perception\u0000evaluation method based on TOPSIS (Technique for Order Preference by Similarity\u0000to Ideal Solution) combined with entropy weighting method and time series model\u0000construction was studied. After collecting the performance characteristics of\u0000various mobile devices, the device performance profile was fitted by using PCA\u0000(principal component analysis) dimensionality reduction and feature engineering\u0000methods such as descriptive time series analysis. The ability of performance\u0000features and profiles to describe the real-time performance status of devices\u0000was understood and studied by applying the TOPSIS method and multi-level\u0000weighting processing. A time series model was constructed for the feature set\u0000under objective weighting, and multiple sensitivity (real-time, short-term,\u0000long-term) performance status perception results were provided to obtain\u0000real-time performance evaluation data and long-term stable performance\u0000prediction data. Finally, by configuring dynamic AB experiments and overlaying\u0000fine-grained power reduction strategies, the usability of the method was\u0000verified, and the accuracy of device performance status identification and\u0000prediction was compared with the performance of the profile features including\u0000dimensionality reduction time series modeling, TOPSIS method and entropy\u0000weighting method, subjective weighting, HMA method. The results show that\u0000accurate real-time performance perception results can greatly enhance business\u0000value, and this research has application effectiveness and certain\u0000forward-looking significance.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MPI+X has been the de facto standard for distributed memory parallel programming. It is widely used primarily as an explicit two-sided communication model, which often leads to complex and error-prone code. Alternatively, PGAS model utilizes efficient one-sided communication and more intuitive communication primitives. In this paper, we present a novel approach that integrates PGAS concepts into the OpenMP programming model, leveraging the LLVM compiler infrastructure and the GASNet-EX communication library. Our model addresses the complexity associated with traditional MPI+OpenMP programming models while ensuring excellent performance and scalability. We evaluate our approach using a set of micro-benchmarks and application kernels on two distinct platforms: Ookami from Stony Brook University and NERSC Perlmutter. The results demonstrate that DiOMP achieves superior bandwidth and lower latency compared to MPI+OpenMP, up to 25% higher bandwidth and down to 45% on latency. DiOMP offers a promising alternative to the traditional MPI+OpenMP hybrid programming model, towards providing a more productive and efficient way to develop high-performance parallel applications for distributed memory systems.
{"title":"Towards a Scalable and Efficient PGAS-based Distributed OpenMP","authors":"Baodi Shan, Mauricio Araya-Polo, Barbara Chapman","doi":"arxiv-2409.02830","DOIUrl":"https://doi.org/arxiv-2409.02830","url":null,"abstract":"MPI+X has been the de facto standard for distributed memory parallel\u0000programming. It is widely used primarily as an explicit two-sided communication\u0000model, which often leads to complex and error-prone code. Alternatively, PGAS\u0000model utilizes efficient one-sided communication and more intuitive\u0000communication primitives. In this paper, we present a novel approach that\u0000integrates PGAS concepts into the OpenMP programming model, leveraging the LLVM\u0000compiler infrastructure and the GASNet-EX communication library. Our model\u0000addresses the complexity associated with traditional MPI+OpenMP programming\u0000models while ensuring excellent performance and scalability. We evaluate our\u0000approach using a set of micro-benchmarks and application kernels on two\u0000distinct platforms: Ookami from Stony Brook University and NERSC Perlmutter.\u0000The results demonstrate that DiOMP achieves superior bandwidth and lower\u0000latency compared to MPI+OpenMP, up to 25% higher bandwidth and down to 45% on\u0000latency. DiOMP offers a promising alternative to the traditional MPI+OpenMP\u0000hybrid programming model, towards providing a more productive and efficient way\u0000to develop high-performance parallel applications for distributed memory\u0000systems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the realm of Large Language Model (LLM) inference, the inherent structure of transformer models coupled with the multi-GPU tensor parallelism strategy leads to a sequential execution of computation and communication. This results in substantial underutilization of computing resources during the communication phase. To mitigate this inefficiency, various techniques have been developed to optimize the use of computational power throughout the communication process. These strategies primarily involve overlapping matrix computations and communications, as well as interleaving micro-batches across different requests. Nonetheless, these approaches either fall short of achieving ideal overlap or impose certain limitations on their application. To overcome these challenges, this paper introduces a novel strategy for computation-communication overlap that operates at the sequence level. This method not only enhances the degree of overlap but also minimizes the constraints on its applicability. Experimental evaluations conducted using 30b/70b models have demonstrated significant improvements in efficiency. Specifically, the proposed technique has been shown to reduce time consumption by approximately 35% on 4090 GPU and by roughly 15% on A800 GPU during the prefill stage of LLM inference.
{"title":"ISO: Overlap of Computation and Communication within Seqenence For LLM Inference","authors":"Bin Xiao, Lei Su","doi":"arxiv-2409.11155","DOIUrl":"https://doi.org/arxiv-2409.11155","url":null,"abstract":"In the realm of Large Language Model (LLM) inference, the inherent structure\u0000of transformer models coupled with the multi-GPU tensor parallelism strategy\u0000leads to a sequential execution of computation and communication. This results\u0000in substantial underutilization of computing resources during the communication\u0000phase. To mitigate this inefficiency, various techniques have been developed to\u0000optimize the use of computational power throughout the communication process.\u0000These strategies primarily involve overlapping matrix computations and\u0000communications, as well as interleaving micro-batches across different\u0000requests. Nonetheless, these approaches either fall short of achieving ideal\u0000overlap or impose certain limitations on their application. To overcome these\u0000challenges, this paper introduces a novel strategy for\u0000computation-communication overlap that operates at the sequence level. This\u0000method not only enhances the degree of overlap but also minimizes the\u0000constraints on its applicability. Experimental evaluations conducted using\u000030b/70b models have demonstrated significant improvements in efficiency.\u0000Specifically, the proposed technique has been shown to reduce time consumption\u0000by approximately 35% on 4090 GPU and by roughly 15% on A800 GPU during the\u0000prefill stage of LLM inference.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
StevenJiaxun, Tang, Mingcan Xiang, Yang Wang, Bo Wu, Jianjun Chen, Tongping Liu
Performance analysis is challenging as different components (e.g.,different libraries, and applications) of a complex system can interact with each other. However, few existing tools focus on understanding such interactions. To bridge this gap, we propose a novel analysis method "Cross Flow Analysis (XFA)" that monitors the interactions/flows across these components. We also built the Scaler profiler that provides a holistic view of the time spent on each component (e.g., library or application) and every API inside each component. This paper proposes multiple new techniques, such as Universal Shadow Table, and Relation-Aware Data Folding. These techniques enable Scaler to achieve low runtime overhead, low memory overhead, and high profiling accuracy. Based on our extensive experimental results, Scaler detects multiple unknown performance issues inside widely-used applications, and therefore will be a useful complement to existing work. The reproduction package including the source code, benchmarks, and evaluation scripts, can be found at https://doi.org/10.5281/zenodo.13336658.
{"title":"Scaler: Efficient and Effective Cross Flow Analysis","authors":"StevenJiaxun, Tang, Mingcan Xiang, Yang Wang, Bo Wu, Jianjun Chen, Tongping Liu","doi":"arxiv-2409.00854","DOIUrl":"https://doi.org/arxiv-2409.00854","url":null,"abstract":"Performance analysis is challenging as different components (e.g.,different\u0000libraries, and applications) of a complex system can interact with each other.\u0000However, few existing tools focus on understanding such interactions. To bridge\u0000this gap, we propose a novel analysis method \"Cross Flow Analysis (XFA)\" that\u0000monitors the interactions/flows across these components. We also built the\u0000Scaler profiler that provides a holistic view of the time spent on each\u0000component (e.g., library or application) and every API inside each component.\u0000This paper proposes multiple new techniques, such as Universal Shadow Table,\u0000and Relation-Aware Data Folding. These techniques enable Scaler to achieve low\u0000runtime overhead, low memory overhead, and high profiling accuracy. Based on\u0000our extensive experimental results, Scaler detects multiple unknown performance\u0000issues inside widely-used applications, and therefore will be a useful\u0000complement to existing work. The reproduction package including the source code, benchmarks, and\u0000evaluation scripts, can be found at https://doi.org/10.5281/zenodo.13336658.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Qi, H. Moore, N. Hogade, D. Milojicic, C. Bash, S. Pasricha
Serverless computing is an emerging cloud computing paradigm that can reduce costs for cloud providers and their customers. However, serverless cloud platforms have stringent performance requirements (due to the need to execute short duration functions in a timely manner) and a growing carbon footprint. Traditional carbon-reducing techniques such as shutting down idle containers can reduce performance by increasing cold-start latencies of containers required in the future. This can cause higher violation rates of service level objectives (SLOs). Conversely, traditional latency-reduction approaches of prewarming containers or keeping them alive when not in use can improve performance but increase the associated carbon footprint of the serverless cluster platform. To strike a balance between sustainability and performance, in this paper, we propose a novel carbon- and SLO-aware framework called CASA to schedule and autoscale containers in a serverless cloud computing cluster. Experimental results indicate that CASA reduces the operational carbon footprint of a FaaS cluster by up to 2.6x while also reducing the SLO violation rate by up to 1.4x compared to the state-of-the-art.
无服务器计算是一种新兴的云计算模式,可以为云提供商及其客户降低成本。然而,无服务器云平台对性能有严格要求(因为需要及时执行短时功能),而且碳足迹也在不断增加。传统的减碳技术(如关闭闲置容器)会增加未来需要的容器的冷启动延迟,从而降低性能。传统的减碳技术(如关闭闲置容器)会增加未来需要的容器的冷启动延迟,从而降低性能,这可能会导致更高的服务级别目标(SLO)违反率。相反,对容器进行预热或在不使用时保持容器存活的传统延迟降低方法,虽然可以提高性能,但会增加无服务器集群平台的相关碳足迹。为了在可持续发展和性能之间取得平衡,我们在本文中提出了一种名为 CASA 的新型碳和 SLO 感知框架,用于在无服务器云计算集群中调度和自动扩展容器。
{"title":"CASA: A Framework for SLO and Carbon-Aware Autoscaling and Scheduling in Serverless Cloud Computing","authors":"S. Qi, H. Moore, N. Hogade, D. Milojicic, C. Bash, S. Pasricha","doi":"arxiv-2409.00550","DOIUrl":"https://doi.org/arxiv-2409.00550","url":null,"abstract":"Serverless computing is an emerging cloud computing paradigm that can reduce\u0000costs for cloud providers and their customers. However, serverless cloud\u0000platforms have stringent performance requirements (due to the need to execute\u0000short duration functions in a timely manner) and a growing carbon footprint.\u0000Traditional carbon-reducing techniques such as shutting down idle containers\u0000can reduce performance by increasing cold-start latencies of containers\u0000required in the future. This can cause higher violation rates of service level\u0000objectives (SLOs). Conversely, traditional latency-reduction approaches of\u0000prewarming containers or keeping them alive when not in use can improve\u0000performance but increase the associated carbon footprint of the serverless\u0000cluster platform. To strike a balance between sustainability and performance,\u0000in this paper, we propose a novel carbon- and SLO-aware framework called CASA\u0000to schedule and autoscale containers in a serverless cloud computing cluster.\u0000Experimental results indicate that CASA reduces the operational carbon\u0000footprint of a FaaS cluster by up to 2.6x while also reducing the SLO violation\u0000rate by up to 1.4x compared to the state-of-the-art.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andreas Herten, Sebastian Achilles, Damian Alvarez, Jayesh Badwaik, Eric Behle, Mathis Bode, Thomas Breuer, Daniel Caviedes-Voullième, Mehdi Cherti, Adel Dabah, Salem El Sayed, Wolfgang Frings, Ana Gonzalez-Nicolas, Eric B. Gregory, Kaveh Haghighi Mood, Thorsten Hater, Jenia Jitsev, Chelsea Maria John, Jan H. Meinke, Catrin I. Meyer, Pavel Mezentsev, Jan-Oliver Mirus, Stepan Nassyr, Carolin Penke, Manoel Römmer, Ujjwal Sinha, Benedikt von St. Vieth, Olaf Stein, Estela Suarez, Dennis Willsch, Ilya Zhukov
Benchmarks are essential in the design of modern HPC installations, as they define key aspects of system components. Beyond synthetic workloads, it is crucial to include real applications that represent user requirements into benchmark suites, to guarantee high usability and widespread adoption of a new system. Given the significant investments in leadership-class supercomputers of the exascale era, this is even more important and necessitates alignment with a vision of Open Science and reproducibility. In this work, we present the JUPITER Benchmark Suite, which incorporates 16 applications from various domains. It was designed for and used in the procurement of JUPITER, the first European exascale supercomputer. We identify requirements and challenges and outline the project and software infrastructure setup. We provide descriptions and scalability studies of selected applications and a set of key takeaways. The JUPITER Benchmark Suite is released as open source software with this work at https://github.com/FZJ-JSC/jubench.
{"title":"Application-Driven Exascale: The JUPITER Benchmark Suite","authors":"Andreas Herten, Sebastian Achilles, Damian Alvarez, Jayesh Badwaik, Eric Behle, Mathis Bode, Thomas Breuer, Daniel Caviedes-Voullième, Mehdi Cherti, Adel Dabah, Salem El Sayed, Wolfgang Frings, Ana Gonzalez-Nicolas, Eric B. Gregory, Kaveh Haghighi Mood, Thorsten Hater, Jenia Jitsev, Chelsea Maria John, Jan H. Meinke, Catrin I. Meyer, Pavel Mezentsev, Jan-Oliver Mirus, Stepan Nassyr, Carolin Penke, Manoel Römmer, Ujjwal Sinha, Benedikt von St. Vieth, Olaf Stein, Estela Suarez, Dennis Willsch, Ilya Zhukov","doi":"arxiv-2408.17211","DOIUrl":"https://doi.org/arxiv-2408.17211","url":null,"abstract":"Benchmarks are essential in the design of modern HPC installations, as they\u0000define key aspects of system components. Beyond synthetic workloads, it is\u0000crucial to include real applications that represent user requirements into\u0000benchmark suites, to guarantee high usability and widespread adoption of a new\u0000system. Given the significant investments in leadership-class supercomputers of\u0000the exascale era, this is even more important and necessitates alignment with a\u0000vision of Open Science and reproducibility. In this work, we present the\u0000JUPITER Benchmark Suite, which incorporates 16 applications from various\u0000domains. It was designed for and used in the procurement of JUPITER, the first\u0000European exascale supercomputer. We identify requirements and challenges and\u0000outline the project and software infrastructure setup. We provide descriptions\u0000and scalability studies of selected applications and a set of key takeaways.\u0000The JUPITER Benchmark Suite is released as open source software with this work\u0000at https://github.com/FZJ-JSC/jubench.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}