arXiv - CS - Performance最新文献

HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms HRA：元搜索优化算法排序的多标准框架

arXiv - CS - Performance

Pub Date : 2024-09-18 DOI: arxiv-2409.11617

Evgenia-Maria K. Goula, Dimitris G. Sotiropoulos

Metaheuristic algorithms are essential for solving complex optimizationproblems in different fields. However, the difficulty in comparing and ratingthese algorithms remains due to the wide range of performance metrics andproblem dimensions usually involved. On the other hand, nonparametricstatistical methods and post hoc tests are time-consuming, especially when weonly need to identify the top performers among many algorithms. TheHierarchical Rank Aggregation (HRA) algorithm aims to efficiently rankmetaheuristic algorithms based on their performance across many criteria anddimensions. The HRA employs a hierarchical framework that begins withcollecting performance metrics on various benchmark functions and dimensions.Rank-based normalization is employed for each performance measure to ensurecomparability and the robust TOPSIS aggregation is applied to combine theserankings at several hierarchical levels, resulting in a comprehensive rankingof the algorithms. Our study uses data from the CEC 2017 competition todemonstrate the robustness and efficacy of the HRA framework. It examines 30benchmark functions and evaluates the performance of 13 metaheuristicalgorithms across five performance indicators in four distinct dimensions. Thispresentation highlights the potential of the HRA to enhance the interpretationof the comparative advantages and disadvantages of various algorithms bysimplifying practitioners' choices of the most appropriate algorithm forcertain optimization problems.

元启发式算法对于解决不同领域的复杂优化问题至关重要。然而，由于通常涉及多种性能指标和问题维度，对这些算法进行比较和评级仍然存在困难。另一方面，非参数统计方法和事后检验非常耗时，尤其是当我们只需要从众多算法中找出性能最好的算法时。分层排名聚合（HRA）算法旨在根据元启发式算法在多个标准和维度上的表现对其进行有效排名。HRA 采用分层框架，首先收集各种基准函数和维度的性能指标，然后对每个性能指标进行基于等级的归一化以确保可比性，最后采用稳健的 TOPSIS 聚合法将多个分层级别的排名结合起来，从而得出算法的综合排名。我们的研究使用了 2017 年 CEC 竞赛的数据来展示 HRA 框架的鲁棒性和有效性。它考察了 30 个基准函数，并从四个不同维度的五个性能指标评估了 13 种元搜索算法的性能。该报告强调了 HRA 的潜力，即通过简化实践者对特定优化问题最合适算法的选择，增强对各种算法优缺点的解释。

{"title":"HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms","authors":"Evgenia-Maria K. Goula, Dimitris G. Sotiropoulos","doi":"arxiv-2409.11617","DOIUrl":"https://doi.org/arxiv-2409.11617","url":null,"abstract":"Metaheuristic algorithms are essential for solving complex optimization\u0000problems in different fields. However, the difficulty in comparing and rating\u0000these algorithms remains due to the wide range of performance metrics and\u0000problem dimensions usually involved. On the other hand, nonparametric\u0000statistical methods and post hoc tests are time-consuming, especially when we\u0000only need to identify the top performers among many algorithms. The\u0000Hierarchical Rank Aggregation (HRA) algorithm aims to efficiently rank\u0000metaheuristic algorithms based on their performance across many criteria and\u0000dimensions. The HRA employs a hierarchical framework that begins with\u0000collecting performance metrics on various benchmark functions and dimensions.\u0000Rank-based normalization is employed for each performance measure to ensure\u0000comparability and the robust TOPSIS aggregation is applied to combine these\u0000rankings at several hierarchical levels, resulting in a comprehensive ranking\u0000of the algorithms. Our study uses data from the CEC 2017 competition to\u0000demonstrate the robustness and efficacy of the HRA framework. It examines 30\u0000benchmark functions and evaluates the performance of 13 metaheuristic\u0000algorithms across five performance indicators in four distinct dimensions. This\u0000presentation highlights the potential of the HRA to enhance the interpretation\u0000of the comparative advantages and disadvantages of various algorithms by\u0000simplifying practitioners' choices of the most appropriate algorithm for\u0000certain optimization problems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study 图重排能加速图神经网络训练吗？实验研究

arXiv - CS - Performance

Pub Date : 2024-09-17 DOI: arxiv-2409.11129

Nikolai Merkel, Pierre Toussing, Ruben Mayer, Hans-Arno Jacobsen

Graph neural networks (GNNs) are a type of neural network capable of learningon graph-structured data. However, training GNNs on large-scale graphs ischallenging due to iterative aggregations of high-dimensional features fromneighboring vertices within sparse graph structures combined with neuralnetwork operations. The sparsity of graphs frequently results in suboptimalmemory access patterns and longer training time. Graph reordering is anoptimization strategy aiming to improve the graph data layout. It has shown tobe effective to speed up graph analytics workloads, but its effect on theperformance of GNN training has not been investigated yet. The generalizationof reordering to GNN performance is nontrivial, as multiple aspects must beconsidered: GNN hyper-parameters such as the number of layers, the number ofhidden dimensions, and the feature size used in the GNN model, neural networkoperations, large intermediate vertex states, and GPU acceleration. In our work, we close this gap by performing an empirical evaluation of 12reordering strategies in two state-of-the-art GNN systems, PyTorch Geometricand Deep Graph Library. Our results show that graph reordering is effective inreducing training time for CPU- and GPU-based training, respectively. Further,we find that GNN hyper-parameters influence the effectiveness of reordering,that reordering metrics play an important role in selecting a reorderingstrategy, that lightweight reordering performs better for GPU-based than forCPU-based training, and that invested reordering time can in many cases beamortized.

图神经网络（GNN）是一种能够在图结构数据上学习的神经网络。然而，在大规模图上训练 GNN 是一项挑战，因为需要在稀疏的图结构中结合神经网络操作，对相邻顶点的高维特征进行迭代聚合。图的稀疏性经常导致次优内存访问模式和更长的训练时间。图重排是一种旨在改进图数据布局的优化策略。事实证明，它能有效加快图分析工作负载的速度，但它对 GNN 训练性能的影响尚未得到研究。将重新排序推广到 GNN 性能并非易事，因为必须考虑多个方面：GNN 超参数（如层数、隐藏维数和 GNN 模型中使用的特征大小）、神经网络操作、大型中间顶点状态和 GPU 加速。在我们的工作中，我们通过对 PyTorch Geometric 和 Deep Graph Library 这两个最先进的 GNN 系统中的 12 种重新排序策略进行实证评估，缩小了这一差距。我们的结果表明，在基于 CPU 和 GPU 的训练中，图重排分别有效地减少了训练时间。此外，我们还发现，GNN 超参数会影响重排序的效果，重排序指标在选择重排序策略时起着重要作用，基于 GPU 的轻量级重排序比基于 CPU 的训练效果更好，而且在很多情况下，投入的重排序时间可以缩短。

{"title":"Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study","authors":"Nikolai Merkel, Pierre Toussing, Ruben Mayer, Hans-Arno Jacobsen","doi":"arxiv-2409.11129","DOIUrl":"https://doi.org/arxiv-2409.11129","url":null,"abstract":"Graph neural networks (GNNs) are a type of neural network capable of learning\u0000on graph-structured data. However, training GNNs on large-scale graphs is\u0000challenging due to iterative aggregations of high-dimensional features from\u0000neighboring vertices within sparse graph structures combined with neural\u0000network operations. The sparsity of graphs frequently results in suboptimal\u0000memory access patterns and longer training time. Graph reordering is an\u0000optimization strategy aiming to improve the graph data layout. It has shown to\u0000be effective to speed up graph analytics workloads, but its effect on the\u0000performance of GNN training has not been investigated yet. The generalization\u0000of reordering to GNN performance is nontrivial, as multiple aspects must be\u0000considered: GNN hyper-parameters such as the number of layers, the number of\u0000hidden dimensions, and the feature size used in the GNN model, neural network\u0000operations, large intermediate vertex states, and GPU acceleration. In our work, we close this gap by performing an empirical evaluation of 12\u0000reordering strategies in two state-of-the-art GNN systems, PyTorch Geometric\u0000and Deep Graph Library. Our results show that graph reordering is effective in\u0000reducing training time for CPU- and GPU-based training, respectively. Further,\u0000we find that GNN hyper-parameters influence the effectiveness of reordering,\u0000that reordering metrics play an important role in selecting a reordering\u0000strategy, that lightweight reordering performs better for GPU-based than for\u0000CPU-based training, and that invested reordering time can in many cases be\u0000amortized.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures 不同多核架构 Ondes3D 地震模拟器上的时间负载失衡问题

arXiv - CS - Performance

Pub Date : 2024-09-17 DOI: arxiv-2409.11392

Ana Luisa Veroneze Solórzano, Philippe Olivier Alexandre Navaux, Lucas Mello Schnorr

The variety of today's multicore architectures motivates researchers toexplore parallel scientific applications on different platforms. Load imbalanceis one performance issue that can prejudice parallel applications fromexploiting the computational power of these platforms. Ondes3D is a scientificapplication for seismic wave simulation used to assess the geological impact ofearthquakes. Its parallelism relies on applying a regular domain decompositionin the geological domain provided and distributing each sub-domain to MPIranks. Previous works investigate the significant spatial and temporalimbalance in Ondes3D and suggest new parallelization and load balancingtechniques to minimize them. However, none explored its execution on differentarchitectures. Our paper evaluates the performance of Ondes3D for twoearthquake scenarios on eight different multicore architectures, includingIntel, AMD, and ARM processors. We measure the load distribution per MPI rank,evaluate the temporal load imbalance, and compare the execution of theapplication's kernels. Our results show that the temporal load imbalance inOndes3D depends on the architecture chosen, with some platforms minimizing suchimbalance more effectively.

当今多核架构的多样性促使研究人员探索不同平台上的并行科学应用。负载不平衡是影响并行应用发挥这些平台计算能力的一个性能问题。Ondes3D 是一个地震波模拟科学应用，用于评估地震的地质影响。它的并行性依赖于在所提供的地质域中应用规则域分解，并将每个子域分配给 MPIranks。以前的工作研究了 Ondes3D 中显著的空间和时间不平衡，并提出了新的并行化和负载平衡技术，以尽量减少这些不平衡。但是，没有一项研究探讨了其在不同体系结构上的执行情况。我们的论文评估了Ondes3D在八种不同多核架构（包括英特尔、AMD和ARM处理器）上两种地震场景的性能。我们测量了每个 MPI 级的负载分布，评估了时间负载不平衡，并比较了应用程序内核的执行情况。我们的结果表明，Ondes3D 中的时间负载不平衡取决于所选的架构，某些平台能更有效地最小化这种不平衡。

{"title":"Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures","authors":"Ana Luisa Veroneze Solórzano, Philippe Olivier Alexandre Navaux, Lucas Mello Schnorr","doi":"arxiv-2409.11392","DOIUrl":"https://doi.org/arxiv-2409.11392","url":null,"abstract":"The variety of today's multicore architectures motivates researchers to\u0000explore parallel scientific applications on different platforms. Load imbalance\u0000is one performance issue that can prejudice parallel applications from\u0000exploiting the computational power of these platforms. Ondes3D is a scientific\u0000application for seismic wave simulation used to assess the geological impact of\u0000earthquakes. Its parallelism relies on applying a regular domain decomposition\u0000in the geological domain provided and distributing each sub-domain to MPI\u0000ranks. Previous works investigate the significant spatial and temporal\u0000imbalance in Ondes3D and suggest new parallelization and load balancing\u0000techniques to minimize them. However, none explored its execution on different\u0000architectures. Our paper evaluates the performance of Ondes3D for two\u0000earthquake scenarios on eight different multicore architectures, including\u0000Intel, AMD, and ARM processors. We measure the load distribution per MPI rank,\u0000evaluate the temporal load imbalance, and compare the execution of the\u0000application's kernels. Our results show that the temporal load imbalance in\u0000Ondes3D depends on the architecture chosen, with some platforms minimizing such\u0000imbalance more effectively.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Landscape of GPU-Centric Communication 以 GPU 为中心的通信格局

arXiv - CS - Performance

Pub Date : 2024-09-15 DOI: arxiv-2409.09874

Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, Ismayil Ismayilov

n recent years, GPUs have become the preferred accelerators for HPC and MLapplications due to their parallelism and fast memory bandwidth. While GPUsboost computation, inter-GPU communication can create scalability bottlenecks,especially as the number of GPUs per node and cluster grows. Traditionally, theCPU managed multi-GPU communication, but advancements in GPU-centriccommunication now challenge this CPU dominance by reducing its involvement,granting GPUs more autonomy in communication tasks, and addressing mismatchesin multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing onvendor mechanisms and user-level library supports. It aims to clarify thecomplexities and diverse options in this field, define the terminology, andcategorize existing approaches within and across nodes. The paper discussesvendor-provided mechanisms for communication and memory management in multi-GPUexecution and reviews major communication libraries, their benefits,challenges, and performance insights. Then, it explores key research paradigms,future outlooks, and open research questions. By extensively describingGPU-centric communication techniques across the software and hardware stacks,we provide researchers, programmers, engineers, and library designers insightson how to exploit multi-GPU systems at their best.

近年来，GPU 因其并行性和快速内存带宽而成为 HPC 和 ML 应用的首选加速器。虽然 GPU 可以提高计算能力，但 GPU 之间的通信会造成可扩展性瓶颈，尤其是当每个节点和集群的 GPU 数量增加时。传统上，CPU 负责管理多 GPU 通信，但现在以 GPU 为中心的通信技术的进步挑战了 CPU 的主导地位，减少了 CPU 的参与，赋予 GPU 在通信任务中更多的自主权，并解决了多 GPU 通信和计算中的不匹配问题。本文介绍了以 GPU 为中心的通信，重点是供应商机制和用户级库支持。本文旨在阐明该领域的复杂性和多种选择，定义术语，并对节点内和节点间的现有方法进行分类。本文讨论了供应商提供的多 GPU 执行中的通信和内存管理机制，回顾了主要的通信库、其优势、挑战和性能见解。然后，论文探讨了关键研究范例、未来展望和开放研究问题。通过广泛介绍软件和硬件堆栈中以 GPU 为中心的通信技术，我们为研究人员、程序员、工程师和库设计人员提供了如何以最佳方式利用多 GPU 系统的见解。

{"title":"The Landscape of GPU-Centric Communication","authors":"Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, Ismayil Ismayilov","doi":"arxiv-2409.09874","DOIUrl":"https://doi.org/arxiv-2409.09874","url":null,"abstract":"n recent years, GPUs have become the preferred accelerators for HPC and ML\u0000applications due to their parallelism and fast memory bandwidth. While GPUs\u0000boost computation, inter-GPU communication can create scalability bottlenecks,\u0000especially as the number of GPUs per node and cluster grows. Traditionally, the\u0000CPU managed multi-GPU communication, but advancements in GPU-centric\u0000communication now challenge this CPU dominance by reducing its involvement,\u0000granting GPUs more autonomy in communication tasks, and addressing mismatches\u0000in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on\u0000vendor mechanisms and user-level library supports. It aims to clarify the\u0000complexities and diverse options in this field, define the terminology, and\u0000categorize existing approaches within and across nodes. The paper discusses\u0000vendor-provided mechanisms for communication and memory management in multi-GPU\u0000execution and reviews major communication libraries, their benefits,\u0000challenges, and performance insights. Then, it explores key research paradigms,\u0000future outlooks, and open research questions. By extensively describing\u0000GPU-centric communication techniques across the software and hardware stacks,\u0000we provide researchers, programmers, engineers, and library designers insights\u0000on how to exploit multi-GPU systems at their best.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Global Perspective on the Past, Present, and Future of Video Streaming over Starlink 从全球视角看星际链路视频流的过去、现在和未来

arXiv - CS - Performance

Pub Date : 2024-09-15 DOI: arxiv-2409.09846

Liz Izhikevich, Reese Enghardt, Te-Yuan Huang, Renata Teixeira

This study presents the first global analysis of on-demand video streamingover Low Earth Orbit (LEO) satellite networks, using data from over one millionhouseholds across 85 countries. We highlight Starlink's role as a major LEOprovider, enhancing connectivity in underserved regions. Our findings revealthat while overall video quality on Starlink matches that of traditionalnetworks, the inherent variability in LEO conditions -- such as throughputfluctuations and packet loss -- leads to an increase in bitrate switches andrebuffers. To further improve the quality of experience for the LEO community,we manipulate existing congestion control and adaptive bitrate streamingalgorithms using simulation and real A/B tests deployed on over one millionhouseholds. Our results underscore the need for video streaming and congestioncontrol algorithms to adapt to rapidly evolving network landscapes, ensuringhigh-quality service across diverse and dynamic network types.

本研究利用来自 85 个国家 100 多万个家庭的数据，首次对低地球轨道（LEO）卫星网络上的点播视频流进行了全球性分析。我们强调了 Starlink 作为主要低地轨道提供商的作用，它增强了服务不足地区的连接性。我们的研究结果表明，虽然 Starlink 上的整体视频质量与传统网络不相上下，但低地轨道条件固有的多变性（如吞吐量波动和数据包丢失）导致比特率开关和缓冲器的增加。为了进一步改善 LEO 社区的体验质量，我们通过在超过一百万个家庭中部署的模拟和实际 A/B 测试，对现有的拥塞控制和自适应比特率流算法进行了改进。我们的研究结果突出表明，视频流和拥塞控制算法需要适应快速发展的网络环境，以确保在多样化的动态网络类型中提供高质量的服务。

引用次数: 0

Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators 为深度神经网络加速器自动生成快速准确的性能模型

arXiv - CS - Performance

Pub Date : 2024-09-13 DOI: arxiv-2409.08595

Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann

Implementing Deep Neural Networks (DNNs) on resource-constrained edge devicesis a challenging task that requires tailored hardware accelerator architecturesand a clear understanding of their performance characteristics when executingthe intended AI workload. To facilitate this, we present an automatedgeneration approach for fast performance models to accurately estimate thelatency of a DNN mapped onto systematically modeled and concisely describedaccelerator architectures. Using our accelerator architecture descriptionmethod, we modeled representative DNN accelerators such as Gemmini, UltraTrail,Plasticine-derived, and a parameterizable systolic array. Together with DNNmappings for those modeled architectures, we perform a combined DNN/hardwaredependency graph analysis, which enables us, in the best case, to evaluate only154 loop kernel iterations to estimate the performance for 4.19 billioninstructions achieving a significant speedup. We outperform regression andanalytical models in terms of mean absolute percentage error (MAPE) compared tosimulation results, while being several magnitudes faster than an RTLsimulation.

在资源受限的边缘设备上实现深度神经网络（DNN）是一项极具挑战性的任务，需要量身定制硬件加速器架构，并清楚了解其在执行预期人工智能工作负载时的性能特征。为此，我们提出了一种自动生成快速性能模型的方法，以准确估计映射到系统建模和简明描述的加速器架构上的 DNN 的延迟。利用我们的加速器架构描述方法，我们对 Gemmini、UltraTrail、Plasticine-derived 和可参数化的收缩阵列等代表性 DNN 加速器进行了建模。结合这些建模架构的 DNN 映射，我们进行了 DNN/硬件依赖图组合分析，在最佳情况下，我们只需评估 154 次循环内核迭代，就能估算出 41.9 亿条指令的性能，实现了显著的提速。与模拟结果相比，我们在平均绝对百分比误差 (MAPE) 方面优于回归模型和分析模型，同时速度比 RTL 模拟快几个数量级。

{"title":"Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators","authors":"Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann","doi":"arxiv-2409.08595","DOIUrl":"https://doi.org/arxiv-2409.08595","url":null,"abstract":"Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices\u0000is a challenging task that requires tailored hardware accelerator architectures\u0000and a clear understanding of their performance characteristics when executing\u0000the intended AI workload. To facilitate this, we present an automated\u0000generation approach for fast performance models to accurately estimate the\u0000latency of a DNN mapped onto systematically modeled and concisely described\u0000accelerator architectures. Using our accelerator architecture description\u0000method, we modeled representative DNN accelerators such as Gemmini, UltraTrail,\u0000Plasticine-derived, and a parameterizable systolic array. Together with DNN\u0000mappings for those modeled architectures, we perform a combined DNN/hardware\u0000dependency graph analysis, which enables us, in the best case, to evaluate only\u0000154 loop kernel iterations to estimate the performance for 4.19 billion\u0000instructions achieving a significant speedup. We outperform regression and\u0000analytical models in terms of mean absolute percentage error (MAPE) compared to\u0000simulation results, while being several magnitudes faster than an RTL\u0000simulation.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Computational Algorithms for the Product Form Solution of Closed Queuing Networks with Finite Buffers and Skip-Over Policy 具有有限缓冲区和跳过策略的封闭排队网络乘积形式求解计算算法

arXiv - CS - Performance

Pub Date : 2024-09-12 DOI: arxiv-2409.08075

Gianfranco Balbo, Andrea Marin, Diletta Olliaro, Matteo Sereno

Closed queuing networks with finite capacity buffers and skip-over policiesare fundamental models in the performance evaluation of computer andcommunication systems. This technical report presents the details ofcomputational algorithms to derive the key performance metrics for suchnetworks. The primary focus is on the efficient computation of thenormalization constant, which is critical for determining the steady-stateprobabilities of the network states under investigation. A convolutionalgorithm is proposed, which paves the way for the computation of keyperformance indices, such as queue length distribution and throughput,accommodating the intricacies introduced by finite capacity constraints andskip-over mechanisms. Finally, an extension of the traditional Mean ValueAnalysis algorithm addressing numerical stability is provided. The approachesdiscussed here allow make the investigation of large-scale networks feasibleand enable the development of robust implementations of these techniques forpractical use.

具有有限容量缓冲区和跳过策略的封闭队列网络是计算机和通信系统性能评估的基本模型。本技术报告详细介绍了用于推导此类网络关键性能指标的计算算法。主要重点是归一化常数的高效计算，这对于确定所研究网络状态的稳态概率至关重要。此外，还提出了一种卷积算法，为队列长度分布和吞吐量等关键性能指标的计算铺平了道路，同时还考虑到了有限容量约束和跳过机制带来的复杂性。最后，还对传统的均值分析算法进行了扩展，以解决数值稳定性问题。本文讨论的方法使大规模网络的研究变得可行，并使这些技术的稳健实现成为可能。

引用次数: 0

Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa 最先进 CPU 的微架构比较和内核建模：格雷斯、蓝宝石急流和热那亚

arXiv - CS - Performance

Pub Date : 2024-09-12 DOI: arxiv-2409.08108

Jan Laukemann, Georg Hager, Gerhard Wellein

With Nvidia's release of the Grace Superchip, all three big semiconductorcompanies in HPC (AMD, Intel, Nvidia) are currently competing in the race forthe best CPU. In this work we analyze the performance of these state-of-the-artCPUs and create an accurate in-core performance model for theirmicroarchitectures Zen 4, Golden Cove, and Neoverse V2, extending the OpenSource Architecture Code Analyzer (OSACA) tool and comparing it with LLVM-MCA.Starting from the peculiarities and up- and downsides of a single core, weextend our comparison by a variety of microbenchmarks and the capabilities of afull node. The "write-allocate (WA) evasion" feature, which can automaticallyreduce the memory traffic caused by write misses, receives special attention;we show that the Grace Superchip has a next-to-optimal implementation of WAevasion, and that the only way to avoid write allocates on Zen 4 is theexplicit use of non-temporal stores.

随着 Nvidia 发布 Grace 超级芯片，HPC 领域的三大半导体公司（AMD、Intel 和 Nvidia）目前都在争夺最佳 CPU。在这项工作中，我们分析了这些最先进 CPU 的性能，并为它们的微架构 Zen 4、Golden Cove 和 Neoverse V2 建立了精确的内核性能模型，扩展了开源架构代码分析器（OSACA）工具，并与 LLVM-MCA 进行了比较。我们特别关注了 "写分配（WA）规避 "功能，该功能可以自动减少写未命中造成的内存流量；我们证明了格雷斯超级芯片拥有近乎最佳的 "WA规避 "实现，而在 Zen 4 上避免写分配的唯一方法是明确使用非时态存储。

引用次数: 0

E-QUARTIC: Energy Efficient Edge Ensemble of Convolutional Neural Networks for Resource-Optimized Learning E-QUARTIC：用于资源优化学习的高能效边缘卷积神经网络集合

arXiv - CS - Performance

Pub Date : 2024-09-12 DOI: arxiv-2409.08369

Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing

Ensemble learning is a meta-learning approach that combines the predictionsof multiple learners, demonstrating improved accuracy and robustness.Nevertheless, ensembling models like Convolutional Neural Networks (CNNs)result in high memory and computing overhead, preventing their deployment inembedded systems. These devices are usually equipped with small batteries thatprovide power supply and might include energy-harvesting modules that extractenergy from the environment. In this work, we propose E-QUARTIC, a novel EnergyEfficient Edge Ensembling framework to build ensembles of CNNs targetingArtificial Intelligence (AI)-based embedded systems. Our design outperformssingle-instance CNN baselines and state-of-the-art edge AI solutions, improvingaccuracy and adapting to varying energy conditions while maintaining similarmemory requirements. Then, we leverage the multi-CNN structure of the designedensemble to implement an energy-aware model selection policy inenergy-harvesting AI systems. We show that our solution outperforms thestate-of-the-art by reducing system failure rate by up to 40% while ensuringhigher average output qualities. Ultimately, we show that the proposed designenables concurrent on-device training and high-quality inference execution atthe edge, limiting the performance and energy overheads to less than 0.04%.

然而，像卷积神经网络（CNNs）这样的集合模型会带来高内存和计算开销，从而阻碍其在嵌入式系统中的部署。这些设备通常配备小型电池提供电源，还可能包括从环境中提取能量的能量收集模块。在这项工作中，我们提出了 E-QUARTIC，这是一种新颖的高能效边缘集合框架，用于构建以人工智能（AI）为基础的嵌入式系统为目标的 CNN 集合。我们的设计优于单实例 CNN 基线和最先进的边缘 AI 解决方案，在保持类似内存要求的同时，提高了准确性并适应了不同的能源条件。然后，我们利用所设计的集合的多 CNN 结构，在能量收集人工智能系统中实现了能量感知模型选择策略。我们的研究表明，我们的解决方案优于最先进的解决方案，系统故障率降低了 40%，同时确保了更高的平均输出质量。最终，我们证明了所提出的设计能够在边缘同时进行设备上训练和高质量推理执行，将性能和能耗开销限制在 0.04% 以下。

{"title":"E-QUARTIC: Energy Efficient Edge Ensemble of Convolutional Neural Networks for Resource-Optimized Learning","authors":"Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing","doi":"arxiv-2409.08369","DOIUrl":"https://doi.org/arxiv-2409.08369","url":null,"abstract":"Ensemble learning is a meta-learning approach that combines the predictions\u0000of multiple learners, demonstrating improved accuracy and robustness.\u0000Nevertheless, ensembling models like Convolutional Neural Networks (CNNs)\u0000result in high memory and computing overhead, preventing their deployment in\u0000embedded systems. These devices are usually equipped with small batteries that\u0000provide power supply and might include energy-harvesting modules that extract\u0000energy from the environment. In this work, we propose E-QUARTIC, a novel Energy\u0000Efficient Edge Ensembling framework to build ensembles of CNNs targeting\u0000Artificial Intelligence (AI)-based embedded systems. Our design outperforms\u0000single-instance CNN baselines and state-of-the-art edge AI solutions, improving\u0000accuracy and adapting to varying energy conditions while maintaining similar\u0000memory requirements. Then, we leverage the multi-CNN structure of the designed\u0000ensemble to implement an energy-aware model selection policy in\u0000energy-harvesting AI systems. We show that our solution outperforms the\u0000state-of-the-art by reducing system failure rate by up to 40% while ensuring\u0000higher average output qualities. Ultimately, we show that the proposed design\u0000enables concurrent on-device training and high-quality inference execution at\u0000the edge, limiting the performance and energy overheads to less than 0.04%.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU Inf-MLLM：在单个 GPU 上实现多模态大型语言模型的高效流推理

arXiv - CS - Performance

Pub Date : 2024-09-11 DOI: arxiv-2409.09086

Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo

Multimodal Large Language Models (MLLMs) are distinguished by theirmultimodal comprehensive ability and widely used in many real-worldapplications including GPT-4o, autonomous driving and robotics. Despite theirimpressive performance, the multimodal inputs always incur long context. Theinference under long context requires caching massive Key and Value states (KVcache) of previous tokens, which introduces high latency and excessive memoryconsumption. Due to this reason, it is challenging to deploy streaminginference of MLLMs on edge devices, which largely constrains the power andusage of MLLMs in real-world applications. In this paper, we introduceInf-MLLM, an efficient inference framework for MLLMs, which enable streaminginference of MLLM on a single GPU with infinite context. Inf-MLLM is based onour key observation of the attention pattern in both LLMs and MLLMs called"attention saddles". Thanks to the newly discovered attention pattern, Inf-MLLMmaintains a size-constrained KV cache by dynamically caching recent tokens andrelevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novelapproach to enable MLLMs to capture long-term dependency. We show that Inf-MLLMenables multiple LLMs and MLLMs to achieve stable performance over 4M-tokenlong texts and multi-round conversations with 1-hour-long videos on a singleGPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality thanexisting methods such as StreamingLLM and 2x speedup than H2O.

多模态大语言模型（MLLM）以其多模态综合能力而著称，并广泛应用于 GPT-4o、自动驾驶和机器人等许多现实世界的应用中。尽管多模态模型的性能令人印象深刻，但多模态输入总是会产生长语境。长语境下的推理需要缓存大量以前标记的键和值状态（KVcache），这会带来高延迟和过多的内存消耗。因此，在边缘设备上部署 MLLM 的流式推断具有挑战性，这在很大程度上限制了 MLLM 在实际应用中的功率和使用。在本文中，我们介绍了Inf-MLLM--一种高效的MLLM推理框架，它可以在单个GPU上实现无限上下文的MLLM流式推理。Inf-MLLM 基于我们对 LLM 和 MLLM 中的注意力模式（称为 "注意力鞍"）的关键观察。得益于新发现的注意力模式，Inf-MLLM通过动态缓存最近标记和相关标记来维持大小受限的KV缓存。此外，Inf-MLLM 还提出了注意力偏置（attention bias），这是一种使 MLLM 能够捕捉长期依赖性的新方法。我们的研究表明，Inf-MLLM 使多个 LLM 和 MLLM 能够在单 GPU 上对 4M 标记长度的文本和 1 小时长视频的多轮对话实现稳定的性能。此外，Inf-MLLM 的流推理质量优于 StreamingLLM 等现有方法，速度是 H2O 的 2 倍。

{"title":"Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU","authors":"Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo","doi":"arxiv-2409.09086","DOIUrl":"https://doi.org/arxiv-2409.09086","url":null,"abstract":"Multimodal Large Language Models (MLLMs) are distinguished by their\u0000multimodal comprehensive ability and widely used in many real-world\u0000applications including GPT-4o, autonomous driving and robotics. Despite their\u0000impressive performance, the multimodal inputs always incur long context. The\u0000inference under long context requires caching massive Key and Value states (KV\u0000cache) of previous tokens, which introduces high latency and excessive memory\u0000consumption. Due to this reason, it is challenging to deploy streaming\u0000inference of MLLMs on edge devices, which largely constrains the power and\u0000usage of MLLMs in real-world applications. In this paper, we introduce\u0000Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming\u0000inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on\u0000our key observation of the attention pattern in both LLMs and MLLMs called\u0000\"attention saddles\". Thanks to the newly discovered attention pattern, Inf-MLLM\u0000maintains a size-constrained KV cache by dynamically caching recent tokens and\u0000relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel\u0000approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM\u0000enables multiple LLMs and MLLMs to achieve stable performance over 4M-token\u0000long texts and multi-round conversations with 1-hour-long videos on a single\u0000GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than\u0000existing methods such as StreamingLLM and 2x speedup than H2O.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0