Metaheuristic algorithms are essential for solving complex optimizationproblems in different fields. However, the difficulty in comparing and ratingthese algorithms remains due to the wide range of performance metrics andproblem dimensions usually involved. On the other hand, nonparametricstatistical methods and post hoc tests are time-consuming, especially when weonly need to identify the top performers among many algorithms. TheHierarchical Rank Aggregation (HRA) algorithm aims to efficiently rankmetaheuristic algorithms based on their performance across many criteria anddimensions. The HRA employs a hierarchical framework that begins withcollecting performance metrics on various benchmark functions and dimensions.Rank-based normalization is employed for each performance measure to ensurecomparability and the robust TOPSIS aggregation is applied to combine theserankings at several hierarchical levels, resulting in a comprehensive rankingof the algorithms. Our study uses data from the CEC 2017 competition todemonstrate the robustness and efficacy of the HRA framework. It examines 30benchmark functions and evaluates the performance of 13 metaheuristicalgorithms across five performance indicators in four distinct dimensions. Thispresentation highlights the potential of the HRA to enhance the interpretationof the comparative advantages and disadvantages of various algorithms bysimplifying practitioners' choices of the most appropriate algorithm forcertain optimization problems.
元启发式算法对于解决不同领域的复杂优化问题至关重要。然而,由于通常涉及多种性能指标和问题维度,对这些算法进行比较和评级仍然存在困难。另一方面,非参数统计方法和事后检验非常耗时,尤其是当我们只需要从众多算法中找出性能最好的算法时。分层排名聚合(HRA)算法旨在根据元启发式算法在多个标准和维度上的表现对其进行有效排名。HRA 采用分层框架,首先收集各种基准函数和维度的性能指标,然后对每个性能指标进行基于等级的归一化以确保可比性,最后采用稳健的 TOPSIS 聚合法将多个分层级别的排名结合起来,从而得出算法的综合排名。我们的研究使用了 2017 年 CEC 竞赛的数据来展示 HRA 框架的鲁棒性和有效性。它考察了 30 个基准函数,并从四个不同维度的五个性能指标评估了 13 种元搜索算法的性能。该报告强调了 HRA 的潜力,即通过简化实践者对特定优化问题最合适算法的选择,增强对各种算法优缺点的解释。
{"title":"HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms","authors":"Evgenia-Maria K. Goula, Dimitris G. Sotiropoulos","doi":"arxiv-2409.11617","DOIUrl":"https://doi.org/arxiv-2409.11617","url":null,"abstract":"Metaheuristic algorithms are essential for solving complex optimization\u0000problems in different fields. However, the difficulty in comparing and rating\u0000these algorithms remains due to the wide range of performance metrics and\u0000problem dimensions usually involved. On the other hand, nonparametric\u0000statistical methods and post hoc tests are time-consuming, especially when we\u0000only need to identify the top performers among many algorithms. The\u0000Hierarchical Rank Aggregation (HRA) algorithm aims to efficiently rank\u0000metaheuristic algorithms based on their performance across many criteria and\u0000dimensions. The HRA employs a hierarchical framework that begins with\u0000collecting performance metrics on various benchmark functions and dimensions.\u0000Rank-based normalization is employed for each performance measure to ensure\u0000comparability and the robust TOPSIS aggregation is applied to combine these\u0000rankings at several hierarchical levels, resulting in a comprehensive ranking\u0000of the algorithms. Our study uses data from the CEC 2017 competition to\u0000demonstrate the robustness and efficacy of the HRA framework. It examines 30\u0000benchmark functions and evaluates the performance of 13 metaheuristic\u0000algorithms across five performance indicators in four distinct dimensions. This\u0000presentation highlights the potential of the HRA to enhance the interpretation\u0000of the comparative advantages and disadvantages of various algorithms by\u0000simplifying practitioners' choices of the most appropriate algorithm for\u0000certain optimization problems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolai Merkel, Pierre Toussing, Ruben Mayer, Hans-Arno Jacobsen
Graph neural networks (GNNs) are a type of neural network capable of learningon graph-structured data. However, training GNNs on large-scale graphs ischallenging due to iterative aggregations of high-dimensional features fromneighboring vertices within sparse graph structures combined with neuralnetwork operations. The sparsity of graphs frequently results in suboptimalmemory access patterns and longer training time. Graph reordering is anoptimization strategy aiming to improve the graph data layout. It has shown tobe effective to speed up graph analytics workloads, but its effect on theperformance of GNN training has not been investigated yet. The generalizationof reordering to GNN performance is nontrivial, as multiple aspects must beconsidered: GNN hyper-parameters such as the number of layers, the number ofhidden dimensions, and the feature size used in the GNN model, neural networkoperations, large intermediate vertex states, and GPU acceleration. In our work, we close this gap by performing an empirical evaluation of 12reordering strategies in two state-of-the-art GNN systems, PyTorch Geometricand Deep Graph Library. Our results show that graph reordering is effective inreducing training time for CPU- and GPU-based training, respectively. Further,we find that GNN hyper-parameters influence the effectiveness of reordering,that reordering metrics play an important role in selecting a reorderingstrategy, that lightweight reordering performs better for GPU-based than forCPU-based training, and that invested reordering time can in many cases beamortized.
{"title":"Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study","authors":"Nikolai Merkel, Pierre Toussing, Ruben Mayer, Hans-Arno Jacobsen","doi":"arxiv-2409.11129","DOIUrl":"https://doi.org/arxiv-2409.11129","url":null,"abstract":"Graph neural networks (GNNs) are a type of neural network capable of learning\u0000on graph-structured data. However, training GNNs on large-scale graphs is\u0000challenging due to iterative aggregations of high-dimensional features from\u0000neighboring vertices within sparse graph structures combined with neural\u0000network operations. The sparsity of graphs frequently results in suboptimal\u0000memory access patterns and longer training time. Graph reordering is an\u0000optimization strategy aiming to improve the graph data layout. It has shown to\u0000be effective to speed up graph analytics workloads, but its effect on the\u0000performance of GNN training has not been investigated yet. The generalization\u0000of reordering to GNN performance is nontrivial, as multiple aspects must be\u0000considered: GNN hyper-parameters such as the number of layers, the number of\u0000hidden dimensions, and the feature size used in the GNN model, neural network\u0000operations, large intermediate vertex states, and GPU acceleration. In our work, we close this gap by performing an empirical evaluation of 12\u0000reordering strategies in two state-of-the-art GNN systems, PyTorch Geometric\u0000and Deep Graph Library. Our results show that graph reordering is effective in\u0000reducing training time for CPU- and GPU-based training, respectively. Further,\u0000we find that GNN hyper-parameters influence the effectiveness of reordering,\u0000that reordering metrics play an important role in selecting a reordering\u0000strategy, that lightweight reordering performs better for GPU-based than for\u0000CPU-based training, and that invested reordering time can in many cases be\u0000amortized.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ana Luisa Veroneze Solórzano, Philippe Olivier Alexandre Navaux, Lucas Mello Schnorr
The variety of today's multicore architectures motivates researchers toexplore parallel scientific applications on different platforms. Load imbalanceis one performance issue that can prejudice parallel applications fromexploiting the computational power of these platforms. Ondes3D is a scientificapplication for seismic wave simulation used to assess the geological impact ofearthquakes. Its parallelism relies on applying a regular domain decompositionin the geological domain provided and distributing each sub-domain to MPIranks. Previous works investigate the significant spatial and temporalimbalance in Ondes3D and suggest new parallelization and load balancingtechniques to minimize them. However, none explored its execution on differentarchitectures. Our paper evaluates the performance of Ondes3D for twoearthquake scenarios on eight different multicore architectures, includingIntel, AMD, and ARM processors. We measure the load distribution per MPI rank,evaluate the temporal load imbalance, and compare the execution of theapplication's kernels. Our results show that the temporal load imbalance inOndes3D depends on the architecture chosen, with some platforms minimizing suchimbalance more effectively.
{"title":"Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures","authors":"Ana Luisa Veroneze Solórzano, Philippe Olivier Alexandre Navaux, Lucas Mello Schnorr","doi":"arxiv-2409.11392","DOIUrl":"https://doi.org/arxiv-2409.11392","url":null,"abstract":"The variety of today's multicore architectures motivates researchers to\u0000explore parallel scientific applications on different platforms. Load imbalance\u0000is one performance issue that can prejudice parallel applications from\u0000exploiting the computational power of these platforms. Ondes3D is a scientific\u0000application for seismic wave simulation used to assess the geological impact of\u0000earthquakes. Its parallelism relies on applying a regular domain decomposition\u0000in the geological domain provided and distributing each sub-domain to MPI\u0000ranks. Previous works investigate the significant spatial and temporal\u0000imbalance in Ondes3D and suggest new parallelization and load balancing\u0000techniques to minimize them. However, none explored its execution on different\u0000architectures. Our paper evaluates the performance of Ondes3D for two\u0000earthquake scenarios on eight different multicore architectures, including\u0000Intel, AMD, and ARM processors. We measure the load distribution per MPI rank,\u0000evaluate the temporal load imbalance, and compare the execution of the\u0000application's kernels. Our results show that the temporal load imbalance in\u0000Ondes3D depends on the architecture chosen, with some platforms minimizing such\u0000imbalance more effectively.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, Ismayil Ismayilov
n recent years, GPUs have become the preferred accelerators for HPC and MLapplications due to their parallelism and fast memory bandwidth. While GPUsboost computation, inter-GPU communication can create scalability bottlenecks,especially as the number of GPUs per node and cluster grows. Traditionally, theCPU managed multi-GPU communication, but advancements in GPU-centriccommunication now challenge this CPU dominance by reducing its involvement,granting GPUs more autonomy in communication tasks, and addressing mismatchesin multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing onvendor mechanisms and user-level library supports. It aims to clarify thecomplexities and diverse options in this field, define the terminology, andcategorize existing approaches within and across nodes. The paper discussesvendor-provided mechanisms for communication and memory management in multi-GPUexecution and reviews major communication libraries, their benefits,challenges, and performance insights. Then, it explores key research paradigms,future outlooks, and open research questions. By extensively describingGPU-centric communication techniques across the software and hardware stacks,we provide researchers, programmers, engineers, and library designers insightson how to exploit multi-GPU systems at their best.
{"title":"The Landscape of GPU-Centric Communication","authors":"Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, Ismayil Ismayilov","doi":"arxiv-2409.09874","DOIUrl":"https://doi.org/arxiv-2409.09874","url":null,"abstract":"n recent years, GPUs have become the preferred accelerators for HPC and ML\u0000applications due to their parallelism and fast memory bandwidth. While GPUs\u0000boost computation, inter-GPU communication can create scalability bottlenecks,\u0000especially as the number of GPUs per node and cluster grows. Traditionally, the\u0000CPU managed multi-GPU communication, but advancements in GPU-centric\u0000communication now challenge this CPU dominance by reducing its involvement,\u0000granting GPUs more autonomy in communication tasks, and addressing mismatches\u0000in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on\u0000vendor mechanisms and user-level library supports. It aims to clarify the\u0000complexities and diverse options in this field, define the terminology, and\u0000categorize existing approaches within and across nodes. The paper discusses\u0000vendor-provided mechanisms for communication and memory management in multi-GPU\u0000execution and reviews major communication libraries, their benefits,\u0000challenges, and performance insights. Then, it explores key research paradigms,\u0000future outlooks, and open research questions. By extensively describing\u0000GPU-centric communication techniques across the software and hardware stacks,\u0000we provide researchers, programmers, engineers, and library designers insights\u0000on how to exploit multi-GPU systems at their best.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study presents the first global analysis of on-demand video streamingover Low Earth Orbit (LEO) satellite networks, using data from over one millionhouseholds across 85 countries. We highlight Starlink's role as a major LEOprovider, enhancing connectivity in underserved regions. Our findings revealthat while overall video quality on Starlink matches that of traditionalnetworks, the inherent variability in LEO conditions -- such as throughputfluctuations and packet loss -- leads to an increase in bitrate switches andrebuffers. To further improve the quality of experience for the LEO community,we manipulate existing congestion control and adaptive bitrate streamingalgorithms using simulation and real A/B tests deployed on over one millionhouseholds. Our results underscore the need for video streaming and congestioncontrol algorithms to adapt to rapidly evolving network landscapes, ensuringhigh-quality service across diverse and dynamic network types.
{"title":"A Global Perspective on the Past, Present, and Future of Video Streaming over Starlink","authors":"Liz Izhikevich, Reese Enghardt, Te-Yuan Huang, Renata Teixeira","doi":"arxiv-2409.09846","DOIUrl":"https://doi.org/arxiv-2409.09846","url":null,"abstract":"This study presents the first global analysis of on-demand video streaming\u0000over Low Earth Orbit (LEO) satellite networks, using data from over one million\u0000households across 85 countries. We highlight Starlink's role as a major LEO\u0000provider, enhancing connectivity in underserved regions. Our findings reveal\u0000that while overall video quality on Starlink matches that of traditional\u0000networks, the inherent variability in LEO conditions -- such as throughput\u0000fluctuations and packet loss -- leads to an increase in bitrate switches and\u0000rebuffers. To further improve the quality of experience for the LEO community,\u0000we manipulate existing congestion control and adaptive bitrate streaming\u0000algorithms using simulation and real A/B tests deployed on over one million\u0000households. Our results underscore the need for video streaming and congestion\u0000control algorithms to adapt to rapidly evolving network landscapes, ensuring\u0000high-quality service across diverse and dynamic network types.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"211 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann
Implementing Deep Neural Networks (DNNs) on resource-constrained edge devicesis a challenging task that requires tailored hardware accelerator architecturesand a clear understanding of their performance characteristics when executingthe intended AI workload. To facilitate this, we present an automatedgeneration approach for fast performance models to accurately estimate thelatency of a DNN mapped onto systematically modeled and concisely describedaccelerator architectures. Using our accelerator architecture descriptionmethod, we modeled representative DNN accelerators such as Gemmini, UltraTrail,Plasticine-derived, and a parameterizable systolic array. Together with DNNmappings for those modeled architectures, we perform a combined DNN/hardwaredependency graph analysis, which enables us, in the best case, to evaluate only154 loop kernel iterations to estimate the performance for 4.19 billioninstructions achieving a significant speedup. We outperform regression andanalytical models in terms of mean absolute percentage error (MAPE) compared tosimulation results, while being several magnitudes faster than an RTLsimulation.
{"title":"Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators","authors":"Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann","doi":"arxiv-2409.08595","DOIUrl":"https://doi.org/arxiv-2409.08595","url":null,"abstract":"Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices\u0000is a challenging task that requires tailored hardware accelerator architectures\u0000and a clear understanding of their performance characteristics when executing\u0000the intended AI workload. To facilitate this, we present an automated\u0000generation approach for fast performance models to accurately estimate the\u0000latency of a DNN mapped onto systematically modeled and concisely described\u0000accelerator architectures. Using our accelerator architecture description\u0000method, we modeled representative DNN accelerators such as Gemmini, UltraTrail,\u0000Plasticine-derived, and a parameterizable systolic array. Together with DNN\u0000mappings for those modeled architectures, we perform a combined DNN/hardware\u0000dependency graph analysis, which enables us, in the best case, to evaluate only\u0000154 loop kernel iterations to estimate the performance for 4.19 billion\u0000instructions achieving a significant speedup. We outperform regression and\u0000analytical models in terms of mean absolute percentage error (MAPE) compared to\u0000simulation results, while being several magnitudes faster than an RTL\u0000simulation.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gianfranco Balbo, Andrea Marin, Diletta Olliaro, Matteo Sereno
Closed queuing networks with finite capacity buffers and skip-over policiesare fundamental models in the performance evaluation of computer andcommunication systems. This technical report presents the details ofcomputational algorithms to derive the key performance metrics for suchnetworks. The primary focus is on the efficient computation of thenormalization constant, which is critical for determining the steady-stateprobabilities of the network states under investigation. A convolutionalgorithm is proposed, which paves the way for the computation of keyperformance indices, such as queue length distribution and throughput,accommodating the intricacies introduced by finite capacity constraints andskip-over mechanisms. Finally, an extension of the traditional Mean ValueAnalysis algorithm addressing numerical stability is provided. The approachesdiscussed here allow make the investigation of large-scale networks feasibleand enable the development of robust implementations of these techniques forpractical use.
{"title":"Computational Algorithms for the Product Form Solution of Closed Queuing Networks with Finite Buffers and Skip-Over Policy","authors":"Gianfranco Balbo, Andrea Marin, Diletta Olliaro, Matteo Sereno","doi":"arxiv-2409.08075","DOIUrl":"https://doi.org/arxiv-2409.08075","url":null,"abstract":"Closed queuing networks with finite capacity buffers and skip-over policies\u0000are fundamental models in the performance evaluation of computer and\u0000communication systems. This technical report presents the details of\u0000computational algorithms to derive the key performance metrics for such\u0000networks. The primary focus is on the efficient computation of the\u0000normalization constant, which is critical for determining the steady-state\u0000probabilities of the network states under investigation. A convolution\u0000algorithm is proposed, which paves the way for the computation of key\u0000performance indices, such as queue length distribution and throughput,\u0000accommodating the intricacies introduced by finite capacity constraints and\u0000skip-over mechanisms. Finally, an extension of the traditional Mean Value\u0000Analysis algorithm addressing numerical stability is provided. The approaches\u0000discussed here allow make the investigation of large-scale networks feasible\u0000and enable the development of robust implementations of these techniques for\u0000practical use.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"84 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With Nvidia's release of the Grace Superchip, all three big semiconductorcompanies in HPC (AMD, Intel, Nvidia) are currently competing in the race forthe best CPU. In this work we analyze the performance of these state-of-the-artCPUs and create an accurate in-core performance model for theirmicroarchitectures Zen 4, Golden Cove, and Neoverse V2, extending the OpenSource Architecture Code Analyzer (OSACA) tool and comparing it with LLVM-MCA.Starting from the peculiarities and up- and downsides of a single core, weextend our comparison by a variety of microbenchmarks and the capabilities of afull node. The "write-allocate (WA) evasion" feature, which can automaticallyreduce the memory traffic caused by write misses, receives special attention;we show that the Grace Superchip has a next-to-optimal implementation of WAevasion, and that the only way to avoid write allocates on Zen 4 is theexplicit use of non-temporal stores.
随着 Nvidia 发布 Grace 超级芯片,HPC 领域的三大半导体公司(AMD、Intel 和 Nvidia)目前都在争夺最佳 CPU。在这项工作中,我们分析了这些最先进 CPU 的性能,并为它们的微架构 Zen 4、Golden Cove 和 Neoverse V2 建立了精确的内核性能模型,扩展了开源架构代码分析器(OSACA)工具,并与 LLVM-MCA 进行了比较。我们特别关注了 "写分配(WA)规避 "功能,该功能可以自动减少写未命中造成的内存流量;我们证明了格雷斯超级芯片拥有近乎最佳的 "WA规避 "实现,而在 Zen 4 上避免写分配的唯一方法是明确使用非时态存储。
{"title":"Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa","authors":"Jan Laukemann, Georg Hager, Gerhard Wellein","doi":"arxiv-2409.08108","DOIUrl":"https://doi.org/arxiv-2409.08108","url":null,"abstract":"With Nvidia's release of the Grace Superchip, all three big semiconductor\u0000companies in HPC (AMD, Intel, Nvidia) are currently competing in the race for\u0000the best CPU. In this work we analyze the performance of these state-of-the-art\u0000CPUs and create an accurate in-core performance model for their\u0000microarchitectures Zen 4, Golden Cove, and Neoverse V2, extending the Open\u0000Source Architecture Code Analyzer (OSACA) tool and comparing it with LLVM-MCA.\u0000Starting from the peculiarities and up- and downsides of a single core, we\u0000extend our comparison by a variety of microbenchmarks and the capabilities of a\u0000full node. The \"write-allocate (WA) evasion\" feature, which can automatically\u0000reduce the memory traffic caused by write misses, receives special attention;\u0000we show that the Grace Superchip has a next-to-optimal implementation of WA\u0000evasion, and that the only way to avoid write allocates on Zen 4 is the\u0000explicit use of non-temporal stores.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing
Ensemble learning is a meta-learning approach that combines the predictionsof multiple learners, demonstrating improved accuracy and robustness.Nevertheless, ensembling models like Convolutional Neural Networks (CNNs)result in high memory and computing overhead, preventing their deployment inembedded systems. These devices are usually equipped with small batteries thatprovide power supply and might include energy-harvesting modules that extractenergy from the environment. In this work, we propose E-QUARTIC, a novel EnergyEfficient Edge Ensembling framework to build ensembles of CNNs targetingArtificial Intelligence (AI)-based embedded systems. Our design outperformssingle-instance CNN baselines and state-of-the-art edge AI solutions, improvingaccuracy and adapting to varying energy conditions while maintaining similarmemory requirements. Then, we leverage the multi-CNN structure of the designedensemble to implement an energy-aware model selection policy inenergy-harvesting AI systems. We show that our solution outperforms thestate-of-the-art by reducing system failure rate by up to 40% while ensuringhigher average output qualities. Ultimately, we show that the proposed designenables concurrent on-device training and high-quality inference execution atthe edge, limiting the performance and energy overheads to less than 0.04%.
{"title":"E-QUARTIC: Energy Efficient Edge Ensemble of Convolutional Neural Networks for Resource-Optimized Learning","authors":"Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing","doi":"arxiv-2409.08369","DOIUrl":"https://doi.org/arxiv-2409.08369","url":null,"abstract":"Ensemble learning is a meta-learning approach that combines the predictions\u0000of multiple learners, demonstrating improved accuracy and robustness.\u0000Nevertheless, ensembling models like Convolutional Neural Networks (CNNs)\u0000result in high memory and computing overhead, preventing their deployment in\u0000embedded systems. These devices are usually equipped with small batteries that\u0000provide power supply and might include energy-harvesting modules that extract\u0000energy from the environment. In this work, we propose E-QUARTIC, a novel Energy\u0000Efficient Edge Ensembling framework to build ensembles of CNNs targeting\u0000Artificial Intelligence (AI)-based embedded systems. Our design outperforms\u0000single-instance CNN baselines and state-of-the-art edge AI solutions, improving\u0000accuracy and adapting to varying energy conditions while maintaining similar\u0000memory requirements. Then, we leverage the multi-CNN structure of the designed\u0000ensemble to implement an energy-aware model selection policy in\u0000energy-harvesting AI systems. We show that our solution outperforms the\u0000state-of-the-art by reducing system failure rate by up to 40% while ensuring\u0000higher average output qualities. Ultimately, we show that the proposed design\u0000enables concurrent on-device training and high-quality inference execution at\u0000the edge, limiting the performance and energy overheads to less than 0.04%.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimodal Large Language Models (MLLMs) are distinguished by theirmultimodal comprehensive ability and widely used in many real-worldapplications including GPT-4o, autonomous driving and robotics. Despite theirimpressive performance, the multimodal inputs always incur long context. Theinference under long context requires caching massive Key and Value states (KVcache) of previous tokens, which introduces high latency and excessive memoryconsumption. Due to this reason, it is challenging to deploy streaminginference of MLLMs on edge devices, which largely constrains the power andusage of MLLMs in real-world applications. In this paper, we introduceInf-MLLM, an efficient inference framework for MLLMs, which enable streaminginference of MLLM on a single GPU with infinite context. Inf-MLLM is based onour key observation of the attention pattern in both LLMs and MLLMs called"attention saddles". Thanks to the newly discovered attention pattern, Inf-MLLMmaintains a size-constrained KV cache by dynamically caching recent tokens andrelevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novelapproach to enable MLLMs to capture long-term dependency. We show that Inf-MLLMenables multiple LLMs and MLLMs to achieve stable performance over 4M-tokenlong texts and multi-round conversations with 1-hour-long videos on a singleGPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality thanexisting methods such as StreamingLLM and 2x speedup than H2O.
{"title":"Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU","authors":"Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo","doi":"arxiv-2409.09086","DOIUrl":"https://doi.org/arxiv-2409.09086","url":null,"abstract":"Multimodal Large Language Models (MLLMs) are distinguished by their\u0000multimodal comprehensive ability and widely used in many real-world\u0000applications including GPT-4o, autonomous driving and robotics. Despite their\u0000impressive performance, the multimodal inputs always incur long context. The\u0000inference under long context requires caching massive Key and Value states (KV\u0000cache) of previous tokens, which introduces high latency and excessive memory\u0000consumption. Due to this reason, it is challenging to deploy streaming\u0000inference of MLLMs on edge devices, which largely constrains the power and\u0000usage of MLLMs in real-world applications. In this paper, we introduce\u0000Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming\u0000inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on\u0000our key observation of the attention pattern in both LLMs and MLLMs called\u0000\"attention saddles\". Thanks to the newly discovered attention pattern, Inf-MLLM\u0000maintains a size-constrained KV cache by dynamically caching recent tokens and\u0000relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel\u0000approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM\u0000enables multiple LLMs and MLLMs to achieve stable performance over 4M-token\u0000long texts and multi-round conversations with 1-hour-long videos on a single\u0000GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than\u0000existing methods such as StreamingLLM and 2x speedup than H2O.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}