Metaheuristic algorithms are essential for solving complex optimization problems in different fields. However, the difficulty in comparing and rating these algorithms remains due to the wide range of performance metrics and problem dimensions usually involved. On the other hand, nonparametric statistical methods and post hoc tests are time-consuming, especially when we only need to identify the top performers among many algorithms. The Hierarchical Rank Aggregation (HRA) algorithm aims to efficiently rank metaheuristic algorithms based on their performance across many criteria and dimensions. The HRA employs a hierarchical framework that begins with collecting performance metrics on various benchmark functions and dimensions. Rank-based normalization is employed for each performance measure to ensure comparability and the robust TOPSIS aggregation is applied to combine these rankings at several hierarchical levels, resulting in a comprehensive ranking of the algorithms. Our study uses data from the CEC 2017 competition to demonstrate the robustness and efficacy of the HRA framework. It examines 30 benchmark functions and evaluates the performance of 13 metaheuristic algorithms across five performance indicators in four distinct dimensions. This presentation highlights the potential of the HRA to enhance the interpretation of the comparative advantages and disadvantages of various algorithms by simplifying practitioners' choices of the most appropriate algorithm for certain optimization problems.
元启发式算法对于解决不同领域的复杂优化问题至关重要。然而,由于通常涉及多种性能指标和问题维度,对这些算法进行比较和评级仍然存在困难。另一方面,非参数统计方法和事后检验非常耗时,尤其是当我们只需要从众多算法中找出性能最好的算法时。分层排名聚合(HRA)算法旨在根据元启发式算法在多个标准和维度上的表现对其进行有效排名。HRA 采用分层框架,首先收集各种基准函数和维度的性能指标,然后对每个性能指标进行基于等级的归一化以确保可比性,最后采用稳健的 TOPSIS 聚合法将多个分层级别的排名结合起来,从而得出算法的综合排名。我们的研究使用了 2017 年 CEC 竞赛的数据来展示 HRA 框架的鲁棒性和有效性。它考察了 30 个基准函数,并从四个不同维度的五个性能指标评估了 13 种元搜索算法的性能。该报告强调了 HRA 的潜力,即通过简化实践者对特定优化问题最合适算法的选择,增强对各种算法优缺点的解释。
{"title":"HRA: A Multi-Criteria Framework for Ranking Metaheuristic Optimization Algorithms","authors":"Evgenia-Maria K. Goula, Dimitris G. Sotiropoulos","doi":"arxiv-2409.11617","DOIUrl":"https://doi.org/arxiv-2409.11617","url":null,"abstract":"Metaheuristic algorithms are essential for solving complex optimization\u0000problems in different fields. However, the difficulty in comparing and rating\u0000these algorithms remains due to the wide range of performance metrics and\u0000problem dimensions usually involved. On the other hand, nonparametric\u0000statistical methods and post hoc tests are time-consuming, especially when we\u0000only need to identify the top performers among many algorithms. The\u0000Hierarchical Rank Aggregation (HRA) algorithm aims to efficiently rank\u0000metaheuristic algorithms based on their performance across many criteria and\u0000dimensions. The HRA employs a hierarchical framework that begins with\u0000collecting performance metrics on various benchmark functions and dimensions.\u0000Rank-based normalization is employed for each performance measure to ensure\u0000comparability and the robust TOPSIS aggregation is applied to combine these\u0000rankings at several hierarchical levels, resulting in a comprehensive ranking\u0000of the algorithms. Our study uses data from the CEC 2017 competition to\u0000demonstrate the robustness and efficacy of the HRA framework. It examines 30\u0000benchmark functions and evaluates the performance of 13 metaheuristic\u0000algorithms across five performance indicators in four distinct dimensions. This\u0000presentation highlights the potential of the HRA to enhance the interpretation\u0000of the comparative advantages and disadvantages of various algorithms by\u0000simplifying practitioners' choices of the most appropriate algorithm for\u0000certain optimization problems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikolai Merkel, Pierre Toussing, Ruben Mayer, Hans-Arno Jacobsen
Graph neural networks (GNNs) are a type of neural network capable of learning on graph-structured data. However, training GNNs on large-scale graphs is challenging due to iterative aggregations of high-dimensional features from neighboring vertices within sparse graph structures combined with neural network operations. The sparsity of graphs frequently results in suboptimal memory access patterns and longer training time. Graph reordering is an optimization strategy aiming to improve the graph data layout. It has shown to be effective to speed up graph analytics workloads, but its effect on the performance of GNN training has not been investigated yet. The generalization of reordering to GNN performance is nontrivial, as multiple aspects must be considered: GNN hyper-parameters such as the number of layers, the number of hidden dimensions, and the feature size used in the GNN model, neural network operations, large intermediate vertex states, and GPU acceleration. In our work, we close this gap by performing an empirical evaluation of 12 reordering strategies in two state-of-the-art GNN systems, PyTorch Geometric and Deep Graph Library. Our results show that graph reordering is effective in reducing training time for CPU- and GPU-based training, respectively. Further, we find that GNN hyper-parameters influence the effectiveness of reordering, that reordering metrics play an important role in selecting a reordering strategy, that lightweight reordering performs better for GPU-based than for CPU-based training, and that invested reordering time can in many cases be amortized.
{"title":"Can Graph Reordering Speed Up Graph Neural Network Training? An Experimental Study","authors":"Nikolai Merkel, Pierre Toussing, Ruben Mayer, Hans-Arno Jacobsen","doi":"arxiv-2409.11129","DOIUrl":"https://doi.org/arxiv-2409.11129","url":null,"abstract":"Graph neural networks (GNNs) are a type of neural network capable of learning\u0000on graph-structured data. However, training GNNs on large-scale graphs is\u0000challenging due to iterative aggregations of high-dimensional features from\u0000neighboring vertices within sparse graph structures combined with neural\u0000network operations. The sparsity of graphs frequently results in suboptimal\u0000memory access patterns and longer training time. Graph reordering is an\u0000optimization strategy aiming to improve the graph data layout. It has shown to\u0000be effective to speed up graph analytics workloads, but its effect on the\u0000performance of GNN training has not been investigated yet. The generalization\u0000of reordering to GNN performance is nontrivial, as multiple aspects must be\u0000considered: GNN hyper-parameters such as the number of layers, the number of\u0000hidden dimensions, and the feature size used in the GNN model, neural network\u0000operations, large intermediate vertex states, and GPU acceleration. In our work, we close this gap by performing an empirical evaluation of 12\u0000reordering strategies in two state-of-the-art GNN systems, PyTorch Geometric\u0000and Deep Graph Library. Our results show that graph reordering is effective in\u0000reducing training time for CPU- and GPU-based training, respectively. Further,\u0000we find that GNN hyper-parameters influence the effectiveness of reordering,\u0000that reordering metrics play an important role in selecting a reordering\u0000strategy, that lightweight reordering performs better for GPU-based than for\u0000CPU-based training, and that invested reordering time can in many cases be\u0000amortized.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ana Luisa Veroneze Solórzano, Philippe Olivier Alexandre Navaux, Lucas Mello Schnorr
The variety of today's multicore architectures motivates researchers to explore parallel scientific applications on different platforms. Load imbalance is one performance issue that can prejudice parallel applications from exploiting the computational power of these platforms. Ondes3D is a scientific application for seismic wave simulation used to assess the geological impact of earthquakes. Its parallelism relies on applying a regular domain decomposition in the geological domain provided and distributing each sub-domain to MPI ranks. Previous works investigate the significant spatial and temporal imbalance in Ondes3D and suggest new parallelization and load balancing techniques to minimize them. However, none explored its execution on different architectures. Our paper evaluates the performance of Ondes3D for two earthquake scenarios on eight different multicore architectures, including Intel, AMD, and ARM processors. We measure the load distribution per MPI rank, evaluate the temporal load imbalance, and compare the execution of the application's kernels. Our results show that the temporal load imbalance in Ondes3D depends on the architecture chosen, with some platforms minimizing such imbalance more effectively.
{"title":"Temporal Load Imbalance on Ondes3D Seismic Simulator for Different Multicore Architectures","authors":"Ana Luisa Veroneze Solórzano, Philippe Olivier Alexandre Navaux, Lucas Mello Schnorr","doi":"arxiv-2409.11392","DOIUrl":"https://doi.org/arxiv-2409.11392","url":null,"abstract":"The variety of today's multicore architectures motivates researchers to\u0000explore parallel scientific applications on different platforms. Load imbalance\u0000is one performance issue that can prejudice parallel applications from\u0000exploiting the computational power of these platforms. Ondes3D is a scientific\u0000application for seismic wave simulation used to assess the geological impact of\u0000earthquakes. Its parallelism relies on applying a regular domain decomposition\u0000in the geological domain provided and distributing each sub-domain to MPI\u0000ranks. Previous works investigate the significant spatial and temporal\u0000imbalance in Ondes3D and suggest new parallelization and load balancing\u0000techniques to minimize them. However, none explored its execution on different\u0000architectures. Our paper evaluates the performance of Ondes3D for two\u0000earthquake scenarios on eight different multicore architectures, including\u0000Intel, AMD, and ARM processors. We measure the load distribution per MPI rank,\u0000evaluate the temporal load imbalance, and compare the execution of the\u0000application's kernels. Our results show that the temporal load imbalance in\u0000Ondes3D depends on the architecture chosen, with some platforms minimizing such\u0000imbalance more effectively.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, Ismayil Ismayilov
n recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now challenge this CPU dominance by reducing its involvement, granting GPUs more autonomy in communication tasks, and addressing mismatches in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on vendor mechanisms and user-level library supports. It aims to clarify the complexities and diverse options in this field, define the terminology, and categorize existing approaches within and across nodes. The paper discusses vendor-provided mechanisms for communication and memory management in multi-GPU execution and reviews major communication libraries, their benefits, challenges, and performance insights. Then, it explores key research paradigms, future outlooks, and open research questions. By extensively describing GPU-centric communication techniques across the software and hardware stacks, we provide researchers, programmers, engineers, and library designers insights on how to exploit multi-GPU systems at their best.
{"title":"The Landscape of GPU-Centric Communication","authors":"Didem Unat, Ilyas Turimbetov, Mohammed Kefah Taha Issa, Doğan Sağbili, Flavio Vella, Daniele De Sensi, Ismayil Ismayilov","doi":"arxiv-2409.09874","DOIUrl":"https://doi.org/arxiv-2409.09874","url":null,"abstract":"n recent years, GPUs have become the preferred accelerators for HPC and ML\u0000applications due to their parallelism and fast memory bandwidth. While GPUs\u0000boost computation, inter-GPU communication can create scalability bottlenecks,\u0000especially as the number of GPUs per node and cluster grows. Traditionally, the\u0000CPU managed multi-GPU communication, but advancements in GPU-centric\u0000communication now challenge this CPU dominance by reducing its involvement,\u0000granting GPUs more autonomy in communication tasks, and addressing mismatches\u0000in multi-GPU communication and computation. This paper provides a landscape of GPU-centric communication, focusing on\u0000vendor mechanisms and user-level library supports. It aims to clarify the\u0000complexities and diverse options in this field, define the terminology, and\u0000categorize existing approaches within and across nodes. The paper discusses\u0000vendor-provided mechanisms for communication and memory management in multi-GPU\u0000execution and reviews major communication libraries, their benefits,\u0000challenges, and performance insights. Then, it explores key research paradigms,\u0000future outlooks, and open research questions. By extensively describing\u0000GPU-centric communication techniques across the software and hardware stacks,\u0000we provide researchers, programmers, engineers, and library designers insights\u0000on how to exploit multi-GPU systems at their best.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study presents the first global analysis of on-demand video streaming over Low Earth Orbit (LEO) satellite networks, using data from over one million households across 85 countries. We highlight Starlink's role as a major LEO provider, enhancing connectivity in underserved regions. Our findings reveal that while overall video quality on Starlink matches that of traditional networks, the inherent variability in LEO conditions -- such as throughput fluctuations and packet loss -- leads to an increase in bitrate switches and rebuffers. To further improve the quality of experience for the LEO community, we manipulate existing congestion control and adaptive bitrate streaming algorithms using simulation and real A/B tests deployed on over one million households. Our results underscore the need for video streaming and congestion control algorithms to adapt to rapidly evolving network landscapes, ensuring high-quality service across diverse and dynamic network types.
{"title":"A Global Perspective on the Past, Present, and Future of Video Streaming over Starlink","authors":"Liz Izhikevich, Reese Enghardt, Te-Yuan Huang, Renata Teixeira","doi":"arxiv-2409.09846","DOIUrl":"https://doi.org/arxiv-2409.09846","url":null,"abstract":"This study presents the first global analysis of on-demand video streaming\u0000over Low Earth Orbit (LEO) satellite networks, using data from over one million\u0000households across 85 countries. We highlight Starlink's role as a major LEO\u0000provider, enhancing connectivity in underserved regions. Our findings reveal\u0000that while overall video quality on Starlink matches that of traditional\u0000networks, the inherent variability in LEO conditions -- such as throughput\u0000fluctuations and packet loss -- leads to an increase in bitrate switches and\u0000rebuffers. To further improve the quality of experience for the LEO community,\u0000we manipulate existing congestion control and adaptive bitrate streaming\u0000algorithms using simulation and real A/B tests deployed on over one million\u0000households. Our results underscore the need for video streaming and congestion\u0000control algorithms to adapt to rapidly evolving network landscapes, ensuring\u0000high-quality service across diverse and dynamic network types.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"211 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann
Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the intended AI workload. To facilitate this, we present an automated generation approach for fast performance models to accurately estimate the latency of a DNN mapped onto systematically modeled and concisely described accelerator architectures. Using our accelerator architecture description method, we modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. Together with DNN mappings for those modeled architectures, we perform a combined DNN/hardware dependency graph analysis, which enables us, in the best case, to evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup. We outperform regression and analytical models in terms of mean absolute percentage error (MAPE) compared to simulation results, while being several magnitudes faster than an RTL simulation.
{"title":"Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators","authors":"Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann","doi":"arxiv-2409.08595","DOIUrl":"https://doi.org/arxiv-2409.08595","url":null,"abstract":"Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices\u0000is a challenging task that requires tailored hardware accelerator architectures\u0000and a clear understanding of their performance characteristics when executing\u0000the intended AI workload. To facilitate this, we present an automated\u0000generation approach for fast performance models to accurately estimate the\u0000latency of a DNN mapped onto systematically modeled and concisely described\u0000accelerator architectures. Using our accelerator architecture description\u0000method, we modeled representative DNN accelerators such as Gemmini, UltraTrail,\u0000Plasticine-derived, and a parameterizable systolic array. Together with DNN\u0000mappings for those modeled architectures, we perform a combined DNN/hardware\u0000dependency graph analysis, which enables us, in the best case, to evaluate only\u0000154 loop kernel iterations to estimate the performance for 4.19 billion\u0000instructions achieving a significant speedup. We outperform regression and\u0000analytical models in terms of mean absolute percentage error (MAPE) compared to\u0000simulation results, while being several magnitudes faster than an RTL\u0000simulation.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gianfranco Balbo, Andrea Marin, Diletta Olliaro, Matteo Sereno
Closed queuing networks with finite capacity buffers and skip-over policies are fundamental models in the performance evaluation of computer and communication systems. This technical report presents the details of computational algorithms to derive the key performance metrics for such networks. The primary focus is on the efficient computation of the normalization constant, which is critical for determining the steady-state probabilities of the network states under investigation. A convolution algorithm is proposed, which paves the way for the computation of key performance indices, such as queue length distribution and throughput, accommodating the intricacies introduced by finite capacity constraints and skip-over mechanisms. Finally, an extension of the traditional Mean Value Analysis algorithm addressing numerical stability is provided. The approaches discussed here allow make the investigation of large-scale networks feasible and enable the development of robust implementations of these techniques for practical use.
{"title":"Computational Algorithms for the Product Form Solution of Closed Queuing Networks with Finite Buffers and Skip-Over Policy","authors":"Gianfranco Balbo, Andrea Marin, Diletta Olliaro, Matteo Sereno","doi":"arxiv-2409.08075","DOIUrl":"https://doi.org/arxiv-2409.08075","url":null,"abstract":"Closed queuing networks with finite capacity buffers and skip-over policies\u0000are fundamental models in the performance evaluation of computer and\u0000communication systems. This technical report presents the details of\u0000computational algorithms to derive the key performance metrics for such\u0000networks. The primary focus is on the efficient computation of the\u0000normalization constant, which is critical for determining the steady-state\u0000probabilities of the network states under investigation. A convolution\u0000algorithm is proposed, which paves the way for the computation of key\u0000performance indices, such as queue length distribution and throughput,\u0000accommodating the intricacies introduced by finite capacity constraints and\u0000skip-over mechanisms. Finally, an extension of the traditional Mean Value\u0000Analysis algorithm addressing numerical stability is provided. The approaches\u0000discussed here allow make the investigation of large-scale networks feasible\u0000and enable the development of robust implementations of these techniques for\u0000practical use.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"84 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With Nvidia's release of the Grace Superchip, all three big semiconductor companies in HPC (AMD, Intel, Nvidia) are currently competing in the race for the best CPU. In this work we analyze the performance of these state-of-the-art CPUs and create an accurate in-core performance model for their microarchitectures Zen 4, Golden Cove, and Neoverse V2, extending the Open Source Architecture Code Analyzer (OSACA) tool and comparing it with LLVM-MCA. Starting from the peculiarities and up- and downsides of a single core, we extend our comparison by a variety of microbenchmarks and the capabilities of a full node. The "write-allocate (WA) evasion" feature, which can automatically reduce the memory traffic caused by write misses, receives special attention; we show that the Grace Superchip has a next-to-optimal implementation of WA evasion, and that the only way to avoid write allocates on Zen 4 is the explicit use of non-temporal stores.
随着 Nvidia 发布 Grace 超级芯片,HPC 领域的三大半导体公司(AMD、Intel 和 Nvidia)目前都在争夺最佳 CPU。在这项工作中,我们分析了这些最先进 CPU 的性能,并为它们的微架构 Zen 4、Golden Cove 和 Neoverse V2 建立了精确的内核性能模型,扩展了开源架构代码分析器(OSACA)工具,并与 LLVM-MCA 进行了比较。我们特别关注了 "写分配(WA)规避 "功能,该功能可以自动减少写未命中造成的内存流量;我们证明了格雷斯超级芯片拥有近乎最佳的 "WA规避 "实现,而在 Zen 4 上避免写分配的唯一方法是明确使用非时态存储。
{"title":"Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa","authors":"Jan Laukemann, Georg Hager, Gerhard Wellein","doi":"arxiv-2409.08108","DOIUrl":"https://doi.org/arxiv-2409.08108","url":null,"abstract":"With Nvidia's release of the Grace Superchip, all three big semiconductor\u0000companies in HPC (AMD, Intel, Nvidia) are currently competing in the race for\u0000the best CPU. In this work we analyze the performance of these state-of-the-art\u0000CPUs and create an accurate in-core performance model for their\u0000microarchitectures Zen 4, Golden Cove, and Neoverse V2, extending the Open\u0000Source Architecture Code Analyzer (OSACA) tool and comparing it with LLVM-MCA.\u0000Starting from the peculiarities and up- and downsides of a single core, we\u0000extend our comparison by a variety of microbenchmarks and the capabilities of a\u0000full node. The \"write-allocate (WA) evasion\" feature, which can automatically\u0000reduce the memory traffic caused by write misses, receives special attention;\u0000we show that the Grace Superchip has a next-to-optimal implementation of WA\u0000evasion, and that the only way to avoid write allocates on Zen 4 is the\u0000explicit use of non-temporal stores.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142225117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing
Ensemble learning is a meta-learning approach that combines the predictions of multiple learners, demonstrating improved accuracy and robustness. Nevertheless, ensembling models like Convolutional Neural Networks (CNNs) result in high memory and computing overhead, preventing their deployment in embedded systems. These devices are usually equipped with small batteries that provide power supply and might include energy-harvesting modules that extract energy from the environment. In this work, we propose E-QUARTIC, a novel Energy Efficient Edge Ensembling framework to build ensembles of CNNs targeting Artificial Intelligence (AI)-based embedded systems. Our design outperforms single-instance CNN baselines and state-of-the-art edge AI solutions, improving accuracy and adapting to varying energy conditions while maintaining similar memory requirements. Then, we leverage the multi-CNN structure of the designed ensemble to implement an energy-aware model selection policy in energy-harvesting AI systems. We show that our solution outperforms the state-of-the-art by reducing system failure rate by up to 40% while ensuring higher average output qualities. Ultimately, we show that the proposed design enables concurrent on-device training and high-quality inference execution at the edge, limiting the performance and energy overheads to less than 0.04%.
{"title":"E-QUARTIC: Energy Efficient Edge Ensemble of Convolutional Neural Networks for Resource-Optimized Learning","authors":"Le Zhang, Onat Gungor, Flavio Ponzina, Tajana Rosing","doi":"arxiv-2409.08369","DOIUrl":"https://doi.org/arxiv-2409.08369","url":null,"abstract":"Ensemble learning is a meta-learning approach that combines the predictions\u0000of multiple learners, demonstrating improved accuracy and robustness.\u0000Nevertheless, ensembling models like Convolutional Neural Networks (CNNs)\u0000result in high memory and computing overhead, preventing their deployment in\u0000embedded systems. These devices are usually equipped with small batteries that\u0000provide power supply and might include energy-harvesting modules that extract\u0000energy from the environment. In this work, we propose E-QUARTIC, a novel Energy\u0000Efficient Edge Ensembling framework to build ensembles of CNNs targeting\u0000Artificial Intelligence (AI)-based embedded systems. Our design outperforms\u0000single-instance CNN baselines and state-of-the-art edge AI solutions, improving\u0000accuracy and adapting to varying energy conditions while maintaining similar\u0000memory requirements. Then, we leverage the multi-CNN structure of the designed\u0000ensemble to implement an energy-aware model selection policy in\u0000energy-harvesting AI systems. We show that our solution outperforms the\u0000state-of-the-art by reducing system failure rate by up to 40% while ensuring\u0000higher average output qualities. Ultimately, we show that the proposed design\u0000enables concurrent on-device training and high-quality inference execution at\u0000the edge, limiting the performance and energy overheads to less than 0.04%.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multimodal Large Language Models (MLLMs) are distinguished by their multimodal comprehensive ability and widely used in many real-world applications including GPT-4o, autonomous driving and robotics. Despite their impressive performance, the multimodal inputs always incur long context. The inference under long context requires caching massive Key and Value states (KV cache) of previous tokens, which introduces high latency and excessive memory consumption. Due to this reason, it is challenging to deploy streaming inference of MLLMs on edge devices, which largely constrains the power and usage of MLLMs in real-world applications. In this paper, we introduce Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on our key observation of the attention pattern in both LLMs and MLLMs called "attention saddles". Thanks to the newly discovered attention pattern, Inf-MLLM maintains a size-constrained KV cache by dynamically caching recent tokens and relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM enables multiple LLMs and MLLMs to achieve stable performance over 4M-token long texts and multi-round conversations with 1-hour-long videos on a single GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than existing methods such as StreamingLLM and 2x speedup than H2O.
{"title":"Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU","authors":"Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo","doi":"arxiv-2409.09086","DOIUrl":"https://doi.org/arxiv-2409.09086","url":null,"abstract":"Multimodal Large Language Models (MLLMs) are distinguished by their\u0000multimodal comprehensive ability and widely used in many real-world\u0000applications including GPT-4o, autonomous driving and robotics. Despite their\u0000impressive performance, the multimodal inputs always incur long context. The\u0000inference under long context requires caching massive Key and Value states (KV\u0000cache) of previous tokens, which introduces high latency and excessive memory\u0000consumption. Due to this reason, it is challenging to deploy streaming\u0000inference of MLLMs on edge devices, which largely constrains the power and\u0000usage of MLLMs in real-world applications. In this paper, we introduce\u0000Inf-MLLM, an efficient inference framework for MLLMs, which enable streaming\u0000inference of MLLM on a single GPU with infinite context. Inf-MLLM is based on\u0000our key observation of the attention pattern in both LLMs and MLLMs called\u0000\"attention saddles\". Thanks to the newly discovered attention pattern, Inf-MLLM\u0000maintains a size-constrained KV cache by dynamically caching recent tokens and\u0000relevant tokens. Furthermore, Inf-MLLM proposes attention bias, a novel\u0000approach to enable MLLMs to capture long-term dependency. We show that Inf-MLLM\u0000enables multiple LLMs and MLLMs to achieve stable performance over 4M-token\u0000long texts and multi-round conversations with 1-hour-long videos on a single\u0000GPU. In addition, Inf-MLLM exhibits superior streaming reasoning quality than\u0000existing methods such as StreamingLLM and 2x speedup than H2O.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142256006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}