PageRank is a metric that assigns importance to the vertices of a graph based on its neighbors and their scores. Recently, there has been increasing interest in computing PageRank on dynamic graphs, where the graph structure evolves due to edge insertions and deletions. However, traditional barrier-based approaches for updating PageRanks encounter significant wait times on certain graph structures, leading to high overall runtimes. Additionally, the growing trend of multicore architectures with increased core counts has raised concerns about random thread delays and failures. In this study, we propose a lock-free algorithm for updating PageRank scores on dynamic graphs. First, we introduce our Dynamic Frontier (DF) approach, which identifies and processes vertices likely to change PageRanks with minimal overhead. Subsequently, we integrate DF with our lock-free and fault-tolerant PageRank ($DF_{LF}$), incorporating a helping mechanism among threads between its two phases. Experimental results demonstrate that $DF_{LF}$ not only eliminates waiting times at iteration barriers but also withstands random thread delays and crashes. On average, it is 4.6x faster than lock-free Naive-dynamic PageRank ($ND_{LF}$).
{"title":"Lock-Free Computation of PageRank in Dynamic Graphs","authors":"Subhajit Sahu","doi":"arxiv-2407.19562","DOIUrl":"https://doi.org/arxiv-2407.19562","url":null,"abstract":"PageRank is a metric that assigns importance to the vertices of a graph based\u0000on its neighbors and their scores. Recently, there has been increasing interest\u0000in computing PageRank on dynamic graphs, where the graph structure evolves due\u0000to edge insertions and deletions. However, traditional barrier-based approaches\u0000for updating PageRanks encounter significant wait times on certain graph\u0000structures, leading to high overall runtimes. Additionally, the growing trend\u0000of multicore architectures with increased core counts has raised concerns about\u0000random thread delays and failures. In this study, we propose a lock-free\u0000algorithm for updating PageRank scores on dynamic graphs. First, we introduce\u0000our Dynamic Frontier (DF) approach, which identifies and processes vertices\u0000likely to change PageRanks with minimal overhead. Subsequently, we integrate DF\u0000with our lock-free and fault-tolerant PageRank ($DF_{LF}$), incorporating a\u0000helping mechanism among threads between its two phases. Experimental results\u0000demonstrate that $DF_{LF}$ not only eliminates waiting times at iteration\u0000barriers but also withstands random thread delays and crashes. On average, it\u0000is 4.6x faster than lock-free Naive-dynamic PageRank ($ND_{LF}$).","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryan BarronTheoretical Division, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Maksim E. ErenAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Manish BhattaraiTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Ismael BoureimaTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Cynthia MatuszekAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Boian S. AlexandrovTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA
In several Machine Learning (ML) clustering and dimensionality reduction approaches, such as non-negative matrix factorization (NMF), RESCAL, and K-Means clustering, users must select a hyper-parameter k to define the number of clusters or components that yield an ideal separation of samples or clean clusters. This selection, while difficult, is crucial to avoid overfitting or underfitting the data. Several ML applications use scoring methods (e.g., Silhouette and Davies Boulding scores) to evaluate the cluster pattern stability for a specific k. The score is calculated for different trials over a range of k, and the ideal k is heuristically selected as the value before the model starts overfitting, indicated by a drop or increase in the score resembling an elbow curve plot. While the grid-search method can be used to accurately find a good k value, visiting a range of k can become time-consuming and computationally resource-intensive. In this paper, we introduce the Binary Bleed method based on binary search, which significantly reduces the k search space for these grid-search ML algorithms by truncating the target k values from the search space using a heuristic with thresholding over the scores. Binary Bleed is designed to work with single-node serial, single-node multi-processing, and distributed computing resources. In our experiments, we demonstrate the reduced search space gain over a naive sequential search of the ideal k and the accuracy of the Binary Bleed in identifying the correct k for NMFk, K-Means pyDNMFk, and pyDRESCALk with Silhouette and Davies Boulding scores. We make our implementation of Binary Bleed for the NMF algorithm available on GitHub.
在一些机器学习(ML)聚类和降维方法(如非负矩阵因式分解(NMF)、RESCAL 和 K-Means 聚类)中,用户必须选择一个超参数 k 来定义能产生理想样本分离或干净聚类的聚类或成分的数量。这种选择虽然困难,但对于避免数据过拟合或欠拟合至关重要。一些 ML 应用程序使用评分方法(如 Silhouette 和 Davies Boulding 评分)来评估特定 k 的聚类模式稳定性。评分是根据 k 范围内的不同试验计算得出的,理想的 k 值是在模型开始过拟合之前启发式选出的,表现为评分的下降或上升,类似于肘部曲线图。虽然网格搜索法可以准确地找到一个好的 k 值,但访问一个 k 范围可能会耗费大量时间和计算资源。在本文中,我们介绍了基于二进制搜索的 BinaryBleed 方法,该方法通过使用对分数进行阈值化处理的启发式方法从搜索空间中截断目标 k 值,从而显著减少了这些网格搜索 ML 算法的 k 搜索空间。在我们的实验中,我们展示了二进制漂白在识别 NMFk、K-Means pyDNMFk 和具有 Silhouette 和 Davies Bouldings 分数的 pyDRESCALk 的正确 k 时,比顺序搜索理想 k 的天真方法所获得的搜索空间更小,以及二进制漂白的准确性。我们在 GitHub 上提供了 NMF 算法的二进制漂白实现。
{"title":"Binary Bleed: Fast Distributed and Parallel Method for Automatic Model Selection","authors":"Ryan BarronTheoretical Division, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Maksim E. ErenAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Manish BhattaraiTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Ismael BoureimaTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Cynthia MatuszekAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Boian S. AlexandrovTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA","doi":"arxiv-2407.19125","DOIUrl":"https://doi.org/arxiv-2407.19125","url":null,"abstract":"In several Machine Learning (ML) clustering and dimensionality reduction\u0000approaches, such as non-negative matrix factorization (NMF), RESCAL, and\u0000K-Means clustering, users must select a hyper-parameter k to define the number\u0000of clusters or components that yield an ideal separation of samples or clean\u0000clusters. This selection, while difficult, is crucial to avoid overfitting or\u0000underfitting the data. Several ML applications use scoring methods (e.g.,\u0000Silhouette and Davies Boulding scores) to evaluate the cluster pattern\u0000stability for a specific k. The score is calculated for different trials over a\u0000range of k, and the ideal k is heuristically selected as the value before the\u0000model starts overfitting, indicated by a drop or increase in the score\u0000resembling an elbow curve plot. While the grid-search method can be used to\u0000accurately find a good k value, visiting a range of k can become time-consuming\u0000and computationally resource-intensive. In this paper, we introduce the Binary\u0000Bleed method based on binary search, which significantly reduces the k search\u0000space for these grid-search ML algorithms by truncating the target k values\u0000from the search space using a heuristic with thresholding over the scores.\u0000Binary Bleed is designed to work with single-node serial, single-node\u0000multi-processing, and distributed computing resources. In our experiments, we\u0000demonstrate the reduced search space gain over a naive sequential search of the\u0000ideal k and the accuracy of the Binary Bleed in identifying the correct k for\u0000NMFk, K-Means pyDNMFk, and pyDRESCALk with Silhouette and Davies Boulding\u0000scores. We make our implementation of Binary Bleed for the NMF algorithm\u0000available on GitHub.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federated Learning (FL) has emerged as a transformative approach for enabling distributed machine learning while preserving user privacy, yet it faces challenges like communication inefficiencies and reliance on centralized infrastructures, leading to increased latency and costs. This paper presents a novel FL methodology that overcomes these limitations by eliminating the dependency on edge servers, employing a server-assisted Proximity Evaluation for dynamic cluster formation based on data similarity, performance indices, and geographical proximity. Our integrated approach enhances operational efficiency and scalability through a Hybrid Decentralized Aggregation Protocol, which merges local model training with peer-to-peer weight exchange and a centralized final aggregation managed by a dynamically elected driver node, significantly curtailing global communication overhead. Additionally, the methodology includes Decentralized Driver Selection, Check-pointing to reduce network traffic, and a Health Status Verification Mechanism for system robustness. Validated using the breast cancer dataset, our architecture not only demonstrates a nearly tenfold reduction in communication overhead but also shows remarkable improvements in reducing training latency and energy consumption while maintaining high learning performance, offering a scalable, efficient, and privacy-preserving solution for the future of federated learning ecosystems.
{"title":"SCALE: Self-regulated Clustered federAted LEarning in a Homogeneous Environment","authors":"Sai Puppala, Ismail Hossain, Md Jahangir Alam, Sajedul Talukder, Zahidur Talukder, Syed Bahauddin","doi":"arxiv-2407.18387","DOIUrl":"https://doi.org/arxiv-2407.18387","url":null,"abstract":"Federated Learning (FL) has emerged as a transformative approach for enabling\u0000distributed machine learning while preserving user privacy, yet it faces\u0000challenges like communication inefficiencies and reliance on centralized\u0000infrastructures, leading to increased latency and costs. This paper presents a\u0000novel FL methodology that overcomes these limitations by eliminating the\u0000dependency on edge servers, employing a server-assisted Proximity Evaluation\u0000for dynamic cluster formation based on data similarity, performance indices,\u0000and geographical proximity. Our integrated approach enhances operational\u0000efficiency and scalability through a Hybrid Decentralized Aggregation Protocol,\u0000which merges local model training with peer-to-peer weight exchange and a\u0000centralized final aggregation managed by a dynamically elected driver node,\u0000significantly curtailing global communication overhead. Additionally, the\u0000methodology includes Decentralized Driver Selection, Check-pointing to reduce\u0000network traffic, and a Health Status Verification Mechanism for system\u0000robustness. Validated using the breast cancer dataset, our architecture not\u0000only demonstrates a nearly tenfold reduction in communication overhead but also\u0000shows remarkable improvements in reducing training latency and energy\u0000consumption while maintaining high learning performance, offering a scalable,\u0000efficient, and privacy-preserving solution for the future of federated learning\u0000ecosystems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The design of energy-efficient, high-performance, and reliable Convolutional Neural Network (CNN) accelerators involves significant challenges due to complex power and thermal management issues. This paper introduces SAfEPaTh, a novel system-level approach for accurately estimating power and temperature in tile-based CNN accelerators. By addressing both steady-state and transient-state scenarios, SAfEPaTh effectively captures the dynamic effects of pipeline bubbles in interlayer pipelines, utilizing real CNN workloads for comprehensive evaluation. Unlike traditional methods, it eliminates the need for circuit-level simulations or on-chip measurements. Our methodology leverages TANIA, a cutting-edge hybrid digital-analog tile-based accelerator featuring analog-in-memory computing cores alongside digital cores. Through rigorous simulation results using the ResNet18 model, we demonstrate SAfEPaTh's capability to accurately estimate power and temperature within 500 seconds, encompassing CNN model accelerator mapping exploration and detailed power and thermal estimations. This efficiency and accuracy make SAfEPaTh an invaluable tool for designers, enabling them to optimize performance while adhering to stringent power and thermal constraints. Furthermore, SAfEPaTh's adaptability extends its utility across various CNN models and accelerator architectures, underscoring its broad applicability in the field. This study contributes significantly to the advancement of energy-efficient and reliable CNN accelerator designs, addressing critical challenges in dynamic power and thermal management.
{"title":"SAfEPaTh: A System-Level Approach for Efficient Power and Thermal Estimation of Convolutional Neural Network Accelerator","authors":"Yukai Chen, Simei Yang, Debjyoti Bhattacharjee, Francky Catthoor, Arindam Mallik","doi":"arxiv-2407.17623","DOIUrl":"https://doi.org/arxiv-2407.17623","url":null,"abstract":"The design of energy-efficient, high-performance, and reliable Convolutional\u0000Neural Network (CNN) accelerators involves significant challenges due to\u0000complex power and thermal management issues. This paper introduces SAfEPaTh, a\u0000novel system-level approach for accurately estimating power and temperature in\u0000tile-based CNN accelerators. By addressing both steady-state and\u0000transient-state scenarios, SAfEPaTh effectively captures the dynamic effects of\u0000pipeline bubbles in interlayer pipelines, utilizing real CNN workloads for\u0000comprehensive evaluation. Unlike traditional methods, it eliminates the need\u0000for circuit-level simulations or on-chip measurements. Our methodology\u0000leverages TANIA, a cutting-edge hybrid digital-analog tile-based accelerator\u0000featuring analog-in-memory computing cores alongside digital cores. Through\u0000rigorous simulation results using the ResNet18 model, we demonstrate SAfEPaTh's\u0000capability to accurately estimate power and temperature within 500 seconds,\u0000encompassing CNN model accelerator mapping exploration and detailed power and\u0000thermal estimations. This efficiency and accuracy make SAfEPaTh an invaluable\u0000tool for designers, enabling them to optimize performance while adhering to\u0000stringent power and thermal constraints. Furthermore, SAfEPaTh's adaptability\u0000extends its utility across various CNN models and accelerator architectures,\u0000underscoring its broad applicability in the field. This study contributes\u0000significantly to the advancement of energy-efficient and reliable CNN\u0000accelerator designs, addressing critical challenges in dynamic power and\u0000thermal management.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"142 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141785657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper explores the adaptation of Transformerbased models for edge devices through the quantisation and hardware acceleration of the ARM Keyword Transformer (KWT) model on a RISC-V platform. The model was targeted to run on 64kB RAM in bare-metal C using a custom-developed edge AI library. KWT-1 was retrained to be 369 times smaller, with only a 10% loss in accuracy through reducing output classes from 35 to 2. The retraining and quantisation reduced model size from 2.42 MB to 1.65 kB. The integration of custom RISC-V instructions that accelerated GELU and SoftMax operations enabled a 5x speedup and thus ~5x power reduction in inference, with inference clock cycle counts decreasing from 26 million to 5.5 million clock cycles while incurring a small area overhead of approximately 29%. The results demonstrate a viable method for porting and accelerating Transformer-based models in low-power IoT devices.
{"title":"KWT-Tiny: RISC-V Accelerated, Embedded Keyword Spotting Transformer","authors":"Aness Al-Qawlaq, Ajay Kumar M, Deepu John","doi":"arxiv-2407.16026","DOIUrl":"https://doi.org/arxiv-2407.16026","url":null,"abstract":"This paper explores the adaptation of Transformerbased models for edge\u0000devices through the quantisation and hardware acceleration of the ARM Keyword\u0000Transformer (KWT) model on a RISC-V platform. The model was targeted to run on\u000064kB RAM in bare-metal C using a custom-developed edge AI library. KWT-1 was\u0000retrained to be 369 times smaller, with only a 10% loss in accuracy through\u0000reducing output classes from 35 to 2. The retraining and quantisation reduced\u0000model size from 2.42 MB to 1.65 kB. The integration of custom RISC-V\u0000instructions that accelerated GELU and SoftMax operations enabled a 5x speedup\u0000and thus ~5x power reduction in inference, with inference clock cycle counts\u0000decreasing from 26 million to 5.5 million clock cycles while incurring a small\u0000area overhead of approximately 29%. The results demonstrate a viable method for\u0000porting and accelerating Transformer-based models in low-power IoT devices.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"356 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141778972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Behavioural simulation is deployed in CAD flow to verify the functional correctness of a Register Transfer Level (RTL) design. Metadata extracted from behavioural simulation could be used to optimise and/or speed up subsequent steps in the hardware design flow. In this paper, we propose Simopt, a tool flow that extracts simulation metadata to improve the timing performance of the design by introducing latency awareness during the placement phase and subsequently improving the routing time of the post-placed netlist using vendor tools. For our experiments, we adapt the open-source Yosys flow to perform Simopt-aware placement. Our results show that using the Simopt-pass in the design implementation flow results in up to 38.2% reduction in timing performance (latency) of the design.
{"title":"Simopt -- Simulation pass for Speculative Optimisation of FPGA-CAD flow","authors":"Eashan Wadhwa, Shanker Shreejith","doi":"arxiv-2408.12676","DOIUrl":"https://doi.org/arxiv-2408.12676","url":null,"abstract":"Behavioural simulation is deployed in CAD flow to verify the functional\u0000correctness of a Register Transfer Level (RTL) design. Metadata extracted from\u0000behavioural simulation could be used to optimise and/or speed up subsequent\u0000steps in the hardware design flow. In this paper, we propose Simopt, a tool\u0000flow that extracts simulation metadata to improve the timing performance of the\u0000design by introducing latency awareness during the placement phase and\u0000subsequently improving the routing time of the post-placed netlist using vendor\u0000tools. For our experiments, we adapt the open-source Yosys flow to perform\u0000Simopt-aware placement. Our results show that using the Simopt-pass in the\u0000design implementation flow results in up to 38.2% reduction in timing\u0000performance (latency) of the design.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142195486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Susana Rebolledo Ruiz, Borja Perez, Jose Luis Bosque, Peter Hsu
The Bicameral Cache is a cache organization proposal for a vector architecture that segregates data according to their access type, distinguishing scalar from vector references. Its aim is to avoid both types of references from interfering in each other's data locality, with a special focus on prioritizing the performance on vector references. The proposed system incorporates an additional, non-polluting prefetching mechanism to help populate the long vector cache lines in advance to increase the hit rate by further exploiting the spatial locality on vector data. Its evaluation was conducted on the Cavatools simulator, comparing the performance to a standard conventional cache, over different typical vector benchmarks for several vector lengths. The results proved the proposed cache speeds up performance on stride-1 vector benchmarks, while hardly impacting non-stride-1's. In addition, the prefetching feature consistently provided an additional value.
{"title":"The Bicameral Cache: a split cache for vector architectures","authors":"Susana Rebolledo Ruiz, Borja Perez, Jose Luis Bosque, Peter Hsu","doi":"arxiv-2407.15440","DOIUrl":"https://doi.org/arxiv-2407.15440","url":null,"abstract":"The Bicameral Cache is a cache organization proposal for a vector\u0000architecture that segregates data according to their access type,\u0000distinguishing scalar from vector references. Its aim is to avoid both types of\u0000references from interfering in each other's data locality, with a special focus\u0000on prioritizing the performance on vector references. The proposed system\u0000incorporates an additional, non-polluting prefetching mechanism to help\u0000populate the long vector cache lines in advance to increase the hit rate by\u0000further exploiting the spatial locality on vector data. Its evaluation was\u0000conducted on the Cavatools simulator, comparing the performance to a standard\u0000conventional cache, over different typical vector benchmarks for several vector\u0000lengths. The results proved the proposed cache speeds up performance on\u0000stride-1 vector benchmarks, while hardly impacting non-stride-1's. In addition,\u0000the prefetching feature consistently provided an additional value.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"57 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141778973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li
Colocating high-priority, latency-sensitive (LS) and low-priority, best-effort (BE) DNN inference services reduces the total cost of ownership (TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts and PCIe bus contentions, existing GPU sharing solutions are unable to avoid resource conflicts among concurrently executing tasks, failing to achieve both low latency for LS tasks and high throughput for BE tasks. To bridge this gap, this paper presents Missile, a general GPU sharing solution for multi-tenant DNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware resource isolation between multiple LS and BE DNN tasks at software level. Through comprehensive reverse engineering, Missile first reveals a general VRAM channel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel conflicts using software-level cache coloring. It also isolates the PCIe bus and fairly allocates PCIe bandwidth using completely fair scheduler. We evaluate 12 mainstream DNNs with synthetic and real-world workloads on four GPUs. The results show that compared to the state-of-the-art GPU sharing solutions, Missile reduces tail latency for LS services by up to ~50%, achieves up to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants on-demand for optimal performance.
{"title":"Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference","authors":"Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li","doi":"arxiv-2407.13996","DOIUrl":"https://doi.org/arxiv-2407.13996","url":null,"abstract":"Colocating high-priority, latency-sensitive (LS) and low-priority,\u0000best-effort (BE) DNN inference services reduces the total cost of ownership\u0000(TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts\u0000and PCIe bus contentions, existing GPU sharing solutions are unable to avoid\u0000resource conflicts among concurrently executing tasks, failing to achieve both\u0000low latency for LS tasks and high throughput for BE tasks. To bridge this gap,\u0000this paper presents Missile, a general GPU sharing solution for multi-tenant\u0000DNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware\u0000resource isolation between multiple LS and BE DNN tasks at software level.\u0000Through comprehensive reverse engineering, Missile first reveals a general VRAM\u0000channel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel\u0000conflicts using software-level cache coloring. It also isolates the PCIe bus\u0000and fairly allocates PCIe bandwidth using completely fair scheduler. We\u0000evaluate 12 mainstream DNNs with synthetic and real-world workloads on four\u0000GPUs. The results show that compared to the state-of-the-art GPU sharing\u0000solutions, Missile reduces tail latency for LS services by up to ~50%, achieves\u0000up to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants\u0000on-demand for optimal performance.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2013 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increasing demand for deploying large Mixture-of-Experts (MoE) models in resource-constrained environments necessitates efficient approaches to address their high memory and computational requirements challenges. Moreover, given that tasks come in different user-defined constraints and the available resources change over time in multi-tenant environments, it is necessary to design an approach which provides a flexible configuration space. This paper presents an adaptive serving approach for the efficient deployment of MoE models, capitalizing on partial quantization of the experts. By dynamically determining the number of quantized experts and their distribution across CPU and GPU, our approach explores the Pareto frontier and offers a fine-grained range of configurations for tuning throughput and model quality. Our evaluation on an NVIDIA A100 GPU using a Mixtral 8x7B MoE model for three language modelling benchmarks demonstrates that the throughput of token generation can be adjusted from 0.63 to 13.00 token per second. This enhancement comes with a marginal perplexity increase of 2.62 to 2.80, 6.48 to 7.24, and 3.24 to 3.53 for WikiText2, PTB, and C4 datasets respectively under maximum quantization. These results highlight the practical applicability of our approach in dynamic and accuracy-sensitive applications where both memory usage and output quality are important.
{"title":"Mixture of Experts with Mixture of Precisions for Tuning Quality of Service","authors":"HamidReza Imani, Abdolah Amirany, Tarek El-Ghazawi","doi":"arxiv-2407.14417","DOIUrl":"https://doi.org/arxiv-2407.14417","url":null,"abstract":"The increasing demand for deploying large Mixture-of-Experts (MoE) models in\u0000resource-constrained environments necessitates efficient approaches to address\u0000their high memory and computational requirements challenges. Moreover, given\u0000that tasks come in different user-defined constraints and the available\u0000resources change over time in multi-tenant environments, it is necessary to\u0000design an approach which provides a flexible configuration space. This paper\u0000presents an adaptive serving approach for the efficient deployment of MoE\u0000models, capitalizing on partial quantization of the experts. By dynamically\u0000determining the number of quantized experts and their distribution across CPU\u0000and GPU, our approach explores the Pareto frontier and offers a fine-grained\u0000range of configurations for tuning throughput and model quality. Our evaluation\u0000on an NVIDIA A100 GPU using a Mixtral 8x7B MoE model for three language\u0000modelling benchmarks demonstrates that the throughput of token generation can\u0000be adjusted from 0.63 to 13.00 token per second. This enhancement comes with a\u0000marginal perplexity increase of 2.62 to 2.80, 6.48 to 7.24, and 3.24 to 3.53\u0000for WikiText2, PTB, and C4 datasets respectively under maximum quantization.\u0000These results highlight the practical applicability of our approach in dynamic\u0000and accuracy-sensitive applications where both memory usage and output quality\u0000are important.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs' parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about the performance of new model architectures on existing GPUs, existing models on new GPUs, and new model architectures on new GPUs. To address these questions, we introduce NeuSight, a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution. The framework leverages both GPU hardware behavior and software library optimizations to estimate end-to-end performance. Previous work uses regression models that capture linear trends or multilayer perceptrons to predict the overall latency of deep learning kernels on GPUs. These approaches suffer from higher error percentages when forecasting performance on unseen models and new GPUs. Instead, NeuSight decomposes the prediction problem into smaller problems, bounding the prediction through fundamental performance laws. NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU. Tile-granularity predictions are determined using a machine learning approach and aggregated to estimate end-to-end latency. NeuSight outperforms prior work across various deep learning workloads and the latest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior works, where both GPT3 and H100 were not used to train the framework.
{"title":"Data-driven Forecasting of Deep Learning Performance on GPUs","authors":"Seonho Lee, Amar Phanishayee, Divya Mahajan","doi":"arxiv-2407.13853","DOIUrl":"https://doi.org/arxiv-2407.13853","url":null,"abstract":"Deep learning kernels exhibit predictable memory accesses and compute\u0000patterns, making GPUs' parallel architecture well-suited for their execution.\u0000Software and runtime systems for GPUs are optimized to better utilize the\u0000stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As\u0000deep learning models and GPUs evolve, access to newer GPUs is often limited,\u0000raising questions about the performance of new model architectures on existing\u0000GPUs, existing models on new GPUs, and new model architectures on new GPUs. To\u0000address these questions, we introduce NeuSight, a framework to predict the\u0000performance of various deep learning models, for both training and inference,\u0000on unseen GPUs without requiring actual execution. The framework leverages both\u0000GPU hardware behavior and software library optimizations to estimate end-to-end\u0000performance. Previous work uses regression models that capture linear trends or\u0000multilayer perceptrons to predict the overall latency of deep learning kernels\u0000on GPUs. These approaches suffer from higher error percentages when forecasting\u0000performance on unseen models and new GPUs. Instead, NeuSight decomposes the\u0000prediction problem into smaller problems, bounding the prediction through\u0000fundamental performance laws. NeuSight decomposes a single deep learning kernel\u0000prediction into smaller working sets called tiles, which are executed\u0000independently on the GPU. Tile-granularity predictions are determined using a\u0000machine learning approach and aggregated to estimate end-to-end latency.\u0000NeuSight outperforms prior work across various deep learning workloads and the\u0000latest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in\u0000predicting the latency of GPT3 model for training and inference on H100,\u0000compared to state-of-the-art prior works, where both GPT3 and H100 were not\u0000used to train the framework.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}