arXiv - CS - Performance最新文献_第6页

Lock-Free Computation of PageRank in Dynamic Graphs 动态图中 PageRank 的无锁计算

arXiv - CS - Performance

Pub Date : 2024-07-28 DOI: arxiv-2407.19562

Subhajit Sahu

PageRank is a metric that assigns importance to the vertices of a graph basedon its neighbors and their scores. Recently, there has been increasing interestin computing PageRank on dynamic graphs, where the graph structure evolves dueto edge insertions and deletions. However, traditional barrier-based approachesfor updating PageRanks encounter significant wait times on certain graphstructures, leading to high overall runtimes. Additionally, the growing trendof multicore architectures with increased core counts has raised concerns aboutrandom thread delays and failures. In this study, we propose a lock-freealgorithm for updating PageRank scores on dynamic graphs. First, we introduceour Dynamic Frontier (DF) approach, which identifies and processes verticeslikely to change PageRanks with minimal overhead. Subsequently, we integrate DFwith our lock-free and fault-tolerant PageRank ($DF_{LF}$), incorporating ahelping mechanism among threads between its two phases. Experimental resultsdemonstrate that $DF_{LF}$ not only eliminates waiting times at iterationbarriers but also withstands random thread delays and crashes. On average, itis 4.6x faster than lock-free Naive-dynamic PageRank ($ND_{LF}$).

PageRank 是一种度量方法，它根据图中顶点的邻居及其得分来确定顶点的重要性。最近，人们对在动态图上计算 PageRank 越来越感兴趣，在动态图中，图结构会随着边的插入和删除而变化。然而，传统的基于障碍的 PageRanks 更新方法在某些图结构上需要大量等待时间，导致总体运行时间较长。此外，多核架构内核数不断增加的趋势也引发了对随机线程延迟和故障的担忧。在本研究中，我们提出了一种在动态图上更新 PageRank 分数的无锁算法。首先，我们介绍了动态前沿（DF）方法，它能以最小的开销识别并处理可能改变 PageRank 的顶点。随后，我们将 DF 与我们的无锁容错 PageRank（$DF_{LF}$）相结合，在两个阶段之间加入了线程间的帮助机制。实验结果表明，$DF_{LF}$ 不仅消除了迭代障碍的等待时间，还能承受随机线程的延迟和崩溃。平均而言，它比无锁的 Naive-dynamic PageRank（$ND_{LF}$）快 4.6 倍。

{"title":"Lock-Free Computation of PageRank in Dynamic Graphs","authors":"Subhajit Sahu","doi":"arxiv-2407.19562","DOIUrl":"https://doi.org/arxiv-2407.19562","url":null,"abstract":"PageRank is a metric that assigns importance to the vertices of a graph based\u0000on its neighbors and their scores. Recently, there has been increasing interest\u0000in computing PageRank on dynamic graphs, where the graph structure evolves due\u0000to edge insertions and deletions. However, traditional barrier-based approaches\u0000for updating PageRanks encounter significant wait times on certain graph\u0000structures, leading to high overall runtimes. Additionally, the growing trend\u0000of multicore architectures with increased core counts has raised concerns about\u0000random thread delays and failures. In this study, we propose a lock-free\u0000algorithm for updating PageRank scores on dynamic graphs. First, we introduce\u0000our Dynamic Frontier (DF) approach, which identifies and processes vertices\u0000likely to change PageRanks with minimal overhead. Subsequently, we integrate DF\u0000with our lock-free and fault-tolerant PageRank ($DF_{LF}$), incorporating a\u0000helping mechanism among threads between its two phases. Experimental results\u0000demonstrate that $DF_{LF}$ not only eliminates waiting times at iteration\u0000barriers but also withstands random thread delays and crashes. On average, it\u0000is 4.6x faster than lock-free Naive-dynamic PageRank ($ND_{LF}$).","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Binary Bleed: Fast Distributed and Parallel Method for Automatic Model Selection 二进制漂白：自动模型选择的快速分布式并行方法

arXiv - CS - Performance

Pub Date : 2024-07-26 DOI: arxiv-2407.19125

Ryan BarronTheoretical Division, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Maksim E. ErenAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Manish BhattaraiTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Ismael BoureimaTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Cynthia MatuszekAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Boian S. AlexandrovTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA

In several Machine Learning (ML) clustering and dimensionality reductionapproaches, such as non-negative matrix factorization (NMF), RESCAL, andK-Means clustering, users must select a hyper-parameter k to define the numberof clusters or components that yield an ideal separation of samples or cleanclusters. This selection, while difficult, is crucial to avoid overfitting orunderfitting the data. Several ML applications use scoring methods (e.g.,Silhouette and Davies Boulding scores) to evaluate the cluster patternstability for a specific k. The score is calculated for different trials over arange of k, and the ideal k is heuristically selected as the value before themodel starts overfitting, indicated by a drop or increase in the scoreresembling an elbow curve plot. While the grid-search method can be used toaccurately find a good k value, visiting a range of k can become time-consumingand computationally resource-intensive. In this paper, we introduce the BinaryBleed method based on binary search, which significantly reduces the k searchspace for these grid-search ML algorithms by truncating the target k valuesfrom the search space using a heuristic with thresholding over the scores.Binary Bleed is designed to work with single-node serial, single-nodemulti-processing, and distributed computing resources. In our experiments, wedemonstrate the reduced search space gain over a naive sequential search of theideal k and the accuracy of the Binary Bleed in identifying the correct k forNMFk, K-Means pyDNMFk, and pyDRESCALk with Silhouette and Davies Bouldingscores. We make our implementation of Binary Bleed for the NMF algorithmavailable on GitHub.

在一些机器学习（ML）聚类和降维方法（如非负矩阵因式分解（NMF）、RESCAL 和 K-Means 聚类）中，用户必须选择一个超参数 k 来定义能产生理想样本分离或干净聚类的聚类或成分的数量。这种选择虽然困难，但对于避免数据过拟合或欠拟合至关重要。一些 ML 应用程序使用评分方法（如 Silhouette 和 Davies Boulding 评分）来评估特定 k 的聚类模式稳定性。评分是根据 k 范围内的不同试验计算得出的，理想的 k 值是在模型开始过拟合之前启发式选出的，表现为评分的下降或上升，类似于肘部曲线图。虽然网格搜索法可以准确地找到一个好的 k 值，但访问一个 k 范围可能会耗费大量时间和计算资源。在本文中，我们介绍了基于二进制搜索的 BinaryBleed 方法，该方法通过使用对分数进行阈值化处理的启发式方法从搜索空间中截断目标 k 值，从而显著减少了这些网格搜索 ML 算法的 k 搜索空间。在我们的实验中，我们展示了二进制漂白在识别 NMFk、K-Means pyDNMFk 和具有 Silhouette 和 Davies Bouldings 分数的 pyDRESCALk 的正确 k 时，比顺序搜索理想 k 的天真方法所获得的搜索空间更小，以及二进制漂白的准确性。我们在 GitHub 上提供了 NMF 算法的二进制漂白实现。

{"title":"Binary Bleed: Fast Distributed and Parallel Method for Automatic Model Selection","authors":"Ryan BarronTheoretical Division, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Maksim E. ErenAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Manish BhattaraiTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Ismael BoureimaTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA, Cynthia MatuszekAdvanced Research in Cyber Systems, Los Alamos National Laboratory, Los Alamos, USADepartment of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, Maryland, USA, Boian S. AlexandrovTheoretical Division, Los Alamos National Laboratory, Los Alamos, USA","doi":"arxiv-2407.19125","DOIUrl":"https://doi.org/arxiv-2407.19125","url":null,"abstract":"In several Machine Learning (ML) clustering and dimensionality reduction\u0000approaches, such as non-negative matrix factorization (NMF), RESCAL, and\u0000K-Means clustering, users must select a hyper-parameter k to define the number\u0000of clusters or components that yield an ideal separation of samples or clean\u0000clusters. This selection, while difficult, is crucial to avoid overfitting or\u0000underfitting the data. Several ML applications use scoring methods (e.g.,\u0000Silhouette and Davies Boulding scores) to evaluate the cluster pattern\u0000stability for a specific k. The score is calculated for different trials over a\u0000range of k, and the ideal k is heuristically selected as the value before the\u0000model starts overfitting, indicated by a drop or increase in the score\u0000resembling an elbow curve plot. While the grid-search method can be used to\u0000accurately find a good k value, visiting a range of k can become time-consuming\u0000and computationally resource-intensive. In this paper, we introduce the Binary\u0000Bleed method based on binary search, which significantly reduces the k search\u0000space for these grid-search ML algorithms by truncating the target k values\u0000from the search space using a heuristic with thresholding over the scores.\u0000Binary Bleed is designed to work with single-node serial, single-node\u0000multi-processing, and distributed computing resources. In our experiments, we\u0000demonstrate the reduced search space gain over a naive sequential search of the\u0000ideal k and the accuracy of the Binary Bleed in identifying the correct k for\u0000NMFk, K-Means pyDNMFk, and pyDRESCALk with Silhouette and Davies Boulding\u0000scores. We make our implementation of Binary Bleed for the NMF algorithm\u0000available on GitHub.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"48 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141870323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SCALE: Self-regulated Clustered federAted LEarning in a Homogeneous Environment SCALE：在同质环境中进行自我调节的聚类联合学习

arXiv - CS - Performance

Pub Date : 2024-07-25 DOI: arxiv-2407.18387

Sai Puppala, Ismail Hossain, Md Jahangir Alam, Sajedul Talukder, Zahidur Talukder, Syed Bahauddin

Federated Learning (FL) has emerged as a transformative approach for enablingdistributed machine learning while preserving user privacy, yet it faceschallenges like communication inefficiencies and reliance on centralizedinfrastructures, leading to increased latency and costs. This paper presents anovel FL methodology that overcomes these limitations by eliminating thedependency on edge servers, employing a server-assisted Proximity Evaluationfor dynamic cluster formation based on data similarity, performance indices,and geographical proximity. Our integrated approach enhances operationalefficiency and scalability through a Hybrid Decentralized Aggregation Protocol,which merges local model training with peer-to-peer weight exchange and acentralized final aggregation managed by a dynamically elected driver node,significantly curtailing global communication overhead. Additionally, themethodology includes Decentralized Driver Selection, Check-pointing to reducenetwork traffic, and a Health Status Verification Mechanism for systemrobustness. Validated using the breast cancer dataset, our architecture notonly demonstrates a nearly tenfold reduction in communication overhead but alsoshows remarkable improvements in reducing training latency and energyconsumption while maintaining high learning performance, offering a scalable,efficient, and privacy-preserving solution for the future of federated learningecosystems.

联合学习（Federated Learning，FL）已成为在保护用户隐私的同时实现分布式机器学习的变革性方法，但它面临着通信效率低下和依赖集中式基础设施等挑战，导致延迟和成本增加。本文介绍了一种先进的 FL 方法，该方法消除了对边缘服务器的依赖，采用服务器辅助的 "邻近性评估"（Proximity Evaluation），根据数据相似性、性能指标和地理邻近性动态形成集群，从而克服了这些局限性。我们的集成方法通过混合分散聚合协议提高了运行效率和可扩展性，该协议将本地模型训练与点对点权重交换以及由动态选出的驱动节点管理的集中式最终聚合合并在一起，大大减少了全球通信开销。此外，该方法还包括分散式驱动程序选择、用于减少网络流量的检查点，以及用于确保系统稳健性的健康状态验证机制。通过使用乳腺癌数据集进行验证，我们的架构不仅证明通信开销减少了近十倍，而且在降低训练延迟和能耗方面也有显著改进，同时还保持了较高的学习性能，为未来的联合学习生态系统提供了可扩展、高效和保护隐私的解决方案。

{"title":"SCALE: Self-regulated Clustered federAted LEarning in a Homogeneous Environment","authors":"Sai Puppala, Ismail Hossain, Md Jahangir Alam, Sajedul Talukder, Zahidur Talukder, Syed Bahauddin","doi":"arxiv-2407.18387","DOIUrl":"https://doi.org/arxiv-2407.18387","url":null,"abstract":"Federated Learning (FL) has emerged as a transformative approach for enabling\u0000distributed machine learning while preserving user privacy, yet it faces\u0000challenges like communication inefficiencies and reliance on centralized\u0000infrastructures, leading to increased latency and costs. This paper presents a\u0000novel FL methodology that overcomes these limitations by eliminating the\u0000dependency on edge servers, employing a server-assisted Proximity Evaluation\u0000for dynamic cluster formation based on data similarity, performance indices,\u0000and geographical proximity. Our integrated approach enhances operational\u0000efficiency and scalability through a Hybrid Decentralized Aggregation Protocol,\u0000which merges local model training with peer-to-peer weight exchange and a\u0000centralized final aggregation managed by a dynamically elected driver node,\u0000significantly curtailing global communication overhead. Additionally, the\u0000methodology includes Decentralized Driver Selection, Check-pointing to reduce\u0000network traffic, and a Health Status Verification Mechanism for system\u0000robustness. Validated using the breast cancer dataset, our architecture not\u0000only demonstrates a nearly tenfold reduction in communication overhead but also\u0000shows remarkable improvements in reducing training latency and energy\u0000consumption while maintaining high learning performance, offering a scalable,\u0000efficient, and privacy-preserving solution for the future of federated learning\u0000ecosystems.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"42 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141873317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SAfEPaTh: A System-Level Approach for Efficient Power and Thermal Estimation of Convolutional Neural Network Accelerator SAfEPaTh：高效卷积神经网络加速器功率和热估计的系统级方法

arXiv - CS - Performance

Pub Date : 2024-07-24 DOI: arxiv-2407.17623

Yukai Chen, Simei Yang, Debjyoti Bhattacharjee, Francky Catthoor, Arindam Mallik

The design of energy-efficient, high-performance, and reliable ConvolutionalNeural Network (CNN) accelerators involves significant challenges due tocomplex power and thermal management issues. This paper introduces SAfEPaTh, anovel system-level approach for accurately estimating power and temperature intile-based CNN accelerators. By addressing both steady-state andtransient-state scenarios, SAfEPaTh effectively captures the dynamic effects ofpipeline bubbles in interlayer pipelines, utilizing real CNN workloads forcomprehensive evaluation. Unlike traditional methods, it eliminates the needfor circuit-level simulations or on-chip measurements. Our methodologyleverages TANIA, a cutting-edge hybrid digital-analog tile-based acceleratorfeaturing analog-in-memory computing cores alongside digital cores. Throughrigorous simulation results using the ResNet18 model, we demonstrate SAfEPaTh'scapability to accurately estimate power and temperature within 500 seconds,encompassing CNN model accelerator mapping exploration and detailed power andthermal estimations. This efficiency and accuracy make SAfEPaTh an invaluabletool for designers, enabling them to optimize performance while adhering tostringent power and thermal constraints. Furthermore, SAfEPaTh's adaptabilityextends its utility across various CNN models and accelerator architectures,underscoring its broad applicability in the field. This study contributessignificantly to the advancement of energy-efficient and reliable CNNaccelerator designs, addressing critical challenges in dynamic power andthermal management.

由于复杂的功耗和热管理问题，设计高能效、高性能和可靠的卷积神经网络（CNN）加速器面临巨大挑战。本文介绍了 SAfEPaTh，这是一种系统级方法，用于准确估算基于线性模型的 CNN 加速器的功耗和温度。通过处理稳态和暂态场景，SAfEPaTh 有效捕捉了层间流水线中流水线气泡的动态影响，并利用真实的 CNN 工作负载进行了全面评估。与传统方法不同，它无需进行电路级仿真或片上测量。我们的方法利用了 TANIA，这是一种基于数字模拟瓦片的尖端混合加速器，配备了模拟内存计算内核和数字内核。通过使用 ResNet18 模型的严格仿真结果，我们证明了 SAfEPaTh 能够在 500 秒内准确估算功率和温度，包括 CNN 模型加速器映射探索和详细的功率和温度估算。这种效率和准确性使 SAfEPaTh 成为设计人员的宝贵工具，使他们能够在遵守严格的功率和热约束的同时优化性能。此外，SAfEPaTh 的适应性扩展了其在各种 CNN 模型和加速器架构中的用途，突显了其在该领域的广泛适用性。这项研究为推动高能效、可靠的 CNN 加速器设计，解决动态功率和热管理方面的关键挑战做出了重要贡献。

{"title":"SAfEPaTh: A System-Level Approach for Efficient Power and Thermal Estimation of Convolutional Neural Network Accelerator","authors":"Yukai Chen, Simei Yang, Debjyoti Bhattacharjee, Francky Catthoor, Arindam Mallik","doi":"arxiv-2407.17623","DOIUrl":"https://doi.org/arxiv-2407.17623","url":null,"abstract":"The design of energy-efficient, high-performance, and reliable Convolutional\u0000Neural Network (CNN) accelerators involves significant challenges due to\u0000complex power and thermal management issues. This paper introduces SAfEPaTh, a\u0000novel system-level approach for accurately estimating power and temperature in\u0000tile-based CNN accelerators. By addressing both steady-state and\u0000transient-state scenarios, SAfEPaTh effectively captures the dynamic effects of\u0000pipeline bubbles in interlayer pipelines, utilizing real CNN workloads for\u0000comprehensive evaluation. Unlike traditional methods, it eliminates the need\u0000for circuit-level simulations or on-chip measurements. Our methodology\u0000leverages TANIA, a cutting-edge hybrid digital-analog tile-based accelerator\u0000featuring analog-in-memory computing cores alongside digital cores. Through\u0000rigorous simulation results using the ResNet18 model, we demonstrate SAfEPaTh's\u0000capability to accurately estimate power and temperature within 500 seconds,\u0000encompassing CNN model accelerator mapping exploration and detailed power and\u0000thermal estimations. This efficiency and accuracy make SAfEPaTh an invaluable\u0000tool for designers, enabling them to optimize performance while adhering to\u0000stringent power and thermal constraints. Furthermore, SAfEPaTh's adaptability\u0000extends its utility across various CNN models and accelerator architectures,\u0000underscoring its broad applicability in the field. This study contributes\u0000significantly to the advancement of energy-efficient and reliable CNN\u0000accelerator designs, addressing critical challenges in dynamic power and\u0000thermal management.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"142 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141785657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

KWT-Tiny: RISC-V Accelerated, Embedded Keyword Spotting Transformer KWT-Tiny：RISC-V 加速的嵌入式关键词查找转换器

arXiv - CS - Performance

Pub Date : 2024-07-22 DOI: arxiv-2407.16026

Aness Al-Qawlaq, Ajay Kumar M, Deepu John

This paper explores the adaptation of Transformerbased models for edgedevices through the quantisation and hardware acceleration of the ARM KeywordTransformer (KWT) model on a RISC-V platform. The model was targeted to run on64kB RAM in bare-metal C using a custom-developed edge AI library. KWT-1 wasretrained to be 369 times smaller, with only a 10% loss in accuracy throughreducing output classes from 35 to 2. The retraining and quantisation reducedmodel size from 2.42 MB to 1.65 kB. The integration of custom RISC-Vinstructions that accelerated GELU and SoftMax operations enabled a 5x speedupand thus ~5x power reduction in inference, with inference clock cycle countsdecreasing from 26 million to 5.5 million clock cycles while incurring a smallarea overhead of approximately 29%. The results demonstrate a viable method forporting and accelerating Transformer-based models in low-power IoT devices.

本文通过在 RISC-V 平台上对 ARM KeywordTransformer (KWT) 模型进行量化和硬件加速，探讨了基于 Transformer 的边缘设备模型的适应性。该模型使用定制开发的边缘人工智能库，以裸机 C 语言在 64kB RAM 上运行为目标。通过重新训练和量化，模型大小从 2.42 MB 减少到 1.65 kB。定制 RISC 指令的集成加速了 GELU 和 SoftMax 操作，使推理速度提高了 5 倍，推理功耗降低了 5 倍，推理时钟周期数从 2,600 万个时钟周期减少到 550 万个时钟周期，同时产生的小面积开销约为 29%。这些结果表明，在低功耗物联网设备中导入和加速基于 Transformer 的模型是一种可行的方法。

引用次数: 0

Simopt -- Simulation pass for Speculative Optimisation of FPGA-CAD flow Simopt -- 用于 FPGA-CAD 流程推测性优化的仿真通行证

arXiv - CS - Performance

Pub Date : 2024-07-22 DOI: arxiv-2408.12676

Eashan Wadhwa, Shanker Shreejith

Behavioural simulation is deployed in CAD flow to verify the functionalcorrectness of a Register Transfer Level (RTL) design. Metadata extracted frombehavioural simulation could be used to optimise and/or speed up subsequentsteps in the hardware design flow. In this paper, we propose Simopt, a toolflow that extracts simulation metadata to improve the timing performance of thedesign by introducing latency awareness during the placement phase andsubsequently improving the routing time of the post-placed netlist using vendortools. For our experiments, we adapt the open-source Yosys flow to performSimopt-aware placement. Our results show that using the Simopt-pass in thedesign implementation flow results in up to 38.2% reduction in timingperformance (latency) of the design.

在 CAD 流程中部署行为仿真是为了验证寄存器传输层 (RTL) 设计的功能正确性。从行为仿真中提取的元数据可用于优化和/或加快硬件设计流程中的后续步骤。在本文中，我们提出了 Simopt，这是一种提取仿真元数据的工具流程，通过在放置阶段引入延迟意识来改善设计的时序性能，并随后使用供应商工具改善放置后网表的布线时间。在实验中，我们调整了开源 Yosys 流程，以执行 Simopt 感知贴装。结果表明，在设计实现流程中使用 Simopt-pass 可使设计的时序性能（延迟）降低 38.2%。

引用次数: 0

The Bicameral Cache: a split cache for vector architectures 双核高速缓存：矢量架构的分离式高速缓存

arXiv - CS - Performance

Pub Date : 2024-07-22 DOI: arxiv-2407.15440

Susana Rebolledo Ruiz, Borja Perez, Jose Luis Bosque, Peter Hsu

The Bicameral Cache is a cache organization proposal for a vectorarchitecture that segregates data according to their access type,distinguishing scalar from vector references. Its aim is to avoid both types ofreferences from interfering in each other's data locality, with a special focuson prioritizing the performance on vector references. The proposed systemincorporates an additional, non-polluting prefetching mechanism to helppopulate the long vector cache lines in advance to increase the hit rate byfurther exploiting the spatial locality on vector data. Its evaluation wasconducted on the Cavatools simulator, comparing the performance to a standardconventional cache, over different typical vector benchmarks for several vectorlengths. The results proved the proposed cache speeds up performance onstride-1 vector benchmarks, while hardly impacting non-stride-1's. In addition,the prefetching feature consistently provided an additional value.

双核高速缓存（Bicameral Cache）是一种针对矢量架构的高速缓存组织建议，它根据数据的访问类型对数据进行隔离，区分标量引用和矢量引用。其目的是避免两种类型的引用相互干扰数据位置，并特别注重优先提高矢量引用的性能。提议的系统包含一个额外的、无污染的预取机制，通过进一步利用矢量数据的空间位置性，帮助提前填充长矢量缓存行以提高命中率。我们在 Cavatools 模拟器上对该缓存进行了评估，比较了它与标准传统缓存的性能，以及不同矢量长度的典型矢量基准。结果证明，所提出的高速缓存提高了跨 1 向量基准的性能，而对非跨 1 向量基准几乎没有影响。此外，预取功能还持续提供了额外的价值。

引用次数: 0

Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference 导弹多租户 DNN 推理的细粒度硬件级 GPU 资源隔离

arXiv - CS - Performance

Pub Date : 2024-07-19 DOI: arxiv-2407.13996

Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li

Colocating high-priority, latency-sensitive (LS) and low-priority,best-effort (BE) DNN inference services reduces the total cost of ownership(TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflictsand PCIe bus contentions, existing GPU sharing solutions are unable to avoidresource conflicts among concurrently executing tasks, failing to achieve bothlow latency for LS tasks and high throughput for BE tasks. To bridge this gap,this paper presents Missile, a general GPU sharing solution for multi-tenantDNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardwareresource isolation between multiple LS and BE DNN tasks at software level.Through comprehensive reverse engineering, Missile first reveals a general VRAMchannel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channelconflicts using software-level cache coloring. It also isolates the PCIe busand fairly allocates PCIe bandwidth using completely fair scheduler. Weevaluate 12 mainstream DNNs with synthetic and real-world workloads on fourGPUs. The results show that compared to the state-of-the-art GPU sharingsolutions, Missile reduces tail latency for LS services by up to ~50%, achievesup to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenantson-demand for optimal performance.

将高优先级、延迟敏感型（LS）和低优先级、尽力而为型（BE）DNN推理服务进行共享，可以降低 GPU 集群的总体拥有成本（TCO）。受限于VRAM通道冲突和PCIe总线争用等瓶颈，现有的GPU共享解决方案无法避免并发执行任务之间的资源冲突，无法实现LS任务的低延迟和BE任务的高吞吐量。为了弥补这一缺陷，本文提出了一种通用的 GPU 共享解决方案--Missile，用于英伟达 GPU 上的多租户 DNN 推理。通过全面的逆向工程，Missile首先揭示了英伟达™（NVIDIA®）GPU的通用VRAM通道哈希映射架构，并利用软件级缓存着色消除了VRAM通道冲突。它还隔离了 PCIe 总线，并使用完全公平的调度程序公平分配 PCIe 带宽。我们在四台 GPU 上使用合成和实际工作负载评估了 12 种主流 DNN。结果表明，与最先进的 GPU 共享解决方案相比，Missile 将 LS 服务的尾部延迟降低了约 50%，实现了高达 6.1 倍的 BE 作业吞吐量，并按需分配了租户的 PCIe 总线带宽，从而实现了最佳性能。

{"title":"Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference","authors":"Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li","doi":"arxiv-2407.13996","DOIUrl":"https://doi.org/arxiv-2407.13996","url":null,"abstract":"Colocating high-priority, latency-sensitive (LS) and low-priority,\u0000best-effort (BE) DNN inference services reduces the total cost of ownership\u0000(TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts\u0000and PCIe bus contentions, existing GPU sharing solutions are unable to avoid\u0000resource conflicts among concurrently executing tasks, failing to achieve both\u0000low latency for LS tasks and high throughput for BE tasks. To bridge this gap,\u0000this paper presents Missile, a general GPU sharing solution for multi-tenant\u0000DNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware\u0000resource isolation between multiple LS and BE DNN tasks at software level.\u0000Through comprehensive reverse engineering, Missile first reveals a general VRAM\u0000channel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel\u0000conflicts using software-level cache coloring. It also isolates the PCIe bus\u0000and fairly allocates PCIe bandwidth using completely fair scheduler. We\u0000evaluate 12 mainstream DNNs with synthetic and real-world workloads on four\u0000GPUs. The results show that compared to the state-of-the-art GPU sharing\u0000solutions, Missile reduces tail latency for LS services by up to ~50%, achieves\u0000up to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants\u0000on-demand for optimal performance.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2013 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mixture of Experts with Mixture of Precisions for Tuning Quality of Service 专家与精确度混合物用于调整服务质量

arXiv - CS - Performance

Pub Date : 2024-07-19 DOI: arxiv-2407.14417

HamidReza Imani, Abdolah Amirany, Tarek El-Ghazawi

The increasing demand for deploying large Mixture-of-Experts (MoE) models inresource-constrained environments necessitates efficient approaches to addresstheir high memory and computational requirements challenges. Moreover, giventhat tasks come in different user-defined constraints and the availableresources change over time in multi-tenant environments, it is necessary todesign an approach which provides a flexible configuration space. This paperpresents an adaptive serving approach for the efficient deployment of MoEmodels, capitalizing on partial quantization of the experts. By dynamicallydetermining the number of quantized experts and their distribution across CPUand GPU, our approach explores the Pareto frontier and offers a fine-grainedrange of configurations for tuning throughput and model quality. Our evaluationon an NVIDIA A100 GPU using a Mixtral 8x7B MoE model for three languagemodelling benchmarks demonstrates that the throughput of token generation canbe adjusted from 0.63 to 13.00 token per second. This enhancement comes with amarginal perplexity increase of 2.62 to 2.80, 6.48 to 7.24, and 3.24 to 3.53for WikiText2, PTB, and C4 datasets respectively under maximum quantization.These results highlight the practical applicability of our approach in dynamicand accuracy-sensitive applications where both memory usage and output qualityare important.

在资源受限的环境中部署大型专家混合物（MoE）模型的需求日益增长，这就需要采用高效的方法来解决其内存和计算要求高的难题。此外，考虑到在多租户环境中，任务具有不同的用户定义限制，而且可用资源会随时间发生变化，因此有必要设计一种能提供灵活配置空间的方法。本文介绍了一种自适应服务方法，利用专家的部分量化，高效部署 MoE 模型。通过动态确定量化专家的数量及其在 CPU 和 GPU 上的分布，我们的方法探索了帕累托前沿，并为调整吞吐量和模型质量提供了细粒度的配置范围。我们在英伟达 A100 GPU 上使用 Mixtral 8x7B MoE 模型对三个语言建模基准进行的评估表明，令牌生成的吞吐量可从每秒 0.63 个令牌调整到 13.00 个令牌。在最大量化条件下，WikiText2、PTB 和 C4 数据集的误解度分别从 2.62 提高到 2.80、6.48 提高到 7.24 和 3.24 提高到 3.53。这些结果突显了我们的方法在内存使用和输出质量都很重要的动态和准确度敏感型应用中的实际适用性。

{"title":"Mixture of Experts with Mixture of Precisions for Tuning Quality of Service","authors":"HamidReza Imani, Abdolah Amirany, Tarek El-Ghazawi","doi":"arxiv-2407.14417","DOIUrl":"https://doi.org/arxiv-2407.14417","url":null,"abstract":"The increasing demand for deploying large Mixture-of-Experts (MoE) models in\u0000resource-constrained environments necessitates efficient approaches to address\u0000their high memory and computational requirements challenges. Moreover, given\u0000that tasks come in different user-defined constraints and the available\u0000resources change over time in multi-tenant environments, it is necessary to\u0000design an approach which provides a flexible configuration space. This paper\u0000presents an adaptive serving approach for the efficient deployment of MoE\u0000models, capitalizing on partial quantization of the experts. By dynamically\u0000determining the number of quantized experts and their distribution across CPU\u0000and GPU, our approach explores the Pareto frontier and offers a fine-grained\u0000range of configurations for tuning throughput and model quality. Our evaluation\u0000on an NVIDIA A100 GPU using a Mixtral 8x7B MoE model for three language\u0000modelling benchmarks demonstrates that the throughput of token generation can\u0000be adjusted from 0.63 to 13.00 token per second. This enhancement comes with a\u0000marginal perplexity increase of 2.62 to 2.80, 6.48 to 7.24, and 3.24 to 3.53\u0000for WikiText2, PTB, and C4 datasets respectively under maximum quantization.\u0000These results highlight the practical applicability of our approach in dynamic\u0000and accuracy-sensitive applications where both memory usage and output quality\u0000are important.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data-driven Forecasting of Deep Learning Performance on GPUs GPU 上深度学习性能的数据驱动预测

arXiv - CS - Performance

Pub Date : 2024-07-18 DOI: arxiv-2407.13853

Seonho Lee, Amar Phanishayee, Divya Mahajan

Deep learning kernels exhibit predictable memory accesses and computepatterns, making GPUs' parallel architecture well-suited for their execution.Software and runtime systems for GPUs are optimized to better utilize thestream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. Asdeep learning models and GPUs evolve, access to newer GPUs is often limited,raising questions about the performance of new model architectures on existingGPUs, existing models on new GPUs, and new model architectures on new GPUs. Toaddress these questions, we introduce NeuSight, a framework to predict theperformance of various deep learning models, for both training and inference,on unseen GPUs without requiring actual execution. The framework leverages bothGPU hardware behavior and software library optimizations to estimate end-to-endperformance. Previous work uses regression models that capture linear trends ormultilayer perceptrons to predict the overall latency of deep learning kernelson GPUs. These approaches suffer from higher error percentages when forecastingperformance on unseen models and new GPUs. Instead, NeuSight decomposes theprediction problem into smaller problems, bounding the prediction throughfundamental performance laws. NeuSight decomposes a single deep learning kernelprediction into smaller working sets called tiles, which are executedindependently on the GPU. Tile-granularity predictions are determined using amachine learning approach and aggregated to estimate end-to-end latency.NeuSight outperforms prior work across various deep learning workloads and thelatest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% inpredicting the latency of GPT3 model for training and inference on H100,compared to state-of-the-art prior works, where both GPT3 and H100 were notused to train the framework.

GPU 的软件和运行系统经过优化，能够更好地利用流式多处理器、片上高速缓存和片外高带宽内存。随着深度学习模型和GPU的不断发展，对较新GPU的访问往往受到限制，这就提出了新模型架构在现有GPU上的性能、现有模型在新GPU上的性能以及新模型架构在新GPU上的性能等问题。为了解决这些问题，我们推出了 NeuSight，这是一个预测各种深度学习模型在未见过的 GPU 上训练和推理性能的框架，无需实际执行。该框架利用 GPU 硬件行为和软件库优化来估算端到端的性能。以前的工作使用捕捉线性趋势的回归模型或多层感知器来预测深度学习 kernelson GPU 的整体延迟。当预测未见模型和新 GPU 的性能时，这些方法的误差率较高。相反，NeuSight 将预测问题分解成更小的问题，通过基本性能法则对预测进行约束。NeuSight 将单个深度学习核心预测分解成更小的工作集（称为 "瓦片"），这些工作集在 GPU 上独立执行。NeuSight 在各种深度学习工作负载和最新 GPU 上的表现都优于之前的工作。在预测 GPT3 模型在 H100 上的训练和推理延迟时，它将误差百分比从 198% 和 19.7% 降低到 3.8%。

{"title":"Data-driven Forecasting of Deep Learning Performance on GPUs","authors":"Seonho Lee, Amar Phanishayee, Divya Mahajan","doi":"arxiv-2407.13853","DOIUrl":"https://doi.org/arxiv-2407.13853","url":null,"abstract":"Deep learning kernels exhibit predictable memory accesses and compute\u0000patterns, making GPUs' parallel architecture well-suited for their execution.\u0000Software and runtime systems for GPUs are optimized to better utilize the\u0000stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As\u0000deep learning models and GPUs evolve, access to newer GPUs is often limited,\u0000raising questions about the performance of new model architectures on existing\u0000GPUs, existing models on new GPUs, and new model architectures on new GPUs. To\u0000address these questions, we introduce NeuSight, a framework to predict the\u0000performance of various deep learning models, for both training and inference,\u0000on unseen GPUs without requiring actual execution. The framework leverages both\u0000GPU hardware behavior and software library optimizations to estimate end-to-end\u0000performance. Previous work uses regression models that capture linear trends or\u0000multilayer perceptrons to predict the overall latency of deep learning kernels\u0000on GPUs. These approaches suffer from higher error percentages when forecasting\u0000performance on unseen models and new GPUs. Instead, NeuSight decomposes the\u0000prediction problem into smaller problems, bounding the prediction through\u0000fundamental performance laws. NeuSight decomposes a single deep learning kernel\u0000prediction into smaller working sets called tiles, which are executed\u0000independently on the GPU. Tile-granularity predictions are determined using a\u0000machine learning approach and aggregated to estimate end-to-end latency.\u0000NeuSight outperforms prior work across various deep learning workloads and the\u0000latest GPUs. It reduces the percentage error from 198% and 19.7% to 3.8% in\u0000predicting the latency of GPT3 model for training and inference on H100,\u0000compared to state-of-the-art prior works, where both GPT3 and H100 were not\u0000used to train the framework.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0