IEEE Transactions on Parallel and Distributed Systems最新文献_第2页

HpT: Hybrid Acceleration of Spatio-Temporal Attention Model Training on Heterogeneous Manycore Architectures 基于异构多核架构的时空注意力模型混合加速训练

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2025-01-01 DOI: 10.1109/TPDS.2024.3522781

Saiman Dahal;Pratyush Dhingra;Krishu Kumar Thapa;Partha Pratim Pande;Ananth Kalyanaraman

Transformer models have become widely popular in numerous applications, and especially for building foundation large language models (LLMs). Recently, there has been a surge in the exploration of transformer-based architectures in non-LLM applications. In particular, the self-attention mechanism within the transformer architecture offers a way to exploit any hidden relations within data, making it widely applicable for a variety of spatio-temporal tasks in scientific computing domains (e.g., weather, traffic, agriculture). Most of these efforts have primarily focused on accelerating the inference phase. However, the computational resources required to train these attention-based models for scientific applications remain a significant challenge to address. Emerging non-volatile memory (NVM)-based processing-in-memory (PIM) architectures can achieve higher performance and better energy efficiency than their GPU-based counterparts. However, the frequent weight updates during training would necessitate write operations to NVM cells, posing a significant barrier for considering stand-alone NVM-based PIM architectures. In this paper, we present HpT, a new hybrid approach to accelerate the training of attention-based models for scientific applications. Our approach is hybrid at two different layers: at the software layer, our approach dynamically switches from a full-parameter training mode to a lower-parameter training mode by incorporating intrinsic dimensionality; and at the hardware layer, our approach harnesses the combined power of GPUs, resistive random-access memory (ReRAM)-based PIM devices, and systolic arrays. This software-hardware co-design approach is aimed at adaptively reducing both runtime and energy costs during the training phase, without compromising on quality. Experiments on four concrete real-world scientific applications demonstrate that our hybrid approach is able to significantly reduce training time (up to

$11.9times$

) and energy consumption (up to

$12.05times$

), compared to the corresponding full-parameter training executing on only GPUs. Our approach serves as an example for accelerating the training of attention-based models on heterogeneous platforms including ReRAMs.

Transformer模型已经在许多应用程序中广泛流行，特别是用于构建基础大型语言模型（llm）。最近，在非llm应用程序中对基于变压器的体系结构的探索激增。特别是，变压器体系结构中的自关注机制提供了一种利用数据中任何隐藏关系的方法，使其广泛适用于科学计算领域（例如，天气、交通、农业）中的各种时空任务。大多数这些努力主要集中在加速推理阶段。然而，训练这些基于注意力的模型用于科学应用所需的计算资源仍然是一个需要解决的重大挑战。新兴的基于非易失性存储器（NVM）的内存处理（PIM）体系结构可以实现比基于gpu的体系结构更高的性能和更好的能源效率。然而，训练期间频繁的权重更新将需要对NVM单元进行写操作，这对考虑基于NVM的独立PIM架构构成了重大障碍。在本文中，我们提出了一种新的混合方法HpT，用于加速科学应用中基于注意的模型的训练。我们的方法在两个不同的层是混合的：在软件层，我们的方法通过结合内在维度动态地从全参数训练模式切换到低参数训练模式；在硬件层，我们的方法利用了gpu、基于电阻随机存取存储器（ReRAM）的PIM设备和收缩阵列的综合能力。这种软硬件协同设计方法旨在自适应地减少训练阶段的运行时间和能源成本，同时不影响质量。在四个具体的现实世界科学应用中进行的实验表明，与仅在gpu上执行相应的全参数训练相比，我们的混合方法能够显着减少训练时间（高达11.9美元）和能耗（高达12.05美元）。我们的方法可以作为在包括reram在内的异构平台上加速训练基于注意力的模型的示例。

{"title":"HpT: Hybrid Acceleration of Spatio-Temporal Attention Model Training on Heterogeneous Manycore Architectures","authors":"Saiman Dahal;Pratyush Dhingra;Krishu Kumar Thapa;Partha Pratim Pande;Ananth Kalyanaraman","doi":"10.1109/TPDS.2024.3522781","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522781","url":null,"abstract":"Transformer models have become widely popular in numerous applications, and especially for building foundation large language models (LLMs). Recently, there has been a surge in the exploration of transformer-based architectures in non-LLM applications. In particular, the self-attention mechanism within the transformer architecture offers a way to exploit any hidden relations within data, making it widely applicable for a variety of spatio-temporal tasks in scientific computing domains (e.g., weather, traffic, agriculture). Most of these efforts have primarily focused on accelerating the inference phase. However, the computational resources required to train these attention-based models for scientific applications remain a significant challenge to address. Emerging non-volatile memory (NVM)-based processing-in-memory (PIM) architectures can achieve higher performance and better energy efficiency than their GPU-based counterparts. However, the frequent weight updates during training would necessitate write operations to NVM cells, posing a significant barrier for considering stand-alone NVM-based PIM architectures. In this paper, we present <monospace>HpT</monospace>, a new hybrid approach to accelerate the training of attention-based models for scientific applications. Our approach is hybrid at two different layers: at the software layer, our approach dynamically switches from a full-parameter training mode to a lower-parameter training mode by incorporating intrinsic dimensionality; and at the hardware layer, our approach harnesses the combined power of GPUs, resistive random-access memory (ReRAM)-based PIM devices, and systolic arrays. This software-hardware co-design approach is aimed at adaptively reducing both runtime and energy costs during the training phase, without compromising on quality. Experiments on four concrete real-world scientific applications demonstrate that our hybrid approach is able to significantly reduce training time (up to <inline-formula><tex-math>$11.9times$</tex-math></inline-formula>) and energy consumption (up to <inline-formula><tex-math>$12.05times$</tex-math></inline-formula>), compared to the corresponding full-parameter training executing on only GPUs. Our approach serves as an example for accelerating the training of attention-based models on heterogeneous platforms including ReRAMs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"407-421"},"PeriodicalIF":5.6,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sparrow: Expediting Smart Contract Execution for Blockchain Sharding via Inter-Shard Caching Sparrow：通过分片间缓存加速区块链分片的智能合约执行

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-12-26 DOI: 10.1109/TPDS.2024.3522016

Junyuan Liang;Peiyuan Yao;Wuhui Chen;Zicong Hong;Jianting Zhang;Ting Cai;Min Sun;Zibin Zheng

Sharding is a promising solution to scale blockchain by separating the system into multiple shards to process transactions in parallel. However, due to state separation and shard isolation, it is still challenging to efficiently support smart contracts on a blockchain sharding system where smart contracts can interact with each other, involving states maintained by multiple shards. Specifically, existing sharding systems adopt a costly multi-step collaboration mechanism to execute smart contracts, resulting in long latency and low throughput. This article proposes Sparrow, a blockchain sharding protocol achieving one-step execution for smart contracts. To break shard isolation, inspired by non-local hotspot data caching in traditional databases, we propose a new idea of inter-shard caching, allowing a shard to prefetch and cache frequently accessed contract states of other shards. The miner can thus use the inter-shard cache to pre-execute a pending transaction, retrieve all its contract invocations, and commit it to multiple shards in one step. Particularly, we first propose a speculative dispersal cache synchronisation mechanism for efficient and secure cache synchronization across shards in Byzantine environments. Then, we propose a multi-branch exploration mechanism to solve the rollback problem during the optimistic one-step execution of contract invocations with dependencies. We also present a series of conflict resolution mechanisms to decrease the rollback caused by inherent transaction conflicts. We implement prototypes for Sparrow and existing sharding systems, and the evaluation shows that Sparrow improves the throughput by

$2.44times$

and reduces the transaction latency by 30% compared with the existing sharding systems.

分片是一种很有前途的扩展区块链的解决方案，它将系统分成多个分片来并行处理事务。然而，由于状态分离和分片隔离，在区块链分片系统上有效支持智能合约仍然具有挑战性，在区块链分片系统中，智能合约可以相互交互，涉及多个分片维护的状态。具体来说，现有的分片系统采用昂贵的多步协作机制来执行智能合约，导致延迟长，吞吐量低。本文提出了一种区块链分片协议Sparrow，它实现了智能合约的一步执行。为了打破分片隔离，受传统数据库非本地热点数据缓存的启发，我们提出了一种分片间缓存的新思路，允许一个分片预取和缓存其他分片频繁访问的合约状态。因此，矿工可以使用分片间缓存来预执行待处理事务，检索其所有合约调用，并在一步中将其提交到多个分片。特别是，我们首先提出了一种推测式分散缓存同步机制，用于在拜占庭环境中跨分片进行高效和安全的缓存同步。然后，我们提出了一种多分支探索机制来解决依赖关系契约调用的乐观一步执行过程中的回滚问题。我们还提出了一系列冲突解决机制，以减少由固有事务冲突引起的回滚。我们实现了Sparrow和现有分片系统的原型，评估表明，与现有分片系统相比，Sparrow的吞吐量提高了2.44倍，交易延迟减少了30%。

{"title":"Sparrow: Expediting Smart Contract Execution for Blockchain Sharding via Inter-Shard Caching","authors":"Junyuan Liang;Peiyuan Yao;Wuhui Chen;Zicong Hong;Jianting Zhang;Ting Cai;Min Sun;Zibin Zheng","doi":"10.1109/TPDS.2024.3522016","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522016","url":null,"abstract":"Sharding is a promising solution to scale blockchain by separating the system into multiple shards to process transactions in parallel. However, due to state separation and shard isolation, it is still challenging to efficiently support smart contracts on a blockchain sharding system where smart contracts can interact with each other, involving states maintained by multiple shards. Specifically, existing sharding systems adopt a costly multi-step collaboration mechanism to execute smart contracts, resulting in long latency and low throughput. This article proposes Sparrow, a blockchain sharding protocol achieving one-step execution for smart contracts. To break shard isolation, inspired by non-local hotspot data caching in traditional databases, we propose a new idea of inter-shard caching, allowing a shard to prefetch and cache frequently accessed contract states of other shards. The miner can thus use the inter-shard cache to pre-execute a pending transaction, retrieve all its contract invocations, and commit it to multiple shards in one step. Particularly, we first propose a speculative dispersal cache synchronisation mechanism for efficient and secure cache synchronization across shards in Byzantine environments. Then, we propose a multi-branch exploration mechanism to solve the rollback problem during the optimistic one-step execution of contract invocations with dependencies. We also present a series of conflict resolution mechanisms to decrease the rollback caused by inherent transaction conflicts. We implement prototypes for Sparrow and existing sharding systems, and the evaluation shows that Sparrow improves the throughput by <inline-formula><tex-math>$2.44times$</tex-math></inline-formula> and reduces the transaction latency by 30% compared with the existing sharding systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"377-390"},"PeriodicalIF":5.6,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor Cores

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-12-25 DOI: 10.1109/TPDS.2024.3522776

Yuhan Leng;Gaoyuan Zou;Hansheng Wang;Panruo Wu;Shaoshuai Zhang

Since 2017, NVIDIA GPUs have been equipped with specialized units known as Tensor Cores, which demonstrate remarkable efficiency in processing matrix multiplications (GEMMs). Beyond GEMMs, researchers have explored the potential applications of Tensor Cores in matrix factorization, such as QR factorization. However, the inside GEMMs in QR factorization are typically tall and skinny. Compared to compute-bound square GEMMs, these tall and skinny GEMMs are memory bound, leading to suboptimal performance on Tensor Cores. To solve this problem, we indicate the recursive QR factorization can convert the tall and skinny GEMMs to relatively square and large GEMMs, resulting in better performance on Tensor Cores. Besides, we extend the FP16 Tensor-Cores-based QR factorization to accommodate FP32 and FP64 on FP16 and INT8 Tensor Cores, respectively. Additionally, to address the issue of orthogonality loss in the preceding Tensor Cores-based QR factorization, we transition from the Gram-Schmidt to the Householder algorithm while preserving high performance. According to our experimental evaluation conducted on NVIDIA's A100 and GeForce RTX 3090 GPU, the precision levels of FP64, FP32, and FP16 are up to 6.22x, 8.67x, and 4.03x faster, respectively, than the current state-of-the-art implementations.

{"title":"High Performance Householder QR Factorization on Emerging GPU Architectures Using Tensor Cores","authors":"Yuhan Leng;Gaoyuan Zou;Hansheng Wang;Panruo Wu;Shaoshuai Zhang","doi":"10.1109/TPDS.2024.3522776","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522776","url":null,"abstract":"Since 2017, NVIDIA GPUs have been equipped with specialized units known as Tensor Cores, which demonstrate remarkable efficiency in processing matrix multiplications (GEMMs). Beyond GEMMs, researchers have explored the potential applications of Tensor Cores in matrix factorization, such as QR factorization. However, the inside GEMMs in QR factorization are typically tall and skinny. Compared to compute-bound square GEMMs, these tall and skinny GEMMs are memory bound, leading to suboptimal performance on Tensor Cores. To solve this problem, we indicate the recursive QR factorization can convert the tall and skinny GEMMs to relatively square and large GEMMs, resulting in better performance on Tensor Cores. Besides, we extend the FP16 Tensor-Cores-based QR factorization to accommodate FP32 and FP64 on FP16 and INT8 Tensor Cores, respectively. Additionally, to address the issue of orthogonality loss in the preceding Tensor Cores-based QR factorization, we transition from the Gram-Schmidt to the Householder algorithm while preserving high performance. According to our experimental evaluation conducted on NVIDIA's A100 and GeForce RTX 3090 GPU, the precision levels of FP64, FP32, and FP16 are up to 6.22x, 8.67x, and 4.03x faster, respectively, than the current state-of-the-art implementations.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"422-436"},"PeriodicalIF":5.6,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrated and Fungible Scheduling of Deep Learning Workloads Using Multi-Agent Reinforcement Learning

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-12-25 DOI: 10.1109/TPDS.2024.3522333

Jialun Li;Danyang Xiao;Diying Yang;Xuan Mo;Weigang Wu

GPU clusters have been widely used to co-locate various deep learning (DL) workloads in a multi-tenant way. Although such resource sharing can significantly reduce training cost, resource contention and interference among co-located workloads make task scheduling very complex and challenging. To simplify the scheduling problem, existing algorithms usually divide the procedure of scheduling into two sub-tasks, i.e., task placement and resource allocation, and allocate resources according to pre-defined and fixed resource demands. However, such a paradigm significantly constrains the selection of potential scheduling solutions. In this article, we present MAIFS, a novel multi-agent reinforcement learning based scheduling algorithm that handles task placement and resource allocation integratedly, and allows fungible resource allocation based on resource sensitivity of DL workloads. The core of MAIFS lies in two mechanisms. The multi-agent attention mechanism is designed to learn and share inter-related resource state features observed from different agents, which enables agents to explore fungible resource allocation solutions. The dynamic coordination graph mechanism is designed for coordinating interactive task placement decisions of agents during integrated scheduling, so as to mitigate potential task conflicts. Simulated experiments using two large scale production DL workload traces and physical deployment experiments based on a Kubernetes based GPU cluster show that MAIFS can outperform state-of-the-art scheduling algorithms by up to 44% in terms of makespan and 46% in terms of job completion time (JCT).

{"title":"Integrated and Fungible Scheduling of Deep Learning Workloads Using Multi-Agent Reinforcement Learning","authors":"Jialun Li;Danyang Xiao;Diying Yang;Xuan Mo;Weigang Wu","doi":"10.1109/TPDS.2024.3522333","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522333","url":null,"abstract":"GPU clusters have been widely used to co-locate various deep learning (DL) workloads in a multi-tenant way. Although such resource sharing can significantly reduce training cost, resource contention and interference among co-located workloads make task scheduling very complex and challenging. To simplify the scheduling problem, existing algorithms usually divide the procedure of scheduling into two sub-tasks, i.e., task placement and resource allocation, and allocate resources according to pre-defined and fixed resource demands. However, such a paradigm significantly constrains the selection of potential scheduling solutions. In this article, we present MAIFS, a novel multi-agent reinforcement learning based scheduling algorithm that handles task placement and resource allocation integratedly, and allows fungible resource allocation based on resource sensitivity of DL workloads. The core of MAIFS lies in two mechanisms. The multi-agent attention mechanism is designed to learn and share inter-related resource state features observed from different agents, which enables agents to explore fungible resource allocation solutions. The dynamic coordination graph mechanism is designed for coordinating interactive task placement decisions of agents during integrated scheduling, so as to mitigate potential task conflicts. Simulated experiments using two large scale production DL workload traces and physical deployment experiments based on a Kubernetes based GPU cluster show that MAIFS can outperform state-of-the-art scheduling algorithms by up to 44% in terms of makespan and 46% in terms of job completion time (JCT).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"391-406"},"PeriodicalIF":5.6,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ViTeGNN: Towards Versatile Inference of Temporal Graph Neural Networks on FPGA

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-12-24 DOI: 10.1109/TPDS.2024.3521897

Hongkuan Zhou;Bingyi Zhang;Rajgopal Kannan;Carl Busart;Viktor K. Prasanna

Temporal Graph Neural Networks (TGNNs) are powerful models to capture temporal, structural, and contextual information on temporal graphs, outperforming other methods in many high-impact downstream tasks. However, achieving high-performance TGNN inference in production environments is challenging because TGNN models suffer from high computation complexity and intrinsic temporal data dependency that hinders data parallelism. In addition, real-world TGNN applications have different latency and throughput requirements. This work presents ViTeGNN, a versatile TGNN inference solution for memory-based TGNNs on FPGAs. ViTeGNN performs algorithm-model-architecture co-design to meet the latency and throughput requirements of real-world TGNN applications. Besides the vanilla inference mode ViTeGNN-bal that updates embeddings for nodes interacting with others, we propose ViTeGNN-lat and ViTeGNN-thpt, optimized for latency and throughput. Our model optimizations include a lightweight method to compute attention scores and a related temporal neighbor pruning strategy to reduce computation and memory accesses. These are holistically coupled with key hardware optimizations that leverage the FPGA hardware. We propose a novel hardware module to execute the complex neighbor update process efficiently. To ensure similar accuracy vis-á-vis the original model, the simplified models are trained using the knowledge distillation technique. We propose a unified hardware design that supports all of these three inference modes without FPGA reconfiguration. Enabled by our flexible hardware architecture, we further propose ViTeGNN-auto, which automatically selects the best inference mode at runtime based on latency and throughput requirements, guided by our accurate performance model. We evaluate the performance of the proposed hardware accelerator on five real-world datasets. ViTeGNN-bal reduces the computation complexity by an average of 62% and memory accesses by an average of 36% with only 0.0042 accuracy loss. Compared with state-of-the-art implementations on CPU and GPU, our FPGA implementation achieves

$53.9/26.0/16.1times$

speedup and

$8.2/4.0/2.5times$

speedup for ViTeGNN-lat/-bal/-thpt, respectively.

{"title":"ViTeGNN: Towards Versatile Inference of Temporal Graph Neural Networks on FPGA","authors":"Hongkuan Zhou;Bingyi Zhang;Rajgopal Kannan;Carl Busart;Viktor K. Prasanna","doi":"10.1109/TPDS.2024.3521897","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3521897","url":null,"abstract":"Temporal Graph Neural Networks (TGNNs) are powerful models to capture temporal, structural, and contextual information on temporal graphs, outperforming other methods in many high-impact downstream tasks. However, achieving high-performance TGNN inference in production environments is challenging because TGNN models suffer from high computation complexity and intrinsic temporal data dependency that hinders data parallelism. In addition, real-world TGNN applications have different latency and throughput requirements. This work presents ViTeGNN, a versatile TGNN inference solution for memory-based TGNNs on FPGAs. ViTeGNN performs algorithm-model-architecture co-design to meet the latency and throughput requirements of real-world TGNN applications. Besides the vanilla inference mode ViTeGNN-bal that updates embeddings for nodes interacting with others, we propose ViTeGNN-lat and ViTeGNN-thpt, optimized for latency and throughput. Our model optimizations include a lightweight method to compute attention scores and a related temporal neighbor pruning strategy to reduce computation and memory accesses. These are holistically coupled with key hardware optimizations that leverage the FPGA hardware. We propose a novel hardware module to execute the complex neighbor update process efficiently. To ensure similar accuracy vis-á-vis the original model, the simplified models are trained using the knowledge distillation technique. We propose a unified hardware design that supports all of these three inference modes without FPGA reconfiguration. Enabled by our flexible hardware architecture, we further propose ViTeGNN-auto, which automatically selects the best inference mode at runtime based on latency and throughput requirements, guided by our accurate performance model. We evaluate the performance of the proposed hardware accelerator on five real-world datasets. ViTeGNN-bal reduces the computation complexity by an average of 62% and memory accesses by an average of 36% with only 0.0042 accuracy loss. Compared with state-of-the-art implementations on CPU and GPU, our FPGA implementation achieves <inline-formula><tex-math>$53.9/26.0/16.1times$</tex-math></inline-formula> speedup and <inline-formula><tex-math>$8.2/4.0/2.5times$</tex-math></inline-formula> speedup for ViTeGNN-lat/-bal/-thpt, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"502-519"},"PeriodicalIF":5.6,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Online Elastic Resource Provisioning With QoS Guarantee in Container-Based Cloud Computing

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-12-24 DOI: 10.1109/TPDS.2024.3522085

Shuaibing Lu;Ran Yan;Jie Wu;Jackson Yang;Xinyu Deng;Shen Wu;Zhi Cai;Juan Fang

In cloud data centers, the exponential growth of data places increasing demands on computing, storage, and network resources, especially in multi-tenant environments. While this growth is crucial for ensuring Quality of Service (QoS), it also introduces challenges such as fluctuating resource requirements and static container configurations, which can lead to resource underutilization and high energy consumption. This article addresses online resource provisioning and efficient scheduling for multi-tenant environments, aiming to minimize energy consumption while balancing elasticity and QoS requirements. To address this, we propose a novel optimization framework that reformulates the resource provisioning problem into a more manageable form. By reducing the original multi-constraint optimization to a container placement problem, we apply the interior-point barrier method to simplify the optimization, integrating constraints directly into the objective function for efficient computation. We also introduce elasticity as a key parameter to balance energy consumption with autonomous resource scaling, ensuring that resource consolidation does not compromise system flexibility. The proposed Energy-Efficient and Elastic Resource Provisioning (EEP) framework comprises three main modules: a distributed resource management module that employs vertical partitioning and dynamic leader election for adaptive resource allocation; a prediction module using

$omega$

-step prediction for accurate resource demand forecasting; and an elastic scheduling module that dynamically adjusts to tenant scaling needs, optimizing resource allocation and minimizing energy consumption. Extensive experiments across diverse cloud scenarios demonstrate that the EEP framework significantly improves energy efficiency and resource utilization compared to established baselines, supporting sustainable cloud management practices.

{"title":"Online Elastic Resource Provisioning With QoS Guarantee in Container-Based Cloud Computing","authors":"Shuaibing Lu;Ran Yan;Jie Wu;Jackson Yang;Xinyu Deng;Shen Wu;Zhi Cai;Juan Fang","doi":"10.1109/TPDS.2024.3522085","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3522085","url":null,"abstract":"In cloud data centers, the exponential growth of data places increasing demands on computing, storage, and network resources, especially in multi-tenant environments. While this growth is crucial for ensuring Quality of Service (QoS), it also introduces challenges such as fluctuating resource requirements and static container configurations, which can lead to resource underutilization and high energy consumption. This article addresses online resource provisioning and efficient scheduling for multi-tenant environments, aiming to minimize energy consumption while balancing elasticity and QoS requirements. To address this, we propose a novel optimization framework that reformulates the resource provisioning problem into a more manageable form. By reducing the original multi-constraint optimization to a container placement problem, we apply the interior-point barrier method to simplify the optimization, integrating constraints directly into the objective function for efficient computation. We also introduce elasticity as a key parameter to balance energy consumption with autonomous resource scaling, ensuring that resource consolidation does not compromise system flexibility. The proposed Energy-Efficient and Elastic Resource Provisioning (EEP) framework comprises three main modules: a distributed resource management module that employs vertical partitioning and dynamic leader election for adaptive resource allocation; a prediction module using <inline-formula><tex-math>$omega$</tex-math></inline-formula>-step prediction for accurate resource demand forecasting; and an elastic scheduling module that dynamically adjusts to tenant scaling needs, optimizing resource allocation and minimizing energy consumption. Extensive experiments across diverse cloud scenarios demonstrate that the EEP framework significantly improves energy efficiency and resource utilization compared to established baselines, supporting sustainable cloud management practices.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"361-376"},"PeriodicalIF":5.6,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143105920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CAT: Cellular Automata on Tensor Cores 张量核上的元胞自动机

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-12-20 DOI: 10.1109/TPDS.2024.3520395

Cristóbal A. Navarro;Felipe A. Quezada;Enzo Meneses;Héctor Ferrada;Nancy Hitschfeld

Cellular automata (CA) are simulation models that can produce complex emergent behaviors from simple local rules. Although state-of-the-art GPU solutions are already fast due to their data-parallel nature, their performance can rapidly degrade in CA with a large neighborhood radius. With the inclusion of tensor cores across the entire GPU ecosystem, interest has grown in finding ways to leverage these fast units outside the field of artificial intelligence, which was their original purpose. In this work, we present CAT, a GPU tensor core approach that can accelerate CA in which the cell transition function acts on a weighted summation of its neighborhood. CAT is evaluated theoretically, using an extended PRAM cost model, as well as empirically using the Larger Than Life (LTL) family of CA as case studies. The results confirm that the cost model is accurate, showing that CAT exhibits constant time throughout the entire radius range

$1 leq r leq 16$

, and its theoretical speedups agree with the empirical results. At low radius

$r=1,2$

, CAT is competitive and is only surpassed by the fastest state-of-the-art GPU solution. Starting from

$r=3$

, CAT progressively outperforms all other approaches, reaching speedups of up to

$101times$

over a GPU baseline and up to

$sim !14times$

over the fastest state-of-the-art GPU approach. In terms of energy efficiency, CAT is competitive in the range

$1 leq r leq 4$

and from

$r geq 5$

it is the most energy efficient approach. As for performance scaling across GPU architectures, CAT shows a promising trend that, if continues for future generations, it would increase its performance at a higher rate than classical GPU solutions. A CPU version of CAT was also explored, using the recently introduced AMX instructions. Although its performance is still below GPU tensor cores, it is a promising approach as it can still outperform some GPU approaches at large radius. The results obtained in this work put CAT as an approach with great potential for scientists who need to study emerging phenomena in CA with a large neighborhood radius, both in the GPU and in the CPU.

元胞自动机（CA）是一种能够从简单的局部规则中产生复杂突发行为的仿真模型。尽管最先进的GPU解决方案由于其数据并行特性已经很快，但它们的性能在具有大邻域半径的CA中可能会迅速下降。随着整个GPU生态系统中包含了张量核，人们越来越有兴趣在人工智能领域之外寻找方法来利用这些快速单元，这是它们的最初目的。在这项工作中，我们提出了CAT，一种GPU张量核心方法，可以加速CA，其中单元转换函数作用于其邻域的加权和。从理论上评估CAT，使用扩展的PRAM成本模型，以及经验上使用大于寿命（LTL） CA家族作为案例研究。结果证实了成本模型的准确性，表明CAT在整个半径范围内呈现恒定时间$1 leq r leq 16$，其理论加速与实证结果一致。在低半径$r=1,2$， CAT具有竞争力，只有最快的最先进的GPU解决方案才能超越它。从$r=3$开始，CAT逐渐优于所有其他方法，在GPU基线上达到高达$101times$的速度，在最快的最先进的GPU方法上达到$sim !14times$的速度。在能源效率方面，CAT在$1 leq r leq 4$范围内具有竞争力，从$r geq 5$来看，它是最节能的方法。至于跨GPU架构的性能扩展，CAT显示出一个有希望的趋势，如果在未来几代中继续下去，它将以比经典GPU解决方案更高的速度提高其性能。我们还研究了CAT的CPU版本，使用了最近引入的AMX指令。虽然它的性能仍然低于GPU张量核，但它仍然可以在大半径范围内优于一些GPU方法，是一种很有前途的方法。在这项工作中获得的结果表明，对于需要在GPU和CPU中研究具有大邻域半径的CA中新出现的现象的科学家来说，CAT是一种具有巨大潜力的方法。

{"title":"CAT: Cellular Automata on Tensor Cores","authors":"Cristóbal A. Navarro;Felipe A. Quezada;Enzo Meneses;Héctor Ferrada;Nancy Hitschfeld","doi":"10.1109/TPDS.2024.3520395","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3520395","url":null,"abstract":"Cellular automata (CA) are simulation models that can produce complex emergent behaviors from simple local rules. Although state-of-the-art GPU solutions are already fast due to their data-parallel nature, their performance can rapidly degrade in CA with a large neighborhood radius. With the inclusion of tensor cores across the entire GPU ecosystem, interest has grown in finding ways to leverage these fast units outside the field of artificial intelligence, which was their original purpose. In this work, we present CAT, a GPU tensor core approach that can accelerate CA in which the cell transition function acts on a weighted summation of its neighborhood. CAT is evaluated theoretically, using an extended PRAM cost model, as well as empirically using the Larger Than Life (LTL) family of CA as case studies. The results confirm that the cost model is accurate, showing that CAT exhibits constant time throughout the entire radius range \u0000<inline-formula><tex-math>$1 leq r leq 16$</tex-math></inline-formula>\u0000, and its theoretical speedups agree with the empirical results. At low radius \u0000<inline-formula><tex-math>$r=1,2$</tex-math></inline-formula>\u0000, CAT is competitive and is only surpassed by the fastest state-of-the-art GPU solution. Starting from \u0000<inline-formula><tex-math>$r=3$</tex-math></inline-formula>\u0000, CAT progressively outperforms all other approaches, reaching speedups of up to \u0000<inline-formula><tex-math>$101times$</tex-math></inline-formula>\u0000 over a GPU baseline and up to \u0000<inline-formula><tex-math>$sim !14times$</tex-math></inline-formula>\u0000 over the fastest state-of-the-art GPU approach. In terms of energy efficiency, CAT is competitive in the range \u0000<inline-formula><tex-math>$1 leq r leq 4$</tex-math></inline-formula>\u0000 and from \u0000<inline-formula><tex-math>$r geq 5$</tex-math></inline-formula>\u0000 it is the most energy efficient approach. As for performance scaling across GPU architectures, CAT shows a promising trend that, if continues for future generations, it would increase its performance at a higher rate than classical GPU solutions. A CPU version of CAT was also explored, using the recently introduced AMX instructions. Although its performance is still below GPU tensor cores, it is a promising approach as it can still outperform some GPU approaches at large radius. The results obtained in this work put CAT as an approach with great potential for scientists who need to study emerging phenomena in CA with a large neighborhood radius, both in the GPU and in the CPU.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"341-355"},"PeriodicalIF":5.6,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142938327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AsyncFedGAN: An Efficient and Staleness-Aware Asynchronous Federated Learning Framework for Generative Adversarial Networks

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-12-20 DOI: 10.1109/TPDS.2024.3521016

Daniel Manu;Abee Alazzwi;Jingjing Yao;Youzuo Lin;Xiang Sun

Generative Adversarial Networks (GANs) are deep learning models that learn and generate new samples similar to existing ones. Traditionally, GANs are trained in centralized data centers, raising data privacy concerns due to the need for clients to upload their data. To address this, Federated Learning (FL) integrates with GANs, allowing collaborative training without sharing local data. However, this integration is complex because GANs involve two interdependent models—the generator and the discriminator—while FL typically handles a single model over distributed datasets. In this article, we propose a novel asynchronous FL framework for GANs, called AsyncFedGAN, designed to efficiently and distributively train both models tailored for molecule generation. AsyncFedGAN addresses the challenges of training interactive models, resolves the straggler issue in synchronous FL, reduces model staleness in asynchronous FL, and lowers client energy consumption. Our extensive simulations for molecular discovery show that AsyncFedGAN achieves convergence with proper settings, outperforms baseline methods, and balances model performance with client energy usage.

{"title":"AsyncFedGAN: An Efficient and Staleness-Aware Asynchronous Federated Learning Framework for Generative Adversarial Networks","authors":"Daniel Manu;Abee Alazzwi;Jingjing Yao;Youzuo Lin;Xiang Sun","doi":"10.1109/TPDS.2024.3521016","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3521016","url":null,"abstract":"Generative Adversarial Networks (GANs) are deep learning models that learn and generate new samples similar to existing ones. Traditionally, GANs are trained in centralized data centers, raising data privacy concerns due to the need for clients to upload their data. To address this, Federated Learning (FL) integrates with GANs, allowing collaborative training without sharing local data. However, this integration is complex because GANs involve two interdependent models—the generator and the discriminator—while FL typically handles a single model over distributed datasets. In this article, we propose a novel asynchronous FL framework for GANs, called AsyncFedGAN, designed to efficiently and distributively train both models tailored for molecule generation. AsyncFedGAN addresses the challenges of training interactive models, resolves the straggler issue in synchronous FL, reduces model staleness in asynchronous FL, and lowers client energy consumption. Our extensive simulations for molecular discovery show that AsyncFedGAN achieves convergence with proper settings, outperforms baseline methods, and balances model performance with client energy usage.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"553-569"},"PeriodicalIF":5.6,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143361038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training UMPIPE：基于不等微批的管道并行深度神经网络训练

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-12-11 DOI: 10.1109/TPDS.2024.3515804

Guangyao Zhou;Wenhong Tian;Rajkumar Buyya;Kui Wu

The increasing need for large-scale deep neural networks (DNN) has made parallel training an area of intensive focus. One effective method, microbatch-based pipeline parallelism (notably GPipe), accelerates parallel training in various architectures. However, existing parallel training architectures normally use equal data partitioning (EDP), where each layer's process maintains identical microbatch-sizes. EDP may hinder training speed because different processes often require varying optimal microbatch-sizes. To address this, we introduce UMPIPE, a novel framework for unequal microbatches-based pipeline parallelism. UMPIPE enables unequal data partitions (UEDP) across processes to optimize resource utilization. We develop a recurrence formula to calculate the time cost in UMPIPE by considering both computation and communication processes. To further enhance UMPIPE's efficiency, we propose the Dual-Chromosome Genetic Algorithm for UMPIPE (DGAP) that accounts for the independent time costs of forward and backward propagation. Furthermore, we present TiDGAP, a two-level improvement on DGAP. TiDGAP accelerates the process by simultaneously calculating the end time for multiple individuals and microbatches using matrix operations. Our extensive experiments validate the dual-chromosome strategy's optimization benefits and TiDGAP's acceleration capabilities. TiDGAP can achieve better training schemes than baselines, such as the local greedy algorithm and the global greedy-based dynamic programming. Compared to (GPipe, PipeDream), UMPIPE achieves increases in training speed:

$(13.89,11.09)%$

for GPT1-14,

$(17.11, 7.96)%$

for VGG16 and

$geq (170,100)%$

for simulation networks.

对大规模深度神经网络（DNN）日益增长的需求使得并行训练成为一个备受关注的领域。一种有效的方法，基于微批的管道并行（特别是GPipe），可以加速各种架构中的并行训练。然而，现有的并行训练架构通常使用相等数据分区（EDP），其中每层的进程保持相同的微批大小。EDP可能会阻碍训练速度，因为不同的过程通常需要不同的最佳微批大小。为了解决这个问题，我们引入了UMPIPE，这是一个基于不平等微批处理的管道并行性的新框架。UMPIPE支持跨进程的不平等数据分区（UEDP），以优化资源利用率。同时考虑了计算过程和通信过程，建立了计算UMPIPE中时间开销的递推公式。为了进一步提高UMPIPE的效率，我们提出了考虑正向和反向传播独立时间成本的UMPIPE双染色体遗传算法（DGAP）。此外，我们提出了TiDGAP，这是对DGAP的两级改进。TiDGAP通过使用矩阵操作同时计算多个个体和微批的结束时间来加速过程。我们的大量实验验证了双染色体策略的优化效益和TiDGAP的加速能力。TiDGAP可以获得比基线更好的训练方案，如局部贪婪算法和基于全局贪婪的动态规划。与（GPipe, PipeDream）相比，UMPIPE实现了训练速度的提高：GPT1-14为$(13.89,11.09)%$， VGG16为$(17.11, 7.96)%$，仿真网络为$geq (170,100)%$。

{"title":"UMPIPE: Unequal Microbatches-Based Pipeline Parallelism for Deep Neural Network Training","authors":"Guangyao Zhou;Wenhong Tian;Rajkumar Buyya;Kui Wu","doi":"10.1109/TPDS.2024.3515804","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3515804","url":null,"abstract":"The increasing need for large-scale deep neural networks (DNN) has made parallel training an area of intensive focus. One effective method, microbatch-based pipeline parallelism (notably GPipe), accelerates parallel training in various architectures. However, existing parallel training architectures normally use equal data partitioning (EDP), where each layer's process maintains identical microbatch-sizes. EDP may hinder training speed because different processes often require varying optimal microbatch-sizes. To address this, we introduce UMPIPE, a novel framework for unequal microbatches-based pipeline parallelism. UMPIPE enables unequal data partitions (UEDP) across processes to optimize resource utilization. We develop a recurrence formula to calculate the time cost in UMPIPE by considering both computation and communication processes. To further enhance UMPIPE's efficiency, we propose the Dual-Chromosome Genetic Algorithm for UMPIPE (DGAP) that accounts for the independent time costs of forward and backward propagation. Furthermore, we present TiDGAP, a two-level improvement on DGAP. TiDGAP accelerates the process by simultaneously calculating the end time for multiple individuals and microbatches using matrix operations. Our extensive experiments validate the dual-chromosome strategy's optimization benefits and TiDGAP's acceleration capabilities. TiDGAP can achieve better training schemes than baselines, such as the local greedy algorithm and the global greedy-based dynamic programming. Compared to (GPipe, PipeDream), UMPIPE achieves increases in training speed: \u0000<inline-formula><tex-math>$(13.89,11.09)%$</tex-math></inline-formula>\u0000 for GPT1-14, \u0000<inline-formula><tex-math>$(17.11, 7.96)%$</tex-math></inline-formula>\u0000 for VGG16 and \u0000<inline-formula><tex-math>$geq (170,100)%$</tex-math></inline-formula>\u0000 for simulation networks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"293-307"},"PeriodicalIF":5.6,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142890166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fine-Grained QoS Control via Tightly-Coupled Bandwidth Monitoring and Regulation for FPGA-Based Heterogeneous SoCs 基于fpga的异构soc紧耦合带宽监控与调节的细粒度QoS控制

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-12-09 DOI: 10.1109/TPDS.2024.3513416

Giacomo Valente;Gianluca Brilli;Tania Di Mascio;Alessandro Capotondi;Paolo Burgio;Paolo Valente;Andrea Marongiu

Commercial embedded systems increasingly rely on heterogeneous architectures that integrate general-purpose, multi-core processors, and various hardware accelerators on the same chip. This provides the high performance required by modern applications at a low cost and low power consumption, but at the same time poses new challenges. Hardware resource sharing at various levels, and in particular at the main memory controller level, results in slower execution time for the application tasks, ultimately making the system unpredictable from the point of view of timing. To enable the adoption of heterogeneous systems-on-chip (System on Chips (SoCs)) in the domain of timing-critical applications several hardware and software approaches have been proposed, bandwidth regulation based on monitoring and throttling being one of the most widely adopted. Existing solutions, however, are either too coarse-grained, limiting the control over computing engines activities, or strongly platform-dependent, addressing the problem only for specific SoCs. This article proposes an innovative approach that can accurately control main memory bandwidth usage in FPGA-based heterogeneous SoCs. In particular, it controls system bandwidth by connecting a runtime bandwidth regulation component to FPGA-based accelerators. Our solution offers dynamically configurable, fine-grained bandwidth regulation – to adapt to the varying requirements of the application over time – at a very low overhead. Furthermore, it is entirely platform-independent, capable of integration with any FPGA-based accelerator. Developed at the register-transfer level using a reference SoC platform, it is designed for easy compatibility with any FPGA-based SoC. Experimental results conducted on the Xilinx Zynq UltraScale+ platform demonstrate that our approach (i) is more than

$100times$

faster than loosely-coupled, software controlled regulators; (ii) is capable of exploiting the system bandwidth 28.7% more efficiently than tightly-coupled hardware regulators (e.g., ARM CoreLink QoS-400, where available); (iii) enables task co-scheduling solutions not feasible with state-of-the-art bandwidth regulation methods.

商业嵌入式系统越来越依赖于异构体系结构，这些体系结构在同一芯片上集成了通用的多核处理器和各种硬件加速器。这以低成本和低功耗提供了现代应用所需的高性能，但同时也提出了新的挑战。硬件资源在不同级别上的共享，特别是在主内存控制器级别上的共享，会导致应用程序任务的执行时间变慢，最终使系统从计时的角度来看不可预测。为了在时间关键应用领域采用异构片上系统（soc），已经提出了几种硬件和软件方法，基于监控和节流的带宽调节是最广泛采用的方法之一。然而，现有的解决方案要么过于粗粒度，限制了对计算引擎活动的控制，要么高度依赖于平台，仅针对特定的soc解决问题。本文提出了一种创新的方法，可以精确地控制基于fpga的异构soc中的主存储器带宽使用。特别是，它通过将运行时带宽调节组件连接到基于fpga的加速器来控制系统带宽。我们的解决方案以非常低的开销提供动态可配置的、细粒度的带宽调节——以适应应用程序随时间变化的需求。此外，它完全独立于平台，能够与任何基于fpga的加速器集成。它使用参考SoC平台在寄存器传输级别开发，旨在与任何基于fpga的SoC轻松兼容。在赛灵思Zynq UltraScale+平台上进行的实验结果表明，我们的方法(i)比松耦合、软件控制的调节器快100倍以上；（ii）能够比紧耦合硬件调节器（如ARM CoreLink QoS-400）更有效地利用系统带宽28.7%；（iii）使任务协同调度解决方案在最先进的带宽调节方法中是不可行的。

{"title":"Fine-Grained QoS Control via Tightly-Coupled Bandwidth Monitoring and Regulation for FPGA-Based Heterogeneous SoCs","authors":"Giacomo Valente;Gianluca Brilli;Tania Di Mascio;Alessandro Capotondi;Paolo Burgio;Paolo Valente;Andrea Marongiu","doi":"10.1109/TPDS.2024.3513416","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3513416","url":null,"abstract":"Commercial embedded systems increasingly rely on heterogeneous architectures that integrate general-purpose, multi-core processors, and various hardware accelerators on the same chip. This provides the high performance required by modern applications at a low cost and low power consumption, but at the same time poses new challenges. Hardware resource sharing at various levels, and in particular at the main memory controller level, results in slower execution time for the application tasks, ultimately making the system unpredictable from the point of view of timing. To enable the adoption of heterogeneous systems-on-chip (System on Chips (SoCs)) in the domain of timing-critical applications several hardware and software approaches have been proposed, bandwidth regulation based on monitoring and throttling being one of the most widely adopted. Existing solutions, however, are either too coarse-grained, limiting the control over computing engines activities, or strongly platform-dependent, addressing the problem only for specific SoCs. This article proposes an innovative approach that can accurately control main memory bandwidth usage in FPGA-based heterogeneous SoCs. In particular, it controls system bandwidth by connecting a runtime bandwidth regulation component to FPGA-based accelerators. Our solution offers dynamically configurable, fine-grained bandwidth regulation – to adapt to the varying requirements of the application over time – at a very low overhead. Furthermore, it is entirely platform-independent, capable of integration with any FPGA-based accelerator. Developed at the register-transfer level using a reference SoC platform, it is designed for easy compatibility with any FPGA-based SoC. Experimental results conducted on the Xilinx Zynq UltraScale+ platform demonstrate that our approach (i) is more than \u0000<inline-formula><tex-math>$100times$</tex-math></inline-formula>\u0000 faster than loosely-coupled, software controlled regulators; (ii) is capable of exploiting the system bandwidth 28.7% more efficiently than tightly-coupled hardware regulators (e.g., ARM CoreLink QoS-400, where available); (iii) enables task co-scheduling solutions not feasible with state-of-the-art bandwidth regulation methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"326-340"},"PeriodicalIF":5.6,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142938328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0