IEEE Transactions on Parallel and Distributed Systems最新文献_第10页

Reproducibility of the DaCe Framework on NPBench Benchmarks DaCe 框架在 NPBench 基准上的再现性

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-07-12 DOI: 10.1109/tpds.2024.3427130

Anish Govind, Yuchen Jing, Stefanie Dao, Michael Granado, Rachel Handran, Davit Margarian, Matthew Mikhailov, Danny Vo, Matei-Alexandru Gardus, Khai Vu, Derek Bouius, Bryan Chin, Mahidhar Tatineni, Mary Thomas

引用次数: 0

Cost-Effective Server Deployment for Multi-Access Edge Networks: A Cooperative Scheme 为多接入边缘网络部署经济高效的服务器：合作方案

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-07-11 DOI: 10.1109/TPDS.2024.3426523

Rong Cong;Zhiwei Zhao;Linyuanqi Zhang;Geyong Min

The combination of 5G/6G and edge computing has been envisioned as a promising paradigm to empower pervasive and intensive computing for the Internet-of-Things (IoT). High deployment cost is one of the major obstacles for realizing 5G/6G edge computing. Most existing works tried to deploy the minimum number of edge servers to cover a target area by avoiding coverage overlaps. However, following this framework, the resource requirement per server will be drastically increased by the peak requirement during workload variations. Even worse, most resources will be left under-utilized for most of the time. To address this problem, we propose CoopEdge, a cost-effective server deployment scheme for cooperative multi-access edge computing. The key idea of CoopEdge is to allow deploying overlapped servers to handle variable requested workloads in a cooperative manner. In this way, the peak demands can be dispersed into multiple servers, and the resource requirement for each server can be greatly reduced. We propose a Two-step Incremental Deployment (TID) algorithm to jointly decide the server deployment and cooperation policies. For the scenarios involving multiple network operators that are unwilling to cooperate with each other, we further extend the TID algorithm to a distributed TID algorithm based on the game theory. Extensive evaluation experiments are conducted based on the measurement results of seven real-world edge applications. The results show that compared with the state-of-the-art work, CoopEdge significantly reduces the deployment cost by 38.7% and improves resource utilization by 36.2%, and the proposed distributed algorithm can achieve a comparable deployment cost with CoopEdge, especially for small-coverage servers.

5G/6G 与边缘计算的结合被视为一种前景广阔的模式，可为物联网（IoT）提供无处不在的密集计算。高昂的部署成本是实现 5G/6G 边缘计算的主要障碍之一。大多数现有研究都试图通过避免覆盖重叠，部署最少数量的边缘服务器来覆盖目标区域。然而，按照这种框架，每台服务器的资源需求将因工作负载变化时的峰值需求而急剧增加。更糟糕的是，大多数资源在大部分时间都得不到充分利用。为了解决这个问题，我们提出了 CoopEdge，一种用于合作式多访问边缘计算的经济高效的服务器部署方案。CoopEdge 的主要理念是允许部署重叠的服务器，以合作的方式处理不同请求的工作负载。通过这种方式，峰值需求可以被分散到多个服务器上，每个服务器的资源需求也可以大大降低。我们提出了一种两步增量部署（TID）算法来共同决定服务器部署和合作策略。针对多个网络运营商不愿意相互合作的情况，我们将 TID 算法进一步扩展为基于博弈论的分布式 TID 算法。基于七个真实世界边缘应用的测量结果，我们进行了广泛的评估实验。结果表明，与最先进的工作相比，CoopEdge 大幅降低了 38.7% 的部署成本，提高了 36.2% 的资源利用率，而所提出的分布式算法可以实现与 CoopEdge 相当的部署成本，尤其是对于小覆盖范围的服务器。

{"title":"Cost-Effective Server Deployment for Multi-Access Edge Networks: A Cooperative Scheme","authors":"Rong Cong;Zhiwei Zhao;Linyuanqi Zhang;Geyong Min","doi":"10.1109/TPDS.2024.3426523","DOIUrl":"10.1109/TPDS.2024.3426523","url":null,"abstract":"The combination of 5G/6G and edge computing has been envisioned as a promising paradigm to empower pervasive and intensive computing for the Internet-of-Things (IoT). High deployment cost is one of the major obstacles for realizing 5G/6G edge computing. Most existing works tried to deploy the minimum number of edge servers to cover a target area by avoiding coverage overlaps. However, following this framework, the resource requirement per server will be drastically increased by the peak requirement during workload variations. Even worse, most resources will be left under-utilized for most of the time. To address this problem, we propose CoopEdge, a cost-effective server deployment scheme for cooperative multi-access edge computing. The key idea of CoopEdge is to allow deploying overlapped servers to handle variable requested workloads in a cooperative manner. In this way, the peak demands can be dispersed into multiple servers, and the resource requirement for each server can be greatly reduced. We propose a Two-step Incremental Deployment (TID) algorithm to jointly decide the server deployment and cooperation policies. For the scenarios involving multiple network operators that are unwilling to cooperate with each other, we further extend the TID algorithm to a distributed TID algorithm based on the game theory. Extensive evaluation experiments are conducted based on the measurement results of seven real-world edge applications. The results show that compared with the state-of-the-art work, CoopEdge significantly reduces the deployment cost by 38.7% and improves resource utilization by 36.2%, and the proposed distributed algorithm can achieve a comparable deployment cost with CoopEdge, especially for small-coverage servers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1583-1597"},"PeriodicalIF":5.6,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141609719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Adaptive QoS-Aware Microservice Deployment With Excessive Loads via Intra- and Inter-Datacenter Scheduling 通过数据中心内和数据中心间调度实现超负荷情况下的自适应 QoS 感知微服务部署

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-07-10 DOI: 10.1109/TPDS.2024.3425931

Jiuchen Shi;Kaihua Fu;Jiawen Wang;Quan Chen;Deze Zeng;Minyi Guo

User-facing applications often experience excessive loads and are shifting towards the microservice architecture. To fully utilize heterogeneous resources, current datacenters have adopted the disaggregated storage and compute architecture, where the storage and compute clusters are suitable to deploy the stateful and stateless microservices, respectively. Moreover, when the local datacenter has insufficient resources to host excessive loads, a reasonable solution is moving some microservices to remote datacenters. However, it is nontrivial to decide the appropriate microservice deployment inside the local datacenter and identify the appropriate migration decision to remote datacenters, as microservices show different characteristics, and the local datacenter shows different resource contention situations. We therefore propose ELIS, an intra- and inter-datacenter scheduling system that ensures the Quality-of-Service (QoS) of the microservice application, while minimizing the network bandwidth usage and computational resource usage. ELIS comprises a resource manager, a cross-cluster microservice deployer, and a reward-based microservice migrator. The resource manager allocates near-optimal resources for microservices while ensuring QoS. The microservice deployer deploys the microservices between the storage and compute clusters in the local datacenter, to minimize the network bandwidth usage while satisfying the microservice resource demand. The microservice migrator migrates some microservices to remote datacenters when local resources cannot afford the excessive loads. Experimental results show that ELIS ensures the QoS of user-facing applications. Meanwhile, it reduces the public network bandwidth usage, the remote computational resource usage, and the local network bandwidth usage by 49.6%, 48.5%, and 60.7% on average, respectively.

面向用户的应用程序经常会遇到负载过重的问题，因此正在转向微服务架构。为了充分利用异构资源，目前的数据中心采用了分解存储和计算的架构，存储集群和计算集群分别适合部署有状态和无状态的微服务。此外，当本地数据中心的资源不足以承载过多负载时，合理的解决方案是将部分微服务转移到远程数据中心。然而，由于微服务显示出不同的特性，本地数据中心也显示出不同的资源争用情况，要决定在本地数据中心内部署适当的微服务并确定向远程数据中心迁移的适当决策并非易事。因此，我们提出了一个数据中心内和数据中心间的调度系统--ELIS，它能确保微服务应用的服务质量（QoS），同时最大限度地减少网络带宽使用量和计算资源使用量。ELIS 由资源管理器、跨集群微服务部署器和基于奖励的微服务迁移器组成。资源管理器为微服务分配接近最优的资源，同时确保服务质量。微服务部署器在本地数据中心的存储集群和计算集群之间部署微服务，在满足微服务资源需求的同时尽量减少网络带宽的使用。当本地资源无法承受过多负载时，微服务迁移器会将一些微服务迁移到远程数据中心。实验结果表明，ELIS 确保了面向用户的应用程序的服务质量。同时，它将公共网络带宽使用率、远程计算资源使用率和本地网络带宽使用率分别平均降低了 49.6%、48.5% 和 60.7%。

{"title":"Adaptive QoS-Aware Microservice Deployment With Excessive Loads via Intra- and Inter-Datacenter Scheduling","authors":"Jiuchen Shi;Kaihua Fu;Jiawen Wang;Quan Chen;Deze Zeng;Minyi Guo","doi":"10.1109/TPDS.2024.3425931","DOIUrl":"10.1109/TPDS.2024.3425931","url":null,"abstract":"User-facing applications often experience excessive loads and are shifting towards the microservice architecture. To fully utilize heterogeneous resources, current datacenters have adopted the disaggregated storage and compute architecture, where the storage and compute clusters are suitable to deploy the stateful and stateless microservices, respectively. Moreover, when the local datacenter has insufficient resources to host excessive loads, a reasonable solution is moving some microservices to remote datacenters. However, it is nontrivial to decide the appropriate microservice deployment inside the local datacenter and identify the appropriate migration decision to remote datacenters, as microservices show different characteristics, and the local datacenter shows different resource contention situations. We therefore propose ELIS, an intra- and inter-datacenter scheduling system that ensures the Quality-of-Service (QoS) of the microservice application, while minimizing the network bandwidth usage and computational resource usage. ELIS comprises a \u0000<italic>resource manager\u0000, a \u0000<italic>cross-cluster microservice deployer\u0000, and a \u0000<italic>reward-based microservice migrator\u0000. The resource manager allocates near-optimal resources for microservices while ensuring QoS. The microservice deployer deploys the microservices between the storage and compute clusters in the local datacenter, to minimize the network bandwidth usage while satisfying the microservice resource demand. The microservice migrator migrates some microservices to remote datacenters when local resources cannot afford the excessive loads. Experimental results show that ELIS ensures the QoS of user-facing applications. Meanwhile, it reduces the public network bandwidth usage, the remote computational resource usage, and the local network bandwidth usage by 49.6%, 48.5%, and 60.7% on average, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1565-1582"},"PeriodicalIF":5.6,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141585924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Toward Materials Genome Big-Data: A Blockchain-Based Secure Storage and Efficient Retrieval Method 迈向材料基因组大数据：基于区块链的安全存储和高效检索方法

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-07-10 DOI: 10.1109/TPDS.2024.3426275

Ran Wang;Cheng Xu;Xiaotong Zhang

With the advent of the era of data-driven material R&D, more and more countries have begun to build material Big Data sharing platforms to support the design and R&D of new materials. In the application process of material Big Data sharing platforms, storage and retrieval are the basis of resource mining and analysis. However, achieving efficient storage and recovery is not accessible due to the multimodality, isomerization, discrete and other characteristics of material data. At the same time, due to the lack of security mechanisms, how to ensure the integrity and reliability of the original data is also a significant problem faced by researchers. Given these issues, this paper proposes a blockchain-based secure storage and efficient retrieval scheme. Introducing the Improved Merkle Tree (MMT) structure into the block, the transaction data on the chain and the original data in the off-chain cloud are mapped through the material data template. Experimental results show that our proposed MMT structure has no significant impact on the block creation efficiency while improving the retrieval efficiency. At the same time, MMT is superior to state-of-the-art retrieval methods in terms of efficiency, especially regarding range retrieval. The method proposed in this paper is more suitable for the application needs of the material Big Data sharing platform, and the retrieval efficiency has also been significantly improved.

随着数据驱动材料研发时代的到来，越来越多的国家开始建设材料大数据共享平台，为新材料的设计和研发提供支持。在材料大数据共享平台的应用过程中，存储和检索是资源挖掘和分析的基础。然而，由于材料数据的多模态性、异构性、离散性等特点，实现高效存储和检索并不容易。同时，由于缺乏安全机制，如何保证原始数据的完整性和可靠性也是研究人员面临的重要问题。鉴于这些问题，本文提出了一种基于区块链的安全存储和高效检索方案。在区块中引入改进梅克尔树（MMT）结构，通过物质数据模板映射链上交易数据和链下云中的原始数据。实验结果表明，我们提出的 MMT 结构在提高检索效率的同时，对区块创建效率没有显著影响。同时，MMT 在效率方面优于最先进的检索方法，尤其是在范围检索方面。本文提出的方法更适合物资大数据共享平台的应用需求，检索效率也得到了显著提高。

{"title":"Toward Materials Genome Big-Data: A Blockchain-Based Secure Storage and Efficient Retrieval Method","authors":"Ran Wang;Cheng Xu;Xiaotong Zhang","doi":"10.1109/TPDS.2024.3426275","DOIUrl":"10.1109/TPDS.2024.3426275","url":null,"abstract":"With the advent of the era of data-driven material R&D, more and more countries have begun to build material Big Data sharing platforms to support the design and R&D of new materials. In the application process of material Big Data sharing platforms, storage and retrieval are the basis of resource mining and analysis. However, achieving efficient storage and recovery is not accessible due to the multimodality, isomerization, discrete and other characteristics of material data. At the same time, due to the lack of security mechanisms, how to ensure the integrity and reliability of the original data is also a significant problem faced by researchers. Given these issues, this paper proposes a blockchain-based secure storage and efficient retrieval scheme. Introducing the Improved Merkle Tree (MMT) structure into the block, the transaction data on the chain and the original data in the off-chain cloud are mapped through the material data template. Experimental results show that our proposed MMT structure has no significant impact on the block creation efficiency while improving the retrieval efficiency. At the same time, MMT is superior to state-of-the-art retrieval methods in terms of efficiency, especially regarding range retrieval. The method proposed in this paper is more suitable for the application needs of the material Big Data sharing platform, and the retrieval efficiency has also been significantly improved.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1630-1643"},"PeriodicalIF":5.6,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141585919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RADAR: A Skew-Resistant and Hotness-Aware Ordered Index Design for Processing-in-Memory Systems RADAR：用于内存处理系统的抗偏斜和热度感知有序索引设计

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-07-09 DOI: 10.1109/TPDS.2024.3424853

Yifan Hua;Shengan Zheng;Weihan Kong;Cong Zhou;Kaixin Huang;Ruoyan Ma;Linpeng Huang

Pointer chasing becomes the performance bottleneck for today's in-memory indexes due to the memory wall. Emerging processing-in-memory (PIM) technologies are promising to mitigate this bottleneck, by enabling low-latency memory access and aggregated memory bandwidth scaling with the number of PIM modules. Prior PIM-based indexes adopt a fixed granularity to partition the key space and maintain static heights of skiplist nodes among PIM modules to accelerate index operations on skiplist, neglecting the changes in skewness and hotness of data access patterns during runtime. In this article, we present RADAR, an innovative PIM-friendly skiplist that dynamically partitions the key space among PIM modules to adapt to varying skewness. An offline learning-based model is employed to catch hotness changes to adjust the heights of skiplist nodes. In multiple datasets, RADAR achieves up to 198.2x performance improvement and consumes 47.4% less memory than state-of-the-art designs on real PIM hardware.

由于存在内存墙，指针追逐成为当今内存索引的性能瓶颈。新兴的内存处理（PIM）技术有望通过低延迟内存访问和随 PIM 模块数量增加而扩展的聚合内存带宽来缓解这一瓶颈。之前基于 PIM 的索引采用固定粒度来划分密钥空间，并在 PIM 模块之间保持跳表节点的静态高度，以加速跳表上的索引操作，但忽略了运行时数据访问模式的偏度和热度变化。在本文中，我们介绍了一种创新的 PIM 友好型跳转表 RADAR，它可以动态划分 PIM 模块之间的密钥空间，以适应不同的偏度。我们采用了一种基于离线学习的模型来捕捉热度变化，从而调整 skiplist 节点的高度。在多个数据集中，RADAR 实现了高达 198.2 倍的性能提升，在真实 PIM 硬件上的内存消耗比一流设计少 47.4%。

引用次数: 0

Acceleration of Multi-Body Molecular Dynamics With Customized Parallel Dataflow 利用定制并行数据流加速多体分子动力学

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-07-08 DOI: 10.1109/TPDS.2024.3420441

Quan Deng;Qiang Liu;Ming Yuan;Xiaohui Duan;Lin Gan;Jinzhe Yang;Wenlai Zhao;Zhenxiang Zhang;Guiming Wu;Wayne Luk;Haohuan Fu;Guangwen Yang

FPGAs are drawing increasing attention in resolving molecular dynamics (MD) problems, and have already been applied in problems such as two-body potentials, force fields composed of these potentials, etc. Competitive performance is obtained compared with traditional counterparts such as CPUs and GPUs. However, as far as we know, FPGA solutions for more complex and real-world MD problems, such as multi-body potentials, are seldom to be seen. This work explores the prospects of state-of-the-art FPGAs in accelerating multi-body potential. An FPGA-based accelerator with customized parallel dataflow that features multi-body potential computation, motion update, and internode communication is designed. Major contributions include: (1) parallelization applied at different levels of the accelerator; (2) an optimized dataflow mixing atom-level pipeline and cell-level pipeline to achieve high throughput; (3) a mixed-precision method using different precision at different stages of simulations; and (4) a communication-efficient method for internode communication. Experiments show that, our single-node accelerator is over 2.7× faster than an 8-core CPU design, performing 20.501 ns/day on a 55,296-atom system for the Tersoff simulation. Regarding power efficiency, our accelerator is 28.9× higher than I7-11700 and 4.8× higher than RTX 3090 when running the same test case.

FPGA 在解决分子动力学（MD）问题方面正受到越来越多的关注，并已应用于二体势能、由这些势能组成的力场等问题。与传统的 CPU 和 GPU 相比，其性能更具竞争力。然而，据我们所知，FPGA 解决方案很少能解决更复杂和现实世界中的 MD 问题，如多体势能。这项工作探索了最先进的 FPGA 在加速多体势垒方面的前景。我们设计了一种基于 FPGA 的加速器，它具有定制的并行数据流，可进行多体势能计算、运动更新和节点间通信。主要贡献包括(1) 在加速器的不同层级应用并行化；(2) 混合原子级流水线和单元级流水线的优化数据流，以实现高吞吐量；(3) 在模拟的不同阶段使用不同精度的混合精度方法；以及 (4) 节点间通信的高效通信方法。实验表明，我们的单节点加速器比 8 核 CPU 设计快 2.7 倍以上，在一个 55,296 原子系统中，Tersoff 仿真的速度为 20.501 ns/天。在能效方面，运行相同的测试案例时，我们的加速器比 I7-11700 高 28.9 倍，比 RTX 3090 高 4.8 倍。

{"title":"Acceleration of Multi-Body Molecular Dynamics With Customized Parallel Dataflow","authors":"Quan Deng;Qiang Liu;Ming Yuan;Xiaohui Duan;Lin Gan;Jinzhe Yang;Wenlai Zhao;Zhenxiang Zhang;Guiming Wu;Wayne Luk;Haohuan Fu;Guangwen Yang","doi":"10.1109/TPDS.2024.3420441","DOIUrl":"10.1109/TPDS.2024.3420441","url":null,"abstract":"FPGAs are drawing increasing attention in resolving molecular dynamics (MD) problems, and have already been applied in problems such as two-body potentials, force fields composed of these potentials, etc. Competitive performance is obtained compared with traditional counterparts such as CPUs and GPUs. However, as far as we know, FPGA solutions for more complex and real-world MD problems, such as multi-body potentials, are seldom to be seen. This work explores the prospects of state-of-the-art FPGAs in accelerating multi-body potential. An FPGA-based accelerator with customized parallel dataflow that features multi-body potential computation, motion update, and internode communication is designed. Major contributions include: (1) parallelization applied at different levels of the accelerator; (2) an optimized dataflow mixing atom-level pipeline and cell-level pipeline to achieve high throughput; (3) a mixed-precision method using different precision at different stages of simulations; and (4) a communication-efficient method for internode communication. Experiments show that, our single-node accelerator is over 2.7× faster than an 8-core CPU design, performing 20.501 ns/day on a 55,296-atom system for the \u0000<italic>Tersoff\u0000 simulation. Regarding power efficiency, our accelerator is 28.9× higher than I7-11700 and 4.8× higher than RTX 3090 when running the same test case.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2297-2314"},"PeriodicalIF":5.6,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141567201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pyxis: Scheduling Mixed Tasks in Disaggregated Datacenters Pyxis：在分散的数据中心调度混合任务

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-06-24 DOI: 10.1109/TPDS.2024.3418620

Sheng Qi;Chao Jin;Mosharaf Chowdhury;Zhenming Liu;Xuanzhe Liu;Xin Jin

Disaggregating compute from storage is an emerging trend in cloud computing. Effectively utilizing resources in both compute and storage pool is the key to high performance. The state-of-the-art scheduler provides optimal scheduling decisions for workloads with homogeneous tasks. However, cloud applications often generate a mix of tasks with diverse compute and IO characteristics, resulting in sub-optimal performance for existing solutions. We present Pyxis, a system that provides optimal scheduling decisions for mixed workloads in disaggregated datacenters with theoretical guarantees. Pyxis is capable of maximizing overall throughput while meeting latency SLOs. Pyxis decouples the scheduling of different tasks. Our insight is that the optimal solution has an “all-or-nothing” structure that can be captured by a single turning point in the spectrum of tasks. Based on task characteristics, the turning point partitions the tasks either all to storage nodes or all to compute nodes (none to storage nodes). We theoretically prove that the optimal solution has such a structure, and design an online algorithm with sub-second convergence. We implement a prototype of Pyxis. Experiments on CloudLab with various synthetic and application workloads show that Pyxis improves the throughput by 3–21× over the state-of-the-art solution.

将计算与存储分离是云计算的新兴趋势。有效利用计算和存储池中的资源是实现高性能的关键。最先进的调度程序可为具有同质任务的工作负载提供最佳调度决策。然而，云应用通常会产生具有不同计算和 IO 特性的混合任务，从而导致现有解决方案无法达到最佳性能。我们介绍的 Pyxis 系统可为分解数据中心中的混合工作负载提供最佳调度决策，并提供理论保证。Pyxis 能够最大限度地提高整体吞吐量，同时满足延迟 SLO 要求。Pyxis 分离了不同任务的调度。我们的见解是，最佳解决方案具有 "全有或全无 "的结构，可以通过任务谱中的一个转折点来捕捉。根据任务特征，转折点会将任务划分为两种类型，一种是全部分配给存储节点，另一种是全部分配给计算节点（没有分配给存储节点）。我们从理论上证明了最优解具有这样的结构，并设计了一种亚秒级收敛的在线算法。我们实现了 Pyxis 的原型。在 CloudLab 上使用各种合成和应用工作负载进行的实验表明，Pyxis 比最先进的解决方案提高了 3-21 倍的吞吐量。

{"title":"Pyxis: Scheduling Mixed Tasks in Disaggregated Datacenters","authors":"Sheng Qi;Chao Jin;Mosharaf Chowdhury;Zhenming Liu;Xuanzhe Liu;Xin Jin","doi":"10.1109/TPDS.2024.3418620","DOIUrl":"10.1109/TPDS.2024.3418620","url":null,"abstract":"Disaggregating compute from storage is an emerging trend in cloud computing. Effectively utilizing resources in both compute and storage pool is the key to high performance. The state-of-the-art scheduler provides optimal scheduling decisions for workloads with homogeneous tasks. However, cloud applications often generate a mix of tasks with diverse compute and IO characteristics, resulting in sub-optimal performance for existing solutions. We present Pyxis, a system that provides optimal scheduling decisions for mixed workloads in disaggregated datacenters with theoretical guarantees. Pyxis is capable of maximizing overall throughput while meeting latency SLOs. Pyxis decouples the scheduling of different tasks. Our insight is that the optimal solution has an “all-or-nothing” structure that can be captured by a single \u0000<italic>turning point\u0000 in the spectrum of tasks. Based on task characteristics, the turning point partitions the tasks either all to storage nodes or all to compute nodes (none to storage nodes). We theoretically prove that the optimal solution has such a structure, and design an online algorithm with sub-second convergence. We implement a prototype of Pyxis. Experiments on CloudLab with various synthetic and application workloads show that Pyxis improves the throughput by 3–21× over the state-of-the-art solution.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1536-1550"},"PeriodicalIF":5.6,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141500610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting Temporal-Unrolled Parallelism for Energy-Efficient SNN Acceleration 利用时序未展开并行性实现高能效 SNN 加速

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-06-18 DOI: 10.1109/TPDS.2024.3415712

Fangxin Liu;Zongwu Wang;Wenbo Zhao;Ning Yang;Yongbiao Chen;Shiyuan Huang;Haomin Li;Tao Yang;Songwen Pei;Xiaoyao Liang;Li Jiang

Event-driven spiking neural networks (SNNs) have demonstrated significant potential for achieving high energy and area efficiency. However, existing SNN accelerators suffer from issues such as high latency and energy consumption due to serial accumulation-comparison operations. This is mainly because SNN neurons integrate spikes, accumulate membrane potential, and generate output spikes when the potential exceeds a threshold. To address this, one approach is to leverage the sparsity of SNN spikes to reduce the number of time steps. However, this method can result in imbalanced workloads among neurons and limit the utilization of processing elements (PEs). In this paper, we present SATO, a temporal-parallel SNN accelerator that enables parallel accumulation of membrane potential for all time steps. SATO adopts a two-stage pipeline methodology, effectively decoupling neuron computations. This not only maintains accuracy but also unveils opportunities for fine-grained parallelism. By dividing the neuron computation into distinct stages, SATO enables the concurrent execution of spike accumulation for each time step, leveraging the parallel processing capabilities of modern hardware architectures. This not only enhances the overall efficiency of the accelerator but also reduces latency by exploiting parallelism at a granular level. The architecture of SATO includes a novel binary adder-search tree for generating the output spike train, effectively decoupling the chronological dependence in the accumulation-comparison operation. Furthermore, SATO employs a bucket-sort-based method to evenly distribute compressed workloads to all PEs, maximizing data locality of input spike trains. Experimental results on various SNN models demonstrate that SATO outperforms the well-known accelerator, the 8-bit version of “Eyeriss” by

$20.7times$

in terms of speedup and

$6.0times$

energy-saving, on average. Compared to the state-of-the-art SNN accelerator “SpinalFlow”, SATO can also achieve

$4.6times$

performance gain and

$3.1times$

energy reduction on average, which is quite impressive for inference.

事件驱动尖峰神经网络（SNN）在实现高能耗和高面积效率方面具有巨大潜力。然而，现有的 SNN 加速器存在延迟和能耗高的问题，这是由于串行累加比较操作造成的。这主要是因为 SNN 神经元会整合尖峰、累积膜电位，并在电位超过阈值时产生输出尖峰。为了解决这个问题，一种方法是利用 SNN 尖峰的稀疏性来减少时间步数。然而，这种方法会导致神经元之间的工作量不平衡，并限制处理元件（PE）的利用率。在本文中，我们介绍了 SATO，这是一种时间并行的 SNN 加速器，可以并行累积所有时间步长的膜电位。SATO 采用两阶段流水线方法，有效地解耦了神经元计算。这不仅保持了准确性，还为细粒度并行提供了机会。通过将神经元计算划分为不同的阶段，SATO 可以利用现代硬件架构的并行处理能力，同时执行每个时间步的尖峰累积。这不仅提高了加速器的整体效率，还通过利用细粒度的并行性降低了延迟。SATO 的架构包括一个新颖的二进制加法器搜索树，用于生成输出尖峰列车，有效地解耦了累积比较操作中的时序依赖性。此外，SATO 还采用了一种基于桶排序的方法，将压缩工作负载均匀地分配给所有处理器，最大限度地提高了输入尖峰列车的数据局部性。各种 SNN 模型的实验结果表明，SATO 的性能优于著名的加速器--8 位版本的 "Eyeriss"，平均提速 20.7 倍，节能 6.0 倍。与最先进的 SNN 加速器 "SpinalFlow "相比，SATO 还能实现平均 4.6 倍的性能提升和 3.1 倍的能耗降低，这对于推理来说是相当了不起的。

{"title":"Exploiting Temporal-Unrolled Parallelism for Energy-Efficient SNN Acceleration","authors":"Fangxin Liu;Zongwu Wang;Wenbo Zhao;Ning Yang;Yongbiao Chen;Shiyuan Huang;Haomin Li;Tao Yang;Songwen Pei;Xiaoyao Liang;Li Jiang","doi":"10.1109/TPDS.2024.3415712","DOIUrl":"10.1109/TPDS.2024.3415712","url":null,"abstract":"Event-driven spiking neural networks (SNNs) have demonstrated significant potential for achieving high energy and area efficiency. However, existing SNN accelerators suffer from issues such as high latency and energy consumption due to serial accumulation-comparison operations. This is mainly because SNN neurons integrate spikes, accumulate membrane potential, and generate output spikes when the potential exceeds a threshold. To address this, one approach is to leverage the sparsity of SNN spikes to reduce the number of time steps. However, this method can result in imbalanced workloads among neurons and limit the utilization of processing elements (PEs). In this paper, we present SATO, a temporal-parallel SNN accelerator that enables parallel accumulation of membrane potential for all time steps. SATO adopts a two-stage pipeline methodology, effectively decoupling neuron computations. This not only maintains accuracy but also unveils opportunities for fine-grained parallelism. By dividing the neuron computation into distinct stages, SATO enables the concurrent execution of spike accumulation for each time step, leveraging the parallel processing capabilities of modern hardware architectures. This not only enhances the overall efficiency of the accelerator but also reduces latency by exploiting parallelism at a granular level. The architecture of SATO includes a novel binary adder-search tree for generating the output spike train, effectively decoupling the chronological dependence in the accumulation-comparison operation. Furthermore, SATO employs a bucket-sort-based method to evenly distribute compressed workloads to all PEs, maximizing data locality of input spike trains. Experimental results on various SNN models demonstrate that SATO outperforms the well-known accelerator, the 8-bit version of “Eyeriss” by \u0000<inline-formula><tex-math>$20.7times$</tex-math></inline-formula>\u0000 in terms of speedup and \u0000<inline-formula><tex-math>$6.0times$</tex-math></inline-formula>\u0000 energy-saving, on average. Compared to the state-of-the-art SNN accelerator “SpinalFlow”, SATO can also achieve \u0000<inline-formula><tex-math>$4.6times$</tex-math></inline-formula>\u0000 performance gain and \u0000<inline-formula><tex-math>$3.1times$</tex-math></inline-formula>\u0000 energy reduction on average, which is quite impressive for inference.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1749-1764"},"PeriodicalIF":5.6,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Performance Hardware Acceleration Architecture for Cross-Silo Federated Learning 跨ilo 联合学习的高性能硬件加速架构

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-06-13 DOI: 10.1109/TPDS.2024.3413718

Junxue Zhang;Xiaodian Cheng;Liu Yang;Jinbin Hu;Han Tian;Kai Chen

Cross-silo federated learning (FL) adopts various cryptographic operations to preserve data privacy, which introduces significant performance overhead. In this paper, we identify nine widely-used cryptographic operations and design an efficient hardware architecture to accelerate them. However, directly offloading them on hardware statically leads to (1) inadequate hardware acceleration due to the limited resources allocated to each operation; (2) insufficient resource utilization, since different operations are used at different times. To address these challenges, we propose FLASH, a high-performance hardware acceleration architecture for cross-silo FL systems. At its heart, FLASH extracts two basic operators—modular exponentiation and multiplication—behind the nine cryptographic operations and implements them as highly-performant engines to achieve adequate acceleration. Furthermore, it leverages a dataflow scheduling scheme to dynamically compose different cryptographic operations based on these basic engines to obtain sufficient resource utilization. We have implemented a fully-functional FLASH prototype with Xilinx VU13P FPGA and integrated it with FATE, the most widely-adopted cross-silo FL framework. Experimental results show that, for the nine cryptographic operations, FLASH achieves up to

$14.0times$

and

$3.4times$

acceleration over CPU and GPU, translating to up to

$6.8times$

and

$2.0times$

speedup for realistic FL applications, respectively. We finally evaluate the FLASH design as an ASIC, and it achieves

$23.6times$

performance improvement upon the FPGA prototype.

跨ilo 联合学习（FL）采用各种加密操作来保护数据隐私，这带来了巨大的性能开销。在本文中，我们确定了九种广泛使用的加密操作，并设计了一种高效的硬件架构来加速这些操作。然而，直接在硬件上静态卸载这些操作会导致：(1) 由于分配给每个操作的资源有限，硬件加速不足；(2) 由于不同操作在不同时间使用，资源利用率不足。为了应对这些挑战，我们提出了 FLASH，一种用于跨单片机 FL 系统的高性能硬件加速架构。FLASH 的核心是提取九个加密操作背后的两个基本运算符--模块化指数运算和乘法运算，并将它们作为高性能引擎来实现充分加速。此外，它还利用数据流调度方案，在这些基本引擎的基础上动态组合不同的加密运算，以获得足够的资源利用率。我们利用赛灵思 VU13P FPGA 实现了一个功能齐全的 FLASH 原型，并将其与 FATE（最广泛采用的跨单片机 FL 框架）集成。实验结果表明，对于九种加密操作，FLASH 比 CPU 和 GPU 分别实现了高达 14.0 美元/次和 3.4 美元/次的加速，对于现实的 FL 应用，分别实现了高达 6.8 美元/次和 2.0 美元/次的提速。最后，我们评估了作为 ASIC 的 FLASH 设计，它比 FPGA 原型的性能提高了 23.6 倍。

{"title":"High-Performance Hardware Acceleration Architecture for Cross-Silo Federated Learning","authors":"Junxue Zhang;Xiaodian Cheng;Liu Yang;Jinbin Hu;Han Tian;Kai Chen","doi":"10.1109/TPDS.2024.3413718","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3413718","url":null,"abstract":"Cross-silo federated learning (FL) adopts various cryptographic operations to preserve data privacy, which introduces significant performance overhead. In this paper, we identify nine widely-used cryptographic operations and design an efficient hardware architecture to accelerate them. However, directly offloading them on hardware statically leads to (1) inadequate hardware acceleration due to the limited resources allocated to each operation; (2) insufficient resource utilization, since different operations are used at different times. To address these challenges, we propose FLASH, a high-performance hardware acceleration architecture for cross-silo FL systems. At its heart, FLASH extracts two basic operators—modular exponentiation and multiplication—behind the nine cryptographic operations and implements them as highly-performant engines to achieve adequate acceleration. Furthermore, it leverages a dataflow scheduling scheme to dynamically compose different cryptographic operations based on these basic engines to obtain sufficient resource utilization. We have implemented a fully-functional FLASH prototype with Xilinx VU13P FPGA and integrated it with FATE, the most widely-adopted cross-silo FL framework. Experimental results show that, for the nine cryptographic operations, FLASH achieves up to \u0000<inline-formula><tex-math>$14.0times$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$3.4times$</tex-math></inline-formula>\u0000 acceleration over CPU and GPU, translating to up to \u0000<inline-formula><tex-math>$6.8times$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$2.0times$</tex-math></inline-formula>\u0000 speedup for realistic FL applications, respectively. We finally evaluate the FLASH design as an ASIC, and it achieves \u0000<inline-formula><tex-math>$23.6times$</tex-math></inline-formula>\u0000 performance improvement upon the FPGA prototype.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1506-1523"},"PeriodicalIF":5.6,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141494887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Joint Participant and Learning Topology Selection for Federated Learning in Edge Clouds 边缘云中联合学习的参与者和学习拓扑选择

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-06-13 DOI: 10.1109/TPDS.2024.3413751

Xinliang Wei;Kejiang Ye;Xinghua Shi;Cheng-Zhong Xu;Yu Wang

Deploying federated learning (FL) in edge clouds poses challenges, especially when multiple models are concurrently trained in resource-constrained edge environments. Existing research on federated edge learning has predominantly focused on client selection for training a single FL model, typically with a fixed learning topology. Preliminary experiments indicate that FL models with adaptable topologies exhibit lower learning costs compared to those with fixed topologies. This paper delves into the intricacies of jointly selecting participants and learning topologies for multiple FL models simultaneously trained in the edge cloud. The problem is formulated as an integer non-linear programming problem, aiming to minimize total learning costs associated with all FL models while adhering to edge resource constraints. To tackle this challenging optimization problem, we introduce a two-stage algorithm that decouples the original problem into two sub-problems and iteratively addresses them separately with efficient heuristics. Our method enhances resource competition and load balancing in edge clouds by allowing FL models to choose participants and learning topologies independently. Extensive experiments conducted with real-world networks and FL datasets affirm the better performance of our algorithm, demonstrating lower average total costs with up to 33.5% and 39.6% compared to previous methods designed for multi-model FL.

在边缘云中部署联合学习（FL）是一项挑战，尤其是在资源有限的边缘环境中同时训练多个模型时。关于联合边缘学习的现有研究主要集中在训练单一 FL 模型的客户端选择上，通常采用固定的学习拓扑结构。初步实验表明，具有可适应拓扑结构的 FL 模型与具有固定拓扑结构的模型相比，学习成本更低。本文深入探讨了为同时在边缘云中训练的多个 FL 模型联合选择参与者和学习拓扑的复杂性。该问题被表述为一个整数非线性编程问题，旨在最小化与所有 FL 模型相关的总学习成本，同时遵守边缘资源约束。为解决这一具有挑战性的优化问题，我们引入了一种两阶段算法，将原始问题分解为两个子问题，并采用高效启发式迭代法分别解决它们。我们的方法允许 FL 模型独立选择参与者和学习拓扑，从而增强了边缘云中的资源竞争和负载平衡。利用真实世界网络和 FL 数据集进行的大量实验证实了我们的算法具有更好的性能，与之前为多模型 FL 设计的方法相比，平均总成本分别降低了 33.5% 和 39.6%。

{"title":"Joint Participant and Learning Topology Selection for Federated Learning in Edge Clouds","authors":"Xinliang Wei;Kejiang Ye;Xinghua Shi;Cheng-Zhong Xu;Yu Wang","doi":"10.1109/TPDS.2024.3413751","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3413751","url":null,"abstract":"Deploying federated learning (FL) in edge clouds poses challenges, especially when multiple models are concurrently trained in resource-constrained edge environments. Existing research on federated edge learning has predominantly focused on client selection for training a single FL model, typically with a fixed learning topology. Preliminary experiments indicate that FL models with adaptable topologies exhibit lower learning costs compared to those with fixed topologies. This paper delves into the intricacies of jointly selecting participants and learning topologies for multiple FL models simultaneously trained in the edge cloud. The problem is formulated as an integer non-linear programming problem, aiming to minimize total learning costs associated with all FL models while adhering to edge resource constraints. To tackle this challenging optimization problem, we introduce a two-stage algorithm that decouples the original problem into two sub-problems and iteratively addresses them separately with efficient heuristics. Our method enhances resource competition and load balancing in edge clouds by allowing FL models to choose participants and learning topologies independently. Extensive experiments conducted with real-world networks and FL datasets affirm the better performance of our algorithm, demonstrating lower average total costs with up to 33.5% and 39.6% compared to previous methods designed for multi-model FL.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1456-1468"},"PeriodicalIF":5.6,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141448038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0