首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
Exploiting Temporal-Unrolled Parallelism for Energy-Efficient SNN Acceleration 利用时序未展开并行性实现高能效 SNN 加速
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-18 DOI: 10.1109/TPDS.2024.3415712
Fangxin Liu;Zongwu Wang;Wenbo Zhao;Ning Yang;Yongbiao Chen;Shiyuan Huang;Haomin Li;Tao Yang;Songwen Pei;Xiaoyao Liang;Li Jiang
Event-driven spiking neural networks (SNNs) have demonstrated significant potential for achieving high energy and area efficiency. However, existing SNN accelerators suffer from issues such as high latency and energy consumption due to serial accumulation-comparison operations. This is mainly because SNN neurons integrate spikes, accumulate membrane potential, and generate output spikes when the potential exceeds a threshold. To address this, one approach is to leverage the sparsity of SNN spikes to reduce the number of time steps. However, this method can result in imbalanced workloads among neurons and limit the utilization of processing elements (PEs). In this paper, we present SATO, a temporal-parallel SNN accelerator that enables parallel accumulation of membrane potential for all time steps. SATO adopts a two-stage pipeline methodology, effectively decoupling neuron computations. This not only maintains accuracy but also unveils opportunities for fine-grained parallelism. By dividing the neuron computation into distinct stages, SATO enables the concurrent execution of spike accumulation for each time step, leveraging the parallel processing capabilities of modern hardware architectures. This not only enhances the overall efficiency of the accelerator but also reduces latency by exploiting parallelism at a granular level. The architecture of SATO includes a novel binary adder-search tree for generating the output spike train, effectively decoupling the chronological dependence in the accumulation-comparison operation. Furthermore, SATO employs a bucket-sort-based method to evenly distribute compressed workloads to all PEs, maximizing data locality of input spike trains. Experimental results on various SNN models demonstrate that SATO outperforms the well-known accelerator, the 8-bit version of “Eyeriss” by $20.7times$ in terms of speedup and $6.0times$ energy-saving, on average. Compared to the state-of-the-art SNN accelerator “SpinalFlow”, SATO can also achieve $4.6times$ performance gain and $3.1times$ energy reduction on average, which is quite impressive for inference.
事件驱动尖峰神经网络(SNN)在实现高能耗和高面积效率方面具有巨大潜力。然而,现有的 SNN 加速器存在延迟和能耗高的问题,这是由于串行累加比较操作造成的。这主要是因为 SNN 神经元会整合尖峰、累积膜电位,并在电位超过阈值时产生输出尖峰。为了解决这个问题,一种方法是利用 SNN 尖峰的稀疏性来减少时间步数。然而,这种方法会导致神经元之间的工作量不平衡,并限制处理元件(PE)的利用率。在本文中,我们介绍了 SATO,这是一种时间并行的 SNN 加速器,可以并行累积所有时间步长的膜电位。SATO 采用两阶段流水线方法,有效地解耦了神经元计算。这不仅保持了准确性,还为细粒度并行提供了机会。通过将神经元计算划分为不同的阶段,SATO 可以利用现代硬件架构的并行处理能力,同时执行每个时间步的尖峰累积。这不仅提高了加速器的整体效率,还通过利用细粒度的并行性降低了延迟。SATO 的架构包括一个新颖的二进制加法器搜索树,用于生成输出尖峰列车,有效地解耦了累积比较操作中的时序依赖性。此外,SATO 还采用了一种基于桶排序的方法,将压缩工作负载均匀地分配给所有处理器,最大限度地提高了输入尖峰列车的数据局部性。各种 SNN 模型的实验结果表明,SATO 的性能优于著名的加速器--8 位版本的 "Eyeriss",平均提速 20.7 倍,节能 6.0 倍。与最先进的 SNN 加速器 "SpinalFlow "相比,SATO 还能实现平均 4.6 倍的性能提升和 3.1 倍的能耗降低,这对于推理来说是相当了不起的。
{"title":"Exploiting Temporal-Unrolled Parallelism for Energy-Efficient SNN Acceleration","authors":"Fangxin Liu;Zongwu Wang;Wenbo Zhao;Ning Yang;Yongbiao Chen;Shiyuan Huang;Haomin Li;Tao Yang;Songwen Pei;Xiaoyao Liang;Li Jiang","doi":"10.1109/TPDS.2024.3415712","DOIUrl":"10.1109/TPDS.2024.3415712","url":null,"abstract":"Event-driven spiking neural networks (SNNs) have demonstrated significant potential for achieving high energy and area efficiency. However, existing SNN accelerators suffer from issues such as high latency and energy consumption due to serial accumulation-comparison operations. This is mainly because SNN neurons integrate spikes, accumulate membrane potential, and generate output spikes when the potential exceeds a threshold. To address this, one approach is to leverage the sparsity of SNN spikes to reduce the number of time steps. However, this method can result in imbalanced workloads among neurons and limit the utilization of processing elements (PEs). In this paper, we present SATO, a temporal-parallel SNN accelerator that enables parallel accumulation of membrane potential for all time steps. SATO adopts a two-stage pipeline methodology, effectively decoupling neuron computations. This not only maintains accuracy but also unveils opportunities for fine-grained parallelism. By dividing the neuron computation into distinct stages, SATO enables the concurrent execution of spike accumulation for each time step, leveraging the parallel processing capabilities of modern hardware architectures. This not only enhances the overall efficiency of the accelerator but also reduces latency by exploiting parallelism at a granular level. The architecture of SATO includes a novel binary adder-search tree for generating the output spike train, effectively decoupling the chronological dependence in the accumulation-comparison operation. Furthermore, SATO employs a bucket-sort-based method to evenly distribute compressed workloads to all PEs, maximizing data locality of input spike trains. Experimental results on various SNN models demonstrate that SATO outperforms the well-known accelerator, the 8-bit version of “Eyeriss” by \u0000<inline-formula><tex-math>$20.7times$</tex-math></inline-formula>\u0000 in terms of speedup and \u0000<inline-formula><tex-math>$6.0times$</tex-math></inline-formula>\u0000 energy-saving, on average. Compared to the state-of-the-art SNN accelerator “SpinalFlow”, SATO can also achieve \u0000<inline-formula><tex-math>$4.6times$</tex-math></inline-formula>\u0000 performance gain and \u0000<inline-formula><tex-math>$3.1times$</tex-math></inline-formula>\u0000 energy reduction on average, which is quite impressive for inference.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Performance Hardware Acceleration Architecture for Cross-Silo Federated Learning 跨ilo 联合学习的高性能硬件加速架构
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-13 DOI: 10.1109/TPDS.2024.3413718
Junxue Zhang;Xiaodian Cheng;Liu Yang;Jinbin Hu;Han Tian;Kai Chen
Cross-silo federated learning (FL) adopts various cryptographic operations to preserve data privacy, which introduces significant performance overhead. In this paper, we identify nine widely-used cryptographic operations and design an efficient hardware architecture to accelerate them. However, directly offloading them on hardware statically leads to (1) inadequate hardware acceleration due to the limited resources allocated to each operation; (2) insufficient resource utilization, since different operations are used at different times. To address these challenges, we propose FLASH, a high-performance hardware acceleration architecture for cross-silo FL systems. At its heart, FLASH extracts two basic operators—modular exponentiation and multiplication—behind the nine cryptographic operations and implements them as highly-performant engines to achieve adequate acceleration. Furthermore, it leverages a dataflow scheduling scheme to dynamically compose different cryptographic operations based on these basic engines to obtain sufficient resource utilization. We have implemented a fully-functional FLASH prototype with Xilinx VU13P FPGA and integrated it with FATE, the most widely-adopted cross-silo FL framework. Experimental results show that, for the nine cryptographic operations, FLASH achieves up to $14.0times$ and $3.4times$ acceleration over CPU and GPU, translating to up to $6.8times$ and $2.0times$ speedup for realistic FL applications, respectively. We finally evaluate the FLASH design as an ASIC, and it achieves $23.6times$ performance improvement upon the FPGA prototype.
跨ilo 联合学习(FL)采用各种加密操作来保护数据隐私,这带来了巨大的性能开销。在本文中,我们确定了九种广泛使用的加密操作,并设计了一种高效的硬件架构来加速这些操作。然而,直接在硬件上静态卸载这些操作会导致:(1) 由于分配给每个操作的资源有限,硬件加速不足;(2) 由于不同操作在不同时间使用,资源利用率不足。为了应对这些挑战,我们提出了 FLASH,一种用于跨单片机 FL 系统的高性能硬件加速架构。FLASH 的核心是提取九个加密操作背后的两个基本运算符--模块化指数运算和乘法运算,并将它们作为高性能引擎来实现充分加速。此外,它还利用数据流调度方案,在这些基本引擎的基础上动态组合不同的加密运算,以获得足够的资源利用率。我们利用赛灵思 VU13P FPGA 实现了一个功能齐全的 FLASH 原型,并将其与 FATE(最广泛采用的跨单片机 FL 框架)集成。实验结果表明,对于九种加密操作,FLASH 比 CPU 和 GPU 分别实现了高达 14.0 美元/次和 3.4 美元/次的加速,对于现实的 FL 应用,分别实现了高达 6.8 美元/次和 2.0 美元/次的提速。最后,我们评估了作为 ASIC 的 FLASH 设计,它比 FPGA 原型的性能提高了 23.6 倍。
{"title":"High-Performance Hardware Acceleration Architecture for Cross-Silo Federated Learning","authors":"Junxue Zhang;Xiaodian Cheng;Liu Yang;Jinbin Hu;Han Tian;Kai Chen","doi":"10.1109/TPDS.2024.3413718","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3413718","url":null,"abstract":"Cross-silo federated learning (FL) adopts various cryptographic operations to preserve data privacy, which introduces significant performance overhead. In this paper, we identify nine widely-used cryptographic operations and design an efficient hardware architecture to accelerate them. However, directly offloading them on hardware statically leads to (1) inadequate hardware acceleration due to the limited resources allocated to each operation; (2) insufficient resource utilization, since different operations are used at different times. To address these challenges, we propose FLASH, a high-performance hardware acceleration architecture for cross-silo FL systems. At its heart, FLASH extracts two basic operators—modular exponentiation and multiplication—behind the nine cryptographic operations and implements them as highly-performant engines to achieve adequate acceleration. Furthermore, it leverages a dataflow scheduling scheme to dynamically compose different cryptographic operations based on these basic engines to obtain sufficient resource utilization. We have implemented a fully-functional FLASH prototype with Xilinx VU13P FPGA and integrated it with FATE, the most widely-adopted cross-silo FL framework. Experimental results show that, for the nine cryptographic operations, FLASH achieves up to \u0000<inline-formula><tex-math>$14.0times$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$3.4times$</tex-math></inline-formula>\u0000 acceleration over CPU and GPU, translating to up to \u0000<inline-formula><tex-math>$6.8times$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$2.0times$</tex-math></inline-formula>\u0000 speedup for realistic FL applications, respectively. We finally evaluate the FLASH design as an ASIC, and it achieves \u0000<inline-formula><tex-math>$23.6times$</tex-math></inline-formula>\u0000 performance improvement upon the FPGA prototype.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141494887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint Participant and Learning Topology Selection for Federated Learning in Edge Clouds 边缘云中联合学习的参与者和学习拓扑选择
IF 5.6 2区 计算机科学 Q1 Computer Science Pub Date : 2024-06-13 DOI: 10.1109/TPDS.2024.3413751
Xinliang Wei;Kejiang Ye;Xinghua Shi;Cheng-Zhong Xu;Yu Wang
Deploying federated learning (FL) in edge clouds poses challenges, especially when multiple models are concurrently trained in resource-constrained edge environments. Existing research on federated edge learning has predominantly focused on client selection for training a single FL model, typically with a fixed learning topology. Preliminary experiments indicate that FL models with adaptable topologies exhibit lower learning costs compared to those with fixed topologies. This paper delves into the intricacies of jointly selecting participants and learning topologies for multiple FL models simultaneously trained in the edge cloud. The problem is formulated as an integer non-linear programming problem, aiming to minimize total learning costs associated with all FL models while adhering to edge resource constraints. To tackle this challenging optimization problem, we introduce a two-stage algorithm that decouples the original problem into two sub-problems and iteratively addresses them separately with efficient heuristics. Our method enhances resource competition and load balancing in edge clouds by allowing FL models to choose participants and learning topologies independently. Extensive experiments conducted with real-world networks and FL datasets affirm the better performance of our algorithm, demonstrating lower average total costs with up to 33.5% and 39.6% compared to previous methods designed for multi-model FL.
在边缘云中部署联合学习(FL)是一项挑战,尤其是在资源有限的边缘环境中同时训练多个模型时。关于联合边缘学习的现有研究主要集中在训练单一 FL 模型的客户端选择上,通常采用固定的学习拓扑结构。初步实验表明,具有可适应拓扑结构的 FL 模型与具有固定拓扑结构的模型相比,学习成本更低。本文深入探讨了为同时在边缘云中训练的多个 FL 模型联合选择参与者和学习拓扑的复杂性。该问题被表述为一个整数非线性编程问题,旨在最小化与所有 FL 模型相关的总学习成本,同时遵守边缘资源约束。为解决这一具有挑战性的优化问题,我们引入了一种两阶段算法,将原始问题分解为两个子问题,并采用高效启发式迭代法分别解决它们。我们的方法允许 FL 模型独立选择参与者和学习拓扑,从而增强了边缘云中的资源竞争和负载平衡。利用真实世界网络和 FL 数据集进行的大量实验证实了我们的算法具有更好的性能,与之前为多模型 FL 设计的方法相比,平均总成本分别降低了 33.5% 和 39.6%。
{"title":"Joint Participant and Learning Topology Selection for Federated Learning in Edge Clouds","authors":"Xinliang Wei;Kejiang Ye;Xinghua Shi;Cheng-Zhong Xu;Yu Wang","doi":"10.1109/TPDS.2024.3413751","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3413751","url":null,"abstract":"Deploying federated learning (FL) in edge clouds poses challenges, especially when multiple models are concurrently trained in resource-constrained edge environments. Existing research on federated edge learning has predominantly focused on client selection for training a single FL model, typically with a fixed learning topology. Preliminary experiments indicate that FL models with adaptable topologies exhibit lower learning costs compared to those with fixed topologies. This paper delves into the intricacies of jointly selecting participants and learning topologies for multiple FL models simultaneously trained in the edge cloud. The problem is formulated as an integer non-linear programming problem, aiming to minimize total learning costs associated with all FL models while adhering to edge resource constraints. To tackle this challenging optimization problem, we introduce a two-stage algorithm that decouples the original problem into two sub-problems and iteratively addresses them separately with efficient heuristics. Our method enhances resource competition and load balancing in edge clouds by allowing FL models to choose participants and learning topologies independently. Extensive experiments conducted with real-world networks and FL datasets affirm the better performance of our algorithm, demonstrating lower average total costs with up to 33.5% and 39.6% compared to previous methods designed for multi-model FL.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141448038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Faster-BNI: Fast Parallel Exact Inference on Bayesian Networks Faster-BNI:贝叶斯网络的快速并行精确推理
IF 5.6 2区 计算机科学 Q1 Computer Science Pub Date : 2024-06-13 DOI: 10.1109/TPDS.2024.3414177
Jiantong Jiang;Zeyi Wen;Atif Mansoor;Ajmal Mian
Bayesian networks (BNs) have recently attracted more attention, because they are interpretable machine learning models and enable a direct representation of causal relations between variables. However, exact inference on BNs is time-consuming, especially for complex problems, which hinders the widespread adoption of BNs. To improve the efficiency, we propose a fast BN exact inference named Faster-BNI on multi-core CPUs. Faster-BNI enhances the efficiency of a well-known BN exact inference algorithm, namely the junction tree algorithm, through hybrid parallelism that tightly integrates coarse- and fine-grained parallelism. Moreover, we identify that the bottleneck of BN exact inference methods lies in recursively updating the potential tables of the network. To reduce the table update cost, Faster-BNI employs novel optimizations, including the reduction of potential tables and re-organizing the potential table storage, to avoid unnecessary memory consumption and simplify potential table operations. Comprehensive experiments on real-world BNs show that the sequential version of Faster-BNI outperforms existing sequential implementation by 9 to 22 times, and the parallel version of Faster-BNI achieves up to 11 times faster inference than its parallel counterparts.
贝叶斯网络(BN)是一种可解释的机器学习模型,能够直接表示变量之间的因果关系,因此近来受到越来越多的关注。然而,贝叶斯网络的精确推理非常耗时,尤其是在复杂问题上,这阻碍了贝叶斯网络的广泛应用。为了提高效率,我们提出了一种在多核 CPU 上进行快速 BN 精确推理的方法,名为 Faster-BNI。Faster-BNI 通过将粗粒度和细粒度并行性紧密结合的混合并行性,提高了著名的 BN 精确推理算法(即结点树算法)的效率。此外,我们发现 BN 精确推理方法的瓶颈在于递归更新网络的势表。为了降低表更新成本,Faster-BNI 采用了新颖的优化方法,包括减少潜在表和重新组织潜在表存储,以避免不必要的内存消耗并简化潜在表操作。在现实世界 BN 上进行的综合实验表明,Faster-BNI 的顺序版本比现有的顺序实现快 9 到 22 倍,而 Faster-BNI 的并行版本的推理速度比并行版本快达 11 倍。
{"title":"Faster-BNI: Fast Parallel Exact Inference on Bayesian Networks","authors":"Jiantong Jiang;Zeyi Wen;Atif Mansoor;Ajmal Mian","doi":"10.1109/TPDS.2024.3414177","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3414177","url":null,"abstract":"Bayesian networks (BNs) have recently attracted more attention, because they are interpretable machine learning models and enable a direct representation of causal relations between variables. However, exact inference on BNs is time-consuming, especially for complex problems, which hinders the widespread adoption of BNs. To improve the efficiency, we propose a fast BN exact inference named Faster-BNI on multi-core CPUs. Faster-BNI enhances the efficiency of a well-known BN exact inference algorithm, namely the junction tree algorithm, through hybrid parallelism that tightly integrates coarse- and fine-grained parallelism. Moreover, we identify that the bottleneck of BN exact inference methods lies in recursively updating the potential tables of the network. To reduce the table update cost, Faster-BNI employs novel optimizations, including the reduction of potential tables and re-organizing the potential table storage, to avoid unnecessary memory consumption and simplify potential table operations. Comprehensive experiments on real-world BNs show that the sequential version of Faster-BNI outperforms existing sequential implementation by 9 to 22 times, and the parallel version of Faster-BNI achieves up to 11 times faster inference than its parallel counterparts.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141447952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CPLNS: Cooperative Parallel Large Neighborhood Search for Large-Scale Multi-Agent Path Finding CPLNS:面向大规模多代理路径查找的合作并行大型邻域搜索
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-11 DOI: 10.1109/TPDS.2024.3408030
Kai Chen;Qingjun Qu;Feng Zhu;Zhengming Yi;Wenjie Tang
The large-scale Multi-Agent Path Finding (MAPF) problem presents a significant challenge in combinatorial optimization. Currently, one of the advanced, near-optimal algorithms is Large Neighborhood Search (LNS), which can handle instances with thousands of agents. Although a basic portfolio parallel search based on multiple independent LNS solvers enhances speed and robustness, it encounters scalability issues with increasing CPU cores. To address this limitation, we propose the Cooperative Parallel LNS (CPLNS) algorithm, aimed at boosting parallel efficiency. The main challenge in cooperative parallel search lies in designing suitable portfolio and cooperative strategies that balance search diversification and intensification. To address this, we first analyze the characteristics of LNS. We then introduce a flexible group-based cooperative parallel strategy, where the current best solution is shared within each group to aid intensification, while maintaining diversification through independent group computations. Furthermore, we augment search diversification by integrating a simulated annealing-based LNS and bounded suboptimal single-agent pathfinding. We also introduce a rule-based methodology for portfolio construction to simplify parameter settings and improve search efficiency. Finally, we enhance communication and memory efficiency through a shared data filtering technique and optimized data structures. In benchmarks on 33 maps with 825 instances, CPLNS achieved a median speedup of 21.95 on a 32-core machine, solving 96.97% of cases within five minutes and reducing the average suboptimality score from 1.728 to 1.456. Additionally, tests with up to 10,000 agents verify CPLNS's scalability for large-scale MAPF problems.
大规模多代理寻路(MAPF)问题是组合优化领域的一项重大挑战。目前,接近最优的先进算法之一是大型邻域搜索(LNS),它可以处理数千个代理的实例。虽然基于多个独立 LNS 求解器的基本组合并行搜索提高了速度和鲁棒性,但随着 CPU 内核的增加,它也遇到了可扩展性问题。为解决这一限制,我们提出了合作并行 LNS(CPLNS)算法,旨在提高并行效率。合作并行搜索的主要挑战在于设计合适的组合和合作策略,以平衡搜索的多样化和集约化。为此,我们首先分析了 LNS 的特点。然后,我们引入了一种灵活的基于组的合作并行策略,即在每个组内共享当前最佳解决方案以帮助强化,同时通过独立的组计算保持多样化。此外,我们还通过整合基于模拟退火的 LNS 和有界次优单个代理寻路来增强搜索多样化。我们还引入了基于规则的组合构建方法,以简化参数设置并提高搜索效率。最后,我们通过共享数据过滤技术和优化数据结构提高了通信和内存效率。在 33 个地图、825 个实例的基准测试中,CPLNS 在 32 核机器上的中位速度提高了 21.95 倍,96.97% 的案例在 5 分钟内得到解决,平均次优化得分从 1.728 降至 1.456。此外,多达 10,000 个代理的测试验证了 CPLNS 对大规模 MAPF 问题的可扩展性。
{"title":"CPLNS: Cooperative Parallel Large Neighborhood Search for Large-Scale Multi-Agent Path Finding","authors":"Kai Chen;Qingjun Qu;Feng Zhu;Zhengming Yi;Wenjie Tang","doi":"10.1109/TPDS.2024.3408030","DOIUrl":"10.1109/TPDS.2024.3408030","url":null,"abstract":"The large-scale Multi-Agent Path Finding (MAPF) problem presents a significant challenge in combinatorial optimization. Currently, one of the advanced, near-optimal algorithms is Large Neighborhood Search (LNS), which can handle instances with thousands of agents. Although a basic portfolio parallel search based on multiple independent LNS solvers enhances speed and robustness, it encounters scalability issues with increasing CPU cores. To address this limitation, we propose the Cooperative Parallel LNS (CPLNS) algorithm, aimed at boosting parallel efficiency. The main challenge in cooperative parallel search lies in designing suitable portfolio and cooperative strategies that balance search diversification and intensification. To address this, we first analyze the characteristics of LNS. We then introduce a flexible group-based cooperative parallel strategy, where the current best solution is shared within each group to aid intensification, while maintaining diversification through independent group computations. Furthermore, we augment search diversification by integrating a simulated annealing-based LNS and bounded suboptimal single-agent pathfinding. We also introduce a rule-based methodology for portfolio construction to simplify parameter settings and improve search efficiency. Finally, we enhance communication and memory efficiency through a shared data filtering technique and optimized data structures. In benchmarks on 33 maps with 825 instances, CPLNS achieved a median speedup of 21.95 on a 32-core machine, solving 96.97% of cases within five minutes and reducing the average suboptimality score from 1.728 to 1.456. Additionally, tests with up to 10,000 agents verify CPLNS's scalability for large-scale MAPF problems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Communication-efficient Federated Multi-Task Learning With Personalization and Fairness 以个性化和公平性加速具有通信效率的联合多任务学习
IF 5.3 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-10 DOI: 10.1109/tpds.2024.3411815
Renyou Xie, Chaojie Li, Xiaojun Zhou, Zhaoyang Dong
{"title":"Accelerating Communication-efficient Federated Multi-Task Learning With Personalization and Fairness","authors":"Renyou Xie, Chaojie Li, Xiaojun Zhou, Zhaoyang Dong","doi":"10.1109/tpds.2024.3411815","DOIUrl":"https://doi.org/10.1109/tpds.2024.3411815","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KLNK: Expanding Page Boundaries in a Distributed Shared Memory System KLNK:在分布式共享内存系统中扩展页面边界
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-05 DOI: 10.1109/TPDS.2024.3409882
Yi-Wei Ci;Michael R. Lyu;Zhan Zhang;De-Cheng Zuo;Xiao-Zong Yang
Software-based distributed shared memory (DSM) allows multiple processes to access shared data without the need for specialized hardware. However, this flexibility comes at a significant cost due to the need for data synchronization. One approach to mitigate these costs is to relax the consistency model, which can lead to delayed updates to the shared data. This approach typically requires the use of explicit synchronization primitives to regulate access to the shared memory and determine the timing of data synchronization. To circumvent the need for explicit synchronization, an alternative approach is to manage shared memory transparently using the underlying system. While this can simplify programming, it often imposes a fixed granularity for data sharing, which can limit the expansion of the coherence domain and increase the synchronization requirements. To overcome this limitation, we propose an abstraction called the elastic coherence domain, which dynamically adjusts the scope of data synchronization and is supported by the underlying system for transparent management of shared memory. The experimental results show that this approach can improve the efficiency of memory sharing in distributed environments.
基于软件的分布式共享内存(DSM)允许多个进程访问共享数据,而无需专用硬件。然而,由于需要同步数据,这种灵活性需要付出巨大的代价。降低这些成本的一种方法是放宽一致性模型,这可能会导致共享数据的延迟更新。这种方法通常需要使用显式同步原语来规范对共享内存的访问,并确定数据同步的时间。为了避免显式同步,另一种方法是使用底层系统透明地管理共享内存。虽然这种方法可以简化编程,但往往会对数据共享施加固定的粒度,从而限制一致性域的扩展,增加同步要求。为了克服这一限制,我们提出了一种称为弹性一致性域的抽象概念,它可以动态调整数据同步的范围,并得到底层系统的支持,从而实现共享内存的透明管理。实验结果表明,这种方法可以提高分布式环境中的内存共享效率。
{"title":"KLNK: Expanding Page Boundaries in a Distributed Shared Memory System","authors":"Yi-Wei Ci;Michael R. Lyu;Zhan Zhang;De-Cheng Zuo;Xiao-Zong Yang","doi":"10.1109/TPDS.2024.3409882","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3409882","url":null,"abstract":"Software-based distributed shared memory (DSM) allows multiple processes to access shared data without the need for specialized hardware. However, this flexibility comes at a significant cost due to the need for data synchronization. One approach to mitigate these costs is to relax the consistency model, which can lead to delayed updates to the shared data. This approach typically requires the use of explicit synchronization primitives to regulate access to the shared memory and determine the timing of data synchronization. To circumvent the need for explicit synchronization, an alternative approach is to manage shared memory transparently using the underlying system. While this can simplify programming, it often imposes a fixed granularity for data sharing, which can limit the expansion of the coherence domain and increase the synchronization requirements. To overcome this limitation, we propose an abstraction called the elastic coherence domain, which dynamically adjusts the scope of data synchronization and is supported by the underlying system for transparent management of shared memory. The experimental results show that this approach can improve the efficiency of memory sharing in distributed environments.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141725570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FEUAGame: Fairness-Aware Edge User Allocation for App Vendors FEUAGame:面向应用程序供应商的公平感知边缘用户分配
IF 5.6 2区 计算机科学 Q1 Computer Science Pub Date : 2024-06-04 DOI: 10.1109/TPDS.2024.3409548
Jingwen Zhou;Feifei Chen;Guangming Cui;Yong Xiang;Qiang He
Mobile edge computing (MEC) offers a new computing paradigm that turns computing and storage resources to the network edge to provide minimal service latency compared to cloud computing. Many research works have attempted to help app vendors allocate users to appropriate edge servers for high-performance service provisioning. However, existing edge user allocation (EUA) approaches have ignored fairness in users’ data rates caused by interference, which is crucial in service provisioning in the MEC environment. To pursue fairness in EUA, edge users need to be assigned to edge servers so their quality of experience can be ensured at minimum costs without significant service performance differences among them. In this paper, we make the first attempt to address this fair edge user allocation (FEUA) problem. Specifically, we formulate the FEUA problem, prove its $mathcal {NP}$-hardness, and propose an optimal approach to solve small-scale FEUA problems. To accommodate large-scale FEUA scenarios, we propose a game-theoretic approach called FEUAGame that transforms the FEUA problem into a potential game that admits a Nash equilibrium. FEUA employs a decentralized algorithm to find the Nash equilibrium in the potential game as the solution to the FEUA problem. A widely-used real-world data set is utilised to experimentally compare the performance of FEUAGame to four representative approaches. The numerical outcomes show the effectiveness and efficiency of the proposed approaches in solving the FEUA problem.
与云计算相比,移动边缘计算(MEC)提供了一种新的计算模式,它将计算和存储资源转向网络边缘,以提供最小的服务延迟。许多研究工作都试图帮助应用程序供应商将用户分配到合适的边缘服务器,以提供高性能服务。然而,现有的边缘用户分配(EUA)方法忽略了干扰导致的用户数据速率的公平性,而这在 MEC 环境中的服务供应中至关重要。为了追求 EUA 的公平性,需要将边缘用户分配给边缘服务器,这样就能以最小的成本确保他们的体验质量,而不会造成他们之间明显的服务性能差异。本文首次尝试解决边缘用户公平分配(FEUA)问题。具体来说,我们提出了 FEUA 问题,证明了它的 $mathcal {NP}$ 硬度,并提出了解决小规模 FEUA 问题的最优方法。为了适应大规模的 FEUA 情景,我们提出了一种名为 FEUAGame 的博弈论方法,它将 FEUA 问题转化为一个潜在博弈,该博弈允许一个纳什均衡。FEUA 采用分散算法在潜在博弈中找到纳什均衡作为 FEUA 问题的解决方案。我们利用广泛使用的现实世界数据集,通过实验比较了 FEUAGame 与四种代表性方法的性能。数值结果显示了所提方法在解决 FEUA 问题方面的有效性和效率。
{"title":"FEUAGame: Fairness-Aware Edge User Allocation for App Vendors","authors":"Jingwen Zhou;Feifei Chen;Guangming Cui;Yong Xiang;Qiang He","doi":"10.1109/TPDS.2024.3409548","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3409548","url":null,"abstract":"Mobile edge computing (MEC) offers a new computing paradigm that turns computing and storage resources to the network edge to provide minimal service latency compared to cloud computing. Many research works have attempted to help app vendors allocate users to appropriate edge servers for high-performance service provisioning. However, existing edge user allocation (EUA) approaches have ignored fairness in users’ data rates caused by interference, which is crucial in service provisioning in the MEC environment. To pursue fairness in EUA, edge users need to be assigned to edge servers so their quality of experience can be ensured at minimum costs without significant service performance differences among them. In this paper, we make the first attempt to address this fair edge user allocation (FEUA) problem. Specifically, we formulate the FEUA problem, prove its \u0000<inline-formula><tex-math>$mathcal {NP}$</tex-math></inline-formula>\u0000-hardness, and propose an optimal approach to solve small-scale FEUA problems. To accommodate large-scale FEUA scenarios, we propose a game-theoretic approach called FEUAGame that transforms the FEUA problem into a potential game that admits a Nash equilibrium. FEUA employs a decentralized algorithm to find the Nash equilibrium in the potential game as the solution to the FEUA problem. A widely-used real-world data set is utilised to experimentally compare the performance of FEUAGame to four representative approaches. The numerical outcomes show the effectiveness and efficiency of the proposed approaches in solving the FEUA problem.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.6,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141448039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WASP: Efficient Power Management Enabling Workload-Aware, Self-Powered AIoT Devices WASP:高效电源管理,实现感知工作负载、自供电的人工智能物联网设备
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-06-03 DOI: 10.1109/TPDS.2024.3408167
Xiaofeng Hou;Xuehan Tang;Jiacheng Liu;Chao Li;Luhong Liang;Kwang-Ting Cheng
The wide adoption of edge AI has heightened the demand for various battery-less and maintenance-free smart systems. Nevertheless, emerging Artificial Intelligence of Things (AIoT) are complex workloads showing increased power demand, diversified power usage patterns, and unique sensitivity to power management (PM) approaches. Existing AIoT devices cannot select the most appropriate PM tuning knob, and therefore they often make sub-optimal decisions. In addition, these PM solutions always assume traditional power regulation circuit which incurs non-negligible power loss and control overhead. This can greatly compromise the potential of AIoT efficiency. In this paper, we explore power management (PM) optimization for emerging self-powered AIoT devices. We propose WASP, a highly efficient power management scheme for workload-aware, self-powered AIoT devices. The novelty of WASP is two fold. First, it combines offline profiling and light-weight online control to select the most appropriate PM tuning knobs for the given DNN models. Second, it is well tailored to a reconfigurable voltage regulation module that can make the best use of the limited power budget. Our results show that WASP allows AIoT devices to accomplish 65.6% more inference tasks under a stringent power budget without any performance degradation compared with other existing approaches.
边缘人工智能的广泛应用提高了对各种免电池和免维护智能系统的需求。然而,新兴的人工智能物联网(AIoT)是一种复杂的工作负载,显示出更高的功耗需求、多样化的用电模式以及对电源管理(PM)方法的独特敏感性。现有的人工智能物联网设备无法选择最合适的电源管理调节旋钮,因此常常做出次优决策。此外,这些电源管理解决方案总是假定采用传统的电源调节电路,这会产生不可忽略的功率损耗和控制开销。这会大大降低 AIoT 的潜在效率。在本文中,我们探讨了新兴自供电 AIoT 设备的电源管理(PM)优化。我们提出了 WASP,一种针对工作负载感知、自供电 AIoT 设备的高效电源管理方案。WASP 的新颖之处有两方面。首先,它将离线剖析和轻量级在线控制相结合,为给定的 DNN 模型选择最合适的 PM 调节旋钮。其次,它非常适合可重新配置的电压调节模块,能充分利用有限的功率预算。我们的研究结果表明,与其他现有方法相比,WASP 能让 AIoT 设备在严格的功率预算下多完成 65.6% 的推理任务,且性能不会下降。
{"title":"WASP: Efficient Power Management Enabling Workload-Aware, Self-Powered AIoT Devices","authors":"Xiaofeng Hou;Xuehan Tang;Jiacheng Liu;Chao Li;Luhong Liang;Kwang-Ting Cheng","doi":"10.1109/TPDS.2024.3408167","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3408167","url":null,"abstract":"The wide adoption of edge AI has heightened the demand for various battery-less and maintenance-free smart systems. Nevertheless, emerging Artificial Intelligence of Things (AIoT) are complex workloads showing increased power demand, diversified power usage patterns, and unique sensitivity to power management (PM) approaches. Existing AIoT devices cannot select the most appropriate PM tuning knob, and therefore they often make sub-optimal decisions. In addition, these PM solutions always assume traditional power regulation circuit which incurs non-negligible power loss and control overhead. This can greatly compromise the potential of AIoT efficiency. In this paper, we explore power management (PM) optimization for emerging self-powered AIoT devices. We propose WASP, a highly efficient power management scheme for workload-aware, self-powered AIoT devices. The novelty of WASP is two fold. First, it combines offline profiling and light-weight online control to select the most appropriate PM tuning knobs for the given DNN models. Second, it is well tailored to a reconfigurable voltage regulation module that can make the best use of the limited power budget. Our results show that WASP allows AIoT devices to accomplish 65.6% more inference tasks under a stringent power budget without any performance degradation compared with other existing approaches.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141333940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Communication-Efficient Regret-Optimal Distributed Online Convex Optimization 通信高效的回归最优分布式在线凸优化
IF 5.3 2区 计算机科学 Q1 Computer Science Pub Date : 2024-05-21 DOI: 10.1109/tpds.2024.3403883
Jiandong Liu, Lan Zhang, Fengxiang He, Chi Zhang, Shanyang Jiang, Xiang-Yang Li
{"title":"Communication-Efficient Regret-Optimal Distributed Online Convex Optimization","authors":"Jiandong Liu, Lan Zhang, Fengxiang He, Chi Zhang, Shanyang Jiang, Xiang-Yang Li","doi":"10.1109/tpds.2024.3403883","DOIUrl":"https://doi.org/10.1109/tpds.2024.3403883","url":null,"abstract":"","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141151939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1