首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
Toward Materials Genome Big-Data: A Blockchain-Based Secure Storage and Efficient Retrieval Method 迈向材料基因组大数据:基于区块链的安全存储和高效检索方法
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-10 DOI: 10.1109/TPDS.2024.3426275
Ran Wang;Cheng Xu;Xiaotong Zhang
With the advent of the era of data-driven material R&D, more and more countries have begun to build material Big Data sharing platforms to support the design and R&D of new materials. In the application process of material Big Data sharing platforms, storage and retrieval are the basis of resource mining and analysis. However, achieving efficient storage and recovery is not accessible due to the multimodality, isomerization, discrete and other characteristics of material data. At the same time, due to the lack of security mechanisms, how to ensure the integrity and reliability of the original data is also a significant problem faced by researchers. Given these issues, this paper proposes a blockchain-based secure storage and efficient retrieval scheme. Introducing the Improved Merkle Tree (MMT) structure into the block, the transaction data on the chain and the original data in the off-chain cloud are mapped through the material data template. Experimental results show that our proposed MMT structure has no significant impact on the block creation efficiency while improving the retrieval efficiency. At the same time, MMT is superior to state-of-the-art retrieval methods in terms of efficiency, especially regarding range retrieval. The method proposed in this paper is more suitable for the application needs of the material Big Data sharing platform, and the retrieval efficiency has also been significantly improved.
随着数据驱动材料研发时代的到来,越来越多的国家开始建设材料大数据共享平台,为新材料的设计和研发提供支持。在材料大数据共享平台的应用过程中,存储和检索是资源挖掘和分析的基础。然而,由于材料数据的多模态性、异构性、离散性等特点,实现高效存储和检索并不容易。同时,由于缺乏安全机制,如何保证原始数据的完整性和可靠性也是研究人员面临的重要问题。鉴于这些问题,本文提出了一种基于区块链的安全存储和高效检索方案。在区块中引入改进梅克尔树(MMT)结构,通过物质数据模板映射链上交易数据和链下云中的原始数据。实验结果表明,我们提出的 MMT 结构在提高检索效率的同时,对区块创建效率没有显著影响。同时,MMT 在效率方面优于最先进的检索方法,尤其是在范围检索方面。本文提出的方法更适合物资大数据共享平台的应用需求,检索效率也得到了显著提高。
{"title":"Toward Materials Genome Big-Data: A Blockchain-Based Secure Storage and Efficient Retrieval Method","authors":"Ran Wang;Cheng Xu;Xiaotong Zhang","doi":"10.1109/TPDS.2024.3426275","DOIUrl":"10.1109/TPDS.2024.3426275","url":null,"abstract":"With the advent of the era of data-driven material R&D, more and more countries have begun to build material Big Data sharing platforms to support the design and R&D of new materials. In the application process of material Big Data sharing platforms, storage and retrieval are the basis of resource mining and analysis. However, achieving efficient storage and recovery is not accessible due to the multimodality, isomerization, discrete and other characteristics of material data. At the same time, due to the lack of security mechanisms, how to ensure the integrity and reliability of the original data is also a significant problem faced by researchers. Given these issues, this paper proposes a blockchain-based secure storage and efficient retrieval scheme. Introducing the Improved Merkle Tree (MMT) structure into the block, the transaction data on the chain and the original data in the off-chain cloud are mapped through the material data template. Experimental results show that our proposed MMT structure has no significant impact on the block creation efficiency while improving the retrieval efficiency. At the same time, MMT is superior to state-of-the-art retrieval methods in terms of efficiency, especially regarding range retrieval. The method proposed in this paper is more suitable for the application needs of the material Big Data sharing platform, and the retrieval efficiency has also been significantly improved.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1630-1643"},"PeriodicalIF":5.6,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141585919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RADAR: A Skew-Resistant and Hotness-Aware Ordered Index Design for Processing-in-Memory Systems RADAR:用于内存处理系统的抗偏斜和热度感知有序索引设计
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-09 DOI: 10.1109/TPDS.2024.3424853
Yifan Hua;Shengan Zheng;Weihan Kong;Cong Zhou;Kaixin Huang;Ruoyan Ma;Linpeng Huang
Pointer chasing becomes the performance bottleneck for today's in-memory indexes due to the memory wall. Emerging processing-in-memory (PIM) technologies are promising to mitigate this bottleneck, by enabling low-latency memory access and aggregated memory bandwidth scaling with the number of PIM modules. Prior PIM-based indexes adopt a fixed granularity to partition the key space and maintain static heights of skiplist nodes among PIM modules to accelerate index operations on skiplist, neglecting the changes in skewness and hotness of data access patterns during runtime. In this article, we present RADAR, an innovative PIM-friendly skiplist that dynamically partitions the key space among PIM modules to adapt to varying skewness. An offline learning-based model is employed to catch hotness changes to adjust the heights of skiplist nodes. In multiple datasets, RADAR achieves up to 198.2x performance improvement and consumes 47.4% less memory than state-of-the-art designs on real PIM hardware.
由于存在内存墙,指针追逐成为当今内存索引的性能瓶颈。新兴的内存处理(PIM)技术有望通过低延迟内存访问和随 PIM 模块数量增加而扩展的聚合内存带宽来缓解这一瓶颈。之前基于 PIM 的索引采用固定粒度来划分密钥空间,并在 PIM 模块之间保持跳表节点的静态高度,以加速跳表上的索引操作,但忽略了运行时数据访问模式的偏度和热度变化。在本文中,我们介绍了一种创新的 PIM 友好型跳转表 RADAR,它可以动态划分 PIM 模块之间的密钥空间,以适应不同的偏度。我们采用了一种基于离线学习的模型来捕捉热度变化,从而调整 skiplist 节点的高度。在多个数据集中,RADAR 实现了高达 198.2 倍的性能提升,在真实 PIM 硬件上的内存消耗比一流设计少 47.4%。
{"title":"RADAR: A Skew-Resistant and Hotness-Aware Ordered Index Design for Processing-in-Memory Systems","authors":"Yifan Hua;Shengan Zheng;Weihan Kong;Cong Zhou;Kaixin Huang;Ruoyan Ma;Linpeng Huang","doi":"10.1109/TPDS.2024.3424853","DOIUrl":"10.1109/TPDS.2024.3424853","url":null,"abstract":"Pointer chasing becomes the performance bottleneck for today's in-memory indexes due to the memory wall. Emerging processing-in-memory (PIM) technologies are promising to mitigate this bottleneck, by enabling low-latency memory access and aggregated memory bandwidth scaling with the number of PIM modules. Prior PIM-based indexes adopt a fixed granularity to partition the key space and maintain static heights of skiplist nodes among PIM modules to accelerate index operations on skiplist, neglecting the changes in skewness and hotness of data access patterns during runtime. In this article, we present RADAR, an innovative PIM-friendly skiplist that dynamically partitions the key space among PIM modules to adapt to varying skewness. An offline learning-based model is employed to catch hotness changes to adjust the heights of skiplist nodes. In multiple datasets, RADAR achieves up to 198.2x performance improvement and consumes 47.4% less memory than state-of-the-art designs on real PIM hardware.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1598-1614"},"PeriodicalIF":5.6,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141567225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Acceleration of Multi-Body Molecular Dynamics With Customized Parallel Dataflow 利用定制并行数据流加速多体分子动力学
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-07-08 DOI: 10.1109/TPDS.2024.3420441
Quan Deng;Qiang Liu;Ming Yuan;Xiaohui Duan;Lin Gan;Jinzhe Yang;Wenlai Zhao;Zhenxiang Zhang;Guiming Wu;Wayne Luk;Haohuan Fu;Guangwen Yang
FPGAs are drawing increasing attention in resolving molecular dynamics (MD) problems, and have already been applied in problems such as two-body potentials, force fields composed of these potentials, etc. Competitive performance is obtained compared with traditional counterparts such as CPUs and GPUs. However, as far as we know, FPGA solutions for more complex and real-world MD problems, such as multi-body potentials, are seldom to be seen. This work explores the prospects of state-of-the-art FPGAs in accelerating multi-body potential. An FPGA-based accelerator with customized parallel dataflow that features multi-body potential computation, motion update, and internode communication is designed. Major contributions include: (1) parallelization applied at different levels of the accelerator; (2) an optimized dataflow mixing atom-level pipeline and cell-level pipeline to achieve high throughput; (3) a mixed-precision method using different precision at different stages of simulations; and (4) a communication-efficient method for internode communication. Experiments show that, our single-node accelerator is over 2.7× faster than an 8-core CPU design, performing 20.501 ns/day on a 55,296-atom system for the Tersoff simulation. Regarding power efficiency, our accelerator is 28.9× higher than I7-11700 and 4.8× higher than RTX 3090 when running the same test case.
FPGA 在解决分子动力学(MD)问题方面正受到越来越多的关注,并已应用于二体势能、由这些势能组成的力场等问题。与传统的 CPU 和 GPU 相比,其性能更具竞争力。然而,据我们所知,FPGA 解决方案很少能解决更复杂和现实世界中的 MD 问题,如多体势能。这项工作探索了最先进的 FPGA 在加速多体势垒方面的前景。我们设计了一种基于 FPGA 的加速器,它具有定制的并行数据流,可进行多体势能计算、运动更新和节点间通信。主要贡献包括(1) 在加速器的不同层级应用并行化;(2) 混合原子级流水线和单元级流水线的优化数据流,以实现高吞吐量;(3) 在模拟的不同阶段使用不同精度的混合精度方法;以及 (4) 节点间通信的高效通信方法。实验表明,我们的单节点加速器比 8 核 CPU 设计快 2.7 倍以上,在一个 55,296 原子系统中,Tersoff 仿真的速度为 20.501 ns/天。在能效方面,运行相同的测试案例时,我们的加速器比 I7-11700 高 28.9 倍,比 RTX 3090 高 4.8 倍。
{"title":"Acceleration of Multi-Body Molecular Dynamics With Customized Parallel Dataflow","authors":"Quan Deng;Qiang Liu;Ming Yuan;Xiaohui Duan;Lin Gan;Jinzhe Yang;Wenlai Zhao;Zhenxiang Zhang;Guiming Wu;Wayne Luk;Haohuan Fu;Guangwen Yang","doi":"10.1109/TPDS.2024.3420441","DOIUrl":"10.1109/TPDS.2024.3420441","url":null,"abstract":"FPGAs are drawing increasing attention in resolving molecular dynamics (MD) problems, and have already been applied in problems such as two-body potentials, force fields composed of these potentials, etc. Competitive performance is obtained compared with traditional counterparts such as CPUs and GPUs. However, as far as we know, FPGA solutions for more complex and real-world MD problems, such as multi-body potentials, are seldom to be seen. This work explores the prospects of state-of-the-art FPGAs in accelerating multi-body potential. An FPGA-based accelerator with customized parallel dataflow that features multi-body potential computation, motion update, and internode communication is designed. Major contributions include: (1) parallelization applied at different levels of the accelerator; (2) an optimized dataflow mixing atom-level pipeline and cell-level pipeline to achieve high throughput; (3) a mixed-precision method using different precision at different stages of simulations; and (4) a communication-efficient method for internode communication. Experiments show that, our single-node accelerator is over 2.7× faster than an 8-core CPU design, performing 20.501 ns/day on a 55,296-atom system for the \u0000<italic>Tersoff</i>\u0000 simulation. Regarding power efficiency, our accelerator is 28.9× higher than I7-11700 and 4.8× higher than RTX 3090 when running the same test case.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2297-2314"},"PeriodicalIF":5.6,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141567201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pyxis: Scheduling Mixed Tasks in Disaggregated Datacenters Pyxis:在分散的数据中心调度混合任务
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-24 DOI: 10.1109/TPDS.2024.3418620
Sheng Qi;Chao Jin;Mosharaf Chowdhury;Zhenming Liu;Xuanzhe Liu;Xin Jin
Disaggregating compute from storage is an emerging trend in cloud computing. Effectively utilizing resources in both compute and storage pool is the key to high performance. The state-of-the-art scheduler provides optimal scheduling decisions for workloads with homogeneous tasks. However, cloud applications often generate a mix of tasks with diverse compute and IO characteristics, resulting in sub-optimal performance for existing solutions. We present Pyxis, a system that provides optimal scheduling decisions for mixed workloads in disaggregated datacenters with theoretical guarantees. Pyxis is capable of maximizing overall throughput while meeting latency SLOs. Pyxis decouples the scheduling of different tasks. Our insight is that the optimal solution has an “all-or-nothing” structure that can be captured by a single turning point in the spectrum of tasks. Based on task characteristics, the turning point partitions the tasks either all to storage nodes or all to compute nodes (none to storage nodes). We theoretically prove that the optimal solution has such a structure, and design an online algorithm with sub-second convergence. We implement a prototype of Pyxis. Experiments on CloudLab with various synthetic and application workloads show that Pyxis improves the throughput by 3–21× over the state-of-the-art solution.
将计算与存储分离是云计算的新兴趋势。有效利用计算和存储池中的资源是实现高性能的关键。最先进的调度程序可为具有同质任务的工作负载提供最佳调度决策。然而,云应用通常会产生具有不同计算和 IO 特性的混合任务,从而导致现有解决方案无法达到最佳性能。我们介绍的 Pyxis 系统可为分解数据中心中的混合工作负载提供最佳调度决策,并提供理论保证。Pyxis 能够最大限度地提高整体吞吐量,同时满足延迟 SLO 要求。Pyxis 分离了不同任务的调度。我们的见解是,最佳解决方案具有 "全有或全无 "的结构,可以通过任务谱中的一个转折点来捕捉。根据任务特征,转折点会将任务划分为两种类型,一种是全部分配给存储节点,另一种是全部分配给计算节点(没有分配给存储节点)。我们从理论上证明了最优解具有这样的结构,并设计了一种亚秒级收敛的在线算法。我们实现了 Pyxis 的原型。在 CloudLab 上使用各种合成和应用工作负载进行的实验表明,Pyxis 比最先进的解决方案提高了 3-21 倍的吞吐量。
{"title":"Pyxis: Scheduling Mixed Tasks in Disaggregated Datacenters","authors":"Sheng Qi;Chao Jin;Mosharaf Chowdhury;Zhenming Liu;Xuanzhe Liu;Xin Jin","doi":"10.1109/TPDS.2024.3418620","DOIUrl":"10.1109/TPDS.2024.3418620","url":null,"abstract":"Disaggregating compute from storage is an emerging trend in cloud computing. Effectively utilizing resources in both compute and storage pool is the key to high performance. The state-of-the-art scheduler provides optimal scheduling decisions for workloads with homogeneous tasks. However, cloud applications often generate a mix of tasks with diverse compute and IO characteristics, resulting in sub-optimal performance for existing solutions. We present Pyxis, a system that provides optimal scheduling decisions for mixed workloads in disaggregated datacenters with theoretical guarantees. Pyxis is capable of maximizing overall throughput while meeting latency SLOs. Pyxis decouples the scheduling of different tasks. Our insight is that the optimal solution has an “all-or-nothing” structure that can be captured by a single \u0000<italic>turning point</i>\u0000 in the spectrum of tasks. Based on task characteristics, the turning point partitions the tasks either all to storage nodes or all to compute nodes (none to storage nodes). We theoretically prove that the optimal solution has such a structure, and design an online algorithm with sub-second convergence. We implement a prototype of Pyxis. Experiments on CloudLab with various synthetic and application workloads show that Pyxis improves the throughput by 3–21× over the state-of-the-art solution.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 9","pages":"1536-1550"},"PeriodicalIF":5.6,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141500610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Temporal-Unrolled Parallelism for Energy-Efficient SNN Acceleration 利用时序未展开并行性实现高能效 SNN 加速
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-18 DOI: 10.1109/TPDS.2024.3415712
Fangxin Liu;Zongwu Wang;Wenbo Zhao;Ning Yang;Yongbiao Chen;Shiyuan Huang;Haomin Li;Tao Yang;Songwen Pei;Xiaoyao Liang;Li Jiang
Event-driven spiking neural networks (SNNs) have demonstrated significant potential for achieving high energy and area efficiency. However, existing SNN accelerators suffer from issues such as high latency and energy consumption due to serial accumulation-comparison operations. This is mainly because SNN neurons integrate spikes, accumulate membrane potential, and generate output spikes when the potential exceeds a threshold. To address this, one approach is to leverage the sparsity of SNN spikes to reduce the number of time steps. However, this method can result in imbalanced workloads among neurons and limit the utilization of processing elements (PEs). In this paper, we present SATO, a temporal-parallel SNN accelerator that enables parallel accumulation of membrane potential for all time steps. SATO adopts a two-stage pipeline methodology, effectively decoupling neuron computations. This not only maintains accuracy but also unveils opportunities for fine-grained parallelism. By dividing the neuron computation into distinct stages, SATO enables the concurrent execution of spike accumulation for each time step, leveraging the parallel processing capabilities of modern hardware architectures. This not only enhances the overall efficiency of the accelerator but also reduces latency by exploiting parallelism at a granular level. The architecture of SATO includes a novel binary adder-search tree for generating the output spike train, effectively decoupling the chronological dependence in the accumulation-comparison operation. Furthermore, SATO employs a bucket-sort-based method to evenly distribute compressed workloads to all PEs, maximizing data locality of input spike trains. Experimental results on various SNN models demonstrate that SATO outperforms the well-known accelerator, the 8-bit version of “Eyeriss” by $20.7times$ in terms of speedup and $6.0times$ energy-saving, on average. Compared to the state-of-the-art SNN accelerator “SpinalFlow”, SATO can also achieve $4.6times$ performance gain and $3.1times$ energy reduction on average, which is quite impressive for inference.
事件驱动尖峰神经网络(SNN)在实现高能耗和高面积效率方面具有巨大潜力。然而,现有的 SNN 加速器存在延迟和能耗高的问题,这是由于串行累加比较操作造成的。这主要是因为 SNN 神经元会整合尖峰、累积膜电位,并在电位超过阈值时产生输出尖峰。为了解决这个问题,一种方法是利用 SNN 尖峰的稀疏性来减少时间步数。然而,这种方法会导致神经元之间的工作量不平衡,并限制处理元件(PE)的利用率。在本文中,我们介绍了 SATO,这是一种时间并行的 SNN 加速器,可以并行累积所有时间步长的膜电位。SATO 采用两阶段流水线方法,有效地解耦了神经元计算。这不仅保持了准确性,还为细粒度并行提供了机会。通过将神经元计算划分为不同的阶段,SATO 可以利用现代硬件架构的并行处理能力,同时执行每个时间步的尖峰累积。这不仅提高了加速器的整体效率,还通过利用细粒度的并行性降低了延迟。SATO 的架构包括一个新颖的二进制加法器搜索树,用于生成输出尖峰列车,有效地解耦了累积比较操作中的时序依赖性。此外,SATO 还采用了一种基于桶排序的方法,将压缩工作负载均匀地分配给所有处理器,最大限度地提高了输入尖峰列车的数据局部性。各种 SNN 模型的实验结果表明,SATO 的性能优于著名的加速器--8 位版本的 "Eyeriss",平均提速 20.7 倍,节能 6.0 倍。与最先进的 SNN 加速器 "SpinalFlow "相比,SATO 还能实现平均 4.6 倍的性能提升和 3.1 倍的能耗降低,这对于推理来说是相当了不起的。
{"title":"Exploiting Temporal-Unrolled Parallelism for Energy-Efficient SNN Acceleration","authors":"Fangxin Liu;Zongwu Wang;Wenbo Zhao;Ning Yang;Yongbiao Chen;Shiyuan Huang;Haomin Li;Tao Yang;Songwen Pei;Xiaoyao Liang;Li Jiang","doi":"10.1109/TPDS.2024.3415712","DOIUrl":"10.1109/TPDS.2024.3415712","url":null,"abstract":"Event-driven spiking neural networks (SNNs) have demonstrated significant potential for achieving high energy and area efficiency. However, existing SNN accelerators suffer from issues such as high latency and energy consumption due to serial accumulation-comparison operations. This is mainly because SNN neurons integrate spikes, accumulate membrane potential, and generate output spikes when the potential exceeds a threshold. To address this, one approach is to leverage the sparsity of SNN spikes to reduce the number of time steps. However, this method can result in imbalanced workloads among neurons and limit the utilization of processing elements (PEs). In this paper, we present SATO, a temporal-parallel SNN accelerator that enables parallel accumulation of membrane potential for all time steps. SATO adopts a two-stage pipeline methodology, effectively decoupling neuron computations. This not only maintains accuracy but also unveils opportunities for fine-grained parallelism. By dividing the neuron computation into distinct stages, SATO enables the concurrent execution of spike accumulation for each time step, leveraging the parallel processing capabilities of modern hardware architectures. This not only enhances the overall efficiency of the accelerator but also reduces latency by exploiting parallelism at a granular level. The architecture of SATO includes a novel binary adder-search tree for generating the output spike train, effectively decoupling the chronological dependence in the accumulation-comparison operation. Furthermore, SATO employs a bucket-sort-based method to evenly distribute compressed workloads to all PEs, maximizing data locality of input spike trains. Experimental results on various SNN models demonstrate that SATO outperforms the well-known accelerator, the 8-bit version of “Eyeriss” by \u0000<inline-formula><tex-math>$20.7times$</tex-math></inline-formula>\u0000 in terms of speedup and \u0000<inline-formula><tex-math>$6.0times$</tex-math></inline-formula>\u0000 energy-saving, on average. Compared to the state-of-the-art SNN accelerator “SpinalFlow”, SATO can also achieve \u0000<inline-formula><tex-math>$4.6times$</tex-math></inline-formula>\u0000 performance gain and \u0000<inline-formula><tex-math>$3.1times$</tex-math></inline-formula>\u0000 energy reduction on average, which is quite impressive for inference.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 10","pages":"1749-1764"},"PeriodicalIF":5.6,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-Performance Hardware Acceleration Architecture for Cross-Silo Federated Learning 跨ilo 联合学习的高性能硬件加速架构
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-13 DOI: 10.1109/TPDS.2024.3413718
Junxue Zhang;Xiaodian Cheng;Liu Yang;Jinbin Hu;Han Tian;Kai Chen
Cross-silo federated learning (FL) adopts various cryptographic operations to preserve data privacy, which introduces significant performance overhead. In this paper, we identify nine widely-used cryptographic operations and design an efficient hardware architecture to accelerate them. However, directly offloading them on hardware statically leads to (1) inadequate hardware acceleration due to the limited resources allocated to each operation; (2) insufficient resource utilization, since different operations are used at different times. To address these challenges, we propose FLASH, a high-performance hardware acceleration architecture for cross-silo FL systems. At its heart, FLASH extracts two basic operators—modular exponentiation and multiplication—behind the nine cryptographic operations and implements them as highly-performant engines to achieve adequate acceleration. Furthermore, it leverages a dataflow scheduling scheme to dynamically compose different cryptographic operations based on these basic engines to obtain sufficient resource utilization. We have implemented a fully-functional FLASH prototype with Xilinx VU13P FPGA and integrated it with FATE, the most widely-adopted cross-silo FL framework. Experimental results show that, for the nine cryptographic operations, FLASH achieves up to $14.0times$ and $3.4times$ acceleration over CPU and GPU, translating to up to $6.8times$ and $2.0times$ speedup for realistic FL applications, respectively. We finally evaluate the FLASH design as an ASIC, and it achieves $23.6times$ performance improvement upon the FPGA prototype.
跨ilo 联合学习(FL)采用各种加密操作来保护数据隐私,这带来了巨大的性能开销。在本文中,我们确定了九种广泛使用的加密操作,并设计了一种高效的硬件架构来加速这些操作。然而,直接在硬件上静态卸载这些操作会导致:(1) 由于分配给每个操作的资源有限,硬件加速不足;(2) 由于不同操作在不同时间使用,资源利用率不足。为了应对这些挑战,我们提出了 FLASH,一种用于跨单片机 FL 系统的高性能硬件加速架构。FLASH 的核心是提取九个加密操作背后的两个基本运算符--模块化指数运算和乘法运算,并将它们作为高性能引擎来实现充分加速。此外,它还利用数据流调度方案,在这些基本引擎的基础上动态组合不同的加密运算,以获得足够的资源利用率。我们利用赛灵思 VU13P FPGA 实现了一个功能齐全的 FLASH 原型,并将其与 FATE(最广泛采用的跨单片机 FL 框架)集成。实验结果表明,对于九种加密操作,FLASH 比 CPU 和 GPU 分别实现了高达 14.0 美元/次和 3.4 美元/次的加速,对于现实的 FL 应用,分别实现了高达 6.8 美元/次和 2.0 美元/次的提速。最后,我们评估了作为 ASIC 的 FLASH 设计,它比 FPGA 原型的性能提高了 23.6 倍。
{"title":"High-Performance Hardware Acceleration Architecture for Cross-Silo Federated Learning","authors":"Junxue Zhang;Xiaodian Cheng;Liu Yang;Jinbin Hu;Han Tian;Kai Chen","doi":"10.1109/TPDS.2024.3413718","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3413718","url":null,"abstract":"Cross-silo federated learning (FL) adopts various cryptographic operations to preserve data privacy, which introduces significant performance overhead. In this paper, we identify nine widely-used cryptographic operations and design an efficient hardware architecture to accelerate them. However, directly offloading them on hardware statically leads to (1) inadequate hardware acceleration due to the limited resources allocated to each operation; (2) insufficient resource utilization, since different operations are used at different times. To address these challenges, we propose FLASH, a high-performance hardware acceleration architecture for cross-silo FL systems. At its heart, FLASH extracts two basic operators—modular exponentiation and multiplication—behind the nine cryptographic operations and implements them as highly-performant engines to achieve adequate acceleration. Furthermore, it leverages a dataflow scheduling scheme to dynamically compose different cryptographic operations based on these basic engines to obtain sufficient resource utilization. We have implemented a fully-functional FLASH prototype with Xilinx VU13P FPGA and integrated it with FATE, the most widely-adopted cross-silo FL framework. Experimental results show that, for the nine cryptographic operations, FLASH achieves up to \u0000<inline-formula><tex-math>$14.0times$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$3.4times$</tex-math></inline-formula>\u0000 acceleration over CPU and GPU, translating to up to \u0000<inline-formula><tex-math>$6.8times$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$2.0times$</tex-math></inline-formula>\u0000 speedup for realistic FL applications, respectively. We finally evaluate the FLASH design as an ASIC, and it achieves \u0000<inline-formula><tex-math>$23.6times$</tex-math></inline-formula>\u0000 performance improvement upon the FPGA prototype.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1506-1523"},"PeriodicalIF":5.6,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141494887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint Participant and Learning Topology Selection for Federated Learning in Edge Clouds 边缘云中联合学习的参与者和学习拓扑选择
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-13 DOI: 10.1109/TPDS.2024.3413751
Xinliang Wei;Kejiang Ye;Xinghua Shi;Cheng-Zhong Xu;Yu Wang
Deploying federated learning (FL) in edge clouds poses challenges, especially when multiple models are concurrently trained in resource-constrained edge environments. Existing research on federated edge learning has predominantly focused on client selection for training a single FL model, typically with a fixed learning topology. Preliminary experiments indicate that FL models with adaptable topologies exhibit lower learning costs compared to those with fixed topologies. This paper delves into the intricacies of jointly selecting participants and learning topologies for multiple FL models simultaneously trained in the edge cloud. The problem is formulated as an integer non-linear programming problem, aiming to minimize total learning costs associated with all FL models while adhering to edge resource constraints. To tackle this challenging optimization problem, we introduce a two-stage algorithm that decouples the original problem into two sub-problems and iteratively addresses them separately with efficient heuristics. Our method enhances resource competition and load balancing in edge clouds by allowing FL models to choose participants and learning topologies independently. Extensive experiments conducted with real-world networks and FL datasets affirm the better performance of our algorithm, demonstrating lower average total costs with up to 33.5% and 39.6% compared to previous methods designed for multi-model FL.
在边缘云中部署联合学习(FL)是一项挑战,尤其是在资源有限的边缘环境中同时训练多个模型时。关于联合边缘学习的现有研究主要集中在训练单一 FL 模型的客户端选择上,通常采用固定的学习拓扑结构。初步实验表明,具有可适应拓扑结构的 FL 模型与具有固定拓扑结构的模型相比,学习成本更低。本文深入探讨了为同时在边缘云中训练的多个 FL 模型联合选择参与者和学习拓扑的复杂性。该问题被表述为一个整数非线性编程问题,旨在最小化与所有 FL 模型相关的总学习成本,同时遵守边缘资源约束。为解决这一具有挑战性的优化问题,我们引入了一种两阶段算法,将原始问题分解为两个子问题,并采用高效启发式迭代法分别解决它们。我们的方法允许 FL 模型独立选择参与者和学习拓扑,从而增强了边缘云中的资源竞争和负载平衡。利用真实世界网络和 FL 数据集进行的大量实验证实了我们的算法具有更好的性能,与之前为多模型 FL 设计的方法相比,平均总成本分别降低了 33.5% 和 39.6%。
{"title":"Joint Participant and Learning Topology Selection for Federated Learning in Edge Clouds","authors":"Xinliang Wei;Kejiang Ye;Xinghua Shi;Cheng-Zhong Xu;Yu Wang","doi":"10.1109/TPDS.2024.3413751","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3413751","url":null,"abstract":"Deploying federated learning (FL) in edge clouds poses challenges, especially when multiple models are concurrently trained in resource-constrained edge environments. Existing research on federated edge learning has predominantly focused on client selection for training a single FL model, typically with a fixed learning topology. Preliminary experiments indicate that FL models with adaptable topologies exhibit lower learning costs compared to those with fixed topologies. This paper delves into the intricacies of jointly selecting participants and learning topologies for multiple FL models simultaneously trained in the edge cloud. The problem is formulated as an integer non-linear programming problem, aiming to minimize total learning costs associated with all FL models while adhering to edge resource constraints. To tackle this challenging optimization problem, we introduce a two-stage algorithm that decouples the original problem into two sub-problems and iteratively addresses them separately with efficient heuristics. Our method enhances resource competition and load balancing in edge clouds by allowing FL models to choose participants and learning topologies independently. Extensive experiments conducted with real-world networks and FL datasets affirm the better performance of our algorithm, demonstrating lower average total costs with up to 33.5% and 39.6% compared to previous methods designed for multi-model FL.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1456-1468"},"PeriodicalIF":5.6,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141448038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Faster-BNI: Fast Parallel Exact Inference on Bayesian Networks Faster-BNI:贝叶斯网络的快速并行精确推理
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-13 DOI: 10.1109/TPDS.2024.3414177
Jiantong Jiang;Zeyi Wen;Atif Mansoor;Ajmal Mian
Bayesian networks (BNs) have recently attracted more attention, because they are interpretable machine learning models and enable a direct representation of causal relations between variables. However, exact inference on BNs is time-consuming, especially for complex problems, which hinders the widespread adoption of BNs. To improve the efficiency, we propose a fast BN exact inference named Faster-BNI on multi-core CPUs. Faster-BNI enhances the efficiency of a well-known BN exact inference algorithm, namely the junction tree algorithm, through hybrid parallelism that tightly integrates coarse- and fine-grained parallelism. Moreover, we identify that the bottleneck of BN exact inference methods lies in recursively updating the potential tables of the network. To reduce the table update cost, Faster-BNI employs novel optimizations, including the reduction of potential tables and re-organizing the potential table storage, to avoid unnecessary memory consumption and simplify potential table operations. Comprehensive experiments on real-world BNs show that the sequential version of Faster-BNI outperforms existing sequential implementation by 9 to 22 times, and the parallel version of Faster-BNI achieves up to 11 times faster inference than its parallel counterparts.
贝叶斯网络(BN)是一种可解释的机器学习模型,能够直接表示变量之间的因果关系,因此近来受到越来越多的关注。然而,贝叶斯网络的精确推理非常耗时,尤其是在复杂问题上,这阻碍了贝叶斯网络的广泛应用。为了提高效率,我们提出了一种在多核 CPU 上进行快速 BN 精确推理的方法,名为 Faster-BNI。Faster-BNI 通过将粗粒度和细粒度并行性紧密结合的混合并行性,提高了著名的 BN 精确推理算法(即结点树算法)的效率。此外,我们发现 BN 精确推理方法的瓶颈在于递归更新网络的势表。为了降低表更新成本,Faster-BNI 采用了新颖的优化方法,包括减少潜在表和重新组织潜在表存储,以避免不必要的内存消耗并简化潜在表操作。在现实世界 BN 上进行的综合实验表明,Faster-BNI 的顺序版本比现有的顺序实现快 9 到 22 倍,而 Faster-BNI 的并行版本的推理速度比并行版本快达 11 倍。
{"title":"Faster-BNI: Fast Parallel Exact Inference on Bayesian Networks","authors":"Jiantong Jiang;Zeyi Wen;Atif Mansoor;Ajmal Mian","doi":"10.1109/TPDS.2024.3414177","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3414177","url":null,"abstract":"Bayesian networks (BNs) have recently attracted more attention, because they are interpretable machine learning models and enable a direct representation of causal relations between variables. However, exact inference on BNs is time-consuming, especially for complex problems, which hinders the widespread adoption of BNs. To improve the efficiency, we propose a fast BN exact inference named Faster-BNI on multi-core CPUs. Faster-BNI enhances the efficiency of a well-known BN exact inference algorithm, namely the junction tree algorithm, through hybrid parallelism that tightly integrates coarse- and fine-grained parallelism. Moreover, we identify that the bottleneck of BN exact inference methods lies in recursively updating the potential tables of the network. To reduce the table update cost, Faster-BNI employs novel optimizations, including the reduction of potential tables and re-organizing the potential table storage, to avoid unnecessary memory consumption and simplify potential table operations. Comprehensive experiments on real-world BNs show that the sequential version of Faster-BNI outperforms existing sequential implementation by 9 to 22 times, and the parallel version of Faster-BNI achieves up to 11 times faster inference than its parallel counterparts.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 8","pages":"1444-1455"},"PeriodicalIF":5.6,"publicationDate":"2024-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141447952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CPLNS: Cooperative Parallel Large Neighborhood Search for Large-Scale Multi-Agent Path Finding CPLNS:面向大规模多代理路径查找的合作并行大型邻域搜索
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-11 DOI: 10.1109/TPDS.2024.3408030
Kai Chen;Qingjun Qu;Feng Zhu;Zhengming Yi;Wenjie Tang
The large-scale Multi-Agent Path Finding (MAPF) problem presents a significant challenge in combinatorial optimization. Currently, one of the advanced, near-optimal algorithms is Large Neighborhood Search (LNS), which can handle instances with thousands of agents. Although a basic portfolio parallel search based on multiple independent LNS solvers enhances speed and robustness, it encounters scalability issues with increasing CPU cores. To address this limitation, we propose the Cooperative Parallel LNS (CPLNS) algorithm, aimed at boosting parallel efficiency. The main challenge in cooperative parallel search lies in designing suitable portfolio and cooperative strategies that balance search diversification and intensification. To address this, we first analyze the characteristics of LNS. We then introduce a flexible group-based cooperative parallel strategy, where the current best solution is shared within each group to aid intensification, while maintaining diversification through independent group computations. Furthermore, we augment search diversification by integrating a simulated annealing-based LNS and bounded suboptimal single-agent pathfinding. We also introduce a rule-based methodology for portfolio construction to simplify parameter settings and improve search efficiency. Finally, we enhance communication and memory efficiency through a shared data filtering technique and optimized data structures. In benchmarks on 33 maps with 825 instances, CPLNS achieved a median speedup of 21.95 on a 32-core machine, solving 96.97% of cases within five minutes and reducing the average suboptimality score from 1.728 to 1.456. Additionally, tests with up to 10,000 agents verify CPLNS's scalability for large-scale MAPF problems.
大规模多代理寻路(MAPF)问题是组合优化领域的一项重大挑战。目前,接近最优的先进算法之一是大型邻域搜索(LNS),它可以处理数千个代理的实例。虽然基于多个独立 LNS 求解器的基本组合并行搜索提高了速度和鲁棒性,但随着 CPU 内核的增加,它也遇到了可扩展性问题。为解决这一限制,我们提出了合作并行 LNS(CPLNS)算法,旨在提高并行效率。合作并行搜索的主要挑战在于设计合适的组合和合作策略,以平衡搜索的多样化和集约化。为此,我们首先分析了 LNS 的特点。然后,我们引入了一种灵活的基于组的合作并行策略,即在每个组内共享当前最佳解决方案以帮助强化,同时通过独立的组计算保持多样化。此外,我们还通过整合基于模拟退火的 LNS 和有界次优单个代理寻路来增强搜索多样化。我们还引入了基于规则的组合构建方法,以简化参数设置并提高搜索效率。最后,我们通过共享数据过滤技术和优化数据结构提高了通信和内存效率。在 33 个地图、825 个实例的基准测试中,CPLNS 在 32 核机器上的中位速度提高了 21.95 倍,96.97% 的案例在 5 分钟内得到解决,平均次优化得分从 1.728 降至 1.456。此外,多达 10,000 个代理的测试验证了 CPLNS 对大规模 MAPF 问题的可扩展性。
{"title":"CPLNS: Cooperative Parallel Large Neighborhood Search for Large-Scale Multi-Agent Path Finding","authors":"Kai Chen;Qingjun Qu;Feng Zhu;Zhengming Yi;Wenjie Tang","doi":"10.1109/TPDS.2024.3408030","DOIUrl":"10.1109/TPDS.2024.3408030","url":null,"abstract":"The large-scale Multi-Agent Path Finding (MAPF) problem presents a significant challenge in combinatorial optimization. Currently, one of the advanced, near-optimal algorithms is Large Neighborhood Search (LNS), which can handle instances with thousands of agents. Although a basic portfolio parallel search based on multiple independent LNS solvers enhances speed and robustness, it encounters scalability issues with increasing CPU cores. To address this limitation, we propose the Cooperative Parallel LNS (CPLNS) algorithm, aimed at boosting parallel efficiency. The main challenge in cooperative parallel search lies in designing suitable portfolio and cooperative strategies that balance search diversification and intensification. To address this, we first analyze the characteristics of LNS. We then introduce a flexible group-based cooperative parallel strategy, where the current best solution is shared within each group to aid intensification, while maintaining diversification through independent group computations. Furthermore, we augment search diversification by integrating a simulated annealing-based LNS and bounded suboptimal single-agent pathfinding. We also introduce a rule-based methodology for portfolio construction to simplify parameter settings and improve search efficiency. Finally, we enhance communication and memory efficiency through a shared data filtering technique and optimized data structures. In benchmarks on 33 maps with 825 instances, CPLNS achieved a median speedup of 21.95 on a 32-core machine, solving 96.97% of cases within five minutes and reducing the average suboptimality score from 1.728 to 1.456. Additionally, tests with up to 10,000 agents verify CPLNS's scalability for large-scale MAPF problems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2069-2086"},"PeriodicalIF":5.6,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Communication-Efficient Federated Multi-Task Learning With Personalization and Fairness 以个性化和公平性加速具有通信效率的联合多任务学习
IF 5.6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-10 DOI: 10.1109/TPDS.2024.3411815
Renyou Xie;Chaojie Li;Xiaojun Zhou;Zhaoyang Dong
Federated learning techniques provide a promising framework for collaboratively training a machine learning model without sharing users’ data, and delivering a security solution to guarantee privacy during the model training of IoT devices. Nonetheless, challenges posed by data heterogeneity and communication resource constraints make it difficult to develop an efficient federated learning algorithm in terms of the low order of convergence rate. It could significantly deteriorate the quality of service for critical machine learning tasks, e.g., facial recognition, which requires an edge-ready, low-power, low-latency training algorithm. To address these challenges, a communication-efficient federated learning approach is proposed in this paper where the momentum technique is leveraged to accelerate the convergence rate while largely reducing the communication requirements. First, a federated multi-task learning framework by which the learning tasks are reformulated by the multi-objective optimization problem is introduced to address the data heterogeneity. The multiple gradient descent algorithm is harnessed to find the common gradient descending direction for all participants so that the common features can be learned and no sacrifice on each clients’ performance. Second, to reduce communication costs, a local momentum technique with global information is developed to speed up the convergence rate, where the convergence analysis of the proposed method under non-convex case is studied. It is proved that the proposed local momentum can actually achieve the same acceleration as the global momentum, whereas it is more robust than algorithms that solely rely on the acceleration by the global momentum. Third, the generalization of the proposed acceleration approach is investigated which is demonstrated by the accelerated variation of FedAvg. Finally, the performance of the proposed method on the learning model accuracy, convergence rate, and robustness to data heterogeneity, is investigated by empirical experiments on four public datasets, while a real-world IoT platform is constructed to demonstrate the communication efficiency of the proposed method.
联盟学习技术提供了一个前景广阔的框架,可在不共享用户数据的情况下协作训练机器学习模型,并提供安全解决方案,在物联网设备的模型训练过程中保证隐私。然而,由于数据异构性和通信资源限制带来的挑战,很难开发出高效的联合学习算法(收敛率较低)。这可能会大大降低关键机器学习任务(如面部识别)的服务质量,而面部识别需要边缘就绪、低功耗、低延迟的训练算法。为了应对这些挑战,本文提出了一种通信效率高的联合学习方法,利用动量技术加快收敛速度,同时大大降低通信要求。首先,本文引入了联合多任务学习框架,通过多目标优化问题对学习任务进行重新表述,以解决数据异质性问题。利用多重梯度下降算法,为所有参与者找到共同的梯度下降方向,这样既能学习到共同特征,又不会牺牲每个客户端的性能。其次,为了降低通信成本,开发了一种具有全局信息的局部动量技术,以加快收敛速度,研究了所提方法在非凸情况下的收敛分析。研究证明,所提出的局部动量实际上可以达到与全局动量相同的加速度,而与单纯依赖全局动量加速的算法相比,它的鲁棒性更高。第三,研究了所提加速方法的通用性,并通过 FedAvg 的加速变化进行了证明。最后,通过在四个公共数据集上进行实证实验,研究了所提方法在学习模型准确性、收敛速度和对数据异质性的鲁棒性方面的性能,同时构建了一个真实世界的物联网平台,以证明所提方法的通信效率。
{"title":"Accelerating Communication-Efficient Federated Multi-Task Learning With Personalization and Fairness","authors":"Renyou Xie;Chaojie Li;Xiaojun Zhou;Zhaoyang Dong","doi":"10.1109/TPDS.2024.3411815","DOIUrl":"10.1109/TPDS.2024.3411815","url":null,"abstract":"Federated learning techniques provide a promising framework for collaboratively training a machine learning model without sharing users’ data, and delivering a security solution to guarantee privacy during the model training of IoT devices. Nonetheless, challenges posed by data heterogeneity and communication resource constraints make it difficult to develop an efficient federated learning algorithm in terms of the low order of convergence rate. It could significantly deteriorate the quality of service for critical machine learning tasks, e.g., facial recognition, which requires an edge-ready, low-power, low-latency training algorithm. To address these challenges, a communication-efficient federated learning approach is proposed in this paper where the momentum technique is leveraged to accelerate the convergence rate while largely reducing the communication requirements. First, a federated multi-task learning framework by which the learning tasks are reformulated by the multi-objective optimization problem is introduced to address the data heterogeneity. The multiple gradient descent algorithm is harnessed to find the common gradient descending direction for all participants so that the common features can be learned and no sacrifice on each clients’ performance. Second, to reduce communication costs, a local momentum technique with global information is developed to speed up the convergence rate, where the convergence analysis of the proposed method under non-convex case is studied. It is proved that the proposed local momentum can actually achieve the same acceleration as the global momentum, whereas it is more robust than algorithms that solely rely on the acceleration by the global momentum. Third, the generalization of the proposed acceleration approach is investigated which is demonstrated by the accelerated variation of FedAvg. Finally, the performance of the proposed method on the learning model accuracy, convergence rate, and robustness to data heterogeneity, is investigated by empirical experiments on four public datasets, while a real-world IoT platform is constructed to demonstrate the communication efficiency of the proposed method.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2239-2253"},"PeriodicalIF":5.6,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1