Pub Date : 2025-08-19DOI: 10.1016/j.jpdc.2025.105168
Duo Yang , Bing Hu , Yunqi Gao , A-Long Jin , An Liu , Kwan L. Yeung , Yang You
Federated learning enables collaborative model training among numerous clients. However, existing participant/client selection methods fail to fully leverage the advantages of clients with excellent computational or communication capabilities. In this paper, we propose HeaPS, a novel Heterogeneity-aware Participant Selection framework for efficient federated learning. We introduce a finer-grained global selection algorithm to select communication-strong leaders and computation-strong members from candidate clients. The leaders are responsible for communicating with the server to reduce per-round duration, as well as contributing gradients; while the members communicate with the leaders to contribute more gradients obtained from high-utility data to the global model and improve the final model accuracy. Meanwhile, we develop a gradient migration path generation algorithm to match the optimal leader for each member. We also design the client scheduler to facilitate parallel local training of leaders and members based on gradient migration. Experimental results show that, in comparison with state-of-the-art methods, HeaPS achieves a speedup of up to 3.20× in time-to-accuracy performance and improves the final accuracy by up to 3.57%. The code for HeaPS is available at https://github.com/Dora233/HeaPS.
{"title":"HeaPS: Heterogeneity-aware participant selection for efficient federated learning","authors":"Duo Yang , Bing Hu , Yunqi Gao , A-Long Jin , An Liu , Kwan L. Yeung , Yang You","doi":"10.1016/j.jpdc.2025.105168","DOIUrl":"10.1016/j.jpdc.2025.105168","url":null,"abstract":"<div><div>Federated learning enables collaborative model training among numerous clients. However, existing participant/client selection methods fail to fully leverage the advantages of clients with excellent computational or communication capabilities. In this paper, we propose HeaPS, a novel Heterogeneity-aware Participant Selection framework for efficient federated learning. We introduce a finer-grained global selection algorithm to select communication-strong leaders and computation-strong members from candidate clients. The leaders are responsible for communicating with the server to reduce per-round duration, as well as contributing gradients; while the members communicate with the leaders to contribute more gradients obtained from high-utility data to the global model and improve the final model accuracy. Meanwhile, we develop a gradient migration path generation algorithm to match the optimal leader for each member. We also design the client scheduler to facilitate parallel local training of leaders and members based on gradient migration. Experimental results show that, in comparison with state-of-the-art methods, HeaPS achieves a speedup of up to 3.20× in time-to-accuracy performance and improves the final accuracy by up to 3.57%. The code for HeaPS is available at <span><span>https://github.com/Dora233/HeaPS</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"206 ","pages":"Article 105168"},"PeriodicalIF":4.0,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-18DOI: 10.1016/j.jpdc.2025.105170
Maxime Gonthier , Loris Marchal , Samuel Thibault
Hardware accelerators like GPUs now provide a large part of the computational power used for scientific simulations. Despite their efficacy, GPUs possess limited memory and are connected to the main memory of the machine via a bandwidth limited bus. Scientific simulations often operate on very large data, that surpasses the GPU's memory capacity. Therefore, one has to turn to out-of-core computing: data is kept in a remote, slower memory (CPU memory), and moved back and forth from/to the device memory (GPU memory), a process also present for multicore CPUs with limited memory. In both cases, data movement quickly becomes a performance bottleneck. Task-based runtime schedulers have emerged as a convenient and efficient way to manage large applications on such heterogeneous platforms. We propose a scheduler for task-based runtimes that improves data locality for out-of-core linear algebra computations, to reduce data movement. We design a data-aware strategy for both task scheduling and data eviction from limited memories. We compare this scheduler to existing schedulers in runtime systems. Using StarPU, we show that our new scheduling strategy achieves comparable performance when memory is not a constraint, and significantly better performance when application input data exceeds memory, on both GPUs and CPU cores.
{"title":"A scheduler to foster data locality for GPU and out-of-core task-based linear algebra applications","authors":"Maxime Gonthier , Loris Marchal , Samuel Thibault","doi":"10.1016/j.jpdc.2025.105170","DOIUrl":"10.1016/j.jpdc.2025.105170","url":null,"abstract":"<div><div>Hardware accelerators like GPUs now provide a large part of the computational power used for scientific simulations. Despite their efficacy, GPUs possess limited memory and are connected to the main memory of the machine via a bandwidth limited bus. Scientific simulations often operate on very large data, that surpasses the GPU's memory capacity. Therefore, one has to turn to <strong>out-of-core</strong> computing: data is kept in a remote, slower memory (CPU memory), and moved back and forth from/to the device memory (GPU memory), a process also present for multicore CPUs with limited memory. In both cases, data movement quickly becomes a performance bottleneck. Task-based runtime schedulers have emerged as a convenient and efficient way to manage large applications on such heterogeneous platforms. <strong>We propose a scheduler for task-based runtimes</strong> that improves <strong>data locality</strong> for out-of-core linear algebra computations, to reduce data movement. We design a data-aware strategy for both task scheduling and data eviction from limited memories. We compare this scheduler to existing schedulers in runtime systems. Using <span>StarPU</span>, we show that our new scheduling strategy achieves comparable performance when memory is not a constraint, and significantly better performance when application input data exceeds memory, on both GPUs and CPU cores.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"206 ","pages":"Article 105170"},"PeriodicalIF":4.0,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144866099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-14DOI: 10.1016/S0743-7315(25)00131-5
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00131-5","DOIUrl":"10.1016/S0743-7315(25)00131-5","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105164"},"PeriodicalIF":4.0,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144842214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-05DOI: 10.1016/j.jpdc.2025.105160
Bin Yu , Lei Chen , He Zhao , Zhiyu Ma , Haotian Cheng , Xiaoting Zhang , Liang Sun , Tong Zhou , Nianzu Sheng
The industrial use of blockchain technology is becoming more widespread, the scalability of blockchain is still one of the primary challenges in large-scale practical applications. Separation schemes are being introduced by many blockchain projects to solve their scalability problems. In this paper, we propose a comprehensive separation scheme SoRCS, which separates the node role, the chain, and the data storage. It makes full use of the resources of each node, reduces the load on the nodes, and improves the degree of decentralization. Ordering of verified transactions, execution of ordered transactions, confirmation of ordering and execution blocks run concurrently within different sub-networks to improve blockchain performance. Based on the results of the block consensus, we provide a three-phase response: documented, executed, and confirmed.
Based on the SoRCS architecture, we also implement a prototype system that consists of 1200 nodes to evaluate our separation schemes. Its peak throughput is 14.7 Ktps and its latency is around 0.5 s. We use the three-phase response time to avoid the issue of higher latency, and the first response time is around 0.15 s.
{"title":"SoRCS: A scalable blockchain model with separation of role, chain and storage","authors":"Bin Yu , Lei Chen , He Zhao , Zhiyu Ma , Haotian Cheng , Xiaoting Zhang , Liang Sun , Tong Zhou , Nianzu Sheng","doi":"10.1016/j.jpdc.2025.105160","DOIUrl":"10.1016/j.jpdc.2025.105160","url":null,"abstract":"<div><div>The industrial use of blockchain technology is becoming more widespread, the scalability of blockchain is still one of the primary challenges in large-scale practical applications. Separation schemes are being introduced by many blockchain projects to solve their scalability problems. In this paper, we propose a comprehensive separation scheme SoRCS, which separates the node role, the chain, and the data storage. It makes full use of the resources of each node, reduces the load on the nodes, and improves the degree of decentralization. Ordering of verified transactions, execution of ordered transactions, confirmation of ordering and execution blocks run concurrently within different sub-networks to improve blockchain performance. Based on the results of the block consensus, we provide a three-phase response: documented, executed, and confirmed.</div><div>Based on the SoRCS architecture, we also implement a prototype system that consists of 1200 nodes to evaluate our separation schemes. Its peak throughput is 14.7 Ktps and its latency is around 0.5 s. We use the three-phase response time to avoid the issue of higher latency, and the first response time is around 0.15 s.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"206 ","pages":"Article 105160"},"PeriodicalIF":4.0,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144772622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NAND-Flash-based mobile devices have gained increasing popularity. However, fragmentation in flash storage significantly impedes the I/O performance of the system, which leads to a poor user experience. Currently, the logical fragmentation is decoupled from the physical fragmentation and garbage collection is typically controlled by the Flash Translation Layer (FTL), degrading the garbage collection efficiency. In this paper, a novel fragmentation handling strategy, namely the File System controlled Coupled Defragmenter (FSCD), is proposed in which the file system is used to control the garbage collection, and couple the logical and physical fragmentations, synchronizing between the logical and physical defragmentation. As a result, FSCD can significantly reduce fragmentation and improve system performance in the fragmented state. Experimental results showed that the performance of Sequential Read/Sequential Write and Random Read/Random Write had been improved by 393.7%, 356.2% and 126.0%, 296.0% over the conventional FTL, respectively. FSCD alleviates fragmentation, improves I/O performance, and enables a better user experience, which provides a solution for the next generation of the NAND-flash-based mobile storage system.
{"title":"FSCD: File system controlled coupled defragmenter for mobile storage systems","authors":"Pingyang Huang, Chenxi Liu, Jilong Yang, Ting Chen, Zhiyuan Cheng","doi":"10.1016/j.jpdc.2025.105159","DOIUrl":"10.1016/j.jpdc.2025.105159","url":null,"abstract":"<div><div>NAND-Flash-based mobile devices have gained increasing popularity. However, fragmentation in flash storage significantly impedes the I/O performance of the system, which leads to a poor user experience. Currently, the logical fragmentation is decoupled from the physical fragmentation and garbage collection is typically controlled by the <em>Flash Translation Layer</em> (FTL), degrading the garbage collection efficiency. In this paper, a novel fragmentation handling strategy, namely the <em>File System controlled Coupled Defragmenter</em> (FSCD), is proposed in which the file system is used to control the garbage collection, and couple the logical and physical fragmentations, synchronizing between the logical and physical defragmentation. As a result, FSCD can significantly reduce fragmentation and improve system performance in the fragmented state. Experimental results showed that the performance of Sequential Read/Sequential Write and Random Read/Random Write had been improved by 393.7%, 356.2% and 126.0%, 296.0% over the conventional FTL, respectively. FSCD alleviates fragmentation, improves I/O performance, and enables a better user experience, which provides a solution for the next generation of the NAND-flash-based mobile storage system.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"206 ","pages":"Article 105159"},"PeriodicalIF":4.0,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144772621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-29DOI: 10.1016/j.jpdc.2025.105161
Saugat Sharma, Grzegorz Chmaj, Henry Selvaraj
In IoT systems, issues like data loss can severely impact application performance, often resulting in missed deadlines in real-time operating systems (RTOS). Missed deadlines can lead to incomplete or delayed data, compromising system reliability and user safety. To address these challenges, this paper proposes a solution that reduces missed deadlines and effectively manages data gaps when they occur. This paper presents a novel approach, FP-DVFS-CC, which combines Fixed Priority (FP) scheduling with Dynamic Voltage and Frequency Scaling (DVFS) and Cycle Conserving (CC) to dynamically adjust CPU speeds in RTOS-based IoT systems. This method optimizes deadline adherence by adapting processing power based on task priority and system load, significantly reducing missed deadlines and energy consumption in real-time applications. Additionally, it proposes a MQTT-based cache to handle the delivery of missed messages with sensor data and suggests using machine learning to provide synthetic data to the system in case of data absence. Further, for addressing data gaps due to sensor malfunctions or delays, we introduce FeatureSync, an imputation technique that utilizes only highly correlated features integrating with machine learning algorithms to generate synthetic data. Experimental validation demonstrates enhanced performance with machine learning integration. If missing data is treated as zero instead of performing data imputation, it will lower accuracy, as seen in the case analysis, and increase prediction error. However, using data imputation can reduce the errors caused by treating missing values as zero. Testing on datasets with varied missing data reaffirms the approach's effectiveness.
{"title":"Sensor failure mitigation in RTOS based internet of things (IoT) systems using machine learning","authors":"Saugat Sharma, Grzegorz Chmaj, Henry Selvaraj","doi":"10.1016/j.jpdc.2025.105161","DOIUrl":"10.1016/j.jpdc.2025.105161","url":null,"abstract":"<div><div>In IoT systems, issues like data loss can severely impact application performance, often resulting in missed deadlines in real-time operating systems (RTOS). Missed deadlines can lead to incomplete or delayed data, compromising system reliability and user safety. To address these challenges, this paper proposes a solution that reduces missed deadlines and effectively manages data gaps when they occur. This paper presents a novel approach, FP-DVFS-CC, which combines Fixed Priority (FP) scheduling with Dynamic Voltage and Frequency Scaling (DVFS) and Cycle Conserving (CC) to dynamically adjust CPU speeds in RTOS-based IoT systems. This method optimizes deadline adherence by adapting processing power based on task priority and system load, significantly reducing missed deadlines and energy consumption in real-time applications. Additionally, it proposes a MQTT-based cache to handle the delivery of missed messages with sensor data and suggests using machine learning to provide synthetic data to the system in case of data absence. Further, for addressing data gaps due to sensor malfunctions or delays, we introduce FeatureSync, an imputation technique that utilizes only highly correlated features integrating with machine learning algorithms to generate synthetic data. Experimental validation demonstrates enhanced performance with machine learning integration. If missing data is treated as zero instead of performing data imputation, it will lower accuracy, as seen in the case analysis, and increase prediction error. However, using data imputation can reduce the errors caused by treating missing values as zero. Testing on datasets with varied missing data reaffirms the approach's effectiveness.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"206 ","pages":"Article 105161"},"PeriodicalIF":4.0,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144779604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-28DOI: 10.1016/j.jpdc.2025.105157
Nathalie Furmento, Abdou Guermouche, Gwenolé Lucas, Thomas Morin, Samuel Thibault, Pierre-André Wacrenier
Task-based programming models are currently an ample trend to leverage heterogeneous parallel systems in a productive way (OpenACC, Kokkos, Legion, OmpSs, PaRSEC, StarPU, XKaapi, ...). Among these models, the Sequential Task Flow (STF) model is widely embraced (PaRSEC's DTD, OmpSs, StarPU) since it allows to express task graphs naturally through a sequential-looking submission of tasks, and tasks dependencies are inferred automatically. However, STF is limited to task graphs with task sizes that are fixed at submission, posing a challenge in determining the optimal task granularity. Notably, in heterogeneous systems, the optimal task size varies across different processing units, so a single task size would not fit all units. StarPU's recursive tasks allow graphs with several task granularities by turning some tasks into sub-graphs dynamically at runtime. The decision to transform these tasks into sub-graphs is decided by a StarPU component called the Splitter. After deciding to transform some tasks, classical scheduling approaches are used, making this component generic, and orthogonal to the scheduler. In this paper, we propose a new policy for the Splitter, which is designed for heterogeneous platforms, that relies on linear programming aimed at minimizing execution time and maximizing resource utilization. This results in a dynamic well-balanced set comprising both small tasks to fill multiple CPU cores, and large tasks for efficient execution on accelerators like GPU devices. We then present an experimental evaluation showing that just-in-time adaptations of the task graph lead to improved performance across various dense linear algebra algorithms.
{"title":"Optimizing parallel heterogeneous system efficiency: Dynamic task graph adaptation with recursive tasks","authors":"Nathalie Furmento, Abdou Guermouche, Gwenolé Lucas, Thomas Morin, Samuel Thibault, Pierre-André Wacrenier","doi":"10.1016/j.jpdc.2025.105157","DOIUrl":"10.1016/j.jpdc.2025.105157","url":null,"abstract":"<div><div>Task-based programming models are currently an ample trend to leverage heterogeneous parallel systems in a productive way (OpenACC, Kokkos, Legion, OmpSs, <span>PaRSEC</span>, <span>StarPU</span>, XKaapi, ...). Among these models, the Sequential Task Flow (STF) model is widely embraced (<span>PaRSEC</span>'s DTD, OmpSs, <span>StarPU</span>) since it allows to express task graphs naturally through a sequential-looking submission of tasks, and tasks dependencies are inferred automatically. However, STF is limited to task graphs with task sizes that are fixed at submission, posing a challenge in determining the optimal task granularity. Notably, in heterogeneous systems, the optimal task size varies across different processing units, so a single task size would not fit all units. <span>StarPU</span>'s recursive tasks allow graphs with several task granularities by turning some tasks into sub-graphs dynamically at runtime. The decision to transform these tasks into sub-graphs is decided by a <span>StarPU</span> component called the Splitter. After deciding to transform some tasks, classical scheduling approaches are used, making this component generic, and orthogonal to the scheduler. In this paper, we propose a new policy for the Splitter, which is designed for heterogeneous platforms, that relies on linear programming aimed at minimizing execution time and maximizing resource utilization. This results in a dynamic well-balanced set comprising both small tasks to fill multiple CPU cores, and large tasks for efficient execution on accelerators like GPU devices. We then present an experimental evaluation showing that just-in-time adaptations of the task graph lead to improved performance across various dense linear algebra algorithms.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105157"},"PeriodicalIF":4.0,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144749390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This trend is especially true in stencil applications, as their regular pattern simplifies the selection of checkpoint locations. However, checkpoint and restart mechanisms introduce non-negligible overhead, disk load, and scalability concerns. In this paper, we show an alternative through fault resilience, enabled by the features provided by the User Level Fault Mitigation extension and shipped within the Legio fault resilience framework. Through fault resilience, we continue executing only the non-failed processes, thus sacrificing result accuracy for faster fault recovery. Our experiments on some specimen stencil applications show that, despite the fault impact visible in the result, we produced meaningful values usable for scientific research, proving the possibilities of a fault resilience approach in a stencil scenario.
{"title":"To repair or not to repair: Assessing fault resilience in MPI stencil applications","authors":"Roberto Rocco , Elisabetta Boella , Daniele Gregori , Gianluca Palermo","doi":"10.1016/j.jpdc.2025.105156","DOIUrl":"10.1016/j.jpdc.2025.105156","url":null,"abstract":"<div><div>With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This trend is especially true in stencil applications, as their regular pattern simplifies the selection of checkpoint locations. However, checkpoint and restart mechanisms introduce non-negligible overhead, disk load, and scalability concerns. In this paper, we show an alternative through fault resilience, enabled by the features provided by the User Level Fault Mitigation extension and shipped within the Legio fault resilience framework. Through fault resilience, we continue executing only the non-failed processes, thus sacrificing result accuracy for faster fault recovery. Our experiments on some specimen stencil applications show that, despite the fault impact visible in the result, we produced meaningful values usable for scientific research, proving the possibilities of a fault resilience approach in a stencil scenario.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105156"},"PeriodicalIF":4.0,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144720967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-25DOI: 10.1016/j.jpdc.2025.105155
Zewei Xin, Qinya Li, Chaoyue Niu, Fan Wu, Guihai Chen
Traditional Federated Learning (FL) predominantly focuses on task-consistent scenarios, assuming clients possess identical tasks or task sets. However, in multi-task scenarios, client task sets can vary greatly due to their operating environments, available resources, and hardware configurations. Conventional task-consistent FL cannot address such heterogeneity effectively. We define this statistical heterogeneity of task sets, where each client performs a unique subset of server tasks, as cross-device task heterogeneity. In this work, we propose a novel Federated Partial Multi-task (FedPMT) method, allowing clients with diverse task sets to collaborate and train comprehensive models suitable for any task subset. Specifically, clients deploy partial multi-task models tailored to their localized task sets, while the server utilizes single-task models as an intermediate stage to address the model heterogeneity arising from differing task sets. Collaborative training is facilitated through bidirectional transformations between them. To alleviate the negative transfer caused by task set disparities, we introduce task attenuation factors to modulate the influence of different tasks. This adjustment enhances the performance and task generalization ability of cloud models, promoting models to converge towards a shared optimum across all task subsets. Extensive experiments conducted on the NYUD-v2, PASCAL Context and Cityscapes datasets validate the effectiveness and superiority of FedPMT.
{"title":"Federated multi-task learning with cross-device heterogeneous task subsets","authors":"Zewei Xin, Qinya Li, Chaoyue Niu, Fan Wu, Guihai Chen","doi":"10.1016/j.jpdc.2025.105155","DOIUrl":"10.1016/j.jpdc.2025.105155","url":null,"abstract":"<div><div>Traditional Federated Learning (FL) predominantly focuses on task-consistent scenarios, assuming clients possess identical tasks or task sets. However, in multi-task scenarios, client task sets can vary greatly due to their operating environments, available resources, and hardware configurations. Conventional task-consistent FL cannot address such heterogeneity effectively. We define this statistical heterogeneity of task sets, where each client performs a unique subset of server tasks, as cross-device task heterogeneity. In this work, we propose a novel Federated Partial Multi-task (FedPMT) method, allowing clients with diverse task sets to collaborate and train comprehensive models suitable for any task subset. Specifically, clients deploy partial multi-task models tailored to their localized task sets, while the server utilizes single-task models as an intermediate stage to address the model heterogeneity arising from differing task sets. Collaborative training is facilitated through bidirectional transformations between them. To alleviate the negative transfer caused by task set disparities, we introduce task attenuation factors to modulate the influence of different tasks. This adjustment enhances the performance and task generalization ability of cloud models, promoting models to converge towards a shared optimum across all task subsets. Extensive experiments conducted on the NYUD-v2, PASCAL Context and Cityscapes datasets validate the effectiveness and superiority of FedPMT.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105155"},"PeriodicalIF":4.0,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144720966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-11DOI: 10.1016/S0743-7315(25)00116-9
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00116-9","DOIUrl":"10.1016/S0743-7315(25)00116-9","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105149"},"PeriodicalIF":3.4,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144604858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}