首页 > 最新文献

Journal of Parallel and Distributed Computing最新文献

英文 中文
HeaPS: Heterogeneity-aware participant selection for efficient federated learning 高效联邦学习的异构感知参与者选择
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-19 DOI: 10.1016/j.jpdc.2025.105168
Duo Yang , Bing Hu , Yunqi Gao , A-Long Jin , An Liu , Kwan L. Yeung , Yang You
Federated learning enables collaborative model training among numerous clients. However, existing participant/client selection methods fail to fully leverage the advantages of clients with excellent computational or communication capabilities. In this paper, we propose HeaPS, a novel Heterogeneity-aware Participant Selection framework for efficient federated learning. We introduce a finer-grained global selection algorithm to select communication-strong leaders and computation-strong members from candidate clients. The leaders are responsible for communicating with the server to reduce per-round duration, as well as contributing gradients; while the members communicate with the leaders to contribute more gradients obtained from high-utility data to the global model and improve the final model accuracy. Meanwhile, we develop a gradient migration path generation algorithm to match the optimal leader for each member. We also design the client scheduler to facilitate parallel local training of leaders and members based on gradient migration. Experimental results show that, in comparison with state-of-the-art methods, HeaPS achieves a speedup of up to 3.20× in time-to-accuracy performance and improves the final accuracy by up to 3.57%. The code for HeaPS is available at https://github.com/Dora233/HeaPS.
联邦学习支持在众多客户之间进行协作模型训练。然而,现有的参与者/客户选择方法未能充分利用具有优秀计算或通信能力的客户的优势。在本文中,我们提出了一种新的异构感知参与者选择框架,用于高效的联邦学习。我们引入了一种细粒度的全局选择算法,从候选客户中选择沟通能力强的领导者和计算能力强的成员。领导者负责与服务器沟通,以减少每轮持续时间,以及贡献梯度;同时成员与领导进行沟通,将高效用数据获得的梯度更多地贡献给全局模型,提高最终模型的精度。同时,我们开发了一种梯度迁移路径生成算法来匹配每个成员的最优领导者。我们还设计了客户端调度程序,以促进基于梯度迁移的领导者和成员的并行本地培训。实验结果表明,与目前最先进的方法相比,该方法的时间精度比(time-to-accuracy)性能提高了3.20倍,最终精度提高了3.57%。堆的代码可在https://github.com/Dora233/HeaPS上获得。
{"title":"HeaPS: Heterogeneity-aware participant selection for efficient federated learning","authors":"Duo Yang ,&nbsp;Bing Hu ,&nbsp;Yunqi Gao ,&nbsp;A-Long Jin ,&nbsp;An Liu ,&nbsp;Kwan L. Yeung ,&nbsp;Yang You","doi":"10.1016/j.jpdc.2025.105168","DOIUrl":"10.1016/j.jpdc.2025.105168","url":null,"abstract":"<div><div>Federated learning enables collaborative model training among numerous clients. However, existing participant/client selection methods fail to fully leverage the advantages of clients with excellent computational or communication capabilities. In this paper, we propose HeaPS, a novel Heterogeneity-aware Participant Selection framework for efficient federated learning. We introduce a finer-grained global selection algorithm to select communication-strong leaders and computation-strong members from candidate clients. The leaders are responsible for communicating with the server to reduce per-round duration, as well as contributing gradients; while the members communicate with the leaders to contribute more gradients obtained from high-utility data to the global model and improve the final model accuracy. Meanwhile, we develop a gradient migration path generation algorithm to match the optimal leader for each member. We also design the client scheduler to facilitate parallel local training of leaders and members based on gradient migration. Experimental results show that, in comparison with state-of-the-art methods, HeaPS achieves a speedup of up to 3.20× in time-to-accuracy performance and improves the final accuracy by up to 3.57%. The code for HeaPS is available at <span><span>https://github.com/Dora233/HeaPS</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"206 ","pages":"Article 105168"},"PeriodicalIF":4.0,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144887402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A scheduler to foster data locality for GPU and out-of-core task-based linear algebra applications 为GPU和核心外的基于任务的线性代数应用程序培育数据局部性的调度程序
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-18 DOI: 10.1016/j.jpdc.2025.105170
Maxime Gonthier , Loris Marchal , Samuel Thibault
Hardware accelerators like GPUs now provide a large part of the computational power used for scientific simulations. Despite their efficacy, GPUs possess limited memory and are connected to the main memory of the machine via a bandwidth limited bus. Scientific simulations often operate on very large data, that surpasses the GPU's memory capacity. Therefore, one has to turn to out-of-core computing: data is kept in a remote, slower memory (CPU memory), and moved back and forth from/to the device memory (GPU memory), a process also present for multicore CPUs with limited memory. In both cases, data movement quickly becomes a performance bottleneck. Task-based runtime schedulers have emerged as a convenient and efficient way to manage large applications on such heterogeneous platforms. We propose a scheduler for task-based runtimes that improves data locality for out-of-core linear algebra computations, to reduce data movement. We design a data-aware strategy for both task scheduling and data eviction from limited memories. We compare this scheduler to existing schedulers in runtime systems. Using StarPU, we show that our new scheduling strategy achieves comparable performance when memory is not a constraint, and significantly better performance when application input data exceeds memory, on both GPUs and CPU cores.
像gpu这样的硬件加速器现在为科学模拟提供了很大一部分的计算能力。尽管它们的功效,gpu拥有有限的内存,并通过带宽有限的总线连接到机器的主存储器。科学模拟通常在非常大的数据上运行,这超过了GPU的内存容量。因此,必须转向核外计算:数据保存在远程、较慢的内存(CPU内存)中,并在设备内存(GPU内存)之间来回移动,这一过程也适用于内存有限的多核CPU。在这两种情况下,数据移动都会迅速成为性能瓶颈。基于任务的运行时调度器已经成为管理此类异构平台上的大型应用程序的一种方便而有效的方法。我们提出了一个基于任务的运行时调度器,它可以改善核心外线性代数计算的数据局部性,以减少数据移动。我们设计了一种数据感知策略,用于任务调度和从有限内存中提取数据。我们将此调度器与运行时系统中的现有调度器进行比较。通过使用StarPU,我们证明了我们的新调度策略在内存不受约束的情况下实现了相当的性能,并且在gpu和CPU内核上,当应用程序输入数据超过内存时显著提高了性能。
{"title":"A scheduler to foster data locality for GPU and out-of-core task-based linear algebra applications","authors":"Maxime Gonthier ,&nbsp;Loris Marchal ,&nbsp;Samuel Thibault","doi":"10.1016/j.jpdc.2025.105170","DOIUrl":"10.1016/j.jpdc.2025.105170","url":null,"abstract":"<div><div>Hardware accelerators like GPUs now provide a large part of the computational power used for scientific simulations. Despite their efficacy, GPUs possess limited memory and are connected to the main memory of the machine via a bandwidth limited bus. Scientific simulations often operate on very large data, that surpasses the GPU's memory capacity. Therefore, one has to turn to <strong>out-of-core</strong> computing: data is kept in a remote, slower memory (CPU memory), and moved back and forth from/to the device memory (GPU memory), a process also present for multicore CPUs with limited memory. In both cases, data movement quickly becomes a performance bottleneck. Task-based runtime schedulers have emerged as a convenient and efficient way to manage large applications on such heterogeneous platforms. <strong>We propose a scheduler for task-based runtimes</strong> that improves <strong>data locality</strong> for out-of-core linear algebra computations, to reduce data movement. We design a data-aware strategy for both task scheduling and data eviction from limited memories. We compare this scheduler to existing schedulers in runtime systems. Using <span>StarPU</span>, we show that our new scheduling strategy achieves comparable performance when memory is not a constraint, and significantly better performance when application input data exceeds memory, on both GPUs and CPU cores.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"206 ","pages":"Article 105170"},"PeriodicalIF":4.0,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144866099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues) 封面1 -完整的扉页(每期)/特刊扉页(每期)
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-14 DOI: 10.1016/S0743-7315(25)00131-5
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00131-5","DOIUrl":"10.1016/S0743-7315(25)00131-5","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105164"},"PeriodicalIF":4.0,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144842214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SoRCS: A scalable blockchain model with separation of role, chain and storage SoRCS:具有角色、链和存储分离的可扩展区块链模型
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-05 DOI: 10.1016/j.jpdc.2025.105160
Bin Yu , Lei Chen , He Zhao , Zhiyu Ma , Haotian Cheng , Xiaoting Zhang , Liang Sun , Tong Zhou , Nianzu Sheng
The industrial use of blockchain technology is becoming more widespread, the scalability of blockchain is still one of the primary challenges in large-scale practical applications. Separation schemes are being introduced by many blockchain projects to solve their scalability problems. In this paper, we propose a comprehensive separation scheme SoRCS, which separates the node role, the chain, and the data storage. It makes full use of the resources of each node, reduces the load on the nodes, and improves the degree of decentralization. Ordering of verified transactions, execution of ordered transactions, confirmation of ordering and execution blocks run concurrently within different sub-networks to improve blockchain performance. Based on the results of the block consensus, we provide a three-phase response: documented, executed, and confirmed.
Based on the SoRCS architecture, we also implement a prototype system that consists of 1200 nodes to evaluate our separation schemes. Its peak throughput is 14.7 Ktps and its latency is around 0.5 s. We use the three-phase response time to avoid the issue of higher latency, and the first response time is around 0.15 s.
区块链技术的工业应用越来越广泛,但区块链的可扩展性仍然是大规模实际应用的主要挑战之一。许多区块链项目都引入了分离方案来解决可伸缩性问题。本文提出了一种综合分离方案SoRCS,将节点角色、链和数据存储分离。它充分利用了每个节点的资源,减少了节点的负载,提高了去中心化程度。已验证事务的排序、有序事务的执行、排序的确认和执行块在不同的子网中并发运行,以提高区块链的性能。基于区块共识的结果,我们提供了三个阶段的响应:记录、执行和确认。基于SoRCS架构,我们还实现了一个由1200个节点组成的原型系统来评估我们的分离方案。其峰值吞吐量为14.7 Ktps,延迟约为0.5 s。我们使用三相响应时间来避免更高的延迟问题,第一次响应时间约为0.15 s。
{"title":"SoRCS: A scalable blockchain model with separation of role, chain and storage","authors":"Bin Yu ,&nbsp;Lei Chen ,&nbsp;He Zhao ,&nbsp;Zhiyu Ma ,&nbsp;Haotian Cheng ,&nbsp;Xiaoting Zhang ,&nbsp;Liang Sun ,&nbsp;Tong Zhou ,&nbsp;Nianzu Sheng","doi":"10.1016/j.jpdc.2025.105160","DOIUrl":"10.1016/j.jpdc.2025.105160","url":null,"abstract":"<div><div>The industrial use of blockchain technology is becoming more widespread, the scalability of blockchain is still one of the primary challenges in large-scale practical applications. Separation schemes are being introduced by many blockchain projects to solve their scalability problems. In this paper, we propose a comprehensive separation scheme SoRCS, which separates the node role, the chain, and the data storage. It makes full use of the resources of each node, reduces the load on the nodes, and improves the degree of decentralization. Ordering of verified transactions, execution of ordered transactions, confirmation of ordering and execution blocks run concurrently within different sub-networks to improve blockchain performance. Based on the results of the block consensus, we provide a three-phase response: documented, executed, and confirmed.</div><div>Based on the SoRCS architecture, we also implement a prototype system that consists of 1200 nodes to evaluate our separation schemes. Its peak throughput is 14.7 Ktps and its latency is around 0.5 s. We use the three-phase response time to avoid the issue of higher latency, and the first response time is around 0.15 s.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"206 ","pages":"Article 105160"},"PeriodicalIF":4.0,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144772622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FSCD: File system controlled coupled defragmenter for mobile storage systems 移动存储系统的文件系统控制耦合碎片整理程序
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-31 DOI: 10.1016/j.jpdc.2025.105159
Pingyang Huang, Chenxi Liu, Jilong Yang, Ting Chen, Zhiyuan Cheng
NAND-Flash-based mobile devices have gained increasing popularity. However, fragmentation in flash storage significantly impedes the I/O performance of the system, which leads to a poor user experience. Currently, the logical fragmentation is decoupled from the physical fragmentation and garbage collection is typically controlled by the Flash Translation Layer (FTL), degrading the garbage collection efficiency. In this paper, a novel fragmentation handling strategy, namely the File System controlled Coupled Defragmenter (FSCD), is proposed in which the file system is used to control the garbage collection, and couple the logical and physical fragmentations, synchronizing between the logical and physical defragmentation. As a result, FSCD can significantly reduce fragmentation and improve system performance in the fragmented state. Experimental results showed that the performance of Sequential Read/Sequential Write and Random Read/Random Write had been improved by 393.7%, 356.2% and 126.0%, 296.0% over the conventional FTL, respectively. FSCD alleviates fragmentation, improves I/O performance, and enables a better user experience, which provides a solution for the next generation of the NAND-flash-based mobile storage system.
基于nand闪存的移动设备越来越受欢迎。但是,flash存储中的碎片会严重影响系统的I/O性能,从而导致较差的用户体验。目前,逻辑碎片与物理碎片是分离的,垃圾收集通常由Flash转换层(FTL)控制,降低了垃圾收集效率。本文提出了一种新的碎片处理策略,即文件系统控制的耦合碎片整理器(FSCD),该策略利用文件系统控制垃圾收集,将逻辑碎片和物理碎片进行耦合,在逻辑碎片整理和物理碎片整理之间进行同步。因此,FSCD可以显著减少碎片,提高系统在碎片状态下的性能。实验结果表明,与传统的FTL相比,顺序读/顺序写和随机读/随机写的性能分别提高了393.7%、356.2%和126.0%、296.0%。FSCD解决了碎片化问题,提高了I/O性能,提供了更好的用户体验,为下一代基于nand闪存的移动存储系统提供了解决方案。
{"title":"FSCD: File system controlled coupled defragmenter for mobile storage systems","authors":"Pingyang Huang,&nbsp;Chenxi Liu,&nbsp;Jilong Yang,&nbsp;Ting Chen,&nbsp;Zhiyuan Cheng","doi":"10.1016/j.jpdc.2025.105159","DOIUrl":"10.1016/j.jpdc.2025.105159","url":null,"abstract":"<div><div>NAND-Flash-based mobile devices have gained increasing popularity. However, fragmentation in flash storage significantly impedes the I/O performance of the system, which leads to a poor user experience. Currently, the logical fragmentation is decoupled from the physical fragmentation and garbage collection is typically controlled by the <em>Flash Translation Layer</em> (FTL), degrading the garbage collection efficiency. In this paper, a novel fragmentation handling strategy, namely the <em>File System controlled Coupled Defragmenter</em> (FSCD), is proposed in which the file system is used to control the garbage collection, and couple the logical and physical fragmentations, synchronizing between the logical and physical defragmentation. As a result, FSCD can significantly reduce fragmentation and improve system performance in the fragmented state. Experimental results showed that the performance of Sequential Read/Sequential Write and Random Read/Random Write had been improved by 393.7%, 356.2% and 126.0%, 296.0% over the conventional FTL, respectively. FSCD alleviates fragmentation, improves I/O performance, and enables a better user experience, which provides a solution for the next generation of the NAND-flash-based mobile storage system.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"206 ","pages":"Article 105159"},"PeriodicalIF":4.0,"publicationDate":"2025-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144772621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sensor failure mitigation in RTOS based internet of things (IoT) systems using machine learning 基于RTOS的物联网(IoT)系统中使用机器学习的传感器故障缓解
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-29 DOI: 10.1016/j.jpdc.2025.105161
Saugat Sharma, Grzegorz Chmaj, Henry Selvaraj
In IoT systems, issues like data loss can severely impact application performance, often resulting in missed deadlines in real-time operating systems (RTOS). Missed deadlines can lead to incomplete or delayed data, compromising system reliability and user safety. To address these challenges, this paper proposes a solution that reduces missed deadlines and effectively manages data gaps when they occur. This paper presents a novel approach, FP-DVFS-CC, which combines Fixed Priority (FP) scheduling with Dynamic Voltage and Frequency Scaling (DVFS) and Cycle Conserving (CC) to dynamically adjust CPU speeds in RTOS-based IoT systems. This method optimizes deadline adherence by adapting processing power based on task priority and system load, significantly reducing missed deadlines and energy consumption in real-time applications. Additionally, it proposes a MQTT-based cache to handle the delivery of missed messages with sensor data and suggests using machine learning to provide synthetic data to the system in case of data absence. Further, for addressing data gaps due to sensor malfunctions or delays, we introduce FeatureSync, an imputation technique that utilizes only highly correlated features integrating with machine learning algorithms to generate synthetic data. Experimental validation demonstrates enhanced performance with machine learning integration. If missing data is treated as zero instead of performing data imputation, it will lower accuracy, as seen in the case analysis, and increase prediction error. However, using data imputation can reduce the errors caused by treating missing values as zero. Testing on datasets with varied missing data reaffirms the approach's effectiveness.
在物联网系统中,数据丢失等问题会严重影响应用程序的性能,通常会导致实时操作系统(RTOS)错过截止日期。错过最后期限可能导致数据不完整或延迟,从而危及系统可靠性和用户安全。为了应对这些挑战,本文提出了一种解决方案,可以减少错过的最后期限,并在出现数据差距时有效地管理数据差距。本文提出了一种新颖的方法FP-DVFS-CC,该方法将固定优先级(FP)调度与动态电压和频率缩放(DVFS)和周期守恒(CC)相结合,以动态调整基于rtos的物联网系统中的CPU速度。该方法通过调整基于任务优先级和系统负载的处理能力来优化截止日期遵守,显著减少了实时应用中错过的截止日期和能耗。此外,它提出了一个基于mqtt的缓存来处理带有传感器数据的丢失消息的传递,并建议在数据缺失的情况下使用机器学习向系统提供合成数据。此外,为了解决由于传感器故障或延迟导致的数据缺口,我们引入了FeatureSync,这是一种仅利用高度相关的特征与机器学习算法集成来生成合成数据的输入技术。实验验证证明了与机器学习集成后的性能增强。如果将缺失的数据视为零而不进行数据补全,就会降低准确性,如案例分析所示,并增加预测误差。然而,使用数据插入可以减少将缺失值视为零所造成的误差。在不同缺失数据的数据集上的测试证实了该方法的有效性。
{"title":"Sensor failure mitigation in RTOS based internet of things (IoT) systems using machine learning","authors":"Saugat Sharma,&nbsp;Grzegorz Chmaj,&nbsp;Henry Selvaraj","doi":"10.1016/j.jpdc.2025.105161","DOIUrl":"10.1016/j.jpdc.2025.105161","url":null,"abstract":"<div><div>In IoT systems, issues like data loss can severely impact application performance, often resulting in missed deadlines in real-time operating systems (RTOS). Missed deadlines can lead to incomplete or delayed data, compromising system reliability and user safety. To address these challenges, this paper proposes a solution that reduces missed deadlines and effectively manages data gaps when they occur. This paper presents a novel approach, FP-DVFS-CC, which combines Fixed Priority (FP) scheduling with Dynamic Voltage and Frequency Scaling (DVFS) and Cycle Conserving (CC) to dynamically adjust CPU speeds in RTOS-based IoT systems. This method optimizes deadline adherence by adapting processing power based on task priority and system load, significantly reducing missed deadlines and energy consumption in real-time applications. Additionally, it proposes a MQTT-based cache to handle the delivery of missed messages with sensor data and suggests using machine learning to provide synthetic data to the system in case of data absence. Further, for addressing data gaps due to sensor malfunctions or delays, we introduce FeatureSync, an imputation technique that utilizes only highly correlated features integrating with machine learning algorithms to generate synthetic data. Experimental validation demonstrates enhanced performance with machine learning integration. If missing data is treated as zero instead of performing data imputation, it will lower accuracy, as seen in the case analysis, and increase prediction error. However, using data imputation can reduce the errors caused by treating missing values as zero. Testing on datasets with varied missing data reaffirms the approach's effectiveness.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"206 ","pages":"Article 105161"},"PeriodicalIF":4.0,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144779604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing parallel heterogeneous system efficiency: Dynamic task graph adaptation with recursive tasks 优化并行异构系统效率:递归任务的动态任务图自适应
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-28 DOI: 10.1016/j.jpdc.2025.105157
Nathalie Furmento, Abdou Guermouche, Gwenolé Lucas, Thomas Morin, Samuel Thibault, Pierre-André Wacrenier
Task-based programming models are currently an ample trend to leverage heterogeneous parallel systems in a productive way (OpenACC, Kokkos, Legion, OmpSs, PaRSEC, StarPU, XKaapi, ...). Among these models, the Sequential Task Flow (STF) model is widely embraced (PaRSEC's DTD, OmpSs, StarPU) since it allows to express task graphs naturally through a sequential-looking submission of tasks, and tasks dependencies are inferred automatically. However, STF is limited to task graphs with task sizes that are fixed at submission, posing a challenge in determining the optimal task granularity. Notably, in heterogeneous systems, the optimal task size varies across different processing units, so a single task size would not fit all units. StarPU's recursive tasks allow graphs with several task granularities by turning some tasks into sub-graphs dynamically at runtime. The decision to transform these tasks into sub-graphs is decided by a StarPU component called the Splitter. After deciding to transform some tasks, classical scheduling approaches are used, making this component generic, and orthogonal to the scheduler. In this paper, we propose a new policy for the Splitter, which is designed for heterogeneous platforms, that relies on linear programming aimed at minimizing execution time and maximizing resource utilization. This results in a dynamic well-balanced set comprising both small tasks to fill multiple CPU cores, and large tasks for efficient execution on accelerators like GPU devices. We then present an experimental evaluation showing that just-in-time adaptations of the task graph lead to improved performance across various dense linear algebra algorithms.
基于任务的编程模型目前是一种以生产方式利用异构并行系统的充分趋势(OpenACC、Kokkos、Legion、omps、PaRSEC、StarPU、XKaapi等)。在这些模型中,顺序任务流(Sequential Task Flow, STF)模型被广泛采用(PaRSEC的DTD、omps、StarPU),因为它允许通过看起来顺序的任务提交自然地表示任务图,并且自动推断任务依赖关系。然而,STF仅限于在提交时具有固定任务大小的任务图,这在确定最佳任务粒度方面提出了挑战。值得注意的是,在异构系统中,最佳任务大小因不同的处理单元而异,因此单一任务大小不适合所有单元。StarPU的递归任务通过在运行时动态地将一些任务转换为子图,从而允许具有多个任务粒度的图。将这些任务转换为子图的决策是由称为Splitter的StarPU组件决定的。在决定转换某些任务之后,将使用经典调度方法,使该组件具有通用性,并且与调度程序正交。在本文中,我们提出了一种新的Splitter策略,该策略是为异构平台设计的,它依赖于旨在最小化执行时间和最大化资源利用率的线性规划。这将产生一个动态平衡的集合,其中既包括填充多个CPU内核的小任务,也包括在GPU设备等加速器上高效执行的大任务。然后,我们提出了一个实验评估,表明任务图的即时适应可以提高各种密集线性代数算法的性能。
{"title":"Optimizing parallel heterogeneous system efficiency: Dynamic task graph adaptation with recursive tasks","authors":"Nathalie Furmento,&nbsp;Abdou Guermouche,&nbsp;Gwenolé Lucas,&nbsp;Thomas Morin,&nbsp;Samuel Thibault,&nbsp;Pierre-André Wacrenier","doi":"10.1016/j.jpdc.2025.105157","DOIUrl":"10.1016/j.jpdc.2025.105157","url":null,"abstract":"<div><div>Task-based programming models are currently an ample trend to leverage heterogeneous parallel systems in a productive way (OpenACC, Kokkos, Legion, OmpSs, <span>PaRSEC</span>, <span>StarPU</span>, XKaapi, ...). Among these models, the Sequential Task Flow (STF) model is widely embraced (<span>PaRSEC</span>'s DTD, OmpSs, <span>StarPU</span>) since it allows to express task graphs naturally through a sequential-looking submission of tasks, and tasks dependencies are inferred automatically. However, STF is limited to task graphs with task sizes that are fixed at submission, posing a challenge in determining the optimal task granularity. Notably, in heterogeneous systems, the optimal task size varies across different processing units, so a single task size would not fit all units. <span>StarPU</span>'s recursive tasks allow graphs with several task granularities by turning some tasks into sub-graphs dynamically at runtime. The decision to transform these tasks into sub-graphs is decided by a <span>StarPU</span> component called the Splitter. After deciding to transform some tasks, classical scheduling approaches are used, making this component generic, and orthogonal to the scheduler. In this paper, we propose a new policy for the Splitter, which is designed for heterogeneous platforms, that relies on linear programming aimed at minimizing execution time and maximizing resource utilization. This results in a dynamic well-balanced set comprising both small tasks to fill multiple CPU cores, and large tasks for efficient execution on accelerators like GPU devices. We then present an experimental evaluation showing that just-in-time adaptations of the task graph lead to improved performance across various dense linear algebra algorithms.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105157"},"PeriodicalIF":4.0,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144749390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
To repair or not to repair: Assessing fault resilience in MPI stencil applications 修复或不修复:评估MPI模板应用中的故障恢复能力
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-25 DOI: 10.1016/j.jpdc.2025.105156
Roberto Rocco , Elisabetta Boella , Daniele Gregori , Gianluca Palermo
With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This trend is especially true in stencil applications, as their regular pattern simplifies the selection of checkpoint locations. However, checkpoint and restart mechanisms introduce non-negligible overhead, disk load, and scalability concerns. In this paper, we show an alternative through fault resilience, enabled by the features provided by the User Level Fault Mitigation extension and shipped within the Legio fault resilience framework. Through fault resilience, we continue executing only the non-failed processes, thus sacrificing result accuracy for faster fault recovery. Our experiments on some specimen stencil applications show that, despite the fault impact visible in the result, we produced meaningful values usable for scientific research, proving the possibilities of a fault resilience approach in a stencil scenario.
随着高性能计算计算规模的不断扩大,故障在高性能计算领域的应用越来越广泛。MPI标准没有定义发生故障后的应用程序行为,将故障管理的负担留给了用户,用户通常采用检查点和重启机制。这种趋势在模板应用程序中尤为明显,因为它们的常规模式简化了检查点位置的选择。但是,检查点和重新启动机制引入了不可忽略的开销、磁盘负载和可伸缩性问题。在本文中,我们通过故障恢复展示了另一种选择,它由用户级故障缓解扩展提供的特性支持,并在Legio故障恢复框架中提供。通过故障恢复,我们继续只执行未发生故障的进程,从而牺牲结果的准确性以实现更快的故障恢复。我们在一些样品模板应用上的实验表明,尽管在结果中可以看到故障影响,但我们产生了可用于科学研究的有意义的值,证明了模板场景中故障恢复方法的可能性。
{"title":"To repair or not to repair: Assessing fault resilience in MPI stencil applications","authors":"Roberto Rocco ,&nbsp;Elisabetta Boella ,&nbsp;Daniele Gregori ,&nbsp;Gianluca Palermo","doi":"10.1016/j.jpdc.2025.105156","DOIUrl":"10.1016/j.jpdc.2025.105156","url":null,"abstract":"<div><div>With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This trend is especially true in stencil applications, as their regular pattern simplifies the selection of checkpoint locations. However, checkpoint and restart mechanisms introduce non-negligible overhead, disk load, and scalability concerns. In this paper, we show an alternative through fault resilience, enabled by the features provided by the User Level Fault Mitigation extension and shipped within the Legio fault resilience framework. Through fault resilience, we continue executing only the non-failed processes, thus sacrificing result accuracy for faster fault recovery. Our experiments on some specimen stencil applications show that, despite the fault impact visible in the result, we produced meaningful values usable for scientific research, proving the possibilities of a fault resilience approach in a stencil scenario.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105156"},"PeriodicalIF":4.0,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144720967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Federated multi-task learning with cross-device heterogeneous task subsets 跨设备异构任务子集的联邦多任务学习
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-25 DOI: 10.1016/j.jpdc.2025.105155
Zewei Xin, Qinya Li, Chaoyue Niu, Fan Wu, Guihai Chen
Traditional Federated Learning (FL) predominantly focuses on task-consistent scenarios, assuming clients possess identical tasks or task sets. However, in multi-task scenarios, client task sets can vary greatly due to their operating environments, available resources, and hardware configurations. Conventional task-consistent FL cannot address such heterogeneity effectively. We define this statistical heterogeneity of task sets, where each client performs a unique subset of server tasks, as cross-device task heterogeneity. In this work, we propose a novel Federated Partial Multi-task (FedPMT) method, allowing clients with diverse task sets to collaborate and train comprehensive models suitable for any task subset. Specifically, clients deploy partial multi-task models tailored to their localized task sets, while the server utilizes single-task models as an intermediate stage to address the model heterogeneity arising from differing task sets. Collaborative training is facilitated through bidirectional transformations between them. To alleviate the negative transfer caused by task set disparities, we introduce task attenuation factors to modulate the influence of different tasks. This adjustment enhances the performance and task generalization ability of cloud models, promoting models to converge towards a shared optimum across all task subsets. Extensive experiments conducted on the NYUD-v2, PASCAL Context and Cityscapes datasets validate the effectiveness and superiority of FedPMT.
传统的联邦学习(FL)主要关注任务一致的场景,假设客户端拥有相同的任务或任务集。但是,在多任务场景中,由于操作环境、可用资源和硬件配置的不同,客户机任务集可能会有很大差异。传统的任务一致性FL不能有效地解决这种异质性。我们将这种任务集的统计异质性定义为跨设备任务异质性,其中每个客户端执行服务器任务的唯一子集。在这项工作中,我们提出了一种新的联邦部分多任务(FedPMT)方法,允许具有不同任务集的客户端协作并训练适合任何任务子集的综合模型。具体来说,客户端部署针对其本地化任务集定制的部分多任务模型,而服务器利用单任务模型作为中间阶段来解决由不同任务集引起的模型异构性。通过它们之间的双向转换,促进了协作培训。为了缓解由任务集差异引起的负迁移,我们引入任务衰减因子来调节不同任务的影响。这种调整增强了云模型的性能和任务泛化能力,促进模型向跨所有任务子集的共享最优收敛。在NYUD-v2、PASCAL Context和cityscape数据集上进行的大量实验验证了FedPMT的有效性和优越性。
{"title":"Federated multi-task learning with cross-device heterogeneous task subsets","authors":"Zewei Xin,&nbsp;Qinya Li,&nbsp;Chaoyue Niu,&nbsp;Fan Wu,&nbsp;Guihai Chen","doi":"10.1016/j.jpdc.2025.105155","DOIUrl":"10.1016/j.jpdc.2025.105155","url":null,"abstract":"<div><div>Traditional Federated Learning (FL) predominantly focuses on task-consistent scenarios, assuming clients possess identical tasks or task sets. However, in multi-task scenarios, client task sets can vary greatly due to their operating environments, available resources, and hardware configurations. Conventional task-consistent FL cannot address such heterogeneity effectively. We define this statistical heterogeneity of task sets, where each client performs a unique subset of server tasks, as cross-device task heterogeneity. In this work, we propose a novel Federated Partial Multi-task (FedPMT) method, allowing clients with diverse task sets to collaborate and train comprehensive models suitable for any task subset. Specifically, clients deploy partial multi-task models tailored to their localized task sets, while the server utilizes single-task models as an intermediate stage to address the model heterogeneity arising from differing task sets. Collaborative training is facilitated through bidirectional transformations between them. To alleviate the negative transfer caused by task set disparities, we introduce task attenuation factors to modulate the influence of different tasks. This adjustment enhances the performance and task generalization ability of cloud models, promoting models to converge towards a shared optimum across all task subsets. Extensive experiments conducted on the NYUD-v2, PASCAL Context and Cityscapes datasets validate the effectiveness and superiority of FedPMT.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"205 ","pages":"Article 105155"},"PeriodicalIF":4.0,"publicationDate":"2025-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144720966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues) 封面1 -完整的扉页(每期)/特刊扉页(每期)
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-11 DOI: 10.1016/S0743-7315(25)00116-9
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00116-9","DOIUrl":"10.1016/S0743-7315(25)00116-9","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"204 ","pages":"Article 105149"},"PeriodicalIF":3.4,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144604858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Parallel and Distributed Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1