首页 > 最新文献

Journal of Parallel and Distributed Computing最新文献

英文 中文
OptimES: Optimizing federated learning using remote embeddings for graph neural networks OptimES:使用图神经网络的远程嵌入优化联邦学习
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-22 DOI: 10.1016/j.jpdc.2026.105227
Pranjal Naman, Yogesh Simmhan
Graph Neural Networks (GNNs) have experienced rapid advancements in recent years due to their ability to learn meaningful representations from graph data structures. However, in most real-world settings, such as financial transaction networks and healthcare networks, this data is localized to different data owners and cannot be aggregated due to privacy concerns. Federated Learning (FL) has emerged as a viable machine learning approach for training a shared model that iteratively aggregates local models trained on decentralized data. This addresses privacy concerns while leveraging parallelism. State-of-the-art methods enhance the privacy-respecting convergence accuracy of federated GNN training by sharing remote embeddings of boundary vertices through a server (EmbC). However, they are limited by diminished performance due to large communication costs. In this article, we propose OptimES, an optimized federated GNN training framework that employs remote neighbourhood pruning, overlapping the push of embeddings to the server with local training, and dynamic pulling of embeddings to reduce network costs and training time. We perform a rigorous evaluation of these strategies for four common graph datasets with up to 111M vertices and 1.6B edges. We see that a modest drop in per-round accuracy due to the preemptive push of embeddings is out-stripped by the reduction in per-round training time for large and dense graphs like Reddit and Products, converging up to  ≈ 3.5 ×  faster than EmbC and giving up to  ≈ 16% better accuracy than the default federated GNN learning. While accuracy improvements over default federated GNNs are modest for sparser graphs like Arxiv and Papers, they achieve the target accuracy about  ≈ 11 ×  faster than EmbC.
近年来,由于能够从图数据结构中学习有意义的表示,图神经网络(gnn)取得了快速发展。然而,在大多数现实环境中,例如金融交易网络和医疗保健网络,这些数据被定位到不同的数据所有者,并且由于隐私问题而无法聚合。联邦学习(FL)已经成为一种可行的机器学习方法,用于训练共享模型,该模型迭代地聚合在分散数据上训练的本地模型。这在利用并行性的同时解决了隐私问题。最先进的方法通过服务器(EmbC)共享边界顶点的远程嵌入,提高了联邦GNN训练的尊重隐私的收敛精度。然而,由于通信成本高,性能下降,它们受到限制。在本文中,我们提出了一种优化的联邦GNN训练框架OptimES,该框架采用远程邻域修剪,将嵌入与本地训练重叠推送到服务器,以及动态提取嵌入以减少网络成本和训练时间。我们对四个常见的图形数据集进行了严格的评估,这些数据集有多达111M个顶点和1.6B条边。我们看到,对于大型和密集的图(如Reddit和Products),由于抢先推送嵌入而导致的每轮精度的适度下降被每轮训练时间的减少所抵消,比EmbC更快收敛到 ≈ 3.5 × ,并且比默认的联邦GNN学习提高 ≈ 16%的精度。虽然对于像Arxiv和Papers这样的稀疏图,默认联合gnn的精度改进是适度的,但它们比EmbC更快地实现了 ≈ 11 × 的目标精度。
{"title":"OptimES: Optimizing federated learning using remote embeddings for graph neural networks","authors":"Pranjal Naman,&nbsp;Yogesh Simmhan","doi":"10.1016/j.jpdc.2026.105227","DOIUrl":"10.1016/j.jpdc.2026.105227","url":null,"abstract":"<div><div>Graph Neural Networks (GNNs) have experienced rapid advancements in recent years due to their ability to learn meaningful representations from graph data structures. However, in most real-world settings, such as financial transaction networks and healthcare networks, this data is localized to different data owners and cannot be aggregated due to privacy concerns. Federated Learning (FL) has emerged as a viable machine learning approach for training a shared model that iteratively aggregates local models trained on decentralized data. This addresses privacy concerns while leveraging parallelism. State-of-the-art methods enhance the privacy-respecting convergence accuracy of federated GNN training by sharing remote embeddings of boundary vertices through a server (EmbC). However, they are limited by diminished performance due to large communication costs. In this article, we propose OptimES, an optimized federated GNN training framework that employs remote neighbourhood pruning, overlapping the push of embeddings to the server with local training, and dynamic pulling of embeddings to reduce network costs and training time. We perform a rigorous evaluation of these strategies for four common graph datasets with up to 111<em>M</em> vertices and 1.6<em>B</em> edges. We see that a modest drop in per-round accuracy due to the preemptive push of embeddings is out-stripped by the reduction in per-round training time for large and dense graphs like Reddit and Products, converging up to  ≈ 3.5 ×  faster than EmbC and giving up to  ≈ 16% better accuracy than the default federated GNN learning. While accuracy improvements over default federated GNNs are modest for sparser graphs like Arxiv and Papers, they achieve the target accuracy about  ≈ 11 ×  faster than EmbC.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"211 ","pages":"Article 105227"},"PeriodicalIF":4.0,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146079885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HyBMSearch: A fast multi-Level search algorithm delivering order-of-Magnitude speedups on multi-Billion datasets HyBMSearch:一个快速的多级搜索算法,在数十亿个数据集上提供数量级的速度
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-11 DOI: 10.1016/j.jpdc.2026.105226
Shashank Raj , Kalyanmoy Deb
We present HyBMSearch (Hybrid Bayesian Multi-Level Search), a Python-based algorithm that redefines how we handle extremely large, sorted datasets. By combining classic methods-binary and interpolation search-with a multi-level chunking approach, this technique achieves significant speedups on arrays ranging from 100 million to 10 billion (tested) elements. At the core of our approach is the integration of a hybrid and custom genetic algorithm with Bayesian optimization, enabling automatic parameter tuning. This eliminates the guesswork of manual tuning while maintaining solid performance across a variety of scenarios. Despite the fact that NumPy’s searchsorted is highly optimized C code, HyBMSearch (written in Python) still delivers dramatic speed gains in multi-threaded tests. It processes 10 million lookups on a 100-million-element dataset in just 0.0244 seconds (compared to 23.67 seconds needed for searchsorted), handles 100 million lookups on a 1-billion-element array in 0.393 seconds (versus 184.89 seconds by NumPy’s searchsorted), performs 500 million lookups on 5 billion elements in 59.00 seconds (rather than 979.73 seconds by NumPy’s searchsorted), and resolves 1 billion lookups on 10 billion elements in 119.68 seconds (instead of 2071.84 seconds by NumPy’s searchsorted). These results set a new milestone for high-performance search methods in parallel and distributed settings, demonstrating the capability of our proposed approach to optimize the search process.
我们提出HyBMSearch(混合贝叶斯多级搜索),这是一种基于python的算法,它重新定义了我们如何处理超大的、排序的数据集。通过将经典方法(二进制和插值搜索)与多级分块方法相结合,该技术在1亿到100亿个(已测试)元素的数组上实现了显著的加速。我们方法的核心是将混合自定义遗传算法与贝叶斯优化相结合,从而实现自动参数调优。这消除了手动调优的猜测,同时在各种场景中保持稳定的性能。尽管NumPy的searchsorted是高度优化的C代码,HyBMSearch(用Python编写)仍然在多线程测试中提供了显著的速度提升。它在1亿个元素的数据集上处理1000万次查找仅需0.0244秒(相比之下,searchsorted需要23.67秒),在10亿个元素的数组上处理1亿次查找仅需0.393秒(相比之下,NumPy的searchsorted需要184.89秒),在50亿个元素上执行5亿次查找仅需59.00秒(而NumPy的searchsorted需要979.73秒)。并在119.68秒内解析100亿个元素的10亿次查找(而不是NumPy的搜索排序的2071.84秒)。这些结果为并行和分布式设置下的高性能搜索方法树立了新的里程碑,证明了我们提出的方法优化搜索过程的能力。
{"title":"HyBMSearch: A fast multi-Level search algorithm delivering order-of-Magnitude speedups on multi-Billion datasets","authors":"Shashank Raj ,&nbsp;Kalyanmoy Deb","doi":"10.1016/j.jpdc.2026.105226","DOIUrl":"10.1016/j.jpdc.2026.105226","url":null,"abstract":"<div><div>We present HyBMSearch (Hybrid Bayesian Multi-Level Search), a Python-based algorithm that redefines how we handle extremely large, sorted datasets. By combining classic methods-binary and interpolation search-with a multi-level chunking approach, this technique achieves significant speedups on arrays ranging from 100 million to 10 billion (tested) elements. At the core of our approach is the integration of a hybrid and custom genetic algorithm with Bayesian optimization, enabling automatic parameter tuning. This eliminates the guesswork of manual tuning while maintaining solid performance across a variety of scenarios. Despite the fact that NumPy’s <span>searchsorted</span> is highly optimized C code, HyBMSearch (written in Python) still delivers dramatic speed gains in multi-threaded tests. It processes 10 million lookups on a 100-million-element dataset in just 0.0244 seconds (compared to 23.67 seconds needed for <span>searchsorted</span>), handles 100 million lookups on a 1-billion-element array in 0.393 seconds (versus 184.89 seconds by <span>NumPy’s searchsorted</span>), performs 500 million lookups on 5 billion elements in 59.00 seconds (rather than 979.73 seconds by <span>NumPy’s searchsorted</span>), and resolves 1 billion lookups on 10 billion elements in 119.68 seconds (instead of 2071.84 seconds by <span>NumPy’s searchsorted</span>). These results set a new milestone for high-performance search methods in parallel and distributed settings, demonstrating the capability of our proposed approach to optimize the search process.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"211 ","pages":"Article 105226"},"PeriodicalIF":4.0,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145957646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues) 封面1 -完整的扉页(每期)/特刊扉页(每期)
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-10 DOI: 10.1016/S0743-7315(25)00186-8
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00186-8","DOIUrl":"10.1016/S0743-7315(25)00186-8","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"209 ","pages":"Article 105219"},"PeriodicalIF":4.0,"publicationDate":"2026-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145977042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AFS-GNN: Adaptive and fast scheduling system for distributed GNN training AFS-GNN:分布式GNN训练自适应快速调度系统
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-08 DOI: 10.1016/j.jpdc.2026.105225
Yuting Gao, Yongqiang Gao, Yongmei Liu
Graph Neural Networks (GNNs) have become core models for learning from relational data in domains such as transportation, social networks, and recommender systems. However, distributed GNN training on large graphs suffers from severe GPU workload imbalance and high communication cost caused by dynamic mini-batch sampling and large structural differences among nodes. To address these challenges, we propose AFS-GNN, a scheduling-aware adaptive framework that achieves fine-grained workload balancing in distributed GNN training. AFS-GNN continuously monitors per-GPU mini-batch execution time through lightweight runtime agents and employs Kalman filtering to suppress transient fluctuations and detect persistent imbalance trends. Upon imbalance detection, it constructs a Hierarchical Dependency Graph (HDG) that explicitly captures multi-hop aggregation dependencies and node-level computational costs. Guided by a heuristic load estimator, AFS-GNN applies cost-aware spectral bipartitioning via the Fiedler vector to select structurally coherent migration blocks that minimize inter-GPU communication while maintaining computational consistency. Selected blocks are migrated asynchronously across devices using intra-node or inter-process communication, ensuring non-blocking execution. Extensive experiments on large-scale benchmarks-ogbn-products and ogbn-papers100M-demonstrate that AFS-GNN achieves up to 21.7% acceleration over Euler, 15% over DistDGL, and 13.7% over FlexGraph, while maintaining stable convergence and scalability across diverse batch sizes and partition configurations.
图神经网络(gnn)已经成为交通、社交网络和推荐系统等领域中从关系数据中学习的核心模型。然而,在大型图上进行分布式GNN训练时,由于动态小批量采样和节点间结构差异大,导致GPU工作负载严重不平衡,通信成本高。为了解决这些挑战,我们提出了AFS-GNN,这是一个调度感知的自适应框架,可以在分布式GNN训练中实现细粒度的工作负载平衡。AFS-GNN通过轻量级运行代理持续监控每个gpu的小批量执行时间,并采用卡尔曼滤波来抑制瞬态波动并检测持续的不平衡趋势。在不平衡检测后,构建层次依赖图(HDG),显式捕获多跳聚合依赖关系和节点级计算成本。在启发式负载估计器的指导下,AFS-GNN通过费德勒矢量应用成本感知谱双分区来选择结构上一致的迁移块,在保持计算一致性的同时最小化gpu间的通信。选择的块通过节点内或进程间通信在设备之间异步迁移,确保非阻塞执行。在大规模基准测试(ogbn-products和ogbn-paper)上进行的大量实验表明,AFS-GNN比Euler实现了高达21.7%的加速,比DistDGL实现了15%的加速,比FlexGraph实现了13.7%的加速,同时在不同的批大小和分区配置中保持了稳定的收敛和可扩展性。
{"title":"AFS-GNN: Adaptive and fast scheduling system for distributed GNN training","authors":"Yuting Gao,&nbsp;Yongqiang Gao,&nbsp;Yongmei Liu","doi":"10.1016/j.jpdc.2026.105225","DOIUrl":"10.1016/j.jpdc.2026.105225","url":null,"abstract":"<div><div>Graph Neural Networks (GNNs) have become core models for learning from relational data in domains such as transportation, social networks, and recommender systems. However, distributed GNN training on large graphs suffers from severe GPU workload imbalance and high communication cost caused by dynamic mini-batch sampling and large structural differences among nodes. To address these challenges, we propose AFS-GNN, a scheduling-aware adaptive framework that achieves fine-grained workload balancing in distributed GNN training. AFS-GNN continuously monitors per-GPU mini-batch execution time through lightweight runtime agents and employs Kalman filtering to suppress transient fluctuations and detect persistent imbalance trends. Upon imbalance detection, it constructs a Hierarchical Dependency Graph (HDG) that explicitly captures multi-hop aggregation dependencies and node-level computational costs. Guided by a heuristic load estimator, AFS-GNN applies cost-aware spectral bipartitioning via the Fiedler vector to select structurally coherent migration blocks that minimize inter-GPU communication while maintaining computational consistency. Selected blocks are migrated asynchronously across devices using intra-node or inter-process communication, ensuring non-blocking execution. Extensive experiments on large-scale benchmarks-<em>ogbn-products</em> and <em>ogbn-papers100M</em>-demonstrate that AFS-GNN achieves up to 21.7% acceleration over Euler, 15% over DistDGL, and 13.7% over FlexGraph, while maintaining stable convergence and scalability across diverse batch sizes and partition configurations.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"211 ","pages":"Article 105225"},"PeriodicalIF":4.0,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145957647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mobility -aware server placement and power allocation for randomly walking mobile users 移动感知服务器位置和随机移动用户的功率分配
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2026-01-06 DOI: 10.1016/j.jpdc.2025.105216
Keqin Li
We systematically, quantitatively, and mathematically address the problems of optimal mobility-aware server placement and optimal mobility-aware power allocation in mobile edge computing environments with randomly walking mobile users. The new contributions of the paper are highlighted below. We establish a single-server M/G/1 queueing system for mobile user equipment and a multiserver M/G/k queueing system for mobile edge clouds. We consider both the synchronous mobility model and the asynchronous mobility model, which are described by discrete-time Markov chains and continuous-time Markov chains respectively. We discuss two task offloading strategies for user equipment in the same service area, i.e., the equal-response-time method and the equal-load-fraction method. We formally and rigorously define the optimal mobility-aware server placement problem and the optimal mobility-aware power allocation problem. We develop optimization algorithms to solve the optimal mobility-aware server placement problem and the optimal mobility-aware power allocation problem. We demonstrate numerical data for optimal mobility-aware server placement and optimal mobility-aware power allocation with two mobility models, two task offloading strategies, and two power consumption models. The significance of the paper can be seen from the fact that the above analytical and algorithmic discussion of optimal mobility-aware server placement and optimal mobility-aware power allocation for mobile edge computing environments with randomly walking mobile users has rarely been seen in the existing literature.
我们系统地、定量地和数学地解决了移动边缘计算环境中具有随机行走移动用户的最佳移动性感知服务器放置和最佳移动性感知功率分配问题。本文的新贡献在下面突出显示。建立了针对移动用户设备的单服务器M/G/1排队系统和针对移动边缘云的多服务器M/G/k排队系统。我们考虑了同步迁移模型和异步迁移模型,它们分别用离散马尔可夫链和连续马尔可夫链来描述。讨论了同一服务区域内用户设备的两种任务卸载策略,即等响应时间法和等负载分数法。我们正式而严格地定义了最优移动感知服务器布局问题和最优移动感知功率分配问题。我们开发了优化算法来解决最优移动性感知服务器放置问题和最优移动性感知功率分配问题。我们用两种移动性模型、两种任务卸载策略和两种功耗模型展示了最优移动性感知服务器放置和最优移动性感知功率分配的数值数据。上述针对随机行走移动用户的移动边缘计算环境的最优移动性感知服务器布局和最优移动性感知功率分配的分析和算法讨论,在现有文献中很少见到,可见本文的意义。
{"title":"Mobility -aware server placement and power allocation for randomly walking mobile users","authors":"Keqin Li","doi":"10.1016/j.jpdc.2025.105216","DOIUrl":"10.1016/j.jpdc.2025.105216","url":null,"abstract":"<div><div>We systematically, quantitatively, and mathematically address the problems of optimal mobility-aware server placement and optimal mobility-aware power allocation in mobile edge computing environments with randomly walking mobile users. The new contributions of the paper are highlighted below. We establish a single-server M/G/1 queueing system for mobile user equipment and a multiserver M/G/k queueing system for mobile edge clouds. We consider both the synchronous mobility model and the asynchronous mobility model, which are described by discrete-time Markov chains and continuous-time Markov chains respectively. We discuss two task offloading strategies for user equipment in the same service area, i.e., the equal-response-time method and the equal-load-fraction method. We formally and rigorously define the optimal mobility-aware server placement problem and the optimal mobility-aware power allocation problem. We develop optimization algorithms to solve the optimal mobility-aware server placement problem and the optimal mobility-aware power allocation problem. We demonstrate numerical data for optimal mobility-aware server placement and optimal mobility-aware power allocation with two mobility models, two task offloading strategies, and two power consumption models. The significance of the paper can be seen from the fact that the above analytical and algorithmic discussion of optimal mobility-aware server placement and optimal mobility-aware power allocation for mobile edge computing environments with randomly walking mobile users has rarely been seen in the existing literature.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"210 ","pages":"Article 105216"},"PeriodicalIF":4.0,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145940057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed quadratic interpolation estimation for large-scale quantile regression 大规模分位数回归的分布二次插值估计
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-20 DOI: 10.1016/j.jpdc.2025.105214
Ziqian Qin , Yue Chao , Xuejun Ma
A number of statistical learning approaches for large-scale quantile regression (QR) have been rapidly developed to address the optimization issues arising from massive data computations. However, the principal idea behind most distributed QR estimation procedures for solving the nondifferentiable quantile loss problem is to approximate the check function using kernel-based smoothing approaches with bandwidth. In this article, we develop a new communication-efficient distributed QR estimation procedure called Distributed Quadratic Interpolation estimation strategy for QR (DQIQR) to tackle the issue posed by the limited memory constraint on a single computer machine. Specifically, we implement a quadratic function in a small neighborhood around the origin, which transforms the nondifferentiable check function into a convex and smooth quadratic loss function without using kernel-based methods. The minimizer, named the DQIQR estimator, is obtained through an approximate multi-round reweighted least squares aggregations procedure under the divide-and-conquer (DC) framework. Theoretically, we establish the asymptotic normality for the DQIQR estimator and show that our estimator achieves the same efficiency as the QR estimator computed on the entire data. Furthermore, a regularized version of DQIQR (DRQIQR) for processing distributed variable selection procedure is also investigated. Finally, the synthetic and real datasets are used to evaluate the effectiveness of the proposed approaches.
大规模分位数回归(QR)的一些统计学习方法已经迅速发展,以解决大量数据计算带来的优化问题。然而,大多数用于解决不可微分位数损失问题的分布式QR估计程序背后的主要思想是使用带带宽的基于核的平滑方法来近似检查函数。在本文中,我们开发了一种新的通信高效的分布式QR估计方法,称为QR的分布式二次插值估计策略(DQIQR),以解决单台计算机有限内存约束所带来的问题。具体来说,我们在原点附近的小邻域内实现了一个二次函数,它将不可微的检查函数转化为凸光滑的二次损失函数,而不使用基于核的方法。在分而治之的框架下,通过近似多轮重加权最小二乘方法得到最小估计器DQIQR估计器。理论上,我们建立了DQIQR估计量的渐近正态性,并证明了我们的估计量与在整个数据上计算的QR估计量达到了相同的效率。此外,本文还研究了用于处理分布式变量选择过程的正则化DQIQR (DRQIQR)。最后,利用合成数据集和实际数据集对所提方法的有效性进行了评价。
{"title":"Distributed quadratic interpolation estimation for large-scale quantile regression","authors":"Ziqian Qin ,&nbsp;Yue Chao ,&nbsp;Xuejun Ma","doi":"10.1016/j.jpdc.2025.105214","DOIUrl":"10.1016/j.jpdc.2025.105214","url":null,"abstract":"<div><div>A number of statistical learning approaches for large-scale quantile regression (QR) have been rapidly developed to address the optimization issues arising from massive data computations. However, the principal idea behind most distributed QR estimation procedures for solving the nondifferentiable quantile loss problem is to approximate the check function using kernel-based smoothing approaches with bandwidth. In this article, we develop a new communication-efficient distributed QR estimation procedure called Distributed Quadratic Interpolation estimation strategy for QR (DQIQR) to tackle the issue posed by the limited memory constraint on a single computer machine. Specifically, we implement a quadratic function in a small neighborhood around the origin, which transforms the nondifferentiable check function into a convex and smooth quadratic loss function without using kernel-based methods. The minimizer, named the DQIQR estimator, is obtained through an approximate multi-round reweighted least squares aggregations procedure under the divide-and-conquer (DC) framework. Theoretically, we establish the asymptotic normality for the DQIQR estimator and show that our estimator achieves the same efficiency as the QR estimator computed on the entire data. Furthermore, a regularized version of DQIQR (DRQIQR) for processing distributed variable selection procedure is also investigated. Finally, the synthetic and real datasets are used to evaluate the effectiveness of the proposed approaches.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"210 ","pages":"Article 105214"},"PeriodicalIF":4.0,"publicationDate":"2025-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal schedule for periodic jobs with discretely controllable processing times on two machines 两台机器上加工时间离散可控的周期性作业的最优调度
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-10 DOI: 10.1016/j.jpdc.2025.105204
Zizhao Wang , Wei Bao , Ruoyu Wu , Dong Yuan , Albert Y. Zomaya
In many real-world situations, the processing time of computational jobs can be shortened by lowering the processing quality. This is referred to as discretely controllable processing time, where the original processing time can be shortened to a number of levels with lower processing qualities. In this paper, we study the scheduling problem of periodic jobs with discretely controllable processing times on two machines. The problem is NP-hard, as directly solving it through dynamic programming leads to exponential computational complexity. This is because we need to memorise a set of processed jobs to avoid reprocessing. In order to address this issue, we prove the Ordered Scheduling Structure (OSS) Property and the Consecutive Decision Making (CDM) Property. The OSS Property allows us to search for an optimal solution in which jobs on the same machine are orderly started. The CDM Property allows us to memorise only two jobs to completely avoid the job reprocessing. These two properties greatly decrease the searching space, and the resultant dynamic programming solution to find an optimal solution is with pseudo-polynomial computational complexity.
在许多实际情况下,可以通过降低处理质量来缩短计算作业的处理时间。这被称为离散可控加工时间,其中原始加工时间可以缩短到较低加工质量的若干级别。本文研究了加工时间离散可控的周期作业在两台机器上的调度问题。这个问题是np困难的,因为通过动态规划直接解决它会导致指数级的计算复杂性。这是因为我们需要记住一组已处理的作业,以避免再处理。为了解决这个问题,我们证明了有序调度结构(Ordered Scheduling Structure, OSS)和连续决策(continuous Decision Making, CDM)的性质。OSS属性允许我们搜索最优解决方案,其中同一台机器上的作业有序启动。CDM属性允许我们只记住两个作业,以完全避免作业的再处理。这两种性质大大减小了搜索空间,得到的动态规划解求最优解的计算复杂度为伪多项式。
{"title":"Optimal schedule for periodic jobs with discretely controllable processing times on two machines","authors":"Zizhao Wang ,&nbsp;Wei Bao ,&nbsp;Ruoyu Wu ,&nbsp;Dong Yuan ,&nbsp;Albert Y. Zomaya","doi":"10.1016/j.jpdc.2025.105204","DOIUrl":"10.1016/j.jpdc.2025.105204","url":null,"abstract":"<div><div>In many real-world situations, the processing time of computational jobs can be shortened by lowering the processing quality. This is referred to as discretely controllable processing time, where the original processing time can be shortened to a number of levels with lower processing qualities. In this paper, we study the scheduling problem of periodic jobs with discretely controllable processing times on two machines. The problem is NP-hard, as directly solving it through dynamic programming leads to exponential computational complexity. This is because we need to memorise a set of processed jobs to avoid reprocessing. In order to address this issue, we prove the Ordered Scheduling Structure (OSS) Property and the Consecutive Decision Making (CDM) Property. The OSS Property allows us to search for an optimal solution in which jobs on the same machine are orderly started. The CDM Property allows us to memorise only two jobs to completely avoid the job reprocessing. These two properties greatly decrease the searching space, and the resultant dynamic programming solution to find an optimal solution is with pseudo-polynomial computational complexity.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"210 ","pages":"Article 105204"},"PeriodicalIF":4.0,"publicationDate":"2025-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues) 封面1 -完整的扉页(每期)/特刊扉页(每期)
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-07 DOI: 10.1016/S0743-7315(25)00174-1
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(25)00174-1","DOIUrl":"10.1016/S0743-7315(25)00174-1","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"208 ","pages":"Article 105207"},"PeriodicalIF":4.0,"publicationDate":"2025-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145736744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimistic execution in byzantine broadcast protocols that tolerate malicious majority 容忍恶意多数的拜占庭广播协议的乐观执行
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-29 DOI: 10.1016/j.jpdc.2025.105203
Ruomu Hou, Haifeng Yu
We consider the classic byzantine broadcast problem in distributed computing, in the context of a system with n node and at most fmax byzantine failures, under the standard synchronous timing model. Let f be the actual number of byzantine failures in a given execution, where ffmax. Our goal in this work is to optimize the performance of byzantine broadcast protocols in the common case where f is relative small. To this end, we propose a novel framework, called FlintBB, for adding an optimistic track into existing byzantine broadcast protocols. Using this framework, we show that we can achieve an exponential improvement in several existing byzantine broadcast protocols when f is relatively small. At the same time, our approach does not sacrifice the performance when f is not small.
在标准同步定时模型下,考虑了具有n个节点且最多fmax拜占庭故障的分布式计算系统中的经典拜占庭广播问题。设f为给定执行中拜占庭失败的实际次数,其中f≤fmax。我们在这项工作中的目标是在f相对较小的常见情况下优化拜占庭广播协议的性能。为此,我们提出了一个名为FlintBB的新框架,用于在现有的拜占庭广播协议中添加乐观轨道。使用这个框架,我们表明,当f相对较小时,我们可以在几个现有的拜占庭广播协议中实现指数级的改进。同时,我们的方法在f不小的情况下不牺牲性能。
{"title":"Optimistic execution in byzantine broadcast protocols that tolerate malicious majority","authors":"Ruomu Hou,&nbsp;Haifeng Yu","doi":"10.1016/j.jpdc.2025.105203","DOIUrl":"10.1016/j.jpdc.2025.105203","url":null,"abstract":"<div><div>We consider the classic byzantine broadcast problem in distributed computing, in the context of a system with <em>n</em> node and at most <span><math><msub><mi>f</mi><mrow><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub></math></span> byzantine failures, under the standard synchronous timing model. Let <em>f</em> be the actual number of byzantine failures in a given execution, where <span><math><mrow><mi>f</mi><mo>≤</mo><msub><mi>f</mi><mrow><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub></mrow></math></span>. Our goal in this work is to optimize the performance of byzantine broadcast protocols in the common case where <em>f</em> is relative small. To this end, we propose a novel framework, called <span>FlintBB</span>, for adding an <em>optimistic track</em> into existing byzantine broadcast protocols. Using this framework, we show that we can achieve an <em>exponential improvement</em> in several existing byzantine broadcast protocols when <em>f</em> is relatively small. At the same time, our approach does not sacrifice the performance when <em>f</em> is not small.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"209 ","pages":"Article 105203"},"PeriodicalIF":4.0,"publicationDate":"2025-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PAARD: Proximity-aware all-reduce communication for dragonfly networks 蜻蜓网络的接近感知全缩减通信
IF 4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-19 DOI: 10.1016/j.jpdc.2025.105201
Junchao Ma, Dezun Dong, Fei Lei, Liquan Xiao
The All-Reduce operation is one of the most widely used collective communication operations, and it is widely used in the research and engineering fields of high-performance computing(HPC) and distributed machine learning(DML). Previous optimization work for All-Reduce operation is to design new algorithms only for different message size and different number of processors, and ignores the optimization that can be achieved by considering the topology. Dragonfly is a popular topology for current and future high-speed interconnection networks. The hierarchical characteristics of dragonfly network can be utilized to effectively reduce hardware overhead while ensuring low end-to-end transmission latency. This paper offers a first attempt to design an efficient All-Reduce algorithm on dragonfly networks, referenced as PAARD. Based on the hierarchical characteristics of dragonfly network, PAARD first proposes an end-to-end solution to alleviate congestion that could remarkably boost performance. We carefully design the algorithm of PAARD to ensure desirable performance with acceptable overhead and guarantee the generality when met marginal cases. Then, to illustrate the effectiveness of PAARD, we analyze the performance of PAARD with the state-of-the-art algorithm, Halving-doubling(HD) algorithm and Ring algorithm. The simulation results demonstrate that in our design the execution time can be improved by 3X for HD and 4.19x for Ring on 256 nodes of a 342-node dragonfly with minimal routing.
All-Reduce运算是应用最广泛的集体通信运算之一,广泛应用于高性能计算(HPC)和分布式机器学习(DML)的研究和工程领域。以往针对All-Reduce操作的优化工作都是针对不同的消息大小和不同的处理器数量设计新的算法,而忽略了考虑拓扑可以实现的优化。蜻蜓是当前和未来高速互连网络的流行拓扑结构。蜻蜓网络的分层特性可以有效地降低硬件开销,同时保证低端到端传输延迟。本文首次尝试在蜻蜓网络上设计一种高效的All-Reduce算法,称为PAARD。基于蜻蜓网络的分层特性,PAARD首先提出了一种端到端的解决方案来缓解拥塞,从而显著提高性能。我们精心设计了PAARD算法,以在可接受的开销下保证理想的性能,并保证在遇到边际情况时的通用性。然后,为了说明PAARD的有效性,我们分析了PAARD与最先进的算法、减半加倍(HD)算法和环算法的性能。仿真结果表明,在我们的设计中,在342节点蜻蜓的256个节点上,HD和Ring的执行时间分别提高了3倍和4.19倍。
{"title":"PAARD: Proximity-aware all-reduce communication for dragonfly networks","authors":"Junchao Ma,&nbsp;Dezun Dong,&nbsp;Fei Lei,&nbsp;Liquan Xiao","doi":"10.1016/j.jpdc.2025.105201","DOIUrl":"10.1016/j.jpdc.2025.105201","url":null,"abstract":"<div><div>The All-Reduce operation is one of the most widely used collective communication operations, and it is widely used in the research and engineering fields of high-performance computing(HPC) and distributed machine learning(DML). Previous optimization work for All-Reduce operation is to design new algorithms only for different message size and different number of processors, and ignores the optimization that can be achieved by considering the topology. Dragonfly is a popular topology for current and future high-speed interconnection networks. The hierarchical characteristics of dragonfly network can be utilized to effectively reduce hardware overhead while ensuring low end-to-end transmission latency. This paper offers a first attempt to design an efficient All-Reduce algorithm on dragonfly networks, referenced as PAARD. Based on the hierarchical characteristics of dragonfly network, PAARD first proposes an end-to-end solution to alleviate congestion that could remarkably boost performance. We carefully design the algorithm of PAARD to ensure desirable performance with acceptable overhead and guarantee the generality when met marginal cases. Then, to illustrate the effectiveness of PAARD, we analyze the performance of PAARD with the state-of-the-art algorithm, Halving-doubling(HD) algorithm and Ring algorithm. The simulation results demonstrate that in our design the execution time can be improved by 3X for HD and 4.19x for Ring on 256 nodes of a 342-node dragonfly with minimal routing.</div></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"209 ","pages":"Article 105201"},"PeriodicalIF":4.0,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Parallel and Distributed Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1