首页 > 最新文献

Journal of Parallel and Distributed Computing最新文献

英文 中文
Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues) 封面 1 - 完整扉页(常规期刊)/特刊扉页(特刊)
IF 3.8 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-03 DOI: 10.1016/S0743-7315(24)00094-7
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(24)00094-7","DOIUrl":"https://doi.org/10.1016/S0743-7315(24)00094-7","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"191 ","pages":"Article 104930"},"PeriodicalIF":3.8,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000947/pdfft?md5=2a0c1e248048475ac142cf8a9af19128&pid=1-s2.0-S0743731524000947-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141240483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing CNN inference speed over big social data through efficient model parallelism for sustainable web of things 通过高效模型并行化优化 CNN 对社交大数据的推理速度,实现可持续物联网
IF 3.8 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-31 DOI: 10.1016/j.jpdc.2024.104927
Yuhao Hu , Xiaolong Xu , Muhammad Bilal , Weiyi Zhong , Yuwen Liu , Huaizhen Kou , Lingzhen Kong

The rapid development of artificial intelligence and networking technologies has catalyzed the popularity of intelligent services based on deep learning in recent years, which in turn fosters the advancement of Web of Things (WoT). Big social data (BSD) plays an important role during the processing of intelligent services in WoT. However, intelligent BSD services are computationally intensive and require ultra-low latency. End or edge devices with limited computing power cannot realize the extremely low response latency of those services. Distributed inference of deep neural networks (DNNs) on various devices is considered a feasible solution by allocating the computing load of a DNN to several devices. In this work, an efficient model parallelism method that couples convolution layer (Conv) split with resource allocation is proposed. First, given a random computing resource allocation strategy, the Conv split decision is made through a mathematical analysis method to realize the parallel inference of convolutional neural networks (CNNs). Next, Deep Reinforcement Learning is used to get the optimal computing resource allocation strategy to maximize the resource utilization rate and minimize the CNN inference latency. Finally, simulation results show that our approach performs better than the baselines and is applicable for BSD services in WoT with a high workload.

近年来,人工智能和网络技术的快速发展推动了基于深度学习的智能服务的普及,进而促进了物联网(WoT)的发展。大社会数据(BSD)在物联网智能服务的处理过程中发挥着重要作用。然而,智能 BSD 服务是计算密集型的,需要超低延迟。计算能力有限的终端或边缘设备无法实现这些服务的超低响应延迟。通过将深度神经网络(DNN)的计算负载分配给多个设备,在不同设备上进行分布式推理被认为是一种可行的解决方案。在这项工作中,提出了一种将卷积层(Conv)拆分与资源分配相结合的高效模型并行方法。首先,给定随机计算资源分配策略,通过数学分析方法做出 Conv 分割决策,实现卷积神经网络(CNN)的并行推理。接着,利用深度强化学习(Deep Reinforcement Learning)获得最优计算资源分配策略,从而最大化资源利用率,最小化 CNN 推理延迟。最后,仿真结果表明,我们的方法比基线方法性能更好,适用于高工作量的 WoT 中的 BSD 服务。
{"title":"Optimizing CNN inference speed over big social data through efficient model parallelism for sustainable web of things","authors":"Yuhao Hu ,&nbsp;Xiaolong Xu ,&nbsp;Muhammad Bilal ,&nbsp;Weiyi Zhong ,&nbsp;Yuwen Liu ,&nbsp;Huaizhen Kou ,&nbsp;Lingzhen Kong","doi":"10.1016/j.jpdc.2024.104927","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104927","url":null,"abstract":"<div><p>The rapid development of artificial intelligence and networking technologies has catalyzed the popularity of intelligent services based on deep learning in recent years, which in turn fosters the advancement of Web of Things (WoT). Big social data (BSD) plays an important role during the processing of intelligent services in WoT. However, intelligent BSD services are computationally intensive and require ultra-low latency. End or edge devices with limited computing power cannot realize the extremely low response latency of those services. Distributed inference of deep neural networks (DNNs) on various devices is considered a feasible solution by allocating the computing load of a DNN to several devices. In this work, an efficient model parallelism method that couples convolution layer (Conv) split with resource allocation is proposed. First, given a random computing resource allocation strategy, the Conv split decision is made through a mathematical analysis method to realize the parallel inference of convolutional neural networks (CNNs). Next, Deep Reinforcement Learning is used to get the optimal computing resource allocation strategy to maximize the resource utilization rate and minimize the CNN inference latency. Finally, simulation results show that our approach performs better than the baselines and is applicable for BSD services in WoT with a high workload.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"192 ","pages":"Article 104927"},"PeriodicalIF":3.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141291866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topo: Towards a fine-grained topological data processing framework on Tianhe-3 supercomputer 拓扑:在 "天河三号 "超级计算机上建立细粒度拓扑数据处理框架
IF 3.8 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-31 DOI: 10.1016/j.jpdc.2024.104926
Nan Hu , Yutong Lu , Zhuo Tang , Zhiyong Liu , Dan Huang , Zhiguang Chen

Big data frameworks are widely deployed in supercomputers for analyzing large-scale datasets. Topological data processing is an emerging approach that focuses on analyzing the topological structures in high-dimensional scientific data. However, incorporating topological data processing into current big data frameworks presents three main challenges: (1) The frequent data exchange poses challenges to the traditional coarse-grained parallelism. (2) The spatial topology makes parallel programming harder using oversimplified MapReduce APIs. (3) The massive intermediate data and NUMA architecture hinder resource utilization and scalability on novel supercomputers and many-core processors.

In this paper, we present Topo, a generic distributed framework that enhances topological data processing on many-core supercomputers. Topo relies on three concepts. (1) It employs fine-grained parallelism, with awareness of topological structures in datasets, to support interactions among collaborative workers before each shuffle phase. (2) It provides intuitive APIs for topological data operations. (3) It implements efficient collective I/O and NUMA-aware dynamic task scheduling to improve multi-threading and load balancing. We evaluate Topo's performance on the Tianhe-3 supercomputer, which utilizes state-of-the-art ARM many-core processors. Experimental results of execution time show that compared to popular frameworks, Topo achieves an average speedup of 5.3× and 6.3×, with a maximum speedup of 8.4× and 20×, on HPC workloads and big data benchmarks, respectively. Topo further reduces total execution time on processing skewed datasets by 41%.

大数据框架被广泛部署在超级计算机中,用于分析大规模数据集。拓扑数据处理是一种新兴方法,主要用于分析高维科学数据中的拓扑结构。然而,将拓扑数据处理纳入当前的大数据框架面临三大挑战:(1)频繁的数据交换对传统的粗粒度并行性提出了挑战。(2) 空间拓扑使得使用过于简化的 MapReduce API 进行并行编程变得更加困难。(3) 海量中间数据和 NUMA 架构阻碍了新型超级计算机和多核处理器的资源利用率和可扩展性。Topo 依赖于三个概念。(1)它采用细粒度并行技术,并能感知数据集中的拓扑结构,从而在每个洗牌阶段之前支持协作工作者之间的互动。(2) 为拓扑数据操作提供直观的应用程序接口。(3) 实现高效的集体 I/O 和 NUMA 感知动态任务调度,以改进多线程和负载平衡。我们在使用最先进 ARM 多核处理器的天河-3 超级计算机上评估了 Topo 的性能。执行时间的实验结果表明,与流行的框架相比,Topo在高性能计算工作负载和大数据基准上的平均速度分别提高了5.3倍和6.3倍,最大速度提高了8.4倍和20倍。在处理倾斜数据集方面,Topo进一步将总执行时间缩短了41%。
{"title":"Topo: Towards a fine-grained topological data processing framework on Tianhe-3 supercomputer","authors":"Nan Hu ,&nbsp;Yutong Lu ,&nbsp;Zhuo Tang ,&nbsp;Zhiyong Liu ,&nbsp;Dan Huang ,&nbsp;Zhiguang Chen","doi":"10.1016/j.jpdc.2024.104926","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104926","url":null,"abstract":"<div><p>Big data frameworks are widely deployed in supercomputers for analyzing large-scale datasets. Topological data processing is an emerging approach that focuses on analyzing the topological structures in high-dimensional scientific data. However, incorporating topological data processing into current big data frameworks presents three main challenges: (1) The frequent data exchange poses challenges to the traditional coarse-grained parallelism. (2) The spatial topology makes parallel programming harder using oversimplified MapReduce APIs. (3) The massive intermediate data and NUMA architecture hinder resource utilization and scalability on novel supercomputers and many-core processors.</p><p>In this paper, we present Topo, a generic distributed framework that enhances topological data processing on many-core supercomputers. Topo relies on three concepts. (1) It employs fine-grained parallelism, with awareness of topological structures in datasets, to support interactions among collaborative workers before each shuffle phase. (2) It provides intuitive APIs for topological data operations. (3) It implements efficient collective I/O and NUMA-aware dynamic task scheduling to improve multi-threading and load balancing. We evaluate Topo's performance on the Tianhe-3 supercomputer, which utilizes state-of-the-art ARM many-core processors. Experimental results of execution time show that compared to popular frameworks, Topo achieves an average speedup of 5.3× and 6.3×, with a maximum speedup of 8.4× and 20×, on HPC workloads and big data benchmarks, respectively. Topo further reduces total execution time on processing skewed datasets by 41%.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"192 ","pages":"Article 104926"},"PeriodicalIF":3.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141291882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Routing and wavelength assignment for folded hypercube in linear array WDM optical networks 线性阵列波分复用光学网络中折叠超立方体的路由和波长分配
IF 3.8 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-28 DOI: 10.1016/j.jpdc.2024.104924
V. Vinitha Navis, A. Berin Greeni

The folded hypercube is one of the hypercube variants and is of great significance in the study of interconnection networks. In a folded hypercube, information can be broadcast using efficient distributed algorithms. In the context of parallel computing, folded hypercube has been studied as a possible network topology as an alternative to the hypercube. The routing and wavelength assignment (RWA) problem is significant, since it improves the performance of wavelength-routed all-optical networks constructed using wavelength division multiplexing approach. Given the physical network topology, the aim of the RWA problem is to establish routes for the connection requests and assign the fewest possible wavelengths in accordance with the wavelength continuity and distinct wavelength constraints. This paper discusses the RWA problem in a linear array for the folded hypercube communication pattern by using the congestion technique.

折叠超立方体是超立方体的变体之一,在互连网络研究中具有重要意义。在折叠超立方体中,可以使用高效的分布式算法进行信息广播。在并行计算的背景下,折叠超立方体作为超立方体的一种可能的网络拓扑结构得到了研究。路由和波长分配(RWA)问题非常重要,因为它能提高使用波分复用方法构建的波长路由全光网络的性能。鉴于物理网络拓扑结构,RWA 问题的目的是为连接请求建立路由,并根据波长连续性和不同波长约束条件分配尽可能少的波长。本文利用拥塞技术讨论了折叠超立方通信模式线性阵列中的 RWA 问题。
{"title":"Routing and wavelength assignment for folded hypercube in linear array WDM optical networks","authors":"V. Vinitha Navis,&nbsp;A. Berin Greeni","doi":"10.1016/j.jpdc.2024.104924","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104924","url":null,"abstract":"<div><p>The folded hypercube is one of the hypercube variants and is of great significance in the study of interconnection networks. In a folded hypercube, information can be broadcast using efficient distributed algorithms. In the context of parallel computing, folded hypercube has been studied as a possible network topology as an alternative to the hypercube. The routing and wavelength assignment (RWA) problem is significant, since it improves the performance of wavelength-routed all-optical networks constructed using wavelength division multiplexing approach. Given the physical network topology, the aim of the RWA problem is to establish routes for the connection requests and assign the fewest possible wavelengths in accordance with the wavelength continuity and distinct wavelength constraints. This paper discusses the RWA problem in a linear array for the folded hypercube communication pattern by using the congestion technique.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"192 ","pages":"Article 104924"},"PeriodicalIF":3.8,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141243573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems 分布式系统上高阶有限元离散矩阵多向量积的快速硬件感知无矩阵算法
IF 3.8 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-27 DOI: 10.1016/j.jpdc.2024.104925
Gourab Panigrahi , Nikhil Kodali , Debashis Panda , Phani Motamarri

Recent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. In this work, we address a critical gap in existing matrix-free implementations which are not well suited for the action of FE discretized matrices on very large number of vectors. In particular, we propose efficient matrix-free algorithms for evaluating FE discretized matrix-multivector products on both multi-node CPU and GPU architectures. To this end, we employ batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and enabling further parallelization. On CPUs, we utilize even-odd decomposition, SIMD vectorization, and overlapping computation and communication strategies. On GPUs, we develop strategies to overlap compute with data movement for achieving efficient pipelining and reduced data accesses through the use of GPU-shared memory, constant memory and kernel fusion. Our implementation outperforms the baselines for Helmholtz operator action on 1024 vectors, achieving up to 1.4x improvement on one CPU node and up to 2.8x on one GPU node, while reaching up to 4.4x and 1.5x improvement on multiple nodes for CPUs (3072 cores) and GPUs (24 GPUs), respectively. We further benchmark the performance of the proposed implementation for solving a model eigenvalue problem for 1024 smallest eigenvalue-eigenvector pairs by employing the Chebyshev Filtered Subspace Iteration method, achieving up to 1.5x improvement on one CPU node and up to 2.2x on one GPU node while reaching up to 3.0x and 1.4x improvement on multi-node CPUs (3072 cores) and GPUs (24 GPUs), respectively.

与传统的稀疏矩阵方法相比,最近用于高阶有限元(FE)离散矩阵-矢量乘法的硬件感知无矩阵算法减少了浮点运算和数据访问成本。在这项工作中,我们解决了现有无矩阵实现中的一个关键问题,即现有无矩阵实现不太适合 FE 离散矩阵对大量向量的作用。特别是,我们提出了在多节点 CPU 和 GPU 架构上评估 FE 离散矩阵-多向量乘积的高效无矩阵算法。为此,我们采用分批评估策略,根据底层硬件架构调整批量大小,从而获得更好的数据局部性,并进一步实现并行化。在 CPU 上,我们采用偶数分解、SIMD 矢量化以及重叠计算和通信策略。在 GPU 上,我们开发了计算与数据移动重叠策略,通过使用 GPU 共享内存、常量内存和内核融合,实现高效流水线和减少数据访问。对于 1024 向量上的亥姆霍兹算子动作,我们的实现优于基线,在一个 CPU 节点上实现了高达 1.4 倍的改进,在一个 GPU 节点上实现了高达 2.8 倍的改进,而在 CPU(3072 个内核)和 GPU(24 个 GPU)的多个节点上分别实现了高达 4.4 倍和 1.5 倍的改进。我们还采用切比雪夫过滤子空间迭代法,进一步对所提出的实现方法在求解 1024 个最小特征值-特征向量对的模型特征值问题时的性能进行了基准测试,结果表明,在一个 CPU 节点上,性能提高了 1.5 倍;在一个 GPU 节点上,性能提高了 2.2 倍;而在多节点 CPU(3072 个内核)和 GPU(24 个 GPU)上,性能分别提高了 3.0 倍和 1.4 倍。
{"title":"Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems","authors":"Gourab Panigrahi ,&nbsp;Nikhil Kodali ,&nbsp;Debashis Panda ,&nbsp;Phani Motamarri","doi":"10.1016/j.jpdc.2024.104925","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104925","url":null,"abstract":"<div><p>Recent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. In this work, we address a critical gap in existing matrix-free implementations which are not well suited for the action of FE discretized matrices on very large number of vectors. In particular, we propose efficient matrix-free algorithms for evaluating FE discretized matrix-multivector products on both multi-node CPU and GPU architectures. To this end, we employ batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and enabling further parallelization. On CPUs, we utilize even-odd decomposition, SIMD vectorization, and overlapping computation and communication strategies. On GPUs, we develop strategies to overlap compute with data movement for achieving efficient pipelining and reduced data accesses through the use of GPU-shared memory, constant memory and kernel fusion. Our implementation outperforms the baselines for Helmholtz operator action on 1024 vectors, achieving up to 1.4x improvement on one CPU node and up to 2.8x on one GPU node, while reaching up to 4.4x and 1.5x improvement on multiple nodes for CPUs (3072 cores) and GPUs (24 GPUs), respectively. We further benchmark the performance of the proposed implementation for solving a model eigenvalue problem for 1024 smallest eigenvalue-eigenvector pairs by employing the Chebyshev Filtered Subspace Iteration method, achieving up to 1.5x improvement on one CPU node and up to 2.2x on one GPU node while reaching up to 3.0x and 1.4x improvement on multi-node CPUs (3072 cores) and GPUs (24 GPUs), respectively.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"192 ","pages":"Article 104925"},"PeriodicalIF":3.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141291809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PerfTop: Towards performance prediction of distributed learning over general topology PerfTop:在一般拓扑结构上实现分布式学习的性能预测
IF 3.8 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-24 DOI: 10.1016/j.jpdc.2024.104922
Changzhi Yan, Zehan Zhu, Youcheng Niu, Cong Wang, Cheng Zhuo, Jinming Xu

Distributed learning with multiple GPUs has been widely adopted to accelerate the training process of large-scale deep neural networks. However, misconfiguration of the GPU clusters with various communication primitives and topologies could potentially diminish the gains in parallel computation and lead to significant degradation in training efficiency. Predicting the performance of distributed learning enables service providers to identify potential bottlenecks beforehand. In this work, we propose a Performance prediction framework over General Topologies, called PerfTop, for accurate estimation of per-iteration execution time. The main strategy is to integrate computation time prediction with an analytical model to map the nonlinearity in communication and fine-grained computation-communication patterns. This enables accurate prediction of a variety of neural network models over general topologies, such as tree, hierarchical, and exponential. Our extensive experiments show that PerfTop outperforms existing methods in estimating both computation and communication time, particularly for communication, surpassing the existing methods by over 45%. Meanwhile, it achieves an accuracy of above 85% in predicting the execution time over general topologies compared to simple topologies such as star and ring from the previous works.

利用多个 GPU 进行分布式学习已被广泛采用,以加速大规模深度神经网络的训练过程。然而,利用各种通信原语和拓扑结构对 GPU 集群进行错误配置,可能会降低并行计算的收益,导致训练效率显著下降。预测分布式学习的性能能让服务提供商提前发现潜在瓶颈。在这项工作中,我们提出了一个名为 PerfTop 的通用拓扑性能预测框架,用于准确估算每次迭代的执行时间。主要策略是将计算时间预测与分析模型相结合,以映射通信中的非线性和细粒度计算-通信模式。这样就能准确预测各种神经网络模型的一般拓扑结构,如树型、分层型和指数型。我们的大量实验表明,PerfTop 在估算计算和通信时间方面都优于现有方法,尤其是在通信时间方面,超过现有方法 45% 以上。同时,在预测一般拓扑结构的执行时间时,它的准确率达到了 85% 以上,而之前的研究只预测了星形和环形等简单拓扑结构的执行时间。
{"title":"PerfTop: Towards performance prediction of distributed learning over general topology","authors":"Changzhi Yan,&nbsp;Zehan Zhu,&nbsp;Youcheng Niu,&nbsp;Cong Wang,&nbsp;Cheng Zhuo,&nbsp;Jinming Xu","doi":"10.1016/j.jpdc.2024.104922","DOIUrl":"10.1016/j.jpdc.2024.104922","url":null,"abstract":"<div><p>Distributed learning with multiple GPUs has been widely adopted to accelerate the training process of large-scale deep neural networks. However, misconfiguration of the GPU clusters with various communication primitives and topologies could potentially diminish the gains in parallel computation and lead to significant degradation in training efficiency. Predicting the performance of distributed learning enables service providers to identify potential bottlenecks beforehand. In this work, we propose a <u>Perf</u>ormance prediction framework over General <u>Top</u>ologies, called PerfTop, for accurate estimation of per-iteration execution time. The main strategy is to integrate computation time prediction with an analytical model to map the nonlinearity in communication and fine-grained computation-communication patterns. This enables accurate prediction of a variety of neural network models over general topologies, such as tree, hierarchical, and exponential. Our extensive experiments show that PerfTop outperforms existing methods in estimating both computation and communication time, particularly for communication, surpassing the existing methods by over 45%. Meanwhile, it achieves an accuracy of above 85% in predicting the execution time over general topologies compared to simple topologies such as star and ring from the previous works.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"192 ","pages":"Article 104922"},"PeriodicalIF":3.8,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141139181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local outlier factor for anomaly detection in HPCC systems 用于 HPCC 系统异常检测的局部离群因子
IF 3.8 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-23 DOI: 10.1016/j.jpdc.2024.104923
Arya Adesh , Shobha G , Jyoti Shetty , Lili Xu

Local Outlier Factor (LOF) is an unsupervised anomaly detection algorithm that finds anomalies by assessing the local density of a data point relative to its neighborhood. Anomaly detection is the process of finding anomalies in datasets. Anomalies in real-time datasets may indicate critical events like bank frauds, data compromise, network threats, etc. This paper deals with the implementation of the LOF algorithm in the HPCC Systems platform, which is an open-source distributed computing platform for big data analytics. Improved LOF is also proposed which efficiently detects anomalies in datasets rich in duplicates. The impact of varying hyperparameters on the performance of LOF is examined in HPCC Systems. This paper examines the performance of LOF with other algorithms like COF, LoOP, and kNN over several datasets in the HPCC Systems. Additionally, the efficacy of LOF is evaluated across big-data frameworks such as Spark, Hadoop, and HPCC Systems, by comparing their runtime performances.

局部离群因子(LOF)是一种无监督异常检测算法,它通过评估数据点相对于其邻域的局部密度来发现异常。异常检测是在数据集中发现异常的过程。实时数据集中的异常可能预示着银行欺诈、数据泄露、网络威胁等重大事件。本文论述了 LOF 算法在 HPCC 系统平台上的实现,该平台是用于大数据分析的开源分布式计算平台。本文还提出了改进的 LOF 算法,它能有效地检测出重复数据集中的异常情况。在 HPCC 系统中,研究了不同超参数对 LOF 性能的影响。本文通过 HPCC 系统中的多个数据集,检验了 LOF 与 COF、LoOP 和 kNN 等其他算法的性能。此外,通过比较 Spark、Hadoop 和 HPCC 系统等大数据框架的运行性能,评估了 LOF 的功效。
{"title":"Local outlier factor for anomaly detection in HPCC systems","authors":"Arya Adesh ,&nbsp;Shobha G ,&nbsp;Jyoti Shetty ,&nbsp;Lili Xu","doi":"10.1016/j.jpdc.2024.104923","DOIUrl":"10.1016/j.jpdc.2024.104923","url":null,"abstract":"<div><p>Local Outlier Factor (LOF) is an unsupervised anomaly detection algorithm that finds anomalies by assessing the local density of a data point relative to its neighborhood. Anomaly detection is the process of finding anomalies in datasets. Anomalies in real-time datasets may indicate critical events like bank frauds, data compromise, network threats, etc. This paper deals with the implementation of the LOF algorithm in the HPCC Systems platform, which is an open-source distributed computing platform for big data analytics. Improved LOF is also proposed which efficiently detects anomalies in datasets rich in duplicates. The impact of varying hyperparameters on the performance of LOF is examined in HPCC Systems. This paper examines the performance of LOF with other algorithms like COF, LoOP, and kNN over several datasets in the HPCC Systems. Additionally, the efficacy of LOF is evaluated across big-data frameworks such as Spark, Hadoop, and HPCC Systems, by comparing their runtime performances.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"192 ","pages":"Article 104923"},"PeriodicalIF":3.8,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141140257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GraMeR: Graph Meta Reinforcement learning for multi-objective influence maximization GraMeR: 面向多目标影响力最大化的图元强化学习
IF 3.8 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-23 DOI: 10.1016/j.jpdc.2024.104900
Sai Munikoti , Balasubramaniam Natarajan , Mahantesh Halappanavar

Influence maximization (IM) is a combinatorial problem of identifying a subset of seed nodes in a network (graph), which when activated, provide a maximal spread of influence in the network for a given diffusion model and a budget for seed set size. IM has numerous applications such as viral marketing, epidemic control, sensor placement and other network-related tasks. However, its practical uses are limited due to the computational complexity of current algorithms. Recently, deep reinforcement learning has been leveraged to solve IM in order to ease the computational burden. However, there are serious limitations in current approaches, including narrow IM formulation that only consider influence via spread and ignore self activation, low scalability to large graphs, and lack of generalizability across graph families leading to a large running time for every test network. In this work, we address these limitations through a unique approach that involves: (1) Formulating a generic IM problem as a Markov decision process that handles both intrinsic and influence activations; (2) incorporating generalizability via meta-learning across graph families. There are previous works that combine deep reinforcement learning with graph neural network but this work solves a more realistic IM problem and incorporates generalizability across graphs via meta reinforcement learning. Extensive experiments are carried out in various standard networks to validate performance of the proposed Graph Meta Reinforcement learning (GraMeR) framework. The results indicate that GraMeR is multiple orders faster and generic than conventional approaches when applied on small to medium scale graphs.

影响最大化(IM)是一个组合问题,即在网络(图)中确定一个种子节点子集,当激活该子集时,在给定的扩散模型和种子集大小预算下,该子集可在网络中提供最大的影响传播。IM 有许多应用,如病毒营销、流行病控制、传感器安置和其他网络相关任务。然而,由于当前算法的计算复杂性,其实际应用受到了限制。最近,人们利用深度强化学习来解决 IM 问题,以减轻计算负担。然而,目前的方法存在严重的局限性,包括只考虑通过传播产生影响而忽略自激活的狭隘 IM 表述、对大型图的可扩展性低、缺乏跨图族的泛化能力,导致每个测试网络的运行时间都很长。在这项研究中,我们采用了一种独特的方法来解决这些局限性,其中包括:(1)将一般的 IM 问题表述为一个马尔可夫决策过程,该过程可同时处理内在激活和影响激活;(2)通过元学习在图族间实现通用性。之前有研究将深度强化学习与图神经网络相结合,但本研究解决的是一个更现实的 IM 问题,并通过元强化学习实现了跨图的通用性。我们在各种标准网络中进行了广泛的实验,以验证所提出的图元强化学习(GraMeR)框架的性能。结果表明,与传统方法相比,GraMeR 在中小型图上的应用速度和通用性要快上数倍。
{"title":"GraMeR: Graph Meta Reinforcement learning for multi-objective influence maximization","authors":"Sai Munikoti ,&nbsp;Balasubramaniam Natarajan ,&nbsp;Mahantesh Halappanavar","doi":"10.1016/j.jpdc.2024.104900","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104900","url":null,"abstract":"<div><p>Influence maximization (IM) is a combinatorial problem of identifying a subset of seed nodes in a network (graph), which when activated, provide a maximal spread of influence in the network for a given diffusion model and a budget for seed set size. IM has numerous applications such as viral marketing, epidemic control, sensor placement and other network-related tasks. However, its practical uses are limited due to the computational complexity of current algorithms. Recently, deep reinforcement learning has been leveraged to solve IM in order to ease the computational burden. However, there are serious limitations in current approaches, including narrow IM formulation that only consider influence via spread and ignore self activation, low scalability to large graphs, and lack of generalizability across graph families leading to a large running time for every test network. In this work, we address these limitations through a unique approach that involves: (1) Formulating a generic IM problem as a Markov decision process that handles both intrinsic and influence activations; (2) incorporating generalizability via meta-learning across graph families. There are previous works that combine deep reinforcement learning with graph neural network but this work solves a more realistic IM problem and incorporates generalizability across graphs via meta reinforcement learning. Extensive experiments are carried out in various standard networks to validate performance of the proposed Graph Meta Reinforcement learning (GraMeR) framework. The results indicate that GraMeR is multiple orders faster and generic than conventional approaches when applied on small to medium scale graphs.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"192 ","pages":"Article 104900"},"PeriodicalIF":3.8,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141423534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-scale and cooperative graybox parallel optimization on the supercomputer Fugaku 超级计算机 "Fugaku "上的大规模协同灰箱并行优化
IF 3.8 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-22 DOI: 10.1016/j.jpdc.2024.104921
Lorenzo Canonne , Bilel Derbel , Miwako Tsuji , Mitsuhisa Sato

We design, develop and analyze parallel variants of a state-of-the-art graybox optimization algorithm, namely Drils (Deterministic Recombination and Iterated Local Search), for attacking large-scale pseudo-boolean optimization problems on top of the large-scale computing facilities offered by the supercomputer Fugaku. We first adopt a Master/Worker design coupled with a fully distributed Island-based model, ending up with a number of hybrid OpenMP/MPI implementations of high-level parallel Drils versions. We show that such a design, although effective, can be substantially improved by enabling a more focused iteration-level cooperation mechanism between the core graybox components of the original serial Drils algorithm. Extensive experiments are conducted in order to provide a systematic analysis of the impact of the designed parallel algorithms on search behavior, and their ability to compute high-quality solutions using increasing number of CPU-cores. Results using up to 1024×12-cores NUMA nodes, and NK-landscapes with up to 10,000 binary variables are reported, providing evidence on the relative strength of the designed hybrid cooperative graybox parallel search.

我们设计、开发并分析了一种最先进的灰盒优化算法的并行变体,即 Drils(确定性重组和迭代局部搜索),用于在超级计算机富加库提供的大规模计算设施之上解决大规模伪布尔优化问题。我们首先采用了 Master/Worker 设计和基于岛的完全分布式模型,最终得到了一些高级并行 Drils 版本的 OpenMP/MPI 混合实现。我们的研究表明,这种设计虽然有效,但可以通过在原始串行 Drils 算法的核心灰盒组件之间建立更集中的迭代级合作机制来大幅改进。我们进行了广泛的实验,以便系统分析所设计的并行算法对搜索行为的影响,以及使用越来越多的 CPU 核心计算高质量解决方案的能力。报告了使用多达 1024×12 核 NUMA 节点和多达 10,000 个二进制变量的 NK-landscapes 的结果,为所设计的混合合作灰箱并行搜索的相对优势提供了证据。
{"title":"Large-scale and cooperative graybox parallel optimization on the supercomputer Fugaku","authors":"Lorenzo Canonne ,&nbsp;Bilel Derbel ,&nbsp;Miwako Tsuji ,&nbsp;Mitsuhisa Sato","doi":"10.1016/j.jpdc.2024.104921","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104921","url":null,"abstract":"<div><p>We design, develop and analyze parallel variants of a state-of-the-art graybox optimization algorithm, namely <span>Drils</span> (Deterministic Recombination and Iterated Local Search), for attacking large-scale pseudo-boolean optimization problems on top of the large-scale computing facilities offered by the supercomputer Fugaku. We first adopt a Master/Worker design coupled with a fully distributed Island-based model, ending up with a number of hybrid OpenMP/MPI implementations of high-level parallel <span>Drils</span> versions. We show that such a design, although effective, can be substantially improved by enabling a more focused iteration-level cooperation mechanism between the core graybox components of the original serial <span>Drils</span> algorithm. Extensive experiments are conducted in order to provide a systematic analysis of the impact of the designed parallel algorithms on search behavior, and their ability to compute high-quality solutions using increasing number of CPU-cores. Results using up to 1024×12-cores NUMA nodes, and NK-landscapes with up to <span><math><mn>10</mn><mo>,</mo><mn>000</mn></math></span> binary variables are reported, providing evidence on the relative strength of the designed hybrid cooperative graybox parallel search.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"191 ","pages":"Article 104921"},"PeriodicalIF":3.8,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141090688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HBPB, applying reuse distance to improve cache efficiency proactively HBPB,应用重用距离主动提高高速缓存效率
IF 3.8 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-05-20 DOI: 10.1016/j.jpdc.2024.104919
Arthur M. Krause, Paulo C. Santos, Arthur F. Lorenzon, Philippe O.A. Navaux

Cache memories play a significant role in the performance, area, and energy consumption of modern processors, and this impact is expected to grow as on-die memories become larger. While caches are highly effective for cache-friendly access patterns, they introduce unnecessary delays and energy wastage when they fail to serve the required data. Hence, cache bypassing techniques have been proposed to optimize the latency of cache-unfriendly memory accesses. In this scenario, we discuss HBPB, a history-based preemptive bypassing technique that accelerates cache-unfriendly access through the reduced latency of bypassing the caches. By extensively evaluating different real-world applications and hardware cache configurations, we show that HBPB yields energy reductions of up to 75% and performance improvements of up to 50% compared to a version that does not apply cache bypassing. More importantly, we demonstrate that HBPB does not affect the performance of applications with cache-friendly access patterns.

高速缓冲存储器在现代处理器的性能、面积和能耗方面发挥着重要作用,而且随着芯片上存储器的增大,预计这种影响还会越来越大。虽然高速缓存对高速缓存友好的访问模式非常有效,但当高速缓存无法提供所需数据时,就会带来不必要的延迟和能源浪费。因此,有人提出了高速缓存旁路技术,以优化不适合高速缓存的内存访问延迟。在本方案中,我们讨论了 HBPB,这是一种基于历史记录的抢先绕过技术,可通过缩短绕过高速缓存的延迟来加速高速缓存不友好访问。通过广泛评估不同的实际应用和硬件缓存配置,我们发现,与不应用缓存旁路的版本相比,HBPB 可减少高达 75% 的能耗,提高高达 50% 的性能。更重要的是,我们证明 HBPB 不会影响具有缓存友好访问模式的应用程序的性能。
{"title":"HBPB, applying reuse distance to improve cache efficiency proactively","authors":"Arthur M. Krause,&nbsp;Paulo C. Santos,&nbsp;Arthur F. Lorenzon,&nbsp;Philippe O.A. Navaux","doi":"10.1016/j.jpdc.2024.104919","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104919","url":null,"abstract":"<div><p>Cache memories play a significant role in the performance, area, and energy consumption of modern processors, and this impact is expected to grow as on-die memories become larger. While caches are highly effective for cache-friendly access patterns, they introduce unnecessary delays and energy wastage when they fail to serve the required data. Hence, cache bypassing techniques have been proposed to optimize the latency of cache-unfriendly memory accesses. In this scenario, we discuss <em>HBPB</em>, a history-based preemptive bypassing technique that accelerates cache-unfriendly access through the reduced latency of bypassing the caches. By extensively evaluating different real-world applications and hardware cache configurations, we show that <em>HBPB</em> yields energy reductions of up to 75% and performance improvements of up to 50% compared to a version that does not apply cache bypassing. More importantly, we demonstrate that <em>HBPB</em> does not affect the performance of applications with cache-friendly access patterns.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":"191 ","pages":"Article 104919"},"PeriodicalIF":3.8,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141078565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Parallel and Distributed Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1