首页 > 最新文献

Journal of Parallel and Distributed Computing最新文献

英文 中文
A multi-level parallel approach to increase the computation efficiency of a global ocean temperature dataset reconstruction 提高全球海洋温度数据集重建计算效率的多级并行方法
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-06-14 DOI: 10.1016/j.jpdc.2024.104938
Huifeng Yuan , Lijing Cheng , Yuying Pan , Zhetao Tan , Qian Liu , Zhong Jin

There is an increasing need to provide real-time datasets for climate monitoring and applications. However, the current data products from all international groups have at least a month delay for data release. One reason for this delay is the long computing time of the global reconstruction algorithm (so-called mapping approach). To tackle this issue, this paper proposes a multi-level parallel computing model to improve the efficiency of data construction by parallelization of computation, reducing code branch prediction, optimizing data spatial locality, cache utilization, and other measures. This model has been applied to a mapping approach proposed by the Institute of Atmospheric Physics (IAP), one of the world's most widely used data products in the ocean and climate field. Compared with the traditional serial construction of MATLAB-based scheme on a single node, the speed of the construction after parallel optimizations is speeded up by ∼4.7 times. A large-scale parallel experiment of a long-term (∼1000 months) gridded dataset utilizing over 16,000 processor cores proves the model's scalability, improving ∼1200 times. In summary, this new model represents another example of the application of high-performance computing in oceanography and climatology.

为气候监测和应用提供实时数据集的需求日益增加。然而,目前所有国际组织的数据产品至少要延迟一个月才能发布数据。造成这种延迟的原因之一是全局重建算法(即所谓的映射法)的计算时间过长。针对这一问题,本文提出了一种多级并行计算模型,通过计算并行化、减少代码分支预测、优化数据空间位置、缓存利用等措施提高数据构建效率。该模型已应用于大气物理研究所(IAP)提出的制图方法,该方法是世界上海洋和气候领域应用最广泛的数据产品之一。与传统的基于 MATLAB 的单节点串行构建方案相比,并行优化后的构建速度提高了 4.7 倍。利用超过 16,000 个处理器内核对长期(∼1000 个月)网格数据集进行的大规模并行实验证明了该模型的可扩展性,提高了∼1200 倍。总之,这一新模型是高性能计算在海洋学和气候学中应用的又一范例。
{"title":"A multi-level parallel approach to increase the computation efficiency of a global ocean temperature dataset reconstruction","authors":"Huifeng Yuan ,&nbsp;Lijing Cheng ,&nbsp;Yuying Pan ,&nbsp;Zhetao Tan ,&nbsp;Qian Liu ,&nbsp;Zhong Jin","doi":"10.1016/j.jpdc.2024.104938","DOIUrl":"10.1016/j.jpdc.2024.104938","url":null,"abstract":"<div><p>There is an increasing need to provide real-time datasets for climate monitoring and applications. However, the current data products from all international groups have at least a month delay for data release. One reason for this delay is the long computing time of the global reconstruction algorithm (so-called mapping approach). To tackle this issue, this paper proposes a multi-level parallel computing model to improve the efficiency of data construction by parallelization of computation, reducing code branch prediction, optimizing data spatial locality, cache utilization, and other measures. This model has been applied to a mapping approach proposed by the Institute of Atmospheric Physics (IAP), one of the world's most widely used data products in the ocean and climate field. Compared with the traditional serial construction of MATLAB-based scheme on a single node, the speed of the construction after parallel optimizations is speeded up by ∼4.7 times. A large-scale parallel experiment of a long-term (∼1000 months) gridded dataset utilizing over 16,000 processor cores proves the model's scalability, improving ∼1200 times. In summary, this new model represents another example of the application of high-performance computing in oceanography and climatology.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141405481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using hardware-transactional-memory support to implement speculative task execution 利用硬件-事务-内存支持来执行特定任务
IF 3.4 3区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2024-06-14 DOI: 10.1016/j.jpdc.2024.104939
Juan Salamanca , Alexandro Baldassin

Loops take up most of the time of computer programs, so optimizing them so that they run in the shortest time possible is a continuous task. However, this task is not negligible; on the contrary, it is an open area of research since many irregular loops are hard to parallelize. Generally, these loops have loop-carried (DOACROSS) dependencies and the appearance of dependencies could depend on the context. Many techniques have been studied to be able to parallelize these loops efficiently; however, for example in the OpenMP standard there is no efficient way to parallelize them. This article presents Speculative Task Execution (STE), a technique that enables the execution of OpenMP tasks in a speculative way to accelerate certain hot-code regions (such as loops) marked by OpenMP directives. It also presents a detailed analysis of the application of Hardware Transactional Memory (HTM) support for executing tasks speculatively and describes a careful evaluation of the implementation of STE using HTM on modern machines. In particular, we consider the scenario in which speculative tasks are generated by the OpenMP taskloop construct (Speculative Taskloop (STL)). As a result, it provides evidence to support several important claims about the performance of STE over HTM in modern processor architectures. Experimental results reveal that: (a) by implementing STL on top of HTM for hot-code regions, speed-ups of up to 5.39× can be obtained in IBM POWER8 and of up to 2.41× in Intel processors using 4 cores; and (b) STL-ROT, a variant of STL using rollback-only transactions (ROTs), achieves speed-ups of up to 17.70× in IBM POWER9 processor using 20 cores.

循环占用了计算机程序的大部分时间,因此优化循环以使其在尽可能短的时间内运行是一项持续性任务。然而,这项任务并非可以忽略不计,相反,它还是一个开放的研究领域,因为许多不规则循环很难并行化。一般来说,这些循环具有循环携带(DOACROSS)依赖性,而依赖性的出现可能取决于上下文。为了能高效地并行处理这些循环,人们研究了许多技术;然而,例如在 OpenMP 标准中,并没有高效的方法来并行处理这些循环。本文介绍了 "投机任务执行"(STE)技术,它能以投机方式执行 OpenMP 任务,以加速 OpenMP 指令标记的某些热代码区域(如循环)。报告还详细分析了硬件事务内存(HTM)支持投机执行任务的应用,并描述了在现代机器上使用 HTM 实现 STE 的细致评估。我们特别考虑了由 OpenMP 任务循环结构(Speculative Taskloop (STL))生成投机任务的情况。因此,在现代处理器架构中,它为 STE 优于 HTM 性能的几个重要说法提供了证据支持。实验结果表明(a) 通过在 HTM 基础上为热代码区域实施 STL,在使用 4 个内核的 IBM POWER8 处理器中可获得高达 5.39 倍的速度提升,在使用 4 个内核的英特尔处理器中可获得高达 2.41 倍的速度提升;以及 (b) STL-ROT 是 STL 的一种变体,使用仅回滚事务 (ROT),在使用 20 个内核的 IBM POWER9 处理器中可获得高达 17.70 倍的速度提升。
{"title":"Using hardware-transactional-memory support to implement speculative task execution","authors":"Juan Salamanca ,&nbsp;Alexandro Baldassin","doi":"10.1016/j.jpdc.2024.104939","DOIUrl":"10.1016/j.jpdc.2024.104939","url":null,"abstract":"<div><p>Loops take up most of the time of computer programs, so optimizing them so that they run in the shortest time possible is a continuous task. However, this task is not negligible; on the contrary, it is an open area of research since many irregular loops are hard to parallelize. Generally, these loops have loop-carried (DOACROSS) dependencies and the appearance of dependencies could depend on the context. Many techniques have been studied to be able to parallelize these loops efficiently; however, for example in the OpenMP standard there is no efficient way to parallelize them. This article presents Speculative Task Execution (STE), a technique that enables the execution of OpenMP tasks in a speculative way to accelerate certain hot-code regions (such as loops) marked by OpenMP directives. It also presents a detailed analysis of the application of Hardware Transactional Memory (HTM) support for executing tasks speculatively and describes a careful evaluation of the implementation of STE using HTM on modern machines. In particular, we consider the scenario in which speculative tasks are generated by the OpenMP <span>taskloop</span> construct (<em>Speculative Taskloop (STL)</em>). As a result, it provides evidence to support several important claims about the performance of STE over HTM in modern processor architectures. Experimental results reveal that: (a) by implementing STL on top of HTM for hot-code regions, speed-ups of up to 5.39× can be obtained in IBM POWER8 and of up to 2.41× in Intel processors using 4 cores; and (b) STL-ROT, a variant of STL using rollback-only transactions (ROTs), achieves speed-ups of up to 17.70× in IBM POWER9 processor using 20 cores.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141411270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PA-SPS: A predictive adaptive approach for an elastic stream processing system PA-SPS:弹性流处理系统的预测自适应方法
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-06-14 DOI: 10.1016/j.jpdc.2024.104940
Daniel Wladdimiro , Luciana Arantes , Pierre Sens , Nicolás Hidalgo

Stream Processing Systems (SPSs) dynamically process input events. Since the input is usually not a constant flow, presenting rate fluctuations, many works in the literature propose to dynamically replicate SPS operators, aiming at reducing the processing bottleneck induced by such fluctuations. However, these SPSs do not consider the problem of load balancing of the replicas or the cost involved in reconfiguring the system whenever the number of replicas changes. We present in this paper a predictive model which, based on input rate variation, execution time of operators, and queued events, dynamically defines the necessary current number of replicas of each operator. A predictor, composed of different models (i.e., mathematical and Machine Learning ones), predicts the input rate. We also propose a Storm-based SPS, named PA-SPS, which uses such a predictive model, not requiring reboot reconfiguration when the number of operators replica change. PA-SPS also implements a load balancer that distributes incoming events evenly among replicas of an operator. We have conducted experiments on Google Cloud Platform (GCP) for evaluation PA-SPS using real traffic traces of different applications and also compared it with Storm and other existing SPSs.

流处理系统(SPS)可动态处理输入事件。由于输入通常不是恒定流,会出现速率波动,因此许多文献建议动态复制 SPS 操作员,以减少这种波动引起的处理瓶颈。然而,这些 SPS 并没有考虑复制的负载平衡问题,也没有考虑在复制数量发生变化时重新配置系统所涉及的成本。我们在本文中提出了一个预测模型,该模型基于输入率变化、操作员执行时间和排队事件,动态定义每个操作员当前所需的副本数量。由不同模型(即数学模型和机器学习模型)组成的预测器可预测输入率。我们还提出了一种基于 Storm 的 SPS,名为 PA-SPS,它使用这种预测模型,在操作员副本数量发生变化时不需要重启重新配置。PA-SPS 还实现了一个负载平衡器,可在操作员副本之间平均分配传入事件。我们在谷歌云平台(GCP)上使用不同应用的真实流量轨迹对 PA-SPS 进行了评估实验,并将其与 Storm 和其他现有 SPS 进行了比较。
{"title":"PA-SPS: A predictive adaptive approach for an elastic stream processing system","authors":"Daniel Wladdimiro ,&nbsp;Luciana Arantes ,&nbsp;Pierre Sens ,&nbsp;Nicolás Hidalgo","doi":"10.1016/j.jpdc.2024.104940","DOIUrl":"10.1016/j.jpdc.2024.104940","url":null,"abstract":"<div><p>Stream Processing Systems (SPSs) dynamically process input events. Since the input is usually not a constant flow, presenting rate fluctuations, many works in the literature propose to dynamically replicate SPS operators, aiming at reducing the processing bottleneck induced by such fluctuations. However, these SPSs do not consider the problem of load balancing of the replicas or the cost involved in reconfiguring the system whenever the number of replicas changes. We present in this paper a predictive model which, based on input rate variation, execution time of operators, and queued events, dynamically defines the necessary current number of replicas of each operator. A predictor, composed of different models (i.e., mathematical and Machine Learning ones), predicts the input rate. We also propose a Storm-based SPS, named PA-SPS, which uses such a predictive model, not requiring reboot reconfiguration when the number of operators replica change. PA-SPS also implements a load balancer that distributes incoming events evenly among replicas of an operator. We have conducted experiments on Google Cloud Platform (GCP) for evaluation PA-SPS using real traffic traces of different applications and also compared it with Storm and other existing SPSs.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141401072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Meta-Fed IDS: Meta-learning and Federated learning based fog-cloud approach to detect known and zero-day cyber attacks in IoMT networks 元喂养 IDS:基于元学习和联合学习的雾云方法,用于检测 IoMT 网络中的已知和零日网络攻击
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-06-05 DOI: 10.1016/j.jpdc.2024.104934
Umer Zukaib , Xiaohui Cui , Chengliang Zheng , Dong Liang , Salah Ud Din

The Internet of Medical Things (IoMT) is a transformative fusion of medical sensors, equipment, and the Internet of Things, positioned to transform healthcare. However, security and privacy concerns hinder widespread IoMT adoption, intensified by the scarcity of high-quality datasets for developing effective security solutions. Addressing these challenges, we propose a novel framework for cyberattack detection in dynamic IoMT networks. This framework integrates Federated Learning with Meta-learning, employing a multi-phase architecture for identifying known attacks, and incorporates advanced clustering and biased classifiers to address zero-day attacks. The framework's deployment is adaptable to dynamic and diverse environments, utilizing an Infrastructure-as-a-Service (IaaS) model on the cloud and a Software-as-a-Service (SaaS) model on the fog end. To reflect real-world scenarios, we introduce a specialized IoMT dataset. Our experimental results indicate high accuracy and low misclassification rates, demonstrating the framework's capability in detecting cyber threats in complex IoMT environments. This approach shows significant promise in bolstering cybersecurity in advanced healthcare technologies.

医疗物联网(IoMT)是医疗传感器、设备和物联网的变革性融合,将改变医疗保健行业。然而,安全和隐私问题阻碍了 IoMT 的广泛应用,而用于开发有效安全解决方案的高质量数据集的稀缺又加剧了这一问题。为了应对这些挑战,我们提出了一种用于动态物联网技术网络中网络攻击检测的新型框架。该框架将联邦学习与元学习相结合,采用多阶段架构来识别已知攻击,并结合先进的聚类和偏差分类器来应对零日攻击。该框架的部署可适应动态和多样化的环境,在云端采用基础设施即服务(IaaS)模式,在雾端采用软件即服务(SaaS)模式。为了反映真实世界的场景,我们引入了专门的 IoMT 数据集。实验结果表明,该框架具有较高的准确率和较低的误分类率,证明了其在复杂的 IoMT 环境中检测网络威胁的能力。这种方法在加强先进医疗保健技术的网络安全方面大有可为。
{"title":"Meta-Fed IDS: Meta-learning and Federated learning based fog-cloud approach to detect known and zero-day cyber attacks in IoMT networks","authors":"Umer Zukaib ,&nbsp;Xiaohui Cui ,&nbsp;Chengliang Zheng ,&nbsp;Dong Liang ,&nbsp;Salah Ud Din","doi":"10.1016/j.jpdc.2024.104934","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104934","url":null,"abstract":"<div><p>The Internet of Medical Things (IoMT) is a transformative fusion of medical sensors, equipment, and the Internet of Things, positioned to transform healthcare. However, security and privacy concerns hinder widespread IoMT adoption, intensified by the scarcity of high-quality datasets for developing effective security solutions. Addressing these challenges, we propose a novel framework for cyberattack detection in dynamic IoMT networks. This framework integrates Federated Learning with Meta-learning, employing a multi-phase architecture for identifying known attacks, and incorporates advanced clustering and biased classifiers to address zero-day attacks. The framework's deployment is adaptable to dynamic and diverse environments, utilizing an Infrastructure-as-a-Service (IaaS) model on the cloud and a Software-as-a-Service (SaaS) model on the fog end. To reflect real-world scenarios, we introduce a specialized IoMT dataset. Our experimental results indicate high accuracy and low misclassification rates, demonstrating the framework's capability in detecting cyber threats in complex IoMT environments. This approach shows significant promise in bolstering cybersecurity in advanced healthcare technologies.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141302619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DRACO: Distributed Resource-aware Admission Control for large-scale, multi-tier systems DRACO:面向大规模多层系统的分布式资源感知准入控制
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-06-04 DOI: 10.1016/j.jpdc.2024.104935
Domenico Cotroneo, Roberto Natella, Stefano Rosiello

Modern distributed systems are designed to manage overload conditions, by throttling the traffic in excess that cannot be served through overload control techniques. However, the adoption of large-scale NoSQL datastores make systems vulnerable to unbalanced overloads, where specific datastore nodes are overloaded because of hot-spot resources and hogs. In this paper, we propose DRACO, a novel overload control solution that is aware of data dependencies between the application and the datastore tiers. DRACO performs selective admission control of application requests, by only dropping the ones that map to resources on overloaded datastore nodes, while achieving high resource utilization on non-overloaded datastore nodes. We evaluate DRACO on two case studies with high availability and performance requirements, a virtualized IP Multimedia Subsystem and a distributed fileserver. Results show that the solution can achieve high performance and resource utilization even under extreme overload conditions, up to 100x the engineered capacity.

现代分布式系统设计用于管理过载情况,通过过载控制技术对无法提供服务的过量流量进行节流。然而,大规模 NoSQL 数据存储的采用使系统容易受到不平衡过载的影响,在这种情况下,特定的数据存储节点会因为热点资源和占用而过载。在本文中,我们提出了一种新颖的过载控制解决方案 DRACO,它能感知应用程序和数据存储层之间的数据依赖关系。DRACO 对应用请求执行选择性准入控制,只丢弃映射到过载数据存储节点资源的请求,同时实现非过载数据存储节点的高资源利用率。我们在两个具有高可用性和高性能要求的案例研究(虚拟化 IP 多媒体子系统和分布式文件服务器)中对 DRACO 进行了评估。结果表明,即使在极端过载条件下,该解决方案也能实现高性能和资源利用率,最高可达设计容量的 100 倍。
{"title":"DRACO: Distributed Resource-aware Admission Control for large-scale, multi-tier systems","authors":"Domenico Cotroneo,&nbsp;Roberto Natella,&nbsp;Stefano Rosiello","doi":"10.1016/j.jpdc.2024.104935","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104935","url":null,"abstract":"<div><p>Modern distributed systems are designed to manage overload conditions, by throttling the traffic in excess that cannot be served through <em>overload control</em> techniques. However, the adoption of large-scale NoSQL datastores make systems vulnerable to <em>unbalanced overloads</em>, where specific datastore nodes are overloaded because of hot-spot resources and hogs. In this paper, we propose DRACO, a novel overload control solution that is aware of data dependencies between the application and the datastore tiers. DRACO performs selective admission control of application requests, by only dropping the ones that map to resources on overloaded datastore nodes, while achieving high resource utilization on non-overloaded datastore nodes. We evaluate DRACO on two case studies with high availability and performance requirements, a virtualized IP Multimedia Subsystem and a distributed fileserver. Results show that the solution can achieve high performance and resource utilization even under extreme overload conditions, up to 100x the engineered capacity.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000996/pdfft?md5=47aadc5c325c36c8ff181fd763795f30&pid=1-s2.0-S0743731524000996-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141291810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues) 封面 1 - 完整扉页(常规期刊)/特刊扉页(特刊)
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-06-03 DOI: 10.1016/S0743-7315(24)00094-7
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(24)00094-7","DOIUrl":"https://doi.org/10.1016/S0743-7315(24)00094-7","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000947/pdfft?md5=2a0c1e248048475ac142cf8a9af19128&pid=1-s2.0-S0743731524000947-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141240483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing CNN inference speed over big social data through efficient model parallelism for sustainable web of things 通过高效模型并行化优化 CNN 对社交大数据的推理速度,实现可持续物联网
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-05-31 DOI: 10.1016/j.jpdc.2024.104927
Yuhao Hu , Xiaolong Xu , Muhammad Bilal , Weiyi Zhong , Yuwen Liu , Huaizhen Kou , Lingzhen Kong

The rapid development of artificial intelligence and networking technologies has catalyzed the popularity of intelligent services based on deep learning in recent years, which in turn fosters the advancement of Web of Things (WoT). Big social data (BSD) plays an important role during the processing of intelligent services in WoT. However, intelligent BSD services are computationally intensive and require ultra-low latency. End or edge devices with limited computing power cannot realize the extremely low response latency of those services. Distributed inference of deep neural networks (DNNs) on various devices is considered a feasible solution by allocating the computing load of a DNN to several devices. In this work, an efficient model parallelism method that couples convolution layer (Conv) split with resource allocation is proposed. First, given a random computing resource allocation strategy, the Conv split decision is made through a mathematical analysis method to realize the parallel inference of convolutional neural networks (CNNs). Next, Deep Reinforcement Learning is used to get the optimal computing resource allocation strategy to maximize the resource utilization rate and minimize the CNN inference latency. Finally, simulation results show that our approach performs better than the baselines and is applicable for BSD services in WoT with a high workload.

近年来,人工智能和网络技术的快速发展推动了基于深度学习的智能服务的普及,进而促进了物联网(WoT)的发展。大社会数据(BSD)在物联网智能服务的处理过程中发挥着重要作用。然而,智能 BSD 服务是计算密集型的,需要超低延迟。计算能力有限的终端或边缘设备无法实现这些服务的超低响应延迟。通过将深度神经网络(DNN)的计算负载分配给多个设备,在不同设备上进行分布式推理被认为是一种可行的解决方案。在这项工作中,提出了一种将卷积层(Conv)拆分与资源分配相结合的高效模型并行方法。首先,给定随机计算资源分配策略,通过数学分析方法做出 Conv 分割决策,实现卷积神经网络(CNN)的并行推理。接着,利用深度强化学习(Deep Reinforcement Learning)获得最优计算资源分配策略,从而最大化资源利用率,最小化 CNN 推理延迟。最后,仿真结果表明,我们的方法比基线方法性能更好,适用于高工作量的 WoT 中的 BSD 服务。
{"title":"Optimizing CNN inference speed over big social data through efficient model parallelism for sustainable web of things","authors":"Yuhao Hu ,&nbsp;Xiaolong Xu ,&nbsp;Muhammad Bilal ,&nbsp;Weiyi Zhong ,&nbsp;Yuwen Liu ,&nbsp;Huaizhen Kou ,&nbsp;Lingzhen Kong","doi":"10.1016/j.jpdc.2024.104927","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104927","url":null,"abstract":"<div><p>The rapid development of artificial intelligence and networking technologies has catalyzed the popularity of intelligent services based on deep learning in recent years, which in turn fosters the advancement of Web of Things (WoT). Big social data (BSD) plays an important role during the processing of intelligent services in WoT. However, intelligent BSD services are computationally intensive and require ultra-low latency. End or edge devices with limited computing power cannot realize the extremely low response latency of those services. Distributed inference of deep neural networks (DNNs) on various devices is considered a feasible solution by allocating the computing load of a DNN to several devices. In this work, an efficient model parallelism method that couples convolution layer (Conv) split with resource allocation is proposed. First, given a random computing resource allocation strategy, the Conv split decision is made through a mathematical analysis method to realize the parallel inference of convolutional neural networks (CNNs). Next, Deep Reinforcement Learning is used to get the optimal computing resource allocation strategy to maximize the resource utilization rate and minimize the CNN inference latency. Finally, simulation results show that our approach performs better than the baselines and is applicable for BSD services in WoT with a high workload.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141291866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topo: Towards a fine-grained topological data processing framework on Tianhe-3 supercomputer 拓扑:在 "天河三号 "超级计算机上建立细粒度拓扑数据处理框架
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-05-31 DOI: 10.1016/j.jpdc.2024.104926
Nan Hu , Yutong Lu , Zhuo Tang , Zhiyong Liu , Dan Huang , Zhiguang Chen

Big data frameworks are widely deployed in supercomputers for analyzing large-scale datasets. Topological data processing is an emerging approach that focuses on analyzing the topological structures in high-dimensional scientific data. However, incorporating topological data processing into current big data frameworks presents three main challenges: (1) The frequent data exchange poses challenges to the traditional coarse-grained parallelism. (2) The spatial topology makes parallel programming harder using oversimplified MapReduce APIs. (3) The massive intermediate data and NUMA architecture hinder resource utilization and scalability on novel supercomputers and many-core processors.

In this paper, we present Topo, a generic distributed framework that enhances topological data processing on many-core supercomputers. Topo relies on three concepts. (1) It employs fine-grained parallelism, with awareness of topological structures in datasets, to support interactions among collaborative workers before each shuffle phase. (2) It provides intuitive APIs for topological data operations. (3) It implements efficient collective I/O and NUMA-aware dynamic task scheduling to improve multi-threading and load balancing. We evaluate Topo's performance on the Tianhe-3 supercomputer, which utilizes state-of-the-art ARM many-core processors. Experimental results of execution time show that compared to popular frameworks, Topo achieves an average speedup of 5.3× and 6.3×, with a maximum speedup of 8.4× and 20×, on HPC workloads and big data benchmarks, respectively. Topo further reduces total execution time on processing skewed datasets by 41%.

大数据框架被广泛部署在超级计算机中,用于分析大规模数据集。拓扑数据处理是一种新兴方法,主要用于分析高维科学数据中的拓扑结构。然而,将拓扑数据处理纳入当前的大数据框架面临三大挑战:(1)频繁的数据交换对传统的粗粒度并行性提出了挑战。(2) 空间拓扑使得使用过于简化的 MapReduce API 进行并行编程变得更加困难。(3) 海量中间数据和 NUMA 架构阻碍了新型超级计算机和多核处理器的资源利用率和可扩展性。Topo 依赖于三个概念。(1)它采用细粒度并行技术,并能感知数据集中的拓扑结构,从而在每个洗牌阶段之前支持协作工作者之间的互动。(2) 为拓扑数据操作提供直观的应用程序接口。(3) 实现高效的集体 I/O 和 NUMA 感知动态任务调度,以改进多线程和负载平衡。我们在使用最先进 ARM 多核处理器的天河-3 超级计算机上评估了 Topo 的性能。执行时间的实验结果表明,与流行的框架相比,Topo在高性能计算工作负载和大数据基准上的平均速度分别提高了5.3倍和6.3倍,最大速度提高了8.4倍和20倍。在处理倾斜数据集方面,Topo进一步将总执行时间缩短了41%。
{"title":"Topo: Towards a fine-grained topological data processing framework on Tianhe-3 supercomputer","authors":"Nan Hu ,&nbsp;Yutong Lu ,&nbsp;Zhuo Tang ,&nbsp;Zhiyong Liu ,&nbsp;Dan Huang ,&nbsp;Zhiguang Chen","doi":"10.1016/j.jpdc.2024.104926","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104926","url":null,"abstract":"<div><p>Big data frameworks are widely deployed in supercomputers for analyzing large-scale datasets. Topological data processing is an emerging approach that focuses on analyzing the topological structures in high-dimensional scientific data. However, incorporating topological data processing into current big data frameworks presents three main challenges: (1) The frequent data exchange poses challenges to the traditional coarse-grained parallelism. (2) The spatial topology makes parallel programming harder using oversimplified MapReduce APIs. (3) The massive intermediate data and NUMA architecture hinder resource utilization and scalability on novel supercomputers and many-core processors.</p><p>In this paper, we present Topo, a generic distributed framework that enhances topological data processing on many-core supercomputers. Topo relies on three concepts. (1) It employs fine-grained parallelism, with awareness of topological structures in datasets, to support interactions among collaborative workers before each shuffle phase. (2) It provides intuitive APIs for topological data operations. (3) It implements efficient collective I/O and NUMA-aware dynamic task scheduling to improve multi-threading and load balancing. We evaluate Topo's performance on the Tianhe-3 supercomputer, which utilizes state-of-the-art ARM many-core processors. Experimental results of execution time show that compared to popular frameworks, Topo achieves an average speedup of 5.3× and 6.3×, with a maximum speedup of 8.4× and 20×, on HPC workloads and big data benchmarks, respectively. Topo further reduces total execution time on processing skewed datasets by 41%.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141291882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Routing and wavelength assignment for folded hypercube in linear array WDM optical networks 线性阵列波分复用光学网络中折叠超立方体的路由和波长分配
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-05-28 DOI: 10.1016/j.jpdc.2024.104924
V. Vinitha Navis, A. Berin Greeni

The folded hypercube is one of the hypercube variants and is of great significance in the study of interconnection networks. In a folded hypercube, information can be broadcast using efficient distributed algorithms. In the context of parallel computing, folded hypercube has been studied as a possible network topology as an alternative to the hypercube. The routing and wavelength assignment (RWA) problem is significant, since it improves the performance of wavelength-routed all-optical networks constructed using wavelength division multiplexing approach. Given the physical network topology, the aim of the RWA problem is to establish routes for the connection requests and assign the fewest possible wavelengths in accordance with the wavelength continuity and distinct wavelength constraints. This paper discusses the RWA problem in a linear array for the folded hypercube communication pattern by using the congestion technique.

折叠超立方体是超立方体的变体之一,在互连网络研究中具有重要意义。在折叠超立方体中,可以使用高效的分布式算法进行信息广播。在并行计算的背景下,折叠超立方体作为超立方体的一种可能的网络拓扑结构得到了研究。路由和波长分配(RWA)问题非常重要,因为它能提高使用波分复用方法构建的波长路由全光网络的性能。鉴于物理网络拓扑结构,RWA 问题的目的是为连接请求建立路由,并根据波长连续性和不同波长约束条件分配尽可能少的波长。本文利用拥塞技术讨论了折叠超立方通信模式线性阵列中的 RWA 问题。
{"title":"Routing and wavelength assignment for folded hypercube in linear array WDM optical networks","authors":"V. Vinitha Navis,&nbsp;A. Berin Greeni","doi":"10.1016/j.jpdc.2024.104924","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104924","url":null,"abstract":"<div><p>The folded hypercube is one of the hypercube variants and is of great significance in the study of interconnection networks. In a folded hypercube, information can be broadcast using efficient distributed algorithms. In the context of parallel computing, folded hypercube has been studied as a possible network topology as an alternative to the hypercube. The routing and wavelength assignment (RWA) problem is significant, since it improves the performance of wavelength-routed all-optical networks constructed using wavelength division multiplexing approach. Given the physical network topology, the aim of the RWA problem is to establish routes for the connection requests and assign the fewest possible wavelengths in accordance with the wavelength continuity and distinct wavelength constraints. This paper discusses the RWA problem in a linear array for the folded hypercube communication pattern by using the congestion technique.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141243573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems 分布式系统上高阶有限元离散矩阵多向量积的快速硬件感知无矩阵算法
IF 3.8 3区 计算机科学 Q1 Mathematics Pub Date : 2024-05-27 DOI: 10.1016/j.jpdc.2024.104925
Gourab Panigrahi , Nikhil Kodali , Debashis Panda , Phani Motamarri

Recent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. In this work, we address a critical gap in existing matrix-free implementations which are not well suited for the action of FE discretized matrices on very large number of vectors. In particular, we propose efficient matrix-free algorithms for evaluating FE discretized matrix-multivector products on both multi-node CPU and GPU architectures. To this end, we employ batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and enabling further parallelization. On CPUs, we utilize even-odd decomposition, SIMD vectorization, and overlapping computation and communication strategies. On GPUs, we develop strategies to overlap compute with data movement for achieving efficient pipelining and reduced data accesses through the use of GPU-shared memory, constant memory and kernel fusion. Our implementation outperforms the baselines for Helmholtz operator action on 1024 vectors, achieving up to 1.4x improvement on one CPU node and up to 2.8x on one GPU node, while reaching up to 4.4x and 1.5x improvement on multiple nodes for CPUs (3072 cores) and GPUs (24 GPUs), respectively. We further benchmark the performance of the proposed implementation for solving a model eigenvalue problem for 1024 smallest eigenvalue-eigenvector pairs by employing the Chebyshev Filtered Subspace Iteration method, achieving up to 1.5x improvement on one CPU node and up to 2.2x on one GPU node while reaching up to 3.0x and 1.4x improvement on multi-node CPUs (3072 cores) and GPUs (24 GPUs), respectively.

与传统的稀疏矩阵方法相比,最近用于高阶有限元(FE)离散矩阵-矢量乘法的硬件感知无矩阵算法减少了浮点运算和数据访问成本。在这项工作中,我们解决了现有无矩阵实现中的一个关键问题,即现有无矩阵实现不太适合 FE 离散矩阵对大量向量的作用。特别是,我们提出了在多节点 CPU 和 GPU 架构上评估 FE 离散矩阵-多向量乘积的高效无矩阵算法。为此,我们采用分批评估策略,根据底层硬件架构调整批量大小,从而获得更好的数据局部性,并进一步实现并行化。在 CPU 上,我们采用偶数分解、SIMD 矢量化以及重叠计算和通信策略。在 GPU 上,我们开发了计算与数据移动重叠策略,通过使用 GPU 共享内存、常量内存和内核融合,实现高效流水线和减少数据访问。对于 1024 向量上的亥姆霍兹算子动作,我们的实现优于基线,在一个 CPU 节点上实现了高达 1.4 倍的改进,在一个 GPU 节点上实现了高达 2.8 倍的改进,而在 CPU(3072 个内核)和 GPU(24 个 GPU)的多个节点上分别实现了高达 4.4 倍和 1.5 倍的改进。我们还采用切比雪夫过滤子空间迭代法,进一步对所提出的实现方法在求解 1024 个最小特征值-特征向量对的模型特征值问题时的性能进行了基准测试,结果表明,在一个 CPU 节点上,性能提高了 1.5 倍;在一个 GPU 节点上,性能提高了 2.2 倍;而在多节点 CPU(3072 个内核)和 GPU(24 个 GPU)上,性能分别提高了 3.0 倍和 1.4 倍。
{"title":"Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems","authors":"Gourab Panigrahi ,&nbsp;Nikhil Kodali ,&nbsp;Debashis Panda ,&nbsp;Phani Motamarri","doi":"10.1016/j.jpdc.2024.104925","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104925","url":null,"abstract":"<div><p>Recent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. In this work, we address a critical gap in existing matrix-free implementations which are not well suited for the action of FE discretized matrices on very large number of vectors. In particular, we propose efficient matrix-free algorithms for evaluating FE discretized matrix-multivector products on both multi-node CPU and GPU architectures. To this end, we employ batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and enabling further parallelization. On CPUs, we utilize even-odd decomposition, SIMD vectorization, and overlapping computation and communication strategies. On GPUs, we develop strategies to overlap compute with data movement for achieving efficient pipelining and reduced data accesses through the use of GPU-shared memory, constant memory and kernel fusion. Our implementation outperforms the baselines for Helmholtz operator action on 1024 vectors, achieving up to 1.4x improvement on one CPU node and up to 2.8x on one GPU node, while reaching up to 4.4x and 1.5x improvement on multiple nodes for CPUs (3072 cores) and GPUs (24 GPUs), respectively. We further benchmark the performance of the proposed implementation for solving a model eigenvalue problem for 1024 smallest eigenvalue-eigenvector pairs by employing the Chebyshev Filtered Subspace Iteration method, achieving up to 1.5x improvement on one CPU node and up to 2.2x on one GPU node while reaching up to 3.0x and 1.4x improvement on multi-node CPUs (3072 cores) and GPUs (24 GPUs), respectively.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141291809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Parallel and Distributed Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1