Pub Date : 2024-06-14DOI: 10.1016/j.jpdc.2024.104938
Huifeng Yuan , Lijing Cheng , Yuying Pan , Zhetao Tan , Qian Liu , Zhong Jin
There is an increasing need to provide real-time datasets for climate monitoring and applications. However, the current data products from all international groups have at least a month delay for data release. One reason for this delay is the long computing time of the global reconstruction algorithm (so-called mapping approach). To tackle this issue, this paper proposes a multi-level parallel computing model to improve the efficiency of data construction by parallelization of computation, reducing code branch prediction, optimizing data spatial locality, cache utilization, and other measures. This model has been applied to a mapping approach proposed by the Institute of Atmospheric Physics (IAP), one of the world's most widely used data products in the ocean and climate field. Compared with the traditional serial construction of MATLAB-based scheme on a single node, the speed of the construction after parallel optimizations is speeded up by ∼4.7 times. A large-scale parallel experiment of a long-term (∼1000 months) gridded dataset utilizing over 16,000 processor cores proves the model's scalability, improving ∼1200 times. In summary, this new model represents another example of the application of high-performance computing in oceanography and climatology.
{"title":"A multi-level parallel approach to increase the computation efficiency of a global ocean temperature dataset reconstruction","authors":"Huifeng Yuan , Lijing Cheng , Yuying Pan , Zhetao Tan , Qian Liu , Zhong Jin","doi":"10.1016/j.jpdc.2024.104938","DOIUrl":"10.1016/j.jpdc.2024.104938","url":null,"abstract":"<div><p>There is an increasing need to provide real-time datasets for climate monitoring and applications. However, the current data products from all international groups have at least a month delay for data release. One reason for this delay is the long computing time of the global reconstruction algorithm (so-called mapping approach). To tackle this issue, this paper proposes a multi-level parallel computing model to improve the efficiency of data construction by parallelization of computation, reducing code branch prediction, optimizing data spatial locality, cache utilization, and other measures. This model has been applied to a mapping approach proposed by the Institute of Atmospheric Physics (IAP), one of the world's most widely used data products in the ocean and climate field. Compared with the traditional serial construction of MATLAB-based scheme on a single node, the speed of the construction after parallel optimizations is speeded up by ∼4.7 times. A large-scale parallel experiment of a long-term (∼1000 months) gridded dataset utilizing over 16,000 processor cores proves the model's scalability, improving ∼1200 times. In summary, this new model represents another example of the application of high-performance computing in oceanography and climatology.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141405481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-14DOI: 10.1016/j.jpdc.2024.104939
Juan Salamanca , Alexandro Baldassin
Loops take up most of the time of computer programs, so optimizing them so that they run in the shortest time possible is a continuous task. However, this task is not negligible; on the contrary, it is an open area of research since many irregular loops are hard to parallelize. Generally, these loops have loop-carried (DOACROSS) dependencies and the appearance of dependencies could depend on the context. Many techniques have been studied to be able to parallelize these loops efficiently; however, for example in the OpenMP standard there is no efficient way to parallelize them. This article presents Speculative Task Execution (STE), a technique that enables the execution of OpenMP tasks in a speculative way to accelerate certain hot-code regions (such as loops) marked by OpenMP directives. It also presents a detailed analysis of the application of Hardware Transactional Memory (HTM) support for executing tasks speculatively and describes a careful evaluation of the implementation of STE using HTM on modern machines. In particular, we consider the scenario in which speculative tasks are generated by the OpenMP taskloop construct (Speculative Taskloop (STL)). As a result, it provides evidence to support several important claims about the performance of STE over HTM in modern processor architectures. Experimental results reveal that: (a) by implementing STL on top of HTM for hot-code regions, speed-ups of up to 5.39× can be obtained in IBM POWER8 and of up to 2.41× in Intel processors using 4 cores; and (b) STL-ROT, a variant of STL using rollback-only transactions (ROTs), achieves speed-ups of up to 17.70× in IBM POWER9 processor using 20 cores.
{"title":"Using hardware-transactional-memory support to implement speculative task execution","authors":"Juan Salamanca , Alexandro Baldassin","doi":"10.1016/j.jpdc.2024.104939","DOIUrl":"10.1016/j.jpdc.2024.104939","url":null,"abstract":"<div><p>Loops take up most of the time of computer programs, so optimizing them so that they run in the shortest time possible is a continuous task. However, this task is not negligible; on the contrary, it is an open area of research since many irregular loops are hard to parallelize. Generally, these loops have loop-carried (DOACROSS) dependencies and the appearance of dependencies could depend on the context. Many techniques have been studied to be able to parallelize these loops efficiently; however, for example in the OpenMP standard there is no efficient way to parallelize them. This article presents Speculative Task Execution (STE), a technique that enables the execution of OpenMP tasks in a speculative way to accelerate certain hot-code regions (such as loops) marked by OpenMP directives. It also presents a detailed analysis of the application of Hardware Transactional Memory (HTM) support for executing tasks speculatively and describes a careful evaluation of the implementation of STE using HTM on modern machines. In particular, we consider the scenario in which speculative tasks are generated by the OpenMP <span>taskloop</span> construct (<em>Speculative Taskloop (STL)</em>). As a result, it provides evidence to support several important claims about the performance of STE over HTM in modern processor architectures. Experimental results reveal that: (a) by implementing STL on top of HTM for hot-code regions, speed-ups of up to 5.39× can be obtained in IBM POWER8 and of up to 2.41× in Intel processors using 4 cores; and (b) STL-ROT, a variant of STL using rollback-only transactions (ROTs), achieves speed-ups of up to 17.70× in IBM POWER9 processor using 20 cores.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.4,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141411270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-14DOI: 10.1016/j.jpdc.2024.104940
Daniel Wladdimiro , Luciana Arantes , Pierre Sens , Nicolás Hidalgo
Stream Processing Systems (SPSs) dynamically process input events. Since the input is usually not a constant flow, presenting rate fluctuations, many works in the literature propose to dynamically replicate SPS operators, aiming at reducing the processing bottleneck induced by such fluctuations. However, these SPSs do not consider the problem of load balancing of the replicas or the cost involved in reconfiguring the system whenever the number of replicas changes. We present in this paper a predictive model which, based on input rate variation, execution time of operators, and queued events, dynamically defines the necessary current number of replicas of each operator. A predictor, composed of different models (i.e., mathematical and Machine Learning ones), predicts the input rate. We also propose a Storm-based SPS, named PA-SPS, which uses such a predictive model, not requiring reboot reconfiguration when the number of operators replica change. PA-SPS also implements a load balancer that distributes incoming events evenly among replicas of an operator. We have conducted experiments on Google Cloud Platform (GCP) for evaluation PA-SPS using real traffic traces of different applications and also compared it with Storm and other existing SPSs.
{"title":"PA-SPS: A predictive adaptive approach for an elastic stream processing system","authors":"Daniel Wladdimiro , Luciana Arantes , Pierre Sens , Nicolás Hidalgo","doi":"10.1016/j.jpdc.2024.104940","DOIUrl":"10.1016/j.jpdc.2024.104940","url":null,"abstract":"<div><p>Stream Processing Systems (SPSs) dynamically process input events. Since the input is usually not a constant flow, presenting rate fluctuations, many works in the literature propose to dynamically replicate SPS operators, aiming at reducing the processing bottleneck induced by such fluctuations. However, these SPSs do not consider the problem of load balancing of the replicas or the cost involved in reconfiguring the system whenever the number of replicas changes. We present in this paper a predictive model which, based on input rate variation, execution time of operators, and queued events, dynamically defines the necessary current number of replicas of each operator. A predictor, composed of different models (i.e., mathematical and Machine Learning ones), predicts the input rate. We also propose a Storm-based SPS, named PA-SPS, which uses such a predictive model, not requiring reboot reconfiguration when the number of operators replica change. PA-SPS also implements a load balancer that distributes incoming events evenly among replicas of an operator. We have conducted experiments on Google Cloud Platform (GCP) for evaluation PA-SPS using real traffic traces of different applications and also compared it with Storm and other existing SPSs.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141401072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-05DOI: 10.1016/j.jpdc.2024.104934
Umer Zukaib , Xiaohui Cui , Chengliang Zheng , Dong Liang , Salah Ud Din
The Internet of Medical Things (IoMT) is a transformative fusion of medical sensors, equipment, and the Internet of Things, positioned to transform healthcare. However, security and privacy concerns hinder widespread IoMT adoption, intensified by the scarcity of high-quality datasets for developing effective security solutions. Addressing these challenges, we propose a novel framework for cyberattack detection in dynamic IoMT networks. This framework integrates Federated Learning with Meta-learning, employing a multi-phase architecture for identifying known attacks, and incorporates advanced clustering and biased classifiers to address zero-day attacks. The framework's deployment is adaptable to dynamic and diverse environments, utilizing an Infrastructure-as-a-Service (IaaS) model on the cloud and a Software-as-a-Service (SaaS) model on the fog end. To reflect real-world scenarios, we introduce a specialized IoMT dataset. Our experimental results indicate high accuracy and low misclassification rates, demonstrating the framework's capability in detecting cyber threats in complex IoMT environments. This approach shows significant promise in bolstering cybersecurity in advanced healthcare technologies.
{"title":"Meta-Fed IDS: Meta-learning and Federated learning based fog-cloud approach to detect known and zero-day cyber attacks in IoMT networks","authors":"Umer Zukaib , Xiaohui Cui , Chengliang Zheng , Dong Liang , Salah Ud Din","doi":"10.1016/j.jpdc.2024.104934","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104934","url":null,"abstract":"<div><p>The Internet of Medical Things (IoMT) is a transformative fusion of medical sensors, equipment, and the Internet of Things, positioned to transform healthcare. However, security and privacy concerns hinder widespread IoMT adoption, intensified by the scarcity of high-quality datasets for developing effective security solutions. Addressing these challenges, we propose a novel framework for cyberattack detection in dynamic IoMT networks. This framework integrates Federated Learning with Meta-learning, employing a multi-phase architecture for identifying known attacks, and incorporates advanced clustering and biased classifiers to address zero-day attacks. The framework's deployment is adaptable to dynamic and diverse environments, utilizing an Infrastructure-as-a-Service (IaaS) model on the cloud and a Software-as-a-Service (SaaS) model on the fog end. To reflect real-world scenarios, we introduce a specialized IoMT dataset. Our experimental results indicate high accuracy and low misclassification rates, demonstrating the framework's capability in detecting cyber threats in complex IoMT environments. This approach shows significant promise in bolstering cybersecurity in advanced healthcare technologies.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141302619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern distributed systems are designed to manage overload conditions, by throttling the traffic in excess that cannot be served through overload control techniques. However, the adoption of large-scale NoSQL datastores make systems vulnerable to unbalanced overloads, where specific datastore nodes are overloaded because of hot-spot resources and hogs. In this paper, we propose DRACO, a novel overload control solution that is aware of data dependencies between the application and the datastore tiers. DRACO performs selective admission control of application requests, by only dropping the ones that map to resources on overloaded datastore nodes, while achieving high resource utilization on non-overloaded datastore nodes. We evaluate DRACO on two case studies with high availability and performance requirements, a virtualized IP Multimedia Subsystem and a distributed fileserver. Results show that the solution can achieve high performance and resource utilization even under extreme overload conditions, up to 100x the engineered capacity.
{"title":"DRACO: Distributed Resource-aware Admission Control for large-scale, multi-tier systems","authors":"Domenico Cotroneo, Roberto Natella, Stefano Rosiello","doi":"10.1016/j.jpdc.2024.104935","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104935","url":null,"abstract":"<div><p>Modern distributed systems are designed to manage overload conditions, by throttling the traffic in excess that cannot be served through <em>overload control</em> techniques. However, the adoption of large-scale NoSQL datastores make systems vulnerable to <em>unbalanced overloads</em>, where specific datastore nodes are overloaded because of hot-spot resources and hogs. In this paper, we propose DRACO, a novel overload control solution that is aware of data dependencies between the application and the datastore tiers. DRACO performs selective admission control of application requests, by only dropping the ones that map to resources on overloaded datastore nodes, while achieving high resource utilization on non-overloaded datastore nodes. We evaluate DRACO on two case studies with high availability and performance requirements, a virtualized IP Multimedia Subsystem and a distributed fileserver. Results show that the solution can achieve high performance and resource utilization even under extreme overload conditions, up to 100x the engineered capacity.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000996/pdfft?md5=47aadc5c325c36c8ff181fd763795f30&pid=1-s2.0-S0743731524000996-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141291810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-03DOI: 10.1016/S0743-7315(24)00094-7
{"title":"Front Matter 1 - Full Title Page (regular issues)/Special Issue Title page (special issues)","authors":"","doi":"10.1016/S0743-7315(24)00094-7","DOIUrl":"https://doi.org/10.1016/S0743-7315(24)00094-7","url":null,"abstract":"","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0743731524000947/pdfft?md5=2a0c1e248048475ac142cf8a9af19128&pid=1-s2.0-S0743731524000947-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141240483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-31DOI: 10.1016/j.jpdc.2024.104927
Yuhao Hu , Xiaolong Xu , Muhammad Bilal , Weiyi Zhong , Yuwen Liu , Huaizhen Kou , Lingzhen Kong
The rapid development of artificial intelligence and networking technologies has catalyzed the popularity of intelligent services based on deep learning in recent years, which in turn fosters the advancement of Web of Things (WoT). Big social data (BSD) plays an important role during the processing of intelligent services in WoT. However, intelligent BSD services are computationally intensive and require ultra-low latency. End or edge devices with limited computing power cannot realize the extremely low response latency of those services. Distributed inference of deep neural networks (DNNs) on various devices is considered a feasible solution by allocating the computing load of a DNN to several devices. In this work, an efficient model parallelism method that couples convolution layer (Conv) split with resource allocation is proposed. First, given a random computing resource allocation strategy, the Conv split decision is made through a mathematical analysis method to realize the parallel inference of convolutional neural networks (CNNs). Next, Deep Reinforcement Learning is used to get the optimal computing resource allocation strategy to maximize the resource utilization rate and minimize the CNN inference latency. Finally, simulation results show that our approach performs better than the baselines and is applicable for BSD services in WoT with a high workload.
{"title":"Optimizing CNN inference speed over big social data through efficient model parallelism for sustainable web of things","authors":"Yuhao Hu , Xiaolong Xu , Muhammad Bilal , Weiyi Zhong , Yuwen Liu , Huaizhen Kou , Lingzhen Kong","doi":"10.1016/j.jpdc.2024.104927","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104927","url":null,"abstract":"<div><p>The rapid development of artificial intelligence and networking technologies has catalyzed the popularity of intelligent services based on deep learning in recent years, which in turn fosters the advancement of Web of Things (WoT). Big social data (BSD) plays an important role during the processing of intelligent services in WoT. However, intelligent BSD services are computationally intensive and require ultra-low latency. End or edge devices with limited computing power cannot realize the extremely low response latency of those services. Distributed inference of deep neural networks (DNNs) on various devices is considered a feasible solution by allocating the computing load of a DNN to several devices. In this work, an efficient model parallelism method that couples convolution layer (Conv) split with resource allocation is proposed. First, given a random computing resource allocation strategy, the Conv split decision is made through a mathematical analysis method to realize the parallel inference of convolutional neural networks (CNNs). Next, Deep Reinforcement Learning is used to get the optimal computing resource allocation strategy to maximize the resource utilization rate and minimize the CNN inference latency. Finally, simulation results show that our approach performs better than the baselines and is applicable for BSD services in WoT with a high workload.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141291866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-31DOI: 10.1016/j.jpdc.2024.104926
Nan Hu , Yutong Lu , Zhuo Tang , Zhiyong Liu , Dan Huang , Zhiguang Chen
Big data frameworks are widely deployed in supercomputers for analyzing large-scale datasets. Topological data processing is an emerging approach that focuses on analyzing the topological structures in high-dimensional scientific data. However, incorporating topological data processing into current big data frameworks presents three main challenges: (1) The frequent data exchange poses challenges to the traditional coarse-grained parallelism. (2) The spatial topology makes parallel programming harder using oversimplified MapReduce APIs. (3) The massive intermediate data and NUMA architecture hinder resource utilization and scalability on novel supercomputers and many-core processors.
In this paper, we present Topo, a generic distributed framework that enhances topological data processing on many-core supercomputers. Topo relies on three concepts. (1) It employs fine-grained parallelism, with awareness of topological structures in datasets, to support interactions among collaborative workers before each shuffle phase. (2) It provides intuitive APIs for topological data operations. (3) It implements efficient collective I/O and NUMA-aware dynamic task scheduling to improve multi-threading and load balancing. We evaluate Topo's performance on the Tianhe-3 supercomputer, which utilizes state-of-the-art ARM many-core processors. Experimental results of execution time show that compared to popular frameworks, Topo achieves an average speedup of 5.3× and 6.3×, with a maximum speedup of 8.4× and 20×, on HPC workloads and big data benchmarks, respectively. Topo further reduces total execution time on processing skewed datasets by 41%.
大数据框架被广泛部署在超级计算机中,用于分析大规模数据集。拓扑数据处理是一种新兴方法,主要用于分析高维科学数据中的拓扑结构。然而,将拓扑数据处理纳入当前的大数据框架面临三大挑战:(1)频繁的数据交换对传统的粗粒度并行性提出了挑战。(2) 空间拓扑使得使用过于简化的 MapReduce API 进行并行编程变得更加困难。(3) 海量中间数据和 NUMA 架构阻碍了新型超级计算机和多核处理器的资源利用率和可扩展性。Topo 依赖于三个概念。(1)它采用细粒度并行技术,并能感知数据集中的拓扑结构,从而在每个洗牌阶段之前支持协作工作者之间的互动。(2) 为拓扑数据操作提供直观的应用程序接口。(3) 实现高效的集体 I/O 和 NUMA 感知动态任务调度,以改进多线程和负载平衡。我们在使用最先进 ARM 多核处理器的天河-3 超级计算机上评估了 Topo 的性能。执行时间的实验结果表明,与流行的框架相比,Topo在高性能计算工作负载和大数据基准上的平均速度分别提高了5.3倍和6.3倍,最大速度提高了8.4倍和20倍。在处理倾斜数据集方面,Topo进一步将总执行时间缩短了41%。
{"title":"Topo: Towards a fine-grained topological data processing framework on Tianhe-3 supercomputer","authors":"Nan Hu , Yutong Lu , Zhuo Tang , Zhiyong Liu , Dan Huang , Zhiguang Chen","doi":"10.1016/j.jpdc.2024.104926","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104926","url":null,"abstract":"<div><p>Big data frameworks are widely deployed in supercomputers for analyzing large-scale datasets. Topological data processing is an emerging approach that focuses on analyzing the topological structures in high-dimensional scientific data. However, incorporating topological data processing into current big data frameworks presents three main challenges: (1) The frequent data exchange poses challenges to the traditional coarse-grained parallelism. (2) The spatial topology makes parallel programming harder using oversimplified MapReduce APIs. (3) The massive intermediate data and NUMA architecture hinder resource utilization and scalability on novel supercomputers and many-core processors.</p><p>In this paper, we present Topo, a generic distributed framework that enhances topological data processing on many-core supercomputers. Topo relies on three concepts. (1) It employs fine-grained parallelism, with awareness of topological structures in datasets, to support interactions among collaborative workers before each shuffle phase. (2) It provides intuitive APIs for topological data operations. (3) It implements efficient collective I/O and NUMA-aware dynamic task scheduling to improve multi-threading and load balancing. We evaluate Topo's performance on the Tianhe-3 supercomputer, which utilizes state-of-the-art ARM many-core processors. Experimental results of execution time show that compared to popular frameworks, Topo achieves an average speedup of 5.3× and 6.3×, with a maximum speedup of 8.4× and 20×, on HPC workloads and big data benchmarks, respectively. Topo further reduces total execution time on processing skewed datasets by 41%.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141291882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-28DOI: 10.1016/j.jpdc.2024.104924
V. Vinitha Navis, A. Berin Greeni
The folded hypercube is one of the hypercube variants and is of great significance in the study of interconnection networks. In a folded hypercube, information can be broadcast using efficient distributed algorithms. In the context of parallel computing, folded hypercube has been studied as a possible network topology as an alternative to the hypercube. The routing and wavelength assignment (RWA) problem is significant, since it improves the performance of wavelength-routed all-optical networks constructed using wavelength division multiplexing approach. Given the physical network topology, the aim of the RWA problem is to establish routes for the connection requests and assign the fewest possible wavelengths in accordance with the wavelength continuity and distinct wavelength constraints. This paper discusses the RWA problem in a linear array for the folded hypercube communication pattern by using the congestion technique.
{"title":"Routing and wavelength assignment for folded hypercube in linear array WDM optical networks","authors":"V. Vinitha Navis, A. Berin Greeni","doi":"10.1016/j.jpdc.2024.104924","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104924","url":null,"abstract":"<div><p>The folded hypercube is one of the hypercube variants and is of great significance in the study of interconnection networks. In a folded hypercube, information can be broadcast using efficient distributed algorithms. In the context of parallel computing, folded hypercube has been studied as a possible network topology as an alternative to the hypercube. The routing and wavelength assignment (RWA) problem is significant, since it improves the performance of wavelength-routed all-optical networks constructed using wavelength division multiplexing approach. Given the physical network topology, the aim of the RWA problem is to establish routes for the connection requests and assign the fewest possible wavelengths in accordance with the wavelength continuity and distinct wavelength constraints. This paper discusses the RWA problem in a linear array for the folded hypercube communication pattern by using the congestion technique.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141243573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. In this work, we address a critical gap in existing matrix-free implementations which are not well suited for the action of FE discretized matrices on very large number of vectors. In particular, we propose efficient matrix-free algorithms for evaluating FE discretized matrix-multivector products on both multi-node CPU and GPU architectures. To this end, we employ batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and enabling further parallelization. On CPUs, we utilize even-odd decomposition, SIMD vectorization, and overlapping computation and communication strategies. On GPUs, we develop strategies to overlap compute with data movement for achieving efficient pipelining and reduced data accesses through the use of GPU-shared memory, constant memory and kernel fusion. Our implementation outperforms the baselines for Helmholtz operator action on 1024 vectors, achieving up to 1.4x improvement on one CPU node and up to 2.8x on one GPU node, while reaching up to 4.4x and 1.5x improvement on multiple nodes for CPUs (3072 cores) and GPUs (24 GPUs), respectively. We further benchmark the performance of the proposed implementation for solving a model eigenvalue problem for 1024 smallest eigenvalue-eigenvector pairs by employing the Chebyshev Filtered Subspace Iteration method, achieving up to 1.5x improvement on one CPU node and up to 2.2x on one GPU node while reaching up to 3.0x and 1.4x improvement on multi-node CPUs (3072 cores) and GPUs (24 GPUs), respectively.
与传统的稀疏矩阵方法相比,最近用于高阶有限元(FE)离散矩阵-矢量乘法的硬件感知无矩阵算法减少了浮点运算和数据访问成本。在这项工作中,我们解决了现有无矩阵实现中的一个关键问题,即现有无矩阵实现不太适合 FE 离散矩阵对大量向量的作用。特别是,我们提出了在多节点 CPU 和 GPU 架构上评估 FE 离散矩阵-多向量乘积的高效无矩阵算法。为此,我们采用分批评估策略,根据底层硬件架构调整批量大小,从而获得更好的数据局部性,并进一步实现并行化。在 CPU 上,我们采用偶数分解、SIMD 矢量化以及重叠计算和通信策略。在 GPU 上,我们开发了计算与数据移动重叠策略,通过使用 GPU 共享内存、常量内存和内核融合,实现高效流水线和减少数据访问。对于 1024 向量上的亥姆霍兹算子动作,我们的实现优于基线,在一个 CPU 节点上实现了高达 1.4 倍的改进,在一个 GPU 节点上实现了高达 2.8 倍的改进,而在 CPU(3072 个内核)和 GPU(24 个 GPU)的多个节点上分别实现了高达 4.4 倍和 1.5 倍的改进。我们还采用切比雪夫过滤子空间迭代法,进一步对所提出的实现方法在求解 1024 个最小特征值-特征向量对的模型特征值问题时的性能进行了基准测试,结果表明,在一个 CPU 节点上,性能提高了 1.5 倍;在一个 GPU 节点上,性能提高了 2.2 倍;而在多节点 CPU(3072 个内核)和 GPU(24 个 GPU)上,性能分别提高了 3.0 倍和 1.4 倍。
{"title":"Fast hardware-aware matrix-free algorithms for higher-order finite-element discretized matrix multivector products on distributed systems","authors":"Gourab Panigrahi , Nikhil Kodali , Debashis Panda , Phani Motamarri","doi":"10.1016/j.jpdc.2024.104925","DOIUrl":"https://doi.org/10.1016/j.jpdc.2024.104925","url":null,"abstract":"<div><p>Recent hardware-aware matrix-free algorithms for higher-order finite-element (FE) discretized matrix-vector multiplications reduce floating point operations and data access costs compared to traditional sparse matrix approaches. In this work, we address a critical gap in existing matrix-free implementations which are not well suited for the action of FE discretized matrices on very large number of vectors. In particular, we propose efficient matrix-free algorithms for evaluating FE discretized matrix-multivector products on both multi-node CPU and GPU architectures. To this end, we employ batched evaluation strategies, with the batchsize tailored to underlying hardware architectures, leading to better data locality and enabling further parallelization. On CPUs, we utilize even-odd decomposition, SIMD vectorization, and overlapping computation and communication strategies. On GPUs, we develop strategies to overlap compute with data movement for achieving efficient pipelining and reduced data accesses through the use of GPU-shared memory, constant memory and kernel fusion. Our implementation outperforms the baselines for Helmholtz operator action on 1024 vectors, achieving up to 1.4x improvement on one CPU node and up to 2.8x on one GPU node, while reaching up to 4.4x and 1.5x improvement on multiple nodes for CPUs (3072 cores) and GPUs (24 GPUs), respectively. We further benchmark the performance of the proposed implementation for solving a model eigenvalue problem for 1024 smallest eigenvalue-eigenvector pairs by employing the Chebyshev Filtered Subspace Iteration method, achieving up to 1.5x improvement on one CPU node and up to 2.2x on one GPU node while reaching up to 3.0x and 1.4x improvement on multi-node CPUs (3072 cores) and GPUs (24 GPUs), respectively.</p></div>","PeriodicalId":54775,"journal":{"name":"Journal of Parallel and Distributed Computing","volume":null,"pages":null},"PeriodicalIF":3.8,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141291809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}