Pub Date : 2025-02-10DOI: 10.1016/j.parco.2025.103126
Jaroslav Olha, Jana Hozzová, Matej Antol, Jiří Filipovič
Many state-of-the-art HPC applications rely on autotuning to maintain peak performance. Autotuning allows a program to be re-optimized for new hardware, settings, or input — even during execution. However, the approach has an inherent problem that has yet to be properly addressed: since the autotuning process itself requires computational resources, it is also subject to optimization. In other words, while autotuning aims to decrease a program’s run time by improving its efficiency, it also introduces additional overhead that can extend the overall run time. To achieve optimal performance, both the application and the autotuning process should be optimized together, treating them as a single optimization criterion. This framing allows us to determine a reasonable tuning budget to avoid both undertuning, where insufficient autotuning leads to suboptimal performance, and overtuning, where excessive autotuning imposes overhead that outweighs the benefits of program optimization.
In this paper, we explore the tuning budget optimization problem in detail, highlighting its interesting properties and implications, which have largely been overlooked in the literature. Additionally, we present several viable solutions for tuning budget optimization and evaluate their efficiency across a range of commonly used HPC kernels.
{"title":"Estimating resource budgets to ensure autotuning efficiency","authors":"Jaroslav Olha, Jana Hozzová, Matej Antol, Jiří Filipovič","doi":"10.1016/j.parco.2025.103126","DOIUrl":"10.1016/j.parco.2025.103126","url":null,"abstract":"<div><div>Many state-of-the-art HPC applications rely on autotuning to maintain peak performance. Autotuning allows a program to be re-optimized for new hardware, settings, or input — even during execution. However, the approach has an inherent problem that has yet to be properly addressed: since the autotuning process itself requires computational resources, it is also subject to optimization. In other words, while autotuning aims to decrease a program’s run time by improving its efficiency, it also introduces additional overhead that can extend the overall run time. To achieve optimal performance, both the application and the autotuning process should be optimized together, treating them as a single optimization criterion. This framing allows us to determine a reasonable tuning budget to avoid both undertuning, where insufficient autotuning leads to suboptimal performance, and overtuning, where excessive autotuning imposes overhead that outweighs the benefits of program optimization.</div><div>In this paper, we explore the tuning budget optimization problem in detail, highlighting its interesting properties and implications, which have largely been overlooked in the literature. Additionally, we present several viable solutions for tuning budget optimization and evaluate their efficiency across a range of commonly used HPC kernels.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103126"},"PeriodicalIF":2.0,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143396223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-20DOI: 10.1016/j.parco.2025.103125
Henri Casanova , Arnaud Giersch , Arnaud Legrand , Martin Quinson , Frédéric Suter
Researchers in parallel and distributed computing (PDC) often resort to simulation because experiments conducted using a simulator can be for arbitrary experimental scenarios, are less resource-, labor-, and time-consuming than their real-world counterparts, and are perfectly repeatable and observable. Many frameworks have been developed to ease the development of PDC simulators, and these frameworks provide different levels of accuracy, scalability, versatility, extensibility, and usability. The SimGrid framework has been used by many PDC researchers to produce a wide range of simulators for over two decades. Its popularity is due to a large emphasis placed on accuracy, scalability, and versatility, and is in spite of shortcomings in terms of extensibility and usability. Although SimGrid provides sensible simulation models for the common case, it was difficult for users to extend these models to meet domain-specific needs. Furthermore, SimGrid only provided relatively low-level simulation abstractions, making the implementation of a simulator of a complex system a labor-intensive undertaking. In this work we describe developments in the last decade that have contributed to vastly improving extensibility and usability, thus lowering or removing entry barriers for users to develop custom SimGrid simulators.
{"title":"Lowering entry barriers to developing custom simulators of distributed applications and platforms with SimGrid","authors":"Henri Casanova , Arnaud Giersch , Arnaud Legrand , Martin Quinson , Frédéric Suter","doi":"10.1016/j.parco.2025.103125","DOIUrl":"10.1016/j.parco.2025.103125","url":null,"abstract":"<div><div>Researchers in parallel and distributed computing (PDC) often resort to simulation because experiments conducted using a simulator can be for arbitrary experimental scenarios, are less resource-, labor-, and time-consuming than their real-world counterparts, and are perfectly repeatable and observable. Many frameworks have been developed to ease the development of PDC simulators, and these frameworks provide different levels of accuracy, scalability, versatility, extensibility, and usability. The SimGrid framework has been used by many PDC researchers to produce a wide range of simulators for over two decades. Its popularity is due to a large emphasis placed on accuracy, scalability, and versatility, and is in spite of shortcomings in terms of extensibility and usability. Although SimGrid provides sensible simulation models for the common case, it was difficult for users to extend these models to meet domain-specific needs. Furthermore, SimGrid only provided relatively low-level simulation abstractions, making the implementation of a simulator of a complex system a labor-intensive undertaking. In this work we describe developments in the last decade that have contributed to vastly improving extensibility and usability, thus lowering or removing entry barriers for users to develop custom SimGrid simulators.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103125"},"PeriodicalIF":2.0,"publicationDate":"2025-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143176246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-20DOI: 10.1016/j.parco.2024.103124
Xiran Gao , Li Chen , Haoyu Wang , Huimin Cui , Xiaobing Feng
The sequential task flow (STF) model introduces implicit data dependences to exploit task-based parallelism, simplifying programming but also introducing non-negligible runtime overhead. On emerging cache-less, explicit inter-core message passing (EMP) architectures, the long latency of memory access further amplifies the runtime overhead of the traditional STF model, resulting in unsatisfactory performance.
This paper addresses two main components in the STF tasking runtime. We uncover abundant concurrency in the task dependence graph (TDG) building process through three sufficient conditions, put forward PBH, a parallelized TDG building algorithm with helpers which mixes pipeline parallelism and data parallelism to overcome the TDG building bottleneck for fine-grained tasks. We also introduce a centralized, lock-less task scheduler, EMP-C, based on the EMP interface, and propose three optimizations. These two techniques are implemented and evaluated on a product processor with EMP support, i.e. SW26010. Experimental results show that compared to traditional techniques, PBH achieves an average speedup of 1.55 for fine-grained task workloads, and the EMP-C scheduler brings speedups as high as 1.52 and 2.38 for fine-grained and coarse-grained task workloads, respectively. And the combination of these two techniques significantly improves the granularity scalability of the runtime, reducing the minimum effective task granularity (METG) to 0.1 ms and achieving an order of magnitude decrease in some cases.
{"title":"Scalable tasking runtime with parallelized builders for explicit message passing architectures","authors":"Xiran Gao , Li Chen , Haoyu Wang , Huimin Cui , Xiaobing Feng","doi":"10.1016/j.parco.2024.103124","DOIUrl":"10.1016/j.parco.2024.103124","url":null,"abstract":"<div><div>The sequential task flow (STF) model introduces implicit data dependences to exploit task-based parallelism, simplifying programming but also introducing non-negligible runtime overhead. On emerging cache-less, explicit inter-core message passing (EMP) architectures, the long latency of memory access further amplifies the runtime overhead of the traditional STF model, resulting in unsatisfactory performance.</div><div>This paper addresses two main components in the STF tasking runtime. We uncover abundant concurrency in the task dependence graph (TDG) building process through three sufficient conditions, put forward PBH, a parallelized TDG building algorithm with helpers which mixes pipeline parallelism and data parallelism to overcome the TDG building bottleneck for fine-grained tasks. We also introduce a centralized, lock-less task scheduler, EMP-C, based on the EMP interface, and propose three optimizations. These two techniques are implemented and evaluated on a product processor with EMP support, i.e. SW26010. Experimental results show that compared to traditional techniques, PBH achieves an average speedup of 1.55 for fine-grained task workloads, and the EMP-C scheduler brings speedups as high as 1.52 and 2.38 for fine-grained and coarse-grained task workloads, respectively. And the combination of these two techniques significantly improves the granularity scalability of the runtime, reducing the minimum effective task granularity (METG) to 0.1 ms and achieving an order of magnitude decrease in some cases.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103124"},"PeriodicalIF":2.0,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143176245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-06DOI: 10.1016/j.parco.2024.103123
Kasia Świrydowicz , Nicholson Koukpaizan , Maksudul Alam , Shaked Regev , Michael Saunders , Slaven Peleš
Linear solvers are major computational bottlenecks in a wide range of decision support and optimization computations. The challenges become even more pronounced on heterogeneous hardware, where traditional sparse numerical linear algebra methods are often inefficient. For example, methods for solving ill-conditioned linear systems have relied on conditional branching, which degrades performance on hardware accelerators such as graphical processing units (GPUs). To improve the efficiency of solving ill-conditioned systems, our computational strategy separates computations that are efficient on GPUs from those that need to run on traditional central processing units (CPUs). Our strategy maximizes the reuse of expensive CPU computations. Iterative methods, which thus far have not been broadly used for ill-conditioned linear systems, play an important role in our approach. In particular, we extend ideas from Arioli et al., (2007) to implement iterative refinement using inexact LU factors and flexible generalized minimal residual (FGMRES), with the aim of efficient performance on GPUs. We focus on solutions that are effective within broader application contexts, and discuss how early performance tests could be improved to be more predictive of the performance in a realistic environment.
{"title":"Iterative methods in GPU-resident linear solvers for nonlinear constrained optimization","authors":"Kasia Świrydowicz , Nicholson Koukpaizan , Maksudul Alam , Shaked Regev , Michael Saunders , Slaven Peleš","doi":"10.1016/j.parco.2024.103123","DOIUrl":"10.1016/j.parco.2024.103123","url":null,"abstract":"<div><div>Linear solvers are major computational bottlenecks in a wide range of decision support and optimization computations. The challenges become even more pronounced on heterogeneous hardware, where traditional sparse numerical linear algebra methods are often inefficient. For example, methods for solving ill-conditioned linear systems have relied on conditional branching, which degrades performance on hardware accelerators such as graphical processing units (GPUs). To improve the efficiency of solving ill-conditioned systems, our computational strategy separates computations that are efficient on GPUs from those that need to run on traditional central processing units (CPUs). Our strategy maximizes the reuse of expensive CPU computations. Iterative methods, which thus far have not been broadly used for ill-conditioned linear systems, play an important role in our approach. In particular, we extend ideas from Arioli et al., (2007) to implement iterative refinement using inexact LU factors and flexible generalized minimal residual (FGMRES), with the aim of efficient performance on GPUs. We focus on solutions that are effective within broader application contexts, and discuss how early performance tests could be improved to be more predictive of the performance in a realistic environment.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103123"},"PeriodicalIF":2.0,"publicationDate":"2024-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143175823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-13DOI: 10.1016/j.parco.2024.103122
Zheng Miao , Jon C. Calhoun , Rong Ge
Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize the costs of common resilience techniques including checkpoint-restart and forward recovery. We focus on sparse linear solvers as they are the fundamental kernels in many scientific applications. In particular, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes on computer clusters, and develop and prototype performance optimization and power management strategies to improve energy efficiency. Moreover, we take a deep dive into the forward recovery that recently started to draw attention from researchers, and propose a practical matrix-aware optimization technique to reduce its recovery time. This work shows that while the time and energy costs of various resilience techniques are different, they share the common components and can be quantitatively evaluated with a generalized framework. This analysis framework can be used to guide the design of performance and energy optimization technologies. While each resilience technique has its advantages depending on the fault rate, system size, and power budget, the forward recovery can further benefit from matrix-aware optimizations for large-scale computing.
{"title":"Towards resilient and energy efficient scalable Krylov solvers","authors":"Zheng Miao , Jon C. Calhoun , Rong Ge","doi":"10.1016/j.parco.2024.103122","DOIUrl":"10.1016/j.parco.2024.103122","url":null,"abstract":"<div><div>Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize the costs of common resilience techniques including checkpoint-restart and forward recovery. We focus on sparse linear solvers as they are the fundamental kernels in many scientific applications. In particular, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes on computer clusters, and develop and prototype performance optimization and power management strategies to improve energy efficiency. Moreover, we take a deep dive into the forward recovery that recently started to draw attention from researchers, and propose a practical matrix-aware optimization technique to reduce its recovery time. This work shows that while the time and energy costs of various resilience techniques are different, they share the common components and can be quantitatively evaluated with a generalized framework. This analysis framework can be used to guide the design of performance and energy optimization technologies. While each resilience technique has its advantages depending on the fault rate, system size, and power budget, the forward recovery can further benefit from matrix-aware optimizations for large-scale computing.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103122"},"PeriodicalIF":2.0,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142703732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-08DOI: 10.1016/j.parco.2024.103121
Xiaofeng Zou , Yuanxi Peng , Tuo Li , Lingjun Kong , Lu Zhang
The ML-KEM standard based on Kyber algorithm is one of the post-quantum cryptography (PQC) standards released by the National Institute of Standards and Technology (NIST) to withstand quantum attacks. To increase throughput and reduce the execution time that is limited by the high computational complexity of the Kyber algorithm, an RISC-V-based processor Seesaw is designed to accelerate the Kyber algorithm. The 32 specialized extension instructions are mainly designed to enhance the parallel computing ability of the processor and accelerate all the processes of the Kyber algorithm by thoroughly analyzing its characteristics. Subsequently, by carefully designing hardware such as poly vector registers and algorithm execution units on the RISC-V processor, the support of microarchitecture for extension instructions was achieved. Seesaw supports 4096-bit vector calculations through its poly vector registers and execution unit to meet high-throughput requirements and is implemented on the field-programmable gate array (FPGA). In addition, we modify the compiler simultaneously to adapt to the instruction extension and execution of Seesaw. Experimental results indicate that the processor achieves a speed-up of 432 and 18864 for hash and NTT, respectively, compared with that without extension instructions and a speed-up of 5.6 for the execution of the Kyber algorithm compared with the advanced hardware design.
{"title":"Seesaw: A 4096-bit vector processor for accelerating Kyber based on RISC-V ISA extensions","authors":"Xiaofeng Zou , Yuanxi Peng , Tuo Li , Lingjun Kong , Lu Zhang","doi":"10.1016/j.parco.2024.103121","DOIUrl":"10.1016/j.parco.2024.103121","url":null,"abstract":"<div><div>The ML-KEM standard based on Kyber algorithm is one of the post-quantum cryptography (PQC) standards released by the National Institute of Standards and Technology (NIST) to withstand quantum attacks. To increase throughput and reduce the execution time that is limited by the high computational complexity of the Kyber algorithm, an RISC-V-based processor Seesaw is designed to accelerate the Kyber algorithm. The 32 specialized extension instructions are mainly designed to enhance the parallel computing ability of the processor and accelerate all the processes of the Kyber algorithm by thoroughly analyzing its characteristics. Subsequently, by carefully designing hardware such as poly vector registers and algorithm execution units on the RISC-V processor, the support of microarchitecture for extension instructions was achieved. Seesaw supports 4096-bit vector calculations through its poly vector registers and execution unit to meet high-throughput requirements and is implemented on the field-programmable gate array (FPGA). In addition, we modify the compiler simultaneously to adapt to the instruction extension and execution of Seesaw. Experimental results indicate that the processor achieves a speed-up of 432<span><math><mo>×</mo></math></span> and 18864<span><math><mo>×</mo></math></span> for hash and NTT, respectively, compared with that without extension instructions and a speed-up of 5.6<span><math><mo>×</mo></math></span> for the execution of the Kyber algorithm compared with the advanced hardware design.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"123 ","pages":"Article 103121"},"PeriodicalIF":2.0,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-10DOI: 10.1016/j.parco.2024.103114
Fenglong Cai , Dong Yuan , Zhe Yang , Yonghui Xu , Wei He , Wei Guo , Lizhen Cui
Pre-trained models (PTMs) have demonstrated great success in a variety of NLP and CV tasks and have become a significant development in the field of deep learning. However, the large memory and high computational requirements associated with PTMs can increase the cost and time of inference, limiting their service provisioning in practical applications. To improve the Quality of Service (QoS) of PTM applications by reducing waiting and response times, we propose the FastPTM framework. This general framework aims to accelerate PTM inference services in a multi-tenant environment by reducing model loading time and switching overhead on GPUs. The framework utilizes a fast weights loading method based on weights and model separation of PTMs to efficiently accelerate parallel inference services in resource-constrained environments. Furthermore, an online scheduling algorithm is designed to reduce the inference service time. The results of the experiments indicate that FastPTM can improve the throughput of inference services by an average of 4x and up to 8.2x, while reducing the number of switches by 4.7x and the number of overtimes by 15.3x.
{"title":"FastPTM: Fast weights loading of pre-trained models for parallel inference service provisioning","authors":"Fenglong Cai , Dong Yuan , Zhe Yang , Yonghui Xu , Wei He , Wei Guo , Lizhen Cui","doi":"10.1016/j.parco.2024.103114","DOIUrl":"10.1016/j.parco.2024.103114","url":null,"abstract":"<div><div>Pre-trained models (PTMs) have demonstrated great success in a variety of NLP and CV tasks and have become a significant development in the field of deep learning. However, the large memory and high computational requirements associated with PTMs can increase the cost and time of inference, limiting their service provisioning in practical applications. To improve the Quality of Service (QoS) of PTM applications by reducing waiting and response times, we propose the FastPTM framework. This general framework aims to accelerate PTM inference services in a multi-tenant environment by reducing model loading time and switching overhead on GPUs. The framework utilizes a fast weights loading method based on weights and model separation of PTMs to efficiently accelerate parallel inference services in resource-constrained environments. Furthermore, an online scheduling algorithm is designed to reduce the inference service time. The results of the experiments indicate that FastPTM can improve the throughput of inference services by an average of 4x and up to 8.2x, while reducing the number of switches by 4.7x and the number of overtimes by 15.3x.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103114"},"PeriodicalIF":2.0,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142532380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-05DOI: 10.1016/j.parco.2024.103113
Rahim Alizadeh , Shahriar Bijani , Fatemeh Shakeri
This paper presents an algorithm to solve the problem of estimating the largest eigenvalue and its corresponding eigenvector for irreducible matrices in a distributed manner. The proposed algorithm utilizes a network of computational nodes that interact with each other, forming a strongly connected digraph where each node handles one row of the matrix, without the need for centralized storage or knowledge of the entire matrix. Each node possesses a solution space, and the intersection of all these solution spaces contains the leading eigenvector of the matrix. Initially, each node selects a random vector from its solution space, and then, while interacting with its neighbors, updates the vector at each step by solving a quadratically constrained linear program (QCLP). The updates are done so that the nodes reach a consensus on the leading eigenvector of the matrix. The numerical outcomes demonstrate the effectiveness of our proposed method.
{"title":"Distributed consensus-based estimation of the leading eigenvalue of a non-negative irreducible matrix","authors":"Rahim Alizadeh , Shahriar Bijani , Fatemeh Shakeri","doi":"10.1016/j.parco.2024.103113","DOIUrl":"10.1016/j.parco.2024.103113","url":null,"abstract":"<div><div>This paper presents an algorithm to solve the problem of estimating the largest eigenvalue and its corresponding eigenvector for irreducible matrices in a distributed manner. The proposed algorithm utilizes a network of computational nodes that interact with each other, forming a strongly connected digraph where each node handles one row of the matrix, without the need for centralized storage or knowledge of the entire matrix. Each node possesses a solution space, and the intersection of all these solution spaces contains the leading eigenvector of the matrix. Initially, each node selects a random vector from its solution space, and then, while interacting with its neighbors, updates the vector at each step by solving a quadratically constrained linear program (QCLP). The updates are done so that the nodes reach a consensus on the leading eigenvector of the matrix. The numerical outcomes demonstrate the effectiveness of our proposed method.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103113"},"PeriodicalIF":2.0,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142424535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-21DOI: 10.1016/j.parco.2024.103112
Adrian Schmitz, Semih Burak, Julian Miller, Matthias S. Müller
High-performance computing (HPC) systems enable scientific advances through simulation and data processing. The heterogeneity in HPC hardware and software increases the application complexity and reduces its maintainability and productivity. This work proposes a prototype implementation for a parallel pattern-based source-to-source compiler to address these challenges. The prototype limits the complexity of parallelism and heterogeneous architectures to parallel patterns that are optimized towards a given target architecture. By applying high-level optimizations and a mapping between parallel patterns and execution units during compile time, portability between systems is achieved. The compiler can address architectures with shared memory, distributed memory, and accelerator offloading.
The approach shows speedups for seven of the nine supported Rodinia benchmarks, reaching speedups of up to twelve times. Porting LULESH to the Parallel Pattern Language (PPL) shows a compression of code size by 65% (3.4 thousand lines of code) through a more concise expression and a higher level of abstraction. The tool’s limitations include dynamic algorithms that are challenging to analyze statically and overheads during the compile time optimization. This paper is an extended version of a previous PMAM publication (Schmitz et al., 2024).
{"title":"Parallel Pattern Compiler for Automatic Global Optimizations","authors":"Adrian Schmitz, Semih Burak, Julian Miller, Matthias S. Müller","doi":"10.1016/j.parco.2024.103112","DOIUrl":"10.1016/j.parco.2024.103112","url":null,"abstract":"<div><div>High-performance computing (HPC) systems enable scientific advances through simulation and data processing. The heterogeneity in HPC hardware and software increases the application complexity and reduces its maintainability and productivity. This work proposes a prototype implementation for a parallel pattern-based source-to-source compiler to address these challenges. The prototype limits the complexity of parallelism and heterogeneous architectures to parallel patterns that are optimized towards a given target architecture. By applying high-level optimizations and a mapping between parallel patterns and execution units during compile time, portability between systems is achieved. The compiler can address architectures with shared memory, distributed memory, and accelerator offloading.</div><div>The approach shows speedups for seven of the nine supported Rodinia benchmarks, reaching speedups of up to twelve times. Porting LULESH to the Parallel Pattern Language (PPL) shows a compression of code size by 65% (3.4 thousand lines of code) through a more concise expression and a higher level of abstraction. The tool’s limitations include dynamic algorithms that are challenging to analyze statically and overheads during the compile time optimization. This paper is an extended version of a previous PMAM publication (Schmitz et al., 2024).</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103112"},"PeriodicalIF":2.0,"publicationDate":"2024-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142323330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17DOI: 10.1016/j.parco.2024.103111
Xingwang Huang , Min Xie , Dong An , Shubin Su , Zongliang Zhang
Task scheduling in the cloud computing still remains challenging in terms of performance. Several evolutionary-derived algorithms have been proposed to solve or alleviate this problem. However, evolutionary algorithms have good exploration ability, but the performance drops significantly in high dimensions. To address this issue, considering the characteristic of task scheduling in cloud computing (i.e. all task-VM mappings are 1-dimensional and have the same search range), we propose a task scheduling algorithm based on grey wolf optimization using a new encoding mechanism (GWOEM) in this work. Through this new encoding mechanism, greedy and evolutionary algorithms are rationally integrated in GWOEM. Besides, based on the new mechanism, the dimension of search space is reduced to 1 and the key parameter (i.e., the population size) is eliminated. We apply the proposed GWOEM to the Google Cloud Jobs dataset (GoCJ) and demonstrate better performance than the prior state of the art in terms of makespan.
云计算中的任务调度在性能方面仍面临挑战。为了解决或缓解这一问题,人们提出了几种进化衍生算法。然而,进化算法具有良好的探索能力,但在高维度下性能明显下降。针对这一问题,考虑到云计算中任务调度的特点(即所有任务-VM 映射都是一维的,且具有相同的搜索范围),我们在本研究中提出了一种基于灰狼优化的任务调度算法,并使用了一种新的编码机制(GWOEM)。通过这种新的编码机制,贪婪算法和进化算法被合理地集成到了 GWOEM 中。此外,在新机制的基础上,搜索空间的维度被降为 1,关键参数(即种群规模)被取消。我们将所提出的 GWOEM 应用于 Google Cloud Jobs 数据集 (GoCJ),结果表明其在时间跨度方面的性能优于现有技术。
{"title":"Task scheduling in cloud computing based on grey wolf optimization with a new encoding mechanism","authors":"Xingwang Huang , Min Xie , Dong An , Shubin Su , Zongliang Zhang","doi":"10.1016/j.parco.2024.103111","DOIUrl":"10.1016/j.parco.2024.103111","url":null,"abstract":"<div><p>Task scheduling in the cloud computing still remains challenging in terms of performance. Several evolutionary-derived algorithms have been proposed to solve or alleviate this problem. However, evolutionary algorithms have good exploration ability, but the performance drops significantly in high dimensions. To address this issue, considering the characteristic of task scheduling in cloud computing (i.e. all task-VM mappings are 1-dimensional and have the same search range), we propose a task scheduling algorithm based on grey wolf optimization using a new encoding mechanism (GWOEM) in this work. Through this new encoding mechanism, greedy and evolutionary algorithms are rationally integrated in GWOEM. Besides, based on the new mechanism, the dimension of search space is reduced to 1 and the key parameter (i.e., the population size) is eliminated. We apply the proposed GWOEM to the Google Cloud Jobs dataset (GoCJ) and demonstrate better performance than the prior state of the art in terms of makespan.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"122 ","pages":"Article 103111"},"PeriodicalIF":2.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}