Task scheduling in the cloud computing still remains challenging in terms of performance. Several evolutionary-derived algorithms have been proposed to solve or alleviate this problem. However, evolutionary algorithms have good exploration ability, but the performance drops significantly in high dimensions. To address this issue, considering the characteristic of task scheduling in cloud computing (i.e. all task-VM mappings are 1-dimensional and have the same search range), we propose a task scheduling algorithm based on grey wolf optimization using a new encoding mechanism (GWOEM) in this work. Through this new encoding mechanism, greedy and evolutionary algorithms are rationally integrated in GWOEM. Besides, based on the new mechanism, the dimension of search space is reduced to 1 and the key parameter (i.e., the population size) is eliminated. We apply the proposed GWOEM to the Google Cloud Jobs dataset (GoCJ) and demonstrate better performance than the prior state of the art in terms of makespan.
Performance optimization continues to be a challenge in modern HPC software. Existing performance optimization techniques, including profiling-based and auto-tuning techniques, fail to indicate program modifications at the source level thus preventing their portability across compilers. This paper describes Muppet, a new approach that identifies program modifications called mutations aimed at improving program performance. Muppet’s mutations help developers reason about performance defects and missed opportunities to improve performance at the source code level. In contrast to compiler techniques that optimize code at intermediate representations (IR), Muppet uses the idea of source-level mutation testing to relax correctness constraints and automatically discover optimization opportunities that otherwise are not feasible using the IR. We demonstrate the Muppet’s concept in the OpenMP programming model. Muppet generates a list of OpenMP mutations that alter the program parallelism in various ways, and is capable of running a variety of optimization algorithms such as delta debugging, Bayesian Optimization and decision tree optimization to find a subset of mutations which, when applied to the original program, cause the most speedup while maintaining program correctness. When Muppet is evaluated against a diverse set of benchmark programs and proxy applications, it is capable of finding sets of mutations that induce speedup in 75.9% of the evaluated programs.
Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the order of data structure traversal. In this work, we present an abstraction for traversing a set of regular data structures (e.g., multidimensional arrays) that allows the design of traversal-agnostic algorithms. Such algorithms can easily optimize for memory performance and employ semi-automated parallelization or autotuning without altering their internal code. We also add an abstraction for autotuning that allows defining tuning parameters in one place and removes boilerplate code. The proposed solution was implemented as an extension of the Noarr library that simplifies a layout-agnostic design of regular data structures. It is implemented entirely using C++ template meta-programming without any nonstandard dependencies, so it is fully compatible with existing compilers, including CUDA NVCC or Intel DPC++. We evaluate the performance and expressiveness of our approach on the Polybench-C benchmarks.
Users’ limited understanding of the storage system architecture prevents them from fully utilizing the parallel I/O capability of the storage system, leading to a negative impact on the overall performance of supercomputers. Therefore, exploring effective strategies for utilizing parallel I/O capabilities is imperative. In this regard, we conduct an analysis of the workload on two production supercomputers’ Object Storage Targets (OSTs) and study the potential inefficient I/O patterns for high performance computing jobs. Our research findings indicate that under the traditional stripe settings that most supercomputers use to ensure stability, the real-time load on OSTs is severely unbalanced. This imbalance results in I/O requests that fail to fully utilize the available OSTs. To tackle this issue, we propose a job-aware optimization approach, which includes static and dynamic file striping. Static file striping optimizes all user jobs, whereas dynamic file striping employs clustering of job names and job paths to extract similarities among jobs and predict partially stripe-optimizable jobs for users. Additionally, a stripe recovery mechanism is employed to mitigate the negative impact of stripe misconfigurations. This approach appropriately adjusts the file stripe layout based on the job’s I/O pattern, allowing for better mobilization of underutilized OSTs to enhance parallel I/O capabilities. Through experimental verification, the number of OSTs that jobs can use has been increased, effectively improving the parallel I/O performance of the job without significantly affecting operational stability.
Deadlock-free routing is a significant challenge in Network-on-Chip (NoC) design as it affects the network’s latency, power consumption, and load balance, impacting the performance of multi-processor systems-on-chip. However, achieving deadlock-free routing will routinely result in expensive overhead as previous solutions either sacrifice performance or power efficiency to proactively avoid deadlocks or impose high hardware complexity to resolve deadlocks when they occur reactively. Utilizing the various characteristics of NoC to implement deadlock-free routing can be significantly more cost-effective with less impact on performance. This paper proposes a relay routing algorithm (NxtSPR) with a shortest path property and a deadlock prevention mechanism based on a synchronized Hamiltonian ring. The proposal is based on an in-depth study of the characteristics of a Triplet-Based many-core Architecture (TriBA) NoC. We establish various important topology-related theories and perform a formal verification (proof-based) for them. By utilizing the critical subgraph and apex of TriBA, NxtSPR can pre-calculate downstream nodes forwarding ports for packets by using a concise judgment strategy. This significantly reduces the computational overhead required for data transmission while optimizing the pipeline design of routers to decrease packet transmission latency and power consumption compared to other TriBA routing algorithms. We group the data transmissions according to the levels of maximum Hamiltonian edges a packet will traverse during its transmission life cycle. Independent data transmissions between groups can avoid mutual interference and resource competition, eliminating potential deadlocks. Gem5 simulation results show that, under the synthetic traffic patterns, compared to the representative (Table) and up-to-date (SPR4T) routing algorithms, NxtSPR achieves a 20.19%, 14.76%, and 5.54%, 4.66% reduction in average packet latency and per-packet power consumption, respectively. Moreover, it has an average of 18.50% and 4.34% improvement in throughput, as compared to them. PARSEC benchmark results show that NxtSPR reduces application runtime by up to a maximum of 22.30% and 12.82% compared to Table and SPR4T, and running the same applications with TriBA results in a maximum runtime reduction of 10.77% compared to 2D-Mesh.