Markov modulated discrete arrival processes have a wide literature, including parameter estimation methods based on expectation–maximization (EM). In this paper, we investigate the adaptation of these EM based methods to Markov modulated fluid arrival processes (MMFAP), and conclude that only the generator matrix of the modulating Markov chain of MMFAPs can be approximated by EM based method. For the rest of the parameters, the fluid rates and the fluid variances, we investigate the efficiency of numerical likelihood maximization.
To reduce the computational complexity of the likelihood computation, we accelerate the numerical inverse Laplace transformation step of the procedure with function fitting.
The popularity of machine learning technologies and frameworks has led to an increasingly large number of machine learning workloads running on high-performance computing (HPC) clusters. The ML workflows are readily being adopted in diverse computational fields such as Biology, Physics, Materials, and Computer Science. The I/O behavior of the emerging ML workloads distinctly differs from the traditional HPC workloads, such as simulation or checkpoint/restart-based HPC I/O behavior. Additionally, the ML workloads have also pushed for the utilization of GPUs or a combination of CPUs and GPUs in addition to using only CPUs for computational tasks. The diverse and complex I/O behavior of ML workloads requires extensive study and is critical for the efficient performance of various layers of the I/O stack and the overall performance of HPC workloads. This work aims to fill the gap in understanding the I/O behavior of emerging ML workloads by providing an in-depth analysis of ML jobs running on large-scale leadership HPC systems. In particular, we have analyzed the behavior of jobs based on the scale of the jobs, the science domains, and the processing units used by the ML jobs. The analysis was performed on 23,000 ML jobs collected from one year of Darshan logs running on Summit, which is one of the fastest supercomputers. We also collect the CPU and GPU usage of 15,165 ML jobs by merging the Darshan dataset with the power usage of the processing units on Summit. Therefore, this paper is able to provide a systematic I/O characterization of ML workloads on a leadership scale HPC machine to understand how the I/O behavior differs for workloads across various science domains, the scale of workloads, and processing units and analyze the usage of parallel file system and burst buffer by ML I/O workloads. We have made several observations regarding I/O performances and access patterns through various analytical studies and discuss the important lessons learnt from the perspective of a ML user and a storage architect for emerging ML workloads running on large-scale supercomputers.
Distributed scheduling algorithms based on carrier sense multiple access (CSMA) are optimal in terms of the throughput and the steady-state queue lengths. However, they take a prohibitively long time to reach the steady-state, often exponential in the network size. Therefore for large networks that operate over a finite time horizon, apart from the guarantees on the steady-state queue lengths, performance guarantees on the short-term (i.e., transient) queuing behaviour are also required. To that end, we propose distributed scheduling algorithms that are guaranteed to have expected queue lengths not just in the steady-state but at every time instant, where is with respect to the network size. Further, our algorithms have complexity and support a constant fraction of the maximum throughput for typical wireless topologies. The central idea of our algorithms is to resolve collisions among pairs of conflicting nodes by assigning a master–follower hierarchy. The master–follower hierarchy can either be chosen randomly or based on the topology of the conflict graph, leading to different performance guarantees.
In addition to these hierarchical collision resolution algorithms, which are primarily designed for the conflict graph-based interference model, we also propose an Aloha-based algorithm for the -neighbour collision tolerance interference model, which is a generalization of the conflict graph model. We show that the proposed Aloha-based algorithm supports a constant fraction of the maximum throughput for typical wireless topologies.
IoT networks handle incoming packets from large numbers of IoT Devices (IoTDs) to IoT Gateways. This can lead to the IoT Massive Access Problem that causes buffer overflow, large end-to-end delays and missed deadlines. This paper analyzes a novel traffic shaping method named the Quasi-Deterministic Traffic Policy (QDTP) that mitigates this problem by shaping the incoming traffic without increasing the end-to-end delay or dropping packets. Using queueing theoretic techniques and extensive data driven simulations with real IoT datasets, the value of QDTP is shown as a means to considerably reduce congestion at the Gateway, and significantly improve the IoT network’s overall performance.
The Join-the-Shortest-Queue (JSQ) load-balancing scheme is known to minimise the average delay of jobs in homogeneous systems consisting of identical servers. However, it performs poorly in heterogeneous systems where servers have different processing rates. Finding a delay optimal scheme remains an open problem for heterogeneous systems. In this paper, we consider a speed-aware version of the JSQ scheme for heterogeneous systems and show that it achieves delay optimality in the fluid limit. One of the key issues in establishing this optimality result for heterogeneous systems is to show that the sequence of steady-state distributions indexed by the system size is tight in an appropriately defined space. The usual technique for showing tightness by coupling with a suitably defined dominant system does not work for heterogeneous systems. To prove tightness, we devise a new technique that uses the drift of exponential Lyapunov functions. Using the non-negativity of the drift, we show that the stationary queue length distribution has an exponentially decaying tail — a fact we use to prove tightness. Another technical difficulty arises due to the complexity of the underlying state-space and the separation of two time-scales in the fluid limit. Due to these factors, the fluid-limit turns out to be a function of the invariant distribution of a multi-dimensional Markov chain which is hard to characterise. By using some properties of this invariant distribution and using the monotonicity of the system, we show that the fluid limit is has a unique and globally attractive fixed point.
Magnetic tape provides a cost-effective way to retain the exponentially increasing volumes of data being produced. The low cost per gigabyte and the low energy consumption render tape a preferred option over hard disk drives and flash for infrequently accessed data. Assessing the performance of tape library systems is central to achieving appropriate storage provisioning and dimensioning. Performance is affected by the number and the operational characteristics of the tape drives and the robotic arms, and the mount and unmount policies deployed. In this paper, we develop a novel analytical model that accurately captures the principal aspects of tape library operation. Several relevant performance measures including the mean waiting time and the mount/unmount rates are derived. The model provides useful insights into the behavior of the tape libraries and yields results that enable a better understanding of the design tradeoffs. The validity of the model developed is confirmed by demonstrating a good agreement of the predicted performance with that obtained by simulation across various configurations. To mitigate the burden on the robotic mechanism, a scheme of accumulating multiple requests before sending them to the tape library is proposed and studied.