Pub Date : 2023-05-01DOI: 10.1109/ipdps54959.2023.00074
{"title":"Keynote: Future Workloads Drive the Need for High Performant and Adaptive Computing Hardware","authors":"","doi":"10.1109/ipdps54959.2023.00074","DOIUrl":"https://doi.org/10.1109/ipdps54959.2023.00074","url":null,"abstract":"","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123792467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00036
Qi Yu, Lin Wang, Yuchong Hu, Yumeng Xu, D. Feng, Jie Fu, Xia Zhu, Zhen Yao, Wenjia Wei
Cloud storage systems have commonly used erasure coding that encodes data in stripes of blocks as a low-cost redundancy method for data reliability. Relative to traditional erasure coding, wide-stripe erasure coding that increases the stripe size has been recently proposed and explored to achieve lower redundancy. We observe that wide-stripe erasure coding makes multi-block failures occur much more frequently than traditional erasure coding in cloud storage systems.However, how to efficiently repair multiple blocks in wide-stripe erasure-coded storage systems remains unexplored. The conventional multi-block repair method sends available blocks from surviving nodes to one single new node to repair all failed blocks in a centralized way, which may cause the new node to be the bottleneck; recent multi-block repair methods follow pipelined single-block repair methods and the former are simply built on the latter in an independent way, which may cause the surviving nodes with limited bandwidth to be bottlenecks.In this paper, we first analyze the effects of both centralized and independent ways on the multi-block repair and then propose HMBR, a hybrid multi-block repair mechanism that combines centralized and independent multi-block repairs to tradeoff the bandwidth bottlenecks caused by the new and surviving nodes, thus optimizing the multi-block repair performance. We further extend HMBR for hierarchical network topology and multi-node failures. We prototype HMBR and show via Amazon EC2 that the repair time of a multi-block failure can be reduced by up to 64.8% over state-of-the-art schemes.
{"title":"Boosting Multi-Block Repair in Cloud Storage Systems with Wide-Stripe Erasure Coding","authors":"Qi Yu, Lin Wang, Yuchong Hu, Yumeng Xu, D. Feng, Jie Fu, Xia Zhu, Zhen Yao, Wenjia Wei","doi":"10.1109/IPDPS54959.2023.00036","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00036","url":null,"abstract":"Cloud storage systems have commonly used erasure coding that encodes data in stripes of blocks as a low-cost redundancy method for data reliability. Relative to traditional erasure coding, wide-stripe erasure coding that increases the stripe size has been recently proposed and explored to achieve lower redundancy. We observe that wide-stripe erasure coding makes multi-block failures occur much more frequently than traditional erasure coding in cloud storage systems.However, how to efficiently repair multiple blocks in wide-stripe erasure-coded storage systems remains unexplored. The conventional multi-block repair method sends available blocks from surviving nodes to one single new node to repair all failed blocks in a centralized way, which may cause the new node to be the bottleneck; recent multi-block repair methods follow pipelined single-block repair methods and the former are simply built on the latter in an independent way, which may cause the surviving nodes with limited bandwidth to be bottlenecks.In this paper, we first analyze the effects of both centralized and independent ways on the multi-block repair and then propose HMBR, a hybrid multi-block repair mechanism that combines centralized and independent multi-block repairs to tradeoff the bandwidth bottlenecks caused by the new and surviving nodes, thus optimizing the multi-block repair performance. We further extend HMBR for hierarchical network topology and multi-node failures. We prototype HMBR and show via Amazon EC2 that the repair time of a multi-block failure can be reduced by up to 64.8% over state-of-the-art schemes.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123867330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00016
S. Shubha, Shohaib Mahmud, Haiying Shen, Geoffrey Fox, M. Marathe
Compared to previous epidemics, COVID-19 spreads much faster in people gatherings. Thus, we need not only more accurate epidemic spread prediction considering the people gatherings but also more time-efficient prediction for taking actions (e.g., allocating medical equipments) in time. Motivated by this, we analyzed a time-varying people mobility graph of the United States (US) for one year and the effectiveness of previous methods in handling time-varying graphs. We identified several factors that influence COVID-19 spread and observed that some graph changes are transient, which degrades the effectiveness of the previous graph repartitioning and replication methods in distributed graph processing since they generate more time overhead than saved time. Based on the analysis, we propose an accurate and time-efficient Distributed Epidemic Spread Prediction system (DESP). First, DESP incorporates the factors into a previous prediction model to increase the prediction accuracy. Second, DESP conducts repartitioning and replication only when a graph change is stable for a certain time period (predicted using machine learning) to ensure the operation improves time-efficiency. We conducted extensive experiments on Amazon AWS based on real people movement datasets. Experimental results show DESP reduces communication time by up to 52%, while enhancing accuracy by up to 24% compared to existing methods.
{"title":"Accurate and Efficient Distributed COVID-19 Spread Prediction based on a Large-Scale Time-Varying People Mobility Graph","authors":"S. Shubha, Shohaib Mahmud, Haiying Shen, Geoffrey Fox, M. Marathe","doi":"10.1109/IPDPS54959.2023.00016","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00016","url":null,"abstract":"Compared to previous epidemics, COVID-19 spreads much faster in people gatherings. Thus, we need not only more accurate epidemic spread prediction considering the people gatherings but also more time-efficient prediction for taking actions (e.g., allocating medical equipments) in time. Motivated by this, we analyzed a time-varying people mobility graph of the United States (US) for one year and the effectiveness of previous methods in handling time-varying graphs. We identified several factors that influence COVID-19 spread and observed that some graph changes are transient, which degrades the effectiveness of the previous graph repartitioning and replication methods in distributed graph processing since they generate more time overhead than saved time. Based on the analysis, we propose an accurate and time-efficient Distributed Epidemic Spread Prediction system (DESP). First, DESP incorporates the factors into a previous prediction model to increase the prediction accuracy. Second, DESP conducts repartitioning and replication only when a graph change is stable for a certain time period (predicted using machine learning) to ensure the operation improves time-efficiency. We conducted extensive experiments on Amazon AWS based on real people movement datasets. Experimental results show DESP reduces communication time by up to 52%, while enhancing accuracy by up to 24% compared to existing methods.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124892544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00067
Ignacio Gavier, Joshua Russell, Devdhar Patel, E. Rietman, H. Siegelmann
Register Transfer Level (RTL) simulation and verification of Digital Circuits are extremely important and costly tasks in the Integrated Circuits industry. While some simulators have incorporated the exploitation of parallelism in the structure of Digital Circuits to run on multi-core CPUs, the maximum throughput they achieve quickly reaches a plateau, as described by Amdahl’s Law. Recent research from Nvidia has obtained much higher throughput in simulations using GPUs, highlighting the potential of these devices for Digital Circuit simulation. However, they were required to incorporate sophisticated algorithms to support GPU simulation. In addition, the unbalanced structure of real-life Digital Circuits provides difficulties for processing on multi-threaded devices. In this paper, we present a Digital Circuit compiler that utilizes Neural Networks to exploit the various parallelisms in RTL simulation, making use of PyTorch, a widely-used Neural Network framework that facilitate their simulation on GPUs. By using properties of Boolean Functions, we developed a novel algorithm that converts any Digital Circuit into a Neural Network, and optimization techniques that help in pushing the thread computational capability to the limit. The results show three orders of magnitude higher throughput than Verilator RTL simulator, an improvement of one order of magnitude compared to the state-of-the-art GPU techniques from Nvidia. We believe that the use of Neural Networks not only provides a significant improvement in simulation and verification tasks in the Integrated Circuits industry, but also opens a line of research for simulators at the logic and physical gate level.
{"title":"Neural Network Compiler for Parallel High-Throughput Simulation of Digital Circuits","authors":"Ignacio Gavier, Joshua Russell, Devdhar Patel, E. Rietman, H. Siegelmann","doi":"10.1109/IPDPS54959.2023.00067","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00067","url":null,"abstract":"Register Transfer Level (RTL) simulation and verification of Digital Circuits are extremely important and costly tasks in the Integrated Circuits industry. While some simulators have incorporated the exploitation of parallelism in the structure of Digital Circuits to run on multi-core CPUs, the maximum throughput they achieve quickly reaches a plateau, as described by Amdahl’s Law. Recent research from Nvidia has obtained much higher throughput in simulations using GPUs, highlighting the potential of these devices for Digital Circuit simulation. However, they were required to incorporate sophisticated algorithms to support GPU simulation. In addition, the unbalanced structure of real-life Digital Circuits provides difficulties for processing on multi-threaded devices. In this paper, we present a Digital Circuit compiler that utilizes Neural Networks to exploit the various parallelisms in RTL simulation, making use of PyTorch, a widely-used Neural Network framework that facilitate their simulation on GPUs. By using properties of Boolean Functions, we developed a novel algorithm that converts any Digital Circuit into a Neural Network, and optimization techniques that help in pushing the thread computational capability to the limit. The results show three orders of magnitude higher throughput than Verilator RTL simulator, an improvement of one order of magnitude compared to the state-of-the-art GPU techniques from Nvidia. We believe that the use of Neural Networks not only provides a significant improvement in simulation and verification tasks in the Integrated Circuits industry, but also opens a line of research for simulators at the logic and physical gate level.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127877602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00077
Jue Wang, Fumihiko Ino, Jing Ke
This paper introduces a novel parallel relaxed flooding (PRF) algorithm for Voronoi diagram generation. The algorithm takes a set of reference points extracted from an image as input and assigns each GPU thread a partition of the image domain to perform parallel flooding computation. Our PRF algorithm has three advantages as follows. (1) The PRF algorithm divides an image domain into subregions for concurrent flooding computation. To achieve high parallelism, a point selection method is incorporated to remove dependencies between different subregions. (2) We exploit the sparsity of the input point data with a k-d tree. With the k-d tree data structure, the point selection step achieves high efficiency, and the amount of CPU-GPU data transfer is reduced. (3) We propose a relaxed flooding method, which achieves more accurate results and decreases memory traffic compared to the traditional flooding method. In addition to these advantages, we provide an empirical method to determine the appropriate parameter in the point selection step for high performance, given an expected error rate. We evaluated the performance of our method on multiple datasets. Compared with the state-of-the-art parallel banding algorithm, our method achieved an average speed-up of 4.6× on the randomly generated datasets with a point density of 0.01%, and 6.8× on nuclei segmentation datasets. The code of the PRF algorithm is publicly available*.
{"title":"PRF: A Fast Parallel Relaxed Flooding Algorithm for Voronoi Diagram Generation on GPU","authors":"Jue Wang, Fumihiko Ino, Jing Ke","doi":"10.1109/IPDPS54959.2023.00077","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00077","url":null,"abstract":"This paper introduces a novel parallel relaxed flooding (PRF) algorithm for Voronoi diagram generation. The algorithm takes a set of reference points extracted from an image as input and assigns each GPU thread a partition of the image domain to perform parallel flooding computation. Our PRF algorithm has three advantages as follows. (1) The PRF algorithm divides an image domain into subregions for concurrent flooding computation. To achieve high parallelism, a point selection method is incorporated to remove dependencies between different subregions. (2) We exploit the sparsity of the input point data with a k-d tree. With the k-d tree data structure, the point selection step achieves high efficiency, and the amount of CPU-GPU data transfer is reduced. (3) We propose a relaxed flooding method, which achieves more accurate results and decreases memory traffic compared to the traditional flooding method. In addition to these advantages, we provide an empirical method to determine the appropriate parameter in the point selection step for high performance, given an expected error rate. We evaluated the performance of our method on multiple datasets. Compared with the state-of-the-art parallel banding algorithm, our method achieved an average speed-up of 4.6× on the randomly generated datasets with a point density of 0.01%, and 6.8× on nuclei segmentation datasets. The code of the PRF algorithm is publicly available*.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117197025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00064
A. Powell, G. Mudalige
As the complexity of multi-physics simulations increases, there is a need for efficient flow of information between components. Discrete ‘coupler’ codes can abstract away this process, improving solver interoperability. One such multi-physics problem is modelling a gas turbine aero engine, where instances of rotor/stator CFD and combustion simulations are coupled. Allocating resources correctly and efficiently during production simulations is a significant challenge due to the large HPC resources required and the varying scalability of specific components, a result of differences between solver physics. In this research, we develop a coupled mini-app simulation and an accompanying performance model to help support this process. We integrate an existing Particle-In-Cell mini-app, SIMPIC, as a ‘performance proxy’ for production combustion codes in industry, into a coupled mini-app CFD simulation using the CPX mini-coupler. The bottlenecks of the workload are examined, and the performance behavior are replicated using the mini-app. A selection of optimizations are examined, allowing us to estimate the workload’s theoretical performance. The coupling of mini-apps is supported by an empirical performance model which is then used to load balance and predict the speedup of a full-scale compressor-combustor-turbine simulation of 1.2Bn cells, a production representative problem size. The model is validated on 40K-cores of an HPE-Cray EX system, predicting the runtime of the mini-app work-flow with over 75% accuracy. The developed coupled mini-apps and empirical model combination demonstrates how rapid design space and run-time setup exploration studies can be carried out to obtain the best performance from full-scale Combustion-CFD coupled simulations.
{"title":"Predictive Analysis of Code Optimisations on Large-Scale Coupled CFD-Combustion Simulations using the CPX Mini-App","authors":"A. Powell, G. Mudalige","doi":"10.1109/IPDPS54959.2023.00064","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00064","url":null,"abstract":"As the complexity of multi-physics simulations increases, there is a need for efficient flow of information between components. Discrete ‘coupler’ codes can abstract away this process, improving solver interoperability. One such multi-physics problem is modelling a gas turbine aero engine, where instances of rotor/stator CFD and combustion simulations are coupled. Allocating resources correctly and efficiently during production simulations is a significant challenge due to the large HPC resources required and the varying scalability of specific components, a result of differences between solver physics. In this research, we develop a coupled mini-app simulation and an accompanying performance model to help support this process. We integrate an existing Particle-In-Cell mini-app, SIMPIC, as a ‘performance proxy’ for production combustion codes in industry, into a coupled mini-app CFD simulation using the CPX mini-coupler. The bottlenecks of the workload are examined, and the performance behavior are replicated using the mini-app. A selection of optimizations are examined, allowing us to estimate the workload’s theoretical performance. The coupling of mini-apps is supported by an empirical performance model which is then used to load balance and predict the speedup of a full-scale compressor-combustor-turbine simulation of 1.2Bn cells, a production representative problem size. The model is validated on 40K-cores of an HPE-Cray EX system, predicting the runtime of the mini-app work-flow with over 75% accuracy. The developed coupled mini-apps and empirical model combination demonstrates how rapid design space and run-time setup exploration studies can be carried out to obtain the best performance from full-scale Combustion-CFD coupled simulations.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134007012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/ipdps54959.2023.00010
{"title":"Keynote: Fifty Years of Parallel Programming: Ieri, Oggi, Domani or Yesterday, Today, Tomorrow","authors":"","doi":"10.1109/ipdps54959.2023.00010","DOIUrl":"https://doi.org/10.1109/ipdps54959.2023.00010","url":null,"abstract":"","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134584660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00101
Sajal Dash, Mohammad Alaul Haque Monil, Junqi Yin, R. Anandakrishnan, Feiyi Wang
Cancer is a leading cause of death in the US, and it results from a combination of two-nine genetic mutations. Identifying five-hit combinations responsible for several cancer types is computationally intractable even with the fastest super-computers in the USA. Iterating through nested loops required by the process presents a simplex-shaped workload with irregular memory access patterns. Distributing this workload efficiently across thousands of GPUs offers a challenge in dividing simplex-shaped (triangular/tetrahedral) workload into similar shapes with equal volume. Irregular memory access patterns create imbalanced compute utilization across nodes. We developed a generalized solution for distributing a simplex-shaped workload by partially coalescing the nested for-loops, minimizing the memory access overhead by efficiently utilizing limited shared memory, a dynamic scheduler, and loop tiling. For 4-hit combinations, we achieved a 90% − 100% strong scaling efficiency for up to 3594 V100 GPUs on the Summit supercomputer. Finally, we designed and implemented a distributed algorithm to identify 5-hit combinations for four different cancer types, and the identified combinations can differentiate between cancer and normal samples with 86.59−88.79% precision and 84.42 − 90.91% recall. We also demonstrated the robustness of our solution by porting the code to another leadership class computing platform Crusher, a testbed for the fastest supercomputer Frontier. On Crusher, we achieved 98% strong scaling efficiency on 50 nodes (400 AMD MI250X GCDs) and demonstrated the computational readiness of Frontier for scientific applications.
{"title":"Distributing Simplex-Shaped Nested for-Loops to Identify Carcinogenic Gene Combinations","authors":"Sajal Dash, Mohammad Alaul Haque Monil, Junqi Yin, R. Anandakrishnan, Feiyi Wang","doi":"10.1109/IPDPS54959.2023.00101","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00101","url":null,"abstract":"Cancer is a leading cause of death in the US, and it results from a combination of two-nine genetic mutations. Identifying five-hit combinations responsible for several cancer types is computationally intractable even with the fastest super-computers in the USA. Iterating through nested loops required by the process presents a simplex-shaped workload with irregular memory access patterns. Distributing this workload efficiently across thousands of GPUs offers a challenge in dividing simplex-shaped (triangular/tetrahedral) workload into similar shapes with equal volume. Irregular memory access patterns create imbalanced compute utilization across nodes. We developed a generalized solution for distributing a simplex-shaped workload by partially coalescing the nested for-loops, minimizing the memory access overhead by efficiently utilizing limited shared memory, a dynamic scheduler, and loop tiling. For 4-hit combinations, we achieved a 90% − 100% strong scaling efficiency for up to 3594 V100 GPUs on the Summit supercomputer. Finally, we designed and implemented a distributed algorithm to identify 5-hit combinations for four different cancer types, and the identified combinations can differentiate between cancer and normal samples with 86.59−88.79% precision and 84.42 − 90.91% recall. We also demonstrated the robustness of our solution by porting the code to another leadership class computing platform Crusher, a testbed for the fastest supercomputer Frontier. On Crusher, we achieved 98% strong scaling efficiency on 50 nodes (400 AMD MI250X GCDs) and demonstrated the computational readiness of Frontier for scientific applications.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132681284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00050
L. Schares, A. Tantawi, P. Maniotis, Ming-Hung Chen, Claudia Misale, Seetharami R. Seelam, Hao Yu
Efficient placement of advanced HPC and AI workloads with application constraints is raising challenges for resource schedulers on shared infrastructures, such as the Cloud. In this work, we propose a novel Constraints- and Heuristics-based scheduler on HIerarchical Topologies for High-Performance Computing workloads in the Cloud (chic-sched, for short). Our heuristics-based algorithm enables placement across multiple levels in a network hierarchy with loosely specified constraints, and it works without retries by providing suboptimal placements to minimize placement failures. This allows for fast scheduling at scale, and the O(N log N) complexity enables placement decisions within tens of milliseconds for groups of hundreds of virtual machines (VM). We introduce a new and simple metric to quantify the goodness of group placements. With this metric, in terms of deviation from ideal placements, we show that chic-sched is 20-50% better than the common bestFit or worstFit algorithms in all scenarios of two-level placements with spreading and packing constraints. We evaluate chic-sched with publicly available VM-request traces from a production Cloud, and, comparing against bestFit, we show that it achieves 8% lower placement failure rates and more than 40% better placement locality. Finally, to quantify the goodness of constraints-based placements, we conduct experiments with a realistic MPI workload on synthetically allocated VM clusters in a public cloud. We measure a 9% performance improvement over an adverse placement in a scenario where our heuristics-based scheduler would return a good, but not perfect, placement.
{"title":"Chic-sched: a HPC Placement-Group Scheduler on Hierarchical Topologies with Constraints","authors":"L. Schares, A. Tantawi, P. Maniotis, Ming-Hung Chen, Claudia Misale, Seetharami R. Seelam, Hao Yu","doi":"10.1109/IPDPS54959.2023.00050","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00050","url":null,"abstract":"Efficient placement of advanced HPC and AI workloads with application constraints is raising challenges for resource schedulers on shared infrastructures, such as the Cloud. In this work, we propose a novel Constraints- and Heuristics-based scheduler on HIerarchical Topologies for High-Performance Computing workloads in the Cloud (chic-sched, for short). Our heuristics-based algorithm enables placement across multiple levels in a network hierarchy with loosely specified constraints, and it works without retries by providing suboptimal placements to minimize placement failures. This allows for fast scheduling at scale, and the O(N log N) complexity enables placement decisions within tens of milliseconds for groups of hundreds of virtual machines (VM). We introduce a new and simple metric to quantify the goodness of group placements. With this metric, in terms of deviation from ideal placements, we show that chic-sched is 20-50% better than the common bestFit or worstFit algorithms in all scenarios of two-level placements with spreading and packing constraints. We evaluate chic-sched with publicly available VM-request traces from a production Cloud, and, comparing against bestFit, we show that it achieves 8% lower placement failure rates and more than 40% better placement locality. Finally, to quantify the goodness of constraints-based placements, we conduct experiments with a realistic MPI workload on synthetically allocated VM clusters in a public cloud. We measure a 9% performance improvement over an adverse placement in a scenario where our heuristics-based scheduler would return a good, but not perfect, placement.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123581732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00040
Wissam M. Sid-Lakhdar, S. Cayrols, Daniel Bielich, A. Abdelfattah, P. Luszczek, M. Gates, S. Tomov, H. Johansen, David B. Williams-Young, T. Davis, J. Dongarra, H. Anzt
The solution of linear least-squares problems is at the heart of many scientific and engineering applications. While any method able to minimize the backward error of such problems is considered numerically stable, the theory states that the forward error depends on the condition number of the matrix in the system of equations. On the one hand, the QR factorization is an efficient method to solve such problems, but the solutions it produces may have large forward errors when the matrix is rank deficient. On the other hand, rank-revealing QR (RRQR) is able to produce smaller forward errors on rank deficient matrices, but its cost is prohibitive compared to QR due to memory-inefficient operations. The aim of this paper is to propose PAQR for the solution of rank-deficient linear least-squares problems as an alternative solution method. It has the same (or smaller) cost as QR and is as accurate as QR with column pivoting in many practical cases. In addition to presenting the algorithm and its implementations on different hardware architectures, we compare its accuracy and performance results on a variety of application-derived problems.
{"title":"PAQR: Pivoting Avoiding QR factorization","authors":"Wissam M. Sid-Lakhdar, S. Cayrols, Daniel Bielich, A. Abdelfattah, P. Luszczek, M. Gates, S. Tomov, H. Johansen, David B. Williams-Young, T. Davis, J. Dongarra, H. Anzt","doi":"10.1109/IPDPS54959.2023.00040","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00040","url":null,"abstract":"The solution of linear least-squares problems is at the heart of many scientific and engineering applications. While any method able to minimize the backward error of such problems is considered numerically stable, the theory states that the forward error depends on the condition number of the matrix in the system of equations. On the one hand, the QR factorization is an efficient method to solve such problems, but the solutions it produces may have large forward errors when the matrix is rank deficient. On the other hand, rank-revealing QR (RRQR) is able to produce smaller forward errors on rank deficient matrices, but its cost is prohibitive compared to QR due to memory-inefficient operations. The aim of this paper is to propose PAQR for the solution of rank-deficient linear least-squares problems as an alternative solution method. It has the same (or smaller) cost as QR and is as accurate as QR with column pivoting in many practical cases. In addition to presenting the algorithm and its implementations on different hardware architectures, we compare its accuracy and performance results on a variety of application-derived problems.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123723380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}