2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第4页

Keynote: Future Workloads Drive the Need for High Performant and Adaptive Computing Hardware 主题演讲:未来的工作负载推动对高性能和自适应计算硬件的需求

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/ipdps54959.2023.00074

引用次数: 0

Boosting Multi-Block Repair in Cloud Storage Systems with Wide-Stripe Erasure Coding 利用宽条擦除编码促进云存储系统的多块修复

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00036

Qi Yu, Lin Wang, Yuchong Hu, Yumeng Xu, D. Feng, Jie Fu, Xia Zhu, Zhen Yao, Wenjia Wei

Cloud storage systems have commonly used erasure coding that encodes data in stripes of blocks as a low-cost redundancy method for data reliability. Relative to traditional erasure coding, wide-stripe erasure coding that increases the stripe size has been recently proposed and explored to achieve lower redundancy. We observe that wide-stripe erasure coding makes multi-block failures occur much more frequently than traditional erasure coding in cloud storage systems.However, how to efficiently repair multiple blocks in wide-stripe erasure-coded storage systems remains unexplored. The conventional multi-block repair method sends available blocks from surviving nodes to one single new node to repair all failed blocks in a centralized way, which may cause the new node to be the bottleneck; recent multi-block repair methods follow pipelined single-block repair methods and the former are simply built on the latter in an independent way, which may cause the surviving nodes with limited bandwidth to be bottlenecks.In this paper, we first analyze the effects of both centralized and independent ways on the multi-block repair and then propose HMBR, a hybrid multi-block repair mechanism that combines centralized and independent multi-block repairs to tradeoff the bandwidth bottlenecks caused by the new and surviving nodes, thus optimizing the multi-block repair performance. We further extend HMBR for hierarchical network topology and multi-node failures. We prototype HMBR and show via Amazon EC2 that the repair time of a multi-block failure can be reduced by up to 64.8% over state-of-the-art schemes.

云存储系统通常采用擦除编码(erasure coding)，将数据以条状块的形式编码，作为一种低成本的冗余方式，提高数据可靠性。相对于传统的擦除编码，近年来人们提出并探索了增加条带大小的宽条带擦除编码，以达到较低的冗余。我们观察到，在云存储系统中，宽条擦除编码比传统的擦除编码更频繁地发生多块故障。然而，如何有效地修复宽条擦除编码存储系统中的多个块仍然是一个未探索的问题。传统的多块修复方法是将幸存节点的可用块发送到单个新节点，集中修复所有故障块，这可能导致新节点成为瓶颈;目前的多块修复方法都遵循流水线式单块修复方法，且多块修复方法都是在单块修复方法的基础上独立构建的，这可能导致带宽有限的幸存节点成为瓶颈。本文首先分析了集中式和独立式两种方式对多块修复的影响，然后提出了一种混合多块修复机制HMBR，该机制结合了集中式和独立式多块修复，以权衡新节点和幸存节点带来的带宽瓶颈，从而优化多块修复性能。我们进一步扩展了HMBR用于分层网络拓扑和多节点故障。我们对HMBR进行了原型设计，并通过Amazon EC2显示，与最先进的方案相比，多块故障的修复时间最多可减少64.8%。

{"title":"Boosting Multi-Block Repair in Cloud Storage Systems with Wide-Stripe Erasure Coding","authors":"Qi Yu, Lin Wang, Yuchong Hu, Yumeng Xu, D. Feng, Jie Fu, Xia Zhu, Zhen Yao, Wenjia Wei","doi":"10.1109/IPDPS54959.2023.00036","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00036","url":null,"abstract":"Cloud storage systems have commonly used erasure coding that encodes data in stripes of blocks as a low-cost redundancy method for data reliability. Relative to traditional erasure coding, wide-stripe erasure coding that increases the stripe size has been recently proposed and explored to achieve lower redundancy. We observe that wide-stripe erasure coding makes multi-block failures occur much more frequently than traditional erasure coding in cloud storage systems.However, how to efficiently repair multiple blocks in wide-stripe erasure-coded storage systems remains unexplored. The conventional multi-block repair method sends available blocks from surviving nodes to one single new node to repair all failed blocks in a centralized way, which may cause the new node to be the bottleneck; recent multi-block repair methods follow pipelined single-block repair methods and the former are simply built on the latter in an independent way, which may cause the surviving nodes with limited bandwidth to be bottlenecks.In this paper, we first analyze the effects of both centralized and independent ways on the multi-block repair and then propose HMBR, a hybrid multi-block repair mechanism that combines centralized and independent multi-block repairs to tradeoff the bandwidth bottlenecks caused by the new and surviving nodes, thus optimizing the multi-block repair performance. We further extend HMBR for hierarchical network topology and multi-node failures. We prototype HMBR and show via Amazon EC2 that the repair time of a multi-block failure can be reduced by up to 64.8% over state-of-the-art schemes.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123867330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Accurate and Efficient Distributed COVID-19 Spread Prediction based on a Large-Scale Time-Varying People Mobility Graph 基于大尺度时变人口流动图的COVID-19准确高效分布式传播预测

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00016

S. Shubha, Shohaib Mahmud, Haiying Shen, Geoffrey Fox, M. Marathe

Compared to previous epidemics, COVID-19 spreads much faster in people gatherings. Thus, we need not only more accurate epidemic spread prediction considering the people gatherings but also more time-efficient prediction for taking actions (e.g., allocating medical equipments) in time. Motivated by this, we analyzed a time-varying people mobility graph of the United States (US) for one year and the effectiveness of previous methods in handling time-varying graphs. We identified several factors that influence COVID-19 spread and observed that some graph changes are transient, which degrades the effectiveness of the previous graph repartitioning and replication methods in distributed graph processing since they generate more time overhead than saved time. Based on the analysis, we propose an accurate and time-efficient Distributed Epidemic Spread Prediction system (DESP). First, DESP incorporates the factors into a previous prediction model to increase the prediction accuracy. Second, DESP conducts repartitioning and replication only when a graph change is stable for a certain time period (predicted using machine learning) to ensure the operation improves time-efficiency. We conducted extensive experiments on Amazon AWS based on real people movement datasets. Experimental results show DESP reduces communication time by up to 52%, while enhancing accuracy by up to 24% compared to existing methods.

与以往的疫情相比，COVID-19在人群聚集中传播得快得多。因此，我们不仅需要更准确地预测人群聚集的疫情传播，还需要更高效的预测，以便及时采取行动(如分配医疗设备)。受此启发，我们分析了美国一年的时变人口流动图，以及之前处理时变图表的方法的有效性。我们确定了影响COVID-19传播的几个因素，并观察到一些图的变化是短暂的，这降低了以前的图重分区和复制方法在分布式图处理中的有效性，因为它们产生的时间开销大于节省的时间。在此基础上，提出了一种准确、高效的分布式疫情传播预测系统(DESP)。首先，DESP将这些因素纳入到先前的预测模型中，以提高预测精度。其次，DESP仅在图变化在一定时间内稳定(使用机器学习预测)时才进行重分区和复制，以确保操作提高时间效率。我们在亚马逊AWS上基于真实的人的运动数据集进行了大量的实验。实验结果表明，与现有方法相比，DESP减少了52%的通信时间，同时提高了24%的精度。

{"title":"Accurate and Efficient Distributed COVID-19 Spread Prediction based on a Large-Scale Time-Varying People Mobility Graph","authors":"S. Shubha, Shohaib Mahmud, Haiying Shen, Geoffrey Fox, M. Marathe","doi":"10.1109/IPDPS54959.2023.00016","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00016","url":null,"abstract":"Compared to previous epidemics, COVID-19 spreads much faster in people gatherings. Thus, we need not only more accurate epidemic spread prediction considering the people gatherings but also more time-efficient prediction for taking actions (e.g., allocating medical equipments) in time. Motivated by this, we analyzed a time-varying people mobility graph of the United States (US) for one year and the effectiveness of previous methods in handling time-varying graphs. We identified several factors that influence COVID-19 spread and observed that some graph changes are transient, which degrades the effectiveness of the previous graph repartitioning and replication methods in distributed graph processing since they generate more time overhead than saved time. Based on the analysis, we propose an accurate and time-efficient Distributed Epidemic Spread Prediction system (DESP). First, DESP incorporates the factors into a previous prediction model to increase the prediction accuracy. Second, DESP conducts repartitioning and replication only when a graph change is stable for a certain time period (predicted using machine learning) to ensure the operation improves time-efficiency. We conducted extensive experiments on Amazon AWS based on real people movement datasets. Experimental results show DESP reduces communication time by up to 52%, while enhancing accuracy by up to 24% compared to existing methods.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124892544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neural Network Compiler for Parallel High-Throughput Simulation of Digital Circuits 并行高通量数字电路仿真的神经网络编译器

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00067

Ignacio Gavier, Joshua Russell, Devdhar Patel, E. Rietman, H. Siegelmann

Register Transfer Level (RTL) simulation and verification of Digital Circuits are extremely important and costly tasks in the Integrated Circuits industry. While some simulators have incorporated the exploitation of parallelism in the structure of Digital Circuits to run on multi-core CPUs, the maximum throughput they achieve quickly reaches a plateau, as described by Amdahl’s Law. Recent research from Nvidia has obtained much higher throughput in simulations using GPUs, highlighting the potential of these devices for Digital Circuit simulation. However, they were required to incorporate sophisticated algorithms to support GPU simulation. In addition, the unbalanced structure of real-life Digital Circuits provides difficulties for processing on multi-threaded devices. In this paper, we present a Digital Circuit compiler that utilizes Neural Networks to exploit the various parallelisms in RTL simulation, making use of PyTorch, a widely-used Neural Network framework that facilitate their simulation on GPUs. By using properties of Boolean Functions, we developed a novel algorithm that converts any Digital Circuit into a Neural Network, and optimization techniques that help in pushing the thread computational capability to the limit. The results show three orders of magnitude higher throughput than Verilator RTL simulator, an improvement of one order of magnitude compared to the state-of-the-art GPU techniques from Nvidia. We believe that the use of Neural Networks not only provides a significant improvement in simulation and verification tasks in the Integrated Circuits industry, but also opens a line of research for simulators at the logic and physical gate level.

数字电路的寄存器传输电平(RTL)仿真与验证是集成电路行业中极其重要且昂贵的任务。虽然一些模拟器已经在数字电路结构中利用并行性在多核cpu上运行，但它们实现的最大吞吐量很快达到了一个平台，正如Amdahl定律所描述的那样。Nvidia最近的研究已经在使用gpu的模拟中获得了更高的吞吐量，突出了这些设备在数字电路模拟中的潜力。然而，他们需要结合复杂的算法来支持GPU模拟。此外，现实数字电路的不平衡结构为多线程设备的处理提供了困难。在本文中，我们提出了一个数字电路编译器，它利用神经网络来利用RTL仿真中的各种并行性，利用PyTorch，一个广泛使用的神经网络框架，促进它们在gpu上的仿真。通过使用布尔函数的特性，我们开发了一种将任何数字电路转换为神经网络的新算法，以及有助于将线程计算能力推向极限的优化技术。结果显示，与Verilator RTL模拟器相比，吞吐量提高了三个数量级，与Nvidia最先进的GPU技术相比，提高了一个数量级。我们相信，神经网络的使用不仅为集成电路行业的仿真和验证任务提供了重大改进，而且还为逻辑和物理栅极级的模拟器开辟了一系列研究。

{"title":"Neural Network Compiler for Parallel High-Throughput Simulation of Digital Circuits","authors":"Ignacio Gavier, Joshua Russell, Devdhar Patel, E. Rietman, H. Siegelmann","doi":"10.1109/IPDPS54959.2023.00067","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00067","url":null,"abstract":"Register Transfer Level (RTL) simulation and verification of Digital Circuits are extremely important and costly tasks in the Integrated Circuits industry. While some simulators have incorporated the exploitation of parallelism in the structure of Digital Circuits to run on multi-core CPUs, the maximum throughput they achieve quickly reaches a plateau, as described by Amdahl’s Law. Recent research from Nvidia has obtained much higher throughput in simulations using GPUs, highlighting the potential of these devices for Digital Circuit simulation. However, they were required to incorporate sophisticated algorithms to support GPU simulation. In addition, the unbalanced structure of real-life Digital Circuits provides difficulties for processing on multi-threaded devices. In this paper, we present a Digital Circuit compiler that utilizes Neural Networks to exploit the various parallelisms in RTL simulation, making use of PyTorch, a widely-used Neural Network framework that facilitate their simulation on GPUs. By using properties of Boolean Functions, we developed a novel algorithm that converts any Digital Circuit into a Neural Network, and optimization techniques that help in pushing the thread computational capability to the limit. The results show three orders of magnitude higher throughput than Verilator RTL simulator, an improvement of one order of magnitude compared to the state-of-the-art GPU techniques from Nvidia. We believe that the use of Neural Networks not only provides a significant improvement in simulation and verification tasks in the Integrated Circuits industry, but also opens a line of research for simulators at the logic and physical gate level.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127877602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PRF: A Fast Parallel Relaxed Flooding Algorithm for Voronoi Diagram Generation on GPU 基于GPU的Voronoi图生成的快速并行松弛泛洪算法

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00077

Jue Wang, Fumihiko Ino, Jing Ke

This paper introduces a novel parallel relaxed flooding (PRF) algorithm for Voronoi diagram generation. The algorithm takes a set of reference points extracted from an image as input and assigns each GPU thread a partition of the image domain to perform parallel flooding computation. Our PRF algorithm has three advantages as follows. (1) The PRF algorithm divides an image domain into subregions for concurrent flooding computation. To achieve high parallelism, a point selection method is incorporated to remove dependencies between different subregions. (2) We exploit the sparsity of the input point data with a k-d tree. With the k-d tree data structure, the point selection step achieves high efficiency, and the amount of CPU-GPU data transfer is reduced. (3) We propose a relaxed flooding method, which achieves more accurate results and decreases memory traffic compared to the traditional flooding method. In addition to these advantages, we provide an empirical method to determine the appropriate parameter in the point selection step for high performance, given an expected error rate. We evaluated the performance of our method on multiple datasets. Compared with the state-of-the-art parallel banding algorithm, our method achieved an average speed-up of 4.6× on the randomly generated datasets with a point density of 0.01%, and 6.8× on nuclei segmentation datasets. The code of the PRF algorithm is publicly available*.

提出了一种新的Voronoi图生成并行松弛泛洪(PRF)算法。该算法从图像中提取一组参考点作为输入，并为每个GPU线程分配图像域的一个分区进行并行泛洪计算。我们的PRF算法有以下三个优点:(1) PRF算法将图像域划分为子区域进行并行泛洪计算。为了实现高并行性，采用点选择方法去除不同子区域之间的依赖关系。(2)我们利用k-d树的输入点数据的稀疏性。采用k-d树的数据结构，选点步骤效率高，减少了CPU-GPU之间的数据传输量。(3)提出了一种宽松的泛洪方法，与传统的泛洪方法相比，该方法获得了更精确的结果，并且减少了内存流量。除了这些优点之外，我们还提供了一种经验方法，在给定预期错误率的情况下，在点选择步骤中确定适当的参数以获得高性能。我们在多个数据集上评估了我们的方法的性能。与最先进的并行带算法相比，该方法在随机生成的点密度为0.01%的数据集上实现了4.6倍的平均提速，在核分割数据集上实现了6.8倍的平均提速。PRF算法的代码是公开的*。

{"title":"PRF: A Fast Parallel Relaxed Flooding Algorithm for Voronoi Diagram Generation on GPU","authors":"Jue Wang, Fumihiko Ino, Jing Ke","doi":"10.1109/IPDPS54959.2023.00077","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00077","url":null,"abstract":"This paper introduces a novel parallel relaxed flooding (PRF) algorithm for Voronoi diagram generation. The algorithm takes a set of reference points extracted from an image as input and assigns each GPU thread a partition of the image domain to perform parallel flooding computation. Our PRF algorithm has three advantages as follows. (1) The PRF algorithm divides an image domain into subregions for concurrent flooding computation. To achieve high parallelism, a point selection method is incorporated to remove dependencies between different subregions. (2) We exploit the sparsity of the input point data with a k-d tree. With the k-d tree data structure, the point selection step achieves high efficiency, and the amount of CPU-GPU data transfer is reduced. (3) We propose a relaxed flooding method, which achieves more accurate results and decreases memory traffic compared to the traditional flooding method. In addition to these advantages, we provide an empirical method to determine the appropriate parameter in the point selection step for high performance, given an expected error rate. We evaluated the performance of our method on multiple datasets. Compared with the state-of-the-art parallel banding algorithm, our method achieved an average speed-up of 4.6× on the randomly generated datasets with a point density of 0.01%, and 6.8× on nuclei segmentation datasets. The code of the PRF algorithm is publicly available*.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117197025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Predictive Analysis of Code Optimisations on Large-Scale Coupled CFD-Combustion Simulations using the CPX Mini-App CPX Mini-App对大规模耦合cfd -燃烧模拟的代码优化预测分析

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00064

A. Powell, G. Mudalige

As the complexity of multi-physics simulations increases, there is a need for efficient flow of information between components. Discrete ‘coupler’ codes can abstract away this process, improving solver interoperability. One such multi-physics problem is modelling a gas turbine aero engine, where instances of rotor/stator CFD and combustion simulations are coupled. Allocating resources correctly and efficiently during production simulations is a significant challenge due to the large HPC resources required and the varying scalability of specific components, a result of differences between solver physics. In this research, we develop a coupled mini-app simulation and an accompanying performance model to help support this process. We integrate an existing Particle-In-Cell mini-app, SIMPIC, as a ‘performance proxy’ for production combustion codes in industry, into a coupled mini-app CFD simulation using the CPX mini-coupler. The bottlenecks of the workload are examined, and the performance behavior are replicated using the mini-app. A selection of optimizations are examined, allowing us to estimate the workload’s theoretical performance. The coupling of mini-apps is supported by an empirical performance model which is then used to load balance and predict the speedup of a full-scale compressor-combustor-turbine simulation of 1.2Bn cells, a production representative problem size. The model is validated on 40K-cores of an HPE-Cray EX system, predicting the runtime of the mini-app work-flow with over 75% accuracy. The developed coupled mini-apps and empirical model combination demonstrates how rapid design space and run-time setup exploration studies can be carried out to obtain the best performance from full-scale Combustion-CFD coupled simulations.

随着多物理场仿真复杂性的增加，需要在组件之间实现有效的信息流。离散的“耦合器”代码可以抽象出这个过程，提高求解器的互操作性。其中一个多物理场问题是对燃气涡轮航空发动机进行建模，其中转子/定子CFD和燃烧模拟的实例是耦合的。在生产模拟过程中，正确有效地分配资源是一项重大挑战，因为需要大量的HPC资源和特定组件的不同可扩展性，这是求解器物理特性差异的结果。在本研究中，我们开发了一个耦合的迷你应用程序模拟和伴随的性能模型来帮助支持这一过程。我们将现有的Particle-In-Cell小型应用程序SIMPIC集成到使用CPX小型耦器的耦合小型应用程序CFD模拟中，SIMPIC作为工业生产燃烧代码的“性能代理”。检查工作负载的瓶颈，并使用迷你应用程序复制性能行为。选择优化检查，使我们能够估计工作负载的理论性能。小型应用程序的耦合由一个经验性能模型支持，该模型随后被用于负载平衡和预测12亿个电池的全尺寸压缩机-燃烧器-涡轮模拟的加速，这是一个具有生产代表性的问题规模。该模型在HPE-Cray EX系统的40k核上进行了验证，预测迷你应用程序工作流程的运行时精度超过75%。开发的耦合迷你应用程序和经验模型的结合表明，如何通过快速的设计空间和运行时设置探索研究，从全尺寸燃烧- cfd耦合模拟中获得最佳性能。

{"title":"Predictive Analysis of Code Optimisations on Large-Scale Coupled CFD-Combustion Simulations using the CPX Mini-App","authors":"A. Powell, G. Mudalige","doi":"10.1109/IPDPS54959.2023.00064","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00064","url":null,"abstract":"As the complexity of multi-physics simulations increases, there is a need for efficient flow of information between components. Discrete ‘coupler’ codes can abstract away this process, improving solver interoperability. One such multi-physics problem is modelling a gas turbine aero engine, where instances of rotor/stator CFD and combustion simulations are coupled. Allocating resources correctly and efficiently during production simulations is a significant challenge due to the large HPC resources required and the varying scalability of specific components, a result of differences between solver physics. In this research, we develop a coupled mini-app simulation and an accompanying performance model to help support this process. We integrate an existing Particle-In-Cell mini-app, SIMPIC, as a ‘performance proxy’ for production combustion codes in industry, into a coupled mini-app CFD simulation using the CPX mini-coupler. The bottlenecks of the workload are examined, and the performance behavior are replicated using the mini-app. A selection of optimizations are examined, allowing us to estimate the workload’s theoretical performance. The coupling of mini-apps is supported by an empirical performance model which is then used to load balance and predict the speedup of a full-scale compressor-combustor-turbine simulation of 1.2Bn cells, a production representative problem size. The model is validated on 40K-cores of an HPE-Cray EX system, predicting the runtime of the mini-app work-flow with over 75% accuracy. The developed coupled mini-apps and empirical model combination demonstrates how rapid design space and run-time setup exploration studies can be carried out to obtain the best performance from full-scale Combustion-CFD coupled simulations.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134007012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Keynote: Fifty Years of Parallel Programming: Ieri, Oggi, Domani or Yesterday, Today, Tomorrow 主题演讲:并行编程的五十年:Ieri, Oggi, Domani或昨天，今天，明天

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/ipdps54959.2023.00010

引用次数: 0

Distributing Simplex-Shaped Nested for-Loops to Identify Carcinogenic Gene Combinations 分布简单形嵌套for- loop以识别致癌基因组合

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00101

Sajal Dash, Mohammad Alaul Haque Monil, Junqi Yin, R. Anandakrishnan, Feiyi Wang

Cancer is a leading cause of death in the US, and it results from a combination of two-nine genetic mutations. Identifying five-hit combinations responsible for several cancer types is computationally intractable even with the fastest super-computers in the USA. Iterating through nested loops required by the process presents a simplex-shaped workload with irregular memory access patterns. Distributing this workload efficiently across thousands of GPUs offers a challenge in dividing simplex-shaped (triangular/tetrahedral) workload into similar shapes with equal volume. Irregular memory access patterns create imbalanced compute utilization across nodes. We developed a generalized solution for distributing a simplex-shaped workload by partially coalescing the nested for-loops, minimizing the memory access overhead by efficiently utilizing limited shared memory, a dynamic scheduler, and loop tiling. For 4-hit combinations, we achieved a 90% − 100% strong scaling efficiency for up to 3594 V100 GPUs on the Summit supercomputer. Finally, we designed and implemented a distributed algorithm to identify 5-hit combinations for four different cancer types, and the identified combinations can differentiate between cancer and normal samples with 86.59−88.79% precision and 84.42 − 90.91% recall. We also demonstrated the robustness of our solution by porting the code to another leadership class computing platform Crusher, a testbed for the fastest supercomputer Frontier. On Crusher, we achieved 98% strong scaling efficiency on 50 nodes (400 AMD MI250X GCDs) and demonstrated the computational readiness of Frontier for scientific applications.

在美国，癌症是导致死亡的主要原因之一，它是由29种基因突变共同导致的。即使使用美国最快的超级计算机，确定导致几种癌症类型的五击组合在计算上也是困难的。通过进程所需的嵌套循环进行迭代，呈现出具有不规则内存访问模式的简单型工作负载。将这种工作负载高效地分布到数千个gpu上，这对将简单形状(三角形/四面体)工作负载划分为具有相同体积的类似形状提出了挑战。不规则的内存访问模式导致节点间计算利用率不平衡。我们开发了一种通用的解决方案，通过部分地合并嵌套的for循环来分发简单形状的工作负载，通过有效地利用有限的共享内存、动态调度器和循环平铺来最小化内存访问开销。对于4击组合，我们在Summit超级计算机上实现了高达3594个V100 gpu的90% - 100%的强大扩展效率。最后，我们设计并实现了一种分布式算法来识别4种不同癌症类型的5命中组合，识别出的组合可以区分癌症和正常样本，准确率为86.59 ~ 88.79%，召回率为84.42 ~ 90.91%。我们还通过将代码移植到另一个领先级计算平台Crusher(最快的超级计算机Frontier的测试平台)来展示我们解决方案的健壮性。在Crusher上，我们在50个节点(400 AMD MI250X gcd)上实现了98%的强大缩放效率，并展示了Frontier在科学应用中的计算就绪性。

{"title":"Distributing Simplex-Shaped Nested for-Loops to Identify Carcinogenic Gene Combinations","authors":"Sajal Dash, Mohammad Alaul Haque Monil, Junqi Yin, R. Anandakrishnan, Feiyi Wang","doi":"10.1109/IPDPS54959.2023.00101","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00101","url":null,"abstract":"Cancer is a leading cause of death in the US, and it results from a combination of two-nine genetic mutations. Identifying five-hit combinations responsible for several cancer types is computationally intractable even with the fastest super-computers in the USA. Iterating through nested loops required by the process presents a simplex-shaped workload with irregular memory access patterns. Distributing this workload efficiently across thousands of GPUs offers a challenge in dividing simplex-shaped (triangular/tetrahedral) workload into similar shapes with equal volume. Irregular memory access patterns create imbalanced compute utilization across nodes. We developed a generalized solution for distributing a simplex-shaped workload by partially coalescing the nested for-loops, minimizing the memory access overhead by efficiently utilizing limited shared memory, a dynamic scheduler, and loop tiling. For 4-hit combinations, we achieved a 90% − 100% strong scaling efficiency for up to 3594 V100 GPUs on the Summit supercomputer. Finally, we designed and implemented a distributed algorithm to identify 5-hit combinations for four different cancer types, and the identified combinations can differentiate between cancer and normal samples with 86.59−88.79% precision and 84.42 − 90.91% recall. We also demonstrated the robustness of our solution by porting the code to another leadership class computing platform Crusher, a testbed for the fastest supercomputer Frontier. On Crusher, we achieved 98% strong scaling efficiency on 50 nodes (400 AMD MI250X GCDs) and demonstrated the computational readiness of Frontier for scientific applications.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132681284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Chic-sched: a HPC Placement-Group Scheduler on Hierarchical Topologies with Constraints chc -sched:一个基于约束的层次拓扑结构的HPC放置组调度程序

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00050

L. Schares, A. Tantawi, P. Maniotis, Ming-Hung Chen, Claudia Misale, Seetharami R. Seelam, Hao Yu

Efficient placement of advanced HPC and AI workloads with application constraints is raising challenges for resource schedulers on shared infrastructures, such as the Cloud. In this work, we propose a novel Constraints- and Heuristics-based scheduler on HIerarchical Topologies for High-Performance Computing workloads in the Cloud (chic-sched, for short). Our heuristics-based algorithm enables placement across multiple levels in a network hierarchy with loosely specified constraints, and it works without retries by providing suboptimal placements to minimize placement failures. This allows for fast scheduling at scale, and the O(N log N) complexity enables placement decisions within tens of milliseconds for groups of hundreds of virtual machines (VM). We introduce a new and simple metric to quantify the goodness of group placements. With this metric, in terms of deviation from ideal placements, we show that chic-sched is 20-50% better than the common bestFit or worstFit algorithms in all scenarios of two-level placements with spreading and packing constraints. We evaluate chic-sched with publicly available VM-request traces from a production Cloud, and, comparing against bestFit, we show that it achieves 8% lower placement failure rates and more than 40% better placement locality. Finally, to quantify the goodness of constraints-based placements, we conduct experiments with a realistic MPI workload on synthetically allocated VM clusters in a public cloud. We measure a 9% performance improvement over an adverse placement in a scenario where our heuristics-based scheduler would return a good, but not perfect, placement.

高级高性能计算和人工智能工作负载在应用程序约束下的有效放置，给共享基础设施(如云)上的资源调度器带来了挑战。在这项工作中，我们提出了一种新的基于约束和启发式的分层拓扑调度器，用于云中的高性能计算工作负载(简称chic-sched)。我们基于启发式的算法允许在具有松散指定约束的网络层次结构中跨多个级别进行放置，并且通过提供次优放置来最小化放置失败，无需重试。这允许大规模的快速调度，并且O(N log N)的复杂性使数百个虚拟机(VM)的组可以在几十毫秒内做出放置决策。我们引入了一个新的和简单的度量来量化群体安置的好坏。有了这个度量，就与理想位置的偏差而言，我们表明，在所有具有扩展和打包约束的两级位置的场景中，chicc -sched比常见的bestFit或worstFit算法好20-50%。我们使用来自生产云中公开可用的vm请求跟踪来评估chic-sched，并且与bestFit进行比较，我们发现它的放置失败率降低了8%，放置位置优于40%以上。最后，为了量化基于约束的布局的优点，我们在公共云中综合分配的VM集群上使用现实的MPI工作负载进行了实验。在基于启发式的调度器返回良好但不完美的位置的场景中，我们测量了9%的性能改进。

{"title":"Chic-sched: a HPC Placement-Group Scheduler on Hierarchical Topologies with Constraints","authors":"L. Schares, A. Tantawi, P. Maniotis, Ming-Hung Chen, Claudia Misale, Seetharami R. Seelam, Hao Yu","doi":"10.1109/IPDPS54959.2023.00050","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00050","url":null,"abstract":"Efficient placement of advanced HPC and AI workloads with application constraints is raising challenges for resource schedulers on shared infrastructures, such as the Cloud. In this work, we propose a novel Constraints- and Heuristics-based scheduler on HIerarchical Topologies for High-Performance Computing workloads in the Cloud (chic-sched, for short). Our heuristics-based algorithm enables placement across multiple levels in a network hierarchy with loosely specified constraints, and it works without retries by providing suboptimal placements to minimize placement failures. This allows for fast scheduling at scale, and the O(N log N) complexity enables placement decisions within tens of milliseconds for groups of hundreds of virtual machines (VM). We introduce a new and simple metric to quantify the goodness of group placements. With this metric, in terms of deviation from ideal placements, we show that chic-sched is 20-50% better than the common bestFit or worstFit algorithms in all scenarios of two-level placements with spreading and packing constraints. We evaluate chic-sched with publicly available VM-request traces from a production Cloud, and, comparing against bestFit, we show that it achieves 8% lower placement failure rates and more than 40% better placement locality. Finally, to quantify the goodness of constraints-based placements, we conduct experiments with a realistic MPI workload on synthetically allocated VM clusters in a public cloud. We measure a 9% performance improvement over an adverse placement in a scenario where our heuristics-based scheduler would return a good, but not perfect, placement.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123581732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PAQR: Pivoting Avoiding QR factorization PAQR:旋转避免QR分解

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00040

Wissam M. Sid-Lakhdar, S. Cayrols, Daniel Bielich, A. Abdelfattah, P. Luszczek, M. Gates, S. Tomov, H. Johansen, David B. Williams-Young, T. Davis, J. Dongarra, H. Anzt

The solution of linear least-squares problems is at the heart of many scientific and engineering applications. While any method able to minimize the backward error of such problems is considered numerically stable, the theory states that the forward error depends on the condition number of the matrix in the system of equations. On the one hand, the QR factorization is an efficient method to solve such problems, but the solutions it produces may have large forward errors when the matrix is rank deficient. On the other hand, rank-revealing QR (RRQR) is able to produce smaller forward errors on rank deficient matrices, but its cost is prohibitive compared to QR due to memory-inefficient operations. The aim of this paper is to propose PAQR for the solution of rank-deficient linear least-squares problems as an alternative solution method. It has the same (or smaller) cost as QR and is as accurate as QR with column pivoting in many practical cases. In addition to presenting the algorithm and its implementations on different hardware architectures, we compare its accuracy and performance results on a variety of application-derived problems.

线性最小二乘问题的求解是许多科学和工程应用的核心。虽然任何能够最小化这类问题的后向误差的方法都被认为是数值稳定的，但该理论指出，前向误差取决于方程组中矩阵的条件数。一方面，QR分解是解决这类问题的有效方法，但当矩阵秩不足时，其解可能存在较大的前向误差。另一方面，显示秩的QR (RRQR)能够在秩缺乏矩阵上产生较小的前向错误，但由于内存效率低下的操作，与QR相比，其成本过高。本文的目的是提出一种求解秩缺失线性最小二乘问题的PAQR方法。它具有与QR相同(或更小)的成本，并且在许多实际情况下与具有列枢轴的QR一样准确。除了介绍该算法及其在不同硬件架构上的实现外，我们还比较了其在各种应用派生问题上的准确性和性能结果。

{"title":"PAQR: Pivoting Avoiding QR factorization","authors":"Wissam M. Sid-Lakhdar, S. Cayrols, Daniel Bielich, A. Abdelfattah, P. Luszczek, M. Gates, S. Tomov, H. Johansen, David B. Williams-Young, T. Davis, J. Dongarra, H. Anzt","doi":"10.1109/IPDPS54959.2023.00040","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00040","url":null,"abstract":"The solution of linear least-squares problems is at the heart of many scientific and engineering applications. While any method able to minimize the backward error of such problems is considered numerically stable, the theory states that the forward error depends on the condition number of the matrix in the system of equations. On the one hand, the QR factorization is an efficient method to solve such problems, but the solutions it produces may have large forward errors when the matrix is rank deficient. On the other hand, rank-revealing QR (RRQR) is able to produce smaller forward errors on rank deficient matrices, but its cost is prohibitive compared to QR due to memory-inefficient operations. The aim of this paper is to propose PAQR for the solution of rank-deficient linear least-squares problems as an alternative solution method. It has the same (or smaller) cost as QR and is as accurate as QR with column pivoting in many practical cases. In addition to presenting the algorithm and its implementations on different hardware architectures, we compare its accuracy and performance results on a variety of application-derived problems.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123723380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0