Pub Date : 2023-07-17DOI: 10.48550/arXiv.2307.08829
Andreas Irmler, Raghavendra Kanakagiri, S. Ohlmann, Edgar Solomonik, A. Grüneis
We propose an algorithm that aims at minimizing the inter-node communication volume for distributed and memory-efficient tensor contraction schemes on modern multi-core compute nodes. The key idea is to define processor grids that optimize intra-/inter-node communication volume in the employed contraction algorithms. We present an implementation of the proposed node-aware communication algorithm into the Cyclops Tensor Framework (CTF). We demonstrate that this implementation achieves a significantly improved performance for matrix-matrix-multiplication and tensor-contractions on up to several hundreds modern compute nodes compared to conventional implementations without using node-aware processor grids. Our implementation shows good performance when compared with existing state-of-the-art parallel matrix multiplication libraries (COSMA and ScaLAPACK). In addition to the discussion of the performance for matrix-matrix-multiplication, we also investigate the performance of our node-aware communication algorithm for tensor contractions as they occur in quantum chemical coupled-cluster methods. To this end we employ a modified version of CTF in combination with a coupled-cluster code (Cc4s). Our findings show that the node-aware communication algorithm is also able to improve the performance of coupled-cluster theory calculations for real-world problems running on tens to hundreds of compute nodes.
{"title":"Optimizing Distributed Tensor Contractions using Node-Aware Processor Grids","authors":"Andreas Irmler, Raghavendra Kanakagiri, S. Ohlmann, Edgar Solomonik, A. Grüneis","doi":"10.48550/arXiv.2307.08829","DOIUrl":"https://doi.org/10.48550/arXiv.2307.08829","url":null,"abstract":"We propose an algorithm that aims at minimizing the inter-node communication volume for distributed and memory-efficient tensor contraction schemes on modern multi-core compute nodes. The key idea is to define processor grids that optimize intra-/inter-node communication volume in the employed contraction algorithms. We present an implementation of the proposed node-aware communication algorithm into the Cyclops Tensor Framework (CTF). We demonstrate that this implementation achieves a significantly improved performance for matrix-matrix-multiplication and tensor-contractions on up to several hundreds modern compute nodes compared to conventional implementations without using node-aware processor grids. Our implementation shows good performance when compared with existing state-of-the-art parallel matrix multiplication libraries (COSMA and ScaLAPACK). In addition to the discussion of the performance for matrix-matrix-multiplication, we also investigate the performance of our node-aware communication algorithm for tensor contractions as they occur in quantum chemical coupled-cluster methods. To this end we employ a modified version of CTF in combination with a coupled-cluster code (Cc4s). Our findings show that the node-aware communication algorithm is also able to improve the performance of coupled-cluster theory calculations for real-world problems running on tens to hundreds of compute nodes.","PeriodicalId":383993,"journal":{"name":"European Conference on Parallel Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126766665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-08DOI: 10.48550/arXiv.2305.04635
Felix Liu, A. Fredriksson, S. Markidis
Cholesky factorization is a widely used method for solving linear systems involving symmetric, positive-definite matrices, and can be an attractive choice in applications where a high degree of numerical stability is needed. One such application is numerical optimization, where direct methods for solving linear systems are widely used and often a significant performance bottleneck. An example where this is the case, and the specific type of optimization problem motivating this work, is radiation therapy treatment planning, where numerical optimization is used to create individual treatment plans for patients. To address this bottleneck, we propose a task-based multi-threaded method for Cholesky factorization of banded matrices with medium-sized bands. We implement our algorithm using OpenMP tasks and compare our performance with state-of-the-art libraries such as Intel MKL. Our performance measurements show a performance that is on par or better than Intel MKL (up to ~26%) for a wide range of matrix bandwidths on two different Intel CPU systems.
{"title":"Parallel Cholesky Factorization for Banded Matrices using OpenMP Tasks","authors":"Felix Liu, A. Fredriksson, S. Markidis","doi":"10.48550/arXiv.2305.04635","DOIUrl":"https://doi.org/10.48550/arXiv.2305.04635","url":null,"abstract":"Cholesky factorization is a widely used method for solving linear systems involving symmetric, positive-definite matrices, and can be an attractive choice in applications where a high degree of numerical stability is needed. One such application is numerical optimization, where direct methods for solving linear systems are widely used and often a significant performance bottleneck. An example where this is the case, and the specific type of optimization problem motivating this work, is radiation therapy treatment planning, where numerical optimization is used to create individual treatment plans for patients. To address this bottleneck, we propose a task-based multi-threaded method for Cholesky factorization of banded matrices with medium-sized bands. We implement our algorithm using OpenMP tasks and compare our performance with state-of-the-art libraries such as Intel MKL. Our performance measurements show a performance that is on par or better than Intel MKL (up to ~26%) for a wide range of matrix bandwidths on two different Intel CPU systems.","PeriodicalId":383993,"journal":{"name":"European Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128953894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-21DOI: 10.48550/arXiv.2303.11733
Karthick Panner Selvam, M. Brorsson
Deep Learning (DL) has developed to become a corner-stone in many everyday applications that we are now relying on. However, making sure that the DL model uses the underlying hardware efficiently takes a lot of effort. Knowledge about inference characteristics can help to find the right match so that enough resources are given to the model, but not too much. We have developed a DL Inference Performance Predictive Model (DIPPM) that predicts the inference latency, energy, and memory usage of a given input DL model on the NVIDIA A100 GPU. We also devised an algorithm to suggest the appropriate A100 Multi-Instance GPU profile from the output of DIPPM. We developed a methodology to convert DL models expressed in multiple frameworks to a generalized graph structure that is used in DIPPM. It means DIPPM can parse input DL models from various frameworks. Our DIPPM can be used not only helps to find suitable hardware configurations but also helps to perform rapid design-space exploration for the inference performance of a model. We constructed a graph multi-regression dataset consisting of 10,508 different DL models to train and evaluate the performance of DIPPM, and reached a resulting Mean Absolute Percentage Error (MAPE) as low as 1.9%.
{"title":"DIPPM: a Deep Learning Inference Performance Predictive Model using Graph Neural Networks","authors":"Karthick Panner Selvam, M. Brorsson","doi":"10.48550/arXiv.2303.11733","DOIUrl":"https://doi.org/10.48550/arXiv.2303.11733","url":null,"abstract":"Deep Learning (DL) has developed to become a corner-stone in many everyday applications that we are now relying on. However, making sure that the DL model uses the underlying hardware efficiently takes a lot of effort. Knowledge about inference characteristics can help to find the right match so that enough resources are given to the model, but not too much. We have developed a DL Inference Performance Predictive Model (DIPPM) that predicts the inference latency, energy, and memory usage of a given input DL model on the NVIDIA A100 GPU. We also devised an algorithm to suggest the appropriate A100 Multi-Instance GPU profile from the output of DIPPM. We developed a methodology to convert DL models expressed in multiple frameworks to a generalized graph structure that is used in DIPPM. It means DIPPM can parse input DL models from various frameworks. Our DIPPM can be used not only helps to find suitable hardware configurations but also helps to perform rapid design-space exploration for the inference performance of a model. We constructed a graph multi-regression dataset consisting of 10,508 different DL models to train and evaluate the performance of DIPPM, and reached a resulting Mean Absolute Percentage Error (MAPE) as low as 1.9%.","PeriodicalId":383993,"journal":{"name":"European Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130217576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-03-07DOI: 10.48550/arXiv.2303.03848
A. Ibrahim, Sebastian Götschel, D. Ruprecht
Parallel-in-time algorithms provide an additional layer of concurrency for the numerical integration of models based on time-dependent differential equations. Methods like Parareal, which parallelize across multiple time steps, rely on a computationally cheap and coarse integrator to propagate information forward in time, while a parallelizable expensive fine propagator provides accuracy. Typically, the coarse method is a numerical integrator using lower resolution, reduced order or a simplified model. Our paper proposes to use a physics-informed neural network (PINN) instead. We demonstrate for the Black-Scholes equation, a partial differential equation from computational finance, that Parareal with a PINN coarse propagator provides better speedup than a numerical coarse propagator. Training and evaluating a neural network are both tasks whose computing patterns are well suited for GPUs. By contrast, mesh-based algorithms with their low computational intensity struggle to perform well. We show that moving the coarse propagator PINN to a GPU while running the numerical fine propagator on the CPU further improves Parareal's single-node performance. This suggests that integrating machine learning techniques into parallel-in-time integration methods and exploiting their differences in computing patterns might offer a way to better utilize heterogeneous architectures.
{"title":"Parareal with a physics-informed neural network as coarse propagator","authors":"A. Ibrahim, Sebastian Götschel, D. Ruprecht","doi":"10.48550/arXiv.2303.03848","DOIUrl":"https://doi.org/10.48550/arXiv.2303.03848","url":null,"abstract":"Parallel-in-time algorithms provide an additional layer of concurrency for the numerical integration of models based on time-dependent differential equations. Methods like Parareal, which parallelize across multiple time steps, rely on a computationally cheap and coarse integrator to propagate information forward in time, while a parallelizable expensive fine propagator provides accuracy. Typically, the coarse method is a numerical integrator using lower resolution, reduced order or a simplified model. Our paper proposes to use a physics-informed neural network (PINN) instead. We demonstrate for the Black-Scholes equation, a partial differential equation from computational finance, that Parareal with a PINN coarse propagator provides better speedup than a numerical coarse propagator. Training and evaluating a neural network are both tasks whose computing patterns are well suited for GPUs. By contrast, mesh-based algorithms with their low computational intensity struggle to perform well. We show that moving the coarse propagator PINN to a GPU while running the numerical fine propagator on the CPU further improves Parareal's single-node performance. This suggests that integrating machine learning techniques into parallel-in-time integration methods and exploiting their differences in computing patterns might offer a way to better utilize heterogeneous architectures.","PeriodicalId":383993,"journal":{"name":"European Conference on Parallel Processing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115048897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-09-05DOI: 10.1007/978-3-031-39698-4_4
Roberto Rocco, G. Palermo
{"title":"Fault-Aware Group-Collective Communication Creation and Repair in MPI","authors":"Roberto Rocco, G. Palermo","doi":"10.1007/978-3-031-39698-4_4","DOIUrl":"https://doi.org/10.1007/978-3-031-39698-4_4","url":null,"abstract":"","PeriodicalId":383993,"journal":{"name":"European Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130680116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-05DOI: 10.1007/978-3-031-12597-3_14
Philipp Wiesner, Dominik Scheinert, Thorsten Wittkopp, L. Thamsen, O. Kao
{"title":"Cucumber: Renewable-Aware Admission Control for Delay-Tolerant Cloud and Edge Workloads","authors":"Philipp Wiesner, Dominik Scheinert, Thorsten Wittkopp, L. Thamsen, O. Kao","doi":"10.1007/978-3-031-12597-3_14","DOIUrl":"https://doi.org/10.1007/978-3-031-12597-3_14","url":null,"abstract":"","PeriodicalId":383993,"journal":{"name":"European Conference on Parallel Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127813699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-04-06DOI: 10.48550/arXiv.2204.02869
M. Madon, Georges Da Costa, J. Pierson
Digital technologies are becoming ubiquitous while their impact increases. A growing part of this impact happens far away from the end users, in networks or data centers, contributing to a rebound effect. A solution for a more responsible use is therefore to involve the user. As a first step in this quest, this work considers the users of a data center and characterizes their contribution to curtail the computing load for a short period of time by solely changing their job submission behavior.The contributions are: (i) an open-source plugin for the simulator Batsim to simulate users based on real data; (ii) the exploration of four types of user behaviors to curtail the load during a time window namely delaying, degrading, reconfiguring or renouncing to their job submissions. We study the impact of these behaviors on four different metrics: the energy consumed during and after the time window, the mean waiting time and the mean slowdown. We also characterize the conditions under which the involvement of users is the most beneficial.
{"title":"Characterization of different user behaviors for demand response in data centers","authors":"M. Madon, Georges Da Costa, J. Pierson","doi":"10.48550/arXiv.2204.02869","DOIUrl":"https://doi.org/10.48550/arXiv.2204.02869","url":null,"abstract":"Digital technologies are becoming ubiquitous while their impact increases. A growing part of this impact happens far away from the end users, in networks or data centers, contributing to a rebound effect. A solution for a more responsible use is therefore to involve the user. As a first step in this quest, this work considers the users of a data center and characterizes their contribution to curtail the computing load for a short period of time by solely changing their job submission behavior.The contributions are: (i) an open-source plugin for the simulator Batsim to simulate users based on real data; (ii) the exploration of four types of user behaviors to curtail the load during a time window namely delaying, degrading, reconfiguring or renouncing to their job submissions. We study the impact of these behaviors on four different metrics: the energy consumed during and after the time window, the mean waiting time and the mean slowdown. We also characterize the conditions under which the involvement of users is the most beneficial.","PeriodicalId":383993,"journal":{"name":"European Conference on Parallel Processing","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124432534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-01DOI: 10.1007/978-3-030-85665-6_5
Burak Aksar, B. Schwaller, O. Aaziz, V. Leung, J. Brandt, Manuel Egele, A. Coskun
{"title":"E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems","authors":"Burak Aksar, B. Schwaller, O. Aaziz, V. Leung, J. Brandt, Manuel Egele, A. Coskun","doi":"10.1007/978-3-030-85665-6_5","DOIUrl":"https://doi.org/10.1007/978-3-030-85665-6_5","url":null,"abstract":"","PeriodicalId":383993,"journal":{"name":"European Conference on Parallel Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131869177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}