European Conference on Parallel Processing最新文献

英文中文

Optimizing Distributed Tensor Contractions using Node-Aware Processor Grids 使用节点感知处理器网格优化分布式张量收缩

European Conference on Parallel Processing

Pub Date : 2023-07-17 DOI: 10.48550/arXiv.2307.08829

Andreas Irmler, Raghavendra Kanakagiri, S. Ohlmann, Edgar Solomonik, A. Grüneis

We propose an algorithm that aims at minimizing the inter-node communication volume for distributed and memory-efficient tensor contraction schemes on modern multi-core compute nodes. The key idea is to define processor grids that optimize intra-/inter-node communication volume in the employed contraction algorithms. We present an implementation of the proposed node-aware communication algorithm into the Cyclops Tensor Framework (CTF). We demonstrate that this implementation achieves a significantly improved performance for matrix-matrix-multiplication and tensor-contractions on up to several hundreds modern compute nodes compared to conventional implementations without using node-aware processor grids. Our implementation shows good performance when compared with existing state-of-the-art parallel matrix multiplication libraries (COSMA and ScaLAPACK). In addition to the discussion of the performance for matrix-matrix-multiplication, we also investigate the performance of our node-aware communication algorithm for tensor contractions as they occur in quantum chemical coupled-cluster methods. To this end we employ a modified version of CTF in combination with a coupled-cluster code (Cc4s). Our findings show that the node-aware communication algorithm is also able to improve the performance of coupled-cluster theory calculations for real-world problems running on tens to hundreds of compute nodes.

我们提出了一种算法，旨在最大限度地减少现代多核计算节点上分布式和内存高效张量收缩方案的节点间通信量。关键思想是定义处理器网格，以优化所采用的收缩算法中的节点内/节点间通信量。我们提出了在Cyclops张量框架(CTF)中实现所提出的节点感知通信算法。我们证明，与不使用节点感知处理器网格的传统实现相比，该实现在多达数百个现代计算节点上实现了矩阵-矩阵-乘法和张量-收缩的显着改进性能。与现有的最先进的并行矩阵乘法库(COSMA和ScaLAPACK)相比，我们的实现显示出良好的性能。除了讨论矩阵-矩阵-乘法的性能外，我们还研究了我们的节点感知通信算法在量子化学耦合簇方法中出现的张量收缩的性能。为此，我们将CTF的修改版本与耦合簇代码(Cc4s)结合使用。我们的研究结果表明，节点感知通信算法也能够提高在数十到数百个计算节点上运行的实际问题的耦合集群理论计算的性能。

{"title":"Optimizing Distributed Tensor Contractions using Node-Aware Processor Grids","authors":"Andreas Irmler, Raghavendra Kanakagiri, S. Ohlmann, Edgar Solomonik, A. Grüneis","doi":"10.48550/arXiv.2307.08829","DOIUrl":"https://doi.org/10.48550/arXiv.2307.08829","url":null,"abstract":"We propose an algorithm that aims at minimizing the inter-node communication volume for distributed and memory-efficient tensor contraction schemes on modern multi-core compute nodes. The key idea is to define processor grids that optimize intra-/inter-node communication volume in the employed contraction algorithms. We present an implementation of the proposed node-aware communication algorithm into the Cyclops Tensor Framework (CTF). We demonstrate that this implementation achieves a significantly improved performance for matrix-matrix-multiplication and tensor-contractions on up to several hundreds modern compute nodes compared to conventional implementations without using node-aware processor grids. Our implementation shows good performance when compared with existing state-of-the-art parallel matrix multiplication libraries (COSMA and ScaLAPACK). In addition to the discussion of the performance for matrix-matrix-multiplication, we also investigate the performance of our node-aware communication algorithm for tensor contractions as they occur in quantum chemical coupled-cluster methods. To this end we employ a modified version of CTF in combination with a coupled-cluster code (Cc4s). Our findings show that the node-aware communication algorithm is also able to improve the performance of coupled-cluster theory calculations for real-world problems running on tens to hundreds of compute nodes.","PeriodicalId":383993,"journal":{"name":"European Conference on Parallel Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126766665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallel Cholesky Factorization for Banded Matrices using OpenMP Tasks 基于OpenMP任务的带状矩阵并行Cholesky分解

European Conference on Parallel Processing

Pub Date : 2023-05-08 DOI: 10.48550/arXiv.2305.04635

Felix Liu, A. Fredriksson, S. Markidis

Cholesky factorization is a widely used method for solving linear systems involving symmetric, positive-definite matrices, and can be an attractive choice in applications where a high degree of numerical stability is needed. One such application is numerical optimization, where direct methods for solving linear systems are widely used and often a significant performance bottleneck. An example where this is the case, and the specific type of optimization problem motivating this work, is radiation therapy treatment planning, where numerical optimization is used to create individual treatment plans for patients. To address this bottleneck, we propose a task-based multi-threaded method for Cholesky factorization of banded matrices with medium-sized bands. We implement our algorithm using OpenMP tasks and compare our performance with state-of-the-art libraries such as Intel MKL. Our performance measurements show a performance that is on par or better than Intel MKL (up to ~26%) for a wide range of matrix bandwidths on two different Intel CPU systems.

Cholesky分解是一种广泛使用的方法，用于求解涉及对称正定矩阵的线性系统，并且在需要高度数值稳定性的应用中可以是一个有吸引力的选择。其中一个应用是数值优化，其中直接求解线性系统的方法被广泛使用，并且通常是一个显著的性能瓶颈。一个例子就是这样，一个特定类型的优化问题激发了这项工作，就是放射治疗的治疗计划，其中数值优化被用来为病人制定个人治疗计划。为了解决这一瓶颈，我们提出了一种基于任务的多线程方法，用于中等频带的带状矩阵的Cholesky分解。我们使用OpenMP任务实现我们的算法，并将我们的性能与最先进的库(如Intel MKL)进行比较。我们的性能测量显示，在两种不同的英特尔CPU系统上，在广泛的矩阵带宽范围内，性能与英特尔MKL相当或更好(高达26%)。

引用次数: 0

DIPPM: a Deep Learning Inference Performance Predictive Model using Graph Neural Networks DIPPM:使用图神经网络的深度学习推理性能预测模型

European Conference on Parallel Processing

Pub Date : 2023-03-21 DOI: 10.48550/arXiv.2303.11733

Karthick Panner Selvam, M. Brorsson

Deep Learning (DL) has developed to become a corner-stone in many everyday applications that we are now relying on. However, making sure that the DL model uses the underlying hardware efficiently takes a lot of effort. Knowledge about inference characteristics can help to find the right match so that enough resources are given to the model, but not too much. We have developed a DL Inference Performance Predictive Model (DIPPM) that predicts the inference latency, energy, and memory usage of a given input DL model on the NVIDIA A100 GPU. We also devised an algorithm to suggest the appropriate A100 Multi-Instance GPU profile from the output of DIPPM. We developed a methodology to convert DL models expressed in multiple frameworks to a generalized graph structure that is used in DIPPM. It means DIPPM can parse input DL models from various frameworks. Our DIPPM can be used not only helps to find suitable hardware configurations but also helps to perform rapid design-space exploration for the inference performance of a model. We constructed a graph multi-regression dataset consisting of 10,508 different DL models to train and evaluate the performance of DIPPM, and reached a resulting Mean Absolute Percentage Error (MAPE) as low as 1.9%.

深度学习(DL)已经发展成为我们现在依赖的许多日常应用程序的基石。然而，确保DL模型有效地使用底层硬件需要大量的工作。关于推理特征的知识可以帮助找到正确的匹配，以便为模型提供足够的资源，但不要太多。我们开发了一个深度学习推理性能预测模型(DIPPM)，可以预测NVIDIA A100 GPU上给定输入深度学习模型的推理延迟、能量和内存使用情况。我们还设计了一种算法，从DIPPM的输出中建议适当的A100多实例GPU配置文件。我们开发了一种方法，将在多个框架中表达的深度学习模型转换为DIPPM中使用的广义图结构。这意味着DIPPM可以解析来自不同框架的输入DL模型。我们的DIPPM不仅可以帮助找到合适的硬件配置，还可以帮助对模型的推理性能执行快速的设计空间探索。我们构建了一个由10,508个不同深度学习模型组成的图多元回归数据集来训练和评估DIPPM的性能，并获得了低至1.9%的平均绝对百分比误差(MAPE)。

{"title":"DIPPM: a Deep Learning Inference Performance Predictive Model using Graph Neural Networks","authors":"Karthick Panner Selvam, M. Brorsson","doi":"10.48550/arXiv.2303.11733","DOIUrl":"https://doi.org/10.48550/arXiv.2303.11733","url":null,"abstract":"Deep Learning (DL) has developed to become a corner-stone in many everyday applications that we are now relying on. However, making sure that the DL model uses the underlying hardware efficiently takes a lot of effort. Knowledge about inference characteristics can help to find the right match so that enough resources are given to the model, but not too much. We have developed a DL Inference Performance Predictive Model (DIPPM) that predicts the inference latency, energy, and memory usage of a given input DL model on the NVIDIA A100 GPU. We also devised an algorithm to suggest the appropriate A100 Multi-Instance GPU profile from the output of DIPPM. We developed a methodology to convert DL models expressed in multiple frameworks to a generalized graph structure that is used in DIPPM. It means DIPPM can parse input DL models from various frameworks. Our DIPPM can be used not only helps to find suitable hardware configurations but also helps to perform rapid design-space exploration for the inference performance of a model. We constructed a graph multi-regression dataset consisting of 10,508 different DL models to train and evaluate the performance of DIPPM, and reached a resulting Mean Absolute Percentage Error (MAPE) as low as 1.9%.","PeriodicalId":383993,"journal":{"name":"European Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130217576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Parareal with a physics-informed neural network as coarse propagator 用物理信息神经网络作为粗传播器的平行面

European Conference on Parallel Processing

Pub Date : 2023-03-07 DOI: 10.48550/arXiv.2303.03848

A. Ibrahim, Sebastian Götschel, D. Ruprecht

Parallel-in-time algorithms provide an additional layer of concurrency for the numerical integration of models based on time-dependent differential equations. Methods like Parareal, which parallelize across multiple time steps, rely on a computationally cheap and coarse integrator to propagate information forward in time, while a parallelizable expensive fine propagator provides accuracy. Typically, the coarse method is a numerical integrator using lower resolution, reduced order or a simplified model. Our paper proposes to use a physics-informed neural network (PINN) instead. We demonstrate for the Black-Scholes equation, a partial differential equation from computational finance, that Parareal with a PINN coarse propagator provides better speedup than a numerical coarse propagator. Training and evaluating a neural network are both tasks whose computing patterns are well suited for GPUs. By contrast, mesh-based algorithms with their low computational intensity struggle to perform well. We show that moving the coarse propagator PINN to a GPU while running the numerical fine propagator on the CPU further improves Parareal's single-node performance. This suggests that integrating machine learning techniques into parallel-in-time integration methods and exploiting their differences in computing patterns might offer a way to better utilize heterogeneous architectures.

实时并行算法为基于时变微分方程的模型数值积分提供了一个额外的并发层。像Parareal这样跨多个时间步并行化的方法，依赖于计算廉价和粗糙的积分器来及时向前传播信息，而可并行化的昂贵精细传播器提供准确性。通常，粗方法是使用较低分辨率、降低阶数或简化模型的数值积分器。我们的论文建议使用物理信息神经网络(PINN)来代替。我们证明了Black-Scholes方程，一个来自计算金融学的偏微分方程，具有PINN粗传播算子的拟面比数值粗传播算子提供更好的加速。训练和评估神经网络都是计算模式非常适合gpu的任务。相比之下，基于网格的算法由于其低计算强度而难以表现良好。我们表明，将粗传播器PINN移动到GPU上，同时在CPU上运行数值精细传播器，进一步提高了Parareal的单节点性能。这表明，将机器学习技术集成到并行集成方法中，并利用它们在计算模式上的差异，可能会提供一种更好地利用异构体系结构的方法。

{"title":"Parareal with a physics-informed neural network as coarse propagator","authors":"A. Ibrahim, Sebastian Götschel, D. Ruprecht","doi":"10.48550/arXiv.2303.03848","DOIUrl":"https://doi.org/10.48550/arXiv.2303.03848","url":null,"abstract":"Parallel-in-time algorithms provide an additional layer of concurrency for the numerical integration of models based on time-dependent differential equations. Methods like Parareal, which parallelize across multiple time steps, rely on a computationally cheap and coarse integrator to propagate information forward in time, while a parallelizable expensive fine propagator provides accuracy. Typically, the coarse method is a numerical integrator using lower resolution, reduced order or a simplified model. Our paper proposes to use a physics-informed neural network (PINN) instead. We demonstrate for the Black-Scholes equation, a partial differential equation from computational finance, that Parareal with a PINN coarse propagator provides better speedup than a numerical coarse propagator. Training and evaluating a neural network are both tasks whose computing patterns are well suited for GPUs. By contrast, mesh-based algorithms with their low computational intensity struggle to perform well. We show that moving the coarse propagator PINN to a GPU while running the numerical fine propagator on the CPU further improves Parareal's single-node performance. This suggests that integrating machine learning techniques into parallel-in-time integration methods and exploiting their differences in computing patterns might offer a way to better utilize heterogeneous architectures.","PeriodicalId":383993,"journal":{"name":"European Conference on Parallel Processing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115048897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FedCML: Federated Clustering Mutual Learning with non-IID Data FedCML:非iid数据的联邦聚类互学习

European Conference on Parallel Processing

Pub Date : 2023-01-01 DOI: 10.1007/978-3-031-39698-4_42

Zekai Chen, Fuyi Wang, Shengxing Yu, Ximeng Liu, Zhiwei Zheng

引用次数: 0

Fault-Aware Group-Collective Communication Creation and Repair in MPI MPI中故障感知群-群通信的创建与修复

European Conference on Parallel Processing

Pub Date : 2022-09-05 DOI: 10.1007/978-3-031-39698-4_4

Roberto Rocco, G. Palermo

引用次数: 2

Cucumber: Renewable-Aware Admission Control for Delay-Tolerant Cloud and Edge Workloads 黄瓜:延迟容忍云和边缘工作负载的可再生感知准入控制

European Conference on Parallel Processing

Pub Date : 2022-05-05 DOI: 10.1007/978-3-031-12597-3_14

Philipp Wiesner, Dominik Scheinert, Thorsten Wittkopp, L. Thamsen, O. Kao

引用次数: 5

Characterization of different user behaviors for demand response in data centers 数据中心需求响应的不同用户行为特征

European Conference on Parallel Processing

Pub Date : 2022-04-06 DOI: 10.48550/arXiv.2204.02869

M. Madon, Georges Da Costa, J. Pierson

Digital technologies are becoming ubiquitous while their impact increases. A growing part of this impact happens far away from the end users, in networks or data centers, contributing to a rebound effect. A solution for a more responsible use is therefore to involve the user. As a first step in this quest, this work considers the users of a data center and characterizes their contribution to curtail the computing load for a short period of time by solely changing their job submission behavior.The contributions are: (i) an open-source plugin for the simulator Batsim to simulate users based on real data; (ii) the exploration of four types of user behaviors to curtail the load during a time window namely delaying, degrading, reconfiguring or renouncing to their job submissions. We study the impact of these behaviors on four different metrics: the energy consumed during and after the time window, the mean waiting time and the mean slowdown. We also characterize the conditions under which the involvement of users is the most beneficial.

数字技术正变得无处不在，其影响也在增加。这种影响越来越多地发生在远离最终用户的网络或数据中心，从而产生反弹效应。因此，一个更负责任的使用方法是让用户参与进来。作为这个任务的第一步，本工作考虑了数据中心的用户，并描述了他们通过仅仅改变他们的作业提交行为在短时间内减少计算负载的贡献。贡献是:(i)模拟器Batsim的开源插件，基于真实数据模拟用户;(ii)探索四种类型的用户行为，以减少时间窗口内的负载，即延迟，降级，重新配置或放弃他们的作业提交。我们研究了这些行为对四个不同指标的影响:在时间窗口期间和之后消耗的能量，平均等待时间和平均减速时间。我们还描述了用户参与最有利的条件。

引用次数: 0

Deterministic Parallel Hypergraph Partitioning 确定性并行超图分区

European Conference on Parallel Processing

Pub Date : 2021-12-23 DOI: 10.1007/978-3-031-12597-3_19

Lars Gottesbüren, M. Hamann

引用次数: 5

E2EWatch: An End-to-End Anomaly Diagnosis Framework for Production HPC Systems E2EWatch:生产高性能计算系统的端到端异常诊断框架

European Conference on Parallel Processing

Pub Date : 2021-09-01 DOI: 10.1007/978-3-030-85665-6_5

Burak Aksar, B. Schwaller, O. Aaziz, V. Leung, J. Brandt, Manuel Egele, A. Coskun

引用次数: 5

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

European Conference on Parallel Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀