首页 > 最新文献

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
DeepThermo: Deep Learning Accelerated Parallel Monte Carlo Sampling for Thermodynamics Evaluation of High Entropy Alloys
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00041
Junqi Yin, Feiyi Wang, M. Shankar
Since the introduction of Metropolis Monte Carlo (MC) sampling, it and its variants have become standard tools used for thermodynamics evaluations of physical systems. However, a long-standing problem that hinders the effectiveness and efficiency of MC sampling is the lack of a generic method (a.k.a. MC proposal) to update the system configurations. Consequently, current practices are not scalable. Here we propose a parallel MC sampling framework for thermodynamics evaluation—DeepThermo. By using deep learning–based MC proposals that can globally update the system configurations, we show that DeepThermo can effectively evaluate the phase transition behaviors of high entropy alloys, which have an astronomical configuration space. For the first time, we directly evaluate a density of states expanding over a range of ~e10,000 for a real material. We also demonstrate DeepThermo’s performance and scalability up to 3,000 GPUs on both NVIDIA V100 and AMD MI250X-based supercomputers.
自从引入Metropolis蒙特卡罗(MC)采样以来,它及其变体已成为用于物理系统热力学评估的标准工具。然而,一个长期存在的问题阻碍了MC采样的有效性和效率,即缺乏更新系统配置的通用方法(即MC建议)。因此,当前的实践是不可伸缩的。本文提出了一个用于热力学评估的并行MC采样框架——deepthermo。通过使用基于深度学习的MC建议,可以全局更新系统构型,我们表明DeepThermo可以有效地评估具有天文构型空间的高熵合金的相变行为。我们第一次直接计算了真实材料在~e10,000范围内扩展的态密度。我们还展示了DeepThermo在NVIDIA V100和AMD mi250超级计算机上高达3000个gpu的性能和可扩展性。
{"title":"DeepThermo: Deep Learning Accelerated Parallel Monte Carlo Sampling for Thermodynamics Evaluation of High Entropy Alloys","authors":"Junqi Yin, Feiyi Wang, M. Shankar","doi":"10.1109/IPDPS54959.2023.00041","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00041","url":null,"abstract":"Since the introduction of Metropolis Monte Carlo (MC) sampling, it and its variants have become standard tools used for thermodynamics evaluations of physical systems. However, a long-standing problem that hinders the effectiveness and efficiency of MC sampling is the lack of a generic method (a.k.a. MC proposal) to update the system configurations. Consequently, current practices are not scalable. Here we propose a parallel MC sampling framework for thermodynamics evaluation—DeepThermo. By using deep learning–based MC proposals that can globally update the system configurations, we show that DeepThermo can effectively evaluate the phase transition behaviors of high entropy alloys, which have an astronomical configuration space. For the first time, we directly evaluate a density of states expanding over a range of ~e10,000 for a real material. We also demonstrate DeepThermo’s performance and scalability up to 3,000 GPUs on both NVIDIA V100 and AMD MI250X-based supercomputers.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124830709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
QoS-Aware and Cost-Efficient Dynamic Resource Allocation for Serverless ML Workflows 无服务器ML工作流的qos感知和成本效益动态资源分配
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00093
Hao Wu, Junxiao Deng, Haoqiang Fan, Shadi Ibrahim, Song Wu, Hai Jin
Machine Learning (ML) workflows are increasingly deployed on serverless computing platforms to benefit from their elasticity and fine-grain pricing. Proper resource allocation is crucial to achieve fast and cost-efficient execution of serverless ML workflows (specially for hyperparameter tuning and model training). Unfortunately, existing resource allocation methods are static, treat functions equally, and rely on offline prediction, which limit their efficiency. In this paper, we introduce CE-scaling – a Cost-Efficient autoscaling framework for serverless ML work-flows. During the hyperparameter tuning, CE-scaling partitions resources across stages according to their exact usage to minimize resource waste. Moreover, it incorporates an online prediction method to dynamically adjust resources during model training. We implement and evaluate CE-scaling on AWS Lambda using various ML models. Evaluation results show that compared to state-of-the-art static resource allocation methods, CE-scaling can reduce the job completion time and the monetary cost by up to 63% and 41% for hyperparameter tuning, respectively; and by up to 58% and 38% for model training.
机器学习(ML)工作流越来越多地部署在无服务器计算平台上,以受益于其弹性和细粒度定价。适当的资源分配对于实现无服务器ML工作流的快速和经济高效的执行至关重要(特别是对于超参数调优和模型训练)。不幸的是,现有的资源分配方法是静态的,对函数一视同仁,并且依赖于离线预测,这限制了它们的效率。在本文中,我们介绍了ce伸缩——一种用于无服务器ML工作流的经济高效的自动伸缩框架。在超参数调优期间,ce伸缩根据资源的确切使用情况在各个阶段对资源进行分区,以最大限度地减少资源浪费。此外,它还结合了在线预测方法,在模型训练过程中动态调整资源。我们使用各种ML模型在AWS Lambda上实现和评估ce扩展。评估结果表明,与最先进的静态资源分配方法相比,ce扩展可以将超参数调优的作业完成时间和货币成本分别减少63%和41%;在模特培训中,这一比例分别高达58%和38%。
{"title":"QoS-Aware and Cost-Efficient Dynamic Resource Allocation for Serverless ML Workflows","authors":"Hao Wu, Junxiao Deng, Haoqiang Fan, Shadi Ibrahim, Song Wu, Hai Jin","doi":"10.1109/IPDPS54959.2023.00093","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00093","url":null,"abstract":"Machine Learning (ML) workflows are increasingly deployed on serverless computing platforms to benefit from their elasticity and fine-grain pricing. Proper resource allocation is crucial to achieve fast and cost-efficient execution of serverless ML workflows (specially for hyperparameter tuning and model training). Unfortunately, existing resource allocation methods are static, treat functions equally, and rely on offline prediction, which limit their efficiency. In this paper, we introduce CE-scaling – a Cost-Efficient autoscaling framework for serverless ML work-flows. During the hyperparameter tuning, CE-scaling partitions resources across stages according to their exact usage to minimize resource waste. Moreover, it incorporates an online prediction method to dynamically adjust resources during model training. We implement and evaluate CE-scaling on AWS Lambda using various ML models. Evaluation results show that compared to state-of-the-art static resource allocation methods, CE-scaling can reduce the job completion time and the monetary cost by up to 63% and 41% for hyperparameter tuning, respectively; and by up to 58% and 38% for model training.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114390239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Distributed Sparse Random Projection Trees for Constructing K-Nearest Neighbor Graphs 构造k近邻图的分布稀疏随机投影树
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00014
Isuru Ranawaka, Md. Khaledur Rahman, A. Azad
A random projection tree that partitions data points by projecting them onto random vectors is widely used for approximate nearest neighbor search in high-dimensional space. We consider a particular case of random projection trees for constructing a k-nearest neighbor graph (KNNG) from high-dimensional data. We develop a distributed-memory Random Projection Tree (DRPT) algorithm for constructing sparse random projection trees and then running a query on the forest to create the KNN graph. DRPT uses sparse matrix operations and a communication reduction scheme to scale KNN graph constructions to thousands of processes on a supercomputer. The accuracy of DRPT is comparable to state-of-the-art methods for approximate nearest neighbor search, while it runs two orders of magnitude faster than its peers. DRPT is available at https://github.com/HipGraph/DRPT.
随机投影树通过将数据点投影到随机向量上来划分数据点,被广泛用于高维空间的近似最近邻搜索。我们考虑了一种特殊的从高维数据构造k近邻图(KNNG)的随机投影树。我们开发了一种分布式内存随机投影树(DRPT)算法,用于构造稀疏随机投影树,然后在森林上运行查询以创建KNN图。DRPT使用稀疏矩阵运算和通信约简方案将KNN图结构扩展到超级计算机上的数千个进程。DRPT的精度可与最先进的近似最近邻搜索方法相媲美,而其运行速度比同类方法快两个数量级。DRPT可在https://github.com/HipGraph/DRPT上获得。
{"title":"Distributed Sparse Random Projection Trees for Constructing K-Nearest Neighbor Graphs","authors":"Isuru Ranawaka, Md. Khaledur Rahman, A. Azad","doi":"10.1109/IPDPS54959.2023.00014","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00014","url":null,"abstract":"A random projection tree that partitions data points by projecting them onto random vectors is widely used for approximate nearest neighbor search in high-dimensional space. We consider a particular case of random projection trees for constructing a k-nearest neighbor graph (KNNG) from high-dimensional data. We develop a distributed-memory Random Projection Tree (DRPT) algorithm for constructing sparse random projection trees and then running a query on the forest to create the KNN graph. DRPT uses sparse matrix operations and a communication reduction scheme to scale KNN graph constructions to thousands of processes on a supercomputer. The accuracy of DRPT is comparable to state-of-the-art methods for approximate nearest neighbor search, while it runs two orders of magnitude faster than its peers. DRPT is available at https://github.com/HipGraph/DRPT.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122880941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mimir: Extending I/O Interfaces to Express User Intent for Complex Workloads in HPC Mimir:扩展I/O接口以表达HPC中复杂工作负载的用户意图
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00027
H. Devarajan, K. Mohror
The complexity of data management in HPC systems stems from the diversity in I/O behavior exhibited by new workloads, multistage workflows, and the presence of multitiered storage systems. This complexity is managed by the storage systems, which provide user-level configurations to allow the tuning of workload I/O within the system. However, these configurations are difficult to set by users who lack expertise in I/O subsystems. We propose a paradigm change in which users specify the intent of I/O operations and storage systems automatically set various configurations based on the supplied intent. To this end, we developed the Mimir infrastructure to assist users in passing I/O intent to the underlying storage system. We demonstrate several use cases that map user-defined intents to storage configurations that lead to optimized I/O. In this study, we make three observations. First, I/O intents should be applied to each level of the I/O storage stack, from HDF5 to MPI-IO to POSIX, and integrated using lightweight adaptors in the existing stack. Second, the Mimir infrastructure supports up to 400M Ops/sec throughput of intents in the system, with a low memory overhead of 6.85KB per node. Third, intents assist in configuring a hierarchical cache to preload I/O, buffer in a node-local device, and store data in a global cache to optimize I/O workloads by 2.33×, 4×, and 2.1×, respectively. Our Mimir infrastructure optimizes complex large-scale workflows by up to 4× better I/O performance on the Lassen supercomputer by using automatically derived I/O intents.
高性能计算系统中数据管理的复杂性源于新工作负载、多阶段工作流和多层存储系统所表现出的I/O行为的多样性。这种复杂性由存储系统管理,存储系统提供用户级配置,允许在系统内调优工作负载I/O。但是,缺乏I/O子系统专业知识的用户很难设置这些配置。我们提出了一种范式变化,用户指定I/O操作的意图,存储系统根据提供的意图自动设置各种配置。为此,我们开发了Mimir基础架构,以帮助用户将I/O意图传递给底层存储系统。我们演示了几个用例,这些用例将用户定义的意图映射到导致优化I/O的存储配置。在这项研究中,我们做了三个观察。首先,I/O意图应该应用于I/O存储堆栈的每个级别,从HDF5到MPI-IO再到POSIX,并使用现有堆栈中的轻量级适配器进行集成。其次,Mimir基础设施在系统中支持高达400M Ops/sec的意图吞吐量,每个节点的内存开销低至6.85KB。第三,intent可以帮助配置分层缓存来预加载I/O,在节点本地设备中进行缓冲,并在全局缓存中存储数据,从而分别将I/O工作负载优化2.33倍、4倍和2.1倍。我们的Mimir基础架构通过使用自动派生的I/O意图,在Lassen超级计算机上优化了复杂的大规模工作流程,I/O性能提高了4倍。
{"title":"Mimir: Extending I/O Interfaces to Express User Intent for Complex Workloads in HPC","authors":"H. Devarajan, K. Mohror","doi":"10.1109/IPDPS54959.2023.00027","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00027","url":null,"abstract":"The complexity of data management in HPC systems stems from the diversity in I/O behavior exhibited by new workloads, multistage workflows, and the presence of multitiered storage systems. This complexity is managed by the storage systems, which provide user-level configurations to allow the tuning of workload I/O within the system. However, these configurations are difficult to set by users who lack expertise in I/O subsystems. We propose a paradigm change in which users specify the intent of I/O operations and storage systems automatically set various configurations based on the supplied intent. To this end, we developed the Mimir infrastructure to assist users in passing I/O intent to the underlying storage system. We demonstrate several use cases that map user-defined intents to storage configurations that lead to optimized I/O. In this study, we make three observations. First, I/O intents should be applied to each level of the I/O storage stack, from HDF5 to MPI-IO to POSIX, and integrated using lightweight adaptors in the existing stack. Second, the Mimir infrastructure supports up to 400M Ops/sec throughput of intents in the system, with a low memory overhead of 6.85KB per node. Third, intents assist in configuring a hierarchical cache to preload I/O, buffer in a node-local device, and store data in a global cache to optimize I/O workloads by 2.33×, 4×, and 2.1×, respectively. Our Mimir infrastructure optimizes complex large-scale workflows by up to 4× better I/O performance on the Lassen supercomputer by using automatically derived I/O intents.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125917028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
H-Cache: Traffic-Aware Hybrid Rule-Caching in Software-Defined Networks H-Cache:软件定义网络中流量感知混合规则缓存
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00017
Zeyu Luan, Qing Li, Yi Wang, Yong Jiang
Ternary Content Addressable Memory (TCAM) is an essential hardware component in SDN-enabled switches, which supports fast lookup speed and flexible matching patterns. However, TCAM’s limited storage capacity has long been a scalability challenge to enforce fine-grained forwarding policies in SDN. Based on the observation of traffic locality, the rule-caching mechanism employs a combination of TCAM and Random Access Memory (RAM) to maintain the forwarding rules of large and small flows, respectively. However, previous works cannot identify large flows timely and accurately, and suffer from high computational complexity when addressing rule dependencies in TCAM. Worse still, TCAM only caches the forwarding rules of large flows but ignores the latency requirements of small flows. Small flows encounter cache-miss in TCAM and then will be diverted to RAM, where they have to experience slow lookup processes. To jointly optimize the performance of both high-throughput large flows and latency-sensitive small flows, we propose a hybrid rule-caching framework, H-Cache, to scale traffic-aware forwarding policies in SDN. H-Cache identifies large flows through a collaboration of learning-based and threshold-based methods to achieve early detection and high accuracy, and proposes a time-efficient greedy heuristic to address rule dependencies. For small flows, H-Cache establishes default paths in TCAM to speed up their lookup processes, and also reduces their TCAM occupancy through label switching and region partitioning. Experiments with both real-world and synthetic datasets demonstrate that H-Cache increases TCAM utilization by an average of 11% and reduces the average completion time of small flows by almost 70%.
三元内容可寻址内存(TCAM)是支持sdn的交换机中必不可少的硬件组件,它支持快速查找速度和灵活的匹配模式。然而,TCAM有限的存储容量长期以来一直是在SDN中执行细粒度转发策略的可伸缩性挑战。基于对流量局域性的观察,规则缓存机制采用TCAM和RAM (Random Access Memory)相结合的方式分别维护大流量和小流量的转发规则。然而,以前的工作不能及时准确地识别大流量,并且在TCAM中处理规则依赖关系时计算复杂度高。更糟糕的是,TCAM只缓存大流量的转发规则,而忽略了小流量的延迟要求。小流在TCAM中遇到缓存缺失,然后将被转移到RAM,在那里它们必须经历缓慢的查找过程。为了共同优化高吞吐量大流和延迟敏感小流的性能,我们提出了一种混合规则缓存框架H-Cache,用于扩展SDN中流量感知转发策略。H-Cache通过基于学习和基于阈值的方法的协作来识别大流量,以实现早期检测和高精度,并提出了一种时间效率高的贪婪启发式方法来解决规则依赖性。对于小流量,H-Cache在TCAM中建立默认路径,以加快其查找过程,并通过标签交换和区域划分减少其TCAM占用。实际数据集和合成数据集的实验表明,H-Cache平均提高了11%的TCAM利用率,并将小流量的平均完井时间缩短了近70%。
{"title":"H-Cache: Traffic-Aware Hybrid Rule-Caching in Software-Defined Networks","authors":"Zeyu Luan, Qing Li, Yi Wang, Yong Jiang","doi":"10.1109/IPDPS54959.2023.00017","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00017","url":null,"abstract":"Ternary Content Addressable Memory (TCAM) is an essential hardware component in SDN-enabled switches, which supports fast lookup speed and flexible matching patterns. However, TCAM’s limited storage capacity has long been a scalability challenge to enforce fine-grained forwarding policies in SDN. Based on the observation of traffic locality, the rule-caching mechanism employs a combination of TCAM and Random Access Memory (RAM) to maintain the forwarding rules of large and small flows, respectively. However, previous works cannot identify large flows timely and accurately, and suffer from high computational complexity when addressing rule dependencies in TCAM. Worse still, TCAM only caches the forwarding rules of large flows but ignores the latency requirements of small flows. Small flows encounter cache-miss in TCAM and then will be diverted to RAM, where they have to experience slow lookup processes. To jointly optimize the performance of both high-throughput large flows and latency-sensitive small flows, we propose a hybrid rule-caching framework, H-Cache, to scale traffic-aware forwarding policies in SDN. H-Cache identifies large flows through a collaboration of learning-based and threshold-based methods to achieve early detection and high accuracy, and proposes a time-efficient greedy heuristic to address rule dependencies. For small flows, H-Cache establishes default paths in TCAM to speed up their lookup processes, and also reduces their TCAM occupancy through label switching and region partitioning. Experiments with both real-world and synthetic datasets demonstrate that H-Cache increases TCAM utilization by an average of 11% and reduces the average completion time of small flows by almost 70%.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114692561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc* 面向PETSc的gpu感知非阻塞MPI邻域集体通信设计与优化
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00070
Kawthar Shafie Khorassani, Chen-Chun Chen, H. Subramoni, D. Panda
MPI Neighborhood collectives are used for non-traditional collective operations involving uneven distribution of communication amongst processes such as sparse communication patterns. They provide flexibility to define the communication pattern involved when a neighborhood relationship can be defined. PETSc, the Portable, Extensible Toolkit for Scientific Computation, used extensively with scientific applications to provide scalable solutions through routines modeled by partial differential equations, utilizes neighborhood communication patterns to define various structures and routines.We propose GPU-aware MPI Neighborhood collective operations with support for AMD and NVIDIA GPU backends and propose optimized designs to provide scalable performance for various communication routines. We evaluate our designs using PETSc structures for scattering from a parallel vector to a parallel vector, scattering from a sequential vector to a parallel vector, and scattering from a parallel vector to a sequential vector using a star forest graph representation implemented with nonblocking MPI neighborhood alltoallv collective operations. We evaluate our neighborhood designs on 64 NVIDIA GPUs on the Lassen system with Infiniband networking, demonstrating30.90% improvement against a GPU implementation utilizing CPU-staging techniques, and 8.25% improvement against GPU-aware point-to-point implementations of the communication pattern. We also evaluate on 64 AMD GPUs on the Spock system with slingshot networking and present 39.52% improvement against the CPU-staging implementation of a neighborhood GPU vector type in PETSc, and 33.25% improvement against GPU-aware point-to-point implementation of the routine.
MPI邻域集体用于涉及进程之间通信分布不均匀(如稀疏通信模式)的非传统集体操作。当可以定义邻居关系时,它们提供了定义所涉及的通信模式的灵活性。PETSc是可移植、可扩展的科学计算工具包,广泛用于科学应用程序,通过偏微分方程建模的例程提供可扩展的解决方案,利用邻域通信模式定义各种结构和例程。我们提出了支持AMD和NVIDIA GPU后端的GPU感知MPI Neighborhood集体操作,并提出了优化设计,为各种通信例程提供可扩展的性能。我们使用PETSc结构评估了从平行向量到平行向量的散射,从顺序向量到平行向量的散射,以及使用非阻塞MPI邻域alltoallv集体操作实现的星林图表示从平行向量到顺序向量的散射。我们用Infiniband网络在Lassen系统上的64个NVIDIA GPU上评估了我们的邻域设计,与使用cpu分级技术的GPU实现相比,证明了30.90%的改进,与使用GPU感知的点对点通信模式实现相比,改进了8.25%。我们还在Spock系统上使用slingshot网络对64 AMD GPU进行了评估,与PETSc中邻域GPU矢量类型的cpu分级实现相比,改进了39.52%,与GPU感知点对点实现相比,改进了33.25%。
{"title":"Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc*","authors":"Kawthar Shafie Khorassani, Chen-Chun Chen, H. Subramoni, D. Panda","doi":"10.1109/IPDPS54959.2023.00070","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00070","url":null,"abstract":"MPI Neighborhood collectives are used for non-traditional collective operations involving uneven distribution of communication amongst processes such as sparse communication patterns. They provide flexibility to define the communication pattern involved when a neighborhood relationship can be defined. PETSc, the Portable, Extensible Toolkit for Scientific Computation, used extensively with scientific applications to provide scalable solutions through routines modeled by partial differential equations, utilizes neighborhood communication patterns to define various structures and routines.We propose GPU-aware MPI Neighborhood collective operations with support for AMD and NVIDIA GPU backends and propose optimized designs to provide scalable performance for various communication routines. We evaluate our designs using PETSc structures for scattering from a parallel vector to a parallel vector, scattering from a sequential vector to a parallel vector, and scattering from a parallel vector to a sequential vector using a star forest graph representation implemented with nonblocking MPI neighborhood alltoallv collective operations. We evaluate our neighborhood designs on 64 NVIDIA GPUs on the Lassen system with Infiniband networking, demonstrating30.90% improvement against a GPU implementation utilizing CPU-staging techniques, and 8.25% improvement against GPU-aware point-to-point implementations of the communication pattern. We also evaluate on 64 AMD GPUs on the Spock system with slingshot networking and present 39.52% improvement against the CPU-staging implementation of a neighborhood GPU vector type in PETSc, and 33.25% improvement against GPU-aware point-to-point implementation of the routine.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"252 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129981478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication 矩阵乘法运行时优化的机器学习方法
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00059
Yufan Xia, M. D. L. Pierre, A. Barnard, Giuseppe M. J. Barca
The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of modern multi-core shared memory systems, it is challenging to determine the number of threads that minimises the multi-thread GEMM runtime.We present a proof-of-concept approach to building an Architecture and Data-Structure Aware Linear Algebra (ADSALA) software library that uses machine learning to optimise the runtime performance of BLAS routines. More specifically, our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task based on the collected training data. Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a two-socket AMD Zen 3, revealed a 25 to 40 per cent speedup compared to traditional GEMM implementations in BLAS when using GEMM of memory usage within 100 MB.
通用矩阵乘法(GEMM)是科学计算中的基本算法之一。单线程GEMM实现通过阻塞和自动调优等技术得到了很好的优化。然而,由于现代多核共享内存系统的复杂性,确定最小化多线程GEMM运行时的线程数是一项挑战。我们提出了一种概念验证方法来构建一个架构和数据结构感知线性代数(ADSALA)软件库,该软件库使用机器学习来优化BLAS例程的运行时性能。更具体地说,我们的方法使用动态机器学习模型,根据收集的训练数据自动选择给定GEMM任务的最佳线程数。在两种不同的HPC节点架构上的测试结果显示,当使用内存使用在100 MB以内的GEMM时,与BLAS中的传统GEMM实现相比,两种不同的HPC节点架构(一种基于双插槽的英特尔Cascade Lake,另一种基于双插槽的AMD Zen 3)的速度提高了25%到40%。
{"title":"A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication","authors":"Yufan Xia, M. D. L. Pierre, A. Barnard, Giuseppe M. J. Barca","doi":"10.1109/IPDPS54959.2023.00059","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00059","url":null,"abstract":"The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of modern multi-core shared memory systems, it is challenging to determine the number of threads that minimises the multi-thread GEMM runtime.We present a proof-of-concept approach to building an Architecture and Data-Structure Aware Linear Algebra (ADSALA) software library that uses machine learning to optimise the runtime performance of BLAS routines. More specifically, our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task based on the collected training data. Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a two-socket AMD Zen 3, revealed a 25 to 40 per cent speedup compared to traditional GEMM implementations in BLAS when using GEMM of memory usage within 100 MB.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130035096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lossy Scientific Data Compression With SPERR 有损科学数据压缩与SPERR
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00104
Shaomeng Li, P. Lindstrom, J. Clyne
As the need for data reduction in high-performance computing (HPC) continues to grow, we introduce a new and highly effective tool to help achieve this goal—SPERR. SPERR is a versatile lossy compressor for structured scientific data; it is built on top of an advanced wavelet compression algorithm, SPECK, and provides additional capabilities valued in HPC environments. These capabilities include parallel execution for large volumes and a compression mode that satisfies a maximum point-wise error tolerance. Evaluation shows that in most settings SPERR achieves the best rate-distortion trade-off among current popular lossy scientific data compressors.
随着高性能计算(HPC)中对数据减少的需求不断增长,我们引入了一种新的高效工具来帮助实现这一目标——sperr。SPERR是一种用于结构化科学数据的通用有损压缩器;它建立在先进的小波压缩算法SPECK之上,并提供了在HPC环境中有价值的额外功能。这些功能包括大容量的并行执行和满足最大逐点容错的压缩模式。评估表明,在大多数情况下,SPERR在当前流行的有损科学数据压缩器中实现了最佳的率失真权衡。
{"title":"Lossy Scientific Data Compression With SPERR","authors":"Shaomeng Li, P. Lindstrom, J. Clyne","doi":"10.1109/IPDPS54959.2023.00104","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00104","url":null,"abstract":"As the need for data reduction in high-performance computing (HPC) continues to grow, we introduce a new and highly effective tool to help achieve this goal—SPERR. SPERR is a versatile lossy compressor for structured scientific data; it is built on top of an advanced wavelet compression algorithm, SPECK, and provides additional capabilities valued in HPC environments. These capabilities include parallel execution for large volumes and a compression mode that satisfies a maximum point-wise error tolerance. Evaluation shows that in most settings SPERR achieves the best rate-distortion trade-off among current popular lossy scientific data compressors.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133285034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Traversing Large Compressed Graphs on GPUs 在gpu上遍历大型压缩图
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00013
Prasun Gera, Hyesoon Kim
GPUs can be used effectively for accelerating graph analytics, provided the datasets fit in GPU memory. This is often not the case for large real-world datasets such as social, web, or biological graphs. We propose a graph compression format for static unweighted graphs based on Elias-Fano encoding that is amenable to run-time decompression on massively parallel architectures such as GPUs. We show that we can compress a variety of large graphs by a factor of 1.55x over the commonly used compressed sparse row (CSR) representation. The scheme is particularly beneficial for cases where conventional CSR based approaches do not work at all due to memory capacity constraints, or incur a significant penalty for out-of-core processing. We implement GPU accelerated breadth first search for this graph representation and show that the runtime performance for in-memory compressed graphs is 3.8x-6.5x better than out-of-core implementations for CSR graphs. Further, our implementation is also 1.45x-2x faster than the current state of the art in GPU based compressed graph traversals while maintaining a competitive compression ratio. We also extend our work to other analytics applications such as single source shortest paths and PageRank. Finally, we explore the interplay between graph reordering, graph compression, and performance.
GPU可以有效地用于加速图形分析,前提是数据集适合GPU内存。对于大型现实世界的数据集,如社交、网络或生物图,情况通常不是这样。我们提出了一种基于Elias-Fano编码的静态无加权图形压缩格式,该格式适用于gpu等大规模并行架构的运行时解压缩。我们表明,我们可以将各种大型图压缩到常用的压缩稀疏行(CSR)表示的1.55倍。对于由于内存容量限制,传统的基于CSR的方法根本无法工作,或者由于核外处理而导致重大损失的情况,该方案特别有用。我们为这种图形表示实现了GPU加速的广度优先搜索,并表明内存压缩图形的运行时性能比CSR图形的核外实现好3.8 -6.5倍。此外,我们的实现比当前基于GPU的压缩图遍历速度快1.45x-2倍,同时保持有竞争力的压缩比。我们还将我们的工作扩展到其他分析应用程序,如单源最短路径和PageRank。最后,我们探讨了图重排序、图压缩和性能之间的相互作用。
{"title":"Traversing Large Compressed Graphs on GPUs","authors":"Prasun Gera, Hyesoon Kim","doi":"10.1109/IPDPS54959.2023.00013","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00013","url":null,"abstract":"GPUs can be used effectively for accelerating graph analytics, provided the datasets fit in GPU memory. This is often not the case for large real-world datasets such as social, web, or biological graphs. We propose a graph compression format for static unweighted graphs based on Elias-Fano encoding that is amenable to run-time decompression on massively parallel architectures such as GPUs. We show that we can compress a variety of large graphs by a factor of 1.55x over the commonly used compressed sparse row (CSR) representation. The scheme is particularly beneficial for cases where conventional CSR based approaches do not work at all due to memory capacity constraints, or incur a significant penalty for out-of-core processing. We implement GPU accelerated breadth first search for this graph representation and show that the runtime performance for in-memory compressed graphs is 3.8x-6.5x better than out-of-core implementations for CSR graphs. Further, our implementation is also 1.45x-2x faster than the current state of the art in GPU based compressed graph traversals while maintaining a competitive compression ratio. We also extend our work to other analytics applications such as single source shortest paths and PageRank. Finally, we explore the interplay between graph reordering, graph compression, and performance.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130301512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic Tensor Linearization and Time Slicing for Efficient Factorization of Infinite Data Streams 无限数据流高效分解的动态张量线性化和时间切片
Pub Date : 2023-05-01 DOI: 10.1109/IPDPS54959.2023.00048
Yongseok Soh, Ahmed E. Helal, Fabio Checconi, Jan Laukemann, Jesmin Jahan Tithi, Teresa M. Ranadive, F. Petrini, Jeewhan Choi
Streaming tensor factorization is an effective tool for unsupervised analysis of time-evolving sparse data, which emerge in many critical domains such as cybersecurity and trend analysis. In contrast to traditional tensors, time-evolving tensors demonstrate extreme sparsity and sparsity variation over time, resulting in irregular memory access and inefficient use of parallel computing resources. Additionally, due to the prohibitive cost of dynamically generating compressed sparse tensor formats, the state-of-the-art approaches process streaming tensors in a raw form that fails to capture data locality and suffers from high synchronization cost. To address these challenges, we propose a new dynamic tensor linearization framework that quickly encodes streaming multi-dimensional data on-the-fly in a compact representation, which has substantially lower memory usage and higher data reuse and parallelism than the original raw data. This is achieved by using a spatial sketching algorithm that keeps all incoming nonzero elements but remaps them into a tensor sketch with considerably reduced multi-dimensional image space. Moreover, we present a dynamic time slicing mechanism that uses variable-width time slices (instead of the traditional fixed-width) to balance the frequency of factor updates and the utilization of computing resources. We demonstrate the efficacy of our framework by accelerating two high-performance streaming tensor algorithms, namely, CP-stream and spCP-stream, and significantly improve their performance for a range of real-world streaming tensors. On a modern 56-core CPU, our framework achieves 10.3 − 11× and 6.4 − 7.2× geometric-mean speedup for the CP-stream and spCP-stream algorithms, respectively.
流张量分解是对随时间变化的稀疏数据进行无监督分析的有效工具,在网络安全和趋势分析等许多关键领域都有应用。与传统张量相比,时间演化张量表现出极端的稀疏性和随时间变化的稀疏性,导致不规则的内存访问和并行计算资源的低效使用。此外,由于动态生成压缩稀疏张量格式的成本过高,最先进的方法以原始形式处理流张量,无法捕获数据局部性并遭受高同步成本的影响。为了解决这些挑战,我们提出了一个新的动态张量线性化框架,该框架可以快速地以紧凑的表示方式对流多维数据进行动态编码,与原始原始数据相比,该框架具有更低的内存使用和更高的数据重用和并行性。这是通过使用空间素描算法实现的,该算法保留所有传入的非零元素,但将它们重新映射到一个张量素描中,大大减少了多维图像空间。此外,我们还提出了一种动态时间切片机制,该机制使用变宽时间片(而不是传统的固定宽度时间片)来平衡因子更新的频率和计算资源的利用率。我们通过加速两种高性能流张量算法(即CP-stream和spCP-stream)来证明我们的框架的有效性,并显着提高了它们在一系列现实世界流张量中的性能。在现代56核CPU上,我们的框架分别为CP-stream和spCP-stream算法实现了10.3−11倍和6.4−7.2倍的几何平均加速。
{"title":"Dynamic Tensor Linearization and Time Slicing for Efficient Factorization of Infinite Data Streams","authors":"Yongseok Soh, Ahmed E. Helal, Fabio Checconi, Jan Laukemann, Jesmin Jahan Tithi, Teresa M. Ranadive, F. Petrini, Jeewhan Choi","doi":"10.1109/IPDPS54959.2023.00048","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00048","url":null,"abstract":"Streaming tensor factorization is an effective tool for unsupervised analysis of time-evolving sparse data, which emerge in many critical domains such as cybersecurity and trend analysis. In contrast to traditional tensors, time-evolving tensors demonstrate extreme sparsity and sparsity variation over time, resulting in irregular memory access and inefficient use of parallel computing resources. Additionally, due to the prohibitive cost of dynamically generating compressed sparse tensor formats, the state-of-the-art approaches process streaming tensors in a raw form that fails to capture data locality and suffers from high synchronization cost. To address these challenges, we propose a new dynamic tensor linearization framework that quickly encodes streaming multi-dimensional data on-the-fly in a compact representation, which has substantially lower memory usage and higher data reuse and parallelism than the original raw data. This is achieved by using a spatial sketching algorithm that keeps all incoming nonzero elements but remaps them into a tensor sketch with considerably reduced multi-dimensional image space. Moreover, we present a dynamic time slicing mechanism that uses variable-width time slices (instead of the traditional fixed-width) to balance the frequency of factor updates and the utilization of computing resources. We demonstrate the efficacy of our framework by accelerating two high-performance streaming tensor algorithms, namely, CP-stream and spCP-stream, and significantly improve their performance for a range of real-world streaming tensors. On a modern 56-core CPU, our framework achieves 10.3 − 11× and 6.4 − 7.2× geometric-mean speedup for the CP-stream and spCP-stream algorithms, respectively.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132934399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1