Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00041
Junqi Yin, Feiyi Wang, M. Shankar
Since the introduction of Metropolis Monte Carlo (MC) sampling, it and its variants have become standard tools used for thermodynamics evaluations of physical systems. However, a long-standing problem that hinders the effectiveness and efficiency of MC sampling is the lack of a generic method (a.k.a. MC proposal) to update the system configurations. Consequently, current practices are not scalable. Here we propose a parallel MC sampling framework for thermodynamics evaluation—DeepThermo. By using deep learning–based MC proposals that can globally update the system configurations, we show that DeepThermo can effectively evaluate the phase transition behaviors of high entropy alloys, which have an astronomical configuration space. For the first time, we directly evaluate a density of states expanding over a range of ~e10,000 for a real material. We also demonstrate DeepThermo’s performance and scalability up to 3,000 GPUs on both NVIDIA V100 and AMD MI250X-based supercomputers.
{"title":"DeepThermo: Deep Learning Accelerated Parallel Monte Carlo Sampling for Thermodynamics Evaluation of High Entropy Alloys","authors":"Junqi Yin, Feiyi Wang, M. Shankar","doi":"10.1109/IPDPS54959.2023.00041","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00041","url":null,"abstract":"Since the introduction of Metropolis Monte Carlo (MC) sampling, it and its variants have become standard tools used for thermodynamics evaluations of physical systems. However, a long-standing problem that hinders the effectiveness and efficiency of MC sampling is the lack of a generic method (a.k.a. MC proposal) to update the system configurations. Consequently, current practices are not scalable. Here we propose a parallel MC sampling framework for thermodynamics evaluation—DeepThermo. By using deep learning–based MC proposals that can globally update the system configurations, we show that DeepThermo can effectively evaluate the phase transition behaviors of high entropy alloys, which have an astronomical configuration space. For the first time, we directly evaluate a density of states expanding over a range of ~e10,000 for a real material. We also demonstrate DeepThermo’s performance and scalability up to 3,000 GPUs on both NVIDIA V100 and AMD MI250X-based supercomputers.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124830709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00093
Hao Wu, Junxiao Deng, Haoqiang Fan, Shadi Ibrahim, Song Wu, Hai Jin
Machine Learning (ML) workflows are increasingly deployed on serverless computing platforms to benefit from their elasticity and fine-grain pricing. Proper resource allocation is crucial to achieve fast and cost-efficient execution of serverless ML workflows (specially for hyperparameter tuning and model training). Unfortunately, existing resource allocation methods are static, treat functions equally, and rely on offline prediction, which limit their efficiency. In this paper, we introduce CE-scaling – a Cost-Efficient autoscaling framework for serverless ML work-flows. During the hyperparameter tuning, CE-scaling partitions resources across stages according to their exact usage to minimize resource waste. Moreover, it incorporates an online prediction method to dynamically adjust resources during model training. We implement and evaluate CE-scaling on AWS Lambda using various ML models. Evaluation results show that compared to state-of-the-art static resource allocation methods, CE-scaling can reduce the job completion time and the monetary cost by up to 63% and 41% for hyperparameter tuning, respectively; and by up to 58% and 38% for model training.
{"title":"QoS-Aware and Cost-Efficient Dynamic Resource Allocation for Serverless ML Workflows","authors":"Hao Wu, Junxiao Deng, Haoqiang Fan, Shadi Ibrahim, Song Wu, Hai Jin","doi":"10.1109/IPDPS54959.2023.00093","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00093","url":null,"abstract":"Machine Learning (ML) workflows are increasingly deployed on serverless computing platforms to benefit from their elasticity and fine-grain pricing. Proper resource allocation is crucial to achieve fast and cost-efficient execution of serverless ML workflows (specially for hyperparameter tuning and model training). Unfortunately, existing resource allocation methods are static, treat functions equally, and rely on offline prediction, which limit their efficiency. In this paper, we introduce CE-scaling – a Cost-Efficient autoscaling framework for serverless ML work-flows. During the hyperparameter tuning, CE-scaling partitions resources across stages according to their exact usage to minimize resource waste. Moreover, it incorporates an online prediction method to dynamically adjust resources during model training. We implement and evaluate CE-scaling on AWS Lambda using various ML models. Evaluation results show that compared to state-of-the-art static resource allocation methods, CE-scaling can reduce the job completion time and the monetary cost by up to 63% and 41% for hyperparameter tuning, respectively; and by up to 58% and 38% for model training.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114390239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00014
Isuru Ranawaka, Md. Khaledur Rahman, A. Azad
A random projection tree that partitions data points by projecting them onto random vectors is widely used for approximate nearest neighbor search in high-dimensional space. We consider a particular case of random projection trees for constructing a k-nearest neighbor graph (KNNG) from high-dimensional data. We develop a distributed-memory Random Projection Tree (DRPT) algorithm for constructing sparse random projection trees and then running a query on the forest to create the KNN graph. DRPT uses sparse matrix operations and a communication reduction scheme to scale KNN graph constructions to thousands of processes on a supercomputer. The accuracy of DRPT is comparable to state-of-the-art methods for approximate nearest neighbor search, while it runs two orders of magnitude faster than its peers. DRPT is available at https://github.com/HipGraph/DRPT.
{"title":"Distributed Sparse Random Projection Trees for Constructing K-Nearest Neighbor Graphs","authors":"Isuru Ranawaka, Md. Khaledur Rahman, A. Azad","doi":"10.1109/IPDPS54959.2023.00014","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00014","url":null,"abstract":"A random projection tree that partitions data points by projecting them onto random vectors is widely used for approximate nearest neighbor search in high-dimensional space. We consider a particular case of random projection trees for constructing a k-nearest neighbor graph (KNNG) from high-dimensional data. We develop a distributed-memory Random Projection Tree (DRPT) algorithm for constructing sparse random projection trees and then running a query on the forest to create the KNN graph. DRPT uses sparse matrix operations and a communication reduction scheme to scale KNN graph constructions to thousands of processes on a supercomputer. The accuracy of DRPT is comparable to state-of-the-art methods for approximate nearest neighbor search, while it runs two orders of magnitude faster than its peers. DRPT is available at https://github.com/HipGraph/DRPT.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122880941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00027
H. Devarajan, K. Mohror
The complexity of data management in HPC systems stems from the diversity in I/O behavior exhibited by new workloads, multistage workflows, and the presence of multitiered storage systems. This complexity is managed by the storage systems, which provide user-level configurations to allow the tuning of workload I/O within the system. However, these configurations are difficult to set by users who lack expertise in I/O subsystems. We propose a paradigm change in which users specify the intent of I/O operations and storage systems automatically set various configurations based on the supplied intent. To this end, we developed the Mimir infrastructure to assist users in passing I/O intent to the underlying storage system. We demonstrate several use cases that map user-defined intents to storage configurations that lead to optimized I/O. In this study, we make three observations. First, I/O intents should be applied to each level of the I/O storage stack, from HDF5 to MPI-IO to POSIX, and integrated using lightweight adaptors in the existing stack. Second, the Mimir infrastructure supports up to 400M Ops/sec throughput of intents in the system, with a low memory overhead of 6.85KB per node. Third, intents assist in configuring a hierarchical cache to preload I/O, buffer in a node-local device, and store data in a global cache to optimize I/O workloads by 2.33×, 4×, and 2.1×, respectively. Our Mimir infrastructure optimizes complex large-scale workflows by up to 4× better I/O performance on the Lassen supercomputer by using automatically derived I/O intents.
{"title":"Mimir: Extending I/O Interfaces to Express User Intent for Complex Workloads in HPC","authors":"H. Devarajan, K. Mohror","doi":"10.1109/IPDPS54959.2023.00027","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00027","url":null,"abstract":"The complexity of data management in HPC systems stems from the diversity in I/O behavior exhibited by new workloads, multistage workflows, and the presence of multitiered storage systems. This complexity is managed by the storage systems, which provide user-level configurations to allow the tuning of workload I/O within the system. However, these configurations are difficult to set by users who lack expertise in I/O subsystems. We propose a paradigm change in which users specify the intent of I/O operations and storage systems automatically set various configurations based on the supplied intent. To this end, we developed the Mimir infrastructure to assist users in passing I/O intent to the underlying storage system. We demonstrate several use cases that map user-defined intents to storage configurations that lead to optimized I/O. In this study, we make three observations. First, I/O intents should be applied to each level of the I/O storage stack, from HDF5 to MPI-IO to POSIX, and integrated using lightweight adaptors in the existing stack. Second, the Mimir infrastructure supports up to 400M Ops/sec throughput of intents in the system, with a low memory overhead of 6.85KB per node. Third, intents assist in configuring a hierarchical cache to preload I/O, buffer in a node-local device, and store data in a global cache to optimize I/O workloads by 2.33×, 4×, and 2.1×, respectively. Our Mimir infrastructure optimizes complex large-scale workflows by up to 4× better I/O performance on the Lassen supercomputer by using automatically derived I/O intents.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125917028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00017
Zeyu Luan, Qing Li, Yi Wang, Yong Jiang
Ternary Content Addressable Memory (TCAM) is an essential hardware component in SDN-enabled switches, which supports fast lookup speed and flexible matching patterns. However, TCAM’s limited storage capacity has long been a scalability challenge to enforce fine-grained forwarding policies in SDN. Based on the observation of traffic locality, the rule-caching mechanism employs a combination of TCAM and Random Access Memory (RAM) to maintain the forwarding rules of large and small flows, respectively. However, previous works cannot identify large flows timely and accurately, and suffer from high computational complexity when addressing rule dependencies in TCAM. Worse still, TCAM only caches the forwarding rules of large flows but ignores the latency requirements of small flows. Small flows encounter cache-miss in TCAM and then will be diverted to RAM, where they have to experience slow lookup processes. To jointly optimize the performance of both high-throughput large flows and latency-sensitive small flows, we propose a hybrid rule-caching framework, H-Cache, to scale traffic-aware forwarding policies in SDN. H-Cache identifies large flows through a collaboration of learning-based and threshold-based methods to achieve early detection and high accuracy, and proposes a time-efficient greedy heuristic to address rule dependencies. For small flows, H-Cache establishes default paths in TCAM to speed up their lookup processes, and also reduces their TCAM occupancy through label switching and region partitioning. Experiments with both real-world and synthetic datasets demonstrate that H-Cache increases TCAM utilization by an average of 11% and reduces the average completion time of small flows by almost 70%.
{"title":"H-Cache: Traffic-Aware Hybrid Rule-Caching in Software-Defined Networks","authors":"Zeyu Luan, Qing Li, Yi Wang, Yong Jiang","doi":"10.1109/IPDPS54959.2023.00017","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00017","url":null,"abstract":"Ternary Content Addressable Memory (TCAM) is an essential hardware component in SDN-enabled switches, which supports fast lookup speed and flexible matching patterns. However, TCAM’s limited storage capacity has long been a scalability challenge to enforce fine-grained forwarding policies in SDN. Based on the observation of traffic locality, the rule-caching mechanism employs a combination of TCAM and Random Access Memory (RAM) to maintain the forwarding rules of large and small flows, respectively. However, previous works cannot identify large flows timely and accurately, and suffer from high computational complexity when addressing rule dependencies in TCAM. Worse still, TCAM only caches the forwarding rules of large flows but ignores the latency requirements of small flows. Small flows encounter cache-miss in TCAM and then will be diverted to RAM, where they have to experience slow lookup processes. To jointly optimize the performance of both high-throughput large flows and latency-sensitive small flows, we propose a hybrid rule-caching framework, H-Cache, to scale traffic-aware forwarding policies in SDN. H-Cache identifies large flows through a collaboration of learning-based and threshold-based methods to achieve early detection and high accuracy, and proposes a time-efficient greedy heuristic to address rule dependencies. For small flows, H-Cache establishes default paths in TCAM to speed up their lookup processes, and also reduces their TCAM occupancy through label switching and region partitioning. Experiments with both real-world and synthetic datasets demonstrate that H-Cache increases TCAM utilization by an average of 11% and reduces the average completion time of small flows by almost 70%.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114692561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00070
Kawthar Shafie Khorassani, Chen-Chun Chen, H. Subramoni, D. Panda
MPI Neighborhood collectives are used for non-traditional collective operations involving uneven distribution of communication amongst processes such as sparse communication patterns. They provide flexibility to define the communication pattern involved when a neighborhood relationship can be defined. PETSc, the Portable, Extensible Toolkit for Scientific Computation, used extensively with scientific applications to provide scalable solutions through routines modeled by partial differential equations, utilizes neighborhood communication patterns to define various structures and routines.We propose GPU-aware MPI Neighborhood collective operations with support for AMD and NVIDIA GPU backends and propose optimized designs to provide scalable performance for various communication routines. We evaluate our designs using PETSc structures for scattering from a parallel vector to a parallel vector, scattering from a sequential vector to a parallel vector, and scattering from a parallel vector to a sequential vector using a star forest graph representation implemented with nonblocking MPI neighborhood alltoallv collective operations. We evaluate our neighborhood designs on 64 NVIDIA GPUs on the Lassen system with Infiniband networking, demonstrating30.90% improvement against a GPU implementation utilizing CPU-staging techniques, and 8.25% improvement against GPU-aware point-to-point implementations of the communication pattern. We also evaluate on 64 AMD GPUs on the Spock system with slingshot networking and present 39.52% improvement against the CPU-staging implementation of a neighborhood GPU vector type in PETSc, and 33.25% improvement against GPU-aware point-to-point implementation of the routine.
{"title":"Designing and Optimizing GPU-aware Nonblocking MPI Neighborhood Collective Communication for PETSc*","authors":"Kawthar Shafie Khorassani, Chen-Chun Chen, H. Subramoni, D. Panda","doi":"10.1109/IPDPS54959.2023.00070","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00070","url":null,"abstract":"MPI Neighborhood collectives are used for non-traditional collective operations involving uneven distribution of communication amongst processes such as sparse communication patterns. They provide flexibility to define the communication pattern involved when a neighborhood relationship can be defined. PETSc, the Portable, Extensible Toolkit for Scientific Computation, used extensively with scientific applications to provide scalable solutions through routines modeled by partial differential equations, utilizes neighborhood communication patterns to define various structures and routines.We propose GPU-aware MPI Neighborhood collective operations with support for AMD and NVIDIA GPU backends and propose optimized designs to provide scalable performance for various communication routines. We evaluate our designs using PETSc structures for scattering from a parallel vector to a parallel vector, scattering from a sequential vector to a parallel vector, and scattering from a parallel vector to a sequential vector using a star forest graph representation implemented with nonblocking MPI neighborhood alltoallv collective operations. We evaluate our neighborhood designs on 64 NVIDIA GPUs on the Lassen system with Infiniband networking, demonstrating30.90% improvement against a GPU implementation utilizing CPU-staging techniques, and 8.25% improvement against GPU-aware point-to-point implementations of the communication pattern. We also evaluate on 64 AMD GPUs on the Spock system with slingshot networking and present 39.52% improvement against the CPU-staging implementation of a neighborhood GPU vector type in PETSc, and 33.25% improvement against GPU-aware point-to-point implementation of the routine.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"252 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129981478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00059
Yufan Xia, M. D. L. Pierre, A. Barnard, Giuseppe M. J. Barca
The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of modern multi-core shared memory systems, it is challenging to determine the number of threads that minimises the multi-thread GEMM runtime.We present a proof-of-concept approach to building an Architecture and Data-Structure Aware Linear Algebra (ADSALA) software library that uses machine learning to optimise the runtime performance of BLAS routines. More specifically, our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task based on the collected training data. Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a two-socket AMD Zen 3, revealed a 25 to 40 per cent speedup compared to traditional GEMM implementations in BLAS when using GEMM of memory usage within 100 MB.
通用矩阵乘法(GEMM)是科学计算中的基本算法之一。单线程GEMM实现通过阻塞和自动调优等技术得到了很好的优化。然而,由于现代多核共享内存系统的复杂性,确定最小化多线程GEMM运行时的线程数是一项挑战。我们提出了一种概念验证方法来构建一个架构和数据结构感知线性代数(ADSALA)软件库,该软件库使用机器学习来优化BLAS例程的运行时性能。更具体地说,我们的方法使用动态机器学习模型,根据收集的训练数据自动选择给定GEMM任务的最佳线程数。在两种不同的HPC节点架构上的测试结果显示,当使用内存使用在100 MB以内的GEMM时,与BLAS中的传统GEMM实现相比,两种不同的HPC节点架构(一种基于双插槽的英特尔Cascade Lake,另一种基于双插槽的AMD Zen 3)的速度提高了25%到40%。
{"title":"A Machine Learning Approach Towards Runtime Optimisation of Matrix Multiplication","authors":"Yufan Xia, M. D. L. Pierre, A. Barnard, Giuseppe M. J. Barca","doi":"10.1109/IPDPS54959.2023.00059","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00059","url":null,"abstract":"The GEneral Matrix Multiplication (GEMM) is one of the essential algorithms in scientific computing. Single-thread GEMM implementations are well-optimised with techniques like blocking and autotuning. However, due to the complexity of modern multi-core shared memory systems, it is challenging to determine the number of threads that minimises the multi-thread GEMM runtime.We present a proof-of-concept approach to building an Architecture and Data-Structure Aware Linear Algebra (ADSALA) software library that uses machine learning to optimise the runtime performance of BLAS routines. More specifically, our method uses a machine learning model on-the-fly to automatically select the optimal number of threads for a given GEMM task based on the collected training data. Test results on two different HPC node architectures, one based on a two-socket Intel Cascade Lake and the other on a two-socket AMD Zen 3, revealed a 25 to 40 per cent speedup compared to traditional GEMM implementations in BLAS when using GEMM of memory usage within 100 MB.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130035096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00104
Shaomeng Li, P. Lindstrom, J. Clyne
As the need for data reduction in high-performance computing (HPC) continues to grow, we introduce a new and highly effective tool to help achieve this goal—SPERR. SPERR is a versatile lossy compressor for structured scientific data; it is built on top of an advanced wavelet compression algorithm, SPECK, and provides additional capabilities valued in HPC environments. These capabilities include parallel execution for large volumes and a compression mode that satisfies a maximum point-wise error tolerance. Evaluation shows that in most settings SPERR achieves the best rate-distortion trade-off among current popular lossy scientific data compressors.
{"title":"Lossy Scientific Data Compression With SPERR","authors":"Shaomeng Li, P. Lindstrom, J. Clyne","doi":"10.1109/IPDPS54959.2023.00104","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00104","url":null,"abstract":"As the need for data reduction in high-performance computing (HPC) continues to grow, we introduce a new and highly effective tool to help achieve this goal—SPERR. SPERR is a versatile lossy compressor for structured scientific data; it is built on top of an advanced wavelet compression algorithm, SPECK, and provides additional capabilities valued in HPC environments. These capabilities include parallel execution for large volumes and a compression mode that satisfies a maximum point-wise error tolerance. Evaluation shows that in most settings SPERR achieves the best rate-distortion trade-off among current popular lossy scientific data compressors.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133285034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00013
Prasun Gera, Hyesoon Kim
GPUs can be used effectively for accelerating graph analytics, provided the datasets fit in GPU memory. This is often not the case for large real-world datasets such as social, web, or biological graphs. We propose a graph compression format for static unweighted graphs based on Elias-Fano encoding that is amenable to run-time decompression on massively parallel architectures such as GPUs. We show that we can compress a variety of large graphs by a factor of 1.55x over the commonly used compressed sparse row (CSR) representation. The scheme is particularly beneficial for cases where conventional CSR based approaches do not work at all due to memory capacity constraints, or incur a significant penalty for out-of-core processing. We implement GPU accelerated breadth first search for this graph representation and show that the runtime performance for in-memory compressed graphs is 3.8x-6.5x better than out-of-core implementations for CSR graphs. Further, our implementation is also 1.45x-2x faster than the current state of the art in GPU based compressed graph traversals while maintaining a competitive compression ratio. We also extend our work to other analytics applications such as single source shortest paths and PageRank. Finally, we explore the interplay between graph reordering, graph compression, and performance.
{"title":"Traversing Large Compressed Graphs on GPUs","authors":"Prasun Gera, Hyesoon Kim","doi":"10.1109/IPDPS54959.2023.00013","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00013","url":null,"abstract":"GPUs can be used effectively for accelerating graph analytics, provided the datasets fit in GPU memory. This is often not the case for large real-world datasets such as social, web, or biological graphs. We propose a graph compression format for static unweighted graphs based on Elias-Fano encoding that is amenable to run-time decompression on massively parallel architectures such as GPUs. We show that we can compress a variety of large graphs by a factor of 1.55x over the commonly used compressed sparse row (CSR) representation. The scheme is particularly beneficial for cases where conventional CSR based approaches do not work at all due to memory capacity constraints, or incur a significant penalty for out-of-core processing. We implement GPU accelerated breadth first search for this graph representation and show that the runtime performance for in-memory compressed graphs is 3.8x-6.5x better than out-of-core implementations for CSR graphs. Further, our implementation is also 1.45x-2x faster than the current state of the art in GPU based compressed graph traversals while maintaining a competitive compression ratio. We also extend our work to other analytics applications such as single source shortest paths and PageRank. Finally, we explore the interplay between graph reordering, graph compression, and performance.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130301512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-05-01DOI: 10.1109/IPDPS54959.2023.00048
Yongseok Soh, Ahmed E. Helal, Fabio Checconi, Jan Laukemann, Jesmin Jahan Tithi, Teresa M. Ranadive, F. Petrini, Jeewhan Choi
Streaming tensor factorization is an effective tool for unsupervised analysis of time-evolving sparse data, which emerge in many critical domains such as cybersecurity and trend analysis. In contrast to traditional tensors, time-evolving tensors demonstrate extreme sparsity and sparsity variation over time, resulting in irregular memory access and inefficient use of parallel computing resources. Additionally, due to the prohibitive cost of dynamically generating compressed sparse tensor formats, the state-of-the-art approaches process streaming tensors in a raw form that fails to capture data locality and suffers from high synchronization cost. To address these challenges, we propose a new dynamic tensor linearization framework that quickly encodes streaming multi-dimensional data on-the-fly in a compact representation, which has substantially lower memory usage and higher data reuse and parallelism than the original raw data. This is achieved by using a spatial sketching algorithm that keeps all incoming nonzero elements but remaps them into a tensor sketch with considerably reduced multi-dimensional image space. Moreover, we present a dynamic time slicing mechanism that uses variable-width time slices (instead of the traditional fixed-width) to balance the frequency of factor updates and the utilization of computing resources. We demonstrate the efficacy of our framework by accelerating two high-performance streaming tensor algorithms, namely, CP-stream and spCP-stream, and significantly improve their performance for a range of real-world streaming tensors. On a modern 56-core CPU, our framework achieves 10.3 − 11× and 6.4 − 7.2× geometric-mean speedup for the CP-stream and spCP-stream algorithms, respectively.
{"title":"Dynamic Tensor Linearization and Time Slicing for Efficient Factorization of Infinite Data Streams","authors":"Yongseok Soh, Ahmed E. Helal, Fabio Checconi, Jan Laukemann, Jesmin Jahan Tithi, Teresa M. Ranadive, F. Petrini, Jeewhan Choi","doi":"10.1109/IPDPS54959.2023.00048","DOIUrl":"https://doi.org/10.1109/IPDPS54959.2023.00048","url":null,"abstract":"Streaming tensor factorization is an effective tool for unsupervised analysis of time-evolving sparse data, which emerge in many critical domains such as cybersecurity and trend analysis. In contrast to traditional tensors, time-evolving tensors demonstrate extreme sparsity and sparsity variation over time, resulting in irregular memory access and inefficient use of parallel computing resources. Additionally, due to the prohibitive cost of dynamically generating compressed sparse tensor formats, the state-of-the-art approaches process streaming tensors in a raw form that fails to capture data locality and suffers from high synchronization cost. To address these challenges, we propose a new dynamic tensor linearization framework that quickly encodes streaming multi-dimensional data on-the-fly in a compact representation, which has substantially lower memory usage and higher data reuse and parallelism than the original raw data. This is achieved by using a spatial sketching algorithm that keeps all incoming nonzero elements but remaps them into a tensor sketch with considerably reduced multi-dimensional image space. Moreover, we present a dynamic time slicing mechanism that uses variable-width time slices (instead of the traditional fixed-width) to balance the frequency of factor updates and the utilization of computing resources. We demonstrate the efficacy of our framework by accelerating two high-performance streaming tensor algorithms, namely, CP-stream and spCP-stream, and significantly improve their performance for a range of real-world streaming tensors. On a modern 56-core CPU, our framework achieves 10.3 − 11× and 6.4 − 7.2× geometric-mean speedup for the CP-stream and spCP-stream algorithms, respectively.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132934399","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}