Rohan Basu Roy, Tirthak Patel, V. Gadepally, Devesh Tiwari
This work introduces Mashup, a novel strategy to leverage serverless computing model for executing scientific workflows in a hybrid fashion by taking advantage of both the traditional VM-based cloud computing platform and the emerging serverless platform. Mashup outperforms the state-of-the-art workflow execution engines by an average of 34% and 43% in terms of execution time reduction and cost reduction, respectively, for widely-used HPC workflows on the Amazon Cloud platform (EC2 and Lambda).
{"title":"Mashup: making serverless computing useful for HPC workflows via hybrid execution","authors":"Rohan Basu Roy, Tirthak Patel, V. Gadepally, Devesh Tiwari","doi":"10.1145/3503221.3508407","DOIUrl":"https://doi.org/10.1145/3503221.3508407","url":null,"abstract":"This work introduces Mashup, a novel strategy to leverage serverless computing model for executing scientific workflows in a hybrid fashion by taking advantage of both the traditional VM-based cloud computing platform and the emerging serverless platform. Mashup outperforms the state-of-the-art workflow execution engines by an average of 34% and 43% in terms of execution time reduction and cost reduction, respectively, for widely-used HPC workflows on the Amazon Cloud platform (EC2 and Lambda).","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132071241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck
Triangle Counting (TC) is a basic graph mining problem with numerous applications. However, the large size of real-world graphs has a severe effect on TC performance. This paper studies the TC algorithm from the perspective of memory utilization. We investigate the implications of the skewed degree distribution of real-world graphs on TC and make novel observations on how memory locality is negatively affected. Based on this, we introduce the LOTUS algorithm as a structure-aware TC that optimizes locality. The evaluation on 14 real-world graphs with up to 162 billion edges and on 3 different processor architectures of up to 128 cores shows that Lotus is 2.2--5.5X faster than previous works.
{"title":"LOTUS: locality optimizing triangle counting","authors":"Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck","doi":"10.1145/3503221.3508402","DOIUrl":"https://doi.org/10.1145/3503221.3508402","url":null,"abstract":"Triangle Counting (TC) is a basic graph mining problem with numerous applications. However, the large size of real-world graphs has a severe effect on TC performance. This paper studies the TC algorithm from the perspective of memory utilization. We investigate the implications of the skewed degree distribution of real-world graphs on TC and make novel observations on how memory locality is negatively affected. Based on this, we introduce the LOTUS algorithm as a structure-aware TC that optimizes locality. The evaluation on 14 real-world graphs with up to 162 billion edges and on 3 different processor architectures of up to 128 cores shows that Lotus is 2.2--5.5X faster than previous works.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131074279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present GASS, an LLVM-based open-source compiler for NVIDIA GPU's SASS machine assembly. GASS is the first open-source compiler targeting SASS, and it provides a unified toolchain for currently fragmented low-level performance research on NVIDIA GPUs. GASS supports all recent architectures, including Volta, Turing, and Ampere. Our evaluation shows that our specialized optimizations deliver significant speedup over LLVM's algorithms.
{"title":"An LLVM-based open-source compiler for NVIDIA GPUs","authors":"Da Yan, Wei Wang, X. Chu","doi":"10.1145/3503221.3508428","DOIUrl":"https://doi.org/10.1145/3503221.3508428","url":null,"abstract":"We present GASS, an LLVM-based open-source compiler for NVIDIA GPU's SASS machine assembly. GASS is the first open-source compiler targeting SASS, and it provides a unified toolchain for currently fragmented low-level performance research on NVIDIA GPUs. GASS supports all recent architectures, including Volta, Turing, and Ampere. Our evaluation shows that our specialized optimizations deliver significant speedup over LLVM's algorithms.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122759764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-03-28DOI: 10.1163/1234-5678_beh_com_9000000176
Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck
{"title":"LOTUS","authors":"Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck","doi":"10.1163/1234-5678_beh_com_9000000176","DOIUrl":"https://doi.org/10.1163/1234-5678_beh_com_9000000176","url":null,"abstract":"","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129845817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Performance analysis is widely used to identify performance issues of parallel applications. However, complex communications and data dependence, as well as the interactions between different kinds of performance issues make high-efficiency performance analysis even harder. Although a large number of performance tools have been designed, accurately pinpointing root causes for such complex performance issues still needs specific in-depth analysis. To implement each such analysis, significant human efforts and domain knowledge are normally required. To reduce the burden of implementing accurate performance analysis, we propose a domain specific programming framework, named PerFlow. PerFlow abstracts the step-by-step process of performance analysis as a dataflow graph. This dataflow graph consists of main performance analysis sub-tasks, called passes, which can either be provided by PerFlow's built-in analysis library, or be implemented by developers to meet their requirements. Moreover, to achieve effective analysis, we propose a Program Abstraction Graph to represent the performance of a program execution and then leverage various graph algorithms to automate the analysis. We demonstrate the efficacy of PerFlow by three case studies of real-world applications with up to 700K lines of code. Results show that PerFlow significantly eases the implementation of customized analysis tasks. In addition, PerFlow is able to perform analysis and locate performance bugs automatically and effectively.
{"title":"PerFlow","authors":"Yuyang Jin, Haojie Wang, Runxin Zhong, Chen Zhang, Jidong Zhai","doi":"10.1145/3503221.3508405","DOIUrl":"https://doi.org/10.1145/3503221.3508405","url":null,"abstract":"Performance analysis is widely used to identify performance issues of parallel applications. However, complex communications and data dependence, as well as the interactions between different kinds of performance issues make high-efficiency performance analysis even harder. Although a large number of performance tools have been designed, accurately pinpointing root causes for such complex performance issues still needs specific in-depth analysis. To implement each such analysis, significant human efforts and domain knowledge are normally required. To reduce the burden of implementing accurate performance analysis, we propose a domain specific programming framework, named PerFlow. PerFlow abstracts the step-by-step process of performance analysis as a dataflow graph. This dataflow graph consists of main performance analysis sub-tasks, called passes, which can either be provided by PerFlow's built-in analysis library, or be implemented by developers to meet their requirements. Moreover, to achieve effective analysis, we propose a Program Abstraction Graph to represent the performance of a program execution and then leverage various graph algorithms to automate the analysis. We demonstrate the efficacy of PerFlow by three case studies of real-world applications with up to 700K lines of code. Results show that PerFlow significantly eases the implementation of customized analysis tasks. In addition, PerFlow is able to perform analysis and locate performance bugs automatically and effectively.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126483607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Concurrent program verification is challenging due to a large number of thread interferences. A popular approach is to encode concurrent programs as SMT formulas and then rely on off-the-shelf SMT solvers to accomplish the verification. In most existing works, an SMT solver is simply treated as the backend. There is little research on improving SMT solving for concurrent program verification. In this paper, we recognize the characteristics of interference relation in multi-threaded programs and propose a novel approach for utilizing the interference relation in the SMT solving of multi-threaded program verification under various memory models. We show that the backend SMT solver can benefit a lot from the domain knowledge of concurrent programs. We implemented our approach in a prototype tool called Zpre. We compared it with the state-of-the-art Z3 tool on credible benchmarks from the ConcurrencySafety category of SV-COMP 2019. Experimental results show promising improvements attributed to our approach.
{"title":"Interference relation-guided SMT solving for multi-threaded program verification","authors":"Hongyu Fan, Weiting Liu, Fei He","doi":"10.1145/3503221.3508424","DOIUrl":"https://doi.org/10.1145/3503221.3508424","url":null,"abstract":"Concurrent program verification is challenging due to a large number of thread interferences. A popular approach is to encode concurrent programs as SMT formulas and then rely on off-the-shelf SMT solvers to accomplish the verification. In most existing works, an SMT solver is simply treated as the backend. There is little research on improving SMT solving for concurrent program verification. In this paper, we recognize the characteristics of interference relation in multi-threaded programs and propose a novel approach for utilizing the interference relation in the SMT solving of multi-threaded program verification under various memory models. We show that the backend SMT solver can benefit a lot from the domain knowledge of concurrent programs. We implemented our approach in a prototype tool called Zpre. We compared it with the state-of-the-art Z3 tool on credible benchmarks from the ConcurrencySafety category of SV-COMP 2019. Experimental results show promising improvements attributed to our approach.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115839715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a framework that allows programmers to specify replicated data stores through application logic, data replication scheme, and high-level invariants that needs to be satisfied. From such specifications all the needed consistency requirements can be inferred from traces of executions of the potential data store, to determine the optimal data store coordination. The framework supports arbitrarily complex data store operations and partial data replication. This leads to expressiveness for a wide range of data stores, with significant run-time performance benefits.
{"title":"Optimizing consistency for partially replicated data stores","authors":"I. Kuraj, Armando Solar-Lezama, N. Polikarpova","doi":"10.1145/3503221.3508438","DOIUrl":"https://doi.org/10.1145/3503221.3508438","url":null,"abstract":"We present a framework that allows programmers to specify replicated data stores through application logic, data replication scheme, and high-level invariants that needs to be satisfied. From such specifications all the needed consistency requirements can be inferred from traces of executions of the potential data store, to determine the optimal data store coordination. The framework supports arbitrarily complex data store operations and partial data replication. This leads to expressiveness for a wide range of data stores, with significant run-time performance benefits.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116273282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junmin Xiao, Qing Xue, Hui Ma, Xiaoyang Zhang, Guangming Tan
As a fundamental factorization operation, the singular value decomposition (SVD) plays a paramount role in abroad range of domains such as scientific computing and machine learning. Due to its computational bottleneck of factorization for small matrices in real-world applications, many GPU-accelerated batched SVD algorithms have been investigated recently. However, these algorithms failed to achieve a balance between data locality and parallelism because their workflows depend on the size of each matrix. In this work, we propose a matrix-size-independent W-cycle algorithm to accelerate the batched one-side Jacobi SVD on GPUs, which successfully strikes the balance between data locality and parallelism. The experimental evaluation demonstrates that the proposed algorithm achieves 4.5X performance speedup on average over the state-of-the-art cuSOLVER.
{"title":"A W-cycle algorithm for efficient batched SVD on GPUs","authors":"Junmin Xiao, Qing Xue, Hui Ma, Xiaoyang Zhang, Guangming Tan","doi":"10.1145/3503221.3508443","DOIUrl":"https://doi.org/10.1145/3503221.3508443","url":null,"abstract":"As a fundamental factorization operation, the singular value decomposition (SVD) plays a paramount role in abroad range of domains such as scientific computing and machine learning. Due to its computational bottleneck of factorization for small matrices in real-world applications, many GPU-accelerated batched SVD algorithms have been investigated recently. However, these algorithms failed to achieve a balance between data locality and parallelism because their workflows depend on the size of each matrix. In this work, we propose a matrix-size-independent W-cycle algorithm to accelerate the batched one-side Jacobi SVD on GPUs, which successfully strikes the balance between data locality and parallelism. The experimental evaluation demonstrates that the proposed algorithm achieves 4.5X performance speedup on average over the state-of-the-art cuSOLVER.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131969723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The existing Graph Neural Network (GNN) systems adopt graph partitioning to divide the graph data for multi-GPU training. Although they support large graphs, we find that the existing techniques lead to large data loading overhead. In this work, we for the first time model the data movement overhead among CPU and GPUs in GNN training. Based on the performance model, we provide an efficient algorithm to divide and distribute the graph data onto multiple GPUs so that the data loading time is minimized. The experiments show that our technique achieves smaller data loading time compared with the existing graph partitioning methods.
{"title":"Rethinking graph data placement for graph neural network training on multiple GPUs","authors":"Shihui Song, Peng Jiang","doi":"10.1145/3503221.3508435","DOIUrl":"https://doi.org/10.1145/3503221.3508435","url":null,"abstract":"The existing Graph Neural Network (GNN) systems adopt graph partitioning to divide the graph data for multi-GPU training. Although they support large graphs, we find that the existing techniques lead to large data loading overhead. In this work, we for the first time model the data movement overhead among CPU and GPUs in GNN training. Based on the performance model, we provide an efficient algorithm to divide and distribute the graph data onto multiple GPUs so that the data loading time is minimized. The experiments show that our technique achieves smaller data loading time compared with the existing graph partitioning methods.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125534249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. However, using HM, wisely migrating data objects on it is critical for high performance. In this work, we introduce a load balance-aware page management system, named LB-HM. LB-HM introduces task semantics during memory profiling, rather than being application-agnostic. Evaluating with a set of memory-consuming HPC applications, we show that we show that LB-HM reduces existing load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution on Optane-based HM.
{"title":"LB-HM: load balance-aware data placement on heterogeneous memory for task-parallel HPC applications","authors":"Zhen Xie, Jie Liu, Sam Ma, Jiajia Li, Dong Li","doi":"10.1145/3503221.3508406","DOIUrl":"https://doi.org/10.1145/3503221.3508406","url":null,"abstract":"The emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. However, using HM, wisely migrating data objects on it is critical for high performance. In this work, we introduce a load balance-aware page management system, named LB-HM. LB-HM introduces task semantics during memory profiling, rather than being application-agnostic. Evaluating with a set of memory-consuming HPC applications, we show that we show that LB-HM reduces existing load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution on Optane-based HM.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"41 16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131145719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}