首页 > 最新文献

Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming最新文献

英文 中文
Mashup: making serverless computing useful for HPC workflows via hybrid execution Mashup:通过混合执行使无服务器计算对HPC工作流有用
Rohan Basu Roy, Tirthak Patel, V. Gadepally, Devesh Tiwari
This work introduces Mashup, a novel strategy to leverage serverless computing model for executing scientific workflows in a hybrid fashion by taking advantage of both the traditional VM-based cloud computing platform and the emerging serverless platform. Mashup outperforms the state-of-the-art workflow execution engines by an average of 34% and 43% in terms of execution time reduction and cost reduction, respectively, for widely-used HPC workflows on the Amazon Cloud platform (EC2 and Lambda).
本文介绍了Mashup,这是一种利用无服务器计算模型的新策略,通过利用传统的基于vm的云计算平台和新兴的无服务器平台,以混合方式执行科学工作流。对于Amazon Cloud平台(EC2和Lambda)上广泛使用的HPC工作流,Mashup在执行时间减少和成本降低方面分别比最先进的工作流执行引擎平均高出34%和43%。
{"title":"Mashup: making serverless computing useful for HPC workflows via hybrid execution","authors":"Rohan Basu Roy, Tirthak Patel, V. Gadepally, Devesh Tiwari","doi":"10.1145/3503221.3508407","DOIUrl":"https://doi.org/10.1145/3503221.3508407","url":null,"abstract":"This work introduces Mashup, a novel strategy to leverage serverless computing model for executing scientific workflows in a hybrid fashion by taking advantage of both the traditional VM-based cloud computing platform and the emerging serverless platform. Mashup outperforms the state-of-the-art workflow execution engines by an average of 34% and 43% in terms of execution time reduction and cost reduction, respectively, for widely-used HPC workflows on the Amazon Cloud platform (EC2 and Lambda).","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132071241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
LOTUS: locality optimizing triangle counting LOTUS:局部性优化三角形计数
Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck
Triangle Counting (TC) is a basic graph mining problem with numerous applications. However, the large size of real-world graphs has a severe effect on TC performance. This paper studies the TC algorithm from the perspective of memory utilization. We investigate the implications of the skewed degree distribution of real-world graphs on TC and make novel observations on how memory locality is negatively affected. Based on this, we introduce the LOTUS algorithm as a structure-aware TC that optimizes locality. The evaluation on 14 real-world graphs with up to 162 billion edges and on 3 different processor architectures of up to 128 cores shows that Lotus is 2.2--5.5X faster than previous works.
三角形计数(TC)是一个基本的图挖掘问题,有许多应用。然而,现实世界的大尺寸图形对TC性能有严重的影响。本文从内存利用率的角度对TC算法进行了研究。我们研究了真实世界图的偏斜度分布对TC的影响,并对记忆局部性如何受到负面影响进行了新的观察。在此基础上,我们介绍了LOTUS算法作为优化局部性的结构感知TC。在14个具有多达1620亿个边的真实图形和3个多达128核的不同处理器架构上进行的评估表明,Lotus比以前的工作快2.2- 5.5倍。
{"title":"LOTUS: locality optimizing triangle counting","authors":"Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck","doi":"10.1145/3503221.3508402","DOIUrl":"https://doi.org/10.1145/3503221.3508402","url":null,"abstract":"Triangle Counting (TC) is a basic graph mining problem with numerous applications. However, the large size of real-world graphs has a severe effect on TC performance. This paper studies the TC algorithm from the perspective of memory utilization. We investigate the implications of the skewed degree distribution of real-world graphs on TC and make novel observations on how memory locality is negatively affected. Based on this, we introduce the LOTUS algorithm as a structure-aware TC that optimizes locality. The evaluation on 14 real-world graphs with up to 162 billion edges and on 3 different processor architectures of up to 128 cores shows that Lotus is 2.2--5.5X faster than previous works.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131074279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
An LLVM-based open-source compiler for NVIDIA GPUs 一个基于llvm的NVIDIA gpu的开源编译器
Da Yan, Wei Wang, X. Chu
We present GASS, an LLVM-based open-source compiler for NVIDIA GPU's SASS machine assembly. GASS is the first open-source compiler targeting SASS, and it provides a unified toolchain for currently fragmented low-level performance research on NVIDIA GPUs. GASS supports all recent architectures, including Volta, Turing, and Ampere. Our evaluation shows that our specialized optimizations deliver significant speedup over LLVM's algorithms.
我们提出了GASS,一个基于llvm的开源编译器,用于NVIDIA GPU的SASS机器组件。GASS是第一个针对SASS的开源编译器,它为目前零散的NVIDIA gpu底层性能研究提供了一个统一的工具链。GASS支持所有最新的体系结构,包括Volta、Turing和Ampere。我们的评估表明,我们的专业优化比LLVM的算法提供了显著的加速。
{"title":"An LLVM-based open-source compiler for NVIDIA GPUs","authors":"Da Yan, Wei Wang, X. Chu","doi":"10.1145/3503221.3508428","DOIUrl":"https://doi.org/10.1145/3503221.3508428","url":null,"abstract":"We present GASS, an LLVM-based open-source compiler for NVIDIA GPU's SASS machine assembly. GASS is the first open-source compiler targeting SASS, and it provides a unified toolchain for currently fragmented low-level performance research on NVIDIA GPUs. GASS supports all recent architectures, including Volta, Turing, and Ampere. Our evaluation shows that our specialized optimizations deliver significant speedup over LLVM's algorithms.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122759764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
LOTUS 莲花
Pub Date : 2022-03-28 DOI: 10.1163/1234-5678_beh_com_9000000176
Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck
{"title":"LOTUS","authors":"Mohsen Koohi Esfahani, Peter Kilpatrick, H. Vandierendonck","doi":"10.1163/1234-5678_beh_com_9000000176","DOIUrl":"https://doi.org/10.1163/1234-5678_beh_com_9000000176","url":null,"abstract":"","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129845817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PerFlow PerFlow
Yuyang Jin, Haojie Wang, Runxin Zhong, Chen Zhang, Jidong Zhai
Performance analysis is widely used to identify performance issues of parallel applications. However, complex communications and data dependence, as well as the interactions between different kinds of performance issues make high-efficiency performance analysis even harder. Although a large number of performance tools have been designed, accurately pinpointing root causes for such complex performance issues still needs specific in-depth analysis. To implement each such analysis, significant human efforts and domain knowledge are normally required. To reduce the burden of implementing accurate performance analysis, we propose a domain specific programming framework, named PerFlow. PerFlow abstracts the step-by-step process of performance analysis as a dataflow graph. This dataflow graph consists of main performance analysis sub-tasks, called passes, which can either be provided by PerFlow's built-in analysis library, or be implemented by developers to meet their requirements. Moreover, to achieve effective analysis, we propose a Program Abstraction Graph to represent the performance of a program execution and then leverage various graph algorithms to automate the analysis. We demonstrate the efficacy of PerFlow by three case studies of real-world applications with up to 700K lines of code. Results show that PerFlow significantly eases the implementation of customized analysis tasks. In addition, PerFlow is able to perform analysis and locate performance bugs automatically and effectively.
{"title":"PerFlow","authors":"Yuyang Jin, Haojie Wang, Runxin Zhong, Chen Zhang, Jidong Zhai","doi":"10.1145/3503221.3508405","DOIUrl":"https://doi.org/10.1145/3503221.3508405","url":null,"abstract":"Performance analysis is widely used to identify performance issues of parallel applications. However, complex communications and data dependence, as well as the interactions between different kinds of performance issues make high-efficiency performance analysis even harder. Although a large number of performance tools have been designed, accurately pinpointing root causes for such complex performance issues still needs specific in-depth analysis. To implement each such analysis, significant human efforts and domain knowledge are normally required. To reduce the burden of implementing accurate performance analysis, we propose a domain specific programming framework, named PerFlow. PerFlow abstracts the step-by-step process of performance analysis as a dataflow graph. This dataflow graph consists of main performance analysis sub-tasks, called passes, which can either be provided by PerFlow's built-in analysis library, or be implemented by developers to meet their requirements. Moreover, to achieve effective analysis, we propose a Program Abstraction Graph to represent the performance of a program execution and then leverage various graph algorithms to automate the analysis. We demonstrate the efficacy of PerFlow by three case studies of real-world applications with up to 700K lines of code. Results show that PerFlow significantly eases the implementation of customized analysis tasks. In addition, PerFlow is able to perform analysis and locate performance bugs automatically and effectively.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126483607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Interference relation-guided SMT solving for multi-threaded program verification 多线程程序验证的干扰关系导向SMT求解
Hongyu Fan, Weiting Liu, Fei He
Concurrent program verification is challenging due to a large number of thread interferences. A popular approach is to encode concurrent programs as SMT formulas and then rely on off-the-shelf SMT solvers to accomplish the verification. In most existing works, an SMT solver is simply treated as the backend. There is little research on improving SMT solving for concurrent program verification. In this paper, we recognize the characteristics of interference relation in multi-threaded programs and propose a novel approach for utilizing the interference relation in the SMT solving of multi-threaded program verification under various memory models. We show that the backend SMT solver can benefit a lot from the domain knowledge of concurrent programs. We implemented our approach in a prototype tool called Zpre. We compared it with the state-of-the-art Z3 tool on credible benchmarks from the ConcurrencySafety category of SV-COMP 2019. Experimental results show promising improvements attributed to our approach.
由于存在大量线程干扰,并发程序验证具有挑战性。一种流行的方法是将并发程序编码为SMT公式,然后依靠现成的SMT求解器来完成验证。在大多数现有工作中,SMT求解器只是作为后端处理。关于改进并发程序验证的SMT求解方法的研究很少。本文认识到多线程程序中干扰关系的特点,提出了一种利用干扰关系求解各种内存模型下多线程程序验证的SMT的新方法。我们证明后端SMT求解器可以从并发程序的领域知识中获益良多。我们在一个名为Zpre的原型工具中实现了我们的方法。我们将其与最先进的Z3工具在SV-COMP 2019的并发安全类别的可靠基准上进行了比较。实验结果表明,我们的方法有很大的改善。
{"title":"Interference relation-guided SMT solving for multi-threaded program verification","authors":"Hongyu Fan, Weiting Liu, Fei He","doi":"10.1145/3503221.3508424","DOIUrl":"https://doi.org/10.1145/3503221.3508424","url":null,"abstract":"Concurrent program verification is challenging due to a large number of thread interferences. A popular approach is to encode concurrent programs as SMT formulas and then rely on off-the-shelf SMT solvers to accomplish the verification. In most existing works, an SMT solver is simply treated as the backend. There is little research on improving SMT solving for concurrent program verification. In this paper, we recognize the characteristics of interference relation in multi-threaded programs and propose a novel approach for utilizing the interference relation in the SMT solving of multi-threaded program verification under various memory models. We show that the backend SMT solver can benefit a lot from the domain knowledge of concurrent programs. We implemented our approach in a prototype tool called Zpre. We compared it with the state-of-the-art Z3 tool on credible benchmarks from the ConcurrencySafety category of SV-COMP 2019. Experimental results show promising improvements attributed to our approach.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115839715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Optimizing consistency for partially replicated data stores 对部分复制数据存储进行一致性优化
I. Kuraj, Armando Solar-Lezama, N. Polikarpova
We present a framework that allows programmers to specify replicated data stores through application logic, data replication scheme, and high-level invariants that needs to be satisfied. From such specifications all the needed consistency requirements can be inferred from traces of executions of the potential data store, to determine the optimal data store coordination. The framework supports arbitrarily complex data store operations and partial data replication. This leads to expressiveness for a wide range of data stores, with significant run-time performance benefits.
我们提出了一个框架,允许程序员通过应用程序逻辑、数据复制方案和需要满足的高级不变量来指定复制的数据存储。从这些规范中,可以从潜在数据存储的执行轨迹推断出所有所需的一致性需求,以确定最佳的数据存储协调。该框架支持任意复杂的数据存储操作和部分数据复制。这导致了广泛的数据存储的表达性,具有显著的运行时性能优势。
{"title":"Optimizing consistency for partially replicated data stores","authors":"I. Kuraj, Armando Solar-Lezama, N. Polikarpova","doi":"10.1145/3503221.3508438","DOIUrl":"https://doi.org/10.1145/3503221.3508438","url":null,"abstract":"We present a framework that allows programmers to specify replicated data stores through application logic, data replication scheme, and high-level invariants that needs to be satisfied. From such specifications all the needed consistency requirements can be inferred from traces of executions of the potential data store, to determine the optimal data store coordination. The framework supports arbitrarily complex data store operations and partial data replication. This leads to expressiveness for a wide range of data stores, with significant run-time performance benefits.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116273282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A W-cycle algorithm for efficient batched SVD on GPUs gpu上高效批处理奇异值分解的w周期算法
Junmin Xiao, Qing Xue, Hui Ma, Xiaoyang Zhang, Guangming Tan
As a fundamental factorization operation, the singular value decomposition (SVD) plays a paramount role in abroad range of domains such as scientific computing and machine learning. Due to its computational bottleneck of factorization for small matrices in real-world applications, many GPU-accelerated batched SVD algorithms have been investigated recently. However, these algorithms failed to achieve a balance between data locality and parallelism because their workflows depend on the size of each matrix. In this work, we propose a matrix-size-independent W-cycle algorithm to accelerate the batched one-side Jacobi SVD on GPUs, which successfully strikes the balance between data locality and parallelism. The experimental evaluation demonstrates that the proposed algorithm achieves 4.5X performance speedup on average over the state-of-the-art cuSOLVER.
奇异值分解(SVD)作为一种基本的分解运算,在科学计算和机器学习等领域发挥着重要作用。由于在实际应用中存在小矩阵分解的计算瓶颈,许多gpu加速的批处理奇异值分解算法最近得到了研究。然而,这些算法无法在数据局部性和并行性之间取得平衡,因为它们的工作流依赖于每个矩阵的大小。在这项工作中,我们提出了一种矩阵大小无关的W-cycle算法来加速gpu上的批处理单边Jacobi SVD,成功地在数据局部性和并行性之间取得了平衡。实验结果表明,该算法比当前最先进的cuSOLVER算法平均提高4.5倍。
{"title":"A W-cycle algorithm for efficient batched SVD on GPUs","authors":"Junmin Xiao, Qing Xue, Hui Ma, Xiaoyang Zhang, Guangming Tan","doi":"10.1145/3503221.3508443","DOIUrl":"https://doi.org/10.1145/3503221.3508443","url":null,"abstract":"As a fundamental factorization operation, the singular value decomposition (SVD) plays a paramount role in abroad range of domains such as scientific computing and machine learning. Due to its computational bottleneck of factorization for small matrices in real-world applications, many GPU-accelerated batched SVD algorithms have been investigated recently. However, these algorithms failed to achieve a balance between data locality and parallelism because their workflows depend on the size of each matrix. In this work, we propose a matrix-size-independent W-cycle algorithm to accelerate the batched one-side Jacobi SVD on GPUs, which successfully strikes the balance between data locality and parallelism. The experimental evaluation demonstrates that the proposed algorithm achieves 4.5X performance speedup on average over the state-of-the-art cuSOLVER.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131969723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Rethinking graph data placement for graph neural network training on multiple GPUs 基于多gpu的图神经网络训练图数据放置的再思考
Shihui Song, Peng Jiang
The existing Graph Neural Network (GNN) systems adopt graph partitioning to divide the graph data for multi-GPU training. Although they support large graphs, we find that the existing techniques lead to large data loading overhead. In this work, we for the first time model the data movement overhead among CPU and GPUs in GNN training. Based on the performance model, we provide an efficient algorithm to divide and distribute the graph data onto multiple GPUs so that the data loading time is minimized. The experiments show that our technique achieves smaller data loading time compared with the existing graph partitioning methods.
现有的图神经网络(GNN)系统采用图分区对图数据进行划分,进行多gpu训练。尽管它们支持大型图表,但我们发现现有的技术导致了大量的数据加载开销。在这项工作中,我们首次对GNN训练中CPU和gpu之间的数据移动开销进行建模。基于性能模型,我们提供了一种高效的算法,将图形数据划分并分配到多个gpu上,从而最大限度地减少了数据加载时间。实验表明,与现有的图划分方法相比,我们的方法实现了更短的数据加载时间。
{"title":"Rethinking graph data placement for graph neural network training on multiple GPUs","authors":"Shihui Song, Peng Jiang","doi":"10.1145/3503221.3508435","DOIUrl":"https://doi.org/10.1145/3503221.3508435","url":null,"abstract":"The existing Graph Neural Network (GNN) systems adopt graph partitioning to divide the graph data for multi-GPU training. Although they support large graphs, we find that the existing techniques lead to large data loading overhead. In this work, we for the first time model the data movement overhead among CPU and GPUs in GNN training. Based on the performance model, we provide an efficient algorithm to divide and distribute the graph data onto multiple GPUs so that the data loading time is minimized. The experiments show that our technique achieves smaller data loading time compared with the existing graph partitioning methods.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125534249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
LB-HM: load balance-aware data placement on heterogeneous memory for task-parallel HPC applications LB-HM:任务并行HPC应用程序在异构内存上的负载平衡感知数据放置
Zhen Xie, Jie Liu, Sam Ma, Jiajia Li, Dong Li
The emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. However, using HM, wisely migrating data objects on it is critical for high performance. In this work, we introduce a load balance-aware page management system, named LB-HM. LB-HM introduces task semantics during memory profiling, rather than being application-agnostic. Evaluating with a set of memory-consuming HPC applications, we show that we show that LB-HM reduces existing load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution on Optane-based HM.
异构内存(HM)的出现为消耗内存的高性能计算应用程序提供了一种经济高效的解决方案。然而,使用HM,明智地在其上迁移数据对象对于高性能至关重要。在这项工作中,我们介绍了一个负载平衡感知页面管理系统,命名为LB-HM。LB-HM在内存分析期间引入任务语义,而不是与应用程序无关。通过对一组消耗内存的HPC应用程序进行评估,我们发现LB-HM减少了现有的负载不平衡,与基于硬件的解决方案和基于optane的基于软件的解决方案相比,LB-HM平均提高了17.1%和15.4%(最高26.0%和23.2%)的性能。
{"title":"LB-HM: load balance-aware data placement on heterogeneous memory for task-parallel HPC applications","authors":"Zhen Xie, Jie Liu, Sam Ma, Jiajia Li, Dong Li","doi":"10.1145/3503221.3508406","DOIUrl":"https://doi.org/10.1145/3503221.3508406","url":null,"abstract":"The emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. However, using HM, wisely migrating data objects on it is critical for high performance. In this work, we introduce a load balance-aware page management system, named LB-HM. LB-HM introduces task semantics during memory profiling, rather than being application-agnostic. Evaluating with a set of memory-consuming HPC applications, we show that we show that LB-HM reduces existing load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution on Optane-based HM.","PeriodicalId":398609,"journal":{"name":"Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"41 16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131145719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1