A Practical Approach for Performance Analysis of Shared-Memory Programs

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI:10.1109/IPDPS.2011.68

B. Tudor, Y. M. Teo

{"title":"A Practical Approach for Performance Analysis of Shared-Memory Programs","authors":"B. Tudor, Y. M. Teo","doi":"10.1109/IPDPS.2011.68","DOIUrl":null,"url":null,"abstract":"Parallel programming has transcended from HPC into mainstream, enabled by a growing number of programming models, languages and methodologies, as well as the availability of multicore systems. However, performance analysis of parallel programs is still difficult, especially for large and complex programs, or applications developed using different programming models. This paper proposes a simple analytical model for studying the speedup of shared-memory programs on multicore systems. The proposed model derives the speedup and speedup loss from data dependency and memory overhead for various configurations of threads, cores and memory access policies in UMA and NUMA systems. The model is practical because it uses only generally available and non-intrusive inputs derived from the trace of the operating system run-queue and hardware events counters. Using six OpenMP HPC dwarfs from the NPB benchmark, our model differs from measurement results on average by 9% for UMA and 11% on NUMA. Our analysis shows that speedup loss is dominated by memory contention, especially for larger problem sizes. For the worst performing structured grid dwarf on UMA, memory contention accounts for up to 99% of the speedup loss. Based on this insight, we apply our model to determine the optimal number of cores that alleviates memory contention, maximizing speedup and reducing execution time.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Parallel & Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2011.68","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Parallel programming has transcended from HPC into mainstream, enabled by a growing number of programming models, languages and methodologies, as well as the availability of multicore systems. However, performance analysis of parallel programs is still difficult, especially for large and complex programs, or applications developed using different programming models. This paper proposes a simple analytical model for studying the speedup of shared-memory programs on multicore systems. The proposed model derives the speedup and speedup loss from data dependency and memory overhead for various configurations of threads, cores and memory access policies in UMA and NUMA systems. The model is practical because it uses only generally available and non-intrusive inputs derived from the trace of the operating system run-queue and hardware events counters. Using six OpenMP HPC dwarfs from the NPB benchmark, our model differs from measurement results on average by 9% for UMA and 11% on NUMA. Our analysis shows that speedup loss is dominated by memory contention, especially for larger problem sizes. For the worst performing structured grid dwarf on UMA, memory contention accounts for up to 99% of the speedup loss. Based on this insight, we apply our model to determine the optimal number of cores that alleviates memory contention, maximizing speedup and reducing execution time.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

共享内存程序性能分析的实用方法

由于越来越多的编程模型、语言和方法，以及多核系统的可用性，并行编程已经超越了HPC成为主流。然而，并行程序的性能分析仍然很困难，特别是对于大型和复杂的程序，或者使用不同编程模型开发的应用程序。本文提出了一个简单的分析模型来研究多核系统上共享内存程序的加速问题。该模型从UMA和NUMA系统中不同配置的线程、内核和内存访问策略的数据依赖和内存开销中推导出加速和加速损失。该模型是实用的，因为它只使用来自操作系统运行队列和硬件事件计数器跟踪的一般可用的非侵入性输入。使用来自NPB基准测试的6个OpenMP HPC小矮人，我们的模型与UMA的测量结果平均相差9%，在NUMA上相差11%。我们的分析表明，加速损失主要是由内存争用造成的，特别是对于较大的问题规模。对于UMA上性能最差的结构化网格矮人，内存争用占加速损失的99%。基于这一见解，我们应用我们的模型来确定减轻内存争用、最大化加速和减少执行时间的最优内核数量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 IEEE International Parallel & Distributed Processing Symposium

自引率

0.00%

发文量