(Mis)understanding the NUMA memory system performance of multithreaded workloads

2013 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2013-09-01 DOI:10.1109/IISWC.2013.6704666

Z. Majó, T. Gross

{"title":"(Mis)understanding the NUMA memory system performance of multithreaded workloads","authors":"Z. Majó, T. Gross","doi":"10.1109/IISWC.2013.6704666","DOIUrl":null,"url":null,"abstract":"An important aspect of workload characterization is understanding memory system performance (i.e., understanding a workload's interaction with the memory system). On systems with a non-uniform memory architecture (NUMA) the performance critically depends on the distribution of data and computations. The actual memory access patterns have a large influence on performance on systems with aggressive prefetcher units. This paper describes an analysis of the memory system performance of multithreaded programs and shows that some programs are (unintentionally) structured so that they use the memory system of today's NUMA-multicores inefficiently: Programs exhibit program-level data sharing, a performance-limiting factor that makes data and computation distribution in NUMA systems difficult. Moreover, many programs have irregular memory access patterns that are hard to predict by processor prefetcher units. The memory system performance as observed for a given program on a specific platform depends also on many algorithm and implementation decisions. The paper shows that a set of simple algorithmic changes coupled with commonly available OS functionality suffice to eliminate data sharing and to regularize the memory access patterns for a subset of the PARSEC parallel benchmarks. These simple source-level changes result in performance improvements of up to 3.1X, but more importantly, they lead to a fairer and more accurate performance evaluation on NUMA-multicore systems. They also illustrate the importance of carefully considering all details of algorithms and architectures to avoid drawing incorrect conclusions.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2013.6704666","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 38

Abstract

An important aspect of workload characterization is understanding memory system performance (i.e., understanding a workload's interaction with the memory system). On systems with a non-uniform memory architecture (NUMA) the performance critically depends on the distribution of data and computations. The actual memory access patterns have a large influence on performance on systems with aggressive prefetcher units. This paper describes an analysis of the memory system performance of multithreaded programs and shows that some programs are (unintentionally) structured so that they use the memory system of today's NUMA-multicores inefficiently: Programs exhibit program-level data sharing, a performance-limiting factor that makes data and computation distribution in NUMA systems difficult. Moreover, many programs have irregular memory access patterns that are hard to predict by processor prefetcher units. The memory system performance as observed for a given program on a specific platform depends also on many algorithm and implementation decisions. The paper shows that a set of simple algorithmic changes coupled with commonly available OS functionality suffice to eliminate data sharing and to regularize the memory access patterns for a subset of the PARSEC parallel benchmarks. These simple source-level changes result in performance improvements of up to 3.1X, but more importantly, they lead to a fairer and more accurate performance evaluation on NUMA-multicore systems. They also illustrate the importance of carefully considering all details of algorithms and architectures to avoid drawing incorrect conclusions.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

(2)对多线程工作负载下NUMA内存系统性能的理解不足

工作负载表征的一个重要方面是理解内存系统性能(即，理解工作负载与内存系统的交互)。在具有非统一内存架构(NUMA)的系统上，性能严重依赖于数据和计算的分布。实际的内存访问模式对具有主动预取器单元的系统的性能有很大的影响。本文对多线程程序的内存系统性能进行了分析，并指出一些程序的结构(无意中)导致它们不能有效地使用当前NUMA-多核的内存系统。程序表现出程序级的数据共享，这是一个性能限制因素，使数据和计算在NUMA系统中的分布变得困难。此外，许多程序具有不规则的内存访问模式，处理器预取器单元很难预测。在特定平台上观察到的给定程序的内存系统性能还取决于许多算法和实现决策。本文表明，一组简单的算法更改加上常用的操作系统功能足以消除数据共享，并使PARSEC并行基准测试子集的内存访问模式规范化。这些简单的源代码级更改导致性能提高高达3.1倍，但更重要的是，它们导致在numa多核系统上进行更公平、更准确的性能评估。它们还说明了仔细考虑算法和架构的所有细节以避免得出错误结论的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2013 IEEE International Symposium on Workload Characterization (IISWC)

自引率

0.00%

发文量

期刊最新文献

Pannotia: Understanding irregular GPGPU graph applications Performance, energy characterizations and architectural implications of an emerging mobile platform benchmark suite - MobileBench Power and performance of GPU-accelerated systems: A closer look Hardware-independent application characterization Performance implications of System Management Mode