(Mis)understanding the NUMA memory system performance of multithreaded workloads

Z. Majó, T. Gross
{"title":"(Mis)understanding the NUMA memory system performance of multithreaded workloads","authors":"Z. Majó, T. Gross","doi":"10.1109/IISWC.2013.6704666","DOIUrl":null,"url":null,"abstract":"An important aspect of workload characterization is understanding memory system performance (i.e., understanding a workload's interaction with the memory system). On systems with a non-uniform memory architecture (NUMA) the performance critically depends on the distribution of data and computations. The actual memory access patterns have a large influence on performance on systems with aggressive prefetcher units. This paper describes an analysis of the memory system performance of multithreaded programs and shows that some programs are (unintentionally) structured so that they use the memory system of today's NUMA-multicores inefficiently: Programs exhibit program-level data sharing, a performance-limiting factor that makes data and computation distribution in NUMA systems difficult. Moreover, many programs have irregular memory access patterns that are hard to predict by processor prefetcher units. The memory system performance as observed for a given program on a specific platform depends also on many algorithm and implementation decisions. The paper shows that a set of simple algorithmic changes coupled with commonly available OS functionality suffice to eliminate data sharing and to regularize the memory access patterns for a subset of the PARSEC parallel benchmarks. These simple source-level changes result in performance improvements of up to 3.1X, but more importantly, they lead to a fairer and more accurate performance evaluation on NUMA-multicore systems. They also illustrate the importance of carefully considering all details of algorithms and architectures to avoid drawing incorrect conclusions.","PeriodicalId":365868,"journal":{"name":"2013 IEEE International Symposium on Workload Characterization (IISWC)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"38","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2013.6704666","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 38

Abstract

An important aspect of workload characterization is understanding memory system performance (i.e., understanding a workload's interaction with the memory system). On systems with a non-uniform memory architecture (NUMA) the performance critically depends on the distribution of data and computations. The actual memory access patterns have a large influence on performance on systems with aggressive prefetcher units. This paper describes an analysis of the memory system performance of multithreaded programs and shows that some programs are (unintentionally) structured so that they use the memory system of today's NUMA-multicores inefficiently: Programs exhibit program-level data sharing, a performance-limiting factor that makes data and computation distribution in NUMA systems difficult. Moreover, many programs have irregular memory access patterns that are hard to predict by processor prefetcher units. The memory system performance as observed for a given program on a specific platform depends also on many algorithm and implementation decisions. The paper shows that a set of simple algorithmic changes coupled with commonly available OS functionality suffice to eliminate data sharing and to regularize the memory access patterns for a subset of the PARSEC parallel benchmarks. These simple source-level changes result in performance improvements of up to 3.1X, but more importantly, they lead to a fairer and more accurate performance evaluation on NUMA-multicore systems. They also illustrate the importance of carefully considering all details of algorithms and architectures to avoid drawing incorrect conclusions.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
(2)对多线程工作负载下NUMA内存系统性能的理解不足
工作负载表征的一个重要方面是理解内存系统性能(即,理解工作负载与内存系统的交互)。在具有非统一内存架构(NUMA)的系统上,性能严重依赖于数据和计算的分布。实际的内存访问模式对具有主动预取器单元的系统的性能有很大的影响。本文对多线程程序的内存系统性能进行了分析,并指出一些程序的结构(无意中)导致它们不能有效地使用当前NUMA-多核的内存系统。程序表现出程序级的数据共享,这是一个性能限制因素,使数据和计算在NUMA系统中的分布变得困难。此外,许多程序具有不规则的内存访问模式,处理器预取器单元很难预测。在特定平台上观察到的给定程序的内存系统性能还取决于许多算法和实现决策。本文表明,一组简单的算法更改加上常用的操作系统功能足以消除数据共享,并使PARSEC并行基准测试子集的内存访问模式规范化。这些简单的源代码级更改导致性能提高高达3.1倍,但更重要的是,它们导致在numa多核系统上进行更公平、更准确的性能评估。它们还说明了仔细考虑算法和架构的所有细节以避免得出错误结论的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Pannotia: Understanding irregular GPGPU graph applications Performance, energy characterizations and architectural implications of an emerging mobile platform benchmark suite - MobileBench Power and performance of GPU-accelerated systems: A closer look Hardware-independent application characterization Performance implications of System Management Mode
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1