MOPED: Orchestrating interprocess message data on CMPs

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI:10.1109/HPCA.2011.5749721

Junli Gu, S. Lumetta, Rakesh Kumar, Yihe Sun

{"title":"MOPED: Orchestrating interprocess message data on CMPs","authors":"Junli Gu, S. Lumetta, Rakesh Kumar, Yihe Sun","doi":"10.1109/HPCA.2011.5749721","DOIUrl":null,"url":null,"abstract":"Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make these models attractive, synchronization overhead and data copying must also be offloaded from the processors. In this paper, we describe a Message Orchestration and Performance Enhancement Device (MOPED) that provides hardware mechanisms to support state-of-the-art message passing protocols such as MPI. MOPED extends the per-processor cache controllers and coherence protocol to support message synchronization and management in hardware, to transfer message data efficiently without intermediate buffer copies, and to place useful data in caches in a timely manner. MOPED thus allows full overlap between communication and computation on the cores. We extended a 16-core full-system simulator based on Simics and FeS2. MOPED interacts with the directory controllers to orchestrate message data. We evaluated benefits to performance and coherence traffic by integrating MOPED into the MPICH runtime. Relative to unmodified MPI execution, MOPED reduces execution time of real applications (NAS Parallel Benchmarks) by 17–45% and of communication microbenchmarks (Intel's IMB) by 76–94%. Off-chip memory misses are reduced by 43–88% for applications and by 75–100% for microbenchmarks.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2011.5749721","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make these models attractive, synchronization overhead and data copying must also be offloaded from the processors. In this paper, we describe a Message Orchestration and Performance Enhancement Device (MOPED) that provides hardware mechanisms to support state-of-the-art message passing protocols such as MPI. MOPED extends the per-processor cache controllers and coherence protocol to support message synchronization and management in hardware, to transfer message data efficiently without intermediate buffer copies, and to place useful data in caches in a timely manner. MOPED thus allows full overlap between communication and computation on the cores. We extended a 16-core full-system simulator based on Simics and FeS2. MOPED interacts with the directory controllers to orchestrate message data. We evaluated benefits to performance and coherence traffic by integrating MOPED into the MPICH runtime. Relative to unmodified MPI execution, MOPED reduces execution time of real applications (NAS Parallel Benchmarks) by 17–45% and of communication microbenchmarks (Intel's IMB) by 76–94%. Off-chip memory misses are reduced by 43–88% for applications and by 75–100% for microbenchmarks.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在cmp上编排进程间消息数据

未来的cmp将结合许多简单的核心和深度缓存层次结构。核心越多，每个核心的缓存资源就越少，必须谨慎地共享，以避免由于冲突和污染而导致利用率低下。这些体系结构中数据的显式移动(例如消息传递)可以提供有关程序行为的提示，这些提示可用于隐藏延迟和改进缓存行为。然而，要使这些模型具有吸引力，还必须从处理器中卸载同步开销和数据复制。在本文中，我们描述了一个消息编排和性能增强设备(mped)，它提供了硬件机制来支持最先进的消息传递协议，如MPI。mop扩展了每个处理器的缓存控制器和一致性协议，以支持硬件中的消息同步和管理，有效地传输消息数据而无需中间缓冲区副本，并及时将有用的数据放入缓存中。因此，mop允许核心上的通信和计算完全重叠。我们扩展了一个基于Simics和FeS2的16核全系统模拟器。MOPED与目录控制器交互以编排消息数据。我们通过将mopp集成到MPICH运行时来评估对性能和一致性流量的好处。相对于未修改的MPI执行，mopd将实际应用程序(NAS并行基准)的执行时间减少了17-45%，通信微基准(英特尔的IMB)的执行时间减少了76-94%。对于应用程序来说，片外内存丢失减少了43-88%，对于微基准测试来说，减少了75-100%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2011 IEEE 17th International Symposium on High Performance Computer Architecture

自引率

0.00%

发文量

期刊最新文献

Safe and efficient supervised memory systems Keynote address II: How's the parallel computing revolution going? A case for guarded power gating for multi-core processors Fg-STP: Fine-Grain Single Thread Partitioning on Multicores A quantitative performance analysis model for GPU architectures