Coherent block data transfer in the FLASH multiprocessor

Proceedings 11th International Parallel Processing Symposium Pub Date : 1997-04-01 DOI:10.1109/IPPS.1997.580836

J. Heinlein, K. Gharachorloo, Robert P. Bosch, M. Rosenblum, Anoop Gupta

{"title":"Coherent block data transfer in the FLASH multiprocessor","authors":"J. Heinlein, K. Gharachorloo, Robert P. Bosch, M. Rosenblum, Anoop Gupta","doi":"10.1109/IPPS.1997.580836","DOIUrl":null,"url":null,"abstract":"A key goal of the Stanford FLASH project is to explore the integration of multiple communication protocols in a single multiprocessor architecture. To achieve this goal, FLASH includes a programmable node controller called MAGIC, which contains an embedded protocol processor capable of implementing multiple protocols. In this paper we present a specialized protocol for block data transfer integrated with a conventional cache coherence protocol. Block transfer forms the basis for message passing implementations on top of shared memory, occurs in important workloads such as databases, and is frequently used by the operating system. We discuss the issues that arise in designing a fully integrated protocol and its interactions with cache coherence. Using microbenchmarks, MPI communication primitives, and an application running on the operating system, we compare our protocol with standard bcopy and bcopy augmented with prefetches. Our results show that integrated block transfer can accelerate communication between nodes while off-loading the task from the main processor utilizing the network more efficiently, and reducing the associated cache pollution. Given the aggressive support for prefetching in FLASH, prefetched bcopy is able to achieve competitive performance in many cases but lacks the other three advantages of our protocol.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"107 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 11th International Parallel Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPPS.1997.580836","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

A key goal of the Stanford FLASH project is to explore the integration of multiple communication protocols in a single multiprocessor architecture. To achieve this goal, FLASH includes a programmable node controller called MAGIC, which contains an embedded protocol processor capable of implementing multiple protocols. In this paper we present a specialized protocol for block data transfer integrated with a conventional cache coherence protocol. Block transfer forms the basis for message passing implementations on top of shared memory, occurs in important workloads such as databases, and is frequently used by the operating system. We discuss the issues that arise in designing a fully integrated protocol and its interactions with cache coherence. Using microbenchmarks, MPI communication primitives, and an application running on the operating system, we compare our protocol with standard bcopy and bcopy augmented with prefetches. Our results show that integrated block transfer can accelerate communication between nodes while off-loading the task from the main processor utilizing the network more efficiently, and reducing the associated cache pollution. Given the aggressive support for prefetching in FLASH, prefetched bcopy is able to achieve competitive performance in many cases but lacks the other three advantages of our protocol.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

在FLASH多处理器中进行相干块数据传输

斯坦福大学FLASH项目的一个关键目标是探索在单个多处理器架构中集成多种通信协议。为了实现这一目标，FLASH包含一个称为MAGIC的可编程节点控制器，该控制器包含一个能够实现多种协议的嵌入式协议处理器。在本文中，我们提出了一个集成了传统缓存一致性协议的块数据传输专用协议。块传输构成了共享内存之上的消息传递实现的基础，发生在数据库等重要工作负载中，并且经常被操作系统使用。我们讨论了在设计一个完全集成的协议及其与缓存一致性的交互时出现的问题。使用微基准测试、MPI通信原语和运行在操作系统上的应用程序，我们将我们的协议与标准bcopy和增强了预取的bcopy进行比较。我们的研究结果表明，集成块传输可以加速节点之间的通信，同时更有效地利用网络从主处理器卸载任务，并减少相关的缓存污染。鉴于FLASH中对预取的积极支持，预取bcopy在许多情况下能够获得具有竞争力的性能，但缺乏我们协议的其他三个优点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings 11th International Parallel Processing Symposium

自引率

0.00%

发文量