Coherent Network Interfaces for Fine-Grain Communication

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-01 DOI:10.1145/232973.232999

Shubhendu S. Mukherjee, B. Falsafi, M. Hill, D. Wood

{"title":"Coherent Network Interfaces for Fine-Grain Communication","authors":"Shubhendu S. Mukherjee, B. Falsafi, M. Hill, D. Wood","doi":"10.1145/232973.232999","DOIUrl":null,"url":null,"abstract":"Historically, processor accesses to memory-mapped device registers have been marked uncachable to insure their visibility to the device. The ubiquity of snooping cache coherence, however, makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads (e.g., for polling).This paper begins an exploration of network interfaces (NIs) that use coherence---coherent network interfaces (CNIs)---to improve communication performance. We restrict this study to NI/CNIs that reside on coherent memory or I/O buses, to NI/CNIs that are much simpler than processors, and to the performance of fine-grain messaging from user process to user process.Our first contribution is to develop and optimize two mechanisms that CNIs use to communicate with processors. A cachable device register---derived from cachable control registers [39,40]---is a coherent, cachable block of memory used to transfer status, control, or data between a device and a processor. Cachable queues generalize cachable device registers from one cachable, coherent memory block to a contiguous region of cachable, coherent blocks managed as a circular queue.Our second contribution is a taxonomy and comparison of four CNIs with a more conventional NI. Microbenchmark results show that CNIs can improve the round-trip latency and achievable bandwidth of a small 64-byte message by 37% and 125% respectively on the memory bus and 74% and 123% respectively on a coherent I/O bus. Experiments with five macrobenchmarks show that CNIs can improve the performance by 17-53% on the memory bus and 30-88% on the I/O bus.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1996-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"104","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/232973.232999","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 104

Abstract

Historically, processor accesses to memory-mapped device registers have been marked uncachable to insure their visibility to the device. The ubiquity of snooping cache coherence, however, makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads (e.g., for polling).This paper begins an exploration of network interfaces (NIs) that use coherence---coherent network interfaces (CNIs)---to improve communication performance. We restrict this study to NI/CNIs that reside on coherent memory or I/O buses, to NI/CNIs that are much simpler than processors, and to the performance of fine-grain messaging from user process to user process.Our first contribution is to develop and optimize two mechanisms that CNIs use to communicate with processors. A cachable device register---derived from cachable control registers [39,40]---is a coherent, cachable block of memory used to transfer status, control, or data between a device and a processor. Cachable queues generalize cachable device registers from one cachable, coherent memory block to a contiguous region of cachable, coherent blocks managed as a circular queue.Our second contribution is a taxonomy and comparison of four CNIs with a more conventional NI. Microbenchmark results show that CNIs can improve the round-trip latency and achievable bandwidth of a small 64-byte message by 37% and 125% respectively on the memory bus and 74% and 123% respectively on a coherent I/O bus. Experiments with five macrobenchmarks show that CNIs can improve the performance by 17-53% on the memory bus and 30-88% on the I/O bus.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于细粒度通信的相干网络接口

从历史上看，处理器对内存映射设备寄存器的访问被标记为不可访问，以确保它们对设备的可见性。然而，无处不在的窥探缓存一致性使得处理器和设备与可缓存的、一致的内存操作交互成为可能。使用一致性可以通过促进整个缓存块的突发传输和减少控制开销(例如，轮询)来提高性能。本文开始探索使用相干性的网络接口(NIs)——相干网络接口(CNIs)——来提高通信性能。我们将这项研究局限于驻留在一致内存或I/O总线上的NI/ cni，比处理器简单得多的NI/ cni，以及用户进程到用户进程的细粒度消息传递性能。我们的第一个贡献是开发和优化cni用于与处理器通信的两种机制。可缓存设备寄存器——源自可缓存控制寄存器[39,40]——是一种连贯的、可缓存的内存块，用于在设备和处理器之间传输状态、控制或数据。可缓存队列将可缓存的设备寄存器从一个可缓存的、一致的内存块推广到一个可缓存的、一致的块的连续区域，作为一个循环队列进行管理。我们的第二个贡献是将四个cni与更传统的NI进行分类和比较。微基准测试结果表明，cni可以将64字节小消息的往返延迟和可实现带宽在内存总线上分别提高37%和125%，在相干I/O总线上分别提高74%和123%。五个宏基准测试的实验表明，cni可以在内存总线上提高17-53%的性能，在I/O总线上提高30-88%的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

23rd Annual International Symposium on Computer Architecture (ISCA'96)

自引率

0.00%

发文量

期刊最新文献

Memory Bandwidth Limitations of Future Microprocessors Missing the Memory Wall: The Case for Processor/Memory Integration Instruction Prefetching of Systems Codes with Layout Optimized for Reduced Cache Misses STiNG: A CC-NUMA Computer System for the Commercial Marketplace High-Bandwidth Address Translation for Multiple-Issue Processors