23rd Annual International Symposium on Computer Architecture (ISCA'96)最新文献

英文中文

Early Experience with Message-Passing on the SHRIMP Multicomputer 在SHRIMP多计算机上进行消息传递的早期经验

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.233004

E. Felten, R. Alpert, A. Bilas, M. Blumrich, D. Clark, Stefanos N. Damianakis, C. Dubnicki, L. Iftode, Kai Li

The SHRIMP multicomputer provides virtual memory-mapped communication (VMMC), which supports protected, user-level message passing, allows user programs to perform their own buffer management, and separates data transfers from control transfers so that a data transfer can be done without the intervention of the receiving node CPU. An important question is whether such a mechanism can indeed deliver all of the available hardware performance to applications which use conventional message-passing libraries.This paper reports our early experience with message-passing on a small, working SHRIMP multicomputer. We have implemented several user-level communication libraries on top of the VMMC mechanism, including the NX message-passing interface, Sun RPC, stream sockets, and specialized RPC. The first three are fully compatible with existing systems. Our experience shows that the VMMC mechanism supports these message-passing interfaces well. When zero-copy protocols are allowed by the semantics of the interface, VMMC can effectively deliver to applications almost all of the raw hardware's communication performance.

SHRIMP多计算机提供虚拟内存映射通信(VMMC)，它支持受保护的用户级消息传递，允许用户程序执行自己的缓冲区管理，并将数据传输与控制传输分开，这样数据传输就可以在没有接收节点CPU干预的情况下完成。一个重要的问题是，这种机制是否确实能够为使用传统消息传递库的应用程序提供所有可用的硬件性能。本文报告了我们在一台小型、可工作的SHRIMP多计算机上进行消息传递的早期经验。我们在VMMC机制之上实现了几个用户级通信库，包括NX消息传递接口、Sun RPC、流套接字和专用RPC。前三个与现有系统完全兼容。我们的经验表明，VMMC机制很好地支持这些消息传递接口。当接口语义允许零复制协议时，VMMC可以有效地向应用程序提供几乎所有原始硬件的通信性能。

{"title":"Early Experience with Message-Passing on the SHRIMP Multicomputer","authors":"E. Felten, R. Alpert, A. Bilas, M. Blumrich, D. Clark, Stefanos N. Damianakis, C. Dubnicki, L. Iftode, Kai Li","doi":"10.1145/232973.233004","DOIUrl":"https://doi.org/10.1145/232973.233004","url":null,"abstract":"The SHRIMP multicomputer provides virtual memory-mapped communication (VMMC), which supports protected, user-level message passing, allows user programs to perform their own buffer management, and separates data transfers from control transfers so that a data transfer can be done without the intervention of the receiving node CPU. An important question is whether such a mechanism can indeed deliver all of the available hardware performance to applications which use conventional message-passing libraries.This paper reports our early experience with message-passing on a small, working SHRIMP multicomputer. We have implemented several user-level communication libraries on top of the VMMC mechanism, including the NX message-passing interface, Sun RPC, stream sockets, and specialized RPC. The first three are fully compatible with existing systems. Our experience shows that the VMMC mechanism supports these message-passing interfaces well. When zero-copy protocols are allowed by the semantics of the interface, VMMC can effectively deliver to applications almost all of the raw hardware's communication performance.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130511611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 64

High-Bandwidth Address Translation for Multiple-Issue Processors 多问题处理器的高带宽地址转换

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232990

T. Austin, G. Sohi

In an effort to push the envelope of system performance, microprocessor designs are continually exploiting higher levels of instruction-level parallelism, resulting in increasing bandwidth demands on the address translation mechanism. Most current microprocessor designs meet this demand with a multi-ported TLB. While this design provides an excellent hit rate at each port, its access latency and area grow very quickly as the number of ports is increased. As bandwidth demands continue to increase, multi-ported designs will soon impact memory access latency.We present four high-bandwidth address translation mechanisms with latency and area characteristics that scale better than a multiported TLB design. We extend traditional high-bandwidth memory design techniques to address translation, developing interleaved and multi-level TLB designs. In addition, we introduce two new designs crafted specifically for high-bandwidth address translation. Piggyback ports are introduced as a technique to exploit spatial locality in simultaneous translation requests, allowing accesses to the same virtual memory page to combine their requests at the TLB access port. Pretranslation is introduced as a technique for attaching translations to base register values, making it possible to reuse a single translation many times.We perform extensive simulation-based studies to evaluate our designs. We vary key system parameters, such as processor model, page size, and number of architected registers, to see what effects these changes have on the relative merits of each approach. A number of designs show particular promise. Multi-level TLBs with as few as eight entries in the upper-level TLB nearly achieve the performance of a TLB with unlimited bandwidth. Piggyback ports combined with a lesser-ported TLB structure, e.g., an interleaved or multi-ported TLB, also perform well. Pretranslation over a single-ported TLB performs almost as well as a same-sized multi-level TLB with the added benefit of decreased access latency for physically indexed caches.

为了推动系统性能的极限，微处理器设计不断开发更高级别的指令级并行性，导致对地址转换机制的带宽需求增加。目前大多数微处理器设计都采用多端口TLB来满足这种需求。虽然这种设计在每个端口上提供了极好的命中率，但随着端口数量的增加，其访问延迟和面积增长得非常快。随着带宽需求的不断增加，多端口设计将很快影响内存访问延迟。我们提出了四种具有延迟和区域特性的高带宽地址转换机制，它们比多端口TLB设计具有更好的可扩展性。我们扩展了传统的高带宽存储器设计技术来解决翻译问题，开发了交错和多级TLB设计。此外，我们还介绍了两种专门用于高带宽地址转换的新设计。在同时翻译请求中引入了一种利用空间局部性的技术，允许访问相同的虚拟内存页以在TLB访问端口上组合它们的请求。预翻译是作为一种将翻译附加到基寄存器值的技术引入的，这使得多次重用单个翻译成为可能。我们进行广泛的基于模拟的研究来评估我们的设计。我们改变了关键的系统参数，例如处理器模型、页面大小和体系结构寄存器的数量，以查看这些更改对每种方法的相对优点有什么影响。许多设计显示出特别的希望。上层TLB中只有8个表项的多级TLB几乎可以达到无限带宽的TLB的性能。与较小端口TLB结构(例如，交错或多端口TLB)相结合的承载端口也表现良好。在单端口TLB上进行预翻译的性能几乎与相同大小的多级TLB一样好，并且还可以减少物理索引缓存的访问延迟。

{"title":"High-Bandwidth Address Translation for Multiple-Issue Processors","authors":"T. Austin, G. Sohi","doi":"10.1145/232973.232990","DOIUrl":"https://doi.org/10.1145/232973.232990","url":null,"abstract":"In an effort to push the envelope of system performance, microprocessor designs are continually exploiting higher levels of instruction-level parallelism, resulting in increasing bandwidth demands on the address translation mechanism. Most current microprocessor designs meet this demand with a multi-ported TLB. While this design provides an excellent hit rate at each port, its access latency and area grow very quickly as the number of ports is increased. As bandwidth demands continue to increase, multi-ported designs will soon impact memory access latency.We present four high-bandwidth address translation mechanisms with latency and area characteristics that scale better than a multiported TLB design. We extend traditional high-bandwidth memory design techniques to address translation, developing interleaved and multi-level TLB designs. In addition, we introduce two new designs crafted specifically for high-bandwidth address translation. Piggyback ports are introduced as a technique to exploit spatial locality in simultaneous translation requests, allowing accesses to the same virtual memory page to combine their requests at the TLB access port. Pretranslation is introduced as a technique for attaching translations to base register values, making it possible to reuse a single translation many times.We perform extensive simulation-based studies to evaluate our designs. We vary key system parameters, such as processor model, page size, and number of architected registers, to see what effects these changes have on the relative merits of each approach. A number of designs show particular promise. Multi-level TLBs with as few as eight entries in the upper-level TLB nearly achieve the performance of a TLB with unlimited bandwidth. Piggyback ports combined with a lesser-ported TLB structure, e.g., an interleaved or multi-ported TLB, also perform well. Pretranslation over a single-ported TLB performs almost as well as a same-sized multi-level TLB with the added benefit of decreased access latency for physically indexed caches.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"267 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116587165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Memory Bandwidth Limitations of Future Microprocessors 未来微处理器的内存带宽限制

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232983

D. Burger, J. Goodman, A. Kägi

This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches---implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.

本文提出引脚带宽将是未来微处理器的关键考虑因素。我们表明，许多用于容忍不断增长的内存延迟的技术是以增加带宽需求为代价的。通过对执行时间的分解，我们表明，对于采用积极的内存延迟容忍技术的现代处理器，由于带宽不足而浪费的周期通常超过由于原始内存延迟而浪费的周期。考虑到最大化内存带宽的重要性，我们计算了有效引脚带宽，然后估计了最优有效引脚带宽。我们通过确定缓存和最小流量缓存对内存层次结构较低级别的访问进行过滤的数量来测量这些数量。我们看到，在缓存生成的总内存流量和最小流量缓存之间存在超过两个数量级的差距——这意味着存在大幅增加有效引脚带宽的潜力。我们将这种流量差距分解为四个因素，并显示它们对不同基准的流量减少的贡献差异很大。我们的结论是，在短期内，引脚带宽限制将使更复杂的片上缓存具有成本效益。例如，灵活的缓存可能允许单个应用程序从一系列缓存策略中进行选择。从长远来看，我们预测片外访问将非常昂贵，以至于所有系统内存都将驻留在一个或多个处理器芯片上。

{"title":"Memory Bandwidth Limitations of Future Microprocessors","authors":"D. Burger, J. Goodman, A. Kägi","doi":"10.1145/232973.232983","DOIUrl":"https://doi.org/10.1145/232973.232983","url":null,"abstract":"This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a decomposition of execution time, we show that for modern processors that employ aggressive memory latency tolerance techniques, wasted cycles due to insufficient bandwidth generally exceed those due to raw memory latencies. Given the importance of maximizing memory bandwidth, we calculate effective pin bandwidth, then estimate optimal effective pin bandwidth. We measure these quantities by determining the amount by which both caches and minimal-traffic caches filter accesses to the lower levels of the memory hierarchy. We see that there is a gap that can exceed two orders of magnitude between the total memory traffic generated by caches and the minimal-traffic caches---implying that the potential exists to increase effective pin bandwidth substantially. We decompose this traffic gap into four factors, and show they contribute quite differently to traffic reduction for different benchmarks. We conclude that, in the short term, pin bandwidth limitations will make more complex on-chip caches cost-effective. For example, flexible caches may allow individual applications to choose from a range of caching policies. In the long term, we predict that off-chip accesses will be so expensive that all system memory will reside on one or more processor chips.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"170 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114112654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 463

Missing the Memory Wall: The Case for Processor/Memory Integration 错过内存墙:处理器/内存集成的案例

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232984

Ashley Saulsbury, Fong Pong, A. Nowatzyk

Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems. These CPU-centric designs invest a lot of power and chip area to bridge the widening gap between CPU and main memory speeds. Yet, many large applications do not operate well on these systems and are limited by the memory subsystem performance.This paper argues for an integrated system approach that uses less-powerful CPUs that are tightly integrated with advanced memory technologies to build competitive systems with greatly reduced cost and complexity. Based on a design study using the next generation 0.25µm, 256Mbit dynamic random-access memory (DRAM) process and on the analysis of existing machines, we show that processor memory integration can be used to build competitive, scalable and cost-effective MP systems.We present results from execution driven uni- and multi-processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor. In this system, small direct mapped instruction caches with long lines are very effective, as are column buffer data caches augmented with a victim cache.

当前的高性能计算机系统使用复杂的大型超标量cpu，通过缓存和互连系统的层次结构与主存储器连接。这些以CPU为中心的设计投入了大量的功率和芯片面积，以弥合CPU和主存储器速度之间日益扩大的差距。然而，许多大型应用程序不能很好地在这些系统上运行，并且受到内存子系统性能的限制。本文主张采用集成系统方法，即使用功能较弱的cpu与先进的内存技术紧密集成，以大大降低成本和复杂性来构建具有竞争力的系统。基于使用下一代0.25µm, 256Mbit动态随机存取存储器(DRAM)工艺的设计研究和对现有机器的分析，我们表明处理器存储器集成可用于构建具有竞争力，可扩展且具有成本效益的MP系统。我们展示了执行驱动的单处理器和多处理器模拟的结果，表明低延迟和高带宽的好处可以弥补集成处理器的大小和复杂性的限制。在这个系统中，具有长行的小型直接映射指令缓存非常有效，用受害者缓存增强的列缓冲区数据缓存也是如此。

{"title":"Missing the Memory Wall: The Case for Processor/Memory Integration","authors":"Ashley Saulsbury, Fong Pong, A. Nowatzyk","doi":"10.1145/232973.232984","DOIUrl":"https://doi.org/10.1145/232973.232984","url":null,"abstract":"Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems. These CPU-centric designs invest a lot of power and chip area to bridge the widening gap between CPU and main memory speeds. Yet, many large applications do not operate well on these systems and are limited by the memory subsystem performance.This paper argues for an integrated system approach that uses less-powerful CPUs that are tightly integrated with advanced memory technologies to build competitive systems with greatly reduced cost and complexity. Based on a design study using the next generation 0.25µm, 256Mbit dynamic random-access memory (DRAM) process and on the analysis of existing machines, we show that processor memory integration can be used to build competitive, scalable and cost-effective MP systems.We present results from execution driven uni- and multi-processor simulations showing that the benefits of lower latency and higher bandwidth can compensate for the restrictions on the size and complexity of the integrated processor. In this system, small direct mapped instruction caches with long lines are very effective, as are column buffer data caches augmented with a victim cache.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"230 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114209234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 254

Correlation and Aliasing in Dynamic Branch Predictors 动态分支预测器中的关联和混叠

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232978

S. Sechrest, Chih-Chieh Lee, T. Mudge

Previous branch prediction studies have relied primarily upon the SPECint89 and SPECint92 benchmarks for evaluation. Most of these benchmarks exercise a very small amount of code. As a consequence, the resources required by these schemes for accurate predictions of larger programs have not been clear. Moreover, many of these studies have simulated a very limited number of configurations. Here we report on simulations of a variety of branch prediction schemes using a set of relatively large benchmark programs that we believe to be more representative of likely system workloads. We have examined the sensitivity of these prediction schemes to variation in workload, in resources, and in design and configuration. We show that for predictors with small available resources, aliasing between distinct branches can have the dominant influence on prediction accuracy. For global history based schemes, such as GAs and gshare, aliasing in the predictor table can eliminate any advantage gained through inter branch correlation. For self-history based prediction scheme, such as PAs, it is aliasing in the buffer recording branch history, rather than the predictor table, that poses problems. Past studies have sometimes confused these effects and allocated resources incorrectly.

以前的分支预测研究主要依赖于SPECint89和SPECint92基准进行评估。大多数这些基准测试都只运行少量的代码。因此，这些方案为准确预测更大的项目所需要的资源还不清楚。此外，这些研究中的许多都模拟了数量非常有限的配置。在这里，我们使用一组相对较大的基准测试程序对各种分支预测方案进行模拟，我们认为这些程序更能代表可能的系统工作负载。我们已经检查了这些预测方案对工作负载、资源以及设计和配置变化的敏感性。我们表明，对于可用资源较少的预测器，不同分支之间的混叠会对预测精度产生主要影响。对于基于全局历史的方案，如GAs和gshare，预测表中的混叠可以消除通过分支间相关获得的任何优势。对于基于自历史的预测方案，如PAs，产生问题的是缓冲区记录分支历史中的混叠，而不是预测表。过去的研究有时混淆了这些影响，并错误地分配了资源。

{"title":"Correlation and Aliasing in Dynamic Branch Predictors","authors":"S. Sechrest, Chih-Chieh Lee, T. Mudge","doi":"10.1145/232973.232978","DOIUrl":"https://doi.org/10.1145/232973.232978","url":null,"abstract":"Previous branch prediction studies have relied primarily upon the SPECint89 and SPECint92 benchmarks for evaluation. Most of these benchmarks exercise a very small amount of code. As a consequence, the resources required by these schemes for accurate predictions of larger programs have not been clear. Moreover, many of these studies have simulated a very limited number of configurations. Here we report on simulations of a variety of branch prediction schemes using a set of relatively large benchmark programs that we believe to be more representative of likely system workloads. We have examined the sensitivity of these prediction schemes to variation in workload, in resources, and in design and configuration. We show that for predictors with small available resources, aliasing between distinct branches can have the dominant influence on prediction accuracy. For global history based schemes, such as GAs and gshare, aliasing in the predictor table can eliminate any advantage gained through inter branch correlation. For self-history based prediction scheme, such as PAs, it is aliasing in the buffer recording branch history, rather than the predictor table, that poses problems. Past studies have sometimes confused these effects and allocated resources incorrectly.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128953315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 110

Decoupled Hardware Support for Distributed Shared Memory 分布式共享内存的解耦硬件支持

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-01 DOI: 10.1145/232973.232979

S. Reinhardt, Robert W. Pfile, D. Wood

This paper investigates hardware support for fine-grain distributed shared memory (DSM) in networks of workstations. To reduce design time and implementation cost relative to dedicated DSM systems, we decouple the functional hardware components of DSM support, allowing greater use of off-the-shelf devices.We present two decoupled systems, Typhoon-0 and Typhoon-1. Typhoon-0 uses an off-the-shelf protocol processor and network interface; a custom access control device is the only DSM-specific hardware. To demonstrate the feasibility and simplicity of this access control device, we designed and built an FPGA-based version in under one year. Typhoon-1 also uses an off-the-shelf protocol processor, but integrates the network interface and access control devices for higher performance.We compare the performance of the two decoupled systems with two integrated systems via simulation. For six benchmarks on 32 nodes, Typhoon-0 ranges from 30% to 309% slower than the best integrated system, while Typhoon-1 ranges from 13% to 132% slower. Four of the six benchmarks achieve speedups of 12 to 18 on Typhoon-0 and 15 to 26 on Typhoon-1, compared with 19 to 35 on the best integrated system. Two benchmarks are hampered by high communication overheads, but selectively replacing shared-memory operations with message passing provides speedups of at least 16 on both decoupled systems. These speedups indicate that decoupled designs can potentially provide a cost-effective alternative to complex high-end DSM systems.

本文研究了工作站网络中对细粒度分布式共享内存(DSM)的硬件支持。相对于专用的DSM系统，为了减少设计时间和实施成本，我们将DSM支持的功能硬件组件解耦，从而允许更多地使用现成的设备。我们提出了两个解耦系统，台风0和台风1。台风-0使用现成的协议处理器和网络接口;自定义访问控制设备是唯一的dsm专用硬件。为了证明这种访问控制装置的可行性和简单性，我们在不到一年的时间内设计并构建了一个基于fpga的版本。台风-1也使用一个现成的协议处理器，但是为了更高的性能集成了网络接口和访问控制设备。通过仿真比较了两个解耦系统与两个集成系统的性能。在32个节点上的六个基准测试中，台风0比最佳集成系统慢30%至309%，而台风1则慢13%至132%。六个基准中有四个在台风0上达到12至18的加速，在台风1上达到15至26的加速，而最好的综合系统为19至35。两个基准测试受到高通信开销的阻碍，但是有选择地用消息传递替换共享内存操作可以在两个解耦的系统上提供至少16%的速度提升。这些加速表明，解耦设计可能为复杂的高端DSM系统提供一种具有成本效益的替代方案。

{"title":"Decoupled Hardware Support for Distributed Shared Memory","authors":"S. Reinhardt, Robert W. Pfile, D. Wood","doi":"10.1145/232973.232979","DOIUrl":"https://doi.org/10.1145/232973.232979","url":null,"abstract":"This paper investigates hardware support for fine-grain distributed shared memory (DSM) in networks of workstations. To reduce design time and implementation cost relative to dedicated DSM systems, we decouple the functional hardware components of DSM support, allowing greater use of off-the-shelf devices.We present two decoupled systems, Typhoon-0 and Typhoon-1. Typhoon-0 uses an off-the-shelf protocol processor and network interface; a custom access control device is the only DSM-specific hardware. To demonstrate the feasibility and simplicity of this access control device, we designed and built an FPGA-based version in under one year. Typhoon-1 also uses an off-the-shelf protocol processor, but integrates the network interface and access control devices for higher performance.We compare the performance of the two decoupled systems with two integrated systems via simulation. For six benchmarks on 32 nodes, Typhoon-0 ranges from 30% to 309% slower than the best integrated system, while Typhoon-1 ranges from 13% to 132% slower. Four of the six benchmarks achieve speedups of 12 to 18 on Typhoon-0 and 15 to 26 on Typhoon-1, compared with 19 to 35 on the best integrated system. Two benchmarks are hampered by high communication overheads, but selectively replacing shared-memory operations with message passing provides speedups of at least 16 on both decoupled systems. These speedups indicate that decoupled designs can potentially provide a cost-effective alternative to complex high-end DSM systems.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125050367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 82

Coherent Network Interfaces for Fine-Grain Communication 用于细粒度通信的相干网络接口

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-01 DOI: 10.1145/232973.232999

Shubhendu S. Mukherjee, B. Falsafi, M. Hill, D. Wood

Historically, processor accesses to memory-mapped device registers have been marked uncachable to insure their visibility to the device. The ubiquity of snooping cache coherence, however, makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads (e.g., for polling).This paper begins an exploration of network interfaces (NIs) that use coherence---coherent network interfaces (CNIs)---to improve communication performance. We restrict this study to NI/CNIs that reside on coherent memory or I/O buses, to NI/CNIs that are much simpler than processors, and to the performance of fine-grain messaging from user process to user process.Our first contribution is to develop and optimize two mechanisms that CNIs use to communicate with processors. A cachable device register---derived from cachable control registers [39,40]---is a coherent, cachable block of memory used to transfer status, control, or data between a device and a processor. Cachable queues generalize cachable device registers from one cachable, coherent memory block to a contiguous region of cachable, coherent blocks managed as a circular queue.Our second contribution is a taxonomy and comparison of four CNIs with a more conventional NI. Microbenchmark results show that CNIs can improve the round-trip latency and achievable bandwidth of a small 64-byte message by 37% and 125% respectively on the memory bus and 74% and 123% respectively on a coherent I/O bus. Experiments with five macrobenchmarks show that CNIs can improve the performance by 17-53% on the memory bus and 30-88% on the I/O bus.

从历史上看，处理器对内存映射设备寄存器的访问被标记为不可访问，以确保它们对设备的可见性。然而，无处不在的窥探缓存一致性使得处理器和设备与可缓存的、一致的内存操作交互成为可能。使用一致性可以通过促进整个缓存块的突发传输和减少控制开销(例如，轮询)来提高性能。本文开始探索使用相干性的网络接口(NIs)——相干网络接口(CNIs)——来提高通信性能。我们将这项研究局限于驻留在一致内存或I/O总线上的NI/ cni，比处理器简单得多的NI/ cni，以及用户进程到用户进程的细粒度消息传递性能。我们的第一个贡献是开发和优化cni用于与处理器通信的两种机制。可缓存设备寄存器——源自可缓存控制寄存器[39,40]——是一种连贯的、可缓存的内存块，用于在设备和处理器之间传输状态、控制或数据。可缓存队列将可缓存的设备寄存器从一个可缓存的、一致的内存块推广到一个可缓存的、一致的块的连续区域，作为一个循环队列进行管理。我们的第二个贡献是将四个cni与更传统的NI进行分类和比较。微基准测试结果表明，cni可以将64字节小消息的往返延迟和可实现带宽在内存总线上分别提高37%和125%，在相干I/O总线上分别提高74%和123%。五个宏基准测试的实验表明，cni可以在内存总线上提高17-53%的性能，在I/O总线上提高30-88%的性能。

{"title":"Coherent Network Interfaces for Fine-Grain Communication","authors":"Shubhendu S. Mukherjee, B. Falsafi, M. Hill, D. Wood","doi":"10.1145/232973.232999","DOIUrl":"https://doi.org/10.1145/232973.232999","url":null,"abstract":"Historically, processor accesses to memory-mapped device registers have been marked uncachable to insure their visibility to the device. The ubiquity of snooping cache coherence, however, makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads (e.g., for polling).This paper begins an exploration of network interfaces (NIs) that use coherence---coherent network interfaces (CNIs)---to improve communication performance. We restrict this study to NI/CNIs that reside on coherent memory or I/O buses, to NI/CNIs that are much simpler than processors, and to the performance of fine-grain messaging from user process to user process.Our first contribution is to develop and optimize two mechanisms that CNIs use to communicate with processors. A cachable device register---derived from cachable control registers [39,40]---is a coherent, cachable block of memory used to transfer status, control, or data between a device and a processor. Cachable queues generalize cachable device registers from one cachable, coherent memory block to a contiguous region of cachable, coherent blocks managed as a circular queue.Our second contribution is a taxonomy and comparison of four CNIs with a more conventional NI. Microbenchmark results show that CNIs can improve the round-trip latency and achievable bandwidth of a small 64-byte message by 37% and 125% respectively on the memory bus and 74% and 123% respectively on a coherent I/O bus. Experiments with five macrobenchmarks show that CNIs can improve the performance by 17-53% on the memory bus and 30-88% on the I/O bus.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114930966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 104

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

23rd Annual International Symposium on Computer Architecture (ISCA'96)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀