Application and Architectural Bottlenecks in Large Scale Distributed Shared Memory Machines

23rd Annual International Symposium on Computer Architecture (ISCA'96) Pub Date : 1996-05-15 DOI:10.1145/232973.232988

Chris Holt, Jaswinder Pal Singh, J. Hennessy

{"title":"Application and Architectural Bottlenecks in Large Scale Distributed Shared Memory Machines","authors":"Chris Holt, Jaswinder Pal Singh, J. Hennessy","doi":"10.1145/232973.232988","DOIUrl":null,"url":null,"abstract":"Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have been well explored. Recently, a number of researchers have presented architectural techniques for scaling a cache coherent shared address space to much larger processor counts. In this paper, we examine the extent to which applications can achieve reasonable performance on such large-scale, cache-coherent, distributed shared address space machines, by determining the problems sizes needed to achieve a reasonable level of efficiency. We also look at how much programming effort and optimization is needed to achieve high efficiency, beyond that needed at small processor counts. For each application, we discuss the main architectural bottlenecks that prevent smaller problem sizes or less optimized programs from achieving good efficiency. Our results show that while there are some applications that either do not scale or must be heavily optimized to do so, for most of the applications we studied it is not necessary to heavily modify the code or restructure algorithms to scale well upto several hundred processors, once the basic techniques for load balancing and data locality are used that are needed for small-scale systems as well. Programs written with some care perform well without substantially compromising the ease of programming advantage of a shared address space, and the problem sizes required to achieve good performance are surprisingly small. It is important to be careful about how data structures and layouts interact with system granularities, but these optimizations are usually needed for moderate-scale machines as well.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/232973.232988","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 47

Abstract

Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have been well explored. Recently, a number of researchers have presented architectural techniques for scaling a cache coherent shared address space to much larger processor counts. In this paper, we examine the extent to which applications can achieve reasonable performance on such large-scale, cache-coherent, distributed shared address space machines, by determining the problems sizes needed to achieve a reasonable level of efficiency. We also look at how much programming effort and optimization is needed to achieve high efficiency, beyond that needed at small processor counts. For each application, we discuss the main architectural bottlenecks that prevent smaller problem sizes or less optimized programs from achieving good efficiency. Our results show that while there are some applications that either do not scale or must be heavily optimized to do so, for most of the applications we studied it is not necessary to heavily modify the code or restructure algorithms to scale well upto several hundred processors, once the basic techniques for load balancing and data locality are used that are needed for small-scale systems as well. Programs written with some care perform well without substantially compromising the ease of programming advantage of a shared address space, and the problem sizes required to achieve good performance are surprisingly small. It is important to be careful about how data structures and layouts interact with system granularities, but these optimizations are usually needed for moderate-scale machines as well.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大规模分布式共享内存机的应用和架构瓶颈

在小型到中等规模的硬件缓存一致共享内存机器中遇到的许多编程挑战已经得到了广泛的研究。虽然还有工作要做，但有效地对这种机器进行编程所需的基本技术已经得到了很好的探索。最近，许多研究人员提出了将缓存一致共享地址空间扩展到更大处理器数量的体系结构技术。在本文中，我们通过确定实现合理效率水平所需的问题大小，来检查应用程序在这种大规模、缓存一致、分布式共享地址空间机器上可以实现合理性能的程度。我们还研究了需要多少编程工作和优化才能实现高效率，而不仅仅是在小处理器数量下。对于每个应用程序，我们将讨论阻碍较小问题规模或优化程度较低的程序获得良好效率的主要体系结构瓶颈。我们的结果表明，虽然有一些应用程序要么无法扩展，要么必须进行大量优化才能扩展，但对于我们研究的大多数应用程序来说，一旦使用了小型系统所需的负载平衡和数据局域性的基本技术，就不需要大量修改代码或重构算法来扩展到数百个处理器。小心编写的程序性能良好，而不会严重损害共享地址空间的编程便利性优势，并且实现良好性能所需的问题规模非常小。注意数据结构和布局如何与系统粒度交互是很重要的，但是对于中等规模的机器通常也需要这些优化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

23rd Annual International Symposium on Computer Architecture (ISCA'96)

自引率

0.00%

发文量

期刊最新文献

Memory Bandwidth Limitations of Future Microprocessors Missing the Memory Wall: The Case for Processor/Memory Integration Instruction Prefetching of Systems Codes with Layout Optimized for Reduced Cache Misses STiNG: A CC-NUMA Computer System for the Commercial Marketplace High-Bandwidth Address Translation for Multiple-Issue Processors