23rd Annual International Symposium on Computer Architecture (ISCA'96)最新文献

英文中文

Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors 提高动态超标量微处理器的缓存端口效率

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232989

Kenneth M. Wilson, K. Olukotun, M. Rosenblum

The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance. However, multi-ported caches are costly to implement. In this paper we propose techniques for improving the bandwidth of a single cache port by using additional buffering in the processor, and by taking maximum advantage of a wider cache port. We evaluate these techniques using realistic applications that include the operating system. Our techniques using a single-ported cache achieve 91% of the performance of a dual-ported cache.

现代微处理器的内存带宽需求需要使用多端口缓存来实现峰值性能。然而，多端口缓存的实现成本很高。在本文中，我们提出了通过在处理器中使用额外的缓冲来提高单个缓存端口带宽的技术，并通过最大限度地利用更宽的缓存端口。我们使用包括操作系统在内的实际应用程序来评估这些技术。我们使用单端口缓存的技术实现了双端口缓存91%的性能。

引用次数: 68

MGS: A Multigrain Shared Memory System MGS:一个多粒共享内存系统

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232980

D. Yeung, J. Kubiatowicz, A. Agarwal

Parallel workstations, each comprising 10-100 processors, promise cost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared memory multiprocessors through software over a local area network to synthesize larger shared memory systems. We call these systems Distributed Scalable Shared-memory Multiprocessors (DSSMPs).This paper introduces the design of a shared memory system that uses multiple granularities of sharing, and presents an implementation on the Alewife multiprocessor, called MGS. Multigrain shared memory enables the collaboration of hardware and software shared memory, and is effective at exploiting a form of locality called multigrain locality. The system provides efficient support for fine-grain cache-line sharing, and resorts to coarse-grain page-level sharing only when locality is violated. A framework for characterizing application performance on DSSMPs is also introduced.Using MGS, an in-depth study of several shared memory applications is conducted to understand the behavior of DSSMPs. We find that unmodified shared memory applications can exploit multigrain sharing. Keeping the number of processors fixed, applications execute up to 85% faster when each DSSMP node is a multiprocessor as opposed to a uniprocessor. We also show that tightly-coupled multiprocessors hold a significant performance advantage over DSSMPs on unmodified applications. However, a best-effort implementation of a kernel from one of the applications allows a DSSMP to almost match the performance of a tightly-coupled multiprocessor.

并行工作站，每个包含10-100个处理器，承诺经济高效的通用多处理。本文探讨了这种中小规模的共享内存多处理器通过软件在局域网上的耦合来合成更大的共享内存系统。我们称这些系统为分布式可扩展共享内存多处理器(dssmp)。本文介绍了一种采用多粒度共享的共享内存系统的设计，并给出了在Alewife多处理器(MGS)上的实现。多粒共享内存支持硬件和软件共享内存的协作，并且有效地利用了一种称为多粒局部性的局部性形式。该系统为细粒度缓存行共享提供了有效的支持，并且仅在违反局部性时才采用粗粒度页面级共享。本文还介绍了dssmp上应用性能表征的框架。利用MGS，对几种共享内存应用程序进行了深入研究，以了解dssmp的行为。我们发现未经修改的共享内存应用程序可以利用多粒共享。保持固定的处理器数量，当每个DSSMP节点都是多处理器时，应用程序的执行速度比单处理器快85%。我们还表明，在未经修改的应用程序中，紧密耦合的多处理器比dssmp具有显著的性能优势。但是，一个应用程序的内核的最佳实现允许DSSMP几乎与紧密耦合的多处理器的性能相匹配。

{"title":"MGS: A Multigrain Shared Memory System","authors":"D. Yeung, J. Kubiatowicz, A. Agarwal","doi":"10.1145/232973.232980","DOIUrl":"https://doi.org/10.1145/232973.232980","url":null,"abstract":"Parallel workstations, each comprising 10-100 processors, promise cost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared memory multiprocessors through software over a local area network to synthesize larger shared memory systems. We call these systems Distributed Scalable Shared-memory Multiprocessors (DSSMPs).This paper introduces the design of a shared memory system that uses multiple granularities of sharing, and presents an implementation on the Alewife multiprocessor, called MGS. Multigrain shared memory enables the collaboration of hardware and software shared memory, and is effective at exploiting a form of locality called multigrain locality. The system provides efficient support for fine-grain cache-line sharing, and resorts to coarse-grain page-level sharing only when locality is violated. A framework for characterizing application performance on DSSMPs is also introduced.Using MGS, an in-depth study of several shared memory applications is conducted to understand the behavior of DSSMPs. We find that unmodified shared memory applications can exploit multigrain sharing. Keeping the number of processors fixed, applications execute up to 85% faster when each DSSMP node is a multiprocessor as opposed to a uniprocessor. We also show that tightly-coupled multiprocessors hold a significant performance advantage over DSSMPs on unmodified applications. However, a best-effort implementation of a kernel from one of the applications allows a DSSMP to almost match the performance of a tightly-coupled multiprocessor.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121531031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 77

An Analysis of Dynamic Branch Prediction Schemes on System Workloads 基于系统负载的动态分支预测方案分析

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232977

Nicolas Gloy, C. Young, Bradley Chen, Michael D. Smith

Recent studies of dynamic branch prediction schemes rely almost exclusively on user-only simulations to evaluate performance. We find that an evaluation of these schemes with user and kernel references often leads to different conclusions. By analyzing our own Atom-generated system traces and the system traces from the Instruction Benchmark Suite, we quantify the effects of kernel and user interactions on branch prediction accuracy. We find that user-only traces yield accurate prediction results only when the kernel accounts for less than 5% of the total executed instructions. Schemes that appear to predict well under user-only traces are not always the most effective on full-system traces: the recently-proposed two-level adaptive schemes can suffer from higher aliasing than the original per-branch 2-bit counter scheme. We also find that flushing the branch history state at fixed intervals does not accurately model the true effects of user/kernel interaction.

最近的动态分支预测方案的研究几乎完全依赖于用户模拟来评估性能。我们发现用用户和内核参考对这些方案进行评估往往会得出不同的结论。通过分析我们自己的atom生成的系统跟踪和来自指令基准测试套件的系统跟踪，我们量化了内核和用户交互对分支预测准确性的影响。我们发现，只有当内核执行的指令少于总执行指令的5%时，纯用户跟踪才会产生准确的预测结果。在仅用户走线下预测良好的方案并不总是在全系统走线上最有效:最近提出的两级自适应方案可能比原始的每分支2位计数器方案遭受更高的混叠。我们还发现，以固定的间隔刷新分支历史状态并不能准确地模拟用户/内核交互的真实效果。

引用次数: 76

Performance Comparison of ILP Machines with Cycle Time Evaluation 周期时间评价下的ILP机器性能比较

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232995

Tetsuya Hara, H. Ando, Chikako Nakanishi, M. Nakaya

Many studies have investigated performance improvement through exploiting instruction-level parallelism (ILP) with a particular architecture. Unfortunately, these studies indicate performance improvement using the number of cycles that are required to execute a program, but do not quantitatively estimate the penalty imposed on the cycle time from the architecture. Since the performance of a microprocessor must be measured by its execution time, a cycle time evaluation is required as well as a cycle count speedup evaluation. Currently, superscalar machines are widely accepted as the machines which achieve the highest performance. On the other hand, because of hardware simplicity and instruction scheduling sophistication, there is a perception that the next generation of microprocessors will be implemented with a VLIW architecture. A simple VLIW machine, however, has a serious weakness regarding speculative execution. Thus, it is a question whether a simple VLIW machine really outperforms a superscalar machine. We recently proposed a mechanism called predicating that supports speculative execution for the VLIW machine, and showed a significant cycle count speedup over a scalar machine. Although the mechanism is simple, it is unknown how much it imposes a penalty on the cycle time, and how much the performance is improved as a result. This paper evaluates both the cycle count speedup and the cycle time for three ILP machines: a superscalar machine, a simple VLIW machine, and the VLIW machine with predicating. The evaluation results show that the simple VLIW machine slightly outperforms the superscalar machine, while the VLIW machine with predicating achieves a significant speedup of 1.41x over the superscalar machine.

许多研究通过在特定架构中利用指令级并行性(ILP)来研究性能改进。不幸的是，这些研究表明使用执行程序所需的周期数来提高性能，但没有定量地估计体系结构对周期时间的影响。由于微处理器的性能必须通过其执行时间来衡量，因此需要周期时间评估以及周期计数加速评估。目前，超标量机器被广泛认为是性能最高的机器。另一方面，由于硬件的简单性和指令调度的复杂性，人们认为下一代微处理器将使用VLIW架构实现。然而，一个简单的VLIW机器在推测执行方面有一个严重的弱点。因此，一个简单的VLIW机器是否真的优于一个超标量机器是一个问题。我们最近提出了一种称为predicating的机制，它支持VLIW机器的推测执行，并且在标量机器上显示了显著的周期计数加速。虽然这种机制很简单，但我们不知道它对循环时间的影响有多大，以及性能提高了多少。本文评估了三种ILP机器的循环计数加速和周期时间:超标量机器、简单VLIW机器和带预测的VLIW机器。评估结果表明，简单VLIW机器的性能略优于超标量机器，而带有预测的VLIW机器的速度比超标量机器提高了1.41倍。

{"title":"Performance Comparison of ILP Machines with Cycle Time Evaluation","authors":"Tetsuya Hara, H. Ando, Chikako Nakanishi, M. Nakaya","doi":"10.1145/232973.232995","DOIUrl":"https://doi.org/10.1145/232973.232995","url":null,"abstract":"Many studies have investigated performance improvement through exploiting instruction-level parallelism (ILP) with a particular architecture. Unfortunately, these studies indicate performance improvement using the number of cycles that are required to execute a program, but do not quantitatively estimate the penalty imposed on the cycle time from the architecture. Since the performance of a microprocessor must be measured by its execution time, a cycle time evaluation is required as well as a cycle count speedup evaluation. Currently, superscalar machines are widely accepted as the machines which achieve the highest performance. On the other hand, because of hardware simplicity and instruction scheduling sophistication, there is a perception that the next generation of microprocessors will be implemented with a VLIW architecture. A simple VLIW machine, however, has a serious weakness regarding speculative execution. Thus, it is a question whether a simple VLIW machine really outperforms a superscalar machine. We recently proposed a mechanism called predicating that supports speculative execution for the VLIW machine, and showed a significant cycle count speedup over a scalar machine. Although the mechanism is simple, it is unknown how much it imposes a penalty on the cycle time, and how much the performance is improved as a result. This paper evaluates both the cycle count speedup and the cycle time for three ILP machines: a superscalar machine, a simple VLIW machine, and the VLIW machine with predicating. The evaluation results show that the simple VLIW machine slightly outperforms the superscalar machine, while the VLIW machine with predicating achieves a significant speedup of 1.41x over the superscalar machine.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123022855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors COMA:构建容错可扩展共享内存多处理器的机会

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232981

C. Morin, A. Gefflaut, M. Banâtre, Anne-Marie Kermarrec

Due to the increasing number of their components, Scalable Shared Memory Multiprocessors (SSMMs) have a very high probability of experiencing failures. Tolerating node failures therefore becomes very important for these architectures particularly if they must be used for long-running computations. In this paper, we show that the class of Cache Only Memory Architectures (COMA) are good candidates for building fault-tolerant SSMMs. A backward error recovery strategy can be implemented without significant hardware modification to previously proposed COMA by exploiting their standard replication mechanisms and extending the coherence protocol to transparently manage recovery data. Evaluation of the proposed fault-tolerant COMA is based on execution driven simulations using some of the Splash applications. We show that, for the simulated architecture, the performance degradation caused by fault-tolerance mechanisms varies from 5% in the best case to 35% in the worst case. The standard memory behavior is only slightly perturbed. Moreover, results also show that the proposed scheme preserves the architecture scalability and that the memory overhead remains low for parallel applications using mostly shared data.

由于组件数量的增加，可扩展共享内存多处理器(ssmm)出现故障的概率非常高。因此，容忍节点故障对于这些体系结构来说变得非常重要，特别是当它们必须用于长时间运行的计算时。在本文中，我们证明了一类仅缓存内存架构(COMA)是构建容错ssmm的良好候选者。通过利用其标准复制机制和扩展一致性协议来透明地管理恢复数据，可以实现向后错误恢复策略，而无需对先前提出的COMA进行重大硬件修改。对所提出的容错昏迷的评估是基于使用一些Splash应用程序的执行驱动模拟。我们表明，对于模拟的架构，由容错机制引起的性能下降从最佳情况下的5%到最坏情况下的35%不等。标准的记忆行为只受到轻微的干扰。此外，结果还表明，该方案保留了体系结构的可扩展性，并且对于使用大部分共享数据的并行应用程序来说，内存开销仍然很低。

{"title":"COMA: An Opportunity for Building Fault-Tolerant Scalable Shared Memory Multiprocessors","authors":"C. Morin, A. Gefflaut, M. Banâtre, Anne-Marie Kermarrec","doi":"10.1145/232973.232981","DOIUrl":"https://doi.org/10.1145/232973.232981","url":null,"abstract":"Due to the increasing number of their components, Scalable Shared Memory Multiprocessors (SSMMs) have a very high probability of experiencing failures. Tolerating node failures therefore becomes very important for these architectures particularly if they must be used for long-running computations. In this paper, we show that the class of Cache Only Memory Architectures (COMA) are good candidates for building fault-tolerant SSMMs. A backward error recovery strategy can be implemented without significant hardware modification to previously proposed COMA by exploiting their standard replication mechanisms and extending the coherence protocol to transparently manage recovery data. Evaluation of the proposed fault-tolerant COMA is based on execution driven simulations using some of the Splash applications. We show that, for the simulated architecture, the performance degradation caused by fault-tolerance mechanisms varies from 5% in the best case to 35% in the worst case. The standard memory behavior is only slightly perturbed. Moreover, results also show that the proposed scheme preserves the architecture scalability and that the memory overhead remains low for parallel applications using mostly shared data.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129911375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

The Difference-Bit Cache 差位缓存

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232986

Toni Juan, T. Lang, J. Navarro

The difference-bit cache is a two-way set-associative cache with an access time that is smaller than that of a conventional one and close or equal to that of a direct-mapped cache. This is achieved by noticing that the two tags for a set have to differ at least by one bit and by using this bit to select the way. In contrast with previous approaches that predict the way and have two types of hits (primary of one cycle and secondary of two to four cycles), all hits of the difference-bit cache are of one cycle. The evaluation of the access time of our cache organization has been performed using a recently proposed on-chip cache access model.

差位缓存是一种双向集关联缓存，其访问时间小于传统缓存，接近或等于直接映射缓存的访问时间。这是通过注意到一个集合的两个标记必须至少相差一个位并使用这个位来选择方式来实现的。与之前的预测方式和两种类型的命中(主周期为一个周期，次周期为两到四个周期)的方法相比，差分位缓存的所有命中都是一个周期。使用最近提出的片上缓存访问模型对我们的缓存组织的访问时间进行了评估。

引用次数: 43

DCD --- Disk Caching Disk: A New Approach for Boosting I/O Performance DCD—磁盘缓存磁盘:提高I/O性能的新方法

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232991

Yimin Hu, Qing Yang

This paper presents a novel disk storage architecture called DCD, Disk Caching Disk, for the purpose of optimizing I/O performance. The main idea of the DCD is to use a small log disk, referred to as cache-disk, as a secondary disk cache to optimize write performance. While the cache-disk and the normal data disk have the same physical properties, the access speed of the former differs dramatically from the latter because of different data units and different ways in which data are accessed. Our objective is to exploit this speed difference by using the log disk as a cache to build a reliable and smooth disk hierarchy. A small RAM buffer is used to collect small write requests to form a log which is transferred onto the cache-disk whenever the cache-disk is idle. Because of the temporal locality that exists in office/engineering work-load environments, the DCD system shows write performance close to the same size RAM (i.e. solid-state disk) for the cost of a disk. Moreover, the cache-disk can also be implemented as a logical disk in which case a small portion of the normal data disk is used as the log disk. Trace-driven simulation experiments are carried out to evaluate the performance of the proposed disk architecture. Under the office/engineering work-load environment, the DCD shows superb disk performance for writes as compared to existing disk systems. Performance improvements of up to two orders of magnitude are observed in terms of average response time for write operations. Furthermore, DCD is very reliable and works at the device or device driver level. As a result, it can be applied directly to current file systems without the need of changing the operating system.

本文提出了一种新的磁盘存储体系结构，称为DCD(磁盘缓存磁盘)，用于优化I/O性能。DCD的主要思想是使用一个小的日志磁盘(称为缓存磁盘)作为辅助磁盘缓存来优化写性能。缓存盘和普通数据盘具有相同的物理特性，但由于数据单元和访问方式不同，缓存盘和普通数据盘的访问速度差别很大。我们的目标是通过使用日志磁盘作为缓存来构建可靠且平滑的磁盘层次结构，从而利用这种速度差异。一个小的RAM缓冲区用于收集小的写请求，形成一个日志，当缓存磁盘空闲时，该日志被传输到缓存磁盘上。由于存在于办公室/工程工作负载环境中的时间局部性，DCD系统显示的写性能接近相同大小的RAM(即固态磁盘)，而磁盘的成本是相同的。此外，cache盘还可以实现为逻辑盘，将正常数据盘的一小部分用作日志盘。进行了跟踪驱动的仿真实验，以评估所提出的磁盘架构的性能。在办公室/工程工作负载环境下，与现有磁盘系统相比，DCD在写操作方面表现出卓越的磁盘性能。在写操作的平均响应时间方面，可以观察到高达两个数量级的性能改进。此外，DCD非常可靠，工作在设备或设备驱动程序级别。因此，它可以直接应用于当前的文件系统，而无需更改操作系统。

{"title":"DCD --- Disk Caching Disk: A New Approach for Boosting I/O Performance","authors":"Yimin Hu, Qing Yang","doi":"10.1145/232973.232991","DOIUrl":"https://doi.org/10.1145/232973.232991","url":null,"abstract":"This paper presents a novel disk storage architecture called DCD, Disk Caching Disk, for the purpose of optimizing I/O performance. The main idea of the DCD is to use a small log disk, referred to as cache-disk, as a secondary disk cache to optimize write performance. While the cache-disk and the normal data disk have the same physical properties, the access speed of the former differs dramatically from the latter because of different data units and different ways in which data are accessed. Our objective is to exploit this speed difference by using the log disk as a cache to build a reliable and smooth disk hierarchy. A small RAM buffer is used to collect small write requests to form a log which is transferred onto the cache-disk whenever the cache-disk is idle. Because of the temporal locality that exists in office/engineering work-load environments, the DCD system shows write performance close to the same size RAM (i.e. solid-state disk) for the cost of a disk. Moreover, the cache-disk can also be implemented as a logical disk in which case a small portion of the normal data disk is used as the log disk. Trace-driven simulation experiments are carried out to evaluate the performance of the proposed disk architecture. Under the office/engineering work-load environment, the DCD shows superb disk performance for writes as compared to existing disk systems. Performance improvements of up to two orders of magnitude are observed in terms of average response time for write operations. Furthermore, DCD is very reliable and works at the device or device driver level. As a result, it can be applied directly to current file systems without the need of changing the operating system.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121775191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 167

STiNG: A CC-NUMA Computer System for the Commercial Marketplace 用于商业市场的CC-NUMA计算机系统

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.233006

Thomas D. Lovett, R. Clapp

"STiNG" is a Cache Coherent Non-Uniform Memory Access (CC-NUMA) Multiprocessor designed and built by Sequent Computer Systems, Inc. It combines four processor Symmetric Multi-processor (SMP) nodes (called Quads), using a Scalable Coherent Interface (SCI) based coherent interconnect. The Quads are based on the Intel P6 processor and the external bus it defines. In addition to 4 P6 processors, each Quad may contain up to 4 GBytes of system memory, 2 Peripheral Component Interface (PCI) busses for I/O, and a Lynx board. The Lynx board provides the datapath to the SCI-based interconnect and ensures system-wide cache coherency. STiNG is one of the first commercial CC-NUMA systems to be built. This paper describes the motivation for building STiNG as well as its architecture and implementation. In addition, performance analysis is provided for On-Line Transaction Processing (OLTP) and Decision Support System (DSS) workloads. Finally, the status of the current implementation is reviewed.

“STiNG”是由sequential Computer Systems, Inc.设计和制造的缓存一致非统一存储器访问(CC-NUMA)多处理器。它结合了四个处理器对称多处理器(SMP)节点(称为Quads)，使用基于可扩展相干接口(SCI)的相干互连。Quads基于英特尔P6处理器及其定义的外部总线。除了4个P6处理器外，每个Quad可能包含高达4gb的系统内存，2个用于I/O的外围组件接口(PCI)总线和一个Lynx板。Lynx单板为基于scsi的互连提供数据路径，并确保系统范围内的缓存一致性。STiNG是首批商用CC-NUMA系统之一。本文描述了构建STiNG的动机以及它的体系结构和实现。此外，还为联机事务处理(OLTP)和决策支持系统(DSS)工作负载提供了性能分析。最后，对目前的实施情况进行了回顾。

引用次数: 275

Don't Use the Page Number, but a Pointer to It 不要使用页码，而是使用指向页码的指针

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232985

André Seznec

Most newly announced high performance microprocessors support 64-bit virtual addresses and the width of physical addresses is also growing. As a result, the size of the address tags in the L1 cache is increasing. The impact of on chip area is particularly dramatic when small block sizes are used. At the same time, the performance of high performance microprocessors depends more and more on the accuracy of branch prediction and for reasons similar to those in the case of caches the size of the Branch Target Buffer is also increasing linearly with the address width.In this paper, we apply the simple principle stated in the title for limiting the tag size of on-chip caches. In the resulting indirect-tagged cache, the duplication of the page number in processors (in TLB and in cache tags) is removed. The tag check is then simplified and the tag cost does not depend on the address width. Applying the same principle to Branch Target Buffers, we propose the Reduced Branch Target Buffer. The storage size in a Reduced Branch Target Buffer does not depend on the address width and is dramatically smaller than the size of the conventional implementation of a Branch Target Buffer.

大多数新发布的高性能微处理器都支持64位虚拟地址，物理地址的宽度也在增长。因此，L1缓存中的地址标记的大小正在增加。当使用小块尺寸时，片上面积的影响尤其显著。同时，高性能微处理器的性能越来越依赖于分支预测的准确性，并且由于类似于缓存的原因，分支目标缓冲区的大小也随着地址宽度线性增加。在本文中，我们应用标题中所述的简单原则来限制片上缓存的标签大小。在产生的间接标记缓存中，删除了处理器(在TLB和缓存标记中)中重复的页码。然后简化标签检查，并且标签成本不依赖于地址宽度。将相同的原理应用于分支目标缓冲区，我们提出了精简分支目标缓冲区。精简分支目标缓冲区中的存储大小不依赖于地址宽度，并且比分支目标缓冲区的传统实现的大小要小得多。

{"title":"Don't Use the Page Number, but a Pointer to It","authors":"André Seznec","doi":"10.1145/232973.232985","DOIUrl":"https://doi.org/10.1145/232973.232985","url":null,"abstract":"Most newly announced high performance microprocessors support 64-bit virtual addresses and the width of physical addresses is also growing. As a result, the size of the address tags in the L1 cache is increasing. The impact of on chip area is particularly dramatic when small block sizes are used. At the same time, the performance of high performance microprocessors depends more and more on the accuracy of branch prediction and for reasons similar to those in the case of caches the size of the Branch Target Buffer is also increasing linearly with the address width.In this paper, we apply the simple principle stated in the title for limiting the tag size of on-chip caches. In the resulting indirect-tagged cache, the duplication of the page number in processors (in TLB and in cache tags) is removed. The tag check is then simplified and the tag cost does not depend on the address width. Applying the same principle to Branch Target Buffers, we propose the Reduced Branch Target Buffer. The storage size in a Reduced Branch Target Buffer does not depend on the address width and is dramatically smaller than the size of the conventional implementation of a Branch Target Buffer.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126708360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Understanding Application Performance on Shared Virtual Memory Systems 了解共享虚拟内存系统上的应用程序性能

23rd Annual International Symposium on Computer Architecture (ISCA'96)

Pub Date : 1996-05-15 DOI: 10.1145/232973.232987

L. Iftode, J. Singh, Kai Li

Many researchers have proposed interesting protocols for shared virtual memory (SVM) systems, and demonstrated performance improvements on parallel programs. However, there is still no clear understanding of the performance potential of SVM systems for different classes of applications. This paper begins to fill this gap, by studying the performance of a range of applications in detail and understanding it in light of application characteristics.We first develop a brief classification of the inherent data sharing patterns in the applications, and how they interact with system granularities to yield the communication patterns relevant to SVM systems. We then use detailed simulation to compare the performance of two SVM approaches---Lazy Released Consistency (LRC) and Automatic Update Release Consistency (AURC)---with each other and with an all-hardware CC-NUMA approach. We examine how performance is affected by problem size, machine size, key system parameters, and the use of less optimized program implementations. We find that SVM can indeed perform quite well for systems of at leant up to 32 processors for several nontrivial applications. However, performance is much more variable across applications than on CC-NUMA systems, and the problem sizes needed to obtain good parallel performance are substantially larger. The hardware-assisted AURC system tends to perform significantly better than the all-software LRC under our system assumptions, particularly when realistic cache hierarchies are used.

许多研究者为共享虚拟内存(SVM)系统提出了有趣的协议，并证明了并行程序的性能改进。然而，对于支持向量机系统在不同类别应用中的性能潜力，目前还没有明确的认识。本文通过对一系列应用程序的性能进行详细的研究，并结合应用程序的特点对其进行理解，从而开始填补这一空白。我们首先对应用程序中固有的数据共享模式进行了简要分类，以及它们如何与系统粒度交互以产生与支持向量机系统相关的通信模式。然后，我们使用详细的仿真来比较两种支持向量机方法——延迟发布一致性(LRC)和自动更新发布一致性(AURC)——彼此之间的性能以及与全硬件CC-NUMA方法的性能。我们将研究性能如何受到问题大小、机器大小、关键系统参数和使用优化程度较低的程序实现的影响。我们发现SVM确实可以在至少32个处理器的系统中执行一些重要的应用程序。然而，与CC-NUMA系统相比，应用程序之间的性能差异更大，获得良好并行性能所需的问题规模要大得多。在我们的系统假设下，硬件辅助的AURC系统往往比全软件的LRC表现得好得多，特别是在使用实际的缓存层次结构时。

{"title":"Understanding Application Performance on Shared Virtual Memory Systems","authors":"L. Iftode, J. Singh, Kai Li","doi":"10.1145/232973.232987","DOIUrl":"https://doi.org/10.1145/232973.232987","url":null,"abstract":"Many researchers have proposed interesting protocols for shared virtual memory (SVM) systems, and demonstrated performance improvements on parallel programs. However, there is still no clear understanding of the performance potential of SVM systems for different classes of applications. This paper begins to fill this gap, by studying the performance of a range of applications in detail and understanding it in light of application characteristics.We first develop a brief classification of the inherent data sharing patterns in the applications, and how they interact with system granularities to yield the communication patterns relevant to SVM systems. We then use detailed simulation to compare the performance of two SVM approaches---Lazy Released Consistency (LRC) and Automatic Update Release Consistency (AURC)---with each other and with an all-hardware CC-NUMA approach. We examine how performance is affected by problem size, machine size, key system parameters, and the use of less optimized program implementations. We find that SVM can indeed perform quite well for systems of at leant up to 32 processors for several nontrivial applications. However, performance is much more variable across applications than on CC-NUMA systems, and the problem sizes needed to obtain good parallel performance are substantially larger. The hardware-assisted AURC system tends to perform significantly better than the all-software LRC under our system assumptions, particularly when realistic cache hierarchies are used.","PeriodicalId":415354,"journal":{"name":"23rd Annual International Symposium on Computer Architecture (ISCA'96)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133713996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 80

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

23rd Annual International Symposium on Computer Architecture (ISCA'96)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀