首页 > 最新文献

Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture最新文献

英文 中文
Origin-based fault-tolerant routing in the mesh 网格中基于原点的容错路由
Pub Date : 1995-10-01 DOI: 10.1109/HPCA.1995.386551
R. Libeskind-Hadas, Eli Brandt
The ability to tolerate faults is critical in multi-computers employing large numbers of processors. This paper describes a class of fault-tolerant routing algorithms for n-dimensional meshes that can tolerate large numbers of faults without using virtual channels. We show that these routing algorithms prevent livelock and deadlock while remaining highly adaptive.<>
容错能力对于使用大量处理器的多台计算机至关重要。本文描述了一种可以在不使用虚拟通道的情况下容忍大量故障的n维网格容错路由算法。我们证明了这些路由算法在保持高度自适应的同时防止了活锁和死锁。
{"title":"Origin-based fault-tolerant routing in the mesh","authors":"R. Libeskind-Hadas, Eli Brandt","doi":"10.1109/HPCA.1995.386551","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386551","url":null,"abstract":"The ability to tolerate faults is critical in multi-computers employing large numbers of processors. This paper describes a class of fault-tolerant routing algorithms for n-dimensional meshes that can tolerate large numbers of faults without using virtual channels. We show that these routing algorithms prevent livelock and deadlock while remaining highly adaptive.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131512627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Software assistance for data caches 数据缓存的软件辅助
Pub Date : 1995-10-01 DOI: 10.1109/HPCA.1995.386546
O. Temam, Nathalie Drach-Temam
Hardware and software cache optimizations are active fields of research, that have yielded powerful but occasionally complex designs and algorithms. The purpose of this paper is to investigate the performance of combined though simple software and hardware optimizations. Because current caches provide little flexibility for exploiting temporal and spatial locality, two hardware modifications are proposed to support these two kinds of locality. Spatial locality is exploited by using large virtual cache lines which do not exhibit the performance flaws of large physical cache lines. Temporal locality is exploited by minimizing cache pollution with a bypass mechanism that still allows to exploit spatial locality. Subsequently, it is shown that simple software informations on the spatial/temporal locality of array references, as provided by current data locality optimizing algorithms, can be used to significantly increase cache performance. The performance and design trade-offs of the proposed mechanisms are discussed. Software assisted caches are further shown to provide a convenient support for hardware and software optimizations.<>
硬件和软件缓存优化是活跃的研究领域,已经产生了强大但偶尔复杂的设计和算法。本文的目的是研究通过简单的软件和硬件优化组合的性能。由于当前的缓存在利用时间和空间局部性方面提供的灵活性很小,因此提出了两种硬件修改来支持这两种局部性。空间局部性是通过使用大型虚拟缓存线来利用的,这些虚拟缓存线不会表现出大型物理缓存线的性能缺陷。时间局部性是通过使用仍然允许利用空间局部性的旁路机制来最小化缓存污染来利用的。随后,研究表明,当前数据局部性优化算法提供的关于数组引用的空间/时间局部性的简单软件信息可用于显着提高缓存性能。讨论了所提出的机制的性能和设计权衡。软件辅助缓存进一步显示为硬件和软件优化提供方便的支持
{"title":"Software assistance for data caches","authors":"O. Temam, Nathalie Drach-Temam","doi":"10.1109/HPCA.1995.386546","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386546","url":null,"abstract":"Hardware and software cache optimizations are active fields of research, that have yielded powerful but occasionally complex designs and algorithms. The purpose of this paper is to investigate the performance of combined though simple software and hardware optimizations. Because current caches provide little flexibility for exploiting temporal and spatial locality, two hardware modifications are proposed to support these two kinds of locality. Spatial locality is exploited by using large virtual cache lines which do not exhibit the performance flaws of large physical cache lines. Temporal locality is exploited by minimizing cache pollution with a bypass mechanism that still allows to exploit spatial locality. Subsequently, it is shown that simple software informations on the spatial/temporal locality of array references, as provided by current data locality optimizing algorithms, can be used to significantly increase cache performance. The performance and design trade-offs of the proposed mechanisms are discussed. Software assisted caches are further shown to provide a convenient support for hardware and software optimizations.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"517 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123104996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Design and performance evaluation of a multithreaded architecture 一个多线程架构的设计和性能评估
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386533
R. Govindarajan, S. Nemawarkar, Philip LeNir
Multithreaded architectures have the ability to tolerate long memory latencies and unpredictable synchronization delays. We propose a multithreaded architecture that is capable of exploiting both coarse-grain parallelism, and fine-grain instruction level parallelism in a program. Instruction-level parallelism is exploited by grouping instructions from a number of active threads at runtime. The architecture supports multiple resident activations to improve the extent of locality exploited. Further, a distributed data structure cache organization is proposed to reduce both the network: traffic and the latency in accessing remote locations. Initial performance evaluation using discrete-event simulation indicates that the architecture is capable of achieving very high processor throughput. The introduction of the data structure cache reduces the network latency significantly. The impact of various cache organizations on the performance of the architecture is also discussed in this paper.<>
多线程体系结构能够容忍较长的内存延迟和不可预测的同步延迟。我们提出了一种多线程架构,能够在程序中同时利用粗粒度并行性和细粒度指令级并行性。指令级并行性是通过在运行时对来自多个活动线程的指令进行分组来实现的。该体系结构支持多个驻留激活,以提高局部性利用的程度。此外,提出了一种分布式数据结构缓存组织,以减少网络流量和访问远程位置的延迟。使用离散事件模拟进行的初步性能评估表明,该体系结构能够实现非常高的处理器吞吐量。数据结构缓存的引入大大降低了网络延迟。本文还讨论了各种缓存组织对体系结构性能的影响
{"title":"Design and performance evaluation of a multithreaded architecture","authors":"R. Govindarajan, S. Nemawarkar, Philip LeNir","doi":"10.1109/HPCA.1995.386533","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386533","url":null,"abstract":"Multithreaded architectures have the ability to tolerate long memory latencies and unpredictable synchronization delays. We propose a multithreaded architecture that is capable of exploiting both coarse-grain parallelism, and fine-grain instruction level parallelism in a program. Instruction-level parallelism is exploited by grouping instructions from a number of active threads at runtime. The architecture supports multiple resident activations to improve the extent of locality exploited. Further, a distributed data structure cache organization is proposed to reduce both the network: traffic and the latency in accessing remote locations. Initial performance evaluation using discrete-event simulation indicates that the architecture is capable of achieving very high processor throughput. The introduction of the data structure cache reduces the network latency significantly. The impact of various cache organizations on the performance of the architecture is also discussed in this paper.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115187231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
The effects of STEF in finely parallel multithreaded processors STEF在精细并行多线程处理器中的作用
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386531
Yamin Li, Wanming Chu
The throughput of a multiple-pipelined processor suffers due to lack of sufficient instructions to make multiple pipelines busy and due to delays associated with pipeline dependencies. Finely Parallel Multithreaded Processor (FPMP) architectures try to solve these problems by dispatching multiple instructions from multiple instruction threads in parallel. This paper proposes an analytic model which is used to quantify the advantage of FPMP architectures. The effects of four important parameters in FPMP, S,T,E, and F (STEF) are evaluated. Unlike previous analytic models of multithreaded architecture, the model presented concerns the performance of multiple pipelines. It deals not only with pipelines dependencies but also with structure conflicts. The model accepts the configuration parameters of a FPMP, the distribution of instruction types, and the distribution of interlock delay cycles. The model provides a quick performance prediction and a quick utilization prediction which are helpful in the processor design.<>
由于缺乏足够的指令使多个管道繁忙,以及由于与管道依赖相关的延迟,多管道处理器的吞吐量受到影响。精细并行多线程处理器(FPMP)架构试图通过并行调度多个指令线程中的多条指令来解决这些问题。本文提出了一个量化FPMP架构优势的分析模型。评估FPMP中S、T、E和F (STEF)四个重要参数的影响。与以往的多线程体系结构分析模型不同,该模型关注多个管道的性能。它不仅处理管道依赖关系,还处理结构冲突。该模型接受FPMP的配置参数、指令类型的分布和互锁延迟周期的分布。该模型提供了快速的性能预测和利用率预测,有助于处理器的设计
{"title":"The effects of STEF in finely parallel multithreaded processors","authors":"Yamin Li, Wanming Chu","doi":"10.1109/HPCA.1995.386531","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386531","url":null,"abstract":"The throughput of a multiple-pipelined processor suffers due to lack of sufficient instructions to make multiple pipelines busy and due to delays associated with pipeline dependencies. Finely Parallel Multithreaded Processor (FPMP) architectures try to solve these problems by dispatching multiple instructions from multiple instruction threads in parallel. This paper proposes an analytic model which is used to quantify the advantage of FPMP architectures. The effects of four important parameters in FPMP, S,T,E, and F (STEF) are evaluated. Unlike previous analytic models of multithreaded architecture, the model presented concerns the performance of multiple pipelines. It deals not only with pipelines dependencies but also with structure conflicts. The model accepts the configuration parameters of a FPMP, the distribution of instruction types, and the distribution of interlock delay cycles. The model provides a quick performance prediction and a quick utilization prediction which are helpful in the processor design.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121291344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The Named-State Register File: implementation and performance 命名状态寄存器文件:实现和性能
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386560
P. Nuth, W. Dally
Context switches are slow in conventional processors because the entire processor state must be saved and restored, even if much of the state is not used before the next context switch. This paper introduces the Named-State Register File, a fine-grain associative register file. The NSF uses hardware and software techniques to efficiently manage registers among sequential or parallel procedure activations. The NSF holds more live data per register than conventional register files, and requires much less spill and reload traffic to switch between concurrent contexts. The NSF speeds execution of some sequential and parallel programs by 9% to 17% over alternative register file organizations. The NSF has access time comparable to a conventional register file and only adds 5% to the area of a typical processor chip.<>
在传统处理器中,上下文切换速度很慢,因为必须保存和恢复整个处理器状态,即使在下一次上下文切换之前大部分状态未被使用。本文介绍了一种细粒度关联寄存器文件——命名状态寄存器文件。NSF使用硬件和软件技术来有效地管理顺序或并行过程激活之间的寄存器。与传统的寄存器文件相比,NSF每个寄存器保存更多的实时数据,并且在并发上下文之间切换所需的溢出和重新加载流量要少得多。NSF使一些顺序和并行程序的执行速度比其他寄存器文件组织快9%到17%。NSF的访问时间与传统的寄存器文件相当,并且只增加了典型处理器芯片面积的5%。
{"title":"The Named-State Register File: implementation and performance","authors":"P. Nuth, W. Dally","doi":"10.1109/HPCA.1995.386560","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386560","url":null,"abstract":"Context switches are slow in conventional processors because the entire processor state must be saved and restored, even if much of the state is not used before the next context switch. This paper introduces the Named-State Register File, a fine-grain associative register file. The NSF uses hardware and software techniques to efficiently manage registers among sequential or parallel procedure activations. The NSF holds more live data per register than conventional register files, and requires much less spill and reload traffic to switch between concurrent contexts. The NSF speeds execution of some sequential and parallel programs by 9% to 17% over alternative register file organizations. The NSF has access time comparable to a conventional register file and only adds 5% to the area of a typical processor chip.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114558848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors 共享内存多处理器中基于硬件的步进和顺序预取的有效性
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386554
F. Dahlgren, P. Stenström
We study the relative efficiency of previously proposed stride and sequential prefetching-two promising hardware-based prefetching schemes to reduce read-miss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequential prefetching does better than stride prefetching for three applications. This is because (i) most strides are shorter than the block size (we assume 32 byte blocks), which means that sequential prefetching is as effective for stride accesses, and (ii) sequential prefetching also exploits the locality of read misses for non-stride accesses. However we find that since stride prefetching causes fewer useless prefetches, it consumes less memory-system bandwidth.<>
我们研究了先前提出的跨步预取和顺序预取的相对效率,这两种有前途的基于硬件的预取方案可以减少共享内存多处理器中的读缺失损失。尽管在我们研究的6个应用程序中,有4个应用程序采用跨步访问,但我们发现顺序预取在3个应用程序中优于跨步预取。这是因为(i)大多数步进都比块大小短(我们假设32字节块),这意味着顺序预取对于步进访问同样有效,并且(ii)顺序预取还利用了非步进访问的读丢失的局部性。然而,我们发现,由于步幅预取导致更少的无用预取,它消耗更少的内存系统带宽。
{"title":"Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors","authors":"F. Dahlgren, P. Stenström","doi":"10.1109/HPCA.1995.386554","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386554","url":null,"abstract":"We study the relative efficiency of previously proposed stride and sequential prefetching-two promising hardware-based prefetching schemes to reduce read-miss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequential prefetching does better than stride prefetching for three applications. This is because (i) most strides are shorter than the block size (we assume 32 byte blocks), which means that sequential prefetching is as effective for stride accesses, and (ii) sequential prefetching also exploits the locality of read misses for non-stride accesses. However we find that since stride prefetching causes fewer useless prefetches, it consumes less memory-system bandwidth.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117319138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Two techniques for improving performance on bus-based multiprocessors 提高基于总线的多处理器性能的两种技术
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386536
Craig Anderson, J. Baer
We explore two techniques for reducing memory latency in bus-based multiprocessors. The first one, designed for sector caches, is a snoopy cache coherence protocol that uses a large transfer block to take advantage of spatial locality, while using a small coherence block (called a subblock to avoid false sharing). The second technique is read snarfing (or read broadcasting), in which all caches can acquire data transmitted in response to a read request to update invalid blocks in their own cache. We evaluated the two techniques by simulating 6 applications that exhibit a variety of reference patterns. We compared the performance of the new protocol against that of the Illinois protocol with both small and large block sizes and found that it was effective in reducing memory latency and providing more consistent, good results than the Illinois protocol with a given line size. Read snarfing also improved performance mostly for protocols that use large line sizes.<>
我们探讨了在基于总线的多处理器中减少内存延迟的两种技术。第一个是为扇区缓存设计的,是一个snoopy缓存一致性协议,它使用一个大的传输块来利用空间局域性,同时使用一个小的一致性块(称为子块以避免错误共享)。第二种技术是读捕获(或读广播),其中所有缓存都可以获取为响应读取请求而传输的数据,以更新自己缓存中的无效块。我们通过模拟6个展示各种参考模式的应用程序来评估这两种技术。我们将新协议的性能与具有大小块大小的Illinois协议的性能进行了比较,发现它可以有效地减少内存延迟,并提供比具有给定行大小的Illinois协议更一致的良好结果。读阻塞也提高了性能,主要是对于使用大线路的协议。
{"title":"Two techniques for improving performance on bus-based multiprocessors","authors":"Craig Anderson, J. Baer","doi":"10.1109/HPCA.1995.386536","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386536","url":null,"abstract":"We explore two techniques for reducing memory latency in bus-based multiprocessors. The first one, designed for sector caches, is a snoopy cache coherence protocol that uses a large transfer block to take advantage of spatial locality, while using a small coherence block (called a subblock to avoid false sharing). The second technique is read snarfing (or read broadcasting), in which all caches can acquire data transmitted in response to a read request to update invalid blocks in their own cache. We evaluated the two techniques by simulating 6 applications that exhibit a variety of reference patterns. We compared the performance of the new protocol against that of the Illinois protocol with both small and large block sizes and found that it was effective in reducing memory latency and providing more consistent, good results than the Illinois protocol with a given line size. Read snarfing also improved performance mostly for protocols that use large line sizes.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131668712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Improving performance by cache driven memory management 通过缓存驱动的内存管理提高性能
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386539
K. Westerholz, Stephen Honal, J. Plankl, C. Hafer
The efficient utilization of caches is crucial for a competitive memory hierarchy. Access times required by modern processors are continuously decreasing. Direct mapped caches provide the shortest access time. Using them yields reduced hardware costs and fast memory access but implies additional misses in the cache, resulting in performance degradation. Another source of conflicts is the addressing scheme if caches are physically addressed. For such caches, memory management affects cache utilization. Enhancements in virtual memory management as presented in this paper reduce cache misses by as much as 80% for real-indexed caches. We developed three algorithms that use runtime information. All of them are suitable for direct-mapped and set associative caches. Applied to SPECint92 benchmark suite, we measured a performance improvement of 6.9% in a multiprogramming environment for a R4000 based UNIX workstation. This figure also includes the overhead caused by the more complex memory management.<>
缓存的有效利用对于竞争性内存层次结构至关重要。现代处理器所需的访问时间不断减少。直接映射缓存提供最短的访问时间。使用它们可以降低硬件成本和快速内存访问,但意味着缓存中存在额外的丢失,从而导致性能下降。如果缓存是物理寻址的,冲突的另一个来源是寻址方案。对于这样的缓存,内存管理会影响缓存利用率。本文提出的虚拟内存管理方面的增强可将实索引缓存的缓存丢失率降低多达80%。我们开发了三种使用运行时信息的算法。它们都适用于直接映射和集合关联缓存。应用SPECint92基准测试套件,在基于R4000的UNIX工作站的多道编程环境中,我们测量到性能提高了6.9%。这个数字还包括由更复杂的内存管理引起的开销。
{"title":"Improving performance by cache driven memory management","authors":"K. Westerholz, Stephen Honal, J. Plankl, C. Hafer","doi":"10.1109/HPCA.1995.386539","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386539","url":null,"abstract":"The efficient utilization of caches is crucial for a competitive memory hierarchy. Access times required by modern processors are continuously decreasing. Direct mapped caches provide the shortest access time. Using them yields reduced hardware costs and fast memory access but implies additional misses in the cache, resulting in performance degradation. Another source of conflicts is the addressing scheme if caches are physically addressed. For such caches, memory management affects cache utilization. Enhancements in virtual memory management as presented in this paper reduce cache misses by as much as 80% for real-indexed caches. We developed three algorithms that use runtime information. All of them are suitable for direct-mapped and set associative caches. Applied to SPECint92 benchmark suite, we measured a performance improvement of 6.9% in a multiprogramming environment for a R4000 based UNIX workstation. This figure also includes the overhead caused by the more complex memory management.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123829492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A design framework for hybrid-access caches 混合访问缓存的设计框架
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386547
K. B. Theobald, H. Hum, G. Gao
High-speed microprocessors need fast on-chip caches in order to keep busy. Direct-mapped caches have better access times than set-associative caches, but poorer miss rates. This has led to several hybrid on-chip caches combining the speed of direct-mapped caches with the hit rates of associative caches. In this paper, we unify these hybrids within a single framework which we call the hybrid access cache (HAC) model. Existing hybrid caches lie near the edges of the HAC design space, leaving the middle untouched. We study a group of caches in this middle region, a group we call half-and-half caches, which are half direct-mapped and half set-associative. Simulations confirm the predictive valve of the HAC model, and demonstrate that, for medium to large caches, this middle region yields more efficient cache designs.<>
高速微处理器需要快速的片内缓存以保持繁忙。直接映射的缓存比集合关联的缓存有更好的访问时间,但是更低的缺失率。这导致了一些混合片上缓存结合了直接映射缓存的速度和关联缓存的命中率。在本文中,我们将这些混合模式统一在一个框架中,我们称之为混合访问缓存(HAC)模型。现有的混合缓存位于HAC设计空间的边缘附近,中间部分未受影响。我们研究在这个中间区域的一组缓存,我们称之为半-半缓存,它们一半是直接映射的,一半是集合结合的。仿真结果证实了HAC模型的预测价值,并表明,对于中大型缓存,中间区域产生更有效的缓存设计。
{"title":"A design framework for hybrid-access caches","authors":"K. B. Theobald, H. Hum, G. Gao","doi":"10.1109/HPCA.1995.386547","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386547","url":null,"abstract":"High-speed microprocessors need fast on-chip caches in order to keep busy. Direct-mapped caches have better access times than set-associative caches, but poorer miss rates. This has led to several hybrid on-chip caches combining the speed of direct-mapped caches with the hit rates of associative caches. In this paper, we unify these hybrids within a single framework which we call the hybrid access cache (HAC) model. Existing hybrid caches lie near the edges of the HAC design space, leaving the middle untouched. We study a group of caches in this middle region, a group we call half-and-half caches, which are half direct-mapped and half set-associative. Simulations confirm the predictive valve of the HAC model, and demonstrate that, for medium to large caches, this middle region yields more efficient cache designs.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133753347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Fine-grain multi-thread processor architecture for massively parallel processing 用于大规模并行处理的细粒度多线程处理器架构
Pub Date : 1995-01-22 DOI: 10.1109/HPCA.1995.386532
T. Kawano, S. Kusakabe, R. Taniguchi, M. Amamiya
Latency, caused by remote memory access and remote procedure call, is one of the most serious problems in massively parallel computers. In order to eliminate the processors' idle time caused by these latencies, processors must perform fast context switching among fine-grain concurrent processes. In this paper, we propose a processor architecture, called Datarol-II, that promotes efficient fine-grain multi-thread execution by performing fast context switching among fine-grain concurrent processes. In the Datarol-II processor, an implicit register load/store mechanism is embedded in the execution pipeline in order to reduce memory access overhead caused by context switching. In order to reduce local memory access latency, a two-level hierarchical memory system and a load control mechanism are also introduced. We describe the Datarol-II processor architecture, and show its evaluation results.<>
由远程内存访问和远程过程调用引起的延迟是大规模并行计算机中最严重的问题之一。为了消除由这些延迟引起的处理器空闲时间,处理器必须在细粒度并发进程之间执行快速上下文切换。在本文中,我们提出了一种称为Datarol-II的处理器架构,它通过在细粒度并发进程之间执行快速上下文切换来促进高效的细粒度多线程执行。在Datarol-II处理器中,隐式的寄存器加载/存储机制嵌入到执行管道中,以减少上下文切换引起的内存访问开销。为了减少本地存储器访问延迟,还引入了两级分层存储器系统和负载控制机制。描述了Datarol-II处理器的体系结构,并给出了其评估结果。
{"title":"Fine-grain multi-thread processor architecture for massively parallel processing","authors":"T. Kawano, S. Kusakabe, R. Taniguchi, M. Amamiya","doi":"10.1109/HPCA.1995.386532","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386532","url":null,"abstract":"Latency, caused by remote memory access and remote procedure call, is one of the most serious problems in massively parallel computers. In order to eliminate the processors' idle time caused by these latencies, processors must perform fast context switching among fine-grain concurrent processes. In this paper, we propose a processor architecture, called Datarol-II, that promotes efficient fine-grain multi-thread execution by performing fast context switching among fine-grain concurrent processes. In the Datarol-II processor, an implicit register load/store mechanism is embedded in the execution pipeline in order to reduce memory access overhead caused by context switching. In order to reduce local memory access latency, a two-level hierarchical memory system and a load control mechanism are also introduced. We describe the Datarol-II processor architecture, and show its evaluation results.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115206539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1