Modern shared-memory multiprocmors require complex interconnection networks to provide sufficient communication bandwidth between processors. They also rely on advanced memory systems that allow multiple memory operations to be made in parallel. It is expensive to maintain a high consistency level in a machine based on a general network, but for special interconnection topologies, some of these costs can he reduced. We define and study one class of interconnection networks, race-free networks. New conditions for sequential consistency are presented which show that sequential consistency can be maintained if all accesses in a multiprocessor can be ordered in an acyclic graph. We show that this can be done in racefree networks without the need for a transaction to be globally performed before the next transaction can be issued: We also investigate what is required to maintain processor consistency in race-free networks. In a race-free network which maintains processor consistency, writes may be pipelined, and reads may bypass writes. - The proposed methods reduce the latencies associated with processor write-misses to shared data.
{"title":"Race-free interconnection networks and multiprocessor consistency","authors":"A. Landin, Erik Hagersten, Seif Haridi","doi":"10.1145/115952.115964","DOIUrl":"https://doi.org/10.1145/115952.115964","url":null,"abstract":"Modern shared-memory multiprocmors require complex interconnection networks to provide sufficient communication bandwidth between processors. They also rely on advanced memory systems that allow multiple memory operations to be made in parallel. It is expensive to maintain a high consistency level in a machine based on a general network, but for special interconnection topologies, some of these costs can he reduced. We define and study one class of interconnection networks, race-free networks. New conditions for sequential consistency are presented which show that sequential consistency can be maintained if all accesses in a multiprocessor can be ordered in an acyclic graph. We show that this can be done in racefree networks without the need for a transaction to be globally performed before the next transaction can be issued: We also investigate what is required to maintain processor consistency in race-free networks. In a race-free network which maintains processor consistency, writes may be pipelined, and reads may bypass writes. - The proposed methods reduce the latencies associated with processor write-misses to shared data.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123778780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1991-04-01DOI: 10.1109/ISCA.1991.1021605
X. Lin, L. Ni
Efficient routing of messages is the key to the performance of multicomputers. Multicast communication refers to the delivery of the same message from a source node to an arbitrary number of destination nodes. Wormhole routing is the most promising switching technique used in new generation multicomputers. In this paper, we present multicast wormhole routing methods for multicomputers adopting 2D-mesh and hypercube topologies. The dual-path routing algorithm requires less system resource, while the multipath routing algorithm creates less traffic. More import antly, both routing algorithms are deadlock-free, which is essential to wormhole networks.
{"title":"Deadlock-fyee multicast wormhole routing in multicomputer networks","authors":"X. Lin, L. Ni","doi":"10.1109/ISCA.1991.1021605","DOIUrl":"https://doi.org/10.1109/ISCA.1991.1021605","url":null,"abstract":"Efficient routing of messages is the key to the performance of multicomputers. Multicast communication refers to the delivery of the same message from a source node to an arbitrary number of destination nodes. Wormhole routing is the most promising switching technique used in new generation multicomputers. In this paper, we present multicast wormhole routing methods for multicomputers adopting 2D-mesh and hypercube topologies. The dual-path routing algorithm requires less system resource, while the multipath routing algorithm creates less traffic. More import antly, both routing algorithms are deadlock-free, which is essential to wormhole networks.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127075002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Nakajima, H. Nakano, Y. Nakakura, T. Yoshida, Y. Goi, Y. Nakai, R. Segawa, T. Xishida, H. Kadota
multiple instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, LOOP instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.
本文描述了一种可以在数字应用中保持很高性能的超大规模集成电路(VLSI)超标量处理器架构。该体系结构由编译器静态地执行指令级调度,并执行指令的无序发出和执行,以减少执行过程中动态发生的管道上的停顿。在这个体系结构中,在每个时钟周期中获取一对指令,同时解码,并独立地发布到相应的执行管道。为了便于编译器的指令级调度,架构提供者:-i)几乎所有指令对的同时执行,包括Store-Stare对和Load-Store对;ii)简单、低延迟、易于配对的执行管道结构;iii)大容量多端口浮点寄存器和整数寄存器。采用新颖的直接标签比较(direct Tag Compare, DTC)方法实现高效的数据依赖解析,采用无处罚分支机制实现简单的控制依赖解析,采用流水线数据缓存和128位宽总线带宽实现大数据传输能力,从而动态降低管道危害,从而提高系统性能。采用新的DTC方法、同步管道操作和数据绕过网络实现了一种有效的数据依赖解析机制,允许乱序指令的发布和执行。DTC方法的思想类似于带标记令牌的动态数据缺陷体系结构。非惩罚分支是通过延迟分支、执行器计数器在一个时钟周期内递减、比较和分支的LOOP指令和带有预测条件码的非惩罚条件分支三种技术实现的。这些技术有助于减少在运行时发生的管道失速。利用这些技术,该架构可以在4OMHz时钟下实现80MFLOPS/80MIPS的峰值性能,并保持比简单的MFU型RISC处理器高1.4至3.6倍的性能。
{"title":"OHMEGA : a VLSI superscalar processor architecture for numerical applications","authors":"M. Nakajima, H. Nakano, Y. Nakakura, T. Yoshida, Y. Goi, Y. Nakai, R. Segawa, T. Xishida, H. Kadota","doi":"10.1145/115952.115969","DOIUrl":"https://doi.org/10.1145/115952.115969","url":null,"abstract":"multiple instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, LOOP instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117249772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a methodology for modeling the behavior of a given class of applications executing in real workloads on a particular machine. The methodology is illustrated by modeling the execution of computationally bound, parallel applications running in real workloads on an Alliant FX/80. me model is constructed from real measured data obtained during normal machine operation and is capable of capturing intricate multiple job interactions, such as contention for shared resources. The model is a finitestate, discrete-time Markov model with rewards and costs associated with each state. The model is capable of predicting the distribution of completion times in real workloads for a given application. The predictions are useful in gauging how quickly an application will execute, or in predicting the performance impact of a system change. The model constructed in this study is validated with three separate sets of empirical data. In one validation, the model successfully predicts the effects of operating the machine with one less processor.
{"title":"Performance prediction and tuning on a multiprocessor","authors":"R. Dimpsey, R. Iyer","doi":"10.1145/115952.115972","DOIUrl":"https://doi.org/10.1145/115952.115972","url":null,"abstract":"This paper presents a methodology for modeling the behavior of a given class of applications executing in real workloads on a particular machine. The methodology is illustrated by modeling the execution of computationally bound, parallel applications running in real workloads on an Alliant FX/80. me model is constructed from real measured data obtained during normal machine operation and is capable of capturing intricate multiple job interactions, such as contention for shared resources. The model is a finitestate, discrete-time Markov model with rewards and costs associated with each state. The model is capable of predicting the distribution of completion times in real workloads for a given application. The predictions are useful in gauging how quickly an application will execute, or in predicting the performance impact of a system change. The model constructed in this study is validated with three separate sets of empirical data. In one validation, the model successfully predicts the effects of operating the machine with one less processor.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115161471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ideally, a pipeline processor can run at a rate that is limited by its slowest stage. Branches in the instruction stream disrupt the pipeIine, and reduce processor performance to well below ideal. Since workloads contain a high percentage of taken branches, techniques are needed to reduce or eliminate thk degradation. A Branch History Table (BHT) stores past action and target for branches, and predicts that future behavior will repeat. Although past action is a good indicator of future action, the subroutine CALL/RETURN paradigm makes correct prediction of the branch target dlfflcult. We propose a new stack mechanism for reducing this type of mispredlction. Using traces of the SPEC benchmark suite running on an RS/6000, we provide an analysis of the performance enhancements possible using a BHT. We show that the proposed mechanism can reduce the number of branch wrong guesses by 18.2°/0 on average.
{"title":"Branch history table prediction of moving target branches due to subroutine returns","authors":"D. Kaeli, P. Emma","doi":"10.1145/115952.115957","DOIUrl":"https://doi.org/10.1145/115952.115957","url":null,"abstract":"Ideally, a pipeline processor can run at a rate that is limited by its slowest stage. Branches in the instruction stream disrupt the pipeIine, and reduce processor performance to well below ideal. Since workloads contain a high percentage of taken branches, techniques are needed to reduce or eliminate thk degradation. A Branch History Table (BHT) stores past action and target for branches, and predicts that future behavior will repeat. Although past action is a good indicator of future action, the subroutine CALL/RETURN paradigm makes correct prediction of the branch target dlfflcult. We propose a new stack mechanism for reducing this type of mispredlction. Using traces of the SPEC benchmark suite running on an RS/6000, we provide an analysis of the performance enhancements possible using a BHT. We show that the proposed mechanism can reduce the number of branch wrong guesses by 18.2°/0 on average.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124232473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the near future, microprocessor systems with very high clock rates will use multichip module (MCM) pack- aging technology to reduce chip-crossing delays. In this paper we present the results of a study for the design of a 250 MHz Gallium Arsenide (GaAs) microprocessor t,lrat employs h4CM technology to improve performance. The design study for the resulting two-level split cache st.arts with a baseline cache architecture and then ex- amines the following aspects: 1) primary cache size and degree of associativity; 2) primary data-cache write pol- icy; 3) secondary cache size and organization; 4) pri- mary cache fetch size; 5) concurrency between instruc- tion and data accesses. A trace-driven simulator is used to analyze each design's performance. The results show that memory access time and page-size constraints ef- Cectively limit the size of the primary data and instruc- tion caches to 4I
{"title":"Implementing a cache for a high-performance GaAs microprocessor","authors":"K. Olukotun, T. Mudge, Richard B. Brown","doi":"10.1145/115952.115967","DOIUrl":"https://doi.org/10.1145/115952.115967","url":null,"abstract":"In the near future, microprocessor systems with very high clock rates will use multichip module (MCM) pack- aging technology to reduce chip-crossing delays. In this paper we present the results of a study for the design of a 250 MHz Gallium Arsenide (GaAs) microprocessor t,lrat employs h4CM technology to improve performance. The design study for the resulting two-level split cache st.arts with a baseline cache architecture and then ex- amines the following aspects: 1) primary cache size and degree of associativity; 2) primary data-cache write pol- icy; 3) secondary cache size and organization; 4) pri- mary cache fetch size; 5) concurrency between instruc- tion and data accesses. A trace-driven simulator is used to analyze each design's performance. The results show that memory access time and page-size constraints ef- Cectively limit the size of the primary data and instruc- tion caches to 4I<W (16KB). For such cache sizes, a write-through policy is better than a write-back policy. Three cache mechanisms that contribute to improved performance are introduced. The first is a variant of the write-through policy called write-only. This write policy provides most of the performance benefits of sub- Ilod placernenl without extra valid bits. The second, is the use of a split secondary cache. Finally, the third mechanism allows loads to pass stores without associa- tive matching. Keywords-two-level caches, high performance pro- cessors, gallium arsenide, multichip modules, trace- driven cache simulation.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"163 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133516865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper examines the effect of code generation strategy and register set size and structure on the performance of RISC processors. We vary the number of registers from 16 to 128, in both split and shared organizations, and use three different code generation strategies that differ in the way their instruction schedulers and register allocators cooperate in utilizing registers. The architectnres used in the experiments incorporate fealures of the Motorola 88000 and the MIPS R2000. We observed three things. First, more sophisticated code generation strategies require fewer registers. In our experiments more than 32 registers yielded only marginal performance improvement over 32. Using a simpler strategy, the point of diminishing returns appeared after 64 registers. Second, given a small number of registers (e.g. 16), a machine with a shared register organization executes faster than one with a split organization; given a larger number of registers, the write-back bus to the shared register set becomes the bottleneck, and a split organization is better. Third, a machine with a floating point coprocessor does not always execute faster than one with a slower on-chip implementation, if the coprocessor does not perform expensive integer operations as well. The problem can be solved by transferring operands to the floating point unit, doing a multiply or divide there, and then shipping the data back to the CPU.
{"title":"The effect on RISC performance of register set size and structure versus code generation strategy","authors":"David G. Bradlee, S. Eggers, R. Henry","doi":"10.1145/115953.115985","DOIUrl":"https://doi.org/10.1145/115953.115985","url":null,"abstract":"This paper examines the effect of code generation strategy and register set size and structure on the performance of RISC processors. We vary the number of registers from 16 to 128, in both split and shared organizations, and use three different code generation strategies that differ in the way their instruction schedulers and register allocators cooperate in utilizing registers. The architectnres used in the experiments incorporate fealures of the Motorola 88000 and the MIPS R2000. We observed three things. First, more sophisticated code generation strategies require fewer registers. In our experiments more than 32 registers yielded only marginal performance improvement over 32. Using a simpler strategy, the point of diminishing returns appeared after 64 registers. Second, given a small number of registers (e.g. 16), a machine with a shared register organization executes faster than one with a split organization; given a larger number of registers, the write-back bus to the shared register set becomes the bottleneck, and a split organization is better. Third, a machine with a floating point coprocessor does not always execute faster than one with a slower on-chip implementation, if the coprocessor does not perform expensive integer operations as well. The problem can be solved by transferring operands to the floating point unit, doing a multiply or divide there, and then shipping the data back to the CPU.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114158195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Directory-hsed protocols have been proposed as an efficient means of implementing cache consistency in large-scale sharedmemory multiprocessors. One class of these protocols utilizes a limired pointers directory, which Stores the identities of a Small number of caches mntaining a given block of data. However. the performance potential of these directories in large-scale machines has been speculative at best. In this paper we introduce an analytic model that not only explains the behavior seen in small-scale simulation studies, but also allows us to extrapolate forward to evaluate the efficiency of limited pointers directories in large-scale systems. Our model shows that miss rates inherent to invalidation-based consistencyschemes are relatively high (typically 10% to 60%) for actively shared data, across a variety of workloads. We find that limited pointers schemes that resort to broadcasting invalidations when the pointers are exhausted perform very poorly in largescale machines, even if there are sufficient pointas most of the time. On the other hand, no-broadcast slrategies that limit the degree of caching to the number of pointers in an entry have only a modest impact on the cache miss rate and network traflic under a wide range of workloads. including those in which data blocks are actively accessed by a large number of processors.
{"title":"Modeling the performance of limited pointers directories for cache coherence","authors":"R. Simoni, M. Horowitz","doi":"10.1145/115952.115983","DOIUrl":"https://doi.org/10.1145/115952.115983","url":null,"abstract":"Directory-hsed protocols have been proposed as an efficient means of implementing cache consistency in large-scale sharedmemory multiprocessors. One class of these protocols utilizes a limired pointers directory, which Stores the identities of a Small number of caches mntaining a given block of data. However. the performance potential of these directories in large-scale machines has been speculative at best. In this paper we introduce an analytic model that not only explains the behavior seen in small-scale simulation studies, but also allows us to extrapolate forward to evaluate the efficiency of limited pointers directories in large-scale systems. Our model shows that miss rates inherent to invalidation-based consistencyschemes are relatively high (typically 10% to 60%) for actively shared data, across a variety of workloads. We find that limited pointers schemes that resort to broadcasting invalidations when the pointers are exhausted perform very poorly in largescale machines, even if there are sufficient pointas most of the time. On the other hand, no-broadcast slrategies that limit the degree of caching to the number of pointers in an entry have only a modest impact on the cache miss rate and network traflic under a wide range of workloads. including those in which data blocks are actively accessed by a large number of processors.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127061266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes an architecture and related compiler support for software-controlled daia prefetching, a technique to hide memory latency in high-performance processors. At compile-time, FETCB instructions are inserted into the instruction-stream by the compiler, based on anticipated data references and detailed information about the memory system. At run time, a separate functional unit in the CPU, the fe tch uni t , interprets these instructions and initiates appropriate memory reads. Prefetched data is kept in a small, fullyassociative cache, called the fetchbuffer, to reduce contention with the conventional direct-mapped cache. We also introduce a prewrileback technique that can reduce the impact.of stalls due to replacement writebacks in the cache. A detailed hardware model is presented and the required compiler support is developed. Simulations based on a MIPS processor model show that this technique can dramatically reduce on-chip cache miss ratios and average observed memory latency for scientific loops at only slight cost in total memory traffic.
{"title":"An architecture for software-controlled data prefetching","authors":"A. Klaiber, H. Levy","doi":"10.1145/115953.115958","DOIUrl":"https://doi.org/10.1145/115953.115958","url":null,"abstract":"This paper describes an architecture and related compiler support for software-controlled daia prefetching, a technique to hide memory latency in high-performance processors. At compile-time, FETCB instructions are inserted into the instruction-stream by the compiler, based on anticipated data references and detailed information about the memory system. At run time, a separate functional unit in the CPU, the fe tch uni t , interprets these instructions and initiates appropriate memory reads. Prefetched data is kept in a small, fullyassociative cache, called the fetchbuffer, to reduce contention with the conventional direct-mapped cache. We also introduce a prewrileback technique that can reduce the impact.of stalls due to replacement writebacks in the cache. A detailed hardware model is presented and the required compiler support is developed. Simulations based on a MIPS processor model show that this technique can dramatically reduce on-chip cache miss ratios and average observed memory latency for scientific loops at only slight cost in total memory traffic.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129538764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most current architectures have registers organized in one of two ways: single register sets; or register stacks, implemented as either overlapping register windows or register-caches, Each has particular strengths and weaknesses. For example, a single register set excels over a stack if a program requires frequent access to globals. ~ However, a register stack performs better if deep^ recursive chains~exist. One drawback of all current systems is that the hardware limits the manner in which the software can use registers. In this paper, a register hardware organization called fhreoded windows or f-windows, which is being developed by the authors to enhance the performance of concurrent systems, is evaluated for sequential programs. The organization allows the registers to be dynamically restructured in any of the above forms, and any combination of the above forms. This permits the compiler, or the programmer, to capitalize upon each register organization’s strong points and avoid their disadvantages.
{"title":"Flexible register management for sequential programs","authors":"D. Quammen, D. Miller","doi":"10.1145/115953.115984","DOIUrl":"https://doi.org/10.1145/115953.115984","url":null,"abstract":"Most current architectures have registers organized in one of two ways: single register sets; or register stacks, implemented as either overlapping register windows or register-caches, Each has particular strengths and weaknesses. For example, a single register set excels over a stack if a program requires frequent access to globals. ~ However, a register stack performs better if deep^ recursive chains~exist. One drawback of all current systems is that the hardware limits the manner in which the software can use registers. In this paper, a register hardware organization called fhreoded windows or f-windows, which is being developed by the authors to enhance the performance of concurrent systems, is evaluated for sequential programs. The organization allows the registers to be dynamically restructured in any of the above forms, and any combination of the above forms. This permits the compiler, or the programmer, to capitalize upon each register organization’s strong points and avoid their disadvantages.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127490068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}