Pub Date : 1995-10-01DOI: 10.1109/HPCA.1995.386551
R. Libeskind-Hadas, Eli Brandt
The ability to tolerate faults is critical in multi-computers employing large numbers of processors. This paper describes a class of fault-tolerant routing algorithms for n-dimensional meshes that can tolerate large numbers of faults without using virtual channels. We show that these routing algorithms prevent livelock and deadlock while remaining highly adaptive.<>
{"title":"Origin-based fault-tolerant routing in the mesh","authors":"R. Libeskind-Hadas, Eli Brandt","doi":"10.1109/HPCA.1995.386551","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386551","url":null,"abstract":"The ability to tolerate faults is critical in multi-computers employing large numbers of processors. This paper describes a class of fault-tolerant routing algorithms for n-dimensional meshes that can tolerate large numbers of faults without using virtual channels. We show that these routing algorithms prevent livelock and deadlock while remaining highly adaptive.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131512627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-10-01DOI: 10.1109/HPCA.1995.386546
O. Temam, Nathalie Drach-Temam
Hardware and software cache optimizations are active fields of research, that have yielded powerful but occasionally complex designs and algorithms. The purpose of this paper is to investigate the performance of combined though simple software and hardware optimizations. Because current caches provide little flexibility for exploiting temporal and spatial locality, two hardware modifications are proposed to support these two kinds of locality. Spatial locality is exploited by using large virtual cache lines which do not exhibit the performance flaws of large physical cache lines. Temporal locality is exploited by minimizing cache pollution with a bypass mechanism that still allows to exploit spatial locality. Subsequently, it is shown that simple software informations on the spatial/temporal locality of array references, as provided by current data locality optimizing algorithms, can be used to significantly increase cache performance. The performance and design trade-offs of the proposed mechanisms are discussed. Software assisted caches are further shown to provide a convenient support for hardware and software optimizations.<>
{"title":"Software assistance for data caches","authors":"O. Temam, Nathalie Drach-Temam","doi":"10.1109/HPCA.1995.386546","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386546","url":null,"abstract":"Hardware and software cache optimizations are active fields of research, that have yielded powerful but occasionally complex designs and algorithms. The purpose of this paper is to investigate the performance of combined though simple software and hardware optimizations. Because current caches provide little flexibility for exploiting temporal and spatial locality, two hardware modifications are proposed to support these two kinds of locality. Spatial locality is exploited by using large virtual cache lines which do not exhibit the performance flaws of large physical cache lines. Temporal locality is exploited by minimizing cache pollution with a bypass mechanism that still allows to exploit spatial locality. Subsequently, it is shown that simple software informations on the spatial/temporal locality of array references, as provided by current data locality optimizing algorithms, can be used to significantly increase cache performance. The performance and design trade-offs of the proposed mechanisms are discussed. Software assisted caches are further shown to provide a convenient support for hardware and software optimizations.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"517 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123104996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-01-22DOI: 10.1109/HPCA.1995.386533
R. Govindarajan, S. Nemawarkar, Philip LeNir
Multithreaded architectures have the ability to tolerate long memory latencies and unpredictable synchronization delays. We propose a multithreaded architecture that is capable of exploiting both coarse-grain parallelism, and fine-grain instruction level parallelism in a program. Instruction-level parallelism is exploited by grouping instructions from a number of active threads at runtime. The architecture supports multiple resident activations to improve the extent of locality exploited. Further, a distributed data structure cache organization is proposed to reduce both the network: traffic and the latency in accessing remote locations. Initial performance evaluation using discrete-event simulation indicates that the architecture is capable of achieving very high processor throughput. The introduction of the data structure cache reduces the network latency significantly. The impact of various cache organizations on the performance of the architecture is also discussed in this paper.<>
{"title":"Design and performance evaluation of a multithreaded architecture","authors":"R. Govindarajan, S. Nemawarkar, Philip LeNir","doi":"10.1109/HPCA.1995.386533","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386533","url":null,"abstract":"Multithreaded architectures have the ability to tolerate long memory latencies and unpredictable synchronization delays. We propose a multithreaded architecture that is capable of exploiting both coarse-grain parallelism, and fine-grain instruction level parallelism in a program. Instruction-level parallelism is exploited by grouping instructions from a number of active threads at runtime. The architecture supports multiple resident activations to improve the extent of locality exploited. Further, a distributed data structure cache organization is proposed to reduce both the network: traffic and the latency in accessing remote locations. Initial performance evaluation using discrete-event simulation indicates that the architecture is capable of achieving very high processor throughput. The introduction of the data structure cache reduces the network latency significantly. The impact of various cache organizations on the performance of the architecture is also discussed in this paper.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115187231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-01-22DOI: 10.1109/HPCA.1995.386531
Yamin Li, Wanming Chu
The throughput of a multiple-pipelined processor suffers due to lack of sufficient instructions to make multiple pipelines busy and due to delays associated with pipeline dependencies. Finely Parallel Multithreaded Processor (FPMP) architectures try to solve these problems by dispatching multiple instructions from multiple instruction threads in parallel. This paper proposes an analytic model which is used to quantify the advantage of FPMP architectures. The effects of four important parameters in FPMP, S,T,E, and F (STEF) are evaluated. Unlike previous analytic models of multithreaded architecture, the model presented concerns the performance of multiple pipelines. It deals not only with pipelines dependencies but also with structure conflicts. The model accepts the configuration parameters of a FPMP, the distribution of instruction types, and the distribution of interlock delay cycles. The model provides a quick performance prediction and a quick utilization prediction which are helpful in the processor design.<>
{"title":"The effects of STEF in finely parallel multithreaded processors","authors":"Yamin Li, Wanming Chu","doi":"10.1109/HPCA.1995.386531","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386531","url":null,"abstract":"The throughput of a multiple-pipelined processor suffers due to lack of sufficient instructions to make multiple pipelines busy and due to delays associated with pipeline dependencies. Finely Parallel Multithreaded Processor (FPMP) architectures try to solve these problems by dispatching multiple instructions from multiple instruction threads in parallel. This paper proposes an analytic model which is used to quantify the advantage of FPMP architectures. The effects of four important parameters in FPMP, S,T,E, and F (STEF) are evaluated. Unlike previous analytic models of multithreaded architecture, the model presented concerns the performance of multiple pipelines. It deals not only with pipelines dependencies but also with structure conflicts. The model accepts the configuration parameters of a FPMP, the distribution of instruction types, and the distribution of interlock delay cycles. The model provides a quick performance prediction and a quick utilization prediction which are helpful in the processor design.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121291344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-01-22DOI: 10.1109/HPCA.1995.386560
P. Nuth, W. Dally
Context switches are slow in conventional processors because the entire processor state must be saved and restored, even if much of the state is not used before the next context switch. This paper introduces the Named-State Register File, a fine-grain associative register file. The NSF uses hardware and software techniques to efficiently manage registers among sequential or parallel procedure activations. The NSF holds more live data per register than conventional register files, and requires much less spill and reload traffic to switch between concurrent contexts. The NSF speeds execution of some sequential and parallel programs by 9% to 17% over alternative register file organizations. The NSF has access time comparable to a conventional register file and only adds 5% to the area of a typical processor chip.<>
{"title":"The Named-State Register File: implementation and performance","authors":"P. Nuth, W. Dally","doi":"10.1109/HPCA.1995.386560","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386560","url":null,"abstract":"Context switches are slow in conventional processors because the entire processor state must be saved and restored, even if much of the state is not used before the next context switch. This paper introduces the Named-State Register File, a fine-grain associative register file. The NSF uses hardware and software techniques to efficiently manage registers among sequential or parallel procedure activations. The NSF holds more live data per register than conventional register files, and requires much less spill and reload traffic to switch between concurrent contexts. The NSF speeds execution of some sequential and parallel programs by 9% to 17% over alternative register file organizations. The NSF has access time comparable to a conventional register file and only adds 5% to the area of a typical processor chip.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114558848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-01-22DOI: 10.1109/HPCA.1995.386554
F. Dahlgren, P. Stenström
We study the relative efficiency of previously proposed stride and sequential prefetching-two promising hardware-based prefetching schemes to reduce read-miss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequential prefetching does better than stride prefetching for three applications. This is because (i) most strides are shorter than the block size (we assume 32 byte blocks), which means that sequential prefetching is as effective for stride accesses, and (ii) sequential prefetching also exploits the locality of read misses for non-stride accesses. However we find that since stride prefetching causes fewer useless prefetches, it consumes less memory-system bandwidth.<>
{"title":"Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors","authors":"F. Dahlgren, P. Stenström","doi":"10.1109/HPCA.1995.386554","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386554","url":null,"abstract":"We study the relative efficiency of previously proposed stride and sequential prefetching-two promising hardware-based prefetching schemes to reduce read-miss penalties in shared-memory multiprocessors. Although stride accesses dominate in four out of six of the applications we study, we find that sequential prefetching does better than stride prefetching for three applications. This is because (i) most strides are shorter than the block size (we assume 32 byte blocks), which means that sequential prefetching is as effective for stride accesses, and (ii) sequential prefetching also exploits the locality of read misses for non-stride accesses. However we find that since stride prefetching causes fewer useless prefetches, it consumes less memory-system bandwidth.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117319138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-01-22DOI: 10.1109/HPCA.1995.386536
Craig Anderson, J. Baer
We explore two techniques for reducing memory latency in bus-based multiprocessors. The first one, designed for sector caches, is a snoopy cache coherence protocol that uses a large transfer block to take advantage of spatial locality, while using a small coherence block (called a subblock to avoid false sharing). The second technique is read snarfing (or read broadcasting), in which all caches can acquire data transmitted in response to a read request to update invalid blocks in their own cache. We evaluated the two techniques by simulating 6 applications that exhibit a variety of reference patterns. We compared the performance of the new protocol against that of the Illinois protocol with both small and large block sizes and found that it was effective in reducing memory latency and providing more consistent, good results than the Illinois protocol with a given line size. Read snarfing also improved performance mostly for protocols that use large line sizes.<>
{"title":"Two techniques for improving performance on bus-based multiprocessors","authors":"Craig Anderson, J. Baer","doi":"10.1109/HPCA.1995.386536","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386536","url":null,"abstract":"We explore two techniques for reducing memory latency in bus-based multiprocessors. The first one, designed for sector caches, is a snoopy cache coherence protocol that uses a large transfer block to take advantage of spatial locality, while using a small coherence block (called a subblock to avoid false sharing). The second technique is read snarfing (or read broadcasting), in which all caches can acquire data transmitted in response to a read request to update invalid blocks in their own cache. We evaluated the two techniques by simulating 6 applications that exhibit a variety of reference patterns. We compared the performance of the new protocol against that of the Illinois protocol with both small and large block sizes and found that it was effective in reducing memory latency and providing more consistent, good results than the Illinois protocol with a given line size. Read snarfing also improved performance mostly for protocols that use large line sizes.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131668712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-01-22DOI: 10.1109/HPCA.1995.386539
K. Westerholz, Stephen Honal, J. Plankl, C. Hafer
The efficient utilization of caches is crucial for a competitive memory hierarchy. Access times required by modern processors are continuously decreasing. Direct mapped caches provide the shortest access time. Using them yields reduced hardware costs and fast memory access but implies additional misses in the cache, resulting in performance degradation. Another source of conflicts is the addressing scheme if caches are physically addressed. For such caches, memory management affects cache utilization. Enhancements in virtual memory management as presented in this paper reduce cache misses by as much as 80% for real-indexed caches. We developed three algorithms that use runtime information. All of them are suitable for direct-mapped and set associative caches. Applied to SPECint92 benchmark suite, we measured a performance improvement of 6.9% in a multiprogramming environment for a R4000 based UNIX workstation. This figure also includes the overhead caused by the more complex memory management.<>
{"title":"Improving performance by cache driven memory management","authors":"K. Westerholz, Stephen Honal, J. Plankl, C. Hafer","doi":"10.1109/HPCA.1995.386539","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386539","url":null,"abstract":"The efficient utilization of caches is crucial for a competitive memory hierarchy. Access times required by modern processors are continuously decreasing. Direct mapped caches provide the shortest access time. Using them yields reduced hardware costs and fast memory access but implies additional misses in the cache, resulting in performance degradation. Another source of conflicts is the addressing scheme if caches are physically addressed. For such caches, memory management affects cache utilization. Enhancements in virtual memory management as presented in this paper reduce cache misses by as much as 80% for real-indexed caches. We developed three algorithms that use runtime information. All of them are suitable for direct-mapped and set associative caches. Applied to SPECint92 benchmark suite, we measured a performance improvement of 6.9% in a multiprogramming environment for a R4000 based UNIX workstation. This figure also includes the overhead caused by the more complex memory management.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123829492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-01-22DOI: 10.1109/HPCA.1995.386547
K. B. Theobald, H. Hum, G. Gao
High-speed microprocessors need fast on-chip caches in order to keep busy. Direct-mapped caches have better access times than set-associative caches, but poorer miss rates. This has led to several hybrid on-chip caches combining the speed of direct-mapped caches with the hit rates of associative caches. In this paper, we unify these hybrids within a single framework which we call the hybrid access cache (HAC) model. Existing hybrid caches lie near the edges of the HAC design space, leaving the middle untouched. We study a group of caches in this middle region, a group we call half-and-half caches, which are half direct-mapped and half set-associative. Simulations confirm the predictive valve of the HAC model, and demonstrate that, for medium to large caches, this middle region yields more efficient cache designs.<>
{"title":"A design framework for hybrid-access caches","authors":"K. B. Theobald, H. Hum, G. Gao","doi":"10.1109/HPCA.1995.386547","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386547","url":null,"abstract":"High-speed microprocessors need fast on-chip caches in order to keep busy. Direct-mapped caches have better access times than set-associative caches, but poorer miss rates. This has led to several hybrid on-chip caches combining the speed of direct-mapped caches with the hit rates of associative caches. In this paper, we unify these hybrids within a single framework which we call the hybrid access cache (HAC) model. Existing hybrid caches lie near the edges of the HAC design space, leaving the middle untouched. We study a group of caches in this middle region, a group we call half-and-half caches, which are half direct-mapped and half set-associative. Simulations confirm the predictive valve of the HAC model, and demonstrate that, for medium to large caches, this middle region yields more efficient cache designs.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133753347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1995-01-22DOI: 10.1109/HPCA.1995.386532
T. Kawano, S. Kusakabe, R. Taniguchi, M. Amamiya
Latency, caused by remote memory access and remote procedure call, is one of the most serious problems in massively parallel computers. In order to eliminate the processors' idle time caused by these latencies, processors must perform fast context switching among fine-grain concurrent processes. In this paper, we propose a processor architecture, called Datarol-II, that promotes efficient fine-grain multi-thread execution by performing fast context switching among fine-grain concurrent processes. In the Datarol-II processor, an implicit register load/store mechanism is embedded in the execution pipeline in order to reduce memory access overhead caused by context switching. In order to reduce local memory access latency, a two-level hierarchical memory system and a load control mechanism are also introduced. We describe the Datarol-II processor architecture, and show its evaluation results.<>
{"title":"Fine-grain multi-thread processor architecture for massively parallel processing","authors":"T. Kawano, S. Kusakabe, R. Taniguchi, M. Amamiya","doi":"10.1109/HPCA.1995.386532","DOIUrl":"https://doi.org/10.1109/HPCA.1995.386532","url":null,"abstract":"Latency, caused by remote memory access and remote procedure call, is one of the most serious problems in massively parallel computers. In order to eliminate the processors' idle time caused by these latencies, processors must perform fast context switching among fine-grain concurrent processes. In this paper, we propose a processor architecture, called Datarol-II, that promotes efficient fine-grain multi-thread execution by performing fast context switching among fine-grain concurrent processes. In the Datarol-II processor, an implicit register load/store mechanism is embedded in the execution pipeline in order to reduce memory access overhead caused by context switching. In order to reduce local memory access latency, a two-level hierarchical memory system and a load control mechanism are also introduced. We describe the Datarol-II processor architecture, and show its evaluation results.<<ETX>>","PeriodicalId":330315,"journal":{"name":"Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115206539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}