Pub Date : 1993-05-01DOI: 10.1109/ISCA.1993.698542
E. Rothberg, J. Singh, Anoop Gupta
The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines? In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors. We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.
{"title":"Working Sets, Cache Sizes, And Node Granularity Issues For Large-scale Multiprocessors","authors":"E. Rothberg, J. Singh, Anoop Gupta","doi":"10.1109/ISCA.1993.698542","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698542","url":null,"abstract":"The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines?\u0000In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors.\u0000We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134520873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-05-01DOI: 10.1109/ISCA.1993.698548
M. Dubois, J. Skeppstedt, L. Ricciulli, Krishnan Ramamurthy, P. Stenström
In this paper we introduce a new classification of misses in shared-memory multiprocessors based on interprocessor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All other misses are useless misses and can be ignored without affecting the correctness of program execution. Based on the new classification we compare the effectiveness of five different protocols which delay and combine invalidations leading to useless misses. In cache-based systems the protocols are very effective and have miss rates close to the essential miss rate. In virtual shared memory systems the techniques are also effective but leave room for improvements.
{"title":"The Detection And Elimination Of Useless Misses In Multiprocessors","authors":"M. Dubois, J. Skeppstedt, L. Ricciulli, Krishnan Ramamurthy, P. Stenström","doi":"10.1109/ISCA.1993.698548","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698548","url":null,"abstract":"In this paper we introduce a new classification of misses in shared-memory multiprocessors based on interprocessor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All other misses are useless misses and can be ignored without affecting the correctness of program execution. Based on the new classification we compare the effectiveness of five different protocols which delay and combine invalidations leading to useless misses. In cache-based systems the protocols are very effective and have miss rates close to the essential miss rate. In virtual shared memory systems the techniques are also effective but leave room for improvements.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127592898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Direct-mapped caches are a popular design choice for highperfortnsnce processors;unfortunately, direct-mapped cachessuffer systematic interference misses when more than one address maps into the sensecache set. This paper &scribes the design of column-ossociotive caches.which minhize the cofllcrs that arise in direct-mapped accessesby allowing conflicting addressesto dynamically choose alternate hashing functions, so that most of the cordiicting datacanreside in thecache. At the sametime, however, the critical hit accesspath is unchanged. The key to implementing this schemeefficiently is the addition of a reho.dsM to eachcache se~ which indicates whether that set storesdata that is referenced by an alternate hashing timction. When multiple addressesmap into the samelocatioz theserehoshed locatwns are preferentially replaced. Using trace-driven simulations and en analytical model, we demonstrate that a column-associative cacheremovesvirtually all interference missesfor large caches,without altering the critical hit accesstime.
{"title":"Column-associative Caches: A Technique For Reducing The Miss Rate Of Direct-mapped Caches","authors":"A. Agarwal, S. Pudar","doi":"10.1145/165123.165153","DOIUrl":"https://doi.org/10.1145/165123.165153","url":null,"abstract":"Direct-mapped caches are a popular design choice for highperfortnsnce processors;unfortunately, direct-mapped cachessuffer systematic interference misses when more than one address maps into the sensecache set. This paper &scribes the design of column-ossociotive caches.which minhize the cofllcrs that arise in direct-mapped accessesby allowing conflicting addressesto dynamically choose alternate hashing functions, so that most of the cordiicting datacanreside in thecache. At the sametime, however, the critical hit accesspath is unchanged. The key to implementing this schemeefficiently is the addition of a reho.dsM to eachcache se~ which indicates whether that set storesdata that is referenced by an alternate hashing timction. When multiple addressesmap into the samelocatioz theserehoshed locatwns are preferentially replaced. Using trace-driven simulations and en analytical model, we demonstrate that a column-associative cacheremovesvirtually all interference missesfor large caches,without altering the critical hit accesstime.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124554078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
DRAMs containing cache memory are studied in the context of vector supercomputers. In particular, we consider systems where processors have no internal data caches and memory reference streams are generated by vector instructions. For this application, we expect that cached DRAMs can provide high bandwidth at relatively low cost. We study both DRAMs with a single, long cache line and with smaller, multiple cache lines. Memory interleaving schemes that increase data locality are proposed and studied. The interleaving schemes are also shown to lead to non-uniform bank accesses, i.e. hot banks. This suggest there is an important optimization problem involving methods that increase locality to improve performance, but not so much that hot banks diminish performance. We show that for uniprocessor systems, both types of cached DRAMs work well with the proposed interleave methods. For multiprogrammed multiprocessors, the multiple cache line DRAMs work better.
{"title":"Performance Of Cached Dram Organizations In Vector Supercomputers","authors":"W. Hsu, James E. Smith","doi":"10.1145/165123.165170","DOIUrl":"https://doi.org/10.1145/165123.165170","url":null,"abstract":"DRAMs containing cache memory are studied in the context of vector supercomputers. In particular, we consider systems where processors have no internal data caches and memory reference streams are generated by vector instructions. For this application, we expect that cached DRAMs can provide high bandwidth at relatively low cost.\u0000We study both DRAMs with a single, long cache line and with smaller, multiple cache lines. Memory interleaving schemes that increase data locality are proposed and studied. The interleaving schemes are also shown to lead to non-uniform bank accesses, i.e. hot banks. This suggest there is an important optimization problem involving methods that increase locality to improve performance, but not so much that hot banks diminish performance. We show that for uniprocessor systems, both types of cached DRAMs work well with the proposed interleave methods. For multiprogrammed multiprocessors, the multiple cache line DRAMs work better.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126915209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-05-01DOI: 10.1109/ISCA.1993.698552
Yasuo Hidaka, H. Koike, Hidehiko Tanaka
Multi-threading is often used to compile logic and functional languages, and implement parallel C libraries. Fine-grain multi-threading requires rapid context switching, which can be slow on architectures with register windows. In past, researchers have either proposed new hardware support for dynamic allocation of windows to threads, or have sacrificed fast procedure calls by fixed allocation of windows to threads. In this paper, a novel window management algorithm, which retains both fast procedure calls and fast context switching, is proposed. The algorithm has been implemented on the SPARC processor by modifying window trap handlers. A quantitative evaluation of the scheme using a multi-threaded application with various concurrency and granularity levels is given. The evaluation shows that the proposed scheme always does better than the other schemes. Some implications for multi-threaded architectures are also presented.
{"title":"Multiple Threads In Cyclic Register Windows","authors":"Yasuo Hidaka, H. Koike, Hidehiko Tanaka","doi":"10.1109/ISCA.1993.698552","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698552","url":null,"abstract":"Multi-threading is often used to compile logic and functional languages, and implement parallel C libraries. Fine-grain multi-threading requires rapid context switching, which can be slow on architectures with register windows. In past, researchers have either proposed new hardware support for dynamic allocation of windows to threads, or have sacrificed fast procedure calls by fixed allocation of windows to threads. In this paper, a novel window management algorithm, which retains both fast procedure calls and fast context switching, is proposed. The algorithm has been implemented on the SPARC processor by modifying window trap handlers. A quantitative evaluation of the scheme using a multi-threaded application with various concurrency and granularity levels is given. The evaluation shows that the proposed scheme always does better than the other schemes. Some implications for multi-threaded architectures are also presented.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129969784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a multiprocessor. Prefetching can negatively affect bus utilization, overall cache miss rates, memory latencies and data sharing. We simulated the effects of a particular compiler-directed prefetching algorithm, running on a bus-based multiprocesssor. We showed that, despite a high memory latency, this architecture is not very well-suited for prefetching. For several variations on the architecture, speedups for five parallel programs were no greater than 39%, and degradations were as high as 7%, when prefetching was added to the workload. We examined the sources of cache misses, in light of several different prefetching strategies, and pinpointed the causes of the performance changes. Invalidation misses pose a particular problem for current compiler-directed prefetchers. We applied two techniques that reduced their impact: a special prefetching heuristic tailored to write-shared data, and restructuring shared data to reduce false sharing, thus allowing traditional prefetching algorithms to work well.
{"title":"Limitations Of Cache Prefetching On A Bus-based Multiprocessor","authors":"D. Tullsen, S. Eggers","doi":"10.1145/165123.165163","DOIUrl":"https://doi.org/10.1145/165123.165163","url":null,"abstract":"Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a multiprocessor. Prefetching can negatively affect bus utilization, overall cache miss rates, memory latencies and data sharing. We simulated the effects of a particular compiler-directed prefetching algorithm, running on a bus-based multiprocesssor. We showed that, despite a high memory latency, this architecture is not very well-suited for prefetching. For several variations on the architecture, speedups for five parallel programs were no greater than 39%, and degradations were as high as 7%, when prefetching was added to the workload. We examined the sources of cache misses, in light of several different prefetching strategies, and pinpointed the causes of the performance changes. Invalidation misses pose a particular problem for current compiler-directed prefetchers. We applied two techniques that reduced their impact: a special prefetching heuristic tailored to write-shared data, and restructuring shared data to reduce false sharing, thus allowing traditional prefetching algorithms to work well.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121467658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-05-01DOI: 10.1109/ISCA.1993.698560
N. Jouppi
This paper investigates issues involving writes and caches. First, tradeoffs on writes that miss in the cache are investigated. In particular, whether the missed cache block is fetched on a write miss, whether the missed cache block is allocated in the cache, and whether the cache line is written before hit or miss is known are considered. Depending on the combination of these polices chosen, the entire cache miss rate can vary by a factor of two on some applications. The combination of no-fetch-on-write and write-allocate can provide better performance than cache line allocation instructions. Second, tradeoffs between write-through and write-back caching when writes hit in a cache are considered. A mixture of these two alternatives, called write caching is proposed. Write caching places a small fully-associative cache behind a write-through cache. A write cache can eliminate almost as much write traffic as a write-back cache.
{"title":"Cache Write Policies And Performance","authors":"N. Jouppi","doi":"10.1109/ISCA.1993.698560","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698560","url":null,"abstract":"This paper investigates issues involving writes and caches. First, tradeoffs on writes that miss in the cache are investigated. In particular, whether the missed cache block is fetched on a write miss, whether the missed cache block is allocated in the cache, and whether the cache line is written before hit or miss is known are considered. Depending on the combination of these polices chosen, the entire cache miss rate can vary by a factor of two on some applications. The combination of no-fetch-on-write and write-allocate can provide better performance than cache line allocation instructions. Second, tradeoffs between write-through and write-back caching when writes hit in a cache are considered. A mixture of these two alternatives, called write caching is proposed. Write caching places a small fully-associative cache behind a write-through cache. A write cache can eliminate almost as much write traffic as a write-back cache.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122418175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1993-05-01DOI: 10.1109/ISCA.1993.698574
André Seznec, J. Lenfant
Using a prime number of N of memory banks on a vector processor allows a conflict-free access for any slice of N consecutive elements of a vector stored with a stride not multiple of N. To reject the use of a prime (or odd) number N of memory banks, it is generally advanced that address computation for such a memory system would require systematic Euclidean Division by the number N. We first show that the well known Chinese Remainder Theorem allows to define a very simple mapping of data onto the memory banks for which address computation does not require any Euclidean Division. Massively parallel SIMD computers may have several thousands of processors. When the memory on such a machine is globally shared, routing vectors from memory to the processors is a major difficulty; the control for the interconnection network cannot be generally computed at execution time. When the number of memory banks and processors is a product of prime numbers, the family of permutations needed for routing vectors for memory to the processors through the interconnection network have very specific properties. The Chinese Remainder Network presented in the paper is able to execute all these permutations in a single path and may be self-routed.
{"title":"Odd Memory Systems May Be Quite Interesting","authors":"André Seznec, J. Lenfant","doi":"10.1109/ISCA.1993.698574","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698574","url":null,"abstract":"Using a prime number of N of memory banks on a vector processor allows a conflict-free access for any slice of N consecutive elements of a vector stored with a stride not multiple of N.\u0000To reject the use of a prime (or odd) number N of memory banks, it is generally advanced that address computation for such a memory system would require systematic Euclidean Division by the number N. We first show that the well known Chinese Remainder Theorem allows to define a very simple mapping of data onto the memory banks for which address computation does not require any Euclidean Division.\u0000Massively parallel SIMD computers may have several thousands of processors. When the memory on such a machine is globally shared, routing vectors from memory to the processors is a major difficulty; the control for the interconnection network cannot be generally computed at execution time. When the number of memory banks and processors is a product of prime numbers, the family of permutations needed for routing vectors for memory to the processors through the interconnection network have very specific properties. The Chinese Remainder Network presented in the paper is able to execute all these permutations in a single path and may be self-routed.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131764388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Kuck, E. Davidson, D. Lawrie, A. Sameh, Chuanqi Zhu
In thts paper. we give an overmew of the Cedar mutliprocessor and present recent performance results. These tnclude the performance of some computational kernels and the Perfect Benchmarks@ . We also pnsent a methodology for judging parallel system performance and apply this methodology to Cedar, Gray YMP-8, and Thinking Machines CM-S.
{"title":"The Cedar System And An Initial Performance Study","authors":"D. Kuck, E. Davidson, D. Lawrie, A. Sameh, Chuanqi Zhu","doi":"10.1145/285930.286005","DOIUrl":"https://doi.org/10.1145/285930.286005","url":null,"abstract":"In thts paper. we give an overmew of the Cedar mutliprocessor and present recent performance results. These tnclude the performance of some computational kernels and the Perfect Benchmarks@ . We also pnsent a methodology for judging parallel system performance and apply this methodology to Cedar, Gray YMP-8, and Thinking Machines CM-S.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124871943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}