首页 > 最新文献

Proceedings of the 20th Annual International Symposium on Computer Architecture最新文献

英文 中文
Working Sets, Cache Sizes, And Node Granularity Issues For Large-scale Multiprocessors 大型多处理器的工作集、缓存大小和节点粒度问题
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698542
E. Rothberg, J. Singh, Anoop Gupta
The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines?In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors.We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.
资源在处理器、内存和缓存之间的分配是大型并行机设计者面临的一个关键问题。如果一台机器要解决具有一定数据集大小的问题,那么它应该使用大量的处理器,每个处理器都有少量的内存,还是使用较少的处理器,每个处理器都有大量的内存?为了节省成本,每个处理器应该提供多少高速缓存?当更大的问题在更大的机器上运行时,这些决策是如何变化的?本文根据五类重要的大规模并行科学应用的特点,对上述问题进行了探讨。我们首先展示了所有应用程序都有一个定义良好的每处理器工作集的层次结构,其大小、性能影响和可伸缩性特征可以帮助确定多处理器缓存层次结构的不同级别应该有多大。然后,我们将这些工作集与应用程序的某些其他重要特征(例如通信与计算比、并发性和负载平衡行为)一起使用,以反映高性能多处理器中处理节点粒度这一更广泛的问题。我们发现,非常小的缓存(其大小不随问题或机器大小而增加)对于除了两个应用程序类之外的所有应用程序类都足够了。即使在这两个例外中,随着问题的大小,工作集的扩展速度也相当缓慢,而且在可预见的将来运行的问题所需的缓存大小也很小。我们还发现,具有大量处理器和每个处理器相当小的内存的相对细粒度的机器适合所有应用程序。
{"title":"Working Sets, Cache Sizes, And Node Granularity Issues For Large-scale Multiprocessors","authors":"E. Rothberg, J. Singh, Anoop Gupta","doi":"10.1109/ISCA.1993.698542","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698542","url":null,"abstract":"The distribution of resources among processors, memory and caches is a crucial question faced by designers of large-scale parallel machines. If a machine is to solve problems with a certain data set size, should it be built with a large number of processors each with a small amount of memory, or a smaller number of processors each with a large amount of memory? How much cache memory should be provided per processor for cost-effectiveness? And how do these decisions change as larger problems are run on larger machines?\u0000In this paper, we explore the above questions based on the characteristics of five important classes of large-scale parallel scientific applications. We first show that all the applications have a hierarchy of well-defined per-processor working sets, whose size, performance impact and scaling characteristics can help determine how large different levels of a multiprocessor's cache hierarchy should be. Then, we use these working sets together with certain other important characteristics of the applications—such as communication to computation ratios, concurrency, and load balancing behavior—to reflect upon the broader question of the granularity of processing nodes in high-performance multiprocessors.\u0000We find that very small caches whose sizes do not increase with the problem or machine size are adequate for all but two of the application classes. Even in the two exceptions, the working sets scale quite slowly with problem size, and the cache sizes needed for problems that will be run in the foreseeable future are small. We also find that relatively fine-grained machines, with large numbers of processors and quite small amounts of memory per processor, are appropriate for all the applications.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134520873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 118
The Detection And Elimination Of Useless Misses In Multiprocessors 多处理器中无用缺失的检测与消除
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698548
M. Dubois, J. Skeppstedt, L. Ricciulli, Krishnan Ramamurthy, P. Stenström
In this paper we introduce a new classification of misses in shared-memory multiprocessors based on interprocessor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All other misses are useless misses and can be ignored without affecting the correctness of program execution. Based on the new classification we compare the effectiveness of five different protocols which delay and combine invalidations leading to useless misses. In cache-based systems the protocols are very effective and have miss rates close to the essential miss rate. In virtual shared memory systems the techniques are also effective but leave room for improvements.
本文提出了一种基于处理器间通信的共享内存多处理器误报分类方法。我们确定基本失误的集合,即正确执行所必需的最小失误集合。重要错过包括冷错过和真正的分享错过。所有其他错误都是无用的错误,可以忽略而不影响程序执行的正确性。在此基础上,我们比较了延迟失效和组合失效导致无效缺失的五种不同协议的有效性。在基于缓存的系统中,该协议是非常有效的,其漏检率接近基本漏检率。在虚拟共享内存系统中,这些技术也很有效,但仍有改进的余地。
{"title":"The Detection And Elimination Of Useless Misses In Multiprocessors","authors":"M. Dubois, J. Skeppstedt, L. Ricciulli, Krishnan Ramamurthy, P. Stenström","doi":"10.1109/ISCA.1993.698548","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698548","url":null,"abstract":"In this paper we introduce a new classification of misses in shared-memory multiprocessors based on interprocessor communication. We identify the set of essential misses, i.e., the smallest set of misses necessary for correct execution. Essential misses include cold misses and true sharing misses. All other misses are useless misses and can be ignored without affecting the correctness of program execution. Based on the new classification we compare the effectiveness of five different protocols which delay and combine invalidations leading to useless misses. In cache-based systems the protocols are very effective and have miss rates close to the essential miss rate. In virtual shared memory systems the techniques are also effective but leave room for improvements.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127592898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 126
Column-associative Caches: A Technique For Reducing The Miss Rate Of Direct-mapped Caches 列关联缓存:一种减少直接映射缓存缺失率的技术
A. Agarwal, S. Pudar
Direct-mapped caches are a popular design choice for highperfortnsnce processors;unfortunately, direct-mapped cachessuffer systematic interference misses when more than one address maps into the sensecache set. This paper &scribes the design of column-ossociotive caches.which minhize the cofllcrs that arise in direct-mapped accessesby allowing conflicting addressesto dynamically choose alternate hashing functions, so that most of the cordiicting datacanreside in thecache. At the sametime, however, the critical hit accesspath is unchanged. The key to implementing this schemeefficiently is the addition of a reho.dsM to eachcache se~ which indicates whether that set storesdata that is referenced by an alternate hashing timction. When multiple addressesmap into the samelocatioz theserehoshed locatwns are preferentially replaced. Using trace-driven simulations and en analytical model, we demonstrate that a column-associative cacheremovesvirtually all interference missesfor large caches,without altering the critical hit accesstime.
直接映射缓存是高性能处理器的一种流行设计选择;不幸的是,当多个地址映射到感知缓存集时,直接映射缓存会遭受系统干扰丢失。本文叙述了一种柱式关联缓存器的设计。通过允许冲突的地址动态地选择替代的哈希函数,从而使大多数连接数据可以驻留在缓存中,从而最大限度地减少了直接映射访问中出现的冲突。然而,与此同时,暴击访问路径是不变的。有效实现该方案的关键是添加reho。dsM到每个缓存se~,这表明该集合是否存储由备用哈希激励引用的数据。当多个地址映射到相同的位置时,优先替换这些位置。使用跟踪驱动的模拟和分析模型,我们证明了列关联缓存在不改变临界命中访问时间的情况下,几乎消除了大型缓存的所有干扰。
{"title":"Column-associative Caches: A Technique For Reducing The Miss Rate Of Direct-mapped Caches","authors":"A. Agarwal, S. Pudar","doi":"10.1145/165123.165153","DOIUrl":"https://doi.org/10.1145/165123.165153","url":null,"abstract":"Direct-mapped caches are a popular design choice for highperfortnsnce processors;unfortunately, direct-mapped cachessuffer systematic interference misses when more than one address maps into the sensecache set. This paper &scribes the design of column-ossociotive caches.which minhize the cofllcrs that arise in direct-mapped accessesby allowing conflicting addressesto dynamically choose alternate hashing functions, so that most of the cordiicting datacanreside in thecache. At the sametime, however, the critical hit accesspath is unchanged. The key to implementing this schemeefficiently is the addition of a reho.dsM to eachcache se~ which indicates whether that set storesdata that is referenced by an alternate hashing timction. When multiple addressesmap into the samelocatioz theserehoshed locatwns are preferentially replaced. Using trace-driven simulations and en analytical model, we demonstrate that a column-associative cacheremovesvirtually all interference missesfor large caches,without altering the critical hit accesstime.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124554078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 280
Performance Of Cached Dram Organizations In Vector Supercomputers 矢量超级计算机中缓存Dram组织的性能
W. Hsu, James E. Smith
DRAMs containing cache memory are studied in the context of vector supercomputers. In particular, we consider systems where processors have no internal data caches and memory reference streams are generated by vector instructions. For this application, we expect that cached DRAMs can provide high bandwidth at relatively low cost.We study both DRAMs with a single, long cache line and with smaller, multiple cache lines. Memory interleaving schemes that increase data locality are proposed and studied. The interleaving schemes are also shown to lead to non-uniform bank accesses, i.e. hot banks. This suggest there is an important optimization problem involving methods that increase locality to improve performance, but not so much that hot banks diminish performance. We show that for uniprocessor systems, both types of cached DRAMs work well with the proposed interleave methods. For multiprogrammed multiprocessors, the multiple cache line DRAMs work better.
在矢量超级计算机的背景下,研究了包含高速缓存存储器的dram。特别地,我们考虑处理器没有内部数据缓存并且内存引用流是由矢量指令生成的系统。对于这个应用程序,我们期望缓存的dram能够以相对较低的成本提供高带宽。我们研究了具有一条长缓存线和更小的多条缓存线的dram。提出并研究了提高数据局部性的存储器交错方案。交错方案也被证明会导致非均匀的银行访问,即热银行。这表明存在一个重要的优化问题,涉及到增加局部性以提高性能的方法,但并没有过多地降低热库的性能。我们表明,对于单处理器系统,两种类型的缓存dram都可以很好地使用所提出的交错方法。对于多程序多处理器,多高速缓存线dram工作得更好。
{"title":"Performance Of Cached Dram Organizations In Vector Supercomputers","authors":"W. Hsu, James E. Smith","doi":"10.1145/165123.165170","DOIUrl":"https://doi.org/10.1145/165123.165170","url":null,"abstract":"DRAMs containing cache memory are studied in the context of vector supercomputers. In particular, we consider systems where processors have no internal data caches and memory reference streams are generated by vector instructions. For this application, we expect that cached DRAMs can provide high bandwidth at relatively low cost.\u0000We study both DRAMs with a single, long cache line and with smaller, multiple cache lines. Memory interleaving schemes that increase data locality are proposed and studied. The interleaving schemes are also shown to lead to non-uniform bank accesses, i.e. hot banks. This suggest there is an important optimization problem involving methods that increase locality to improve performance, but not so much that hot banks diminish performance. We show that for uniprocessor systems, both types of cached DRAMs work well with the proposed interleave methods. For multiprogrammed multiprocessors, the multiple cache line DRAMs work better.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126915209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
Multiple Threads In Cyclic Register Windows 循环寄存器窗口中的多线程
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698552
Yasuo Hidaka, H. Koike, Hidehiko Tanaka
Multi-threading is often used to compile logic and functional languages, and implement parallel C libraries. Fine-grain multi-threading requires rapid context switching, which can be slow on architectures with register windows. In past, researchers have either proposed new hardware support for dynamic allocation of windows to threads, or have sacrificed fast procedure calls by fixed allocation of windows to threads. In this paper, a novel window management algorithm, which retains both fast procedure calls and fast context switching, is proposed. The algorithm has been implemented on the SPARC processor by modifying window trap handlers. A quantitative evaluation of the scheme using a multi-threaded application with various concurrency and granularity levels is given. The evaluation shows that the proposed scheme always does better than the other schemes. Some implications for multi-threaded architectures are also presented.
多线程通常用于编译逻辑和函数式语言,以及实现并行C库。细粒度多线程需要快速的上下文切换,这在带有寄存器窗口的体系结构上可能很慢。过去,研究人员要么提出新的硬件支持动态分配窗口到线程,要么通过固定分配窗口到线程来牺牲快速过程调用。本文提出了一种新的窗口管理算法,该算法既能保持快速的过程调用,又能保持快速的上下文切换。该算法通过修改窗口陷阱处理程序在SPARC处理器上实现。给出了一个具有不同并发性和粒度级别的多线程应用程序对该方案的定量评价。评价结果表明,该方案的性能优于其他方案。本文还介绍了多线程体系结构的一些含义。
{"title":"Multiple Threads In Cyclic Register Windows","authors":"Yasuo Hidaka, H. Koike, Hidehiko Tanaka","doi":"10.1109/ISCA.1993.698552","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698552","url":null,"abstract":"Multi-threading is often used to compile logic and functional languages, and implement parallel C libraries. Fine-grain multi-threading requires rapid context switching, which can be slow on architectures with register windows. In past, researchers have either proposed new hardware support for dynamic allocation of windows to threads, or have sacrificed fast procedure calls by fixed allocation of windows to threads. In this paper, a novel window management algorithm, which retains both fast procedure calls and fast context switching, is proposed. The algorithm has been implemented on the SPARC processor by modifying window trap handlers. A quantitative evaluation of the scheme using a multi-threaded application with various concurrency and granularity levels is given. The evaluation shows that the proposed scheme always does better than the other schemes. Some implications for multi-threaded architectures are also presented.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129969784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Limitations Of Cache Prefetching On A Bus-based Multiprocessor 基于总线的多处理器缓存预取的局限性
D. Tullsen, S. Eggers
Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a multiprocessor. Prefetching can negatively affect bus utilization, overall cache miss rates, memory latencies and data sharing. We simulated the effects of a particular compiler-directed prefetching algorithm, running on a bus-based multiprocesssor. We showed that, despite a high memory latency, this architecture is not very well-suited for prefetching. For several variations on the architecture, speedups for five parallel programs were no greater than 39%, and degradations were as high as 7%, when prefetching was added to the workload. We examined the sources of cache misses, in light of several different prefetching strategies, and pinpointed the causes of the performance changes. Invalidation misses pose a particular problem for current compiler-directed prefetchers. We applied two techniques that reduced their impact: a special prefetching heuristic tailored to write-shared data, and restructuring shared data to reduce false sharing, thus allowing traditional prefetching algorithms to work well.
编译器导向的缓存预取有可能隐藏当前和未来高性能处理器所看到的大量高内存延迟。然而,预取并非没有成本,特别是在多处理器上。预取会对总线利用率、总体缓存丢失率、内存延迟和数据共享产生负面影响。我们模拟了在基于总线的多处理器上运行的特定编译器定向预取算法的效果。我们表明,尽管内存延迟很高,但这种架构并不非常适合预取。对于架构上的几个变体,当向工作负载中添加预取时,五个并行程序的加速不超过39%,而性能下降高达7%。根据几种不同的预取策略,我们检查了缓存丢失的来源,并确定了性能变化的原因。对于当前的编译器导向的预取器,无效缺失带来了一个特殊的问题。我们应用了两种技术来减少它们的影响:为写共享数据定制的特殊预取启发式方法,以及重组共享数据以减少错误共享,从而使传统的预取算法能够很好地工作。
{"title":"Limitations Of Cache Prefetching On A Bus-based Multiprocessor","authors":"D. Tullsen, S. Eggers","doi":"10.1145/165123.165163","DOIUrl":"https://doi.org/10.1145/165123.165163","url":null,"abstract":"Compiler-directed cache prefetching has the potential to hide much of the high memory latency seen by current and future high-performance processors. However, prefetching is not without costs, particularly on a multiprocessor. Prefetching can negatively affect bus utilization, overall cache miss rates, memory latencies and data sharing. We simulated the effects of a particular compiler-directed prefetching algorithm, running on a bus-based multiprocesssor. We showed that, despite a high memory latency, this architecture is not very well-suited for prefetching. For several variations on the architecture, speedups for five parallel programs were no greater than 39%, and degradations were as high as 7%, when prefetching was added to the workload. We examined the sources of cache misses, in light of several different prefetching strategies, and pinpointed the causes of the performance changes. Invalidation misses pose a particular problem for current compiler-directed prefetchers. We applied two techniques that reduced their impact: a special prefetching heuristic tailored to write-shared data, and restructuring shared data to reduce false sharing, thus allowing traditional prefetching algorithms to work well.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121467658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 77
Cache Write Policies And Performance Cache写策略和性能
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698560
N. Jouppi
This paper investigates issues involving writes and caches. First, tradeoffs on writes that miss in the cache are investigated. In particular, whether the missed cache block is fetched on a write miss, whether the missed cache block is allocated in the cache, and whether the cache line is written before hit or miss is known are considered. Depending on the combination of these polices chosen, the entire cache miss rate can vary by a factor of two on some applications. The combination of no-fetch-on-write and write-allocate can provide better performance than cache line allocation instructions. Second, tradeoffs between write-through and write-back caching when writes hit in a cache are considered. A mixture of these two alternatives, called write caching is proposed. Write caching places a small fully-associative cache behind a write-through cache. A write cache can eliminate almost as much write traffic as a write-back cache.
本文研究涉及写和缓存的问题。首先,研究在缓存中丢失写操作时的权衡。特别是,是否在写未命中时获取丢失的缓存块,是否在缓存中分配丢失的缓存块,以及在hit或miss已知之前是否写入缓存行。根据所选择的这些策略的组合,在某些应用程序上,整个缓存缺失率可能会变化两倍。写时不取和写分配的组合可以提供比缓存线分配指令更好的性能。其次,当写操作到达缓存时,要考虑透写和回写缓存之间的权衡。这两种选择的混合,称为写缓存。写缓存在透写缓存后面放置一个小的全关联缓存。写缓存可以消除几乎和回写缓存一样多的写流量。
{"title":"Cache Write Policies And Performance","authors":"N. Jouppi","doi":"10.1109/ISCA.1993.698560","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698560","url":null,"abstract":"This paper investigates issues involving writes and caches. First, tradeoffs on writes that miss in the cache are investigated. In particular, whether the missed cache block is fetched on a write miss, whether the missed cache block is allocated in the cache, and whether the cache line is written before hit or miss is known are considered. Depending on the combination of these polices chosen, the entire cache miss rate can vary by a factor of two on some applications. The combination of no-fetch-on-write and write-allocate can provide better performance than cache line allocation instructions. Second, tradeoffs between write-through and write-back caching when writes hit in a cache are considered. A mixture of these two alternatives, called write caching is proposed. Write caching places a small fully-associative cache behind a write-through cache. A write cache can eliminate almost as much write traffic as a write-back cache.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122418175","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 254
Odd Memory Systems May Be Quite Interesting 奇怪的记忆系统可能相当有趣
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698574
André Seznec, J. Lenfant
Using a prime number of N of memory banks on a vector processor allows a conflict-free access for any slice of N consecutive elements of a vector stored with a stride not multiple of N.To reject the use of a prime (or odd) number N of memory banks, it is generally advanced that address computation for such a memory system would require systematic Euclidean Division by the number N. We first show that the well known Chinese Remainder Theorem allows to define a very simple mapping of data onto the memory banks for which address computation does not require any Euclidean Division.Massively parallel SIMD computers may have several thousands of processors. When the memory on such a machine is globally shared, routing vectors from memory to the processors is a major difficulty; the control for the interconnection network cannot be generally computed at execution time. When the number of memory banks and processors is a product of prime numbers, the family of permutations needed for routing vectors for memory to the processors through the interconnection network have very specific properties. The Chinese Remainder Network presented in the paper is able to execute all these permutations in a single path and may be self-routed.
在矢量处理器上使用素数N的内存库允许对以非N的倍数存储的矢量的N个连续元素的任何切片进行无冲突访问。一般来说,这种存储系统的地址计算需要系统的欧几里得除法n。我们首先表明,众所周知的中国余数定理允许定义一个非常简单的数据映射到存储库,其中地址计算不需要任何欧几里得除法。大规模并行SIMD计算机可能有数千个处理器。当这种机器上的内存是全局共享的,从内存到处理器的路由向量是一个主要的困难;互连网络的控制一般不能在执行时计算。当存储库和处理器的数量是素数的乘积时,通过互连网络将内存向量路由到处理器所需的排列族具有非常特殊的属性。本文提出的中文剩余网络能够在一条路径上执行所有这些排列,并且可以自路由。
{"title":"Odd Memory Systems May Be Quite Interesting","authors":"André Seznec, J. Lenfant","doi":"10.1109/ISCA.1993.698574","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698574","url":null,"abstract":"Using a prime number of N of memory banks on a vector processor allows a conflict-free access for any slice of N consecutive elements of a vector stored with a stride not multiple of N.\u0000To reject the use of a prime (or odd) number N of memory banks, it is generally advanced that address computation for such a memory system would require systematic Euclidean Division by the number N. We first show that the well known Chinese Remainder Theorem allows to define a very simple mapping of data onto the memory banks for which address computation does not require any Euclidean Division.\u0000Massively parallel SIMD computers may have several thousands of processors. When the memory on such a machine is globally shared, routing vectors from memory to the processors is a major difficulty; the control for the interconnection network cannot be generally computed at execution time. When the number of memory banks and processors is a product of prime numbers, the family of permutations needed for routing vectors for memory to the processors through the interconnection network have very specific properties. The Chinese Remainder Network presented in the paper is able to execute all these permutations in a single path and may be self-routed.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131764388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
The Cedar System And An Initial Performance Study 雪松系统及其初步性能研究
D. Kuck, E. Davidson, D. Lawrie, A. Sameh, Chuanqi Zhu
In thts paper. we give an overmew of the Cedar mutliprocessor and present recent performance results. These tnclude the performance of some computational kernels and the Perfect Benchmarks@ . We also pnsent a methodology for judging parallel system performance and apply this methodology to Cedar, Gray YMP-8, and Thinking Machines CM-S.
在这篇论文中。我们对Cedar多处理器进行了概述,并给出了最近的性能结果。其中包括一些计算内核的性能和完美基准@。我们还提出了一种判断并行系统性能的方法,并将该方法应用于Cedar、Gray YMP-8和Thinking Machines CM-S。
{"title":"The Cedar System And An Initial Performance Study","authors":"D. Kuck, E. Davidson, D. Lawrie, A. Sameh, Chuanqi Zhu","doi":"10.1145/285930.286005","DOIUrl":"https://doi.org/10.1145/285930.286005","url":null,"abstract":"In thts paper. we give an overmew of the Cedar mutliprocessor and present recent performance results. These tnclude the performance of some computational kernels and the Perfect Benchmarks@ . We also pnsent a methodology for judging parallel system performance and apply this methodology to Cedar, Gray YMP-8, and Thinking Machines CM-S.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124871943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
期刊
Proceedings of the 20th Annual International Symposium on Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1