首页 > 最新文献

Proceedings of the 20th Annual International Symposium on Computer Architecture最新文献

英文 中文
Evaluation Of Mechanisms For Fine-grained Parallel Programs In The J-machine And The Cm-5 J-machine和Cm-5中细粒度并行程序的机制评估
Ellen Spertus, S. Goldstein, K. Schauser, T. V. Eicken, D. Culler, W. Dally
This paper uses an abstract machine approach to compare the mechanisms of two parallel machines: the J-Machine and the CM-5. High-level parallel programs are translated by a single optimizing compiler to a fine-grained abstract parallel machine, TAM. A final compilation step is unique to each machine and optimizes for specifics of the architecture. By determining the cost of the primitives and weighting them by their dynamic frequency in parallel programs, we quantify the effectiveness of the following mechanisms individually and in combination. Efficient processor/network coupling proves valuable. Message dispatch is found to be less valuable without atomic operations that allow the scheduling levels to cooperate. Multiple hardware contexts are of small value when the contexts cooperate and the compiler can partition the register set. Tagged memory provides little gain. Finally, the performance of the overall system is strongly influenced by the performance of the memory system and the frequency of control operations.
本文采用抽象机的方法对J-Machine和CM-5两种并联机床的机构进行了比较。高级并行程序由单个优化编译器转换为细粒度抽象并行机TAM。最后的编译步骤对每台机器都是独一无二的,并针对体系结构的具体情况进行优化。通过确定原语的成本,并根据它们在并行程序中的动态频率对它们进行加权,我们量化了以下机制单独和组合的有效性。高效的处理器/网络耦合证明是有价值的。如果没有允许调度级别协作的原子操作,消息调度的价值就会降低。如果多个硬件上下文相互配合,并且编译器可以对寄存器集进行分区,那么多个硬件上下文的价值就很小。标记内存提供很少的增益。最后,整个系统的性能受到存储系统性能和控制操作频率的强烈影响。
{"title":"Evaluation Of Mechanisms For Fine-grained Parallel Programs In The J-machine And The Cm-5","authors":"Ellen Spertus, S. Goldstein, K. Schauser, T. V. Eicken, D. Culler, W. Dally","doi":"10.1145/165123.165165","DOIUrl":"https://doi.org/10.1145/165123.165165","url":null,"abstract":"This paper uses an abstract machine approach to compare the mechanisms of two parallel machines: the J-Machine and the CM-5. High-level parallel programs are translated by a single optimizing compiler to a fine-grained abstract parallel machine, TAM. A final compilation step is unique to each machine and optimizes for specifics of the architecture. By determining the cost of the primitives and weighting them by their dynamic frequency in parallel programs, we quantify the effectiveness of the following mechanisms individually and in combination. Efficient processor/network coupling proves valuable. Message dispatch is found to be less valuable without atomic operations that allow the scheduling levels to cooperate. Multiple hardware contexts are of small value when the contexts cooperate and the compiler can partition the register set. Tagged memory provides little gain. Finally, the performance of the overall system is strongly influenced by the performance of the memory system and the frequency of control operations.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121806112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
Architectural Requirements Of Parallel Scientific Applications With Explicit Communication 具有显式通信的并行科学应用的体系结构需求
R. Cypher, Alex Ho, S. Konstantinidou, P. Messina
This paper studies the behavior of scientific applications running on distributed memory parallel computers. Our goal is to quantify the floating point, memory, I/O and communication requirements of highly parallel scientific applications that perform explicit communication. In addition to quantifying these requirements for fixed problem sizes and numbers of processors, we develop analytical models for the effects of changing the problem size and the degree of parallelism for several of the applications. We use the results to evaluate the trade-offs in the design of multicomputer architectures.
本文研究了在分布式存储并行计算机上运行的科学应用程序的行为。我们的目标是量化执行显式通信的高度并行科学应用程序的浮点数、内存、I/O和通信需求。除了量化这些固定问题大小和处理器数量的需求外,我们还开发了分析模型,用于更改几个应用程序的问题大小和并行度的影响。我们使用结果来评估多计算机体系结构设计中的权衡。
{"title":"Architectural Requirements Of Parallel Scientific Applications With Explicit Communication","authors":"R. Cypher, Alex Ho, S. Konstantinidou, P. Messina","doi":"10.1145/165123.165124","DOIUrl":"https://doi.org/10.1145/165123.165124","url":null,"abstract":"This paper studies the behavior of scientific applications running on distributed memory parallel computers. Our goal is to quantify the floating point, memory, I/O and communication requirements of highly parallel scientific applications that perform explicit communication. In addition to quantifying these requirements for fixed problem sizes and numbers of processors, we develop analytical models for the effects of changing the problem size and the degree of parallelism for several of the applications. We use the results to evaluate the trade-offs in the design of multicomputer architectures.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126096024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 158
Acase For Two-way Skewed-associative Caches 用于双向倾斜关联缓存的案例
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698558
André Seznec
We introduce a new organization for multi-bank cach es: the skewed-associative cache. A two-way skewed-associative cache has the same hardware complexity as a two-way set-associative cache, yet simulations show that it typically exhibits the same hit ratio as a four-way set associative cache with the same size. Then skewed-associative caches must be preferred to set-associative caches. Until the three last years external caches were used and their size could be relatively large. Previous studies have showed that, for cache sizes larger than 64 Kbyt es, direct-mapped caches exhibit hit ratios nearly as good as set-associative caches at a lower hardware cost. Moreover, the cache hit time on a direct-mapped cache may be quite smaller than the cache hit time on a set-associative cache, because optimistic use of data jlowing out from the cache is quite natural. But now, microprocessors are designed with small on-chip caches. Performance of low-end microprocessor systems highly depends on cache behavior. Simulations show that using some associativity in on-chip caches allows to boost the performance of these lowend systems. When considering optimistic use of data (or instruction) jlowing out from the cache, the cache hit time of a two-way skewed-associative (or setassociative) cache is very close to the cache hit time of a direct-mapped cache. Therefore two-way skewed associative caches represent the best tradeoff for today microprocessors with on-chip caches whose sizes are in the range of 4-8K bytes.
我们为多银行缓存引入了一种新的组织:倾斜关联缓存。双向倾斜关联缓存与双向集合关联缓存具有相同的硬件复杂性,但模拟表明,它通常表现出与具有相同大小的四路集合关联缓存相同的命中率。那么斜关联缓存必须优先于集关联缓存。直到最近三年才开始使用外部缓存,它们的大小可能相对较大。先前的研究表明,对于大于64 kb的缓存,直接映射缓存的命中率几乎与集合关联缓存一样好,而且硬件成本更低。此外,直接映射缓存上的缓存命中时间可能比集合关联缓存上的缓存命中时间要小得多,因为乐观地使用从缓存中流出的数据是很自然的。但是现在,微处理器被设计成带有小的片内缓存。低端微处理器系统的性能高度依赖于缓存行为。仿真表明,在片上缓存中使用一些结合律可以提高这些低端系统的性能。当考虑乐观地使用从缓存中抽出的数据(或指令)时,双向倾斜关联(或setassociative)缓存的缓存命中时间非常接近直接映射缓存的缓存命中时间。因此,双向倾斜关联缓存代表了当今微处理器与片上缓存的最佳折衷,其大小在4-8K字节之间。
{"title":"Acase For Two-way Skewed-associative Caches","authors":"André Seznec","doi":"10.1109/ISCA.1993.698558","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698558","url":null,"abstract":"We introduce a new organization for multi-bank cach es: the skewed-associative cache. A two-way skewed-associative cache has the same hardware complexity as a two-way set-associative cache, yet simulations show that it typically exhibits the same hit ratio as a four-way set associative cache with the same size. Then skewed-associative caches must be preferred to set-associative caches. Until the three last years external caches were used and their size could be relatively large. Previous studies have showed that, for cache sizes larger than 64 Kbyt es, direct-mapped caches exhibit hit ratios nearly as good as set-associative caches at a lower hardware cost. Moreover, the cache hit time on a direct-mapped cache may be quite smaller than the cache hit time on a set-associative cache, because optimistic use of data jlowing out from the cache is quite natural. But now, microprocessors are designed with small on-chip caches. Performance of low-end microprocessor systems highly depends on cache behavior. Simulations show that using some associativity in on-chip caches allows to boost the performance of these lowend systems. When considering optimistic use of data (or instruction) jlowing out from the cache, the cache hit time of a two-way skewed-associative (or setassociative) cache is very close to the cache hit time of a direct-mapped cache. Therefore two-way skewed associative caches represent the best tradeoff for today microprocessors with on-chip caches whose sizes are in the range of 4-8K bytes.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124761879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 278
Register Connection: A New Approach To Adding Registers Into Instruction Set Architectures 寄存器连接:在指令集体系结构中增加寄存器的一种新方法
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698565
T. Kiyohara, S. Mahlke, William Y. Chen, Roger A. Bringmann, R. Hank, S. Anik, Wen-mei W. Hwu
Code optimization and scheduling for superscalar and superpipelined processors often increase the register requirement of programs. For existing instruction sets with a small to moderate number of registers, this increased register requirement can be a factor that limits the effectivess of the compiler. In this paper, we introduce a new architectural method for adding a set of extended registers into an architecture. Using a novel concept of connection, this method allows the data stored in the extended registers to be accessed by instructions that apparently reference core registers. Furthermore, we address the technical issues involved in applying the new method to an architecture: instruction set extension, procedure call convention, context switching considerations, upward compatibility, efficient implementation, compiler support, and performance. Experimental results based on a prototype compiler and execution driven simulation show that the proposed method can significantly improve the performance of superscalar processors with a small or moderate number of registers.
超标量和超流水线处理器的代码优化和调度往往会增加程序对寄存器的需求。对于具有少量或中等数量寄存器的现有指令集,这种增加的寄存器需求可能是限制编译器效率的一个因素。在本文中,我们介绍了一种新的体系结构方法,将一组扩展寄存器添加到体系结构中。使用一种新颖的连接概念,这种方法允许存储在扩展寄存器中的数据被明显引用核心寄存器的指令访问。此外,我们还讨论了将新方法应用于体系结构所涉及的技术问题:指令集扩展、过程调用约定、上下文切换考虑、向上兼容性、高效实现、编译器支持和性能。基于原型编译器和执行驱动仿真的实验结果表明,该方法可以显著提高寄存器数量较少或中等数量的超标量处理器的性能。
{"title":"Register Connection: A New Approach To Adding Registers Into Instruction Set Architectures","authors":"T. Kiyohara, S. Mahlke, William Y. Chen, Roger A. Bringmann, R. Hank, S. Anik, Wen-mei W. Hwu","doi":"10.1109/ISCA.1993.698565","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698565","url":null,"abstract":"Code optimization and scheduling for superscalar and superpipelined processors often increase the register requirement of programs. For existing instruction sets with a small to moderate number of registers, this increased register requirement can be a factor that limits the effectivess of the compiler. In this paper, we introduce a new architectural method for adding a set of extended registers into an architecture. Using a novel concept of connection, this method allows the data stored in the extended registers to be accessed by instructions that apparently reference core registers. Furthermore, we address the technical issues involved in applying the new method to an architecture: instruction set extension, procedure call convention, context switching considerations, upward compatibility, efficient implementation, compiler support, and performance. Experimental results based on a prototype compiler and execution driven simulation show that the proposed method can significantly improve the performance of superscalar processors with a small or moderate number of registers.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121612104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
Architectural Support For Translation Table Management In Large Address Space Machines 大型地址空间机中转换表管理的体系结构支持
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698544
Jerome C. Huck, Jim Hays
Virtual memory page translation tables provide mappings from virtual to physical addresses. When the hardware controlled Translation Lookaside Buffers (TLBs) do not contain a translation, these tables provide the translation. Approaches to the structure and management of these tables vary from full hardware implementations to complete software based algorithms.The size of the virtual address space used by processes is rapidly growing beyond 32 bits of address. As the utilized address space increases, new problems and issues surface. Traditional methods for managing the page translation tables are inappropriate for large address space architectures.The Hashed Page Table (HPT), described here, provides a very fast and space efficient translation table that reduces overhead by splitting TLB management responsibilities between hardware and software. Measurements demonstrate its applicability to a diverse range of operating systems and workloads and, in particular, to large virtual address space machines. In simulations of over 4 billion instructions, improvement of 5 to 10% were observed.
虚拟内存页转换表提供了从虚拟地址到物理地址的映射。当硬件控制的翻译暂存缓冲区(tlb)不包含翻译时,这些表提供翻译。这些表的结构和管理方法各不相同,从完全的硬件实现到完整的基于软件的算法。进程使用的虚拟地址空间的大小正在迅速增长,超过32位地址。随着已利用地址空间的增加,新的问题和问题浮出水面。管理页转换表的传统方法不适合大型地址空间体系结构。这里描述的哈希页表(HPT)提供了一个非常快速和空间高效的转换表,通过在硬件和软件之间划分TLB管理职责来减少开销。测量结果表明,它适用于各种操作系统和工作负载,特别是大型虚拟地址空间机。在超过40亿条指令的模拟中,可以观察到5%到10%的改进。
{"title":"Architectural Support For Translation Table Management In Large Address Space Machines","authors":"Jerome C. Huck, Jim Hays","doi":"10.1109/ISCA.1993.698544","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698544","url":null,"abstract":"Virtual memory page translation tables provide mappings from virtual to physical addresses. When the hardware controlled Translation Lookaside Buffers (TLBs) do not contain a translation, these tables provide the translation. Approaches to the structure and management of these tables vary from full hardware implementations to complete software based algorithms.\u0000The size of the virtual address space used by processes is rapidly growing beyond 32 bits of address. As the utilized address space increases, new problems and issues surface. Traditional methods for managing the page translation tables are inappropriate for large address space architectures.\u0000The Hashed Page Table (HPT), described here, provides a very fast and space efficient translation table that reduces overhead by splitting TLB management responsibilities between hardware and software. Measurements demonstrate its applicability to a diverse range of operating systems and workloads and, in particular, to large virtual address space machines. In simulations of over 4 billion instructions, improvement of 5 to 10% were observed.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114690511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 144
Parity Logging Overcoming The Small Write Problem In Redundant Disk Arrays 奇偶记录克服冗余磁盘阵列中的小写问题
Daniel Stodolsky, G. Gibson, M. Holland
Parity encoded redundant disk arrays provide highly reliable, cost effective secondary storage with high performance for read accesses and large write accesses. Their performance on small writes, however, is much worse than mirrored disks—the traditional, highly reliable, but expensive organization for secondary storage. Unfortunately, small writes are a substantial portion of the I/O workload of many important, demanding applications such as on-line transaction processing. This paper presents parity logging, a novel solution to the small write problem for redundant disk arrays. Parity logging applies journalling techniques to substantially reduce the cost of small writes. We provide a detailed analysis of parity logging and competing schemes—mirroring, floating storage, and RAID level 5— and verify these models by simulation. Parity logging provides performance competitive with mirroring, the best of the alternative single failure tolerating disk array organizations. However, its overhead cost is close to the minimum offered by RAID level 5. Finally, parity logging can exploit data caching much more effectively than all three alternative approaches.
奇偶校验编码冗余磁盘阵列为读访问和大写访问提供了高可靠性、高性价比的二级存储。但是,它们在进行小的写操作时的性能要比镜像磁盘差得多,镜像磁盘是用于辅助存储的传统的、高度可靠的、但是昂贵的组织。不幸的是,在许多重要的、要求很高的应用程序(如在线事务处理)中,小的写操作占I/O工作负载的很大一部分。针对冗余磁盘阵列的小写入问题,提出了一种新颖的奇偶日志方法。奇偶性日志记录应用日志记录技术来大幅降低小写操作的成本。我们提供了奇偶记录和竞争方案(镜像、浮动存储和RAID级别5)的详细分析,并通过仿真验证了这些模型。奇偶校验日志提供了与镜像相媲美的性能,镜像是可选的单故障容忍度最好的磁盘阵列组织。但是,它的开销成本接近RAID级别5提供的最小值。最后,奇偶校验日志可以比所有三种替代方法更有效地利用数据缓存。
{"title":"Parity Logging Overcoming The Small Write Problem In Redundant Disk Arrays","authors":"Daniel Stodolsky, G. Gibson, M. Holland","doi":"10.1145/165123.165143","DOIUrl":"https://doi.org/10.1145/165123.165143","url":null,"abstract":"Parity encoded redundant disk arrays provide highly reliable, cost effective secondary storage with high performance for read accesses and large write accesses. Their performance on small writes, however, is much worse than mirrored disks—the traditional, highly reliable, but expensive organization for secondary storage. Unfortunately, small writes are a substantial portion of the I/O workload of many important, demanding applications such as on-line transaction processing. This paper presents parity logging, a novel solution to the small write problem for redundant disk arrays. Parity logging applies journalling techniques to substantially reduce the cost of small writes. We provide a detailed analysis of parity logging and competing schemes—mirroring, floating storage, and RAID level 5— and verify these models by simulation. Parity logging provides performance competitive with mirroring, the best of the alternative single failure tolerating disk array organizations. However, its overhead cost is close to the minimum offered by RAID level 5. Finally, parity logging can exploit data caching much more effectively than all three alternative approaches.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127824114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 164
Mechanisms For Cooperative Shared Memory 协作共享内存机制
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698554
D. Wood, S. Chandra, B. Falsafi, M. Hill, J. Larus, A. Lebeck, James C. Lewis, Shubhendu S. Mukherjee, Subbarao Palacharla, S. Reinhardt
This paper explores the complexity of implementing directory protocols by examining their mechanisms primitive operations on directories, caches, and network interfaces. We compare the following protocols: Dir1B, Dir4B, Dir4NB, DirnNB[2], Dir1SW[9] and an improved version of Dir1SW (Dir1SW+). The comparison shows that the mechanisms and mechanism sequencing of Dir1SW and Dir1SW+ are simpler than those for other protocols. We also compare protocol performance by running eight benchmarks on 32 processor systems. Simulations show that Dir1SW+s performance is comparable to more complex directory protocols. The significant disparity in hardware complexity and the small difference in performance argue that Dir1SW+ may be a more effective use of resources. The small performance difference is attributable to two factors: the low degree of sharing in the benchmarks and Check- In/Check-Out (CICO) directives [9].

本文通过检查目录协议在目录、缓存和网络接口上的基本操作机制,探讨了实现目录协议的复杂性。我们比较了以下协议:Dir1B, Dir4B, Dir4NB, dirnb [2], Dir1SW[9]和Dir1SW的改进版本(Dir1SW+)。比较表明,Dir1SW和Dir1SW+的机制和机制排序都比其他协议简单。我们还通过在32个处理器系统上运行8个基准测试来比较协议性能。仿真结果表明,Dir1SW+s的性能可与更复杂的目录协议相媲美。硬件复杂性的显著差异和性能上的微小差异表明,Dir1SW+可能是一种更有效的资源利用方式。性能差异很小是由两个因素造成的:基准测试的共享程度较低,以及Check- in /Check- out (CICO)指令[9]。
{"title":"Mechanisms For Cooperative Shared Memory","authors":"D. Wood, S. Chandra, B. Falsafi, M. Hill, J. Larus, A. Lebeck, James C. Lewis, Shubhendu S. Mukherjee, Subbarao Palacharla, S. Reinhardt","doi":"10.1109/ISCA.1993.698554","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698554","url":null,"abstract":"This paper explores the complexity of implementing directory protocols by examining their <i>mechanisms</i> primitive operations on directories, caches, and network interfaces. We compare the following protocols: <i>Dir</i><sub>1</sub><i>B</i>, <i>Dir</i><sub>4</sub><i>B</i>, <i>Dir</i><sub>4</sub><i>NB</i>, <i>Dir</i><sub>n</sub><i>NB</i>[2], <i>Dir</i><sub>1</sub><i>SW</i>[9] and an improved version of <i>Dir</i><sub>1</sub>SW (<i>Dir</i><sub>1</sub><i>SW</i><sup>+</sup>). The comparison shows that the mechanisms and mechanism sequencing of <i>Dir</i><sub>1</sub><i>SW</i> and <i>Dir</i><sub>1</sub><i>SW</i><sup>+</sup> are simpler than those for other protocols. We also compare protocol performance by running eight benchmarks on 32 processor systems. Simulations show that <i>Dir</i><sub>1</sub><i>SW</i><sup>+</sup>s performance is comparable to more complex directory protocols. The significant disparity in hardware complexity and the small difference in performance argue that <i>Dir</i><sub>1</sub><i>SW</i><sup>+</sup> may be a more effective use of resources. The small performance difference is attributable to two factors: the low degree of sharing in the benchmarks and Check- In/Check-Out (CICO) directives [9].<br> <br>","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122341172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
A Comparison Of Adaptive Wormhole Routing Algorithms 自适应虫洞路由算法的比较
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698575
R. Boppana, S. Chalasani
Improvement of message latency and network utilization in torus interconnection networks by increasing adaptivity in wormhole routing algorithms is studied. A recently proposed partially adaptive algorithm and four new fully-adaptive routing algorithms are compared with the well-known e-cube algorithm for uniform, hotspot, and local traffic patterns. Our simulations indicate that the partially adaptive north-last algorithm, which causes unbalanced traffic in the network, performs worse than the nonadaptive e-cube routing algorithm for all three traffic patterns. Another result of our study is that the performance does not necessarily improve with full-adaptivity. In particular, a commonly discussed fully-adaptive routing algorithm, which uses 2n virtual channels per physical channel of a k-ary n-cube, performs worse than e-cube for uniform and hotspot traffic patterns. The other three fully-adaptive algorithms, which give priority to messages based on distances traveled, perform much better than the e-cube and partially-adaptive algorithms for all three traffic patterns. One of the conclusions of this study is that adaptivity, full or partial, is not necessarily a benefit in wormhole routing.
研究了通过增加虫洞路由算法的自适应来改善环面互连网络中的消息延迟和网络利用率。针对统一、热点和局部流量模式,将最近提出的部分自适应算法和四种新的全自适应路由算法与著名的e-cube算法进行了比较。我们的模拟表明,对于所有三种流量模式,部分自适应的北后算法比非自适应的e-cube路由算法性能更差,这会导致网络中的流量不平衡。我们研究的另一个结果是,完全适应并不一定会提高表现。特别是,一种经常讨论的全自适应路由算法,它在k-ary n-cube的每个物理通道中使用2n个虚拟通道,对于统一和热点流量模式的性能比e-cube差。其他三种完全自适应算法(根据行进距离优先处理消息)在所有三种流量模式下的表现都比e-cube和部分自适应算法好得多。这项研究的结论之一是,适应性,完全或部分,并不一定是虫洞路由的好处。
{"title":"A Comparison Of Adaptive Wormhole Routing Algorithms","authors":"R. Boppana, S. Chalasani","doi":"10.1109/ISCA.1993.698575","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698575","url":null,"abstract":"Improvement of message latency and network utilization in torus interconnection networks by increasing adaptivity in wormhole routing algorithms is studied. A recently proposed partially adaptive algorithm and four new fully-adaptive routing algorithms are compared with the well-known e-cube algorithm for uniform, hotspot, and local traffic patterns. Our simulations indicate that the partially adaptive north-last algorithm, which causes unbalanced traffic in the network, performs worse than the nonadaptive e-cube routing algorithm for all three traffic patterns. Another result of our study is that the performance does not necessarily improve with full-adaptivity. In particular, a commonly discussed fully-adaptive routing algorithm, which uses 2n virtual channels per physical channel of a k-ary n-cube, performs worse than e-cube for uniform and hotspot traffic patterns. The other three fully-adaptive algorithms, which give priority to messages based on distances traveled, perform much better than the e-cube and partially-adaptive algorithms for all three traffic patterns. One of the conclusions of this study is that adaptivity, full or partial, is not necessarily a benefit in wormhole routing.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125610289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 184
The J-machine Multicomputer: An Architectural Evaluation J-machine多计算机:体系结构评价
M. Noakes, D. Wallach, W. Dally
The MIT J-Machine multicomputer has been constructed to study the role of a set of primitive mechanisms in providing efficient support for parallel computing. Each J-Machine node consists of an integrated multicomputer component, the Message-Driven Processor (MDP), and 1 MByte of DRAM. The MDP provides mechanisms to support efficient communication, synchronization, and naming. A 512 node J-Machine is operational and is due to be expanded to 1024 nodes in March 1993. In this paper we discuss the design of the J-Machine and evaluate the effectiveness of the mechanisms incorporated into the MDP. We measure the performance of the communication and synchronization mechanisms directly and investigate the behavior of four complete applications.
构建了MIT J-Machine多计算机,以研究一组基本机制在为并行计算提供有效支持方面的作用。每个J-Machine节点由一个集成的多计算机组件、消息驱动处理器(Message-Driven Processor, MDP)和1mbyte的DRAM组成。MDP提供了支持高效通信、同步和命名的机制。一个512节点的J-Machine正在运作,并将在1993年3月扩大到1024节点。在本文中,我们讨论了J-Machine的设计,并评估了纳入MDP的机制的有效性。我们直接测量了通信和同步机制的性能,并研究了四个完整应用程序的行为。
{"title":"The J-machine Multicomputer: An Architectural Evaluation","authors":"M. Noakes, D. Wallach, W. Dally","doi":"10.1145/165123.165158","DOIUrl":"https://doi.org/10.1145/165123.165158","url":null,"abstract":"The MIT J-Machine multicomputer has been constructed to study the role of a set of primitive mechanisms in providing efficient support for parallel computing. Each J-Machine node consists of an integrated multicomputer component, the Message-Driven Processor (MDP), and 1 MByte of DRAM. The MDP provides mechanisms to support efficient communication, synchronization, and naming. A 512 node J-Machine is operational and is due to be expanded to 1024 nodes in March 1993. In this paper we discuss the design of the J-Machine and evaluate the effectiveness of the mechanisms incorporated into the MDP. We measure the performance of the communication and synchronization mechanisms directly and investigate the behavior of four complete applications.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129711661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 303
The TickerTAIP Parallel RAID Architecture tickertip并行RAID架构
Pub Date : 1993-05-01 DOI: 10.1109/ISCA.1993.698545
P. Cao, S. Lim, S. Venkataraman, J. Wilkes
Traditional disk arrays have a centralized architecture, with a single controller through which all requests flow. Such a controller is a single point of failure, and its performance limits the maximum size that the array can grow to. We describe here TickerTAIP, a parallel architecture for disk arrays that distributed the controller functions across several loosely-coupled processors. The result is better scalability, fault tolerance, and flexibility.This paper presents the TickerTAIP architecture and an evaluation of its behavior. We demonstrate the feasibility by an existence proof; describe a family of distributed algorithms for calculating RAID parity; discuss techniques for establishing request atomicity, sequencing and recovery; and evaluate the performance of the TickerTAIP design in both absolute terms and by comparison to a centralized RAID implementation. We conclude that the TickerTAIP architectural approach is feasible, useful, and effective.
传统的磁盘阵列采用集中式架构,所有请求都通过一个控制器。这样的控制器是单点故障,其性能限制了阵列可以扩展到的最大尺寸。我们在这里描述TickerTAIP,这是一种用于磁盘阵列的并行架构,它将控制器功能分布在几个松耦合的处理器上。其结果是更好的可伸缩性、容错性和灵活性。本文介绍了tickertip的体系结构及其行为的评价。我们用存在性证明来证明其可行性;描述一组用于计算RAID奇偶校验的分布式算法;讨论建立请求原子性、排序和恢复的技术;并评估TickerTAIP设计的绝对性能和与集中式RAID实现的比较。我们得出结论,tickertip架构方法是可行的、有用的和有效的。
{"title":"The TickerTAIP Parallel RAID Architecture","authors":"P. Cao, S. Lim, S. Venkataraman, J. Wilkes","doi":"10.1109/ISCA.1993.698545","DOIUrl":"https://doi.org/10.1109/ISCA.1993.698545","url":null,"abstract":"Traditional disk arrays have a centralized architecture, with a single controller through which all requests flow. Such a controller is a single point of failure, and its performance limits the maximum size that the array can grow to. We describe here TickerTAIP, a parallel architecture for disk arrays that distributed the controller functions across several loosely-coupled processors. The result is better scalability, fault tolerance, and flexibility.\u0000This paper presents the TickerTAIP architecture and an evaluation of its behavior. We demonstrate the feasibility by an existence proof; describe a family of distributed algorithms for calculating RAID parity; discuss techniques for establishing request atomicity, sequencing and recovery; and evaluate the performance of the TickerTAIP design in both absolute terms and by comparison to a centralized RAID implementation. We conclude that the TickerTAIP architectural approach is feasible, useful, and effective.","PeriodicalId":410022,"journal":{"name":"Proceedings of the 20th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1993-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115025883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 147
期刊
Proceedings of the 20th Annual International Symposium on Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1