首页 > 最新文献

[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture最新文献

英文 中文
A study of I/O behavior of Perfect benchmarks on a multiprocessor 多处理器上完美基准的I/O行为研究
A. Reddy, P. Banerjee
The I/O behavior of some scientific applications, a subset of Perfect benchmarks, executing on a multiprocessor is studied. The aim of this study is to explore the various patterns of I/O access of large scientific applications, and to understand the impact of this observed behavior on the I/O subsystem architecture. I/O behavior of the program is characterized by the demands it imposes on the I/O subsystem. It is observed that implicit I/O or paging is not a major problem for the applications considered and the I/O problem is mainly manifest in the explicit I/O done in the program. Various characteristics of I/O accesses are studied, and their impact on architecture design is discussed.<>
本文研究了在多处理器上执行的一些科学应用程序(Perfect基准测试的一个子集)的I/O行为。本研究的目的是探索大型科学应用程序的各种I/O访问模式,并了解这种观察到的行为对I/O子系统体系结构的影响。程序的I/O行为由它对I/O子系统施加的需求来表征。可以观察到,对于所考虑的应用程序来说,隐式I/O或分页不是主要问题,I/O问题主要体现在程序中完成的显式I/O中。研究了I/O访问的各种特性,并讨论了它们对体系结构设计的影响
{"title":"A study of I/O behavior of Perfect benchmarks on a multiprocessor","authors":"A. Reddy, P. Banerjee","doi":"10.1145/325164.325157","DOIUrl":"https://doi.org/10.1145/325164.325157","url":null,"abstract":"The I/O behavior of some scientific applications, a subset of Perfect benchmarks, executing on a multiprocessor is studied. The aim of this study is to explore the various patterns of I/O access of large scientific applications, and to understand the impact of this observed behavior on the I/O subsystem architecture. I/O behavior of the program is characterized by the demands it imposes on the I/O subsystem. It is observed that implicit I/O or paging is not a major problem for the applications considered and the I/O problem is mainly manifest in the explicit I/O done in the program. Various characteristics of I/O accesses are studied, and their impact on architecture design is discussed.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122266865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
The directory-based cache coherence protocol for the DASH multiprocessor DASH多处理器基于目录的缓存一致性协议
D. Lenoski, J. Laudon, K. Gharachorloo, Anoop Gupta, J. Hennessy
DASH is a scalable shared-memory multiprocessor whose architecture consists of powerful processing nodes, each with a portion of the shared-memory, connected to a scalable interconnection network. A key feature of DASH is its distributed direction-based cache coherence protocol. Unlike traditional snoopy coherence protocols, the DASH protocol does not rely on broadcast; instead it uses point-to-point messages sent between the processors and memories to keep caches consistent. Furthermore, the DASH system does not contain any single serialization or control point. While these features provide the basis for scalability, they also force a reevaluation of many fundamental issues involved in the design of a protocol. These include the issues of correctness, performance, and protocol complexity. The design of the DASH coherence protocol is presented and discussed from the viewpoint of how it addresses the above issues. Also discussed is a strategy for verifying the correctness of the protocol. A brief comparison of the protocol with the IEEE Scalable Coherent Interface protocol is made.<>
DASH是一种可扩展的共享内存多处理器,其架构由功能强大的处理节点组成,每个节点都有一部分共享内存,连接到可扩展的互连网络。DASH的一个关键特性是其基于分布式方向的缓存一致性协议。与传统的snoopy相干协议不同,DASH协议不依赖于广播;相反,它使用在处理器和内存之间发送的点对点消息来保持缓存的一致性。此外,DASH系统不包含任何单一的序列化或控制点。虽然这些特性为可伸缩性提供了基础,但它们也迫使对协议设计中涉及的许多基本问题进行重新评估。这些问题包括正确性、性能和协议复杂性。从DASH一致性协议如何解决上述问题的角度出发,提出并讨论了DASH一致性协议的设计。还讨论了验证协议正确性的策略。将该协议与IEEE可扩展相干接口协议进行了简要比较。
{"title":"The directory-based cache coherence protocol for the DASH multiprocessor","authors":"D. Lenoski, J. Laudon, K. Gharachorloo, Anoop Gupta, J. Hennessy","doi":"10.1145/325164.325132","DOIUrl":"https://doi.org/10.1145/325164.325132","url":null,"abstract":"DASH is a scalable shared-memory multiprocessor whose architecture consists of powerful processing nodes, each with a portion of the shared-memory, connected to a scalable interconnection network. A key feature of DASH is its distributed direction-based cache coherence protocol. Unlike traditional snoopy coherence protocols, the DASH protocol does not rely on broadcast; instead it uses point-to-point messages sent between the processors and memories to keep caches consistent. Furthermore, the DASH system does not contain any single serialization or control point. While these features provide the basis for scalability, they also force a reevaluation of many fundamental issues involved in the design of a protocol. These include the issues of correctness, performance, and protocol complexity. The design of the DASH coherence protocol is presented and discussed from the viewpoint of how it addresses the above issues. Also discussed is a strategy for verifying the correctness of the protocol. A brief comparison of the protocol with the IEEE Scalable Coherent Interface protocol is made.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115896369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 736
An empirical evaluation of two memory-efficient directory methods 两种内存效率目录方法的经验评价
Brian W. O'Krafka, A. Richard Newton
The authors present an empirical evaluation of two memory-efficient directory methods for maintaining coherent caches in large shared-memory multiprocessors. Both directory methods are modifications of a scheme proposed by L.M. Censier and P. Feautrier (1978) that does not rely on a specific interconnection network and can be readily distributed across interleaved main memory. The schemes considered here overcome the large amount of memory required for tags in the original scheme in two different ways. In the first scheme each main memory block is sectored into sub-blocks for which the large tag overhead is shared. In the second scheme a limited number of large tags are stored in an associative cache and shared among a much larger number of main memory blocks. Simulations show that in terms of access time and network traffic both directory methods provide significant performance improvements over a memory system in which shared-writable data are not cached. The large block sizes required for the sectored scheme, however, promote sufficient false sharing for its performance to be markedly worse than when a tag cache is used.<>
作者提出了在大型共享内存多处理器中维护一致缓存的两种内存效率目录方法的经验评估。这两种目录方法都是对L.M. Censier和P. Feautrier(1978)提出的方案的修改,该方案不依赖于特定的互连网络,可以很容易地分布在交错主存储器上。这里考虑的方案以两种不同的方式克服了原始方案中标签所需的大量内存。在第一种方案中,每个主内存块被分割成子块,这些子块共享大的标记开销。在第二种方案中,将有限数量的大标记存储在关联缓存中,并在数量大得多的主存储器块之间共享。模拟表明,在访问时间和网络流量方面,两种目录方法都比不缓存共享可写数据的内存系统提供了显著的性能改进。然而,扇区方案所需的大块大小会促进足够的错误共享,从而使其性能明显低于使用标签缓存时的性能。
{"title":"An empirical evaluation of two memory-efficient directory methods","authors":"Brian W. O'Krafka, A. Richard Newton","doi":"10.1145/325164.325130","DOIUrl":"https://doi.org/10.1145/325164.325130","url":null,"abstract":"The authors present an empirical evaluation of two memory-efficient directory methods for maintaining coherent caches in large shared-memory multiprocessors. Both directory methods are modifications of a scheme proposed by L.M. Censier and P. Feautrier (1978) that does not rely on a specific interconnection network and can be readily distributed across interleaved main memory. The schemes considered here overcome the large amount of memory required for tags in the original scheme in two different ways. In the first scheme each main memory block is sectored into sub-blocks for which the large tag overhead is shared. In the second scheme a limited number of large tags are stored in an associative cache and shared among a much larger number of main memory blocks. Simulations show that in terms of access time and network traffic both directory methods provide significant performance improvements over a memory system in which shared-writable data are not cached. The large block sizes required for the sectored scheme, however, promote sufficient false sharing for its performance to be markedly worse than when a tag cache is used.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125034042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 135
Performance comparison of load/store and symmetric instruction set architectures 加载/存储和对称指令集体系结构的性能比较
D. Alpert, A. Averbuch, O. Danieli
Two pipeline models, one implementing a load/store architecture, the other a symmetric architecture, are compared under identical simulation environments. The symmetric architecture instructions are more powerful, but also more complex; therefore the pipeline model for the symmetric architecture contains an additional stage with an additional adder, more bypasses, and an extra port to the register file. The authors' simulations show that the path length of the load/store architecture is 1.12 longer than that of the symmetric architecture. Nevertheless, most of this advantage is lost because of various pipeline delays that reduce the speedup factor from 1.12 to 1.0375. The main delaying contribution is due to resource dependency (0.064 CPI) and control dependency (0.048 CPI).<>
在相同的仿真环境下,比较了两种管道模型,一种实现了加载/存储体系结构,另一种实现了对称体系结构。对称架构指令更强大,但也更复杂;因此,对称体系结构的管道模型包含一个额外的阶段,其中包含一个额外的加法器、更多的旁路和一个到寄存器文件的额外端口。仿真结果表明,加载/存储结构的路径长度比对称结构长1.12倍。然而,由于各种管道延迟将加速因子从1.12降低到1.0375,因此大部分优势都丢失了。主要的延迟贡献是由于资源依赖(0.064 CPI)和控制依赖(0.048 CPI)。
{"title":"Performance comparison of load/store and symmetric instruction set architectures","authors":"D. Alpert, A. Averbuch, O. Danieli","doi":"10.1145/325164.325137","DOIUrl":"https://doi.org/10.1145/325164.325137","url":null,"abstract":"Two pipeline models, one implementing a load/store architecture, the other a symmetric architecture, are compared under identical simulation environments. The symmetric architecture instructions are more powerful, but also more complex; therefore the pipeline model for the symmetric architecture contains an additional stage with an additional adder, more bypasses, and an extra port to the register file. The authors' simulations show that the path length of the load/store architecture is 1.12 longer than that of the symmetric architecture. Nevertheless, most of this advantage is lost because of various pipeline delays that reduce the speedup factor from 1.12 to 1.0375. The main delaying contribution is due to resource dependency (0.064 CPI) and control dependency (0.048 CPI).<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132051764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Performance of an OLTP application on Symmetry multiprocessor system 对称多处理器系统上OLTP应用程序的性能
S. Thakkar, Mark Sweiger
Sequent's Symmetry series is a bus-based shared-memory multiprocessor. System performance in an OLTP (online transaction processing) relational database application was investigated using the TP1 benchmark. System performance was tested with fully cached benchmarks and with scaled benchmarks. In fully-cached tests, the entire database fits inside main memory. In scaled tests, the database is larger than available memory. In the fully-cached benchmark, performance was initially limited by bus saturation. The cause was the transfer of process context from processor to processor. This was eliminated by assigning each process to a processor. Processor affinity was combined with reductions in message passing within the database. Throughput was dramatically improved. The scaled tests were I/O bound. This bottleneck can be eliminated by connecting more disk drives or by increasing the main memory size.<>
Sequent的对称系列是一个基于总线的共享内存多处理器。使用TP1基准测试了OLTP(在线事务处理)关系数据库应用程序中的系统性能。用完全缓存的基准测试和缩放基准测试测试了系统性能。在全缓存测试中,整个数据库都放在主内存中。在缩放测试中,数据库大于可用内存。在完全缓存的基准测试中,性能最初受到总线饱和的限制。原因是进程上下文在处理器之间的传递。通过将每个进程分配给一个处理器,消除了这种情况。处理器亲和性与数据库内消息传递的减少相结合。吞吐量得到了显著提高。缩放测试是受I/O限制的。这个瓶颈可以通过连接更多的磁盘驱动器或增加主存储器的大小来消除。
{"title":"Performance of an OLTP application on Symmetry multiprocessor system","authors":"S. Thakkar, Mark Sweiger","doi":"10.1145/325164.325149","DOIUrl":"https://doi.org/10.1145/325164.325149","url":null,"abstract":"Sequent's Symmetry series is a bus-based shared-memory multiprocessor. System performance in an OLTP (online transaction processing) relational database application was investigated using the TP1 benchmark. System performance was tested with fully cached benchmarks and with scaled benchmarks. In fully-cached tests, the entire database fits inside main memory. In scaled tests, the database is larger than available memory. In the fully-cached benchmark, performance was initially limited by bus saturation. The cause was the transfer of process context from processor to processor. This was eliminated by assigning each process to a processor. Processor affinity was combined with reductions in message passing within the database. Throughput was dramatically improved. The scaled tests were I/O bound. This bottleneck can be eliminated by connecting more disk drives or by increasing the main memory size.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126844368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 75
Balance in architectural design 建筑设计中的平衡
Samuel Ho, L. Snyder
A performance metric, normalized time, which is closely related to such measures as the area-time product of VLSI theory and the price/performance ratio of advertising literature is introduced. This metric captures the idea of a piece of hardware 'pulling its own weight', that is, contributing as much to performance as it costs in resources. The authors prove general theorems for stating when the size of a given part is in balance with its utilization and give specific formulas for commonly found linear and quadratic devices. They also apply these formulas to an analysis of a specific processor element and discuss the implications for bit-serial-versus-word-parallel, RISC-versus-CISC (reduced-versus complex-instruction-set-computer), and VLIW (very-long-instruction-word) designs.<>
介绍了一种与VLSI理论的面积时间积、广告文献的性价比等指标密切相关的性能指标——归一化时间。这个指标体现了硬件“自得其所”的理念,也就是说,它对性能的贡献与资源消耗一样多。作者证明了零件尺寸与利用率平衡的一般定理,并给出了常见的线性和二次装置的具体公式。他们还将这些公式应用于对特定处理器元件的分析,并讨论了位串行与字并行、risc与cisc(简化与复杂指令集计算机)以及VLIW(超长指令字)设计的含义
{"title":"Balance in architectural design","authors":"Samuel Ho, L. Snyder","doi":"10.1145/325164.325156","DOIUrl":"https://doi.org/10.1145/325164.325156","url":null,"abstract":"A performance metric, normalized time, which is closely related to such measures as the area-time product of VLSI theory and the price/performance ratio of advertising literature is introduced. This metric captures the idea of a piece of hardware 'pulling its own weight', that is, contributing as much to performance as it costs in resources. The authors prove general theorems for stating when the size of a given part is in balance with its utilization and give specific formulas for commonly found linear and quadratic devices. They also apply these formulas to an analysis of a specific processor element and discuss the implications for bit-serial-versus-word-parallel, RISC-versus-CISC (reduced-versus complex-instruction-set-computer), and VLIW (very-long-instruction-word) designs.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128085581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Performance measurement and trace driven simulation of parallel CAD and numeric applications on a hypercube multicomputer 在超立方体多计算机上并行CAD和数值应用程序的性能测量和跟踪驱动仿真
Jiun-Ming Hsu, P. Banerjee
The performance evaluation, workload characterization, and trace-driven simulation of a hypercube multicomputer running realistic workloads are presented. Six representative parallel applications were selected as benchmarks. Software monitoring techniques were then used to collect execution traces. On the basis of the measurement results, the authors investigated both the computation and communication behavior of these parallel programs, including CPU utilization, computation task granularity, message interarrival distribution, the distribution of waiting times in receiving messages, and message length and destination distributions. The localities in communication were also studied. A trace-driven simulation environment was developed to study the behavior of the communication hardware under real workloads. Simulation results on DMA and link utilizations are reported.<>
介绍了运行真实工作负载的超立方体多计算机的性能评估、工作负载表征和跟踪驱动仿真。选取六个具有代表性的并行应用程序作为基准。然后使用软件监视技术来收集执行跟踪。在测量结果的基础上,作者研究了这些并行程序的计算和通信行为,包括CPU利用率、计算任务粒度、消息到达间分布、接收消息等待时间分布、消息长度和目的地分布。并对通信中的位置进行了研究。开发了跟踪驱动仿真环境,研究了通信硬件在实际工作负载下的行为。给出了DMA和链路利用率的仿真结果。
{"title":"Performance measurement and trace driven simulation of parallel CAD and numeric applications on a hypercube multicomputer","authors":"Jiun-Ming Hsu, P. Banerjee","doi":"10.1145/325164.325152","DOIUrl":"https://doi.org/10.1145/325164.325152","url":null,"abstract":"The performance evaluation, workload characterization, and trace-driven simulation of a hypercube multicomputer running realistic workloads are presented. Six representative parallel applications were selected as benchmarks. Software monitoring techniques were then used to collect execution traces. On the basis of the measurement results, the authors investigated both the computation and communication behavior of these parallel programs, including CPU utilization, computation task granularity, message interarrival distribution, the distribution of waiting times in receiving messages, and message length and destination distributions. The localities in communication were also studied. A trace-driven simulation environment was developed to study the behavior of the communication hardware under real workloads. Simulation results on DMA and link utilizations are reported.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132623794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 63
Supporting systolic and memory communication in iWarp 支持收缩和内存通信在iWarp
S. Borkar, R. Cohn, G. Cox, T. Gross, H. T. Kung, M. Lam, M. Levine, B. Moore, W. Moore, C. Peterson, J. Susman, J. Sutton, J. Urbanski, J. Webb
The iWarp communication system supports two widely used interprocessor communication styles: memory communication and systolic communication. A description is given of the rationale, architecture, and implementation for the iWarp communication system. Memory communication is flexible and well suited for general computing, whereas systolic communication is efficient and well suited for speed-critical applications. The iWarp design is made possible by two important innovations in communication: (1) program access to communication and (2) logical channels. The former allows programs to access data as they are transmitted and to redirect portions of messages to different destinations efficiently. The latter increases the connectivity between the processors and guarantees communication bandwidth for classes of messages. These innovations have provided a focus for the iWarp architecture. The result is a communication system that provides a total bandwidth of 320 MBytes/sec and that is integrated on a single VLSI component with a 20 MFLOPS plus 20 MIPS long instruction work computation engine.<>
iWarp通信系统支持两种广泛使用的处理器间通信方式:内存通信和收缩通信。给出了iWarp通信系统的基本原理、体系结构和实现。内存通信是灵活的,非常适合一般计算,而收缩通信是高效的,非常适合速度关键的应用程序。iWarp的设计是由通信方面的两个重要创新实现的:(1)程序访问通信和(2)逻辑通道。前者允许程序在传输数据时访问数据,并有效地将部分消息重定向到不同的目的地。后者增加了处理器之间的连通性,并保证了消息类的通信带宽。这些创新为iWarp架构提供了一个焦点。结果是一个通信系统,提供320兆字节/秒的总带宽,并集成在单个VLSI组件上,具有20 MFLOPS和20 MIPS的长指令工作计算引擎。
{"title":"Supporting systolic and memory communication in iWarp","authors":"S. Borkar, R. Cohn, G. Cox, T. Gross, H. T. Kung, M. Lam, M. Levine, B. Moore, W. Moore, C. Peterson, J. Susman, J. Sutton, J. Urbanski, J. Webb","doi":"10.1145/325164.325116","DOIUrl":"https://doi.org/10.1145/325164.325116","url":null,"abstract":"The iWarp communication system supports two widely used interprocessor communication styles: memory communication and systolic communication. A description is given of the rationale, architecture, and implementation for the iWarp communication system. Memory communication is flexible and well suited for general computing, whereas systolic communication is efficient and well suited for speed-critical applications. The iWarp design is made possible by two important innovations in communication: (1) program access to communication and (2) logical channels. The former allows programs to access data as they are transmitted and to redirect portions of messages to different destinations efficiently. The latter increases the connectivity between the processors and guarantees communication bandwidth for classes of messages. These innovations have provided a focus for the iWarp architecture. The result is a communication system that provides a total bandwidth of 320 MBytes/sec and that is integrated on a single VLSI component with a 20 MFLOPS plus 20 MIPS long instruction work computation engine.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127090417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 210
Virtual-channel flow control 虚拟通道流量控制
W. Dally
Network throughput can be increased by dividing the buffer storage associated with each network channel into several virtual channels. Each physical channel is associated with several small queues, virtual channels, rather than a single deep queue. The virtual channels associated with one physical channel are allocated independently but compete with each other for physical bandwidth. Virtual channels decouple buffer resources from transmission resources. This decoupling allows active messages to pass blocked messages using network bandwidth that would otherwise be left idle. Simulation studies show that, given a fixed amount of buffer storage per link, virtual-channel flow control increases throughput by a factor of 3.5, approaching the capacity of the network.<>
通过将与每个网络通道相关联的缓冲区存储划分为几个虚拟通道,可以提高网络吞吐量。每个物理通道都与几个小队列(虚拟通道)相关联,而不是与一个深度队列相关联。与一个物理通道关联的虚拟通道是独立分配的,但会相互竞争物理带宽。虚拟通道将缓冲资源与传输资源解耦。这种解耦允许活动消息使用原本空闲的网络带宽传递被阻塞的消息。仿真研究表明,给定每条链路固定数量的缓冲存储器,虚拟通道流量控制将吞吐量提高3.5倍,接近网络容量。
{"title":"Virtual-channel flow control","authors":"W. Dally","doi":"10.1145/325164.325115","DOIUrl":"https://doi.org/10.1145/325164.325115","url":null,"abstract":"Network throughput can be increased by dividing the buffer storage associated with each network channel into several virtual channels. Each physical channel is associated with several small queues, virtual channels, rather than a single deep queue. The virtual channels associated with one physical channel are allocated independently but compete with each other for physical bandwidth. Virtual channels decouple buffer resources from transmission resources. This decoupling allows active messages to pass blocked messages using network bandwidth that would otherwise be left idle. Simulation studies show that, given a fixed amount of buffer storage per link, virtual-channel flow control increases throughput by a factor of 3.5, approaching the capacity of the network.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128095396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1658
Weak ordering-a new definition 弱有序——一个新的定义
S. Adve, M. Hill
A memory model for a shared-memory multiprocessor commonly and often implicitly assumed by programmers is that of sequential consistency, which guarantees that all memory accesses will appear to execute atomically and in program order. An alternative model, weak ordering, offers greater performance potential. The central hypothesis of this work is that programmers prefer to reason about sequentially consistent memory, rather than have to think about weaker memory, or even write buffers. Following this hypothesis, weak ordering is defined as a contract between software and hardware. By this contract, software agrees to some formally specified constraints, and hardware agrees to appear sequentially consistent, at least to the software that obeys those constraints. The authors illustrate the power of the new definition with a set of software constraints that forbid data races and with an implementation for cache-coherent systems that is not allowed by the old definition.<>
共享内存多处理器的内存模型通常被程序员隐式地假设为顺序一致性,它保证所有的内存访问看起来都是按照程序顺序自动执行的。另一种模型弱排序提供了更大的性能潜力。这项工作的中心假设是,程序员更愿意考虑顺序一致的内存,而不是不得不考虑较弱的内存,甚至写缓冲区。根据这个假设,弱排序被定义为软件和硬件之间的契约。通过这个契约,软件同意一些正式指定的约束,而硬件同意在顺序上保持一致,至少对遵守这些约束的软件来说是这样。作者通过一组禁止数据竞争的软件约束和旧定义不允许的缓存一致系统的实现来说明新定义的力量。
{"title":"Weak ordering-a new definition","authors":"S. Adve, M. Hill","doi":"10.1145/285930.285996","DOIUrl":"https://doi.org/10.1145/285930.285996","url":null,"abstract":"A memory model for a shared-memory multiprocessor commonly and often implicitly assumed by programmers is that of sequential consistency, which guarantees that all memory accesses will appear to execute atomically and in program order. An alternative model, weak ordering, offers greater performance potential. The central hypothesis of this work is that programmers prefer to reason about sequentially consistent memory, rather than have to think about weaker memory, or even write buffers. Following this hypothesis, weak ordering is defined as a contract between software and hardware. By this contract, software agrees to some formally specified constraints, and hardware agrees to appear sequentially consistent, at least to the software that obeys those constraints. The authors illustrate the power of the new definition with a set of software constraints that forbid data races and with an implementation for cache-coherent systems that is not allowed by the old definition.<<ETX>>","PeriodicalId":297046,"journal":{"name":"[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115173988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 178
期刊
[1990] Proceedings. The 17th Annual International Symposium on Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1