[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture最新文献_第3页

Race-free interconnection networks and multiprocessor consistency 无竞赛互连网络和多处理器一致性

[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture

Pub Date : 1991-04-01 DOI: 10.1145/115952.115964

A. Landin, Erik Hagersten, Seif Haridi

Modern shared-memory multiprocmors require complex interconnection networks to provide sufficient communication bandwidth between processors. They also rely on advanced memory systems that allow multiple memory operations to be made in parallel. It is expensive to maintain a high consistency level in a machine based on a general network, but for special interconnection topologies, some of these costs can he reduced. We define and study one class of interconnection networks, race-free networks. New conditions for sequential consistency are presented which show that sequential consistency can be maintained if all accesses in a multiprocessor can be ordered in an acyclic graph. We show that this can be done in racefree networks without the need for a transaction to be globally performed before the next transaction can be issued: We also investigate what is required to maintain processor consistency in race-free networks. In a race-free network which maintains processor consistency, writes may be pipelined, and reads may bypass writes. - The proposed methods reduce the latencies associated with processor write-misses to shared data.

现代共享内存多处理器需要复杂的互连网络来提供处理器之间足够的通信带宽。它们还依赖于允许并行进行多个存储操作的高级存储系统。在基于一般网络的机器中维护高一致性级别是昂贵的，但是对于特殊的互连拓扑，可以降低其中的一些成本。我们定义并研究了一类互连网络——无种族网络。提出了新的顺序一致性的条件，证明了如果多处理机中的所有访问在一个非循环图中可以有序，则可以保持顺序一致性。我们表明，这可以在无竞争网络中完成，而不需要在发布下一个事务之前全局执行事务:我们还研究了在无竞争网络中维护处理器一致性所需的条件。在保持处理器一致性的无竞争网络中，写可以是流水线的，读可以绕过写。-建议的方法减少了与处理器对共享数据写错误相关的延迟。

{"title":"Race-free interconnection networks and multiprocessor consistency","authors":"A. Landin, Erik Hagersten, Seif Haridi","doi":"10.1145/115952.115964","DOIUrl":"https://doi.org/10.1145/115952.115964","url":null,"abstract":"Modern shared-memory multiprocmors require complex interconnection networks to provide sufficient communication bandwidth between processors. They also rely on advanced memory systems that allow multiple memory operations to be made in parallel. It is expensive to maintain a high consistency level in a machine based on a general network, but for special interconnection topologies, some of these costs can he reduced. We define and study one class of interconnection networks, race-free networks. New conditions for sequential consistency are presented which show that sequential consistency can be maintained if all accesses in a multiprocessor can be ordered in an acyclic graph. We show that this can be done in racefree networks without the need for a transaction to be globally performed before the next transaction can be issued: We also investigate what is required to maintain processor consistency in race-free networks. In a race-free network which maintains processor consistency, writes may be pipelined, and reads may bypass writes. - The proposed methods reduce the latencies associated with processor write-misses to shared data.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123778780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 78

Deadlock-fyee multicast wormhole routing in multicomputer networks 多计算机网络中的死锁多播虫洞路由

[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture

Pub Date : 1991-04-01 DOI: 10.1109/ISCA.1991.1021605

X. Lin, L. Ni

Efficient routing of messages is the key to the performance of multicomputers. Multicast communication refers to the delivery of the same message from a source node to an arbitrary number of destination nodes. Wormhole routing is the most promising switching technique used in new generation multicomputers. In this paper, we present multicast wormhole routing methods for multicomputers adopting 2D-mesh and hypercube topologies. The dual-path routing algorithm requires less system resource, while the multipath routing algorithm creates less traffic. More import antly, both routing algorithms are deadlock-free, which is essential to wormhole networks.

有效的消息路由是多计算机性能的关键。多播通信是指将同一消息从一个源节点传递到任意数量的目标节点。虫洞路由是新一代多机中最有前途的交换技术。本文提出了采用二维网格和超立方体拓扑结构的多计算机组播虫洞路由方法。双路径路由算法占用的系统资源较少，而多路径路由算法占用的流量较少。更重要的是，这两种路由算法都是无死锁的，这对虫洞网络至关重要。

引用次数: 170

OHMEGA : a VLSI superscalar processor architecture for numerical applications OHMEGA:用于数值应用的VLSI超标量处理器架构

[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture

Pub Date : 1991-04-01 DOI: 10.1145/115952.115969

M. Nakajima, H. Nakano, Y. Nakakura, T. Yoshida, Y. Goi, Y. Nakai, R. Segawa, T. Xishida, H. Kadota

multiple instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, LOOP instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.

本文描述了一种可以在数字应用中保持很高性能的超大规模集成电路(VLSI)超标量处理器架构。该体系结构由编译器静态地执行指令级调度，并执行指令的无序发出和执行，以减少执行过程中动态发生的管道上的停顿。在这个体系结构中，在每个时钟周期中获取一对指令，同时解码，并独立地发布到相应的执行管道。为了便于编译器的指令级调度，架构提供者:-i)几乎所有指令对的同时执行，包括Store-Stare对和Load-Store对;ii)简单、低延迟、易于配对的执行管道结构;iii)大容量多端口浮点寄存器和整数寄存器。采用新颖的直接标签比较(direct Tag Compare, DTC)方法实现高效的数据依赖解析，采用无处罚分支机制实现简单的控制依赖解析，采用流水线数据缓存和128位宽总线带宽实现大数据传输能力，从而动态降低管道危害，从而提高系统性能。采用新的DTC方法、同步管道操作和数据绕过网络实现了一种有效的数据依赖解析机制，允许乱序指令的发布和执行。DTC方法的思想类似于带标记令牌的动态数据缺陷体系结构。非惩罚分支是通过延迟分支、执行器计数器在一个时钟周期内递减、比较和分支的LOOP指令和带有预测条件码的非惩罚条件分支三种技术实现的。这些技术有助于减少在运行时发生的管道失速。利用这些技术，该架构可以在4OMHz时钟下实现80MFLOPS/80MIPS的峰值性能，并保持比简单的MFU型RISC处理器高1.4至3.6倍的性能。

{"title":"OHMEGA : a VLSI superscalar processor architecture for numerical applications","authors":"M. Nakajima, H. Nakano, Y. Nakakura, T. Yoshida, Y. Goi, Y. Nakai, R. Segawa, T. Xishida, H. Kadota","doi":"10.1145/115952.115969","DOIUrl":"https://doi.org/10.1145/115952.115969","url":null,"abstract":"multiple instructions per clock cycle, there are at least four This paper describes a VLSI superscalar processor architecture which can sustain very high performance in numericai applications. The architecture performs instructionlevel scheduling statically by the compiler, and performs outof-order issuing and executing of instructions to decrease the stall on the pipelines that dynamically occurs in execution. In this architecture, a pair of instructions are fetched in every clock cycle, decoded simultaneously, and issued to corresponding execution pipelines independently. For ease of instruction-level scheduling by the compiler, the architecture provider: -i) rimultaneoun execution df almost all pairs of instructions including Store-Stare pair and Load-Store pair, ii) simple, low-latency, and easily-paired execution pipeline structure, and iii) high-capacity muiti-ported floating point registers and integer registers. Enhanced performance by the dynamic decrease of the pipeline hazards is achieved by i) efficient data dependency resolution with novel Directly Tag Compare (DTC) method, ii) non-penalty branch mechanism and simple control dependency resolution, and iii) large data transfer ability by the pipelined data cache and 128 bit wide bus bandwidth. An efficient data dependency resolutioi mechanism, which is realized by using the novel DTC method, synchronized pipeiine operation, and data bypassing network, permits out-of-order instruction issuing and execution. The idea of DTC method is similar to that of dynamic data-flaw architecture with tagged token. The non-penalty branches are realized by three techniques, delayed branch, LOOP instruction that executer counter decrement, compare, and branch in one clock cycle, and non-penalty conditional branch with predicted condition codes. These 'techniques contribute to the decrease of the pipeline stalls occurring at run time. The architecture can achieve 80MFLOPS/80MIPS peak performance at 4OMHz clock and sustain 1.4 to 3.6 times higher performance of simple Multiple Function Unit (MFU) type RISC processors by taking advantage of these techniques.","PeriodicalId":187095,"journal":{"name":"[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117249772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Performance prediction and tuning on a multiprocessor 多处理器上的性能预测和调优

[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture

Pub Date : 1991-04-01 DOI: 10.1145/115952.115972

R. Dimpsey, R. Iyer

This paper presents a methodology for modeling the behavior of a given class of applications executing in real workloads on a particular machine. The methodology is illustrated by modeling the execution of computationally bound, parallel applications running in real workloads on an Alliant FX/80. me model is constructed from real measured data obtained during normal machine operation and is capable of capturing intricate multiple job interactions, such as contention for shared resources. The model is a finitestate, discrete-time Markov model with rewards and costs associated with each state. The model is capable of predicting the distribution of completion times in real workloads for a given application. The predictions are useful in gauging how quickly an application will execute, or in predicting the performance impact of a system change. The model constructed in this study is validated with three separate sets of empirical data. In one validation, the model successfully predicts the effects of operating the machine with one less processor.

本文提出了一种方法，用于对特定机器上在实际工作负载中执行的给定类应用程序的行为进行建模。该方法通过在Alliant FX/80上对实际工作负载中运行的计算绑定并行应用程序的执行建模来说明。该模型是根据机器正常运行过程中获得的真实测量数据构建的，能够捕捉复杂的多作业交互，例如共享资源的争用。该模型是一个有限的、离散时间的马尔可夫模型，其奖励和成本与每个状态相关。该模型能够预测给定应用程序的实际工作负载中的完成时间分布。这些预测在衡量应用程序的执行速度或预测系统更改对性能的影响时非常有用。用三组独立的经验数据验证了本研究构建的模型。在一次验证中，该模型成功地预测了少一个处理器操作机器的效果。

引用次数: 14

Branch history table prediction of moving target branches due to subroutine returns 子程序返回引起的移动目标分支历史表预测

[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture

Pub Date : 1991-04-01 DOI: 10.1145/115952.115957

D. Kaeli, P. Emma

Ideally, a pipeline processor can run at a rate that is limited by its slowest stage. Branches in the instruction stream disrupt the pipeIine, and reduce processor performance to well below ideal. Since workloads contain a high percentage of taken branches, techniques are needed to reduce or eliminate thk degradation. A Branch History Table (BHT) stores past action and target for branches, and predicts that future behavior will repeat. Although past action is a good indicator of future action, the subroutine CALL/RETURN paradigm makes correct prediction of the branch target dlfflcult. We propose a new stack mechanism for reducing this type of mispredlction. Using traces of the SPEC benchmark suite running on an RS/6000, we provide an analysis of the performance enhancements possible using a BHT. We show that the proposed mechanism can reduce the number of branch wrong guesses by 18.2°/0 on average.

理想情况下，流水线处理器可以以受其最慢阶段限制的速率运行。指令流中的分支会破坏管道，并将处理器性能降低到远低于理想的水平。由于工作负载包含很高比例的已取分支，因此需要技术来减少或消除这种退化。分支历史表(BHT)存储分支过去的操作和目标，并预测未来的行为将重复。虽然过去的操作是未来操作的良好指示器，但子例程CALL/RETURN范式对分支目标进行了正确的预测。我们提出了一种新的堆栈机制来减少这种类型的错误预测。通过在RS/6000上运行SPEC基准测试套件的跟踪，我们分析了使用BHT可能带来的性能增强。结果表明，该机制可将分支猜测错误率平均降低18.2°/0。

引用次数: 158

Implementing a cache for a high-performance GaAs microprocessor 实现高性能砷化镓微处理器的缓存

[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture

Pub Date : 1991-04-01 DOI: 10.1145/115952.115967

K. Olukotun, T. Mudge, Richard B. Brown

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

[1991] Proceedings. The 18th Annual International Symposium on Computer Architecture

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀