Taraflops into laptops

IEEE Parallel & Distributed Technology: Systems & Applications Pub Date : 1900-01-01 DOI:10.1109/M-PDT.1994.329787

S. Wallach

{"title":"Taraflops into laptops","authors":"S. Wallach","doi":"10.1109/M-PDT.1994.329787","DOIUrl":null,"url":null,"abstract":"BANDWIDTH W e need a t least 100 Mbyte/sec/node, which after the normal expansion for head-ers and ECC is around 1 Gbidsec of raw data on the link. This represents 22 T3 (44.736-Mbidsec) interfaces per node! LATENCY W e need an end-to-end latency through the switch network which is in line with the rest of the store hierarchy. If we look a t current processors, we see performance characteristics something like this for the different levels of the store hierarchy: Level Clocks Slowdown Register 1 Level 1 cache 2-3 2-3 Level 2 cache 6-10 2-3 Store 2 0+ 2-3 So each level down the hierarchy is a factor of 2 or 3 slower than the previous one. If we view store accessed over the switch as the next level of the memory hierarchy, this implies that we want to achieve an access through the switch in around 40-60 CPU cycles-that is, in 400-600 nanoseconds for a 1 00-MHz clocked C P U (probably a low estimate). ATiM is currently viewed as the lowest latency nonproprietary switch structure, but such switches have a single switch latency of around 1.25 sec; this implies a full switch network latency of around 4 Fsec for a 256-node machine, a factor of 10 too large. So far I have ignored the latency in getting from a user request out to the switch network. If the network is accessed as a communications device (as will happen with a naive ATM interface), this will involve system calls and the kernel of the operating system. Many thousands of instructions will be executed, translating Teraflops into laptops Stl?UP WUllUCh. COYlZE'X At a recent meeting of the High Performance Computing and Communications and Information Technology Subconi-mittee, the topic was software for scalable parallel processing. Various suppliers of hardware systems and software applications participated, including me. The consensus was that standard third-party software was beginning to emerge on scalable parallel processors, and that as a result, a new world of computing was coming. One participant went so far as to state that \" one day we will run parallelized finite element code on a laptop. \" I share the same view: Scalable parallel processing (SPP) will be the norm, and will pervade all computing from the laptop to the teraflop. For server systems costing $50,000 or more, parallel processors will be standard in the next year, with price erosion of 1 …","PeriodicalId":325213,"journal":{"name":"IEEE Parallel & Distributed Technology: Systems & Applications","volume":"405 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Parallel & Distributed Technology: Systems & Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/M-PDT.1994.329787","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

BANDWIDTH W e need a t least 100 Mbyte/sec/node, which after the normal expansion for head-ers and ECC is around 1 Gbidsec of raw data on the link. This represents 22 T3 (44.736-Mbidsec) interfaces per node! LATENCY W e need an end-to-end latency through the switch network which is in line with the rest of the store hierarchy. If we look a t current processors, we see performance characteristics something like this for the different levels of the store hierarchy: Level Clocks Slowdown Register 1 Level 1 cache 2-3 2-3 Level 2 cache 6-10 2-3 Store 2 0+ 2-3 So each level down the hierarchy is a factor of 2 or 3 slower than the previous one. If we view store accessed over the switch as the next level of the memory hierarchy, this implies that we want to achieve an access through the switch in around 40-60 CPU cycles-that is, in 400-600 nanoseconds for a 1 00-MHz clocked C P U (probably a low estimate). ATiM is currently viewed as the lowest latency nonproprietary switch structure, but such switches have a single switch latency of around 1.25 sec; this implies a full switch network latency of around 4 Fsec for a 256-node machine, a factor of 10 too large. So far I have ignored the latency in getting from a user request out to the switch network. If the network is accessed as a communications device (as will happen with a naive ATM interface), this will involve system calls and the kernel of the operating system. Many thousands of instructions will be executed, translating Teraflops into laptops Stl?UP WUllUCh. COYlZE'X At a recent meeting of the High Performance Computing and Communications and Information Technology Subconi-mittee, the topic was software for scalable parallel processing. Various suppliers of hardware systems and software applications participated, including me. The consensus was that standard third-party software was beginning to emerge on scalable parallel processors, and that as a result, a new world of computing was coming. One participant went so far as to state that " one day we will run parallelized finite element code on a laptop. " I share the same view: Scalable parallel processing (SPP) will be the norm, and will pervade all computing from the laptop to the teraflop. For server systems costing $50,000 or more, parallel processors will be standard in the next year, with price erosion of 1 …

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Taraflops进入笔记本电脑

带宽我们需要至少100 Mbyte/sec/节点，在头和ECC正常扩展后，链路上的原始数据约为1 gbbidsec。这表示每个节点有22个T3 (44.736-Mbidsec)接口!我们需要通过交换机网络的端到端延迟，这与存储层次结构的其余部分一致。如果我们看一下当前的处理器，我们会看到不同级别存储层次结构的性能特征如下:级别时钟减慢寄存器1 1级缓存2-3 - 2- 2级缓存6-10 - 2-3存储2 0+ 2-3所以每个级别都比前一个慢2或3倍。如果我们将通过交换机访问的存储视为内存层次结构的下一层，这意味着我们希望通过交换机在大约40-60个CPU周期内实现访问——也就是说，对于一个以100 - mhz为时钟的CPU，在400-600纳秒内(可能是一个低估计)。ATiM目前被认为是延迟最低的非专有交换机结构，但这种交换机的单个交换机延迟约为1.25秒;这意味着对于一台256个节点的机器来说，完整的交换网络延迟大约是4秒，这是延迟的10倍。到目前为止，我忽略了从用户请求到交换机网络的延迟。如果网络是作为通信设备访问的(就像简单的ATM接口一样)，这将涉及系统调用和操作系统内核。成千上万的指令将被执行，将万亿次浮点运算转换到笔记本电脑上。WUllUCh。在高性能计算、通信和信息技术小组委员会最近的一次会议上，主题是用于可扩展并行处理的软件。各种硬件系统和软件应用的供应商参与其中，包括我。大家一致认为，标准的第三方软件开始出现在可伸缩的并行处理器上，因此，一个新的计算世界即将到来。一位与会者甚至表示，“有一天我们将在笔记本电脑上运行并行有限元代码。”我也有同样的看法:可扩展并行处理(SPP)将成为常态，并将渗透到从笔记本电脑到万亿次浮点运算的所有计算中。对于成本在5万美元或以上的服务器系统，并行处理器将在明年成为标准配置，价格将下降1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Parallel & Distributed Technology: Systems & Applications

自引率

0.00%

发文量

期刊最新文献

A Unified Trace Environment for IBM SP systems Integrating personal computers in a distributed client-server environment Index, volume 4, 1996 Fault-tolerant computer system design Topics in advanced scientific computation