{"title":"Taraflops into laptops","authors":"S. Wallach","doi":"10.1109/M-PDT.1994.329787","DOIUrl":null,"url":null,"abstract":"BANDWIDTH W e need a t least 100 Mbyte/sec/node, which after the normal expansion for head-ers and ECC is around 1 Gbidsec of raw data on the link. This represents 22 T3 (44.736-Mbidsec) interfaces per node! LATENCY W e need an end-to-end latency through the switch network which is in line with the rest of the store hierarchy. If we look a t current processors, we see performance characteristics something like this for the different levels of the store hierarchy: Level Clocks Slowdown Register 1 Level 1 cache 2-3 2-3 Level 2 cache 6-10 2-3 Store 2 0+ 2-3 So each level down the hierarchy is a factor of 2 or 3 slower than the previous one. If we view store accessed over the switch as the next level of the memory hierarchy, this implies that we want to achieve an access through the switch in around 40-60 CPU cycles-that is, in 400-600 nanoseconds for a 1 00-MHz clocked C P U (probably a low estimate). ATiM is currently viewed as the lowest latency nonproprietary switch structure, but such switches have a single switch latency of around 1.25 sec; this implies a full switch network latency of around 4 Fsec for a 256-node machine, a factor of 10 too large. So far I have ignored the latency in getting from a user request out to the switch network. If the network is accessed as a communications device (as will happen with a naive ATM interface), this will involve system calls and the kernel of the operating system. Many thousands of instructions will be executed, translating Teraflops into laptops Stl?UP WUllUCh. COYlZE'X At a recent meeting of the High Performance Computing and Communications and Information Technology Subconi-mittee, the topic was software for scalable parallel processing. Various suppliers of hardware systems and software applications participated, including me. The consensus was that standard third-party software was beginning to emerge on scalable parallel processors, and that as a result, a new world of computing was coming. One participant went so far as to state that \" one day we will run parallelized finite element code on a laptop. \" I share the same view: Scalable parallel processing (SPP) will be the norm, and will pervade all computing from the laptop to the teraflop. For server systems costing $50,000 or more, parallel processors will be standard in the next year, with price erosion of 1 …","PeriodicalId":325213,"journal":{"name":"IEEE Parallel & Distributed Technology: Systems & Applications","volume":"405 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Parallel & Distributed Technology: Systems & Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/M-PDT.1994.329787","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
BANDWIDTH W e need a t least 100 Mbyte/sec/node, which after the normal expansion for head-ers and ECC is around 1 Gbidsec of raw data on the link. This represents 22 T3 (44.736-Mbidsec) interfaces per node! LATENCY W e need an end-to-end latency through the switch network which is in line with the rest of the store hierarchy. If we look a t current processors, we see performance characteristics something like this for the different levels of the store hierarchy: Level Clocks Slowdown Register 1 Level 1 cache 2-3 2-3 Level 2 cache 6-10 2-3 Store 2 0+ 2-3 So each level down the hierarchy is a factor of 2 or 3 slower than the previous one. If we view store accessed over the switch as the next level of the memory hierarchy, this implies that we want to achieve an access through the switch in around 40-60 CPU cycles-that is, in 400-600 nanoseconds for a 1 00-MHz clocked C P U (probably a low estimate). ATiM is currently viewed as the lowest latency nonproprietary switch structure, but such switches have a single switch latency of around 1.25 sec; this implies a full switch network latency of around 4 Fsec for a 256-node machine, a factor of 10 too large. So far I have ignored the latency in getting from a user request out to the switch network. If the network is accessed as a communications device (as will happen with a naive ATM interface), this will involve system calls and the kernel of the operating system. Many thousands of instructions will be executed, translating Teraflops into laptops Stl?UP WUllUCh. COYlZE'X At a recent meeting of the High Performance Computing and Communications and Information Technology Subconi-mittee, the topic was software for scalable parallel processing. Various suppliers of hardware systems and software applications participated, including me. The consensus was that standard third-party software was beginning to emerge on scalable parallel processors, and that as a result, a new world of computing was coming. One participant went so far as to state that " one day we will run parallelized finite element code on a laptop. " I share the same view: Scalable parallel processing (SPP) will be the norm, and will pervade all computing from the laptop to the teraflop. For server systems costing $50,000 or more, parallel processors will be standard in the next year, with price erosion of 1 …