Scale-out processors

2012 39th Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2012-06-09 DOI:10.1145/2366231.2337217

P. Lotfi-Kamran, Boris Grot, M. Ferdman, Stavros Volos, Yusuf Onur Koçberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, B. Falsafi

{"title":"Scale-out processors","authors":"P. Lotfi-Kamran, Boris Grot, M. Ferdman, Stavros Volos, Yusuf Onur Koçberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, B. Falsafi","doi":"10.1145/2366231.2337217","DOIUrl":null,"url":null,"abstract":"Scale-out datacenters mandate high per-server throughput to get the maximum benefit from the large TCO investment. Emerging applications (e.g., data serving and web search) that run in these datacenters operate on vast datasets that are not accommodated by on-die caches of existing server chips. Large caches reduce the die area available for cores and lower performance through long access latency when instructions are fetched. Performance on scale-out workloads is maximized through a modestly-sized last-level cache that captures the instruction footprint at the lowest possible access latency. In this work, we introduce a methodology for designing scalable and efficient scale-out server processors. Based on a metric of performance-density, we facilitate the design of optimal multi-core configurations, called pods. Each pod is a complete server that tightly couples a number of cores to a small last-level cache using a fast interconnect. Replicating the pod to fill the die area yields processors which have optimal performance density, leading to maximum per-chip throughput. Moreover, as each pod is a stand-alone server, scale-out processors avoid the expense of global (i.e., interpod) interconnect and coherence. These features synergistically maximize throughput, lower design complexity, and improve technology scalability. In 20nm technology, scaleout chips improve throughput by 5x-6.5x over conventional and by 1.6x-1.9x over emerging tiled organizations.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"180","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2366231.2337217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 180

Abstract

Scale-out datacenters mandate high per-server throughput to get the maximum benefit from the large TCO investment. Emerging applications (e.g., data serving and web search) that run in these datacenters operate on vast datasets that are not accommodated by on-die caches of existing server chips. Large caches reduce the die area available for cores and lower performance through long access latency when instructions are fetched. Performance on scale-out workloads is maximized through a modestly-sized last-level cache that captures the instruction footprint at the lowest possible access latency. In this work, we introduce a methodology for designing scalable and efficient scale-out server processors. Based on a metric of performance-density, we facilitate the design of optimal multi-core configurations, called pods. Each pod is a complete server that tightly couples a number of cores to a small last-level cache using a fast interconnect. Replicating the pod to fill the die area yields processors which have optimal performance density, leading to maximum per-chip throughput. Moreover, as each pod is a stand-alone server, scale-out processors avoid the expense of global (i.e., interpod) interconnect and coherence. These features synergistically maximize throughput, lower design complexity, and improve technology scalability. In 20nm technology, scaleout chips improve throughput by 5x-6.5x over conventional and by 1.6x-1.9x over emerging tiled organizations.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

扩展处理器

向外扩展的数据中心要求每台服务器的高吞吐量，以便从大量的TCO投资中获得最大的收益。在这些数据中心中运行的新兴应用程序(例如，数据服务和web搜索)在庞大的数据集上运行，而现有服务器芯片的片上缓存无法容纳这些数据集。大型缓存减少了内核可用的芯片面积，并通过获取指令时的长访问延迟降低了性能。横向扩展工作负载上的性能可以通过大小适中的最后一级缓存实现最大化，该缓存以尽可能低的访问延迟捕获指令占用。在这项工作中，我们介绍了一种设计可伸缩且高效的横向扩展服务器处理器的方法。基于性能密度的度量，我们促进了最佳多核配置的设计，称为pod。每个pod都是一个完整的服务器，它使用快速互连将多个核心紧密耦合到一个小型的最后一级缓存。复制pod以填充模具区域，产生具有最佳性能密度的处理器，从而实现最大的每个芯片吞吐量。此外，由于每个pod都是一个独立的服务器，向外扩展的处理器避免了全局(即interpod)互连和一致性的开销。这些特性协同提高了吞吐量，降低了设计复杂性，并提高了技术可扩展性。在20nm技术中，扩展芯片比传统芯片提高了5 -6.5倍的吞吐量，比新兴的平铺式芯片提高了1.6 -1.9倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2012 39th Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量