Scale-out processors

P. Lotfi-Kamran, Boris Grot, M. Ferdman, Stavros Volos, Yusuf Onur Koçberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, B. Falsafi
{"title":"Scale-out processors","authors":"P. Lotfi-Kamran, Boris Grot, M. Ferdman, Stavros Volos, Yusuf Onur Koçberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, B. Falsafi","doi":"10.1145/2366231.2337217","DOIUrl":null,"url":null,"abstract":"Scale-out datacenters mandate high per-server throughput to get the maximum benefit from the large TCO investment. Emerging applications (e.g., data serving and web search) that run in these datacenters operate on vast datasets that are not accommodated by on-die caches of existing server chips. Large caches reduce the die area available for cores and lower performance through long access latency when instructions are fetched. Performance on scale-out workloads is maximized through a modestly-sized last-level cache that captures the instruction footprint at the lowest possible access latency. In this work, we introduce a methodology for designing scalable and efficient scale-out server processors. Based on a metric of performance-density, we facilitate the design of optimal multi-core configurations, called pods. Each pod is a complete server that tightly couples a number of cores to a small last-level cache using a fast interconnect. Replicating the pod to fill the die area yields processors which have optimal performance density, leading to maximum per-chip throughput. Moreover, as each pod is a stand-alone server, scale-out processors avoid the expense of global (i.e., interpod) interconnect and coherence. These features synergistically maximize throughput, lower design complexity, and improve technology scalability. In 20nm technology, scaleout chips improve throughput by 5x-6.5x over conventional and by 1.6x-1.9x over emerging tiled organizations.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"89 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"180","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2366231.2337217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 180

Abstract

Scale-out datacenters mandate high per-server throughput to get the maximum benefit from the large TCO investment. Emerging applications (e.g., data serving and web search) that run in these datacenters operate on vast datasets that are not accommodated by on-die caches of existing server chips. Large caches reduce the die area available for cores and lower performance through long access latency when instructions are fetched. Performance on scale-out workloads is maximized through a modestly-sized last-level cache that captures the instruction footprint at the lowest possible access latency. In this work, we introduce a methodology for designing scalable and efficient scale-out server processors. Based on a metric of performance-density, we facilitate the design of optimal multi-core configurations, called pods. Each pod is a complete server that tightly couples a number of cores to a small last-level cache using a fast interconnect. Replicating the pod to fill the die area yields processors which have optimal performance density, leading to maximum per-chip throughput. Moreover, as each pod is a stand-alone server, scale-out processors avoid the expense of global (i.e., interpod) interconnect and coherence. These features synergistically maximize throughput, lower design complexity, and improve technology scalability. In 20nm technology, scaleout chips improve throughput by 5x-6.5x over conventional and by 1.6x-1.9x over emerging tiled organizations.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
扩展处理器
向外扩展的数据中心要求每台服务器的高吞吐量,以便从大量的TCO投资中获得最大的收益。在这些数据中心中运行的新兴应用程序(例如,数据服务和web搜索)在庞大的数据集上运行,而现有服务器芯片的片上缓存无法容纳这些数据集。大型缓存减少了内核可用的芯片面积,并通过获取指令时的长访问延迟降低了性能。横向扩展工作负载上的性能可以通过大小适中的最后一级缓存实现最大化,该缓存以尽可能低的访问延迟捕获指令占用。在这项工作中,我们介绍了一种设计可伸缩且高效的横向扩展服务器处理器的方法。基于性能密度的度量,我们促进了最佳多核配置的设计,称为pod。每个pod都是一个完整的服务器,它使用快速互连将多个核心紧密耦合到一个小型的最后一级缓存。复制pod以填充模具区域,产生具有最佳性能密度的处理器,从而实现最大的每个芯片吞吐量。此外,由于每个pod都是一个独立的服务器,向外扩展的处理器避免了全局(即interpod)互连和一致性的开销。这些特性协同提高了吞吐量,降低了设计复杂性,并提高了技术可扩展性。在20nm技术中,扩展芯片比传统芯片提高了5 -6.5倍的吞吐量,比新兴的平铺式芯片提高了1.6 -1.9倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Can traditional programming bridge the Ninja performance gap for parallel computing applications? Towards energy-proportional datacenter memory with mobile DRAM A micro-architectural analysis of switched photonic multi-chip interconnects Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems Enhancing effective throughput for transmission line-based bus
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1