4.2 A 20nm 32-Core 64MB L3 cache SPARC M7 processor

H. Li, Jinuk Luke Shin, G. Konstadinidis, F. Schumacher, V. Krishnaswamy, Hoyeol Cho, Sudesna Dash, R. Masleid, Chaoyang Zheng, Yuanjung David Lin, P. Loewenstein, Heechoul Park, V. Srinivasan, Dawei Huang, C. Hwang, W. Hsu, C. McAllister
{"title":"4.2 A 20nm 32-Core 64MB L3 cache SPARC M7 processor","authors":"H. Li, Jinuk Luke Shin, G. Konstadinidis, F. Schumacher, V. Krishnaswamy, Hoyeol Cho, Sudesna Dash, R. Masleid, Chaoyang Zheng, Yuanjung David Lin, P. Loewenstein, Heechoul Park, V. Srinivasan, Dawei Huang, C. Hwang, W. Hsu, C. McAllister","doi":"10.1109/ISSCC.2015.7062931","DOIUrl":null,"url":null,"abstract":"The SPARC M7 processor delivers more than 3x throughput performance improvement over its predecessor SPARC M6 for commercial applications. It introduces new design features, such as the S4 core, a 64MB L3 cache subsystem with application data integrity, a low-latency, high-throughput on-chip network (OCN), a database analytic accelerator (DAX), fine-grain adaptive power management and 1.5× higher SerDes I/O bandwidth for memory, coherency and system interfaces (Fig. 4.2.1) [1]. The enhancements in the S4 core over the S3 core [2] include a new L2 cache scheme, support for visual instruction set (VIS) extensions, virtual address masking and user-level synchronization instructions to provide continuous single-thread performance improvement for SPARC processors since SPARC T4. In addition, a hierarchical modular approach, called SPARC cache cluster (SCC), is used for the core-L2-L3 cache system. Within the SCC, all four cores share a single 256KB L2 instruction cache and each core pair has its own 256KB L2 data cache. The L2 caches are organized as 2-banks and 8-ways to deliver greater than 1TB/s bandwidth to the four cores. This L2 system delivers 2× more throughput for each core with 1.5x increase in size and the same latency as the previous generation L2 cache scheme. The L2 caches connect to an 8MB, 8-way set-associative partitioned L3 cache. Having a localized L3 cache within each SCC reduces L3 latency by 25%. The chip contains eight SCCs for a total of 32-cores with 256 threads and a 64MB L3 cache with 1.6TB/S bandwidth. In order to support the bandwidth and latency requirements from 256 threads and other system agents, the OCN architecture is implemented in place of a crossbar based network used in previous SPARC processors. Each SCC connects to the OCN, which in turn connects to four on-chip memory controllers (MCUs), coherency systems and eight database analytic accelerator (DAX) engines. The SPARC M7 introduces a customized DAX engine in an effort to optimize performance for Oracle databases. Eight DAX engines handle simple query predicates, decompression, message passing and interrupts across cluster nodes. This query accelerator provides up to 10x better performance for single stream decompression.","PeriodicalId":188403,"journal":{"name":"2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC.2015.7062931","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17

Abstract

The SPARC M7 processor delivers more than 3x throughput performance improvement over its predecessor SPARC M6 for commercial applications. It introduces new design features, such as the S4 core, a 64MB L3 cache subsystem with application data integrity, a low-latency, high-throughput on-chip network (OCN), a database analytic accelerator (DAX), fine-grain adaptive power management and 1.5× higher SerDes I/O bandwidth for memory, coherency and system interfaces (Fig. 4.2.1) [1]. The enhancements in the S4 core over the S3 core [2] include a new L2 cache scheme, support for visual instruction set (VIS) extensions, virtual address masking and user-level synchronization instructions to provide continuous single-thread performance improvement for SPARC processors since SPARC T4. In addition, a hierarchical modular approach, called SPARC cache cluster (SCC), is used for the core-L2-L3 cache system. Within the SCC, all four cores share a single 256KB L2 instruction cache and each core pair has its own 256KB L2 data cache. The L2 caches are organized as 2-banks and 8-ways to deliver greater than 1TB/s bandwidth to the four cores. This L2 system delivers 2× more throughput for each core with 1.5x increase in size and the same latency as the previous generation L2 cache scheme. The L2 caches connect to an 8MB, 8-way set-associative partitioned L3 cache. Having a localized L3 cache within each SCC reduces L3 latency by 25%. The chip contains eight SCCs for a total of 32-cores with 256 threads and a 64MB L3 cache with 1.6TB/S bandwidth. In order to support the bandwidth and latency requirements from 256 threads and other system agents, the OCN architecture is implemented in place of a crossbar based network used in previous SPARC processors. Each SCC connects to the OCN, which in turn connects to four on-chip memory controllers (MCUs), coherency systems and eight database analytic accelerator (DAX) engines. The SPARC M7 introduces a customized DAX engine in an effort to optimize performance for Oracle databases. Eight DAX engines handle simple query predicates, decompression, message passing and interrupts across cluster nodes. This query accelerator provides up to 10x better performance for single stream decompression.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
4.2 20nm 32核64MB L3缓存SPARC M7处理器
SPARC M7处理器在商业应用方面比其前身SPARC M6提供了3倍以上的吞吐量性能改进。它引入了新的设计特性,如S4内核、具有应用数据完整性的64MB L3缓存子系统、低延迟、高吞吐量片上网络(OCN)、数据库分析加速器(DAX)、细粒度自适应电源管理和1.5倍高的SerDes I/O带宽,用于内存、一致性和系统接口(图4.2.1)[1]。S4内核相对于S3内核的增强[2]包括一个新的L2缓存方案、对可视指令集(VIS)扩展的支持、虚拟地址屏蔽和用户级同步指令,从而为SPARC处理器提供自SPARC T4以来持续的单线程性能改进。此外,一种称为SPARC缓存集群(SCC)的分层模块化方法用于核心l2 - l3缓存系统。在SCC中,所有四个核心共享一个256KB L2指令缓存,每个核心对都有自己的256KB L2数据缓存。L2缓存组织为2-bank和8-way,以向四个核心提供大于1TB/s的带宽。这个L2系统为每个核心提供了2倍的吞吐量,大小增加了1.5倍,延迟与上一代L2缓存方案相同。L2缓存连接到一个8MB的8路集合关联分区L3缓存。在每个SCC中使用本地化的L3缓存可以减少25%的L3延迟。该芯片包含8个scc,共32核,256线程,64MB L3缓存,带宽1.6TB/S。为了支持来自256个线程和其他系统代理的带宽和延迟需求,OCN架构被实现来代替以前的SPARC处理器中使用的基于交叉条的网络。每个SCC连接到OCN, OCN又连接到4个片上存储器控制器(mcu)、一致性系统和8个数据库分析加速器(DAX)引擎。SPARC M7引入了一个定制的DAX引擎,以优化Oracle数据库的性能。八个DAX引擎处理简单的查询谓词、解压缩、消息传递和跨集群节点的中断。这个查询加速器为单个流解压缩提供了高达10倍的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
F2: Memory trends: From big data to wearable devices 13.6 A 600μW Bluetooth low-energy front-end receiver in 0.13μm CMOS technology 22.8 A 24-to-35Gb/s x4 VCSEL driver IC with multi-rate referenceless CDR in 0.13um SiGe BiCMOS 14.8 A 0.009mm2 2.06mW 32-to-2000MHz 2nd-order ΔΣ analogous bang-bang digital PLL with feed-forward delay-locked and phase-locked operations in 14nm FinFET technology 25.7 A 2.4GHz 4mW inductorless RF synthesizer
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1