4.2 20nm 32核64MB L3缓存SPARC M7处理器

2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers Pub Date : 2015-03-19 DOI:10.1109/ISSCC.2015.7062931

H. Li, Jinuk Luke Shin, G. Konstadinidis, F. Schumacher, V. Krishnaswamy, Hoyeol Cho, Sudesna Dash, R. Masleid, Chaoyang Zheng, Yuanjung David Lin, P. Loewenstein, Heechoul Park, V. Srinivasan, Dawei Huang, C. Hwang, W. Hsu, C. McAllister

{"title":"4.2 20nm 32核64MB L3缓存SPARC M7处理器","authors":"H. Li, Jinuk Luke Shin, G. Konstadinidis, F. Schumacher, V. Krishnaswamy, Hoyeol Cho, Sudesna Dash, R. Masleid, Chaoyang Zheng, Yuanjung David Lin, P. Loewenstein, Heechoul Park, V. Srinivasan, Dawei Huang, C. Hwang, W. Hsu, C. McAllister","doi":"10.1109/ISSCC.2015.7062931","DOIUrl":null,"url":null,"abstract":"The SPARC M7 processor delivers more than 3x throughput performance improvement over its predecessor SPARC M6 for commercial applications. It introduces new design features, such as the S4 core, a 64MB L3 cache subsystem with application data integrity, a low-latency, high-throughput on-chip network (OCN), a database analytic accelerator (DAX), fine-grain adaptive power management and 1.5× higher SerDes I/O bandwidth for memory, coherency and system interfaces (Fig. 4.2.1) [1]. The enhancements in the S4 core over the S3 core [2] include a new L2 cache scheme, support for visual instruction set (VIS) extensions, virtual address masking and user-level synchronization instructions to provide continuous single-thread performance improvement for SPARC processors since SPARC T4. In addition, a hierarchical modular approach, called SPARC cache cluster (SCC), is used for the core-L2-L3 cache system. Within the SCC, all four cores share a single 256KB L2 instruction cache and each core pair has its own 256KB L2 data cache. The L2 caches are organized as 2-banks and 8-ways to deliver greater than 1TB/s bandwidth to the four cores. This L2 system delivers 2× more throughput for each core with 1.5x increase in size and the same latency as the previous generation L2 cache scheme. The L2 caches connect to an 8MB, 8-way set-associative partitioned L3 cache. Having a localized L3 cache within each SCC reduces L3 latency by 25%. The chip contains eight SCCs for a total of 32-cores with 256 threads and a 64MB L3 cache with 1.6TB/S bandwidth. In order to support the bandwidth and latency requirements from 256 threads and other system agents, the OCN architecture is implemented in place of a crossbar based network used in previous SPARC processors. Each SCC connects to the OCN, which in turn connects to four on-chip memory controllers (MCUs), coherency systems and eight database analytic accelerator (DAX) engines. The SPARC M7 introduces a customized DAX engine in an effort to optimize performance for Oracle databases. Eight DAX engines handle simple query predicates, decompression, message passing and interrupts across cluster nodes. This query accelerator provides up to 10x better performance for single stream decompression.","PeriodicalId":188403,"journal":{"name":"2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"4.2 A 20nm 32-Core 64MB L3 cache SPARC M7 processor\",\"authors\":\"H. Li, Jinuk Luke Shin, G. Konstadinidis, F. Schumacher, V. Krishnaswamy, Hoyeol Cho, Sudesna Dash, R. Masleid, Chaoyang Zheng, Yuanjung David Lin, P. Loewenstein, Heechoul Park, V. Srinivasan, Dawei Huang, C. Hwang, W. Hsu, C. McAllister\",\"doi\":\"10.1109/ISSCC.2015.7062931\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The SPARC M7 processor delivers more than 3x throughput performance improvement over its predecessor SPARC M6 for commercial applications. It introduces new design features, such as the S4 core, a 64MB L3 cache subsystem with application data integrity, a low-latency, high-throughput on-chip network (OCN), a database analytic accelerator (DAX), fine-grain adaptive power management and 1.5× higher SerDes I/O bandwidth for memory, coherency and system interfaces (Fig. 4.2.1) [1]. The enhancements in the S4 core over the S3 core [2] include a new L2 cache scheme, support for visual instruction set (VIS) extensions, virtual address masking and user-level synchronization instructions to provide continuous single-thread performance improvement for SPARC processors since SPARC T4. In addition, a hierarchical modular approach, called SPARC cache cluster (SCC), is used for the core-L2-L3 cache system. Within the SCC, all four cores share a single 256KB L2 instruction cache and each core pair has its own 256KB L2 data cache. The L2 caches are organized as 2-banks and 8-ways to deliver greater than 1TB/s bandwidth to the four cores. This L2 system delivers 2× more throughput for each core with 1.5x increase in size and the same latency as the previous generation L2 cache scheme. The L2 caches connect to an 8MB, 8-way set-associative partitioned L3 cache. Having a localized L3 cache within each SCC reduces L3 latency by 25%. The chip contains eight SCCs for a total of 32-cores with 256 threads and a 64MB L3 cache with 1.6TB/S bandwidth. In order to support the bandwidth and latency requirements from 256 threads and other system agents, the OCN architecture is implemented in place of a crossbar based network used in previous SPARC processors. Each SCC connects to the OCN, which in turn connects to four on-chip memory controllers (MCUs), coherency systems and eight database analytic accelerator (DAX) engines. The SPARC M7 introduces a customized DAX engine in an effort to optimize performance for Oracle databases. Eight DAX engines handle simple query predicates, decompression, message passing and interrupts across cluster nodes. This query accelerator provides up to 10x better performance for single stream decompression.\",\"PeriodicalId\":188403,\"journal\":{\"name\":\"2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC.2015.7062931\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC.2015.7062931","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

摘要

SPARC M7处理器在商业应用方面比其前身SPARC M6提供了3倍以上的吞吐量性能改进。它引入了新的设计特性，如S4内核、具有应用数据完整性的64MB L3缓存子系统、低延迟、高吞吐量片上网络(OCN)、数据库分析加速器(DAX)、细粒度自适应电源管理和1.5倍高的SerDes I/O带宽，用于内存、一致性和系统接口(图4.2.1)[1]。S4内核相对于S3内核的增强[2]包括一个新的L2缓存方案、对可视指令集(VIS)扩展的支持、虚拟地址屏蔽和用户级同步指令，从而为SPARC处理器提供自SPARC T4以来持续的单线程性能改进。此外，一种称为SPARC缓存集群(SCC)的分层模块化方法用于核心l2 - l3缓存系统。在SCC中，所有四个核心共享一个256KB L2指令缓存，每个核心对都有自己的256KB L2数据缓存。L2缓存组织为2-bank和8-way，以向四个核心提供大于1TB/s的带宽。这个L2系统为每个核心提供了2倍的吞吐量，大小增加了1.5倍，延迟与上一代L2缓存方案相同。L2缓存连接到一个8MB的8路集合关联分区L3缓存。在每个SCC中使用本地化的L3缓存可以减少25%的L3延迟。该芯片包含8个scc，共32核，256线程，64MB L3缓存，带宽1.6TB/S。为了支持来自256个线程和其他系统代理的带宽和延迟需求，OCN架构被实现来代替以前的SPARC处理器中使用的基于交叉条的网络。每个SCC连接到OCN, OCN又连接到4个片上存储器控制器(mcu)、一致性系统和8个数据库分析加速器(DAX)引擎。SPARC M7引入了一个定制的DAX引擎，以优化Oracle数据库的性能。八个DAX引擎处理简单的查询谓词、解压缩、消息传递和跨集群节点的中断。这个查询加速器为单个流解压缩提供了高达10倍的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

4.2 A 20nm 32-Core 64MB L3 cache SPARC M7 processor

The SPARC M7 processor delivers more than 3x throughput performance improvement over its predecessor SPARC M6 for commercial applications. It introduces new design features, such as the S4 core, a 64MB L3 cache subsystem with application data integrity, a low-latency, high-throughput on-chip network (OCN), a database analytic accelerator (DAX), fine-grain adaptive power management and 1.5× higher SerDes I/O bandwidth for memory, coherency and system interfaces (Fig. 4.2.1) [1]. The enhancements in the S4 core over the S3 core [2] include a new L2 cache scheme, support for visual instruction set (VIS) extensions, virtual address masking and user-level synchronization instructions to provide continuous single-thread performance improvement for SPARC processors since SPARC T4. In addition, a hierarchical modular approach, called SPARC cache cluster (SCC), is used for the core-L2-L3 cache system. Within the SCC, all four cores share a single 256KB L2 instruction cache and each core pair has its own 256KB L2 data cache. The L2 caches are organized as 2-banks and 8-ways to deliver greater than 1TB/s bandwidth to the four cores. This L2 system delivers 2× more throughput for each core with 1.5x increase in size and the same latency as the previous generation L2 cache scheme. The L2 caches connect to an 8MB, 8-way set-associative partitioned L3 cache. Having a localized L3 cache within each SCC reduces L3 latency by 25%. The chip contains eight SCCs for a total of 32-cores with 256 threads and a 64MB L3 cache with 1.6TB/S bandwidth. In order to support the bandwidth and latency requirements from 256 threads and other system agents, the OCN architecture is implemented in place of a crossbar based network used in previous SPARC processors. Each SCC connects to the OCN, which in turn connects to four on-chip memory controllers (MCUs), coherency systems and eight database analytic accelerator (DAX) engines. The SPARC M7 introduces a customized DAX engine in an effort to optimize performance for Oracle databases. Eight DAX engines handle simple query predicates, decompression, message passing and interrupts across cluster nodes. This query accelerator provides up to 10x better performance for single stream decompression.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers

自引率

0.00%

发文量