H. Li, Jinuk Luke Shin, G. Konstadinidis, F. Schumacher, V. Krishnaswamy, Hoyeol Cho, Sudesna Dash, R. Masleid, Chaoyang Zheng, Yuanjung David Lin, P. Loewenstein, Heechoul Park, V. Srinivasan, Dawei Huang, C. Hwang, W. Hsu, C. McAllister
{"title":"4.2 20nm 32核64MB L3缓存SPARC M7处理器","authors":"H. Li, Jinuk Luke Shin, G. Konstadinidis, F. Schumacher, V. Krishnaswamy, Hoyeol Cho, Sudesna Dash, R. Masleid, Chaoyang Zheng, Yuanjung David Lin, P. Loewenstein, Heechoul Park, V. Srinivasan, Dawei Huang, C. Hwang, W. Hsu, C. McAllister","doi":"10.1109/ISSCC.2015.7062931","DOIUrl":null,"url":null,"abstract":"The SPARC M7 processor delivers more than 3x throughput performance improvement over its predecessor SPARC M6 for commercial applications. It introduces new design features, such as the S4 core, a 64MB L3 cache subsystem with application data integrity, a low-latency, high-throughput on-chip network (OCN), a database analytic accelerator (DAX), fine-grain adaptive power management and 1.5× higher SerDes I/O bandwidth for memory, coherency and system interfaces (Fig. 4.2.1) [1]. The enhancements in the S4 core over the S3 core [2] include a new L2 cache scheme, support for visual instruction set (VIS) extensions, virtual address masking and user-level synchronization instructions to provide continuous single-thread performance improvement for SPARC processors since SPARC T4. In addition, a hierarchical modular approach, called SPARC cache cluster (SCC), is used for the core-L2-L3 cache system. Within the SCC, all four cores share a single 256KB L2 instruction cache and each core pair has its own 256KB L2 data cache. The L2 caches are organized as 2-banks and 8-ways to deliver greater than 1TB/s bandwidth to the four cores. This L2 system delivers 2× more throughput for each core with 1.5x increase in size and the same latency as the previous generation L2 cache scheme. The L2 caches connect to an 8MB, 8-way set-associative partitioned L3 cache. Having a localized L3 cache within each SCC reduces L3 latency by 25%. The chip contains eight SCCs for a total of 32-cores with 256 threads and a 64MB L3 cache with 1.6TB/S bandwidth. In order to support the bandwidth and latency requirements from 256 threads and other system agents, the OCN architecture is implemented in place of a crossbar based network used in previous SPARC processors. Each SCC connects to the OCN, which in turn connects to four on-chip memory controllers (MCUs), coherency systems and eight database analytic accelerator (DAX) engines. The SPARC M7 introduces a customized DAX engine in an effort to optimize performance for Oracle databases. Eight DAX engines handle simple query predicates, decompression, message passing and interrupts across cluster nodes. This query accelerator provides up to 10x better performance for single stream decompression.","PeriodicalId":188403,"journal":{"name":"2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":"{\"title\":\"4.2 A 20nm 32-Core 64MB L3 cache SPARC M7 processor\",\"authors\":\"H. Li, Jinuk Luke Shin, G. Konstadinidis, F. Schumacher, V. Krishnaswamy, Hoyeol Cho, Sudesna Dash, R. Masleid, Chaoyang Zheng, Yuanjung David Lin, P. Loewenstein, Heechoul Park, V. Srinivasan, Dawei Huang, C. Hwang, W. Hsu, C. McAllister\",\"doi\":\"10.1109/ISSCC.2015.7062931\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The SPARC M7 processor delivers more than 3x throughput performance improvement over its predecessor SPARC M6 for commercial applications. It introduces new design features, such as the S4 core, a 64MB L3 cache subsystem with application data integrity, a low-latency, high-throughput on-chip network (OCN), a database analytic accelerator (DAX), fine-grain adaptive power management and 1.5× higher SerDes I/O bandwidth for memory, coherency and system interfaces (Fig. 4.2.1) [1]. The enhancements in the S4 core over the S3 core [2] include a new L2 cache scheme, support for visual instruction set (VIS) extensions, virtual address masking and user-level synchronization instructions to provide continuous single-thread performance improvement for SPARC processors since SPARC T4. In addition, a hierarchical modular approach, called SPARC cache cluster (SCC), is used for the core-L2-L3 cache system. Within the SCC, all four cores share a single 256KB L2 instruction cache and each core pair has its own 256KB L2 data cache. The L2 caches are organized as 2-banks and 8-ways to deliver greater than 1TB/s bandwidth to the four cores. This L2 system delivers 2× more throughput for each core with 1.5x increase in size and the same latency as the previous generation L2 cache scheme. The L2 caches connect to an 8MB, 8-way set-associative partitioned L3 cache. Having a localized L3 cache within each SCC reduces L3 latency by 25%. The chip contains eight SCCs for a total of 32-cores with 256 threads and a 64MB L3 cache with 1.6TB/S bandwidth. In order to support the bandwidth and latency requirements from 256 threads and other system agents, the OCN architecture is implemented in place of a crossbar based network used in previous SPARC processors. Each SCC connects to the OCN, which in turn connects to four on-chip memory controllers (MCUs), coherency systems and eight database analytic accelerator (DAX) engines. The SPARC M7 introduces a customized DAX engine in an effort to optimize performance for Oracle databases. Eight DAX engines handle simple query predicates, decompression, message passing and interrupts across cluster nodes. This query accelerator provides up to 10x better performance for single stream decompression.\",\"PeriodicalId\":188403,\"journal\":{\"name\":\"2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"17\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC.2015.7062931\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC.2015.7062931","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
4.2 A 20nm 32-Core 64MB L3 cache SPARC M7 processor
The SPARC M7 processor delivers more than 3x throughput performance improvement over its predecessor SPARC M6 for commercial applications. It introduces new design features, such as the S4 core, a 64MB L3 cache subsystem with application data integrity, a low-latency, high-throughput on-chip network (OCN), a database analytic accelerator (DAX), fine-grain adaptive power management and 1.5× higher SerDes I/O bandwidth for memory, coherency and system interfaces (Fig. 4.2.1) [1]. The enhancements in the S4 core over the S3 core [2] include a new L2 cache scheme, support for visual instruction set (VIS) extensions, virtual address masking and user-level synchronization instructions to provide continuous single-thread performance improvement for SPARC processors since SPARC T4. In addition, a hierarchical modular approach, called SPARC cache cluster (SCC), is used for the core-L2-L3 cache system. Within the SCC, all four cores share a single 256KB L2 instruction cache and each core pair has its own 256KB L2 data cache. The L2 caches are organized as 2-banks and 8-ways to deliver greater than 1TB/s bandwidth to the four cores. This L2 system delivers 2× more throughput for each core with 1.5x increase in size and the same latency as the previous generation L2 cache scheme. The L2 caches connect to an 8MB, 8-way set-associative partitioned L3 cache. Having a localized L3 cache within each SCC reduces L3 latency by 25%. The chip contains eight SCCs for a total of 32-cores with 256 threads and a 64MB L3 cache with 1.6TB/S bandwidth. In order to support the bandwidth and latency requirements from 256 threads and other system agents, the OCN architecture is implemented in place of a crossbar based network used in previous SPARC processors. Each SCC connects to the OCN, which in turn connects to four on-chip memory controllers (MCUs), coherency systems and eight database analytic accelerator (DAX) engines. The SPARC M7 introduces a customized DAX engine in an effort to optimize performance for Oracle databases. Eight DAX engines handle simple query predicates, decompression, message passing and interrupts across cluster nodes. This query accelerator provides up to 10x better performance for single stream decompression.