Partitioning Multi-Threaded Processors with a Large Number of Threads

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005. Pub Date : 2005-03-20 DOI:10.1109/ISPASS.2005.1430566

A. El-Moursy, Rajeev Garg, D. Albonesi, S. Dwarkadas

{"title":"Partitioning Multi-Threaded Processors with a Large Number of Threads","authors":"A. El-Moursy, Rajeev Garg, D. Albonesi, S. Dwarkadas","doi":"10.1109/ISPASS.2005.1430566","DOIUrl":null,"url":null,"abstract":"Today's general-purpose processors are increasingly using multithreading in order to better leverage the additional on-chip real estate available with each technology generation. Simultaneous multi-threading (SMT) was originally proposed as a large dynamic superscalar processor with monolithic hardware structures shared among all threads. Inters hyper-threaded Pentium 4 processor partitions the queue structures among two threads, demonstrating more balanced performance by reducing the hoarding of structures by a single thread. IBM's Power5 processor is a 2-way chip multiprocessor (CMP) of SMT processors, each supporting 2 threads, which significantly reduces design complexity and can improve power efficiency. This paper examines processor partitioning options for larger numbers of threads on a chip. While growing transistor budgets permit four and eight-thread processors to be designed, design complexity, power dissipation, and wire scaling limitations create significant barriers to their actual realization. We explore the design choices of sharing, or of partitioning and distributing, the front end (instruction cache, instruction fetch, and dispatch), the execution units and associated state, as well as the L1 Dcache banks, in a clustered multi-threaded (CMT) processor. We show that the best performance is obtained by restricting the sharing of the L1 Dcache banks and the execution engines among threads. On the other hand, significant sharing of the front-end resources is the best approach. When compared against large monolithic SMT processors, a CMT processor provides very competitive IPC performance on average, 90-96% of that of partitioned SMT while being more scalable and much more power efficient. In a CMP organization, the gap between SMT and CMT processors shrinks further, making a CMP of CMT processors a highly viable alternative for the future","PeriodicalId":230669,"journal":{"name":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS.2005.1430566","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

Abstract

Today's general-purpose processors are increasingly using multithreading in order to better leverage the additional on-chip real estate available with each technology generation. Simultaneous multi-threading (SMT) was originally proposed as a large dynamic superscalar processor with monolithic hardware structures shared among all threads. Inters hyper-threaded Pentium 4 processor partitions the queue structures among two threads, demonstrating more balanced performance by reducing the hoarding of structures by a single thread. IBM's Power5 processor is a 2-way chip multiprocessor (CMP) of SMT processors, each supporting 2 threads, which significantly reduces design complexity and can improve power efficiency. This paper examines processor partitioning options for larger numbers of threads on a chip. While growing transistor budgets permit four and eight-thread processors to be designed, design complexity, power dissipation, and wire scaling limitations create significant barriers to their actual realization. We explore the design choices of sharing, or of partitioning and distributing, the front end (instruction cache, instruction fetch, and dispatch), the execution units and associated state, as well as the L1 Dcache banks, in a clustered multi-threaded (CMT) processor. We show that the best performance is obtained by restricting the sharing of the L1 Dcache banks and the execution engines among threads. On the other hand, significant sharing of the front-end resources is the best approach. When compared against large monolithic SMT processors, a CMT processor provides very competitive IPC performance on average, 90-96% of that of partitioned SMT while being more scalable and much more power efficient. In a CMP organization, the gap between SMT and CMT processors shrinks further, making a CMP of CMT processors a highly viable alternative for the future

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对具有大量线程的多线程处理器进行分区

如今的通用处理器越来越多地使用多线程，以便更好地利用每一代技术带来的额外芯片空间。同步多线程(SMT)最初被提出为一种大型动态超标量处理器，其硬件结构在所有线程之间共享。它的超线程Pentium 4处理器在两个线程之间划分队列结构，通过减少单个线程对结构的囤积来展示更均衡的性能。IBM的Power5处理器是SMT处理器的双向芯片多处理器(CMP)，每个处理器支持2个线程，这大大降低了设计复杂性，并可以提高电源效率。本文研究了一个芯片上线程数量较多的处理器分区选项。虽然不断增长的晶体管预算允许设计四线程和八线程处理器，但设计复杂性、功耗和导线缩放限制为其实际实现创造了重大障碍。我们探讨了在集群多线程(CMT)处理器中共享或分区和分发前端(指令缓存、指令获取和分派)、执行单元和相关状态以及L1 Dcache库的设计选择。我们表明，通过限制线程之间L1 Dcache银行和执行引擎的共享，可以获得最佳性能。另一方面，大量共享前端资源是最好的方法。与大型单片SMT处理器相比，CMT处理器提供了非常有竞争力的IPC性能，平均为分区SMT的90-96%，同时具有更高的可扩展性和更高的功耗效率。在CMP组织中，SMT和CMT处理器之间的差距进一步缩小，使CMT处理器的CMP成为未来高度可行的替代方案

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.

自引率

0.00%

发文量