A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

IF 2 4区 计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS ACM Transactions on Computer Systems Pub Date : 2012-04-01 DOI:10.1145/2166879.2166882
Mark Gebhart, Daniel R. Johnson, D. Tarjan, S. Keckler, W. Dally, Erik Lindholm, K. Skadron
{"title":"A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors","authors":"Mark Gebhart, Daniel R. Johnson, D. Tarjan, S. Keckler, W. Dally, Erik Lindholm, K. Skadron","doi":"10.1145/2166879.2166882","DOIUrl":null,"url":null,"abstract":"Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Reducing the number of threads that the scheduler must consider each cycle improves the scheduler’s energy efficiency. Second, we propose replacing the monolithic register file found on modern designs with a hierarchical register file. We explore various trade-offs for the hierarchy including the number of levels in the hierarchy and the number of entries at each level. We consider both a hardware-managed caching scheme and a software-managed scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy. Combined with a hierarchical register file, our two-level thread scheduler provides a further reduction in energy by only allocating entries in the upper levels of the register file hierarchy for active threads. Averaging across a variety of real world graphics and compute workloads, the active thread count can be reduced by a factor of 4 with minimal impact on performance and our most efficient three-level software-managed register file hierarchy reduces register file energy by 54%.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"28 1","pages":"8:1-8:38"},"PeriodicalIF":2.0000,"publicationDate":"2012-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/2166879.2166882","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 39

Abstract

Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Reducing the number of threads that the scheduler must consider each cycle improves the scheduler’s energy efficiency. Second, we propose replacing the monolithic register file found on modern designs with a hierarchical register file. We explore various trade-offs for the hierarchy including the number of levels in the hierarchy and the number of entries at each level. We consider both a hardware-managed caching scheme and a software-managed scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy. Combined with a hierarchical register file, our two-level thread scheduler provides a further reduction in energy by only allocating entries in the upper levels of the register file hierarchy for active threads. Averaging across a variety of real world graphics and compute workloads, the active thread count can be reduced by a factor of 4 with minimal impact on performance and our most efficient three-level software-managed register file hierarchy reduces register file energy by 54%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高效处理量处理器的分级线程调度和寄存器文件
现代图形处理单元(gpu)使用大量的硬件线程来隐藏功能单元和内存访问延迟。极端多线程需要一个复杂的线程调度器和一个大的寄存器文件,这在能量和延迟方面都是昂贵的。我们提出了两种互补的技术来减少像gpu这样的大线程处理器上的能量。首先,我们研究了一个两级线程调度器,它维护一小组活动线程来隐藏ALU和本地内存访问延迟,以及一组较大的挂起线程来隐藏主内存延迟。减少调度器每个周期必须考虑的线程数可以提高调度器的能源效率。其次,我们建议用分层寄存器文件替换现代设计中的单片寄存器文件。我们探讨了层次结构的各种权衡,包括层次结构中的级别数量和每个级别上的条目数量。我们考虑硬件管理的缓存方案和软件管理的方案,其中编译器负责编排寄存器文件层次结构中的所有数据移动。结合分层寄存器文件,我们的两级线程调度器通过仅为活动线程在寄存器文件层次结构的上层分配条目,进一步减少了能量。在各种现实世界的图形和计算工作负载中进行平均计算,活动线程计数可以减少4倍,对性能的影响最小,我们最有效的三级软件管理的寄存器文件层次结构可以将寄存器文件能量减少54%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Transactions on Computer Systems
ACM Transactions on Computer Systems 工程技术-计算机:理论方法
CiteScore
4.00
自引率
0.00%
发文量
7
审稿时长
1 months
期刊介绍: ACM Transactions on Computer Systems (TOCS) presents research and development results on the design, implementation, analysis, evaluation, and use of computer systems and systems software. The term "computer systems" is interpreted broadly and includes operating systems, systems architecture and hardware, distributed systems, optimizing compilers, and the interaction between systems and computer networks. Articles appearing in TOCS will tend either to present new techniques and concepts, or to report on experiences and experiments with actual systems. Insights useful to system designers, builders, and users will be emphasized. TOCS publishes research and technical papers, both short and long. It includes technical correspondence to permit commentary on technical topics and on previously published papers.
期刊最新文献
PMAlloc: A Holistic Approach to Improving Persistent Memory Allocation Trinity: High-Performance and Reliable Mobile Emulation through Graphics Projection Hardware-software Collaborative Tiered-memory Management Framework for Virtualization Diciclo: Flexible User-level Services for Efficient Multitenant Isolation Modeling the Interplay between Loop Tiling and Fusion in Optimizing Compilers Using Affine Relations
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1