Feedback-directed pipeline parallelism

M. A. Suleman, Moinuddin K. Qureshi, Khubaib, Y. Patt
{"title":"Feedback-directed pipeline parallelism","authors":"M. A. Suleman, Moinuddin K. Qureshi, Khubaib, Y. Patt","doi":"10.1145/1854273.1854296","DOIUrl":null,"url":null,"abstract":"Extracting high performance from Chip Multiprocessors requires that the application be parallelized. A common software technique to parallelize loops is pipeline parallelism in which the programmer/compiler splits each loop iteration into stages and each stage runs on a certain number of cores. It is important to choose the number of cores for each stage carefully because the core-to-stage allocation determines performance and power consumption. Finding the best core-to-stage allocation for an application is challenging because the number of possible allocations is large, and the best allocation depends on the input set and machine configuration. This paper proposes Feedback-Directed Pipelining (FDP), a software framework that chooses the core-to-stage allocation at run-time. FDP first maximizes the performance of the workload and then saves power by reducing the number of active cores, without impacting performance. Our evaluation on a real SMP system with two Core2Quad processors (8 cores) shows that FDP provides an average speedup of 4.2x which is significantly higher than the 2.3x speedup obtained with a practical profile-based allocation. We also show that FDP is robust to changes in machine configuration and input set.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"64","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1854273.1854296","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 64

Abstract

Extracting high performance from Chip Multiprocessors requires that the application be parallelized. A common software technique to parallelize loops is pipeline parallelism in which the programmer/compiler splits each loop iteration into stages and each stage runs on a certain number of cores. It is important to choose the number of cores for each stage carefully because the core-to-stage allocation determines performance and power consumption. Finding the best core-to-stage allocation for an application is challenging because the number of possible allocations is large, and the best allocation depends on the input set and machine configuration. This paper proposes Feedback-Directed Pipelining (FDP), a software framework that chooses the core-to-stage allocation at run-time. FDP first maximizes the performance of the workload and then saves power by reducing the number of active cores, without impacting performance. Our evaluation on a real SMP system with two Core2Quad processors (8 cores) shows that FDP provides an average speedup of 4.2x which is significantly higher than the 2.3x speedup obtained with a practical profile-based allocation. We also show that FDP is robust to changes in machine configuration and input set.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
反馈导向的管道并行性
从芯片多处理器中提取高性能要求应用程序并行化。一种常见的软件并行循环技术是管道并行,在这种技术中,程序员/编译器将每个循环迭代分成几个阶段,每个阶段在一定数量的内核上运行。仔细选择每个阶段的核心数量非常重要,因为核心到阶段的分配决定了性能和功耗。为应用程序找到最佳的核心到阶段分配是一项挑战,因为可能的分配数量很大,而最佳分配取决于输入集和机器配置。本文提出了一种在运行时选择核心到阶段分配的软件框架——反馈导向管道(FDP)。FDP首先最大化工作负载的性能,然后通过减少活动核的数量来节省功耗,而不会影响性能。我们对具有两个Core2Quad处理器(8核)的真实SMP系统的评估表明,FDP提供了4.2x的平均加速,这明显高于实际基于配置文件分配获得的2.3x加速。我们还证明了FDP对机器配置和输入集的变化具有鲁棒性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Reducing task creation and termination overhead in explicitly parallel programs An intra-tile cache set balancing scheme NUcache: A multicore cache organization based on Next-Use distance Towards a science of parallel programming Discovering and understanding performance bottlenecks in transactional applications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1