Can traditional programming bridge the Ninja performance gap for parallel computing applications?

N. Satish, Changkyu Kim, J. Chhugani, Hideki Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, P. Dubey
{"title":"Can traditional programming bridge the Ninja performance gap for parallel computing applications?","authors":"N. Satish, Changkyu Kim, J. Chhugani, Hideki Saito, R. Krishnaiyer, M. Smelyanskiy, M. Girkar, P. Dubey","doi":"10.1145/2742910","DOIUrl":null,"url":null,"abstract":"Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming manycore architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads. We first quantify the extent of the “Ninja gap”, which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.","PeriodicalId":193578,"journal":{"name":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"88","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 39th Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2742910","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 88

Abstract

Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional approaches to programming do not apply to these modern processors and hence radical new languages must be discovered. In this paper, we question this thinking and offer evidence in support of traditional programming methods and the performance-vs-programming effort effectiveness of common multi-core processors and upcoming manycore architectures in delivering significant speedup, and close-to-optimal performance for commonly used parallel computing workloads. We first quantify the extent of the “Ninja gap”, which is the performance gap between naively written C/C++ code that is parallelism unaware (often serial) and best-optimized code on modern multi-/many-core processors. Using a set of representative throughput computing benchmarks, we show that there is an average Ninja gap of 24X (up to 53X) for a recent 6-core Intel® Core™ i7 X980 Westmere CPU, and that this gap if left unaddressed will inevitably increase. We show how a set of well-known algorithmic changes coupled with advancements in modern compiler technology can bring down the Ninja gap to an average of just 1.3X. These changes typically require low programming effort, as compared to the very high effort in producing Ninja code. We also discuss hardware support for programmability that can reduce the impact of these changes and even further increase programmer productivity. We show equally encouraging results for the upcoming Intel® Many Integrated Core architecture (Intel® MIC) which has more cores and wider SIMD. We thus demonstrate that we can contain the otherwise uncontrolled growth of the Ninja gap and offer a more stable and predictable performance growth over future architectures, offering strong evidence that radical language changes are not required.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
对于并行计算应用程序,传统编程能否弥补Ninja的性能差距?
当前的处理器趋势是将更多的内核与更宽的SIMD单元集成在一起,以及更深更复杂的内存层次结构,这使得从应用程序中提取性能变得越来越具有挑战性。一些人认为,传统的编程方法不适用于这些现代处理器,因此必须发现全新的语言。在本文中,我们对这种想法提出了质疑,并提供了支持传统编程方法的证据,以及通用多核处理器和即将到来的多核架构在为常用并行计算工作负载提供显着加速和接近最佳性能方面的性能vs编程工作效率。我们首先量化了“忍者差距”的程度,这是指不考虑并行性的天真编写的C/ c++代码(通常是串行的)与现代多核/多核处理器上最佳优化的代码之间的性能差距。使用一组具有代表性的吞吐量计算基准,我们表明,对于最近的6核Intel®Core™i7 X980 Westmere CPU, Ninja的平均差距为24X(最高可达53X),如果不加以解决,这一差距将不可避免地增加。我们展示了一组众所周知的算法变化以及现代编译器技术的进步如何将Ninja的差距缩小到平均1.3倍。这些改变通常只需要很少的编程工作,而制作Ninja代码则需要大量的工作。我们还讨论了对可编程性的硬件支持,这可以减少这些更改的影响,甚至进一步提高程序员的工作效率。我们为即将推出的具有更多内核和更宽SIMD的英特尔®多集成核心架构(英特尔®MIC)展示了同样令人鼓舞的结果。因此,我们证明了我们可以控制忍者差距的不受控制的增长,并在未来的体系结构中提供更稳定和可预测的性能增长,提供了强有力的证据,证明不需要彻底的语言更改。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Can traditional programming bridge the Ninja performance gap for parallel computing applications? Towards energy-proportional datacenter memory with mobile DRAM A micro-architectural analysis of switched photonic multi-chip interconnects Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems Enhancing effective throughput for transmission line-based bus
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1