vSensor

Q1 Computer Science ACM Sigplan Notices Pub Date : 2018-02-10 DOI:10.1145/3200691.3178497
Xiongchao Tang, Jidong Zhai, Xuehai Qian, Bingsheng He, W. Xue, Wenguang Chen
{"title":"vSensor","authors":"Xiongchao Tang, Jidong Zhai, Xuehai Qian, Bingsheng He, W. Xue, Wenguang Chen","doi":"10.1145/3200691.3178497","DOIUrl":null,"url":null,"abstract":"Performance variance becomes increasingly challenging on current large-scale HPC systems. Even using a fixed number of computing nodes, the execution time of several runs can vary significantly. Many parallel programs executing on supercomputers suffer from such variance. Performance variance not only causes unpredictable performance requirement violations, but also makes it unintuitive to understand the program behavior. Despite prior efforts, efficient on-line detection of performance variance remains an open problem. In this paper, we propose vSensor, a novel approach for light-weight and on-line performance variance detection. The key insight is that, instead of solely relying on an external detector, the source code of a program itself could reveal the runtime performance characteristics. Specifically, many parallel programs contain code snippets that are executed repeatedly with an invariant quantity of work. Based on this observation, we use compiler techniques to automatically identify these fixed-workload snippets and use them as performance variance sensors (v-sensors) that enable effective detection. We evaluate vSensor with a variety of parallel programs on the Tianhe-2 system. Results show that vSensor can effectively detect performance variance on HPC systems. The performance overhead is smaller than 4% with up to 16,384 processes. In particular, with vSensor, we found a bad node with slow memory that slowed a program's performance by 21%. As a showcase, we also detected a severe network performance problem that caused a 3.37X slowdown for an HPC kernel program on the Tianhe-2 system.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"17 1","pages":"124 - 136"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Sigplan Notices","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3200691.3178497","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0

Abstract

Performance variance becomes increasingly challenging on current large-scale HPC systems. Even using a fixed number of computing nodes, the execution time of several runs can vary significantly. Many parallel programs executing on supercomputers suffer from such variance. Performance variance not only causes unpredictable performance requirement violations, but also makes it unintuitive to understand the program behavior. Despite prior efforts, efficient on-line detection of performance variance remains an open problem. In this paper, we propose vSensor, a novel approach for light-weight and on-line performance variance detection. The key insight is that, instead of solely relying on an external detector, the source code of a program itself could reveal the runtime performance characteristics. Specifically, many parallel programs contain code snippets that are executed repeatedly with an invariant quantity of work. Based on this observation, we use compiler techniques to automatically identify these fixed-workload snippets and use them as performance variance sensors (v-sensors) that enable effective detection. We evaluate vSensor with a variety of parallel programs on the Tianhe-2 system. Results show that vSensor can effectively detect performance variance on HPC systems. The performance overhead is smaller than 4% with up to 16,384 processes. In particular, with vSensor, we found a bad node with slow memory that slowed a program's performance by 21%. As a showcase, we also detected a severe network performance problem that caused a 3.37X slowdown for an HPC kernel program on the Tianhe-2 system.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在当前的大规模高性能计算系统中,性能差异变得越来越具有挑战性。即使使用固定数量的计算节点,几次运行的执行时间也会有很大差异。在超级计算机上执行的许多并行程序都存在这种差异。性能差异不仅会导致不可预测的性能需求违反,而且还会使理解程序行为变得不直观。尽管以前的努力,有效的在线检测性能差异仍然是一个悬而未决的问题。在本文中,我们提出了vSensor,一种轻量级和在线性能方差检测的新方法。关键的见解是,程序本身的源代码可以揭示运行时性能特征,而不是仅仅依赖于外部检测器。具体来说,许多并行程序包含重复执行的代码片段,其工作量是不变的。基于这种观察,我们使用编译器技术来自动识别这些固定工作负载片段,并将它们用作能够有效检测的性能差异传感器(v-sensor)。我们在天河二号系统上用各种并行程序对vSensor进行了评估。结果表明,vSensor可以有效地检测高性能计算系统的性能差异。在多达16,384个进程的情况下,性能开销小于4%。特别是在使用vSensor时,我们发现一个内存较慢的坏节点使程序的性能降低了21%。作为演示,我们还检测到一个严重的网络性能问题,该问题导致天河二号系统上的一个HPC内核程序的速度下降了3.37倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
ACM Sigplan Notices
ACM Sigplan Notices 工程技术-计算机:软件工程
CiteScore
4.90
自引率
0.00%
发文量
0
审稿时长
2-4 weeks
期刊介绍: The ACM Special Interest Group on Programming Languages explores programming language concepts and tools, focusing on design, implementation, practice, and theory. Its members are programming language developers, educators, implementers, researchers, theoreticians, and users. SIGPLAN sponsors several major annual conferences, including the Symposium on Principles of Programming Languages (POPL), the Symposium on Principles and Practice of Parallel Programming (PPoPP), the Conference on Programming Language Design and Implementation (PLDI), the International Conference on Functional Programming (ICFP), the International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), as well as more than a dozen other events of either smaller size or in-cooperation with other SIGs. The monthly "ACM SIGPLAN Notices" publishes proceedings of selected sponsored events and an annual report on SIGPLAN activities. Members receive discounts on conference registrations and free access to ACM SIGPLAN publications in the ACM Digital Library. SIGPLAN recognizes significant research and service contributions of individuals with a variety of awards, supports current members through the Professional Activities Committee, and encourages future programming language enthusiasts with frequent Programming Languages Mentoring Workshops (PLMW).
期刊最新文献
Outcomes of Endoscopic Drainage in Children with Pancreatic Fluid Collections: A Systematic Review and Meta-Analysis. Letter from the Chair SEIS Proceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management, ISMM 2018, Philadelphia, PA, USA, June 18, 2018 Proceedings of the 19th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, LCTES 2018, Philadelphia, PA, USA, June 19-20, 2018
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1