Lessons learned at 208K: Towards debugging millions of cores

Gregory L. Lee, D. Ahn, D. Arnold, B. Supinski, M. LeGendre, B. Miller, M. Schulz, B. Liblit
{"title":"Lessons learned at 208K: Towards debugging millions of cores","authors":"Gregory L. Lee, D. Ahn, D. Arnold, B. Supinski, M. LeGendre, B. Miller, M. Schulz, B. Liblit","doi":"10.1109/SC.2008.5218557","DOIUrl":null,"url":null,"abstract":"Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application - already, debugging the full Blue-Gene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach such sizes and beyond, tools must use a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become tool bottlenecks. In this paper, we present challenges to petascale tool development, using the stack trace analysis tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208 K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present implemented solutions to these challenges and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.2008.5218557","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 53

Abstract

Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application - already, debugging the full Blue-Gene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach such sizes and beyond, tools must use a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become tool bottlenecks. In this paper, we present challenges to petascale tool development, using the stack trace analysis tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208 K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present implemented solutions to these challenges and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
208K的经验教训:调试数百万核
千兆级系统将对性能和正确性工具提出几个新的挑战。这样的机器可能包含数百万个核心,需要工具使用可扩展的数据结构和分析算法来收集和处理应用程序数据。此外,在这样的规模下,每个工具本身将成为一个大型并行应用程序——在劳伦斯利弗莫尔国家实验室调试完整的Blue-Gene/L (BG/L)安装需要使用1664个工具守护进程。为了达到这样的规模,工具必须使用可伸缩的通信基础设施,并有效地管理它们自己的工具流程。某些系统资源(如文件系统)也可能成为工具的瓶颈。在本文中,我们使用堆栈跟踪分析工具(STAT)作为案例研究,提出了千兆级工具开发面临的挑战。STAT是一个轻量级工具,它收集并合并来自并行应用程序的堆栈跟踪,以识别进程等价类。我们使用Infiniband集群上数千个任务收集的结果,以及BG/L上高达208k进程的结果,以确定当前的可扩展性问题以及将在千兆级上面临的挑战。然后,我们将介绍针对这些挑战的实现解决方案,并展示由此带来的性能改进。我们还讨论了未来的计划,以满足千兆级机器的调试需求。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Efficient auction-based grid reservations using dynamic programming Scientific application-based performance comparison of SGI Altix 4700, IBM POWER5+, and SGI ICE 8200 supercomputers Nimrod/K: Towards massively parallel dynamic Grid workflows Global Trees: A framework for linked data structures on distributed memory parallel systems Bandwidth intensive 3-D FFT kernel for GPUs using CUDA
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1