C. Carothers, J. Meredith, Mark P. Blanco, J. Vetter, M. Mubarak, Justin M. LaPre, S. Moore
{"title":"Durango: Scalable Synthetic Workload Generation for Extreme-Scale Application Performance Modeling and Simulation","authors":"C. Carothers, J. Meredith, Mark P. Blanco, J. Vetter, M. Mubarak, Justin M. LaPre, S. Moore","doi":"10.1145/3064911.3064923","DOIUrl":null,"url":null,"abstract":"Performance modeling of extreme-scale applications on accurate representations of potential architectures is critical for designing next generation supercomputing systems because it is impractical to construct prototype systems at scale with new network hardware in order to explore designs and policies. However, these simulations often rely on static application traces that can be difficult to work with because of their size and lack of flexibility to extend or scale up without rerunning the original application. To address this problem, we have created a new technique for generating scalable, flexible workloads from real applications, we have implemented a prototype, called Durango, that combines a proven analytical performance modeling language, Aspen, with the massively parallel HPC network modeling capabilities of the CODES framework. Our models are compact, parameterized and representative of real applications with computation events. They are not resource intensive to create and are portable across simulator environments. We demonstrate the utility of Durango by simulating the LULESH application in the CODES simulation environment on several topologies and show that Durango is practical to use for simulation without loss of fidelity, as quantified by simulation metrics. During our validation of Durango's generated communication model of LULESH, we found that the original LULESH miniapp code had a latent bug where the MPI_Waitall operation was used incorrectly. This finding underscores the potential need for a tool such as Durango, beyond its benefits for flexible workload generation and modeling. Additionally, we demonstrate the efficacy of Durango's direct integration approach, which links Aspen into CODES as part of the running network simulation model. Here, Aspen generates the application-level computation timing events, which in turn drive the start of a network communication phase. Results show that Durango's performance scales well when executing both torus and dragonfly network models on up to 4K Blue Gene/Q nodes using 32K MPI ranks, Durango also avoids the overheads and complexities associated with extreme-scale trace files.","PeriodicalId":341026,"journal":{"name":"Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3064911.3064923","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18
Abstract
Performance modeling of extreme-scale applications on accurate representations of potential architectures is critical for designing next generation supercomputing systems because it is impractical to construct prototype systems at scale with new network hardware in order to explore designs and policies. However, these simulations often rely on static application traces that can be difficult to work with because of their size and lack of flexibility to extend or scale up without rerunning the original application. To address this problem, we have created a new technique for generating scalable, flexible workloads from real applications, we have implemented a prototype, called Durango, that combines a proven analytical performance modeling language, Aspen, with the massively parallel HPC network modeling capabilities of the CODES framework. Our models are compact, parameterized and representative of real applications with computation events. They are not resource intensive to create and are portable across simulator environments. We demonstrate the utility of Durango by simulating the LULESH application in the CODES simulation environment on several topologies and show that Durango is practical to use for simulation without loss of fidelity, as quantified by simulation metrics. During our validation of Durango's generated communication model of LULESH, we found that the original LULESH miniapp code had a latent bug where the MPI_Waitall operation was used incorrectly. This finding underscores the potential need for a tool such as Durango, beyond its benefits for flexible workload generation and modeling. Additionally, we demonstrate the efficacy of Durango's direct integration approach, which links Aspen into CODES as part of the running network simulation model. Here, Aspen generates the application-level computation timing events, which in turn drive the start of a network communication phase. Results show that Durango's performance scales well when executing both torus and dragonfly network models on up to 4K Blue Gene/Q nodes using 32K MPI ranks, Durango also avoids the overheads and complexities associated with extreme-scale trace files.
基于潜在架构的精确表示的极端规模应用程序的性能建模对于设计下一代超级计算系统至关重要,因为为了探索设计和策略,使用新的网络硬件构建大规模原型系统是不切实际的。然而,这些模拟通常依赖于静态应用程序跟踪,这些跟踪很难处理,因为它们的大小和缺乏在不重新运行原始应用程序的情况下扩展或扩展的灵活性。为了解决这个问题,我们创建了一种新技术,用于从实际应用程序中生成可扩展的、灵活的工作负载,我们实现了一个名为Durango的原型,它结合了经过验证的分析性能建模语言Aspen和CODES框架的大规模并行HPC网络建模功能。我们的模型是紧凑的,参数化的,并且代表了具有计算事件的实际应用。它们的创建不需要耗费大量资源,并且可以跨模拟器环境移植。我们通过在几种拓扑结构上的CODES仿真环境中模拟LULESH应用程序来演示Durango的实用性,并表明Durango可用于仿真而不会损失保真度,并通过仿真指标进行量化。在我们验证Durango生成的LULESH通信模型期间,我们发现原来的LULESH miniapp代码有一个潜在的错误,其中MPI_Waitall操作被错误地使用。这一发现强调了对Durango这样的工具的潜在需求,除了它在灵活的工作负载生成和建模方面的好处之外。此外,我们还展示了Durango直接集成方法的有效性,该方法将Aspen连接到CODES中,作为运行网络仿真模型的一部分。在这里,Aspen生成应用程序级计算计时事件,这些事件反过来驱动网络通信阶段的开始。结果表明,当使用32K MPI等级在高达4K Blue Gene/Q节点上执行环面和蜻蜓网络模型时,Durango的性能可以很好地扩展,Durango还避免了与极端规模跟踪文件相关的开销和复杂性。