首页 > 最新文献

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

英文 中文
ACR: Automatic checkpoint/restart for soft and hard error protection ACR:自动检查点/重启软、硬错误保护
Xiang Ni, Esteban Meneses, Nikhil Jain, L. Kalé
As machines increase in scale, many researchers have predicted that failure rates will correspondingly increase. Soft errors do not inhibit execution, but may silently generate incorrect results. Recent trends have shown that soft error rates are increasing, and hence they must be detected and handled to maintain correctness. We present a holistic methodology for automatically detecting and recovering from soft or hard faults with minimal application intervention. This is demonstrated by ACR: an automatic checkpoint/restart framework that performs application replication and automatically adapts the checkpoint period using online information about the current failure rate. ACR performs an application- and user-oblivious recovery. We empirically test ACR by injecting failures that follow different distributions for five applications and show low overhead when scaled to 131,072 cores. We also analyze the interaction between soft and hard errors and propose three recovery schemes that explore the trade-off between performance and reliability requirements.
随着机器规模的增加,许多研究人员预测故障率将相应增加。软错误不会阻止执行,但可能会静默地生成不正确的结果。最近的趋势表明,软错误率正在增加,因此必须检测和处理它们以保持正确性。我们提出了一个整体的方法,自动检测和恢复从软或硬故障与最小的应用程序干预。ACR演示了这一点:ACR是一个自动检查点/重启框架,它执行应用程序复制,并使用有关当前故障率的在线信息自动调整检查点周期。ACR执行与应用程序和用户无关的恢复。我们通过在五个应用程序中注入遵循不同发行版的故障来对ACR进行经验测试,并在扩展到131,072个内核时显示出较低的开销。我们还分析了软错误和硬错误之间的相互作用,并提出了三种恢复方案,探讨了性能和可靠性要求之间的权衡。
{"title":"ACR: Automatic checkpoint/restart for soft and hard error protection","authors":"Xiang Ni, Esteban Meneses, Nikhil Jain, L. Kalé","doi":"10.1145/2503210.2503266","DOIUrl":"https://doi.org/10.1145/2503210.2503266","url":null,"abstract":"As machines increase in scale, many researchers have predicted that failure rates will correspondingly increase. Soft errors do not inhibit execution, but may silently generate incorrect results. Recent trends have shown that soft error rates are increasing, and hence they must be detected and handled to maintain correctness. We present a holistic methodology for automatically detecting and recovering from soft or hard faults with minimal application intervention. This is demonstrated by ACR: an automatic checkpoint/restart framework that performs application replication and automatically adapts the checkpoint period using online information about the current failure rate. ACR performs an application- and user-oblivious recovery. We empirically test ACR by injecting failures that follow different distributions for five applications and show low overhead when scaled to 131,072 cores. We also analyze the interaction between soft and hard errors and propose three recovery schemes that explore the trade-off between performance and reliability requirements.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123678605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 91
Effective sampling-driven performance tools for GPU-accelerated supercomputers 有效的采样驱动的性能工具,用于gpu加速的超级计算机
Milind Chabbi, K. Murthy, M. Fagan, J. Mellor-Crummey
Performance analysis of GPU-accelerated systems requires a system-wide view that considers both CPU and GPU components. In this paper, we describe how to extend system-wide, sampling-based performance analysis methods to GPU-accelerated systems. Since current GPUs do not support sampling, our implementation required careful coordination of instrumentation-based performance data collection on GPUs with sampling-based methods employed on CPUs. In addition, we also introduce a novel technique for analyzing systemic idleness in CPU/GPU systems. We demonstrate the effectiveness of our techniques with application case studies on Titan and Keeneland. Some of the highlights of our case studies are: 1) we improved performance for LULESH 1.0 by 30%, 2) we identified a hardware performance problem on Keeneland, 3) we identified a scaling problem in LAMMPS derived from CUDA initialization, and 4) we identified a performance problem that is caused by GPU synchronization operations that suffer delays due to blocking system calls.
GPU加速系统的性能分析需要考虑CPU和GPU组件的全系统视图。在本文中,我们描述了如何将基于采样的系统范围性能分析方法扩展到gpu加速系统。由于当前的gpu不支持采样,我们的实现需要仔细协调gpu上基于仪器的性能数据收集与cpu上采用的基于采样的方法。此外,我们还介绍了一种新的技术来分析CPU/GPU系统中的系统空闲。我们通过对泰坦和基恩兰的应用案例研究证明了我们技术的有效性。我们案例研究的一些亮点是:1)我们将LULESH 1.0的性能提高了30%,2)我们确定了Keeneland上的硬件性能问题,3)我们确定了由CUDA初始化引起的LAMMPS缩放问题,以及4)我们确定了由GPU同步操作引起的性能问题,该操作由于阻塞系统调用而遭受延迟。
{"title":"Effective sampling-driven performance tools for GPU-accelerated supercomputers","authors":"Milind Chabbi, K. Murthy, M. Fagan, J. Mellor-Crummey","doi":"10.1145/2503210.2503299","DOIUrl":"https://doi.org/10.1145/2503210.2503299","url":null,"abstract":"Performance analysis of GPU-accelerated systems requires a system-wide view that considers both CPU and GPU components. In this paper, we describe how to extend system-wide, sampling-based performance analysis methods to GPU-accelerated systems. Since current GPUs do not support sampling, our implementation required careful coordination of instrumentation-based performance data collection on GPUs with sampling-based methods employed on CPUs. In addition, we also introduce a novel technique for analyzing systemic idleness in CPU/GPU systems. We demonstrate the effectiveness of our techniques with application case studies on Titan and Keeneland. Some of the highlights of our case studies are: 1) we improved performance for LULESH 1.0 by 30%, 2) we identified a hardware performance problem on Keeneland, 3) we identified a scaling problem in LAMMPS derived from CUDA initialization, and 4) we identified a performance problem that is caused by GPU synchronization operations that suffer delays due to blocking system calls.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124002882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Scalable domain decomposition preconditioners for heterogeneous elliptic problems 非均质椭圆问题的可扩展域分解预处理
P. Jolivet, F. Hecht, F. Nataf, C. Prud'homme
Domain decomposition methods are, alongside multigrid methods, one of the dominant paradigms in contemporary large-scale partial differential equation simulation. In this paper, a lightweight implementation of a theoretically and numerically scalable preconditioner is presented in the context of overlapping methods. The performance of this work is assessed by numerical simulations executed on thousands of cores, for solving various highly heterogeneous elliptic problems in both 2D and 3D with billions of degrees of freedom. Such problems arise in computational science and engineering, in solid and fluid mechanics. While focusing on overlapping domain decomposition methods might seem too restrictive, it will be shown how this work can be applied to a variety of other methods, such as non-overlapping methods and abstract deflation based preconditioners. It is also presented how multilevel preconditioners can be used to avoid communication during an iterative process such as a Krylov method.
区域分解方法与多网格方法是当代大规模偏微分方程模拟的主要方法之一。在本文中,在重叠方法的背景下,提出了一个理论上和数值上可扩展的预调节器的轻量级实现。这项工作的性能通过在数千个核心上执行的数值模拟来评估,用于解决具有数十亿自由度的2D和3D各种高度非均质椭圆问题。这类问题出现在计算科学和工程、固体和流体力学中。虽然专注于重叠域分解方法可能看起来过于限制,但它将显示这项工作如何应用于各种其他方法,例如非重叠方法和基于抽象紧缩的前置条件。本文还介绍了如何使用多级预调节器来避免在迭代过程中通信,如Krylov方法。
{"title":"Scalable domain decomposition preconditioners for heterogeneous elliptic problems","authors":"P. Jolivet, F. Hecht, F. Nataf, C. Prud'homme","doi":"10.1145/2503210.2503212","DOIUrl":"https://doi.org/10.1145/2503210.2503212","url":null,"abstract":"Domain decomposition methods are, alongside multigrid methods, one of the dominant paradigms in contemporary large-scale partial differential equation simulation. In this paper, a lightweight implementation of a theoretically and numerically scalable preconditioner is presented in the context of overlapping methods. The performance of this work is assessed by numerical simulations executed on thousands of cores, for solving various highly heterogeneous elliptic problems in both 2D and 3D with billions of degrees of freedom. Such problems arise in computational science and engineering, in solid and fluid mechanics. While focusing on overlapping domain decomposition methods might seem too restrictive, it will be shown how this work can be applied to a variety of other methods, such as non-overlapping methods and abstract deflation based preconditioners. It is also presented how multilevel preconditioners can be used to avoid communication during an iterative process such as a Krylov method.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122851343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 75
Load-balanced pipeline parallelism 负载平衡的管道并行性
M. Kamruzzaman, S. Swanson, D. Tullsen
Accelerating a single thread in current parallel systems remains a challenging problem, because sequential threads do not naturally take advantage of the additional cores. Recent work shows that automatic extraction of pipeline parallelism is an effective way to speed up single thread execution. However, two problems remain challenging - load balancing and inter-thread communication. This work shows new mechanism to exploit pipeline parallelism that naturally solves the load balancing and communication problems. This compiler-based technique automatically extracts the pipeline stages and executes them in a data parallel fashion, using token-based chunked synchronization to handle sequential stages. This technique provides linear speedup for several applications, and outperforms prior techniques to exploit pipeline parallelism by as much as 50%.
在当前的并行系统中加速单个线程仍然是一个具有挑战性的问题,因为顺序线程不能自然地利用额外的内核。最近的研究表明,自动提取流水线并行性是提高单线程执行速度的有效方法。然而,负载均衡和线程间通信这两个问题仍然具有挑战性。这项工作展示了利用管道并行性的新机制,自然地解决了负载平衡和通信问题。这种基于编译器的技术自动提取管道阶段,并以数据并行的方式执行它们,使用基于令牌的分块同步来处理顺序阶段。该技术为多个应用程序提供了线性加速,并且在利用管道并行性方面比先前的技术高出50%。
{"title":"Load-balanced pipeline parallelism","authors":"M. Kamruzzaman, S. Swanson, D. Tullsen","doi":"10.1145/2503210.2503295","DOIUrl":"https://doi.org/10.1145/2503210.2503295","url":null,"abstract":"Accelerating a single thread in current parallel systems remains a challenging problem, because sequential threads do not naturally take advantage of the additional cores. Recent work shows that automatic extraction of pipeline parallelism is an effective way to speed up single thread execution. However, two problems remain challenging - load balancing and inter-thread communication. This work shows new mechanism to exploit pipeline parallelism that naturally solves the load balancing and communication problems. This compiler-based technique automatically extracts the pipeline stages and executes them in a data parallel fashion, using token-based chunked synchronization to handle sequential stages. This technique provides linear speedup for several applications, and outperforms prior techniques to exploit pipeline parallelism by as much as 50%.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116172147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Kinetic turbulence simulations at extreme scale on leadership-class systems 领导级系统的极端尺度动力学湍流模拟
Bei Wang, S. Ethier, W. Tang, T. Williams, K. Ibrahim, Kamesh Madduri, Samuel Williams, L. Oliker
Reliable predictive simulation capability addressing confinement properties in magnetically confined fusion plasmas is critically-important for ITER, a 20 billion dollar international burning plasma device under construction in France. The complex study of kinetic turbulence, which can severely limit the energy confinement and impact the economic viability of fusion systems, requires simulations at extreme scale for such an unprecedented device size. Our newly optimized, global, ab initio particle-in-cell code solving the nonlinear equations underlying gyrokinetic theory achieves excellent performance with respect to “time to solution” at the full capacity of the IBM Blue Gene/Q on 786,432 cores of Mira at ALCF and recently of the 1,572,864 cores of Sequoia at LLNL. Recent multithreading and domain decomposition optimizations in the new GTC-P code represent critically important software advances for modern, low memory per core systems by enabling routine simulations at unprecedented size (130 million grid points ITER-scale) and resolution (65 billion particles).
解决磁约束聚变等离子体约束特性的可靠预测模拟能力对ITER至关重要,ITER是一个耗资200亿美元的国际燃烧等离子体装置,正在法国建设中。动力学湍流的复杂研究可能会严重限制能量约束并影响聚变系统的经济可行性,因此需要对这种前所未有的设备尺寸进行极端规模的模拟。我们最新优化的、全局的、从头算的单元内粒子代码解决了基于旋转动力学理论的非线性方程,在ALCF的786,432核Mira和LLNL的1,572,864核Sequoia的满容量IBM Blue Gene/Q上,在“解决时间”方面取得了出色的性能。新GTC-P代码中最近的多线程和域分解优化代表了现代低内存每核系统的重要软件进步,通过实现前所未有的规模(1.3亿个网格点)和分辨率(650亿个粒子)的常规模拟。
{"title":"Kinetic turbulence simulations at extreme scale on leadership-class systems","authors":"Bei Wang, S. Ethier, W. Tang, T. Williams, K. Ibrahim, Kamesh Madduri, Samuel Williams, L. Oliker","doi":"10.1145/2503210.2503258","DOIUrl":"https://doi.org/10.1145/2503210.2503258","url":null,"abstract":"Reliable predictive simulation capability addressing confinement properties in magnetically confined fusion plasmas is critically-important for ITER, a 20 billion dollar international burning plasma device under construction in France. The complex study of kinetic turbulence, which can severely limit the energy confinement and impact the economic viability of fusion systems, requires simulations at extreme scale for such an unprecedented device size. Our newly optimized, global, ab initio particle-in-cell code solving the nonlinear equations underlying gyrokinetic theory achieves excellent performance with respect to “time to solution” at the full capacity of the IBM Blue Gene/Q on 786,432 cores of Mira at ALCF and recently of the 1,572,864 cores of Sequoia at LLNL. Recent multithreading and domain decomposition optimizations in the new GTC-P code represent critically important software advances for modern, low memory per core systems by enabling routine simulations at unprecedented size (130 million grid points ITER-scale) and resolution (65 billion particles).","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129103511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Compiling affine loop nests for distributed-memory parallel architectures 编译分布式内存并行架构的仿射循环巢
Uday Bondhugula
We present new techniques for compilation of arbitrarily nested loops with affine dependences for distributed-memory parallel architectures. Our framework is implemented as a source-level transformer that uses the polyhedral model, and generates parallel code with communication expressed with the Message Passing Interface (MPI) library. Compared to all previous approaches, ours is a significant advance either (1) with respect to the generality of input code handled, or (2) efficiency of communication code, or both. We provide experimental results on a cluster of multicores demonstrating its effectiveness. In some cases, code we generate outperforms manually parallelized codes, and in another case is within 25% of it. To the best of our knowledge, this is the first work reporting end-to-end fully automatic distributed-memory parallelization and code generation for input programs and transformation techniques as general as those we allow.
我们提出了编译具有仿射依赖的任意嵌套循环的新技术,用于分布式内存并行架构。我们的框架是作为一个使用多面体模型的源级转换器实现的,并生成使用消息传递接口(MPI)库表示通信的并行代码。与之前的所有方法相比,我们的方法在(1)处理输入代码的通用性方面,或(2)通信代码的效率方面,或两者兼而有之,都取得了重大进展。我们提供了一个多核集群的实验结果来证明它的有效性。在某些情况下,我们生成的代码优于手动并行化的代码,在另一种情况下是在25%之内。据我们所知,这是第一个报告端到端的全自动分布式内存并行化和代码生成的工作,用于输入程序和我们所允许的一般转换技术。
{"title":"Compiling affine loop nests for distributed-memory parallel architectures","authors":"Uday Bondhugula","doi":"10.1145/2503210.2503289","DOIUrl":"https://doi.org/10.1145/2503210.2503289","url":null,"abstract":"We present new techniques for compilation of arbitrarily nested loops with affine dependences for distributed-memory parallel architectures. Our framework is implemented as a source-level transformer that uses the polyhedral model, and generates parallel code with communication expressed with the Message Passing Interface (MPI) library. Compared to all previous approaches, ours is a significant advance either (1) with respect to the generality of input code handled, or (2) efficiency of communication code, or both. We provide experimental results on a cluster of multicores demonstrating its effectiveness. In some cases, code we generate outperforms manually parallelized codes, and in another case is within 25% of it. To the best of our knowledge, this is the first work reporting end-to-end fully automatic distributed-memory parallelization and code generation for input programs and transformation techniques as general as those we allow.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124102934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 70
Taming parallel I/O complexity with auto-tuning 通过自动调优控制并行I/O复杂性
Babak Behzad, Huong Vu, Thanh Luu, Joseph Huchette, S. Byna, R. Aydt, Q. Koziol, M. Snir
We present an auto-tuning system for optimizing I/O performance of HDF5 applications and demonstrate its value across platforms, applications, and at scale. The system uses a genetic algorithm to search a large space of tunable parameters and to identify effective settings at all layers of the parallel I/O stack. The parameter settings are applied transparently by the auto-tuning system via dynamically intercepted HDF5 calls. To validate our auto-tuning system, we applied it to three I/O benchmarks (VPIC, VORPAL, and GCRM) that replicate the I/O activity of their respective applications. We tested the system with different weak-scaling configurations (128, 2048, and 4096 CPU cores) that generate 30 GB to 1 TB of data, and executed these configurations on diverse HPC platforms (Cray XE6, IBM BG/P, and Dell Cluster). In all cases, the auto-tuning framework identified tunable parameters that substantially improved write performance over default system settings. We consistently demonstrate I/O write speedups between 2× and 100× for test configurations.
我们提出了一个用于优化HDF5应用程序I/O性能的自动调优系统,并展示了它在跨平台、应用程序和规模上的价值。该系统使用遗传算法搜索大量可调参数,并识别并行I/O堆栈各层的有效设置。参数设置通过动态截获的HDF5调用由自动调优系统透明地应用。为了验证我们的自动调优系统,我们将其应用于三个I/O基准测试(VPIC、VORPAL和GCRM),它们复制各自应用程序的I/O活动。我们使用不同的弱伸缩配置(128、2048和4096个CPU内核)测试了系统,这些配置可以生成30 GB到1 TB的数据,并在不同的HPC平台(Cray XE6、IBM BG/P和Dell Cluster)上执行这些配置。在所有情况下,自动调优框架都确定了可调参数,这些参数大大提高了默认系统设置的写性能。对于测试配置,我们始终证明I/O写入速度在2倍到100倍之间。
{"title":"Taming parallel I/O complexity with auto-tuning","authors":"Babak Behzad, Huong Vu, Thanh Luu, Joseph Huchette, S. Byna, R. Aydt, Q. Koziol, M. Snir","doi":"10.1145/2503210.2503278","DOIUrl":"https://doi.org/10.1145/2503210.2503278","url":null,"abstract":"We present an auto-tuning system for optimizing I/O performance of HDF5 applications and demonstrate its value across platforms, applications, and at scale. The system uses a genetic algorithm to search a large space of tunable parameters and to identify effective settings at all layers of the parallel I/O stack. The parameter settings are applied transparently by the auto-tuning system via dynamically intercepted HDF5 calls. To validate our auto-tuning system, we applied it to three I/O benchmarks (VPIC, VORPAL, and GCRM) that replicate the I/O activity of their respective applications. We tested the system with different weak-scaling configurations (128, 2048, and 4096 CPU cores) that generate 30 GB to 1 TB of data, and executed these configurations on diverse HPC platforms (Cray XE6, IBM BG/P, and Dell Cluster). In all cases, the auto-tuning framework identified tunable parameters that substantially improved write performance over default system settings. We consistently demonstrate I/O write speedups between 2× and 100× for test configurations.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128393470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 114
Parallelizing the execution of sequential scripts 并行执行顺序脚本
Zhao Zhang, D. Katz, Timothy G. Armstrong, J. Wozniak, Ian T Foster
Scripting is often used in science to create applications via the composition of existing programs. Parallel scripting systems allow the creation of such applications, but each system introduces the need to adopt a somewhat specialized programming model. We present an alternative scripting approach, AMFS Shell, that lets programmers express parallel scripting applications via minor extensions to existing sequential scripting languages, such as Bash, and then execute them in-memory on large-scale computers. We define a small set of commands between the scripts and a parallel scripting runtime system, so that programmers can compose their scripts in a familiar scripting language. The underlying AMFS implements both collective (fast file movement) and functional (transformation based on content) file management. Tasks are handled by AMFS's built-in execution engine. AMFS Shell is expressive enough for a wide range of applications, and the framework can run such applications efficiently on large-scale computers.
在科学中,脚本经常用于通过组合现有程序来创建应用程序。并行脚本系统允许创建这样的应用程序,但是每个系统都需要采用某种专门的编程模型。我们提出了另一种脚本方法AMFS Shell,它允许程序员通过对现有顺序脚本语言(如Bash)的小扩展来表达并行脚本应用程序,然后在大型计算机的内存中执行它们。我们在脚本和并行脚本运行时系统之间定义了一小组命令,这样程序员就可以用熟悉的脚本语言编写脚本。底层AMFS实现了集体(快速文件移动)和功能性(基于内容的转换)文件管理。任务由AMFS的内置执行引擎处理。AMFS Shell对于广泛的应用程序具有足够的表现力,并且该框架可以在大型计算机上有效地运行这些应用程序。
{"title":"Parallelizing the execution of sequential scripts","authors":"Zhao Zhang, D. Katz, Timothy G. Armstrong, J. Wozniak, Ian T Foster","doi":"10.1145/2503210.2503222","DOIUrl":"https://doi.org/10.1145/2503210.2503222","url":null,"abstract":"Scripting is often used in science to create applications via the composition of existing programs. Parallel scripting systems allow the creation of such applications, but each system introduces the need to adopt a somewhat specialized programming model. We present an alternative scripting approach, AMFS Shell, that lets programmers express parallel scripting applications via minor extensions to existing sequential scripting languages, such as Bash, and then execute them in-memory on large-scale computers. We define a small set of commands between the scripts and a parallel scripting runtime system, so that programmers can compose their scripts in a familiar scripting language. The underlying AMFS implements both collective (fast file movement) and functional (transformation based on content) file management. Tasks are handled by AMFS's built-in execution engine. AMFS Shell is expressive enough for a wide range of applications, and the framework can run such applications efficiently on large-scale computers.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123789654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
A ‘cool’ way of improving the reliability of HPC machines 这是一种提高高性能计算机器可靠性的“酷”方法
O. Sarood, Esteban Meneses, L. Kalé
Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.
能源消耗飙升,可靠性下降,这些因素共同成为下一代超级计算机面临的最大障碍。最近的报告表达了对百亿亿次级的可靠性可能会下降到故障成为常态而不是例外的程度的担忧。高性能计算研究人员正致力于改进现有的容错协议来解决这些问题。提高硬件可靠性,即机器部件可靠性的研究也在独立取得进展。在本文中,我们试图弥合这一差距,并探索结合软件和硬件方面的潜力,以提高高性能计算机器的可靠性。众所周知,堆芯温度每升高10°C,故障率就会翻一番。我们利用这个概念来实验证明限制核心温度和负载平衡的潜力,以实现双重好处:提高并行机器的可靠性和减少应用程序所需的总执行时间。我们的实验结果表明,我们可以将机器的可靠性提高2.3倍,并将执行时间缩短12%。此外,我们的方案还可以减少高达25%的机器能耗。对于350K套接字的机器,常规检查点/重新启动无法取得进展(效率低于1%),而我们经过验证的模型通过将机器可靠性提高2.29倍来预测效率为20%。
{"title":"A ‘cool’ way of improving the reliability of HPC machines","authors":"O. Sarood, Esteban Meneses, L. Kalé","doi":"10.1145/2503210.2503228","DOIUrl":"https://doi.org/10.1145/2503210.2503228","url":null,"abstract":"Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133871396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Scalable virtual machine deployment using VM image caches 使用虚拟机映像缓存的可伸缩虚拟机部署
Kaveh Razavi, T. Kielmann
In IaaS clouds, VM startup times are frequently perceived as slow, negatively impacting both dynamic scaling of web applications and the startup of high-performance computing applications consisting of many VM nodes. A significant part of the startup time is due to the large transfers of VM image content from a storage node to the actual compute nodes, even when copy-on-write schemes are used. We have observed that only a tiny part of the VM image is needed for the VM to be able to start up. Based on this observation, we propose using small caches for VM images to overcome the VM startup bottlenecks. We have implemented such caches as an extension to KVM/QEMU. Our evaluation with up to 64 VMs shows that using our caches reduces the time needed for simultaneous VM startups to the one of a single VM.
在IaaS云中,VM的启动时间经常被认为很慢,这对web应用程序的动态扩展和由许多VM节点组成的高性能计算应用程序的启动都产生了负面影响。启动时间的很大一部分是由于VM映像内容从存储节点到实际计算节点的大量传输,即使使用了写时复制模式。我们已经观察到,VM能够启动只需要VM映像的一小部分。基于这一观察,我们建议为VM映像使用小型缓存来克服VM启动瓶颈。我们已经实现了这样的缓存作为KVM/QEMU的扩展。我们对多达64个VM的评估表明,使用我们的缓存可以将同时启动VM所需的时间减少到单个VM的时间。
{"title":"Scalable virtual machine deployment using VM image caches","authors":"Kaveh Razavi, T. Kielmann","doi":"10.1145/2503210.2503274","DOIUrl":"https://doi.org/10.1145/2503210.2503274","url":null,"abstract":"In IaaS clouds, VM startup times are frequently perceived as slow, negatively impacting both dynamic scaling of web applications and the startup of high-performance computing applications consisting of many VM nodes. A significant part of the startup time is due to the large transfers of VM image content from a storage node to the actual compute nodes, even when copy-on-write schemes are used. We have observed that only a tiny part of the VM image is needed for the VM to be able to start up. Based on this observation, we propose using small caches for VM images to overcome the VM startup bottlenecks. We have implemented such caches as an extension to KVM/QEMU. Our evaluation with up to 64 VMs shows that using our caches reduces the time needed for simultaneous VM startups to the one of a single VM.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130716117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
期刊
2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1