首页 > 最新文献

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

英文 中文
Guide-copy: Fast and silent migration of virtual machine for datacenters Guide-copy:用于数据中心的虚拟机快速、静默迁移
Jihun Kim, Dongju Chae, Jangwoong Kim, Jong Kim
Cloud infrastructure providers deploy Dynamic Resource Management (DRM) to minimize the cost of datacenter operation, while maintaining the Service Level Agreement (SLA). Such DRM schemes depend on the capability to migrate virtual machine (VM) images. However, existing migration techniques are not suitable for highly utilized clouds due to their latency and bandwidth critical memory transfer mechanisms. In this paper, we propose guide-copy migration, a novel VM migration scheme to provide a fast and silent migration, which works nicely under highly utilized clouds. The guide-copy migration transfers only the memory pages accessed at the destination node in the near future by running a guide version of the VM at the source node and a migrated VM at the destination node simultaneously during the migration. The guide-copy migration's highly accurate and low-bandwidth memory transfer mechanism enables a fast and silent VM migration to maintain the SLA of all VMs in the cloud.
云基础设施提供商部署动态资源管理(DRM)来最小化数据中心运营成本,同时维护服务水平协议(SLA)。这种DRM方案依赖于迁移虚拟机(VM)映像的能力。然而,现有的迁移技术由于其延迟和带宽关键型内存传输机制而不适合高度利用的云。在本文中,我们提出了一种新的虚拟机迁移方案指南-复制迁移,它提供了一个快速和无声的迁移,在高利用率的云环境下工作得很好。guide-copy迁移时,在源节点运行guide版本的虚拟机,在目标节点运行已迁移的虚拟机,只迁移近期在目标节点访问过的内存页面。导拷贝迁移具有高精度、低带宽的内存传输机制,可以实现快速、静音的虚拟机迁移,从而保持云环境中所有虚拟机的SLA。
{"title":"Guide-copy: Fast and silent migration of virtual machine for datacenters","authors":"Jihun Kim, Dongju Chae, Jangwoong Kim, Jong Kim","doi":"10.1145/2503210.2503251","DOIUrl":"https://doi.org/10.1145/2503210.2503251","url":null,"abstract":"Cloud infrastructure providers deploy Dynamic Resource Management (DRM) to minimize the cost of datacenter operation, while maintaining the Service Level Agreement (SLA). Such DRM schemes depend on the capability to migrate virtual machine (VM) images. However, existing migration techniques are not suitable for highly utilized clouds due to their latency and bandwidth critical memory transfer mechanisms. In this paper, we propose guide-copy migration, a novel VM migration scheme to provide a fast and silent migration, which works nicely under highly utilized clouds. The guide-copy migration transfers only the memory pages accessed at the destination node in the near future by running a guide version of the VM at the source node and a migrated VM at the destination node simultaneously during the migration. The guide-copy migration's highly accurate and low-bandwidth memory transfer mechanism enables a fast and silent VM migration to maintain the SLA of all VMs in the cloud.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116092292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Practical nonvolatile multilevel-cell phase change memory 实用的非易失性多电平单元相变存储器
D. Yoon, Jichuan Chang, R. Schreiber, N. Jouppi
Multilevel-cell (MLC) phase change memory (PCM) may provide both high capacity main memory and faster-than-Flash persistent storage. But slow growth in cell resistance with time, resistance drift, can cause transient errors in MLC-PCM. Drift errors increase with time, and prior work suggests refresh before the cell loses data. The need for refresh makes MLC-PCM volatile, taking away a key advantage. Based on the observation that most drift errors occur in a particular state in four-level-cell PCM, we propose to change from four levels to three levels, eliminating the most vulnerable state. This simple change lowers cell drift error rates by many orders of magnitude: three-level-cell PCM can retain data without power for more than ten years. With optimized encoding/decoding and a wearout tolerance mechanism, we can narrow the capacity gap between three-level and four-level cells. These techniques together enable low-cost, high-performance, genuinely nonvolatile MLC-PCM.
多电平单元(MLC)相变存储器(PCM)可以提供高容量主存储器和比闪存更快的持久存储器。但随着时间的推移,细胞电阻增长缓慢,电阻漂移,会导致MLC-PCM的瞬态误差。漂移误差随着时间的推移而增加,先前的工作建议在单元丢失数据之前进行刷新。更新的需要使MLC-PCM不稳定,夺走了一个关键的优势。基于对四电平单元PCM中大多数漂移误差发生在特定状态的观察,我们建议将四电平改为三电平,消除最脆弱的状态。这个简单的改变将单元漂移错误率降低了许多数量级:三级单元PCM可以在没有电源的情况下保留数据超过十年。通过优化的编码/解码和损耗容忍机制,我们可以缩小三级和四级单元之间的容量差距。这些技术共同实现了低成本,高性能,真正的非易失性MLC-PCM。
{"title":"Practical nonvolatile multilevel-cell phase change memory","authors":"D. Yoon, Jichuan Chang, R. Schreiber, N. Jouppi","doi":"10.1145/2503210.2503221","DOIUrl":"https://doi.org/10.1145/2503210.2503221","url":null,"abstract":"Multilevel-cell (MLC) phase change memory (PCM) may provide both high capacity main memory and faster-than-Flash persistent storage. But slow growth in cell resistance with time, resistance drift, can cause transient errors in MLC-PCM. Drift errors increase with time, and prior work suggests refresh before the cell loses data. The need for refresh makes MLC-PCM volatile, taking away a key advantage. Based on the observation that most drift errors occur in a particular state in four-level-cell PCM, we propose to change from four levels to three levels, eliminating the most vulnerable state. This simple change lowers cell drift error rates by many orders of magnitude: three-level-cell PCM can retain data without power for more than ten years. With optimized encoding/decoding and a wearout tolerance mechanism, we can narrow the capacity gap between three-level and four-level cells. These techniques together enable low-cost, high-performance, genuinely nonvolatile MLC-PCM.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122086018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
Solving the compressible Navier-Stokes equations on up to 1.97 million cores and 4.1 trillion grid points 在多达197万个核和4.1万亿个网格点上解决可压缩的Navier-Stokes方程
I. Bermejo-Moreno, J. Bodart, J. Larsson, Blaise M. Barney, J. Nichols, Steve Jones
We present weak and strong scaling studies as well as performance analyses of the Hybrid code, a finite-difference solver of the compressible Navier-Stokes equations on structured grids used for the direct numerical simulation of isotropic turbulence and its interaction with shock waves. Parallelization is achieved through MPI, emphasizing the use of nonblocking communication with concurrent computation. The simulations, scaling and performance studies were done on the Sequoia, Vulcan and Vesta Blue Gene/Q systems, the first two accounting for a total of 1,966,080 cores when used in combination. The maximum number of grid points simulated was 4.12 trillion, with a memory usage of approximately 1.6 PB. We discuss the use of hyperthreading, which significantly improves the parallel performance of the code on this architecture.
我们提出了弱尺度和强尺度研究以及混合代码的性能分析,混合代码是用于直接数值模拟各向同性湍流及其与激波相互作用的结构网格上的可压缩Navier-Stokes方程的有限差分求解器。并行化是通过MPI实现的,强调在并发计算中使用非阻塞通信。模拟,缩放和性能研究是在Sequoia, Vulcan和Vesta Blue Gene/Q系统上完成的,前两个系统在组合使用时总共占1,966,080个内核。模拟的网格点的最大数量为4.12万亿,内存使用量约为1.6 PB。我们讨论了超线程的使用,它显著提高了该体系结构上代码的并行性能。
{"title":"Solving the compressible Navier-Stokes equations on up to 1.97 million cores and 4.1 trillion grid points","authors":"I. Bermejo-Moreno, J. Bodart, J. Larsson, Blaise M. Barney, J. Nichols, Steve Jones","doi":"10.1145/2503210.2503265","DOIUrl":"https://doi.org/10.1145/2503210.2503265","url":null,"abstract":"We present weak and strong scaling studies as well as performance analyses of the Hybrid code, a finite-difference solver of the compressible Navier-Stokes equations on structured grids used for the direct numerical simulation of isotropic turbulence and its interaction with shock waves. Parallelization is achieved through MPI, emphasizing the use of nonblocking communication with concurrent computation. The simulations, scaling and performance studies were done on the Sequoia, Vulcan and Vesta Blue Gene/Q systems, the first two accounting for a total of 1,966,080 cores when used in combination. The maximum number of grid points simulated was 4.12 trillion, with a memory usage of approximately 1.6 PB. We discuss the use of hyperthreading, which significantly improves the parallel performance of the code on this architecture.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122175587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Deterministic scale-free pipeline parallelism with hyperqueues 具有超队列的确定性无标度管道并行性
H. Vandierendonck, Kallia Chronaki, Dimitrios S. Nikolopoulos
Ubiquitous parallel computing aims to make parallel programming accessible to a wide variety of programming areas using deterministic and scale-free programming models built on a task abstraction. However, it remains hard to reconcile these attributes with pipeline parallelism, where the number of pipeline stages is typically hard-coded in the program and defines the degree of parallelism. This paper introduces hyperqueues, a programming abstraction that enables the construction of deterministic and scale-free pipeline parallel programs. Hyperqueues extend the concept of Cilk++ hyperobjects to provide thread-local views on a shared data structure. While hyperobjects are organized around private local views, hyperqueues require shared concurrent views on the underlying data structure. We define the semantics of hyperqueues and describe their implementation in a work-stealing scheduler. We demonstrate scalable performance on pipeline-parallel PARSEC benchmarks and find that hyperqueues provide comparable or up to 30% better performance than POSIX threads and Intel's Threading Building Blocks. The latter are highly tuned to the number of available processing cores, while programs using hyperqueues are scale-free.
无所不在的并行计算旨在使用基于任务抽象的确定性和无标度编程模型,使并行编程可用于各种编程领域。然而,很难将这些属性与管道并行性相协调,管道阶段的数量通常在程序中硬编码,并定义并行度。本文介绍了超队列,它是一种编程抽象,可以构造确定性和无标度的管道并行程序。超队列扩展了cilk++超对象的概念,在共享数据结构上提供线程本地视图。虽然超对象是围绕私有本地视图组织的,但超队列需要底层数据结构上的共享并发视图。我们定义了超队列的语义,并描述了它们在工作窃取调度器中的实现。我们在管道并行的PARSEC基准测试中演示了可扩展的性能,并发现超队列提供的性能比POSIX线程和英特尔的线程构建块高出30%。后者对可用处理内核的数量进行了高度调优,而使用超队列的程序是无标度的。
{"title":"Deterministic scale-free pipeline parallelism with hyperqueues","authors":"H. Vandierendonck, Kallia Chronaki, Dimitrios S. Nikolopoulos","doi":"10.1145/2503210.2503233","DOIUrl":"https://doi.org/10.1145/2503210.2503233","url":null,"abstract":"Ubiquitous parallel computing aims to make parallel programming accessible to a wide variety of programming areas using deterministic and scale-free programming models built on a task abstraction. However, it remains hard to reconcile these attributes with pipeline parallelism, where the number of pipeline stages is typically hard-coded in the program and defines the degree of parallelism. This paper introduces hyperqueues, a programming abstraction that enables the construction of deterministic and scale-free pipeline parallel programs. Hyperqueues extend the concept of Cilk++ hyperobjects to provide thread-local views on a shared data structure. While hyperobjects are organized around private local views, hyperqueues require shared concurrent views on the underlying data structure. We define the semantics of hyperqueues and describe their implementation in a work-stealing scheduler. We demonstrate scalable performance on pipeline-parallel PARSEC benchmarks and find that hyperqueues provide comparable or up to 30% better performance than POSIX threads and Intel's Threading Building Blocks. The latter are highly tuned to the number of available processing cores, while programs using hyperqueues are scale-free.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126592462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Assessing the effects of data compression in simulations using physically motivated metrics 使用物理激励指标评估模拟中数据压缩的效果
D. Laney, S. Langer, Christopher Weber, Peter Lindstrom, Al Wegener
This paper examines whether lossy compression can be used effectively in physics simulations as a possible strategy to combat the expected data-movement bottleneck in future high performance computing architectures. We show that, for the codes and simulations we tested, compression levels of 3-5X can be applied without causing significant changes to important physical quantities. Rather than applying signal processing error metrics, we utilize physics-based metrics appropriate for each code to assess the impact of compression. We evaluate three different simulation codes: a Lagrangian shock-hydrodynamics code, an Eulerian higher-order hydrodynamics turbulence modeling code, and an Eulerian coupled laser-plasma interaction code. We compress relevant quantities after each time-step to approximate the effects of tightly coupled compression and study the compression rates to estimate memory and disk-bandwidth reduction. We find that the error characteristics of compression algorithms must be carefully considered in the context of the underlying physics being modeled.
本文研究了有损压缩是否可以有效地用于物理模拟,作为一种可能的策略来对抗未来高性能计算架构中预期的数据移动瓶颈。我们表明,对于我们测试的代码和模拟,可以应用3-5X的压缩级别,而不会对重要的物理量造成重大变化。而不是应用信号处理误差指标,我们利用适合于每个代码的基于物理的指标来评估压缩的影响。我们评估了三种不同的模拟代码:拉格朗日激波-流体动力学代码,欧拉高阶流体动力学湍流建模代码和欧拉耦合激光-等离子体相互作用代码。我们在每个时间步之后压缩相关的量来近似紧耦合压缩的效果,并研究压缩率来估计内存和磁盘带宽的减少。我们发现压缩算法的误差特性必须在被建模的底层物理环境中仔细考虑。
{"title":"Assessing the effects of data compression in simulations using physically motivated metrics","authors":"D. Laney, S. Langer, Christopher Weber, Peter Lindstrom, Al Wegener","doi":"10.1145/2503210.2503283","DOIUrl":"https://doi.org/10.1145/2503210.2503283","url":null,"abstract":"This paper examines whether lossy compression can be used effectively in physics simulations as a possible strategy to combat the expected data-movement bottleneck in future high performance computing architectures. We show that, for the codes and simulations we tested, compression levels of 3-5X can be applied without causing significant changes to important physical quantities. Rather than applying signal processing error metrics, we utilize physics-based metrics appropriate for each code to assess the impact of compression. We evaluate three different simulation codes: a Lagrangian shock-hydrodynamics code, an Eulerian higher-order hydrodynamics turbulence modeling code, and an Eulerian coupled laser-plasma interaction code. We compress relevant quantities after each time-step to approximate the effects of tightly coupled compression and study the compression rates to estimate memory and disk-bandwidth reduction. We find that the error characteristics of compression algorithms must be carefully considered in the context of the underlying physics being modeled.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117315663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 57
Predicting application performance using supervised learning on communication features 使用通信特征的监督学习预测应用程序性能
Nikhil Jain, A. Bhatele, Michael P. Robson, T. Gamblin, L. Kalé
Task mapping on torus networks has traditionally focused on either reducing the maximum dilation or average number of hops per byte for messages in an application. These metrics make simplified assumptions about the cause of network congestion, and do not provide accurate correlation with execution time. Hence, these metrics cannot be used to reasonably predict or compare application performance for different mappings. In this paper, we attempt to model the performance of an application using communication data, such as the communication graph and network hardware counters. We use supervised learning algorithms, such as randomized decision trees, to correlate performance with prior and new metrics. We propose new hybrid metrics that provide high correlation with application performance, and may be useful for accurate performance prediction. For three different communication patterns and a production application, we demonstrate a very strong correlation between the proposed metrics and the execution time of these codes.
环面网络上的任务映射传统上关注的是减少应用程序中消息的最大扩展或每字节的平均跳数。这些指标对网络拥塞的原因做出了简化的假设,并且没有提供与执行时间的准确相关性。因此,这些指标不能用于合理地预测或比较不同映射的应用程序性能。在本文中,我们尝试使用通信数据(如通信图和网络硬件计数器)对应用程序的性能进行建模。我们使用监督学习算法,如随机决策树,将性能与先前和新的指标相关联。我们提出了与应用程序性能高度相关的新的混合度量,并且可能有助于准确的性能预测。对于三种不同的通信模式和一个生产应用程序,我们展示了建议的度量和这些代码的执行时间之间非常强的相关性。
{"title":"Predicting application performance using supervised learning on communication features","authors":"Nikhil Jain, A. Bhatele, Michael P. Robson, T. Gamblin, L. Kalé","doi":"10.1145/2503210.2503263","DOIUrl":"https://doi.org/10.1145/2503210.2503263","url":null,"abstract":"Task mapping on torus networks has traditionally focused on either reducing the maximum dilation or average number of hops per byte for messages in an application. These metrics make simplified assumptions about the cause of network congestion, and do not provide accurate correlation with execution time. Hence, these metrics cannot be used to reasonably predict or compare application performance for different mappings. In this paper, we attempt to model the performance of an application using communication data, such as the communication graph and network hardware counters. We use supervised learning algorithms, such as randomized decision trees, to correlate performance with prior and new metrics. We propose new hybrid metrics that provide high correlation with application performance, and may be useful for accurate performance prediction. For three different communication patterns and a production application, we demonstrate a very strong correlation between the proposed metrics and the execution time of these codes.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115566517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Parallel design and performance of nested filtering factorization preconditioner 嵌套滤波分解预调节器的并行设计与性能
Long Qu, L. Grigori, F. Nataf
We present the parallel design and performance of the nested filtering factorization preconditioner (NFF), which can be used for solving linear systems arising from the discretization of a system of PDEs on unstructured grids. NFF has limited memory requirements, and it is based on a two level recursive decomposition that exploits a nested block arrow structure of the input matrix, obtained priorly by using graph partitioning techniques. It also allows to preserve several directions of interest of the input matrix to alleviate the effect of low frequency modes on the convergence of iterative methods. For a boundary value problem with highly heterogeneous coefficients, discretized on three-dimensional grids with 64 millions unknowns and 447 millions nonzero entries, we show experimentally that NFF scales up to 2048 cores of Genci's Bull system (Curie), and it is up to 2.6 times faster than the domain decomposition preconditioner Restricted Additive Schwarz implemented in PETSc.
本文提出了嵌套滤波分解预调节器(NFF)的并行设计和性能,该预调节器可用于求解非结构网格上由偏微分方程系统离散化引起的线性系统。NFF对内存的要求有限,它基于两级递归分解,该分解利用了输入矩阵的嵌套块箭头结构,该结构先前通过使用图划分技术获得。它还允许保留输入矩阵的几个感兴趣方向,以减轻低频模式对迭代方法收敛性的影响。对于具有高度非均匀系数的边值问题,在具有6400万个未知数和4.47亿个非零条目的三维网格上离散化,我们通过实验证明NFF可扩展到Genci的公牛系统(Curie)的2048个核,并且比PETSc中实现的域分解预处理条件Restricted Additive Schwarz快2.6倍。
{"title":"Parallel design and performance of nested filtering factorization preconditioner","authors":"Long Qu, L. Grigori, F. Nataf","doi":"10.1145/2503210.2503287","DOIUrl":"https://doi.org/10.1145/2503210.2503287","url":null,"abstract":"We present the parallel design and performance of the nested filtering factorization preconditioner (NFF), which can be used for solving linear systems arising from the discretization of a system of PDEs on unstructured grids. NFF has limited memory requirements, and it is based on a two level recursive decomposition that exploits a nested block arrow structure of the input matrix, obtained priorly by using graph partitioning techniques. It also allows to preserve several directions of interest of the input matrix to alleviate the effect of low frequency modes on the convergence of iterative methods. For a boundary value problem with highly heterogeneous coefficients, discretized on three-dimensional grids with 64 millions unknowns and 447 millions nonzero entries, we show experimentally that NFF scales up to 2048 cores of Genci's Bull system (Curie), and it is up to 2.6 times faster than the domain decomposition preconditioner Restricted Additive Schwarz implemented in PETSc.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115750683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach 基于软硬件协同方法的算法容错再思考
Dong Li, Zizhong Chen, Panruo Wu, J. Vetter
Algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any underlying hardware resilience mechanisms. As a result, some data structures are over-protected by ABFT and hardware, which leads to redundant costs in terms of performance and energy. In this paper, we rethink ABFT using an integrated view including both software and hardware with the goal of improving performance and energy efficiency of ABFT-enabled applications. In particular, we study how to coordinate ABFT and error-correcting code (ECC) for main memory, and investigate the impact of this coordination on performance, energy, and resilience for ABFT-enabled applications. Scaling tests and analysis indicate that our approach saves up to 25% for system energy (and up to 40% for dynamic memory energy) with up to 18% performance improvement over traditional approaches of ABFT with ECC.
基于算法的容错(ABFT)是一种高效的弹性解决方案,适用于许多广泛使用的科学计算内核。然而,在弹性生态系统的上下文中,ABFT对任何底层硬件弹性机制都是完全不透明的。因此,一些数据结构受到ABFT和硬件的过度保护,从而导致在性能和能源方面的冗余成本。在本文中,我们使用包括软件和硬件在内的集成视图重新思考ABFT,目标是提高支持ABFT的应用程序的性能和能源效率。特别是,我们研究了如何协调ABFT和纠错码(ECC)的主存,并调查这种协调对ABFT支持的应用程序的性能,能量和弹性的影响。扩展测试和分析表明,我们的方法节省了高达25%的系统能量(高达40%的动态内存能量),与传统的带ECC的ABFT方法相比,性能提高了18%。
{"title":"Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach","authors":"Dong Li, Zizhong Chen, Panruo Wu, J. Vetter","doi":"10.1145/2503210.2503226","DOIUrl":"https://doi.org/10.1145/2503210.2503226","url":null,"abstract":"Algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many widely-used scientific computing kernels. However, in the context of the resilience ecosystem, ABFT is completely opaque to any underlying hardware resilience mechanisms. As a result, some data structures are over-protected by ABFT and hardware, which leads to redundant costs in terms of performance and energy. In this paper, we rethink ABFT using an integrated view including both software and hardware with the goal of improving performance and energy efficiency of ABFT-enabled applications. In particular, we study how to coordinate ABFT and error-correcting code (ECC) for main memory, and investigate the impact of this coordination on performance, energy, and resilience for ABFT-enabled applications. Scaling tests and analysis indicate that our approach saves up to 25% for system energy (and up to 40% for dynamic memory energy) with up to 18% performance improvement over traditional approaches of ABFT with ECC.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121888463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Exploring the future of out-of-core computing with compute-local non-volatile memory 用计算本地非易失性存储器探索核外计算的未来
Myoungsoo Jung, E. Wilson, Wonil Choi, J. Shalf, H. Aktulga, Chao Yang, Erik Saule, Ümit V. Çatalyürek, M. Kandemir
Drawing parallels to the rise of general purpose graphical processing units (GPGPUs) as accelerators for specific high-performance computing (HPC) workloads, there is a rise in the use of non-volatile memory (NVM) as accelerators for I/O-intensive scientific applications. However, existing works have explored use of NVM within dedicated I/O nodes, which are distant from the compute nodes that actually need such acceleration. As NVM bandwidth begins to out-pace point-to-point network capacity, we argue for the need to break from the archetype of completely separated storage. Therefore, in this work we investigate co-location of NVM and compute by varying I/O interfaces, file systems, types of NVM, and both current and future SSD architectures, uncovering numerous bottlenecks implicit in these various levels in the I/O stack. We present novel hardware and software solutions, including the new Unified File System (UFS), to enable fuller utilization of the new compute-local NVM storage. Our experimental evaluation, which employs a real-world Out-of-Core (OoC) HPC application, demonstrates throughput increases in excess of an order of magnitude over current approaches.
与通用图形处理单元(gpgpu)作为特定高性能计算(HPC)工作负载的加速器的兴起类似,非易失性内存(NVM)作为I/ o密集型科学应用程序的加速器的使用也在增加。然而,现有的工作已经探索了在专用I/O节点中使用NVM,这与实际需要这种加速的计算节点相去甚远。随着NVM带宽开始超过点对点网络容量,我们认为有必要打破完全分离存储的原型。因此,在这项工作中,我们通过不同的I/O接口、文件系统、NVM类型以及当前和未来的SSD架构来研究NVM和计算的协同定位,揭示了I/O堆栈中这些不同级别隐含的许多瓶颈。我们提出了新的硬件和软件解决方案,包括新的统一文件系统(UFS),以充分利用新的计算本地NVM存储。我们的实验评估采用了一个真实的外核(OoC) HPC应用程序,表明吞吐量比当前方法增加了一个数量级以上。
{"title":"Exploring the future of out-of-core computing with compute-local non-volatile memory","authors":"Myoungsoo Jung, E. Wilson, Wonil Choi, J. Shalf, H. Aktulga, Chao Yang, Erik Saule, Ümit V. Çatalyürek, M. Kandemir","doi":"10.1145/2503210.2503261","DOIUrl":"https://doi.org/10.1145/2503210.2503261","url":null,"abstract":"Drawing parallels to the rise of general purpose graphical processing units (GPGPUs) as accelerators for specific high-performance computing (HPC) workloads, there is a rise in the use of non-volatile memory (NVM) as accelerators for I/O-intensive scientific applications. However, existing works have explored use of NVM within dedicated I/O nodes, which are distant from the compute nodes that actually need such acceleration. As NVM bandwidth begins to out-pace point-to-point network capacity, we argue for the need to break from the archetype of completely separated storage. Therefore, in this work we investigate co-location of NVM and compute by varying I/O interfaces, file systems, types of NVM, and both current and future SSD architectures, uncovering numerous bottlenecks implicit in these various levels in the I/O stack. We present novel hardware and software solutions, including the new Unified File System (UFS), to enable fuller utilization of the new compute-local NVM storage. Our experimental evaluation, which employs a real-world Out-of-Core (OoC) HPC application, demonstrates throughput increases in excess of an order of magnitude over current approaches.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122017952","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Scalable parallel graph partitioning 可伸缩并行图划分
Shad Kirmani, P. Raghavan
We consider partitioning a graph in parallel using a large number of processors. Parallel multilevel partitioners, such as Pt-Scotch and ParMetis, produce good quality partitions but their performance scales poorly. Coordinate bisection schemes such as those in Zoltan, which can be applied only to graphs with coordinates, scale well but partition quality is often compromised. We seek to address this gap by developing a scalable parallel scheme which imparts coordinates to a graph through a lattice-based multilevel embedding. Partitions are computed with a parallel formulation of a geometric scheme that has been shown to provide provably good cuts on certain classes of graphs. We analyze the parallel complexity of our scheme and we observe speed-ups and cut-sizes on large graphs. Our results indicate that our method is substantially faster than ParMetis and Pt-Scotch for hundreds to thousands of processors, while producing high quality cuts.
我们考虑使用大量处理器并行地划分一个图。并行多级分区器,如Pt-Scotch和ParMetis,可以产生高质量的分区,但它们的性能伸缩性很差。坐标对分方案,如Zoltan中的那些,只能应用于有坐标的图,可以很好地缩放,但分区质量经常受到损害。我们试图通过开发一种可扩展的并行方案来解决这一差距,该方案通过基于格的多层嵌入将坐标输入到图中。划分是用一个几何方案的并行公式计算的,该方案已被证明在某些图类上提供了可证明的良好切割。我们分析了该方案的并行复杂性,并观察了大图上的加速和切割尺寸。我们的结果表明,对于数百到数千个加工者来说,我们的方法比ParMetis和Pt-Scotch要快得多,同时还能产生高质量的切割。
{"title":"Scalable parallel graph partitioning","authors":"Shad Kirmani, P. Raghavan","doi":"10.1145/2503210.2503280","DOIUrl":"https://doi.org/10.1145/2503210.2503280","url":null,"abstract":"We consider partitioning a graph in parallel using a large number of processors. Parallel multilevel partitioners, such as Pt-Scotch and ParMetis, produce good quality partitions but their performance scales poorly. Coordinate bisection schemes such as those in Zoltan, which can be applied only to graphs with coordinates, scale well but partition quality is often compromised. We seek to address this gap by developing a scalable parallel scheme which imparts coordinates to a graph through a lattice-based multilevel embedding. Partitions are computed with a parallel formulation of a geometric scheme that has been shown to provide provably good cuts on certain classes of graphs. We analyze the parallel complexity of our scheme and we observe speed-ups and cut-sizes on large graphs. Our results indicate that our method is substantially faster than ParMetis and Pt-Scotch for hundreds to thousands of processors, while producing high quality cuts.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125448954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
期刊
2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1