首页 > 最新文献

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
A System Software Approach to Proactive Memory-Error Avoidance 主动避免内存错误的系统软件方法
Carlos H. A. Costa, Yoonho Park, Bryan S. Rosenburg, Chen-Yong Cher, K. D. Ryu
Today's HPC systems use two mechanisms to address main-memory errors. Error-correcting codes make correctable errors transparent to software, while checkpoint/restart (CR) enables recovery from uncorrectable errors. Unfortunately, CR overhead will be enormous at exascale due to the high failure rate of memory. We propose a new OS-based approach that proactively avoids memory errors using prediction. This scheme exposes correctable error information to the OS, which migrates pages and off lines unhealthy memory to avoid application crashes. We analyze memory error patterns in extensive logs from a BG/P system and show how correctable error patterns can be used to identify memory likely to fail. We implement a proactive memory management system on BG/Q by extending the firmware and Linux. We evaluate our approach with a realistic workload and compare our overhead against CR. We show improved resilience with negligible performance overhead for applications.
当今的高性能计算系统使用两种机制来处理主内存错误。纠错码使可纠正的错误对软件透明,而检查点/重启(CR)可从不可纠错的错误中恢复。遗憾的是,由于内存的高故障率,在超大规模系统中,CR 的开销将是巨大的。我们提出了一种基于操作系统的新方法,利用预测主动避免内存错误。该方案将可纠正的错误信息暴露给操作系统,由操作系统迁移页面和非健康内存行,以避免应用程序崩溃。我们分析了 BG/P 系统大量日志中的内存错误模式,并展示了如何利用可纠正错误模式来识别可能发生故障的内存。通过扩展固件和 Linux,我们在 BG/Q 上实现了主动内存管理系统。我们用现实的工作负载评估了我们的方法,并将我们的开销与 CR 进行了比较。结果表明,我们的方法提高了应用程序的恢复能力,而性能开销却可以忽略不计。
{"title":"A System Software Approach to Proactive Memory-Error Avoidance","authors":"Carlos H. A. Costa, Yoonho Park, Bryan S. Rosenburg, Chen-Yong Cher, K. D. Ryu","doi":"10.1109/SC.2014.63","DOIUrl":"https://doi.org/10.1109/SC.2014.63","url":null,"abstract":"Today's HPC systems use two mechanisms to address main-memory errors. Error-correcting codes make correctable errors transparent to software, while checkpoint/restart (CR) enables recovery from uncorrectable errors. Unfortunately, CR overhead will be enormous at exascale due to the high failure rate of memory. We propose a new OS-based approach that proactively avoids memory errors using prediction. This scheme exposes correctable error information to the OS, which migrates pages and off lines unhealthy memory to avoid application crashes. We analyze memory error patterns in extensive logs from a BG/P system and show how correctable error patterns can be used to identify memory likely to fail. We implement a proactive memory management system on BG/Q by extending the firmware and Linux. We evaluate our approach with a realistic workload and compare our overhead against CR. We show improved resilience with negligible performance overhead for applications.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131481877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Using an Adaptive HPC Runtime System to Reconfigure the Cache Hierarchy 使用自适应HPC运行时系统重新配置缓存层次结构
E. Totoni, J. Torrellas, L. Kalé
The cache hierarchy often consumes a large portion of a processor's energy. To save energy in HPC environments, this paper proposes software-controlled reconfiguration of the cache hierarchy with an adaptive runtime system. Our approach addresses the two major limitations associated with other methods that reconfigure the caches: predicting the application's future and finding the best cache hierarchy configuration. Our approach uses formal language theory to express the application's pattern and help predict its future. Furthermore, it uses the prevalent Single Program Multiple Data (SPMD) model of HPC codes to find the best configuration in parallel quickly. Our experiments using cycle-level simulations indicate that 67% of the cache energy can be saved with only a 2.4% performance penalty on average. Moreover, we demonstrate that, for some applications, switching to a software-controlled reconfigurable streaming buffer configuration can improve performance by up to 30% and save 75% of the cache energy.
缓存层次结构通常消耗处理器能量的很大一部分。为了在高性能计算环境下节约能源,本文提出了一种软件控制的具有自适应运行时系统的缓存层次结构重构方法。我们的方法解决了与其他重新配置缓存的方法相关的两个主要限制:预测应用程序的未来和找到最佳的缓存层次结构配置。我们的方法使用形式语言理论来表达应用程序的模式并帮助预测其未来。此外,它还采用了当前流行的HPC代码单程序多数据(SPMD)模型,可以快速找到并行的最佳配置。我们使用循环级模拟的实验表明,平均只有2.4%的性能损失,可以节省67%的缓存能量。此外,我们证明,对于某些应用程序,切换到软件控制的可重构流缓冲配置可以提高30%的性能,并节省75%的缓存能量。
{"title":"Using an Adaptive HPC Runtime System to Reconfigure the Cache Hierarchy","authors":"E. Totoni, J. Torrellas, L. Kalé","doi":"10.1109/SC.2014.90","DOIUrl":"https://doi.org/10.1109/SC.2014.90","url":null,"abstract":"The cache hierarchy often consumes a large portion of a processor's energy. To save energy in HPC environments, this paper proposes software-controlled reconfiguration of the cache hierarchy with an adaptive runtime system. Our approach addresses the two major limitations associated with other methods that reconfigure the caches: predicting the application's future and finding the best cache hierarchy configuration. Our approach uses formal language theory to express the application's pattern and help predict its future. Furthermore, it uses the prevalent Single Program Multiple Data (SPMD) model of HPC codes to find the best configuration in parallel quickly. Our experiments using cycle-level simulations indicate that 67% of the cache energy can be saved with only a 2.4% performance penalty on average. Moreover, we demonstrate that, for some applications, switching to a software-controlled reconfigurable streaming buffer configuration can improve performance by up to 30% and save 75% of the cache energy.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133177539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
RAHTM: Routing Algorithm Aware Hierarchical Task Mapping 路由算法感知分层任务映射
Ahmed H. Abdel-Gawad, Mithuna Thottethodi, A. Bhatele
The mapping of MPI processes to compute nodes on a supercomputer can have a significant impact on communication performance. For high performance computing (HPC) applications with iterative communication, rich offline analysis of such communication can improve performance by optimizing the mapping. Unfortunately, current practices for at-scale HPC consider only the communication graph and network topology in solving this problem. We propose Routing Algorithm aware Hierarchical Task Mapping (RAHTM) which leverages the knowledge of the routing algorithm to improve task mapping. RAHTM achieves high quality mappings by combining (1) a divide-and-conquer strategy to achieve scalability, (2) a limited search of mappings, and (3) a linear programming based routing-aware approach to evaluate possible mappings in the search space. RAHTM achieves 20% reduction in the communication time and 9% reduction in the overall execution time for three communication-heavy benchmarks scaled up to 16,384 processes on a Blue Gene/Q platform.
MPI进程到超级计算机上计算节点的映射会对通信性能产生重大影响。对于具有迭代通信的高性能计算(HPC)应用,对这种通信进行丰富的离线分析可以通过优化映射来提高性能。不幸的是,目前大规模HPC的实践只考虑通信图和网络拓扑来解决这个问题。本文提出了基于路由算法的分层任务映射(RAHTM),它利用路由算法的知识来改进任务映射。RAHTM通过结合(1)分而治之的策略来实现可伸缩性,(2)对映射的有限搜索,以及(3)基于线性规划的路由感知方法来评估搜索空间中可能的映射,从而实现高质量的映射。对于在Blue Gene/Q平台上扩展到16,384个进程的三个通信繁重的基准测试,RAHTM实现了20%的通信时间减少和9%的总执行时间减少。
{"title":"RAHTM: Routing Algorithm Aware Hierarchical Task Mapping","authors":"Ahmed H. Abdel-Gawad, Mithuna Thottethodi, A. Bhatele","doi":"10.1109/SC.2014.32","DOIUrl":"https://doi.org/10.1109/SC.2014.32","url":null,"abstract":"The mapping of MPI processes to compute nodes on a supercomputer can have a significant impact on communication performance. For high performance computing (HPC) applications with iterative communication, rich offline analysis of such communication can improve performance by optimizing the mapping. Unfortunately, current practices for at-scale HPC consider only the communication graph and network topology in solving this problem. We propose Routing Algorithm aware Hierarchical Task Mapping (RAHTM) which leverages the knowledge of the routing algorithm to improve task mapping. RAHTM achieves high quality mappings by combining (1) a divide-and-conquer strategy to achieve scalability, (2) a limited search of mappings, and (3) a linear programming based routing-aware approach to evaluate possible mappings in the search space. RAHTM achieves 20% reduction in the communication time and 9% reduction in the overall execution time for three communication-heavy benchmarks scaled up to 16,384 processes on a Blue Gene/Q platform.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134106765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Finding Constant from Change: Revisiting Network Performance Aware Optimizations on IaaS Clouds 从变化中寻找常数:重新审视IaaS云上的网络性能感知优化
Yifan Gong, Bingsheng He, Dan Li
Network performance aware optimizations have long been an effective approach to optimizing distributed applications on traditional network environments. However, the assumptions of network topology or direct use of several measurements of pair-wise network performance for optimizations are no longer valid on IaaS clouds. Virtualization hides network topology from users, and direct use of network performance measurements may not represent long-term performance. To enable existing network performance aware optimizations on IaaS clouds, we propose to decouple constant component from dynamic network performance while minimizing the difference by a mathematical method called RPCA (Robust Principal Component Analysis). We use the constant component to guide network performance aware optimizations and demonstrate the efficiency of our approach by adopting network aware optimizations for collective communications of MPI and generic topology mapping as well as two real-world applications, N-body and conjugate gradient (CG). Our experiments on Amazon EC2 and simulations demonstrate significant performance improvement on guiding the optimizations.
网络性能感知优化一直是传统网络环境下优化分布式应用程序的有效方法。然而,在IaaS云上,网络拓扑的假设或直接使用成对网络性能的几种测量方法进行优化不再有效。虚拟化对用户隐藏了网络拓扑,并且直接使用网络性能度量可能无法代表长期性能。为了在IaaS云上实现现有网络性能感知优化,我们建议将恒定组件与动态网络性能解耦,同时通过称为RPCA(鲁棒主成分分析)的数学方法最小化差异。我们使用常数分量来指导网络性能感知优化,并通过对MPI和通用拓扑映射的集体通信以及两个现实世界的应用,n体和共轭梯度(CG),采用网络感知优化来证明我们的方法的效率。我们在Amazon EC2上的实验和模拟显示了在引导优化方面的显著性能改进。
{"title":"Finding Constant from Change: Revisiting Network Performance Aware Optimizations on IaaS Clouds","authors":"Yifan Gong, Bingsheng He, Dan Li","doi":"10.1109/SC.2014.85","DOIUrl":"https://doi.org/10.1109/SC.2014.85","url":null,"abstract":"Network performance aware optimizations have long been an effective approach to optimizing distributed applications on traditional network environments. However, the assumptions of network topology or direct use of several measurements of pair-wise network performance for optimizations are no longer valid on IaaS clouds. Virtualization hides network topology from users, and direct use of network performance measurements may not represent long-term performance. To enable existing network performance aware optimizations on IaaS clouds, we propose to decouple constant component from dynamic network performance while minimizing the difference by a mathematical method called RPCA (Robust Principal Component Analysis). We use the constant component to guide network performance aware optimizations and demonstrate the efficiency of our approach by adopting network aware optimizations for collective communications of MPI and generic topology mapping as well as two real-world applications, N-body and conjugate gradient (CG). Our experiments on Amazon EC2 and simulations demonstrate significant performance improvement on guiding the optimizations.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125721396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Scalable Kernel Fusion for Memory-Bound GPU Applications 内存绑定GPU应用的可扩展内核融合
M. Wahib, N. Maruyama
GPU implementations of HPC applications relying on finite difference methods can include tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing data traffic to off-chip memory, kernels that share data arrays are fused to larger kernels where on-chip cache is used to hold the data reused by instructions originating from different kernels. The main challenges are a) searching for the optimal kernel fusions while constrained by data dependencies and kernels' precedences and b) effectively applying kernel fusion to achieve speedup. This paper introduces a problem definition and proposes a scalable method for searching the space of possible kernel fusions to identify optimal kernel fusions for large problems. The paper also proposes a codeless performance upper-bound projection model to achieve effective fusions. Results show that using the proposed scalable method for kernel fusion improved the performance of two real-world applications containing tens of kernels by 1.35x and 1.2x.
依靠有限差分方法的高性能计算应用程序的GPU实现可能包括数十个内存受限的内核。内核融合可以通过减少到片外内存的数据流量来提高性能,共享数据数组的内核被融合到更大的内核中,其中片上缓存用于保存来自不同内核的指令重用的数据。主要的挑战是a)在受数据依赖关系和核优先级约束的情况下寻找最优的核融合;b)有效地应用核融合来实现加速。本文引入了问题的定义,提出了一种可扩展的搜索可能核融合空间的方法,以识别大型问题的最优核融合。为了实现有效的融合,本文还提出了一种无编码性能上界投影模型。结果表明,使用所提出的可扩展核融合方法将两个包含数十个核的实际应用程序的性能分别提高了1.35倍和1.2倍。
{"title":"Scalable Kernel Fusion for Memory-Bound GPU Applications","authors":"M. Wahib, N. Maruyama","doi":"10.1109/SC.2014.21","DOIUrl":"https://doi.org/10.1109/SC.2014.21","url":null,"abstract":"GPU implementations of HPC applications relying on finite difference methods can include tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing data traffic to off-chip memory, kernels that share data arrays are fused to larger kernels where on-chip cache is used to hold the data reused by instructions originating from different kernels. The main challenges are a) searching for the optimal kernel fusions while constrained by data dependencies and kernels' precedences and b) effectively applying kernel fusion to achieve speedup. This paper introduces a problem definition and proposes a scalable method for searching the space of possible kernel fusions to identify optimal kernel fusions for large problems. The paper also proposes a codeless performance upper-bound projection model to achieve effective fusions. Results show that using the proposed scalable method for kernel fusion improved the performance of two real-world applications containing tens of kernels by 1.35x and 1.2x.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125083128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
A Volume Integral Equation Stokes Solver for Problems with Variable Coefficients 变系数问题的体积积分方程Stokes求解器
D. Malhotra, A. Gholami, G. Biros
We present a novel numerical scheme for solving the Stokes equation with variable coefficients in the unit box. Our scheme is based on a volume integral equation formulation. Compared to finite element methods, our formulation decouples the velocity and pressure, generates velocity fields that are by construction divergence free to high accuracy and its performance does not depend on the order of the basis used for discretization. In addition, we employ a novel adaptive fast multipole method for volume integrals to obtain a scheme that is algorithmically optimal. Our scheme supports non-uniform discretizations and is spectrally accurate. To increase per node performance, we have integrated our code with both NVIDIA and Intel accelerators. In our largest scalability test, we solved a problem with 20 billion unknowns, using a 14-order approximation for the velocity, on 2048 nodes of the Stampede system at the Texas Advanced Computing Center. We achieved 0.656 peta FLOPS for the overall code (23% efficiency) and one peta FLOPS for the volume integrals (33% efficiency). As an application example, we simulate Stokes ow in a porous medium with highly complex pore structure using a penalty formulation to enforce the no slip condition.
本文提出了一种求解单位框内变系数Stokes方程的新数值格式。我们的方案是基于一个体积积分方程公式。与有限元方法相比,我们的公式解耦了速度和压力,生成的速度场在构造上是无散度的,精度很高,其性能不依赖于用于离散化的基的顺序。此外,我们采用了一种新的自适应快速多极体积分方法,以获得算法最优的方案。该方案支持非均匀离散化,具有光谱精度。为了提高每个节点的性能,我们将代码与NVIDIA和Intel加速器集成在一起。在我们最大的可伸缩性测试中,我们在Texas Advanced Computing Center的Stampede系统的2048个节点上,使用14阶近似速度,解决了一个包含200亿个未知数的问题。我们为整个代码实现了0.656 peta FLOPS(23%的效率),为体积积分实现了1 peta FLOPS(33%的效率)。作为应用实例,我们在具有高度复杂孔隙结构的多孔介质中,采用惩罚公式来模拟Stokes流,以实现无滑移条件。
{"title":"A Volume Integral Equation Stokes Solver for Problems with Variable Coefficients","authors":"D. Malhotra, A. Gholami, G. Biros","doi":"10.1109/SC.2014.13","DOIUrl":"https://doi.org/10.1109/SC.2014.13","url":null,"abstract":"We present a novel numerical scheme for solving the Stokes equation with variable coefficients in the unit box. Our scheme is based on a volume integral equation formulation. Compared to finite element methods, our formulation decouples the velocity and pressure, generates velocity fields that are by construction divergence free to high accuracy and its performance does not depend on the order of the basis used for discretization. In addition, we employ a novel adaptive fast multipole method for volume integrals to obtain a scheme that is algorithmically optimal. Our scheme supports non-uniform discretizations and is spectrally accurate. To increase per node performance, we have integrated our code with both NVIDIA and Intel accelerators. In our largest scalability test, we solved a problem with 20 billion unknowns, using a 14-order approximation for the velocity, on 2048 nodes of the Stampede system at the Texas Advanced Computing Center. We achieved 0.656 peta FLOPS for the overall code (23% efficiency) and one peta FLOPS for the volume integrals (33% efficiency). As an application example, we simulate Stokes ow in a porous medium with highly complex pore structure using a penalty formulation to enforce the no slip condition.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128001724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
ECC Parity: A Technique for Efficient Memory Error Resilience for Multi-Channel Memory Systems ECC奇偶校验:一种用于多通道存储系统的有效内存错误恢复技术
Xun Jian, Rakesh Kumar
Servers and HPC systems often use a strong memory error correction code, or ECC, to meet their reliability and availability requirements. However, these ECCs often require significant capacity and/or power overheads. We observe that since memory channels are independent from one another, error correction typically needs to be performed for one channel at a time. Based on this observation, we show that instead of always storing in memory the actual ECC correction bits as do existing systems, it is sufficient to store the bitwise parity of the ECC correction bits of different channels for fault-free memory regions, and store the actual ECC correction bits only for faulty memory regions. By trading off the resultant ECC capacity overhead reduction for improved memory energy efficiency, the proposed technique reduces memory energy per instruction by 54.4% and 20.6%, respectively, compared to a commercial chip kill correct ECC and a DIMM-kill correct ECC, while incurring similar or lower capacity overheads.
服务器和HPC系统通常使用强内存纠错码(ECC)来满足其可靠性和可用性要求。然而,这些ecc通常需要大量的容量和/或电力开销。我们观察到,由于内存通道彼此独立,因此通常需要一次对一个通道执行纠错。基于这一观察,我们表明,而不是像现有系统那样总是在内存中存储实际的ECC校正位,而是在无故障存储区域存储不同通道的ECC校正位的位奇偶校验,并且仅在故障存储区域存储实际的ECC校正位就足够了。通过权衡由此产生的ECC容量开销降低以提高内存能量效率,与商业芯片kill - correct ECC和dimm kill - correct ECC相比,所提出的技术将每条指令的内存能量分别降低了54.4%和20.6%,同时产生相似或更低的容量开销。
{"title":"ECC Parity: A Technique for Efficient Memory Error Resilience for Multi-Channel Memory Systems","authors":"Xun Jian, Rakesh Kumar","doi":"10.1109/SC.2014.89","DOIUrl":"https://doi.org/10.1109/SC.2014.89","url":null,"abstract":"Servers and HPC systems often use a strong memory error correction code, or ECC, to meet their reliability and availability requirements. However, these ECCs often require significant capacity and/or power overheads. We observe that since memory channels are independent from one another, error correction typically needs to be performed for one channel at a time. Based on this observation, we show that instead of always storing in memory the actual ECC correction bits as do existing systems, it is sufficient to store the bitwise parity of the ECC correction bits of different channels for fault-free memory regions, and store the actual ECC correction bits only for faulty memory regions. By trading off the resultant ECC capacity overhead reduction for improved memory energy efficiency, the proposed technique reduces memory energy per instruction by 54.4% and 20.6%, respectively, compared to a commercial chip kill correct ECC and a DIMM-kill correct ECC, while incurring similar or lower capacity overheads.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128098357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Dissecting On-Node Memory Access Performance: A Semantic Approach 剖析节点上内存访问性能:一种语义方法
Alfredo Giménez, T. Gamblin, B. Rountree, A. Bhatele, Ilir Jusufi, P. Bremer, B. Hamann
Optimizing memory access is critical for performance and power efficiency. CPU manufacturers have developed sampling-based performance measurement units (PMUs) that report precise costs of memory accesses at specific addresses. However, this data is too low-level to be meaningfully interpreted and contains an excessive amount of irrelevant or uninteresting information. We have developed a method to gather fine-grained memory access performance data for specific data objects and regions of code with low overhead and attribute semantic information to the sampled memory accesses. This information provides the context necessary to more effectively interpret the data. We have developed a tool that performs this sampling and attribution and used the tool to discover and diagnose performance problems in real-world applications. Our techniques provide useful insight into the memory behaviour of applications and allow programmers to understand the performance ramifications of key design decisions: domain decomposition, multi-threading, and data motion within distributed memory systems.
优化内存访问对性能和能效至关重要。CPU制造商已经开发出基于采样的性能测量单元(pmu),可以报告特定地址上内存访问的精确成本。然而,这些数据层次太低,无法进行有意义的解释,并且包含了过多的不相关或无趣的信息。我们已经开发了一种方法来收集特定数据对象和代码区域的细粒度内存访问性能数据,这些数据具有较低的开销,并将语义信息属性赋予采样的内存访问。这些信息提供了更有效地解释数据所需的上下文。我们已经开发了一个工具来执行这种抽样和归因,并使用该工具来发现和诊断实际应用程序中的性能问题。我们的技术为应用程序的内存行为提供了有用的见解,并允许程序员理解关键设计决策的性能影响:分布式内存系统中的域分解、多线程和数据移动。
{"title":"Dissecting On-Node Memory Access Performance: A Semantic Approach","authors":"Alfredo Giménez, T. Gamblin, B. Rountree, A. Bhatele, Ilir Jusufi, P. Bremer, B. Hamann","doi":"10.1109/SC.2014.19","DOIUrl":"https://doi.org/10.1109/SC.2014.19","url":null,"abstract":"Optimizing memory access is critical for performance and power efficiency. CPU manufacturers have developed sampling-based performance measurement units (PMUs) that report precise costs of memory accesses at specific addresses. However, this data is too low-level to be meaningfully interpreted and contains an excessive amount of irrelevant or uninteresting information. We have developed a method to gather fine-grained memory access performance data for specific data objects and regions of code with low overhead and attribute semantic information to the sampled memory accesses. This information provides the context necessary to more effectively interpret the data. We have developed a tool that performs this sampling and attribution and used the tool to discover and diagnose performance problems in real-world applications. Our techniques provide useful insight into the memory behaviour of applications and allow programmers to understand the performance ramifications of key design decisions: domain decomposition, multi-threading, and data motion within distributed memory systems.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127978456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems 部署和操作大规模以数据为中心的并行文件系统的最佳实践和经验教训
S. Oral, James Simmons, Jason Hill, Dustin Leverman, Feiyi Wang, M. Ezell, Ross G. Miller, Douglas Fuller, Raghul Gunasekaran, Youngjae Kim, Saurabh Gupta, Devesh Tiwari, Sudharshan S. Vazhkudai, James H. Rogers, D. Dillow, G. Shipman, Arthur S. Bland
The Oak Ridge Leadership Computing Facility (OLCF) has deployed multiple large-scale parallel file systems (PFS) to support its operations. During this process, OLCF acquired significant expertise in large-scale storage system design, file system software development, technology evaluation, benchmarking, procurement, deployment, and operational practices. Based on the lessons learned from each new PFS deployment, OLCF improved its operating procedures, and strategies. This paper provides an account of our experience and lessons learned in acquiring, deploying, and operating large-scale parallel file systems. We believe that these lessons will be useful to the wider HPC community.
橡树岭领导计算设施(OLCF)部署了多个大规模并行文件系统(PFS)来支持其运行。在此过程中,OLCF获得了大型存储系统设计、文件系统软件开发、技术评估、基准测试、采购、部署和操作实践方面的重要专业知识。根据从每次新的PFS部署中吸取的经验教训,OLCF改进了其操作程序和策略。本文介绍了我们在获取、部署和操作大规模并行文件系统方面的经验和教训。我们相信这些经验教训将对更广泛的HPC社区有用。
{"title":"Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems","authors":"S. Oral, James Simmons, Jason Hill, Dustin Leverman, Feiyi Wang, M. Ezell, Ross G. Miller, Douglas Fuller, Raghul Gunasekaran, Youngjae Kim, Saurabh Gupta, Devesh Tiwari, Sudharshan S. Vazhkudai, James H. Rogers, D. Dillow, G. Shipman, Arthur S. Bland","doi":"10.1109/SC.2014.23","DOIUrl":"https://doi.org/10.1109/SC.2014.23","url":null,"abstract":"The Oak Ridge Leadership Computing Facility (OLCF) has deployed multiple large-scale parallel file systems (PFS) to support its operations. During this process, OLCF acquired significant expertise in large-scale storage system design, file system software development, technology evaluation, benchmarking, procurement, deployment, and operational practices. Based on the lessons learned from each new PFS deployment, OLCF improved its operating procedures, and strategies. This paper provides an account of our experience and lessons learned in acquiring, deploying, and operating large-scale parallel file systems. We believe that these lessons will be useful to the wider HPC community.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124005081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales 具有不确定执行尺度的多级检查点模型的优化
S. Di, L. Bautista-Gomez, F. Cappello
Future extreme-scale systems are expected to experience different types of failures affecting applications with different failure scales, from transient uncorrectable memory errors in processes to massive system outages. In this paper, we propose a multilevel checkpoint model by taking into account uncertain execution scales (different numbers of processes/cores). The contribution is threefold: (1) we provide an in-depth analysis on why it is difficult to derive the optimal checkpoint intervals for different checkpoint levels and optimize the number of cores simultaneously, (2) we devise a novel method that can quickly obtain an optimized solution -- the first successful attempt in multilevel checkpoint models with uncertain scales, and (3) we perform both large scale real experiments and extreme-scale numerical simulation to validate the effectiveness of our design. The experiments confirm that our optimized solution outperforms other state of-the-art solutions by 4.3 -- 88% on wall-clock length.
未来的极端规模系统预计将经历不同类型的故障,影响具有不同故障规模的应用程序,从进程中暂时的不可纠正的内存错误到大规模的系统中断。在本文中,我们提出了一个考虑到不确定的执行规模(不同数量的进程/核心)的多级检查点模型。贡献有三个方面:(1)我们深入分析了为什么难以获得不同检查点级别的最佳检查点间隔并同时优化核心数量;(2)我们设计了一种新的方法,可以快速获得优化解——这是在具有不确定尺度的多层检查点模型中首次成功尝试;(3)我们进行了大规模的真实实验和极端尺度的数值模拟来验证我们设计的有效性。实验证实,我们优化的解决方案在时钟长度上优于其他最先进的解决方案4.3 - 88%。
{"title":"Optimization of a Multilevel Checkpoint Model with Uncertain Execution Scales","authors":"S. Di, L. Bautista-Gomez, F. Cappello","doi":"10.1109/SC.2014.79","DOIUrl":"https://doi.org/10.1109/SC.2014.79","url":null,"abstract":"Future extreme-scale systems are expected to experience different types of failures affecting applications with different failure scales, from transient uncorrectable memory errors in processes to massive system outages. In this paper, we propose a multilevel checkpoint model by taking into account uncertain execution scales (different numbers of processes/cores). The contribution is threefold: (1) we provide an in-depth analysis on why it is difficult to derive the optimal checkpoint intervals for different checkpoint levels and optimize the number of cores simultaneously, (2) we devise a novel method that can quickly obtain an optimized solution -- the first successful attempt in multilevel checkpoint models with uncertain scales, and (3) we perform both large scale real experiments and extreme-scale numerical simulation to validate the effectiveness of our design. The experiments confirm that our optimized solution outperforms other state of-the-art solutions by 4.3 -- 88% on wall-clock length.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126710832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
期刊
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1