首页 > 最新文献

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

英文 中文
Petascale direct numerical simulation of turbulent channel flow on up to 786K cores 千兆级直接数值模拟湍流通道流动高达786K核
Myoungkyu Lee, Nicholas Malaya, R. Moser
We present results of performance optimization for direct numerical simulation (DNS) of wall bounded turbulent flow (channel flow). DNS is a technique in which the fluid flow equations are solved without subgrid modeling. Of particular interest are high Reynolds number (Re) turbulent flows over walls, because of their importance in technological applications. Simulating high Re turbulence is a challenging computational problem, due to the high spatial and temporal resolution requirements.
本文给出了壁面有界湍流(通道流)直接数值模拟(DNS)的性能优化结果。DNS是一种不用子网格建模求解流体流动方程的技术。特别感兴趣的是高雷诺数(Re)紊流,因为它们在技术应用中的重要性。由于对空间和时间分辨率的要求很高,模拟高Re湍流是一个具有挑战性的计算问题。
{"title":"Petascale direct numerical simulation of turbulent channel flow on up to 786K cores","authors":"Myoungkyu Lee, Nicholas Malaya, R. Moser","doi":"10.1145/2503210.2503298","DOIUrl":"https://doi.org/10.1145/2503210.2503298","url":null,"abstract":"We present results of performance optimization for direct numerical simulation (DNS) of wall bounded turbulent flow (channel flow). DNS is a technique in which the fluid flow equations are solved without subgrid modeling. Of particular interest are high Reynolds number (Re) turbulent flows over walls, because of their importance in technological applications. Simulating high Re turbulence is a challenging computational problem, due to the high spatial and temporal resolution requirements.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"126 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133320622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 101
Petascale WRF simulation of hurricane sandy: Deployment of NCSA's cray XE6 blue waters 千万亿次WRF飓风桑迪模拟:NCSA疯狂XE6蓝水的部署
P. Johnsen, M. Straka, M. Shapiro, A. Norton, Thomas J. Galarneau
The National Center for Atmospheric Research (NCAR) Weather Research and Forecasting (WRF) model has been employed on the largest yet storm prediction model using real data of over 4 billion points to simulate the landfall of Hurricane Sandy. Using an unprecedented 13,680 nodes (437,760 cores) of the Cray XE6 “Blue Waters” at NCSA at the University of Illinois, researchers achieved a sustained rate of 285 Tflops while simulating an 18-hour forecast. A grid of size 9120×9216×48 (1.4Tbytes of input) was used, with horizontal resolution of 500 meters and a 2-second time step. 86 Gbytes of forecast data was written every 6 forecast hours at a rate of up to 2 Gbytes/second and collaboratively post-processed and displayed using the Vapor suite at NCAR. Opportunities to enhance scalability in the source code, run-time, and operating system realms were exploited. The output of this numerical model is now under study for model validation.
美国国家大气研究中心(NCAR)天气研究与预报(WRF)模型利用40多亿个点的真实数据,在迄今为止最大的风暴预测模型上模拟了飓风桑迪的登陆。伊利诺伊大学NCSA的克雷XE6“蓝水”计算机史无前例地拥有13680个节点(437760个核心),研究人员在模拟18小时的预测时实现了285 tflop的持续速率。使用大小为9120×9216×48的网格(1.4Tbytes的输入),水平分辨率为500米,时间步长为2秒。每6个预报小时写入86gb的预报数据,速度高达2gb /秒,并使用NCAR的Vapor套件进行协同后处理和显示。利用了增强源代码、运行时和操作系统领域的可伸缩性的机会。目前正在研究该数值模型的输出,以进行模型验证。
{"title":"Petascale WRF simulation of hurricane sandy: Deployment of NCSA's cray XE6 blue waters","authors":"P. Johnsen, M. Straka, M. Shapiro, A. Norton, Thomas J. Galarneau","doi":"10.1145/2503210.2503231","DOIUrl":"https://doi.org/10.1145/2503210.2503231","url":null,"abstract":"The National Center for Atmospheric Research (NCAR) Weather Research and Forecasting (WRF) model has been employed on the largest yet storm prediction model using real data of over 4 billion points to simulate the landfall of Hurricane Sandy. Using an unprecedented 13,680 nodes (437,760 cores) of the Cray XE6 “Blue Waters” at NCSA at the University of Illinois, researchers achieved a sustained rate of 285 Tflops while simulating an 18-hour forecast. A grid of size 9120×9216×48 (1.4Tbytes of input) was used, with horizontal resolution of 500 meters and a 2-second time step. 86 Gbytes of forecast data was written every 6 forecast hours at a rate of up to 2 Gbytes/second and collaboratively post-processed and displayed using the Vapor suite at NCAR. Opportunities to enhance scalability in the source code, run-time, and operating system realms were exploited. The output of this numerical model is now under study for model validation.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133062698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Distributed wait state tracking for runtime MPI deadlock detection 用于运行时MPI死锁检测的分布式等待状态跟踪
Tobias Hilbrich, B. Supinski, W. Nagel, Joachim Protze, C. Baier, Matthias S. Müller
The widely used Message Passing Interface (MPI) with its multitude of communication functions is prone to usage errors. Runtime error detection tools aid in the removal of these errors. We develop MUST as one such tool that provides a wide variety of automatic correctness checks. Its correctness checks can be run in a distributed mode, except for its deadlock detection. This limitation applies to a wide range of tools that either use centralized detection algorithms or a timeout approach. In order to provide scalable and distributed deadlock detection with detailed insight into deadlock situations, we propose a model for MPI blocking conditions that we use to formulate a distributed algorithm. This algorithm implements scalable MPI deadlock detection in MUST. Stress tests at up to 4,096 processes demonstrate the scalability of our approach. Finally, overhead results for a complex benchmark suite demonstrate an average runtime increase of 34% at 2,048 processes.
广泛使用的消息传递接口(Message Passing Interface, MPI)具有多种通信功能,容易出现使用错误。运行时错误检测工具有助于消除这些错误。我们开发MUST作为这样一个工具,它提供了各种各样的自动正确性检查。除了死锁检测之外,它的正确性检查可以在分布式模式下运行。这种限制适用于使用集中检测算法或超时方法的各种工具。为了提供可扩展和分布式死锁检测,并详细了解死锁情况,我们提出了一个用于制定分布式算法的MPI阻塞条件模型。该算法在MUST中实现了可扩展的MPI死锁检测。多达4,096个过程的压力测试证明了我们方法的可扩展性。最后,复杂基准套件的开销结果表明,在2,048个进程时,运行时平均增加了34%。
{"title":"Distributed wait state tracking for runtime MPI deadlock detection","authors":"Tobias Hilbrich, B. Supinski, W. Nagel, Joachim Protze, C. Baier, Matthias S. Müller","doi":"10.1145/2503210.2503237","DOIUrl":"https://doi.org/10.1145/2503210.2503237","url":null,"abstract":"The widely used Message Passing Interface (MPI) with its multitude of communication functions is prone to usage errors. Runtime error detection tools aid in the removal of these errors. We develop MUST as one such tool that provides a wide variety of automatic correctness checks. Its correctness checks can be run in a distributed mode, except for its deadlock detection. This limitation applies to a wide range of tools that either use centralized detection algorithms or a timeout approach. In order to provide scalable and distributed deadlock detection with detailed insight into deadlock situations, we propose a model for MPI blocking conditions that we use to formulate a distributed algorithm. This algorithm implements scalable MPI deadlock detection in MUST. Stress tests at up to 4,096 processes demonstrate the scalability of our approach. Finally, overhead results for a complex benchmark suite demonstrate an average runtime increase of 34% at 2,048 processes.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129325002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
An improved parallel singular value algorithm and its implementation for multicore hardware 一种改进的并行奇异值算法及其多核硬件实现
A. Haidar, J. Kurzak, P. Luszczek
The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance. In this article, we describe a successful methodology to address these challenges-starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. We developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of our SVD solver. Our results demonstrate a many-fold performance increase compared to currently available software. In particular, our software is two times faster than Intel's Math Kernel Library (MKL), a highly optimized implementation from the hardware vendor, when all the singular vectors are requested; it achieves a 5-fold speed-up when only 20% of the vectors are computed; and it is up to 10 times faster if only the singular values are required.
当今cpu的高性能能力与片外通信之间的巨大差距对可扩展和实现高性能的数值软件的开发提出了极大的挑战。在本文中,我们描述了一种解决这些挑战的成功方法——从我们的算法设计开始,通过内核优化和调优,最后是我们的编程模型。所有这些都导致了可伸缩的高性能奇异值分解(SVD)求解器的发展。我们开发了一组高度优化的内核,并将它们与高级优化技术相结合,这些优化技术具有细粒度和包含缓存的内核、基于任务的方法以及混合执行和调度运行时的特点,所有这些都显著提高了我们的SVD求解器的性能。我们的结果表明,与目前可用的软件相比,它的性能提高了许多倍。特别是,当请求所有奇异向量时,我们的软件比硬件供应商高度优化的英特尔数学内核库(MKL)快两倍;当只计算20%的向量时,它实现了5倍的加速;如果只需要单个值,速度可以提高10倍。
{"title":"An improved parallel singular value algorithm and its implementation for multicore hardware","authors":"A. Haidar, J. Kurzak, P. Luszczek","doi":"10.1145/2503210.2503292","DOIUrl":"https://doi.org/10.1145/2503210.2503292","url":null,"abstract":"The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance. In this article, we describe a successful methodology to address these challenges-starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. We developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of our SVD solver. Our results demonstrate a many-fold performance increase compared to currently available software. In particular, our software is two times faster than Intel's Math Kernel Library (MKL), a highly optimized implementation from the hardware vendor, when all the singular vectors are requested; it achieves a 5-fold speed-up when only 20% of the vectors are computed; and it is up to 10 times faster if only the singular values are required.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125471144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 42
A large-scale cross-architecture evaluation of thread-coarsening 螺纹粗化的大规模跨架构评估
A. Magni, Christophe Dubach, M. O’Boyle
OpenCL has become the de-facto data parallel programming model for parallel devices in today's high-performance supercomputers. OpenCL was designed with the goal of guaranteeing program portability across hardware from different vendors. However, achieving good performance is hard, requiring manual tuning of the program and expert knowledge of each target device. In this paper we consider a data parallel compiler transformation - thread-coarsening - and evaluate its effects across a range of devices by developing a source-to-source OpenCL compiler based on LLVM. We thoroughly evaluate this transformation on 17 benchmarks and five platforms with different coarsening parameters giving over 43,000 different experiments. We achieve speedups over 9x on individual applications and average speedups ranging from 1.15x on the Nvidia Kepler GPU to 1.50x on the AMD Cypress GPU. Finally, we use statistical regression to analyse and explain program performance in terms of hardware-based performance counters.
OpenCL已经成为当今高性能超级计算机中并行设备事实上的数据并行编程模型。OpenCL的设计目标是保证程序在不同厂商的硬件之间的可移植性。然而,实现良好的性能是困难的,需要手动调整程序和每个目标设备的专业知识。在本文中,我们考虑了一种数据并行编译器转换-线程粗化-并通过开发基于LLVM的源代码到源代码的OpenCL编译器来评估其在一系列设备上的效果。我们在17个基准和5个具有不同粗化参数的平台上进行了43,000多个不同的实验,对这种转换进行了彻底的评估。我们在单个应用程序上实现了超过9倍的加速,平均加速范围从Nvidia Kepler GPU的1.15倍到AMD Cypress GPU的1.50倍。最后,我们使用统计回归来分析和解释基于硬件的性能计数器方面的程序性能。
{"title":"A large-scale cross-architecture evaluation of thread-coarsening","authors":"A. Magni, Christophe Dubach, M. O’Boyle","doi":"10.1145/2503210.2503268","DOIUrl":"https://doi.org/10.1145/2503210.2503268","url":null,"abstract":"OpenCL has become the de-facto data parallel programming model for parallel devices in today's high-performance supercomputers. OpenCL was designed with the goal of guaranteeing program portability across hardware from different vendors. However, achieving good performance is hard, requiring manual tuning of the program and expert knowledge of each target device. In this paper we consider a data parallel compiler transformation - thread-coarsening - and evaluate its effects across a range of devices by developing a source-to-source OpenCL compiler based on LLVM. We thoroughly evaluate this transformation on 17 benchmarks and five platforms with different coarsening parameters giving over 43,000 different experiments. We achieve speedups over 9x on individual applications and average speedups ranging from 1.15x on the Nvidia Kepler GPU to 1.50x on the AMD Cypress GPU. Finally, we use statistical regression to analyse and explain program performance in terms of hardware-based performance counters.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130025402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 81
Parallel reduction to Hessenberg form with Algorithm-Based Fault Tolerance 基于容错算法的并行约简到Hessenberg形式
Yulu Jia, G. Bosilca, P. Luszczek, J. Dongarra
This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLA-PACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.
本文研究了双边分解的弹性,提出了一种基于通用算法的双边分解弹性方法。我们在Hessenberg约简(HR)的背景下建立了该方法的正确性和数值稳定性的理论证明,并给出了实际实现的可扩展性和性能结果。该方法是一种将基于算法的容错(ABFT)技术与无磁盘检查点技术相结合的混合算法,以充分保护数据。我们用校验和保护矩阵的尾部和初始部分,用无磁盘检查点保护面板范围内的成品面板。与原始的HR (ScaLA-PACK PDGEHRD例程)相比,我们的容错算法引入的开销很少,并保持了相同级别的可伸缩性。我们证明了开销随着矩阵的大小或过程网格的大小的增加呈下降趋势。
{"title":"Parallel reduction to Hessenberg form with Algorithm-Based Fault Tolerance","authors":"Yulu Jia, G. Bosilca, P. Luszczek, J. Dongarra","doi":"10.1145/2503210.2503249","DOIUrl":"https://doi.org/10.1145/2503210.2503249","url":null,"abstract":"This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLA-PACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129701687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
A new routing scheme for jellyfish and its performance with HPC workloads 一种新的水母路由方案及其在高性能计算负载下的性能
Xin Yuan, S. Mahapatra, Wickus Nienaber, S. Pakin, M. Lang
The jellyfish topology where switches are connected using a random graph has recently been proposed for large scale data-center networks. It has been shown to offer higher bisection bandwidth and better permutation throughput than the corresponding fat-tree topology with a similar cost. In this work, we propose a new routing scheme for jellyfish that out-performs existing schemes by more effectively exploiting the path diversity, and comprehensively compare the performance of jellyfish and fat-tree topologies with HPC workloads. The results indicate that both jellyfish and fat-tree topologies offer comparable high performance for HPC workloads on systems that can be realized by 3-level fat-trees using the current technology and the corresponding jellyfish topologies with similar costs. Fat-trees are more effective for smaller systems while jellyfish is more scalable.
水母拓扑最近被提出用于大规模数据中心网络,其中交换机使用随机图连接。在相同的成本下,它比相应的胖树拓扑具有更高的对分带宽和更好的排列吞吐量。在这项工作中,我们提出了一种新的水母路由方案,该方案通过更有效地利用路径多样性来优于现有方案,并全面比较了水母和脂肪树拓扑在HPC工作负载下的性能。研究结果表明,在采用当前技术的3级脂树和相应的水母拓扑可以实现的系统上,水母和脂肪树拓扑在HPC工作负载上都提供了相当高的性能,并且成本相似。肥树对较小的系统更有效,而水母更具可扩展性。
{"title":"A new routing scheme for jellyfish and its performance with HPC workloads","authors":"Xin Yuan, S. Mahapatra, Wickus Nienaber, S. Pakin, M. Lang","doi":"10.1145/2503210.2503229","DOIUrl":"https://doi.org/10.1145/2503210.2503229","url":null,"abstract":"The jellyfish topology where switches are connected using a random graph has recently been proposed for large scale data-center networks. It has been shown to offer higher bisection bandwidth and better permutation throughput than the corresponding fat-tree topology with a similar cost. In this work, we propose a new routing scheme for jellyfish that out-performs existing schemes by more effectively exploiting the path diversity, and comprehensively compare the performance of jellyfish and fat-tree topologies with HPC workloads. The results indicate that both jellyfish and fat-tree topologies offer comparable high performance for HPC workloads on systems that can be realized by 3-level fat-trees using the current technology and the corresponding jellyfish topologies with similar costs. Fat-trees are more effective for smaller systems while jellyfish is more scalable.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129456152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Exploring DRAM organizations for energy-efficient and resilient exascale memories 探索DRAM组织的节能和弹性百亿亿级存储器
Bharan Giridhar, Michael Cieslak, Deepankar Duggal, R. Dreslinski, H. Chen, R. Patti, B. Hold, C. Chakrabarti, T. Mudge, D. Blaauw
The power target for exascale supercomputing is 20MW, with about 30% budgeted for the memory subsystem. Commodity DRAMs will not satisfy this requirement. Additionally, the large number of memory chips (>10M) required will result in crippling failure rates. Although specialized DRAM memories have been reorganized to reduce power through 3D-stacking or row buffer resizing, their implications on fault tolerance have not been considered. We show that addressing reliability and energy is a co-optimization problem involving tradeoffs between error correction cost, access energy and refresh power-reducing the physical page size to decrease access energy increases the energy/area overhead of error resilience. Additionally, power can be reduced by optimizing bitline lengths. The proposed 3D-stacked memory uses a page size of 4kb and consumes 5.1pJ/bit based on simulations with NEK5000 benchmarks. Scaling to 100PB, the memory consumes 4.7MW at 100PB/s which, while well within the total power budget (20MW), is also error-resilient.
百亿亿次超级计算的功率目标是20MW,其中约30%预算用于内存子系统。商品dram将不满足这一要求。此外,所需的大量内存芯片(>10M)将导致严重的故障率。虽然专门的DRAM存储器已经通过3d堆叠或行缓冲调整大小来重新组织以降低功耗,但它们对容错性的影响尚未得到考虑。我们表明,解决可靠性和能量是一个涉及纠错成本、访问能量和刷新功率之间权衡的协同优化问题——减少物理页面大小以减少访问能量会增加错误恢复的能量/面积开销。此外,可以通过优化位行长度来降低功耗。根据NEK5000基准测试的模拟,提议的3d堆叠内存使用4kb的页面大小,消耗5.1pJ/bit。扩展到100PB时,内存在100PB/s时消耗4.7MW,虽然完全在总功率预算(20MW)之内,但也具有容错性。
{"title":"Exploring DRAM organizations for energy-efficient and resilient exascale memories","authors":"Bharan Giridhar, Michael Cieslak, Deepankar Duggal, R. Dreslinski, H. Chen, R. Patti, B. Hold, C. Chakrabarti, T. Mudge, D. Blaauw","doi":"10.1145/2503210.2503215","DOIUrl":"https://doi.org/10.1145/2503210.2503215","url":null,"abstract":"The power target for exascale supercomputing is 20MW, with about 30% budgeted for the memory subsystem. Commodity DRAMs will not satisfy this requirement. Additionally, the large number of memory chips (>10M) required will result in crippling failure rates. Although specialized DRAM memories have been reorganized to reduce power through 3D-stacking or row buffer resizing, their implications on fault tolerance have not been considered. We show that addressing reliability and energy is a co-optimization problem involving tradeoffs between error correction cost, access energy and refresh power-reducing the physical page size to decrease access energy increases the energy/area overhead of error resilience. Additionally, power can be reduced by optimizing bitline lengths. The proposed 3D-stacked memory uses a page size of 4kb and consumes 5.1pJ/bit based on simulations with NEK5000 benchmarks. Scaling to 100PB, the memory consumes 4.7MW at 100PB/s which, while well within the total power budget (20MW), is also error-resilient.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117165580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 87
Location-aware cache management for many-core processors with deep cache hierarchy 具有深度缓存层次结构的多核处理器的位置感知缓存管理
Jongsoo Park, Richard M. Yoo, D. Khudia, C. Hughes, Daehyun Kim
As cache hierarchies become deeper and the number of cores on a chip increases, managing caches becomes more important for performance and energy. However, current hardware cache management policies do not always adapt optimally to the applications behavior: e.g., caches may be polluted by data structures whose locality cannot be captured by the caches, and producer-consumer communication incurs multiple round trips of coherence messages per cache line transferred. We propose load and store instructions that carry hints regarding into which cache(s) the accessed data should be placed. Our instructions allow software to convey locality information to the hardware, while incurring minimal hardware cost and not affecting correctness. Our instructions provide a 1.07× speedup and a 1.24× energy efficiency boost, on average, according to simulations on a 64-core system with private L1 and L2 caches. With a large shared L3 cache added, the benefits increase, providing 1.33× energy reduction on average.
随着缓存层次结构的加深和芯片上核心数量的增加,管理缓存对于性能和能源变得更加重要。然而,当前的硬件缓存管理策略并不总是最优地适应应用程序的行为:例如,缓存可能会被数据结构污染,这些数据结构的位置不能被缓存捕获,并且生产者-消费者通信导致每条缓存线传输的一致性消息多次往返。我们提出了加载和存储指令,这些指令携带有关应该将访问的数据放入哪个缓存的提示。我们的指令允许软件将位置信息传递给硬件,同时产生最小的硬件成本并且不影响正确性。根据在具有专用L1和L2缓存的64核系统上的模拟,我们的指令提供了1.07倍的加速和1.24倍的能效提升。添加了大型共享L3缓存后,好处会增加,平均可以减少1.33倍的能耗。
{"title":"Location-aware cache management for many-core processors with deep cache hierarchy","authors":"Jongsoo Park, Richard M. Yoo, D. Khudia, C. Hughes, Daehyun Kim","doi":"10.1145/2503210.2503224","DOIUrl":"https://doi.org/10.1145/2503210.2503224","url":null,"abstract":"As cache hierarchies become deeper and the number of cores on a chip increases, managing caches becomes more important for performance and energy. However, current hardware cache management policies do not always adapt optimally to the applications behavior: e.g., caches may be polluted by data structures whose locality cannot be captured by the caches, and producer-consumer communication incurs multiple round trips of coherence messages per cache line transferred. We propose load and store instructions that carry hints regarding into which cache(s) the accessed data should be placed. Our instructions allow software to convey locality information to the hardware, while incurring minimal hardware cost and not affecting correctness. Our instructions provide a 1.07× speedup and a 1.24× energy efficiency boost, on average, according to simulations on a 64-core system with private L1 and L2 caches. With a large shared L3 cache added, the benefits increase, providing 1.33× energy reduction on average.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122541537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Efficient data partitioning model for heterogeneous graphs in the cloud 云中异构图的高效数据分区模型
Kisung Lee, Ling Liu
As the size and variety of information networks continue to grow in many scientific and engineering domains, we witness a growing demand for efficient processing of large heterogeneous graphs using a cluster of compute nodes in the Cloud. One open issue is how to effectively partition a large graph to process complex graph operations efficiently. In this paper, we present VB-Partitioner - a distributed data partitioning model and algorithms for efficient processing of graph operations over large-scale graphs in the Cloud. Our VB-Partitioner has three salient features. First, it introduces vertex blocks (VBs) and extended vertex blocks (EVBs) as the building blocks for semantic partitioning of large graphs. Second, VB-Partitioner utilizes vertex block grouping algorithms to place those vertex blocks that have high correlation in graph structure into the same partition. Third, VB-Partitioner employs a VB-partition guided query partitioning model to speed up the parallel processing of graph pattern queries by reducing the amount of inter-partition query processing. We conduct extensive experiments on several real-world graphs with millions of vertices and billions of edges. Our results show that VB-Partitioner significantly outperforms the popular random block-based data partitioner in terms of query latency and scalability over large-scale graphs.
随着许多科学和工程领域中信息网络的规模和种类的不断增长,我们见证了使用云计算节点集群高效处理大型异构图的需求不断增长。一个开放的问题是如何有效地划分一个大的图,以有效地处理复杂的图操作。在本文中,我们提出了VB-Partitioner——一种分布式数据分区模型和算法,用于在云中高效地处理大规模图的图操作。我们的VB-Partitioner有三个显著特点。首先,引入了顶点块(VBs)和扩展顶点块(evb)作为大图语义划分的构建块。其次,VB-Partitioner利用顶点块分组算法,将图结构相关度较高的顶点块放入同一分区中。第三,VB-Partitioner采用vb分区引导的查询分区模型,通过减少分区间的查询处理量来加快图模式查询的并行处理速度。我们在几个具有数百万个顶点和数十亿条边的真实图上进行了广泛的实验。我们的结果表明,VB-Partitioner在大规模图的查询延迟和可伸缩性方面明显优于流行的基于随机块的数据分区器。
{"title":"Efficient data partitioning model for heterogeneous graphs in the cloud","authors":"Kisung Lee, Ling Liu","doi":"10.1145/2503210.2503302","DOIUrl":"https://doi.org/10.1145/2503210.2503302","url":null,"abstract":"As the size and variety of information networks continue to grow in many scientific and engineering domains, we witness a growing demand for efficient processing of large heterogeneous graphs using a cluster of compute nodes in the Cloud. One open issue is how to effectively partition a large graph to process complex graph operations efficiently. In this paper, we present VB-Partitioner - a distributed data partitioning model and algorithms for efficient processing of graph operations over large-scale graphs in the Cloud. Our VB-Partitioner has three salient features. First, it introduces vertex blocks (VBs) and extended vertex blocks (EVBs) as the building blocks for semantic partitioning of large graphs. Second, VB-Partitioner utilizes vertex block grouping algorithms to place those vertex blocks that have high correlation in graph structure into the same partition. Third, VB-Partitioner employs a VB-partition guided query partitioning model to speed up the parallel processing of graph pattern queries by reducing the amount of inter-partition query processing. We conduct extensive experiments on several real-world graphs with millions of vertices and billions of edges. Our results show that VB-Partitioner significantly outperforms the popular random block-based data partitioner in terms of query latency and scalability over large-scale graphs.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131209322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
期刊
2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1