首页 > 最新文献

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer Anton 2:在专用分子动力学超级计算机中提高性能和可编程性的标准
D. E. Shaw, J. P. Grossman, Joseph A. Bank, Brannon Batson, J. A. Butts, Jack C. Chao, Martin M. Deneroff, R. Dror, Amos Even, Christopher H. Fenton, Anthony Forte, Joseph Gagliardo, Gennette Gill, Brian Greskamp, R. Ho, D. Ierardi, Lev Iserovich, J. Kuskin, Richard H. Larson, T. Layman, L. Lee, Adam K. Lerer, Chester Li, Daniel Killebrew, Kenneth M. Mackenzie, Shark Yeuk-Hai Mok, Mark A. Moraes, Rolf Mueller, Lawrence J. Nociolo, Jon L. Peticolas, Terry Quan, D. Ramot, J. Salmon, D. Scarpazza, U. Schafer, Naseer Siddique, Christopher W. Snyder, Jochen Spengler, P. T. P. Tang, Michael Theobald, Horia Toma, Brian Towles, B. Vitale, Stanley C. Wang, C. Young
Anton 2 is a second-generation special-purpose supercomputer for molecular dynamics simulations that achieves significant gains in performance, programmability, and capacity compared to its predecessor, Anton 1. The architecture of Anton 2 is tailored for fine-grained event-driven operation, which improves performance by increasing the overlap of computation with communication, and also allows a wider range of algorithms to run efficiently, enabling many new software-based optimizations. A 512-node Anton 2 machine, currently in operation, is up to ten times faster than Anton 1 with the same number of nodes, greatly expanding the reach of all-atom bio molecular simulations. Anton 2 is the first platform to achieve simulation rates of multiple microseconds of physical time per day for systems with millions of atoms. Demonstrating strong scaling, the machine simulates a standard 23,558-atom benchmark system at a rate of 85 μs/day -- 180 times faster than any commodity hardware platform or general-purpose supercomputer.
Anton 2是用于分子动力学模拟的第二代专用超级计算机,与其前身Anton 1相比,在性能、可编程性和容量方面取得了显着进步。Anton 2的架构是为细粒度的事件驱动操作量身定制的,通过增加计算与通信的重叠来提高性能,并且还允许更广泛的算法高效运行,从而实现许多新的基于软件的优化。目前正在运行的512节点的Anton 2机器,在相同节点数量的情况下,速度是Anton 1的10倍,大大扩展了全原子生物分子模拟的范围。Anton 2是第一个为拥有数百万原子的系统实现每天多微秒物理时间模拟速率的平台。这台机器以85 μs/天的速度模拟了一个标准的23,558个原子的基准系统,展示了强大的可扩展性,比任何商用硬件平台或通用超级计算机快180倍。
{"title":"Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer","authors":"D. E. Shaw, J. P. Grossman, Joseph A. Bank, Brannon Batson, J. A. Butts, Jack C. Chao, Martin M. Deneroff, R. Dror, Amos Even, Christopher H. Fenton, Anthony Forte, Joseph Gagliardo, Gennette Gill, Brian Greskamp, R. Ho, D. Ierardi, Lev Iserovich, J. Kuskin, Richard H. Larson, T. Layman, L. Lee, Adam K. Lerer, Chester Li, Daniel Killebrew, Kenneth M. Mackenzie, Shark Yeuk-Hai Mok, Mark A. Moraes, Rolf Mueller, Lawrence J. Nociolo, Jon L. Peticolas, Terry Quan, D. Ramot, J. Salmon, D. Scarpazza, U. Schafer, Naseer Siddique, Christopher W. Snyder, Jochen Spengler, P. T. P. Tang, Michael Theobald, Horia Toma, Brian Towles, B. Vitale, Stanley C. Wang, C. Young","doi":"10.1109/SC.2014.9","DOIUrl":"https://doi.org/10.1109/SC.2014.9","url":null,"abstract":"Anton 2 is a second-generation special-purpose supercomputer for molecular dynamics simulations that achieves significant gains in performance, programmability, and capacity compared to its predecessor, Anton 1. The architecture of Anton 2 is tailored for fine-grained event-driven operation, which improves performance by increasing the overlap of computation with communication, and also allows a wider range of algorithms to run efficiently, enabling many new software-based optimizations. A 512-node Anton 2 machine, currently in operation, is up to ten times faster than Anton 1 with the same number of nodes, greatly expanding the reach of all-atom bio molecular simulations. Anton 2 is the first platform to achieve simulation rates of multiple microseconds of physical time per day for systems with millions of atoms. Demonstrating strong scaling, the machine simulates a standard 23,558-atom benchmark system at a rate of 85 μs/day -- 180 times faster than any commodity hardware platform or general-purpose supercomputer.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125620113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 455
Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications 图形应用gpu上的快速稀疏矩阵向量乘法
Arash Ashari, N. Sedaghati, John Eisenlohr, S. Parthasarathy, P. Sadayappan
Sparse matrix-vector multiplication (SpMV) is a widely used computational kernel. The most commonly used format for a sparse matrix is CSR (Compressed Sparse Row), but a number of other representations have recently been developed that achieve higher SpMV performance. However, the alternative representations typically impose a significant preprocessing overhead. While a high preprocessing overhead can be amortized for applications requiring many iterative invocations of SpMV that use the same matrix, it is not always feasible -- for instance when analyzing large dynamically evolving graphs. This paper presents ACSR, an adaptive SpMV algorithm that uses the standard CSR format but reduces thread divergence by combining rows into groups (bins) which have a similar number of non-zero elements. Further, for rows in bins that span a wide range of non zero counts, dynamic parallelism is leveraged. A significant benefit of ACSR over other proposed SpMV approaches is that it works directly with the standard CSR format, and thus avoids significant preprocessing overheads. A CUDA implementation of ACSR is shown to outperform SpMV implementations in the NVIDIA CUSP and cuSPARSE libraries on a set of sparse matrices representing power-law graphs. We also demonstrate the use of ACSR for the analysis of dynamic graphs, where the improvement over extant approaches is even higher.
稀疏矩阵向量乘法(SpMV)是一种应用广泛的计算核。稀疏矩阵最常用的格式是CSR (Compressed sparse Row,压缩稀疏行),但最近开发了许多其他表示,以实现更高的SpMV性能。然而,替代表示通常会增加大量的预处理开销。虽然对于需要多次迭代调用使用相同矩阵的SpMV的应用程序,可以平摊较高的预处理开销,但这并不总是可行的——例如,在分析大型动态演化图时。本文提出了ACSR,一种自适应SpMV算法,它使用标准CSR格式,但通过将行组合成具有相似数量的非零元素的组(箱)来减少线程发散。此外,对于跨越大范围非零计数的箱中的行,动态并行性被利用。ACSR相对于其他建议的SpMV方法的一个重要优点是,它直接与标准CSR格式一起工作,因此避免了大量的预处理开销。ACSR的CUDA实现在表示幂律图的一组稀疏矩阵上优于NVIDIA CUSP和cuSPARSE库中的SpMV实现。我们还演示了ACSR在动态图分析中的使用,它比现有方法的改进更高。
{"title":"Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications","authors":"Arash Ashari, N. Sedaghati, John Eisenlohr, S. Parthasarathy, P. Sadayappan","doi":"10.1109/SC.2014.69","DOIUrl":"https://doi.org/10.1109/SC.2014.69","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) is a widely used computational kernel. The most commonly used format for a sparse matrix is CSR (Compressed Sparse Row), but a number of other representations have recently been developed that achieve higher SpMV performance. However, the alternative representations typically impose a significant preprocessing overhead. While a high preprocessing overhead can be amortized for applications requiring many iterative invocations of SpMV that use the same matrix, it is not always feasible -- for instance when analyzing large dynamically evolving graphs. This paper presents ACSR, an adaptive SpMV algorithm that uses the standard CSR format but reduces thread divergence by combining rows into groups (bins) which have a similar number of non-zero elements. Further, for rows in bins that span a wide range of non zero counts, dynamic parallelism is leveraged. A significant benefit of ACSR over other proposed SpMV approaches is that it works directly with the standard CSR format, and thus avoids significant preprocessing overheads. A CUDA implementation of ACSR is shown to outperform SpMV implementations in the NVIDIA CUSP and cuSPARSE libraries on a set of sparse matrices representing power-law graphs. We also demonstrate the use of ACSR for the analysis of dynamic graphs, where the improvement over extant approaches is even higher.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115404468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 135
Slim Fly: A Cost Effective Low-Diameter Network Topology 瘦飞:一种低成本的低直径网络拓扑结构
Maciej Besta, T. Hoefler
We introduce a high-performance cost-effective network topology called Slim Fly that approaches the theoretically optimal network diameter. Slim Fly is based on graphs that approximate the solution to the degree-diameter problem. We analyze Slim Fly and compare it to both traditional and state-of the-art networks. Our analysis shows that Slim Fly has significant advantages over other topologies in latency, bandwidth, resiliency, cost, and power consumption. Finally, we propose deadlock-free routing schemes and physical layouts for large computing centres as well as a detailed cost and power model. Slim Fly enables constructing cost effective and highly resilient data enter and HPC networks that offer low latency and high bandwidth under different HPC workloads such as stencil or graph computations.
我们介绍了一种称为Slim Fly的高性能、经济高效的网络拓扑,它接近理论上最优的网络直径。Slim Fly基于近似度-直径问题的解的图形。我们对Slim Fly进行分析,并将其与传统网络和最先进的网络进行比较。我们的分析表明,Slim Fly在延迟、带宽、弹性、成本和功耗方面比其他拓扑具有显著的优势。最后,我们提出了无死锁的路由方案和大型计算中心的物理布局,以及详细的成本和功耗模型。Slim Fly能够构建具有成本效益和高弹性的数据中心和HPC网络,在不同的HPC工作负载(如模板或图形计算)下提供低延迟和高带宽。
{"title":"Slim Fly: A Cost Effective Low-Diameter Network Topology","authors":"Maciej Besta, T. Hoefler","doi":"10.1109/SC.2014.34","DOIUrl":"https://doi.org/10.1109/SC.2014.34","url":null,"abstract":"We introduce a high-performance cost-effective network topology called Slim Fly that approaches the theoretically optimal network diameter. Slim Fly is based on graphs that approximate the solution to the degree-diameter problem. We analyze Slim Fly and compare it to both traditional and state-of the-art networks. Our analysis shows that Slim Fly has significant advantages over other topologies in latency, bandwidth, resiliency, cost, and power consumption. Finally, we propose deadlock-free routing schemes and physical layouts for large computing centres as well as a detailed cost and power model. Slim Fly enables constructing cost effective and highly resilient data enter and HPC networks that offer low latency and high bandwidth under different HPC workloads such as stencil or graph computations.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128830911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 243
Parallel Bayesian Network Structure Learning for Genome-Scale Gene Networks 基因组尺度基因网络的并行贝叶斯网络结构学习
Sanchit Misra, Md. Vasimuddin, K. Pamnany, Sriram P. Chockalingam, Yong Dong, Min Xie, M. Aluru, S. Aluru
Learning Bayesian networks is NP-hard. Even with recent progress in heuristic and parallel algorithms, modeling capabilities still fall short of the scale of the problems encountered. In this paper, we present a massively parallel method for Bayesian network structure learning, and demonstrate its capability by constructing genome-scale gene networks of the model plant Arabidopsis thaliana from over 168.5 million gene expression values. We report strong scaling efficiency of 75% and demonstrate scaling to 1.57 million cores of the Tianhe-2 supercomputer. Our results constitute three and five orders of magnitude increase over previously published results in the scale of data analyzed and computations performed, respectively. We achieve this through algorithmic innovations, using efficient techniques to distribute work across all compute nodes, all available processors and coprocessors on each node, all available threads on each processor and coprocessor, and vectorization techniques to maximize single thread performance.
学习贝叶斯网络是np困难的。即使最近在启发式和并行算法方面取得了进展,建模能力仍然无法满足所遇到问题的规模。在本文中,我们提出了一种大规模并行的贝叶斯网络结构学习方法,并通过从超过1.685亿个基因表达值中构建模式植物拟南芥的基因组尺度基因网络来证明其能力。我们报告了75%的强大扩展效率,并展示了天河二号超级计算机157万核的扩展。我们的结果分别在数据分析和计算的规模上比以前发表的结果增加了三个和五个数量级。我们通过算法创新来实现这一目标,使用高效的技术在所有计算节点上分配工作,每个节点上的所有可用处理器和协处理器,每个处理器和协处理器上的所有可用线程,以及矢量化技术来最大化单线程性能。
{"title":"Parallel Bayesian Network Structure Learning for Genome-Scale Gene Networks","authors":"Sanchit Misra, Md. Vasimuddin, K. Pamnany, Sriram P. Chockalingam, Yong Dong, Min Xie, M. Aluru, S. Aluru","doi":"10.1109/SC.2014.43","DOIUrl":"https://doi.org/10.1109/SC.2014.43","url":null,"abstract":"Learning Bayesian networks is NP-hard. Even with recent progress in heuristic and parallel algorithms, modeling capabilities still fall short of the scale of the problems encountered. In this paper, we present a massively parallel method for Bayesian network structure learning, and demonstrate its capability by constructing genome-scale gene networks of the model plant Arabidopsis thaliana from over 168.5 million gene expression values. We report strong scaling efficiency of 75% and demonstrate scaling to 1.57 million cores of the Tianhe-2 supercomputer. Our results constitute three and five orders of magnitude increase over previously published results in the scale of data analyzed and computations performed, respectively. We achieve this through algorithmic innovations, using efficient techniques to distribute work across all compute nodes, all available processors and coprocessors on each node, all available threads on each processor and coprocessor, and vectorization techniques to maximize single thread performance.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"195 S556","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132905261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Understanding the Effects of Communication and Coordination on Checkpointing at Scale 理解沟通和协调对大规模检查点的影响
Kurt B. Ferreira, Patrick M. Widener, Scott Levy, D. Arnold, T. Hoefler
Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid check pointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node's compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated check pointing has focused on optimizing message log volumes, local check pointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. We demonstrate how to tune hierarchical uncoordinated check pointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. Our work provides a critical analysis and comparison of coordinated and uncoordinated check pointing and enables users and system administrators to fine-tune the check pointing scheme to the application and system characteristics.
容错是未来大规模系统面临的主要挑战。对协调、非协调和混合检查点系统的积极研究已经探索了异步的引入如何解决预期的可伸缩性问题。然而,针对大规模应用程序选择和调优这些协议的见解很少。在本文中,我们使用基于模拟的方法来显示弹性机制中的本地检查点活动可以显着影响关键工作负载的性能,即使分配给弹性机制的本地节点计算时间不到1%(一个非常慷慨的假设)。具体来说,我们表明,尽管许多关于非协调检查指向的工作都集中在优化消息日志量上,但本地检查指向活动可能会在规模上主导该技术的开销。我们的研究表明,本地检查点会导致流程延迟,这种延迟可以通过消息传递关系传播给其他流程,从而导致一系列级联延迟。我们将演示如何调优旨在减少日志量的分层非协调检查指向协议,以大规模地显著减少这些同步开销。我们的工作提供了对协调和非协调检查指向的关键分析和比较,并使用户和系统管理员能够根据应用程序和系统特征微调检查指向方案。
{"title":"Understanding the Effects of Communication and Coordination on Checkpointing at Scale","authors":"Kurt B. Ferreira, Patrick M. Widener, Scott Levy, D. Arnold, T. Hoefler","doi":"10.1109/SC.2014.77","DOIUrl":"https://doi.org/10.1109/SC.2014.77","url":null,"abstract":"Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid check pointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node's compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated check pointing has focused on optimizing message log volumes, local check pointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. We demonstrate how to tune hierarchical uncoordinated check pointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. Our work provides a critical analysis and comparison of coordinated and uncoordinated check pointing and enables users and system administrators to fine-tune the check pointing scheme to the application and system characteristics.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132020395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications MC-Checker:检测内存一致性错误在MPI单边应用程序
Zhezhe Chen, James Dinan, Zhen Tang, P. Balaji, Hua Zhong, Jun Wei, Tao Huang, Feng Qin
One-sided communication decouples data movement and synchronization by providing support for asynchronous reads and updates of distributed shared data. While such interfaces can be extremely efficient, they also impose challenges in properly performing asynchronous accesses to shared data. This paper presents MC-Checker, a new tool that detects memory consistency errors in MPI one-sided applications. MCChecker first performs online instrumentation and captures relevant dynamic events, such as one-sided communications and load/store operations. MC-Checker then performs analysis to detect memory consistency errors. When found, errors are reported along with useful diagnostic information. Experiments indicate that MC-Checker is effective at detecting and diagnosing memory consistency bugs in MPI one-sided applications, with low overhead, ranging from 24.6% to 71.1%, with an average of 45.2%.
单向通信通过提供对分布式共享数据的异步读取和更新的支持,将数据移动和同步解耦。虽然这样的接口非常高效,但它们也给正确执行对共享数据的异步访问带来了挑战。MC-Checker是一种检测MPI单侧应用中内存一致性错误的新工具。MCChecker首先执行在线检测并捕获相关动态事件,例如单向通信和加载/存储操作。MC-Checker然后执行分析以检测内存一致性错误。发现错误后,将报告错误以及有用的诊断信息。实验表明MC-Checker能够有效地检测和诊断MPI单侧应用程序中的内存一致性错误,开销低,范围为24.6% ~ 71.1%,平均为45.2%。
{"title":"MC-Checker: Detecting Memory Consistency Errors in MPI One-Sided Applications","authors":"Zhezhe Chen, James Dinan, Zhen Tang, P. Balaji, Hua Zhong, Jun Wei, Tao Huang, Feng Qin","doi":"10.1109/SC.2014.46","DOIUrl":"https://doi.org/10.1109/SC.2014.46","url":null,"abstract":"One-sided communication decouples data movement and synchronization by providing support for asynchronous reads and updates of distributed shared data. While such interfaces can be extremely efficient, they also impose challenges in properly performing asynchronous accesses to shared data. This paper presents MC-Checker, a new tool that detects memory consistency errors in MPI one-sided applications. MCChecker first performs online instrumentation and captures relevant dynamic events, such as one-sided communications and load/store operations. MC-Checker then performs analysis to detect memory consistency errors. When found, errors are reported along with useful diagnostic information. Experiments indicate that MC-Checker is effective at detecting and diagnosing memory consistency bugs in MPI one-sided applications, with low overhead, ranging from 24.6% to 71.1%, with an average of 45.2%.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132992324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Microbank: Architecting Through-Silicon Interposer-Based Main Memory Systems 微库:基于硅介层的主存系统架构
Y. Son, O. Seongil, Hyunggyun Yang, Daejin Jung, Jung Ho Ahn, John Kim, Jangwoo Kim, Jae W. Lee
Through-Silicon Interposer (TSI) has recently been proposed to provide high memory bandwidth and improve energy efficiency of the main memory system. However, the impact of TSI on main memory system architecture has not been well explored. While TSI improves the I/O energy efficiency, we show that it results in an unbalanced memory system design in terms of energy efficiency as the core DRAM dominates overall energy consumption. To balance and enhance the energy efficiency of a TSI-based memory system, we propose μbank, a novel DRAM device organization in which each bank is partitioned into multiple smaller banks (or μbanks) that operate independently like conventional banks with minimal area overhead. The μbank organization significantly increases the amount of bank-level parallelism to improve the performance and energy efficiency of the TSI-based memory system. The massive number of μbanks reduces bank conflicts, hence simplifying the memory system design. We evaluated a sophisticated prediction-based DRAM page-management policy, which can improve performance by up to 20.5% in a conventional memory system without μbanks. However, a μbank-based design does not require such a complex page-management policy and a simple open-page policy is often sufficient -- achieving within 5% of a perfect predictor. Our proposed μbank-based memory system improves the IPC and system energy-delay product by 1.62× and 4.80×, respectively, for memory-intensive SPEC 2006 benchmarks on average, over the baseline DDR3-based memory system.
通过硅中间体(TSI)是近年来提出的一种提供高存储带宽和提高主存储系统能效的技术。然而,TSI对主存系统架构的影响还没有得到很好的探讨。虽然TSI提高了I/O能源效率,但我们表明,由于核心DRAM主导了整体能耗,因此它会导致存储系统设计在能源效率方面不平衡。为了平衡和提高基于tsi的存储系统的能量效率,我们提出了μbank,一种新的DRAM器件组织,其中每个bank被划分为多个较小的bank(或μbank),这些bank像传统bank一样独立运行,并且面积开销最小。μbank组织显著增加了bank级并行性的数量,从而提高了基于tsis的存储系统的性能和能效。大量的μbank减少了bank冲突,从而简化了存储系统的设计。我们评估了一个复杂的基于预测的DRAM页面管理策略,该策略可以在没有μbank的传统内存系统中提高高达20.5%的性能。然而,基于μbank的设计不需要如此复杂的页面管理策略,一个简单的打开页面策略通常就足够了——达到完美预测器的5%以内。在内存密集型SPEC 2006基准测试中,与基于ddr3的基准内存系统相比,我们提出的基于μbank的内存系统的IPC和系统能量延迟产品平均分别提高了1.62倍和4.80倍。
{"title":"Microbank: Architecting Through-Silicon Interposer-Based Main Memory Systems","authors":"Y. Son, O. Seongil, Hyunggyun Yang, Daejin Jung, Jung Ho Ahn, John Kim, Jangwoo Kim, Jae W. Lee","doi":"10.1109/SC.2014.91","DOIUrl":"https://doi.org/10.1109/SC.2014.91","url":null,"abstract":"Through-Silicon Interposer (TSI) has recently been proposed to provide high memory bandwidth and improve energy efficiency of the main memory system. However, the impact of TSI on main memory system architecture has not been well explored. While TSI improves the I/O energy efficiency, we show that it results in an unbalanced memory system design in terms of energy efficiency as the core DRAM dominates overall energy consumption. To balance and enhance the energy efficiency of a TSI-based memory system, we propose μbank, a novel DRAM device organization in which each bank is partitioned into multiple smaller banks (or μbanks) that operate independently like conventional banks with minimal area overhead. The μbank organization significantly increases the amount of bank-level parallelism to improve the performance and energy efficiency of the TSI-based memory system. The massive number of μbanks reduces bank conflicts, hence simplifying the memory system design. We evaluated a sophisticated prediction-based DRAM page-management policy, which can improve performance by up to 20.5% in a conventional memory system without μbanks. However, a μbank-based design does not require such a complex page-management policy and a simple open-page policy is often sufficient -- achieving within 5% of a perfect predictor. Our proposed μbank-based memory system improves the IPC and system energy-delay product by 1.62× and 4.80×, respectively, for memory-intensive SPEC 2006 benchmarks on average, over the baseline DDR3-based memory system.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"5 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113961714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters Xeon Phi集群节点内和节点间卸载的统一编程模型
M. Noack, Florian Wende, T. Steinke, F. Cordes
Standard offload programming models for the Xeon Phi, e.g. Intel LEO and OpenMP 4.0, are restricted to a single compute node and hence a limited number of coprocessors. Scaling applications across a Xeon Phi cluster/supercomputer thus requires hybrid programming approaches, usually MPI+X. In this work, we present a framework based on heterogeneous active messages (HAM-Offload) that provides the means to offload work to local and remote (co)processors using a unified offload API. Since HAM-Offload provides similar primitives as current local offload frameworks, existing applications can be easily ported to overcome the single-node limitation while keeping the convenient offload programming model. We demonstrate the effectiveness of the framework by using it to enable a real-world application from the field of molecular dynamics to use multiple local and remote Xeon Phis. The evaluation shows good scaling behavior. Compared with LEO, performance is equal for large offloads and significantly better for small offloads.
Xeon Phi协处理器的标准卸载编程模型,例如Intel LEO和OpenMP 4.0,仅限于单个计算节点,因此协处理器数量有限。因此,跨Xeon Phi集群/超级计算机扩展应用程序需要混合编程方法,通常是MPI+X。在这项工作中,我们提出了一个基于异构活动消息(HAM-Offload)的框架,该框架提供了使用统一的卸载API将工作卸载到本地和远程(co)处理器的方法。由于HAM-Offload提供了与当前本地卸载框架类似的原语,因此可以很容易地移植现有应用程序,以克服单节点限制,同时保持方便的卸载编程模型。我们通过使用它来证明该框架的有效性,使来自分子动力学领域的实际应用能够使用多个本地和远程Xeon Phis。评价结果表明,该材料具有良好的标度性能。与LEO相比,大卸载的性能是一样的,小卸载的性能明显更好。
{"title":"A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters","authors":"M. Noack, Florian Wende, T. Steinke, F. Cordes","doi":"10.1109/SC.2014.22","DOIUrl":"https://doi.org/10.1109/SC.2014.22","url":null,"abstract":"Standard offload programming models for the Xeon Phi, e.g. Intel LEO and OpenMP 4.0, are restricted to a single compute node and hence a limited number of coprocessors. Scaling applications across a Xeon Phi cluster/supercomputer thus requires hybrid programming approaches, usually MPI+X. In this work, we present a framework based on heterogeneous active messages (HAM-Offload) that provides the means to offload work to local and remote (co)processors using a unified offload API. Since HAM-Offload provides similar primitives as current local offload frameworks, existing applications can be easily ported to overcome the single-node limitation while keeping the convenient offload programming model. We demonstrate the effectiveness of the framework by using it to enable a real-world application from the field of molecular dynamics to use multiple local and remote Xeon Phis. The evaluation shows good scaling behavior. Compared with LEO, performance is equal for large offloads and significantly better for small offloads.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123865643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Scaling MapReduce Vertically and Horizontally 垂直和水平缩放MapReduce
I. El-Helw, Rutger F. H. Hofman, H. Bal
Glass wing is a MapReduce framework that uses OpenCL to exploit multi-core CPUs and accelerators. However, compute device capabilities may vary significantly and require targeted optimization. Similarly, the availability of resources such as memory, storage and interconnects can severely impact overall job performance. In this paper, we present and analyze how MapReduce applications can improve their horizontal and vertical scalability by using a well controlled mixture of coarse- and fine-grained parallelism. Specifically, we discuss the Glass wing pipeline and its ability to overlap computation, communication, memory transfers and disk access. Additionally, we show how Glass wing can adapt to the distinct capabilities of a variety of compute devices by employing fine-grained parallelism. We experimentally evaluated the performance of five MapReduce applications and show that Glass wing outperforms Hadoop on a 64-node multi-core CPU cluster by factors between 1.2 and 4, and factors from 20 to 30 on a 23-node GPU cluster. Similarly, we show that Glass wing is at least 1.5 times faster than GPMR on the GPU cluster.
Glass wing是一个MapReduce框架,它使用OpenCL来开发多核cpu和加速器。但是,计算设备的功能可能会有很大差异,需要进行有针对性的优化。类似地,内存、存储和互连等资源的可用性也会严重影响整体作业性能。在本文中,我们展示并分析了MapReduce应用程序如何通过使用粗粒度和细粒度并行性的良好控制混合来提高其水平和垂直可伸缩性。具体来说,我们讨论了Glass翼管道及其重叠计算、通信、内存传输和磁盘访问的能力。此外,我们还展示了Glass wing如何通过采用细粒度并行性来适应各种计算设备的不同功能。我们通过实验评估了5个MapReduce应用程序的性能,并表明Glass wing在64节点多核CPU集群上的性能优于Hadoop的倍数在1.2到4之间,在23节点GPU集群上的性能优于Hadoop的倍数在20到30之间。同样,我们表明Glass wing在GPU集群上比GPMR快至少1.5倍。
{"title":"Scaling MapReduce Vertically and Horizontally","authors":"I. El-Helw, Rutger F. H. Hofman, H. Bal","doi":"10.1109/SC.2014.48","DOIUrl":"https://doi.org/10.1109/SC.2014.48","url":null,"abstract":"Glass wing is a MapReduce framework that uses OpenCL to exploit multi-core CPUs and accelerators. However, compute device capabilities may vary significantly and require targeted optimization. Similarly, the availability of resources such as memory, storage and interconnects can severely impact overall job performance. In this paper, we present and analyze how MapReduce applications can improve their horizontal and vertical scalability by using a well controlled mixture of coarse- and fine-grained parallelism. Specifically, we discuss the Glass wing pipeline and its ability to overlap computation, communication, memory transfers and disk access. Additionally, we show how Glass wing can adapt to the distinct capabilities of a variety of compute devices by employing fine-grained parallelism. We experimentally evaluated the performance of five MapReduce applications and show that Glass wing outperforms Hadoop on a 64-node multi-core CPU cluster by factors between 1.2 and 4, and factors from 20 to 30 on a 23-node GPU cluster. Similarly, we show that Glass wing is at least 1.5 times faster than GPMR on the GPU cluster.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129018709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation 分布式存储并行三维Voronoi和Delaunay镶嵌的高性能计算
T. Peterka, D. Morozov, C. L. Phillips
Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets: N-body simulations, molecular dynamics codes, and LIDAR point clouds are just a few examples. Such computational geometry methods are common in data analysis and visualization, but as the scale of simulations and observations surpasses billions of particles, the existing serial and shared memory algorithms no longer suffice. A distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this paper is a new parallel Delaunay and Voronoi tessellation algorithm that automatically determines which neighbor points need to be exchanged among the sub domains of a spatial decomposition. Other contributions include periodic and wall boundary conditions, comparison of our method using two popular serial libraries, and application to numerous science datasets.
从一组点计算Voronoi或Delaunay镶嵌是许多模拟和测量数据集分析的核心部分:n体模拟,分子动力学代码和LIDAR点云只是其中的几个例子。这种计算几何方法在数据分析和可视化中很常见,但随着模拟和观测的规模超过数十亿个粒子,现有的串行和共享内存算法已不能满足需要。分布式内存可扩展并行算法是唯一可行的方法。本文的主要贡献是一种新的并行Delaunay和Voronoi镶嵌算法,该算法自动确定在空间分解的子域之间需要交换哪些相邻点。其他贡献包括周期和壁边界条件,比较我们的方法使用两个流行的串行库,并应用于众多科学数据集。
{"title":"High-Performance Computation of Distributed-Memory Parallel 3D Voronoi and Delaunay Tessellation","authors":"T. Peterka, D. Morozov, C. L. Phillips","doi":"10.1109/SC.2014.86","DOIUrl":"https://doi.org/10.1109/SC.2014.86","url":null,"abstract":"Computing a Voronoi or Delaunay tessellation from a set of points is a core part of the analysis of many simulated and measured datasets: N-body simulations, molecular dynamics codes, and LIDAR point clouds are just a few examples. Such computational geometry methods are common in data analysis and visualization, but as the scale of simulations and observations surpasses billions of particles, the existing serial and shared memory algorithms no longer suffice. A distributed-memory scalable parallel algorithm is the only feasible approach. The primary contribution of this paper is a new parallel Delaunay and Voronoi tessellation algorithm that automatically determines which neighbor points need to be exchanged among the sub domains of a spatial decomposition. Other contributions include periodic and wall boundary conditions, comparison of our method using two popular serial libraries, and application to numerous science datasets.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129292329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
期刊
SC14: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1