首页 > 最新文献

2014 Hardware-Software Co-Design for High Performance Computing最新文献

英文 中文
Using a Complementary Emulation-Simulation Co-Design Approach to Assess Application Readiness for Processing-in-Memory Systems 使用互补的仿真-仿真协同设计方法来评估内存处理系统的应用准备情况
Pub Date : 2014-11-16 DOI: 10.1109/Co-HPC.2014.5
George Stelle, Stephen L. Olivier, Dylan T. Stark, Arun Rodrigues, K. Hemmert
Disruptive changes to computer architecture are paving the way toward extreme scale computing. The co-design strategy of collaborative research and development among computer architects, system software designers, and application teams can help to ensure that applications not only cope but thrive with these changes. In this paper, we present a novel combined co-design approach of emulation and simulation in the context of investigating future Processing in Memory (PIM) architectures. PIM enables co-location of data and computation to decrease data movement, to provide increases in memory speed and capacity compared to existing technologies and, perhaps most importantly for extreme scale, to improve energy efficiency. Our evaluation of PIM focuses on three mini-applications representing important production applications. The emulation and simulation studies examine the effects of locality-aware versus locality-oblivious data distribution and computation, and they compare PIM to conventional architectures. Both studies contribute in their own way to the overall understanding of the application-architecture interactions, and our results suggest that PIM technology shows great potential for efficient computation without negatively impacting productivity.
计算机架构的颠覆性变化为极端规模计算铺平了道路。计算机架构师、系统软件设计人员和应用程序团队之间的协作研究和开发的协同设计策略可以帮助确保应用程序不仅能够应对这些变化,而且能够适应这些变化。在本文中,我们提出了一种新的结合仿真和仿真的协同设计方法,以研究未来的内存处理(PIM)体系结构。PIM支持数据和计算的协同定位,以减少数据移动,与现有技术相比,提供内存速度和容量的增加,并且对于极端规模来说,可能最重要的是提高能源效率。我们对PIM的评估主要集中在代表重要生产应用程序的三个迷你应用程序上。仿真研究考察了位置感知与位置无关的数据分布和计算的影响,并将PIM与传统架构进行了比较。这两项研究都以各自的方式促进了对应用程序-体系结构相互作用的全面理解,我们的结果表明,PIM技术在不对生产力产生负面影响的情况下显示出高效计算的巨大潜力。
{"title":"Using a Complementary Emulation-Simulation Co-Design Approach to Assess Application Readiness for Processing-in-Memory Systems","authors":"George Stelle, Stephen L. Olivier, Dylan T. Stark, Arun Rodrigues, K. Hemmert","doi":"10.1109/Co-HPC.2014.5","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.5","url":null,"abstract":"Disruptive changes to computer architecture are paving the way toward extreme scale computing. The co-design strategy of collaborative research and development among computer architects, system software designers, and application teams can help to ensure that applications not only cope but thrive with these changes. In this paper, we present a novel combined co-design approach of emulation and simulation in the context of investigating future Processing in Memory (PIM) architectures. PIM enables co-location of data and computation to decrease data movement, to provide increases in memory speed and capacity compared to existing technologies and, perhaps most importantly for extreme scale, to improve energy efficiency. Our evaluation of PIM focuses on three mini-applications representing important production applications. The emulation and simulation studies examine the effects of locality-aware versus locality-oblivious data distribution and computation, and they compare PIM to conventional architectures. Both studies contribute in their own way to the overall understanding of the application-architecture interactions, and our results suggest that PIM technology shows great potential for efficient computation without negatively impacting productivity.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129538611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Performance and Energy Evaluation of CoMD on Intel Xeon Phi Co-processors Intel Xeon Phi协处理器上CoMD的性能和能量评估
Pub Date : 2014-11-16 DOI: 10.1109/Co-HPC.2014.12
Gary Lawson, M. Sosonkina, Yuzhong Shen
Molecular dynamics simulations are used extensively in science and engineering. Co-Design Molecular Dynamics (CoMD) is a proxy application that reflects the workload characteristics of production molecular dynamics software. In particular, CoMD is computationally intensive with 90+% of execution time spent to calculate inter-atomic force potentials. Hence, this application is an ideal candidate for acceleration with the Intel Xeon Phi because it has high theoretical computational performance with low energy consumption. In this work, the kernel computing Embedded Atom model (EAM) forces is adapted to utilize the Intel Xeon Phi acceleration. Performance and energy are measured in the experiments that vary thread affinity, thread count, problem size, node count, and the number of Xeon Phi's per node. Dynamic voltage and frequency scaling (DVFS) is used to reduce host-side power draw during Xeon Phi accelerated phases of the application. Test results are compared against the original (host-only) implementation that uses multithreading, and energy savings as high as 30% are observed.
分子动力学模拟在科学和工程中有着广泛的应用。协同设计分子动力学(CoMD)是一个反映生产分子动力学软件工作负载特征的代理应用程序。特别是,CoMD是计算密集型的,90%以上的执行时间用于计算原子间力势。因此,该应用程序是Intel Xeon Phi处理器加速的理想候选,因为它具有高的理论计算性能和低能耗。在这项工作中,内核计算嵌入式原子模型(EAM)力适应于利用Intel Xeon Phi加速。性能和能量是在不同线程亲和度、线程计数、问题大小、节点计数和每个节点Xeon Phi的数量的实验中测量的。动态电压和频率缩放(DVFS)用于降低应用Xeon Phi加速阶段的主机端功耗。将测试结果与使用多线程的原始(仅主机)实现进行比较,可以观察到节能高达30%。
{"title":"Performance and Energy Evaluation of CoMD on Intel Xeon Phi Co-processors","authors":"Gary Lawson, M. Sosonkina, Yuzhong Shen","doi":"10.1109/Co-HPC.2014.12","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.12","url":null,"abstract":"Molecular dynamics simulations are used extensively in science and engineering. Co-Design Molecular Dynamics (CoMD) is a proxy application that reflects the workload characteristics of production molecular dynamics software. In particular, CoMD is computationally intensive with 90+% of execution time spent to calculate inter-atomic force potentials. Hence, this application is an ideal candidate for acceleration with the Intel Xeon Phi because it has high theoretical computational performance with low energy consumption. In this work, the kernel computing Embedded Atom model (EAM) forces is adapted to utilize the Intel Xeon Phi acceleration. Performance and energy are measured in the experiments that vary thread affinity, thread count, problem size, node count, and the number of Xeon Phi's per node. Dynamic voltage and frequency scaling (DVFS) is used to reduce host-side power draw during Xeon Phi accelerated phases of the application. Test results are compared against the original (host-only) implementation that uses multithreading, and energy savings as high as 30% are observed.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124481244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
An Implementation of Block Conjugate Gradient Algorithm on CPU-GPU Processors 块共轭梯度算法在CPU-GPU处理器上的实现
Pub Date : 2014-11-16 DOI: 10.1109/Co-HPC.2014.10
Hao Ji, M. Sosonkina, Yaohang Li
In this paper, we investigate the implementation of the Block Conjugate Gradient (BCG) algorithm on CPU-GPU processors. By analyzing the performance of various matrix operations in BCG, we identify the main performance bottleneck in constructing new search direction matrices. Replacing the QR decomposition by eigendecomposition of a small matrix remedies the problem by reducing the computational cost of generating orthogonal search directions. Moreover, a hybrid (offload) computing scheme is designed to enables the BCG implementation to handle linear systems with large, sparse coefficient matrices that cannot fit in the GPU memory. The hybrid scheme offloads matrix operations to GPU processors while helps hide the CPU-GPU memory transaction overhead. We compare the performance of our BCG implementation with the one on CPU with Intel Xeon Phi coprocessors using the automatic offload mode. With sufficient number of right hand sides, the CPU-GPU implementation of BCG can reach speedup of 2.61 over the CPU-only implementation, which is significantly higher than that of the CPU-Intel Xeon Phi implementation.
本文研究了块共轭梯度(BCG)算法在CPU-GPU处理器上的实现。通过分析BCG中各种矩阵运算的性能,找出了构造新搜索方向矩阵的主要性能瓶颈。用小矩阵的特征分解代替QR分解,减少了生成正交搜索方向的计算成本,从而解决了这个问题。此外,设计了一种混合(卸载)计算方案,使BCG实现能够处理GPU内存无法容纳的大型稀疏系数矩阵的线性系统。混合方案将矩阵操作卸载到GPU处理器,同时有助于隐藏CPU-GPU内存事务开销。我们将BCG实现的性能与使用自动卸载模式的Intel Xeon Phi协处理器的CPU性能进行了比较。在有足够数量的右手边的情况下,BCG的CPU-GPU实现可以达到2.61的加速,明显高于CPU-Intel Xeon Phi实现。
{"title":"An Implementation of Block Conjugate Gradient Algorithm on CPU-GPU Processors","authors":"Hao Ji, M. Sosonkina, Yaohang Li","doi":"10.1109/Co-HPC.2014.10","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.10","url":null,"abstract":"In this paper, we investigate the implementation of the Block Conjugate Gradient (BCG) algorithm on CPU-GPU processors. By analyzing the performance of various matrix operations in BCG, we identify the main performance bottleneck in constructing new search direction matrices. Replacing the QR decomposition by eigendecomposition of a small matrix remedies the problem by reducing the computational cost of generating orthogonal search directions. Moreover, a hybrid (offload) computing scheme is designed to enables the BCG implementation to handle linear systems with large, sparse coefficient matrices that cannot fit in the GPU memory. The hybrid scheme offloads matrix operations to GPU processors while helps hide the CPU-GPU memory transaction overhead. We compare the performance of our BCG implementation with the one on CPU with Intel Xeon Phi coprocessors using the automatic offload mode. With sufficient number of right hand sides, the CPU-GPU implementation of BCG can reach speedup of 2.61 over the CPU-only implementation, which is significantly higher than that of the CPU-Intel Xeon Phi implementation.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121347490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Power Profiling of a Reduced Data Movement Algorithm for Neutron Cross Section Data in Monte Carlo Simulations 蒙特卡罗模拟中子截面数据的简化数据移动算法的功率分析
Pub Date : 2014-11-16 DOI: 10.1109/Co-HPC.2014.9
John R. Tramm, Kazutomo Yoshii, A. Siegel
Current Monte Carlo neutron transport applications use continuous energy cross section data to provide the statistical foundation for particle trajectories. This "classical" algorithm requires storage and random access of very large data structures. Recently, Forget et al.[1] reported on a fundamentally new approach, based on multipole expansions, that distills cross section data down to a more abstract mathematical format. Their formulation greatly reduces memory storage and improves data locality at the cost of also increasing floating point computation. In the present study we determine the hardware performance parameters, including power usage, of the multipole algorithm relative to the classical continuous energy algorithm. This study is done to guage the suitability of both algorithms for use on next-generation high performance computing platforms.
目前的蒙特卡罗中子输运应用使用连续能量截面数据为粒子轨迹提供统计基础。这种“经典”算法需要存储和随机访问非常大的数据结构。最近,Forget等人[1]报道了一种基于多极展开的全新方法,该方法将截面数据提取为更抽象的数学格式。它们的公式大大减少了内存存储,提高了数据的局部性,同时也增加了浮点计算的代价。在本研究中,我们确定了多极算法相对于经典连续能量算法的硬件性能参数,包括功耗。这项研究是为了衡量这两种算法在下一代高性能计算平台上的适用性。
{"title":"Power Profiling of a Reduced Data Movement Algorithm for Neutron Cross Section Data in Monte Carlo Simulations","authors":"John R. Tramm, Kazutomo Yoshii, A. Siegel","doi":"10.1109/Co-HPC.2014.9","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.9","url":null,"abstract":"Current Monte Carlo neutron transport applications use continuous energy cross section data to provide the statistical foundation for particle trajectories. This \"classical\" algorithm requires storage and random access of very large data structures. Recently, Forget et al.[1] reported on a fundamentally new approach, based on multipole expansions, that distills cross section data down to a more abstract mathematical format. Their formulation greatly reduces memory storage and improves data locality at the cost of also increasing floating point computation. In the present study we determine the hardware performance parameters, including power usage, of the multipole algorithm relative to the classical continuous energy algorithm. This study is done to guage the suitability of both algorithms for use on next-generation high performance computing platforms.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128335247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Application Characterization Using Oxbow Toolkit and PADS Infrastructure 应用表征使用Oxbow工具包和PADS基础设施
Pub Date : 2014-11-16 DOI: 10.1109/Co-HPC.2014.11
S. Sreepathi, M. Grodowitz, Robert V. Lim, Philip Taffet, P. Roth, J. Meredith, Seyong Lee, Dong Li, J. Vetter
Characterizing the behavior of a scientific application and its associated proxy application is essential for determining whether the proxy application actually does mimic the full application. To support our ongoing characterization activities, we have developed the Oxbow toolkit and an associated data store infrastructure for collecting, storing, and querying this characterization information. This paper presents recent updates to the Oxbow toolkit and introduces the Oxbow project's Performance Analytics Data Store (PADS). To demonstrate the possible insights when using the toolkit and data store, we compare the characterizations of several full and proxy applications, along with the High Performance Linpack (HPL) and High Performance Conjugate Gradient (HPCG) benchmarks. Using techniques such as cluster visualizations of PADS data across many experiments, we found that the results show unexpected similarities and differences between proxy applications, and a greater similarity of proxy applications to HPCG than to HPL along many dimensions.
描述科学应用程序及其相关代理应用程序的行为对于确定代理应用程序是否真正模拟完整的应用程序至关重要。为了支持我们正在进行的表征活动,我们开发了Oxbow工具包和相关的数据存储基础设施,用于收集、存储和查询这些表征信息。本文介绍了Oxbow工具箱的最新更新,并介绍了Oxbow项目的性能分析数据存储(PADS)。为了演示使用工具包和数据存储时可能的见解,我们比较了几个完整和代理应用程序的特征,以及高性能Linpack (HPL)和高性能共轭梯度(HPCG)基准测试。在许多实验中使用诸如PADS数据的聚类可视化等技术,我们发现结果显示了代理应用程序之间意想不到的相似性和差异性,并且代理应用程序在许多维度上与HPCG的相似性大于与HPL的相似性。
{"title":"Application Characterization Using Oxbow Toolkit and PADS Infrastructure","authors":"S. Sreepathi, M. Grodowitz, Robert V. Lim, Philip Taffet, P. Roth, J. Meredith, Seyong Lee, Dong Li, J. Vetter","doi":"10.1109/Co-HPC.2014.11","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.11","url":null,"abstract":"Characterizing the behavior of a scientific application and its associated proxy application is essential for determining whether the proxy application actually does mimic the full application. To support our ongoing characterization activities, we have developed the Oxbow toolkit and an associated data store infrastructure for collecting, storing, and querying this characterization information. This paper presents recent updates to the Oxbow toolkit and introduces the Oxbow project's Performance Analytics Data Store (PADS). To demonstrate the possible insights when using the toolkit and data store, we compare the characterizations of several full and proxy applications, along with the High Performance Linpack (HPL) and High Performance Conjugate Gradient (HPCG) benchmarks. Using techniques such as cluster visualizations of PADS data across many experiments, we found that the results show unexpected similarities and differences between proxy applications, and a greater similarity of proxy applications to HPCG than to HPL along many dimensions.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133311722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Abstract Machine Models and Proxy Architectures for Exascale Computing Exascale计算的机器模型和代理体系结构
Pub Date : 2014-11-16 DOI: 10.1109/Co-HPC.2014.4
J. Ang, R. Barrett, R. Benner, D. Burke, Cy Chan, Jeanine E. Cook, D. Donofrio, S. Hammond, K. Hemmert, Suzanne M. Kelly, H. Le, V. Leung, D. Resnick, Arun Rodrigues, J. Shalf, Dylan T. Stark, D. Unat, N. Wright
To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.
为了实现百亿亿次计算,基本的硬件架构必须改变。这将显著影响运行在当前高性能计算(HPC)系统上的科学应用程序,其中许多系统编纂了多年的科学领域知识,并对当代计算机系统进行了改进。为了适应百亿亿级架构,开发人员必须能够对新硬件进行推理,并确定哪些编程模型和算法将在未来提供性能和能效的最佳组合。抽象机器模型被设计成只向应用程序开发人员和系统软件公开机器的重要方面或与性能和代码结构相关的方面。这些模型旨在帮助应用程序开发人员和硬件架构师在协同设计过程中进行沟通。代理体系结构是抽象机器模型的参数化版本,其中添加了参数以阐明关键硬件组件的潜在速度和容量。这些更详细的体系结构模型使分析模型和模拟器的开发人员以及计算机硬件架构师之间能够进行讨论,并且它们允许应用程序性能分析、系统软件开发和硬件优化机会。在本文中,我们提出了一组抽象的机器模型,并展示了如何使用它们来帮助软件开发人员为百亿亿级做准备。然后,我们将参数应用于其中一个模型,以演示代理体系结构如何能够更具体地探索应用程序代码如何映射到未来的体系结构。
{"title":"Abstract Machine Models and Proxy Architectures for Exascale Computing","authors":"J. Ang, R. Barrett, R. Benner, D. Burke, Cy Chan, Jeanine E. Cook, D. Donofrio, S. Hammond, K. Hemmert, Suzanne M. Kelly, H. Le, V. Leung, D. Resnick, Arun Rodrigues, J. Shalf, Dylan T. Stark, D. Unat, N. Wright","doi":"10.1109/Co-HPC.2014.4","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.4","url":null,"abstract":"To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114768095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
Design and Analysis of a 32-bit Embedded High-Performance Cluster Optimized for Energy and Performance 能源与性能优化的32位嵌入式高性能集群的设计与分析
Pub Date : 2014-11-16 DOI: 10.1109/Co-HPC.2014.7
Michael F. Cloutier, Chad Paradis, Vincent M. Weaver
A growing number of supercomputers are being built using processors with low-power embedded ancestry, rather than traditional high-performance cores. In order to evaluate this approach we investigate the energy and performance tradeoffs found with ten different 32-bit ARM development boards while running the HPL Linpack and STREAM benchmarks.Based on these results (and other practical concerns) we chose the Raspberry Pi as a basis for a power-aware embedded cluster computing testbed. Each node of the cluster is instrumented with power measurement circuitry so that detailed cluster-wide power measurements can be obtained, enabling power / performance co-design experiments.While our cluster lags recent x86 machines in performance, the power, visualization, and thermal features make it an excellent low-cost platform for education and experimentation.
越来越多的超级计算机正在使用低功耗嵌入式处理器,而不是传统的高性能内核。为了评估这种方法,我们研究了在运行HPL Linpack和STREAM基准测试时,在十个不同的32位ARM开发板上发现的能量和性能权衡。基于这些结果(以及其他实际问题),我们选择Raspberry Pi作为功耗感知嵌入式集群计算测试平台的基础。集群的每个节点都配备了功率测量电路,以便可以获得详细的集群范围内的功率测量,从而实现功率/性能协同设计实验。虽然我们的集群在性能上落后于最近的x86机器,但其强大的功能、可视化和散热特性使其成为教育和实验的优秀低成本平台。
{"title":"Design and Analysis of a 32-bit Embedded High-Performance Cluster Optimized for Energy and Performance","authors":"Michael F. Cloutier, Chad Paradis, Vincent M. Weaver","doi":"10.1109/Co-HPC.2014.7","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.7","url":null,"abstract":"A growing number of supercomputers are being built using processors with low-power embedded ancestry, rather than traditional high-performance cores. In order to evaluate this approach we investigate the energy and performance tradeoffs found with ten different 32-bit ARM development boards while running the HPL Linpack and STREAM benchmarks.Based on these results (and other practical concerns) we chose the Raspberry Pi as a basis for a power-aware embedded cluster computing testbed. Each node of the cluster is instrumented with power measurement circuitry so that detailed cluster-wide power measurements can be obtained, enabling power / performance co-design experiments.While our cluster lags recent x86 machines in performance, the power, visualization, and thermal features make it an excellent low-cost platform for education and experimentation.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114807802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
mPPM, Viewed as a Co-Design Effort mPPM,被视为共同设计的成果
Pub Date : 2014-11-16 DOI: 10.1109/Co-HPC.2014.13
P. Woodward, J. Jayaraj, R. Barrett
The Piecewise Parabolic Method (PPM) was designed as a means of exploring compressible gas dynam-ics problems of interest in astrophysics, including super-sonic jets, compressible turbulence, stellar convection, and turbulent mixing and burning of gases in stellar interiors. Over time, the capabilities encapsulated in PPM have co-evolved with the availability of a series of high performance computing platforms. Implementation of the algorithm has adapted to and advanced with the architectural capabilities and characteristics of these machines. This adaptability of our PPM codes has enabled targeted astrophysical applica-tions of PPM to exploit these scarce resources to explore complex physical phenomena. Here we describe the means by which this was accomplished, and set a path forward, with a new miniapp, mPPM, for continuing this process in a diverse and dynamic architecture design environment. Adaptations in mPPM for the latest high performance machines are discussed that address the important issue of limited bandwidth from locally attached main memory to the microprocessor chip.
分段抛物法(PPM)被设计为探索天体物理学中感兴趣的可压缩气体动力学问题的一种手段,包括超音速射流、可压缩湍流、恒星对流以及恒星内部气体的湍流混合和燃烧。随着时间的推移,封装在PPM中的功能随着一系列高性能计算平台的可用性而共同发展。算法的实现适应了这些机器的架构能力和特点,并随着它们的发展而进步。我们的PPM代码的这种适应性使PPM的目标天体物理应用程序能够利用这些稀缺资源来探索复杂的物理现象。在这里,我们描述了实现这一目标的方法,并通过一个新的微型应用程序mPPM设定了前进的道路,以便在多样化和动态的架构设计环境中继续这一过程。讨论了mPPM对最新高性能机器的适应性,解决了从本地连接的主存储器到微处理器芯片的带宽有限的重要问题。
{"title":"mPPM, Viewed as a Co-Design Effort","authors":"P. Woodward, J. Jayaraj, R. Barrett","doi":"10.1109/Co-HPC.2014.13","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.13","url":null,"abstract":"The Piecewise Parabolic Method (PPM) was designed as a means of exploring compressible gas dynam-ics problems of interest in astrophysics, including super-sonic jets, compressible turbulence, stellar convection, and turbulent mixing and burning of gases in stellar interiors. Over time, the capabilities encapsulated in PPM have co-evolved with the availability of a series of high performance computing platforms. Implementation of the algorithm has adapted to and advanced with the architectural capabilities and characteristics of these machines. This adaptability of our PPM codes has enabled targeted astrophysical applica-tions of PPM to exploit these scarce resources to explore complex physical phenomena. Here we describe the means by which this was accomplished, and set a path forward, with a new miniapp, mPPM, for continuing this process in a diverse and dynamic architecture design environment. Adaptations in mPPM for the latest high performance machines are discussed that address the important issue of limited bandwidth from locally attached main memory to the microprocessor chip.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115491302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Toward Efficient Programmer-Managed Two-Level Memory Hierarchies in Exascale Computers 面向百亿亿次计算机中高效的程序员管理的两级内存层次结构
Pub Date : 2014-11-16 DOI: 10.1109/Co-HPC.2014.8
Mitesh R. Meswani, G. Loh, S. Blagodurov, D. Roberts, John Slice, Mike Ignatowski
Future exascale systems will require very aggressive memory systems simultaneously delivering huge storage capacities and multi-TB/s bandwidths. To achieve the bandwidth targets, in-package, die-stacked memory technologies will likely be necessary. However, these integrated memories do not provide enough capacity to achieve the overall per-node memory size requirements. As a result, conventional off-package memory (e.g., DIMMs) will still be needed. This creates a "two-level memory" (TLM) organization where a portion of the machine's memory space provides high bandwidth, and the remainder provides capacity at a lower level of performance. Effective use of such a heterogeneous memory organization may require the co-design of the software applications along with the advancements in memory architecture. In this paper, we explore the efficacy of programmer-driven approaches to managing a TLM system, using three Exascale proxy applications as case studies.
未来的百亿亿级系统将需要非常先进的内存系统,同时提供巨大的存储容量和多tb /s的带宽。为了达到带宽目标,封装内、模堆叠存储器技术可能是必要的。但是,这些集成内存不能提供足够的容量来满足每个节点的总体内存大小要求。因此,传统的封装外存储器(例如,dimm)仍然是需要的。这创建了一个“两级内存”(TLM)组织,其中机器的一部分内存空间提供高带宽,其余部分提供较低性能水平的容量。有效地使用这种异构内存组织可能需要软件应用程序的协同设计以及内存体系结构的进步。在本文中,我们使用三个Exascale代理应用程序作为案例研究,探讨了程序员驱动方法管理TLM系统的有效性。
{"title":"Toward Efficient Programmer-Managed Two-Level Memory Hierarchies in Exascale Computers","authors":"Mitesh R. Meswani, G. Loh, S. Blagodurov, D. Roberts, John Slice, Mike Ignatowski","doi":"10.1109/Co-HPC.2014.8","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.8","url":null,"abstract":"Future exascale systems will require very aggressive memory systems simultaneously delivering huge storage capacities and multi-TB/s bandwidths. To achieve the bandwidth targets, in-package, die-stacked memory technologies will likely be necessary. However, these integrated memories do not provide enough capacity to achieve the overall per-node memory size requirements. As a result, conventional off-package memory (e.g., DIMMs) will still be needed. This creates a \"two-level memory\" (TLM) organization where a portion of the machine's memory space provides high bandwidth, and the remainder provides capacity at a lower level of performance. Effective use of such a heterogeneous memory organization may require the co-design of the software applications along with the advancements in memory architecture. In this paper, we explore the efficacy of programmer-driven approaches to managing a TLM system, using three Exascale proxy applications as case studies.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128373595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
An Evaluation of Threaded Models for a Classical MD Proxy Application 经典MD代理应用程序的线程模型评价
Pub Date : 2014-11-16 DOI: 10.1109/Co-HPC.2014.6
Pietro Cicotti, S. Mniszewski, L. Carrington
Exascale systems will have many-core nodes, less memory capacity per core than today's systems, and a large degree of performance variability between cores. All these conditions challenge bulk synchronous SPMD models in which execution is typically synchronous and communication is based on buffers and ghost regions.We explore the design of a multithreaded MD code to evaluate several tradeoffs that arise when converting an MPI application into a hybrid multithreaded application, to address the aforementioned constraints of future architectures.Using OpenMP and PThreads, we implemented several variants of CoMD, a molecular dynamics proxy application. We found that in CoMD, duplicating some of the work to avoid race conditions is an easier and more scalable solution than using atomic updates; that data allocation and placement can be controlled to some extent with a hybrid MPI+threads approach, though an explicit NUMA API to control locality may be desirable; and finally that dynamically scheduling the work within a process can mitigate the impact of performance variability among cores and preserve most of the performance, especially when compared to bulk synchronous implementations such as the MPI reference.
Exascale系统将有许多核心节点,每个核心的内存容量比现在的系统要少,并且核心之间的性能差异很大。所有这些条件都对批量同步SPMD模型提出了挑战,在这些模型中,执行通常是同步的,通信基于缓冲区和虚区。我们探讨了多线程MD代码的设计,以评估将MPI应用程序转换为混合多线程应用程序时出现的几种权衡,以解决前面提到的未来架构的限制。使用OpenMP和PThreads,我们实现了CoMD(一个分子动力学代理应用程序)的几个变体。我们发现,在CoMD中,与使用原子更新相比,复制一些工作以避免竞态条件是一种更容易且更具可扩展性的解决方案;数据的分配和放置可以在一定程度上通过MPI+线程的混合方法来控制,尽管一个显式的NUMA API来控制局部性可能是可取的;最后,动态调度进程内的工作可以减轻内核之间性能变化的影响,并保持大部分性能,特别是与MPI参考等批量同步实现相比。
{"title":"An Evaluation of Threaded Models for a Classical MD Proxy Application","authors":"Pietro Cicotti, S. Mniszewski, L. Carrington","doi":"10.1109/Co-HPC.2014.6","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.6","url":null,"abstract":"Exascale systems will have many-core nodes, less memory capacity per core than today's systems, and a large degree of performance variability between cores. All these conditions challenge bulk synchronous SPMD models in which execution is typically synchronous and communication is based on buffers and ghost regions.We explore the design of a multithreaded MD code to evaluate several tradeoffs that arise when converting an MPI application into a hybrid multithreaded application, to address the aforementioned constraints of future architectures.Using OpenMP and PThreads, we implemented several variants of CoMD, a molecular dynamics proxy application. We found that in CoMD, duplicating some of the work to avoid race conditions is an easier and more scalable solution than using atomic updates; that data allocation and placement can be controlled to some extent with a hybrid MPI+threads approach, though an explicit NUMA API to control locality may be desirable; and finally that dynamically scheduling the work within a process can mitigate the impact of performance variability among cores and preserve most of the performance, especially when compared to bulk synchronous implementations such as the MPI reference.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126861534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
2014 Hardware-Software Co-Design for High Performance Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1