George Stelle, Stephen L. Olivier, Dylan T. Stark, Arun Rodrigues, K. Hemmert
Disruptive changes to computer architecture are paving the way toward extreme scale computing. The co-design strategy of collaborative research and development among computer architects, system software designers, and application teams can help to ensure that applications not only cope but thrive with these changes. In this paper, we present a novel combined co-design approach of emulation and simulation in the context of investigating future Processing in Memory (PIM) architectures. PIM enables co-location of data and computation to decrease data movement, to provide increases in memory speed and capacity compared to existing technologies and, perhaps most importantly for extreme scale, to improve energy efficiency. Our evaluation of PIM focuses on three mini-applications representing important production applications. The emulation and simulation studies examine the effects of locality-aware versus locality-oblivious data distribution and computation, and they compare PIM to conventional architectures. Both studies contribute in their own way to the overall understanding of the application-architecture interactions, and our results suggest that PIM technology shows great potential for efficient computation without negatively impacting productivity.
{"title":"Using a Complementary Emulation-Simulation Co-Design Approach to Assess Application Readiness for Processing-in-Memory Systems","authors":"George Stelle, Stephen L. Olivier, Dylan T. Stark, Arun Rodrigues, K. Hemmert","doi":"10.1109/Co-HPC.2014.5","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.5","url":null,"abstract":"Disruptive changes to computer architecture are paving the way toward extreme scale computing. The co-design strategy of collaborative research and development among computer architects, system software designers, and application teams can help to ensure that applications not only cope but thrive with these changes. In this paper, we present a novel combined co-design approach of emulation and simulation in the context of investigating future Processing in Memory (PIM) architectures. PIM enables co-location of data and computation to decrease data movement, to provide increases in memory speed and capacity compared to existing technologies and, perhaps most importantly for extreme scale, to improve energy efficiency. Our evaluation of PIM focuses on three mini-applications representing important production applications. The emulation and simulation studies examine the effects of locality-aware versus locality-oblivious data distribution and computation, and they compare PIM to conventional architectures. Both studies contribute in their own way to the overall understanding of the application-architecture interactions, and our results suggest that PIM technology shows great potential for efficient computation without negatively impacting productivity.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129538611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Molecular dynamics simulations are used extensively in science and engineering. Co-Design Molecular Dynamics (CoMD) is a proxy application that reflects the workload characteristics of production molecular dynamics software. In particular, CoMD is computationally intensive with 90+% of execution time spent to calculate inter-atomic force potentials. Hence, this application is an ideal candidate for acceleration with the Intel Xeon Phi because it has high theoretical computational performance with low energy consumption. In this work, the kernel computing Embedded Atom model (EAM) forces is adapted to utilize the Intel Xeon Phi acceleration. Performance and energy are measured in the experiments that vary thread affinity, thread count, problem size, node count, and the number of Xeon Phi's per node. Dynamic voltage and frequency scaling (DVFS) is used to reduce host-side power draw during Xeon Phi accelerated phases of the application. Test results are compared against the original (host-only) implementation that uses multithreading, and energy savings as high as 30% are observed.
{"title":"Performance and Energy Evaluation of CoMD on Intel Xeon Phi Co-processors","authors":"Gary Lawson, M. Sosonkina, Yuzhong Shen","doi":"10.1109/Co-HPC.2014.12","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.12","url":null,"abstract":"Molecular dynamics simulations are used extensively in science and engineering. Co-Design Molecular Dynamics (CoMD) is a proxy application that reflects the workload characteristics of production molecular dynamics software. In particular, CoMD is computationally intensive with 90+% of execution time spent to calculate inter-atomic force potentials. Hence, this application is an ideal candidate for acceleration with the Intel Xeon Phi because it has high theoretical computational performance with low energy consumption. In this work, the kernel computing Embedded Atom model (EAM) forces is adapted to utilize the Intel Xeon Phi acceleration. Performance and energy are measured in the experiments that vary thread affinity, thread count, problem size, node count, and the number of Xeon Phi's per node. Dynamic voltage and frequency scaling (DVFS) is used to reduce host-side power draw during Xeon Phi accelerated phases of the application. Test results are compared against the original (host-only) implementation that uses multithreading, and energy savings as high as 30% are observed.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124481244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we investigate the implementation of the Block Conjugate Gradient (BCG) algorithm on CPU-GPU processors. By analyzing the performance of various matrix operations in BCG, we identify the main performance bottleneck in constructing new search direction matrices. Replacing the QR decomposition by eigendecomposition of a small matrix remedies the problem by reducing the computational cost of generating orthogonal search directions. Moreover, a hybrid (offload) computing scheme is designed to enables the BCG implementation to handle linear systems with large, sparse coefficient matrices that cannot fit in the GPU memory. The hybrid scheme offloads matrix operations to GPU processors while helps hide the CPU-GPU memory transaction overhead. We compare the performance of our BCG implementation with the one on CPU with Intel Xeon Phi coprocessors using the automatic offload mode. With sufficient number of right hand sides, the CPU-GPU implementation of BCG can reach speedup of 2.61 over the CPU-only implementation, which is significantly higher than that of the CPU-Intel Xeon Phi implementation.
{"title":"An Implementation of Block Conjugate Gradient Algorithm on CPU-GPU Processors","authors":"Hao Ji, M. Sosonkina, Yaohang Li","doi":"10.1109/Co-HPC.2014.10","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.10","url":null,"abstract":"In this paper, we investigate the implementation of the Block Conjugate Gradient (BCG) algorithm on CPU-GPU processors. By analyzing the performance of various matrix operations in BCG, we identify the main performance bottleneck in constructing new search direction matrices. Replacing the QR decomposition by eigendecomposition of a small matrix remedies the problem by reducing the computational cost of generating orthogonal search directions. Moreover, a hybrid (offload) computing scheme is designed to enables the BCG implementation to handle linear systems with large, sparse coefficient matrices that cannot fit in the GPU memory. The hybrid scheme offloads matrix operations to GPU processors while helps hide the CPU-GPU memory transaction overhead. We compare the performance of our BCG implementation with the one on CPU with Intel Xeon Phi coprocessors using the automatic offload mode. With sufficient number of right hand sides, the CPU-GPU implementation of BCG can reach speedup of 2.61 over the CPU-only implementation, which is significantly higher than that of the CPU-Intel Xeon Phi implementation.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121347490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current Monte Carlo neutron transport applications use continuous energy cross section data to provide the statistical foundation for particle trajectories. This "classical" algorithm requires storage and random access of very large data structures. Recently, Forget et al.[1] reported on a fundamentally new approach, based on multipole expansions, that distills cross section data down to a more abstract mathematical format. Their formulation greatly reduces memory storage and improves data locality at the cost of also increasing floating point computation. In the present study we determine the hardware performance parameters, including power usage, of the multipole algorithm relative to the classical continuous energy algorithm. This study is done to guage the suitability of both algorithms for use on next-generation high performance computing platforms.
{"title":"Power Profiling of a Reduced Data Movement Algorithm for Neutron Cross Section Data in Monte Carlo Simulations","authors":"John R. Tramm, Kazutomo Yoshii, A. Siegel","doi":"10.1109/Co-HPC.2014.9","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.9","url":null,"abstract":"Current Monte Carlo neutron transport applications use continuous energy cross section data to provide the statistical foundation for particle trajectories. This \"classical\" algorithm requires storage and random access of very large data structures. Recently, Forget et al.[1] reported on a fundamentally new approach, based on multipole expansions, that distills cross section data down to a more abstract mathematical format. Their formulation greatly reduces memory storage and improves data locality at the cost of also increasing floating point computation. In the present study we determine the hardware performance parameters, including power usage, of the multipole algorithm relative to the classical continuous energy algorithm. This study is done to guage the suitability of both algorithms for use on next-generation high performance computing platforms.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128335247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Sreepathi, M. Grodowitz, Robert V. Lim, Philip Taffet, P. Roth, J. Meredith, Seyong Lee, Dong Li, J. Vetter
Characterizing the behavior of a scientific application and its associated proxy application is essential for determining whether the proxy application actually does mimic the full application. To support our ongoing characterization activities, we have developed the Oxbow toolkit and an associated data store infrastructure for collecting, storing, and querying this characterization information. This paper presents recent updates to the Oxbow toolkit and introduces the Oxbow project's Performance Analytics Data Store (PADS). To demonstrate the possible insights when using the toolkit and data store, we compare the characterizations of several full and proxy applications, along with the High Performance Linpack (HPL) and High Performance Conjugate Gradient (HPCG) benchmarks. Using techniques such as cluster visualizations of PADS data across many experiments, we found that the results show unexpected similarities and differences between proxy applications, and a greater similarity of proxy applications to HPCG than to HPL along many dimensions.
{"title":"Application Characterization Using Oxbow Toolkit and PADS Infrastructure","authors":"S. Sreepathi, M. Grodowitz, Robert V. Lim, Philip Taffet, P. Roth, J. Meredith, Seyong Lee, Dong Li, J. Vetter","doi":"10.1109/Co-HPC.2014.11","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.11","url":null,"abstract":"Characterizing the behavior of a scientific application and its associated proxy application is essential for determining whether the proxy application actually does mimic the full application. To support our ongoing characterization activities, we have developed the Oxbow toolkit and an associated data store infrastructure for collecting, storing, and querying this characterization information. This paper presents recent updates to the Oxbow toolkit and introduces the Oxbow project's Performance Analytics Data Store (PADS). To demonstrate the possible insights when using the toolkit and data store, we compare the characterizations of several full and proxy applications, along with the High Performance Linpack (HPL) and High Performance Conjugate Gradient (HPCG) benchmarks. Using techniques such as cluster visualizations of PADS data across many experiments, we found that the results show unexpected similarities and differences between proxy applications, and a greater similarity of proxy applications to HPCG than to HPL along many dimensions.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133311722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Ang, R. Barrett, R. Benner, D. Burke, Cy Chan, Jeanine E. Cook, D. Donofrio, S. Hammond, K. Hemmert, Suzanne M. Kelly, H. Le, V. Leung, D. Resnick, Arun Rodrigues, J. Shalf, Dylan T. Stark, D. Unat, N. Wright
To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.
{"title":"Abstract Machine Models and Proxy Architectures for Exascale Computing","authors":"J. Ang, R. Barrett, R. Benner, D. Burke, Cy Chan, Jeanine E. Cook, D. Donofrio, S. Hammond, K. Hemmert, Suzanne M. Kelly, H. Le, V. Leung, D. Resnick, Arun Rodrigues, J. Shalf, Dylan T. Stark, D. Unat, N. Wright","doi":"10.1109/Co-HPC.2014.4","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.4","url":null,"abstract":"To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114768095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael F. Cloutier, Chad Paradis, Vincent M. Weaver
A growing number of supercomputers are being built using processors with low-power embedded ancestry, rather than traditional high-performance cores. In order to evaluate this approach we investigate the energy and performance tradeoffs found with ten different 32-bit ARM development boards while running the HPL Linpack and STREAM benchmarks.Based on these results (and other practical concerns) we chose the Raspberry Pi as a basis for a power-aware embedded cluster computing testbed. Each node of the cluster is instrumented with power measurement circuitry so that detailed cluster-wide power measurements can be obtained, enabling power / performance co-design experiments.While our cluster lags recent x86 machines in performance, the power, visualization, and thermal features make it an excellent low-cost platform for education and experimentation.
{"title":"Design and Analysis of a 32-bit Embedded High-Performance Cluster Optimized for Energy and Performance","authors":"Michael F. Cloutier, Chad Paradis, Vincent M. Weaver","doi":"10.1109/Co-HPC.2014.7","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.7","url":null,"abstract":"A growing number of supercomputers are being built using processors with low-power embedded ancestry, rather than traditional high-performance cores. In order to evaluate this approach we investigate the energy and performance tradeoffs found with ten different 32-bit ARM development boards while running the HPL Linpack and STREAM benchmarks.Based on these results (and other practical concerns) we chose the Raspberry Pi as a basis for a power-aware embedded cluster computing testbed. Each node of the cluster is instrumented with power measurement circuitry so that detailed cluster-wide power measurements can be obtained, enabling power / performance co-design experiments.While our cluster lags recent x86 machines in performance, the power, visualization, and thermal features make it an excellent low-cost platform for education and experimentation.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114807802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Piecewise Parabolic Method (PPM) was designed as a means of exploring compressible gas dynam-ics problems of interest in astrophysics, including super-sonic jets, compressible turbulence, stellar convection, and turbulent mixing and burning of gases in stellar interiors. Over time, the capabilities encapsulated in PPM have co-evolved with the availability of a series of high performance computing platforms. Implementation of the algorithm has adapted to and advanced with the architectural capabilities and characteristics of these machines. This adaptability of our PPM codes has enabled targeted astrophysical applica-tions of PPM to exploit these scarce resources to explore complex physical phenomena. Here we describe the means by which this was accomplished, and set a path forward, with a new miniapp, mPPM, for continuing this process in a diverse and dynamic architecture design environment. Adaptations in mPPM for the latest high performance machines are discussed that address the important issue of limited bandwidth from locally attached main memory to the microprocessor chip.
{"title":"mPPM, Viewed as a Co-Design Effort","authors":"P. Woodward, J. Jayaraj, R. Barrett","doi":"10.1109/Co-HPC.2014.13","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.13","url":null,"abstract":"The Piecewise Parabolic Method (PPM) was designed as a means of exploring compressible gas dynam-ics problems of interest in astrophysics, including super-sonic jets, compressible turbulence, stellar convection, and turbulent mixing and burning of gases in stellar interiors. Over time, the capabilities encapsulated in PPM have co-evolved with the availability of a series of high performance computing platforms. Implementation of the algorithm has adapted to and advanced with the architectural capabilities and characteristics of these machines. This adaptability of our PPM codes has enabled targeted astrophysical applica-tions of PPM to exploit these scarce resources to explore complex physical phenomena. Here we describe the means by which this was accomplished, and set a path forward, with a new miniapp, mPPM, for continuing this process in a diverse and dynamic architecture design environment. Adaptations in mPPM for the latest high performance machines are discussed that address the important issue of limited bandwidth from locally attached main memory to the microprocessor chip.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115491302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mitesh R. Meswani, G. Loh, S. Blagodurov, D. Roberts, John Slice, Mike Ignatowski
Future exascale systems will require very aggressive memory systems simultaneously delivering huge storage capacities and multi-TB/s bandwidths. To achieve the bandwidth targets, in-package, die-stacked memory technologies will likely be necessary. However, these integrated memories do not provide enough capacity to achieve the overall per-node memory size requirements. As a result, conventional off-package memory (e.g., DIMMs) will still be needed. This creates a "two-level memory" (TLM) organization where a portion of the machine's memory space provides high bandwidth, and the remainder provides capacity at a lower level of performance. Effective use of such a heterogeneous memory organization may require the co-design of the software applications along with the advancements in memory architecture. In this paper, we explore the efficacy of programmer-driven approaches to managing a TLM system, using three Exascale proxy applications as case studies.
{"title":"Toward Efficient Programmer-Managed Two-Level Memory Hierarchies in Exascale Computers","authors":"Mitesh R. Meswani, G. Loh, S. Blagodurov, D. Roberts, John Slice, Mike Ignatowski","doi":"10.1109/Co-HPC.2014.8","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.8","url":null,"abstract":"Future exascale systems will require very aggressive memory systems simultaneously delivering huge storage capacities and multi-TB/s bandwidths. To achieve the bandwidth targets, in-package, die-stacked memory technologies will likely be necessary. However, these integrated memories do not provide enough capacity to achieve the overall per-node memory size requirements. As a result, conventional off-package memory (e.g., DIMMs) will still be needed. This creates a \"two-level memory\" (TLM) organization where a portion of the machine's memory space provides high bandwidth, and the remainder provides capacity at a lower level of performance. Effective use of such a heterogeneous memory organization may require the co-design of the software applications along with the advancements in memory architecture. In this paper, we explore the efficacy of programmer-driven approaches to managing a TLM system, using three Exascale proxy applications as case studies.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128373595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Exascale systems will have many-core nodes, less memory capacity per core than today's systems, and a large degree of performance variability between cores. All these conditions challenge bulk synchronous SPMD models in which execution is typically synchronous and communication is based on buffers and ghost regions.We explore the design of a multithreaded MD code to evaluate several tradeoffs that arise when converting an MPI application into a hybrid multithreaded application, to address the aforementioned constraints of future architectures.Using OpenMP and PThreads, we implemented several variants of CoMD, a molecular dynamics proxy application. We found that in CoMD, duplicating some of the work to avoid race conditions is an easier and more scalable solution than using atomic updates; that data allocation and placement can be controlled to some extent with a hybrid MPI+threads approach, though an explicit NUMA API to control locality may be desirable; and finally that dynamically scheduling the work within a process can mitigate the impact of performance variability among cores and preserve most of the performance, especially when compared to bulk synchronous implementations such as the MPI reference.
{"title":"An Evaluation of Threaded Models for a Classical MD Proxy Application","authors":"Pietro Cicotti, S. Mniszewski, L. Carrington","doi":"10.1109/Co-HPC.2014.6","DOIUrl":"https://doi.org/10.1109/Co-HPC.2014.6","url":null,"abstract":"Exascale systems will have many-core nodes, less memory capacity per core than today's systems, and a large degree of performance variability between cores. All these conditions challenge bulk synchronous SPMD models in which execution is typically synchronous and communication is based on buffers and ghost regions.We explore the design of a multithreaded MD code to evaluate several tradeoffs that arise when converting an MPI application into a hybrid multithreaded application, to address the aforementioned constraints of future architectures.Using OpenMP and PThreads, we implemented several variants of CoMD, a molecular dynamics proxy application. We found that in CoMD, duplicating some of the work to avoid race conditions is an easier and more scalable solution than using atomic updates; that data allocation and placement can be controlled to some extent with a hybrid MPI+threads approach, though an explicit NUMA API to control locality may be desirable; and finally that dynamically scheduling the work within a process can mitigate the impact of performance variability among cores and preserve most of the performance, especially when compared to bulk synchronous implementations such as the MPI reference.","PeriodicalId":136638,"journal":{"name":"2014 Hardware-Software Co-Design for High Performance Computing","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126861534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}