One of the most fascinating issues in modern condensed matter physics is to understand highly-correlated electronic structures and propose their novel device designs toward the reduced carbon-dioxide future. Among various developed numerical approaches for highly-correlated electrons, the density matrix renormalization group (DMRG) has been widely accepted as the most promising numerical scheme compared to Monte Carlo and exact diagonalization in terms of accuracy and accessible system size. In fact, DMRG almost perfectly resolves one-dimensional chain like long quantum systems. In this paper, we suggest its extended approach toward higher-dimensional systems by high-performance computing techniques. The computing target in DMRG is a huge non-uniform sparse matrix diagonalization. In order to efficiently parallelize the part, we implement communication step doubling together with reuse of the mid-point data between the doubled two steps to avoid severe bottleneck of all-to-all communications essential for the diagonalization. The technique is successful even for clusters composed of more than 1000 cores and offers a trustworthy exploration way for two-dimensional highly-correlated systems.
{"title":"Parallelization design on multi-core platforms in density matrix renormalization group toward 2-D Quantum strongly-correlated systems","authors":"S. Yamada, Toshiyuki Imamura, M. Machida","doi":"10.1145/2063384.2063467","DOIUrl":"https://doi.org/10.1145/2063384.2063467","url":null,"abstract":"One of the most fascinating issues in modern condensed matter physics is to understand highly-correlated electronic structures and propose their novel device designs toward the reduced carbon-dioxide future. Among various developed numerical approaches for highly-correlated electrons, the density matrix renormalization group (DMRG) has been widely accepted as the most promising numerical scheme compared to Monte Carlo and exact diagonalization in terms of accuracy and accessible system size. In fact, DMRG almost perfectly resolves one-dimensional chain like long quantum systems. In this paper, we suggest its extended approach toward higher-dimensional systems by high-performance computing techniques. The computing target in DMRG is a huge non-uniform sparse matrix diagonalization. In order to efficiently parallelize the part, we implement communication step doubling together with reuse of the mid-point data between the doubled two steps to avoid severe bottleneck of all-to-all communications essential for the diagonalization. The technique is successful even for clusters composed of more than 1000 cores and offers a trustworthy exploration way for two-dimensional highly-correlated systems.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"265 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123107271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huaiming Song, Yanlong Yin, Xian-He Sun, R. Thakur, S. Lang
Parallel file systems have become a common component of modern high-end computers to mask the ever-increasing gap between disk data access speed and CPU computing power. However, while working well for certain applications, current parallel file systems lack the ability to effectively handle concurrent I/O requests with data synchronization needs, whereas concurrent I/O is the norm in data-intensive applications. Recognizing that an I/O request will not complete until all involved file servers in the parallel file system have completed their parts, in this paper we propose a server-side I/O coordination scheme for parallel file systems. The basic idea is to coordinate file servers to serve one application at a time in order to reduce the completion time, and in the meantime maintain the server utilization and fairness. A window-wide coordination concept is introduced to serve our purpose. We present the proposed I/O coordination algorithm and its corresponding analysis of average completion time in this study. We also implement a prototype of the proposed scheme under the PVFS2 file system and MPI-IO environment. Experimental results demonstrate that the proposed scheme can reduce average completion time by 8% to 46%, and provide higher I/O bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/O workloads. Experimental results also show that the server-side I/O coordination scheme has good scalability.
{"title":"Server-side I/O coordination for parallel file systems","authors":"Huaiming Song, Yanlong Yin, Xian-He Sun, R. Thakur, S. Lang","doi":"10.1145/2063384.2063407","DOIUrl":"https://doi.org/10.1145/2063384.2063407","url":null,"abstract":"Parallel file systems have become a common component of modern high-end computers to mask the ever-increasing gap between disk data access speed and CPU computing power. However, while working well for certain applications, current parallel file systems lack the ability to effectively handle concurrent I/O requests with data synchronization needs, whereas concurrent I/O is the norm in data-intensive applications. Recognizing that an I/O request will not complete until all involved file servers in the parallel file system have completed their parts, in this paper we propose a server-side I/O coordination scheme for parallel file systems. The basic idea is to coordinate file servers to serve one application at a time in order to reduce the completion time, and in the meantime maintain the server utilization and fairness. A window-wide coordination concept is introduced to serve our purpose. We present the proposed I/O coordination algorithm and its corresponding analysis of average completion time in this study. We also implement a prototype of the proposed scheme under the PVFS2 file system and MPI-IO environment. Experimental results demonstrate that the proposed scheme can reduce average completion time by 8% to 46%, and provide higher I/O bandwidth than that of default data access strategies adopted by PVFS2 for heavy I/O workloads. Experimental results also show that the server-side I/O coordination scheme has good scalability.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116590747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are sig- nificantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP- based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.
最近的研究结果表明,拓扑感知映射减少了大规模并行机器上通信密集型内核中的网络争用。我们证明了在网状互连上,拓扑感知映射也允许利用高效的拓扑感知集合。我们映射了新的2.5D密集线性代数算法来利用由Blue Gene/P超级计算机分配的长方体分区上的矩形集体。我们的映射允许算法利用优化的线路多播和减少。常用的2D算法不能以这种方式进行映射。在Blue Gene/P的16,384个节点(65,536个内核)上,利用矩形集合的2.5D算法明显快于2D矩阵乘法(MM)和LU分解,分别高达8.7倍和2.1倍。这些加速是由于通信减少(相对于2D MM, 2.5D MM高达95.6%)。我们还推导了基于LogP的矩形广播和约简的新性能模型。利用这些,我们在假设的百亿亿级架构上对矩阵乘法和LU分解的性能进行了建模。
{"title":"Improving communication performance in dense linear algebra via topology aware collectives","authors":"Edgar Solomonik, A. Bhatele, J. Demmel","doi":"10.1145/2063384.2063487","DOIUrl":"https://doi.org/10.1145/2063384.2063487","url":null,"abstract":"Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are sig- nificantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP- based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129079990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eigensolvers are important tools for analyzing and mining useful information from scale-free graphs. Such graphs are used in many applications and can be extremely large. Unfortunately, existing parallel eigensolvers do not scale well for these graphs due to the high communication overhead in the parallel matrix-vector multiplication (MatVec). We develop a MatVec algorithm based on 2D edge partitioning that significantly reduces the communication costs and embed it into a popular eigensolver library. We demonstrate that the enhanced eigensolver can attain two orders of magnitude performance improvement compared to the original on a state-of-art massively parallel machine. We illustrate the performance of the embedded MatVec by computing eigenvalues of a scale-free graph with 300 million vertices and 5 billion edges, the largest scale-free graph analyzed by any in-memory parallel eigensolver, to the best of our knowledge.
{"title":"A scalable eigensolver for large scale-free graphs using 2D graph partitioning","authors":"A. Yoo, A. Baker, R. Pearce, V. Henson","doi":"10.1145/2063384.2063469","DOIUrl":"https://doi.org/10.1145/2063384.2063469","url":null,"abstract":"Eigensolvers are important tools for analyzing and mining useful information from scale-free graphs. Such graphs are used in many applications and can be extremely large. Unfortunately, existing parallel eigensolvers do not scale well for these graphs due to the high communication overhead in the parallel matrix-vector multiplication (MatVec). We develop a MatVec algorithm based on 2D edge partitioning that significantly reduces the communication costs and embed it into a popular eigensolver library. We demonstrate that the enhanced eigensolver can attain two orders of magnitude performance improvement compared to the original on a state-of-art massively parallel machine. We illustrate the performance of the embedded MatVec by computing eigenvalues of a scale-free graph with 300 million vertices and 5 billion edges, the largest scale-free graph analyzed by any in-memory parallel eigensolver, to the best of our knowledge.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133138209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern data-intensive applications move vast amounts of data between multiple locations around the world. To enable predictable and reliable data transfers, next generation networks allow such applications to reserve network resources for exclusive use. In this paper, we solve an important problem (called SMR3) to accommodate multiple and concurrent network reservation requests between a pair of end sites. Given the varying availability of bandwidth within the network, our goal is to accommodate as many reservation requests as possible while minimizing the total time needed to complete the data transfers. First, we prove that SMR3 is an NP-hard problem. Then, we solve it by developing a polynomial-time heuristic called RRA. The RRA algorithm hinges on an efficient mechanism to accommodate large number of requests in an iterative manner. Finally, we show via numerical results that RRA constructs schedules that accommodate significantly larger number of requests compared to other, seemingly efficient, heuristics.
{"title":"End-to-end network QoS via scheduling of flexible resource reservation requests","authors":"Sushant Sharma, D. Katramatos, Dantong Yu","doi":"10.1145/2063384.2063475","DOIUrl":"https://doi.org/10.1145/2063384.2063475","url":null,"abstract":"Modern data-intensive applications move vast amounts of data between multiple locations around the world. To enable predictable and reliable data transfers, next generation networks allow such applications to reserve network resources for exclusive use. In this paper, we solve an important problem (called SMR3) to accommodate multiple and concurrent network reservation requests between a pair of end sites. Given the varying availability of bandwidth within the network, our goal is to accommodate as many reservation requests as possible while minimizing the total time needed to complete the data transfers. First, we prove that SMR3 is an NP-hard problem. Then, we solve it by developing a polynomial-time heuristic called RRA. The RRA algorithm hinges on an efficient mechanism to accommodate large number of requests in an iterative manner. Finally, we show via numerical results that RRA constructs schedules that accommodate significantly larger number of requests compared to other, seemingly efficient, heuristics.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124710782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Shimokawabe, T. Aoki, T. Takaki, Toshio Endo, A. Yamanaka, N. Maruyama, Akira Nukada, S. Matsuoka
The mechanical properties of metal materials largely depend on their intrinsic internal microstructures. To develop engineering materials with the expected properties, predicting patterns in solidified metals would be indispensable. The phase-field simulation is the most powerful method known to simulate the micro-scale dendritic growth during solidification in a binary alloy. To evaluate the realistic description of solidification, however, phase-field simulation requires computing a large number of complex nonlinear terms over a fine-grained grid. Due to such heavy computational demand, previous work on simulating three-dimensional solidification with phase-field methods was successful only in describing simple shapes. Our new simulation techniques achieved scales unprecedentedly large, sufficient for handling complex dendritic structures required in material science. Our simulations on the GPU-rich TSUBAME 2.0 super- computer at the Tokyo Institute of Technology have demonstrated good weak scaling and achieved 1.017 PFlops in single precision for our largest configuration, using 4,000 CPUs along with 16,000 CPU cores.
{"title":"Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer","authors":"T. Shimokawabe, T. Aoki, T. Takaki, Toshio Endo, A. Yamanaka, N. Maruyama, Akira Nukada, S. Matsuoka","doi":"10.1145/2063384.2063388","DOIUrl":"https://doi.org/10.1145/2063384.2063388","url":null,"abstract":"The mechanical properties of metal materials largely depend on their intrinsic internal microstructures. To develop engineering materials with the expected properties, predicting patterns in solidified metals would be indispensable. The phase-field simulation is the most powerful method known to simulate the micro-scale dendritic growth during solidification in a binary alloy. To evaluate the realistic description of solidification, however, phase-field simulation requires computing a large number of complex nonlinear terms over a fine-grained grid. Due to such heavy computational demand, previous work on simulating three-dimensional solidification with phase-field methods was successful only in describing simple shapes. Our new simulation techniques achieved scales unprecedentedly large, sufficient for handling complex dendritic structures required in material science. Our simulations on the GPU-rich TSUBAME 2.0 super- computer at the Tokyo Institute of Technology have demonstrated good weak scaling and achieved 1.017 PFlops in single precision for our largest configuration, using 4,000 CPUs along with 16,000 CPU cores.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129498360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space and performance overhead of checkpoint-restart. We introduce an approach that leverages virtual machine (VM) disk-image multi-snapshotting and multi-deployment inside checkpoint-restart protocols running at guest level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level.
{"title":"BlobCR: Efficient checkpoint-restart for HPC applications on IaaS clouds using virtual disk image snapshots","authors":"Bogdan Nicolae, F. Cappello","doi":"10.1145/2063384.2063429","DOIUrl":"https://doi.org/10.1145/2063384.2063429","url":null,"abstract":"Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running scientific applications. Given the dynamic nature of IaaS clouds and the long runtime and resource utilization of such applications, an efficient checkpoint-restart mechanism becomes paramount in this context. This paper proposes a solution to the aforementioned challenge that aims at minimizing the storage space and performance overhead of checkpoint-restart. We introduce an approach that leverages virtual machine (VM) disk-image multi-snapshotting and multi-deployment inside checkpoint-restart protocols running at guest level in order to efficiently capture and potentially roll back the complete state of the application, including file system modifications. Experiments on the G5K testbed show substantial improvement for MPI applications over existing approaches, both for the case when customized checkpointing is available at application level and the case when it needs to be handled at process level.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115786039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Log files are a necessary record of events on any system. However, as systems scale, so does the volume of data captured. To complicate matters, this data can be distributed across all nodes within the system. This creates challenges in ways to obtain these files as well as archiving them in a consistent manner. It has become commonplace to develop a custom written utility for each system that is tailored specifically to that system. For computer centers that contain multiple systems, each system would have their own respective utility for gathering and archiving log files. Each time a new log file is produced, a modification to the utility is necessary. With each modification, risk of errors could be introduced as well as spending time to introduce that change. This is precisely the purpose of logjam. Once installed, the code only requires modification when new features are required. A configuration file is used to identify each log file as well as where to harvest it and how to archive it. Adding a new log file is as simple as defining it in a configuration file and testing can be performed in the production environment.
{"title":"Logjam: A scalable unified log file archiver","authors":"N. Cardo","doi":"10.1145/2063348.2063379","DOIUrl":"https://doi.org/10.1145/2063348.2063379","url":null,"abstract":"Log files are a necessary record of events on any system. However, as systems scale, so does the volume of data captured. To complicate matters, this data can be distributed across all nodes within the system. This creates challenges in ways to obtain these files as well as archiving them in a consistent manner. It has become commonplace to develop a custom written utility for each system that is tailored specifically to that system. For computer centers that contain multiple systems, each system would have their own respective utility for gathering and archiving log files. Each time a new log file is produced, a modification to the utility is necessary. With each modification, risk of errors could be introduced as well as spending time to introduce that change. This is precisely the purpose of logjam. Once installed, the code only requires modification when new features are required. A configuration file is used to identify each log file as well as where to harvest it and how to archive it. Adding a new log file is as simple as defining it in a configuration file and testing can be performed in the production environment.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121358234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao Mei, Yanhua Sun, G. Zheng, Eric J. Bohm, L. Kalé, James C. Phillips, Christopher B. Harrison
A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simulation. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the Charm++ asynchronous message-driven runtime. We exploit node-aware techniques to optimize both the application and the underlying SMP runtime. Hierarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge National Laboratory, both with and without PME full electrostatics, achieving 93% parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory.
{"title":"Enabling and scaling biomolecular simulations of 100 million atoms on petascale machines with a multicore-optimized message-driven runtime","authors":"Chao Mei, Yanhua Sun, G. Zheng, Eric J. Bohm, L. Kalé, James C. Phillips, Christopher B. Harrison","doi":"10.1145/2063384.2063466","DOIUrl":"https://doi.org/10.1145/2063384.2063466","url":null,"abstract":"A 100-million-atom biomolecular simulation with NAMD is one of the three benchmarks for the NSF-funded sustainable petascale machine. Simulating this large molecular system on a petascale machine presents great challenges, including handling I/O, large memory footprint and getting good strong-scaling results. In this paper, we present parallel I/O techniques to enable the simulation. A new SMP model is designed to efficiently utilize ubiquitous wide multicore clusters by extending the Charm++ asynchronous message-driven runtime. We exploit node-aware techniques to optimize both the application and the underlying SMP runtime. Hierarchical load balancing is further exploited to scale NAMD to the full Jaguar PF Cray XT5 (224,076 cores) at Oak Ridge National Laboratory, both with and without PME full electrostatics, achieving 93% parallel efficiency (vs 6720 cores) at 9 ms per step for a simple cutoff calculation. Excellent scaling is also obtained on 65,536 cores of the Intrepid Blue Gene/P at Argonne National Laboratory.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129137709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.
{"title":"Scalable fast multipole methods on distributed heterogeneous architectures","authors":"Qi Hu, N. Gumerov, R. Duraiswami","doi":"10.1145/2063384.2063432","DOIUrl":"https://doi.org/10.1145/2063384.2063432","url":null,"abstract":"We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a fast N-body sum using a spatial decomposition and is often used in a time-stepping or iterative loop. Using the observation that the local summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the multicore CPUs and the GPU accelerators. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed multiple-node versions. Our implementation can perform the N-body sum for 128M particles on 16 nodes in 4.23 seconds, a performance not achieved by others in the literature on such clusters.","PeriodicalId":358797,"journal":{"name":"2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114107163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}