首页 > 最新文献

Proceedings of Workshops of HPC Asia最新文献

英文 中文
Investigating the performance and productivity of DASH using the Cowichan problems 利用coichan问题研究DASH的性能和生产率
Pub Date : 2018-01-31 DOI: 10.1145/3176364.3176366
K. Fürlinger, R. Kowalewski, Tobias Fuchs, Benedikt Lehmann
DASH is a new realization of the PGAS (Partitioned Global Address Space) programming model in the form of a C++ template library. Instead of using a custom compiler, DASH provides expressive programming constructs using C++ abstraction mechanisms and offers distributed data structures and parallel algorithms that follow the concepts employed by the C++ standard template library (STL). In this paper we evaluate the performance and productivity of DASH by comparing our implementation of a set of benchmark programs with those developed by expert programmers in Intel Cilk, Intel TBB (Threading Building Blocks), Go and Chapel. We perform a comparison on shared memory multiprocessor systems ranging from moderately parallel multicore systems to a 64-core manycore system. We additionally perform a scalability study on a distributed memory system on up to 20 nodes (800 cores). Our results demonstrate that DASH offers productivity that is comparable with the best established programming systems for shared memory and also achieves comparable or better performance. Our results on multi-node systems show that DASH scales well and achieves excellent performance.
DASH是以c++模板库的形式实现PGAS (Partitioned Global Address Space)编程模型。DASH没有使用自定义编译器,而是使用c++抽象机制提供表达性编程结构,并提供遵循c++标准模板库(STL)概念的分布式数据结构和并行算法。在本文中,我们通过将我们实现的一组基准程序与专家程序员在英特尔Cilk、英特尔TBB(线程构建块)、Go和Chapel中开发的程序进行比较,来评估DASH的性能和生产力。我们对共享内存多处理器系统进行了比较,从中等并行多核系统到64核多核系统。我们还在一个多达20个节点(800核)的分布式内存系统上进行了可扩展性研究。我们的研究结果表明,DASH提供的生产力可与现有的最佳共享内存编程系统相媲美,并且还实现了相当或更好的性能。我们在多节点系统上的实验结果表明,DASH具有良好的可扩展性和优异的性能。
{"title":"Investigating the performance and productivity of DASH using the Cowichan problems","authors":"K. Fürlinger, R. Kowalewski, Tobias Fuchs, Benedikt Lehmann","doi":"10.1145/3176364.3176366","DOIUrl":"https://doi.org/10.1145/3176364.3176366","url":null,"abstract":"DASH is a new realization of the PGAS (Partitioned Global Address Space) programming model in the form of a C++ template library. Instead of using a custom compiler, DASH provides expressive programming constructs using C++ abstraction mechanisms and offers distributed data structures and parallel algorithms that follow the concepts employed by the C++ standard template library (STL). In this paper we evaluate the performance and productivity of DASH by comparing our implementation of a set of benchmark programs with those developed by expert programmers in Intel Cilk, Intel TBB (Threading Building Blocks), Go and Chapel. We perform a comparison on shared memory multiprocessor systems ranging from moderately parallel multicore systems to a 64-core manycore system. We additionally perform a scalability study on a distributed memory system on up to 20 nodes (800 cores). Our results demonstrate that DASH offers productivity that is comparable with the best established programming systems for shared memory and also achieves comparable or better performance. Our results on multi-node systems show that DASH scales well and achieves excellent performance.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126508238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Optimizing a particle-in-cell code on Intel knights landing 在英特尔骑士登陆上优化细胞内粒子代码
Pub Date : 2018-01-31 DOI: 10.1145/3176364.3176376
Minhua Wen, Min Chen, James Lin
The particle-in-cell (PIC) code is one of the mainstream algorithms in the laser plasma research area. However, the programming challenges to achieve high performance of PIC codes on the Intel Knights Landing (KNL) processor is widely concerned by global laser plasma researchers. We took the VLPL-S, the PIC code developed at Shanghai Jiao Tong University, as an example to address this concern. We applied the three types of optimization: compute-oriented optimizations, parallel 10, and dynamic loading balancing. We evaluated the optimized VLPL-S code with real test cases on the KNL. The experiments results show our optimization can achieve 1.53X speedup in overall performance, and the performance on the KNL is 1.77X faster than that of a two-socket Intel Xeon E5-2697v4 node. The optimizations we developed for the VLPS-S code can be applied to the other PIC codes.
粒子池码是激光等离子体研究领域的主流算法之一。然而,如何在英特尔KNL处理器上实现高性能PIC码的编程挑战一直是全球激光等离子体研究人员普遍关注的问题。我们以上海交通大学开发的PIC代码VLPL-S为例来解决这个问题。我们应用了三种类型的优化:面向计算的优化、并行优化和动态负载平衡。我们用KNL上的实际测试用例对优化后的VLPL-S代码进行了评估。实验结果表明,我们的优化在整体性能上提高了1.53倍,在KNL上的性能比双插槽Intel至强E5-2697v4节点的性能提高了1.77倍。我们为VLPS-S代码开发的优化可以应用于其他PIC代码。
{"title":"Optimizing a particle-in-cell code on Intel knights landing","authors":"Minhua Wen, Min Chen, James Lin","doi":"10.1145/3176364.3176376","DOIUrl":"https://doi.org/10.1145/3176364.3176376","url":null,"abstract":"The particle-in-cell (PIC) code is one of the mainstream algorithms in the laser plasma research area. However, the programming challenges to achieve high performance of PIC codes on the Intel Knights Landing (KNL) processor is widely concerned by global laser plasma researchers. We took the VLPL-S, the PIC code developed at Shanghai Jiao Tong University, as an example to address this concern. We applied the three types of optimization: compute-oriented optimizations, parallel 10, and dynamic loading balancing. We evaluated the optimized VLPL-S code with real test cases on the KNL. The experiments results show our optimization can achieve 1.53X speedup in overall performance, and the performance on the KNL is 1.77X faster than that of a two-socket Intel Xeon E5-2697v4 node. The optimizations we developed for the VLPS-S code can be applied to the other PIC codes.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116791582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Recent experiences in using MPI-3 RMA in the DASH PGAS runtime 最近在DASH PGAS运行时中使用MPI-3 RMA的经验
Pub Date : 2018-01-31 DOI: 10.1145/3176364.3176367
Joseph Schuchart, R. Kowalewski, Karl Fuerlinger
The Partitioned Global Address Space (PGAS) programming model has become a viable alternative to traditional message passing using MPI. The DASH project provides a PGAS abstraction entirely based on C++11. The underlying DASH RunTime, DART, provides communication and management functionality transparently to the user. In order to facilitate incremental transitions of existing MPI-parallel codes, the development of DART has focused on creating a PGAS runtime based on the MPI-3 RMA standard. From an MPI-RMA user perspective, this paper outlines our recent experiences in the development of DART and presents insights into issues that we faced and how we attempted to solve them, including issues surrounding memory allocation and memory consistency as well as communication latencies. We implemented a set of benchmarks for global memory allocation latency in the framework of the OSU micro-benchmark suite and present results for allocation and communication latency measurements of different global memory allocation strategies under three different MPI implementations.
分区全局地址空间(PGAS)编程模型已经成为使用MPI的传统消息传递的可行替代方案。DASH项目提供了一个完全基于c++ 11的PGAS抽象。底层DASH运行时DART为用户提供透明的通信和管理功能。为了促进现有mpi并行代码的增量转换,DART的开发重点是基于MPI-3 RMA标准创建PGAS运行时。从MPI-RMA用户的角度来看,本文概述了我们最近在DART开发中的经验,并提出了我们面临的问题以及我们如何解决这些问题的见解,包括围绕内存分配和内存一致性以及通信延迟的问题。我们在OSU微基准套件框架中实现了一组全局内存分配延迟的基准测试,并给出了三种不同MPI实现下不同全局内存分配策略的分配和通信延迟测量结果。
{"title":"Recent experiences in using MPI-3 RMA in the DASH PGAS runtime","authors":"Joseph Schuchart, R. Kowalewski, Karl Fuerlinger","doi":"10.1145/3176364.3176367","DOIUrl":"https://doi.org/10.1145/3176364.3176367","url":null,"abstract":"The Partitioned Global Address Space (PGAS) programming model has become a viable alternative to traditional message passing using MPI. The DASH project provides a PGAS abstraction entirely based on C++11. The underlying DASH RunTime, DART, provides communication and management functionality transparently to the user. In order to facilitate incremental transitions of existing MPI-parallel codes, the development of DART has focused on creating a PGAS runtime based on the MPI-3 RMA standard. From an MPI-RMA user perspective, this paper outlines our recent experiences in the development of DART and presents insights into issues that we faced and how we attempted to solve them, including issues surrounding memory allocation and memory consistency as well as communication latencies. We implemented a set of benchmarks for global memory allocation latency in the framework of the OSU micro-benchmark suite and present results for allocation and communication latency measurements of different global memory allocation strategies under three different MPI implementations.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116044790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Linkage of XcalableMP and Python languages for high productivity on HPC cluster system: application to graph order/degree problem 链接XcalableMP和Python语言在HPC集群系统上的高生产力:应用于图阶/度问题
Pub Date : 2018-01-31 DOI: 10.1145/3176364.3176369
M. Nakao, H. Murai, T. Boku, M. Sato
When developing applications on high-performance computing (HPC) cluster systems, Partitioned Global Address Space (PGAS) languages are used due to their high productivity and performance. However, in order to more efficiently develop such applications, it is also important to be able to combine a PGAS language with other languages instead of using a single PGAS language alone. We have designed an XcalableMP (XMP) PGAS language, and developed Omni Compiler as an XMP compiler. In this paper, we report on the development of linkage functions between XMP and {C, Fortran, or Python} for Omni Compiler. Furthermore, as a functional example of interworking between XMP and Python, we discuss the development of an application for the Graph Order/degree problem. Specifically, we paralleled all of the shortest paths among the vertices searches of the application using XMP. When the results of the application in XMP and the original Python were compared, we found that the performance of XMP was 21% faster than that of the original Python on a single CPU core. Moreover, when applying the application on an HPC cluster system with 1,280 CPU cores of 64 compute nodes, we could achieve a 921 times better performance than that on a single CPU core.
在高性能计算(HPC)集群系统上开发应用程序时,由于分区全局地址空间(PGAS)语言具有较高的生产率和性能,因此通常使用PGAS语言。然而,为了更有效地开发此类应用程序,能够将PGAS语言与其他语言结合起来而不是单独使用单一的PGAS语言也很重要。我们设计了XcalableMP (XMP) PGAS语言,并开发了Omni编译器作为XMP编译器。在本文中,我们报告了为Omni编译器开发XMP与{C, Fortran或Python}之间的链接函数。此外,作为XMP和Python之间交互的功能示例,我们讨论了图阶/度问题应用程序的开发。具体来说,我们使用XMP对应用程序的顶点搜索之间的所有最短路径进行并行处理。当将应用程序在XMP和原始Python中的结果进行比较时,我们发现XMP在单个CPU核心上的性能比原始Python快21%。此外,当应用该应用程序在具有1280个CPU核、64个计算节点的HPC集群系统上时,我们可以获得比单个CPU核提高921倍的性能。
{"title":"Linkage of XcalableMP and Python languages for high productivity on HPC cluster system: application to graph order/degree problem","authors":"M. Nakao, H. Murai, T. Boku, M. Sato","doi":"10.1145/3176364.3176369","DOIUrl":"https://doi.org/10.1145/3176364.3176369","url":null,"abstract":"When developing applications on high-performance computing (HPC) cluster systems, Partitioned Global Address Space (PGAS) languages are used due to their high productivity and performance. However, in order to more efficiently develop such applications, it is also important to be able to combine a PGAS language with other languages instead of using a single PGAS language alone. We have designed an XcalableMP (XMP) PGAS language, and developed Omni Compiler as an XMP compiler. In this paper, we report on the development of linkage functions between XMP and {C, Fortran, or Python} for Omni Compiler. Furthermore, as a functional example of interworking between XMP and Python, we discuss the development of an application for the Graph Order/degree problem. Specifically, we paralleled all of the shortest paths among the vertices searches of the application using XMP. When the results of the application in XMP and the original Python were compared, we found that the performance of XMP was 21% faster than that of the original Python on a single CPU core. Moreover, when applying the application on an HPC cluster system with 1,280 CPU cores of 64 compute nodes, we could achieve a 921 times better performance than that on a single CPU core.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"435 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126984581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing 基于openmp的矩阵-矩阵相乘并行实现在英特尔骑士登陆上
Pub Date : 2018-01-31 DOI: 10.1145/3176364.3176374
Roktaek Lim, Yeongha Lee, Raehyun Kim, Jaeyoung Choi
The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have emerged with 2D tile mesh architecture. Implementing of the general matrix-matrix multiplication on a new architecture is an important practice. To date, there has not been a sufficient description on a parallel implementation of the general matrix-matrix multiplication. In this study, we describe the parallel implementation of the double-precision general matrix-matrix multiplication (DGEMM) with OpenMP on the KNL. The implementation is based on the blocked matrix-matrix multiplication. We propose a method for choosing the cache block sizes and discuss the parallelism within the implementation of DGEMM. We show that the performance of DGEMM varies by the thread affinity environment variables. We conducted the performance experiments with the Intel Xeon Phi 7210 and 7250. The performance experiments validate our method.
代号为Knights Landing (KNL)的第二代Intel Xeon Phi处理器已经出现了2D tile mesh架构。在新体系结构上实现一般矩阵-矩阵乘法是一个重要的实践。到目前为止,对一般矩阵-矩阵乘法的并行实现还没有足够的描述。在本研究中,我们描述了用OpenMP在KNL上并行实现双精度一般矩阵-矩阵乘法(DGEMM)。实现是基于阻塞矩阵-矩阵乘法。我们提出了一种选择缓存块大小的方法,并讨论了DGEMM实现中的并行性。我们展示了DGEMM的性能随线程关联环境变量而变化。我们用Intel Xeon Phi 7210和7250进行了性能实验。性能实验验证了该方法的有效性。
{"title":"OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing","authors":"Roktaek Lim, Yeongha Lee, Raehyun Kim, Jaeyoung Choi","doi":"10.1145/3176364.3176374","DOIUrl":"https://doi.org/10.1145/3176364.3176374","url":null,"abstract":"The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have emerged with 2D tile mesh architecture. Implementing of the general matrix-matrix multiplication on a new architecture is an important practice. To date, there has not been a sufficient description on a parallel implementation of the general matrix-matrix multiplication. In this study, we describe the parallel implementation of the double-precision general matrix-matrix multiplication (DGEMM) with OpenMP on the KNL. The implementation is based on the blocked matrix-matrix multiplication. We propose a method for choosing the cache block sizes and discuss the parallelism within the implementation of DGEMM. We show that the performance of DGEMM varies by the thread affinity environment variables. We conducted the performance experiments with the Intel Xeon Phi 7210 and 7250. The performance experiments validate our method.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114225843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Performance evaluation for omni XcalableMP compiler on many-core cluster system based on knights landing 基于knights landing的omni XcalableMP编译器在多核集群系统上的性能评价
Pub Date : 2018-01-31 DOI: 10.1145/3176364.3176372
M. Nakao, H. Murai, T. Boku, M. Sato
To reduce the programming cost on a cluster system, Partitioned Global Address Space (PGAS) languages are used. We have designed an XcalableMP (XMP) PGAS language and developed the Omni XMP compiler (Omni compiler) for XMP. In the present study, we evaluated the performance of the Omni compiler on Oakforest-PACS, which is a cluster system based on Knights Landing, and on a general Linux cluster system. We performed performance tuning for the Omni compiler using a Lattice QCD mini-application and some mathematical functions appearing in that application. As a result, the performance of the Omni compiler after tuning was improved compared to before tuning on both systems. Furthermore, we compared the performance of MPI and OpenMP (MPI+OpenMP), which is an existing programming model, to that of XMP with the tuned Omni compiler. The results showed that the performance of the Lattice QCD mini-application using XMP was achieving more than 94% of the implementation written in MPI + OpenMP.
为了降低集群系统上的编程成本,使用了分区全局地址空间(PGAS)语言。设计了XcalableMP (XMP) PGAS语言,开发了面向XMP的Omni XMP编译器(Omni compiler)。在本研究中,我们分别在基于Knights Landing的集群系统Oakforest-PACS和通用Linux集群系统上对Omni编译器的性能进行了评估。我们使用一个Lattice QCD迷你应用程序和该应用程序中出现的一些数学函数对Omni编译器执行性能调优。因此,在这两个系统上调优之后,Omni编译器的性能比调优之前有所提高。此外,我们还将MPI和OpenMP (MPI+OpenMP)(一种现有的编程模型)的性能与经过调优的Omni编译器的XMP的性能进行了比较。结果表明,使用XMP的Lattice QCD迷你应用程序的性能达到了使用MPI + OpenMP编写的实现的94%以上。
{"title":"Performance evaluation for omni XcalableMP compiler on many-core cluster system based on knights landing","authors":"M. Nakao, H. Murai, T. Boku, M. Sato","doi":"10.1145/3176364.3176372","DOIUrl":"https://doi.org/10.1145/3176364.3176372","url":null,"abstract":"To reduce the programming cost on a cluster system, Partitioned Global Address Space (PGAS) languages are used. We have designed an XcalableMP (XMP) PGAS language and developed the Omni XMP compiler (Omni compiler) for XMP. In the present study, we evaluated the performance of the Omni compiler on Oakforest-PACS, which is a cluster system based on Knights Landing, and on a general Linux cluster system. We performed performance tuning for the Omni compiler using a Lattice QCD mini-application and some mathematical functions appearing in that application. As a result, the performance of the Omni compiler after tuning was improved compared to before tuning on both systems. Furthermore, we compared the performance of MPI and OpenMP (MPI+OpenMP), which is an existing programming model, to that of XMP with the tuned Omni compiler. The results showed that the performance of the Lattice QCD mini-application using XMP was achieving more than 94% of the implementation written in MPI + OpenMP.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131331010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Optimizing two-electron repulsion integral calculation on knights landing architecture 优化骑士登陆结构的双电子排斥积分计算
Pub Date : 2018-01-31 DOI: 10.1145/3176364.3176371
Yingqi Tian, B. Suo, Yingjin Ma, Zhong Jin
In this paper, we introduced some optimization methods we used to optimize two-electron repulsion integral calculation on Knights Landing architecture. We developed a schedule for parallelism and vectorization, and we compared two different methods for calculating lower incomplete gamma function. Our optimization achieved 1.7 speedup on KNL than CPU platform.
本文介绍了用于优化Knights Landing结构双电子斥力积分计算的一些优化方法。我们开发了并行化和矢量化的时间表,并比较了两种不同的方法来计算下不完全伽马函数。我们的优化在KNL平台上的速度比CPU平台提高了1.7倍。
{"title":"Optimizing two-electron repulsion integral calculation on knights landing architecture","authors":"Yingqi Tian, B. Suo, Yingjin Ma, Zhong Jin","doi":"10.1145/3176364.3176371","DOIUrl":"https://doi.org/10.1145/3176364.3176371","url":null,"abstract":"In this paper, we introduced some optimization methods we used to optimize two-electron repulsion integral calculation on Knights Landing architecture. We developed a schedule for parallelism and vectorization, and we compared two different methods for calculating lower incomplete gamma function. Our optimization achieved 1.7 speedup on KNL than CPU platform.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124336960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scaling collectives on large clusters using Intel(R) architecture processors and fabric 在使用Intel(R)架构处理器和结构的大型集群上扩展集合
Pub Date : 2018-01-31 DOI: 10.1145/3176364.3176373
Masashi Horikoshi, L. Meadows, Tom Elken, P. Sivakumar, E. Mascarenhas, James Erwin, D. Durnov, Alexander Sannikov, T. Hanawa, T. Boku
This paper provides results on scaling Barrier and Allreduce to 8192 nodes on a cluster of Intel® Xeon Phi™ processors installed at the University of Tokyo and the University of Tsukuba. We will describe the effects of OS and platform noise on the performance of these collectives, and provide ways to minimize the noise as well as isolate it to specific cores. We will provide results showing that Barrier and Allreduce scale well when noise is reduced. We were able to achieve a latency of 94 usec (7.1x speedup from baseline) or 1 rank per node Barrier and 145 usec (3.3x speedup) for Allreduce at the 16 byte (16B) message size at 4096 nodes.
本文提供了在东京大学和筑波大学安装的Intel®Xeon Phi™处理器集群上将Barrier和Allreduce扩展到8192个节点的结果。我们将描述操作系统和平台噪声对这些集合性能的影响,并提供最小化噪声以及将其隔离到特定核心的方法。我们将提供的结果表明,当噪声降低时,Barrier和Allreduce的规模很好。在4096个节点的16字节(16B)消息大小下,我们能够实现94 usec(比基线加速7.1倍)或每个节点1个rank Barrier的延迟,以及145 usec(加速3.3倍)的Allreduce延迟。
{"title":"Scaling collectives on large clusters using Intel(R) architecture processors and fabric","authors":"Masashi Horikoshi, L. Meadows, Tom Elken, P. Sivakumar, E. Mascarenhas, James Erwin, D. Durnov, Alexander Sannikov, T. Hanawa, T. Boku","doi":"10.1145/3176364.3176373","DOIUrl":"https://doi.org/10.1145/3176364.3176373","url":null,"abstract":"This paper provides results on scaling Barrier and Allreduce to 8192 nodes on a cluster of Intel® Xeon Phi™ processors installed at the University of Tokyo and the University of Tsukuba. We will describe the effects of OS and platform noise on the performance of these collectives, and provide ways to minimize the noise as well as isolate it to specific cores. We will provide results showing that Barrier and Allreduce scale well when noise is reduced. We were able to achieve a latency of 94 usec (7.1x speedup from baseline) or 1 rank per node Barrier and 145 usec (3.3x speedup) for Allreduce at the 16 byte (16B) message size at 4096 nodes.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"117 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114014571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Towards a parallel algebraic multigrid solver using PGAS 基于PGAS的并行代数多网格求解器
Pub Date : 2018-01-31 DOI: 10.1145/3176364.3176368
Niclas Jansson, E. Laure
The Algebraic Multigrid (AMG) method has over the years developed into an efficient tool for solving unstructured linear systems. The need to solve large industrial problems discretized on unstructured meshes, has been a key motivation for devising a parallel AMG method. Despite some success, the key part of the AMG algorithm; the coarsening step, is far from trivial to parallelize efficiently. We here introduce a novel parallelization of the inherently sequential Ruge-Stüben coarsening algorithm, that retains most of the good interpolation properties of the original method. Our parallelization is based on the Partitioned Global Address Space (PGAS) abstraction, which greatly simplifies the parallelization as compared to traditional message passing based implementations. The coarsening algorithm and solver is described in detail and a performance study on a Cray XC40 is presented.
代数多重网格(AMG)方法多年来已发展成为求解非结构化线性系统的有效工具。需要解决大型工业问题离散在非结构化网格上,一直是设计并行AMG方法的一个关键动机。尽管取得了一些成功,但AMG算法的关键部分;粗化步骤对于有效地并行化是非常重要的。我们在这里介绍了一种新的并行化固有序列ruge - st粗化算法,它保留了原始方法的大部分良好的插值特性。我们的并行化基于分区全局地址空间(PGAS)抽象,与传统的基于消息传递的实现相比,它极大地简化了并行化。详细介绍了粗化算法和求解器,并对Cray XC40进行了性能研究。
{"title":"Towards a parallel algebraic multigrid solver using PGAS","authors":"Niclas Jansson, E. Laure","doi":"10.1145/3176364.3176368","DOIUrl":"https://doi.org/10.1145/3176364.3176368","url":null,"abstract":"The Algebraic Multigrid (AMG) method has over the years developed into an efficient tool for solving unstructured linear systems. The need to solve large industrial problems discretized on unstructured meshes, has been a key motivation for devising a parallel AMG method. Despite some success, the key part of the AMG algorithm; the coarsening step, is far from trivial to parallelize efficiently. We here introduce a novel parallelization of the inherently sequential Ruge-Stüben coarsening algorithm, that retains most of the good interpolation properties of the original method. Our parallelization is based on the Partitioned Global Address Space (PGAS) abstraction, which greatly simplifies the parallelization as compared to traditional message passing based implementations. The coarsening algorithm and solver is described in detail and a performance study on a Cray XC40 is presented.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123272494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Performance evaluation for a hydrodynamics application in XcalableACC PGAS language for accelerated clusters 基于XcalableACC PGAS语言的加速集群流体力学性能评价
Pub Date : 2018-01-31 DOI: 10.1145/3176364.3176365
Akihiro Tabuchi, M. Nakao, H. Murai, T. Boku, M. Sato
Clusters equipped with accelerators such as GPUs and MICs are used widely. To use these clusters, programmers write programs for their applications by combining MPI with one of the accelerator programming models such as CUDA and OpenACC. The accelerator programming component is becoming easier because of a directive-based OpenACC, but complex distributed-memory programming using MPI means that programming is still difficult. In order to simplify the programming process, XcalableACC (XACC) has been proposed as an "orthogonal" integration of the PGAS language XcalableMP (XMP) and OpenACC. XACC provides the original XMP and OpenACC features, as well as their extensions for communication between accelerator memories. In this study, we implemented a hydrodynamics mini-application Clover-Leaf in XACC and evaluated the usability of XACC in terms of it performance and productivity. According to the performance evaluation, the XACC version achieved 87--95% of the performance of the MPI+CUDA version and 93--101% of the MPI+OpenACC version with strong scaling, and 88--91% of the MPI+CUDA version and 94--97% of the MPI+OpenACC version with weak scaling. In particular, the halo exchange time was better with XACC than MPI+OpenACC in some cases because the Omni XACC runtime is written in MPI and CUDA, and it is well tuned. The productivity evaluation showed that the application could be implemented after small changes compared with the serial version. These results demonstrate that XACC is a practical programming language for science applications.
配备gpu和mic等加速器的集群被广泛使用。为了使用这些集群,程序员通过将MPI与CUDA和OpenACC等加速编程模型之一相结合,为他们的应用程序编写程序。由于基于指令的OpenACC,加速器编程组件变得更加容易,但是使用MPI进行复杂的分布式内存编程意味着编程仍然很困难。为了简化编程过程,XcalableACC (XACC)被提出作为PGAS语言XcalableMP (XMP)和OpenACC的“正交”集成。XACC提供了原始的XMP和OpenACC特性,以及它们在加速器内存之间通信的扩展。在本研究中,我们在XACC中实现了一个流体动力学迷你应用程序Clover-Leaf,并从性能和生产力方面评估了XACC的可用性。根据性能评估,XACC版本的性能达到MPI+CUDA版本的87—95%,MPI+OpenACC版本的93—101%,具有强缩放,MPI+CUDA版本的88—91%,MPI+OpenACC版本的94—97%,具有弱缩放。特别是,在某些情况下,XACC比MPI+OpenACC的光晕交换时间更好,因为Omni XACC运行时是用MPI和CUDA编写的,并且经过了很好的调优。生产力评估表明,与串行版本相比,该应用程序只需进行很小的更改即可实现。这些结果表明,XACC是一种实用的科学应用程序设计语言。
{"title":"Performance evaluation for a hydrodynamics application in XcalableACC PGAS language for accelerated clusters","authors":"Akihiro Tabuchi, M. Nakao, H. Murai, T. Boku, M. Sato","doi":"10.1145/3176364.3176365","DOIUrl":"https://doi.org/10.1145/3176364.3176365","url":null,"abstract":"Clusters equipped with accelerators such as GPUs and MICs are used widely. To use these clusters, programmers write programs for their applications by combining MPI with one of the accelerator programming models such as CUDA and OpenACC. The accelerator programming component is becoming easier because of a directive-based OpenACC, but complex distributed-memory programming using MPI means that programming is still difficult. In order to simplify the programming process, XcalableACC (XACC) has been proposed as an \"orthogonal\" integration of the PGAS language XcalableMP (XMP) and OpenACC. XACC provides the original XMP and OpenACC features, as well as their extensions for communication between accelerator memories. In this study, we implemented a hydrodynamics mini-application Clover-Leaf in XACC and evaluated the usability of XACC in terms of it performance and productivity. According to the performance evaluation, the XACC version achieved 87--95% of the performance of the MPI+CUDA version and 93--101% of the MPI+OpenACC version with strong scaling, and 88--91% of the MPI+CUDA version and 94--97% of the MPI+OpenACC version with weak scaling. In particular, the halo exchange time was better with XACC than MPI+OpenACC in some cases because the Omni XACC runtime is written in MPI and CUDA, and it is well tuned. The productivity evaluation showed that the application could be implemented after small changes compared with the serial version. These results demonstrate that XACC is a practical programming language for science applications.","PeriodicalId":371083,"journal":{"name":"Proceedings of Workshops of HPC Asia","volume":"561 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123390119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of Workshops of HPC Asia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1