首页 > 最新文献

2011 Symposium on Application Accelerators in High-Performance Computing最新文献

英文 中文
A Study of the Performance of Multifluid PPM Gas Dynamics on CPUs and GPUs 多流体PPM气体动力学在cpu和gpu上的性能研究
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.27
Pei-Hung Lin, J. Jayaraj, P. Woodward
The potential for GPUs and many-core CPUs to support high performance computation in the area of computational fluid dynamics (CFD) is explored quantitatively through the example of the PPM gas dynamics code with PPB multi fluid volume fraction advection. This code has already been implemented on the IBM Cell processor and run at full scale on the Los Alamos Roadrunner machine. This implementation has involved a complete restructuring of the code that has been described in detail elsewhere. Here the lessons learned from that work are exploited to take advantage oftoday's latest generations of multi-core CPUs and many-core GPUs. The operations performed by this code are characterized in detail after being first decomposed into a series of individual code kernels to allow an implementation on GPUs. Careful implementations of this code for both CPUs and GPU sare then contrasted from a performance point of view. In addition, a single kernel that has many of the characteristics of the full application on CPUs has been built into a full, standalone, scalable parallel application. This single-kernel application shows the GPU at its best. In contrast, the full multi fluid gas dynamics application brings into play computational requirements that highlight the essential differences in CPU and GPU designs today and the different programming strategies needed to achieve the best performance for applications of this type on the two devices. The single kernel application code performs extremely well on both platforms. This application is not limited by main memory bandwidth on either device instead it is limited only by the computational capability of each. In this case, the GPU has the advantage, because it has more computational cores. The full multi fluid gas dynamics code is, however, of necessity memory bandwidth limited on the GPU, while it is still computational capability limited on the CPU. We believe that these codes provide a useful context for quantifying the costs and benefits of design decisions for these powerful new computing devices. Suggestions for improvements in both devices and codes based upon this work are offered in our conclusions.
通过PPB多流体体积分数平流的PPM气体动力学代码实例,定量地探讨了gpu和多核cpu在计算流体动力学(CFD)领域支持高性能计算的潜力。该代码已经在IBM Cell处理器上实现,并在Los Alamos Roadrunner机器上全面运行。这个实现涉及到代码的完全重构,这在其他地方有详细的描述。在这里,我们将从这项工作中吸取经验教训,以充分利用当今最新一代的多核cpu和多核gpu。该代码执行的操作首先被分解为一系列单独的代码内核,以便在gpu上实现,然后对其进行详细描述。这段代码在cpu和GPU上的仔细实现,然后从性能的角度进行对比。此外,具有cpu上完整应用程序的许多特征的单个内核已被构建为完整的、独立的、可扩展的并行应用程序。这个单内核应用程序展示了GPU的最佳状态。相比之下,完整的多流体气体动力学应用程序带来了计算需求,突出了当今CPU和GPU设计的本质差异,以及在两种设备上实现这类应用程序的最佳性能所需的不同编程策略。单个内核应用程序代码在两个平台上都执行得非常好。此应用程序不受任何设备上主内存带宽的限制,而是仅受每个设备的计算能力的限制。在这种情况下,GPU具有优势,因为它具有更多的计算核心。然而,完整的多流体气体动力学代码在GPU上的内存带宽有限,而在CPU上的计算能力仍然有限。我们相信这些代码为量化这些强大的新型计算设备的设计决策的成本和收益提供了有用的背景。基于这项工作,我们在结论中提出了改进设备和代码的建议。
{"title":"A Study of the Performance of Multifluid PPM Gas Dynamics on CPUs and GPUs","authors":"Pei-Hung Lin, J. Jayaraj, P. Woodward","doi":"10.1109/SAAHPC.2011.27","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.27","url":null,"abstract":"The potential for GPUs and many-core CPUs to support high performance computation in the area of computational fluid dynamics (CFD) is explored quantitatively through the example of the PPM gas dynamics code with PPB multi fluid volume fraction advection. This code has already been implemented on the IBM Cell processor and run at full scale on the Los Alamos Roadrunner machine. This implementation has involved a complete restructuring of the code that has been described in detail elsewhere. Here the lessons learned from that work are exploited to take advantage oftoday's latest generations of multi-core CPUs and many-core GPUs. The operations performed by this code are characterized in detail after being first decomposed into a series of individual code kernels to allow an implementation on GPUs. Careful implementations of this code for both CPUs and GPU sare then contrasted from a performance point of view. In addition, a single kernel that has many of the characteristics of the full application on CPUs has been built into a full, standalone, scalable parallel application. This single-kernel application shows the GPU at its best. In contrast, the full multi fluid gas dynamics application brings into play computational requirements that highlight the essential differences in CPU and GPU designs today and the different programming strategies needed to achieve the best performance for applications of this type on the two devices. The single kernel application code performs extremely well on both platforms. This application is not limited by main memory bandwidth on either device instead it is limited only by the computational capability of each. In this case, the GPU has the advantage, because it has more computational cores. The full multi fluid gas dynamics code is, however, of necessity memory bandwidth limited on the GPU, while it is still computational capability limited on the CPU. We believe that these codes provide a useful context for quantifying the costs and benefits of design decisions for these powerful new computing devices. Suggestions for improvements in both devices and codes based upon this work are offered in our conclusions.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127015300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Real-Time Object Tracking System on FPGAs 基于fpga的实时目标跟踪系统
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.22
S. Liu, Alexandros Papakonstantinou, Hongjun Wang, Deming Chen
Object tracking is an important task in computer vision applications. One of the crucial challenges is the real-time speed requirement. In this paper we implement an object tracking system in reconfigurable hardware using an efficient parallel architecture. In our implementation, we adopt a background subtraction based algorithm. The designed object tracker exploits hardware parallelism to achieve high system speed. We also propose a dual object region search technique to further boost the performance of our system under complex tracking conditions. For our hardware implementation we use the Alter a Stratix III EP3SL340H1152C2 FPGA device. We compare the proposed FPGA-based implementation with the software implementation running on a 2.2 GHz processor. The observed speedup can reach more than 100X for complex video inputs.
目标跟踪是计算机视觉应用中的一项重要任务。其中一个关键的挑战是实时速度要求。本文采用高效的并行架构,在可重构硬件上实现了一个目标跟踪系统。在我们的实现中,我们采用了基于背景减法的算法。所设计的目标跟踪器利用硬件并行性来实现高系统速度。我们还提出了一种双目标区域搜索技术,以进一步提高系统在复杂跟踪条件下的性能。对于硬件实现,我们使用altera Stratix III EP3SL340H1152C2 FPGA器件。我们将提出的基于fpga的实现与运行在2.2 GHz处理器上的软件实现进行了比较。对于复杂的视频输入,观察到的加速可以达到100倍以上。
{"title":"Real-Time Object Tracking System on FPGAs","authors":"S. Liu, Alexandros Papakonstantinou, Hongjun Wang, Deming Chen","doi":"10.1109/SAAHPC.2011.22","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.22","url":null,"abstract":"Object tracking is an important task in computer vision applications. One of the crucial challenges is the real-time speed requirement. In this paper we implement an object tracking system in reconfigurable hardware using an efficient parallel architecture. In our implementation, we adopt a background subtraction based algorithm. The designed object tracker exploits hardware parallelism to achieve high system speed. We also propose a dual object region search technique to further boost the performance of our system under complex tracking conditions. For our hardware implementation we use the Alter a Stratix III EP3SL340H1152C2 FPGA device. We compare the proposed FPGA-based implementation with the software implementation running on a 2.2 GHz processor. The observed speedup can reach more than 100X for complex video inputs.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130236675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Implications of Memory-Efficiency on Sparse Matrix-Vector Multiplication 稀疏矩阵-向量乘法中内存效率的含义
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.24
Shweta Jain, Robin Pottathuparambil, R. Sass
Sparse Matrix Vector-Multiplication is an important operation for many iterative solvers. However, peak performance is limited by the fact that the commonly used algorithm alternates between compute-bound and memory-bound steps. This paper proposes a novel data structure and an FPGA-based hardware core that eliminates the limitations imposed by memory.
稀疏矩阵向量乘法是许多迭代求解算法的重要运算。然而,通常使用的算法在计算受限和内存受限的步骤之间交替,这一事实限制了峰值性能。本文提出了一种新的数据结构和基于fpga的硬件核心,消除了内存的限制。
{"title":"Implications of Memory-Efficiency on Sparse Matrix-Vector Multiplication","authors":"Shweta Jain, Robin Pottathuparambil, R. Sass","doi":"10.1109/SAAHPC.2011.24","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.24","url":null,"abstract":"Sparse Matrix Vector-Multiplication is an important operation for many iterative solvers. However, peak performance is limited by the fact that the commonly used algorithm alternates between compute-bound and memory-bound steps. This paper proposes a novel data structure and an FPGA-based hardware core that eliminates the limitations imposed by memory.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114344061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Porting Optimized GPU Kernels to a Multi-core CPU: Computational Quantum Chemistry Application Example 将优化的GPU内核移植到多核CPU:计算量子化学应用示例
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.8
Dong Ye, Alexey Titov, V. Kindratenko, Ivan S. Ufimtsev, Todd J. Martinez
We investigate techniques for optimizing a multi-core CPU code back ported from a highly optimized GPU kernel. We show that common sub-expression elimination and loop unrolling optimization techniques improve code performance on the GPU, but not on the CPU. On the other hand, register reuse and loop merging are effective on the CPU and in combination they improve performance of the ported code by 16%.
我们研究了优化从高度优化的GPU内核反向移植的多核CPU代码的技术。我们表明,常见的子表达式消除和循环展开优化技术提高了GPU上的代码性能,但在CPU上却没有。另一方面,寄存器重用和循环合并在CPU上是有效的,它们结合起来使移植代码的性能提高了16%。
{"title":"Porting Optimized GPU Kernels to a Multi-core CPU: Computational Quantum Chemistry Application Example","authors":"Dong Ye, Alexey Titov, V. Kindratenko, Ivan S. Ufimtsev, Todd J. Martinez","doi":"10.1109/SAAHPC.2011.8","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.8","url":null,"abstract":"We investigate techniques for optimizing a multi-core CPU code back ported from a highly optimized GPU kernel. We show that common sub-expression elimination and loop unrolling optimization techniques improve code performance on the GPU, but not on the CPU. On the other hand, register reuse and loop merging are effective on the CPU and in combination they improve performance of the ported code by 16%.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123837888","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Experience Applying Fortran GPU Compilers to Numerical Weather Prediction 应用Fortran GPU编译器进行数值天气预报的经验
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.9
T. Henderson, J. Middlecoff, J. Rosinski, M. Govett, P. Madden
Graphics Processing Units (GPUs) have enabled significant improvements in computational performance compared to traditional CPUs in several application domains. Until recently, GPUs have been programmed using C/C++ based methods such as CUDA (NVIDIA) and OpenCL (NVIDIA and AMD). Using these approaches, Fortran Numerical Weather Prediction (NWP) codes would have to be completely re-written to take full advantage of GPU performance gains. Emerging commercial Fortran compilers allow NWP codes to take advantage of GPU processing power with much less software development effort. The Non-hydrostatic Icosahedral Model (NIM) is a prototype dynamical core for global NWP. We use NIM to examine Fortran directive-based GPU compilers, evaluating code porting effort and computational performance.
与传统cpu相比,图形处理单元(gpu)在多个应用领域的计算性能有了显著提高。直到最近,gpu都是使用基于C/ c++的方法编程的,比如CUDA (NVIDIA)和OpenCL (NVIDIA和AMD)。使用这些方法,Fortran数值天气预报(NWP)代码将不得不完全重写,以充分利用GPU性能的优势。新兴的商用Fortran编译器允许NWP代码利用GPU的处理能力,而软件开发工作要少得多。非流体静力二十面体模型(NIM)是全局NWP的原型动力核心。我们使用NIM来检查基于Fortran指令的GPU编译器,评估代码移植工作和计算性能。
{"title":"Experience Applying Fortran GPU Compilers to Numerical Weather Prediction","authors":"T. Henderson, J. Middlecoff, J. Rosinski, M. Govett, P. Madden","doi":"10.1109/SAAHPC.2011.9","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.9","url":null,"abstract":"Graphics Processing Units (GPUs) have enabled significant improvements in computational performance compared to traditional CPUs in several application domains. Until recently, GPUs have been programmed using C/C++ based methods such as CUDA (NVIDIA) and OpenCL (NVIDIA and AMD). Using these approaches, Fortran Numerical Weather Prediction (NWP) codes would have to be completely re-written to take full advantage of GPU performance gains. Emerging commercial Fortran compilers allow NWP codes to take advantage of GPU processing power with much less software development effort. The Non-hydrostatic Icosahedral Model (NIM) is a prototype dynamical core for global NWP. We use NIM to examine Fortran directive-based GPU compilers, evaluating code porting effort and computational performance.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114284612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures 一类用于多核和GPU架构的混合LAPACK算法
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.18
Mitchel D. Horton, S. Tomov, J. Dongarra
Three out of the top four supercomputers in the November 2010 TOP500 list of the world's most powerful supercomputers use NVIDIA GPUs to accelerate computations. Ninety-five systems from the list are using processors with six or more cores. Three-hundred-sixty-five systems use quad-core processor-based systems. Thirty-seven systems are using dual-core processors. The large-scale enabling of hybrid graphics processing unit (GPU)-based multicore platforms for computational science by developing fundamental numerical libraries (in particular, libraries in the area of dense linear algebra) for them has been underway for some time. We present a class of algorithms based largely on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. The algorithms extend what is currently available in the Matrix Algebra for GPU and Multicore Architectures (MAGMA) Library for performing Cholesky, QR, and LU factorizations using a single core or socket and a single GPU. The extensions occur in two areas. First, panels factored on the CPU using LAPACK are, instead, done in parallel using a highly optimized dynamic asynchronous scheduled algorithm on some number of CPU cores. Second, the remaining CPU cores are used to update the rightmost panels of the matrix in parallel.
在2010年11月全球最强大的超级计算机TOP500榜单中,前四台超级计算机中有三台使用NVIDIA gpu来加速计算。列表中的95个系统使用六核或更多核的处理器。365个系统使用基于四核处理器的系统。37个系统使用双核处理器。通过开发基本的数值库(特别是密集线性代数领域的库),大规模启用基于混合图形处理单元(GPU)的计算科学多核平台已经进行了一段时间。我们提出了一类主要基于软件基础设施的算法,这些基础设施已经为同构多核和基于gpu的混合计算开发了。该算法扩展了目前可用的GPU和多核架构(MAGMA)库中的矩阵代数,用于使用单核或套接字和单个GPU执行Cholesky, QR和LU分解。扩展发生在两个方面。首先,使用LAPACK在CPU上分解的面板是在一定数量的CPU内核上使用高度优化的动态异步调度算法并行完成的。其次,剩余的CPU内核用于并行更新矩阵最右边的面板。
{"title":"A Class of Hybrid LAPACK Algorithms for Multicore and GPU Architectures","authors":"Mitchel D. Horton, S. Tomov, J. Dongarra","doi":"10.1109/SAAHPC.2011.18","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.18","url":null,"abstract":"Three out of the top four supercomputers in the November 2010 TOP500 list of the world's most powerful supercomputers use NVIDIA GPUs to accelerate computations. Ninety-five systems from the list are using processors with six or more cores. Three-hundred-sixty-five systems use quad-core processor-based systems. Thirty-seven systems are using dual-core processors. The large-scale enabling of hybrid graphics processing unit (GPU)-based multicore platforms for computational science by developing fundamental numerical libraries (in particular, libraries in the area of dense linear algebra) for them has been underway for some time. We present a class of algorithms based largely on software infrastructures that have already been developed for homogeneous multicores and hybrid GPU-based computing. The algorithms extend what is currently available in the Matrix Algebra for GPU and Multicore Architectures (MAGMA) Library for performing Cholesky, QR, and LU factorizations using a single core or socket and a single GPU. The extensions occur in two areas. First, panels factored on the CPU using LAPACK are, instead, done in parallel using a highly optimized dynamic asynchronous scheduled algorithm on some number of CPU cores. Second, the remaining CPU cores are used to update the rightmost panels of the matrix in parallel.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115816383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
A First Analysis of a Dynamic Memory Allocation Controller (DMAC) Core 动态内存分配控制器(DMAC)核心初探
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.23
Y. Rajasekhar, R. Sass
Networking performance continues to grow but processor clock frequencies have not. Likewise, the latency to primary memory is not expected to improve dramatically either. This is leading computer architects to reconsider the networking subsystem and the roles and responsibilities of hardware and the operating system. This paper presents the first component of a new networking subsystem where the hardware is responsible for buffering, when necessary, messages without interrupting or involving the operating system. The design is presented and its functionality is demonstrated. The core on an FPGA is exercised with a synthetic stream of messages and the results show that the analytical performance model and measured performance agree.
网络性能持续增长,但处理器时钟频率却没有增长。同样,到主内存的延迟预计也不会显著改善。这导致计算机架构师重新考虑网络子系统以及硬件和操作系统的角色和职责。本文介绍了一个新的网络子系统的第一个组件,其中硬件在必要时负责缓冲消息,而不中断或涉及操作系统。给出了设计方案,并对其功能进行了论证。在FPGA上对该核心进行了综合消息流测试,结果表明分析性能模型与实测性能吻合。
{"title":"A First Analysis of a Dynamic Memory Allocation Controller (DMAC) Core","authors":"Y. Rajasekhar, R. Sass","doi":"10.1109/SAAHPC.2011.23","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.23","url":null,"abstract":"Networking performance continues to grow but processor clock frequencies have not. Likewise, the latency to primary memory is not expected to improve dramatically either. This is leading computer architects to reconsider the networking subsystem and the roles and responsibilities of hardware and the operating system. This paper presents the first component of a new networking subsystem where the hardware is responsible for buffering, when necessary, messages without interrupting or involving the operating system. The design is presented and its functionality is demonstrated. The core on an FPGA is exercised with a synthetic stream of messages and the results show that the analytical performance model and measured performance agree.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127431984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Python for Development of OpenMP and CUDA Kernels for Multidimensional Data 用于多维数据的OpenMP和CUDA内核开发的Python
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.26
B. Vacaliuc, D. Patlolla, E. D'Azevedo, G. Davidson, John K. Munro Jr, T. Evans, W. Joubert, Z. Bell
Design of data structures for high performance computing (HPC) is one of the principal challenges facing researchers looking to utilize heterogeneous computing machinery. Heterogeneous systems derive cost, power, and speed efficiency by being composed of the appropriate hardware for the task. Yet, each type of processor requires a specific organization of the application state in order to achieve peak performance. Discovering this and refactoring the code can be a challenging and time-consuming task for the researcher, as the data structures and the computational model must be co-designed. We present a methodology that uses Python as the environment for which to explore tradeoffs in both the data structure design as well as the code executing on the computation accelerator. Our method enables multi-dimensional arrays to be used effectively in any target environment. We have chosen to focus on OpenMP and CUDA environments, thus exploring the development of optimized kernels for the two most common classes of computing hardware available today: multi-core CPU and GPU. Python's large palette of file and network access routines, its associative indexing syntax and support for common HPC environments makes it relevant for diverse hardware ranging from laptops through computing clusters to the highest performance supercomputers. Our work enables researchers to accelerate the development of their codes on the computing hardware of their choice.
高性能计算(HPC)的数据结构设计是研究人员利用异构计算机器所面临的主要挑战之一。异构系统通过由适合任务的硬件组成来获得成本、功率和速度效率。然而,每种类型的处理器都需要对应用程序状态进行特定的组织,以便实现峰值性能。对于研究人员来说,发现这一点并重构代码可能是一项具有挑战性且耗时的任务,因为数据结构和计算模型必须共同设计。我们提出了一种使用Python作为环境的方法,用于探索数据结构设计以及在计算加速器上执行的代码的权衡。我们的方法可以在任何目标环境中有效地使用多维数组。我们选择专注于OpenMP和CUDA环境,从而为当今两种最常见的计算硬件(多核CPU和GPU)探索优化内核的开发。Python的大量文件和网络访问例程,其关联索引语法和对通用HPC环境的支持使其适用于从笔记本电脑到计算集群再到高性能超级计算机的各种硬件。我们的工作使研究人员能够在他们选择的计算硬件上加速他们的代码开发。
{"title":"Python for Development of OpenMP and CUDA Kernels for Multidimensional Data","authors":"B. Vacaliuc, D. Patlolla, E. D'Azevedo, G. Davidson, John K. Munro Jr, T. Evans, W. Joubert, Z. Bell","doi":"10.1109/SAAHPC.2011.26","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.26","url":null,"abstract":"Design of data structures for high performance computing (HPC) is one of the principal challenges facing researchers looking to utilize heterogeneous computing machinery. Heterogeneous systems derive cost, power, and speed efficiency by being composed of the appropriate hardware for the task. Yet, each type of processor requires a specific organization of the application state in order to achieve peak performance. Discovering this and refactoring the code can be a challenging and time-consuming task for the researcher, as the data structures and the computational model must be co-designed. We present a methodology that uses Python as the environment for which to explore tradeoffs in both the data structure design as well as the code executing on the computation accelerator. Our method enables multi-dimensional arrays to be used effectively in any target environment. We have chosen to focus on OpenMP and CUDA environments, thus exploring the development of optimized kernels for the two most common classes of computing hardware available today: multi-core CPU and GPU. Python's large palette of file and network access routines, its associative indexing syntax and support for common HPC environments makes it relevant for diverse hardware ranging from laptops through computing clusters to the highest performance supercomputers. Our work enables researchers to accelerate the development of their codes on the computing hardware of their choice.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130618018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Non-serial Polyadic Dynamic Programming on a Data-Parallel Many-core Architecture 数据并行多核体系结构上的非串行多进动态规划
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.25
M. Moazeni, M. Sarrafzadeh, A. Bui
Dynamic Programming (DP) is a method for efficiently solving a broad range of search and optimization problems. As a result, techniques for managing large-scale DP problems are often critical to the performance of many applications. DP algorithms are often hard to parallelize. In this paper, we address the challenge of exploiting fine grain parallelism on a family of DP algorithms known as non-serial polyadic. We use an abstract formulation of non-serial polyadic DP, derived from RNA secondary structure prediction and matrix parenthesization approaches that are well-known and important problems from this family. We present a load balancing algorithm that achieves the best overall performance with this type of workload on many-core architectures. A divide-and-conquer approach previously used on multi-core architectures is compared against an iterative version. To evaluate these approaches, the algorithm was implemented on three NVIDIA GPUs using CUDA. We achieved up to 10 GFLOP/s performance and up to 228x speedup over the single-threaded CPU implementation. Moreover, the iterative approach results in up to 3.92x speedup over the divide-and-conquer approach.
动态规划(DP)是一种有效解决各种搜索和优化问题的方法。因此,管理大规模DP问题的技术通常对许多应用程序的性能至关重要。DP算法通常很难并行化。在本文中,我们解决了在一组称为非串行多进的DP算法上开发细粒度并行性的挑战。我们使用了一个抽象的非序列多元DP公式,该公式来源于RNA二级结构预测和矩阵括号化方法,这是该家族中众所周知的重要问题。我们提出了一种负载平衡算法,该算法可以在多核架构上实现这种类型的工作负载的最佳总体性能。将以前在多核体系结构上使用的分而治之的方法与迭代版本进行比较。为了评估这些方法,该算法使用CUDA在三个NVIDIA gpu上实现。与单线程CPU实现相比,我们实现了高达10 GFLOP/s的性能和高达228x的加速。此外,迭代方法比分治方法的速度提高了3.92倍。
{"title":"Non-serial Polyadic Dynamic Programming on a Data-Parallel Many-core Architecture","authors":"M. Moazeni, M. Sarrafzadeh, A. Bui","doi":"10.1109/SAAHPC.2011.25","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.25","url":null,"abstract":"Dynamic Programming (DP) is a method for efficiently solving a broad range of search and optimization problems. As a result, techniques for managing large-scale DP problems are often critical to the performance of many applications. DP algorithms are often hard to parallelize. In this paper, we address the challenge of exploiting fine grain parallelism on a family of DP algorithms known as non-serial polyadic. We use an abstract formulation of non-serial polyadic DP, derived from RNA secondary structure prediction and matrix parenthesization approaches that are well-known and important problems from this family. We present a load balancing algorithm that achieves the best overall performance with this type of workload on many-core architectures. A divide-and-conquer approach previously used on multi-core architectures is compared against an iterative version. To evaluate these approaches, the algorithm was implemented on three NVIDIA GPUs using CUDA. We achieved up to 10 GFLOP/s performance and up to 228x speedup over the single-threaded CPU implementation. Moreover, the iterative approach results in up to 3.92x speedup over the divide-and-conquer approach.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130167493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
GPU Performance Comparison for Accelerated Radar Data Processing 加速雷达数据处理的GPU性能比较
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.14
C. Fallen, B.V.C. Bellamy, G. Newby, B. Watkins
Radar is a data-intensive measurement technique often requiring significant processing to make full use of the received signal. However, computing capacity is limited at remote or mobile radar installations thereby limiting radar data products used for real-time decisions. We used graphics processing units (GPUs) to accelerate processing of high resolution phase-coded radar data from the Modular UHF Ionosphere Radar (MUIR) at the High-frequency Active Auroral Research Program (HAARP) facility in Gakona, Alaska. Previously, this data could not be processed on-site in sufficient time to be useful for decisions made during active experiment campaigns, nor could the data be uploaded for off-site processing to high-performance computing (HPC) resources at the Arctic Region Supercomputing Center (ARSC) in Fairbanks. In this paper, we present a radar data-processing performance comparison of a workstation equipped with dual NVIDIA GeForce GTX 480 GPU accelerator cards and a node from ARSC's PACMAN cluster equipped with dual NVIDIA Tesla M2050 cards. Both platforms meet performance requirements, are relatively inexpensive and could operate effectively at remote observatories such as HAARP.
雷达是一种数据密集型的测量技术,通常需要大量的处理才能充分利用接收到的信号。然而,远程或移动雷达装置的计算能力有限,因此限制了用于实时决策的雷达数据产品。我们使用图形处理单元(gpu)来加速处理来自阿拉斯加加科纳高频主动极光研究计划(HAARP)设施的模块化UHF电离层雷达(MUIR)的高分辨率相位编码雷达数据。在此之前,这些数据无法在足够的时间内进行现场处理,从而无法在积极的实验活动中做出有用的决策,也无法将数据上传到费尔班克斯北极地区超级计算中心(ARSC)的高性能计算(HPC)资源中进行场外处理。在本文中,我们提出了一个工作站配备双NVIDIA GeForce GTX 480 GPU加速卡和ARSC PACMAN集群节点配备双NVIDIA Tesla M2050卡的雷达数据处理性能的比较。两种平台都满足性能要求,相对便宜并且可以在HAARP等远程观测站有效地操作。
{"title":"GPU Performance Comparison for Accelerated Radar Data Processing","authors":"C. Fallen, B.V.C. Bellamy, G. Newby, B. Watkins","doi":"10.1109/SAAHPC.2011.14","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.14","url":null,"abstract":"Radar is a data-intensive measurement technique often requiring significant processing to make full use of the received signal. However, computing capacity is limited at remote or mobile radar installations thereby limiting radar data products used for real-time decisions. We used graphics processing units (GPUs) to accelerate processing of high resolution phase-coded radar data from the Modular UHF Ionosphere Radar (MUIR) at the High-frequency Active Auroral Research Program (HAARP) facility in Gakona, Alaska. Previously, this data could not be processed on-site in sufficient time to be useful for decisions made during active experiment campaigns, nor could the data be uploaded for off-site processing to high-performance computing (HPC) resources at the Arctic Region Supercomputing Center (ARSC) in Fairbanks. In this paper, we present a radar data-processing performance comparison of a workstation equipped with dual NVIDIA GeForce GTX 480 GPU accelerator cards and a node from ARSC's PACMAN cluster equipped with dual NVIDIA Tesla M2050 cards. Both platforms meet performance requirements, are relatively inexpensive and could operate effectively at remote observatories such as HAARP.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129045473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
2011 Symposium on Application Accelerators in High-Performance Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1