首页 > 最新文献

2011 Symposium on Application Accelerators in High-Performance Computing最新文献

英文 中文
Evaluation of GPU Architectures Using Spiking Neural Networks 使用峰值神经网络评估GPU架构
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.20
V. Pallipuram, M. Bhuiyan, M. C. Smith
During recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High-Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia's Tesla C2050, codenamed Fermi, and AMD's Radeon 5870 are two devices positioned to meet the computationally demanding needs of supercomputing research groups across the globe. Though Nvidia GPUs powered by CUDA have been the frequent choices of the performance centric research groups, the introduction and growth of OpenCL has promoted AMD GP-GPUs as potential accelerator candidates that can challenge Nvidia's stronghold. These architectures not only offer a plethora of features for application developers to explore, but their radically different architectures calls for a detailed study that weighs their merits and evaluates their potential to accelerate complex scientific applications. In this paper, we present our performance analysis research comparing Nvidia's Fermi and AMD's Radeon 5870 using OpenCL as the common programming model. We have chosen four different neuron models for Spiking Neural Networks (SNNs), each with different communication and computation requirements, namely the Izhikevich, Wilson, Morris Lecar (ML), and the Hodgkin Huxley (HH) models. We compare the runtime performance of the Fermi and Radeon GPUs with an implementation that exhausts all optimization techniques available with OpenCL. Several equivalent architectural parameters of the two GPUs are studied and correlated with the application performance. In addition to the comparative study effort, our implementations were able to achieve a speed-up of 857.3x and 658.51x on the Fermi and Radeon architectures respectively for the most compute intensive HH model with a dense network containing 9.72 million neurons. The final outcome of this research is a detailed architectural comparison of the two GPU architectures with a common programming platform.
近年来,通用图形处理单元(gp - gpu)已经进入高性能计算(HPC)领域,成为许多研究小组处理复杂科学应用的主要架构焦点之一。英伟达的Tesla C2050(代号为Fermi)和AMD的Radeon 5870是两款旨在满足全球超级计算研究团队计算需求的设备。尽管基于CUDA的英伟达gpu一直是性能中心研究小组的频繁选择,但OpenCL的引入和发展使AMD gp - gpu成为可能挑战英伟达据点的潜在加速器候选人。这些体系结构不仅为应用程序开发人员提供了大量的特性来探索,而且它们截然不同的体系结构需要对它们的优点进行详细的研究,并评估它们加速复杂科学应用程序的潜力。本文采用OpenCL作为通用编程模型,对Nvidia的Fermi和AMD的Radeon 5870进行了性能分析研究。我们为脉冲神经网络(snn)选择了四种不同的神经元模型,每种模型都有不同的通信和计算要求,即Izhikevich, Wilson, Morris Lecar (ML)和Hodgkin Huxley (HH)模型。我们将Fermi和Radeon gpu的运行时性能与耗尽OpenCL可用的所有优化技术的实现进行比较。研究了两种gpu的几个等效架构参数,并将其与应用性能进行了关联。除了比较研究之外,我们的实现能够在Fermi和Radeon架构上分别实现857.3倍和658.51倍的加速,用于包含972万个神经元的密集网络的最计算密集型HH模型。本研究的最终结果是在一个通用的编程平台上对两种GPU架构进行了详细的架构比较。
{"title":"Evaluation of GPU Architectures Using Spiking Neural Networks","authors":"V. Pallipuram, M. Bhuiyan, M. C. Smith","doi":"10.1109/SAAHPC.2011.20","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.20","url":null,"abstract":"During recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High-Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia's Tesla C2050, codenamed Fermi, and AMD's Radeon 5870 are two devices positioned to meet the computationally demanding needs of supercomputing research groups across the globe. Though Nvidia GPUs powered by CUDA have been the frequent choices of the performance centric research groups, the introduction and growth of OpenCL has promoted AMD GP-GPUs as potential accelerator candidates that can challenge Nvidia's stronghold. These architectures not only offer a plethora of features for application developers to explore, but their radically different architectures calls for a detailed study that weighs their merits and evaluates their potential to accelerate complex scientific applications. In this paper, we present our performance analysis research comparing Nvidia's Fermi and AMD's Radeon 5870 using OpenCL as the common programming model. We have chosen four different neuron models for Spiking Neural Networks (SNNs), each with different communication and computation requirements, namely the Izhikevich, Wilson, Morris Lecar (ML), and the Hodgkin Huxley (HH) models. We compare the runtime performance of the Fermi and Radeon GPUs with an implementation that exhausts all optimization techniques available with OpenCL. Several equivalent architectural parameters of the two GPUs are studied and correlated with the application performance. In addition to the comparative study effort, our implementations were able to achieve a speed-up of 857.3x and 658.51x on the Fermi and Radeon architectures respectively for the most compute intensive HH model with a dense network containing 9.72 million neurons. The final outcome of this research is a detailed architectural comparison of the two GPU architectures with a common programming platform.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116172632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
QUonG: A GPU-based HPC System Dedicated to LQCD Computing QUonG:一个基于gpu的LQCD计算HPC系统
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.15
R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini
QUonG is an INFN (Istituto Nazionale di Fisica Nucleare) initiative targeted to develop a high performance computing system dedicated to Lattice QCD computations. QUonG is a massively parallel computing platform that lever-ages on commodity multi-core processors coupled with last generation GPUs. Its network mesh exploits the characteristics of LQCD algorithm for the design of a point-to-point, high performance, low latency 3-d torus network to interconnect the computing nodes. The network is built upon the APE net+ project: it consists of an FPGA-based PCI Express board exposing six full bidirectional off-board links running at 34 Gbps each, and implementing RDMA protocol and an experimental direct network-to-GPU interface, enabling significant access latency reduction for inter-node data transfers. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 60 TFlops/rack of peak performance, at a cost of 5 Ke/TFlops and for an estimated power consumption of 25 KW/rack. A first QUonG system prototype is expected to be delivered at the end of the year 2011.
QUonG是国家核物理研究所(Istituto Nazionale di Fisica Nucleare)的一项倡议,旨在开发一种专用于Lattice QCD计算的高性能计算系统。QUonG是一个大规模并行计算平台,它利用了商用多核处理器和上一代gpu。其网络网格利用LQCD算法的特点,设计了一个点对点、高性能、低延迟的三维环面网络,实现了计算节点之间的互联。该网络建立在APE net+项目的基础上:它由一个基于fpga的PCI Express板组成,暴露了6个完整的双向板外链路,每个链路运行速度为34 Gbps,并实现了RDMA协议和一个实验性的直接网络到gpu接口,从而大大减少了节点间数据传输的访问延迟。一个完整的QUonG部署的最终形状是一个标准42U机架的组装,每个机架的峰值性能为60 TFlops/机架,成本为5 Ke/TFlops,估计功耗为25 KW/机架。第一架QUonG系统原型预计将在2011年底交付。
{"title":"QUonG: A GPU-based HPC System Dedicated to LQCD Computing","authors":"R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini","doi":"10.1109/SAAHPC.2011.15","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.15","url":null,"abstract":"QUonG is an INFN (Istituto Nazionale di Fisica Nucleare) initiative targeted to develop a high performance computing system dedicated to Lattice QCD computations. QUonG is a massively parallel computing platform that lever-ages on commodity multi-core processors coupled with last generation GPUs. Its network mesh exploits the characteristics of LQCD algorithm for the design of a point-to-point, high performance, low latency 3-d torus network to interconnect the computing nodes. The network is built upon the APE net+ project: it consists of an FPGA-based PCI Express board exposing six full bidirectional off-board links running at 34 Gbps each, and implementing RDMA protocol and an experimental direct network-to-GPU interface, enabling significant access latency reduction for inter-node data transfers. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 60 TFlops/rack of peak performance, at a cost of 5 Ke/TFlops and for an estimated power consumption of 25 KW/rack. A first QUonG system prototype is expected to be delivered at the end of the year 2011.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126521069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Design and Simulation of a Rectangular Meshotron Unit Prototype 矩形介速加速器单元原型的设计与仿真
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.21
C.L.S. Romeiro, Guilherme Campos, Arnaldo S. R. Oliveira
A novel application-specific hardware (ASH) unit was designed to form the building block of the Meshotron -- aparallelisation network for three-dimensional (3D) digital wave guide-mesh (DWM) room acoustic models. The rectangular mesh topology was elected. This ASH unit was tested using professional hardware simulation tools, assuming 32-bit integer data. Room impulse responses (RIR) were obtained for a set of small models under different test conditions, using both single-unit and multi-unit configurations. They proved exactly identical to those obtained using 3D DWM modelling software for the same models and test conditions, which validates the design.
一种新型的专用硬件(ASH)单元被设计成Meshotron的构建模块,用于三维(3D)数字波导网格(DWM)房间声学模型的并行网络。选择矩形网格拓扑结构。该ASH单元使用专业硬件模拟工具进行了测试,假设使用32位整数数据。在不同的测试条件下,采用单单元和多单元配置,获得了一组小型模型的房间脉冲响应(RIR)。在相同的模型和测试条件下,它们与使用3D DWM建模软件获得的结果完全相同,从而验证了设计。
{"title":"Design and Simulation of a Rectangular Meshotron Unit Prototype","authors":"C.L.S. Romeiro, Guilherme Campos, Arnaldo S. R. Oliveira","doi":"10.1109/SAAHPC.2011.21","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.21","url":null,"abstract":"A novel application-specific hardware (ASH) unit was designed to form the building block of the Meshotron -- aparallelisation network for three-dimensional (3D) digital wave guide-mesh (DWM) room acoustic models. The rectangular mesh topology was elected. This ASH unit was tested using professional hardware simulation tools, assuming 32-bit integer data. Room impulse responses (RIR) were obtained for a set of small models under different test conditions, using both single-unit and multi-unit configurations. They proved exactly identical to those obtained using 3D DWM modelling software for the same models and test conditions, which validates the design.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126024991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Iterative Refinement on FPGAs fpga的迭代改进
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.19
Jun Kyu Lee, G. D. Peterson
Achievable accuracy for mixed precision iterative refinement depends on the precisions supported by computing platforms. Even though the arithmetic unit precision can be flexible for programmable logic computing architectures (e.g. FPGAs), previous work rarely discusses the performance benefits due to enabling flexible achievable accuracy. Hence, we propose an iterative refinement approach on FPGAs which employs an arbitrary precision for the iterative refinement to obtain an arbitrary accuracy. We implement single processing elements for the refinement on the Xilinx XC5VLX110T and compare them to Xilinx XC6VSX475T for performance estimation. This paper shows that the performance is similar to the NVIDIA GTX480 when a user requires accuracies between single and double precision, but the implementation can also produce beyond double precision accuracy.
混合精度迭代细化的可实现精度取决于计算平台所支持的精度。尽管算术单元精度对于可编程逻辑计算架构(例如fpga)来说是灵活的,但以前的工作很少讨论由于实现灵活的可实现精度而带来的性能优势。因此,我们提出了一种fpga的迭代细化方法,该方法采用任意精度进行迭代细化以获得任意精度。我们在Xilinx XC5VLX110T上实现了单个处理元素,并将它们与Xilinx xc6vlx475t进行了性能评估。本文表明,当用户要求精度介于单精度和双精度之间时,其性能与NVIDIA GTX480相似,但实现也可以产生超越双精度的精度。
{"title":"Iterative Refinement on FPGAs","authors":"Jun Kyu Lee, G. D. Peterson","doi":"10.1109/SAAHPC.2011.19","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.19","url":null,"abstract":"Achievable accuracy for mixed precision iterative refinement depends on the precisions supported by computing platforms. Even though the arithmetic unit precision can be flexible for programmable logic computing architectures (e.g. FPGAs), previous work rarely discusses the performance benefits due to enabling flexible achievable accuracy. Hence, we propose an iterative refinement approach on FPGAs which employs an arbitrary precision for the iterative refinement to obtain an arbitrary accuracy. We implement single processing elements for the refinement on the Xilinx XC5VLX110T and compare them to Xilinx XC6VSX475T for performance estimation. This paper shows that the performance is similar to the NVIDIA GTX480 when a user requires accuracies between single and double precision, but the implementation can also produce beyond double precision accuracy.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132403504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Accelerating a Climate Physics Model with OpenCL 用OpenCL加速气候物理模型
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.17
F. Zafar, D. Ghosh, Lawrence Sebald, Shujia Zhou
Open Computing Language (OpenCL) is fast becoming the standard for heterogeneous parallel computing. It is designed to run on CPUs, GPUs, and other accelerator architectures. By implementing a real world application, a solar radiation model component widely used in climate and weather models, we show that OpenCL multi-threaded programming and execution model can dramatically increase performance even on CPU architectures. Our preliminary investigation indicates that low-level vector instructions and code representations in OpenCL contribute to dramatic performance improvement over the serial version when compared with the execution of the serial code compiled across various compilers on multiple platforms with auto vectorization flags. However, the portability of OpenCL implementations needs to improve, even for CPU architectures.
开放计算语言(OpenCL)正迅速成为异构并行计算的标准。它被设计用于在cpu、gpu和其他加速器架构上运行。通过实现一个真实世界的应用程序,一个广泛用于气候和天气模型的太阳辐射模型组件,我们展示了OpenCL多线程编程和执行模型即使在CPU架构上也可以显着提高性能。我们的初步调查表明,与在多个平台上使用自动向量化标志跨各种编译器编译的串行代码的执行相比,OpenCL中的低级向量指令和代码表示对串行版本的性能有显著的提高。然而,OpenCL实现的可移植性需要改进,即使对于CPU架构也是如此。
{"title":"Accelerating a Climate Physics Model with OpenCL","authors":"F. Zafar, D. Ghosh, Lawrence Sebald, Shujia Zhou","doi":"10.1109/SAAHPC.2011.17","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.17","url":null,"abstract":"Open Computing Language (OpenCL) is fast becoming the standard for heterogeneous parallel computing. It is designed to run on CPUs, GPUs, and other accelerator architectures. By implementing a real world application, a solar radiation model component widely used in climate and weather models, we show that OpenCL multi-threaded programming and execution model can dramatically increase performance even on CPU architectures. Our preliminary investigation indicates that low-level vector instructions and code representations in OpenCL contribute to dramatic performance improvement over the serial version when compared with the execution of the serial code compiled across various compilers on multiple platforms with auto vectorization flags. However, the portability of OpenCL implementations needs to improve, even for CPU architectures.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125961877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation NVIDIA gpu上的可适应二维滑动窗口与运行时编译
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.11
Nicholas Moore, M. Leeser, L. King
For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a two-dimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering -- small fixed templates of a known size applied to a much larger image -- the application considered here uses large arbitrarily-sized templates, up to 156-by-116 pixels, with small search spaces containing no more than 703 window positions per template. Our CUDA implementation approach employs template tiling and problem-specific kernel compilation to achieve speedups of up to 15 when compared to an optimized multi-threaded implementation running on a 3.33 GHz four core Intel Nehalem processor. Tiling the template enables exploiting the parallelism within the computation and shared memory usage. At the same time, problem-specific kernel compilation allows greater levels of adaptability than would otherwise be possible.
对于某些类别的问题,NVIDIA CUDA抽象和硬件属性与问题特征相结合,以限制可以有效加速的特定问题实例。作为一个实际的例子,考虑了一个基于二维相关的模板匹配MATLAB应用程序。对于线性图像过滤的常见情况,这个问题有一个众所周知的解决方案——将已知大小的小固定模板应用于更大的图像——这里考虑的应用程序使用任意大小的大型模板,最大可达156 × 116像素,每个模板的搜索空间不超过703个窗口位置。与运行在3.33 GHz四核Intel Nehalem处理器上的优化多线程实现相比,我们的CUDA实现方法采用模板平纹和针对问题的内核编译来实现高达15%的速度提升。平铺模板可以利用计算和共享内存使用中的并行性。与此同时,特定于问题的内核编译允许更高级别的适应性。
{"title":"Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation","authors":"Nicholas Moore, M. Leeser, L. King","doi":"10.1109/SAAHPC.2011.11","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.11","url":null,"abstract":"For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a two-dimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering -- small fixed templates of a known size applied to a much larger image -- the application considered here uses large arbitrarily-sized templates, up to 156-by-116 pixels, with small search spaces containing no more than 703 window positions per template. Our CUDA implementation approach employs template tiling and problem-specific kernel compilation to achieve speedups of up to 15 when compared to an optimized multi-threaded implementation running on a 3.33 GHz four core Intel Nehalem processor. Tiling the template enables exploiting the parallelism within the computation and shared memory usage. At the same time, problem-specific kernel compilation allows greater levels of adaptability than would otherwise be possible.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121007921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Quantum Chemical Many-Body Theory on Heterogeneous Nodes 非均相节点的量子化学多体理论
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.28
A. Eugene DePrince III, J. Hammond
The iterative solution of the coupled-cluster with single and double excitations (CCSD) equations is a very time-consuming component of the ``gold standard'' in quantum chemistry, the CCSD(T) method. In an effort to accelerate accurate quantum mechanical calculations, we explore two implementation strategies for the iterative solution of the CC equations on graphics procesing units (GPUs). We consider a communication-avoiding algorithm for the spin-free coupled cluster doubles (CCD) equations followed by a low-storage algorithm for the spin-free CCSD equations. In the communication-avoiding algorithm, the entire iterative procedure for the CCD method is performed on the GPU, resulting in accelerations of a factor of 4-5 relative to the pure CPU algorithm. The low-storage CCSD algorithm requires that a minimum of $4o^2v^2+2ov$ elements be stored on the device, where $o$ and $v$ represent the number of orbitals occupied and unoccupied in the reference configuration, respectively. The algorithm masks the transfer time for copying large amounts of data to the GPU by overlapping GPU and CPU computations. The per-iteration costs of this hybrid GPU/CPU algorithm are up to 4.06 times less than those of the pure CPU algorithm and up to 10.63 times less than those of the CCSD implementation found in the {small Molpro} electronic structure package. These results provide insight into how to organize communication and computation as to maximize utilization of a GPU and multicore CPU at the same time.
单激发和双激发耦合簇(CCSD)方程的迭代求解是量子化学“金标准”CCSD(T)方法中非常耗时的一个组成部分。为了加速精确的量子力学计算,我们探索了在图形处理单元(gpu)上迭代求解CC方程的两种实现策略。我们考虑了一种无自旋耦合簇双元(CCD)方程的通信避免算法,以及一种无自旋耦合簇双元(CCSD)方程的低存储算法。在通信避免算法中,CCD方法的整个迭代过程都是在GPU上进行的,相对于纯CPU算法,其加速度是4-5倍。低存储CCSD算法要求在器件上存储至少$ 40 ^2v^2+ 2v $个元素,其中$ 0 $和$v$分别表示参考构型中已占用和未占用的轨道数。该算法通过GPU和CPU计算的重叠来掩盖将大量数据复制到GPU的传输时间。这种GPU/CPU混合算法的每次迭代成本比纯CPU算法低4.06倍,比{small Molpro}电子结构包中的CCSD实现低10.63倍。这些结果提供了如何组织通信和计算的洞察力,以最大限度地利用GPU和多核CPU在同一时间。
{"title":"Quantum Chemical Many-Body Theory on Heterogeneous Nodes","authors":"A. Eugene DePrince III, J. Hammond","doi":"10.1109/SAAHPC.2011.28","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.28","url":null,"abstract":"The iterative solution of the coupled-cluster with single and double excitations (CCSD) equations is a very time-consuming component of the ``gold standard'' in quantum chemistry, the CCSD(T) method. In an effort to accelerate accurate quantum mechanical calculations, we explore two implementation strategies for the iterative solution of the CC equations on graphics procesing units (GPUs). We consider a communication-avoiding algorithm for the spin-free coupled cluster doubles (CCD) equations followed by a low-storage algorithm for the spin-free CCSD equations. In the communication-avoiding algorithm, the entire iterative procedure for the CCD method is performed on the GPU, resulting in accelerations of a factor of 4-5 relative to the pure CPU algorithm. The low-storage CCSD algorithm requires that a minimum of $4o^2v^2+2ov$ elements be stored on the device, where $o$ and $v$ represent the number of orbitals occupied and unoccupied in the reference configuration, respectively. The algorithm masks the transfer time for copying large amounts of data to the GPU by overlapping GPU and CPU computations. The per-iteration costs of this hybrid GPU/CPU algorithm are up to 4.06 times less than those of the pure CPU algorithm and up to 10.63 times less than those of the CCSD implementation found in the {small Molpro} electronic structure package. These results provide insight into how to organize communication and computation as to maximize utilization of a GPU and multicore CPU at the same time.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132434304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing CPU+GPU融合处理器(APU)并行计算效能研究
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.29
Mayank Daga, Ashwin M. Aji, Wu-chun Feng
The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers between the CPU and GPU over PCIe. Emerging heterogeneous computing architectures that "fuse" the functionality of the CPU and GPU, e.g., AMD Fusion and Intel Knights Ferry, hold the promise of addressing the PCIe bottleneck. In this paper, we empirically characterize and analyze the efficacy of AMD Fusion, an architecture that combines general-purposex86 cores and programmable accelerator cores on the same silicon die. We characterize its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks(e.g., reduction), and actual applications (e.g., molecular dynamics). Depending on the benchmark, our results show that Fusion produces a 1.7 to 6.0-fold improvement in the data-transfer time, when compared to a discrete GPU. In turn, this improvement in data-transfer performance can significantly enhance application performance. For example, running a reduction benchmark on AMD Fusion with its mere 80 GPU cores improves performance by 3.5-fold over the discrete AMD Radeon HD 5870 GPU with its 1600 more powerful GPU cores.
图形处理单元(GPU)作为并行计算的加速器已经取得了重大进展。然而,由于GPU作为一个独立的设备驻留在PCIe上,GPU应用程序的性能可能会受到CPU和GPU之间通过PCIe传输数据的瓶颈。新兴的异构计算架构“融合”了CPU和GPU的功能,例如AMD Fusion和Intel Knights Ferry,有望解决PCIe瓶颈问题。在本文中,我们对AMD Fusion的效能进行了实证表征和分析,这是一种将通用86内核和可编程加速器内核结合在同一硅片上的架构。我们通过一组微基准(例如PCIe数据传输)、内核基准(例如PCIe数据传输)来描述它的性能。(还原)和实际应用(如分子动力学)。根据基准测试,我们的结果表明,与独立GPU相比,Fusion在数据传输时间上提高了1.7到6.0倍。反过来,数据传输性能的改进可以显著提高应用程序性能。例如,在只有80个GPU核的AMD Fusion上运行缩减基准测试,性能比拥有1600个更强大GPU核的AMD Radeon HD 5870独立GPU提高3.5倍。
{"title":"On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing","authors":"Mayank Daga, Ashwin M. Aji, Wu-chun Feng","doi":"10.1109/SAAHPC.2011.29","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.29","url":null,"abstract":"The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers between the CPU and GPU over PCIe. Emerging heterogeneous computing architectures that \"fuse\" the functionality of the CPU and GPU, e.g., AMD Fusion and Intel Knights Ferry, hold the promise of addressing the PCIe bottleneck. In this paper, we empirically characterize and analyze the efficacy of AMD Fusion, an architecture that combines general-purposex86 cores and programmable accelerator cores on the same silicon die. We characterize its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks(e.g., reduction), and actual applications (e.g., molecular dynamics). Depending on the benchmark, our results show that Fusion produces a 1.7 to 6.0-fold improvement in the data-transfer time, when compared to a discrete GPU. In turn, this improvement in data-transfer performance can significantly enhance application performance. For example, running a reduction benchmark on AMD Fusion with its mere 80 GPU cores improves performance by 3.5-fold over the discrete AMD Radeon HD 5870 GPU with its 1600 more powerful GPU cores.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123728288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 138
G-NetMon: A GPU-accelerated Network Performance Monitoring System G-NetMon:一个gpu加速的网络性能监控系统
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.10
Wenji Wu, P. DeMar, D. Holmgren, Amitoj Singh
At Fermilab, we have prototyped a GPU-accelerated network performance monitoring system, called G-NetMon, to support large-scale scientific collaborations. In this work, we explore new opportunities in network traffic monitoring and analysis with GPUs. Our system exploits the data parallelism that exists within network flow data to provide fast analysis of bulk data movement between Fermilab and collaboration sites. Experiments demonstrate that our G-NetMon can rapidly detect sub-optimal bulk data movements.
在费米实验室,我们制作了一个gpu加速网络性能监测系统的原型,称为G-NetMon,以支持大规模的科学合作。在这项工作中,我们探索了使用gpu进行网络流量监控和分析的新机会。我们的系统利用存在于网络流数据中的数据并行性来提供Fermilab和协作站点之间的批量数据移动的快速分析。实验表明,我们的G-NetMon可以快速检测到次优的批量数据移动。
{"title":"G-NetMon: A GPU-accelerated Network Performance Monitoring System","authors":"Wenji Wu, P. DeMar, D. Holmgren, Amitoj Singh","doi":"10.1109/SAAHPC.2011.10","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.10","url":null,"abstract":"At Fermilab, we have prototyped a GPU-accelerated network performance monitoring system, called G-NetMon, to support large-scale scientific collaborations. In this work, we explore new opportunities in network traffic monitoring and analysis with GPUs. Our system exploits the data parallelism that exists within network flow data to provide fast analysis of bulk data movement between Fermilab and collaboration sites. Experiments demonstrate that our G-NetMon can rapidly detect sub-optimal bulk data movements.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121445719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Application of Graphics Processing Units (GPUs) to the Study of Non-linear Dynamics of the Exciton Bose-Einstein Condensate in a Semiconductor Quantum Well 图形处理器(gpu)在半导体量子阱中激子玻色-爱因斯坦凝聚非线性动力学研究中的应用
Pub Date : 2011-07-19 DOI: 10.1109/SAAHPC.2011.32
A. Gothandaraman, S. Sadatian, Michal Faryniarz, O. Berman, G. Kolmakov
In this paper, we explore the use of Graphics Processing Units (GPUs) to solve numerically the nonlinear Gross-Pitaevskii equation with an external potential. Our implementation uses NVIDIA's Compute Unified Device Architecture (CUDA) programming paradigm and demonstrates a speedup of 190x on an NVIDIA Tesla C2050 (Fermi) GPU compared to an optimized software implementation on a single-core of an Intel Xeon 5500-series processor. We apply the developed technique to the study of Bose-Einstein condensation (BEC) of excitons in semiconductor nanostructures. The technique is also applicable to the studies of atomic condensates, quantized vortices in quantum fluids, propagation of light pulses in optical wave guides, and ocean wave dynamics.
在本文中,我们探索使用图形处理单元(gpu)来数值求解具有外部势的非线性Gross-Pitaevskii方程。我们的实现使用了NVIDIA的计算统一设备架构(CUDA)编程范例,并演示了在NVIDIA Tesla C2050 (Fermi) GPU上的加速,与在英特尔至强5500系列处理器的单核上优化的软件实现相比,加速了190x。我们将所开发的技术应用于半导体纳米结构中激子的玻色-爱因斯坦凝聚(BEC)研究。该技术也适用于原子凝聚体、量子流体中的量子化涡流、光脉冲在光波导中的传播以及海浪动力学的研究。
{"title":"Application of Graphics Processing Units (GPUs) to the Study of Non-linear Dynamics of the Exciton Bose-Einstein Condensate in a Semiconductor Quantum Well","authors":"A. Gothandaraman, S. Sadatian, Michal Faryniarz, O. Berman, G. Kolmakov","doi":"10.1109/SAAHPC.2011.32","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.32","url":null,"abstract":"In this paper, we explore the use of Graphics Processing Units (GPUs) to solve numerically the nonlinear Gross-Pitaevskii equation with an external potential. Our implementation uses NVIDIA's Compute Unified Device Architecture (CUDA) programming paradigm and demonstrates a speedup of 190x on an NVIDIA Tesla C2050 (Fermi) GPU compared to an optimized software implementation on a single-core of an Intel Xeon 5500-series processor. We apply the developed technique to the study of Bose-Einstein condensation (BEC) of excitons in semiconductor nanostructures. The technique is also applicable to the studies of atomic condensates, quantized vortices in quantum fluids, propagation of light pulses in optical wave guides, and ocean wave dynamics.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114578588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2011 Symposium on Application Accelerators in High-Performance Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1