During recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High-Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia's Tesla C2050, codenamed Fermi, and AMD's Radeon 5870 are two devices positioned to meet the computationally demanding needs of supercomputing research groups across the globe. Though Nvidia GPUs powered by CUDA have been the frequent choices of the performance centric research groups, the introduction and growth of OpenCL has promoted AMD GP-GPUs as potential accelerator candidates that can challenge Nvidia's stronghold. These architectures not only offer a plethora of features for application developers to explore, but their radically different architectures calls for a detailed study that weighs their merits and evaluates their potential to accelerate complex scientific applications. In this paper, we present our performance analysis research comparing Nvidia's Fermi and AMD's Radeon 5870 using OpenCL as the common programming model. We have chosen four different neuron models for Spiking Neural Networks (SNNs), each with different communication and computation requirements, namely the Izhikevich, Wilson, Morris Lecar (ML), and the Hodgkin Huxley (HH) models. We compare the runtime performance of the Fermi and Radeon GPUs with an implementation that exhausts all optimization techniques available with OpenCL. Several equivalent architectural parameters of the two GPUs are studied and correlated with the application performance. In addition to the comparative study effort, our implementations were able to achieve a speed-up of 857.3x and 658.51x on the Fermi and Radeon architectures respectively for the most compute intensive HH model with a dense network containing 9.72 million neurons. The final outcome of this research is a detailed architectural comparison of the two GPU architectures with a common programming platform.
{"title":"Evaluation of GPU Architectures Using Spiking Neural Networks","authors":"V. Pallipuram, M. Bhuiyan, M. C. Smith","doi":"10.1109/SAAHPC.2011.20","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.20","url":null,"abstract":"During recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High-Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia's Tesla C2050, codenamed Fermi, and AMD's Radeon 5870 are two devices positioned to meet the computationally demanding needs of supercomputing research groups across the globe. Though Nvidia GPUs powered by CUDA have been the frequent choices of the performance centric research groups, the introduction and growth of OpenCL has promoted AMD GP-GPUs as potential accelerator candidates that can challenge Nvidia's stronghold. These architectures not only offer a plethora of features for application developers to explore, but their radically different architectures calls for a detailed study that weighs their merits and evaluates their potential to accelerate complex scientific applications. In this paper, we present our performance analysis research comparing Nvidia's Fermi and AMD's Radeon 5870 using OpenCL as the common programming model. We have chosen four different neuron models for Spiking Neural Networks (SNNs), each with different communication and computation requirements, namely the Izhikevich, Wilson, Morris Lecar (ML), and the Hodgkin Huxley (HH) models. We compare the runtime performance of the Fermi and Radeon GPUs with an implementation that exhausts all optimization techniques available with OpenCL. Several equivalent architectural parameters of the two GPUs are studied and correlated with the application performance. In addition to the comparative study effort, our implementations were able to achieve a speed-up of 857.3x and 658.51x on the Fermi and Radeon architectures respectively for the most compute intensive HH model with a dense network containing 9.72 million neurons. The final outcome of this research is a detailed architectural comparison of the two GPU architectures with a common programming platform.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116172632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini
QUonG is an INFN (Istituto Nazionale di Fisica Nucleare) initiative targeted to develop a high performance computing system dedicated to Lattice QCD computations. QUonG is a massively parallel computing platform that lever-ages on commodity multi-core processors coupled with last generation GPUs. Its network mesh exploits the characteristics of LQCD algorithm for the design of a point-to-point, high performance, low latency 3-d torus network to interconnect the computing nodes. The network is built upon the APE net+ project: it consists of an FPGA-based PCI Express board exposing six full bidirectional off-board links running at 34 Gbps each, and implementing RDMA protocol and an experimental direct network-to-GPU interface, enabling significant access latency reduction for inter-node data transfers. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 60 TFlops/rack of peak performance, at a cost of 5 Ke/TFlops and for an estimated power consumption of 25 KW/rack. A first QUonG system prototype is expected to be delivered at the end of the year 2011.
QUonG是国家核物理研究所(Istituto Nazionale di Fisica Nucleare)的一项倡议,旨在开发一种专用于Lattice QCD计算的高性能计算系统。QUonG是一个大规模并行计算平台,它利用了商用多核处理器和上一代gpu。其网络网格利用LQCD算法的特点,设计了一个点对点、高性能、低延迟的三维环面网络,实现了计算节点之间的互联。该网络建立在APE net+项目的基础上:它由一个基于fpga的PCI Express板组成,暴露了6个完整的双向板外链路,每个链路运行速度为34 Gbps,并实现了RDMA协议和一个实验性的直接网络到gpu接口,从而大大减少了节点间数据传输的访问延迟。一个完整的QUonG部署的最终形状是一个标准42U机架的组装,每个机架的峰值性能为60 TFlops/机架,成本为5 Ke/TFlops,估计功耗为25 KW/机架。第一架QUonG系统原型预计将在2011年底交付。
{"title":"QUonG: A GPU-based HPC System Dedicated to LQCD Computing","authors":"R. Ammendola, A. Biagioni, O. Frezza, F. Lo Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini","doi":"10.1109/SAAHPC.2011.15","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.15","url":null,"abstract":"QUonG is an INFN (Istituto Nazionale di Fisica Nucleare) initiative targeted to develop a high performance computing system dedicated to Lattice QCD computations. QUonG is a massively parallel computing platform that lever-ages on commodity multi-core processors coupled with last generation GPUs. Its network mesh exploits the characteristics of LQCD algorithm for the design of a point-to-point, high performance, low latency 3-d torus network to interconnect the computing nodes. The network is built upon the APE net+ project: it consists of an FPGA-based PCI Express board exposing six full bidirectional off-board links running at 34 Gbps each, and implementing RDMA protocol and an experimental direct network-to-GPU interface, enabling significant access latency reduction for inter-node data transfers. The final shape of a complete QUonG deployment is an assembly of standard 42U racks, each one capable of 60 TFlops/rack of peak performance, at a cost of 5 Ke/TFlops and for an estimated power consumption of 25 KW/rack. A first QUonG system prototype is expected to be delivered at the end of the year 2011.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126521069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C.L.S. Romeiro, Guilherme Campos, Arnaldo S. R. Oliveira
A novel application-specific hardware (ASH) unit was designed to form the building block of the Meshotron -- aparallelisation network for three-dimensional (3D) digital wave guide-mesh (DWM) room acoustic models. The rectangular mesh topology was elected. This ASH unit was tested using professional hardware simulation tools, assuming 32-bit integer data. Room impulse responses (RIR) were obtained for a set of small models under different test conditions, using both single-unit and multi-unit configurations. They proved exactly identical to those obtained using 3D DWM modelling software for the same models and test conditions, which validates the design.
{"title":"Design and Simulation of a Rectangular Meshotron Unit Prototype","authors":"C.L.S. Romeiro, Guilherme Campos, Arnaldo S. R. Oliveira","doi":"10.1109/SAAHPC.2011.21","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.21","url":null,"abstract":"A novel application-specific hardware (ASH) unit was designed to form the building block of the Meshotron -- aparallelisation network for three-dimensional (3D) digital wave guide-mesh (DWM) room acoustic models. The rectangular mesh topology was elected. This ASH unit was tested using professional hardware simulation tools, assuming 32-bit integer data. Room impulse responses (RIR) were obtained for a set of small models under different test conditions, using both single-unit and multi-unit configurations. They proved exactly identical to those obtained using 3D DWM modelling software for the same models and test conditions, which validates the design.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126024991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Achievable accuracy for mixed precision iterative refinement depends on the precisions supported by computing platforms. Even though the arithmetic unit precision can be flexible for programmable logic computing architectures (e.g. FPGAs), previous work rarely discusses the performance benefits due to enabling flexible achievable accuracy. Hence, we propose an iterative refinement approach on FPGAs which employs an arbitrary precision for the iterative refinement to obtain an arbitrary accuracy. We implement single processing elements for the refinement on the Xilinx XC5VLX110T and compare them to Xilinx XC6VSX475T for performance estimation. This paper shows that the performance is similar to the NVIDIA GTX480 when a user requires accuracies between single and double precision, but the implementation can also produce beyond double precision accuracy.
{"title":"Iterative Refinement on FPGAs","authors":"Jun Kyu Lee, G. D. Peterson","doi":"10.1109/SAAHPC.2011.19","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.19","url":null,"abstract":"Achievable accuracy for mixed precision iterative refinement depends on the precisions supported by computing platforms. Even though the arithmetic unit precision can be flexible for programmable logic computing architectures (e.g. FPGAs), previous work rarely discusses the performance benefits due to enabling flexible achievable accuracy. Hence, we propose an iterative refinement approach on FPGAs which employs an arbitrary precision for the iterative refinement to obtain an arbitrary accuracy. We implement single processing elements for the refinement on the Xilinx XC5VLX110T and compare them to Xilinx XC6VSX475T for performance estimation. This paper shows that the performance is similar to the NVIDIA GTX480 when a user requires accuracies between single and double precision, but the implementation can also produce beyond double precision accuracy.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132403504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Open Computing Language (OpenCL) is fast becoming the standard for heterogeneous parallel computing. It is designed to run on CPUs, GPUs, and other accelerator architectures. By implementing a real world application, a solar radiation model component widely used in climate and weather models, we show that OpenCL multi-threaded programming and execution model can dramatically increase performance even on CPU architectures. Our preliminary investigation indicates that low-level vector instructions and code representations in OpenCL contribute to dramatic performance improvement over the serial version when compared with the execution of the serial code compiled across various compilers on multiple platforms with auto vectorization flags. However, the portability of OpenCL implementations needs to improve, even for CPU architectures.
{"title":"Accelerating a Climate Physics Model with OpenCL","authors":"F. Zafar, D. Ghosh, Lawrence Sebald, Shujia Zhou","doi":"10.1109/SAAHPC.2011.17","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.17","url":null,"abstract":"Open Computing Language (OpenCL) is fast becoming the standard for heterogeneous parallel computing. It is designed to run on CPUs, GPUs, and other accelerator architectures. By implementing a real world application, a solar radiation model component widely used in climate and weather models, we show that OpenCL multi-threaded programming and execution model can dramatically increase performance even on CPU architectures. Our preliminary investigation indicates that low-level vector instructions and code representations in OpenCL contribute to dramatic performance improvement over the serial version when compared with the execution of the serial code compiled across various compilers on multiple platforms with auto vectorization flags. However, the portability of OpenCL implementations needs to improve, even for CPU architectures.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125961877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a two-dimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering -- small fixed templates of a known size applied to a much larger image -- the application considered here uses large arbitrarily-sized templates, up to 156-by-116 pixels, with small search spaces containing no more than 703 window positions per template. Our CUDA implementation approach employs template tiling and problem-specific kernel compilation to achieve speedups of up to 15 when compared to an optimized multi-threaded implementation running on a 3.33 GHz four core Intel Nehalem processor. Tiling the template enables exploiting the parallelism within the computation and shared memory usage. At the same time, problem-specific kernel compilation allows greater levels of adaptability than would otherwise be possible.
{"title":"Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation","authors":"Nicholas Moore, M. Leeser, L. King","doi":"10.1109/SAAHPC.2011.11","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.11","url":null,"abstract":"For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a two-dimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering -- small fixed templates of a known size applied to a much larger image -- the application considered here uses large arbitrarily-sized templates, up to 156-by-116 pixels, with small search spaces containing no more than 703 window positions per template. Our CUDA implementation approach employs template tiling and problem-specific kernel compilation to achieve speedups of up to 15 when compared to an optimized multi-threaded implementation running on a 3.33 GHz four core Intel Nehalem processor. Tiling the template enables exploiting the parallelism within the computation and shared memory usage. At the same time, problem-specific kernel compilation allows greater levels of adaptability than would otherwise be possible.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121007921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The iterative solution of the coupled-cluster with single and double excitations (CCSD) equations is a very time-consuming component of the ``gold standard'' in quantum chemistry, the CCSD(T) method. In an effort to accelerate accurate quantum mechanical calculations, we explore two implementation strategies for the iterative solution of the CC equations on graphics procesing units (GPUs). We consider a communication-avoiding algorithm for the spin-free coupled cluster doubles (CCD) equations followed by a low-storage algorithm for the spin-free CCSD equations. In the communication-avoiding algorithm, the entire iterative procedure for the CCD method is performed on the GPU, resulting in accelerations of a factor of 4-5 relative to the pure CPU algorithm. The low-storage CCSD algorithm requires that a minimum of $4o^2v^2+2ov$ elements be stored on the device, where $o$ and $v$ represent the number of orbitals occupied and unoccupied in the reference configuration, respectively. The algorithm masks the transfer time for copying large amounts of data to the GPU by overlapping GPU and CPU computations. The per-iteration costs of this hybrid GPU/CPU algorithm are up to 4.06 times less than those of the pure CPU algorithm and up to 10.63 times less than those of the CCSD implementation found in the {small Molpro} electronic structure package. These results provide insight into how to organize communication and computation as to maximize utilization of a GPU and multicore CPU at the same time.
{"title":"Quantum Chemical Many-Body Theory on Heterogeneous Nodes","authors":"A. Eugene DePrince III, J. Hammond","doi":"10.1109/SAAHPC.2011.28","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.28","url":null,"abstract":"The iterative solution of the coupled-cluster with single and double excitations (CCSD) equations is a very time-consuming component of the ``gold standard'' in quantum chemistry, the CCSD(T) method. In an effort to accelerate accurate quantum mechanical calculations, we explore two implementation strategies for the iterative solution of the CC equations on graphics procesing units (GPUs). We consider a communication-avoiding algorithm for the spin-free coupled cluster doubles (CCD) equations followed by a low-storage algorithm for the spin-free CCSD equations. In the communication-avoiding algorithm, the entire iterative procedure for the CCD method is performed on the GPU, resulting in accelerations of a factor of 4-5 relative to the pure CPU algorithm. The low-storage CCSD algorithm requires that a minimum of $4o^2v^2+2ov$ elements be stored on the device, where $o$ and $v$ represent the number of orbitals occupied and unoccupied in the reference configuration, respectively. The algorithm masks the transfer time for copying large amounts of data to the GPU by overlapping GPU and CPU computations. The per-iteration costs of this hybrid GPU/CPU algorithm are up to 4.06 times less than those of the pure CPU algorithm and up to 10.63 times less than those of the CCSD implementation found in the {small Molpro} electronic structure package. These results provide insight into how to organize communication and computation as to maximize utilization of a GPU and multicore CPU at the same time.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132434304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers between the CPU and GPU over PCIe. Emerging heterogeneous computing architectures that "fuse" the functionality of the CPU and GPU, e.g., AMD Fusion and Intel Knights Ferry, hold the promise of addressing the PCIe bottleneck. In this paper, we empirically characterize and analyze the efficacy of AMD Fusion, an architecture that combines general-purposex86 cores and programmable accelerator cores on the same silicon die. We characterize its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks(e.g., reduction), and actual applications (e.g., molecular dynamics). Depending on the benchmark, our results show that Fusion produces a 1.7 to 6.0-fold improvement in the data-transfer time, when compared to a discrete GPU. In turn, this improvement in data-transfer performance can significantly enhance application performance. For example, running a reduction benchmark on AMD Fusion with its mere 80 GPU cores improves performance by 3.5-fold over the discrete AMD Radeon HD 5870 GPU with its 1600 more powerful GPU cores.
图形处理单元(GPU)作为并行计算的加速器已经取得了重大进展。然而,由于GPU作为一个独立的设备驻留在PCIe上,GPU应用程序的性能可能会受到CPU和GPU之间通过PCIe传输数据的瓶颈。新兴的异构计算架构“融合”了CPU和GPU的功能,例如AMD Fusion和Intel Knights Ferry,有望解决PCIe瓶颈问题。在本文中,我们对AMD Fusion的效能进行了实证表征和分析,这是一种将通用86内核和可编程加速器内核结合在同一硅片上的架构。我们通过一组微基准(例如PCIe数据传输)、内核基准(例如PCIe数据传输)来描述它的性能。(还原)和实际应用(如分子动力学)。根据基准测试,我们的结果表明,与独立GPU相比,Fusion在数据传输时间上提高了1.7到6.0倍。反过来,数据传输性能的改进可以显著提高应用程序性能。例如,在只有80个GPU核的AMD Fusion上运行缩减基准测试,性能比拥有1600个更强大GPU核的AMD Radeon HD 5870独立GPU提高3.5倍。
{"title":"On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing","authors":"Mayank Daga, Ashwin M. Aji, Wu-chun Feng","doi":"10.1109/SAAHPC.2011.29","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.29","url":null,"abstract":"The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers between the CPU and GPU over PCIe. Emerging heterogeneous computing architectures that \"fuse\" the functionality of the CPU and GPU, e.g., AMD Fusion and Intel Knights Ferry, hold the promise of addressing the PCIe bottleneck. In this paper, we empirically characterize and analyze the efficacy of AMD Fusion, an architecture that combines general-purposex86 cores and programmable accelerator cores on the same silicon die. We characterize its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks(e.g., reduction), and actual applications (e.g., molecular dynamics). Depending on the benchmark, our results show that Fusion produces a 1.7 to 6.0-fold improvement in the data-transfer time, when compared to a discrete GPU. In turn, this improvement in data-transfer performance can significantly enhance application performance. For example, running a reduction benchmark on AMD Fusion with its mere 80 GPU cores improves performance by 3.5-fold over the discrete AMD Radeon HD 5870 GPU with its 1600 more powerful GPU cores.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123728288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
At Fermilab, we have prototyped a GPU-accelerated network performance monitoring system, called G-NetMon, to support large-scale scientific collaborations. In this work, we explore new opportunities in network traffic monitoring and analysis with GPUs. Our system exploits the data parallelism that exists within network flow data to provide fast analysis of bulk data movement between Fermilab and collaboration sites. Experiments demonstrate that our G-NetMon can rapidly detect sub-optimal bulk data movements.
{"title":"G-NetMon: A GPU-accelerated Network Performance Monitoring System","authors":"Wenji Wu, P. DeMar, D. Holmgren, Amitoj Singh","doi":"10.1109/SAAHPC.2011.10","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.10","url":null,"abstract":"At Fermilab, we have prototyped a GPU-accelerated network performance monitoring system, called G-NetMon, to support large-scale scientific collaborations. In this work, we explore new opportunities in network traffic monitoring and analysis with GPUs. Our system exploits the data parallelism that exists within network flow data to provide fast analysis of bulk data movement between Fermilab and collaboration sites. Experiments demonstrate that our G-NetMon can rapidly detect sub-optimal bulk data movements.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121445719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Gothandaraman, S. Sadatian, Michal Faryniarz, O. Berman, G. Kolmakov
In this paper, we explore the use of Graphics Processing Units (GPUs) to solve numerically the nonlinear Gross-Pitaevskii equation with an external potential. Our implementation uses NVIDIA's Compute Unified Device Architecture (CUDA) programming paradigm and demonstrates a speedup of 190x on an NVIDIA Tesla C2050 (Fermi) GPU compared to an optimized software implementation on a single-core of an Intel Xeon 5500-series processor. We apply the developed technique to the study of Bose-Einstein condensation (BEC) of excitons in semiconductor nanostructures. The technique is also applicable to the studies of atomic condensates, quantized vortices in quantum fluids, propagation of light pulses in optical wave guides, and ocean wave dynamics.
在本文中,我们探索使用图形处理单元(gpu)来数值求解具有外部势的非线性Gross-Pitaevskii方程。我们的实现使用了NVIDIA的计算统一设备架构(CUDA)编程范例,并演示了在NVIDIA Tesla C2050 (Fermi) GPU上的加速,与在英特尔至强5500系列处理器的单核上优化的软件实现相比,加速了190x。我们将所开发的技术应用于半导体纳米结构中激子的玻色-爱因斯坦凝聚(BEC)研究。该技术也适用于原子凝聚体、量子流体中的量子化涡流、光脉冲在光波导中的传播以及海浪动力学的研究。
{"title":"Application of Graphics Processing Units (GPUs) to the Study of Non-linear Dynamics of the Exciton Bose-Einstein Condensate in a Semiconductor Quantum Well","authors":"A. Gothandaraman, S. Sadatian, Michal Faryniarz, O. Berman, G. Kolmakov","doi":"10.1109/SAAHPC.2011.32","DOIUrl":"https://doi.org/10.1109/SAAHPC.2011.32","url":null,"abstract":"In this paper, we explore the use of Graphics Processing Units (GPUs) to solve numerically the nonlinear Gross-Pitaevskii equation with an external potential. Our implementation uses NVIDIA's Compute Unified Device Architecture (CUDA) programming paradigm and demonstrates a speedup of 190x on an NVIDIA Tesla C2050 (Fermi) GPU compared to an optimized software implementation on a single-core of an Intel Xeon 5500-series processor. We apply the developed technique to the study of Bose-Einstein condensation (BEC) of excitons in semiconductor nanostructures. The technique is also applicable to the studies of atomic condensates, quantized vortices in quantum fluids, propagation of light pulses in optical wave guides, and ocean wave dynamics.","PeriodicalId":331604,"journal":{"name":"2011 Symposium on Application Accelerators in High-Performance Computing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114578588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}