2012 SC Companion: High Performance Computing, Networking Storage and Analysis最新文献

英文中文

Poster: High-Speed Decision Making on Live Petabyte Data Streams 海报:实时pb数据流的高速决策

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.218

W. Badgett, K. Biery, C. Green, J. Kowalkowski, K. Maeshima, M. Paterno, R. Roser

High Energy Physics has a long history of coping with cutting-edge data rates in its efforts to extract meaning from experimental data. The quantity of data from planned future experiments that must be analyzed practically in real-time to enable efficient filtering and storage of the scientifically interesting data has driven the development of sophisticated techniques which leverage technologies such as MPI, OpenMP and Intel TBB. We show the evolution of data collection, triggering and filtering from the 1990s with TeVatron experiments into the future of Intensity Frontier and Cosmic Frontier experiments and show how the requirements of upcoming experiments lead us to the development of high-performance streaming triggerless DAQ systems.

高能物理学在从实验数据中提取意义的努力中，有着应对尖端数据速率的悠久历史。来自计划中的未来实验的数据量必须进行实际的实时分析，以便有效地过滤和存储科学上有趣的数据，这推动了利用MPI、OpenMP和英特尔TBB等技术的复杂技术的发展。我们展示了从20世纪90年代的TeVatron实验到强度前沿和宇宙前沿实验的未来的数据收集，触发和过滤的演变，并展示了即将到来的实验的要求如何引导我们开发高性能流无触发DAQ系统。

引用次数: 0

Many-Core Accelerated LIBOR Swaption Portfolio Pricing 多核心加速LIBOR掉期组合定价

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.143

Jörg Lotze, P. Sutton, Hicham Lahlou

This paper describes the acceleration of a MonteCarlo algorithm for pricing a LIBOR swaption portfolio using multi-core CPUs and GPUs. Speedups of up to 305x are achieved on two Nvidia Tesla M2050 GPUs and up to 20.8x on two Intel Xeon E5620 CPUs, compared to a sequential CPU implementation. This performance is achieved by using the Xcelerit platform - writing sequential, high-level C++ code and adopting a simple dataflow programming model. It avoids the complexity involved when using low-level high-performance computing frameworks such as OpenMP, OpenCL, CUDA, or SIMD intrinsics. The paper provides an overview of the Xcelerit platform, details how high performance is achieved through various automatic optimisation and parallelisation techniques, and shows how the tool can be used to implement portable accelerated Monte-Carlo algorithms in finance. It illustrates the implementation of the Monte-Carlo LIBOR swaption portfolio pricer and gives performance results. A comparison of the Xcelerit platform implementation with an equivalent low-level CUDA version shows that the overhead introduced is less than 1.5% in all scenarios.

本文描述了一种基于多核cpu和gpu的LIBOR互换组合定价MonteCarlo算法的加速。与串行CPU实现相比，两个Nvidia Tesla M2050 gpu实现了高达305倍的加速，两个Intel至强E5620 CPU实现了高达20.8倍的加速。这种性能是通过使用Xcelerit平台实现的——编写顺序的高级c++代码，并采用简单的数据流编程模型。它避免了使用底层高性能计算框架(如OpenMP、OpenCL、CUDA或SIMD intrinsic)时所涉及的复杂性。本文概述了Xcelerit平台，详细介绍了如何通过各种自动优化和并行化技术实现高性能，并展示了如何使用该工具在金融领域实现便携式加速蒙特卡罗算法。举例说明了蒙特卡洛LIBOR掉期组合定价器的实现，并给出了性能结果。Xcelerit平台实现与同等低级CUDA版本的比较表明，在所有场景中引入的开销都小于1.5%。

{"title":"Many-Core Accelerated LIBOR Swaption Portfolio Pricing","authors":"Jörg Lotze, P. Sutton, Hicham Lahlou","doi":"10.1109/SC.Companion.2012.143","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.143","url":null,"abstract":"This paper describes the acceleration of a MonteCarlo algorithm for pricing a LIBOR swaption portfolio using multi-core CPUs and GPUs. Speedups of up to 305x are achieved on two Nvidia Tesla M2050 GPUs and up to 20.8x on two Intel Xeon E5620 CPUs, compared to a sequential CPU implementation. This performance is achieved by using the Xcelerit platform - writing sequential, high-level C++ code and adopting a simple dataflow programming model. It avoids the complexity involved when using low-level high-performance computing frameworks such as OpenMP, OpenCL, CUDA, or SIMD intrinsics. The paper provides an overview of the Xcelerit platform, details how high performance is achieved through various automatic optimisation and parallelisation techniques, and shows how the tool can be used to implement portable accelerated Monte-Carlo algorithms in finance. It illustrates the implementation of the Monte-Carlo LIBOR swaption portfolio pricer and gives performance results. A comparison of the Xcelerit platform implementation with an equivalent low-level CUDA version shows that the overhead introduced is less than 1.5% in all scenarios.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"30 1","pages":"1185-1192"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83404584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Abstract: Autonomic Modeling of Data-Driven Application Behavior 摘要:数据驱动应用行为的自主建模

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.277

S. Monteiro, G. Bronevetsky, Marc Casas

Computational behavior of large-scale data driven applications is a complex function of their input, configuration settings, and underlying system architecture. Difficulty in predicting the behavior of these applications makes it challenging to optimize their performance and schedule them onto compute resources. However, manually diagnosing performance problems and reconfiguring resource settings to improve application performance is infeasible and inefficient. We thus need autonomic optimization techniques that observe the application, learn from the observations, and subsequently successfully predict application behavior across different systems and load scenarios. This work presents a modular modeling approach for complex data-driven applications using statistical techniques. These techniques capture important characteristics of input data, consequent dynamic application behavior and system properties to predict application behavior with minimum human intervention. The work demonstrates how to adaptively structure and configure the models based on the observed complexity of application behavior in different input and execution scenarios.

大规模数据驱动应用程序的计算行为是其输入、配置设置和底层系统架构的复杂函数。由于难以预测这些应用程序的行为，因此很难优化它们的性能并将它们调度到计算资源上。但是，手动诊断性能问题并重新配置资源设置以提高应用程序性能是不可行的，而且效率低下。因此，我们需要自主优化技术来观察应用程序，从观察中学习，然后成功地预测跨不同系统和负载场景的应用程序行为。这项工作为使用统计技术的复杂数据驱动应用程序提供了模块化建模方法。这些技术捕获输入数据的重要特征、随后的动态应用程序行为和系统属性，以最少的人为干预预测应用程序行为。该工作演示了如何根据在不同输入和执行场景中观察到的应用程序行为的复杂性自适应地构建和配置模型。

{"title":"Abstract: Autonomic Modeling of Data-Driven Application Behavior","authors":"S. Monteiro, G. Bronevetsky, Marc Casas","doi":"10.1109/SC.Companion.2012.277","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.277","url":null,"abstract":"Computational behavior of large-scale data driven applications is a complex function of their input, configuration settings, and underlying system architecture. Difficulty in predicting the behavior of these applications makes it challenging to optimize their performance and schedule them onto compute resources. However, manually diagnosing performance problems and reconfiguring resource settings to improve application performance is infeasible and inefficient. We thus need autonomic optimization techniques that observe the application, learn from the observations, and subsequently successfully predict application behavior across different systems and load scenarios. This work presents a modular modeling approach for complex data-driven applications using statistical techniques. These techniques capture important characteristics of input data, consequent dynamic application behavior and system properties to predict application behavior with minimum human intervention. The work demonstrates how to adaptively structure and configure the models based on the observed complexity of application behavior in different input and execution scenarios.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"38 1","pages":"1485-1486"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85631644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Abstract: Scalable Fast Multipole Methods for Vortex Element Methods 摘要:涡旋元方法的可扩展快速多极子方法

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.COMPANION.2012.221

Qi Hu, N. Gumerov, Rio Yokota, L. Barba, R. Duraiswami

We use a particle-based method to simulate incompressible flows, where the Fast Multipole Method (FMM) is used to accelerate the calculation of particle interactions. The most time-consuming kernels-the Biot-Savart equation and stretching term of the vorticity equation-are mathematically reformulated so that only two Laplace scalar potentials are used instead of six, while automatically ensuring divergence-free far-field computation. Based on this formulation, and on our previous work for a scalar heterogeneous FMM algorithm, we develop a new FMM-based vortex method capable of simulating general flows including turbulence on heterogeneous architectures. Our work for this poster focuses on the computation perspective and our implementation can perform one time step of the velocity+stretching for one billion particles on 32 nodes in 55.9 seconds, which yields 49.12 Tflop/s.

我们使用基于粒子的方法来模拟不可压缩流动，其中使用快速多极法(FMM)来加速粒子相互作用的计算。最耗时的核——Biot-Savart方程和涡度方程的拉伸项——在数学上被重新表述，这样只使用两个拉普拉斯标量势而不是六个，同时自动确保无发散的远场计算。基于这一公式，并在我们之前对标量异质FMM算法的研究基础上，我们开发了一种新的基于FMM的涡流方法，能够模拟异质架构上的一般流动，包括湍流。我们的工作主要集中在计算角度，我们的实现可以在55.9秒内完成10亿个粒子在32个节点上的速度+拉伸的一个时间步，产生49.12 Tflop/s。

引用次数: 5

Case Study: LRZ Liquid Cooling, Energy Management, Contract Specialities 案例研究:LRZ液体冷却，能源管理，合同专业

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.123

Herbert Huber, A. Auweter, T. Wilde, G. Meijer, Charles Archer, Torsten Bloth, Achim Bomelburg, S. Waitz

This presentation explores energy management, liquid cooling and heat re-use as well as contract specialities for LRZ: Leibniz-Rechenzentrum.

本报告探讨了莱布尼茨-热成中心的能源管理、液体冷却和热量再利用以及合同专业知识。

引用次数: 4

Abstract: Auto-Tuning of Parallel IO Parameters for HDF5 Applications HDF5应用中并行IO参数的自动调整

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.236

Babak Behzad, Joey Huchette, Huong Luu, R. Aydt, Q. Koziol, Prabhat, S. Byna, M. Chaarawi, Yushu Yao

Parallel I/O is an unavoidable part of modern high-performance computing (HPC), but its system-wide dependencies means it has eluded optimization across platforms and applications. This can introduce bottlenecks in otherwise computationally efficient code, especially as scientific computing becomes increasingly data-driven. Various studies have shown that dramatic improvements are possible when the parameters are set appropriately. However, as a result of having multiple layers in the HPC I/O stack - each with its own optimization parameters-and nontrivial execution time for a test run, finding the optimal parameter values is a very complex problem. Additionally, optimal sets do not necessarily translate between use cases, since tuning I/O performance can be highly dependent on the individual application, the problem size, and the compute platform being used. Tunable parameters are exposed primarily at three levels in the I/O stack: the system, middleware, and high-level data-organization layers. HPC systems need a parallel file system, such as Lustre, to intelligently store data in a parallelized fashion. Middleware communication layers, such as MPI-IO, support this kind of parallel I/O and offer a variety of optimizations, such as collective buffering. Scientists and application developers often use HDF5, a high-level cross-platform I/O library that offers a hierarchical object-database representation of scientific data.

并行I/O是现代高性能计算(HPC)不可避免的一部分，但其系统范围的依赖性意味着它无法跨平台和应用程序进行优化。这可能会给计算效率高的代码带来瓶颈，尤其是在科学计算越来越受到数据驱动的情况下。各种研究表明，当参数设置得当时，可能会有显著的改善。然而，由于在HPC I/O堆栈中有多个层——每个层都有自己的优化参数——并且测试运行的执行时间很长，因此找到最佳参数值是一个非常复杂的问题。此外，最优集不一定在用例之间转换，因为I/O性能调优可能高度依赖于单个应用程序、问题大小和所使用的计算平台。可调参数主要在I/O堆栈中的三个级别上公开:系统层、中间件层和高级数据组织层。HPC系统需要一个并行文件系统，例如Lustre，以并行方式智能地存储数据。中间件通信层(如MPI-IO)支持这种并行I/O，并提供各种优化，如集体缓冲。科学家和应用程序开发人员经常使用HDF5，这是一种高级跨平台I/O库，提供科学数据的分层对象数据库表示。

{"title":"Abstract: Auto-Tuning of Parallel IO Parameters for HDF5 Applications","authors":"Babak Behzad, Joey Huchette, Huong Luu, R. Aydt, Q. Koziol, Prabhat, S. Byna, M. Chaarawi, Yushu Yao","doi":"10.1109/SC.Companion.2012.236","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.236","url":null,"abstract":"Parallel I/O is an unavoidable part of modern high-performance computing (HPC), but its system-wide dependencies means it has eluded optimization across platforms and applications. This can introduce bottlenecks in otherwise computationally efficient code, especially as scientific computing becomes increasingly data-driven. Various studies have shown that dramatic improvements are possible when the parameters are set appropriately. However, as a result of having multiple layers in the HPC I/O stack - each with its own optimization parameters-and nontrivial execution time for a test run, finding the optimal parameter values is a very complex problem. Additionally, optimal sets do not necessarily translate between use cases, since tuning I/O performance can be highly dependent on the individual application, the problem size, and the compute platform being used. Tunable parameters are exposed primarily at three levels in the I/O stack: the system, middleware, and high-level data-organization layers. HPC systems need a parallel file system, such as Lustre, to intelligently store data in a parallelized fashion. Middleware communication layers, such as MPI-IO, support this kind of parallel I/O and offer a variety of optimizations, such as collective buffering. Scientists and application developers often use HDF5, a high-level cross-platform I/O library that offers a hierarchical object-database representation of scientific data.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"48 1","pages":"1430-1430"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81694451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Bytes and BTUs: Keys to a Net Zero 字节和btu:实现净零的关键

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.125

S. Hammond

引用次数: 0

Optimizing Local File Accesses for FUSE-Based Distributed Storage 基于fuse分布式存储的本地文件访问优化

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.104

Shun Ishiguro, J. Murakami, Y. Oyama, O. Tatebe

Modern distributed file systems can store huge amounts of information while retaining the benefits of high reliability and performance. Many of these systems are prototyped with FUSE, a popular framework for implementing user-level file systems. Unfortunately, when these systems are mounted on a client that uses FUSE, they suffer from I/O overhead caused by extra memory copies and context switches during local file access. Overhead imposed by FUSE on file systems is not small and becomes more pronounced during local file access. This overhead may significantly degrade the performance of data-intensive applications running with distributed file systems that aggressively use local storage. In this paper, we propose a mechanism that achieves rapid local file access in FUSE-based distributed file systems by reducing the number of memory copies and context switches. We then incorporate the mechanism into the FUSE framework and demonstrate its effectiveness through experiments, using the Gfarm distributed file system.

现代分布式文件系统可以存储大量信息，同时保持高可靠性和高性能的优点。这些系统中的许多都是使用FUSE(一种用于实现用户级文件系统的流行框架)进行原型设计的。不幸的是，当这些系统安装在使用FUSE的客户机上时，它们会受到额外内存副本和本地文件访问期间上下文切换所造成的I/O开销的影响。FUSE对文件系统施加的开销并不小，并且在本地文件访问期间变得更加明显。这种开销可能会显著降低使用大量使用本地存储的分布式文件系统运行的数据密集型应用程序的性能。在本文中，我们提出了一种在基于fuse的分布式文件系统中实现快速本地文件访问的机制，通过减少内存副本和上下文切换的数量。然后，我们将该机制合并到FUSE框架中，并通过使用Gfarm分布式文件系统的实验来证明其有效性。

引用次数: 20

Abstract: Exploring Design Space of a 3D Stacked Vector Cache 摘要:探索三维堆叠矢量缓存的设计空间

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.270

Ryusuke Egawa, J. Tada, Yusuke Endo, H. Takizawa, Hiroaki Kobayashi

Although 3D integration technologies with through silicon vias (TSVs) have expected to overcome the memory and power wall problems in the future microprocessor design, there is no promising EDA tools to design 3D integrated VLSIs. In addition, effects of 3D integration on microprocessor design have not been discussed well. Under this situation, this paper presents design approach of 3D stacked cache memories using existing EDA tools, and shows early performances evaluation of 3D stacked cache memories for vector processors.

虽然通过硅通孔(tsv)的3D集成技术有望在未来的微处理器设计中克服内存和功率墙问题，但目前还没有有前途的EDA工具来设计3D集成vlsi。此外，三维集成对微处理器设计的影响还没有得到很好的讨论。在这种情况下，本文提出了利用现有EDA工具设计三维堆叠式高速缓存的方法，并给出了矢量处理器三维堆叠式高速缓存的早期性能评价。

引用次数: 1

The long term impact of codesign 协同设计的长期影响

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.357

A. Gara

引用次数: 3

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀