Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.218
W. Badgett, K. Biery, C. Green, J. Kowalkowski, K. Maeshima, M. Paterno, R. Roser
High Energy Physics has a long history of coping with cutting-edge data rates in its efforts to extract meaning from experimental data. The quantity of data from planned future experiments that must be analyzed practically in real-time to enable efficient filtering and storage of the scientifically interesting data has driven the development of sophisticated techniques which leverage technologies such as MPI, OpenMP and Intel TBB. We show the evolution of data collection, triggering and filtering from the 1990s with TeVatron experiments into the future of Intensity Frontier and Cosmic Frontier experiments and show how the requirements of upcoming experiments lead us to the development of high-performance streaming triggerless DAQ systems.
{"title":"Poster: High-Speed Decision Making on Live Petabyte Data Streams","authors":"W. Badgett, K. Biery, C. Green, J. Kowalkowski, K. Maeshima, M. Paterno, R. Roser","doi":"10.1109/SC.Companion.2012.218","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.218","url":null,"abstract":"High Energy Physics has a long history of coping with cutting-edge data rates in its efforts to extract meaning from experimental data. The quantity of data from planned future experiments that must be analyzed practically in real-time to enable efficient filtering and storage of the scientifically interesting data has driven the development of sophisticated techniques which leverage technologies such as MPI, OpenMP and Intel TBB. We show the evolution of data collection, triggering and filtering from the 1990s with TeVatron experiments into the future of Intensity Frontier and Cosmic Frontier experiments and show how the requirements of upcoming experiments lead us to the development of high-performance streaming triggerless DAQ systems.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"22 1","pages":"1404-1404"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84406460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.143
Jörg Lotze, P. Sutton, Hicham Lahlou
This paper describes the acceleration of a MonteCarlo algorithm for pricing a LIBOR swaption portfolio using multi-core CPUs and GPUs. Speedups of up to 305x are achieved on two Nvidia Tesla M2050 GPUs and up to 20.8x on two Intel Xeon E5620 CPUs, compared to a sequential CPU implementation. This performance is achieved by using the Xcelerit platform - writing sequential, high-level C++ code and adopting a simple dataflow programming model. It avoids the complexity involved when using low-level high-performance computing frameworks such as OpenMP, OpenCL, CUDA, or SIMD intrinsics. The paper provides an overview of the Xcelerit platform, details how high performance is achieved through various automatic optimisation and parallelisation techniques, and shows how the tool can be used to implement portable accelerated Monte-Carlo algorithms in finance. It illustrates the implementation of the Monte-Carlo LIBOR swaption portfolio pricer and gives performance results. A comparison of the Xcelerit platform implementation with an equivalent low-level CUDA version shows that the overhead introduced is less than 1.5% in all scenarios.
本文描述了一种基于多核cpu和gpu的LIBOR互换组合定价MonteCarlo算法的加速。与串行CPU实现相比,两个Nvidia Tesla M2050 gpu实现了高达305倍的加速,两个Intel至强E5620 CPU实现了高达20.8倍的加速。这种性能是通过使用Xcelerit平台实现的——编写顺序的高级c++代码,并采用简单的数据流编程模型。它避免了使用底层高性能计算框架(如OpenMP、OpenCL、CUDA或SIMD intrinsic)时所涉及的复杂性。本文概述了Xcelerit平台,详细介绍了如何通过各种自动优化和并行化技术实现高性能,并展示了如何使用该工具在金融领域实现便携式加速蒙特卡罗算法。举例说明了蒙特卡洛LIBOR掉期组合定价器的实现,并给出了性能结果。Xcelerit平台实现与同等低级CUDA版本的比较表明,在所有场景中引入的开销都小于1.5%。
{"title":"Many-Core Accelerated LIBOR Swaption Portfolio Pricing","authors":"Jörg Lotze, P. Sutton, Hicham Lahlou","doi":"10.1109/SC.Companion.2012.143","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.143","url":null,"abstract":"This paper describes the acceleration of a MonteCarlo algorithm for pricing a LIBOR swaption portfolio using multi-core CPUs and GPUs. Speedups of up to 305x are achieved on two Nvidia Tesla M2050 GPUs and up to 20.8x on two Intel Xeon E5620 CPUs, compared to a sequential CPU implementation. This performance is achieved by using the Xcelerit platform - writing sequential, high-level C++ code and adopting a simple dataflow programming model. It avoids the complexity involved when using low-level high-performance computing frameworks such as OpenMP, OpenCL, CUDA, or SIMD intrinsics. The paper provides an overview of the Xcelerit platform, details how high performance is achieved through various automatic optimisation and parallelisation techniques, and shows how the tool can be used to implement portable accelerated Monte-Carlo algorithms in finance. It illustrates the implementation of the Monte-Carlo LIBOR swaption portfolio pricer and gives performance results. A comparison of the Xcelerit platform implementation with an equivalent low-level CUDA version shows that the overhead introduced is less than 1.5% in all scenarios.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"30 1","pages":"1185-1192"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83404584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.277
S. Monteiro, G. Bronevetsky, Marc Casas
Computational behavior of large-scale data driven applications is a complex function of their input, configuration settings, and underlying system architecture. Difficulty in predicting the behavior of these applications makes it challenging to optimize their performance and schedule them onto compute resources. However, manually diagnosing performance problems and reconfiguring resource settings to improve application performance is infeasible and inefficient. We thus need autonomic optimization techniques that observe the application, learn from the observations, and subsequently successfully predict application behavior across different systems and load scenarios. This work presents a modular modeling approach for complex data-driven applications using statistical techniques. These techniques capture important characteristics of input data, consequent dynamic application behavior and system properties to predict application behavior with minimum human intervention. The work demonstrates how to adaptively structure and configure the models based on the observed complexity of application behavior in different input and execution scenarios.
{"title":"Abstract: Autonomic Modeling of Data-Driven Application Behavior","authors":"S. Monteiro, G. Bronevetsky, Marc Casas","doi":"10.1109/SC.Companion.2012.277","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.277","url":null,"abstract":"Computational behavior of large-scale data driven applications is a complex function of their input, configuration settings, and underlying system architecture. Difficulty in predicting the behavior of these applications makes it challenging to optimize their performance and schedule them onto compute resources. However, manually diagnosing performance problems and reconfiguring resource settings to improve application performance is infeasible and inefficient. We thus need autonomic optimization techniques that observe the application, learn from the observations, and subsequently successfully predict application behavior across different systems and load scenarios. This work presents a modular modeling approach for complex data-driven applications using statistical techniques. These techniques capture important characteristics of input data, consequent dynamic application behavior and system properties to predict application behavior with minimum human intervention. The work demonstrates how to adaptively structure and configure the models based on the observed complexity of application behavior in different input and execution scenarios.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"38 1","pages":"1485-1486"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85631644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.COMPANION.2012.221
Qi Hu, N. Gumerov, Rio Yokota, L. Barba, R. Duraiswami
We use a particle-based method to simulate incompressible flows, where the Fast Multipole Method (FMM) is used to accelerate the calculation of particle interactions. The most time-consuming kernels-the Biot-Savart equation and stretching term of the vorticity equation-are mathematically reformulated so that only two Laplace scalar potentials are used instead of six, while automatically ensuring divergence-free far-field computation. Based on this formulation, and on our previous work for a scalar heterogeneous FMM algorithm, we develop a new FMM-based vortex method capable of simulating general flows including turbulence on heterogeneous architectures. Our work for this poster focuses on the computation perspective and our implementation can perform one time step of the velocity+stretching for one billion particles on 32 nodes in 55.9 seconds, which yields 49.12 Tflop/s.
{"title":"Abstract: Scalable Fast Multipole Methods for Vortex Element Methods","authors":"Qi Hu, N. Gumerov, Rio Yokota, L. Barba, R. Duraiswami","doi":"10.1109/SC.COMPANION.2012.221","DOIUrl":"https://doi.org/10.1109/SC.COMPANION.2012.221","url":null,"abstract":"We use a particle-based method to simulate incompressible flows, where the Fast Multipole Method (FMM) is used to accelerate the calculation of particle interactions. The most time-consuming kernels-the Biot-Savart equation and stretching term of the vorticity equation-are mathematically reformulated so that only two Laplace scalar potentials are used instead of six, while automatically ensuring divergence-free far-field computation. Based on this formulation, and on our previous work for a scalar heterogeneous FMM algorithm, we develop a new FMM-based vortex method capable of simulating general flows including turbulence on heterogeneous architectures. Our work for this poster focuses on the computation perspective and our implementation can perform one time step of the velocity+stretching for one billion particles on 32 nodes in 55.9 seconds, which yields 49.12 Tflop/s.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"34 1","pages":"1408-1408"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86476942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.123
Herbert Huber, A. Auweter, T. Wilde, G. Meijer, Charles Archer, Torsten Bloth, Achim Bomelburg, S. Waitz
This presentation explores energy management, liquid cooling and heat re-use as well as contract specialities for LRZ: Leibniz-Rechenzentrum.
本报告探讨了莱布尼茨-热成中心的能源管理、液体冷却和热量再利用以及合同专业知识。
{"title":"Case Study: LRZ Liquid Cooling, Energy Management, Contract Specialities","authors":"Herbert Huber, A. Auweter, T. Wilde, G. Meijer, Charles Archer, Torsten Bloth, Achim Bomelburg, S. Waitz","doi":"10.1109/SC.Companion.2012.123","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.123","url":null,"abstract":"This presentation explores energy management, liquid cooling and heat re-use as well as contract specialities for LRZ: Leibniz-Rechenzentrum.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"49 1","pages":"962-992"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82231010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.236
Babak Behzad, Joey Huchette, Huong Luu, R. Aydt, Q. Koziol, Prabhat, S. Byna, M. Chaarawi, Yushu Yao
Parallel I/O is an unavoidable part of modern high-performance computing (HPC), but its system-wide dependencies means it has eluded optimization across platforms and applications. This can introduce bottlenecks in otherwise computationally efficient code, especially as scientific computing becomes increasingly data-driven. Various studies have shown that dramatic improvements are possible when the parameters are set appropriately. However, as a result of having multiple layers in the HPC I/O stack - each with its own optimization parameters-and nontrivial execution time for a test run, finding the optimal parameter values is a very complex problem. Additionally, optimal sets do not necessarily translate between use cases, since tuning I/O performance can be highly dependent on the individual application, the problem size, and the compute platform being used. Tunable parameters are exposed primarily at three levels in the I/O stack: the system, middleware, and high-level data-organization layers. HPC systems need a parallel file system, such as Lustre, to intelligently store data in a parallelized fashion. Middleware communication layers, such as MPI-IO, support this kind of parallel I/O and offer a variety of optimizations, such as collective buffering. Scientists and application developers often use HDF5, a high-level cross-platform I/O library that offers a hierarchical object-database representation of scientific data.
{"title":"Abstract: Auto-Tuning of Parallel IO Parameters for HDF5 Applications","authors":"Babak Behzad, Joey Huchette, Huong Luu, R. Aydt, Q. Koziol, Prabhat, S. Byna, M. Chaarawi, Yushu Yao","doi":"10.1109/SC.Companion.2012.236","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.236","url":null,"abstract":"Parallel I/O is an unavoidable part of modern high-performance computing (HPC), but its system-wide dependencies means it has eluded optimization across platforms and applications. This can introduce bottlenecks in otherwise computationally efficient code, especially as scientific computing becomes increasingly data-driven. Various studies have shown that dramatic improvements are possible when the parameters are set appropriately. However, as a result of having multiple layers in the HPC I/O stack - each with its own optimization parameters-and nontrivial execution time for a test run, finding the optimal parameter values is a very complex problem. Additionally, optimal sets do not necessarily translate between use cases, since tuning I/O performance can be highly dependent on the individual application, the problem size, and the compute platform being used. Tunable parameters are exposed primarily at three levels in the I/O stack: the system, middleware, and high-level data-organization layers. HPC systems need a parallel file system, such as Lustre, to intelligently store data in a parallelized fashion. Middleware communication layers, such as MPI-IO, support this kind of parallel I/O and offer a variety of optimizations, such as collective buffering. Scientists and application developers often use HDF5, a high-level cross-platform I/O library that offers a hierarchical object-database representation of scientific data.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"48 1","pages":"1430-1430"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81694451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.125
S. Hammond
{"title":"Bytes and BTUs: Keys to a Net Zero","authors":"S. Hammond","doi":"10.1109/SC.Companion.2012.125","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.125","url":null,"abstract":"","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"10 1","pages":"1018-1039"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81980075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.104
Shun Ishiguro, J. Murakami, Y. Oyama, O. Tatebe
Modern distributed file systems can store huge amounts of information while retaining the benefits of high reliability and performance. Many of these systems are prototyped with FUSE, a popular framework for implementing user-level file systems. Unfortunately, when these systems are mounted on a client that uses FUSE, they suffer from I/O overhead caused by extra memory copies and context switches during local file access. Overhead imposed by FUSE on file systems is not small and becomes more pronounced during local file access. This overhead may significantly degrade the performance of data-intensive applications running with distributed file systems that aggressively use local storage. In this paper, we propose a mechanism that achieves rapid local file access in FUSE-based distributed file systems by reducing the number of memory copies and context switches. We then incorporate the mechanism into the FUSE framework and demonstrate its effectiveness through experiments, using the Gfarm distributed file system.
{"title":"Optimizing Local File Accesses for FUSE-Based Distributed Storage","authors":"Shun Ishiguro, J. Murakami, Y. Oyama, O. Tatebe","doi":"10.1109/SC.Companion.2012.104","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.104","url":null,"abstract":"Modern distributed file systems can store huge amounts of information while retaining the benefits of high reliability and performance. Many of these systems are prototyped with FUSE, a popular framework for implementing user-level file systems. Unfortunately, when these systems are mounted on a client that uses FUSE, they suffer from I/O overhead caused by extra memory copies and context switches during local file access. Overhead imposed by FUSE on file systems is not small and becomes more pronounced during local file access. This overhead may significantly degrade the performance of data-intensive applications running with distributed file systems that aggressively use local storage. In this paper, we propose a mechanism that achieves rapid local file access in FUSE-based distributed file systems by reducing the number of memory copies and context switches. We then incorporate the mechanism into the FUSE framework and demonstrate its effectiveness through experiments, using the Gfarm distributed file system.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"368 1","pages":"760-765"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76608769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.270
Ryusuke Egawa, J. Tada, Yusuke Endo, H. Takizawa, Hiroaki Kobayashi
Although 3D integration technologies with through silicon vias (TSVs) have expected to overcome the memory and power wall problems in the future microprocessor design, there is no promising EDA tools to design 3D integrated VLSIs. In addition, effects of 3D integration on microprocessor design have not been discussed well. Under this situation, this paper presents design approach of 3D stacked cache memories using existing EDA tools, and shows early performances evaluation of 3D stacked cache memories for vector processors.
{"title":"Abstract: Exploring Design Space of a 3D Stacked Vector Cache","authors":"Ryusuke Egawa, J. Tada, Yusuke Endo, H. Takizawa, Hiroaki Kobayashi","doi":"10.1109/SC.Companion.2012.270","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.270","url":null,"abstract":"Although 3D integration technologies with through silicon vias (TSVs) have expected to overcome the memory and power wall problems in the future microprocessor design, there is no promising EDA tools to design 3D integrated VLSIs. In addition, effects of 3D integration on microprocessor design have not been discussed well. Under this situation, this paper presents design approach of 3D stacked cache memories using existing EDA tools, and shows early performances evaluation of 3D stacked cache memories for vector processors.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"81 1","pages":"1475-1476"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90036785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-11-10DOI: 10.1109/SC.Companion.2012.357
A. Gara
{"title":"The long term impact of codesign","authors":"A. Gara","doi":"10.1109/SC.Companion.2012.357","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.357","url":null,"abstract":"","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"70 1-2","pages":"2212-2246"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72624775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}