2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文中文

0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on roadrunner 激光等离子体在路面上相互作用的Pflop/s万亿粒子动力学模型

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2008-11-15 DOI: 10.1109/SC.2008.5222734

K. Bowers, B. Albright, B. Bergen, L. Yin, K. Barker, D. Kerbyson

We demonstrate the outstanding performance and scalability of the VPIC kinetic plasma modeling code on the heterogeneous IBM Roadrunner supercomputer at Los Alamos National Laboratory. VPIC is a three-dimensional, relativistic, electromagnetic, particle-in-cell (PIC) code that self-consistently evolves a kinetic plasma. VPIC simulations of laser plasma interaction were conducted at unprecedented fidelity and scale-up to 1.0 times 1012 particles on as many as 136 times 106 voxels-to model accurately the particle trapping physics occurring within a laser-driven hohlraum in an inertial confinement fusion experiment. During a parameter study of laser reflectivity as a function of laser intensity under experimentally realizable hohlraum conditions, we measured sustained performance exceeding 0.374 Pflop/s (s.p.) with the inner loop itself achieving 0.488 Pflop/s (s.p.). Given the increasing importance of data motion limitations, it is notable that this was measured in a PIC calculation-a technique that typically requires more data motion per computation than other techniques (such as dense matrix calculations, molecular dynamics N-body calculations and Monte-Carlo calculations) often used to demonstrate supercomputer performance. This capability opens up the exciting possibility of using VPIC to model, from first-principles, an issue critical to the success of the multi-billion dollar DOE/NNSA National Ignition Facility.

我们在洛斯阿拉莫斯国家实验室的异构IBM Roadrunner超级计算机上演示了VPIC动力学等离子体建模代码的出色性能和可扩展性。VPIC是一个三维的、相对论的、电磁的、粒子在细胞(PIC)代码，它自洽地演变成一个动态等离子体。激光等离子体相互作用的VPIC模拟以前所未有的保真度和放大到1.0倍1012粒子在多达136倍106体素上进行，以准确地模拟惯性约束聚变实验中激光驱动的腔内发生的粒子捕获物理。在实验可实现的全息条件下，激光反射率随激光强度的函数参数研究中，我们测量到持续性能超过0.374 Pflop/s (s.p)，内环本身达到0.488 Pflop/s (s.p)。考虑到数据移动限制的重要性日益增加，值得注意的是，这是在PIC计算中测量的，PIC计算通常比用于演示超级计算机性能的其他技术(如密集矩阵计算、分子动力学n体计算和蒙特卡罗计算)每次计算通常需要更多的数据移动。这种能力开启了利用VPIC从第一性原理建立模型的可能性，这是一个对耗资数十亿美元的DOE/NNSA国家点火装置成功至关重要的问题。

{"title":"0.374 Pflop/s trillion-particle kinetic modeling of laser plasma interaction on roadrunner","authors":"K. Bowers, B. Albright, B. Bergen, L. Yin, K. Barker, D. Kerbyson","doi":"10.1109/SC.2008.5222734","DOIUrl":"https://doi.org/10.1109/SC.2008.5222734","url":null,"abstract":"We demonstrate the outstanding performance and scalability of the VPIC kinetic plasma modeling code on the heterogeneous IBM Roadrunner supercomputer at Los Alamos National Laboratory. VPIC is a three-dimensional, relativistic, electromagnetic, particle-in-cell (PIC) code that self-consistently evolves a kinetic plasma. VPIC simulations of laser plasma interaction were conducted at unprecedented fidelity and scale-up to 1.0 times 1012 particles on as many as 136 times 106 voxels-to model accurately the particle trapping physics occurring within a laser-driven hohlraum in an inertial confinement fusion experiment. During a parameter study of laser reflectivity as a function of laser intensity under experimentally realizable hohlraum conditions, we measured sustained performance exceeding 0.374 Pflop/s (s.p.) with the inner loop itself achieving 0.488 Pflop/s (s.p.). Given the increasing importance of data motion limitations, it is notable that this was measured in a PIC calculation-a technique that typically requires more data motion per computation than other techniques (such as dense matrix calculations, molecular dynamics N-body calculations and Monte-Carlo calculations) often used to demonstrate supercomputer performance. This capability opens up the exciting possibility of using VPIC to model, from first-principles, an issue critical to the success of the multi-billion dollar DOE/NNSA National Ignition Facility.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123369913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 74

Analysis of application heartbeats: Learning structural and temporal features in time series data for identification of performance problems 应用程序心跳分析:学习时间序列数据中的结构和时间特征，以识别性能问题

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2008-11-15 DOI: 10.1109/SC.2008.5219753

Emma S. Buneci, D. Reed

Grids promote new modes of scientific collaboration and discovery by connecting distributed instruments, data and computing facilities. Because many resources are shared, application performance can vary widely and unexpectedly. We describe a novel performance analysis framework that reasons temporally and qualitatively about performance data from multiple monitoring levels and sources. The framework periodically analyzes application performance states by generating and interpreting signatures containing structural and temporal features from time-series data. Signatures are compared to expected behaviors and in case of mismatches, the framework hints at causes of degraded performance, based on unexpected behavior characteristics previously learned by application exposure to known performance stress factors. Experiments with two scientific applications reveal signatures that have distinct characteristics during well-performing versus poor-performing executions. The ability to automatically and compactly generate signatures capturing fundamental differences between good and poor application performance states is essential to improving the quality of service for Grid applications.

网格通过连接分布式仪器、数据和计算设施，促进了科学合作和发现的新模式。由于许多资源是共享的，因此应用程序的性能可能会有很大的意外变化。我们描述了一种新的性能分析框架，该框架可以对来自多个监控级别和来源的性能数据进行时间和定性分析。该框架通过从时间序列数据中生成和解释包含结构和时间特征的签名，周期性地分析应用程序的性能状态。将签名与预期行为进行比较，在不匹配的情况下，框架会根据应用程序暴露于已知性能压力因素之前了解到的意外行为特征，提示性能下降的原因。对两个科学应用程序的实验揭示了在执行良好和执行较差的过程中具有不同特征的签名。自动、紧凑地生成捕获应用程序性能良好和较差状态之间基本差异的签名的能力，对于提高网格应用程序的服务质量至关重要。

{"title":"Analysis of application heartbeats: Learning structural and temporal features in time series data for identification of performance problems","authors":"Emma S. Buneci, D. Reed","doi":"10.1109/SC.2008.5219753","DOIUrl":"https://doi.org/10.1109/SC.2008.5219753","url":null,"abstract":"Grids promote new modes of scientific collaboration and discovery by connecting distributed instruments, data and computing facilities. Because many resources are shared, application performance can vary widely and unexpectedly. We describe a novel performance analysis framework that reasons temporally and qualitatively about performance data from multiple monitoring levels and sources. The framework periodically analyzes application performance states by generating and interpreting signatures containing structural and temporal features from time-series data. Signatures are compared to expected behaviors and in case of mismatches, the framework hints at causes of degraded performance, based on unexpected behavior characteristics previously learned by application exposure to known performance stress factors. Experiments with two scientific applications reveal signatures that have distinct characteristics during well-performing versus poor-performing executions. The ability to automatically and compactly generate signatures capturing fundamental differences between good and poor application performance states is essential to improving the quality of service for Grid applications.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125160752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Bandwidth intensive 3-D FFT kernel for GPUs using CUDA 带宽密集的三维FFT内核gpu使用CUDA

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2008-11-15 DOI: 10.1145/1413370.1413376

Akira Nukada, Y. Ogata, Toshio Endo, S. Matsuoka

Most GPU performance ldquohypesrdquo have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces.

大多数GPU性能指标都集中在对内存带宽要求较小的紧密耦合应用上，例如n体，但GPU也是需要大量内存带宽的商品向量机;然而，对有效的编程方法的研究却很少。我们新的3-D FFT内核，用NVIDIA CUDA编写，在高端GPU上实现近80 GFLOPS，比任何现有GPU上的FFT实现(包括CUFFT)快三倍以上。仔细的编程技术被用来充分利用现代GPU硬件特性，同时克服它们的局限性，包括片上共享内存利用率，通过适当的本地化优化线程和寄存器的数量，以及避免低速跨行内存访问。我们的内核应用于实际应用程序，在功耗和成本与性能指标方面实现了数量级的提升。卡外带宽限制仍然是一个问题，这可以通过将应用程序内核限制在卡内来缓解，而理想的解决方案是促进更快的GPU接口。

{"title":"Bandwidth intensive 3-D FFT kernel for GPUs using CUDA","authors":"Akira Nukada, Y. Ogata, Toshio Endo, S. Matsuoka","doi":"10.1145/1413370.1413376","DOIUrl":"https://doi.org/10.1145/1413370.1413376","url":null,"abstract":"Most GPU performance ldquohypesrdquo have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114787768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 129

Lessons learned at 208K: Towards debugging millions of cores 208K的经验教训:调试数百万核

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2008-11-15 DOI: 10.1109/SC.2008.5218557

Gregory L. Lee, D. Ahn, D. Arnold, B. Supinski, M. LeGendre, B. Miller, M. Schulz, B. Liblit

Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application - already, debugging the full Blue-Gene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach such sizes and beyond, tools must use a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become tool bottlenecks. In this paper, we present challenges to petascale tool development, using the stack trace analysis tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208 K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present implemented solutions to these challenges and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.

千兆级系统将对性能和正确性工具提出几个新的挑战。这样的机器可能包含数百万个核心，需要工具使用可扩展的数据结构和分析算法来收集和处理应用程序数据。此外，在这样的规模下，每个工具本身将成为一个大型并行应用程序——在劳伦斯利弗莫尔国家实验室调试完整的Blue-Gene/L (BG/L)安装需要使用1664个工具守护进程。为了达到这样的规模，工具必须使用可伸缩的通信基础设施，并有效地管理它们自己的工具流程。某些系统资源(如文件系统)也可能成为工具的瓶颈。在本文中，我们使用堆栈跟踪分析工具(STAT)作为案例研究，提出了千兆级工具开发面临的挑战。STAT是一个轻量级工具，它收集并合并来自并行应用程序的堆栈跟踪，以识别进程等价类。我们使用Infiniband集群上数千个任务收集的结果，以及BG/L上高达208k进程的结果，以确定当前的可扩展性问题以及将在千兆级上面临的挑战。然后，我们将介绍针对这些挑战的实现解决方案，并展示由此带来的性能改进。我们还讨论了未来的计划，以满足千兆级机器的调试需求。

{"title":"Lessons learned at 208K: Towards debugging millions of cores","authors":"Gregory L. Lee, D. Ahn, D. Arnold, B. Supinski, M. LeGendre, B. Miller, M. Schulz, B. Liblit","doi":"10.1109/SC.2008.5218557","DOIUrl":"https://doi.org/10.1109/SC.2008.5218557","url":null,"abstract":"Petascale systems will present several new challenges to performance and correctness tools. Such machines may contain millions of cores, requiring that tools use scalable data structures and analysis algorithms to collect and to process application data. In addition, at such scales, each tool itself will become a large parallel application - already, debugging the full Blue-Gene/L (BG/L) installation at the Lawrence Livermore National Laboratory requires employing 1664 tool daemons. To reach such sizes and beyond, tools must use a scalable communication infrastructure and manage their own tool processes efficiently. Some system resources, such as the file system, may also become tool bottlenecks. In this paper, we present challenges to petascale tool development, using the stack trace analysis tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208 K processes on BG/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present implemented solutions to these challenges and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129044204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Dendro: Parallel algorithms for multigrid and AMR methods on 2:1 balanced octrees 多网格并行算法和2:1平衡八叉树上的AMR方法

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2008-11-15 DOI: 10.1109/SC.2008.5218558

R. Sampath, Santi S. Adavani, H. Sundar, I. Lashuk, G. Biros

In this article, we present Dendro, a suite of parallel algorithms for the discretization and solution of partial differential equations (PDEs) involving second-order elliptic operators. Dendro uses trilinear finite element discretizations constructed using octrees. Dendro, comprises four main modules: a bottom-up octree generation and 2:1 balancing module, a meshing module, a geometric multiplicative multigrid module, and a module for adaptive mesh refinement (AMR). Here, we focus on the multigrid and AMR modules. The key features of Dendro are coarsening/refinement, inter-octree transfers of scalar and vector fields, and parallel partition of multilevel octree forests. We describe a bottom-up algorithm for constructing the coarser multigrid levels. The input is an arbitrary 2:1 balanced octree-based mesh, representing the fine level mesh. The output is a set of octrees and meshes that are used in the multigrid sweeps. Also, we describe matrix-free implementations for the discretized PDE operators and the intergrid transfer operations. We present results on up to 4096 CPUs on the Cray XT3 (ldquoBigBenrdquo), the Intel 64 system (ldquoAberdquo), and the Sun Constellation Linux cluster (ldquoRangerrdquo).

在本文中，我们提出了Dendro，一套用于二阶椭圆算子偏微分方程离散化和求解的并行算法。Dendro使用使用八叉树构造的三线性有限元离散化。Dendro包括四个主要模块:自下而上的八叉树生成和2:1平衡模块、网格划分模块、几何乘法多网格模块和自适应网格细化模块(AMR)。在这里，我们主要关注多网格和AMR模块。Dendro的主要特征是粗化/精化、标量场和向量场在八叉树间的传递以及多层八叉树森林的并行划分。我们描述了一种自底向上的算法来构造较粗的多网格层。输入是一个任意2:1平衡的基于八叉树的网格，表示精细级网格。输出是一组用于多网格扫描的八叉树和网格。此外，我们还描述了离散PDE算子和网格间转移操作的无矩阵实现。我们在Cray XT3 (ldquoBigBenrdquo)、Intel 64系统(ldquoAberdquo)和Sun Constellation Linux集群(ldquoRangerrdquo)上展示了多达4096个cpu的结果。

{"title":"Dendro: Parallel algorithms for multigrid and AMR methods on 2:1 balanced octrees","authors":"R. Sampath, Santi S. Adavani, H. Sundar, I. Lashuk, G. Biros","doi":"10.1109/SC.2008.5218558","DOIUrl":"https://doi.org/10.1109/SC.2008.5218558","url":null,"abstract":"In this article, we present Dendro, a suite of parallel algorithms for the discretization and solution of partial differential equations (PDEs) involving second-order elliptic operators. Dendro uses trilinear finite element discretizations constructed using octrees. Dendro, comprises four main modules: a bottom-up octree generation and 2:1 balancing module, a meshing module, a geometric multiplicative multigrid module, and a module for adaptive mesh refinement (AMR). Here, we focus on the multigrid and AMR modules. The key features of Dendro are coarsening/refinement, inter-octree transfers of scalar and vector fields, and parallel partition of multilevel octree forests. We describe a bottom-up algorithm for constructing the coarser multigrid levels. The input is an arbitrary 2:1 balanced octree-based mesh, representing the fine level mesh. The output is a set of octrees and meshes that are used in the multigrid sweeps. Also, we describe matrix-free implementations for the discretized PDE operators and the intergrid transfer operations. We present results on up to 4096 CPUs on the Cray XT3 (ldquoBigBenrdquo), the Intel 64 system (ldquoAberdquo), and the Sun Constellation Linux cluster (ldquoRangerrdquo).","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131675062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

Materialized community ground models for large-scale earthquake simulation 大尺度地震模拟的物化社区地面模型

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2008-11-15 DOI: 10.1109/SC.2008.5215657

S. Schlosser, Michael P. Ryan, Ricardo Taborda-Rios, J. C. López-Hernández, D. O'Hallaron, J. Bielak

Large-scale earthquake simulation requires source datasets which describe the highly heterogeneous physical characteristics of the earth in the region under simulation. Physical characteristic datasets are the first stage in a simulation pipeline which includes mesh generation, partitioning, solving, and visualization. In practice, the data is produced in an ad-hoc fashion for each set of experiments, which has several significant shortcomings including lower performance, decreased repeatability and comparability, and a longer time to science, an increasingly important metric. As a solution to these problems, we propose a new approach for providing scientific data to ground motion simulations, in which ground model datasets are fully materialized into octress stored on disk, which can be more efficiently queried (by up to two orders of magnitude) than the underlying community velocity model programs. While octrees have long been used to store spatial datasets, they have not yet been used at the scale we propose. We further propose that these datasets can be provided as a service, either over the Internet or, more likely, in a data center or supercomputing center in which the simulations take place. Since constructing these octrees is itself a challenge, we present three data-parallel techniques for efficiently building them, which can significantly decrease the build time from days or weeks to hours using commodity clusters. This approach typifies a broader shift toward science as a service techniques in which scientific computation and storage services become more tightly intertwined.

大尺度地震模拟需要能够描述模拟区域内地球高度非均匀物理特征的源数据集。物理特征数据集是仿真流水线的第一阶段，包括网格生成、划分、求解和可视化。在实践中，每组实验的数据都是以一种特别的方式产生的，这有几个明显的缺点，包括性能较低，可重复性和可比性降低，以及科学研究的时间较长，这是一个越来越重要的指标。为了解决这些问题，我们提出了一种为地面运动模拟提供科学数据的新方法，其中地面模型数据集完全物化到存储在磁盘上的数据中，可以比底层社区速度模型程序更有效地查询(高达两个数量级)。虽然八叉树长期以来一直用于存储空间数据集，但它们尚未在我们提出的规模上使用。我们进一步建议，这些数据集可以作为一种服务提供，要么通过互联网，要么更有可能在进行模拟的数据中心或超级计算中心提供。由于构建这些八叉树本身就是一个挑战，因此我们提出了三种数据并行技术来有效地构建它们，这可以显着将构建时间从几天或几周减少到使用商品集群的几小时。这种方法代表了向科学即服务技术的广泛转变，其中科学计算和存储服务变得更加紧密地交织在一起。

{"title":"Materialized community ground models for large-scale earthquake simulation","authors":"S. Schlosser, Michael P. Ryan, Ricardo Taborda-Rios, J. C. López-Hernández, D. O'Hallaron, J. Bielak","doi":"10.1109/SC.2008.5215657","DOIUrl":"https://doi.org/10.1109/SC.2008.5215657","url":null,"abstract":"Large-scale earthquake simulation requires source datasets which describe the highly heterogeneous physical characteristics of the earth in the region under simulation. Physical characteristic datasets are the first stage in a simulation pipeline which includes mesh generation, partitioning, solving, and visualization. In practice, the data is produced in an ad-hoc fashion for each set of experiments, which has several significant shortcomings including lower performance, decreased repeatability and comparability, and a longer time to science, an increasingly important metric. As a solution to these problems, we propose a new approach for providing scientific data to ground motion simulations, in which ground model datasets are fully materialized into octress stored on disk, which can be more efficiently queried (by up to two orders of magnitude) than the underlying community velocity model programs. While octrees have long been used to store spatial datasets, they have not yet been used at the scale we propose. We further propose that these datasets can be provided as a service, either over the Internet or, more likely, in a data center or supercomputing center in which the simulations take place. Since constructing these octrees is itself a challenge, we present three data-parallel techniques for efficiently building them, which can significantly decrease the build time from days or weeks to hours using commodity clusters. This approach typifies a broader shift toward science as a service techniques in which scientific computation and storage services become more tightly intertwined.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116722868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Applying double auctions for scheduling of workflows on the Grid 对网格上的工作流调度应用双重拍卖

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2008-11-15 DOI: 10.1109/SC.2008.5218071

Marek Wieczorek, Stefan Podlipnig, R. Prodan, T. Fahringer

Grid economy models have long been considered as a promising alternative for the classical Grid resource management, due to their dynamic and decentralized nature, and because the financial valuation of resources and services is inherent in any such model. In particular, auction models are widely used in the existing Grid research, as they are easy to implement and are shown to successfully manage resource allocation on the Grid market. The focus on the current work is on workflow scheduling in the Grid resource allocation model based on Continuous Double Auctions (CDA). We analyze different scheduling strategies that can be applied by the user to execute workflows in such an environment, and try to identify the general behavioral patterns that can lead to a fast and cheap workflow execution. In the experimental study, we show that under certain circumstances some benefit can be gained by applying an ldquoaggressiverdquo scheduling strategy.

长期以来，网格经济模型一直被认为是经典网格资源管理的一个很有前途的替代方案，因为它们具有动态性和分散性，而且对资源和服务的财务评估是任何此类模型所固有的。其中，拍卖模型在现有的网格研究中得到了广泛的应用，因为它易于实现，并被证明可以成功地管理网格市场上的资源分配。目前研究的重点是基于连续双拍卖(CDA)的网格资源分配模型中的工作流调度问题。我们分析了不同的调度策略，用户可以应用这些策略在这样的环境中执行工作流，并试图确定一般的行为模式，可以导致快速和廉价的工作流执行。在实验研究中，我们表明，在某些情况下，采用低侵略性调度策略可以获得一些好处。

引用次数: 17

Wide-area performance profiling of 10GigE and InfiniBand technologies 10GigE和InfiniBand技术的广域性能分析

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2008-11-15 DOI: 10.1109/SC.2008.5214435

N. Rao, Weikuan Yu, W. Wing, S. Poole, J. Vetter

For wide-area high-performance applications, light-paths provide 10Gbps connectivity, and multi-core hosts with PCI-Express can drive such data rates. However, sustaining such end-to-end application throughputs across connections of thousands of miles remains challenging, and the current performance studies of such solutions are very limited. We present an experimental study of two solutions to achieve such throughputs based on: (a) 10Gbps Ethernet with TCP/IP transport protocols, and (b) InfiniBand and its wide-area extensions. For both, we generate performance profiles over 10Gbps connections of lengths up to 8600 miles, and discuss the components, complexity, and limitations of sustaining such throughputs, using different connections and host configurations. Our results indicate that IB solution is better suited for applications with a single large flow, and 10GigE solution is better for those with multiple competing flows.

对于广域高性能应用，光路提供10Gbps的连接，而带有PCI-Express的多核主机可以驱动这样的数据速率。然而，在数千英里的连接中维持这样的端到端应用程序吞吐量仍然具有挑战性，并且目前对此类解决方案的性能研究非常有限。我们提出了两种解决方案的实验研究，以实现基于:(a)带有TCP/IP传输协议的10Gbps以太网，以及(b) InfiniBand及其广域扩展。对于这两种情况，我们都生成了长达8600英里的10Gbps连接的性能配置文件，并讨论了使用不同的连接和主机配置来维持这种吞吐量的组件、复杂性和限制。我们的结果表明，IB解决方案更适合具有单个大流的应用程序，而10GigE解决方案更适合具有多个竞争流的应用程序。

引用次数: 26

Using server-to-server communication in parallel file systems to simplify consistency and improve performance 在并行文件系统中使用服务器到服务器通信来简化一致性并提高性能

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2008-11-15 DOI: 10.1145/1413370.1413377

P. Carns, B. Settlemyer, W. Ligon

The trend in parallel computing toward clusters running thousands of cooperating processes per application has led to an I/O bottleneck that has only gotten more severe as the CPU density of clusters has increased. Current parallel file systems provide large amounts of aggregate I/O bandwidth; however, they do not achieve the high degrees of metadata scalability required to manage files distributed across hundreds or thousands of storage nodes. In this paper we examine the use of collective communication between the storage servers to improve the scalability of file metadata operations. In particular, we apply server-to-server communication to simplify consistency checking and improve the performance of file creation, file removal, and file stat. Our results indicate that collective communication is an effective scheme for simplifying consistency checks and significantly improving the performance for several real metadata intensive workloads.

并行计算向每个应用程序运行数千个协作进程的集群发展的趋势导致了I/O瓶颈，随着集群CPU密度的增加，这种瓶颈只会变得更加严重。当前的并行文件系统提供了大量的I/O带宽;但是，它们无法实现管理分布在数百或数千个存储节点上的文件所需的高度元数据可伸缩性。在本文中，我们研究了在存储服务器之间使用集体通信来提高文件元数据操作的可伸缩性。特别是，我们应用服务器到服务器的通信来简化一致性检查，提高文件创建、文件删除和文件统计的性能。我们的结果表明，集体通信是一种有效的方案，可以简化一致性检查，并显着提高几个实际的元数据密集型工作负载的性能。

引用次数: 20

Feedback-controlled resource sharing for predictable eScience 反馈控制的可预测eScience资源共享

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2008-11-15 DOI: 10.1145/1413370.1413384

Sang-Min Park, M. Humphrey

The emerging class of adaptive, real-time, data-driven applications is a significant problem for today's HPC systems. In general, it is extremely difficult for queuing-system-controlled HPC resources to make and guarantee a tightly-bounded prediction regarding the time at which a newly-submitted application will execute. While a reservation-based approach partially addresses the problem, it can create severe resource under-utilization (unused reservations, necessary scheduled idle slots, underutilized reservations, etc.) that resource providers are eager to avoid. In contrast, this paper presents a fundamentally different approach to guarantee predictable execution. By creating a virtualized application layer called the performance container, and opportunistically multiplexing concurrent performance containers through the application of formal feedback control theory, we regulate the job's progress such that the job meets its deadline without requiring exclusive access to resources even in the presence of a wide class of unexpected disturbances. Our evaluation using two widely-used applications, WRF and BLAST, on an 8-core server show our approach is predictable and meets deadlines with 3.4 % of errors on average while achieving high overall utilization.

新兴的自适应、实时、数据驱动的应用是当今高性能计算系统面临的一个重大问题。通常，队列系统控制的HPC资源很难对新提交的应用程序的执行时间做出严格的预测。虽然基于预留的方法部分地解决了这个问题，但它可能会造成严重的资源利用不足(未使用的预留、必要的计划空闲插槽、未充分利用的预留等)，而资源提供者希望避免这些情况。相比之下，本文提出了一种完全不同的方法来保证可预测的执行。通过创建一个称为性能容器的虚拟化应用层，并通过应用形式反馈控制理论对并发性能容器进行机会多路复用，我们调节了作业的进度，这样即使在存在大量意外干扰的情况下，作业也能满足其最后期限，而不需要独占访问资源。我们在一台8核服务器上使用两个广泛使用的应用程序WRF和BLAST进行评估，结果表明我们的方法是可预测的，在实现高总体利用率的同时，平均错误率为3.4%。

{"title":"Feedback-controlled resource sharing for predictable eScience","authors":"Sang-Min Park, M. Humphrey","doi":"10.1145/1413370.1413384","DOIUrl":"https://doi.org/10.1145/1413370.1413384","url":null,"abstract":"The emerging class of adaptive, real-time, data-driven applications is a significant problem for today's HPC systems. In general, it is extremely difficult for queuing-system-controlled HPC resources to make and guarantee a tightly-bounded prediction regarding the time at which a newly-submitted application will execute. While a reservation-based approach partially addresses the problem, it can create severe resource under-utilization (unused reservations, necessary scheduled idle slots, underutilized reservations, etc.) that resource providers are eager to avoid. In contrast, this paper presents a fundamentally different approach to guarantee predictable execution. By creating a virtualized application layer called the performance container, and opportunistically multiplexing concurrent performance containers through the application of formal feedback control theory, we regulate the job's progress such that the job meets its deadline without requiring exclusive access to resources even in the presence of a wide class of unexpected disturbances. Our evaluation using two widely-used applications, WRF and BLAST, on an 8-core server show our approach is predictable and meets deadlines with 3.4 % of errors on average while achieving high overall utilization.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130516676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀