SC14: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文中文

IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion IndexFS:通过无状态缓存和大容量插入扩展文件系统元数据性能

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.25

Kai Ren, Qing Zheng, Swapnil Patil, Garth A. Gibson

The growing size of modern storage systems is expected to exceed billions of objects, making metadata scalability critical to overall performance. Many existing distributed file systems only focus on providing highly parallel fast access to file data, and lack a scalable metadata service. In this paper, we introduce a middleware design called Index FS that adds support to existing file systems such as PVFS, Lustre, and HDFS for scalable high-performance operations on metadata and small files. Index FS uses a table-based architecture that incrementally partitions the namespace on a per-directory basis, preserving server and disk locality for small directories. An optimized log-structured layout is used to store metadata and small files efficiently. We also propose two client-based storm free caching techniques: bulk namespace insertion for creation intensive workloads such as N-N check pointing, and stateless consistent metadata caching for hot spot mitigation. By combining these techniques, we have demonstrated Index FS scaled to 128 metadata servers. Experiments show our out-of-core metadata throughput out-performing existing solutions such as PVFS, Lustre, and HDFS by 50% to two orders of magnitude.

现代存储系统的规模不断增长，预计将超过数十亿个对象，这使得元数据的可伸缩性对整体性能至关重要。许多现有的分布式文件系统只关注于提供对文件数据的高度并行的快速访问，而缺乏可扩展的元数据服务。在本文中，我们介绍了一个名为Index FS的中间件设计，它为现有的文件系统(如PVFS、Lustre和HDFS)增加了对元数据和小文件的可扩展高性能操作的支持。Index FS使用基于表的体系结构，它以每个目录为基础对名称空间进行增量分区，为小目录保留服务器和磁盘局部性。采用优化的日志结构布局，高效存储元数据和小文件。我们还提出了两种基于客户端的无风暴缓存技术:用于创建密集型工作负载(如N-N检查指向)的批量命名空间插入，以及用于热点缓解的无状态一致元数据缓存。通过结合这些技术，我们演示了Index FS扩展到128个元数据服务器。实验表明，我们的核心外元数据吞吐量比现有的解决方案(如PVFS、Lustre和HDFS)高出50%到两个数量级。

{"title":"IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion","authors":"Kai Ren, Qing Zheng, Swapnil Patil, Garth A. Gibson","doi":"10.1109/SC.2014.25","DOIUrl":"https://doi.org/10.1109/SC.2014.25","url":null,"abstract":"The growing size of modern storage systems is expected to exceed billions of objects, making metadata scalability critical to overall performance. Many existing distributed file systems only focus on providing highly parallel fast access to file data, and lack a scalable metadata service. In this paper, we introduce a middleware design called Index FS that adds support to existing file systems such as PVFS, Lustre, and HDFS for scalable high-performance operations on metadata and small files. Index FS uses a table-based architecture that incrementally partitions the namespace on a per-directory basis, preserving server and disk locality for small directories. An optimized log-structured layout is used to store metadata and small files efficiently. We also propose two client-based storm free caching techniques: bulk namespace insertion for creation intensive workloads such as N-N check pointing, and stateless consistent metadata caching for hot spot mitigation. By combining these techniques, we have demonstrated Index FS scaled to 128 metadata servers. Experiments show our out-of-core metadata throughput out-performing existing solutions such as PVFS, Lustre, and HDFS by 50% to two orders of magnitude.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"41 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120864000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 146

Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster CPU/GPU混合集群中避免通信的Krylov方法的域分解前置条件

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.81

I. Yamazaki, S. Rajamanickam, E. Boman, M. Hoemmen, M. Heroux, S. Tomov

Krylov subspace projection methods are widely used iterative methods for solving large-scale linear systems of equations. Researchers have demonstrated that communication avoiding (CA) techniques can improve Krylov methods' performance on modern computers, where communication is becoming increasingly expensive compared to arithmetic operations. In this paper, we extend these studies by two major contributions. First, we present our implementation of a CA variant of the Generalized Minimum Residual (GMRES) method, called CAGMRES, for solving no symmetric linear systems of equations on a hybrid CPU/GPU cluster. Our performance results on up to 120 GPUs show that CA-GMRES gives a speedup of up to 2.5x in total solution time over standard GMRES on a hybrid cluster with twelve Intel Xeon CPUs and three Nvidia Fermi GPUs on each node. We then outline a domain decomposition framework to introduce a family of preconditioners that are suitable for CA Krylov methods. Our preconditioners do not incur any additional communication and allow the easy reuse of existing algorithms and software for the sub domain solves. Experimental results on the hybrid CPU/GPU cluster demonstrate that CA-GMRES with preconditioning achieve a speedup of up to 7.4x over CAGMRES without preconditioning, and speedup of up to 1.7x over GMRES with preconditioning in total solution time. These results confirm the potential of our framework to develop a practical and effective preconditioned CA Krylov method.

Krylov子空间投影法是求解大型线性方程组的一种广泛使用的迭代方法。研究人员已经证明，通信避免(CA)技术可以提高Krylov方法在现代计算机上的性能，在现代计算机上，通信与算术运算相比变得越来越昂贵。在本文中，我们通过两个主要贡献来扩展这些研究。首先，我们提出了广义最小残差(GMRES)方法的CA变体的实现，称为CAGMRES，用于在混合CPU/GPU集群上求解非对称线性方程组。我们在多达120个gpu上的性能结果表明，在每个节点上有12个Intel Xeon cpu和3个Nvidia Fermi gpu的混合集群上，CA-GMRES的总解决方案时间比标准GMRES的加速速度提高了2.5倍。然后，我们概述了一个领域分解框架，以引入适用于CA Krylov方法的一系列前置条件。我们的预调节器不会产生任何额外的通信，并且允许为子域解决方案轻松重用现有算法和软件。在CPU/GPU混合集群上的实验结果表明，预处理后的CA-GMRES在总求解时间上比未预处理的CAGMRES加速高达7.4倍，比预处理后的GMRES加速高达1.7倍。这些结果证实了我们的框架开发实用有效的预处理CA Krylov方法的潜力。

{"title":"Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster","authors":"I. Yamazaki, S. Rajamanickam, E. Boman, M. Hoemmen, M. Heroux, S. Tomov","doi":"10.1109/SC.2014.81","DOIUrl":"https://doi.org/10.1109/SC.2014.81","url":null,"abstract":"Krylov subspace projection methods are widely used iterative methods for solving large-scale linear systems of equations. Researchers have demonstrated that communication avoiding (CA) techniques can improve Krylov methods' performance on modern computers, where communication is becoming increasingly expensive compared to arithmetic operations. In this paper, we extend these studies by two major contributions. First, we present our implementation of a CA variant of the Generalized Minimum Residual (GMRES) method, called CAGMRES, for solving no symmetric linear systems of equations on a hybrid CPU/GPU cluster. Our performance results on up to 120 GPUs show that CA-GMRES gives a speedup of up to 2.5x in total solution time over standard GMRES on a hybrid cluster with twelve Intel Xeon CPUs and three Nvidia Fermi GPUs on each node. We then outline a domain decomposition framework to introduce a family of preconditioners that are suitable for CA Krylov methods. Our preconditioners do not incur any additional communication and allow the easy reuse of existing algorithms and software for the sub domain solves. Experimental results on the hybrid CPU/GPU cluster demonstrate that CA-GMRES with preconditioning achieve a speedup of up to 7.4x over CAGMRES without preconditioning, and speedup of up to 1.7x over GMRES with preconditioning in total solution time. These results confirm the potential of our framework to develop a practical and effective preconditioned CA Krylov method.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124938877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Orion: Scaling Genomic Sequence Matching with Fine-Grained Parallelization 猎户座:缩放基因组序列匹配与细粒度并行

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.42

K. Mahadik, S. Chaterji, Bowen Zhou, Milind Kulkarni, S. Bagchi

Gene sequencing instruments are producing huge volumes of data, straining the capabilities of current database searching algorithms and hindering efforts of researchers analyzing large collections of data to obtain greater insights. In the space of parallel genomic sequence search, most of the popular software packages, like mpiBLAST, use the database segmentation approach, wherein the entire database is sharded and searched on different nodes. However this approach does not scale well with the increasing length of individual query sequences as well as the rapid growth in size of sequence databases. In this paper, we propose a fine-grained parallelism technique, called Orion, that divides the input query into an adaptive number of fragments and shards the database. Our technique achieves higher parallelism (and hence speedup) and load balancing than database sharding alone, while maintaining 100% accuracy. We show that it is 12.3X faster than mpiBLAST for solving a relevant comparative genomics problem.

基因测序仪器正在产生大量的数据，使当前数据库搜索算法的能力变得紧张，并阻碍了研究人员分析大量数据以获得更深入的见解的努力。在并行基因组序列搜索领域，大多数流行的软件包，如mpiBLAST，都采用了数据库分割的方法，将整个数据库分片，在不同的节点上进行搜索。然而，随着单个查询序列长度的增加以及序列数据库规模的快速增长，这种方法不能很好地扩展。在本文中，我们提出了一种称为Orion的细粒度并行技术，该技术将输入查询划分为自适应数量的片段并对数据库进行分片。我们的技术实现了比单独的数据库分片更高的并行性(因此加速)和负载平衡，同时保持了100%的准确性。我们表明，在解决相关的比较基因组学问题时，它比mpiBLAST快12.3倍。

引用次数: 22

Pardicle: Parallel Approximate Density-Based Clustering 粒子:并行近似密度聚类

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.51

Md. Mostofa Ali Patwary, N. Satish, N. Sundaram, F. Manne, S. Habib, P. Dubey

DBSCAN is a widely used is density-based clustering algorithm for particle data well-known for its ability to isolate arbitrarily-shaped clusters and to filter noise data. The algorithm is super-linear (O(nlogn)) and computationally expensive for large datasets. Given the need for speed, we propose a fast heuristic algorithm for DBSCAN using density based sampling, which performs equally well in quality compared to exact algorithms, but is more than an order of magnitude faster. Our experiments on astrophysics and synthetic massive datasets (8.5 billion numbers) shows that our approximate algorithm is up to 56× faster than exact algorithms with almost identical quality (Omega-Index ≥ 0.99). We develop a new parallel DBSCAN algorithm, which uses dynamic partitioning to improve load balancing and locality. We demonstrate near-linear speedup on shared memory (15× using 16 cores, single node Intel® Xeon® processor) and distributed memory (3917× using 4096 cores, multinode) computers, with 2× additional performance improvement using Intel® Xeon Phi™ coprocessors. Additionally, existing exact algorithms can achieve up to 3.4 times speedup using dynamic partitioning.

DBSCAN是一种广泛使用的基于密度的粒子数据聚类算法，以其分离任意形状的聚类和过滤噪声数据的能力而闻名。该算法是超线性的(O(nlogn))，对于大型数据集来说计算成本很高。考虑到对速度的需求，我们提出了一种基于密度采样的DBSCAN快速启发式算法，与精确算法相比，该算法在质量上表现同样好，但速度要快一个数量级以上。我们在天体物理和合成海量数据集(85亿个数字)上的实验表明，我们的近似算法比几乎相同质量的精确算法快56倍(ω - index≥0.99)。我们开发了一种新的并行DBSCAN算法，该算法使用动态分区来改善负载平衡和局部性。我们展示了共享内存(使用16核，单节点Intel®Xeon®处理器15倍)和分布式内存(使用4096核，多节点3917倍)计算机上的近似线性加速，使用Intel®Xeon Phi™协处理器可将性能提高2倍。此外，现有的精确算法使用动态分区可以实现高达3.4倍的加速。

{"title":"Pardicle: Parallel Approximate Density-Based Clustering","authors":"Md. Mostofa Ali Patwary, N. Satish, N. Sundaram, F. Manne, S. Habib, P. Dubey","doi":"10.1109/SC.2014.51","DOIUrl":"https://doi.org/10.1109/SC.2014.51","url":null,"abstract":"DBSCAN is a widely used is density-based clustering algorithm for particle data well-known for its ability to isolate arbitrarily-shaped clusters and to filter noise data. The algorithm is super-linear (O(nlogn)) and computationally expensive for large datasets. Given the need for speed, we propose a fast heuristic algorithm for DBSCAN using density based sampling, which performs equally well in quality compared to exact algorithms, but is more than an order of magnitude faster. Our experiments on astrophysics and synthetic massive datasets (8.5 billion numbers) shows that our approximate algorithm is up to 56× faster than exact algorithms with almost identical quality (Omega-Index ≥ 0.99). We develop a new parallel DBSCAN algorithm, which uses dynamic partitioning to improve load balancing and locality. We demonstrate near-linear speedup on shared memory (15× using 16 cores, single node Intel® Xeon® processor) and distributed memory (3917× using 4096 cores, multinode) computers, with 2× additional performance improvement using Intel® Xeon Phi™ coprocessors. Additionally, existing exact algorithms can achieve up to 3.4 times speedup using dynamic partitioning.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128530323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

High-Productivity Framework on GPU-Rich Supercomputers for Operational Weather Prediction Code ASUCA 操作天气预报代码ASUCA在gpu丰富的超级计算机上的高生产力框架

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.26

T. Shimokawabe, T. Aoki, Naoyuki Onodera

The weather prediction code demands large computational performance to achieve fast and high-resolution simulations. Skillful programming techniques are required for obtaining good parallel efficiency on GPU supercomputers. Our framework-based weather prediction code ASUCA has achieved good scalability with hiding complicated implementation and optimizations required for distributed GPUs, contributing to increasing the maintainability, ASUCA is a next-generation high resolution meso-scale atmospheric model being developed by the Japan Meteorological Agency. Our framework automatically translates user-written stencil functions that update grid points and generates both GPU and CPU codes. User-written codes are parallelized by MPI with intra-node GPU peer-to-peer direct access. These codes can easily utilize optimizations such as overlapping technique to hide communication overhead by computation. Our simulations on the GPU-rich supercomputer TSUBAME 2.5 at the Tokyo Institute of Technology have demonstrated good strong and weak scalability achieving 209.6 TFlops in single precision for our largest model using 4,108 NVIDIA K20X GPUs.

天气预报代码需要大量的计算性能来实现快速和高分辨率的模拟。为了在GPU超级计算机上获得良好的并行效率，需要熟练的编程技术。我们的基于框架的天气预报代码ASUCA具有良好的可扩展性，隐藏了分布式gpu所需的复杂实现和优化，有助于提高可维护性，ASUCA是日本气象厅正在开发的下一代高分辨率中尺度大气模式。我们的框架自动转换用户编写的模板函数，更新网格点并生成GPU和CPU代码。用户编写的代码通过MPI与节点内GPU点对点直接访问并行化。这些代码可以很容易地利用重叠技术等优化来隐藏计算带来的通信开销。我们在东京工业大学的gpu丰富的超级计算机TSUBAME 2.5上的模拟显示了良好的强扩展性和弱扩展性，在我们使用4,108个NVIDIA K20X gpu的最大模型中实现了单精度209.6 TFlops。

{"title":"High-Productivity Framework on GPU-Rich Supercomputers for Operational Weather Prediction Code ASUCA","authors":"T. Shimokawabe, T. Aoki, Naoyuki Onodera","doi":"10.1109/SC.2014.26","DOIUrl":"https://doi.org/10.1109/SC.2014.26","url":null,"abstract":"The weather prediction code demands large computational performance to achieve fast and high-resolution simulations. Skillful programming techniques are required for obtaining good parallel efficiency on GPU supercomputers. Our framework-based weather prediction code ASUCA has achieved good scalability with hiding complicated implementation and optimizations required for distributed GPUs, contributing to increasing the maintainability, ASUCA is a next-generation high resolution meso-scale atmospheric model being developed by the Japan Meteorological Agency. Our framework automatically translates user-written stencil functions that update grid points and generates both GPU and CPU codes. User-written codes are parallelized by MPI with intra-node GPU peer-to-peer direct access. These codes can easily utilize optimizations such as overlapping technique to hide communication overhead by computation. Our simulations on the GPU-rich supercomputer TSUBAME 2.5 at the Tokyo Institute of Technology have demonstrated good strong and weak scalability achieving 209.6 TFlops in single precision for our largest model using 4,108 NVIDIA K20X GPUs.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128287907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

24.77 Pflops on a Gravitational Tree-Code to Simulate the Milky Way Galaxy with 18600 GPUs 24.77 pflop在引力树上-代码模拟银河系与18600 gpu

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.10

J. Bédorf, E. Gaburov, M. Fujii, Keigo Nitadori, T. Ishiyama, S. Zwart

We have simulated, for the first time, the long term evolution of the Milky Way Galaxy using 51 billion particles on the Swiss Piz Daint supercomputer with our N-body gravitational tree-code Bonsai. Herein, we describe the scientific motivation and numerical algorithms. The Milky Way model was simulated for 6 billion years, during which the bar structure and spiral arms were fully formed. This improves upon previous simulations by using 1000 times more particles, and provides a wealth of new data that can be directly compared with observations. We also report the scalability on both the Swiss Piz Daint and the US ORNL Titan. On Piz Daint the parallel efficiency of Bonsai was above 95%. The highest performance was achieved with a 242 billion particle Milky Way model using 18600 GPUs on Titan, thereby reaching a sustained GPU and application performance of 33.49 Pflops and 24.77 Pflops respectively.

我们第一次在瑞士的Piz paint超级计算机上模拟了银河系的长期演化，使用了510亿个粒子和我们的n体引力树代码盆景。在此，我们描述了科学动机和数值算法。银河系模型被模拟了60亿年，在此期间，棒状结构和旋臂已经完全形成。这比以前的模拟改进了1000倍以上的粒子，并提供了丰富的新数据，可以直接与观测结果进行比较。我们还报告了瑞士Piz paint和美国ORNL Titan的可扩展性。在Piz paint上，盆景的平行效率在95%以上。在Titan上使用18600个GPU，实现2420亿个粒子的Milky Way模型，从而达到了持续的GPU和应用性能，分别为33.49 Pflops和24.77 Pflops。

引用次数: 57

Physics-Based Urban Earthquake Simulation Enhanced by 10.7 BlnDOF × 30 K Time-Step Unstructured FE Non-Linear Seismic Wave Simulation 10.7 BlnDOF × 30k时间步非结构有限元非线性地震波模拟增强的基于物理的城市地震模拟

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.7

T. Ichimura, K. Fujita, Seizo Tanaka, M. Hori, Lalith Maddegedara, Y. Shizawa, Hiroshi Kobayashi

With the aim of dramatically improving the reliability of urban earthquake response analyses, we developed an unstructured 3-D finite-element-based MPI-OpenMP hybrid seismic wave amplification simulation code, GAMERA. On the K computer, GAMERA was able to achieve a size-up efficiency of 87.1% up to the full K computer. Next, we applied GAMERA to a physics-based urban earthquake response analysis for Tokyo. Using 294,912 CPU cores of the K computer for 11 h, 32 min, we analyzed the 3-D non-linear ground motion of a 10.7 BlnDOF problem with 30 K time steps. Finally, we analyzed the stochastic response of 13,275 building structures in the domain considering uncertainty in structural parameters using 3 h, 56 min of 80,000 CPU cores of the K computer. Although a large amount of computer resources is needed presently, such analyses can change the quality of disaster estimations and are expected to become standard in the future.

为了极大地提高城市地震反应分析的可靠性，我们开发了一种基于MPI-OpenMP的非结构化三维有限元混合地震波放大模拟程序GAMERA。在K计算机上，GAMERA能够达到87.1%的放大效率，达到全K计算机。接下来，我们将GAMERA应用于东京基于物理的城市地震反应分析。利用K计算机294,912个CPU核，在11 h, 32 min的时间内，分析了一个10.7 BlnDOF问题的30 K时间步长的三维非线性地震动。最后，利用K型计算机80000个CPU核的3 h 56 min，分析了考虑结构参数不确定性的区域内13275个建筑结构的随机响应。虽然目前需要大量的计算机资源，但这种分析可以改变灾害估计的质量，并有望在未来成为标准。

引用次数: 56

Quantitatively Modeling Application Resilience with the Data Vulnerability Factor 基于数据脆弱性因子的应用弹性定量建模

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.62

Li Yu, Dong Li, Sparsh Mittal, J. Vetter

Recent strategies to improve the observable resilience of applications require the ability to classify vulnerabilities of individual components (e.g., Data structures, instructions) of an application, and then, selectively apply protection mechanisms to its critical components. To facilitate this vulnerability classification, it is important to have accurate, quantitative techniques that can be applied uniformly and automatically across real-world applications. Traditional methods cannot effectively quantify vulnerability, because they lack a holistic view to examine system resilience, and come with prohibitive evaluation costs. In this paper, we introduce a data-driven, practical methodology to analyze these application vulnerabilities using a novel resilience metric: the data vulnerability factor (DVF). DVF integrates knowledge from both the application and target hardware into the calculation. To calculate DVF, we extend a performance modeling language to provide a structured, fast modeling solution. We evaluate our methodology on six representative computational kernels, we demonstrate the significance of DVF by quantifying the impact of algorithm optimization on vulnerability, and by quantifying the effectiveness of specific hardware protection mechanisms.

最近改进应用程序可观察弹性的策略要求能够对应用程序的单个组件(例如，数据结构、指令)的漏洞进行分类，然后有选择地对其关键组件应用保护机制。为了促进这种漏洞分类，重要的是要有准确的定量技术，这些技术可以在实际应用程序中统一和自动地应用。传统方法不能有效地量化脆弱性，因为它们缺乏检查系统弹性的整体视图，并且具有令人望而却步的评估成本。在本文中，我们介绍了一种数据驱动的实用方法，使用一种新的弹性度量:数据漏洞因子(DVF)来分析这些应用程序漏洞。DVF将应用程序和目标硬件的知识集成到计算中。为了计算DVF，我们扩展了一种性能建模语言，以提供结构化的、快速的建模解决方案。我们在六个具有代表性的计算核上评估了我们的方法，我们通过量化算法优化对漏洞的影响以及量化特定硬件保护机制的有效性来证明DVF的重要性。

{"title":"Quantitatively Modeling Application Resilience with the Data Vulnerability Factor","authors":"Li Yu, Dong Li, Sparsh Mittal, J. Vetter","doi":"10.1109/SC.2014.62","DOIUrl":"https://doi.org/10.1109/SC.2014.62","url":null,"abstract":"Recent strategies to improve the observable resilience of applications require the ability to classify vulnerabilities of individual components (e.g., Data structures, instructions) of an application, and then, selectively apply protection mechanisms to its critical components. To facilitate this vulnerability classification, it is important to have accurate, quantitative techniques that can be applied uniformly and automatically across real-world applications. Traditional methods cannot effectively quantify vulnerability, because they lack a holistic view to examine system resilience, and come with prohibitive evaluation costs. In this paper, we introduce a data-driven, practical methodology to analyze these application vulnerabilities using a novel resilience metric: the data vulnerability factor (DVF). DVF integrates knowledge from both the application and target hardware into the calculation. To calculate DVF, we extend a performance modeling language to provide a structured, fast modeling solution. We evaluate our methodology on six representative computational kernels, we demonstrate the significance of DVF by quantifying the impact of algorithm optimization on vulnerability, and by quantifying the effectiveness of specific hardware protection mechanisms.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"319 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130149442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

Omnisc'IO: A Grammar-Based Approach to Spatial and Temporal I/O Patterns Prediction Omnisc'IO:基于语法的空间和时间I/O模式预测方法

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.56

Matthieu Dorier, Shadi Ibrahim, Gabriel Antoniu, R. Ross

The increasing gap between the computation performance of post-petascale machines and the performance of their I/O subsystem has motivated many I/O optimizations including prefetching, caching, and scheduling techniques. In order to further improve these techniques, modeling and predicting spatial and temporal I/O patterns of HPC applications as they run has became crucial. In this paper we present Omnisc'IO, an approach that builds a grammar-based model of the I/O behavior of HPC applications and uses it to predict when future I/O operations will occur, and where and how much data will be accessed. Omnisc'IO is transparently integrated into the POSIX and MPI I/O stacks and does not require any modification in applications or higher level I/O libraries. It works without any prior knowledge of the application and converges to accurate predictions within a couple of iterations only. Its implementation is efficient in both computation time and memory footprint.

千兆级后机器的计算性能与其I/O子系统的性能之间的差距越来越大，这激发了许多I/O优化，包括预取、缓存和调度技术。为了进一步改进这些技术，在HPC应用程序运行时对其空间和时间I/O模式进行建模和预测变得至关重要。在本文中，我们介绍了Omnisc'IO，这是一种构建基于语法的HPC应用程序I/O行为模型的方法，并使用它来预测未来I/O操作将在何时发生，以及将访问何处和访问多少数据。Omnisc'IO透明地集成到POSIX和MPI I/O堆栈中，不需要在应用程序或更高级别的I/O库中进行任何修改。它在没有任何应用程序的先验知识的情况下工作，并且仅在几次迭代中收敛到准确的预测。它的实现在计算时间和内存占用方面都是高效的。

引用次数: 43

Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers 在异构超级计算机上的千万亿次高阶动态破裂地震模拟

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2014-11-16 DOI: 10.1109/SC.2014.6

A. Heinecke, Alexander Breuer, Sebastian Rettenberger, M. Bader, A. Gabriel, C. Pelties, A. Bode, W. Barth, Xiangke Liao, K. Vaidyanathan, M. Smelyanskiy, P. Dubey

We present an end-to-end optimization of the innovative Arbitrary high-order DERivative Discontinuous Galerkin (ADER-DG) software SeisSol targeting Intel® Xeon Phi coprocessor platforms, achieving unprecedented earthquake model complexity through coupled simulation of full frictional sliding and seismic wave propagation. SeisSol exploits unstructured meshes to flexibly adapt for complicated geometries in realistic geological models. Seismic wave propagation is solved simultaneously with earthquake faulting in a multiphysical manner leading to a heterogeneous solver structure. Our architecture aware optimizations deliver up to 50% of peak performance, and introduce an efficient compute-communication overlapping scheme shadowing the multiphysics computations. SeisSol delivers near-optimal weak scaling, reaching 8.6 DP-PFLOPS on 8,192 nodes of the Tianhe-2 supercomputer. Our performance model projects reaching 18 -- 20 DP-PFLOPS on the full Tianhe-2 machine. Of special relevance to modern civil engineering needs, our pioneering simulation of the 1992 Landers earthquake shows highly detailed rupture evolution and ground motion at frequencies up to 10 Hz.

我们提出了创新的任意高阶导数不连续Galerkin (der - dg)软件SeisSol的端到端优化，目标是Intel®Xeon Phi协处理器平台，通过完全摩擦滑动和地震波传播的耦合模拟，实现了前所未有的地震模型复杂性。SeisSol利用非结构化网格灵活地适应现实地质模型中的复杂几何形状。地震波传播以多物理场的方式与地震断层同时求解，导致求解器结构不均匀。我们的架构感知优化提供了高达50%的峰值性能，并引入了高效的计算通信重叠方案，遮蔽了多物理场计算。SeisSol提供了近乎最佳的弱缩放，在天河二号超级计算机的8192个节点上达到8.6 DP-PFLOPS。我们的性能模型预测在整个天河2号机上达到18 - 20 DP-PFLOPS。与现代土木工程需求特别相关的是，我们对1992年兰德斯地震的开创性模拟显示了频率高达10赫兹的非常详细的破裂演变和地面运动。

{"title":"Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers","authors":"A. Heinecke, Alexander Breuer, Sebastian Rettenberger, M. Bader, A. Gabriel, C. Pelties, A. Bode, W. Barth, Xiangke Liao, K. Vaidyanathan, M. Smelyanskiy, P. Dubey","doi":"10.1109/SC.2014.6","DOIUrl":"https://doi.org/10.1109/SC.2014.6","url":null,"abstract":"We present an end-to-end optimization of the innovative Arbitrary high-order DERivative Discontinuous Galerkin (ADER-DG) software SeisSol targeting Intel® Xeon Phi coprocessor platforms, achieving unprecedented earthquake model complexity through coupled simulation of full frictional sliding and seismic wave propagation. SeisSol exploits unstructured meshes to flexibly adapt for complicated geometries in realistic geological models. Seismic wave propagation is solved simultaneously with earthquake faulting in a multiphysical manner leading to a heterogeneous solver structure. Our architecture aware optimizations deliver up to 50% of peak performance, and introduce an efficient compute-communication overlapping scheme shadowing the multiphysics computations. SeisSol delivers near-optimal weak scaling, reaching 8.6 DP-PFLOPS on 8,192 nodes of the Tianhe-2 supercomputer. Our performance model projects reaching 18 -- 20 DP-PFLOPS on the full Tianhe-2 machine. Of special relevance to modern civil engineering needs, our pioneering simulation of the 1992 Landers earthquake shows highly detailed rupture evolution and ground motion at frequencies up to 10 Hz.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"680 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127584541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 119

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀