2012 SC Companion: High Performance Computing, Networking Storage and Analysis最新文献

英文中文

A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection 用于快速内存跟踪收集的静态二进制检测线程模型

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.101

M. Laurenzano, Joshua Peraza, L. Carrington, Ananta Tiwari, W. A. Ward, R. Campbell

In order to achieve a high level of performance, data intensive applications such as the real-time processing of surveillance feeds from unmanned aerial vehicles will require the strategic application of multi/many-core processors and coprocessors using a hybrid of inter-process message passing (e.g. MPI and SHMEM) and intra-process threading (e.g. pthreads and OpenMP). To facilitate program design decisions, memory traces gathered through binary instrumentation can be used to understand the low-level interactions between a data intensive code and the memory subsystem of a multi-core processor or many-core co-processor. Toward this end, this paper introduces the addition of threading support for PMaCs Efficient Binary Instrumentation Toolkit for Linux/x86 (PEBIL) and compares PEBILs threading model to the threading models of two other popular Linux/x86 binary instrumentation platforms - Pin and Dyninst - on both theoretical and empirical grounds. The empirical comparisons are based on experiments which collect memory address traces for the OpenMP-threaded implementations of the NASA Advanced Supercomputing Parallel Benchmarks (NPBs). This work shows that the overhead of collecting full memory address traces for multithreaded programs is higher in PEBIL (7.7x) than in Pin (4.7x), both of which are significantly lower than Dyninst (897x). This work also shows that PEBIL, uniquely, is able to take advantage of interval-based sampling of a memory address trace by rapidly disabling and re-enabling instrumentation at the transitions into and out of sampling periods in order to achieve significant decreases in the overhead of memory address trace collection. For collecting the memory address streams of each of the NPBs at a 10% sampling rate, PEBIL incurs an average slowdown of 2.9x compared to 4.4x with Pin and 897x with Dyninst.

为了实现高水平的性能，数据密集型应用，如无人机监控馈送的实时处理，将需要使用混合进程间消息传递(例如MPI和SHMEM)和进程内线程(例如pthreads和OpenMP)的多/多核处理器和协处理器的战略应用。为了便于程序设计决策，可以使用通过二进制检测收集的内存跟踪来理解数据密集型代码与多核处理器或多核协处理器的内存子系统之间的低级交互。为此，本文介绍了pmac Linux/x86高效二进制仪表工具箱(PEBIL)中增加的线程支持，并将PEBIL的线程模型与其他两个流行的Linux/x86二进制仪表平台(Pin和Dyninst)的线程模型进行了理论和实证比较。经验比较是基于收集NASA高级超级计算并行基准(NPBs)的openmp线程实现的内存地址跟踪的实验。这项工作表明，在PEBIL (7.7x)中收集多线程程序的全部内存地址跟踪的开销比在Pin (4.7x)中要高，两者都明显低于Dyninst (897x)。这项工作还表明，PEBIL能够独特地利用基于间隔的内存地址跟踪采样，在进入和退出采样周期时快速禁用和重新启用检测，从而显著降低内存地址跟踪收集的开销。为了以10%的采样率收集每个npb的内存地址流，与Pin的4.4倍和Dyninst的897x相比，PEBIL的平均速度降低了2.9倍。

{"title":"A Static Binary Instrumentation Threading Model for Fast Memory Trace Collection","authors":"M. Laurenzano, Joshua Peraza, L. Carrington, Ananta Tiwari, W. A. Ward, R. Campbell","doi":"10.1109/SC.Companion.2012.101","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.101","url":null,"abstract":"In order to achieve a high level of performance, data intensive applications such as the real-time processing of surveillance feeds from unmanned aerial vehicles will require the strategic application of multi/many-core processors and coprocessors using a hybrid of inter-process message passing (e.g. MPI and SHMEM) and intra-process threading (e.g. pthreads and OpenMP). To facilitate program design decisions, memory traces gathered through binary instrumentation can be used to understand the low-level interactions between a data intensive code and the memory subsystem of a multi-core processor or many-core co-processor. Toward this end, this paper introduces the addition of threading support for PMaCs Efficient Binary Instrumentation Toolkit for Linux/x86 (PEBIL) and compares PEBILs threading model to the threading models of two other popular Linux/x86 binary instrumentation platforms - Pin and Dyninst - on both theoretical and empirical grounds. The empirical comparisons are based on experiments which collect memory address traces for the OpenMP-threaded implementations of the NASA Advanced Supercomputing Parallel Benchmarks (NPBs). This work shows that the overhead of collecting full memory address traces for multithreaded programs is higher in PEBIL (7.7x) than in Pin (4.7x), both of which are significantly lower than Dyninst (897x). This work also shows that PEBIL, uniquely, is able to take advantage of interval-based sampling of a memory address trace by rapidly disabling and re-enabling instrumentation at the transitions into and out of sampling periods in order to achieve significant decreases in the overhead of memory address trace collection. For collecting the memory address streams of each of the NPBs at a 10% sampling rate, PEBIL incurs an average slowdown of 2.9x compared to 4.4x with Pin and 897x with Dyninst.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"584 1","pages":"741-745"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77215158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Scalable Multi-Instance Learning Approach for Mapping the Slums of the World 绘制世界贫民窟地图的可扩展多实例学习方法

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.117

Ranga Raju Vatsavai

Remote sensing imagery is widely used in mapping thematic classes, such as, forests, crops, forests and other natural and man-made objects on the Earth. With the availability of very high-resolution satellite imagery, it is now possible to identify complex patterns such as formal and informal (slums) settlements. However, predominantly used single-instance learning algorithms that are widely used in thematic classification are not sufficient for recognizing complex settlement patterns. On the other hand, newer multi-instance learning schemes are useful in recognizing complex structures in images, but they are computationally expensive. In this paper, we present an adaptation of a multi-instance learning algorithm for informal settlement classification and its efficient implementation on shared memory architectures. Experimental evaluation shows that this approach is scalable and as well as accurate than commonly used single-instance learning algorithms.

遥感图像广泛用于绘制专题类，如森林、作物、森林和地球上其他自然和人造物体。有了高分辨率的卫星图像，现在可以识别复杂的模式，如正式和非正式(贫民窟)住区。然而，在主题分类中广泛使用的单实例学习算法不足以识别复杂的聚落模式。另一方面，新的多实例学习方案在识别图像中的复杂结构方面很有用，但它们的计算成本很高。本文提出了一种基于多实例学习的非正式住区分类算法及其在共享内存架构上的高效实现。实验结果表明，该方法比常用的单实例学习算法具有可扩展性和准确性。

引用次数: 5

Performance and Power Characteristics of Matrix Multiplication Algorithms on Multicore and Shared Memory Machines 矩阵乘法算法在多核和共享存储器上的性能和功耗特性

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.87

Yonghong Yan, J. Kemp, Xiaonan Tian, A. Malik, B. Chapman

For many scientific applications, dense matrix multiplication is one of the most important and computation intensive linear algebra operations. An efficient matrix multiplication on high performance and parallel computers requires optimizations on how matrices are decomposed and exchanged between computational nodes to reduce communication and synchronization overhead, as well as to efficiently exploit the memory hierarchy within a node to improve both spatial and temporal data locality. In this paper, we presented our studies of performance, cache behavior, and energy efficiency of multiple parallel matrix multiplication algorithms on a multicore desktop computer and a medium-size shared memory machine, both being considered as referenced sizes of nodes to create a medium- and largescale computational clusters for high performance computing used in industry and national laboratories. Our results highlight both the performance and energy efficiencies, and also provide implications on the memory and resources pressures of those algorithms. We hope this could help users choose the appropriate implementations according to their specific data sets when composing larger-scale scientific applications that use parallel matrix multiplication kernels on a node.

在许多科学应用中，密集矩阵乘法是最重要的、计算量最大的线性代数运算之一。在高性能和并行计算机上进行有效的矩阵乘法需要优化矩阵在计算节点之间的分解和交换方式，以减少通信和同步开销，以及有效地利用节点内的内存层次结构来改进空间和时间数据局部性。在本文中，我们介绍了我们在多核台式计算机和中型共享内存机器上对多个并行矩阵乘法算法的性能、缓存行为和能源效率的研究，这两种算法都被认为是创建用于工业和国家实验室的高性能计算的中型和大型计算集群的参考节点大小。我们的研究结果强调了性能和能源效率，并提供了对这些算法的内存和资源压力的影响。我们希望这可以帮助用户在编写在节点上使用并行矩阵乘法内核的大规模科学应用程序时，根据他们的特定数据集选择合适的实现。

{"title":"Performance and Power Characteristics of Matrix Multiplication Algorithms on Multicore and Shared Memory Machines","authors":"Yonghong Yan, J. Kemp, Xiaonan Tian, A. Malik, B. Chapman","doi":"10.1109/SC.Companion.2012.87","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.87","url":null,"abstract":"For many scientific applications, dense matrix multiplication is one of the most important and computation intensive linear algebra operations. An efficient matrix multiplication on high performance and parallel computers requires optimizations on how matrices are decomposed and exchanged between computational nodes to reduce communication and synchronization overhead, as well as to efficiently exploit the memory hierarchy within a node to improve both spatial and temporal data locality. In this paper, we presented our studies of performance, cache behavior, and energy efficiency of multiple parallel matrix multiplication algorithms on a multicore desktop computer and a medium-size shared memory machine, both being considered as referenced sizes of nodes to create a medium- and largescale computational clusters for high performance computing used in industry and national laboratories. Our results highlight both the performance and energy efficiencies, and also provide implications on the memory and resources pressures of those algorithms. We hope this could help users choose the appropriate implementations according to their specific data sets when composing larger-scale scientific applications that use parallel matrix multiplication kernels on a node.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"56 1","pages":"626-632"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89024911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Experiences with OpenMP, PGI, HMPP and OpenACC Directives on ISO/TTI Kernels 具有在ISO/TTI内核上使用OpenMP, PGI, HMPP和OpenACC指令的经验

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.95

Sayan Ghosh, Terrence Liao, H. Calandra, B. Chapman

GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is usability, since vendor specific APIs are quite different from existing programming languages, and it requires a substantial knowledge of the device and programming interface to optimize applications. Hence, lately a growing number of higher level programming models are targeting GPUs to alleviate this problem. The ultimate goal for a high-level model is to expose an easy-to-use interface for the user to offload compute intensive portions of code (kernels) to the GPU, and tune the code according to the target accelerator to maximize overall performance with a reduced development effort. In this paper, we share our experiences of three of the notable high-level directive based GPU programming models - PGI, CAPS and OpenACC (from CAPS and PGI) on an Nvidia M2090 GPU. We analyze their performance and programmability against Isotropic (ISO)/Tilted Transversely Isotropic (TTI) finite difference kernels, which are primary components in the Reverse Time Migration (RTM) application used by oil and gas exploration for seismic imaging of the sub-surface. When ported to a single GPU using the mentioned directives, we observe an average 1.5-1.8x improvement in performance for both ISO and TTI kernels, when compared with optimized multi-threaded CPU implementations using OpenMP.

gpu正在逐渐成为高性能计算中无处不在的设备，因为与多核cpu相比，它们能够提高计算密集型算法的每瓦性能。GPU的主要缺点是可用性，因为供应商特定的api与现有的编程语言有很大的不同，并且它需要大量的设备和编程接口知识来优化应用程序。因此，最近越来越多的高级编程模型瞄准gpu来缓解这个问题。高级模型的最终目标是为用户提供一个易于使用的界面，以便将计算密集型的代码部分(内核)卸载到GPU，并根据目标加速器调整代码，从而在减少开发工作量的同时最大化整体性能。在本文中，我们分享了三种著名的基于高级指令的GPU编程模型- PGI, CAPS和OpenACC(来自CAPS和PGI)在Nvidia M2090 GPU上的经验。我们针对各向同性(ISO)/倾斜横向各向同性(TTI)有限差分核分析了它们的性能和可编程性，这些核是油气勘探中用于地下地震成像的逆时偏移(RTM)应用的主要组成部分。当使用上述指令移植到单个GPU时，我们观察到与使用OpenMP优化的多线程CPU实现相比，ISO和TTI内核的性能平均提高了1.5-1.8倍。

{"title":"Experiences with OpenMP, PGI, HMPP and OpenACC Directives on ISO/TTI Kernels","authors":"Sayan Ghosh, Terrence Liao, H. Calandra, B. Chapman","doi":"10.1109/SC.Companion.2012.95","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.95","url":null,"abstract":"GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is usability, since vendor specific APIs are quite different from existing programming languages, and it requires a substantial knowledge of the device and programming interface to optimize applications. Hence, lately a growing number of higher level programming models are targeting GPUs to alleviate this problem. The ultimate goal for a high-level model is to expose an easy-to-use interface for the user to offload compute intensive portions of code (kernels) to the GPU, and tune the code according to the target accelerator to maximize overall performance with a reduced development effort. In this paper, we share our experiences of three of the notable high-level directive based GPU programming models - PGI, CAPS and OpenACC (from CAPS and PGI) on an Nvidia M2090 GPU. We analyze their performance and programmability against Isotropic (ISO)/Tilted Transversely Isotropic (TTI) finite difference kernels, which are primary components in the Reverse Time Migration (RTM) application used by oil and gas exploration for seismic imaging of the sub-surface. When ported to a single GPU using the mentioned directives, we observe an average 1.5-1.8x improvement in performance for both ISO and TTI kernels, when compared with optimized multi-threaded CPU implementations using OpenMP.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"1 1","pages":"691-700"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90149910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Abstract: Cascaded TCP: BIG Throughput for BIG DATA Applications in Distributed HPC 摘要:级联TCP:分布式高性能计算中大数据应用的大吞吐量

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.229

Umar Kalim, M. Gardner, Eric J. Brown, Wu-chun Feng

Saturating high capacity and high latency paths is a challenge with vanilla TCP implementations. This is primarily due to congestion-control algorithms which adapt window sizes when acknowledgements are received. With large latencies, the congestion-control algorithms have to wait longer to respond to network conditions (e.g., congestion), and thus result in less aggregate throughput. We argue that throughput can be improved if we reduce the impact of large end-to-end latencies by introducing layer-4 relays along the path. Such relays would enable a cascade of TCP connections, each with lower latency, resulting in better aggregate throughput. This would directly benefit typical applications as well as BIG DATA applications in distributed HPC. We present empirical results supporting our hypothesis.

对于普通的TCP实现来说，饱和高容量和高延迟路径是一个挑战。这主要是由于拥塞控制算法在收到确认时调整窗口大小。对于大延迟，拥塞控制算法必须等待更长的时间来响应网络条件(例如，拥塞)，从而导致更少的总吞吐量。我们认为，如果我们通过在路径上引入第4层中继来减少大的端到端延迟的影响，则可以提高吞吐量。这样的中继可以实现TCP连接的级联，每个连接的延迟都更低，从而产生更好的总吞吐量。这将直接有利于典型应用以及分布式高性能计算中的大数据应用。我们提出实证结果支持我们的假设。

引用次数: 2

Poster: GPU Accelerated Ultrasonic Tomography Using Propagation and Backpropagation Method 海报:使用传播和反向传播方法的GPU加速超声断层扫描

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.249

P. Bello, Yuanwei Jin, E. Lu

This paper develops implementation strategy and method to accelerate the propagation and backpropagation (PBP) tomographic imaging algorithm using Graphic Processing Units (GPUs). The Compute Unified Device Architecture (CUDA) programming model is used to develop our parallelized algorithm since the CUDA model allows the user to interact with the GPU resources more efficiently than traditional shader methods. The results show an improvement of more than 80x when compared to the C/C++ version of the algorithm, and 515x when compared to the MATLAB version while achieving high quality imaging for both cases. We test different CUDA kernel configurations in order to measure changes in the processing-time of our algorithm. By examining the acceleration rate and the image quality, we develop an optimal kernel configuration that maximizes the throughput of CUDA implementation for the PBP method.

本文提出了利用图形处理单元(gpu)加速传播和反向传播(PBP)层析成像算法的实现策略和方法。计算统一设备架构(CUDA)编程模型用于开发我们的并行算法，因为CUDA模型允许用户比传统的着色器方法更有效地与GPU资源交互。结果表明，与C/ c++版本的算法相比，该算法提高了80倍以上，与MATLAB版本相比提高了515倍，同时实现了两种情况下的高质量成像。为了测量算法处理时间的变化，我们测试了不同的CUDA内核配置。通过检查加速速率和图像质量，我们开发了一个优化的内核配置，最大限度地提高了PBP方法CUDA实现的吞吐量。

引用次数: 2

A Network-Aware Object Storage Service 支持网络感知的对象存储服务

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.79

Shigetoshi Yokoyama, Nobukazu Yoshioka, Motonobu Ichimura

This study describes a trial for establishing a network-aware object storage service. For scientific applications that need huge amounts of remotely stored data, the cloud infrastructure has functionalities to provide a service called `cluster as a service' and an inter-cloud object storage service. The scientific applications move from locations with constrained resources to locations where they can be executed practically. The inter-cloud object storage service has to be network-aware in order to perform well.

本研究描述了一种建立网络感知对象存储服务的尝试。对于需要大量远程存储数据的科学应用程序，云基础设施具有提供称为“集群即服务”的服务和云间对象存储服务的功能。科学应用程序从资源受限的地方转移到可以实际执行的地方。云间对象存储服务必须具有网络感知能力，才能运行良好。

引用次数: 3

Abstract: Advances in Gyrokinetic Particle in Cell Simulation for Fusion Plasmas to Extreme Scale 摘要:陀螺动力学粒子在极端尺度聚变等离子体细胞模拟中的研究进展

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.243

Bei Wang, S. Ethier, W. Tang, K. Ibrahim, Kamesh Madduri, Samuel Williams, L. Oliker, T. Williams

The Gyrokinetic Particle-in-cell (PIC) method has been successfully applied in studies of low-frequency microturbulence in magnetic fusion plasmas. While the excellent scaling of PIC codes on modern computing platforms is well established, significant challenges remain in achieving high on-chip concurrency for the new path to exascale systems. In addressing associated issues, it is necessary to deal with the basic gather-scatter operation and the relatively low computational intensity in the PIC method. Significant advancements have been achieved in optimizing gather-scatter operations in the gyrokinetic PIC method for next-generation multi-core CPU and GPU architectures. In particular, we will report on new techniques that improve locality, reduce memory conflict, and efficiently utilize shared memory on GPU's. Performance benchmarks on two high-end computing platforms -- the IBM BlueGene/Q (Mira) system at the Argonne Leadership Computing Facility (ALCF) and the Cray XK6 (Titan Dev) with the latest GPU at Oak Ridge Leadership Computing Facility (OLCF) - will be presented.

回旋动力学粒子池(PIC)方法已成功地应用于磁聚变等离子体低频微湍流的研究。虽然PIC代码在现代计算平台上的出色扩展已经建立，但在实现到百亿亿级系统的新路径的高片上并发性方面仍然存在重大挑战。在解决相关问题时，需要处理PIC方法中基本的聚散运算和相对较低的计算强度。在下一代多核CPU和GPU架构的陀螺动力学PIC方法中，在优化收集-散射操作方面取得了重大进展。特别是，我们将报告提高局部性，减少内存冲突，并有效利用GPU上的共享内存的新技术。在两个高端计算平台上的性能基准测试——阿贡领导计算设施(ALCF)的IBM BlueGene/Q (Mira)系统和橡树岭领导计算设施(OLCF)最新GPU的Cray XK6 (Titan Dev)——将被展示。

{"title":"Abstract: Advances in Gyrokinetic Particle in Cell Simulation for Fusion Plasmas to Extreme Scale","authors":"Bei Wang, S. Ethier, W. Tang, K. Ibrahim, Kamesh Madduri, Samuel Williams, L. Oliker, T. Williams","doi":"10.1109/SC.Companion.2012.243","DOIUrl":"https://doi.org/10.1109/SC.Companion.2012.243","url":null,"abstract":"The Gyrokinetic Particle-in-cell (PIC) method has been successfully applied in studies of low-frequency microturbulence in magnetic fusion plasmas. While the excellent scaling of PIC codes on modern computing platforms is well established, significant challenges remain in achieving high on-chip concurrency for the new path to exascale systems. In addressing associated issues, it is necessary to deal with the basic gather-scatter operation and the relatively low computational intensity in the PIC method. Significant advancements have been achieved in optimizing gather-scatter operations in the gyrokinetic PIC method for next-generation multi-core CPU and GPU architectures. In particular, we will report on new techniques that improve locality, reduce memory conflict, and efficiently utilize shared memory on GPU's. Performance benchmarks on two high-end computing platforms -- the IBM BlueGene/Q (Mira) system at the Argonne Leadership Computing Facility (ALCF) and the Cray XK6 (Titan Dev) with the latest GPU at Oak Ridge Leadership Computing Facility (OLCF) - will be presented.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"15 1","pages":"1439-1440"},"PeriodicalIF":0.0,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90656932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Abstract: An Exascale Workload Study 摘要:百亿亿次负载研究

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.261

Prasanna Balaprakash, Darius Buntinas, Anthony Chan, Apala Guha, Rinku Gupta, S. Narayanan, A. Chien, P. Hovland, B. Norris

Amdahl's law has been one of the factors influencing speedup in high performance computing over the last few decades. While Amdahl's approach of optimizing (10% of the code is where 90% of the execution time is spent) has worked very well in the past, new challenges related to emerging exascale heterogeneous architectures, combined with stringent power and energy limitations, require a new architectural paradigm. The 10x10 approach is an effort in this direction. In this poster, we describe our initial steps and methodologies for defining and actualizing the 10x10 approach.

在过去的几十年里，Amdahl定律一直是影响高性能计算加速的因素之一。虽然Amdahl的优化方法(10%的代码花费了90%的执行时间)在过去非常有效，但与新兴的百亿级异构架构相关的新挑战，以及严格的功率和能源限制，需要一种新的架构范例。10x10方法就是朝这个方向努力的结果。在这张海报中，我们描述了定义和实现10x10方法的初始步骤和方法。

引用次数: 2

A Parallel Unstructured Mesh Infrastructure 并行非结构化网格基础设施

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

Pub Date : 2012-11-10 DOI: 10.1109/SC.Companion.2012.135

E. Seol, Cameron W. Smith, D. Ibanez, M. Shephard

Two Department of Energy (DOE) office of Science's Scientific Discovery through Advanced Computing (SciDAC) Frameworks, Algorithms, and Scalable Technologies for Mathematics (FASTMath) software packages, Parallel Unstructured Mesh Infrastructure (PUMI) and Partitioning using Mesh Adjacencies (ParMA), are presented.

介绍了美国能源部(DOE)科学办公室通过先进计算(SciDAC)进行科学发现的框架、算法和可扩展数学技术(FASTMath)软件包，并行非结构化网格基础设施(PUMI)和使用网格邻接(ParMA)进行分区。

引用次数: 13

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀