SC15: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文中文

PGX.D: a fast distributed graph processing engine PGX。D:快速分布式图形处理引擎

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807620

Sungpack Hong, Siegfried Depner, Thomas Manhardt, J. V. D. Lugt, Merijn Verstraaten, Hassan Chafi

Graph analysis is a powerful method in data analysis. Although several frameworks have been proposed for processing large graph instances in distributed environments, their performance is much lower than using efficient single-machine implementations provided with enough memory. In this paper, we present a fast distributed graph processing system, namely PGX.D. We show that PGX.D outperforms other distributed graph systems like GraphLab significantly (3x -- 90x). Furthermore, PGX.D on 4 to 16 machines is also faster than an implementation optimized for single-machine execution. Using a fast cooperative context-switching mechanism, we implement PGX.D as a low-overhead, bandwidth-efficient communication framework that supports remote data-pulling patterns. Moreover, PGX.D achieves large traffic reduction and good workload balance by applying selective ghost nodes, edge partitioning, and edge chunking transparently to the user. Our analysis confirms that each of these features is indeed crucial for overall performance of certain kinds of graph algorithms. Finally, we advocate the use of balanced beefy clusters where the sustained random DRAM-access bandwidth in aggregate is matched with the bandwidth of the underlying interconnection fabric.

图分析是数据分析中的一种强有力的方法。尽管已经提出了几个框架来处理分布式环境中的大型图形实例，但它们的性能远低于提供足够内存的高效单机实现。本文提出了一种快速分布式图形处理系统PGX.D。我们展示了PGX。D的性能明显优于GraphLab等其他分布式图形系统(3 - 90倍)。此外,PGX。在4到16台机器上运行D也比针对单机执行优化的实现更快。使用快速的协作上下文切换机制，我们实现了PGX。D作为一种低开销、带宽高效的通信框架，支持远程数据提取模式。此外,PGX。D通过对用户透明地应用选择性幽灵节点、边缘分区和边缘分块，实现了大量的流量减少和良好的工作负载平衡。我们的分析证实，这些特征中的每一个都对某些类型的图算法的整体性能至关重要。最后，我们提倡使用均衡的健壮集群，其中持续的随机dram访问带宽总体上与底层互连结构的带宽相匹配。

{"title":"PGX.D: a fast distributed graph processing engine","authors":"Sungpack Hong, Siegfried Depner, Thomas Manhardt, J. V. D. Lugt, Merijn Verstraaten, Hassan Chafi","doi":"10.1145/2807591.2807620","DOIUrl":"https://doi.org/10.1145/2807591.2807620","url":null,"abstract":"Graph analysis is a powerful method in data analysis. Although several frameworks have been proposed for processing large graph instances in distributed environments, their performance is much lower than using efficient single-machine implementations provided with enough memory. In this paper, we present a fast distributed graph processing system, namely PGX.D. We show that PGX.D outperforms other distributed graph systems like GraphLab significantly (3x -- 90x). Furthermore, PGX.D on 4 to 16 machines is also faster than an implementation optimized for single-machine execution. Using a fast cooperative context-switching mechanism, we implement PGX.D as a low-overhead, bandwidth-efficient communication framework that supports remote data-pulling patterns. Moreover, PGX.D achieves large traffic reduction and good workload balance by applying selective ghost nodes, edge partitioning, and edge chunking transparently to the user. Our analysis confirms that each of these features is indeed crucial for overall performance of certain kinds of graph algorithms. Finally, we advocate the use of balanced beefy clusters where the sustained random DRAM-access bandwidth in aggregate is matched with the bandwidth of the underlying interconnection fabric.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"319 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124295710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 94

A kernel-independent FMM in general dimensions 一般维数的核无关FMM

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807647

William B. March, Bo Xiao, Sameer Tharakan, Chenhan D. Yu, G. Biros

We introduce a general-dimensional, kernel-independent, algebraic fast multipole method and apply it to kernel regression. The motivation for this work is the approximation of kernel matrices, which appear in mathematical physics, approximation theory, non-parametric statistics, and machine learning. Existing fast multipole methods are asymptotically optimal, but the underlying constants scale quite badly with the ambient space dimension. We introduce a method that mitigates this shortcoming; it only requires kernel evaluations and scales well with the problem size, the number of processors, and the ambient dimension---as long as the intrinsic dimension of the dataset is small. We test the performance of our method on several synthetic datasets. As a highlight, our largest run was on an image dataset with 10 million points in 246 dimensions.

介绍了一种一般维、核无关的代数快速多极方法，并将其应用于核回归。这项工作的动机是核矩阵的近似，它出现在数学物理、近似理论、非参数统计和机器学习中。现有的快速多极方法是渐近最优的，但底层常数对环境空间维度的标度很差。我们介绍了一种方法来减轻这个缺点;它只需要内核评估，并且可以很好地随问题大小、处理器数量和环境维度(只要数据集的内在维度较小)进行伸缩。我们在几个合成数据集上测试了我们的方法的性能。值得注意的是，我们最大的一次运行是在246维的1000万个点的图像数据集上进行的。

引用次数: 20

Network endpoint congestion control for fine-grained communication 用于细粒度通信的网络端点拥塞控制

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807600

Nan Jiang, Larry R. Dennison, W. Dally

Endpoint congestion in HPC networks creates tree saturation that is detrimental to performance. Endpoint congestion can be alleviated by reducing the injection rate of traffic sources, but requires fast reaction time to avoid congestion buildup. Congestion control becomes more challenging as application communication shift from traditional two-sided model to potentially fine-grained, one-sided communication embodied by various global address space programming models. Existing hardware solutions, such as Explicit Congestion Notification (ECN) and Speculative Reservation Protocol (SRP), either react too slowly or incur too much overhead for small messages. In this study we present two new endpoint congestion-control protocols, Small-Message SRP (SMSRP) and Last-Hop Reservation Protocol (LHRP), both targeted specifically for small messages. Experiments show they can quickly respond to endpoint congestion and prevent tree saturation in the network. Under congestion-free traffic conditions, the new protocols generate minimal overhead with performance comparable to networks with no endpoint congestion control.

HPC网络中的端点拥塞造成树饱和，这对性能有害。终端拥塞可以通过降低流量源注入速率来缓解，但需要快速的响应时间来避免拥塞的形成。随着应用程序通信从传统的双边模型转变为各种全局地址空间编程模型所体现的潜在的细粒度单侧通信，拥塞控制变得更加具有挑战性。现有的硬件解决方案，如显式拥塞通知(ECN)和推测保留协议(SRP)，要么反应太慢，要么对小消息产生太多开销。在这项研究中，我们提出了两个新的端点拥塞控制协议，小消息SRP (SMSRP)和最后一跳保留协议(LHRP)，它们都是专门针对小消息的。实验表明，该算法可以快速响应端点拥塞，防止网络中的树饱和。在无拥塞的流量条件下，新协议产生的开销最小，性能与没有端点拥塞控制的网络相当。

{"title":"Network endpoint congestion control for fine-grained communication","authors":"Nan Jiang, Larry R. Dennison, W. Dally","doi":"10.1145/2807591.2807600","DOIUrl":"https://doi.org/10.1145/2807591.2807600","url":null,"abstract":"Endpoint congestion in HPC networks creates tree saturation that is detrimental to performance. Endpoint congestion can be alleviated by reducing the injection rate of traffic sources, but requires fast reaction time to avoid congestion buildup. Congestion control becomes more challenging as application communication shift from traditional two-sided model to potentially fine-grained, one-sided communication embodied by various global address space programming models. Existing hardware solutions, such as Explicit Congestion Notification (ECN) and Speculative Reservation Protocol (SRP), either react too slowly or incur too much overhead for small messages. In this study we present two new endpoint congestion-control protocols, Small-Message SRP (SMSRP) and Last-Hop Reservation Protocol (LHRP), both targeted specifically for small messages. Experiments show they can quickly respond to endpoint congestion and prevent tree saturation in the network. Under congestion-free traffic conditions, the new protocols generate minimal overhead with performance comparable to networks with no endpoint congestion control.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133689921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

STELLA: a domain-specific tool for structured grid methods in weather and climate models STELLA:用于天气和气候模型中结构化网格方法的领域特定工具

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807627

Tobias Gysi, C. Osuna, O. Fuhrer, Mauro Bianco, T. Schulthess

Many high-performance computing applications solving partial differential equations (PDEs) can be attributed to the class of kernels using stencils on structured grids. Due to the disparity between floating point operation throughput and main memory bandwidth these codes typically achieve only a low fraction of peak performance. Unfortunately, stencil computation optimization techniques are often hardware dependent and lead to a significant increase in code complexity. We present a domain-specific tool, STELLA, which eases the burden of the application developer by separating the architecture dependent implementation strategy from the user-code and is targeted at multi- and manycore processors. On the example of a numerical weather prediction and regional climate model (COSMO) we demonstrate the usefulness of STELLA for a real-world production code. The dynamical core based on STELLA achieves a speedup factor of 1.8x (CPU) and 5.8x (GPU) with respect to the legacy code while reducing the complexity of the user code.

许多求解偏微分方程(PDEs)的高性能计算应用都可以归功于在结构化网格上使用模板的核。由于浮点运算吞吐量和主存带宽之间的差异，这些代码通常只能达到峰值性能的一小部分。不幸的是，模板计算优化技术通常依赖于硬件，并导致代码复杂性的显著增加。我们提出了一个特定于领域的工具STELLA，它通过将架构相关的实现策略与用户代码分离来减轻应用程序开发人员的负担，并且针对多核和多核处理器。在数值天气预报和区域气候模型(COSMO)的示例中，我们演示了STELLA对于实际生产代码的有用性。基于STELLA的动态内核相对于遗留代码实现了1.8倍(CPU)和5.8倍(GPU)的加速因子，同时降低了用户代码的复杂性。

引用次数: 90

Particle tracking in open simulation laboratories 开放模拟实验室中的粒子跟踪

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807645

Kalin Kanov, R. Burns

Particle tracking along streamlines and pathlines is a common scientific analysis technique, which has demanding data, computation and communication requirements. It has been studied in the context of high-performance computing due to the difficulty in its efficient parallelization and its high demands on communication and computational load. In this paper, we study efficient evaluation methods for particle tracking in open simulation laboratories. Simulation laboratories have a fundamentally different architecture from today's supercomputers and provide publicly-available analysis functionality. We focus on the I/O demands of particle tracking for numerical simulation datasets 100s of TBs in size. We compare data-parallel and task-parallel approaches for the advection of particles and show scalability results on data-intensive workloads from a live production environment. We have developed particle tracking capabilities for the Johns Hopkins Turbulence Databases, which store computational fluid dynamics simulation data, including forced isotropic turbulence, magnetohydrodynamics, channel flow turbulence and homogeneous buoyancy-driven turbulence.

粒子沿流线和路径跟踪是一种常见的科学分析技术，其数据、计算和通信要求很高。由于其难以实现高效并行化，且对通信和计算量要求高，因此在高性能计算的背景下对其进行了研究。本文研究了开放仿真实验室中粒子跟踪的有效评价方法。模拟实验室与今天的超级计算机有着根本不同的架构，并提供公开可用的分析功能。我们重点研究了100 tb大小的数值模拟数据集粒子跟踪的I/O需求。我们比较了粒子平流的数据并行和任务并行方法，并展示了实时生产环境中数据密集型工作负载的可伸缩性结果。我们已经为约翰霍普金斯湍流数据库开发了粒子跟踪功能，该数据库存储了计算流体动力学模拟数据，包括强迫各向同性湍流、磁流体动力学、通道流湍流和均匀浮力驱动湍流。

{"title":"Particle tracking in open simulation laboratories","authors":"Kalin Kanov, R. Burns","doi":"10.1145/2807591.2807645","DOIUrl":"https://doi.org/10.1145/2807591.2807645","url":null,"abstract":"Particle tracking along streamlines and pathlines is a common scientific analysis technique, which has demanding data, computation and communication requirements. It has been studied in the context of high-performance computing due to the difficulty in its efficient parallelization and its high demands on communication and computational load. In this paper, we study efficient evaluation methods for particle tracking in open simulation laboratories. Simulation laboratories have a fundamentally different architecture from today's supercomputers and provide publicly-available analysis functionality. We focus on the I/O demands of particle tracking for numerical simulation datasets 100s of TBs in size. We compare data-parallel and task-parallel approaches for the advection of particles and show scalability results on data-intensive workloads from a live production environment. We have developed particle tracking capabilities for the Johns Hopkins Turbulence Databases, which store computational fluid dynamics simulation data, including forced isotropic turbulence, magnetohydrodynamics, channel flow turbulence and homogeneous buoyancy-driven turbulence.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131921876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

C2-bound: a capacity and concurrency driven analytical model for many-core design C2-bound:用于多核设计的容量和并发驱动的分析模型

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807641

Yuhang Liu, Xian-He Sun

In this paper, we propose C2-Bound, a data-driven analytical model, that incorporates both memory capacity and data access concurrency factors to optimize many-core design. C2-Bound is characterized by combining the newly proposed latency model, concurrent average memory access time (C-AMAT), with the well-known memory-bounded speedup model (Sun-Ni's law) to facilitate computing tasks. Compared to traditional chip designs that lack the notion of memory concurrency and memory capacity, C2-Bound model finds memory bound factors significantly impact the optimal number of cores as well as their optimal silicon area allocations, especially for data-intensive applications with a none parallelizable sequential portion. Therefore, our model is valuable to the design of new generation many-core architectures that target big data processing, where working sets are usually larger than conventional scientific computing. These findings are evidenced by our detailed simulations, which show with C2-Bound the design space can be narrowed down significantly up to four orders of magnitude. C2-Bound analytic results can be either used in reconfigurable hardware environments or, by software designers, applied to scheduling, partitioning, and allocating resources among diverse applications.

在本文中，我们提出了一个数据驱动的分析模型C2-Bound，该模型结合了内存容量和数据访问并发性因素来优化多核设计。C2-Bound的特点是将新提出的延迟模型，并发平均内存访问时间(C-AMAT)与众所周知的内存有限加速模型(Sun-Ni定律)相结合，以方便计算任务。与缺乏内存并发性和内存容量概念的传统芯片设计相比，C2-Bound模型发现内存绑定因素显著影响内核的最佳数量及其最佳硅面积分配，特别是对于具有不可并行串行部分的数据密集型应用程序。因此，我们的模型对于设计针对大数据处理的新一代多核架构很有价值，因为大数据处理的工作集通常比传统的科学计算更大。我们的详细模拟证明了这些发现，表明使用C2-Bound设计空间可以显着缩小到四个数量级。c2绑定的分析结果既可以用于可重构的硬件环境，也可以由软件设计人员在不同的应用程序之间应用于调度、分区和资源分配。

{"title":"C2-bound: a capacity and concurrency driven analytical model for many-core design","authors":"Yuhang Liu, Xian-He Sun","doi":"10.1145/2807591.2807641","DOIUrl":"https://doi.org/10.1145/2807591.2807641","url":null,"abstract":"In this paper, we propose C2-Bound, a data-driven analytical model, that incorporates both memory capacity and data access concurrency factors to optimize many-core design. C2-Bound is characterized by combining the newly proposed latency model, concurrent average memory access time (C-AMAT), with the well-known memory-bounded speedup model (Sun-Ni's law) to facilitate computing tasks. Compared to traditional chip designs that lack the notion of memory concurrency and memory capacity, C2-Bound model finds memory bound factors significantly impact the optimal number of cores as well as their optimal silicon area allocations, especially for data-intensive applications with a none parallelizable sequential portion. Therefore, our model is valuable to the design of new generation many-core architectures that target big data processing, where working sets are usually larger than conventional scientific computing. These findings are evidenced by our detailed simulations, which show with C2-Bound the design space can be narrowed down significantly up to four orders of magnitude. C2-Bound analytic results can be either used in reconfigurable hardware environments or, by software designers, applied to scheduling, partitioning, and allocating resources among diverse applications.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123623035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

VOCL-FT: introducing techniques for efficient soft error coprocessor recovery VOCL-FT:介绍一种高效的软错误协处理器恢复技术

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807640

Antonio J. Peña, Wesley Bland, P. Balaji

Popular accelerator programming models rely on offloading computation operations and their corresponding data transfers to the coprocessors, leveraging synchronization points where needed. In this paper we identify and explore how such a programming model enables optimization opportunities not utilized in traditional checkpoint/restart systems, and we analyze them as the building blocks for an efficient fault-tolerant system for accelerators. Although we leverage our techniques to protect from detected but uncorrected ECC errors in the device memory in OpenCL-accelerated applications, coprocessor reliability solutions based on different error detectors and similar API semantics can directly adopt the techniques we propose. Adding error detection and protection involves a tradeoff between runtime overhead and recovery time. Although optimal configurations depend on the particular application, the length of the run, the error rate, and the temporary storage speed, our test cases reveal a good balance with significantly reduced runtime overheads.

流行的加速器编程模型依赖于卸载计算操作及其相应的数据传输到协处理器，在需要时利用同步点。在本文中，我们确定并探索了这种编程模型如何实现传统检查点/重启系统中未利用的优化机会，并将其分析为加速器有效容错系统的构建块。虽然我们利用我们的技术来防止在opencl加速应用程序中设备内存中检测到但未纠正的ECC错误，但基于不同错误检测器和类似API语义的协处理器可靠性解决方案可以直接采用我们提出的技术。添加错误检测和保护需要在运行时开销和恢复时间之间进行权衡。尽管最佳配置取决于特定的应用程序、运行时间、错误率和临时存储速度，但我们的测试用例显示了一个很好的平衡，显著降低了运行时开销。

引用次数: 21

ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs 一个在并行程序中识别内存可伸缩性瓶颈的工具

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807648

Xu Liu, Bo Wu

It is difficult to scale parallel programs in a system that employs a large number of cores. To identify scalability bottlenecks, existing tools principally pinpoint poor thread synchronization strategies or unnecessary data communication. Memory subsystem is one of the key contributors to poor parallel scaling in multicore machines. State-of-the-art tools, however, either lack sophisticated capabilities or are completely ignorant in pinpointing scalability bottlenecks arising from the memory subsystem. To address this issue, we develop a tool---ScaAnalyzer---to pinpoint scaling losses due to poor memory access behaviors of parallel programs. ScaAnalyzer collects, attributes, and analyzes memory-related metrics during program execution while incurring very low overhead. ScaAnalyzer provides high-level, detailed guidance to programmers for scalability optimization. We demonstrate the utility of ScaAnalyzer with case studies of three parallel programs. For each benchmark, ScaAnalyzer identifies scalability bottlenecks caused by poor memory access behaviors and provides optimization guidance that yields significant improvement in scalability.

在使用大量内核的系统中，很难扩展并行程序。为了识别可伸缩性瓶颈，现有的工具主要是查明不良的线程同步策略或不必要的数据通信。内存子系统是导致多核机器并行扩展性差的关键因素之一。然而，最先进的工具要么缺乏复杂的功能，要么完全不了解内存子系统产生的可伸缩性瓶颈。为了解决这个问题，我们开发了一个工具——ScaAnalyzer——来精确定位由于并行程序的内存访问行为不佳而导致的缩放损失。在程序执行期间，ScaAnalyzer收集、属性和分析与内存相关的指标，同时产生非常低的开销。ScaAnalyzer为程序员提供高级的、详细的可伸缩性优化指导。我们通过三个并行程序的案例研究来演示ScaAnalyzer的实用性。对于每个基准测试，ScaAnalyzer识别由不良内存访问行为引起的可伸缩性瓶颈，并提供优化指导，从而显著提高可伸缩性。

{"title":"ScaAnalyzer: a tool to identify memory scalability bottlenecks in parallel programs","authors":"Xu Liu, Bo Wu","doi":"10.1145/2807591.2807648","DOIUrl":"https://doi.org/10.1145/2807591.2807648","url":null,"abstract":"It is difficult to scale parallel programs in a system that employs a large number of cores. To identify scalability bottlenecks, existing tools principally pinpoint poor thread synchronization strategies or unnecessary data communication. Memory subsystem is one of the key contributors to poor parallel scaling in multicore machines. State-of-the-art tools, however, either lack sophisticated capabilities or are completely ignorant in pinpointing scalability bottlenecks arising from the memory subsystem. To address this issue, we develop a tool---ScaAnalyzer---to pinpoint scaling losses due to poor memory access behaviors of parallel programs. ScaAnalyzer collects, attributes, and analyzes memory-related metrics during program execution while incurring very low overhead. ScaAnalyzer provides high-level, detailed guidance to programmers for scalability optimization. We demonstrate the utility of ScaAnalyzer with case studies of three parallel programs. For each benchmark, ScaAnalyzer identifies scalability bottlenecks caused by poor memory access behaviors and provides optimization guidance that yields significant improvement in scalability.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125401145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Adaptive and transparent cache bypassing for GPUs 自适应和透明的缓存绕过gpu

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807606

Ang Li, Gert-Jan van den Braak, Akash Kumar, H. Corporaal

In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multilevel cache hierarchy, in an attempt to reduce the amount and latency of the massive and sometimes irregular memory accesses. However, inferior performance is frequently attained due to serious congestion in the caches results from the huge amount of concurrent threads. In this paper, we propose a novel compile-time framework for adaptive and transparent cache bypassing on GPUs. It uses a simple yet effective approach to control the bypass degree to match the size of applications' runtime footprints. We validate the design on seven GPU platforms that cover all existing GPU generations using 16 applications from widely used GPU benchmarks. Experiments show that our design can significantly mitigate the negative impact due to small cache sizes and improve the overall performance. We analyze the performance across different platforms and applications. We also propose some optimization guidelines on how to efficiently use the GPU caches.

在过去的十年中，gpu已经被广泛应用于通用应用程序。为了捕获这些应用程序的片上局部性，现代gpu集成了多级缓存层次结构，试图减少大量和有时不规则的内存访问的数量和延迟。然而，由于大量并发线程导致的缓存严重拥塞，通常会导致较差的性能。在本文中，我们提出了一种新的编译时框架，用于gpu上的自适应和透明缓存绕过。它使用一种简单而有效的方法来控制绕过程度，以匹配应用程序运行时占用空间的大小。我们在七个GPU平台上验证了设计，这些平台涵盖了所有现有的GPU世代，使用了来自广泛使用的GPU基准的16个应用程序。实验表明，我们的设计可以显著减轻由于小缓存大小而产生的负面影响，并提高整体性能。我们分析了跨不同平台和应用程序的性能。我们还提出了一些关于如何有效使用GPU缓存的优化指南。

{"title":"Adaptive and transparent cache bypassing for GPUs","authors":"Ang Li, Gert-Jan van den Braak, Akash Kumar, H. Corporaal","doi":"10.1145/2807591.2807606","DOIUrl":"https://doi.org/10.1145/2807591.2807606","url":null,"abstract":"In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multilevel cache hierarchy, in an attempt to reduce the amount and latency of the massive and sometimes irregular memory accesses. However, inferior performance is frequently attained due to serious congestion in the caches results from the huge amount of concurrent threads. In this paper, we propose a novel compile-time framework for adaptive and transparent cache bypassing on GPUs. It uses a simple yet effective approach to control the bypass degree to match the size of applications' runtime footprints. We validate the design on seven GPU platforms that cover all existing GPU generations using 16 applications from widely used GPU benchmarks. Experiments show that our design can significantly mitigate the negative impact due to small cache sizes and improve the overall performance. We analyze the performance across different platforms and applications. We also propose some optimization guidelines on how to efficiently use the GPU caches.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"88 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120852442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 76

HipMer: an extreme-scale de novo genome assembler HipMer:一个极端规模的全新基因组组装者

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Pub Date : 2015-11-15 DOI: 10.1145/2807591.2807664

E. Georganas, A. Buluç, J. Chapman, S. Hofmeyr, Chaitanya Aluru, R. Egan, L. Oliker, D. Rokhsar, K. Yelick

De novo whole genome assembly reconstructs genomic sequences from short, overlapping, and potentially erroneous DNA segments and is one of the most important computations in modern genomics. This work presents HipMer, the first high-quality end-to-end de novo assembler designed for extreme scale analysis, via efficient parallelization of the Meraculous code. First, we significantly improve scalability of parallel k-mer analysis for complex repetitive genomes that exhibit skewed frequency distributions. Next, we optimize the traversal of the de Bruijn graph of k-mers by employing a novel communication-avoiding parallel algorithm in a variety of use-case scenarios. Finally, we parallelize the Meraculous scaffolding modules by leveraging the one-sided communication capabilities of the Unified Parallel C while effectively mitigating load imbalance. Large-scale results on a Cray XC30 using grand-challenge genomes demonstrate efficient performance and scalability on thousands of cores. Overall, our pipeline accelerates Meraculous performance by orders of magnitude, enabling the complete assembly of the human genome in just 8.4 minutes on 15K cores of the Cray XC30, and creating unprecedented capability for extreme-scale genomic analysis.

从头开始的全基因组组装从短的、重叠的和可能错误的DNA片段中重建基因组序列，是现代基因组学中最重要的计算之一。这项工作提出了HipMer，第一个高质量的端到端从头组装设计的极端规模分析，通过高效并行化Meraculous代码。首先，我们显著提高了并行k-mer分析的可扩展性，以分析频率分布偏斜的复杂重复基因组。接下来，我们通过在各种用例场景中采用一种新颖的避免通信的并行算法来优化k-mers的de Bruijn图的遍历。最后，我们通过利用统一并行C的单侧通信功能来并行化Meraculous脚手架模块，同时有效地减轻负载不平衡。在Cray XC30上使用大挑战基因组的大规模结果展示了在数千个核心上的高效性能和可扩展性。总体而言，我们的产品线将Meraculous的性能提高了几个数量级，在Cray XC30的15K核上只需8.4分钟即可完成人类基因组的组装，并为极端规模的基因组分析创造了前所未有的能力。

{"title":"HipMer: an extreme-scale de novo genome assembler","authors":"E. Georganas, A. Buluç, J. Chapman, S. Hofmeyr, Chaitanya Aluru, R. Egan, L. Oliker, D. Rokhsar, K. Yelick","doi":"10.1145/2807591.2807664","DOIUrl":"https://doi.org/10.1145/2807591.2807664","url":null,"abstract":"De novo whole genome assembly reconstructs genomic sequences from short, overlapping, and potentially erroneous DNA segments and is one of the most important computations in modern genomics. This work presents HipMer, the first high-quality end-to-end de novo assembler designed for extreme scale analysis, via efficient parallelization of the Meraculous code. First, we significantly improve scalability of parallel k-mer analysis for complex repetitive genomes that exhibit skewed frequency distributions. Next, we optimize the traversal of the de Bruijn graph of k-mers by employing a novel communication-avoiding parallel algorithm in a variety of use-case scenarios. Finally, we parallelize the Meraculous scaffolding modules by leveraging the one-sided communication capabilities of the Unified Parallel C while effectively mitigating load imbalance. Large-scale results on a Cray XC30 using grand-challenge genomes demonstrate efficient performance and scalability on thousands of cores. Overall, our pipeline accelerates Meraculous performance by orders of magnitude, enabling the complete assembly of the human genome in just 8.4 minutes on 15K cores of the Cray XC30, and creating unprecedented capability for extreme-scale genomic analysis.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129558215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 82

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀