首页 > 最新文献

ACM Sigplan Notices最新文献

英文 中文
vSensor
Q1 Computer Science Pub Date : 2018-02-10 DOI: 10.1145/3200691.3178497
Xiongchao Tang, Jidong Zhai, Xuehai Qian, Bingsheng He, W. Xue, Wenguang Chen
Performance variance becomes increasingly challenging on current large-scale HPC systems. Even using a fixed number of computing nodes, the execution time of several runs can vary significantly. Many parallel programs executing on supercomputers suffer from such variance. Performance variance not only causes unpredictable performance requirement violations, but also makes it unintuitive to understand the program behavior. Despite prior efforts, efficient on-line detection of performance variance remains an open problem. In this paper, we propose vSensor, a novel approach for light-weight and on-line performance variance detection. The key insight is that, instead of solely relying on an external detector, the source code of a program itself could reveal the runtime performance characteristics. Specifically, many parallel programs contain code snippets that are executed repeatedly with an invariant quantity of work. Based on this observation, we use compiler techniques to automatically identify these fixed-workload snippets and use them as performance variance sensors (v-sensors) that enable effective detection. We evaluate vSensor with a variety of parallel programs on the Tianhe-2 system. Results show that vSensor can effectively detect performance variance on HPC systems. The performance overhead is smaller than 4% with up to 16,384 processes. In particular, with vSensor, we found a bad node with slow memory that slowed a program's performance by 21%. As a showcase, we also detected a severe network performance problem that caused a 3.37X slowdown for an HPC kernel program on the Tianhe-2 system.
在当前的大规模高性能计算系统中,性能差异变得越来越具有挑战性。即使使用固定数量的计算节点,几次运行的执行时间也会有很大差异。在超级计算机上执行的许多并行程序都存在这种差异。性能差异不仅会导致不可预测的性能需求违反,而且还会使理解程序行为变得不直观。尽管以前的努力,有效的在线检测性能差异仍然是一个悬而未决的问题。在本文中,我们提出了vSensor,一种轻量级和在线性能方差检测的新方法。关键的见解是,程序本身的源代码可以揭示运行时性能特征,而不是仅仅依赖于外部检测器。具体来说,许多并行程序包含重复执行的代码片段,其工作量是不变的。基于这种观察,我们使用编译器技术来自动识别这些固定工作负载片段,并将它们用作能够有效检测的性能差异传感器(v-sensor)。我们在天河二号系统上用各种并行程序对vSensor进行了评估。结果表明,vSensor可以有效地检测高性能计算系统的性能差异。在多达16,384个进程的情况下,性能开销小于4%。特别是在使用vSensor时,我们发现一个内存较慢的坏节点使程序的性能降低了21%。作为演示,我们还检测到一个严重的网络性能问题,该问题导致天河二号系统上的一个HPC内核程序的速度下降了3.37倍。
{"title":"vSensor","authors":"Xiongchao Tang, Jidong Zhai, Xuehai Qian, Bingsheng He, W. Xue, Wenguang Chen","doi":"10.1145/3200691.3178497","DOIUrl":"https://doi.org/10.1145/3200691.3178497","url":null,"abstract":"Performance variance becomes increasingly challenging on current large-scale HPC systems. Even using a fixed number of computing nodes, the execution time of several runs can vary significantly. Many parallel programs executing on supercomputers suffer from such variance. Performance variance not only causes unpredictable performance requirement violations, but also makes it unintuitive to understand the program behavior. Despite prior efforts, efficient on-line detection of performance variance remains an open problem. In this paper, we propose vSensor, a novel approach for light-weight and on-line performance variance detection. The key insight is that, instead of solely relying on an external detector, the source code of a program itself could reveal the runtime performance characteristics. Specifically, many parallel programs contain code snippets that are executed repeatedly with an invariant quantity of work. Based on this observation, we use compiler techniques to automatically identify these fixed-workload snippets and use them as performance variance sensors (v-sensors) that enable effective detection. We evaluate vSensor with a variety of parallel programs on the Tianhe-2 system. Results show that vSensor can effectively detect performance variance on HPC systems. The performance overhead is smaller than 4% with up to 16,384 processes. In particular, with vSensor, we found a bad node with slow memory that slowed a program's performance by 21%. As a showcase, we also detected a severe network performance problem that caused a 3.37X slowdown for an HPC kernel program on the Tianhe-2 system.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"17 1","pages":"124 - 136"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75306483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PAM 帕姆
Q1 Computer Science Pub Date : 2018-02-10 DOI: 10.1145/3200691.3178509
Yihan Sun, Daniel Ferizovic, Guy E. Belloch
Ordered (key-value) maps are an important and widely-used data type for large-scale data processing frameworks. Beyond simple search, insertion and deletion, more advanced operations such as range extraction, filtering, and bulk updates form a critical part of these frameworks. We describe an interface for ordered maps that is augmented to support fast range queries and sums, and introduce a parallel and concurrent library called PAM (Parallel Augmented Maps) that implements the interface. The interface includes a wide variety of functions on augmented maps ranging from basic insertion and deletion to more interesting functions such as union, intersection, filtering, extracting ranges, splitting, and range-sums. We describe algorithms for these functions that are efficient both in theory and practice. As examples of the use of the interface and the performance of PAM we apply the library to four applications: simple range sums, interval trees, 2D range trees, and ranked word index searching. The interface greatly simplifies the implementation of these data structures over direct implementations. Sequentially the code achieves performance that matches or exceeds existing libraries designed specially for a single application, and in parallel our implementation gets speedups ranging from 40 to 90 on 72 cores with 2-way hyperthreading.
有序(键值)映射是大规模数据处理框架中重要且广泛使用的数据类型。除了简单的搜索、插入和删除之外,范围提取、过滤和批量更新等更高级的操作构成了这些框架的关键部分。我们描述了一个用于有序映射的接口,该接口被增强以支持快速范围查询和求和,并引入了一个称为PAM (parallel augmented maps)的并行并发库来实现该接口。该接口包括扩充映射上的各种功能,从基本的插入和删除到更有趣的功能,如联合、相交、过滤、提取范围、分割和范围和。我们描述了这些函数的算法,这些算法在理论和实践上都是有效的。作为使用该接口和PAM性能的示例,我们将该库应用于四个应用程序:简单范围和、区间树、2D范围树和排序词索引搜索。与直接实现相比,接口极大地简化了这些数据结构的实现。随后,代码的性能达到或超过了专门为单个应用程序设计的现有库,同时,我们的实现在72核的2路超线程上获得了40到90的加速。
{"title":"PAM","authors":"Yihan Sun, Daniel Ferizovic, Guy E. Belloch","doi":"10.1145/3200691.3178509","DOIUrl":"https://doi.org/10.1145/3200691.3178509","url":null,"abstract":"Ordered (key-value) maps are an important and widely-used data type for large-scale data processing frameworks. Beyond simple search, insertion and deletion, more advanced operations such as range extraction, filtering, and bulk updates form a critical part of these frameworks. We describe an interface for ordered maps that is augmented to support fast range queries and sums, and introduce a parallel and concurrent library called PAM (Parallel Augmented Maps) that implements the interface. The interface includes a wide variety of functions on augmented maps ranging from basic insertion and deletion to more interesting functions such as union, intersection, filtering, extracting ranges, splitting, and range-sums. We describe algorithms for these functions that are efficient both in theory and practice. As examples of the use of the interface and the performance of PAM we apply the library to four applications: simple range sums, interval trees, 2D range trees, and ranked word index searching. The interface greatly simplifies the implementation of these data structures over direct implementations. Sequentially the code achieves performance that matches or exceeds existing libraries designed specially for a single application, and in parallel our implementation gets speedups ranging from 40 to 90 on 72 cores with 2-way hyperthreading.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"19 1","pages":"290 - 304"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88383125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Featherlight on-the-fly false-sharing detection 轻便的实时假共享检测
Q1 Computer Science Pub Date : 2018-02-10 DOI: 10.1145/3200691.3178499
ChabbiMilind, WenShasha, Liuxu
Shared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and a...
共享内存并行程序通常会出现错误共享——由于不同的线程访问驻留在同一CPU cacheline上的不同变量而导致性能下降。
{"title":"Featherlight on-the-fly false-sharing detection","authors":"ChabbiMilind, WenShasha, Liuxu","doi":"10.1145/3200691.3178499","DOIUrl":"https://doi.org/10.1145/3200691.3178499","url":null,"abstract":"Shared-memory parallel programs routinely suffer from false sharing---a performance degradation caused by different threads accessing different variables that reside on the same CPU cacheline and a...","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3200691.3178499","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44819549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Two concurrent data structures for efficient datalog query processing 用于高效数据日志查询处理的两种并行数据结构
Q1 Computer Science Pub Date : 2018-02-10 DOI: 10.1145/3200691.3178525
JordanHerbert, ScholzBernhard, SubotićPavle
In recent years, Datalog has gained popularity for the implementation of advanced data analysis. Applications benefit from Datalog's high-level, declarative syntax, and availability of efficient al...
近年来,Datalog因实现高级数据分析而广受欢迎。应用程序受益于Datalog的高级声明性语法和高效的可用性。。。
{"title":"Two concurrent data structures for efficient datalog query processing","authors":"JordanHerbert, ScholzBernhard, SubotićPavle","doi":"10.1145/3200691.3178525","DOIUrl":"https://doi.org/10.1145/3200691.3178525","url":null,"abstract":"In recent years, Datalog has gained popularity for the implementation of advanced data analysis. Applications benefit from Datalog's high-level, declarative syntax, and availability of efficient al...","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3200691.3178525","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49211524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Lazygraph Lazygraph
Q1 Computer Science Pub Date : 2018-02-10 DOI: 10.1145/3200691.3178508
Lei Wang, Liangji Zhuang, Junhang Chen, Huimin Cui, Fang Lv, Y. Liu, Xiaobing Feng
Replicas 1 of a vertex play an important role in existing distributed graph processing systems which make a single vertex to be parallel processed by multiple machines and access remote neighbors locally without any remote access. However, replicas of vertices introduce data coherency problem. Existing distributed graph systems treat replicas of a vertex v as an atomic and indivisible vertex, and use an eager data coherency approach to guarantee replicas atomicity. In eager data coherency approach, any changes to vertex data must be immediately communicated to all replicas of v, thus leading to frequent global synchronizations and communications. In this paper, we propose a lazy data coherency approach, called LazyAsync, which treats replicas of a vertex as independent vertices and maintains the data coherency by computations, rather than communications in existing eager approach. Our approach automatically selects some data coherency points from the graph algorithm, and maintains all replicas to share the same global view only at such points, which means the replicas are enabled to maintain different local views between any two adjacent data coherency points. Based on PowerGraph, we develop a distributed graph processing system LazyGraph to implement the LazyAsync approach and exploit graph-aware optimizations. On a 48-node EC2-like cluster, LazyGraph outperforms PowerGraph on four widely used graph algorithms across a variety of real-world graphs, with a speedup ranging from 1.25x to 10.69x.
顶点的副本在现有的分布式图处理系统中发挥着重要的作用,它使单个顶点可以由多台机器并行处理,并在不需要远程访问的情况下本地访问远程邻居。然而,顶点的副本引入了数据一致性问题。现有的分布式图系统将顶点v的副本视为原子的、不可分割的顶点,并使用渴望数据一致性方法来保证副本的原子性。在渴望数据一致性方法中,对顶点数据的任何更改都必须立即传达给v的所有副本,从而导致频繁的全局同步和通信。在本文中,我们提出了一种延迟数据一致性方法,称为LazyAsync,它将一个顶点的副本视为独立的顶点,并通过计算来保持数据一致性,而不是像现有的渴望方法那样通过通信来保持数据一致性。我们的方法自动从图算法中选择一些数据一致性点,并维护所有副本仅在这些点上共享相同的全局视图,这意味着副本可以在任意两个相邻的数据一致性点之间维护不同的局部视图。基于PowerGraph,我们开发了一个分布式图形处理系统LazyGraph,实现了LazyAsync方法,并利用了图形感知优化。在一个48节点的类似ec2的集群上,LazyGraph在四种广泛使用的图形算法上优于PowerGraph,在各种现实世界的图形上,加速幅度从1.25倍到10.69倍不等。
{"title":"Lazygraph","authors":"Lei Wang, Liangji Zhuang, Junhang Chen, Huimin Cui, Fang Lv, Y. Liu, Xiaobing Feng","doi":"10.1145/3200691.3178508","DOIUrl":"https://doi.org/10.1145/3200691.3178508","url":null,"abstract":"Replicas 1 of a vertex play an important role in existing distributed graph processing systems which make a single vertex to be parallel processed by multiple machines and access remote neighbors locally without any remote access. However, replicas of vertices introduce data coherency problem. Existing distributed graph systems treat replicas of a vertex v as an atomic and indivisible vertex, and use an eager data coherency approach to guarantee replicas atomicity. In eager data coherency approach, any changes to vertex data must be immediately communicated to all replicas of v, thus leading to frequent global synchronizations and communications. In this paper, we propose a lazy data coherency approach, called LazyAsync, which treats replicas of a vertex as independent vertices and maintains the data coherency by computations, rather than communications in existing eager approach. Our approach automatically selects some data coherency points from the graph algorithm, and maintains all replicas to share the same global view only at such points, which means the replicas are enabled to maintain different local views between any two adjacent data coherency points. Based on PowerGraph, we develop a distributed graph processing system LazyGraph to implement the LazyAsync approach and exploit graph-aware optimizations. On a 48-node EC2-like cluster, LazyGraph outperforms PowerGraph on four widely used graph algorithms across a variety of real-world graphs, with a speedup ranging from 1.25x to 10.69x.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"20 1","pages":"276 - 289"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85016912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Layrub
Q1 Computer Science Pub Date : 2018-02-10 DOI: 10.1145/3200691.3178528
Bo Liu, Wenbin Jiang, Hai Jin, Xuanhua Shi, Yang Ma
Growing accuracy and robustness of Deep Neural Networks (DNN) models are accompanied by growing model capacity (going deeper or wider). However, high memory requirements of those models make it difficult to execute the training process in one GPU. To address it, we first identify the memory usage characteristics for deep and wide convolutional networks, and demonstrate the opportunities of memory reuse on both intra-layer and inter-layer levels. We then present Layrub, a runtime data placement strategy that orchestrates the execution of training process. It achieves layer-centric reuse to reduce memory consumption for extreme-scale deep learning that cannot be run on one single GPU.
深度神经网络(DNN)模型精度和鲁棒性的提高伴随着模型容量的增长(深度或广度)。然而,这些模型的高内存要求使得在一个GPU上执行训练过程变得困难。为了解决这个问题,我们首先确定了深度和宽卷积网络的内存使用特征,并展示了在层内和层间级别上内存重用的机会。然后,我们介绍Layrub,这是一个运行时数据放置策略,可以编排训练过程的执行。它实现了以层为中心的重用,以减少无法在单个GPU上运行的极端规模深度学习的内存消耗。
{"title":"Layrub","authors":"Bo Liu, Wenbin Jiang, Hai Jin, Xuanhua Shi, Yang Ma","doi":"10.1145/3200691.3178528","DOIUrl":"https://doi.org/10.1145/3200691.3178528","url":null,"abstract":"Growing accuracy and robustness of Deep Neural Networks (DNN) models are accompanied by growing model capacity (going deeper or wider). However, high memory requirements of those models make it difficult to execute the training process in one GPU. To address it, we first identify the memory usage characteristics for deep and wide convolutional networks, and demonstrate the opportunities of memory reuse on both intra-layer and inter-layer levels. We then present Layrub, a runtime data placement strategy that orchestrates the execution of training process. It achieves layer-centric reuse to reduce memory consumption for extreme-scale deep learning that cannot be run on one single GPU.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"8 1","pages":"405 - 406"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87543655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DisCVar DisCVar
Q1 Computer Science Pub Date : 2018-02-10 DOI: 10.1145/3200691.3178502
Harshitha Menon, K. Mohror
Aggressive technology scaling trends have made the hardware of high performance computing (HPC) systems more susceptible to faults. Some of these faults can lead to silent data corruption (SDC), and represent a serious problem because they alter the HPC simulation results. In this paper, we present a full-coverage, systematic methodology called DisCVar to identify critical variables in HPC applications for protection against SDC. DisCVar uses automatic differentiation (AD) to determine the sensitivity of the simulation output to errors in program variables. We empirically validate our approach in identifying vulnerable variables by comparing the results against a full-coverage code-level fault injection campaign. We find that our DisCVar correctly identifies the variables that are critical to ensure application SDC resilience with a high degree of accuracy compared to the results of the fault injection campaign. Additionally, DisCVar requires only two executions of the target program to generate results, whereas in our experiments we needed to perform millions of executions to get the same information from a fault injection campaign.
激进的技术扩展趋势使得高性能计算(HPC)系统的硬件更容易受到故障的影响。其中一些故障可能导致静默数据损坏(SDC),这是一个严重的问题,因为它们会改变HPC模拟结果。在本文中,我们提出了一种称为DisCVar的全覆盖系统方法,用于识别HPC应用程序中防止SDC的关键变量。DisCVar使用自动区分(AD)来确定模拟输出对程序变量错误的敏感性。我们通过将结果与全覆盖的代码级错误注入活动进行比较,以经验验证我们在识别易受攻击变量方面的方法。我们发现,与故障注入活动的结果相比,我们的DisCVar正确识别了对于确保应用程序SDC弹性至关重要的变量,并且具有高度的准确性。此外,DisCVar只需要执行两次目标程序就可以生成结果,而在我们的实验中,我们需要执行数百万次执行才能从故障注入活动中获得相同的信息。
{"title":"DisCVar","authors":"Harshitha Menon, K. Mohror","doi":"10.1145/3200691.3178502","DOIUrl":"https://doi.org/10.1145/3200691.3178502","url":null,"abstract":"Aggressive technology scaling trends have made the hardware of high performance computing (HPC) systems more susceptible to faults. Some of these faults can lead to silent data corruption (SDC), and represent a serious problem because they alter the HPC simulation results. In this paper, we present a full-coverage, systematic methodology called DisCVar to identify critical variables in HPC applications for protection against SDC. DisCVar uses automatic differentiation (AD) to determine the sensitivity of the simulation output to errors in program variables. We empirically validate our approach in identifying vulnerable variables by comparing the results against a full-coverage code-level fault injection campaign. We find that our DisCVar correctly identifies the variables that are critical to ensure application SDC resilience with a high degree of accuracy compared to the results of the fault injection campaign. Additionally, DisCVar requires only two executions of the target program to generate results, whereas in our experiments we needed to perform millions of executions to get the same information from a fault injection campaign.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"43 1","pages":"195 - 206"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82235541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Juggler 变戏法的人
Q1 Computer Science Pub Date : 2018-02-10 DOI: 10.1145/3200691.3178492
M. Belviranli, Seyong Lee, J. Vetter, L. Bhuyan
Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may significantly impact the speedup by requiring global synchronization across multiprocessors (SMs) inside the GPU. To efficiently run applications with interblock data dependences, we need fine-granular task-based execution models that will treat SMs inside a GPU as stand-alone parallel processing units. Such a scheme will enable faster execution by utilizing all internal computation elements inside the GPU and eliminating unnecessary waits during device-wide global barriers. In this paper, we propose Juggler, a task-based execution scheme for GPU workloads with data dependences. The Juggler framework takes applications embedding OpenMP 4.5 tasks as input and executes them on the GPU via an efficient in-device runtime, hence eliminating the need for kernel-wide global synchronization. Juggler requires no or little modification to the source code, and once launched, the runtime entirely runs on the GPU without relying on the host through the entire execution. We have evaluated Juggler on an NVIDIA Tesla P100 GPU and obtained up to 31% performance improvement against global barrier based implementation, with minimal runtime overhead.
单指令多数据(SIMD)计算的科学应用程序在今天的图形处理单元(gpu)上运行时显示出相当大的性能改进。然而,跨线程块的数据依赖的存在可能会通过要求GPU内部跨多处理器(SMs)的全局同步来显著影响加速。为了有效地运行具有块间数据依赖的应用程序,我们需要细粒度的基于任务的执行模型,该模型将GPU内的SMs视为独立的并行处理单元。这样的方案将通过利用GPU内部的所有内部计算元素来实现更快的执行,并消除在设备范围的全局屏障期间不必要的等待。在本文中,我们提出了一种基于任务的执行方案,用于具有数据依赖性的GPU工作负载。Juggler框架将嵌入openmp4.5任务的应用程序作为输入,并通过高效的设备内运行时在GPU上执行它们,从而消除了对内核范围内全局同步的需求。Juggler不需要对源代码进行修改,一旦启动,运行时就完全在GPU上运行,而在整个执行过程中不依赖于主机。我们在NVIDIA Tesla P100 GPU上对Juggler进行了评估,与基于全局屏障的实现相比,性能提高了31%,运行时开销最小。
{"title":"Juggler","authors":"M. Belviranli, Seyong Lee, J. Vetter, L. Bhuyan","doi":"10.1145/3200691.3178492","DOIUrl":"https://doi.org/10.1145/3200691.3178492","url":null,"abstract":"Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may significantly impact the speedup by requiring global synchronization across multiprocessors (SMs) inside the GPU. To efficiently run applications with interblock data dependences, we need fine-granular task-based execution models that will treat SMs inside a GPU as stand-alone parallel processing units. Such a scheme will enable faster execution by utilizing all internal computation elements inside the GPU and eliminating unnecessary waits during device-wide global barriers. In this paper, we propose Juggler, a task-based execution scheme for GPU workloads with data dependences. The Juggler framework takes applications embedding OpenMP 4.5 tasks as input and executes them on the GPU via an efficient in-device runtime, hence eliminating the need for kernel-wide global synchronization. Juggler requires no or little modification to the source code, and once launched, the runtime entirely runs on the GPU without relying on the host through the entire execution. We have evaluated Juggler on an NVIDIA Tesla P100 GPU and obtained up to 31% performance improvement against global barrier based implementation, with minimal runtime overhead.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"55 1","pages":"54 - 67"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75693868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
swSpTRSV
Q1 Computer Science Pub Date : 2018-02-10 DOI: 10.1145/3200691.3178513
Xinliang Wang, Weifeng Liu, W. Xue, Li Wu
Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the out-of-control data reuse and high cost for global memory or shared cache access in inter-level synchronization have been largely neglected in existing work. In this paper, we propose a novel data layout called Sparse Level Tile to make all data reuse under control, and design a Producer-Consumer pairing method to make any inter-level synchronization only happen in very fast register communication. We implement our data layout and algorithms on an SW26010 many-core processor, which is the main building-block of the current world fastest supercomputer Sunway Taihulight. The experimental results of testing all 2057 square matrices from the Florida Matrix Collection show that our method achieves an average speedup of 6.9 and the best speedup of 38.5 over parallel level-set method. Our method also outperforms the latest methods on a KNC many-core processor in 1856 matrices and the latest methods on a K80 GPU in 1672 matrices, respectively.
稀疏三角解(SpTRSV)是许多实际应用中最重要的核算法之一。目前,关于并行SpTRSV的研究主要集中在构建水平集以减少层间同步的数量。然而,在层间同步中,由于全局内存或共享缓存的访问而导致的数据重用失控和高成本问题在很大程度上被现有工作所忽视。在本文中,我们提出了一种新的数据布局,称为稀疏级块,使所有的数据重用可控,并设计了一种生产者-消费者配对方法,使任何层间同步只发生在非常快的寄存器通信中。我们在SW26010多核处理器上实现我们的数据布局和算法,SW26010是目前世界上最快的超级计算机神威太湖之光的主要构建模块。对来自Florida Matrix Collection的所有2057个方阵进行测试的实验结果表明,与并行水平集方法相比,该方法的平均加速速度为6.9,最佳加速速度为38.5。我们的方法也分别优于KNC多核处理器上的最新方法(1856个矩阵)和K80 GPU上的最新方法(1672个矩阵)。
{"title":"swSpTRSV","authors":"Xinliang Wang, Weifeng Liu, W. Xue, Li Wu","doi":"10.1145/3200691.3178513","DOIUrl":"https://doi.org/10.1145/3200691.3178513","url":null,"abstract":"Sparse triangular solve (SpTRSV) is one of the most important kernels in many real-world applications. Currently, much research on parallel SpTRSV focuses on level-set construction for reducing the number of inter-level synchronizations. However, the out-of-control data reuse and high cost for global memory or shared cache access in inter-level synchronization have been largely neglected in existing work. In this paper, we propose a novel data layout called Sparse Level Tile to make all data reuse under control, and design a Producer-Consumer pairing method to make any inter-level synchronization only happen in very fast register communication. We implement our data layout and algorithms on an SW26010 many-core processor, which is the main building-block of the current world fastest supercomputer Sunway Taihulight. The experimental results of testing all 2057 square matrices from the Florida Matrix Collection show that our method achieves an average speedup of 6.9 and the best speedup of 38.5 over parallel level-set method. Our method also outperforms the latest methods on a KNC many-core processor in 1856 matrices and the latest methods on a K80 GPU in 1672 matrices, respectively.","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":"35 1","pages":"338 - 353"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91358275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A persistent lock-free queue for non-volatile memory 用于非易失性内存的持久无锁队列
Q1 Computer Science Pub Date : 2018-02-10 DOI: 10.1145/3200691.3178490
FriedmanMichal, HerlihyMaurice, MaratheVirendra, PetrankErez
Non-volatile memory is expected to coexist with (or even displace) volatile DRAM for main memory in upcoming architectures. This has led to increasing interest in the problem of designing and speci...
在即将到来的体系结构中,非易失性存储器有望与易失性DRAM共存(甚至取代)。这导致人们对设计和规格问题的兴趣日益增加。
{"title":"A persistent lock-free queue for non-volatile memory","authors":"FriedmanMichal, HerlihyMaurice, MaratheVirendra, PetrankErez","doi":"10.1145/3200691.3178490","DOIUrl":"https://doi.org/10.1145/3200691.3178490","url":null,"abstract":"Non-volatile memory is expected to coexist with (or even displace) volatile DRAM for main memory in upcoming architectures. This has led to increasing interest in the problem of designing and speci...","PeriodicalId":50923,"journal":{"name":"ACM Sigplan Notices","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1145/3200691.3178490","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47053303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
ACM Sigplan Notices
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1