2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第10页

MOCHA: Morphable Locality and Compression Aware Architecture for Convolutional Neural Networks MOCHA:卷积神经网络的变形局部性和压缩感知架构

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.59

Syed M. A. H. Jafri, A. Hemani, K. Paul, Naeem Abbas

Today, machine learning based on neural networks has become mainstream, in many application domains. A small subset of machine learning algorithms, called Convolutional Neural Networks (CNN), are considered as state-ofthe- art for many applications (e.g. video/audio classification). The main challenge in implementing the CNNs, in embedded systems, is their large computation, memory, and bandwidth requirements. To meet these demands, dedicated hardware accelerators have been proposed. Since memory is the major cost in CNNs, recent accelerators focus on reducing the memory accesses. In particular, they exploit data locality using either tiling, layer merging or intra/inter feature map parallelism to reduce the memory footprint. However, they lack the flexibility to interleave or cascade these optimizations. Moreover, most of the existing accelerators do not exploit compression that can simultaneously reduce memory requirements, increase the throughput, and enhance the energy efficiency. To tackle these limitations, we present a flexible accelerator called MOCHA. MOCHA has three features that differentiate it from the state-of-the-art: (i) the ability to compress input/ kernels, (ii) the flexibility to interleave various optimizations, and (iii) intelligence to automatically interleave and cascade the optimizations, depending on the dimension of a specific CNN layer and available resources. Post layout Synthesis results reveal that MOCHA provides up to 63% higher energy efficiency, up to 42% higher throughput, and up to 30% less storage, compared to the next best accelerator, at the cost of 26-35% additional area.

今天，基于神经网络的机器学习在许多应用领域已经成为主流。机器学习算法的一个小子集，称为卷积神经网络(CNN)，被认为是许多应用(例如视频/音频分类)的最新技术。在嵌入式系统中实现cnn的主要挑战是其庞大的计算、内存和带宽需求。为了满足这些需求，已经提出了专用的硬件加速器。由于内存是cnn的主要成本，所以最近的加速器致力于减少内存访问。特别是，它们利用数据局部性，使用平铺、层合并或内部/内部特征映射并行性来减少内存占用。然而，它们缺乏交错或级联这些优化的灵活性。此外，大多数现有的加速器都没有利用压缩技术，而压缩技术可以同时降低内存需求、增加吞吐量和提高能源效率。为了解决这些限制，我们提出了一种名为MOCHA的灵活加速器。MOCHA有三个特征将其与最先进的技术区分出来:(i)压缩输入/内核的能力，(ii)交错各种优化的灵活性，以及(iii)根据特定CNN层的维度和可用资源自动交错和级联优化的智能。布局后综合结果显示，与第二好的加速器相比，MOCHA的能源效率提高了63%，吞吐量提高了42%，存储空间减少了30%，而成本是增加了26-35%的面积。

{"title":"MOCHA: Morphable Locality and Compression Aware Architecture for Convolutional Neural Networks","authors":"Syed M. A. H. Jafri, A. Hemani, K. Paul, Naeem Abbas","doi":"10.1109/IPDPS.2017.59","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.59","url":null,"abstract":"Today, machine learning based on neural networks has become mainstream, in many application domains. A small subset of machine learning algorithms, called Convolutional Neural Networks (CNN), are considered as state-ofthe- art for many applications (e.g. video/audio classification). The main challenge in implementing the CNNs, in embedded systems, is their large computation, memory, and bandwidth requirements. To meet these demands, dedicated hardware accelerators have been proposed. Since memory is the major cost in CNNs, recent accelerators focus on reducing the memory accesses. In particular, they exploit data locality using either tiling, layer merging or intra/inter feature map parallelism to reduce the memory footprint. However, they lack the flexibility to interleave or cascade these optimizations. Moreover, most of the existing accelerators do not exploit compression that can simultaneously reduce memory requirements, increase the throughput, and enhance the energy efficiency. To tackle these limitations, we present a flexible accelerator called MOCHA. MOCHA has three features that differentiate it from the state-of-the-art: (i) the ability to compress input/ kernels, (ii) the flexibility to interleave various optimizations, and (iii) intelligence to automatically interleave and cascade the optimizations, depending on the dimension of a specific CNN layer and available resources. Post layout Synthesis results reveal that MOCHA provides up to 63% higher energy efficiency, up to 42% higher throughput, and up to 30% less storage, compared to the next best accelerator, at the cost of 26-35% additional area.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131186879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Fault-Tolerant Online Packet Scheduling on Parallel Channels 并行通道的容错在线分组调度

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.105

P. Garncarek, T. Jurdzinski, Krzysztof Lorys

We consider the problem of scheduling packets of different lengths via k directed parallel communication links. The links are prone to simultaneous errors --- if an error occurs, all links are affected. Dynamic packet arrivals and errors are modelled by a worst-case adversary. The goal is to optimize competitive throughput of online scheduling algorithms. Two types of failures are considered: jamming, when currently scheduled packets are simply not delivered, and crashes, when additionally the channel scheduler crashes losing its current state. For the former, milder type of failures, we prove an upper bound on competitive throughput of 3/4 - 1/(4k) for odd values of k, and 3/4 - 1/(4k+4) for even values of k. On constructive side, we design an online algorithm that, for packets of two different lengths, matches the upper bound on competitive throughput. To compare, scheduling on independent channels, that is, when adversary could cause errors on each channel independently, reaches throughput of 1/2. This shows that scheduling under simultaneous jamming is provably more efficient than scheduling under channel-independent jamming. In the setting with crash failures we prove a general upper bound for competitive throughput of (√5-1)/2 and design an algorithm achieving it for packets of two different lengths. This result has two interesting implications. First, simultaneous crashes are significantly stronger than simultaneous jamming. Second, due to the above mentioned upper bound of 1/2 on throughput under channel-independenterrors, scheduling under simultaneous crashes is significantly stronger than channel-independent crashes, similarly as in the case of jamming errors.

我们考虑了通过k个有向并行通信链路调度不同长度数据包的问题。链接容易同时发生错误——如果发生错误，所有链接都会受到影响。动态数据包到达和错误由最坏情况对手建模。目标是优化在线调度算法的竞争吞吐量。考虑两种类型的故障:阻塞(当前计划的数据包根本没有交付)和崩溃(通道调度器崩溃，失去当前状态)。对于前者，较温和的失败类型，我们证明了k为奇值时竞争吞吐量的上界为3/4 - 1/(4k)， k为偶值时竞争吞吐量的上界为3/4 - 1/(4k+4)。在建设性方面，我们设计了一个在线算法，该算法对于两个不同长度的数据包匹配竞争吞吐量的上界。相比之下，在独立通道上调度，也就是说，当对手可能在每个通道上独立地造成错误时，吞吐量达到1/2。这表明在同步干扰下的调度比在信道无关干扰下的调度更有效。在有崩溃失败的情况下，我们证明了竞争吞吐量的一般上界为(√5-1)/2，并设计了一种算法来实现两个不同长度的数据包。这个结果有两个有趣的含义。首先，同时发生的碰撞明显强于同时发生的干扰。其次，由于上述信道无关的吞吐量上限为1/2，因此在同时崩溃情况下的调度明显强于信道无关的崩溃，类似于干扰错误的情况。

{"title":"Fault-Tolerant Online Packet Scheduling on Parallel Channels","authors":"P. Garncarek, T. Jurdzinski, Krzysztof Lorys","doi":"10.1109/IPDPS.2017.105","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.105","url":null,"abstract":"We consider the problem of scheduling packets of different lengths via k directed parallel communication links. The links are prone to simultaneous errors --- if an error occurs, all links are affected. Dynamic packet arrivals and errors are modelled by a worst-case adversary. The goal is to optimize competitive throughput of online scheduling algorithms. Two types of failures are considered: jamming, when currently scheduled packets are simply not delivered, and crashes, when additionally the channel scheduler crashes losing its current state. For the former, milder type of failures, we prove an upper bound on competitive throughput of 3/4 - 1/(4k) for odd values of k, and 3/4 - 1/(4k+4) for even values of k. On constructive side, we design an online algorithm that, for packets of two different lengths, matches the upper bound on competitive throughput. To compare, scheduling on independent channels, that is, when adversary could cause errors on each channel independently, reaches throughput of 1/2. This shows that scheduling under simultaneous jamming is provably more efficient than scheduling under channel-independent jamming. In the setting with crash failures we prove a general upper bound for competitive throughput of (√5-1)/2 and design an algorithm achieving it for packets of two different lengths. This result has two interesting implications. First, simultaneous crashes are significantly stronger than simultaneous jamming. Second, due to the above mentioned upper bound of 1/2 on throughput under channel-independenterrors, scheduling under simultaneous crashes is significantly stronger than channel-independent crashes, similarly as in the case of jamming errors.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124312704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

SlimSell: A Vectorizable Graph Representation for Breadth-First Search 宽度优先搜索的矢量图表示

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.93

Maciej Besta, Florian Marending, Edgar Solomonik, T. Hoefler

Vectorization and GPUs will profoundly change graph processing. Traditional graph algorithms tuned for 32- or 64-bit based memory accesses will be inefficient on architectures with 512-bit wide (or larger) instruction units that are already present in the Intel Knights Landing (KNL) manycore CPU. Anticipating this shift, we propose SlimSell: a vectorizable graph representation to accelerate Breadth-First Search (BFS) based on sparse-matrix dense-vector (SpMV) products. SlimSell extends and combines the state-of-the-art SIMD-friendly Sell-C-σ matrix storage format with tropical, real, boolean, and sel-max semiring operations. The resulting design reduces the necessary storage (by up to 50%) and thus pressure on the memory subsystem. We augment SlimSell with the SlimWork and SlimChunk schemes that reduce the amount of work and improve load balance, further accelerating BFS. We evaluate all the schemes on Intel Haswell multicore CPUs, the state-of-the-art Intel Xeon Phi KNL manycore CPUs, and NVIDIA Tesla GPUs. Our experiments indicate which semiring offers highest speedups for BFS and illustrate that SlimSell accelerates a tuned Graph500 BFS code by up to 33%. This work shows that vectorization can secure high-performance in BFS based on SpMV products; the proposed principles and designs can be extended to other graph algorithms.

向量化和gpu将深刻地改变图形处理。针对32位或64位内存访问进行调优的传统图形算法，在具有512位宽(或更大)指令单元的架构上效率低下，这些指令单元已经存在于Intel Knights Landing (KNL)多核CPU中。考虑到这种转变，我们提出了SlimSell:一种可矢量的图表示，以加速基于稀疏矩阵密集向量(SpMV)乘积的广度优先搜索(BFS)。SlimSell扩展并结合了最先进的simd友好型Sell-C-σ矩阵存储格式，具有热带、实数、布尔和自最大值半环操作。最终的设计减少了必要的存储(最多减少了50%)，从而减少了内存子系统的压力。我们用SlimWork和SlimChunk方案增强了SlimSell，减少了工作量并改善了负载平衡，进一步加速了BFS。我们在英特尔Haswell多核cpu，最先进的英特尔Xeon Phi KNL多核cpu和NVIDIA Tesla gpu上评估所有方案。我们的实验表明了哪种半循环为BFS提供了最高的加速，并说明了SlimSell对调优的Graph500 BFS代码的加速高达33%。研究表明，矢量化可以保证基于SpMV产品的BFS的高性能;所提出的原理和设计可以推广到其他图算法。

{"title":"SlimSell: A Vectorizable Graph Representation for Breadth-First Search","authors":"Maciej Besta, Florian Marending, Edgar Solomonik, T. Hoefler","doi":"10.1109/IPDPS.2017.93","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.93","url":null,"abstract":"Vectorization and GPUs will profoundly change graph processing. Traditional graph algorithms tuned for 32- or 64-bit based memory accesses will be inefficient on architectures with 512-bit wide (or larger) instruction units that are already present in the Intel Knights Landing (KNL) manycore CPU. Anticipating this shift, we propose SlimSell: a vectorizable graph representation to accelerate Breadth-First Search (BFS) based on sparse-matrix dense-vector (SpMV) products. SlimSell extends and combines the state-of-the-art SIMD-friendly Sell-C-σ matrix storage format with tropical, real, boolean, and sel-max semiring operations. The resulting design reduces the necessary storage (by up to 50%) and thus pressure on the memory subsystem. We augment SlimSell with the SlimWork and SlimChunk schemes that reduce the amount of work and improve load balance, further accelerating BFS. We evaluate all the schemes on Intel Haswell multicore CPUs, the state-of-the-art Intel Xeon Phi KNL manycore CPUs, and NVIDIA Tesla GPUs. Our experiments indicate which semiring offers highest speedups for BFS and illustrate that SlimSell accelerates a tuned Graph500 BFS code by up to 33%. This work shows that vectorization can secure high-performance in BFS based on SpMV products; the proposed principles and designs can be extended to other graph algorithms.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125555179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

Enhancing Datacenter Resource Management through Temporal Logic Constraints 通过时间逻辑约束增强数据中心资源管理

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.27

Hao He, Jiang Hu, D. D. Silva

Resource management of modern datacenters needs to consider multiple competing objectives that involve complex system interactions. In this work, Linear Temporal Logic (LTL) is adopted in describing such interactions by leveraging its ability to express complex properties. Further, LTL-based constraints are integrated with reinforcement learning according the recent progress on control synthesis theory. The LTL-constrained reinforcement learning facilitates desired balance among the competing objectives in managing resources for datacenters. The effectiveness of this new approach is demonstrated by two scenarios. In datacenter power management, the LTL-constrained manager reaches the best balance among power, performance and battery stress compared to the previous work and other alternative approaches. In multitenant job scheduling, 200 MapReduce jobs are emulated on the Amazon AWS cloud. The LTL-constrained scheduler achieves the best balance between system performance and fairness compared to several other methods including three Hadoop schedulers.

现代数据中心的资源管理需要考虑涉及复杂系统交互的多个相互竞争的目标。在这项工作中，线性时间逻辑(LTL)通过利用其表达复杂属性的能力来描述这种相互作用。此外，根据控制综合理论的最新进展，将基于ltl的约束与强化学习相结合。ltl约束的强化学习促进了数据中心资源管理中相互竞争的目标之间的理想平衡。通过两个场景证明了这种新方法的有效性。在数据中心电源管理中，与以前的工作和其他替代方法相比，ltl约束管理器在电源、性能和电池压力之间达到了最佳平衡。在多租户作业调度中，在亚马逊AWS云上模拟了200个MapReduce作业。与其他几种方法(包括三个Hadoop调度器)相比，受ltl约束的调度器实现了系统性能和公平性之间的最佳平衡。

引用次数: 3

Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores 千万核神威太湖之光的可伸缩图遍历

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.53

Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Zhai, Wanwang Yin, Weimin Zheng

Interest has recently grown in efficiently analyzing unstructured data such as social network graphs and protein structures. A fundamental graph algorithm for doing such task is the Breadth-First Search (BFS) algorithm, the foundation for many other important graph algorithms such as calculating the shortest path or finding the maximum flow in graphs. In this paper, we share our experience of designing and implementing the BFS algorithm on Sunway TaihuLight, a newly released machine with 40,960 nodes and 10.6 million accelerator cores. It tops the Top500 list of June 2016 with a 93.01 petaflops Linpack performance [1]. Designed for extremely large-scale computation and power efficiency, processors on Sunway TaihuLight employ a unique heterogeneous many-core architecture and memory hierarchy. With its extremely large size, the machine provides both opportunities and challenges for implementing high-performance irregular algorithms, such as BFS. We propose several techniques, including pipelined module mapping, contention-free data shuffling, and group-based message batching, to address the challenges of efficiently utilizing the features of this large scale heterogeneous machine. We ultimately achieved 23755.7 giga-traversed edges per second (GTEPS), which is the best among heterogeneous machines and the second overall in the Graph500s June 2016 list [2].

最近，人们对有效分析非结构化数据(如社交网络图和蛋白质结构)的兴趣越来越大。广度优先搜索(BFS)算法是完成此类任务的基本图算法，它是许多其他重要图算法(如计算最短路径或查找图中的最大流)的基础。在本文中，我们分享了我们在新发布的具有40960个节点和1060万个加速器内核的神威太湖之光上设计和实现BFS算法的经验。它以93.01 petaflops的Linpack性能在2016年6月的Top500榜单上名列前茅[1]。神威太湖之光的处理器采用独特的异构多核架构和内存层次结构，专为大规模计算和能效而设计。机器的超大尺寸为实现高性能的不规则算法(如BFS)提供了机遇和挑战。我们提出了几种技术，包括流水线模块映射、无争用数据变换和基于组的消息批处理，以解决有效利用这种大规模异构机器特性的挑战。我们最终实现了每秒23755.7千兆遍历边缘(GTEPS)，这在异构机器中是最好的，在graph500 2016年6月的榜单中排名第二[2]。

{"title":"Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores","authors":"Heng Lin, Xiongchao Tang, Bowen Yu, Youwei Zhuo, Wenguang Chen, Jidong Zhai, Wanwang Yin, Weimin Zheng","doi":"10.1109/IPDPS.2017.53","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.53","url":null,"abstract":"Interest has recently grown in efficiently analyzing unstructured data such as social network graphs and protein structures. A fundamental graph algorithm for doing such task is the Breadth-First Search (BFS) algorithm, the foundation for many other important graph algorithms such as calculating the shortest path or finding the maximum flow in graphs. In this paper, we share our experience of designing and implementing the BFS algorithm on Sunway TaihuLight, a newly released machine with 40,960 nodes and 10.6 million accelerator cores. It tops the Top500 list of June 2016 with a 93.01 petaflops Linpack performance [1]. Designed for extremely large-scale computation and power efficiency, processors on Sunway TaihuLight employ a unique heterogeneous many-core architecture and memory hierarchy. With its extremely large size, the machine provides both opportunities and challenges for implementing high-performance irregular algorithms, such as BFS. We propose several techniques, including pipelined module mapping, contention-free data shuffling, and group-based message batching, to address the challenges of efficiently utilizing the features of this large scale heterogeneous machine. We ultimately achieved 23755.7 giga-traversed edges per second (GTEPS), which is the best among heterogeneous machines and the second overall in the Graph500s June 2016 list [2].","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125845179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

A Parallel FastTrack Data Race Detector on Multi-core Systems 多核系统中并行快速通道数据竞赛检测器

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.87

Y. Song, Yann-Hang Lee

Detecting data races in multithreaded programs is critical to ensure the correctness of the programs. To discover data races precisely without false alarms, dynamic detection approaches are often applied. However, the overhead of the existing dynamic detection approaches, even with recent innovations, is still substantially high. In this paper, we present a simple but efficient approach to parallelize data race detection in multicore SMP (Symmetric Multiprocessing) machines. In our approach, data access information needed for dynamic detection is collected at application threads and passed to de-tection threads. The access information is distributed in a way that the operation performed by each detection thread is inde-pendent of that of other detection threads. As a consequence, the overhead caused by locking operations in data race detection can be alleviated and multiple cores can be fully utilized to speed up and scale up the detection. Furthermore, each detection thread deals with only its own assigned memory access region rather than the whole address space. The executions of detection threads can exploit the spatial locality of accesses leading to an improved cache performance. We have applied our parallel approach on the FastTrack algorithm and demon-strated the validity of our approach on an Intel Xeon machine. Our experimental results show that the parallel FastTrack detector, on average, runs 2.2 times faster than the original FastTrack detector on the 8 core machine.

检测多线程程序中的数据竞争对于确保程序的正确性至关重要。为了准确地发现数据竞争而不产生假警报，通常采用动态检测方法。然而，即使有了最近的创新，现有的动态检测方法的开销仍然相当高。在本文中，我们提出了一个简单而有效的方法来并行化多核SMP(对称多处理)机器中的数据竞争检测。在我们的方法中，动态检测所需的数据访问信息在应用程序线程处收集，并传递给检测线程。访问信息以每个检测线程执行的操作独立于其他检测线程的操作的方式进行分发。因此，可以减轻数据竞争检测中锁定操作造成的开销，并且可以充分利用多核来加速和扩展检测。此外，每个检测线程只处理自己分配的内存访问区域，而不是整个地址空间。检测线程的执行可以利用访问的空间局部性，从而提高缓存性能。我们已经在FastTrack算法上应用了我们的并行方法，并在英特尔至强处理器上证明了我们方法的有效性。我们的实验结果表明，在8核机器上，并行FastTrack检测器的平均运行速度是原始FastTrack检测器的2.2倍。

{"title":"A Parallel FastTrack Data Race Detector on Multi-core Systems","authors":"Y. Song, Yann-Hang Lee","doi":"10.1109/IPDPS.2017.87","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.87","url":null,"abstract":"Detecting data races in multithreaded programs is critical to ensure the correctness of the programs. To discover data races precisely without false alarms, dynamic detection approaches are often applied. However, the overhead of the existing dynamic detection approaches, even with recent innovations, is still substantially high. In this paper, we present a simple but efficient approach to parallelize data race detection in multicore SMP (Symmetric Multiprocessing) machines. In our approach, data access information needed for dynamic detection is collected at application threads and passed to de-tection threads. The access information is distributed in a way that the operation performed by each detection thread is inde-pendent of that of other detection threads. As a consequence, the overhead caused by locking operations in data race detection can be alleviated and multiple cores can be fully utilized to speed up and scale up the detection. Furthermore, each detection thread deals with only its own assigned memory access region rather than the whole address space. The executions of detection threads can exploit the spatial locality of accesses leading to an improved cache performance. We have applied our parallel approach on the FastTrack algorithm and demon-strated the validity of our approach on an Intel Xeon machine. Our experimental results show that the parallel FastTrack detector, on average, runs 2.2 times faster than the original FastTrack detector on the 8 core machine.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123224203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization 基于多维预测和误差控制量化的科学数据集有损压缩显著改进

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.115

Dingwen Tao, S. Di, Zizhong Chen, F. Cappello

Today's HPC applications are producing extremely large amounts of data, such that data storage and analysis are becoming more challenging for scientific research. In this work, we design a new error-controlled lossy compression algorithm for large-scale scientific data. Our key contribution is significantly improving the prediction hitting rate (or prediction accuracy) for each data point based on its nearby data values along multiple dimensions. We derive a series of multilayer prediction formulas and their unified formula in the context of data compression. One serious challenge is that the data prediction has to be performed based on the preceding decompressed values during the compression in order to guarantee the error bounds, which may degrade the prediction accuracy in turn. We explore the best layer for the prediction by considering the impact of compression errors on the prediction accuracy. Moreover, we propose an adaptive error-controlled quantization encoder, which can further improve the prediction hitting rate considerably. The data size can be reduced significantly after performing the variable-length encoding because of the uneven distribution produced by our quantization encoder. We evaluate the new compressor on production scientific data sets and compare it with many other state-of-the-art compressors: GZIP, FPZIP, ZFP, SZ-1.1, and ISABELA. Experiments show that our compressor is the best in class, especially with regard to compression factors (or bit-rates) and compression errors (including RMSE, NRMSE, and PSNR). Our solution is better than the second-best solution by more than a 2x increase in the compression factor and 3.8x reduction in the normalized root mean squared error on average, with reasonable error bounds and user-desired bit-rates.

当今的高性能计算应用产生了大量的数据，这使得数据存储和分析对科学研究来说变得更加具有挑战性。在这项工作中，我们设计了一种新的误差控制的大规模科学数据有损压缩算法。我们的主要贡献是显著提高了每个数据点的预测命中率(或预测精度)，这些数据点基于其附近的多维数据值。在数据压缩的背景下，导出了一系列多层预测公式及其统一公式。一个严重的挑战是，为了保证误差范围，在压缩过程中必须基于前面的解压缩值来执行数据预测，这反过来可能会降低预测精度。我们通过考虑压缩误差对预测精度的影响来探索预测的最佳层。此外，我们还提出了一种自适应误差控制量化编码器，可以进一步提高预测命中率。由于我们的量化编码器产生的不均匀分布，在进行变长编码后，数据大小可以显着减少。我们在生产科学数据集上评估了新压缩机，并将其与许多其他最先进的压缩机进行了比较:GZIP, FPZIP, ZFP, SZ-1.1和ISABELA。实验表明，我们的压缩器是同类中最好的，特别是在压缩因子(或比特率)和压缩误差(包括RMSE、NRMSE和PSNR)方面。我们的解决方案比第二好的解决方案更好，压缩系数增加了2倍以上，标准化均方根误差平均减少了3.8倍，并且具有合理的错误界限和用户期望的比特率。

{"title":"Significantly Improving Lossy Compression for Scientific Data Sets Based on Multidimensional Prediction and Error-Controlled Quantization","authors":"Dingwen Tao, S. Di, Zizhong Chen, F. Cappello","doi":"10.1109/IPDPS.2017.115","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.115","url":null,"abstract":"Today's HPC applications are producing extremely large amounts of data, such that data storage and analysis are becoming more challenging for scientific research. In this work, we design a new error-controlled lossy compression algorithm for large-scale scientific data. Our key contribution is significantly improving the prediction hitting rate (or prediction accuracy) for each data point based on its nearby data values along multiple dimensions. We derive a series of multilayer prediction formulas and their unified formula in the context of data compression. One serious challenge is that the data prediction has to be performed based on the preceding decompressed values during the compression in order to guarantee the error bounds, which may degrade the prediction accuracy in turn. We explore the best layer for the prediction by considering the impact of compression errors on the prediction accuracy. Moreover, we propose an adaptive error-controlled quantization encoder, which can further improve the prediction hitting rate considerably. The data size can be reduced significantly after performing the variable-length encoding because of the uneven distribution produced by our quantization encoder. We evaluate the new compressor on production scientific data sets and compare it with many other state-of-the-art compressors: GZIP, FPZIP, ZFP, SZ-1.1, and ISABELA. Experiments show that our compressor is the best in class, especially with regard to compression factors (or bit-rates) and compression errors (including RMSE, NRMSE, and PSNR). Our solution is better than the second-best solution by more than a 2x increase in the compression factor and 3.8x reduction in the normalized root mean squared error on average, with reasonable error bounds and user-desired bit-rates.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115037932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 199

The SEPO Model of Computation to Enable Larger-Than-Memory Hash Tables for GPU-Accelerated Big Data Analytics 为gpu加速的大数据分析启用大于内存哈希表的SEPO计算模型

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.122

Reza Mokhtari, M. Stumm

The massive parallelism and high memory bandwidth of GPU's are particularly well matched with the exigencies of Big Data analytics applications, for which many independent computations and high data throughput are prevalent. These applications often produce (intermediary or final) results in the form of key-value (KV) pairs, and hash tables are particularly well-suited for storing these KV pairs in memory. How such hash tables are implemented on GPUs, however, has a large impact on performance. Unfortunately, all hash table solutions designed for GPUs to date have limitations that prevent acceleration for Big Data analytics applications. In this paper, we present the design and implementation of a GPU-based hash table for efficiently storing the KV pairs of Big Data analytics applications. The hash table is able to grow beyond the size of available GPU memory without excessive performance penalties. Central to our hash table design is the SEPO model of computation, where the processing of individual tasks is selectively postponed when processing is expected to be inefficient. A performance evaluation on seven GPU-based Big Data analytics applications, each processing several Gigabytes of input data, shows that our hash table allows the applications to achieve, on average, a speedup of 3.5 over their CPU-based multi-threaded implementations. This gain is realized despite having hash tables that grow up to four times larger than the size of available GPU memory.

GPU的大规模并行性和高内存带宽特别适合大数据分析应用的需求，因为大数据分析应用需要大量的独立计算和高数据吞吐量。这些应用程序通常以键值(KV)对的形式产生(中间或最终)结果，哈希表特别适合在内存中存储这些KV对。然而，如何在gpu上实现这种哈希表对性能有很大的影响。不幸的是，迄今为止所有为gpu设计的哈希表解决方案都有限制，阻碍了大数据分析应用程序的加速。在本文中，我们提出了一个基于gpu的哈希表的设计和实现，用于有效存储大数据分析应用程序的KV对。哈希表能够增长到超过可用GPU内存的大小，而不会造成过大的性能损失。我们的哈希表设计的核心是计算的SEPO模型，在该模型中，当预期处理效率低下时，单个任务的处理被选择性地推迟。对七个基于gpu的大数据分析应用程序(每个应用程序处理几gb的输入数据)的性能评估表明，我们的哈希表允许应用程序比基于cpu的多线程实现平均提高3.5倍的速度。尽管哈希表增长到可用GPU内存大小的四倍，但仍然实现了这种增益。

{"title":"The SEPO Model of Computation to Enable Larger-Than-Memory Hash Tables for GPU-Accelerated Big Data Analytics","authors":"Reza Mokhtari, M. Stumm","doi":"10.1109/IPDPS.2017.122","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.122","url":null,"abstract":"The massive parallelism and high memory bandwidth of GPU's are particularly well matched with the exigencies of Big Data analytics applications, for which many independent computations and high data throughput are prevalent. These applications often produce (intermediary or final) results in the form of key-value (KV) pairs, and hash tables are particularly well-suited for storing these KV pairs in memory. How such hash tables are implemented on GPUs, however, has a large impact on performance. Unfortunately, all hash table solutions designed for GPUs to date have limitations that prevent acceleration for Big Data analytics applications. In this paper, we present the design and implementation of a GPU-based hash table for efficiently storing the KV pairs of Big Data analytics applications. The hash table is able to grow beyond the size of available GPU memory without excessive performance penalties. Central to our hash table design is the SEPO model of computation, where the processing of individual tasks is selectively postponed when processing is expected to be inefficient. A performance evaluation on seven GPU-based Big Data analytics applications, each processing several Gigabytes of input data, shows that our hash table allows the applications to achieve, on average, a speedup of 3.5 over their CPU-based multi-threaded implementations. This gain is realized despite having hash tables that grow up to four times larger than the size of available GPU memory.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128077257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generating Performance Models for Irregular Applications 为不规则应用程序生成性能模型

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.61

Ryan D. Friese, Nathan R. Tallent, Abhinav Vishnu, D. Kerbyson, A. Hoisie

Many applications have irregular behavior — e.g., input-dependent solvers, irregular memory accesses, or unbiased branches — that cannot be captured using today's automated performance modeling techniques. We describe new hierarchical critical path analyses for the Palm model generation tool. To obtain a good tradeoff between model accuracy, generality, and generation cost, we combine static and dynamic analysis. To create a model's outer structure, we capture tasks along representative MPI critical paths. We create a histogram of critical tasks with parameterized task arguments and instance counts. To model each task, we identify hot instruction-level paths and model each path based on data flow, data locality, and microarchitectural constraints. We describe application models that generate accurate predictions for strong scaling when varying CPU speed, cache and memory speed, microarchitecture, and (with supervision) input data class. Our models' errors are usually below 8%; and always below 13%.

许多应用程序都有不规则的行为——例如，依赖输入的求解器、不规则的内存访问或无偏分支——这些都不能用今天的自动化性能建模技术来捕获。我们为Palm模型生成工具描述了新的分层关键路径分析。为了在模型精度、通用性和发电成本之间取得良好的平衡，我们将静态和动态分析结合起来。为了创建模型的外部结构，我们沿着具有代表性的MPI关键路径捕获任务。我们用参数化的任务参数和实例计数创建关键任务的直方图。为了对每个任务建模，我们确定热指令级路径，并基于数据流、数据局部性和微架构约束对每个路径建模。我们描述了应用程序模型，当改变CPU速度、缓存和内存速度、微架构和(有监督的)输入数据类时，这些模型可以生成准确的预测。我们的模型误差通常在8%以下;而且总是低于13%。

引用次数: 13

Fault-Tolerant Robot Gathering Problems on Graphs With Arbitrary Appearing Times 容错机器人在任意出现时间图上的采集问题

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.70

S. Rajsbaum, Armando Castañeda, D. Flores-Peñaloza, Manuel Alcántara

The LOOK-COMPUTE-MOVE model for a set of autonomous robots has been thoroughly studied for over two decades. Each robot repeatedly LOOKS at its surroundings and obtains a snapshot containing the positions of all robots; based on this information, the robot COMPUTES a destination and then MOVES to it. Previous work assumed all robots are present at the beginning of the computation. What would be the effect of robots appearing asynchronously? This paper studies thisquestion, for problems of bringing the robots close together, andexposes an intimate connection with combinatorial topology. A central problem in the mobile robots area is the gathering problem. In its discrete version, the robots start at vertices in some graph G known to them, move towards the same vertex and stop. The paper shows that if robots are asynchronous and may crash, then gathering is impossible for any graph G with at least two vertices, even if robots can have unique IDs, remember the past, know the same names for the vertices of G and use an arbitrary number of lights to communicate witheach other. Next, the paper studies two weaker variants of gathering: edge gathering and 1-gathering. For both problems we present possibility and impossibility results. The solvability of edge gathering is fully characterized: it is solvable for three or more robots on a given graph if and only if the graph is acyclic. Finally, general robot tasks in a graph are considered. A combinatorial topology characterization for the solvable tasks is presented, by a reduction of the asynchronous fault-tolerant LOOK-COMPUTE-MOVE model to a wait-free read/write shared-memory computing model, bringing together two areas that have been independently studied for a long time into a common theoretical foundation.

一组自主机器人的LOOK-COMPUTE-MOVE模型已经被深入研究了二十多年。每个机器人反复观察周围环境，并获得包含所有机器人位置的快照;基于这些信息，机器人计算出一个目的地，然后向它移动。以前的工作假设在计算开始时所有的机器人都在场。机器人异步出现会产生什么影响?本文研究了这一问题，并揭示了其与组合拓扑的密切联系。移动机器人领域的一个核心问题是采集问题。在离散版本中，机器人从已知的某个图G的顶点开始，向同一个顶点移动并停止。本文表明，如果机器人是异步的，并且可能会崩溃，那么对于任何至少有两个顶点的图G来说，即使机器人可以有唯一的id，记住过去，知道G顶点的相同名称，并且使用任意数量的灯来相互通信，也不可能进行收集。其次，本文研究了两种较弱的聚类:边聚类和1聚类。对于这两个问题，我们都给出了可能和不可能的结果。边集的可解性是完全表征的:当且仅当图为无环时，在给定图上有三个或三个以上的机器人可解。最后，考虑了图中的一般机器人任务。通过将异步容错LOOK-COMPUTE-MOVE模型简化为无等待读写共享内存计算模型，提出了可解任务的组合拓扑表征，将两个长期独立研究的领域整合为一个共同的理论基础。

{"title":"Fault-Tolerant Robot Gathering Problems on Graphs With Arbitrary Appearing Times","authors":"S. Rajsbaum, Armando Castañeda, D. Flores-Peñaloza, Manuel Alcántara","doi":"10.1109/IPDPS.2017.70","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.70","url":null,"abstract":"The LOOK-COMPUTE-MOVE model for a set of autonomous robots has been thoroughly studied for over two decades. Each robot repeatedly LOOKS at its surroundings and obtains a snapshot containing the positions of all robots; based on this information, the robot COMPUTES a destination and then MOVES to it. Previous work assumed all robots are present at the beginning of the computation. What would be the effect of robots appearing asynchronously? This paper studies thisquestion, for problems of bringing the robots close together, andexposes an intimate connection with combinatorial topology. A central problem in the mobile robots area is the gathering problem. In its discrete version, the robots start at vertices in some graph G known to them, move towards the same vertex and stop. The paper shows that if robots are asynchronous and may crash, then gathering is impossible for any graph G with at least two vertices, even if robots can have unique IDs, remember the past, know the same names for the vertices of G and use an arbitrary number of lights to communicate witheach other. Next, the paper studies two weaker variants of gathering: edge gathering and 1-gathering. For both problems we present possibility and impossibility results. The solvability of edge gathering is fully characterized: it is solvable for three or more robots on a given graph if and only if the graph is acyclic. Finally, general robot tasks in a graph are considered. A combinatorial topology characterization for the solvable tasks is presented, by a reduction of the asynchronous fault-tolerant LOOK-COMPUTE-MOVE model to a wait-free read/write shared-memory computing model, bringing together two areas that have been independently studied for a long time into a common theoretical foundation.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114793135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8