Proceedings of the 48th International Conference on Parallel Processing最新文献_第3页

A Practical, Scalable, Relaxed Priority Queue 一个实用的，可扩展的，宽松的优先队列

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337911

Tingzhe Zhou, Maged M. Michael, Michael F. Spear

Priority queues are a fundamental data structure, and in highly concurrent software, scalable priority queues are an important building block. However, they have a fundamental bottleneck when extracting elements, because of the strict requirement that each extract() returns the highest priority element. In many workloads, this requirement can be relaxed, improving scalability. We introduce ZMSQ, a scalable relaxed priority queue. It is the first relaxed priority queue that supports each of the following important practical features: (i) guaranteed success of extraction when the queue is nonempty, (ii) blocking of idle consumers, (iii) memory-safety in non-garbage-collected environments, and (iv) relaxation accuracy that does not degrade as the thread count increases. In addition, our experiments show that ZMSQ is competitive with state-of-the-art prior algorithms, often significantly outperforming them.

优先级队列是一种基本的数据结构，在高度并发的软件中，可伸缩的优先级队列是一个重要的构建块。然而，它们在提取元素时有一个基本的瓶颈，因为严格要求每个extract()返回最高优先级的元素。在许多工作负载中，可以放宽此要求，从而提高可伸缩性。我们介绍了ZMSQ，一个可伸缩的放松优先级队列。它是第一个放松优先级队列，它支持以下每个重要的实用特性:(i)保证队列非空时提取成功，(ii)阻塞空闲消费者，(iii)非垃圾收集环境中的内存安全，以及(iv)放松精度不会随着线程数的增加而降低。此外，我们的实验表明，ZMSQ与最先进的先验算法具有竞争力，通常显著优于它们。

引用次数: 7

AVR: Reducing Memory Traffic with Approximate Value Reconstruction AVR:减少内存流量与近似值重建

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337824

Albin Eldstål-Damlin, P. Trancoso, I. Sourdis

This paper describes Approximate Value Reconstruction (AVR), an architecture for approximate memory compression. AVR reduces the memory traffic of applications that tolerate approximations in their dataset. Thereby, it utilizes more efficiently the available off-chip bandwidth improving significantly system performance and energy efficiency. AVR compresses memory blocks using low latency downsampling that exploits similarities between neighboring values and achieves aggressive compression ratios, up to 16:1 in our implementation. The proposed AVR architecture supports our compression scheme maximizing its effect and minimizing its overheads by (i) co-locating in the Last Level Cache (LLC) compressed and uncompressed data, (ii) efficiently handling LLC evictions, (iii) keeping track of badly compressed memory blocks, and (iv) avoiding LLC pollution with unwanted decompressed data. For applications that tolerate aggressive approximation in large fractions of their data, AVR reduces memory traffic by up to 70%, execution time by up to 55%, and energy costs by up to 20% introducing up to 1.2% error to the application output.

本文描述了一种近似内存压缩的体系结构——近似值重构(AVR)。AVR减少了允许数据集近似的应用程序的内存流量。因此，它更有效地利用可用的片外带宽，显著提高系统性能和能源效率。AVR使用低延迟下采样压缩内存块，利用相邻值之间的相似性，并实现积极的压缩比，在我们的实现中高达16:1。提出的AVR架构支持我们的压缩方案，通过(i)在最后一级缓存(LLC)中共同定位压缩和未压缩的数据，(ii)有效地处理LLC清除，(iii)跟踪严重压缩的内存块，以及(iv)避免不必要的解压缩数据污染LLC，最大限度地提高其效果并最小化其开销。对于能够容忍大量数据的激进逼近的应用程序，AVR最多可减少70%的内存流量，最多可减少55%的执行时间，最多可减少20%的能源成本，并在应用程序输出中引入1.2%的错误。

{"title":"AVR: Reducing Memory Traffic with Approximate Value Reconstruction","authors":"Albin Eldstål-Damlin, P. Trancoso, I. Sourdis","doi":"10.1145/3337821.3337824","DOIUrl":"https://doi.org/10.1145/3337821.3337824","url":null,"abstract":"This paper describes Approximate Value Reconstruction (AVR), an architecture for approximate memory compression. AVR reduces the memory traffic of applications that tolerate approximations in their dataset. Thereby, it utilizes more efficiently the available off-chip bandwidth improving significantly system performance and energy efficiency. AVR compresses memory blocks using low latency downsampling that exploits similarities between neighboring values and achieves aggressive compression ratios, up to 16:1 in our implementation. The proposed AVR architecture supports our compression scheme maximizing its effect and minimizing its overheads by (i) co-locating in the Last Level Cache (LLC) compressed and uncompressed data, (ii) efficiently handling LLC evictions, (iii) keeping track of badly compressed memory blocks, and (iv) avoiding LLC pollution with unwanted decompressed data. For applications that tolerate aggressive approximation in large fractions of their data, AVR reduces memory traffic by up to 70%, execution time by up to 55%, and energy costs by up to 20% introducing up to 1.2% error to the application output.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"30 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115929446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Accelerating Long Read Alignment on Three Processors 在三个处理器上加速长读对齐

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337918

Zonghao Feng, Shuang Qiu, Lipeng Wang, Qiong Luo

Sequence alignment is a fundamental task in bioinformatics, because many downstream applications rely on it. The recent emergence of the third-generation sequencing technology requires new sequence alignment algorithms that handle longer read lengths as well as more sequencing errors. Furthermore, the rapidly increasing volume of sequence data calls for efficient analysis solutions. To address this need, we propose to utilize commodity parallel processors to perform the long read alignment. Specifically, we propose manymap, an acceleration of the leading CPU-based long read aligner minimap2 on the CPU, the GPU, and the Intel Xeon Phi processor. We eliminate intra-loop data dependency in the base-level alignment step of the original minimap2 through redesigning memory layouts of dynamic programming (DP) matrices. This change facilitates the effective vectorization of the most time-consuming procedure in alignment. Additionally, we apply architecture-aware optimizations, such as utilizing high bandwidth memory on Xeon Phi and concurrent kernel execution on GPU. We evaluate our manymap in comparison with the extended minimap2 on a Xeon Gold 5115 CPU, a Tesla V100 GPU, and a Xeon Phi 7210 processor. Our results show that manymap outperforms minimap2 by up to 2.3 times on the overall execution time and 4.5 times on the base-level alignment step.

序列比对是生物信息学的一项基本任务，因为许多下游应用都依赖于它。最近出现的第三代测序技术需要新的序列比对算法来处理更长的读取长度以及更多的测序错误。此外，快速增长的序列数据需要高效的分析解决方案。为了满足这一需求，我们建议利用商用并行处理器来执行长读对齐。具体来说，我们提出了manymap，这是一种基于CPU的长读对齐器minimap2在CPU、GPU和Intel Xeon Phi处理器上的加速。我们通过重新设计动态规划(DP)矩阵的内存布局，消除了原始minimap2基本级对齐步骤中的环内数据依赖。这一变化有助于对对齐中最耗时的过程进行有效的矢量化。此外，我们应用架构感知优化，例如在Xeon Phi上利用高带宽内存和在GPU上并发内核执行。我们将我们的manymap与Xeon Gold 5115 CPU, Tesla V100 GPU和Xeon Phi 7210处理器上的扩展minimap2进行比较。我们的结果表明，manymap在总体执行时间上比minimap2多2.3倍，在基本级对齐步骤上多4.5倍。

{"title":"Accelerating Long Read Alignment on Three Processors","authors":"Zonghao Feng, Shuang Qiu, Lipeng Wang, Qiong Luo","doi":"10.1145/3337821.3337918","DOIUrl":"https://doi.org/10.1145/3337821.3337918","url":null,"abstract":"Sequence alignment is a fundamental task in bioinformatics, because many downstream applications rely on it. The recent emergence of the third-generation sequencing technology requires new sequence alignment algorithms that handle longer read lengths as well as more sequencing errors. Furthermore, the rapidly increasing volume of sequence data calls for efficient analysis solutions. To address this need, we propose to utilize commodity parallel processors to perform the long read alignment. Specifically, we propose manymap, an acceleration of the leading CPU-based long read aligner minimap2 on the CPU, the GPU, and the Intel Xeon Phi processor. We eliminate intra-loop data dependency in the base-level alignment step of the original minimap2 through redesigning memory layouts of dynamic programming (DP) matrices. This change facilitates the effective vectorization of the most time-consuming procedure in alignment. Additionally, we apply architecture-aware optimizations, such as utilizing high bandwidth memory on Xeon Phi and concurrent kernel execution on GPU. We evaluate our manymap in comparison with the extended minimap2 on a Xeon Gold 5115 CPU, a Tesla V100 GPU, and a Xeon Phi 7210 processor. Our results show that manymap outperforms minimap2 by up to 2.3 times on the overall execution time and 4.5 times on the base-level alignment step.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126629135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

BPP: A Realtime Block Access Pattern Mining Scheme for I/O Prediction 一种用于I/O预测的实时块访问模式挖掘方案

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337904

Chunjie Zhu, F. Wang, Binbing Hou

Block access patterns refer to the regularities of accessed blocks, and can be used to effectively enhance the intelligence of block storage systems. However, existing algorithms fail to uncover block access patterns in efficient ways. They either suffer high time and space overhead or only focus on the simplest patterns like sequential ones. In this paper, we propose a realtime block access pattern mining scheme, called BPP, to mine block access patterns at run time with low time and space overhead for making efficient I/O predictions. To reduce the time and space overhead for mining block access patterns, BPP classifies block access patterns into simple and compound ones based on the mining costs of different patterns, and differentiates the mining policies for simple and compound patterns. BPP also adopts a novel garbage cleaning policy, which is specially designed based on the observed features of the obtained patterns to accurately detect valueless patterns and remove them as early as possible. With such a garbage cleaning policy, BPP further reduces the space overhead for managing and utilizing the obtained patterns. To demonstrate the effect of BPP, we conduct a series of experiments with real-world workloads. The experimental results show that BPP can significantly outperform the state-of-the-art I/O prediction schemes.

块访问模式是指访问块的规律性，可以有效地增强块存储系统的智能化。然而，现有的算法无法有效地揭示块访问模式。它们要么耗费大量的时间和空间，要么只关注最简单的模式，比如顺序模式。在本文中，我们提出了一种实时块访问模式挖掘方案，称为BPP，以低时间和空间开销在运行时挖掘块访问模式，以进行有效的I/O预测。为了减少块访问模式挖掘的时间和空间开销，BPP根据不同模式的挖掘成本将块访问模式分为简单模式和复合模式，并区分了简单模式和复合模式的挖掘策略。BPP还采用了一种新的垃圾清理策略，该策略是根据获取的模式的观察特征专门设计的，能够准确地检测出无价值的模式并尽早将其移除。有了这样的垃圾清理策略，BPP进一步减少了管理和利用获得的模式的空间开销。为了证明BPP的效果，我们对实际工作负载进行了一系列实验。实验结果表明，BPP可以显著优于当前最先进的I/O预测方案。

{"title":"BPP: A Realtime Block Access Pattern Mining Scheme for I/O Prediction","authors":"Chunjie Zhu, F. Wang, Binbing Hou","doi":"10.1145/3337821.3337904","DOIUrl":"https://doi.org/10.1145/3337821.3337904","url":null,"abstract":"Block access patterns refer to the regularities of accessed blocks, and can be used to effectively enhance the intelligence of block storage systems. However, existing algorithms fail to uncover block access patterns in efficient ways. They either suffer high time and space overhead or only focus on the simplest patterns like sequential ones. In this paper, we propose a realtime block access pattern mining scheme, called BPP, to mine block access patterns at run time with low time and space overhead for making efficient I/O predictions. To reduce the time and space overhead for mining block access patterns, BPP classifies block access patterns into simple and compound ones based on the mining costs of different patterns, and differentiates the mining policies for simple and compound patterns. BPP also adopts a novel garbage cleaning policy, which is specially designed based on the observed features of the obtained patterns to accurately detect valueless patterns and remove them as early as possible. With such a garbage cleaning policy, BPP further reduces the space overhead for managing and utilizing the obtained patterns. To demonstrate the effect of BPP, we conduct a series of experiments with real-world workloads. The experimental results show that BPP can significantly outperform the state-of-the-art I/O prediction schemes.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127449031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

QLEC

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337926

Ke Li, Haowei Huang, Xiaofeng Gao, Fan Wu, Guihai Chen

With the emergence of Internet of Things (IoT), many battery-operated sensors are deployed in different applications to collect, process, and analyze useful information. In these applications, sensors are often grouped into different clusters to support higher scalability and better data aggregation. Clustering based on energy distribution among nodes can reduce energy consumption and prolong the network lifespan. In our paper, we propose a machine-learning-based energy-efficient clustering algorithm named QLEC to select cluster heads in high-dimensional space and help non-cluster-head nodes route packets. QLEC first selects cluster heads based on their residual energy through successive rounds. Besides, we prove the optimal cluster number in a high-dimensional wireless network and adopt it in our QLEC algorithm. Furthermore, Q-learning method is utilized to maximize residual energy of the network while routing packets from sensors to the base station (BS). The energy-efficient clustering problem in high dimensional space can be formed as an NP-Complete problem and QLEC is proved to solve it in the running time O(kX), where k is the cluster number and X is the number of updates Q-learning needs to converge. Extensive simulations and experiments based on a large-scale dataset show that the proposed scheme outperforms a newly proposed FCM-based algorithm and k-means clustering in terms of network lifespan, packet delivery rate, and transmission latency. To the best of our knowledge, this is the first work adopting Q-learning method in clustering problems in high-dimensional space.

{"title":"QLEC","authors":"Ke Li, Haowei Huang, Xiaofeng Gao, Fan Wu, Guihai Chen","doi":"10.1145/3337821.3337926","DOIUrl":"https://doi.org/10.1145/3337821.3337926","url":null,"abstract":"With the emergence of Internet of Things (IoT), many battery-operated sensors are deployed in different applications to collect, process, and analyze useful information. In these applications, sensors are often grouped into different clusters to support higher scalability and better data aggregation. Clustering based on energy distribution among nodes can reduce energy consumption and prolong the network lifespan. In our paper, we propose a machine-learning-based energy-efficient clustering algorithm named QLEC to select cluster heads in high-dimensional space and help non-cluster-head nodes route packets. QLEC first selects cluster heads based on their residual energy through successive rounds. Besides, we prove the optimal cluster number in a high-dimensional wireless network and adopt it in our QLEC algorithm. Furthermore, Q-learning method is utilized to maximize residual energy of the network while routing packets from sensors to the base station (BS). The energy-efficient clustering problem in high dimensional space can be formed as an NP-Complete problem and QLEC is proved to solve it in the running time O(kX), where k is the cluster number and X is the number of updates Q-learning needs to converge. Extensive simulations and experiments based on a large-scale dataset show that the proposed scheme outperforms a newly proposed FCM-based algorithm and k-means clustering in terms of network lifespan, packet delivery rate, and transmission latency. To the best of our knowledge, this is the first work adopting Q-learning method in clustering problems in high-dimensional space.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124022077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Approximate Code: A Cost-Effective Erasure Coding Framework for Tiered Video Storage in Cloud Systems 近似代码:云系统中分级视频存储的一种经济高效的Erasure编码框架

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337869

Huayi Jin, Chentao Wu, Xin Xie, Jie Li, M. Guo, Hao Lin, Jianfeng Zhang

Nowadays massive video data are stored in cloud storage systems, which are generated by various applications such as autonomous driving, news media, security monitoring, etc. Meanwhile, erasure coding is a popular technique in cloud storage to provide both high reliability and low monetary cost, where triple disk failure tolerant arrays (3DFTs) is a typical choice. Therefore, how to minimize the storage cost of video data in 3DFTs is a challenge for cloud storage systems. Although there are several solutions like approximate storage technique, they cannot guarantee low storage cost and high data reliability concurrently. To address this challenge, in this paper, we propose Approximate Code, which is an erasure coding framework for tiered video storage in cloud systems. The key idea of Approximate Code is distinguishing the important and unimportant data with different capabilities of fault tolerance. On one hand, for important data, Approximate Code provides triple parities to ensure high reliability. On the other hand, single/double parities are applied for unimportant data, which can save the storage cost and accelerate the recovery process. To demonstrate the effectiveness of Approximate Code, we conduct several experiments in Hadoop systems. The results show that, compared to traditional 3DFTs using various erasure codes such as RS, LRC, STAR and TIP-Code, Approximate Code reduces the number of parities by up to 55%, saves the storage cost by up to 20.8% and increase the recovery speed by up to 4.7X when double nodes fail.

如今，海量的视频数据存储在云存储系统中，这些数据是由自动驾驶、新闻媒体、安防监控等各种应用产生的。同时，在云存储中，擦除编码是一种流行的技术，可以提供高可靠性和低成本，其中三盘容错阵列(3dft)是一种典型的选择。因此，如何将视频数据在3dft中的存储成本最小化是云存储系统面临的一个挑战。虽然有近似存储技术等几种解决方案，但它们都不能同时保证低存储成本和高数据可靠性。为了解决这一挑战，在本文中，我们提出了近似代码，这是一种用于云系统中分层视频存储的擦除编码框架。近似码的核心思想是用不同的容错能力区分重要数据和不重要数据。一方面，对于重要数据，Approximate Code提供了三重校验，保证了高可靠性。另一方面，对于不重要的数据采用单/双校验，可以节省存储成本，加快恢复过程。为了证明近似代码的有效性，我们在Hadoop系统中进行了几个实验。结果表明，与使用RS、LRC、STAR和TIP-Code等多种擦除码的传统3dft相比，Approximate Code在双节点故障时可减少多达55%的数据对对，节省高达20.8%的存储成本，并将恢复速度提高高达4.7倍。

{"title":"Approximate Code: A Cost-Effective Erasure Coding Framework for Tiered Video Storage in Cloud Systems","authors":"Huayi Jin, Chentao Wu, Xin Xie, Jie Li, M. Guo, Hao Lin, Jianfeng Zhang","doi":"10.1145/3337821.3337869","DOIUrl":"https://doi.org/10.1145/3337821.3337869","url":null,"abstract":"Nowadays massive video data are stored in cloud storage systems, which are generated by various applications such as autonomous driving, news media, security monitoring, etc. Meanwhile, erasure coding is a popular technique in cloud storage to provide both high reliability and low monetary cost, where triple disk failure tolerant arrays (3DFTs) is a typical choice. Therefore, how to minimize the storage cost of video data in 3DFTs is a challenge for cloud storage systems. Although there are several solutions like approximate storage technique, they cannot guarantee low storage cost and high data reliability concurrently. To address this challenge, in this paper, we propose Approximate Code, which is an erasure coding framework for tiered video storage in cloud systems. The key idea of Approximate Code is distinguishing the important and unimportant data with different capabilities of fault tolerance. On one hand, for important data, Approximate Code provides triple parities to ensure high reliability. On the other hand, single/double parities are applied for unimportant data, which can save the storage cost and accelerate the recovery process. To demonstrate the effectiveness of Approximate Code, we conduct several experiments in Hadoop systems. The results show that, compared to traditional 3DFTs using various erasure codes such as RS, LRC, STAR and TIP-Code, Approximate Code reduces the number of parities by up to 55%, saves the storage cost by up to 20.8% and increase the recovery speed by up to 4.7X when double nodes fail.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129152799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

BCL 基类库

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337912

Benjamin Brock, A. Buluç, K. Yelick

One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI's one-sided interface and PGAS programming languages, lack application-level libraries to support these applications. We present the Berkeley Container Library, a set of generic, cross-platform, high-performance data structures for irregular applications, including queues, hash tables, Bloom filters and more. BCL is written in C++ using an internal DSL called the BCL Core that provides one-sided communication primitives such as remote get and remote put operations. The BCL Core has backends for MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data structures to be used natively in programs written using any of these programming environments. Along with our internal DSL, we present the BCL ObjectContainer abstraction, which allows BCL data structures to transparently serialize complex data types while maintaining efficiency for primitive types. We also introduce the set of BCL data structures and evaluate their performance across a number of high-performance computing systems, demonstrating that BCL programs are competitive with hand-optimized code, even while hiding many of the underlying details of message aggregation, serialization, and synchronization.

{"title":"BCL","authors":"Benjamin Brock, A. Buluç, K. Yelick","doi":"10.1145/3337821.3337912","DOIUrl":"https://doi.org/10.1145/3337821.3337912","url":null,"abstract":"One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI's one-sided interface and PGAS programming languages, lack application-level libraries to support these applications. We present the Berkeley Container Library, a set of generic, cross-platform, high-performance data structures for irregular applications, including queues, hash tables, Bloom filters and more. BCL is written in C++ using an internal DSL called the BCL Core that provides one-sided communication primitives such as remote get and remote put operations. The BCL Core has backends for MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data structures to be used natively in programs written using any of these programming environments. Along with our internal DSL, we present the BCL ObjectContainer abstraction, which allows BCL data structures to transparently serialize complex data types while maintaining efficiency for primitive types. We also introduce the set of BCL data structures and evaluate their performance across a number of high-performance computing systems, demonstrating that BCL programs are competitive with hand-optimized code, even while hiding many of the underlying details of message aggregation, serialization, and synchronization.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116939037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

swATOP swATOP

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337883

Wei Gao, Jiarui Fang, Wenlai Zhao, Jinzhe Yang, Long Wang, L. Gan, H. Fu, Guangwen Yang

Achieving an optimized mapping of Deep Learning (DL) operators to new hardware architectures is the key to building a scalable DL system. However, handcrafted optimization involves huge engineering efforts, due to the variety of DL operator implementations and complex programming skills. Targeting the innovative many-core processor SW26010 adopted by the 3rd fastest supercomputer Sunway TaihuLight, an end-to-end automated framework called swATOP is presented as a more practical solution for DL operator optimization. Arithmetic intensive DL operators are expressed into an auto-tuning-friendly form, which is based on tensorized primitives. By describing the algorithm of a DL operator using our domain specific language (DSL), swATOP is able to derive and produce an optimal implementation by separating hardware-dependent optimization and hardware-agnostic optimization. Hardware-dependent optimization is encapsulated in a set of tensorized primitives with sufficient utilization of the underlying hardware features. The hardware-agnostic optimization contains a scheduler, an intermediate representation (IR) optimizer, an auto-tuner, and a code generator. These modules cooperate to perform an automatic design space exploration, to apply a set of programming techniques, to discover a near-optimal solution, and to generate the executable code. Our experiments show that swATOP is able to bring significant performance improvement on DL operators in over 88% of cases, compared with the best-handcrafted optimization. Compared to a black-box autotuner, the tuning and code generation time can be reduced to minutes from days using swATOP.

{"title":"swATOP","authors":"Wei Gao, Jiarui Fang, Wenlai Zhao, Jinzhe Yang, Long Wang, L. Gan, H. Fu, Guangwen Yang","doi":"10.1145/3337821.3337883","DOIUrl":"https://doi.org/10.1145/3337821.3337883","url":null,"abstract":"Achieving an optimized mapping of Deep Learning (DL) operators to new hardware architectures is the key to building a scalable DL system. However, handcrafted optimization involves huge engineering efforts, due to the variety of DL operator implementations and complex programming skills. Targeting the innovative many-core processor SW26010 adopted by the 3rd fastest supercomputer Sunway TaihuLight, an end-to-end automated framework called swATOP is presented as a more practical solution for DL operator optimization. Arithmetic intensive DL operators are expressed into an auto-tuning-friendly form, which is based on tensorized primitives. By describing the algorithm of a DL operator using our domain specific language (DSL), swATOP is able to derive and produce an optimal implementation by separating hardware-dependent optimization and hardware-agnostic optimization. Hardware-dependent optimization is encapsulated in a set of tensorized primitives with sufficient utilization of the underlying hardware features. The hardware-agnostic optimization contains a scheduler, an intermediate representation (IR) optimizer, an auto-tuner, and a code generator. These modules cooperate to perform an automatic design space exploration, to apply a set of programming techniques, to discover a near-optimal solution, and to generate the executable code. Our experiments show that swATOP is able to bring significant performance improvement on DL operators in over 88% of cases, compared with the best-handcrafted optimization. Compared to a black-box autotuner, the tuning and code generation time can be reduced to minutes from days using swATOP.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115403752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Specialized Concurrent Queue for Scheduling Irregular Workloads on GPUs 在gpu上调度不规则工作负载的专用并发队列

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337837

David Troendle, T. Ta, B. Jang

The persistent thread model offers a viable solution for accelerating data-irregular workloads on Graphic Processing Units (GPUs). However, as the number of active threads increases, contention and retries on shared resources limit the efficiency of task scheduling among the persistent threads. To address this, we propose a highly scalable, non-blocking concurrent queue suitable for use as a GPU persistent thread task scheduler. The proposed concurrent queue has two novel properties: 1) The supporting enqueue/dequeue queue operations never suffer from retry overhead because the atomic operation does not fail and the queue empty exception has been refactored; and 2) The queue operates on an arbitrary number of queue entries for the same cost as a single entry. A proxy thread in each thread group performs all atomic operations on behalf of all threads in the group. These two novel properties substantially reduce thread contention caused by the GPU's lock-step Single Instruction Multiple Threads (SIMT) execution model. To demonstrate the performance and scalability of the proposed queue, we implemented a top-down Breadth First Search (BFS) based on the persistent thread model using 1) the proposed concurrent queue, and 2) two traditional concurrent queues; and analyzed its performance and scalability characteristics under different input graph datasets and hardware configurations. Our experiments show that the BFS implementation based on our proposed queue outperforms not only ones based on traditional queues but also the state-of-the-art BFS implementations found in the literature by a minimum of 1.26× and maximum of 36.23×. We also observed the scalability of our proposed queue is within 10% of the ideal linear speedup for up to the maximum number of threads supported by high-end discrete GPUs (14K threads in our experiment).

持久线程模型为加速图形处理单元(gpu)上数据不规则的工作负载提供了一个可行的解决方案。然而，随着活动线程数量的增加，共享资源上的争用和重试限制了持久线程之间任务调度的效率。为了解决这个问题，我们提出了一个高度可扩展的、非阻塞的并发队列，适合用作GPU持久线程任务调度器。所提出的并发队列具有两个新特性:1)由于原子操作不会失败，并且队列空异常被重构，因此支持的排队/脱队列操作不会遭受重试开销;2)队列对任意数量的队列条目进行操作，其成本与单个条目相同。每个线程组中的代理线程代表该组中的所有线程执行所有原子操作。这两个新特性大大减少了由GPU的锁步单指令多线程(SIMT)执行模型引起的线程争用。为了验证所提出的队列的性能和可扩展性，我们使用1)所提出的并发队列和2)两个传统并发队列实现了基于持久线程模型的自上而下的广度优先搜索(BFS);分析了该算法在不同输入图数据集和硬件配置下的性能和可扩展性特点。我们的实验表明，基于我们提出的队列的BFS实现不仅优于基于传统队列的BFS实现，而且优于文献中发现的最先进的BFS实现，最小为1.26倍，最大为36.23倍。我们还观察到，对于高端离散gpu支持的最大线程数(在我们的实验中为14K线程)，我们提出的队列的可伸缩性在理想线性加速的10%以内。

{"title":"A Specialized Concurrent Queue for Scheduling Irregular Workloads on GPUs","authors":"David Troendle, T. Ta, B. Jang","doi":"10.1145/3337821.3337837","DOIUrl":"https://doi.org/10.1145/3337821.3337837","url":null,"abstract":"The persistent thread model offers a viable solution for accelerating data-irregular workloads on Graphic Processing Units (GPUs). However, as the number of active threads increases, contention and retries on shared resources limit the efficiency of task scheduling among the persistent threads. To address this, we propose a highly scalable, non-blocking concurrent queue suitable for use as a GPU persistent thread task scheduler. The proposed concurrent queue has two novel properties: 1) The supporting enqueue/dequeue queue operations never suffer from retry overhead because the atomic operation does not fail and the queue empty exception has been refactored; and 2) The queue operates on an arbitrary number of queue entries for the same cost as a single entry. A proxy thread in each thread group performs all atomic operations on behalf of all threads in the group. These two novel properties substantially reduce thread contention caused by the GPU's lock-step Single Instruction Multiple Threads (SIMT) execution model. To demonstrate the performance and scalability of the proposed queue, we implemented a top-down Breadth First Search (BFS) based on the persistent thread model using 1) the proposed concurrent queue, and 2) two traditional concurrent queues; and analyzed its performance and scalability characteristics under different input graph datasets and hardware configurations. Our experiments show that the BFS implementation based on our proposed queue outperforms not only ones based on traditional queues but also the state-of-the-art BFS implementations found in the literature by a minimum of 1.26× and maximum of 36.23×. We also observed the scalability of our proposed queue is within 10% of the ideal linear speedup for up to the maximum number of threads supported by high-end discrete GPUs (14K threads in our experiment).","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114216579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Distributed Join Algorithms on Multi-CPU Clusters with GPUDirect RDMA 基于GPUDirect RDMA的多cpu集群分布式连接算法

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337862

Chengxin Guo, Hong Chen, Feng Zhang, Cuiping Li

In data management systems, query processing on GPUs or distributed clusters have proven to be an effective method for high efficiency. However, the high PCIe data transfer overhead between CPUs and GPUs, and the communication cost between nodes in distributed systems are usually bottleneck for improving system performance. Recently, GPUDirect RDMA has been developed and has received a lot of attention. It contains the features of the RDMA and GPUDirect technologies, which provides new opportunities for optimizing query processing. In this paper, we revisit the join algorithm, one of the most important operators in query processing, with GPUDirect RDMA. Specifically, we explore the performance of the hash join and sort merge join with GPUDirect RDMA. We present a new design using GPUDirect RDMA to improve the data communication in distributed join algorithms on multi-GPU clusters. We propose a series of techniques, including multi-layer data partitioning, and adaptive data communication path selection for various transmission channels. Experiments show that the proposed distributed join algorithms using GPUDirect RDMA achieve up to 1.83x performance speedup compared to the state-of-the-art distributed join algorithms. To the best of our knowledge, this is the first work for distributed GPU join algorithms. We believe that the insights and implications in this study shall shed lights on future researches using GPUDirect RDMA.

在数据管理系统中，在gpu或分布式集群上进行查询处理已被证明是一种高效的有效方法。然而，在分布式系统中，cpu和gpu之间的PCIe数据传输开销和节点之间的通信开销往往是制约系统性能提升的瓶颈。最近，GPUDirect RDMA得到了发展，并受到了广泛的关注。它包含了RDMA和GPUDirect技术的特性，为优化查询处理提供了新的机会。在本文中，我们用GPUDirect RDMA重新讨论了查询处理中最重要的运算符之一的连接算法。具体来说，我们探讨了使用GPUDirect RDMA的散列连接和排序合并连接的性能。提出了一种利用GPUDirect RDMA改进多gpu集群分布式连接算法中的数据通信的新设计。我们提出了一系列的技术，包括多层数据划分和自适应数据通信路径选择的各种传输信道。实验表明，采用GPUDirect RDMA的分布式连接算法与现有的分布式连接算法相比，性能提升高达1.83倍。据我们所知，这是分布式GPU连接算法的第一个工作。我们相信本研究的见解和意义将为未来使用GPUDirect RDMA的研究提供启示。

{"title":"Distributed Join Algorithms on Multi-CPU Clusters with GPUDirect RDMA","authors":"Chengxin Guo, Hong Chen, Feng Zhang, Cuiping Li","doi":"10.1145/3337821.3337862","DOIUrl":"https://doi.org/10.1145/3337821.3337862","url":null,"abstract":"In data management systems, query processing on GPUs or distributed clusters have proven to be an effective method for high efficiency. However, the high PCIe data transfer overhead between CPUs and GPUs, and the communication cost between nodes in distributed systems are usually bottleneck for improving system performance. Recently, GPUDirect RDMA has been developed and has received a lot of attention. It contains the features of the RDMA and GPUDirect technologies, which provides new opportunities for optimizing query processing. In this paper, we revisit the join algorithm, one of the most important operators in query processing, with GPUDirect RDMA. Specifically, we explore the performance of the hash join and sort merge join with GPUDirect RDMA. We present a new design using GPUDirect RDMA to improve the data communication in distributed join algorithms on multi-GPU clusters. We propose a series of techniques, including multi-layer data partitioning, and adaptive data communication path selection for various transmission channels. Experiments show that the proposed distributed join algorithms using GPUDirect RDMA achieve up to 1.83x performance speedup compared to the state-of-the-art distributed join algorithms. To the best of our knowledge, this is the first work for distributed GPU join algorithms. We believe that the insights and implications in this study shall shed lights on future researches using GPUDirect RDMA.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123265021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7