Proceedings of the 48th International Conference on Parallel Processing最新文献_第5页

Accelerated Work Stealing 加速偷工作

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337878

D. B. Larkins, John Snyder, James Dinan

Realizing scalable performance with irregular parallel applications is challenging on large-scale distributed memory clusters. These applications typically require continuous, dynamic load balancing to maintain efficiency. Work stealing is a common approach to dynamic distributed load balancing. However its use in conjunction with advanced network offload capabilities is not well understood. We present a distributed work-stealing system that is amenable to acceleration using the Portals 4 network programming interface. Our work shows that the structures provided by Portals to handle two-sided communication are general-purpose and can accelerate work stealing. We demonstrate the effectiveness of this approach using known benchmarks from computational chemistry and for performing unbalanced tree searches. Results show that Portals accelerated work-stealing can greatly reduce communication overhead, task acquisition time, and termination detection.

在大规模分布式内存集群上实现不规则并行应用程序的可伸缩性能是一个挑战。这些应用程序通常需要连续的动态负载平衡来保持效率。偷取工作是实现动态分布式负载平衡的一种常用方法。然而，它与高级网络卸载功能的结合使用还不是很清楚。我们提出了一个分布式的工作窃取系统，该系统可以使用ports4网络编程接口进行加速。我们的研究表明，portal提供的用于处理双边通信的结构是通用的，并且可以加速工作窃取。我们使用来自计算化学的已知基准和执行不平衡树搜索来证明这种方法的有效性。结果表明，门户加速窃取工作可以大大减少通信开销、任务获取时间和终止检测。

引用次数: 2

The Case for Water-Immersion Computer Boards 浸入式电脑板的案例

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337830

M. Koibuchi, I. Fujiwara, Naoya Niwa, Tomohiro Totoki, S. Hirasawa

A key concern for a high-power processor is heat dissipation, which limits the power, and thus the operating frequencies, of chips so as not to exceed some temperature threshold. In particular, 3-D chip integration will further increase power density, thus requiring more efficient cooling technology. While air, fluorinert and mineral oil have been traditionally used as coolants, in this study, we propose to directly use tap or natural water due to its superior thermal conductivity. We have developed the "in-water computer" prototypes that rely on a parylene film insulation coating. Our prototypes can support direct water-immersion cooling by taking and draining natural water, while existing cooling requires the secondary coolant (e.g. outside air in cold climates) for cooling the primary coolants that contact chips. Our prototypes successfully reduce by 20 degrees the chip temperature of commodity processor chips. Our analysis results show that the in-water cooling increases the acceptable amount of power density of chips, thus achieving higher operating frequencies of chips. Through a full-system simulation, our results show that the water-immersion chip multiprocessors outperform the counterpart water-pipe cooled and oil-immersion chips by up to 14% and 4.5%, respectively, in terms of execution times of NAS Parallel Benchmarks.

高功率处理器的一个关键问题是散热，它限制了芯片的功率，从而限制了芯片的工作频率，以免超过某个温度阈值。特别是，3d芯片集成将进一步提高功率密度，因此需要更高效的冷却技术。虽然空气、氟化物和矿物油传统上被用作冷却剂，但在本研究中，由于其优越的导热性，我们建议直接使用自来水或天然水。我们已经开发了“水中计算机”的原型，它依赖于一个聚对二甲苯薄膜绝缘涂层。我们的原型可以通过抽取和排出天然水来支持直接浸入式冷却，而现有的冷却需要二次冷却剂(例如寒冷气候下的外部空气)来冷却接触芯片的主冷却剂。我们的原型成功地将普通处理器芯片的芯片温度降低了20度。我们的分析结果表明，水冷却提高了芯片的可接受功率密度，从而实现了更高的芯片工作频率。通过全系统仿真，我们的结果表明，在NAS并行基准测试中，浸入式芯片多处理器的执行时间分别比水管冷却芯片和油浸芯片高出14%和4.5%。

{"title":"The Case for Water-Immersion Computer Boards","authors":"M. Koibuchi, I. Fujiwara, Naoya Niwa, Tomohiro Totoki, S. Hirasawa","doi":"10.1145/3337821.3337830","DOIUrl":"https://doi.org/10.1145/3337821.3337830","url":null,"abstract":"A key concern for a high-power processor is heat dissipation, which limits the power, and thus the operating frequencies, of chips so as not to exceed some temperature threshold. In particular, 3-D chip integration will further increase power density, thus requiring more efficient cooling technology. While air, fluorinert and mineral oil have been traditionally used as coolants, in this study, we propose to directly use tap or natural water due to its superior thermal conductivity. We have developed the \"in-water computer\" prototypes that rely on a parylene film insulation coating. Our prototypes can support direct water-immersion cooling by taking and draining natural water, while existing cooling requires the secondary coolant (e.g. outside air in cold climates) for cooling the primary coolants that contact chips. Our prototypes successfully reduce by 20 degrees the chip temperature of commodity processor chips. Our analysis results show that the in-water cooling increases the acceptable amount of power density of chips, thus achieving higher operating frequencies of chips. Through a full-system simulation, our results show that the water-immersion chip multiprocessors outperform the counterpart water-pipe cooled and oil-immersion chips by up to 14% and 4.5%, respectively, in terms of execution times of NAS Parallel Benchmarks.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131762480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast Recovery Techniques for Erasure-coded Clusters in Non-uniform Traffic Network 非均匀通信网中擦除编码簇的快速恢复技术

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337831

Y. Bai, Zihan Xu, Haixia Wang, Dongsheng Wang

Nowadays many practical systems adopt erasure codes to ensure reliability and reduce storage overhead. However, erasure codes also bring in low recovery performance. The network links in practice, such as peer-to-peer and cross-data network, always have nonuniform bandwidth because of various reasons. To reduce recovery time, we propose Parallel Pipeline Tree (PPT) and Parallel Pipeline Cross-Tree (PPCT) to speed up single-node and multiple-node recovery in non-uniform traffic network environment, respectively. By utilizing bandwidth gap among links, PPT constructs a tree path based on bandwidth and pipelines the data in parallel. By sharing traffic pressure of requesters with helpers, PPCT constructs a tree-like path and pipelines the data in parallel without additional helpers. We also theoretically explain the effect of PPT and PPCT used in uniform network environment. The experiments implemented on geo-distributed Amazon EC2 show that the time reduction reaches up to 37.2% with PPCT over traditional technique and reaches up to 89.2%, 76.4% and 21.6% with PPT over traditional technique, Partial-Parallel-Repair and Repair Pipelining, respectively. PPT and PPCT significantly improve the performance of erasure codes' recovery.

目前许多实际系统采用擦除码来保证可靠性和减少存储开销。但是，擦除码也带来了较低的恢复性能。在实际应用中，由于各种原因，点对点、跨数据网络等网络链路的带宽往往不均匀。为了缩短恢复时间，我们提出了并行管道树(PPT)和并行管道交叉树(PPCT)，分别加快单节点和多节点在非均匀流量网络环境下的恢复速度。PPT利用链接之间的带宽差距，构建基于带宽的树形路径，并行传输数据。通过与帮助程序共享请求程序的流量压力，PPCT构建了一个树状路径，并在没有额外帮助程序的情况下并行传输数据。从理论上解释了在统一网络环境下使用PPT和PPCT的效果。在地理分布式Amazon EC2上进行的实验表明，PPCT比传统方法缩短了37.2%的修复时间，PPT比传统方法、部分并行修复和管道修复分别缩短了89.2%、76.4%和21.6%的修复时间。PPT和PPCT显著提高了擦除码的恢复性能。

{"title":"Fast Recovery Techniques for Erasure-coded Clusters in Non-uniform Traffic Network","authors":"Y. Bai, Zihan Xu, Haixia Wang, Dongsheng Wang","doi":"10.1145/3337821.3337831","DOIUrl":"https://doi.org/10.1145/3337821.3337831","url":null,"abstract":"Nowadays many practical systems adopt erasure codes to ensure reliability and reduce storage overhead. However, erasure codes also bring in low recovery performance. The network links in practice, such as peer-to-peer and cross-data network, always have nonuniform bandwidth because of various reasons. To reduce recovery time, we propose Parallel Pipeline Tree (PPT) and Parallel Pipeline Cross-Tree (PPCT) to speed up single-node and multiple-node recovery in non-uniform traffic network environment, respectively. By utilizing bandwidth gap among links, PPT constructs a tree path based on bandwidth and pipelines the data in parallel. By sharing traffic pressure of requesters with helpers, PPCT constructs a tree-like path and pipelines the data in parallel without additional helpers. We also theoretically explain the effect of PPT and PPCT used in uniform network environment. The experiments implemented on geo-distributed Amazon EC2 show that the time reduction reaches up to 37.2% with PPCT over traditional technique and reaches up to 89.2%, 76.4% and 21.6% with PPT over traditional technique, Partial-Parallel-Repair and Repair Pipelining, respectively. PPT and PPCT significantly improve the performance of erasure codes' recovery.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114518156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Refactoring and Optimizing WRF Model on Sunway TaihuLight 神威太湖之光WRF模型重构与优化

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337923

Kai Xu, Zhenya Song, Yuandong Chan, Shida Wang, Xiangxu Meng, Weiguo Liu, Wei Xue

The Weather Research and Forecasting (WRF) Model is one of the widely-used mesoscale numerical weather prediction system and is designed for both atmospheric research and operational forecasting applications. However, it is an extremely time-consuming application: running a single simulation takes researchers days to weeks as the simulation size scales up and computing demands grow. In this paper, we port and optimize the whole WRF model to the Sunway TaihuLight supercomputer at a large scale. For the dynamic core in WRF, we present a domain-specific tool, namely, SWSLL, which is a directive-based compiler tool for the Sunway many-core architecture to convert the stencil computation into optimized parallel code. We also apply a decomposition strategy for SWSLL to improve the memory locality and decrease the number of off-chip memory accesses. For physical parameterizations, we explore the thread-level parallelization using OpenACC directives via reorganizations of data layouts and loops to achieve high performance. We present the algorithms and implementations and demonstrate the optimizations of a real-world complicated atmospheric modeling on the Sunway TaihuLight supercomputer. Evaluation results reveal that for the widely used benchmark with a horizontal resolution of 2.5 km, the speedup of 4.7 can be achieved by using the proposed algorithm and optimization strategies for the whole WRF model. In terms of strong scalability, our implementation scales well to hundreds of thousands of heterogeneous cores on Sunway TaihuLight.

天气研究与预报模式(WRF)是一种应用广泛的中尺度数值天气预报系统，可用于大气研究和业务预报。然而，这是一个非常耗时的应用程序:随着模拟规模的扩大和计算需求的增长，运行单个模拟需要研究人员几天到几周的时间。在本文中，我们将整个WRF模型移植到神威太湖之光超级计算机上并进行了大规模优化。对于WRF中的动态内核，我们提出了一种特定领域的工具SWSLL，它是一种基于指令的编译工具，用于神威多核架构将模板计算转换为优化的并行代码。我们还对SWSLL应用了一种分解策略，以改善存储器局部性并减少片外存储器访问次数。对于物理参数化，我们通过重新组织数据布局和循环来探索使用OpenACC指令的线程级并行化，以实现高性能。本文介绍了在神威太湖之光超级计算机上对真实世界复杂大气模型的优化算法和实现。评价结果表明，对于目前广泛使用的水平分辨率为2.5 km的基准，采用本文提出的算法和优化策略对整个WRF模型的加速可达到4.7。在强大的可扩展性方面，我们的实现在神威太湖之光上可以很好地扩展到数十万个异构核心。

{"title":"Refactoring and Optimizing WRF Model on Sunway TaihuLight","authors":"Kai Xu, Zhenya Song, Yuandong Chan, Shida Wang, Xiangxu Meng, Weiguo Liu, Wei Xue","doi":"10.1145/3337821.3337923","DOIUrl":"https://doi.org/10.1145/3337821.3337923","url":null,"abstract":"The Weather Research and Forecasting (WRF) Model is one of the widely-used mesoscale numerical weather prediction system and is designed for both atmospheric research and operational forecasting applications. However, it is an extremely time-consuming application: running a single simulation takes researchers days to weeks as the simulation size scales up and computing demands grow. In this paper, we port and optimize the whole WRF model to the Sunway TaihuLight supercomputer at a large scale. For the dynamic core in WRF, we present a domain-specific tool, namely, SWSLL, which is a directive-based compiler tool for the Sunway many-core architecture to convert the stencil computation into optimized parallel code. We also apply a decomposition strategy for SWSLL to improve the memory locality and decrease the number of off-chip memory accesses. For physical parameterizations, we explore the thread-level parallelization using OpenACC directives via reorganizations of data layouts and loops to achieve high performance. We present the algorithms and implementations and demonstrate the optimizations of a real-world complicated atmospheric modeling on the Sunway TaihuLight supercomputer. Evaluation results reveal that for the widely used benchmark with a horizontal resolution of 2.5 km, the speedup of 4.7 can be achieved by using the proposed algorithm and optimization strategies for the whole WRF model. In terms of strong scalability, our implementation scales well to hundreds of thousands of heterogeneous cores on Sunway TaihuLight.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"227 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114600168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

FlowCon

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337868

Wenjia Zheng, Michael Tynes, Henry Gorelick, Ying Mao, Long Cheng, Yantian Hou

An increasing number of companies are using data analytics to improve their products, services, and business processes. However, learning knowledge effectively from massive data sets always involves nontrivial computational resources. Most businesses thus choose to migrate their hardware needs to a remote cluster computing service (e.g., AWS) or to an in-house cluster facility which is often run at its resource capacity. In such scenarios, where jobs compete for available resources utilizing resources effectively to achieve high-performance data analytics becomes desirable. Although cluster resource management is a fruitful research area having made many advances (e.g., YARN, Kubernetes), few projects have investigated how further optimizations can be made specifically for training multiple machine learning (ML) / deep learning (DL) models. In this work, we introduce FlowCon, a system which is able to monitor loss functions of ML/DL jobs at runtime, and thus to make decisions on resource configuration elastically. We present a detailed design and implementation of FlowCon, and conduct intensive experiments over various DL models. Our experimental results show that FlowCon can strongly improve DL job completion time and resource utilization efficiency, compared to existing approaches. Specifically, FlowCon can reduce the completion time by up to 42.06% for a specific job without sacrificing the overall makespan, in the presence of various DL job workloads.

{"title":"FlowCon","authors":"Wenjia Zheng, Michael Tynes, Henry Gorelick, Ying Mao, Long Cheng, Yantian Hou","doi":"10.1145/3337821.3337868","DOIUrl":"https://doi.org/10.1145/3337821.3337868","url":null,"abstract":"An increasing number of companies are using data analytics to improve their products, services, and business processes. However, learning knowledge effectively from massive data sets always involves nontrivial computational resources. Most businesses thus choose to migrate their hardware needs to a remote cluster computing service (e.g., AWS) or to an in-house cluster facility which is often run at its resource capacity. In such scenarios, where jobs compete for available resources utilizing resources effectively to achieve high-performance data analytics becomes desirable. Although cluster resource management is a fruitful research area having made many advances (e.g., YARN, Kubernetes), few projects have investigated how further optimizations can be made specifically for training multiple machine learning (ML) / deep learning (DL) models. In this work, we introduce FlowCon, a system which is able to monitor loss functions of ML/DL jobs at runtime, and thus to make decisions on resource configuration elastically. We present a detailed design and implementation of FlowCon, and conduct intensive experiments over various DL models. Our experimental results show that FlowCon can strongly improve DL job completion time and resource utilization efficiency, compared to existing approaches. Specifically, FlowCon can reduce the completion time by up to 42.06% for a specific job without sacrificing the overall makespan, in the presence of various DL job workloads.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115327116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

A Network-aware and Partition-based Resource Management Scheme for Data Stream Processing 一种基于网络感知和分区的数据流处理资源管理方案

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337870

Yidan Wang, Z. Tari, Xiaoran Huang, Albert Y. Zomaya

With the increasing demand for data-driven decision making, there is an urgent need for processing geographically distributed data streams in real-time. The existing scheduling and resource management schemes efficiently optimize stream processing performance with the awareness of resource, quality-of-service, and network traffic. However, the correlation between network delay and inter-operator communication pattern is not well-understood. In this study, we propose a network-aware and partition-based resource management scheme to deal with the ever-changing network condition and data communication in stream processing. The proposed approach applies operator fusion by considering the computational demand of individual operators and the inter-operator communication patterns. It maps the fused operators to the clustered hosts with the weighted shortest processing time heuristic. Meanwhile, we established a 3-dimensional coordinate system for prompt reflection of the network condition, real-time traffic, and resource availability. We evaluated the proposed approach against two benchmarks, and the results demonstrate the efficiency in throughput and resource utilization. We also conducted a case study and implemented a prototype system supported by the proposed approach that aims to utilize the stream processing paradigm for pedestrian behavior analysis. The prototype application estimates walking time for a given path according to the real crowd traffic. The promising evaluation results of processing performance further illustrate the efficiency of the proposed approach.

随着数据驱动决策需求的不断增长，迫切需要实时处理地理分布的数据流。现有的调度和资源管理方案通过对资源、服务质量和网络流量的感知，有效地优化了流处理性能。然而，网络延迟与运营商间通信模式之间的关系尚未得到很好的理解。在本研究中，我们提出了一种网络感知和基于分区的资源管理方案，以应对流处理中不断变化的网络条件和数据通信。该方法通过考虑单个算子的计算需求和算子间通信模式，实现算子融合。采用加权最短处理时间启发式算法将融合算子映射到集群主机。同时，我们建立了三维坐标系统，能够及时反映网络状况、实时流量和资源可用性。我们针对两个基准测试对所提出的方法进行了评估，结果证明了吞吐量和资源利用率方面的效率。我们还进行了一个案例研究，并实现了一个原型系统，该系统旨在利用流处理范式进行行人行为分析。原型应用程序根据真实的人群交通估计给定路径的步行时间。良好的处理性能评价结果进一步说明了该方法的有效性。

{"title":"A Network-aware and Partition-based Resource Management Scheme for Data Stream Processing","authors":"Yidan Wang, Z. Tari, Xiaoran Huang, Albert Y. Zomaya","doi":"10.1145/3337821.3337870","DOIUrl":"https://doi.org/10.1145/3337821.3337870","url":null,"abstract":"With the increasing demand for data-driven decision making, there is an urgent need for processing geographically distributed data streams in real-time. The existing scheduling and resource management schemes efficiently optimize stream processing performance with the awareness of resource, quality-of-service, and network traffic. However, the correlation between network delay and inter-operator communication pattern is not well-understood. In this study, we propose a network-aware and partition-based resource management scheme to deal with the ever-changing network condition and data communication in stream processing. The proposed approach applies operator fusion by considering the computational demand of individual operators and the inter-operator communication patterns. It maps the fused operators to the clustered hosts with the weighted shortest processing time heuristic. Meanwhile, we established a 3-dimensional coordinate system for prompt reflection of the network condition, real-time traffic, and resource availability. We evaluated the proposed approach against two benchmarks, and the results demonstrate the efficiency in throughput and resource utilization. We also conducted a case study and implemented a prototype system supported by the proposed approach that aims to utilize the stream processing paradigm for pedestrian behavior analysis. The prototype application estimates walking time for a given path according to the real crowd traffic. The promising evaluation results of processing performance further illustrate the efficiency of the proposed approach.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130304951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Massively Parallel ANS Decoding on GPUs gpu上的大规模并行ANS解码

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337888

André Weißenberger, B. Schmidt

In recent years, graphics processors have enabled significant advances in the fields of big data and streamed deep learning. In order to keep control of rapidly growing amounts of data and to achieve sufficient throughput rates, compression features are a key part of many applications including popular deep learning pipelines. However, as most of the respective APIs rely on CPU-based preprocessing for decoding, data decompression frequently becomes a bottleneck in accelerated compute systems. This establishes the need for efficient GPU-based solutions for decompression. Asymmetric numeral systems (ANS) represent a modern approach to entropy coding, combining superior compression results with high compression and decompression speeds. Concepts for parallelizing ANS decompression on GPUs have been published recently. However, they only exhibit limited scalability in practical applications. In this paper, we present the first massively parallel, arbitrarily scalable approach to ANS decoding on GPUs, based on a novel overflow pattern. Our performance evaluation on three different CUDA-enabled GPUs (V100, TITAN V, GTX 1080) demonstrates speedups of up to 17 over 64 CPU threads, up to 31 over a high performance SIMD-based solution, and up to 39 over Zstandard's entropy codec. Our implementation is publicly available at https://github.com/weissenberger/multians.

近年来，图形处理器在大数据和流深度学习领域取得了重大进展。为了控制快速增长的数据量并实现足够的吞吐率，压缩功能是许多应用程序的关键部分，包括流行的深度学习管道。然而，由于大多数api都依赖于基于cpu的预处理来进行解码，因此数据解压缩经常成为加速计算系统中的瓶颈。这就需要高效的基于gpu的解压缩解决方案。非对称数字系统(ANS)代表了熵编码的一种现代方法，它将优越的压缩结果与高压缩和解压缩速度相结合。在gpu上并行化ANS解压缩的概念最近已经发表。然而，它们在实际应用中只表现出有限的可伸缩性。在本文中，我们提出了基于一种新颖的溢出模式的gpu上的第一个大规模并行，任意可扩展的ANS解码方法。我们在三种不同的支持cuda的gpu (V100, TITAN V, GTX 1080)上的性能评估表明，在64个CPU线程上的速度高达17，在高性能simd解决方案上的速度高达31，在Zstandard的熵编解码器上的速度高达39。我们的实现可以在https://github.com/weissenberger/multians上公开获得。

{"title":"Massively Parallel ANS Decoding on GPUs","authors":"André Weißenberger, B. Schmidt","doi":"10.1145/3337821.3337888","DOIUrl":"https://doi.org/10.1145/3337821.3337888","url":null,"abstract":"In recent years, graphics processors have enabled significant advances in the fields of big data and streamed deep learning. In order to keep control of rapidly growing amounts of data and to achieve sufficient throughput rates, compression features are a key part of many applications including popular deep learning pipelines. However, as most of the respective APIs rely on CPU-based preprocessing for decoding, data decompression frequently becomes a bottleneck in accelerated compute systems. This establishes the need for efficient GPU-based solutions for decompression. Asymmetric numeral systems (ANS) represent a modern approach to entropy coding, combining superior compression results with high compression and decompression speeds. Concepts for parallelizing ANS decompression on GPUs have been published recently. However, they only exhibit limited scalability in practical applications. In this paper, we present the first massively parallel, arbitrarily scalable approach to ANS decoding on GPUs, based on a novel overflow pattern. Our performance evaluation on three different CUDA-enabled GPUs (V100, TITAN V, GTX 1080) demonstrates speedups of up to 17 over 64 CPU threads, up to 31 over a high performance SIMD-based solution, and up to 39 over Zstandard's entropy codec. Our implementation is publicly available at https://github.com/weissenberger/multians.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127524568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Controlled Asynchronous GVT: Accelerating Parallel Discrete Event Simulation on Many-Core Clusters 受控异步GVT:多核集群上加速并行离散事件仿真

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337927

Ali Eker, B. Williams, K. Chiu, D. Ponomarev

In this paper, we investigate the performance of Parallel Discrete Event Simulation (PDES) on a cluster of many-core Intel KNL processors. Specifically, we analyze the impact of different Global Virtual Time (GVT) algorithms in this environment and contribute three significant results. First, we show that it is essential to isolate the thread performing MPI communications from the task of processing simulation events, otherwise the simulation is significantly imbalanced and performs poorly. This applies to both synchronous and asynchronous GVT algorithms. Second, we demonstrate that synchronous GVT algorithm based on barrier synchronization is a better choice for communication-dominated models, while asynchronous GVT based on Mattern's algorithm performs better for computation-dominated scenarios. Third, we propose Controlled Asynchronous GVT (CA-GVT) algorithm that selectively adds synchronization to Mattern-style GVT based on simulation conditions. We demonstrate that CA-GVT outperforms both barrier and Mattern's GVT and achieves about 8% performance improvement on mixed computation-communication models. This is a reasonable improvement for a simple modification to a GVT algorithm.

在本文中，我们研究并行离散事件仿真(PDES)在多核Intel KNL处理器集群上的性能。具体来说，我们分析了不同的全局虚拟时间(GVT)算法在这种环境下的影响，并得出了三个重要的结果。首先，我们证明了将执行MPI通信的线程与处理模拟事件的任务隔离是必要的，否则模拟将显着不平衡并且表现不佳。这适用于同步和异步GVT算法。其次，我们证明了基于屏障同步的同步GVT算法在通信占主导的场景下是更好的选择，而基于Mattern算法的异步GVT算法在计算占主导的场景下表现更好。第三，提出了一种基于仿真条件选择性地在matterstyle GVT中加入同步的可控异步GVT (CA-GVT)算法。我们证明了CA-GVT优于barrier和matn的GVT，并且在混合计算-通信模型上实现了约8%的性能提升。对于GVT算法的简单修改，这是一个合理的改进。

{"title":"Controlled Asynchronous GVT: Accelerating Parallel Discrete Event Simulation on Many-Core Clusters","authors":"Ali Eker, B. Williams, K. Chiu, D. Ponomarev","doi":"10.1145/3337821.3337927","DOIUrl":"https://doi.org/10.1145/3337821.3337927","url":null,"abstract":"In this paper, we investigate the performance of Parallel Discrete Event Simulation (PDES) on a cluster of many-core Intel KNL processors. Specifically, we analyze the impact of different Global Virtual Time (GVT) algorithms in this environment and contribute three significant results. First, we show that it is essential to isolate the thread performing MPI communications from the task of processing simulation events, otherwise the simulation is significantly imbalanced and performs poorly. This applies to both synchronous and asynchronous GVT algorithms. Second, we demonstrate that synchronous GVT algorithm based on barrier synchronization is a better choice for communication-dominated models, while asynchronous GVT based on Mattern's algorithm performs better for computation-dominated scenarios. Third, we propose Controlled Asynchronous GVT (CA-GVT) algorithm that selectively adds synchronization to Mattern-style GVT based on simulation conditions. We demonstrate that CA-GVT outperforms both barrier and Mattern's GVT and achieves about 8% performance improvement on mixed computation-communication models. This is a reasonable improvement for a simple modification to a GVT algorithm.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121497387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

CostPI

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337879

Jiahao Liu, Fang Wang, D. Feng

NVMe SSDs have been wildly adopted to provide storage services in cloud platforms where diverse workloads (including latency-sensitive, throughput-oriented and capacity-oriented workloads) are colocated. To achieve performance isolation, existing solutions partition the shared SSD into multiple isolated regions and assign each workload a separate region. However, these isolation solutions could result in inefficient resource utilization and imbalanced wear. More importantly, they cannot reduce the interference caused by embedded cache contention. In this paper, we present CostPI to improve isolation and resource utilization by providing latency-sensitive workloads with dedicated resources (including data cache, mapping table cache and NAND flash), and providing throughput-oriented and capacity-oriented workloads with shared resources. Specifically, at the NVMe queue level, we present an SLO-aware arbitration mechanism which fetches requests from NVMe queues at different granularities according to workload SLOs. At the embedded cache level, we use an asymmetric allocation scheme to partition the cache (including data cache and mapping table cache). For different data cache partitions, we adopt different cache polices to meet diverse workload requirements while reducing the imbalanced wear. At the NAND flash level, we partition the hardware resources at the channel granularity to enable the strongest isolation. Our experiments show that CostPI can reduce the average response time by up to 44.2%, the 99% response time by up to 89.5%, and the 99.9% by up to 88.5% for latency-sensitive workloads. Meanwhile, CostPI can increase resource utilization and reduce wear-imbalance for the shared NVMe SSD.

{"title":"CostPI","authors":"Jiahao Liu, Fang Wang, D. Feng","doi":"10.1145/3337821.3337879","DOIUrl":"https://doi.org/10.1145/3337821.3337879","url":null,"abstract":"NVMe SSDs have been wildly adopted to provide storage services in cloud platforms where diverse workloads (including latency-sensitive, throughput-oriented and capacity-oriented workloads) are colocated. To achieve performance isolation, existing solutions partition the shared SSD into multiple isolated regions and assign each workload a separate region. However, these isolation solutions could result in inefficient resource utilization and imbalanced wear. More importantly, they cannot reduce the interference caused by embedded cache contention. In this paper, we present CostPI to improve isolation and resource utilization by providing latency-sensitive workloads with dedicated resources (including data cache, mapping table cache and NAND flash), and providing throughput-oriented and capacity-oriented workloads with shared resources. Specifically, at the NVMe queue level, we present an SLO-aware arbitration mechanism which fetches requests from NVMe queues at different granularities according to workload SLOs. At the embedded cache level, we use an asymmetric allocation scheme to partition the cache (including data cache and mapping table cache). For different data cache partitions, we adopt different cache polices to meet diverse workload requirements while reducing the imbalanced wear. At the NAND flash level, we partition the hardware resources at the channel granularity to enable the strongest isolation. Our experiments show that CostPI can reduce the average response time by up to 44.2%, the 99% response time by up to 89.5%, and the 99.9% by up to 88.5% for latency-sensitive workloads. Meanwhile, CostPI can increase resource utilization and reduce wear-imbalance for the shared NVMe SSD.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124361517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

An Efficient Design Flow for Accelerating Complicated-connected CNNs on a Multi-FPGA Platform 在多fpga平台上加速复杂连接cnn的高效设计流程

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337846

Deguang Wang, Junzhong Shen, M. Wen, Chunyuan Zhang

Convolutional Neural Networks (CNNs) have achieved impressive performance on various computer vision tasks. To facilitate better performance, some complicated-connected CNN models (e.g., GoogLeNet and DenseNet) have recently been proposed, and have achieved state-of-the-art performance in the fields of image classification and segmentation. However, CNNs are computation- and memory-intensive. Thus, it is significant to develop hardware accelerators in order to accelerate the inference and training processes of CNNs. Due to the high-performance, reconfigurable and energy-efficient nature of Field-Programmable Gate Arrays (FPGAs), many FPGA-based accelerators have been proposed to implement CNNs and have achieved higher throughput and energy efficiency. However, the large number of parameters involved in complicated-connected CNN models have exceeded the limited hardware resources of single FPGA board, which are unable to meet the memory and computation resource demands associated with mapping entire CNN models. Accordingly, in this paper, we propose a complete design flow to accelerate the inference of complicated-connected CNNs on a multi-FPGA platform, including DAG abstraction, mapping scheme generation and design space exploration. In addition, a multi-FPGA system with flexible inter-FPGA communications is proposed to efficiently support our design flow. Experimental results on representative models illustrate that the proposed multi-FPGA system design can achieve a throughput acceleration of up to 145.2× and 2.5× compared to CPU and GPU solutions, as well as an energy efficiency improvement of up to 139.1× and 4.8× compared to multi-core CPU and GPU solutions.

卷积神经网络(cnn)在各种计算机视觉任务上取得了令人印象深刻的表现。为了获得更好的性能，最近提出了一些复杂连接的CNN模型(例如GoogLeNet和DenseNet)，并在图像分类和分割领域取得了最先进的性能。然而，cnn是计算和内存密集型的。因此，为了加速cnn的推理和训练过程，开发硬件加速器具有重要意义。由于现场可编程门阵列(fpga)的高性能、可重构和节能特性，许多基于fpga的加速器被提出来实现cnn，并取得了更高的吞吐量和能效。然而，复杂连接CNN模型所涉及的大量参数已经超出了单个FPGA板有限的硬件资源，无法满足映射整个CNN模型所带来的内存和计算资源需求。因此，在本文中，我们提出了一个完整的设计流程，以加速多fpga平台上复杂连接cnn的推理，包括DAG抽象，映射方案生成和设计空间探索。此外，还提出了一种具有灵活fpga间通信的多fpga系统，以有效地支持我们的设计流程。代表性模型的实验结果表明，与CPU和GPU解决方案相比，所提出的多fpga系统设计可实现高达145.2倍和2.5倍的吞吐量加速，与多核CPU和GPU解决方案相比，可实现高达139.1倍和4.8倍的能效提升。

{"title":"An Efficient Design Flow for Accelerating Complicated-connected CNNs on a Multi-FPGA Platform","authors":"Deguang Wang, Junzhong Shen, M. Wen, Chunyuan Zhang","doi":"10.1145/3337821.3337846","DOIUrl":"https://doi.org/10.1145/3337821.3337846","url":null,"abstract":"Convolutional Neural Networks (CNNs) have achieved impressive performance on various computer vision tasks. To facilitate better performance, some complicated-connected CNN models (e.g., GoogLeNet and DenseNet) have recently been proposed, and have achieved state-of-the-art performance in the fields of image classification and segmentation. However, CNNs are computation- and memory-intensive. Thus, it is significant to develop hardware accelerators in order to accelerate the inference and training processes of CNNs. Due to the high-performance, reconfigurable and energy-efficient nature of Field-Programmable Gate Arrays (FPGAs), many FPGA-based accelerators have been proposed to implement CNNs and have achieved higher throughput and energy efficiency. However, the large number of parameters involved in complicated-connected CNN models have exceeded the limited hardware resources of single FPGA board, which are unable to meet the memory and computation resource demands associated with mapping entire CNN models. Accordingly, in this paper, we propose a complete design flow to accelerate the inference of complicated-connected CNNs on a multi-FPGA platform, including DAG abstraction, mapping scheme generation and design space exploration. In addition, a multi-FPGA system with flexible inter-FPGA communications is proposed to efficiently support our design flow. Experimental results on representative models illustrate that the proposed multi-FPGA system design can achieve a throughput acceleration of up to 145.2× and 2.5× compared to CPU and GPU solutions, as well as an energy efficiency improvement of up to 139.1× and 4.8× compared to multi-core CPU and GPU solutions.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126244394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2