Realizing scalable performance with irregular parallel applications is challenging on large-scale distributed memory clusters. These applications typically require continuous, dynamic load balancing to maintain efficiency. Work stealing is a common approach to dynamic distributed load balancing. However its use in conjunction with advanced network offload capabilities is not well understood. We present a distributed work-stealing system that is amenable to acceleration using the Portals 4 network programming interface. Our work shows that the structures provided by Portals to handle two-sided communication are general-purpose and can accelerate work stealing. We demonstrate the effectiveness of this approach using known benchmarks from computational chemistry and for performing unbalanced tree searches. Results show that Portals accelerated work-stealing can greatly reduce communication overhead, task acquisition time, and termination detection.
{"title":"Accelerated Work Stealing","authors":"D. B. Larkins, John Snyder, James Dinan","doi":"10.1145/3337821.3337878","DOIUrl":"https://doi.org/10.1145/3337821.3337878","url":null,"abstract":"Realizing scalable performance with irregular parallel applications is challenging on large-scale distributed memory clusters. These applications typically require continuous, dynamic load balancing to maintain efficiency. Work stealing is a common approach to dynamic distributed load balancing. However its use in conjunction with advanced network offload capabilities is not well understood. We present a distributed work-stealing system that is amenable to acceleration using the Portals 4 network programming interface. Our work shows that the structures provided by Portals to handle two-sided communication are general-purpose and can accelerate work stealing. We demonstrate the effectiveness of this approach using known benchmarks from computational chemistry and for performing unbalanced tree searches. Results show that Portals accelerated work-stealing can greatly reduce communication overhead, task acquisition time, and termination detection.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131390636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Koibuchi, I. Fujiwara, Naoya Niwa, Tomohiro Totoki, S. Hirasawa
A key concern for a high-power processor is heat dissipation, which limits the power, and thus the operating frequencies, of chips so as not to exceed some temperature threshold. In particular, 3-D chip integration will further increase power density, thus requiring more efficient cooling technology. While air, fluorinert and mineral oil have been traditionally used as coolants, in this study, we propose to directly use tap or natural water due to its superior thermal conductivity. We have developed the "in-water computer" prototypes that rely on a parylene film insulation coating. Our prototypes can support direct water-immersion cooling by taking and draining natural water, while existing cooling requires the secondary coolant (e.g. outside air in cold climates) for cooling the primary coolants that contact chips. Our prototypes successfully reduce by 20 degrees the chip temperature of commodity processor chips. Our analysis results show that the in-water cooling increases the acceptable amount of power density of chips, thus achieving higher operating frequencies of chips. Through a full-system simulation, our results show that the water-immersion chip multiprocessors outperform the counterpart water-pipe cooled and oil-immersion chips by up to 14% and 4.5%, respectively, in terms of execution times of NAS Parallel Benchmarks.
{"title":"The Case for Water-Immersion Computer Boards","authors":"M. Koibuchi, I. Fujiwara, Naoya Niwa, Tomohiro Totoki, S. Hirasawa","doi":"10.1145/3337821.3337830","DOIUrl":"https://doi.org/10.1145/3337821.3337830","url":null,"abstract":"A key concern for a high-power processor is heat dissipation, which limits the power, and thus the operating frequencies, of chips so as not to exceed some temperature threshold. In particular, 3-D chip integration will further increase power density, thus requiring more efficient cooling technology. While air, fluorinert and mineral oil have been traditionally used as coolants, in this study, we propose to directly use tap or natural water due to its superior thermal conductivity. We have developed the \"in-water computer\" prototypes that rely on a parylene film insulation coating. Our prototypes can support direct water-immersion cooling by taking and draining natural water, while existing cooling requires the secondary coolant (e.g. outside air in cold climates) for cooling the primary coolants that contact chips. Our prototypes successfully reduce by 20 degrees the chip temperature of commodity processor chips. Our analysis results show that the in-water cooling increases the acceptable amount of power density of chips, thus achieving higher operating frequencies of chips. Through a full-system simulation, our results show that the water-immersion chip multiprocessors outperform the counterpart water-pipe cooled and oil-immersion chips by up to 14% and 4.5%, respectively, in terms of execution times of NAS Parallel Benchmarks.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131762480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays many practical systems adopt erasure codes to ensure reliability and reduce storage overhead. However, erasure codes also bring in low recovery performance. The network links in practice, such as peer-to-peer and cross-data network, always have nonuniform bandwidth because of various reasons. To reduce recovery time, we propose Parallel Pipeline Tree (PPT) and Parallel Pipeline Cross-Tree (PPCT) to speed up single-node and multiple-node recovery in non-uniform traffic network environment, respectively. By utilizing bandwidth gap among links, PPT constructs a tree path based on bandwidth and pipelines the data in parallel. By sharing traffic pressure of requesters with helpers, PPCT constructs a tree-like path and pipelines the data in parallel without additional helpers. We also theoretically explain the effect of PPT and PPCT used in uniform network environment. The experiments implemented on geo-distributed Amazon EC2 show that the time reduction reaches up to 37.2% with PPCT over traditional technique and reaches up to 89.2%, 76.4% and 21.6% with PPT over traditional technique, Partial-Parallel-Repair and Repair Pipelining, respectively. PPT and PPCT significantly improve the performance of erasure codes' recovery.
{"title":"Fast Recovery Techniques for Erasure-coded Clusters in Non-uniform Traffic Network","authors":"Y. Bai, Zihan Xu, Haixia Wang, Dongsheng Wang","doi":"10.1145/3337821.3337831","DOIUrl":"https://doi.org/10.1145/3337821.3337831","url":null,"abstract":"Nowadays many practical systems adopt erasure codes to ensure reliability and reduce storage overhead. However, erasure codes also bring in low recovery performance. The network links in practice, such as peer-to-peer and cross-data network, always have nonuniform bandwidth because of various reasons. To reduce recovery time, we propose Parallel Pipeline Tree (PPT) and Parallel Pipeline Cross-Tree (PPCT) to speed up single-node and multiple-node recovery in non-uniform traffic network environment, respectively. By utilizing bandwidth gap among links, PPT constructs a tree path based on bandwidth and pipelines the data in parallel. By sharing traffic pressure of requesters with helpers, PPCT constructs a tree-like path and pipelines the data in parallel without additional helpers. We also theoretically explain the effect of PPT and PPCT used in uniform network environment. The experiments implemented on geo-distributed Amazon EC2 show that the time reduction reaches up to 37.2% with PPCT over traditional technique and reaches up to 89.2%, 76.4% and 21.6% with PPT over traditional technique, Partial-Parallel-Repair and Repair Pipelining, respectively. PPT and PPCT significantly improve the performance of erasure codes' recovery.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114518156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Weather Research and Forecasting (WRF) Model is one of the widely-used mesoscale numerical weather prediction system and is designed for both atmospheric research and operational forecasting applications. However, it is an extremely time-consuming application: running a single simulation takes researchers days to weeks as the simulation size scales up and computing demands grow. In this paper, we port and optimize the whole WRF model to the Sunway TaihuLight supercomputer at a large scale. For the dynamic core in WRF, we present a domain-specific tool, namely, SWSLL, which is a directive-based compiler tool for the Sunway many-core architecture to convert the stencil computation into optimized parallel code. We also apply a decomposition strategy for SWSLL to improve the memory locality and decrease the number of off-chip memory accesses. For physical parameterizations, we explore the thread-level parallelization using OpenACC directives via reorganizations of data layouts and loops to achieve high performance. We present the algorithms and implementations and demonstrate the optimizations of a real-world complicated atmospheric modeling on the Sunway TaihuLight supercomputer. Evaluation results reveal that for the widely used benchmark with a horizontal resolution of 2.5 km, the speedup of 4.7 can be achieved by using the proposed algorithm and optimization strategies for the whole WRF model. In terms of strong scalability, our implementation scales well to hundreds of thousands of heterogeneous cores on Sunway TaihuLight.
{"title":"Refactoring and Optimizing WRF Model on Sunway TaihuLight","authors":"Kai Xu, Zhenya Song, Yuandong Chan, Shida Wang, Xiangxu Meng, Weiguo Liu, Wei Xue","doi":"10.1145/3337821.3337923","DOIUrl":"https://doi.org/10.1145/3337821.3337923","url":null,"abstract":"The Weather Research and Forecasting (WRF) Model is one of the widely-used mesoscale numerical weather prediction system and is designed for both atmospheric research and operational forecasting applications. However, it is an extremely time-consuming application: running a single simulation takes researchers days to weeks as the simulation size scales up and computing demands grow. In this paper, we port and optimize the whole WRF model to the Sunway TaihuLight supercomputer at a large scale. For the dynamic core in WRF, we present a domain-specific tool, namely, SWSLL, which is a directive-based compiler tool for the Sunway many-core architecture to convert the stencil computation into optimized parallel code. We also apply a decomposition strategy for SWSLL to improve the memory locality and decrease the number of off-chip memory accesses. For physical parameterizations, we explore the thread-level parallelization using OpenACC directives via reorganizations of data layouts and loops to achieve high performance. We present the algorithms and implementations and demonstrate the optimizations of a real-world complicated atmospheric modeling on the Sunway TaihuLight supercomputer. Evaluation results reveal that for the widely used benchmark with a horizontal resolution of 2.5 km, the speedup of 4.7 can be achieved by using the proposed algorithm and optimization strategies for the whole WRF model. In terms of strong scalability, our implementation scales well to hundreds of thousands of heterogeneous cores on Sunway TaihuLight.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"227 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114600168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenjia Zheng, Michael Tynes, Henry Gorelick, Ying Mao, Long Cheng, Yantian Hou
An increasing number of companies are using data analytics to improve their products, services, and business processes. However, learning knowledge effectively from massive data sets always involves nontrivial computational resources. Most businesses thus choose to migrate their hardware needs to a remote cluster computing service (e.g., AWS) or to an in-house cluster facility which is often run at its resource capacity. In such scenarios, where jobs compete for available resources utilizing resources effectively to achieve high-performance data analytics becomes desirable. Although cluster resource management is a fruitful research area having made many advances (e.g., YARN, Kubernetes), few projects have investigated how further optimizations can be made specifically for training multiple machine learning (ML) / deep learning (DL) models. In this work, we introduce FlowCon, a system which is able to monitor loss functions of ML/DL jobs at runtime, and thus to make decisions on resource configuration elastically. We present a detailed design and implementation of FlowCon, and conduct intensive experiments over various DL models. Our experimental results show that FlowCon can strongly improve DL job completion time and resource utilization efficiency, compared to existing approaches. Specifically, FlowCon can reduce the completion time by up to 42.06% for a specific job without sacrificing the overall makespan, in the presence of various DL job workloads.
{"title":"FlowCon","authors":"Wenjia Zheng, Michael Tynes, Henry Gorelick, Ying Mao, Long Cheng, Yantian Hou","doi":"10.1145/3337821.3337868","DOIUrl":"https://doi.org/10.1145/3337821.3337868","url":null,"abstract":"An increasing number of companies are using data analytics to improve their products, services, and business processes. However, learning knowledge effectively from massive data sets always involves nontrivial computational resources. Most businesses thus choose to migrate their hardware needs to a remote cluster computing service (e.g., AWS) or to an in-house cluster facility which is often run at its resource capacity. In such scenarios, where jobs compete for available resources utilizing resources effectively to achieve high-performance data analytics becomes desirable. Although cluster resource management is a fruitful research area having made many advances (e.g., YARN, Kubernetes), few projects have investigated how further optimizations can be made specifically for training multiple machine learning (ML) / deep learning (DL) models. In this work, we introduce FlowCon, a system which is able to monitor loss functions of ML/DL jobs at runtime, and thus to make decisions on resource configuration elastically. We present a detailed design and implementation of FlowCon, and conduct intensive experiments over various DL models. Our experimental results show that FlowCon can strongly improve DL job completion time and resource utilization efficiency, compared to existing approaches. Specifically, FlowCon can reduce the completion time by up to 42.06% for a specific job without sacrificing the overall makespan, in the presence of various DL job workloads.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115327116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yidan Wang, Z. Tari, Xiaoran Huang, Albert Y. Zomaya
With the increasing demand for data-driven decision making, there is an urgent need for processing geographically distributed data streams in real-time. The existing scheduling and resource management schemes efficiently optimize stream processing performance with the awareness of resource, quality-of-service, and network traffic. However, the correlation between network delay and inter-operator communication pattern is not well-understood. In this study, we propose a network-aware and partition-based resource management scheme to deal with the ever-changing network condition and data communication in stream processing. The proposed approach applies operator fusion by considering the computational demand of individual operators and the inter-operator communication patterns. It maps the fused operators to the clustered hosts with the weighted shortest processing time heuristic. Meanwhile, we established a 3-dimensional coordinate system for prompt reflection of the network condition, real-time traffic, and resource availability. We evaluated the proposed approach against two benchmarks, and the results demonstrate the efficiency in throughput and resource utilization. We also conducted a case study and implemented a prototype system supported by the proposed approach that aims to utilize the stream processing paradigm for pedestrian behavior analysis. The prototype application estimates walking time for a given path according to the real crowd traffic. The promising evaluation results of processing performance further illustrate the efficiency of the proposed approach.
{"title":"A Network-aware and Partition-based Resource Management Scheme for Data Stream Processing","authors":"Yidan Wang, Z. Tari, Xiaoran Huang, Albert Y. Zomaya","doi":"10.1145/3337821.3337870","DOIUrl":"https://doi.org/10.1145/3337821.3337870","url":null,"abstract":"With the increasing demand for data-driven decision making, there is an urgent need for processing geographically distributed data streams in real-time. The existing scheduling and resource management schemes efficiently optimize stream processing performance with the awareness of resource, quality-of-service, and network traffic. However, the correlation between network delay and inter-operator communication pattern is not well-understood. In this study, we propose a network-aware and partition-based resource management scheme to deal with the ever-changing network condition and data communication in stream processing. The proposed approach applies operator fusion by considering the computational demand of individual operators and the inter-operator communication patterns. It maps the fused operators to the clustered hosts with the weighted shortest processing time heuristic. Meanwhile, we established a 3-dimensional coordinate system for prompt reflection of the network condition, real-time traffic, and resource availability. We evaluated the proposed approach against two benchmarks, and the results demonstrate the efficiency in throughput and resource utilization. We also conducted a case study and implemented a prototype system supported by the proposed approach that aims to utilize the stream processing paradigm for pedestrian behavior analysis. The prototype application estimates walking time for a given path according to the real crowd traffic. The promising evaluation results of processing performance further illustrate the efficiency of the proposed approach.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130304951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, graphics processors have enabled significant advances in the fields of big data and streamed deep learning. In order to keep control of rapidly growing amounts of data and to achieve sufficient throughput rates, compression features are a key part of many applications including popular deep learning pipelines. However, as most of the respective APIs rely on CPU-based preprocessing for decoding, data decompression frequently becomes a bottleneck in accelerated compute systems. This establishes the need for efficient GPU-based solutions for decompression. Asymmetric numeral systems (ANS) represent a modern approach to entropy coding, combining superior compression results with high compression and decompression speeds. Concepts for parallelizing ANS decompression on GPUs have been published recently. However, they only exhibit limited scalability in practical applications. In this paper, we present the first massively parallel, arbitrarily scalable approach to ANS decoding on GPUs, based on a novel overflow pattern. Our performance evaluation on three different CUDA-enabled GPUs (V100, TITAN V, GTX 1080) demonstrates speedups of up to 17 over 64 CPU threads, up to 31 over a high performance SIMD-based solution, and up to 39 over Zstandard's entropy codec. Our implementation is publicly available at https://github.com/weissenberger/multians.
{"title":"Massively Parallel ANS Decoding on GPUs","authors":"André Weißenberger, B. Schmidt","doi":"10.1145/3337821.3337888","DOIUrl":"https://doi.org/10.1145/3337821.3337888","url":null,"abstract":"In recent years, graphics processors have enabled significant advances in the fields of big data and streamed deep learning. In order to keep control of rapidly growing amounts of data and to achieve sufficient throughput rates, compression features are a key part of many applications including popular deep learning pipelines. However, as most of the respective APIs rely on CPU-based preprocessing for decoding, data decompression frequently becomes a bottleneck in accelerated compute systems. This establishes the need for efficient GPU-based solutions for decompression. Asymmetric numeral systems (ANS) represent a modern approach to entropy coding, combining superior compression results with high compression and decompression speeds. Concepts for parallelizing ANS decompression on GPUs have been published recently. However, they only exhibit limited scalability in practical applications. In this paper, we present the first massively parallel, arbitrarily scalable approach to ANS decoding on GPUs, based on a novel overflow pattern. Our performance evaluation on three different CUDA-enabled GPUs (V100, TITAN V, GTX 1080) demonstrates speedups of up to 17 over 64 CPU threads, up to 31 over a high performance SIMD-based solution, and up to 39 over Zstandard's entropy codec. Our implementation is publicly available at https://github.com/weissenberger/multians.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127524568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we investigate the performance of Parallel Discrete Event Simulation (PDES) on a cluster of many-core Intel KNL processors. Specifically, we analyze the impact of different Global Virtual Time (GVT) algorithms in this environment and contribute three significant results. First, we show that it is essential to isolate the thread performing MPI communications from the task of processing simulation events, otherwise the simulation is significantly imbalanced and performs poorly. This applies to both synchronous and asynchronous GVT algorithms. Second, we demonstrate that synchronous GVT algorithm based on barrier synchronization is a better choice for communication-dominated models, while asynchronous GVT based on Mattern's algorithm performs better for computation-dominated scenarios. Third, we propose Controlled Asynchronous GVT (CA-GVT) algorithm that selectively adds synchronization to Mattern-style GVT based on simulation conditions. We demonstrate that CA-GVT outperforms both barrier and Mattern's GVT and achieves about 8% performance improvement on mixed computation-communication models. This is a reasonable improvement for a simple modification to a GVT algorithm.
{"title":"Controlled Asynchronous GVT: Accelerating Parallel Discrete Event Simulation on Many-Core Clusters","authors":"Ali Eker, B. Williams, K. Chiu, D. Ponomarev","doi":"10.1145/3337821.3337927","DOIUrl":"https://doi.org/10.1145/3337821.3337927","url":null,"abstract":"In this paper, we investigate the performance of Parallel Discrete Event Simulation (PDES) on a cluster of many-core Intel KNL processors. Specifically, we analyze the impact of different Global Virtual Time (GVT) algorithms in this environment and contribute three significant results. First, we show that it is essential to isolate the thread performing MPI communications from the task of processing simulation events, otherwise the simulation is significantly imbalanced and performs poorly. This applies to both synchronous and asynchronous GVT algorithms. Second, we demonstrate that synchronous GVT algorithm based on barrier synchronization is a better choice for communication-dominated models, while asynchronous GVT based on Mattern's algorithm performs better for computation-dominated scenarios. Third, we propose Controlled Asynchronous GVT (CA-GVT) algorithm that selectively adds synchronization to Mattern-style GVT based on simulation conditions. We demonstrate that CA-GVT outperforms both barrier and Mattern's GVT and achieves about 8% performance improvement on mixed computation-communication models. This is a reasonable improvement for a simple modification to a GVT algorithm.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121497387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
NVMe SSDs have been wildly adopted to provide storage services in cloud platforms where diverse workloads (including latency-sensitive, throughput-oriented and capacity-oriented workloads) are colocated. To achieve performance isolation, existing solutions partition the shared SSD into multiple isolated regions and assign each workload a separate region. However, these isolation solutions could result in inefficient resource utilization and imbalanced wear. More importantly, they cannot reduce the interference caused by embedded cache contention. In this paper, we present CostPI to improve isolation and resource utilization by providing latency-sensitive workloads with dedicated resources (including data cache, mapping table cache and NAND flash), and providing throughput-oriented and capacity-oriented workloads with shared resources. Specifically, at the NVMe queue level, we present an SLO-aware arbitration mechanism which fetches requests from NVMe queues at different granularities according to workload SLOs. At the embedded cache level, we use an asymmetric allocation scheme to partition the cache (including data cache and mapping table cache). For different data cache partitions, we adopt different cache polices to meet diverse workload requirements while reducing the imbalanced wear. At the NAND flash level, we partition the hardware resources at the channel granularity to enable the strongest isolation. Our experiments show that CostPI can reduce the average response time by up to 44.2%, the 99% response time by up to 89.5%, and the 99.9% by up to 88.5% for latency-sensitive workloads. Meanwhile, CostPI can increase resource utilization and reduce wear-imbalance for the shared NVMe SSD.
{"title":"CostPI","authors":"Jiahao Liu, Fang Wang, D. Feng","doi":"10.1145/3337821.3337879","DOIUrl":"https://doi.org/10.1145/3337821.3337879","url":null,"abstract":"NVMe SSDs have been wildly adopted to provide storage services in cloud platforms where diverse workloads (including latency-sensitive, throughput-oriented and capacity-oriented workloads) are colocated. To achieve performance isolation, existing solutions partition the shared SSD into multiple isolated regions and assign each workload a separate region. However, these isolation solutions could result in inefficient resource utilization and imbalanced wear. More importantly, they cannot reduce the interference caused by embedded cache contention. In this paper, we present CostPI to improve isolation and resource utilization by providing latency-sensitive workloads with dedicated resources (including data cache, mapping table cache and NAND flash), and providing throughput-oriented and capacity-oriented workloads with shared resources. Specifically, at the NVMe queue level, we present an SLO-aware arbitration mechanism which fetches requests from NVMe queues at different granularities according to workload SLOs. At the embedded cache level, we use an asymmetric allocation scheme to partition the cache (including data cache and mapping table cache). For different data cache partitions, we adopt different cache polices to meet diverse workload requirements while reducing the imbalanced wear. At the NAND flash level, we partition the hardware resources at the channel granularity to enable the strongest isolation. Our experiments show that CostPI can reduce the average response time by up to 44.2%, the 99% response time by up to 89.5%, and the 99.9% by up to 88.5% for latency-sensitive workloads. Meanwhile, CostPI can increase resource utilization and reduce wear-imbalance for the shared NVMe SSD.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124361517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deguang Wang, Junzhong Shen, M. Wen, Chunyuan Zhang
Convolutional Neural Networks (CNNs) have achieved impressive performance on various computer vision tasks. To facilitate better performance, some complicated-connected CNN models (e.g., GoogLeNet and DenseNet) have recently been proposed, and have achieved state-of-the-art performance in the fields of image classification and segmentation. However, CNNs are computation- and memory-intensive. Thus, it is significant to develop hardware accelerators in order to accelerate the inference and training processes of CNNs. Due to the high-performance, reconfigurable and energy-efficient nature of Field-Programmable Gate Arrays (FPGAs), many FPGA-based accelerators have been proposed to implement CNNs and have achieved higher throughput and energy efficiency. However, the large number of parameters involved in complicated-connected CNN models have exceeded the limited hardware resources of single FPGA board, which are unable to meet the memory and computation resource demands associated with mapping entire CNN models. Accordingly, in this paper, we propose a complete design flow to accelerate the inference of complicated-connected CNNs on a multi-FPGA platform, including DAG abstraction, mapping scheme generation and design space exploration. In addition, a multi-FPGA system with flexible inter-FPGA communications is proposed to efficiently support our design flow. Experimental results on representative models illustrate that the proposed multi-FPGA system design can achieve a throughput acceleration of up to 145.2× and 2.5× compared to CPU and GPU solutions, as well as an energy efficiency improvement of up to 139.1× and 4.8× compared to multi-core CPU and GPU solutions.
{"title":"An Efficient Design Flow for Accelerating Complicated-connected CNNs on a Multi-FPGA Platform","authors":"Deguang Wang, Junzhong Shen, M. Wen, Chunyuan Zhang","doi":"10.1145/3337821.3337846","DOIUrl":"https://doi.org/10.1145/3337821.3337846","url":null,"abstract":"Convolutional Neural Networks (CNNs) have achieved impressive performance on various computer vision tasks. To facilitate better performance, some complicated-connected CNN models (e.g., GoogLeNet and DenseNet) have recently been proposed, and have achieved state-of-the-art performance in the fields of image classification and segmentation. However, CNNs are computation- and memory-intensive. Thus, it is significant to develop hardware accelerators in order to accelerate the inference and training processes of CNNs. Due to the high-performance, reconfigurable and energy-efficient nature of Field-Programmable Gate Arrays (FPGAs), many FPGA-based accelerators have been proposed to implement CNNs and have achieved higher throughput and energy efficiency. However, the large number of parameters involved in complicated-connected CNN models have exceeded the limited hardware resources of single FPGA board, which are unable to meet the memory and computation resource demands associated with mapping entire CNN models. Accordingly, in this paper, we propose a complete design flow to accelerate the inference of complicated-connected CNNs on a multi-FPGA platform, including DAG abstraction, mapping scheme generation and design space exploration. In addition, a multi-FPGA system with flexible inter-FPGA communications is proposed to efficiently support our design flow. Experimental results on representative models illustrate that the proposed multi-FPGA system design can achieve a throughput acceleration of up to 145.2× and 2.5× compared to CPU and GPU solutions, as well as an energy efficiency improvement of up to 139.1× and 4.8× compared to multi-core CPU and GPU solutions.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126244394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}