DCTCP is a potential TCP replacement to satisfy the requirements of data center network. It receives wide concerns in both academic and industrial circles. However, DCTCP could only support tens of concurrent flows well and suffers timeouts and throughput collapse facing numerous concurrent flows. This is far from the requirement of data center network. Data centers employing partition/aggregation pattern usually involve hundreds of concurrent flows. In this paper, after tracing DCTCP's dynamic behavior through experiments, we explored two roots for DCTCP's failure under the high fan-in traffic pattern: (1) The regulation mechanism of sending window is ineffective when cwnd is decreased to the minimum size, (2) The bursts induced by synchronized flows with small cwnd cause fatal packet loss leading to severe timeouts. We enhance DCTCP to support massive concurrent flows by regulating the sending time interval and desynchronizing the sending time in particular conditions. The new protocol called DCTCP+ outperforms DCTCP when the number of concurrent flows increases to several hundreds. DCTCP+ can normally work to effectively support the short concurrent query responses in the benchmark from real production clusters, and keep the same good performance with the mixture of background traffic.
{"title":"Slowing Little Quickens More: Improving DCTCP for Massive Concurrent Flows","authors":"Mao Miao, Peng Cheng, Fengyuan Ren, Ran Shu","doi":"10.1109/ICPP.2015.78","DOIUrl":"https://doi.org/10.1109/ICPP.2015.78","url":null,"abstract":"DCTCP is a potential TCP replacement to satisfy the requirements of data center network. It receives wide concerns in both academic and industrial circles. However, DCTCP could only support tens of concurrent flows well and suffers timeouts and throughput collapse facing numerous concurrent flows. This is far from the requirement of data center network. Data centers employing partition/aggregation pattern usually involve hundreds of concurrent flows. In this paper, after tracing DCTCP's dynamic behavior through experiments, we explored two roots for DCTCP's failure under the high fan-in traffic pattern: (1) The regulation mechanism of sending window is ineffective when cwnd is decreased to the minimum size, (2) The bursts induced by synchronized flows with small cwnd cause fatal packet loss leading to severe timeouts. We enhance DCTCP to support massive concurrent flows by regulating the sending time interval and desynchronizing the sending time in particular conditions. The new protocol called DCTCP+ outperforms DCTCP when the number of concurrent flows increases to several hundreds. DCTCP+ can normally work to effectively support the short concurrent query responses in the benchmark from real production clusters, and keep the same good performance with the mixture of background traffic.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123813600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Shan, Guihai Chen, Fan Wu, Xiaobing Wu, Xiaofeng Gao, Pan Wu, Haipeng Dai
Tree-based routing structures are widely used to gather data in wireless sensor networks. Along with tree structures, in-network aggregation is adopted to reduce transmissions, to save energy and to prolong the network lifetime. Most existing works that focus on the lifetime optimization for data aggregation do not take the link quality into consideration. In this paper, we study the problem of Maximizing Reliability of Lifetime Constrained data aggregation trees (MRLC) in WSNs. Considering the NP-completeness of the MRLC problem, we propose an algorithm, namely Iterative Relaxation Algorithm (IRA), to iteratively relax the optimization program and to find the aggregation tree subject to the lifetime bound with a sub-optimal cost. To adapt to the distributed nature of the WSNs in practice, we further propose a Prufer code based distributed updating protocol. Through extensive simulations, we demonstrate that IRA outperforms the best known related work in term of reliability.
{"title":"On Maximizing Reliability of Lifetime Constrained Data Aggregation Tree in Wireless Sensor Networks","authors":"M. Shan, Guihai Chen, Fan Wu, Xiaobing Wu, Xiaofeng Gao, Pan Wu, Haipeng Dai","doi":"10.1109/ICPP.2015.17","DOIUrl":"https://doi.org/10.1109/ICPP.2015.17","url":null,"abstract":"Tree-based routing structures are widely used to gather data in wireless sensor networks. Along with tree structures, in-network aggregation is adopted to reduce transmissions, to save energy and to prolong the network lifetime. Most existing works that focus on the lifetime optimization for data aggregation do not take the link quality into consideration. In this paper, we study the problem of Maximizing Reliability of Lifetime Constrained data aggregation trees (MRLC) in WSNs. Considering the NP-completeness of the MRLC problem, we propose an algorithm, namely Iterative Relaxation Algorithm (IRA), to iteratively relax the optimization program and to find the aggregation tree subject to the lifetime bound with a sub-optimal cost. To adapt to the distributed nature of the WSNs in practice, we further propose a Prufer code based distributed updating protocol. Through extensive simulations, we demonstrate that IRA outperforms the best known related work in term of reliability.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126181890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Xiao, Xingyu Xu, Ce Yu, Jiawan Zhang, Shuinai Zhang, Li Ji, Ji-zhou Sun
Spectral calculation and analysis have very important practical applications in astrophysics. The main portion of spectral calculation is to solve a large number of one-dimensional numerical integrations at each point of a large three-dimensional parameter space. However, existing widely used solutions still remain in process-level parallelism, which is not competent to tackle numerous compute-intensive small integral tasks. This paper presented a GPU-optimized approach to accelerate the numerical integration in massive spectral calculation. We also proposed a load balance strategy on hybrid multiple CPUs and GPUs architecture via share memory to maximize performance. The approach was prototyped and tested on the Astrophysical Plasma Emission Code (APEC), a commonly used spectral toolset. Comparing with the original serial version and the 24 CPU cores (2.5GHz) parallel version, our implementation on 3 Tesla C2075 GPUs achieves a speed-up of up to 300 and 22 respectively.
{"title":"Accelerating Spectral Calculation through Hybrid GPU-Based Computing","authors":"Jian Xiao, Xingyu Xu, Ce Yu, Jiawan Zhang, Shuinai Zhang, Li Ji, Ji-zhou Sun","doi":"10.1109/ICPP.2015.13","DOIUrl":"https://doi.org/10.1109/ICPP.2015.13","url":null,"abstract":"Spectral calculation and analysis have very important practical applications in astrophysics. The main portion of spectral calculation is to solve a large number of one-dimensional numerical integrations at each point of a large three-dimensional parameter space. However, existing widely used solutions still remain in process-level parallelism, which is not competent to tackle numerous compute-intensive small integral tasks. This paper presented a GPU-optimized approach to accelerate the numerical integration in massive spectral calculation. We also proposed a load balance strategy on hybrid multiple CPUs and GPUs architecture via share memory to maximize performance. The approach was prototyped and tested on the Astrophysical Plasma Emission Code (APEC), a commonly used spectral toolset. Comparing with the original serial version and the 24 CPU cores (2.5GHz) parallel version, our implementation on 3 Tesla C2075 GPUs achieves a speed-up of up to 300 and 22 respectively.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127463502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengran Fan, Haipeng Jia, Yunquan Zhang, Xiaojing An, Ting Cao
Sharpness is an algorithm used to sharpen images. As the increase of image size, resolution, and the requirements for real-time processing, the performance of sharpness needs to get improved greatly. The independent pixel calculation of sharpness makes a good opportunity to use GPU to largely accelerate the performance. However, to transplant it to GPU, one challenge is that sharpness involves several stages to execute. Each stage has its own characteristics, either with or without data dependency to other stages. Based on those characteristics, this paper proposes a complete solution to implement and optimize sharpness on GPU. Our solution includes five major and effective techniques: Data Transfer Optimization, Kernel Fusion, Vectorization for Data Locality, Border and Reduction Optimization. Experiments show that, compared to a well-optimized CPU version, our GPU solution can reach 10.7~ 69.3 times speedup for different image sizes on an AMD Fire Pro W8000 GPU.
锐度是一种用于锐化图像的算法。随着图像尺寸、分辨率的增加以及对实时处理要求的提高,对图像的清晰度性能要求得到很大的提高。锐度的独立像素计算为使用GPU大幅提高性能提供了很好的机会。然而,要将其移植到GPU,一个挑战是清晰度涉及几个阶段来执行。每个阶段都有自己的特征,或者与其他阶段有数据依赖关系,或者没有数据依赖关系。基于这些特点,本文提出了在GPU上实现和优化图像清晰度的完整方案。我们的解决方案包括五个主要和有效的技术:数据传输优化,核融合,数据局域矢量化,边界和约简优化。实验表明,与优化后的CPU版本相比,我们的GPU解决方案在AMD Fire Pro W8000 GPU上对不同图像大小的加速可以达到10.7~ 69.3倍。
{"title":"Optimizing Image Sharpening Algorithm on GPU","authors":"Mengran Fan, Haipeng Jia, Yunquan Zhang, Xiaojing An, Ting Cao","doi":"10.1109/ICPP.2015.32","DOIUrl":"https://doi.org/10.1109/ICPP.2015.32","url":null,"abstract":"Sharpness is an algorithm used to sharpen images. As the increase of image size, resolution, and the requirements for real-time processing, the performance of sharpness needs to get improved greatly. The independent pixel calculation of sharpness makes a good opportunity to use GPU to largely accelerate the performance. However, to transplant it to GPU, one challenge is that sharpness involves several stages to execute. Each stage has its own characteristics, either with or without data dependency to other stages. Based on those characteristics, this paper proposes a complete solution to implement and optimize sharpness on GPU. Our solution includes five major and effective techniques: Data Transfer Optimization, Kernel Fusion, Vectorization for Data Locality, Border and Reduction Optimization. Experiments show that, compared to a well-optimized CPU version, our GPU solution can reach 10.7~ 69.3 times speedup for different image sizes on an AMD Fire Pro W8000 GPU.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121459937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents the first distributed triangle listing algorithm with provable CPU, I/O, Memory, and Network bounds. Finding all triangles (3-cliques) in a graph has numerous applications for density and connectivity metrics, but the majority of existing algorithms for massive graphs are sequential, while distributed versions of algorithms do not guarantee their CPU, I/O, Memory, or Network requirements. Our Parallel and Distributed Triangle Listing (PDTL) framework focuses on efficient external-memory access in distributed environments instead of fitting sub graphs into memory. It works by performing efficient orientation and load-balancing steps, and replicating graphs across machines by using an extended version of Hu et al.'s Massive Graph Triangulation algorithm. PDTL suits a variety of computational environments, from single-core machines to high-end clusters, and computes the exact triangle count on graphs of over 6B edges and 1B vertices (e.g. Yahoo graphs), outperforming and using fewer resources than the state-of-the-art systems Power Graph, OPT, and PATRIC by 2x to 4x. Our approach thus highlights the importance of I/O in a distributed environment.
{"title":"PDTL: Parallel and Distributed Triangle Listing for Massive Graphs","authors":"Ilias Giechaskiel, G. Panagopoulos, Eiko Yoneki","doi":"10.1109/ICPP.2015.46","DOIUrl":"https://doi.org/10.1109/ICPP.2015.46","url":null,"abstract":"This paper presents the first distributed triangle listing algorithm with provable CPU, I/O, Memory, and Network bounds. Finding all triangles (3-cliques) in a graph has numerous applications for density and connectivity metrics, but the majority of existing algorithms for massive graphs are sequential, while distributed versions of algorithms do not guarantee their CPU, I/O, Memory, or Network requirements. Our Parallel and Distributed Triangle Listing (PDTL) framework focuses on efficient external-memory access in distributed environments instead of fitting sub graphs into memory. It works by performing efficient orientation and load-balancing steps, and replicating graphs across machines by using an extended version of Hu et al.'s Massive Graph Triangulation algorithm. PDTL suits a variety of computational environments, from single-core machines to high-end clusters, and computes the exact triangle count on graphs of over 6B edges and 1B vertices (e.g. Yahoo graphs), outperforming and using fewer resources than the state-of-the-art systems Power Graph, OPT, and PATRIC by 2x to 4x. Our approach thus highlights the importance of I/O in a distributed environment.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131918010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wen Xiong, Zhibin Yu, L. Eeckhout, Zhengdong Bei, Fan Zhang, Chengzhong Xu
Data analytics is at the core of the supply chain for both products and services in modern economies and societies. Big data workloads however, are placing unprecedented demands on computing technologies, calling for a deep understanding and characterization of these emerging workloads. In this paper, we propose Shen Zhen Transportation System (SZTS), a novel big data Hadoop benchmark suite comprised of real-life transportation analysis applications with real-life input data sets from Shenzhen in China. SZTS uniquely focuses on a specific and real-life application domain whereas other existing Hadoop benchmark suites, such as Hi Bench and Cloud Rank-D, consist of generic algorithms with synthetic inputs. We perform a cross-layer workload characterization at both the job and micro architecture level, revealing unique characteristics of SZTS compared to existing Hadoop benchmarks as well as general-purpose multi-core PARSEC benchmarks. We also study the sensitivity of workload behavior with respect to input data size, and propose a methodology for identifying representative input data sets.
{"title":"SZTS: A Novel Big Data Transportation System Benchmark Suite","authors":"Wen Xiong, Zhibin Yu, L. Eeckhout, Zhengdong Bei, Fan Zhang, Chengzhong Xu","doi":"10.1109/ICPP.2015.91","DOIUrl":"https://doi.org/10.1109/ICPP.2015.91","url":null,"abstract":"Data analytics is at the core of the supply chain for both products and services in modern economies and societies. Big data workloads however, are placing unprecedented demands on computing technologies, calling for a deep understanding and characterization of these emerging workloads. In this paper, we propose Shen Zhen Transportation System (SZTS), a novel big data Hadoop benchmark suite comprised of real-life transportation analysis applications with real-life input data sets from Shenzhen in China. SZTS uniquely focuses on a specific and real-life application domain whereas other existing Hadoop benchmark suites, such as Hi Bench and Cloud Rank-D, consist of generic algorithms with synthetic inputs. We perform a cross-layer workload characterization at both the job and micro architecture level, revealing unique characteristics of SZTS compared to existing Hadoop benchmarks as well as general-purpose multi-core PARSEC benchmarks. We also study the sensitivity of workload behavior with respect to input data size, and propose a methodology for identifying representative input data sets.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130751256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we assess the practicability of Hash Sieve, a recently proposed sieving algorithm for the Shortest Vector Problem (SVP) on lattices, on multi-core shared memory systems. To this end, we devised a parallel implementation that scales well, and is based on a probable lock-free system to handle concurrency. The probable lock-free system, implemented with spin-locks, in turn implemented with CAS operations, becomes likely a lock-free mechanism, since threads block only when strictly required and chances are that they are not required to block. With our implementation, we were able to solve the SVP on an arbitrary lattice in dimension 96, in less than 17.5 hours, using 16 physical cores. The least squares fit of the execution times of our implementation, in seconds, lies between 2(0.32n -- 15) or 2(0.33n -- 16). These results are of paramount importance for the selection of parameters in lattice-based cryptography, as they indicate that sieving algorithms are way more practical for solving the SVP than previously believed.
{"title":"Parallel (Probable) Lock-Free Hash Sieve: A Practical Sieving Algorithm for the SVP","authors":"Artur Mariano, C. Bischof, Thijs Laarhoven","doi":"10.1109/ICPP.2015.68","DOIUrl":"https://doi.org/10.1109/ICPP.2015.68","url":null,"abstract":"In this paper, we assess the practicability of Hash Sieve, a recently proposed sieving algorithm for the Shortest Vector Problem (SVP) on lattices, on multi-core shared memory systems. To this end, we devised a parallel implementation that scales well, and is based on a probable lock-free system to handle concurrency. The probable lock-free system, implemented with spin-locks, in turn implemented with CAS operations, becomes likely a lock-free mechanism, since threads block only when strictly required and chances are that they are not required to block. With our implementation, we were able to solve the SVP on an arbitrary lattice in dimension 96, in less than 17.5 hours, using 16 physical cores. The least squares fit of the execution times of our implementation, in seconds, lies between 2(0.32n -- 15) or 2(0.33n -- 16). These results are of paramount importance for the selection of parameters in lattice-based cryptography, as they indicate that sieving algorithms are way more practical for solving the SVP than previously believed.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134045702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.
{"title":"Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations","authors":"Da Li, Hancheng Wu, M. Becchi","doi":"10.1109/ICPP.2015.107","DOIUrl":"https://doi.org/10.1109/ICPP.2015.107","url":null,"abstract":"The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular, we focus on recursive algorithms operating on trees and graphs. We propose different parallelization templates aimed to increase the GPU utilization of these codes. Specifically, we investigate mechanisms to effectively distribute irregular work to streaming multiprocessors and GPU cores. Some of our parallelization templates rely on dynamic parallelism, a feature recently introduced by Nvidia in their Kepler GPUs and announced as part of the Open CL 2.0 standard. We propose mechanisms to maximize the work performed by nested kernels and minimize the overhead due to their invocation. Our results show that the use of our parallelization templates on applications with irregular nested loops can lead to a 2-6x speedup over baseline GPU codes that do not include load balancing mechanisms. The use of nested parallelism-based parallelization templates on recursive tree traversal algorithms can lead to substantial speedups (up to 15-24x) over optimized CPU implementations. However, the benefits of nested parallelism are still unclear in the presence of recursive applications operating on graphs, especially when recursive code variants require expensive synchronization. In these cases, a flat parallelization of iterative versions of the considered algorithms may be preferable.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132337818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Performance and energy efficiency are major concerns in cloud computing data centers. More often, they carry conflicting requirements making optimization a challenge. Further complications arise when heterogeneous hardware and data center management technologies are combined. For example, heterogeneous hardware such as General Purpose Graphics Processing Units (GPGPUs) improve performance at the cost of greater power consumption while virtualization technologies improve resource management and utilization at the cost of degraded performance. In this paper, we focus on exploiting heterogeneity introduced by GPUs to reduce power budget requirements for servers while maintaining performance. To maintain or improve overall server performance at reduced power budget, we propose two enhancements: (a) We borrow power from co-located multithreaded virtual machines (VMs) and reallocate it to GPU VMs. (b) To compensate multi-threaded VMs and re-boost their performance, we propose to borrow virtual computing resources from GPU VMs and reallocate them to CPU VMs. Combining the two techniques minimizes server power budget while maintaining overall server performance. Our results show that server power budget can be reduced by almost 18% at the average cost of 13% performance degradation per virtual machine. In addition, reallocating virtual resources improves the performance of multi-threaded applications by 30% without affecting GPU applications. Combining both techniques reduces server energy consumption by 47 % with minimum performance degradation.
{"title":"Optimization of Resource Allocation and Energy Efficiency in Heterogeneous Cloud Data Centers","authors":"Amer Qouneh, Ming Liu, Tao Li","doi":"10.1109/ICPP.2015.9","DOIUrl":"https://doi.org/10.1109/ICPP.2015.9","url":null,"abstract":"Performance and energy efficiency are major concerns in cloud computing data centers. More often, they carry conflicting requirements making optimization a challenge. Further complications arise when heterogeneous hardware and data center management technologies are combined. For example, heterogeneous hardware such as General Purpose Graphics Processing Units (GPGPUs) improve performance at the cost of greater power consumption while virtualization technologies improve resource management and utilization at the cost of degraded performance. In this paper, we focus on exploiting heterogeneity introduced by GPUs to reduce power budget requirements for servers while maintaining performance. To maintain or improve overall server performance at reduced power budget, we propose two enhancements: (a) We borrow power from co-located multithreaded virtual machines (VMs) and reallocate it to GPU VMs. (b) To compensate multi-threaded VMs and re-boost their performance, we propose to borrow virtual computing resources from GPU VMs and reallocate them to CPU VMs. Combining the two techniques minimizes server power budget while maintaining overall server performance. Our results show that server power budget can be reduced by almost 18% at the average cost of 13% performance degradation per virtual machine. In addition, reallocating virtual resources improves the performance of multi-threaded applications by 30% without affecting GPU applications. Combining both techniques reduces server energy consumption by 47 % with minimum performance degradation.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132662166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Streaming services are gaining popularity and have contributed a tremendous fraction of today's cellular network traffic. Both playback fluency and battery endurance are significant performance metrics for mobile streaming services. However, because of the unpredictable network condition and the loose coupling between upper layer streaming protocols and underlying network configurations, jointly optimizing rebuffering time and energy consumption for mobile streaming services remains a significant challenge. In this paper, we propose a novel framework that effectively addresses the above limitations and optimizes video transmission in cellular networks. We design two complementary algorithms, Rebuffering Time Minimization Algorithm (RTMA) and Energy Minimization Algorithm (EMA) in this framework, to achieve smoothed playback and energy-efficiency on demand over multi-user scenarios. Our algorithms integrate cross-layer parameters to schedule video delivery. Specifically, RTMA aims at achieving the minimum rebuffering time with limited energy and EMA tries to obtain the minimum energy consumption while meeting the rebuffering time constraint. Extensive simulation demonstrates that RTMA is able to reduce at least 68% rebuffering time and EMA can achieve more than 27% energy reduction compared with other state-of-the-art solutions.
{"title":"Joint Media Streaming Optimization of Energy and Rebuffering Time in Cellular Networks","authors":"Zeqi Lai, Yong Cui, Yayun Bao, Jiangchuan Liu, Yingchao Zhao, Xiao Ma","doi":"10.1109/ICPP.2015.49","DOIUrl":"https://doi.org/10.1109/ICPP.2015.49","url":null,"abstract":"Streaming services are gaining popularity and have contributed a tremendous fraction of today's cellular network traffic. Both playback fluency and battery endurance are significant performance metrics for mobile streaming services. However, because of the unpredictable network condition and the loose coupling between upper layer streaming protocols and underlying network configurations, jointly optimizing rebuffering time and energy consumption for mobile streaming services remains a significant challenge. In this paper, we propose a novel framework that effectively addresses the above limitations and optimizes video transmission in cellular networks. We design two complementary algorithms, Rebuffering Time Minimization Algorithm (RTMA) and Energy Minimization Algorithm (EMA) in this framework, to achieve smoothed playback and energy-efficiency on demand over multi-user scenarios. Our algorithms integrate cross-layer parameters to schedule video delivery. Specifically, RTMA aims at achieving the minimum rebuffering time with limited energy and EMA tries to obtain the minimum energy consumption while meeting the rebuffering time constraint. Extensive simulation demonstrates that RTMA is able to reduce at least 68% rebuffering time and EMA can achieve more than 27% energy reduction compared with other state-of-the-art solutions.","PeriodicalId":423007,"journal":{"name":"2015 44th International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114832394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}