Ilaria Di Gennaro, Alessandro Pellegrini, F. Quaglia
A common approach to improve memory access in NUMA machines exploits operating system (OS) page protection mechanisms to induce faults to determine which pages are accessed by what thread, so as to move the thread and its working-set of pages to the same NUMA node. However, existing proposals do not fully fit the requirements of truly multi-thread applications with non-partitioned accesses to virtual pages. In fact, these proposals exploit (induced) faults on a same page-table for all the threads of a same process to determine the access pattern. Hence, the fault by one thread (and the consequent re-opening of the access to the corresponding page) would mask those by other threads on the same page. This may lead to inaccuracy in the estimation of the working-set of individual threads. We overcome this drawback by presenting a lightweight operating system support for Linux, referred to as multi-view address space, explicitly targeting accuracy of per-thread working-set estimation in truly multi-thread applications with non-partitioned accesses, and an associated thread/data migration policy. Our solution is fully transparent to user-space code. It is embedded in a Linux/x86_64 module that installs any required modification to the original kernel image by solely relying on dynamic patching. A motivated case study in the context of HPC is also presented for an assessment of our proposal.
{"title":"OS-Based NUMA Optimization: Tackling the Case of Truly Multi-thread Applications with Non-partitioned Virtual Page Accesses","authors":"Ilaria Di Gennaro, Alessandro Pellegrini, F. Quaglia","doi":"10.1109/CCGrid.2016.91","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.91","url":null,"abstract":"A common approach to improve memory access in NUMA machines exploits operating system (OS) page protection mechanisms to induce faults to determine which pages are accessed by what thread, so as to move the thread and its working-set of pages to the same NUMA node. However, existing proposals do not fully fit the requirements of truly multi-thread applications with non-partitioned accesses to virtual pages. In fact, these proposals exploit (induced) faults on a same page-table for all the threads of a same process to determine the access pattern. Hence, the fault by one thread (and the consequent re-opening of the access to the corresponding page) would mask those by other threads on the same page. This may lead to inaccuracy in the estimation of the working-set of individual threads. We overcome this drawback by presenting a lightweight operating system support for Linux, referred to as multi-view address space, explicitly targeting accuracy of per-thread working-set estimation in truly multi-thread applications with non-partitioned accesses, and an associated thread/data migration policy. Our solution is fully transparent to user-space code. It is embedded in a Linux/x86_64 module that installs any required modification to the original kernel image by solely relying on dynamic patching. A motivated case study in the context of HPC is also presented for an assessment of our proposal.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122272154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qiao Sun, Chao Yang, Changmao Wu, Leisheng Li, Fangfang Liu
Stream compaction, frequently found in a large variety of applications, serves as a general primitive that reduces an input stream to a subset containing only the wanted elements so that the follow-on computation can be done efficiently. In this paper, we propose a fast parallel stream compaction for IA-based multi-/many-core processors. Unlike the previously studied algorithms that depend heavily on a black-box parallel scan, we open the black-box in the proposed algorithm and manually tailor it so that both the workload and the memory footprint is significantly reduced. By further eliminating the conditional statements and applying automatic code generation/optimization for performance-critical kernels, the proposed parallel stream compaction achieves high performance in different cases and for various data types across different IA-based multi/manycore platforms. Experimental results on three typical IA-based processors, including a quad-core Core-i7 CPU, a dual-socket 8-core Xeon CPU, and a 61-core Xeon Phi accelerator show that the proposed implementation outperforms the referenced parallel counterpart in the state-of-art library Thrust. On top of the above, we apply it in the random forest based data classifier to show its potential to boost the performance of real-world applications.
{"title":"Fast Parallel Stream Compaction for IA-Based Multi/many-core Processors","authors":"Qiao Sun, Chao Yang, Changmao Wu, Leisheng Li, Fangfang Liu","doi":"10.1109/CCGrid.2016.112","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.112","url":null,"abstract":"Stream compaction, frequently found in a large variety of applications, serves as a general primitive that reduces an input stream to a subset containing only the wanted elements so that the follow-on computation can be done efficiently. In this paper, we propose a fast parallel stream compaction for IA-based multi-/many-core processors. Unlike the previously studied algorithms that depend heavily on a black-box parallel scan, we open the black-box in the proposed algorithm and manually tailor it so that both the workload and the memory footprint is significantly reduced. By further eliminating the conditional statements and applying automatic code generation/optimization for performance-critical kernels, the proposed parallel stream compaction achieves high performance in different cases and for various data types across different IA-based multi/manycore platforms. Experimental results on three typical IA-based processors, including a quad-core Core-i7 CPU, a dual-socket 8-core Xeon CPU, and a 61-core Xeon Phi accelerator show that the proposed implementation outperforms the referenced parallel counterpart in the state-of-art library Thrust. On top of the above, we apply it in the random forest based data classifier to show its potential to boost the performance of real-world applications.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117274171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large-scale graph analytics has gained attention during the past few years. As the world is going to be more connected by appearance of new technologies and applications such as social networks, Web portals, mobile devices, Internet of things, etc, a huge amount of data are created and stored every day in the form of graphs consisting of billions of vertices and edges. Many graph processing frameworks have been developed to process these large graphs since Google introduced its graph processing framework called Pregel in 2010. On the other hand, cloud computing which is a new paradigm of computing that overcomes restrictions of traditional problems in computing by enabling some novel technological and economical solutions such as distributed computing, elasticity and pay-as-you-go models has improved service delivery features. In this paper, we present iGiraph, a cost-efficient Pregel-like graph processing framework for processing large-scale graphs on public clouds. iGiraph uses a new dynamic re-partitioning approach based on messaging pattern to minimize the cost of resource utilization on public clouds. We also present the experimental results on the performance and cost effects of our method and compare them with basic Giraph framework. Our results validate that iGiraph remarkably decreases the cost and improves the performance by scaling the number of workers dynamically.
{"title":"iGiraph: A Cost-Efficient Framework for Processing Large-Scale Graphs on Public Clouds","authors":"Safiollah Heidari, R. Calheiros, R. Buyya","doi":"10.1109/CCGrid.2016.38","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.38","url":null,"abstract":"Large-scale graph analytics has gained attention during the past few years. As the world is going to be more connected by appearance of new technologies and applications such as social networks, Web portals, mobile devices, Internet of things, etc, a huge amount of data are created and stored every day in the form of graphs consisting of billions of vertices and edges. Many graph processing frameworks have been developed to process these large graphs since Google introduced its graph processing framework called Pregel in 2010. On the other hand, cloud computing which is a new paradigm of computing that overcomes restrictions of traditional problems in computing by enabling some novel technological and economical solutions such as distributed computing, elasticity and pay-as-you-go models has improved service delivery features. In this paper, we present iGiraph, a cost-efficient Pregel-like graph processing framework for processing large-scale graphs on public clouds. iGiraph uses a new dynamic re-partitioning approach based on messaging pattern to minimize the cost of resource utilization on public clouds. We also present the experimental results on the performance and cost effects of our method and compare them with basic Giraph framework. Our results validate that iGiraph remarkably decreases the cost and improves the performance by scaling the number of workers dynamically.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"50 8","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120934115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vehicular ad hoc network (VANET) is an application of intelligent transportation system (ITS) with emphasis on improving traffic safety as well as efficiency. VANET can be thought as a subset of mobile ad hoc network (MANET) where vehicles form a network by communicating with each other (V2V) or with infrastructure (V2I). Vehicles broadcast not only traffic messages but also safety critical messages such as electronic emergency braking light (EEBL). A misuse of this application may result in a traffic accident and loss of life at worse. This situation makes vehicles' authentication a necessary requirement in VANET. During authentication, vehicle's privacy related data such as vehicle and owner's identity and location information should be kept private in order to prevent an attacker from stealing this information. This paper presents a cloud-assisted conditional privacy preserving authentication (CACPPA) protocol for VANET. CACPPA is a hybrid approach that utilizes both the concept of pseudonym-based approaches and group-signaturebased approaches but cleverly avoids the inherent drawbacks of these approaches. CACPPA neither requires a vehicle to manage a certificate revocation list nor does it require vehicle to manage any groups. In fact an efficient cloud-based certification authority is used to assist vehicles getting credentials and subsequently using them during authentication. CACPPA provides conditional anonymity that a vehicle's anonymity preserved only until it honestly follows the protocol. Furthermore, we analyze CACPPA with various attack scenarios, present a computational and communication cost analysis as well as comparison with existing approaches to show its feasibility and robustness.
{"title":"CACPPA: A Cloud-Assisted Conditional Privacy Preserving Authentication Protocol for VANET","authors":"Ubaidullah Rajput, Fizza Abbas, Jian Wang, Hasoo Eun, Heekuck Oh","doi":"10.1109/CCGrid.2016.47","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.47","url":null,"abstract":"Vehicular ad hoc network (VANET) is an application of intelligent transportation system (ITS) with emphasis on improving traffic safety as well as efficiency. VANET can be thought as a subset of mobile ad hoc network (MANET) where vehicles form a network by communicating with each other (V2V) or with infrastructure (V2I). Vehicles broadcast not only traffic messages but also safety critical messages such as electronic emergency braking light (EEBL). A misuse of this application may result in a traffic accident and loss of life at worse. This situation makes vehicles' authentication a necessary requirement in VANET. During authentication, vehicle's privacy related data such as vehicle and owner's identity and location information should be kept private in order to prevent an attacker from stealing this information. This paper presents a cloud-assisted conditional privacy preserving authentication (CACPPA) protocol for VANET. CACPPA is a hybrid approach that utilizes both the concept of pseudonym-based approaches and group-signaturebased approaches but cleverly avoids the inherent drawbacks of these approaches. CACPPA neither requires a vehicle to manage a certificate revocation list nor does it require vehicle to manage any groups. In fact an efficient cloud-based certification authority is used to assist vehicles getting credentials and subsequently using them during authentication. CACPPA provides conditional anonymity that a vehicle's anonymity preserved only until it honestly follows the protocol. Furthermore, we analyze CACPPA with various attack scenarios, present a computational and communication cost analysis as well as comparison with existing approaches to show its feasibility and robustness.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121481422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Rodero, M. Parashar, Aaditya G. Landge, Sidharth Kumar, Valerio Pascucci, P. Bremer
The increasing gap between available compute power and I/O capabilities is resulting in simulation pipelines running on leadership computing facilities being reformulated. In particular, in-situ processing is complementing conventional post-process analysis, however, it can be performed by using the same compute resources as the simulation or using secondary dedicated resources. In this paper, we focus on three different in-situ analysis strategies, which use the same compute resources as the ongoing simulation but different data movement strategies. We evaluate the costs incurred by these strategies in terms of run time, scalability and power/energy consumption. Furthermore, we extrapolate power behavior to peta-scale and investigate different design choices through projections. Experimental evaluation at full machine scale on Titan supports that using fewer cores per node for in-situ analysis is the optimum choice in terms of scalability. Hence, further research effort should be devoted towards developing in-situ analysis techniques following this strategy in future high-end systems.
{"title":"Evaluation of In-Situ Analysis Strategies at Scale for Power Efficiency and Scalability","authors":"I. Rodero, M. Parashar, Aaditya G. Landge, Sidharth Kumar, Valerio Pascucci, P. Bremer","doi":"10.1109/CCGrid.2016.95","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.95","url":null,"abstract":"The increasing gap between available compute power and I/O capabilities is resulting in simulation pipelines running on leadership computing facilities being reformulated. In particular, in-situ processing is complementing conventional post-process analysis, however, it can be performed by using the same compute resources as the simulation or using secondary dedicated resources. In this paper, we focus on three different in-situ analysis strategies, which use the same compute resources as the ongoing simulation but different data movement strategies. We evaluate the costs incurred by these strategies in terms of run time, scalability and power/energy consumption. Furthermore, we extrapolate power behavior to peta-scale and investigate different design choices through projections. Experimental evaluation at full machine scale on Titan supports that using fewer cores per node for in-situ analysis is the optimum choice in terms of scalability. Hence, further research effort should be devoted towards developing in-situ analysis techniques following this strategy in future high-end systems.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122469522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In virtualized cloud hosting centers, a virtual machine (VM) is generally allocated a fixed computing capacity. The virtualization system schedules the VMs and guarantees that each VM capacity is provided and respected. However, a significant amount of CPU time is consumed by the underlying virtualization system, which generally includes device drivers (mainly network and disk drivers). In today's virtualization systems, this CPU time consumed is difficult to monitor and it is not charged to VMs. Such a situation can have important consequences for both clients and provider: performance isolation and predictability for the former and resource management (and especially consolidation) for the latter. In this paper, we propose a virtualization system mechanism which allows estimating the CPU time used by the virtualization system on behalf of VMs. Subsequently, this CPU time is charged to VMs, thus removing the two previous side effects. This mechanism has been implemented in Xen. Its benefits have been evaluated using reference benchmarks.
{"title":"Billing system CPU time on individual VM","authors":"Boris Teabe, A. Tchana, D. Hagimont","doi":"10.1109/CCGrid.2016.76","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.76","url":null,"abstract":"In virtualized cloud hosting centers, a virtual machine (VM) is generally allocated a fixed computing capacity. The virtualization system schedules the VMs and guarantees that each VM capacity is provided and respected. However, a significant amount of CPU time is consumed by the underlying virtualization system, which generally includes device drivers (mainly network and disk drivers). In today's virtualization systems, this CPU time consumed is difficult to monitor and it is not charged to VMs. Such a situation can have important consequences for both clients and provider: performance isolation and predictability for the former and resource management (and especially consolidation) for the latter. In this paper, we propose a virtualization system mechanism which allows estimating the CPU time used by the virtualization system on behalf of VMs. Subsequently, this CPU time is charged to VMs, thus removing the two previous side effects. This mechanism has been implemented in Xen. Its benefits have been evaluated using reference benchmarks.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123286242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As limited power budget is becoming one of the most crucialchallenges in developing supercomputer systems, hardware overprovisioning which installs larger number of nodes beyond the limitations of the power constraint determinedby Thermal Design Power is an attractive way to design extreme-scale supercomputers. In this design, power consumption of each node should be controlled by power-knobs equipped in the hardware such as dynamic voltage and frequency scaling (DVFS) or power capping mechanisms. Traditionally, in supercomputer systems, schedulers determine when and where to allocate jobs. In overprovisioned systems, the schedulers also need to care about power allocation to each job. An easy way is to set a fixed power cap for each job so that the total power consumption is within the power constraint of the system. This fixed power capping does not necessarily provide good performance since the effective power usage of jobs changes throughout their execution. Moreover, because each job has its own performance requirement, fixed power cap may not work well for all the jobs. In this paper, we propose a demand-aware power management framework for overprovisioned and power-constrained high-performance computing (HPC) systems. The job scheduler selects a job to run based on available hardware and power resources. The power manager continuously monitors power usage, predicts performance of executing jobs and optimizes power cap of each CPU so that the required performance level of each job is satisfied while improving system throughput by making good use of available powerbudget. Experiments on a real HPC system and with simulation for a large scale system show that the power manager can successfully control power consumption of executing jobs while achieving 1.17x improvement in system throughput.
{"title":"Demand-Aware Power Management for Power-Constrained HPC Systems","authors":"Cao Thang, Yuan He, Masaaki Kondo","doi":"10.1109/CCGrid.2016.25","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.25","url":null,"abstract":"As limited power budget is becoming one of the most crucialchallenges in developing supercomputer systems, hardware overprovisioning which installs larger number of nodes beyond the limitations of the power constraint determinedby Thermal Design Power is an attractive way to design extreme-scale supercomputers. In this design, power consumption of each node should be controlled by power-knobs equipped in the hardware such as dynamic voltage and frequency scaling (DVFS) or power capping mechanisms. Traditionally, in supercomputer systems, schedulers determine when and where to allocate jobs. In overprovisioned systems, the schedulers also need to care about power allocation to each job. An easy way is to set a fixed power cap for each job so that the total power consumption is within the power constraint of the system. This fixed power capping does not necessarily provide good performance since the effective power usage of jobs changes throughout their execution. Moreover, because each job has its own performance requirement, fixed power cap may not work well for all the jobs. In this paper, we propose a demand-aware power management framework for overprovisioned and power-constrained high-performance computing (HPC) systems. The job scheduler selects a job to run based on available hardware and power resources. The power manager continuously monitors power usage, predicts performance of executing jobs and optimizes power cap of each CPU so that the required performance level of each job is satisfied while improving system throughput by making good use of available powerbudget. Experiments on a real HPC system and with simulation for a large scale system show that the power manager can successfully control power consumption of executing jobs while achieving 1.17x improvement in system throughput.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126678485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wireless supercomputers and datacenters with 60GHz radio or free-space optics (FSO) have been proposed so that a diverse application workload can be better supported by changing network topologies by swapping the endpoints of wireless links. In this study we proposed the use of such wireless links for the purpose of improving job mapping. We investigated various trade-offs of the number of wireless links, time overhead of wireless link reconfiguration, topology embedding and job sizes. Our simulation results demonstrate that the wired job mapping heavily degrades the system utilization of supercomputers and datacenters under a conventional fixed network topology. By contrast, wireless interconnection networks can have an ideal job mapping by directly reconnecting non-neighboring computing nodes. It improves the system utilization by up to 17.7% for user jobs on a supercomputer and thus can shorten the whole service time especially for dealing with dozens of intensively incoming jobs. Furthermore, we confirmed that either workload or scheduling policy does not impact the fact that the ideal job mapping on wireless supercomputers outperforms that on wired networks in terms of system utilization and whole service time. Finally, our evaluation shows that a constrained and reasonable more use of partial wireless links can achieve shorter queuing length and time.
{"title":"HPC Job Mapping over Reconfigurable Wireless Links","authors":"Yao Hu, I. Fujiwara, M. Koibuchi","doi":"10.1109/CCGrid.2016.17","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.17","url":null,"abstract":"Wireless supercomputers and datacenters with 60GHz radio or free-space optics (FSO) have been proposed so that a diverse application workload can be better supported by changing network topologies by swapping the endpoints of wireless links. In this study we proposed the use of such wireless links for the purpose of improving job mapping. We investigated various trade-offs of the number of wireless links, time overhead of wireless link reconfiguration, topology embedding and job sizes. Our simulation results demonstrate that the wired job mapping heavily degrades the system utilization of supercomputers and datacenters under a conventional fixed network topology. By contrast, wireless interconnection networks can have an ideal job mapping by directly reconnecting non-neighboring computing nodes. It improves the system utilization by up to 17.7% for user jobs on a supercomputer and thus can shorten the whole service time especially for dealing with dozens of intensively incoming jobs. Furthermore, we confirmed that either workload or scheduling policy does not impact the fact that the ideal job mapping on wireless supercomputers outperforms that on wired networks in terms of system utilization and whole service time. Finally, our evaluation shows that a constrained and reasonable more use of partial wireless links can achieve shorter queuing length and time.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122933584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Power management has become a central issue inlarge-scale computing clusters where a considerable amount ofenergy is consumed and a large operational cost is incurredannually. Traditional power management techniques have a centralizeddesign that creates challenges for scalability of computingclusters. In this work, we develop a framework for distributedpower budget allocation that maximizes the utility of computingnodes subject to a total power budget constraint. To eliminate the role of central coordinator in the primaldualtechnique, we propose a distributed power budget allocationalgorithm (DiBA) which maximizes the combined performanceof a cluster subject to a power budget constraint in a distributedfashion. Specifically, DiBA is a consensus-based algorithm inwhich each server determines its optimal power consumptionlocally by communicating its state with neighbors (connectednodes) in a cluster. We characterize a synchronous primal-dualtechnique to obtain a benchmark for comparison with thedistributed algorithm that we propose. We demonstrate numericallythat DiBA is a scalable algorithm that outperforms theconventional primal-dual method on large scale clusters in termsof convergence time. Further, DiBA eliminates the communicationbottleneck in the primal-dual method. We thoroughly evaluatethe characteristics of DiBA through simulations of large-scaleclusters. Furthermore, we provide results from a proof-of-conceptimplementation on a real experimental cluster.
{"title":"DiBA: Distributed Power Budget Allocation for Large-Scale Computing Clusters","authors":"Masoud Badiei, Xin Zhan, R. Azimi, S. Reda, Na Li","doi":"10.1109/CCGrid.2016.101","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.101","url":null,"abstract":"Power management has become a central issue inlarge-scale computing clusters where a considerable amount ofenergy is consumed and a large operational cost is incurredannually. Traditional power management techniques have a centralizeddesign that creates challenges for scalability of computingclusters. In this work, we develop a framework for distributedpower budget allocation that maximizes the utility of computingnodes subject to a total power budget constraint. To eliminate the role of central coordinator in the primaldualtechnique, we propose a distributed power budget allocationalgorithm (DiBA) which maximizes the combined performanceof a cluster subject to a power budget constraint in a distributedfashion. Specifically, DiBA is a consensus-based algorithm inwhich each server determines its optimal power consumptionlocally by communicating its state with neighbors (connectednodes) in a cluster. We characterize a synchronous primal-dualtechnique to obtain a benchmark for comparison with thedistributed algorithm that we propose. We demonstrate numericallythat DiBA is a scalable algorithm that outperforms theconventional primal-dual method on large scale clusters in termsof convergence time. Further, DiBA eliminates the communicationbottleneck in the primal-dual method. We thoroughly evaluatethe characteristics of DiBA through simulations of large-scaleclusters. Furthermore, we provide results from a proof-of-conceptimplementation on a real experimental cluster.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114582776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaewon Lee, Jaehyung Ahn, Choongul Park, Jangwoo Kim
An online streaming service is a gateway providing highly accessible data to end-users. However, the rapid growth of digital data and the corresponding user accesses increase the storage costs and management burden of the service providers. In particular, the accumulation of rarely accessed cold data and the time-varying and skewed data accesses are the two major problems degrading the efficiency as well as throughput of modern streaming services. In this paper, we propose DTStorage, a dynamic tape-based storage system for cost-effective and highly-available streaming services. DTStorage significantly reduces the storage costs by keeping latency-insensitive cold data in cost-effective tape storages, and achieves high throughput by adaptively balancing the data availability with its contention-aware replica management policy. Our prototype evaluated in collaboration with KT, the largest multimedia service company in Korea, reduces the storage costs by up to 45% while satisfying the target performance of a real-world smart TV streaming workload.
{"title":"DTStorage: Dynamic Tape-Based Storage for Cost-Effective and Highly-Available Streaming Service","authors":"Jaewon Lee, Jaehyung Ahn, Choongul Park, Jangwoo Kim","doi":"10.1109/CCGrid.2016.43","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.43","url":null,"abstract":"An online streaming service is a gateway providing highly accessible data to end-users. However, the rapid growth of digital data and the corresponding user accesses increase the storage costs and management burden of the service providers. In particular, the accumulation of rarely accessed cold data and the time-varying and skewed data accesses are the two major problems degrading the efficiency as well as throughput of modern streaming services. In this paper, we propose DTStorage, a dynamic tape-based storage system for cost-effective and highly-available streaming services. DTStorage significantly reduces the storage costs by keeping latency-insensitive cold data in cost-effective tape storages, and achieves high throughput by adaptively balancing the data availability with its contention-aware replica management policy. Our prototype evaluated in collaboration with KT, the largest multimedia service company in Korea, reduces the storage costs by up to 45% while satisfying the target performance of a real-world smart TV streaming workload.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114765824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}