Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00074
Chuan Lin, Q. Cao, Jianzhong Huang, Jie Yao, Xiaoqian Li, C. Xie
Data deduplication has been widely introduced to effectively reduce storage requirement of virtual machine (VM) images running on VM servers in the virtualized cloud platforms. Nevertheless, the existing state-of-the-art deduplication for VM images approaches can not sufficiently exploit the potential of underlying hardware with consideration of the interference of deduplication on the foreground VM services, which could affect the quality of VM services. In this paper, we present HPDV, a highly parallel deduplication cluster for VM images, which well utilizes the parallelism to achieve high throughput with minimum interference on the foreground VM services. The main idea behind HPDV is to exploit idle CPU resource of VM servers to parallelize the compute-intensive chunking and fingerprinting, and to parallelize the I/O-intensive fingerprint indexing in the deduplication servers by dividing the globally shared fingerprint index into multiple independent sub-indexes according to the operating systems of VM images. To ensure the quality of VM services, a resource-aware scheduler is proposed to dynamically adjust the number of parallel chunking and fingerprinting threads according to the CPU utilization of VM servers. Our evaluation results demonstrate that compared to a state-of-the-art deduplication system for VM images called Light, HPDV achieves up to 67% deduplication throughput improvement.
{"title":"HPDV:A Highly Parallel Deduplication Cluster for Virtual Machine Images","authors":"Chuan Lin, Q. Cao, Jianzhong Huang, Jie Yao, Xiaoqian Li, C. Xie","doi":"10.1109/CCGRID.2018.00074","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00074","url":null,"abstract":"Data deduplication has been widely introduced to effectively reduce storage requirement of virtual machine (VM) images running on VM servers in the virtualized cloud platforms. Nevertheless, the existing state-of-the-art deduplication for VM images approaches can not sufficiently exploit the potential of underlying hardware with consideration of the interference of deduplication on the foreground VM services, which could affect the quality of VM services. In this paper, we present HPDV, a highly parallel deduplication cluster for VM images, which well utilizes the parallelism to achieve high throughput with minimum interference on the foreground VM services. The main idea behind HPDV is to exploit idle CPU resource of VM servers to parallelize the compute-intensive chunking and fingerprinting, and to parallelize the I/O-intensive fingerprint indexing in the deduplication servers by dividing the globally shared fingerprint index into multiple independent sub-indexes according to the operating systems of VM images. To ensure the quality of VM services, a resource-aware scheduler is proposed to dynamically adjust the number of parallel chunking and fingerprinting threads according to the CPU utilization of VM servers. Our evaluation results demonstrate that compared to a state-of-the-art deduplication system for VM images called Light, HPDV achieves up to 67% deduplication throughput improvement.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117074283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00077
Young Ki Kim, M. HoseinyFarahabady, Young Choon Lee, Albert Y. Zomaya
Workload surges are a serious hindrance to per-formance of even high-throughput key-value data stores, such as Cassandra, MongoDB, and more recently Aerospike. In this paper, we present a decentralized admission controller for high-throughput key-value data stores. The proposed controller dynamically regulates the release time of incoming requests explicitly taking into account different Quality of Service (QoS) classes. In particular, an instance of such controller is assigned to each client for its autonomous admission control specific to the client's QoS requirements. These controllers operate in a decentralized manner with only local performance metrics, response time and queue waiting time. Despite the use of such "minimal" run-time state information, our decentralized admission controller is capable of coping with workload surges respecting QoS requirements. The performance evaluation is carried out by comparing the proposed admission controller with the default scheduling policy of Aerospike, in a testbed cluster under various workload intensity rates. Experimental results confirm that the proposed controller improves QoS satisfaction in terms of end-to-end response time by nearly 12 times, on average, compared with that of Aerospike's, in high-rate workload. Results also show decreases of the average and standard deviation of latency up to 31% and 50%, respectively, during workload surges (peak load) in high-rate workload.
{"title":"Decentralized Admission Control for High-Throughput Key-Value Data Stores","authors":"Young Ki Kim, M. HoseinyFarahabady, Young Choon Lee, Albert Y. Zomaya","doi":"10.1109/CCGRID.2018.00077","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00077","url":null,"abstract":"Workload surges are a serious hindrance to per-formance of even high-throughput key-value data stores, such as Cassandra, MongoDB, and more recently Aerospike. In this paper, we present a decentralized admission controller for high-throughput key-value data stores. The proposed controller dynamically regulates the release time of incoming requests explicitly taking into account different Quality of Service (QoS) classes. In particular, an instance of such controller is assigned to each client for its autonomous admission control specific to the client's QoS requirements. These controllers operate in a decentralized manner with only local performance metrics, response time and queue waiting time. Despite the use of such \"minimal\" run-time state information, our decentralized admission controller is capable of coping with workload surges respecting QoS requirements. The performance evaluation is carried out by comparing the proposed admission controller with the default scheduling policy of Aerospike, in a testbed cluster under various workload intensity rates. Experimental results confirm that the proposed controller improves QoS satisfaction in terms of end-to-end response time by nearly 12 times, on average, compared with that of Aerospike's, in high-rate workload. Results also show decreases of the average and standard deviation of latency up to 31% and 50%, respectively, during workload surges (peak load) in high-rate workload.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125571712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/ccgrid.2018.00064
Luke Bertot, S. Genaud, J. Gossa
In the cloud computing model, cloud providers invoice clients for resource consumption. Hence, tools helping the client to budget the cost of running their application are of pre-eminent importance. However, the opaque and multi-tenant nature of clouds, make job runtimes both variable and hard to predict. In this paper, we propose an improved simulation framework that takes into account this variability using the Monte-Carlo method. We consider the execution of batch jobs on an actual platform, scheduled using typical heuristics based on the user estimates of tasks' runtimes. We model the observed variability through simple distributions to use as inputs to the Monte-Carlo simulation. We show that, our method can capture over 90% of the empirical observations of total execution times.
{"title":"An Overview of Cloud Simulation Enhancement Using the Monte-Carlo Method","authors":"Luke Bertot, S. Genaud, J. Gossa","doi":"10.1109/ccgrid.2018.00064","DOIUrl":"https://doi.org/10.1109/ccgrid.2018.00064","url":null,"abstract":"In the cloud computing model, cloud providers invoice clients for resource consumption. Hence, tools helping the client to budget the cost of running their application are of pre-eminent importance. However, the opaque and multi-tenant nature of clouds, make job runtimes both variable and hard to predict. In this paper, we propose an improved simulation framework that takes into account this variability using the Monte-Carlo method. We consider the execution of batch jobs on an actual platform, scheduled using typical heuristics based on the user estimates of tasks' runtimes. We model the observed variability through simple distributions to use as inputs to the Monte-Carlo simulation. We show that, our method can capture over 90% of the empirical observations of total execution times.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130092198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, the scientific research model of Data - Model - Simulating has become one of the main methods to support the surface process research in the high-cold environment (alpine cold area and high latitude cold area). This kind of research mode needs the e-Geoscience environment based on data, models, high performance computing, and visualization and collaborative to support. In this paper, a highly efficient platform named High-cold environment joint Observation and Research cloud of China (HeorCloud) is established for Geoscience research in high and cold regions of china based on cloud computing technologies. HeorCloud implemented the unified service system named Gateway, be used to achieve the resources of data, model, computing, visualization combination and optimization configuration. Ultimately, besides providing the basic services of data, model and computing resource sharing, the platform also constructs online research community of some professional field contains data, analytical tools, models and computing resources based on Gateway. So far, the platform has realized the atmosphere, hydrology, remote sensing, permafrost research community applicable to the high-cold environment of China, and has been constantly expanding resources.
{"title":"High-Cold Environment Joint Observation and Research Cloud of China","authors":"Yufang Min, Yaonan Zhang, J. Huo, Keting Feng, Jianfang Kang, Guohui Zhao","doi":"10.1109/CCGRID.2018.00060","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00060","url":null,"abstract":"In recent years, the scientific research model of Data - Model - Simulating has become one of the main methods to support the surface process research in the high-cold environment (alpine cold area and high latitude cold area). This kind of research mode needs the e-Geoscience environment based on data, models, high performance computing, and visualization and collaborative to support. In this paper, a highly efficient platform named High-cold environment joint Observation and Research cloud of China (HeorCloud) is established for Geoscience research in high and cold regions of china based on cloud computing technologies. HeorCloud implemented the unified service system named Gateway, be used to achieve the resources of data, model, computing, visualization combination and optimization configuration. Ultimately, besides providing the basic services of data, model and computing resource sharing, the platform also constructs online research community of some professional field contains data, analytical tools, models and computing resources based on Gateway. So far, the platform has realized the atmosphere, hydrology, remote sensing, permafrost research community applicable to the high-cold environment of China, and has been constantly expanding resources.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"783 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126952558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00005
J. Dakka, Kristof Farkas-Pall, Vivek Balasubramanian, M. Turilli, S. Wan, D. Wright, S. Zasada, P. Coveney, S. Jha
The efficacy of drug treatments depends on how tightly small molecules bind to their target proteins. Quantifying the strength of these interactions (the so called ‘binding affinity’) is a grand challenge of computational chemistry, surmounting which could revolutionize drug design and provide the platform for patient specific medicine. Recently, evidence from blind challenge predictions and retrospective validation studies has suggested that molecular dynamics (MD) can now achieve useful predictive accuracy ( 1 kcal/mol) This accuracy is sufficient to greatly accelerate hit to lead and lead optimization. To translate these advances in predictive accuracy so as to impact clinical and/or industrial decision making requires that binding free energy results must be turned around on reduced timescales without loss of accuracy. This demands advances in algorithms, scalable software systems, and intelligent and efficient utilization of supercomputing resources. This work is motivated by the real world problem of providing insight from drug candidate data on a time scale that is as short as possible. Specifically, we reproduce results from a collaborative project between UCL and GlaxoSmithKline to study a congeneric series of drug candidates binding to the BRD4 protein – inhibitors of which have shown promising preclinical efficacy in pathologies ranging from cancer to inflammation. We demonstrate the use of a framework called HTBAC, designed to support the aforementioned requirements of accurate and rapid drug binding affinity calculations. HTBAC facilitates the execution of the numbers of simulations while supporting the adaptive execution of algorithms. Furthermore, HTBAC enables the selection of simulation parameters during runtime which can, in principle, optimize the use of computational resources whilst producing results within a target uncertainty.
{"title":"Enabling Trade-offs Between Accuracy and Computational Cost: Adaptive Algorithms to Reduce Time to Clinical Insight","authors":"J. Dakka, Kristof Farkas-Pall, Vivek Balasubramanian, M. Turilli, S. Wan, D. Wright, S. Zasada, P. Coveney, S. Jha","doi":"10.1109/CCGRID.2018.00005","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00005","url":null,"abstract":"The efficacy of drug treatments depends on how tightly small molecules bind to their target proteins. Quantifying the strength of these interactions (the so called ‘binding affinity’) is a grand challenge of computational chemistry, surmounting which could revolutionize drug design and provide the platform for patient specific medicine. Recently, evidence from blind challenge predictions and retrospective validation studies has suggested that molecular dynamics (MD) can now achieve useful predictive accuracy ( 1 kcal/mol) This accuracy is sufficient to greatly accelerate hit to lead and lead optimization. To translate these advances in predictive accuracy so as to impact clinical and/or industrial decision making requires that binding free energy results must be turned around on reduced timescales without loss of accuracy. This demands advances in algorithms, scalable software systems, and intelligent and efficient utilization of supercomputing resources. This work is motivated by the real world problem of providing insight from drug candidate data on a time scale that is as short as possible. Specifically, we reproduce results from a collaborative project between UCL and GlaxoSmithKline to study a congeneric series of drug candidates binding to the BRD4 protein – inhibitors of which have shown promising preclinical efficacy in pathologies ranging from cancer to inflammation. We demonstrate the use of a framework called HTBAC, designed to support the aforementioned requirements of accurate and rapid drug binding affinity calculations. HTBAC facilitates the execution of the numbers of simulations while supporting the adaptive execution of algorithms. Furthermore, HTBAC enables the selection of simulation parameters during runtime which can, in principle, optimize the use of computational resources whilst producing results within a target uncertainty.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116643065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00075
Gokcen Kestor, I. Peng, R. Gioiosa, S. Krishnamoorthy
Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.
{"title":"Understanding scale-Dependent soft-Error Behavior of Scientific Applications","authors":"Gokcen Kestor, I. Peng, R. Gioiosa, S. Krishnamoorthy","doi":"10.1109/CCGRID.2018.00075","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00075","url":null,"abstract":"Analyzing application fault behavior on large-scale systems is time-consuming and resource-demanding. Currently, researchers need to perform fault injection campaigns at full scale to understand the effects of soft errors on applications and whether these faults result in silent data corruption. Both time and resource requirements greatly limit the scope of the resilience studies that can be currently performed. In this work, we propose a methodology to model application fault behavior at large scale based on a reduced set of experiments performed at small scale. We employ machine learning techniques to accurately model application fault behavior using a set of experiments that can be executed in parallel at small scale. Our methodology drastically reduces the set and the scale of the fault injection experiments to be performed and provides a validated methodology to study application fault behavior at large scale. We show that our methodology can accurately model application fault behavior at large scale by using only small scale experiments. In some cases, we can model the fault behavior of a parallel application running on 4,096 cores with about 90% accuracy based on experiments on a single core.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132128780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00048
A. Ilyushkin, D. Epema
Workflow schedulers often rely on task runtime estimates when making scheduling decisions, and they usually target the scheduling of a single workflow or batches of workflows. In contrast, in this paper, we evaluate the impact of the absence or limited accuracy of task runtime estimates on slowdown when scheduling complete workloads of workflows that arrive over time. We study a total of seven scheduling policies: four of these are popular existing policies for (batches of) workloads from the literature, including a simple backfilling policy which is not aware of task runtime estimates, two are novel workloadoriented policies, including one which targets fairness, and one is the well-known HEFT policy for a single workflow adapted to the online workload scenario. We simulate homogeneous and heterogeneous distributed systems to evaluate the performance of these policies under varying accuracy of task runtime estimates. Our results show that for high utilizations, the order in which workflows are processed is more important than the knowledge of correct task runtime estimates. Under low utilizations, all policies considered show good results, even a policy which does not use task runtime estimates. We also show that our Fair Workflow Prioritization (FWP) policy effectively decreases the variance of workflow slowdown and thus achieves fairness, and that the planbased scheduling policy derived from HEFT does not show much performance improvement while bringing extra complexity to the scheduling process.
{"title":"The Impact of Task Runtime Estimate Accuracy on Scheduling Workloads of Workflows","authors":"A. Ilyushkin, D. Epema","doi":"10.1109/CCGRID.2018.00048","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00048","url":null,"abstract":"Workflow schedulers often rely on task runtime estimates when making scheduling decisions, and they usually target the scheduling of a single workflow or batches of workflows. In contrast, in this paper, we evaluate the impact of the absence or limited accuracy of task runtime estimates on slowdown when scheduling complete workloads of workflows that arrive over time. We study a total of seven scheduling policies: four of these are popular existing policies for (batches of) workloads from the literature, including a simple backfilling policy which is not aware of task runtime estimates, two are novel workloadoriented policies, including one which targets fairness, and one is the well-known HEFT policy for a single workflow adapted to the online workload scenario. We simulate homogeneous and heterogeneous distributed systems to evaluate the performance of these policies under varying accuracy of task runtime estimates. Our results show that for high utilizations, the order in which workflows are processed is more important than the knowledge of correct task runtime estimates. Under low utilizations, all policies considered show good results, even a policy which does not use task runtime estimates. We also show that our Fair Workflow Prioritization (FWP) policy effectively decreases the variance of workflow slowdown and thus achieves fairness, and that the planbased scheduling policy derived from HEFT does not show much performance improvement while bringing extra complexity to the scheduling process.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128948531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00030
Mohan Baruwal Chhetri, Quoc Bao Vo, R. Kowalczyk, S. Nepal
Cloud infrastructure providers are offering consumers a wide range of resource and contract options to choose from, yet most elasticity management solutions are incapable of leveraging this to optimize the cost and performance of cloudhosted applications. To address this problem, in this paper, we propose a novel resource scaling approach that exploits both resource and contract heterogeneity to achieve optimal resource allocations and better cost control. We model resource allocation as an Unbounded Knapsack Problem, and resource scaling as an one-step ahead resource allocation problem. Based on this, we present two scaling strategies, namely delta scale optimization and full scale optimization. Delta scale optimization supports the traditional notion of scaling resources horizontally, i.e., it computes an optimal allocation (or deallocation) of resources to increase (or decrease) the total compute capacity based on the current allocation and the forecast application workload. Full scale optimization, on the other hand, supports the notion of cost-optimal resource rescaling, i.e., the simultaneous allocation and deallocation of resources to meet the forecast workload irrespective of the decision to increase, decrease or maintain capacity. Both strategies provide users greater flexibility in managing trade offs between cost and performance. We motivate our research work by using a realistic and non-trivial scenario of resource scaling for a cloud-hosted IoT platform and use simple use cases to illustrate the benefit of our proposed approach.
{"title":"Towards Resource and Contract Heterogeneity Aware Rescaling for Cloud-Hosted Applications","authors":"Mohan Baruwal Chhetri, Quoc Bao Vo, R. Kowalczyk, S. Nepal","doi":"10.1109/CCGRID.2018.00030","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00030","url":null,"abstract":"Cloud infrastructure providers are offering consumers a wide range of resource and contract options to choose from, yet most elasticity management solutions are incapable of leveraging this to optimize the cost and performance of cloudhosted applications. To address this problem, in this paper, we propose a novel resource scaling approach that exploits both resource and contract heterogeneity to achieve optimal resource allocations and better cost control. We model resource allocation as an Unbounded Knapsack Problem, and resource scaling as an one-step ahead resource allocation problem. Based on this, we present two scaling strategies, namely delta scale optimization and full scale optimization. Delta scale optimization supports the traditional notion of scaling resources horizontally, i.e., it computes an optimal allocation (or deallocation) of resources to increase (or decrease) the total compute capacity based on the current allocation and the forecast application workload. Full scale optimization, on the other hand, supports the notion of cost-optimal resource rescaling, i.e., the simultaneous allocation and deallocation of resources to meet the forecast workload irrespective of the decision to increase, decrease or maintain capacity. Both strategies provide users greater flexibility in managing trade offs between cost and performance. We motivate our research work by using a realistic and non-trivial scenario of resource scaling for a cloud-hosted IoT platform and use simple use cases to illustrate the benefit of our proposed approach.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128759636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00082
Jad Darrous, Shadi Ibrahim, Amelie Chi Zhou, Christian Pérez
Recently, most large cloud providers, like Amazon and Microsoft, replicate their Virtual Machine Images (VMIs) on multiple geographically distributed data centers to offer fast service provisioning. Provisioning a service may require to transfer a VMI over the wide-area network (WAN) and therefore is dictated by the distribution of VMIs and the network bandwidth in-between sites. Nevertheless, existing methods to facilitate VMI management (i.e., retrieving VMIs) overlook network heterogeneity in geo-distributed clouds. In this paper, we design, implement and evaluate Nitro, a novel VMI management system that helps to minimize the transfer time of VMIs over a heterogeneous WAN. To achieve this goal, Nitro incorporates two complementary features. First, it makes use of deduplication to reduce the amount of data which will be transferred due to the high similarities within an image and in-between images. Second, Nitro is equipped with a network-aware data transfer strategy to effectively exploit links with high bandwidth when acquiring data and thus expedites the provisioning time. Experimental results show that our network-aware data transfer strategy offers the optimal solution when acquiring VMIs while introducing minimal overhead. Moreover, Nitro outperforms state-of-the-art VMI storage systems (e.g., OpenStack Swift) by up to 77%.
{"title":"Nitro: Network-Aware Virtual Machine Image Management in Geo-Distributed Clouds","authors":"Jad Darrous, Shadi Ibrahim, Amelie Chi Zhou, Christian Pérez","doi":"10.1109/CCGRID.2018.00082","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00082","url":null,"abstract":"Recently, most large cloud providers, like Amazon and Microsoft, replicate their Virtual Machine Images (VMIs) on multiple geographically distributed data centers to offer fast service provisioning. Provisioning a service may require to transfer a VMI over the wide-area network (WAN) and therefore is dictated by the distribution of VMIs and the network bandwidth in-between sites. Nevertheless, existing methods to facilitate VMI management (i.e., retrieving VMIs) overlook network heterogeneity in geo-distributed clouds. In this paper, we design, implement and evaluate Nitro, a novel VMI management system that helps to minimize the transfer time of VMIs over a heterogeneous WAN. To achieve this goal, Nitro incorporates two complementary features. First, it makes use of deduplication to reduce the amount of data which will be transferred due to the high similarities within an image and in-between images. Second, Nitro is equipped with a network-aware data transfer strategy to effectively exploit links with high bandwidth when acquiring data and thus expedites the provisioning time. Experimental results show that our network-aware data transfer strategy offers the optimal solution when acquiring VMIs while introducing minimal overhead. Moreover, Nitro outperforms state-of-the-art VMI storage systems (e.g., OpenStack Swift) by up to 77%.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114529698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.1109/CCGRID.2018.00066
Md Atiqul Mollah, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang
The recent interconnect topology designs for High Performance Computing (HPC) systems have followed two directions, one characterized by low diameter and the other by high path diversity. The low diameter design focuses on building large networks with small diameters, guaranteeing one short path between each pair of nodes. Examples include Slim Fly and Dragonfly. The high path diversity design takes into account not only other topological metrics such as diameter but also path diversity between pairs of nodes. Examples include fat-tree, Random Regular Graph (RRG) and Generalized De Bruin Graph (GDBG). Topologies designed from these two approaches have distinct features and require very different routing schemes to exploit the network capacity. In this work, we study the performance-related topological features of representative topologies of the two design approaches, including Slim Fly, Dragonfly, RRG, and GDBG, and compare HPC application performance on these topologies with a set of routing schemes. The study uncovers new knowledge about the topologies designed by these two approaches. Findings of the study include (1) the load balance routing technique designed for low diameter topologies, known as the Universal Globally Adaptive Load-balanced routing (UGAL), can be effectively adapted for the high path diversity topologies, and (2) high path diversity topologies in general achieve higher performance than low diameter topologies for networks built by a similar number of the same type of switches.
{"title":"A Comparative Study of Topology Design Approaches for HPC Interconnects","authors":"Md Atiqul Mollah, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang","doi":"10.1109/CCGRID.2018.00066","DOIUrl":"https://doi.org/10.1109/CCGRID.2018.00066","url":null,"abstract":"The recent interconnect topology designs for High Performance Computing (HPC) systems have followed two directions, one characterized by low diameter and the other by high path diversity. The low diameter design focuses on building large networks with small diameters, guaranteeing one short path between each pair of nodes. Examples include Slim Fly and Dragonfly. The high path diversity design takes into account not only other topological metrics such as diameter but also path diversity between pairs of nodes. Examples include fat-tree, Random Regular Graph (RRG) and Generalized De Bruin Graph (GDBG). Topologies designed from these two approaches have distinct features and require very different routing schemes to exploit the network capacity. In this work, we study the performance-related topological features of representative topologies of the two design approaches, including Slim Fly, Dragonfly, RRG, and GDBG, and compare HPC application performance on these topologies with a set of routing schemes. The study uncovers new knowledge about the topologies designed by these two approaches. Findings of the study include (1) the load balance routing technique designed for low diameter topologies, known as the Universal Globally Adaptive Load-balanced routing (UGAL), can be effectively adapted for the high path diversity topologies, and (2) high path diversity topologies in general achieve higher performance than low diameter topologies for networks built by a similar number of the same type of switches.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126097528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}