Subhajit Sidhanta, W. Golab, S. Mukhopadhyay, Saikat Basu
Users of distributed datastores that employquorum-based replication are burdened with the choice of asuitable client-centric consistency setting for each storage operation. The above matching choice is difficult to reason about asit requires deliberating about the tradeoff between the latencyand staleness, i.e., how stale (old) the result is. The latencyand staleness for a given operation depend on the client-centricconsistency setting applied, as well as dynamic parameters such asthe current workload and network condition. We present OptCon, a novel machine learning-based predictive framework, that canautomate the choice of client-centric consistency setting underuser-specified latency and staleness thresholds given in the servicelevel agreement (SLA). Under a given SLA, OptCon predictsa client-centric consistency setting that is matching, i.e., it isweak enough to satisfy the latency threshold, while being strongenough to satisfy the staleness threshold. While manually tunedconsistency settings remain fixed unless explicitly reconfigured, OptCon tunes consistency settings on a per-operation basis withrespect to changing workload and network state. Using decisiontree learning, OptCon yields 0.14 cross validation error in predictingmatching consistency settings under latency and stalenessthresholds given in the SLA. We demonstrate experimentally thatOptCon is at least as effective as any manually chosen consistencysettings in adapting to the SLA thresholds for different usecases. We also demonstrate that OptCon adapts to variationsin workload, whereas a given manually chosen fixed consistencysetting satisfies the SLA only for a characteristic workload.
{"title":"OptCon: An Adaptable SLA-Aware Consistency Tuning Framework for Quorum-Based Stores","authors":"Subhajit Sidhanta, W. Golab, S. Mukhopadhyay, Saikat Basu","doi":"10.1109/CCGrid.2016.9","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.9","url":null,"abstract":"Users of distributed datastores that employquorum-based replication are burdened with the choice of asuitable client-centric consistency setting for each storage operation. The above matching choice is difficult to reason about asit requires deliberating about the tradeoff between the latencyand staleness, i.e., how stale (old) the result is. The latencyand staleness for a given operation depend on the client-centricconsistency setting applied, as well as dynamic parameters such asthe current workload and network condition. We present OptCon, a novel machine learning-based predictive framework, that canautomate the choice of client-centric consistency setting underuser-specified latency and staleness thresholds given in the servicelevel agreement (SLA). Under a given SLA, OptCon predictsa client-centric consistency setting that is matching, i.e., it isweak enough to satisfy the latency threshold, while being strongenough to satisfy the staleness threshold. While manually tunedconsistency settings remain fixed unless explicitly reconfigured, OptCon tunes consistency settings on a per-operation basis withrespect to changing workload and network state. Using decisiontree learning, OptCon yields 0.14 cross validation error in predictingmatching consistency settings under latency and stalenessthresholds given in the SLA. We demonstrate experimentally thatOptCon is at least as effective as any manually chosen consistencysettings in adapting to the SLA thresholds for different usecases. We also demonstrate that OptCon adapts to variationsin workload, whereas a given manually chosen fixed consistencysetting satisfies the SLA only for a characteristic workload.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134015846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present OptEx, a closed-form model of job execution on Apache Spark, a popular parallel processing engine. To the best of our knowledge, OptEx is the first work that analytically models job completion time on Spark. The model can be used to estimate the completion time of a given Spark job on a cloud, with respect to the size of the input dataset, the number of iterations, the number of nodes comprising the underlying cluster. Experimental results demonstrate that OptEx yields a mean relative error of 6% in estimating the job completion time. Furthermore, the model can be applied for estimating the cost optimal cluster composition for running a given Spark job on a cloud under a completion deadline specified in the SLO (i.e.,Service Level Objective). We show experimentally that OptEx is able to correctly estimate the cost optimal cluster composition for running a given Spark job under an SLO deadline with an accuracy of 98%.
{"title":"OptEx: A Deadline-Aware Cost Optimization Model for Spark","authors":"Subhajit Sidhanta, W. Golab, S. Mukhopadhyay","doi":"10.1109/CCGrid.2016.10","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.10","url":null,"abstract":"We present OptEx, a closed-form model of job execution on Apache Spark, a popular parallel processing engine. To the best of our knowledge, OptEx is the first work that analytically models job completion time on Spark. The model can be used to estimate the completion time of a given Spark job on a cloud, with respect to the size of the input dataset, the number of iterations, the number of nodes comprising the underlying cluster. Experimental results demonstrate that OptEx yields a mean relative error of 6% in estimating the job completion time. Furthermore, the model can be applied for estimating the cost optimal cluster composition for running a given Spark job on a cloud under a completion deadline specified in the SLO (i.e.,Service Level Objective). We show experimentally that OptEx is able to correctly estimate the cost optimal cluster composition for running a given Spark job under an SLO deadline with an accuracy of 98%.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127749937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Varghese, Lawan Thamsuhang Subba, Long Thai, A. Barker
Existing benchmarking methods are time consuming processes as they typically benchmark the entire Virtual Machine (VM) in order to generate accurate performance data, making them less suitable for real-time analytics. The research in this paper is aimed to surmount the above challenge by presenting DocLite - Docker Container-based Lightweight benchmarking tool. DocLite explores lightweight cloud benchmarking methods for rapidly executing benchmarks in near real-time. DocLite is built on the Docker container technology, which allows a user-defined memory size and number of CPU cores of the VM to be benchmarked. The tool incorporates two benchmarking methods - the first referred to as the native method employs containers to benchmark a small portion of the VM and generate performance ranks, and the second uses historic benchmark data along with the native method as a hybrid to generate VM ranks. The proposed methods are evaluated on three use-cases and are observed to be up to 91 times faster than benchmarking the entire VM. In both methods, small containers provide the same quality of rankings as a large container. The native method generates ranks with over 90% and 86% accuracy for sequential and parallel execution of an application compared against benchmarking the whole VM. The hybrid method did not improve the quality of the rankings significantly.
{"title":"DocLite: A Docker-Based Lightweight Cloud Benchmarking Tool","authors":"B. Varghese, Lawan Thamsuhang Subba, Long Thai, A. Barker","doi":"10.1109/CCGrid.2016.14","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.14","url":null,"abstract":"Existing benchmarking methods are time consuming processes as they typically benchmark the entire Virtual Machine (VM) in order to generate accurate performance data, making them less suitable for real-time analytics. The research in this paper is aimed to surmount the above challenge by presenting DocLite - Docker Container-based Lightweight benchmarking tool. DocLite explores lightweight cloud benchmarking methods for rapidly executing benchmarks in near real-time. DocLite is built on the Docker container technology, which allows a user-defined memory size and number of CPU cores of the VM to be benchmarked. The tool incorporates two benchmarking methods - the first referred to as the native method employs containers to benchmark a small portion of the VM and generate performance ranks, and the second uses historic benchmark data along with the native method as a hybrid to generate VM ranks. The proposed methods are evaluated on three use-cases and are observed to be up to 91 times faster than benchmarking the entire VM. In both methods, small containers provide the same quality of rankings as a large container. The native method generates ranks with over 90% and 86% accuracy for sequential and parallel execution of an application compared against benchmarking the whole VM. The hybrid method did not improve the quality of the rankings significantly.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128437073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. T. Rahman, Hien Nguyen, J. Subhlok, Gopal Pandurangan
Volunteer computing is being used successfully for large scale scientific computations. This research is in the context of Volpex, a programming framework that supports communicating parallel processes in a volunteer environment. Redundancy and checkpointing are combined to ensure consistent forward progress with Volpex in this unique execution environment characterized by heterogeneous failure prone nodes and interdependent replicated processes. An important parameter for optimizing performance with Volpex is the frequency of checkpointing. The paper presents a mathematical model to minimize the completion time for inter-dependent parallel processes running in a volunteer environment by finding a suitable checkpoint interval. Validation is performed with a sample real world application running on a pool of distributed volunteer nodes. The results indicate that the performance with our predicted checkpoint interval is fairly close to the best performance obtained empirically by varying the checkpoint interval.
{"title":"Checkpointing to Minimize Completion Time for Inter-Dependent Parallel Processes on Volunteer Grids","authors":"M. T. Rahman, Hien Nguyen, J. Subhlok, Gopal Pandurangan","doi":"10.1109/CCGrid.2016.78","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.78","url":null,"abstract":"Volunteer computing is being used successfully for large scale scientific computations. This research is in the context of Volpex, a programming framework that supports communicating parallel processes in a volunteer environment. Redundancy and checkpointing are combined to ensure consistent forward progress with Volpex in this unique execution environment characterized by heterogeneous failure prone nodes and interdependent replicated processes. An important parameter for optimizing performance with Volpex is the frequency of checkpointing. The paper presents a mathematical model to minimize the completion time for inter-dependent parallel processes running in a volunteer environment by finding a suitable checkpoint interval. Validation is performed with a sample real world application running on a pool of distributed volunteer nodes. The results indicate that the performance with our predicted checkpoint interval is fairly close to the best performance obtained empirically by varying the checkpoint interval.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129748246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Szuba, P. Ameri, U. Grabowski, Jörg Meyer, A. Streit
We present a distributed system for storage, processing, three-dimensional visualisation and basic analysis of data from Earth-observing satellites. The database and the server have been designed for high performance and scalability, whereas the client is highly portable thanks to having been designed as a HTML5- and WebGL-based Web application. The system is based on the so-called MEAN stack, a modern replacement for LAMP which has steadily been gaining traction among high-performance Web applications. We demonstrate the performance of the system from the perspective of an user operating the client.
{"title":"A Distributed System for Storing and Processing Data from Earth-Observing Satellites: System Design and Performance Evaluation of the Visualisation Tool","authors":"M. Szuba, P. Ameri, U. Grabowski, Jörg Meyer, A. Streit","doi":"10.1109/CCGrid.2016.19","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.19","url":null,"abstract":"We present a distributed system for storage, processing, three-dimensional visualisation and basic analysis of data from Earth-observing satellites. The database and the server have been designed for high performance and scalability, whereas the client is highly portable thanks to having been designed as a HTML5- and WebGL-based Web application. The system is based on the so-called MEAN stack, a modern replacement for LAMP which has steadily been gaining traction among high-performance Web applications. We demonstrate the performance of the system from the perspective of an user operating the client.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114741543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pedro Costa, Xiao Bai, Fernando M. V. Ramos, M. Correia
Applications such as web search and social networking have been moving from centralized to decentralized cloud architectures to improve their scalability. MapReduce, a programming framework for processing large amounts of data using thousands of machines in a single cloud, also needs to be scaled out to multiple clouds to adapt to this evolution. The challenge of building a multi-cloud distributed architecture is substantial. Notwithstanding, the ability to deal with the new types of faults introduced by such setting, such as the outage of a whole datacenter or an arbitrary fault caused by a malicious cloud insider, increases the endeavor considerably. In this paper we propose Medusa, a platform that allows MapReduce computations to scale out to multiple clouds and tolerate several types of faults. Our solution fulfills four objectives. First, it is transparent to the user, who writes her typical MapReduce application without modification. Second, it does not require any modification to the widely used Hadoop framework. Third, the proposed system goes well beyond the fault-tolerance offered by MapReduce to tolerate arbitrary faults, cloud outages, and even malicious faults caused by corrupt cloud insiders. Fourth, it achieves this increased level of fault tolerance at reasonable cost. We performed an extensive experimental evaluation in the ExoGENI testbed, demonstrating that our solution significantly reduces execution time when compared to traditional methods that achieve the same level of resilience.
{"title":"Medusa: An Efficient Cloud Fault-Tolerant MapReduce","authors":"Pedro Costa, Xiao Bai, Fernando M. V. Ramos, M. Correia","doi":"10.1109/CCGrid.2016.20","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.20","url":null,"abstract":"Applications such as web search and social networking have been moving from centralized to decentralized cloud architectures to improve their scalability. MapReduce, a programming framework for processing large amounts of data using thousands of machines in a single cloud, also needs to be scaled out to multiple clouds to adapt to this evolution. The challenge of building a multi-cloud distributed architecture is substantial. Notwithstanding, the ability to deal with the new types of faults introduced by such setting, such as the outage of a whole datacenter or an arbitrary fault caused by a malicious cloud insider, increases the endeavor considerably. In this paper we propose Medusa, a platform that allows MapReduce computations to scale out to multiple clouds and tolerate several types of faults. Our solution fulfills four objectives. First, it is transparent to the user, who writes her typical MapReduce application without modification. Second, it does not require any modification to the widely used Hadoop framework. Third, the proposed system goes well beyond the fault-tolerance offered by MapReduce to tolerate arbitrary faults, cloud outages, and even malicious faults caused by corrupt cloud insiders. Fourth, it achieves this increased level of fault tolerance at reasonable cost. We performed an extensive experimental evaluation in the ExoGENI testbed, demonstrating that our solution significantly reduces execution time when compared to traditional methods that achieve the same level of resilience.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129421055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Existing Virtual Machine (VM) placement schemes have looked to conserve either CPU and Memory on the physical machine (PM) OR network resources (bandwidth) but not both. However, real applications use all resource types to varying degrees. The result of applying existing placement schemes to VMs running real applications is a fragmented data center where resources along one dimension become unusable even though they are available because of the unavailability of resources along other dimensions. An example of this fragmentation is unusable CPU because of a bottlenecked network link from the PM which has available CPU. To date, evaluations of the efficacy of VM placement schemes has not recognized this fragmentation and it's ill effects, let alone try to measure it and avoid it. In this paper, we first define the notion of what we term "relative resource fragmentation" and illustrate how it can be measured in a data center. The metric we put forth for capturing the degree of fragmentation is comprehensive and includes all key data center resource types. We then propose a VM placement scheme that minimizes this fragmentation and therefore maximizes the utility of data center resources. Results of empirical evaluations of our placement scheme compared to existing placement schemes show a reduction of fragmentation by as much as 15% and an increase in the number of successfully placed applications by as much as 20%.
{"title":"De-Fragmenting the Cloud","authors":"Mayank Mishra, U. Bellur","doi":"10.1109/CCGrid.2016.21","DOIUrl":"https://doi.org/10.1109/CCGrid.2016.21","url":null,"abstract":"Existing Virtual Machine (VM) placement schemes have looked to conserve either CPU and Memory on the physical machine (PM) OR network resources (bandwidth) but not both. However, real applications use all resource types to varying degrees. The result of applying existing placement schemes to VMs running real applications is a fragmented data center where resources along one dimension become unusable even though they are available because of the unavailability of resources along other dimensions. An example of this fragmentation is unusable CPU because of a bottlenecked network link from the PM which has available CPU. To date, evaluations of the efficacy of VM placement schemes has not recognized this fragmentation and it's ill effects, let alone try to measure it and avoid it. In this paper, we first define the notion of what we term \"relative resource fragmentation\" and illustrate how it can be measured in a data center. The metric we put forth for capturing the degree of fragmentation is comprehensive and includes all key data center resource types. We then propose a VM placement scheme that minimizes this fragmentation and therefore maximizes the utility of data center resources. Results of empirical evaluations of our placement scheme compared to existing placement schemes show a reduction of fragmentation by as much as 15% and an increase in the number of successfully placed applications by as much as 20%.","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124068507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}