Pub Date : 2013-12-02DOI: 10.1109/CloudCom.2013.142
F. Al-Haidari, M. Sqalli, K. Salah
Cloud computing is currently one of the most hyped information technology fields and it has become one of the fastest growing segments of IT. A cloud introduces a resource-rich computing model with features such as flexibility, pay per use, elasticity, scalability, and others. In the context of cloud computing, auto scaling and elasticity are methods used to assure SLO (Service Level Objectives) for cloud services as well as the efficient usage of resources. There are many factors related to the auto scaling mechanism that might affect the performance of the cloud services. One of such important factors is the setting of CPU thresholds that control the triggering of the auto scaling policies, for the purpose of adding or terminating resources from the auto-scaling group. Another important factor is the scaling size, which is the number of instances that will be added every time such provisioning process takes place to add more resources to cope with workload spikes. In this paper, we simulate and study the impact of setting the upper CPU utilization threshold and the scaling size factors on the performance of the cloud services. Another contribution of this paper is on formulating and solving optimization problems for tuning these parameters based on input loads, considering both the cost and SLO response time. The study helps in deciding about the optimal setting that enables the use of the least number of cloud resources to satisfy QoS or SLO requirements.
{"title":"Impact of CPU Utilization Thresholds and Scaling Size on Autoscaling Cloud Resources","authors":"F. Al-Haidari, M. Sqalli, K. Salah","doi":"10.1109/CloudCom.2013.142","DOIUrl":"https://doi.org/10.1109/CloudCom.2013.142","url":null,"abstract":"Cloud computing is currently one of the most hyped information technology fields and it has become one of the fastest growing segments of IT. A cloud introduces a resource-rich computing model with features such as flexibility, pay per use, elasticity, scalability, and others. In the context of cloud computing, auto scaling and elasticity are methods used to assure SLO (Service Level Objectives) for cloud services as well as the efficient usage of resources. There are many factors related to the auto scaling mechanism that might affect the performance of the cloud services. One of such important factors is the setting of CPU thresholds that control the triggering of the auto scaling policies, for the purpose of adding or terminating resources from the auto-scaling group. Another important factor is the scaling size, which is the number of instances that will be added every time such provisioning process takes place to add more resources to cope with workload spikes. In this paper, we simulate and study the impact of setting the upper CPU utilization threshold and the scaling size factors on the performance of the cloud services. Another contribution of this paper is on formulating and solving optimization problems for tuning these parameters based on input loads, considering both the cost and SLO response time. The study helps in deciding about the optimal setting that enables the use of the least number of cloud resources to satisfy QoS or SLO requirements.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129186408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-02DOI: 10.1109/CloudCom.2013.116
Felipe Díaz Sánchez, S. A. Zahr, M. Gagnaire
Currently, Cloud brokers bring interoperability and portability of applications across multiple Clouds. In the future, Cloud brokers will offer services based on their knowledge of Cloud providers infrastructure to automatically and cost-effectively overcome performance degradation. In this paper, we present a Mixed-Integer Linear Program (MILP) that provides a cost-effective placement across multiple Clouds. Our MILP formulation considers parameters of Cloud providers such as price, configuration of VMs, network latency, and provisioning time. We evaluate the cost-effectiveness of deploying a Cloud infrastructure into a single or across multiple Cloud providers by using real prices and VM configurations. The results show that in some cases may be cost-effective to distribute the infrastructure across multiple Cloud providers. We also propose three placement policies for faulty multi-Cloud scenarios. The best of these policies minimizes the cost of the Cloud infrastructure under fixed provisioning time values.
{"title":"An Exact Placement Approach for Optimizing Cost and Recovery Time under Faulty Multi-cloud Environments","authors":"Felipe Díaz Sánchez, S. A. Zahr, M. Gagnaire","doi":"10.1109/CloudCom.2013.116","DOIUrl":"https://doi.org/10.1109/CloudCom.2013.116","url":null,"abstract":"Currently, Cloud brokers bring interoperability and portability of applications across multiple Clouds. In the future, Cloud brokers will offer services based on their knowledge of Cloud providers infrastructure to automatically and cost-effectively overcome performance degradation. In this paper, we present a Mixed-Integer Linear Program (MILP) that provides a cost-effective placement across multiple Clouds. Our MILP formulation considers parameters of Cloud providers such as price, configuration of VMs, network latency, and provisioning time. We evaluate the cost-effectiveness of deploying a Cloud infrastructure into a single or across multiple Cloud providers by using real prices and VM configurations. The results show that in some cases may be cost-effective to distribute the infrastructure across multiple Cloud providers. We also propose three placement policies for faulty multi-Cloud scenarios. The best of these policies minimizes the cost of the Cloud infrastructure under fixed provisioning time values.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129188274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-02DOI: 10.1109/CloudCom.2013.13
Á. García, E. Fernández-del-Castillo, Mattieu Puel
The cloud computing model is gaining interest in the scientific computing field after being well established and promoted in the non-academic world. Scientific data centers are starting to promote and deploy cloud services for their users, creating an heterogeneous ecosystem with different resource providers and different software stacks, that are neither designed nor adapted to interoperate or cooperate. We propose the use of the Virtual Organization Membership Service (VOMS), a well proven technology in the Grid area, to provide identity federation across different providers. In this work we also present an implementation of VOMS authentication in Open Stack.
{"title":"Identity Federation with VOMS in Cloud Infrastructures","authors":"Á. García, E. Fernández-del-Castillo, Mattieu Puel","doi":"10.1109/CloudCom.2013.13","DOIUrl":"https://doi.org/10.1109/CloudCom.2013.13","url":null,"abstract":"The cloud computing model is gaining interest in the scientific computing field after being well established and promoted in the non-academic world. Scientific data centers are starting to promote and deploy cloud services for their users, creating an heterogeneous ecosystem with different resource providers and different software stacks, that are neither designed nor adapted to interoperate or cooperate. We propose the use of the Virtual Organization Membership Service (VOMS), a well proven technology in the Grid area, to provide identity federation across different providers. In this work we also present an implementation of VOMS authentication in Open Stack.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130020796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-02DOI: 10.1109/CloudCom.2013.56
G. Hogben, Alain Pannetrat
The paper examines the challenges of defining and measuring availability to support real-word service comparison and dispute resolution through SLAs. We propose a rigorous and unambiguous definition of availability in cloud services. In the light of this, we show that what appear to be apples-for-apples comparisons between real-world SLAs are often based on ambiguous definitions, and even where SLAs are well defined, they differ significantly in their interpretation of availability. We show how two example real-world SLAs, would lead one service provider to report 0% availability while another would report 100% for the same system state history. On the basis of this, the paper concludes by arguing for the importance of standardising availability definitions and examines which elements need to be standardised and, just as importantly, which do not. Many of the results of this paper can be generalised to service level attributes other than availability: in general, such standard service definitions are a key element of a true commodity market in cloud resources, allowing service comparability before purchase, redress in the case of failure to deliver expected value and enhancing accountability in the supply chain.
{"title":"Mutant Apples: A Critical Examination of Cloud SLA Availability Definitions","authors":"G. Hogben, Alain Pannetrat","doi":"10.1109/CloudCom.2013.56","DOIUrl":"https://doi.org/10.1109/CloudCom.2013.56","url":null,"abstract":"The paper examines the challenges of defining and measuring availability to support real-word service comparison and dispute resolution through SLAs. We propose a rigorous and unambiguous definition of availability in cloud services. In the light of this, we show that what appear to be apples-for-apples comparisons between real-world SLAs are often based on ambiguous definitions, and even where SLAs are well defined, they differ significantly in their interpretation of availability. We show how two example real-world SLAs, would lead one service provider to report 0% availability while another would report 100% for the same system state history. On the basis of this, the paper concludes by arguing for the importance of standardising availability definitions and examines which elements need to be standardised and, just as importantly, which do not. Many of the results of this paper can be generalised to service level attributes other than availability: in general, such standard service definitions are a key element of a true commodity market in cloud resources, allowing service comparability before purchase, redress in the case of failure to deliver expected value and enhancing accountability in the supply chain.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129465364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-02DOI: 10.1109/CloudCom.2013.22
Renyu Yang, Ismael Solís Moreno, Jie Xu, Tianyu Wo
Co-allocated workloads in a virtualized computing environment often have to compete for resources, thereby suffering from performance interference. While this phenomenon has a direct impact on the Quality of Service provided to customers, it also changes the patterns of resource utilization and reduces the amount of work per Watt consumed. Unfortunately, there has been only limited research into how performance interference affects energy-efficiency of servers in such environments. In reality, there is a highly dynamic and complicated correlation among resource utilization, performance interference and energy-efficiency. This paper presents a comprehensive analysis that quantifies the negative impact of performance interference on the energy-efficiency of virtualized servers. Our analysis methodology takes into account the heterogeneous workload characteristics identified from a real Cloud environment. In particular, we investigate the impact due to different workload type combinations and develop a method for approximating the levels of performance interference and energy-efficiency degradation. The proposed method is based on profiles of pair combinations of existing workload types and the patterns derived from the analysis. Our experimental results reveal a non-linear relationship between the increase in interference and the reduction in energy-efficiency as well as an average precision within +/-5% of error margin for the estimation of both parameters. These findings provide vital information for research into dynamic trade-offs between resource utilization, performance, and energy-efficiency of a data center.
{"title":"An Analysis of Performance Interference Effects on Energy-Efficiency of Virtualized Cloud Environments","authors":"Renyu Yang, Ismael Solís Moreno, Jie Xu, Tianyu Wo","doi":"10.1109/CloudCom.2013.22","DOIUrl":"https://doi.org/10.1109/CloudCom.2013.22","url":null,"abstract":"Co-allocated workloads in a virtualized computing environment often have to compete for resources, thereby suffering from performance interference. While this phenomenon has a direct impact on the Quality of Service provided to customers, it also changes the patterns of resource utilization and reduces the amount of work per Watt consumed. Unfortunately, there has been only limited research into how performance interference affects energy-efficiency of servers in such environments. In reality, there is a highly dynamic and complicated correlation among resource utilization, performance interference and energy-efficiency. This paper presents a comprehensive analysis that quantifies the negative impact of performance interference on the energy-efficiency of virtualized servers. Our analysis methodology takes into account the heterogeneous workload characteristics identified from a real Cloud environment. In particular, we investigate the impact due to different workload type combinations and develop a method for approximating the levels of performance interference and energy-efficiency degradation. The proposed method is based on profiles of pair combinations of existing workload types and the patterns derived from the analysis. Our experimental results reveal a non-linear relationship between the increase in interference and the reduction in energy-efficiency as well as an average precision within +/-5% of error margin for the estimation of both parameters. These findings provide vital information for research into dynamic trade-offs between resource utilization, performance, and energy-efficiency of a data center.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124296697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-02DOI: 10.1109/CloudCom.2013.159
A. Alkhalid, Chung-Horng Lung, S. Ajila
The major advancement in distributed and High Performance Computing (HPC) systems is the development and evolution of clouds, applications that operate these clouds, and services provided by them. Cloud computing applications are expected to facilitate running complex systems on data centers containing storage and computing units in the range of tens to hundreds of thousands of devices. Meeting the needs of cloud computing systems makes the software deployment process a challenging task. The challenge comes from difficulty in managing the tradeoffs over various dimensions, such as interaction, performance, and security while making deployment decisions. Making deployment decisions exceeds human capability in light of huge increase in computation/storage units in the clouds and software systems running on these clouds. Therefore, autonomic approaches to assist software designers in making the software deployment decisions are important. In this paper, we propose an approach based on clustering techniques for deploying software components on the cloud using requirements decomposition. The paper also demonstrates a validation study of the proposed approach with a case study.
{"title":"Towards Efficient Software Deployment in the Cloud Using Requirements Decomposition","authors":"A. Alkhalid, Chung-Horng Lung, S. Ajila","doi":"10.1109/CloudCom.2013.159","DOIUrl":"https://doi.org/10.1109/CloudCom.2013.159","url":null,"abstract":"The major advancement in distributed and High Performance Computing (HPC) systems is the development and evolution of clouds, applications that operate these clouds, and services provided by them. Cloud computing applications are expected to facilitate running complex systems on data centers containing storage and computing units in the range of tens to hundreds of thousands of devices. Meeting the needs of cloud computing systems makes the software deployment process a challenging task. The challenge comes from difficulty in managing the tradeoffs over various dimensions, such as interaction, performance, and security while making deployment decisions. Making deployment decisions exceeds human capability in light of huge increase in computation/storage units in the clouds and software systems running on these clouds. Therefore, autonomic approaches to assist software designers in making the software deployment decisions are important. In this paper, we propose an approach based on clustering techniques for deploying software components on the cloud using requirements decomposition. The paper also demonstrates a validation study of the proposed approach with a case study.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121545420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-02DOI: 10.1109/CloudCom.2013.71
J. Gouveia, P. Crocker, S. Sousa, Ricardo Azevedo
This article describes an architecture for authentication and uniform access to protected data stored on popular Cloud Storage Service Providers. This architecture takes advantage of the OAuth authentication mechanism and the strong authentication mechanism of the National Electronic Identity (E-Id) Cards, in our case the Portuguese E-Id card or Cartao de Cidadao (CC). We shall present a comparison of authentication mechanisms and access to popular cloud storage providers, comparing the different authentication mechanisms OAuth 1.0, OAuth 1.0a and OAuth 2.0. Using the proposed architecture we have developed an implementation of this architecture that provides a uniform web based access to popular Cloud Storage Service Providers such as Drop box, Skydrive, Cloudpt and Google Drive using the authentication mechanism of the E-Id card as a unique access token. In order to provide a uniform access to these services we shall describe the differences in the various REST APIs for the targeted providers. Finally the web application that allows users that hold E-Id cards a single point of access to their various cloud storage services will be presented.
{"title":"E-Id Authentication and Uniform Access to Cloud Storage Service Providers","authors":"J. Gouveia, P. Crocker, S. Sousa, Ricardo Azevedo","doi":"10.1109/CloudCom.2013.71","DOIUrl":"https://doi.org/10.1109/CloudCom.2013.71","url":null,"abstract":"This article describes an architecture for authentication and uniform access to protected data stored on popular Cloud Storage Service Providers. This architecture takes advantage of the OAuth authentication mechanism and the strong authentication mechanism of the National Electronic Identity (E-Id) Cards, in our case the Portuguese E-Id card or Cartao de Cidadao (CC). We shall present a comparison of authentication mechanisms and access to popular cloud storage providers, comparing the different authentication mechanisms OAuth 1.0, OAuth 1.0a and OAuth 2.0. Using the proposed architecture we have developed an implementation of this architecture that provides a uniform web based access to popular Cloud Storage Service Providers such as Drop box, Skydrive, Cloudpt and Google Drive using the authentication mechanism of the E-Id card as a unique access token. In order to provide a uniform access to these services we shall describe the differences in the various REST APIs for the targeted providers. Finally the web application that allows users that hold E-Id cards a single point of access to their various cloud storage services will be presented.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117289891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-02DOI: 10.1109/CloudCom.2013.70
M. Serrano, M. Hauswirth, Nikos Kefalakis, J. Soldatos
The management performance of cloud systems is measured by the capacity of the cloud for controlling virtual infrastructures and their capability to run parallel-computing applications and distributed-processing services independently. The challenge about how this management performance can be done more dynamically (self-organization) by means of distributed user data and application data demands is yet an area to explore. This paper introduces first a functional architecture design, following the principles for cloud-based service lifecycle control and service composition in cloud, and second an in-house approach enabling self-organization for cloud services controlling the installation of virtual machines by using event-driven management operations acting as a proof of concept implementation. From a management point of view in cloud, enabling control of virtual infrastructures as a response to performance protocols by means of event(s) data processing is fundamental. Likewise managing cloud services lifecycle by enabling scalable applications and using distributed information systems and linked data processing, guarantee the self-organizing feature for cloud systems. Finally multiple advantages arise when infrastructure performance and end user data are used in cloud service management as it is discussed in this paper.
{"title":"A Self-Organizing Architecture for Cloud by Means of Infrastructure Performance and Event Data","authors":"M. Serrano, M. Hauswirth, Nikos Kefalakis, J. Soldatos","doi":"10.1109/CloudCom.2013.70","DOIUrl":"https://doi.org/10.1109/CloudCom.2013.70","url":null,"abstract":"The management performance of cloud systems is measured by the capacity of the cloud for controlling virtual infrastructures and their capability to run parallel-computing applications and distributed-processing services independently. The challenge about how this management performance can be done more dynamically (self-organization) by means of distributed user data and application data demands is yet an area to explore. This paper introduces first a functional architecture design, following the principles for cloud-based service lifecycle control and service composition in cloud, and second an in-house approach enabling self-organization for cloud services controlling the installation of virtual machines by using event-driven management operations acting as a proof of concept implementation. From a management point of view in cloud, enabling control of virtual infrastructures as a response to performance protocols by means of event(s) data processing is fundamental. Likewise managing cloud services lifecycle by enabling scalable applications and using distributed information systems and linked data processing, guarantee the self-organizing feature for cloud systems. Finally multiple advantages arise when infrastructure performance and end user data are used in cloud service management as it is discussed in this paper.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115978945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-02DOI: 10.1109/CloudCom.2013.31
M. Theoharidou, N. Papanikolaou, Siani Pearson, D. Gritzalis
Migrating data, applications or services to the cloud exposes a business to a number of new threats and vulnerabilities, which need to be properly assessed. Assessing privacy risk in cloud environments remains a complex challenge, mitigation of this risk requires trusting a cloud service provider to implement suitable privacy controls. Furthermore, auditors and authorities need to be able to hold service providers accountable for their actions, enforcing rules and regulations through penalties and other mechanisms, and ensuring that any problems are remedied promptly and adequately. This paper examines privacy risk assessment for cloud, and identifies threats, vulnerabilities and countermeasures that clients and providers should implement in order to achieve privacy compliance and accountability.
{"title":"Privacy Risk, Security, Accountability in the Cloud","authors":"M. Theoharidou, N. Papanikolaou, Siani Pearson, D. Gritzalis","doi":"10.1109/CloudCom.2013.31","DOIUrl":"https://doi.org/10.1109/CloudCom.2013.31","url":null,"abstract":"Migrating data, applications or services to the cloud exposes a business to a number of new threats and vulnerabilities, which need to be properly assessed. Assessing privacy risk in cloud environments remains a complex challenge, mitigation of this risk requires trusting a cloud service provider to implement suitable privacy controls. Furthermore, auditors and authorities need to be able to hold service providers accountable for their actions, enforcing rules and regulations through penalties and other mechanisms, and ensuring that any problems are remedied promptly and adequately. This paper examines privacy risk assessment for cloud, and identifies threats, vulnerabilities and countermeasures that clients and providers should implement in order to achieve privacy compliance and accountability.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"234 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132626889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-02DOI: 10.1109/CloudCom.2013.89
R. Esteves, T. Hacker, Chunming Rong
The tremendous growth in data volumes has created a need for new tools and algorithms to quickly analyze large datasets. Cluster analysis techniques, such as K-means can be used for large datasets distributed across several machines. The accuracy of K-means depends on the selection of seed centroids during initialization. K-means++ improves on the K-means seeder, but suffers from problems when it is applied to large datasets: (a) the random algorithm it employs can produce inconsistent results across several analysis runs under the same initial conditions; and (b) it scales poorly for large datasets. In this paper we describe a new Competitive K-means algorithm we developed that addresses both of these problems. We describe an efficient MapReduce implementation of our new Competitive K-means algorithm that we found scales well with large datasets. We compared the performance of our new algorithm with three existing cluster analysis algorithms and found that our new algorithm improves cluster analysis accuracy and decreases variance. Our results show that our new algorithm produced a speedup of 76 ± 9 times compared with the serial K-means++ and is as fast as the Streaming K-means. Our work provides a method to select a good initial seeding in less time, facilitating accurate cluster analysis over large datasets in shorter time.
{"title":"Competitive K-Means, a New Accurate and Distributed K-Means Algorithm for Large Datasets","authors":"R. Esteves, T. Hacker, Chunming Rong","doi":"10.1109/CloudCom.2013.89","DOIUrl":"https://doi.org/10.1109/CloudCom.2013.89","url":null,"abstract":"The tremendous growth in data volumes has created a need for new tools and algorithms to quickly analyze large datasets. Cluster analysis techniques, such as K-means can be used for large datasets distributed across several machines. The accuracy of K-means depends on the selection of seed centroids during initialization. K-means++ improves on the K-means seeder, but suffers from problems when it is applied to large datasets: (a) the random algorithm it employs can produce inconsistent results across several analysis runs under the same initial conditions; and (b) it scales poorly for large datasets. In this paper we describe a new Competitive K-means algorithm we developed that addresses both of these problems. We describe an efficient MapReduce implementation of our new Competitive K-means algorithm that we found scales well with large datasets. We compared the performance of our new algorithm with three existing cluster analysis algorithms and found that our new algorithm improves cluster analysis accuracy and decreases variance. Our results show that our new algorithm produced a speedup of 76 ± 9 times compared with the serial K-means++ and is as fast as the Streaming K-means. Our work provides a method to select a good initial seeding in less time, facilitating accurate cluster analysis over large datasets in shorter time.","PeriodicalId":198053,"journal":{"name":"2013 IEEE 5th International Conference on Cloud Computing Technology and Science","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131586401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}