{"title":"Categorizing hardware failure in large scale cloud computing environment","authors":"Moataz H. Khalil, W. Sheta, Adel Said Elmaghraby","doi":"10.1109/ISSPIT.2016.7886058","DOIUrl":null,"url":null,"abstract":"Cloud computing environments are growing in complexity creating more challenges for improved resilience and availability. Cloud computing research can benefit from machine learning and data mining by using data from actual operational cloud systems. One aspect that needs in-depth analysis is the failure characteristics of cloud environments. Failure is the main contributor to reduced resiliency of applications and services in cloud computing. This work presents a categorizing method to identify machines removed from the system based on failure or due to maintenance. Our experiments are targeting large scale cloud computing environments and experimental data consists of 25 million submitted tasks on 12500 severs over a 29 day period. The parameters of categorizing are CPU and memory utilization. Also, this work developed a support vector machine (SVM) model for learning and prediction of machine failure. The devolved model achieved 99.04 % accuracy. Precision and Recall curves demonstrate that the model is consistent with varying data size. The model has very good consistency with max difference from theoretical data by only 0.008%.","PeriodicalId":371691,"journal":{"name":"2016 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)","volume":"327 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSPIT.2016.7886058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Cloud computing environments are growing in complexity creating more challenges for improved resilience and availability. Cloud computing research can benefit from machine learning and data mining by using data from actual operational cloud systems. One aspect that needs in-depth analysis is the failure characteristics of cloud environments. Failure is the main contributor to reduced resiliency of applications and services in cloud computing. This work presents a categorizing method to identify machines removed from the system based on failure or due to maintenance. Our experiments are targeting large scale cloud computing environments and experimental data consists of 25 million submitted tasks on 12500 severs over a 29 day period. The parameters of categorizing are CPU and memory utilization. Also, this work developed a support vector machine (SVM) model for learning and prediction of machine failure. The devolved model achieved 99.04 % accuracy. Precision and Recall curves demonstrate that the model is consistent with varying data size. The model has very good consistency with max difference from theoretical data by only 0.008%.