云服务数据中心智能服务器崩溃预测

2020 19th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm) Pub Date : 2020-07-01 DOI:10.1109/ITherm45881.2020.9190321

Xingxing Liu, Yongzhan He, Hongmei Liu, Jiajun Zhang, B. Liu, Xiangyu Peng, Jialiang Xu, Jun Zhang, Alex Zhou, Paul Sun, Kunye Zhu, Ahuja Nishi, Dayi Zhu, Ken Zhang

{"title":"云服务数据中心智能服务器崩溃预测","authors":"Xingxing Liu, Yongzhan He, Hongmei Liu, Jiajun Zhang, B. Liu, Xiangyu Peng, Jialiang Xu, Jun Zhang, Alex Zhou, Paul Sun, Kunye Zhu, Ahuja Nishi, Dayi Zhu, Ken Zhang","doi":"10.1109/ITherm45881.2020.9190321","DOIUrl":null,"url":null,"abstract":"In recent years, Cloud Service has gradually been adopted by more and more end customers. Large amounts of applications from various businesses has been migrated to Cloud. Availability is one of the key considerations for end customers when adopting Cloud Service, so CSPs (Cloud Service Providers) are pursuing ever higher standard of SLA (Service-Level Agreement) to accommodate the need. Especially when considering VM (Virtual Machine) based Cloud Service, where resources in one physical server are virtualized and shared among multiple tenants, a server crash would be a huge impact to tenants' business. One solution is to establish an effective and accurate method to predict server crash in advance, so that workloads can be migrated to a healthy server before impacting the service. It is extremely challenging to deliver accurate prediction, since server crash occurs due to all kinds of failures with most of them occurring randomly and suddenly.This paper proposes a smart server crash prediction method for triggering early warning and migration in Cloud Service data center. The proposed server crash perdition is developed based on hardware, firmware and software system information collected from low-level hardware indicators and kernel status to upper-level system logs in OS (Operation System). Machine learning algorithms are adopted in logs analysis and failure prediction. Random Forests algorithm is chosen upon all providing the best precision. The final proposed method is deployed and evaluated in Baidu's data center, and it achieved 93.33% and 87.33% precision in providing Minutes-level and Hours-level ahead-of-time warning in server crash prediction.","PeriodicalId":193052,"journal":{"name":"2020 19th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Smart Server Crash Prediction in Cloud Service Data Center\",\"authors\":\"Xingxing Liu, Yongzhan He, Hongmei Liu, Jiajun Zhang, B. Liu, Xiangyu Peng, Jialiang Xu, Jun Zhang, Alex Zhou, Paul Sun, Kunye Zhu, Ahuja Nishi, Dayi Zhu, Ken Zhang\",\"doi\":\"10.1109/ITherm45881.2020.9190321\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, Cloud Service has gradually been adopted by more and more end customers. Large amounts of applications from various businesses has been migrated to Cloud. Availability is one of the key considerations for end customers when adopting Cloud Service, so CSPs (Cloud Service Providers) are pursuing ever higher standard of SLA (Service-Level Agreement) to accommodate the need. Especially when considering VM (Virtual Machine) based Cloud Service, where resources in one physical server are virtualized and shared among multiple tenants, a server crash would be a huge impact to tenants' business. One solution is to establish an effective and accurate method to predict server crash in advance, so that workloads can be migrated to a healthy server before impacting the service. It is extremely challenging to deliver accurate prediction, since server crash occurs due to all kinds of failures with most of them occurring randomly and suddenly.This paper proposes a smart server crash prediction method for triggering early warning and migration in Cloud Service data center. The proposed server crash perdition is developed based on hardware, firmware and software system information collected from low-level hardware indicators and kernel status to upper-level system logs in OS (Operation System). Machine learning algorithms are adopted in logs analysis and failure prediction. Random Forests algorithm is chosen upon all providing the best precision. The final proposed method is deployed and evaluated in Baidu's data center, and it achieved 93.33% and 87.33% precision in providing Minutes-level and Hours-level ahead-of-time warning in server crash prediction.\",\"PeriodicalId\":193052,\"journal\":{\"name\":\"2020 19th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm)\",\"volume\":\"68 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 19th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ITherm45881.2020.9190321\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 19th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITherm45881.2020.9190321","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

近年来，云服务逐渐被越来越多的终端客户所采用。来自不同企业的大量应用程序已经迁移到云端。可用性是最终客户在采用云服务时的关键考虑因素之一，因此csp(云服务提供商)正在追求更高的SLA(服务水平协议)标准来满足需求。特别是在考虑基于VM(虚拟机)的云服务时，其中一台物理服务器中的资源被虚拟化并在多个租户之间共享，服务器崩溃将对租户的业务产生巨大影响。一种解决方案是建立一种有效而准确的方法来提前预测服务器崩溃，以便在影响服务之前将工作负载迁移到健康的服务器。提供准确的预测是极具挑战性的，因为服务器崩溃是由于各种各样的故障，其中大多数是随机和突然发生的。提出了一种用于云服务数据中心服务器崩溃预警和迁移的智能预测方法。本文提出的服务器崩溃预测基于OS (Operation system)中从底层硬件指标和内核状态到上层系统日志收集的硬件、固件和软件系统信息。在日志分析和故障预测中采用机器学习算法。随机森林算法在提供最佳精度的基础上进行选择。最终提出的方法在百度数据中心进行了部署和评估，在服务器崩溃预测中提供分钟级和小时级预警的准确率分别达到93.33%和87.33%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Smart Server Crash Prediction in Cloud Service Data Center

In recent years, Cloud Service has gradually been adopted by more and more end customers. Large amounts of applications from various businesses has been migrated to Cloud. Availability is one of the key considerations for end customers when adopting Cloud Service, so CSPs (Cloud Service Providers) are pursuing ever higher standard of SLA (Service-Level Agreement) to accommodate the need. Especially when considering VM (Virtual Machine) based Cloud Service, where resources in one physical server are virtualized and shared among multiple tenants, a server crash would be a huge impact to tenants' business. One solution is to establish an effective and accurate method to predict server crash in advance, so that workloads can be migrated to a healthy server before impacting the service. It is extremely challenging to deliver accurate prediction, since server crash occurs due to all kinds of failures with most of them occurring randomly and suddenly.This paper proposes a smart server crash prediction method for triggering early warning and migration in Cloud Service data center. The proposed server crash perdition is developed based on hardware, firmware and software system information collected from low-level hardware indicators and kernel status to upper-level system logs in OS (Operation System). Machine learning algorithms are adopted in logs analysis and failure prediction. Random Forests algorithm is chosen upon all providing the best precision. The final proposed method is deployed and evaluated in Baidu's data center, and it achieved 93.33% and 87.33% precision in providing Minutes-level and Hours-level ahead-of-time warning in server crash prediction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 19th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm)

自引率

0.00%

发文量