云服务数据中心智能服务器崩溃预测

Xingxing Liu, Yongzhan He, Hongmei Liu, Jiajun Zhang, B. Liu, Xiangyu Peng, Jialiang Xu, Jun Zhang, Alex Zhou, Paul Sun, Kunye Zhu, Ahuja Nishi, Dayi Zhu, Ken Zhang
{"title":"云服务数据中心智能服务器崩溃预测","authors":"Xingxing Liu, Yongzhan He, Hongmei Liu, Jiajun Zhang, B. Liu, Xiangyu Peng, Jialiang Xu, Jun Zhang, Alex Zhou, Paul Sun, Kunye Zhu, Ahuja Nishi, Dayi Zhu, Ken Zhang","doi":"10.1109/ITherm45881.2020.9190321","DOIUrl":null,"url":null,"abstract":"In recent years, Cloud Service has gradually been adopted by more and more end customers. Large amounts of applications from various businesses has been migrated to Cloud. Availability is one of the key considerations for end customers when adopting Cloud Service, so CSPs (Cloud Service Providers) are pursuing ever higher standard of SLA (Service-Level Agreement) to accommodate the need. Especially when considering VM (Virtual Machine) based Cloud Service, where resources in one physical server are virtualized and shared among multiple tenants, a server crash would be a huge impact to tenants' business. One solution is to establish an effective and accurate method to predict server crash in advance, so that workloads can be migrated to a healthy server before impacting the service. It is extremely challenging to deliver accurate prediction, since server crash occurs due to all kinds of failures with most of them occurring randomly and suddenly.This paper proposes a smart server crash prediction method for triggering early warning and migration in Cloud Service data center. The proposed server crash perdition is developed based on hardware, firmware and software system information collected from low-level hardware indicators and kernel status to upper-level system logs in OS (Operation System). Machine learning algorithms are adopted in logs analysis and failure prediction. Random Forests algorithm is chosen upon all providing the best precision. The final proposed method is deployed and evaluated in Baidu's data center, and it achieved 93.33% and 87.33% precision in providing Minutes-level and Hours-level ahead-of-time warning in server crash prediction.","PeriodicalId":193052,"journal":{"name":"2020 19th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Smart Server Crash Prediction in Cloud Service Data Center\",\"authors\":\"Xingxing Liu, Yongzhan He, Hongmei Liu, Jiajun Zhang, B. Liu, Xiangyu Peng, Jialiang Xu, Jun Zhang, Alex Zhou, Paul Sun, Kunye Zhu, Ahuja Nishi, Dayi Zhu, Ken Zhang\",\"doi\":\"10.1109/ITherm45881.2020.9190321\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, Cloud Service has gradually been adopted by more and more end customers. Large amounts of applications from various businesses has been migrated to Cloud. Availability is one of the key considerations for end customers when adopting Cloud Service, so CSPs (Cloud Service Providers) are pursuing ever higher standard of SLA (Service-Level Agreement) to accommodate the need. Especially when considering VM (Virtual Machine) based Cloud Service, where resources in one physical server are virtualized and shared among multiple tenants, a server crash would be a huge impact to tenants' business. One solution is to establish an effective and accurate method to predict server crash in advance, so that workloads can be migrated to a healthy server before impacting the service. It is extremely challenging to deliver accurate prediction, since server crash occurs due to all kinds of failures with most of them occurring randomly and suddenly.This paper proposes a smart server crash prediction method for triggering early warning and migration in Cloud Service data center. The proposed server crash perdition is developed based on hardware, firmware and software system information collected from low-level hardware indicators and kernel status to upper-level system logs in OS (Operation System). Machine learning algorithms are adopted in logs analysis and failure prediction. Random Forests algorithm is chosen upon all providing the best precision. The final proposed method is deployed and evaluated in Baidu's data center, and it achieved 93.33% and 87.33% precision in providing Minutes-level and Hours-level ahead-of-time warning in server crash prediction.\",\"PeriodicalId\":193052,\"journal\":{\"name\":\"2020 19th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm)\",\"volume\":\"68 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 19th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ITherm45881.2020.9190321\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 19th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITherm45881.2020.9190321","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

近年来,云服务逐渐被越来越多的终端客户所采用。来自不同企业的大量应用程序已经迁移到云端。可用性是最终客户在采用云服务时的关键考虑因素之一,因此csp(云服务提供商)正在追求更高的SLA(服务水平协议)标准来满足需求。特别是在考虑基于VM(虚拟机)的云服务时,其中一台物理服务器中的资源被虚拟化并在多个租户之间共享,服务器崩溃将对租户的业务产生巨大影响。一种解决方案是建立一种有效而准确的方法来提前预测服务器崩溃,以便在影响服务之前将工作负载迁移到健康的服务器。提供准确的预测是极具挑战性的,因为服务器崩溃是由于各种各样的故障,其中大多数是随机和突然发生的。提出了一种用于云服务数据中心服务器崩溃预警和迁移的智能预测方法。本文提出的服务器崩溃预测基于OS (Operation system)中从底层硬件指标和内核状态到上层系统日志收集的硬件、固件和软件系统信息。在日志分析和故障预测中采用机器学习算法。随机森林算法在提供最佳精度的基础上进行选择。最终提出的方法在百度数据中心进行了部署和评估,在服务器崩溃预测中提供分钟级和小时级预警的准确率分别达到93.33%和87.33%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Smart Server Crash Prediction in Cloud Service Data Center
In recent years, Cloud Service has gradually been adopted by more and more end customers. Large amounts of applications from various businesses has been migrated to Cloud. Availability is one of the key considerations for end customers when adopting Cloud Service, so CSPs (Cloud Service Providers) are pursuing ever higher standard of SLA (Service-Level Agreement) to accommodate the need. Especially when considering VM (Virtual Machine) based Cloud Service, where resources in one physical server are virtualized and shared among multiple tenants, a server crash would be a huge impact to tenants' business. One solution is to establish an effective and accurate method to predict server crash in advance, so that workloads can be migrated to a healthy server before impacting the service. It is extremely challenging to deliver accurate prediction, since server crash occurs due to all kinds of failures with most of them occurring randomly and suddenly.This paper proposes a smart server crash prediction method for triggering early warning and migration in Cloud Service data center. The proposed server crash perdition is developed based on hardware, firmware and software system information collected from low-level hardware indicators and kernel status to upper-level system logs in OS (Operation System). Machine learning algorithms are adopted in logs analysis and failure prediction. Random Forests algorithm is chosen upon all providing the best precision. The final proposed method is deployed and evaluated in Baidu's data center, and it achieved 93.33% and 87.33% precision in providing Minutes-level and Hours-level ahead-of-time warning in server crash prediction.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Thermal Sensor Placement based on Meta-Model Enhancing Observability and Controllability A Cascaded Multi-Core Vapor Chamber for Intra-Lid Heat Spreading in Heterogeneous Packages Corrosion in Liquid Cooling Systems with Water-Based Coolant – Part 2: Corrosion Reliability Testing and Failure Model A Reduced-order Model for Analyzing Heat Transfer in a Thermal Energy Storage Module Systematic Approach in Intel SoC (System on Chip) Thermal Solution Design using CFD (Computational Fluid Dynamics) Simulation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1