前瞻性替换的不可纠正记忆错误预测:基于大规模现场数据的实证研究

Xiaoming Du, Cong Li, Shen Zhou, Mao Ye, Jing Li
{"title":"前瞻性替换的不可纠正记忆错误预测:基于大规模现场数据的实证研究","authors":"Xiaoming Du, Cong Li, Shen Zhou, Mao Ye, Jing Li","doi":"10.1109/EDCC51268.2020.00016","DOIUrl":null,"url":null,"abstract":"Uncorrectable memory errors are the leading causes of server failures in datacenters. Predicting uncorrectable errors (UEs) using the historical correctable error (CE) information helps for proactive replacement of memory hardware before the catastrophic events happen. In this paper, we perform an empirical study of UE prediction on the large-scale field data from more than 30,000 contemporary servers in Tencent datacenters over an 8-month period. We demonstrate that the traditional approach based on CE rate works poorly with a low precision. We then leverage the detail micro-level CE information to design several new predictors. The comparative study shows that the new predictor based on column fault identification boosts the baseline precision for a factor of more than 300% and at the same time also improve the baseline recall substantially.","PeriodicalId":212573,"journal":{"name":"2020 16th European Dependable Computing Conference (EDCC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Predicting Uncorrectable Memory Errors for Proactive Replacement: An Empirical Study on Large-Scale Field Data\",\"authors\":\"Xiaoming Du, Cong Li, Shen Zhou, Mao Ye, Jing Li\",\"doi\":\"10.1109/EDCC51268.2020.00016\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Uncorrectable memory errors are the leading causes of server failures in datacenters. Predicting uncorrectable errors (UEs) using the historical correctable error (CE) information helps for proactive replacement of memory hardware before the catastrophic events happen. In this paper, we perform an empirical study of UE prediction on the large-scale field data from more than 30,000 contemporary servers in Tencent datacenters over an 8-month period. We demonstrate that the traditional approach based on CE rate works poorly with a low precision. We then leverage the detail micro-level CE information to design several new predictors. The comparative study shows that the new predictor based on column fault identification boosts the baseline precision for a factor of more than 300% and at the same time also improve the baseline recall substantially.\",\"PeriodicalId\":212573,\"journal\":{\"name\":\"2020 16th European Dependable Computing Conference (EDCC)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 16th European Dependable Computing Conference (EDCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EDCC51268.2020.00016\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 16th European Dependable Computing Conference (EDCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EDCC51268.2020.00016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

不可纠正的内存错误是导致数据中心服务器故障的主要原因。使用历史可纠正错误(CE)信息预测不可纠正错误(ue)有助于在灾难性事件发生之前主动更换内存硬件。在本文中,我们对腾讯数据中心3万多台当代服务器为期8个月的大规模现场数据进行了UE预测的实证研究。我们证明了传统的基于CE率的方法效果较差,精度较低。然后,我们利用详细的微观级CE信息来设计几个新的预测器。对比研究表明,基于列故障识别的新预测器将基线精度提高了300%以上,同时也大幅提高了基线召回率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Predicting Uncorrectable Memory Errors for Proactive Replacement: An Empirical Study on Large-Scale Field Data
Uncorrectable memory errors are the leading causes of server failures in datacenters. Predicting uncorrectable errors (UEs) using the historical correctable error (CE) information helps for proactive replacement of memory hardware before the catastrophic events happen. In this paper, we perform an empirical study of UE prediction on the large-scale field data from more than 30,000 contemporary servers in Tencent datacenters over an 8-month period. We demonstrate that the traditional approach based on CE rate works poorly with a low precision. We then leverage the detail micro-level CE information to design several new predictors. The comparative study shows that the new predictor based on column fault identification boosts the baseline precision for a factor of more than 300% and at the same time also improve the baseline recall substantially.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Developing Complex Safety Critical Systems in Complex Supply Chains Data-Driven Cross-Layer Fault Management Architecture for Sensor Networks CrEStO: A Tool for Synthesizing Stateful Priorities Stateful Priorities for Precise Restriction of System Behavior Generation of Safety and Liveness Complaint Automata from Goal Model Specifications
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1