An In-Depth Correlative Study Between DRAM Errors and Server Failures in Production Data Centers

Zhinan Cheng, Shujie Han, P. Lee, X. Li, Jiongzhou Liu, Zhan Li
{"title":"An In-Depth Correlative Study Between DRAM Errors and Server Failures in Production Data Centers","authors":"Zhinan Cheng, Shujie Han, P. Lee, X. Li, Jiongzhou Liu, Zhan Li","doi":"10.1109/SRDS55811.2022.00032","DOIUrl":null,"url":null,"abstract":"Dynamic Random Access Memory (DRAM) errors are prevalent and lead to server failures in production data centers. However, little is known about the correlation between DRAM errors and server failures in state-of-the-art field studies on DRAM error measurement. To fill this void, we present an in-depth data-driven correlative analysis between DRAM errors and server failures, with the primary goal of predicting server failures based on DRAM error characterization and hence enabling proactive reliability maintenance for production data centers. Our analysis is based on an eight-month dataset collected from over three million memory modules in the production data centers at Alibaba. We find that the correctable DRAM errors of most server failures only manifest within a short time before the failures happen, implying that server failure prediction should be conducted regularly at short time intervals for accurate prediction. We also study various impacting factors (including component failures in the memory subsystem, DRAM configurations, types of correctable DRAM errors) on server failures. Furthermore, we design a machine-learning-based server failure prediction workflow and demonstrate the feasibility of server failure prediction based on DRAM error characterization. To this end, we report 14 findings from our measurement and prediction studies.","PeriodicalId":143115,"journal":{"name":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 41st International Symposium on Reliable Distributed Systems (SRDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SRDS55811.2022.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Dynamic Random Access Memory (DRAM) errors are prevalent and lead to server failures in production data centers. However, little is known about the correlation between DRAM errors and server failures in state-of-the-art field studies on DRAM error measurement. To fill this void, we present an in-depth data-driven correlative analysis between DRAM errors and server failures, with the primary goal of predicting server failures based on DRAM error characterization and hence enabling proactive reliability maintenance for production data centers. Our analysis is based on an eight-month dataset collected from over three million memory modules in the production data centers at Alibaba. We find that the correctable DRAM errors of most server failures only manifest within a short time before the failures happen, implying that server failure prediction should be conducted regularly at short time intervals for accurate prediction. We also study various impacting factors (including component failures in the memory subsystem, DRAM configurations, types of correctable DRAM errors) on server failures. Furthermore, we design a machine-learning-based server failure prediction workflow and demonstrate the feasibility of server failure prediction based on DRAM error characterization. To this end, we report 14 findings from our measurement and prediction studies.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
生产数据中心DRAM错误与服务器故障相关性的深入研究
动态随机存取内存(DRAM)错误非常普遍,会导致生产数据中心的服务器故障。然而,在DRAM误差测量的最新领域研究中,对DRAM误差与服务器故障之间的相关性知之甚少。为了填补这一空白,我们提出了DRAM错误和服务器故障之间深入的数据驱动相关性分析,其主要目标是基于DRAM错误特征预测服务器故障,从而实现对生产数据中心的主动可靠性维护。我们的分析基于从阿里巴巴生产数据中心的300多万个内存模块中收集的8个月的数据集。我们发现,大多数服务器故障的可纠正的DRAM错误仅在故障发生前的短时间内出现,这意味着服务器故障预测应在短时间间隔内定期进行,以准确预测。我们还研究了服务器故障的各种影响因素(包括内存子系统中的组件故障、DRAM配置、可纠正的DRAM错误类型)。此外,我们设计了一个基于机器学习的服务器故障预测工作流程,并证明了基于DRAM错误表征的服务器故障预测的可行性。为此,我们报告了我们的测量和预测研究中的14项发现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
FWC: Fitting Weight Compression Method for Reducing Communication Traffic for Federated Learning External Reviewers & Co-Reviewers Secure Publish-Process-Subscribe System for Dispersed Computing An In-Depth Correlative Study Between DRAM Errors and Server Failures in Production Data Centers An Investigation on Data Center Cooling Systems Using FPGA-based Temperature Side Channels
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1