Characterizing Disk Health Degradation and Proactively Protecting Against Disk Failures for Reliable Storage Systems

Song Huang, Shuwen Liang, Song Fu, Weisong Shi, Devesh Tiwari, Hsing-bung Chen
{"title":"Characterizing Disk Health Degradation and Proactively Protecting Against Disk Failures for Reliable Storage Systems","authors":"Song Huang, Shuwen Liang, Song Fu, Weisong Shi, Devesh Tiwari, Hsing-bung Chen","doi":"10.1109/ICAC.2019.00027","DOIUrl":null,"url":null,"abstract":"The booming of cloud computing, online services and big data applications have resulted in dramatic expansion of storage systems. Meanwhile, disk drives are reported to be the most commonly replaced hardware component. Disk failures cause service downtime and even data loss, costing enterprises multi-trillion dollars per year. Existing disk failure management approaches are mostly reactive and incur high overheads. To overcome these problems, in this paper, we present a proactive, cost-effective solution to managing large-scale production storage systems. We aim to uncover the entire process in which disk's health deteriorates and forecast when disk drives will fail in the future. Due to a common lack of diagnostic information of disk failures, we rely on the Self-Monitoring, Analysis and Reporting Technology (SMART) data and explore statistical analysis techniques to identify the start of disk degradation. We then model the disk degradation processes as functions of SMART attributes, which eliminates the dependency on time and thus I/O workload. Experimental results from over 23,000 enterprise-class disk drives in a production data center show that our derived models can accurately quantify the degradation of disk health, which enables us to proactively protect data against disk failures. We also investigate several types of disk failures and propose remediation mechanisms to prolong disk lifetime.","PeriodicalId":442645,"journal":{"name":"2019 IEEE International Conference on Autonomic Computing (ICAC)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Autonomic Computing (ICAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAC.2019.00027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

The booming of cloud computing, online services and big data applications have resulted in dramatic expansion of storage systems. Meanwhile, disk drives are reported to be the most commonly replaced hardware component. Disk failures cause service downtime and even data loss, costing enterprises multi-trillion dollars per year. Existing disk failure management approaches are mostly reactive and incur high overheads. To overcome these problems, in this paper, we present a proactive, cost-effective solution to managing large-scale production storage systems. We aim to uncover the entire process in which disk's health deteriorates and forecast when disk drives will fail in the future. Due to a common lack of diagnostic information of disk failures, we rely on the Self-Monitoring, Analysis and Reporting Technology (SMART) data and explore statistical analysis techniques to identify the start of disk degradation. We then model the disk degradation processes as functions of SMART attributes, which eliminates the dependency on time and thus I/O workload. Experimental results from over 23,000 enterprise-class disk drives in a production data center show that our derived models can accurately quantify the degradation of disk health, which enables us to proactively protect data against disk failures. We also investigate several types of disk failures and propose remediation mechanisms to prolong disk lifetime.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
分析硬盘健康退化特征,为可靠的存储系统提供硬盘故障预防措施
云计算、在线服务和大数据应用的蓬勃发展导致了存储系统的急剧扩张。同时,据报道,磁盘驱动器是最常被更换的硬件部件。磁盘故障导致服务停机甚至数据丢失,每年给企业造成数万亿美元的损失。现有的磁盘故障管理方法大多是被动的,并且会产生很高的开销。为了克服这些问题,在本文中,我们提出了一个主动的,具有成本效益的解决方案来管理大规模生产存储系统。我们的目标是揭示磁盘健康状况恶化的整个过程,并预测磁盘驱动器将来何时会发生故障。由于普遍缺乏硬盘故障诊断信息,我们依靠SMART (Self-Monitoring, Analysis and Reporting Technology)数据,探索统计分析技术来识别硬盘退化的开始。然后,我们将磁盘降级过程建模为SMART属性的函数,这消除了对时间的依赖,从而消除了对I/O工作负载的依赖。在一个生产数据中心的23,000多个企业级磁盘驱动器上进行的实验结果表明,我们导出的模型可以准确地量化磁盘健康状况的退化,从而使我们能够主动保护数据免受磁盘故障的影响。我们还研究了几种类型的磁盘故障,并提出了修复机制,以延长磁盘寿命。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Chisel: Reshaping Queries to Trim Latency in Key-Value Stores GreenRoute: A Generalizable Fuel-Saving Vehicular Navigation Service Characterizing Disk Health Degradation and Proactively Protecting Against Disk Failures for Reliable Storage Systems Adaptively Accelerating Map-Reduce/Spark with GPUs: A Case Study Enhancing Learning-Enabled Software Systems to Address Environmental Uncertainty
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1