A Study of Failure Recovery and Logging of High-Performance Parallel File Systems

ACM Transactions on Storage (TOS) Pub Date : 2022-03-29 DOI:10.1145/3483447

Runzhou Han, Om Rameshwar Gatla, Mai Zheng, Jinrui Cao, Di Zhang, Dong Dai, Yong Chen, J. Cook

{"title":"A Study of Failure Recovery and Logging of High-Performance Parallel File Systems","authors":"Runzhou Han, Om Rameshwar Gatla, Mai Zheng, Jinrui Cao, Di Zhang, Dong Dai, Yong Chen, J. Cook","doi":"10.1145/3483447","DOIUrl":null,"url":null,"abstract":"Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called PFault, which is transparent to PFSs and easy to deploy in practice. PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Storage (TOS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3483447","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called PFault, which is transparent to PFSs and easy to deploy in practice. PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

高性能并行文件系统故障恢复与日志记录研究

大规模并行文件系统(pfs)在高性能计算(HPC)中扮演着重要的角色。然而，尽管它们很重要，但与本地存储系统或云存储系统相比，它们的可靠性的研究或理解要少得多。最近在实际高性能计算中心发生的故障事件暴露了PFS集群的潜在缺陷，迫切需要进行系统分析。为了应对这一挑战，我们在本文中对pfs的故障恢复和日志记录机制进行了研究。首先，为了触发目标PFS的故障恢复和日志操作，我们引入了一个名为PFault的黑箱故障注入工具，该工具对PFS是透明的，并且易于在实践中部署。PFault基于一组预先定义的故障模型，模拟PFS中单个存储节点的故障状态，能够系统地检查PFS在故障下的行为。接下来，我们将PFault应用于两种广泛使用的pfs: Lustre和BeeGFS。我们的分析揭示了目标pfs的独特故障恢复和日志模式，并确定了pfs在故障处理方面不完美的多种情况。例如，Lustre包含一个名为LFSCK的恢复组件，用于检测和修复pfs级别的不一致性，但是我们发现LFSCK本身在扫描损坏的Lustre时可能会挂起或触发内核恐慌。即使在LFSCK尝试恢复之后，应用于Lustre的后续工作负载仍然可能表现异常(例如挂起或报告I/O错误)。在BeeGFS及其恢复部分BeeGFS- fsck中也观察到类似的问题。我们深入分析了观察到的异常症状的根本原因，这导致将一个新的补丁集合并到即将发布的Lustre版本中。此外，我们详细描述了实验中生成的大量日志，并确定了pfs在故障记录方面的独特模式和局限性。我们希望这项研究以及由此产生的工具和数据集可以促进社区的后续研究，并帮助改进pfs以实现可靠的高性能计算。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ACM Transactions on Storage (TOS)

自引率

0.00%

发文量

期刊最新文献

WebAssembly-based Delta Sync for Cloud Storage Services DEFUSE: An Interface for Fast and Correct User Space File System Access Donag: Generating Efficient Patches and Diffs for Compressed Archives Building GC-free Key-value Store on HM-SMR Drives with ZoneFS Kangaroo: Theory and Practice of Caching Billions of Tiny Objects on Flash