FaultyRank: A Graph-based Parallel File System Checker

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2023-05-01 DOI:10.1109/IPDPS54959.2023.00029

Saisha Kamat, Abdullah Al Raqibul Islam, Mai Zheng, Dong Dai

{"title":"FaultyRank: A Graph-based Parallel File System Checker","authors":"Saisha Kamat, Abdullah Al Raqibul Islam, Mai Zheng, Dong Dai","doi":"10.1109/IPDPS54959.2023.00029","DOIUrl":null,"url":null,"abstract":"Similar to local file system checkers such as e2fsck for Ext4, a parallel file system (PFS) checker ensures the file system's correctness. The basic idea of file system checkers is straightforward: important metadata are stored redundantly in separate places for cross-checking; inconsistent metadata will be repaired or overwritten by its ‘more correct' counterpart, which is defined by the developers. Unfortunately, implementing the idea for PFSes is non-trivial due to the system complexity. Although many popular parallel file systems already contain dedicated checkers (e.g., LFSCK for Lustre, BeeGFS-FSCK for BeeGFS, mmfsck for GPFS), the existing checkers often cannot detect or repair inconsistencies accurately due to one fundamental limitation: they rely on a fixed set of consistency rules predefined by developers, which cannot cover the various failure scenarios that may occur in practice.In this study, we propose a new graph-based method to build PFS checkers. Specifically, we model important PFS metadata into graphs, then generalize the logic of cross-checking and repairing into graph analytic tasks. We design a new graph algorithm, FaultyRank, to quantitatively calculate the correctness of each metadata object. By leveraging the calculated correctness, we are able to recommend the most promising repairs to users. Based on the idea, we implement a prototype of FaultyRank on Lustre, one of the most widely used parallel file systems, and compare it with Lustre's default file system checker LFSCK. Our experiments show that FaultyRank can achieve the same checking and repairing logic as LFSCK. Moreover, it is capable of detecting and repairing complicated PFS consistency issues that LFSCK can not handle. We also show the performance advantage of FaultyRank compared with LFSCK. Through this study, we believe FaultyRank opens a new opportunity for building PFS checkers effectively and efficiently.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Similar to local file system checkers such as e2fsck for Ext4, a parallel file system (PFS) checker ensures the file system's correctness. The basic idea of file system checkers is straightforward: important metadata are stored redundantly in separate places for cross-checking; inconsistent metadata will be repaired or overwritten by its ‘more correct' counterpart, which is defined by the developers. Unfortunately, implementing the idea for PFSes is non-trivial due to the system complexity. Although many popular parallel file systems already contain dedicated checkers (e.g., LFSCK for Lustre, BeeGFS-FSCK for BeeGFS, mmfsck for GPFS), the existing checkers often cannot detect or repair inconsistencies accurately due to one fundamental limitation: they rely on a fixed set of consistency rules predefined by developers, which cannot cover the various failure scenarios that may occur in practice.In this study, we propose a new graph-based method to build PFS checkers. Specifically, we model important PFS metadata into graphs, then generalize the logic of cross-checking and repairing into graph analytic tasks. We design a new graph algorithm, FaultyRank, to quantitatively calculate the correctness of each metadata object. By leveraging the calculated correctness, we are able to recommend the most promising repairs to users. Based on the idea, we implement a prototype of FaultyRank on Lustre, one of the most widely used parallel file systems, and compare it with Lustre's default file system checker LFSCK. Our experiments show that FaultyRank can achieve the same checking and repairing logic as LFSCK. Moreover, it is capable of detecting and repairing complicated PFS consistency issues that LFSCK can not handle. We also show the performance advantage of FaultyRank compared with LFSCK. Through this study, we believe FaultyRank opens a new opportunity for building PFS checkers effectively and efficiently.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

FaultyRank:一个基于图的并行文件系统检查器

与Ext4的e2fsck等本地文件系统检查器类似，并行文件系统(PFS)检查器可确保文件系统的正确性。文件系统检查器的基本思想很简单:重要的元数据被冗余地存储在不同的地方进行交叉检查;不一致的元数据将由其“更正确”的对应物(由开发人员定义)修复或覆盖。不幸的是，由于系统的复杂性，为pfse实现这个想法并不是微不足道的。尽管许多流行的并行文件系统已经包含了专用的检查器(例如，Lustre的LFSCK, BeeGFS的BeeGFS- fsck, GPFS的mmfsck)，但是由于一个基本的限制，现有的检查器通常不能准确地检测或修复不一致性:它们依赖于开发人员预定义的一组固定的一致性规则，这些规则不能涵盖实践中可能发生的各种故障场景。在这项研究中，我们提出了一种新的基于图的方法来构建PFS检查器。具体来说，我们将重要的PFS元数据建模为图，然后将交叉检查和修复的逻辑推广到图分析任务中。我们设计了一种新的图算法FaultyRank来定量计算每个元数据对象的正确性。通过利用计算的正确性，我们能够向用户推荐最有希望的维修。基于这一思想，我们在Lustre上实现了一个FaultyRank的原型，并与Lustre的默认文件系统检查器LFSCK进行了比较。实验表明，FaultyRank可以实现与LFSCK相同的检测和修复逻辑。此外，它还能够检测和修复LFSCK无法处理的复杂PFS一致性问题。我们还展示了FaultyRank与LFSCK相比的性能优势。通过这项研究，我们相信FaultyRank为有效和高效地构建PFS检查器提供了新的机会。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量