A Bloom Filter Based Scalable Data Integrity Check Tool for Large-Scale Dataset

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS) Pub Date : 2016-11-13 DOI:10.1109/PDSW-DISCS.2016.13

Sisi Xiong, Feiyi Wang, Qing Cao

{"title":"A Bloom Filter Based Scalable Data Integrity Check Tool for Large-Scale Dataset","authors":"Sisi Xiong, Feiyi Wang, Qing Cao","doi":"10.1109/PDSW-DISCS.2016.13","DOIUrl":null,"url":null,"abstract":"Large scale HPC applications are becoming increasingly data intensive. At Oak Ridge Leadership Computing Facility (OLCF), we are observing the number of files curated under individual project are reaching as high as 200 millions and project data size is exceeding petabytes. These simulation datasets, once validated, often needs to be transferred to archival system for long term storage or shared with the rest of the research community. Ensuring the data integrity of the full dataset at this scale is paramount important but also a daunting task. This is especially true considering that most conventional tools are serial and file-based, unwieldy to use and/or can't scale to meet user's demand.To tackle this particular challenge, this paper presents the design, implementation and evaluation of a scalable parallel checksumming tool, fsum, which we developed at OLCF. It is built upon the principle of parallel tree walk and work-stealing pattern to maximize parallelism and is capable of generating a single, consistent signature for the entire dataset at extreme scale. We also applied a novel bloom-filter based technique in aggregating signatures to overcome the signature ordering requirement. Given the probabilistic nature of bloom filter, we provided a detailed error and trade-off analysis. Using multiple datasets from production environment, we demonstrated that our tool can efficiently handle both very large files as well as many small-file based datasets. Our preliminary test showed that on the same hardware, it outperforms conventional tool by as much as 4×. It also exhibited near-linear scaling properties when provisioned with more compute resources.","PeriodicalId":375550,"journal":{"name":"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDSW-DISCS.2016.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Large scale HPC applications are becoming increasingly data intensive. At Oak Ridge Leadership Computing Facility (OLCF), we are observing the number of files curated under individual project are reaching as high as 200 millions and project data size is exceeding petabytes. These simulation datasets, once validated, often needs to be transferred to archival system for long term storage or shared with the rest of the research community. Ensuring the data integrity of the full dataset at this scale is paramount important but also a daunting task. This is especially true considering that most conventional tools are serial and file-based, unwieldy to use and/or can't scale to meet user's demand.To tackle this particular challenge, this paper presents the design, implementation and evaluation of a scalable parallel checksumming tool, fsum, which we developed at OLCF. It is built upon the principle of parallel tree walk and work-stealing pattern to maximize parallelism and is capable of generating a single, consistent signature for the entire dataset at extreme scale. We also applied a novel bloom-filter based technique in aggregating signatures to overcome the signature ordering requirement. Given the probabilistic nature of bloom filter, we provided a detailed error and trade-off analysis. Using multiple datasets from production environment, we demonstrated that our tool can efficiently handle both very large files as well as many small-file based datasets. Our preliminary test showed that on the same hardware, it outperforms conventional tool by as much as 4×. It also exhibited near-linear scaling properties when provisioned with more compute resources.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于布隆过滤器的大规模数据集可扩展数据完整性检查工具

大规模HPC应用的数据密集程度越来越高。在橡树岭领导计算设施(Oak Ridge Leadership Computing Facility, OLCF)，我们观察到单个项目管理的文件数量高达2亿，项目数据大小超过pb。这些模拟数据集一旦得到验证，通常需要转移到档案系统进行长期存储或与研究界的其他人员共享。在这种规模下确保完整数据集的数据完整性是至关重要的，但也是一项艰巨的任务。考虑到大多数传统工具都是基于串行和文件的，难以使用和/或无法扩展以满足用户需求，这一点尤其正确。为了解决这个特殊的挑战，本文介绍了我们在OLCF开发的可扩展并行校验和工具fsum的设计、实现和评估。它建立在并行树行走和工作窃取模式的原则之上，以最大限度地提高并行性，并且能够在极端规模下为整个数据集生成单个一致的签名。我们还采用了一种新的基于bloom-filter的签名聚合技术来克服签名排序的要求。鉴于布隆过滤器的概率性质，我们提供了详细的误差和权衡分析。通过使用生产环境中的多个数据集，我们证明了我们的工具既可以有效地处理非常大的文件，也可以有效地处理许多基于小文件的数据集。我们的初步测试表明，在相同的硬件上，它的性能比传统工具高出4倍。当提供更多的计算资源时，它还显示出近似线性的缩放特性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)

自引率

0.00%

发文量