A Bloom Filter Based Scalable Data Integrity Check Tool for Large-Scale Dataset

Sisi Xiong, Feiyi Wang, Qing Cao
{"title":"A Bloom Filter Based Scalable Data Integrity Check Tool for Large-Scale Dataset","authors":"Sisi Xiong, Feiyi Wang, Qing Cao","doi":"10.1109/PDSW-DISCS.2016.13","DOIUrl":null,"url":null,"abstract":"Large scale HPC applications are becoming increasingly data intensive. At Oak Ridge Leadership Computing Facility (OLCF), we are observing the number of files curated under individual project are reaching as high as 200 millions and project data size is exceeding petabytes. These simulation datasets, once validated, often needs to be transferred to archival system for long term storage or shared with the rest of the research community. Ensuring the data integrity of the full dataset at this scale is paramount important but also a daunting task. This is especially true considering that most conventional tools are serial and file-based, unwieldy to use and/or can't scale to meet user's demand.To tackle this particular challenge, this paper presents the design, implementation and evaluation of a scalable parallel checksumming tool, fsum, which we developed at OLCF. It is built upon the principle of parallel tree walk and work-stealing pattern to maximize parallelism and is capable of generating a single, consistent signature for the entire dataset at extreme scale. We also applied a novel bloom-filter based technique in aggregating signatures to overcome the signature ordering requirement. Given the probabilistic nature of bloom filter, we provided a detailed error and trade-off analysis. Using multiple datasets from production environment, we demonstrated that our tool can efficiently handle both very large files as well as many small-file based datasets. Our preliminary test showed that on the same hardware, it outperforms conventional tool by as much as 4×. It also exhibited near-linear scaling properties when provisioned with more compute resources.","PeriodicalId":375550,"journal":{"name":"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems (PDSW-DISCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDSW-DISCS.2016.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

Large scale HPC applications are becoming increasingly data intensive. At Oak Ridge Leadership Computing Facility (OLCF), we are observing the number of files curated under individual project are reaching as high as 200 millions and project data size is exceeding petabytes. These simulation datasets, once validated, often needs to be transferred to archival system for long term storage or shared with the rest of the research community. Ensuring the data integrity of the full dataset at this scale is paramount important but also a daunting task. This is especially true considering that most conventional tools are serial and file-based, unwieldy to use and/or can't scale to meet user's demand.To tackle this particular challenge, this paper presents the design, implementation and evaluation of a scalable parallel checksumming tool, fsum, which we developed at OLCF. It is built upon the principle of parallel tree walk and work-stealing pattern to maximize parallelism and is capable of generating a single, consistent signature for the entire dataset at extreme scale. We also applied a novel bloom-filter based technique in aggregating signatures to overcome the signature ordering requirement. Given the probabilistic nature of bloom filter, we provided a detailed error and trade-off analysis. Using multiple datasets from production environment, we demonstrated that our tool can efficiently handle both very large files as well as many small-file based datasets. Our preliminary test showed that on the same hardware, it outperforms conventional tool by as much as 4×. It also exhibited near-linear scaling properties when provisioned with more compute resources.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于布隆过滤器的大规模数据集可扩展数据完整性检查工具
大规模HPC应用的数据密集程度越来越高。在橡树岭领导计算设施(Oak Ridge Leadership Computing Facility, OLCF),我们观察到单个项目管理的文件数量高达2亿,项目数据大小超过pb。这些模拟数据集一旦得到验证,通常需要转移到档案系统进行长期存储或与研究界的其他人员共享。在这种规模下确保完整数据集的数据完整性是至关重要的,但也是一项艰巨的任务。考虑到大多数传统工具都是基于串行和文件的,难以使用和/或无法扩展以满足用户需求,这一点尤其正确。为了解决这个特殊的挑战,本文介绍了我们在OLCF开发的可扩展并行校验和工具fsum的设计、实现和评估。它建立在并行树行走和工作窃取模式的原则之上,以最大限度地提高并行性,并且能够在极端规模下为整个数据集生成单个一致的签名。我们还采用了一种新的基于bloom-filter的签名聚合技术来克服签名排序的要求。鉴于布隆过滤器的概率性质,我们提供了详细的误差和权衡分析。通过使用生产环境中的多个数据集,我们证明了我们的工具既可以有效地处理非常大的文件,也可以有效地处理许多基于小文件的数据集。我们的初步测试表明,在相同的硬件上,它的性能比传统工具高出4倍。当提供更多的计算资源时,它还显示出近似线性的缩放特性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Klimatic: A Virtual Data Lake for Harvesting and Distribution of Geospatial Data Towards Energy Efficient Data Management in HPC: The Open Ethernet Drive Approach FatMan vs. LittleBoy: Scaling Up Linear Algebraic Operations in Scale-Out Data Platforms A Bloom Filter Based Scalable Data Integrity Check Tool for Large-Scale Dataset Can Non-volatile Memory Benefit MapReduce Applications on HPC Clusters?
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1