An Algorithmic and Software Pipeline for Very Large Scale Scientific Data Compression with Error Guarantees

Tania Banerjee, J. Choi, Jaemoon Lee, Qian Gong, Ruonan Wang, S. Klasky, A. Rangarajan, Sanjay Ranka
{"title":"An Algorithmic and Software Pipeline for Very Large Scale Scientific Data Compression with Error Guarantees","authors":"Tania Banerjee, J. Choi, Jaemoon Lee, Qian Gong, Ruonan Wang, S. Klasky, A. Rangarajan, Sanjay Ranka","doi":"10.1109/HiPC56025.2022.00039","DOIUrl":null,"url":null,"abstract":"Efficient data compression is becoming increasingly critical for storing scientific data because many scientific applications produce vast amounts of data. This paper presents an end-to-end algorithmic and software pipeline for data compression that guarantees both error bounds on primary data (PD) and derived data, known as Quantities of Interest (QoI).We demonstrate the effectiveness of the pipeline by compressing fusion data generated by a large-scale fusion code, XGC, which produces tens of petabytes of data in a single day. We demonstrate that the compression is conducted by setting aside computational resources known as staging nodes, and does not impact the simulation performance. For efficient parallel I/O, the pipeline uses ADIOS2, which many codes such as XGC already use for their parallel I/O. We show that our approach can compress the data by two orders of magnitude while guaranteeing high accuracy on both the PD and the QoIs. Further, the amount of resources required by compression is a few percent of the resources required by simulation while ensuring that the compression time for each stage is less than the corresponding simulation time.This pipeline consists of three main steps. The first step decomposes the data using domain decomposition into small subdomains. Each subdomain is then compressed independently to achieve a high level of parallelism. The second step uses existing techniques that guarantee error bounds on the primary data for each subdomain. The third step uses a post-processing optimization technique based on Lagrange multipliers to reduce the QoI errors for data corresponding to each subdomain. The Lagrange multipliers generated can be further quantized or truncated to increase the compression level. All of the above characteristics of our approach make it highly practical to apply on-the-fly compression while guaranteeing errors on QoIs that are critical to the scientists.","PeriodicalId":119363,"journal":{"name":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC56025.2022.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Efficient data compression is becoming increasingly critical for storing scientific data because many scientific applications produce vast amounts of data. This paper presents an end-to-end algorithmic and software pipeline for data compression that guarantees both error bounds on primary data (PD) and derived data, known as Quantities of Interest (QoI).We demonstrate the effectiveness of the pipeline by compressing fusion data generated by a large-scale fusion code, XGC, which produces tens of petabytes of data in a single day. We demonstrate that the compression is conducted by setting aside computational resources known as staging nodes, and does not impact the simulation performance. For efficient parallel I/O, the pipeline uses ADIOS2, which many codes such as XGC already use for their parallel I/O. We show that our approach can compress the data by two orders of magnitude while guaranteeing high accuracy on both the PD and the QoIs. Further, the amount of resources required by compression is a few percent of the resources required by simulation while ensuring that the compression time for each stage is less than the corresponding simulation time.This pipeline consists of three main steps. The first step decomposes the data using domain decomposition into small subdomains. Each subdomain is then compressed independently to achieve a high level of parallelism. The second step uses existing techniques that guarantee error bounds on the primary data for each subdomain. The third step uses a post-processing optimization technique based on Lagrange multipliers to reduce the QoI errors for data corresponding to each subdomain. The Lagrange multipliers generated can be further quantized or truncated to increase the compression level. All of the above characteristics of our approach make it highly practical to apply on-the-fly compression while guaranteeing errors on QoIs that are critical to the scientists.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一种具有误差保证的超大规模科学数据压缩算法和软件管道
高效的数据压缩对于存储科学数据变得越来越重要,因为许多科学应用程序产生大量的数据。本文提出了一种端到端的数据压缩算法和软件管道,它保证了原始数据(PD)和派生数据的错误边界,称为兴趣量(qi)。我们通过压缩由大型融合代码XGC生成的融合数据来证明该管道的有效性,该代码在一天内产生数十pb的数据。我们证明了压缩是通过留出称为分段节点的计算资源来进行的,并且不会影响模拟性能。为了高效的并行I/O,管道使用ADIOS2,许多代码(如XGC)已经将其用于并行I/O。我们表明,我们的方法可以将数据压缩两个数量级,同时保证PD和qoi的高精度。此外,压缩所需的资源量是模拟所需资源的几个百分点,同时确保每个阶段的压缩时间小于相应的模拟时间。该管道由三个主要步骤组成。第一步使用域分解将数据分解成小的子域。然后,每个子域被独立压缩,以实现高水平的并行性。第二步使用现有的技术来保证每个子域的主数据的错误边界。第三步采用基于拉格朗日乘法器的后处理优化技术,降低各子域对应数据的qi误差。生成的拉格朗日乘法器可以进一步量化或截断以提高压缩水平。我们方法的所有上述特征使得在保证对科学家至关重要的qi误差的同时,应用动态压缩变得非常实用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
HiPC 2022 Technical Program Committee A Deep Learning-Based In Situ Analysis Framework for Tropical Cyclogenesis Prediction COMPROF and COMPLACE: Shared-Memory Communication Profiling and Automated Thread Placement via Dynamic Binary Instrumentation Message from the HiPC 2022 General Co-Chairs Efficient Personalized and Non-Personalized Alltoall Communication for Modern Multi-HCA GPU-Based Clusters
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1