ZFP-V: Hardware-Optimized Lossy Floating Point Compression

2019 International Conference on Field-Programmable Technology (ICFPT) Pub Date : 2019-12-01 DOI:10.1109/ICFPT47387.2019.00022

Gongjin Sun, S. Jun

{"title":"ZFP-V: Hardware-Optimized Lossy Floating Point Compression","authors":"Gongjin Sun, S. Jun","doi":"10.1109/ICFPT47387.2019.00022","DOIUrl":null,"url":null,"abstract":"Lossy floating point compression algorithms are critical components of reducing the cost and improving the performance of many modern applications, including machine learning and scientific computing. Data compression is widely used to reduce data storage requirements and transfer overhead, but traditional data-oblivious lossless compression schemes are very inefficient for floating point data. On the other hand, recently proposed lossy compression algorithms like ZFP and SZ achieve very high rates of compression while controlling the tolerable error margin. To the best of our knowledge, no efficient hardware implementation of ZFP exists yet, partially due to the inherently serial nature of the algorithm. In this paper, we present the design and implementation of ZFP-V, which identifies the serial portion of the ZFP algorithm and modifies it for more efficient hardware implementation. ZFP-V replaces the \"group testing\" part of ZFP with a variable-length header, which allows our hardware implementation to achieve up to 2x performance improvement compared to our best-effort hardware implementation of the original algorithm while using less on-chip resources, at a marginal reduction of compression ratio. We evaluate an OpenCL implementation of ZFP-V on an Intel Arria 10 FPGA using a variety of real-world scientific datasets, and show a single-pipeline throughput of 1 GB/s – 4 GB/s compression and 2 GB/s – 10 GB/s decompression on real-world datasets. Our implementation often outperforms a 32-thread software implementation on a high-end Intel Xeon CPU, and significantly outperforms a state-of-the-art FPGA implementation of SZ.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT47387.2019.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Lossy floating point compression algorithms are critical components of reducing the cost and improving the performance of many modern applications, including machine learning and scientific computing. Data compression is widely used to reduce data storage requirements and transfer overhead, but traditional data-oblivious lossless compression schemes are very inefficient for floating point data. On the other hand, recently proposed lossy compression algorithms like ZFP and SZ achieve very high rates of compression while controlling the tolerable error margin. To the best of our knowledge, no efficient hardware implementation of ZFP exists yet, partially due to the inherently serial nature of the algorithm. In this paper, we present the design and implementation of ZFP-V, which identifies the serial portion of the ZFP algorithm and modifies it for more efficient hardware implementation. ZFP-V replaces the "group testing" part of ZFP with a variable-length header, which allows our hardware implementation to achieve up to 2x performance improvement compared to our best-effort hardware implementation of the original algorithm while using less on-chip resources, at a marginal reduction of compression ratio. We evaluate an OpenCL implementation of ZFP-V on an Intel Arria 10 FPGA using a variety of real-world scientific datasets, and show a single-pipeline throughput of 1 GB/s – 4 GB/s compression and 2 GB/s – 10 GB/s decompression on real-world datasets. Our implementation often outperforms a 32-thread software implementation on a high-end Intel Xeon CPU, and significantly outperforms a state-of-the-art FPGA implementation of SZ.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

硬件优化有损浮点压缩

有损浮点压缩算法是降低成本和提高许多现代应用(包括机器学习和科学计算)性能的关键组成部分。数据压缩被广泛用于减少数据存储需求和传输开销，但传统的数据无关无损压缩方案对浮点数据的效率非常低。另一方面，最近提出的有损压缩算法，如ZFP和SZ，在控制可容忍误差范围的同时实现了非常高的压缩率。据我们所知，目前还没有有效的ZFP硬件实现，部分原因是该算法固有的串行特性。在本文中，我们提出了ZFP- v的设计和实现，它可以识别ZFP算法的串行部分，并对其进行修改，以提高硬件实现的效率。ZFP- v用可变长度的报头取代了ZFP的“组测试”部分，这使得我们的硬件实现与我们最努力的原始算法的硬件实现相比，在使用更少的片上资源的同时，在压缩比的边际降低下，实现了高达2倍的性能提升。我们使用各种真实世界的科学数据集，在英特尔Arria 10 FPGA上评估了ZFP-V的OpenCL实现，并在真实世界的数据集上展示了1 GB/s - 4 GB/s压缩和2 GB/s - 10 GB/s解压缩的单管道吞吐量。我们的实现通常优于高端Intel Xeon CPU上的32线程软件实现，并且显著优于SZ的最先进FPGA实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 International Conference on Field-Programmable Technology (ICFPT)

自引率

0.00%

发文量

期刊最新文献

RNA: Reconfigurable LSTM Accelerator with Near Data Approximate Processing Time-SWAD: A Dataflow Engine for Time-Based Single Window Stream Aggregation Design and Development of Networked Multiple FPGA Components for Autonomous Tiny Robot Car ZFP-V: Hardware-Optimized Lossy Floating Point Compression Evolved Binary Neural Networks Through Harnessing FPGA Capabilities