{"title":"ZFP-V: Hardware-Optimized Lossy Floating Point Compression","authors":"Gongjin Sun, S. Jun","doi":"10.1109/ICFPT47387.2019.00022","DOIUrl":null,"url":null,"abstract":"Lossy floating point compression algorithms are critical components of reducing the cost and improving the performance of many modern applications, including machine learning and scientific computing. Data compression is widely used to reduce data storage requirements and transfer overhead, but traditional data-oblivious lossless compression schemes are very inefficient for floating point data. On the other hand, recently proposed lossy compression algorithms like ZFP and SZ achieve very high rates of compression while controlling the tolerable error margin. To the best of our knowledge, no efficient hardware implementation of ZFP exists yet, partially due to the inherently serial nature of the algorithm. In this paper, we present the design and implementation of ZFP-V, which identifies the serial portion of the ZFP algorithm and modifies it for more efficient hardware implementation. ZFP-V replaces the \"group testing\" part of ZFP with a variable-length header, which allows our hardware implementation to achieve up to 2x performance improvement compared to our best-effort hardware implementation of the original algorithm while using less on-chip resources, at a marginal reduction of compression ratio. We evaluate an OpenCL implementation of ZFP-V on an Intel Arria 10 FPGA using a variety of real-world scientific datasets, and show a single-pipeline throughput of 1 GB/s – 4 GB/s compression and 2 GB/s – 10 GB/s decompression on real-world datasets. Our implementation often outperforms a 32-thread software implementation on a high-end Intel Xeon CPU, and significantly outperforms a state-of-the-art FPGA implementation of SZ.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT47387.2019.00022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Lossy floating point compression algorithms are critical components of reducing the cost and improving the performance of many modern applications, including machine learning and scientific computing. Data compression is widely used to reduce data storage requirements and transfer overhead, but traditional data-oblivious lossless compression schemes are very inefficient for floating point data. On the other hand, recently proposed lossy compression algorithms like ZFP and SZ achieve very high rates of compression while controlling the tolerable error margin. To the best of our knowledge, no efficient hardware implementation of ZFP exists yet, partially due to the inherently serial nature of the algorithm. In this paper, we present the design and implementation of ZFP-V, which identifies the serial portion of the ZFP algorithm and modifies it for more efficient hardware implementation. ZFP-V replaces the "group testing" part of ZFP with a variable-length header, which allows our hardware implementation to achieve up to 2x performance improvement compared to our best-effort hardware implementation of the original algorithm while using less on-chip resources, at a marginal reduction of compression ratio. We evaluate an OpenCL implementation of ZFP-V on an Intel Arria 10 FPGA using a variety of real-world scientific datasets, and show a single-pipeline throughput of 1 GB/s – 4 GB/s compression and 2 GB/s – 10 GB/s decompression on real-world datasets. Our implementation often outperforms a 32-thread software implementation on a high-end Intel Xeon CPU, and significantly outperforms a state-of-the-art FPGA implementation of SZ.