GSC: efficient lossless compression of VCF files with fast query.

IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES GigaScience Pub Date : 2024-01-02 DOI:10.1093/gigascience/giae046
Xiaolong Luo, Yuxin Chen, Ling Liu, Lulu Ding, Yuxiang Li, Shengkang Li, Yong Zhang, Zexuan Zhu
{"title":"GSC: efficient lossless compression of VCF files with fast query.","authors":"Xiaolong Luo, Yuxin Chen, Ling Liu, Lulu Ding, Yuxiang Li, Shengkang Li, Yong Zhang, Zexuan Zhu","doi":"10.1093/gigascience/giae046","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>With the rise of large-scale genome sequencing projects, genotyping of thousands of samples has produced immense variant call format (VCF) files. It is becoming increasingly challenging to store, transfer, and analyze these voluminous files. Compression methods have been used to tackle these issues, aiming for both high compression ratio and fast random access. However, existing methods have not yet achieved a satisfactory compromise between these 2 objectives.</p><p><strong>Findings: </strong>To address the aforementioned issue, we introduce GSC (Genotype Sparse Compression), a specialized and refined lossless compression tool for VCF files. In benchmark tests conducted across various open-source datasets, GSC showcased exceptional performance in genotype data compression. Compared with the industry's most advanced tools (namely, GBC and GTC), GSC achieved compression ratios that were higher by 26.9% to 82.4% over GBC and GTC on the datasets, respectively. In lossless compression scenarios, GSC also demonstrated robust performance, with compression ratios 1.5× to 6.5× greater than general-purpose tools like gzip, zstd, and BCFtools-a mode not supported by either GBC or GTC. Achieving such high compression ratios did require some reasonable trade-offs, including longer decompression times, with GSC being 1.2× to 2× slower than GBC, yet 1.1× to 1.4× faster than GTC. Moreover, GSC maintained decompression query speeds that were equivalent to its competitors. In terms of RAM usage, GSC outperformed both counterparts. Overall, GSC's comprehensive performance surpasses that of the most advanced technologies.</p><p><strong>Conclusion: </strong>GSC balances high compression ratios with rapid data access, enhancing genomic data management. It supports seamless PLINK binary format conversion, simplifying downstream analysis.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":null,"pages":null},"PeriodicalIF":11.8000,"publicationDate":"2024-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11258903/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giae046","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: With the rise of large-scale genome sequencing projects, genotyping of thousands of samples has produced immense variant call format (VCF) files. It is becoming increasingly challenging to store, transfer, and analyze these voluminous files. Compression methods have been used to tackle these issues, aiming for both high compression ratio and fast random access. However, existing methods have not yet achieved a satisfactory compromise between these 2 objectives.

Findings: To address the aforementioned issue, we introduce GSC (Genotype Sparse Compression), a specialized and refined lossless compression tool for VCF files. In benchmark tests conducted across various open-source datasets, GSC showcased exceptional performance in genotype data compression. Compared with the industry's most advanced tools (namely, GBC and GTC), GSC achieved compression ratios that were higher by 26.9% to 82.4% over GBC and GTC on the datasets, respectively. In lossless compression scenarios, GSC also demonstrated robust performance, with compression ratios 1.5× to 6.5× greater than general-purpose tools like gzip, zstd, and BCFtools-a mode not supported by either GBC or GTC. Achieving such high compression ratios did require some reasonable trade-offs, including longer decompression times, with GSC being 1.2× to 2× slower than GBC, yet 1.1× to 1.4× faster than GTC. Moreover, GSC maintained decompression query speeds that were equivalent to its competitors. In terms of RAM usage, GSC outperformed both counterparts. Overall, GSC's comprehensive performance surpasses that of the most advanced technologies.

Conclusion: GSC balances high compression ratios with rapid data access, enhancing genomic data management. It supports seamless PLINK binary format conversion, simplifying downstream analysis.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
GSC:高效无损压缩 VCF 文件,查询速度快。
背景:随着大规模基因组测序项目的兴起,成千上万样本的基因分型产生了大量的变异调用格式(VCF)文件。存储、传输和分析这些庞大的文件变得越来越具有挑战性。为了解决这些问题,人们采用了压缩方法,旨在实现高压缩比和快速随机访问。然而,现有方法尚未在这两个目标之间取得令人满意的折衷:为了解决上述问题,我们引入了 GSC(基因型稀疏压缩),这是一种专门针对 VCF 文件的精制无损压缩工具。在对各种开源数据集进行的基准测试中,GSC 展示了基因型数据压缩的卓越性能。与业界最先进的工具(即 GBC 和 GTC)相比,GSC 在数据集上的压缩率分别比 GBC 和 GTC 高出 26.9% 至 82.4%。在无损压缩情况下,GSC 也表现出了强劲的性能,其压缩率比 gzip、zstd 和 BCFtools 等通用工具高出 1.5 倍到 6.5 倍--这是 GBC 和 GTC 都不支持的模式。要达到如此高的压缩比,确实需要一些合理的权衡,包括更长的解压缩时间,GSC 比 GBC 慢 1.2 倍到 2 倍,但比 GTC 快 1.1 倍到 1.4 倍。此外,GSC 的解压缩查询速度与其竞争对手相当。在内存使用方面,GSC 的表现优于两个竞争对手。总之,GSC 的综合性能超过了最先进的技术:结论:GSC 兼顾了高压缩率和快速数据访问,加强了基因组数据管理。它支持 PLINK 二进制格式的无缝转换,简化了下游分析。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
GigaScience
GigaScience MULTIDISCIPLINARY SCIENCES-
CiteScore
15.50
自引率
1.10%
发文量
119
审稿时长
1 weeks
期刊介绍: GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.
期刊最新文献
IPEV: identification of prokaryotic and eukaryotic virus-derived sequences in virome using deep learning Large-scale genomic survey with deep learning-based method reveals strain-level phage specificity determinants An effective strategy for assembling the sex-limited chromosome Enhanced bovine genome annotation through integration of transcriptomics and epi-transcriptomics datasets facilitates genomic biology Korea4K: whole genome sequences of 4,157 Koreans with 107 phenotypes derived from extensive health check-ups
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1