基于聚类的质量分数压缩方法。

Proceedings. Data Compression Conference Pub Date : 2016-03-01 Epub Date: 2016-12-19 DOI:10.1109/DCC.2016.49
Mikel Hernaez, Idoia Ochoa, Tsachy Weissman
{"title":"基于聚类的质量分数压缩方法。","authors":"Mikel Hernaez,&nbsp;Idoia Ochoa,&nbsp;Tsachy Weissman","doi":"10.1109/DCC.2016.49","DOIUrl":null,"url":null,"abstract":"<p><p>Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.</p>","PeriodicalId":91161,"journal":{"name":"Proceedings. Data Compression Conference","volume":"2016 ","pages":"261-270"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/DCC.2016.49","citationCount":"10","resultStr":"{\"title\":\"A cluster-based approach to compression of Quality Scores.\",\"authors\":\"Mikel Hernaez,&nbsp;Idoia Ochoa,&nbsp;Tsachy Weissman\",\"doi\":\"10.1109/DCC.2016.49\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.</p>\",\"PeriodicalId\":91161,\"journal\":{\"name\":\"Proceedings. Data Compression Conference\",\"volume\":\"2016 \",\"pages\":\"261-270\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/DCC.2016.49\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. Data Compression Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DCC.2016.49\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2016/12/19 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2016.49","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2016/12/19 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

摘要

由于测序技术的进步和测序成本的急剧下降,大量的测序数据正在产生。存储和共享这些大数据已经成为发现和分析用于医学推断的遗传变异的主要瓶颈。因此,提出了对这些数据进行无损压缩的方法。在压缩的数据中,70%以上对应于质量分数,这表明测序机在调用特定碱基对时的可靠性。因此,为了进一步提高压缩性能,质量分数的有损压缩自然成为候选。由于数据用于遗传变异发现,因此根据其率失真性能以及对变体调用者的影响,对质量分数的有损压缩器进行了分析。以前提出的算法在所有性能指标下都表现不佳,因此不适合某些应用。在这项工作中,我们提出了一种新的有损压缩器,它首先通过假设所有质量分数序列来自马尔可夫模型的混合物来执行聚类步骤。然后,基于马尔可夫模型对质量分数进行量化。每个量化器针对一个特定的失真,以优化整体的率失真性能。最后,通过熵编码器对量化值进行压缩。在所有分析的失真指标下,我们证明了所提出的有损压缩器优于先前提出的方法。这表明,与先前提出的有损压缩器相比,所提出的算法对任何下游应用程序的影响可能不那么明显。此外,我们分析了所提出的有损压缩器如何影响单核苷酸多态性(SNP)调用,并表明调用中引入的可变性远远小于不同SNP调用方法之间存在的可变性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A cluster-based approach to compression of Quality Scores.

Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Faster Maximal Exact Matches with Lazy LCP Evaluation. Recursive Prefix-Free Parsing for Building Big BWTs. PHONI: Streamed Matching Statistics with Multi-Genome References. Client-Driven Transmission of JPEG2000 Image Sequences Using Motion Compensated Conditional Replenishment GeneComp, a new reference-based compressor for SAM files.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1