Proceedings. Data Compression Conference最新文献

英文中文

Optimized VM memory allocation based on monitored cache hit ratio 根据监控的缓存命中率优化虚拟机内存分配

Proceedings. Data Compression Conference

Pub Date : 2016-07-25 DOI: 10.1145/2955193.2955200

Saneyasu Yamaguchi, Eita Fujishima

Cloud computing by the use of virtual machines has become increasingly important in various situations. In such an environment, multiple virtual machines run on a single physical machine. Many hypervisor implementations have a ballooning function. This function enables virtual machine memory size to be dynamically changed at runtime. Hence, it is expected that dynamic memory resource management while considering the application loads can improve its performance. In particular, I/O performance is expected to be improved significantly because it strongly depends on the size of the HDD cache in an operating system, e.g., a page cache in Linux. The Xen hypervisor has xenballoon, which dynamically changes the size of the virtual machine memory based on the size of memory consumed by the processes in the virtual machines and does not consider the size of the page cache. Therefore, the I/O performance is not improved. In a study, a method for adjusting virtual machine memory size was proposed. However, it assumes applications to be homogeneous. In this paper, we propose a method for dynamic management of virtual machine memory size without assumption for applications. The method takes into account the page cache hit ratio. We show the performance evaluation of the method for various read/write applications. The experimental results demonstrate that our method can improve performance of I/O intensive applications in virtual machines.

使用虚拟机的云计算在各种情况下变得越来越重要。在这种环境中，多个虚拟机在单个物理机上运行。许多管理程序实现都有膨胀功能。此功能允许在运行时动态更改虚拟机内存大小。因此，在考虑应用程序负载的同时进行动态内存资源管理可以提高其性能。特别是，I/O性能预计将得到显著改善，因为它在很大程度上取决于操作系统中HDD缓存的大小，例如Linux中的页面缓存。Xen管理程序有xenballoon，它根据虚拟机中进程消耗的内存大小动态更改虚拟机内存的大小，而不考虑页面缓存的大小。因此，I/O性能没有提高。在研究中，提出了一种调整虚拟机内存大小的方法。但是，它假设应用程序是同构的。在本文中，我们提出了一种动态管理虚拟机内存大小的方法，无需对应用程序进行假设。该方法考虑了页面缓存命中率。我们展示了该方法在各种读/写应用程序中的性能评估。实验结果表明，该方法可以提高虚拟机中I/O密集型应用程序的性能。

{"title":"Optimized VM memory allocation based on monitored cache hit ratio","authors":"Saneyasu Yamaguchi, Eita Fujishima","doi":"10.1145/2955193.2955200","DOIUrl":"https://doi.org/10.1145/2955193.2955200","url":null,"abstract":"Cloud computing by the use of virtual machines has become increasingly important in various situations. In such an environment, multiple virtual machines run on a single physical machine. Many hypervisor implementations have a ballooning function. This function enables virtual machine memory size to be dynamically changed at runtime. Hence, it is expected that dynamic memory resource management while considering the application loads can improve its performance. In particular, I/O performance is expected to be improved significantly because it strongly depends on the size of the HDD cache in an operating system, e.g., a page cache in Linux. The Xen hypervisor has xenballoon, which dynamically changes the size of the virtual machine memory based on the size of memory consumed by the processes in the virtual machines and does not consider the size of the page cache. Therefore, the I/O performance is not improved. In a study, a method for adjusting virtual machine memory size was proposed. However, it assumes applications to be homogeneous. In this paper, we propose a method for dynamic management of virtual machine memory size without assumption for applications. The method takes into account the page cache hit ratio. We show the performance evaluation of the method for various read/write applications. The experimental results demonstrate that our method can improve performance of I/O intensive applications in virtual machines.","PeriodicalId":91161,"journal":{"name":"Proceedings. Data Compression Conference","volume":"31 1","pages":"8:1-8:6"},"PeriodicalIF":0.0,"publicationDate":"2016-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84224291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

The misbelief in delay scheduling 对延迟调度的误解

Proceedings. Data Compression Conference

Pub Date : 2016-07-25 DOI: 10.1145/2955193.2955203

Derek Schatzlein, Srivatsan Ravi, Youngtae Noh, Masoud Saeida Ardekani, P. Eugster

Big-data processing frameworks like Hadoop and Spark, often used in multi-user environments, have struggled to achieve a balance between the full utilization of cluster resources and fairness between users. In particular, data locality becomes a concern, as enforcing fairness policies may cause poor placement of tasks in relation to the data on which they operate. To combat this, the schedulers in many frameworks use a heuristic called delay scheduling, which involves waiting for a short, constant interval for data-local task slots to become free if none are available; however, a fixed delay interval is inefficient, as the ideal time to delay varies depending on input data size, network conditions, and other factors. We propose an adaptive solution (Dynamic Delay Scheduling), which uses a simple feedback metric from finished tasks to adapt the delay scheduling interval for subsequent tasks at runtime. We present a dynamic delay implementation in Spark, and show that it outperforms a fixed delay in TPC-H benchmarks. Our preliminary experiments confirm our intuition that job latency in batch-processing scheduling can be improved using simple adaptive techniques with almost no extra state overhead.

像Hadoop和Spark这样的大数据处理框架，经常用于多用户环境，一直在努力实现集群资源的充分利用和用户之间的公平之间的平衡。特别是，数据位置成为一个问题，因为强制执行公平策略可能会导致任务与其操作的数据相关的位置不佳。为了解决这个问题，许多框架中的调度器使用一种称为延迟调度的启发式方法，该方法包括等待一个短而恒定的间隔，以便在没有可用的数据本地任务槽时空闲;但是，固定的延迟间隔是低效的，因为理想的延迟时间取决于输入数据大小、网络条件和其他因素。我们提出了一种自适应的解决方案(动态延迟调度)，它使用一个简单的从已完成的任务反馈度量来适应后续任务在运行时的延迟调度间隔。我们在Spark中提出了一个动态延迟实现，并表明它在TPC-H基准测试中优于固定延迟。我们的初步实验证实了我们的直觉，即批处理调度中的作业延迟可以使用简单的自适应技术来改进，几乎没有额外的状态开销。

{"title":"The misbelief in delay scheduling","authors":"Derek Schatzlein, Srivatsan Ravi, Youngtae Noh, Masoud Saeida Ardekani, P. Eugster","doi":"10.1145/2955193.2955203","DOIUrl":"https://doi.org/10.1145/2955193.2955203","url":null,"abstract":"Big-data processing frameworks like Hadoop and Spark, often used in multi-user environments, have struggled to achieve a balance between the full utilization of cluster resources and fairness between users. In particular, data locality becomes a concern, as enforcing fairness policies may cause poor placement of tasks in relation to the data on which they operate. To combat this, the schedulers in many frameworks use a heuristic called delay scheduling, which involves waiting for a short, constant interval for data-local task slots to become free if none are available; however, a fixed delay interval is inefficient, as the ideal time to delay varies depending on input data size, network conditions, and other factors.\u0000 We propose an adaptive solution (Dynamic Delay Scheduling), which uses a simple feedback metric from finished tasks to adapt the delay scheduling interval for subsequent tasks at runtime. We present a dynamic delay implementation in Spark, and show that it outperforms a fixed delay in TPC-H benchmarks. Our preliminary experiments confirm our intuition that job latency in batch-processing scheduling can be improved using simple adaptive techniques with almost no extra state overhead.","PeriodicalId":91161,"journal":{"name":"Proceedings. Data Compression Conference","volume":"45 1","pages":"9:1-9:6"},"PeriodicalIF":0.0,"publicationDate":"2016-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81885368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DSB-SEIS: a deduplicating secure backup system with encryption intensity selection DSB-SEIS:具有加密强度选择的重复数据删除安全备份系统

Proceedings. Data Compression Conference

Pub Date : 2016-07-25 DOI: 10.1145/2955193.2955208

Mortada A. Aman, Egemen K. Çetinkaya

Cloud computing is an emerging service that enables users to store and manage their data easily at a low cost. We propose a Deduplicating Secure Backup System with Encryption Intensity Selection (DSB-SEIS) that combines features to amend security and performance of cloud-based backup services. Our scheme introduces the concept of encryption intensity selection to cloud backup systems, which allows users to select the encryption intensity of their files. We also combine features such as deduplication, assured deletion, and multi-aspect awareness to further enhance our scheme. The DSB-SEIS performance is measured over an OpenStack cloud installed on CloudLab resources demonstrating that DSB-SEIS can improve the backup service.

云计算是一种新兴的服务，使用户能够以较低的成本轻松存储和管理他们的数据。我们提出了一种具有加密强度选择的重复数据删除安全备份系统(DSB-SEIS)，该系统结合了各种功能来提高基于云的备份服务的安全性和性能。我们的方案将加密强度选择的概念引入到云备份系统中，允许用户选择其文件的加密强度。我们还结合了重复数据删除、保证删除和多方面意识等功能来进一步增强我们的方案。通过安装在CloudLab资源上的OpenStack云测试DSB-SEIS性能，证明DSB-SEIS可以改善备份服务。

引用次数: 1

The CAT theorem and performance of transactional distributed systems 事务性分布式系统的CAT定理与性能

Proceedings. Data Compression Conference

Pub Date : 2016-07-25 DOI: 10.1145/2955193.2955205

S. Ahsan, Indranil Gupta

We argue that transactional distributed database/storage systems need to view the impossibility theorem in terms of the contention, abort rate, and throughput, rather than via the traditional CAP theorem. Motivated by Jim Gray, we state a new impossibility theorem, which we call the CAT theorem (Contention-Abort-Throughput). We present experimental results from the performance of several transactional systems w.r.t. the CAT impossibility spectrum.

我们认为，事务性分布式数据库/存储系统需要从争用、中止率和吞吐量的角度来看待不可能定理，而不是通过传统的CAP定理。受Jim Gray的启发，我们提出了一个新的不可能定理，我们称之为CAT定理(争用-中止-吞吐量)。我们给出了几个交易系统在CAT不可能谱下的性能实验结果。

引用次数: 2

Adaptive resilient routing via preorders in SDN 自适应弹性路由通过预定在SDN

Proceedings. Data Compression Conference

Pub Date : 2016-07-25 DOI: 10.1145/2955193.2955204

Eman Ramadan, Hesham Mekky, Braulio Dumba, Zhi-Li Zhang

In this paper, we propose and advocate a new routing paradigm -- dubbed routing via preorders -- which circumvents the limitations of conventional path-based routing schemes to effectively take advantage of topological diversity inherent in a network with rich topology for adaptive resilient routing, while at the same time meeting the quality-of-service requirements (e.g., latency) of applications or flows. We show how routing via preorders can be realized in SDN networks using the "match-action" data plane abstraction, with a preliminary implementation and evaluation of it in Mininet.

在本文中，我们提出并提倡一种新的路由范例-通过预订路由-它绕过了传统的基于路径的路由方案的局限性，有效地利用具有丰富拓扑的网络中固有的拓扑多样性进行自适应弹性路由，同时满足应用程序或流的服务质量要求(例如，延迟)。我们展示了如何使用“匹配-动作”数据平面抽象在SDN网络中实现通过预定的路由，并在Mininet中对其进行了初步实现和评估。

引用次数: 8

vMCN: virtual mobile cloud network for realizing scalable, real-time cyber physical systems vMCN:用于实现可扩展、实时的网络物理系统的虚拟移动云网络

Proceedings. Data Compression Conference

Pub Date : 2016-07-25 DOI: 10.1145/2955193.2955201

K. Nakauchi, F. Bronzino, Y. Shoji, I. Seskar, D. Raychaudhuri

This paper presents virtual Mobile Cloud Network (vMCN), an architecture for scalable, real-time Cyber Physical Systems (CPS) based on virtualization-capable network infrastructure and highly distributed edge clouds. Emerging CPS applications running on mobile devices, such as Augmented Reality (AR) based navigation and self-driving cars, have fundamental limitations; (1) the response time over the networks is unmanageable and CPS applications suffer from large response times, especially over shared wireless links; (2) the number of real-virtual object mappings cannot be scaled to the approaching trillion order of magnitude in an unified manner. vMCN addresses these issues by introducing novel network virtualization techniques that exploit the "named object" abstraction provided by a fast and scalable global name service. Coordinating virtualized resources across multiple domains, vMCN supports the distributed edge cloud model by deploying an application aware anycast services that achieve the strict requirements of CPS while still scaling to the expected order of magnitude of devices. An initial vMCN prototype system has been developed using the ORBIT testbed resources and the NICT's virtualization-capable WiFi base station. Experimental results reveal the vMCN can support up to about 94% CPS cycles under the set goal of 100 ms, outperforming the baseline system by almost two times.

本文介绍了虚拟移动云网络(vMCN)，这是一种基于虚拟化能力的网络基础设施和高度分布式边缘云的可扩展、实时网络物理系统(CPS)架构。在移动设备上运行的新兴CPS应用程序，如基于增强现实(AR)的导航和自动驾驶汽车，具有根本性的局限性;(1)网络上的响应时间难以管理，CPS应用程序的响应时间很长，特别是在共享无线链路上;(2)实-虚对象映射的数量无法统一缩放到接近万亿数量级。vMCN通过引入新颖的网络虚拟化技术来解决这些问题，这些技术利用了快速且可伸缩的全局名称服务提供的“命名对象”抽象。通过协调跨多个域的虚拟化资源，vMCN通过部署应用感知的任意播服务来支持分布式边缘云模型，这些服务既能满足CPS的严格要求，又能扩展到预期的设备数量级。一个初始的vMCN原型系统已经利用ORBIT测试平台资源和NICT的虚拟化WiFi基站开发出来。实验结果表明，在设定目标为100 ms的情况下，vMCN可以支持高达94%的CPS周期，比基线系统的性能高出近两倍。

{"title":"vMCN: virtual mobile cloud network for realizing scalable, real-time cyber physical systems","authors":"K. Nakauchi, F. Bronzino, Y. Shoji, I. Seskar, D. Raychaudhuri","doi":"10.1145/2955193.2955201","DOIUrl":"https://doi.org/10.1145/2955193.2955201","url":null,"abstract":"This paper presents virtual Mobile Cloud Network (vMCN), an architecture for scalable, real-time Cyber Physical Systems (CPS) based on virtualization-capable network infrastructure and highly distributed edge clouds. Emerging CPS applications running on mobile devices, such as Augmented Reality (AR) based navigation and self-driving cars, have fundamental limitations; (1) the response time over the networks is unmanageable and CPS applications suffer from large response times, especially over shared wireless links; (2) the number of real-virtual object mappings cannot be scaled to the approaching trillion order of magnitude in an unified manner. vMCN addresses these issues by introducing novel network virtualization techniques that exploit the \"named object\" abstraction provided by a fast and scalable global name service. Coordinating virtualized resources across multiple domains, vMCN supports the distributed edge cloud model by deploying an application aware anycast services that achieve the strict requirements of CPS while still scaling to the expected order of magnitude of devices. An initial vMCN prototype system has been developed using the ORBIT testbed resources and the NICT's virtualization-capable WiFi base station. Experimental results reveal the vMCN can support up to about 94% CPS cycles under the set goal of 100 ms, outperforming the baseline system by almost two times.","PeriodicalId":91161,"journal":{"name":"Proceedings. Data Compression Conference","volume":"80 1","pages":"2:1-2:6"},"PeriodicalIF":0.0,"publicationDate":"2016-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79320558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Towards migrating computation to distributed memory caches 将计算迁移到分布式内存缓存

Proceedings. Data Compression Conference

Pub Date : 2016-07-25 DOI: 10.1145/2955193.2955202

Adam Schaub, Michael F. Spear

Memcached and other in-memory distributed key-value stores play a critical role in large-scale web applications, by reducing traffic to persistent storage and providing an easy-to-access look-aside cache in which programmers can store arbitrary data. These caches typically have a narrow interface, consisting only of gets, sets, and compare-and-set. In the worst case, this interface can cause significant inefficiencies as clients get large data items, perform small changes, and then set the updated items back into the cache. We extend memcached to allow clients to execute code directly in the cache. An idealized evaluation on micro-benchmarks based on workload traces from a Cable/Internet service provider shows compelling performance, leading to a recommendation that further research be conducted to make in-cache fetch-and-phi safe and programmer-friendly, and that researchers consider whether a truly distributed cloud platform should make it easier for programmers to execute custom code at all levels of the software stack.

Memcached和其他内存中的分布式键值存储在大规模web应用程序中发挥着关键作用，它们减少了对持久存储的流量，并提供了易于访问的暂存缓存，程序员可以在其中存储任意数据。这些缓存通常具有狭窄的接口，仅由get、set和比较与设置组成。在最坏的情况下，这个接口可能会导致显著的效率低下，因为客户机获取大数据项，执行小更改，然后将更新后的项设置回缓存中。我们扩展了memcached，允许客户端直接在缓存中执行代码。基于有线/互联网服务提供商的工作负载跟踪对微基准进行的理想化评估显示出令人信服的性能，从而建议进行进一步的研究，以使缓存内取取和取取安全且对程序员友好，并且研究人员考虑真正的分布式云平台是否应该使程序员更容易在所有级别的软件堆栈上执行自定义代码。

引用次数: 0

A cluster-based approach to compression of Quality Scores. 基于聚类的质量分数压缩方法。

Proceedings. Data Compression Conference

Pub Date : 2016-03-01 Epub Date: 2016-12-19 DOI: 10.1109/DCC.2016.49

Mikel Hernaez, Idoia Ochoa, Tsachy Weissman

Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.

由于测序技术的进步和测序成本的急剧下降，大量的测序数据正在产生。存储和共享这些大数据已经成为发现和分析用于医学推断的遗传变异的主要瓶颈。因此，提出了对这些数据进行无损压缩的方法。在压缩的数据中，70%以上对应于质量分数，这表明测序机在调用特定碱基对时的可靠性。因此，为了进一步提高压缩性能，质量分数的有损压缩自然成为候选。由于数据用于遗传变异发现，因此根据其率失真性能以及对变体调用者的影响，对质量分数的有损压缩器进行了分析。以前提出的算法在所有性能指标下都表现不佳，因此不适合某些应用。在这项工作中，我们提出了一种新的有损压缩器，它首先通过假设所有质量分数序列来自马尔可夫模型的混合物来执行聚类步骤。然后，基于马尔可夫模型对质量分数进行量化。每个量化器针对一个特定的失真，以优化整体的率失真性能。最后，通过熵编码器对量化值进行压缩。在所有分析的失真指标下，我们证明了所提出的有损压缩器优于先前提出的方法。这表明，与先前提出的有损压缩器相比，所提出的算法对任何下游应用程序的影响可能不那么明显。此外，我们分析了所提出的有损压缩器如何影响单核苷酸多态性(SNP)调用，并表明调用中引入的可变性远远小于不同SNP调用方法之间存在的可变性。

{"title":"A cluster-based approach to compression of Quality Scores.","authors":"Mikel Hernaez, Idoia Ochoa, Tsachy Weissman","doi":"10.1109/DCC.2016.49","DOIUrl":"https://doi.org/10.1109/DCC.2016.49","url":null,"abstract":"Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.","PeriodicalId":91161,"journal":{"name":"Proceedings. Data Compression Conference","volume":"2016 ","pages":"261-270"},"PeriodicalIF":0.0,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/DCC.2016.49","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35532453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

An Evaluation Framework for Lossy Compression of Genome Sequencing Quality Values. 基因组测序质量值的有损压缩评估框架。

Proceedings. Data Compression Conference

Pub Date : 2016-03-01 Epub Date: 2016-12-19 DOI: 10.1109/DCC.2016.39

Claudio Alberti, Noah Daniels, Mikel Hernaez, Jan Voges, Rachel L Goldfeder, Ana A Hernandez-Lopez, Marco Mattavelli, Bonnie Berger

This paper provides the specification and an initial validation of an evaluation framework for the comparison of lossy compressors of genome sequencing quality values. The goal is to define reference data, test sets, tools and metrics that shall be used to evaluate the impact of lossy compression of quality values on human genome variant calling. The functionality of the framework is validated referring to two state-of-the-art genomic compressors. This work has been spurred by the current activity within the ISO/IEC SC29/WG11 technical committee (a.k.a. MPEG), which is investigating the possibility of starting a standardization activity for genomic information representation.

本文为比较基因组测序质量值的有损压缩器提供了一个评估框架的规范和初步验证。其目的是定义参考数据、测试集、工具和指标，用于评估有损压缩质量值对人类基因组变异调用的影响。参照两个最先进的基因组压缩器对该框架的功能进行了验证。这项工作受到 ISO/IEC SC29/WG11 技术委员会（又称 MPEG）当前活动的推动，该委员会正在研究启动基因组信息表示标准化活动的可能性。

引用次数: 0

Denoising of Quality Scores for Boosted Inference and Reduced Storage. 为提升推理和减少存储而对质量分数去噪。

Proceedings. Data Compression Conference

Pub Date : 2016-03-01 Epub Date: 2016-12-19 DOI: 10.1109/DCC.2016.92

Idoia Ochoa, Mikel Hernaez, Rachel Goldfeder, Tsachy Weissman, Euan Ashley

Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Much of the raw data are comprised of nucleotides and the corresponding quality scores that indicate their reliability. The latter are more difficult to compress and are themselves noisy. Lossless and lossy compression of the quality scores has recently been proposed to alleviate the storage costs, but reducing the noise in the quality scores has remained largely unexplored. This raw data is processed in order to identify variants; these genetic variants are used in important applications, such as medical decision making. Thus improving the performance of the variant calling by reducing the noise contained in the quality scores is important. We propose a denoising scheme that reduces the noise of the quality scores and we demonstrate improved inference with this denoised data. Specifically, we show that replacing the quality scores with those generated by the proposed denoiser results in more accurate variant calling in general. Moreover, a consequence of the denoising is that the entropy of the produced quality scores is smaller, and thus significant compression can be achieved with respect to lossless compression of the original quality scores. We expect our results to provide a baseline for future research in denoising of quality scores. The code used in this work as well as a Supplement with all the results are available at http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip.

由于测序技术的进步和测序成本的大幅下降，正在产生大量的测序数据。大部分原始数据由核苷酸和表示其可靠性的相应质量分数组成。后者更难压缩，而且本身有噪声。最近有人提出对质量分数进行无损压缩和有损压缩，以降低存储成本，但降低质量分数中的噪声在很大程度上仍有待探索。对这些原始数据进行处理是为了识别变异；这些遗传变异被用于医疗决策等重要应用中。因此，通过减少质量分数中的噪声来提高变体调用性能非常重要。我们提出了一种去噪方案，可以降低质量分数的噪声，我们还展示了利用这种去噪数据进行推断的改进。具体来说，我们证明了用所提出的去噪器生成的质量得分来替换质量得分，一般来说会提高变异调用的准确性。此外，去噪的一个后果是生成的质量分数的熵更小，因此与原始质量分数的无损压缩相比，可以实现显著的压缩。我们希望我们的研究结果能为今后的质量分数去噪研究提供一个基准。这项工作中使用的代码以及包含所有结果的补编可在 http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip 上获取。

{"title":"Denoising of Quality Scores for Boosted Inference and Reduced Storage.","authors":"Idoia Ochoa, Mikel Hernaez, Rachel Goldfeder, Tsachy Weissman, Euan Ashley","doi":"10.1109/DCC.2016.92","DOIUrl":"10.1109/DCC.2016.92","url":null,"abstract":"Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Much of the raw data are comprised of nucleotides and the corresponding quality scores that indicate their reliability. The latter are more difficult to compress and are themselves noisy. Lossless and lossy compression of the quality scores has recently been proposed to alleviate the storage costs, but reducing the noise in the quality scores has remained largely unexplored. This raw data is processed in order to identify variants; these genetic variants are used in important applications, such as medical decision making. Thus improving the performance of the variant calling by reducing the noise contained in the quality scores is important. We propose a denoising scheme that reduces the noise of the quality scores and we demonstrate improved inference with this denoised data. Specifically, we show that replacing the quality scores with those generated by the proposed denoiser results in more accurate variant calling in general. Moreover, a consequence of the denoising is that the entropy of the produced quality scores is smaller, and thus significant compression can be achieved with respect to lossless compression of the original quality scores. We expect our results to provide a baseline for future research in denoising of quality scores. The code used in this work as well as a Supplement with all the results are available at http://web.stanford.edu/~iochoa/DCCdenoiser_CodeAndSupplement.zip.","PeriodicalId":91161,"journal":{"name":"Proceedings. Data Compression Conference","volume":"2016 ","pages":"251-260"},"PeriodicalIF":0.0,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5663231/pdf/nihms910316.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35567963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings. Data Compression Conference

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀