Proceedings. Data Compression Conference最新文献

英文中文

Faster Maximal Exact Matches with Lazy LCP Evaluation. 通过懒惰 LCP 评估实现更快的最大精确匹配

Proceedings. Data Compression Conference

Pub Date : 2024-03-01 Epub Date: 2024-05-21 DOI: 10.1109/dcc58796.2024.00020

Adrián Goga, Lore Depuydt, Nathaniel K Brown, Jan Fostier, Travis Gagie, Gonzalo Navarro

MONI (Rossi et al., JCB 2022) is a BWT-based compressed index for computing the matching statistics and maximal exact matches (MEMs) of a pattern (usually a DNA read) with respect to a highly repetitive text (usually a database of genomes) using two operations: LF-steps and longest common extension (LCE) queries on a grammar-compressed representation of the text. In practice, most of the operations are constant-time LF-steps but most of the time is spent evaluating LCE queries. In this paper we show how (a variant of) the latter can be evaluated lazily, so as to bound the total time MONI needs to process the pattern in terms of the number of MEMs between the pattern and the text, while maintaining logarithmic latency.

MONI（Rossi 等人，JCB 2022）是一种基于 BWT 的压缩索引，用于计算模式（通常是 DNA 读取）与高度重复文本（通常是基因组数据库）的匹配统计和最大精确匹配 (MEM)，使用了两种操作：对文本的语法压缩表示进行 LF 步骤和最长公共扩展（LCE）查询。在实践中，大部分操作都是恒时 LF 步，但大部分时间都花在评估 LCE 查询上。在本文中，我们展示了如何对后者（的一种变体）进行懒散评估，从而在保持对数延迟的情况下，用模式和文本之间的 MEM 数量来约束 MONI 处理模式所需的总时间。

引用次数: 0

Recursive Prefix-Free Parsing for Building Big BWTs. 用于构建大型 BWT 的无递归前缀解析。

Proceedings. Data Compression Conference

Pub Date : 2023-03-01 Epub Date: 2023-05-19

Marco Oliva, Travis Gagie, Christina Boucher

Prefix-free parsing is useful for a wide variety of purposes including building the BWT, constructing the suffix array, and supporting compressed suffix tree operations. This linear-time algorithm uses a rolling hash to break an input string into substrings, where the resulting set of unique substrings has the property that none of the substrings' suffixes (of more than a certain length) is a proper prefix of any of the other substrings' suffixes. Hence, the name prefix-free parsing. This set of unique substrings is referred to as the dictionary. The parse is the ordered list of dictionary strings that defines the input string. Prior empirical results demonstrated the size of the parse is more burdensome than the size of the dictionary for large, repetitive inputs. Hence, the question arises as to how the size of the parse can scale satisfactorily with the input. Here, we describe our algorithm, recursive prefix-free parsing, which accomplishes this by computing the prefix-free parse of the parse produced by prefix-free parsing an input string. Although conceptually simple, building the BWT from the parse-of-the-parse and the dictionaries is significantly more challenging. We solve and implement this problem. Our experimental results show that recursive prefix-free parsing is extremely effective in reducing the memory needed to build the run-length encoded BWT of the input. Our implementation is open source and available at https://github.com/marco-oliva/r-pfbwt.

无前缀解析有多种用途，包括构建 BWT、构建后缀数组和支持压缩后缀树操作。这种线性时间算法使用滚动散列将输入字符串分解为子串，由此产生的唯一子串集合具有这样的特性：没有一个子串的后缀（超过一定长度）是任何其他子串后缀的适当前缀。因此，我们称之为无前缀解析。这组唯一的子串被称为字典。解析是定义输入字符串的字典字符串的有序列表。先前的经验结果表明，对于大型重复输入而言，解析的大小比字典的大小更为繁琐。因此，问题在于解析的大小如何才能令人满意地随着输入的增加而增加。在此，我们将介绍我们的算法--递归无前缀解析，该算法通过计算对输入字符串进行无前缀解析后产生的无前缀解析来实现这一目标。虽然概念上很简单，但从解析的解析和字典中构建 BWT 却具有很大的挑战性。我们解决并实现了这一问题。我们的实验结果表明，递归无前缀解析在减少构建输入的运行长度编码 BWT 所需的内存方面非常有效。我们的实现是开源的，可在 https://github.com/marco-oliva/r-pfbwt 上获取。

{"title":"Recursive Prefix-Free Parsing for Building Big BWTs.","authors":"Marco Oliva, Travis Gagie, Christina Boucher","doi":"","DOIUrl":"","url":null,"abstract":"Prefix-free parsing is useful for a wide variety of purposes including building the BWT, constructing the suffix array, and supporting compressed suffix tree operations. This linear-time algorithm uses a rolling hash to break an input string into substrings, where the resulting set of unique substrings has the property that none of the substrings' suffixes (of more than a certain length) is a proper prefix of any of the other substrings' suffixes. Hence, the name prefix-free parsing. This set of unique substrings is referred to as the dictionary. The parse is the ordered list of dictionary strings that defines the input string. Prior empirical results demonstrated the size of the parse is more burdensome than the size of the dictionary for large, repetitive inputs. Hence, the question arises as to how the size of the parse can scale satisfactorily with the input. Here, we describe our algorithm, recursive prefix-free parsing, which accomplishes this by computing the prefix-free parse of the parse produced by prefix-free parsing an input string. Although conceptually simple, building the BWT from the parse-of-the-parse and the dictionaries is significantly more challenging. We solve and implement this problem. Our experimental results show that recursive prefix-free parsing is extremely effective in reducing the memory needed to build the run-length encoded BWT of the input. Our implementation is open source and available at https://github.com/marco-oliva/r-pfbwt.","PeriodicalId":91161,"journal":{"name":"Proceedings. Data Compression Conference","volume":"2023 ","pages":"62-70"},"PeriodicalIF":0.0,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11328891/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142001555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PHONI: Streamed Matching Statistics with Multi-Genome References. PHONI:流式匹配统计与多基因组参考。

Proceedings. Data Compression Conference

Pub Date : 2021-03-01 Epub Date: 2021-05-10 DOI: 10.1109/dcc50243.2021.00027

Christina Boucher, Travis Gagie, I Tomohiro, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, Massimiliano Rossi

Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly. This means that, first, we can compute the matching statistics of several long patterns (such as whole human chromosomes) in parallel while still using a reasonable amount of RAM; second, we can compute matching statistics online with low latency and thus quickly recognize when a pattern becomes incompressible relative to the database. Our code is available at https://github.com/koeppl/phoni.

计算模式相对于文本的匹配统计数据是生物信息学中的一项基本任务，但当文本是一个高度压缩的基因组数据库时，这是一项艰巨的任务。Bannai等人针对这种情况给出了一个有效的解决方案，Rossi等人最近实现了该解决方案，但它在模式上使用了两次传递，并在第一次传递期间为每个字符缓冲一个指针。在本文中，我们简化了他们的解决方案，并使其流式传输，代价是稍微放慢速度。这意味着，首先，我们可以并行计算几个长模式（如整个人类染色体）的匹配统计数据，同时仍然使用合理数量的RAM；其次，我们可以以低延迟在线计算匹配统计信息，从而快速识别模式何时相对于数据库变得不可压缩。我们的代码可在https://github.com/koeppl/phoni.

引用次数: 14

Client-Driven Transmission of JPEG2000 Image Sequences Using Motion Compensated Conditional Replenishment 使用运动补偿条件补充的客户端驱动的JPEG2000图像序列传输

Proceedings. Data Compression Conference

Pub Date : 2019-03-26 DOI: 10.1109/DCC.2019.00114

J. J. Sánchez-Hernández, V. Ruiz, J. Ortiz, D. Muller

This is a work focused on remote browsing of JPEG2000 image sequences which takes advantage of the spatial scalability of JPEG2000 to determine which precincts of a subsequent image should be transmitted, and which precincts should be reused from a previously reconstructed image. The results of our experiments demonstrate that the quality of the reconstructed images can be significantly increased by using motion compensation and conditional replenishment on the client side. The proposed algorithm is compatible with standard JPIP servers.

这是一项专注于远程浏览JPEG2000图像序列的工作，它利用JPEG2000的空间可扩展性来确定后续图像的哪些区域应该被传输，哪些区域应该从先前重建的图像中重用。实验结果表明，在客户端使用运动补偿和条件补充可以显著提高重建图像的质量。该算法与标准jip服务器兼容。

引用次数: 0

Compressing Tabular Data via Pairwise Dependencies. 通过成对依赖压缩表格数据。

Proceedings. Data Compression Conference

Pub Date : 2017-04-01 Epub Date: 2017-05-11 DOI: 10.1109/DCC.2017.82

Dmitri S Pavlichin, Amir Ingber, Tsachy Weissman

We propose a method and algorithm for lossless compression of tabular data – including, for example, machine learning datasets, server logs and genomic datasets. Superior compression ratios are achieved by exploiting dependencies between the fields (or "features") in the dataset. The algorithm compresses the records w.r.t. a probabilistic graphical model – specifically an optimized forest, where each feature is a node. The work extends a method known as a Chow-Liu tree by incorporating a more accurate correction term to the cost function, which corresponds to the size required to describe the model itself. Additional features of the algorithm are efficient coding of the metadata (such as probability distributions), as well as data relabeling in order to cope with large datasets and alphabets. We test the algorithm on several datasets, and demonstrate an improvement in the compression rates of between 2X and 5X compared to gzip. The larger improvements are observed for very large datasets, such as the Criteo click prediction dataset which was published as part of a recent Kaggle competition.

引用次数: 1

GeneComp, a new reference-based compressor for SAM files. GeneComp，一个新的基于引用的SAM文件压缩器。

Proceedings. Data Compression Conference

Pub Date : 2017-04-01 Epub Date: 2017-05-11 DOI: 10.1109/DCC.2017.76

Reggy Long, Mikel Hernaez, Idoia Ochoa, Tsachy Weissman

The affordability of DNA sequencing has led to unprecedented volumes of genomic data. These data must be stored, processed, and analyzed. The most popular format for genomic data is the SAM format, which contains information such as alignment, quality values, etc. These files are large (on the order of terabytes), which necessitates compression. In this work we propose a new reference-based compressor for SAM files, which can accommodate different levels of compression, based on the specific needs of the user. In particular, the proposed compressor GeneComp allows the user to perform lossy compression of the quality scores, which have been proven to occupy more than half of the compressed file (when losslessly compressed). We show that the proposed compressor GeneComp overall achieves better compression ratios than previously proposed algorithms when working on lossless mode.

DNA测序的可负担性带来了前所未有的基因组数据量。这些数据必须被存储、处理和分析。基因组数据最流行的格式是SAM格式，它包含诸如对齐、质量值等信息。这些文件很大(以tb为数量级)，因此需要压缩。在这项工作中，我们为SAM文件提出了一种新的基于参考的压缩器，它可以根据用户的具体需求适应不同级别的压缩。特别是，建议的压缩器GeneComp允许用户对质量分数执行有损压缩，这已被证明占用压缩文件的一半以上(在无损压缩时)。我们表明，在无损模式下，所提出的压缩器GeneComp总体上比以前提出的算法实现了更好的压缩比。

引用次数: 5

Slicing in locavore infrastructures 在本地基础设施中切片

Proceedings. Data Compression Conference

Pub Date : 2016-07-25 DOI: 10.1145/2955193.2955207

Glenn Ricart

A Locavore Infrastructure is one which has all of its elements in high-bandwidth and low-latency proximity. It typically combines edge computing elements with an adjacent access network. The growing number of communicating devices and things creates a large and often steady demand for collecting and integrating local information in a Locavore Infrastructure. Slices of this infrastructure can provide architectural advantages in security, meeting performance expectations, and billing. Dynamic slices can provide some of the same kinds of surge capabilities for which traditional cloud computing is prized. Slices can be implemented using a variety of orchestration techniques.

本地基础设施是一种具有高带宽和低延迟邻近的所有元素的基础设施。它通常将边缘计算元素与相邻的接入网相结合。越来越多的通信设备和事物产生了在本地基础设施中收集和集成本地信息的巨大且经常稳定的需求。这种基础设施的片段可以在安全性、满足性能期望和计费方面提供体系结构优势。动态切片可以提供一些与传统云计算所看重的相同类型的激增能力。可以使用各种编排技术来实现切片。

引用次数: 0

Software-defined consistency group abstractions for virtual machines 软件定义的虚拟机一致性组抽象

Proceedings. Data Compression Conference

Pub Date : 2016-07-25 DOI: 10.1145/2955193.2955198

Muntasir Raihan Rahman, Sudarsan Piduri, Ilya Languev, Rean Griffith, Indranil Gupta

In this paper we propose a practical scalable software-level mechanism for taking crash-consistent snapshots of a group of virtual machines. The group is dynamically defined at the software virtualization layer allowing us to move the consistency group abstraction from the hardware array layer into the hypervisor with very low overhead (~ 50 msecs VM freeze time). This low overhead allows us to take crash-consistent snapshots of large software-defined consistency groups at a reasonable frequency, guaranteeing low data loss for disaster recovery. To demonstrate practicality, we use our mechanism to take crash-consistent snapshots of multi-disk virtual machines running two database applications: PostgreSQL, and Apache Cassandra. Deployment experiments confirm that our mechanism scales well with number of VMs, and snapshot times remain invariant of virtual disk size and usage.

在本文中，我们提出了一种实用的可扩展软件级机制，用于获取一组虚拟机的崩溃一致快照。组是在软件虚拟化层动态定义的，允许我们以非常低的开销(~ 50 msecs VM冻结时间)将一致性组抽象从硬件阵列层移动到管理程序中。这种低开销使我们能够以合理的频率对大型软件定义一致性组进行崩溃一致性快照，从而保证灾难恢复时的低数据丢失。为了演示实用性，我们使用我们的机制对运行两个数据库应用程序(PostgreSQL和Apache Cassandra)的多磁盘虚拟机进行崩溃一致性快照。部署实验证实，我们的机制可以很好地扩展虚拟机数量，并且快照时间与虚拟磁盘大小和使用情况保持不变。

引用次数: 2

Next generation virtual network architecture for multi-tenant distributed clouds: challenges and emerging techniques 多租户分布式云的下一代虚拟网络架构:挑战和新兴技术

Proceedings. Data Compression Conference

Pub Date : 2016-07-25 DOI: 10.1145/2955193.2955194

J. Mambretti, J. Chen, F. Yeh

Providing services for multiple tenants within a single or federated distributed cloud environment requires a variety of special considerations related to network design, provisioning, and operations. Especially important are multiple topics concerning the implementation of multiple parallel programmable virtual networks for large numbers of tenants, who require autonomous management, control, and data planes. This paper provides an overview of some of the challenges that arise from developing and implementing parallel programmable virtual networks, describes experiences with several experimental techniques for addressing those challenges based on large scale distributed testbeds, and presents the results of the experiments that were conducted. Distributed environments used include a distributed cloud testbed, the Chameleon Cloud, sponsored by the National Science Foundation's NSFCloud program, the NSF's Global Environment for Network Innovations (GENI), an international distributed OpenFlow testbed, and the Open Science Data Cloud.

在单个或联合分布式云环境中为多个租户提供服务，需要考虑与网络设计、供应和操作相关的各种特殊问题。特别重要的是涉及为大量租户实现多个并行可编程虚拟网络的多个主题，这些租户需要自治的管理、控制和数据平面。本文概述了开发和实现并行可编程虚拟网络所面临的一些挑战，描述了基于大规模分布式测试平台解决这些挑战的几种实验技术的经验，并介绍了所进行的实验结果。使用的分布式环境包括一个分布式云测试平台，变色龙云，由国家科学基金会的NSFCloud项目赞助，NSF的全球网络创新环境(GENI)，一个国际分布式OpenFlow测试平台，以及开放科学数据云。

引用次数: 4

New techniques to curtail the tail latency in stream processing systems 减少流处理系统尾部延迟的新技术

Proceedings. Data Compression Conference

Pub Date : 2016-07-25 DOI: 10.1145/2955193.2955206

Guangxiang Du, Indranil Gupta

This paper presents a series of novel techniques for reducing the tail latency in stream processing systems like Apache Storm. Concretely, we present three mechanisms: (1) adaptive timeout coupled with selective replay to catch straggler tuples; (2) shared queues among different tasks of the same operator to reduce overall queueing delay; (3) latency feedback-based load balancing, intended to mitigate heterogenous scenarios. We have implemented these techniques in Apache Storm, and present experimental results using sets of micro-benchmarks as well as two topologies from Yahoo! Inc. Our results show improvement in tail latency up to 72.9%.

本文提出了一系列减少流处理系统(如Apache Storm)尾部延迟的新技术。具体来说，我们提出了三种机制:(1)自适应超时与选择性重放相结合来捕获离散元组;(2)在同一运营商的不同任务之间共享队列，降低整体排队延迟;(3)基于延迟反馈的负载均衡，旨在缓解异构场景。我们已经在Apache Storm中实现了这些技术，并展示了使用微基准测试集和雅虎的两种拓扑的实验结果。公司。我们的结果表明，尾部延迟提高了72.9%。

引用次数: 5

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings. Data Compression Conference

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀