ACM Transactions on Storage最新文献_第10页

Oasis: Controlling Data Migration in Expansion of Object-based Storage Systems Oasis:对象存储系统扩展中的数据迁移控制

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2022-11-19 DOI: 10.1145/3568424

Yiming Zhang, Li Wang, Shun Gai, Qiwen Ke, Wenhao Li, Zhenlong Song, Guangtao Xue, J. Shu

Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, and so on, where the data is placed among a large number of object storage devices (OSDs). Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the capacity of the storage clusters (i.e., adding new OSDs), which is determined by the nature of CRUSH and will cause significant performance degradation when the expansion is nontrivial. This article presents MapX, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlling data migration after cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MapX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MapX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. We have applied MapX to the state-of-the-art Ceph-RBD (RADOS Block Device) to implement a migration-controllable, decentralized object-based block store (called Oasis). Oasis extends the RBD metadata structure to maintain and retrieve approximate object creation times (for migration control) at the granularity of expansion layers. Experimental results show that the MapX-based Oasis block store outperforms the CRUSH-based Ceph-RBD (which is busy in migrating objects after expansions) by 3.17× ∼ 4.31× in tail latency, and 76.3% (respectively, 83.8%) in IOPS for reads (respectively, writes).

基于对象的存储系统已被广泛用于各种场景，如文件存储、块存储、blob（例如，大型视频）存储等，其中数据被放置在大量对象存储设备（OSD）之间。数据放置对于分散式基于对象的存储系统的可扩展性至关重要。最先进的CRUSH放置方法是一种去中心化算法，它可以确定地将对象副本放置到存储设备上，而不依赖于中心目录。基于CRUSH的存储系统在享受去中心化的好处（如高可扩展性、健壮性和性能）的同时，在扩展存储集群的容量（即添加新的操作系统）时会遭受不受控制的数据迁移，这是由CRUSH的性质决定的，并且在扩展不正常时会导致性能显著下降。本文介绍了MapX，这是CRUSH的一个新扩展，它使用额外的时间维度映射（从对象创建时间到集群扩展时间）来控制集群扩展后的数据迁移。每个扩展都被视为CRUSH映射的一个新层，由CRUSH根下的一个虚拟节点表示。MapX通过操纵中间放置组（PG）的时间戳来控制从对象到图层的映射。MapX适用于各种基于对象的存储场景，其中对象时间戳可以作为更高级别的元数据进行维护。我们已将MapX应用于最先进的Ceph RBD（RADOS块设备），以实现可迁移可控、去中心化的基于对象的块存储（称为Oasis）。Oasis扩展了RBD元数据结构，以在扩展层的粒度上维护和检索近似的对象创建时间（用于迁移控制）。实验结果表明，基于MapX的Oasis块存储在尾部延迟方面优于基于CRUSH的Ceph RBD（扩展后忙于迁移对象）3.17×~4.31×，在读（写）IOPS方面分别优于76.3%（83.8%）。

{"title":"Oasis: Controlling Data Migration in Expansion of Object-based Storage Systems","authors":"Yiming Zhang, Li Wang, Shun Gai, Qiwen Ke, Wenhao Li, Zhenlong Song, Guangtao Xue, J. Shu","doi":"10.1145/3568424","DOIUrl":"https://doi.org/10.1145/3568424","url":null,"abstract":"Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, and so on, where the data is placed among a large number of object storage devices (OSDs). Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the capacity of the storage clusters (i.e., adding new OSDs), which is determined by the nature of CRUSH and will cause significant performance degradation when the expansion is nontrivial. This article presents MapX, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlling data migration after cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MapX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MapX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. We have applied MapX to the state-of-the-art Ceph-RBD (RADOS Block Device) to implement a migration-controllable, decentralized object-based block store (called Oasis). Oasis extends the RBD metadata structure to maintain and retrieve approximate object creation times (for migration control) at the granularity of expansion layers. Experimental results show that the MapX-based Oasis block store outperforms the CRUSH-based Ceph-RBD (which is busy in migrating objects after expansions) by 3.17× ∼ 4.31× in tail latency, and 76.3% (respectively, 83.8%) in IOPS for reads (respectively, writes).","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"19 1","pages":"1 - 22"},"PeriodicalIF":1.7,"publicationDate":"2022-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44156164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The what, The from, and The to: The Migration Games in Deduplicated Systems 什么，从，到:重复数据删除系统中的迁移游戏

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2022-11-15 DOI: https://dl.acm.org/doi/10.1145/3565025

Roei Kisous, Ariel Kolikant, Abhinav Duggal, Sarai Sheinvald, Gala Yadgar

Deduplication reduces the size of the data stored in large-scale storage systems by replacing duplicate data blocks with references to their unique copies. This creates dependencies between files that contain similar content and complicates the management of data in the system. In this article, we address the problem of data migration, in which files are remapped between different volumes as a result of system expansion or maintenance. The challenge of determining which files and blocks to migrate has been studied extensively for systems without deduplication. In the context of deduplicated storage, however, only simplified migration scenarios have been considered.

In this article, we formulate the general migration problem for deduplicated systems as an optimization problem whose objective is to minimize the system’s size while ensuring that the storage load is evenly distributed between the system’s volumes and that the network traffic required for the migration does not exceed its allocation.

We then present three algorithms for generating effective migration plans, each based on a different approach and representing a different trade-off between computation time and migration efficiency. Our greedy algorithm provides modest space savings but is appealing thanks to its exceptionally short runtime. Its results can be improved by using larger system representations. Our theoretically optimal algorithm formulates the migration problem as an integer linear programming (ILP) instance. Its migration plans consistently result in smaller and more balanced systems than those of the greedy approach, although its runtime is long and, as a result, the theoretical optimum is not always found. Our clustering algorithm enjoys the best of both worlds: its migration plans are comparable to those generated by the ILP-based algorithm, but its runtime is shorter, sometimes by an order of magnitude. It can be further accelerated at a modest cost in the quality of its results.

重复数据删除在大规模存储系统中，通过引用重复的数据块的唯一副本来替换重复的数据块，从而减少存储数据的大小。这会在包含相似内容的文件之间创建依赖关系，并使系统中的数据管理变得复杂。在本文中，我们将讨论数据迁移问题，即由于系统扩展或维护而在不同卷之间重新映射文件。对于没有重复数据删除的系统，确定要迁移哪些文件和块的挑战已经进行了广泛的研究。但是，在重复数据删除存储上下文中，只考虑了简化的迁移场景。在本文中，我们将重复数据删除系统的一般迁移问题表述为一个优化问题，其目标是最小化系统大小，同时确保存储负载在系统卷之间均匀分布，并且迁移所需的网络流量不超过其分配。然后，我们提出了三种用于生成有效迁移计划的算法，每种算法基于不同的方法，并表示计算时间和迁移效率之间的不同权衡。我们的贪婪算法提供了适度的空间节省，但由于其异常短的运行时间而吸引人。它的结果可以通过使用更大的系统表示来改进。我们的理论最优算法将迁移问题表述为整数线性规划(ILP)实例。它的迁移计划总是产生比贪婪方法更小、更平衡的系统，尽管它的运行时间很长，因此并不总能找到理论上的最优。我们的聚类算法兼顾了两者的优点:它的迁移计划与基于ilp的算法生成的迁移计划相当，但它的运行时间更短，有时会缩短一个数量级。它可以在结果质量方面以适度的代价进一步加速。

{"title":"The what, The from, and The to: The Migration Games in Deduplicated Systems","authors":"Roei Kisous, Ariel Kolikant, Abhinav Duggal, Sarai Sheinvald, Gala Yadgar","doi":"https://dl.acm.org/doi/10.1145/3565025","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3565025","url":null,"abstract":"Deduplication reduces the size of the data stored in large-scale storage systems by replacing duplicate data blocks with references to their unique copies. This creates dependencies between files that contain similar content and complicates the management of data in the system. In this article, we address the problem of data migration, in which files are remapped between different volumes as a result of system expansion or maintenance. The challenge of determining which files and blocks to migrate has been studied extensively for systems without deduplication. In the context of deduplicated storage, however, only simplified migration scenarios have been considered.In this article, we formulate the general migration problem for deduplicated systems as an optimization problem whose objective is to minimize the system’s size while ensuring that the storage load is evenly distributed between the system’s volumes and that the network traffic required for the migration does not exceed its allocation.We then present three algorithms for generating effective migration plans, each based on a different approach and representing a different trade-off between computation time and migration efficiency. Our greedy algorithm provides modest space savings but is appealing thanks to its exceptionally short runtime. Its results can be improved by using larger system representations. Our theoretically optimal algorithm formulates the migration problem as an integer linear programming (ILP) instance. Its migration plans consistently result in smaller and more balanced systems than those of the greedy approach, although its runtime is long and, as a result, the theoretical optimum is not always found. Our clustering algorithm enjoys the best of both worlds: its migration plans are comparable to those generated by the ILP-based algorithm, but its runtime is shorter, sometimes by an order of magnitude. It can be further accelerated at a modest cost in the quality of its results.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"44 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2022-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving the Endurance of Next Generation SSD’s using WOM-v Codes 使用WOM-v码提高下一代SSD的耐久性

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2022-11-14 DOI: 10.1145/3565027

Shehbaz Jaffer, K. Mahdaviani, Bianca Schroeder

High density Solid State Drives, such as QLC drives, offer increased storage capacity, but a magnitude lower Program and Erase (P/E) cycles, limiting their endurance and hence usability. We present the design and implementation of non-binary, Voltage-Based Write-Once-Memory (WOM-v) Codes to improve the lifetime of QLC drives. First, we develop a FEMU based simulator test-bed to evaluate the gains of WOM-v codes on real world workloads. Second, we propose and implement two optimizations, an efficient garbage collection mechanism and an encoding optimization to drastically improve WOM-v code endurance without compromising performance. Third, we propose analytical approaches to obtain estimates of the endurance gains under WOM-v codes. We analyze the Greedy garbage collection technique with uniform page access distribution and the Least Recently Written (LRW) garbage collection technique with skewed page access distribution in the context of WOM-v codes. We find that although both approaches overestimate the number of required erase operations, the model based on greedy garbage collection with uniform page access distribution provides tighter bounds. A careful evaluation, including microbenchmarks and trace-driven evaluation, demonstrates that WOM-v codes can reduce Erase cycles for QLC drives by 4.4×–11.1× for real world workloads with minimal performance overheads resulting in improved QLC SSD lifetime.

高密度固态驱动器，如QLC驱动器，提供了更大的存储容量，但大大降低了程序和擦除(P/E)周期，限制了它们的耐用性和可用性。我们提出了非二进制，基于电压的写一次存储器(WOM-v)代码的设计和实现，以提高QLC驱动器的使用寿命。首先，我们开发了一个基于FEMU的模拟器测试平台，以评估WOM-v代码在现实世界工作负载上的增益。其次，我们提出并实现了两项优化，一种有效的垃圾收集机制和一种编码优化，以在不影响性能的情况下大幅提高WOM-v的代码耐久性。第三，我们提出了分析方法来估计WOM-v代码下的续航增益。在WOM-v代码环境下，分析了具有均匀页访问分布的贪婪垃圾收集技术和具有倾斜页访问分布的最近最少写入(LRW)垃圾收集技术。我们发现，尽管这两种方法都高估了所需擦除操作的数量，但基于统一页面访问分布的贪婪垃圾收集模型提供了更严格的边界。仔细的评估，包括微基准测试和跟踪驱动评估，表明WOM-v代码可以减少QLC驱动器的擦除周期4.4×-11.1×，以最小的性能开销，从而提高QLC SSD的使用寿命。

{"title":"Improving the Endurance of Next Generation SSD’s using WOM-v Codes","authors":"Shehbaz Jaffer, K. Mahdaviani, Bianca Schroeder","doi":"10.1145/3565027","DOIUrl":"https://doi.org/10.1145/3565027","url":null,"abstract":"High density Solid State Drives, such as QLC drives, offer increased storage capacity, but a magnitude lower Program and Erase (P/E) cycles, limiting their endurance and hence usability. We present the design and implementation of non-binary, Voltage-Based Write-Once-Memory (WOM-v) Codes to improve the lifetime of QLC drives. First, we develop a FEMU based simulator test-bed to evaluate the gains of WOM-v codes on real world workloads. Second, we propose and implement two optimizations, an efficient garbage collection mechanism and an encoding optimization to drastically improve WOM-v code endurance without compromising performance. Third, we propose analytical approaches to obtain estimates of the endurance gains under WOM-v codes. We analyze the Greedy garbage collection technique with uniform page access distribution and the Least Recently Written (LRW) garbage collection technique with skewed page access distribution in the context of WOM-v codes. We find that although both approaches overestimate the number of required erase operations, the model based on greedy garbage collection with uniform page access distribution provides tighter bounds. A careful evaluation, including microbenchmarks and trace-driven evaluation, demonstrates that WOM-v codes can reduce Erase cycles for QLC drives by 4.4×–11.1× for real world workloads with minimal performance overheads resulting in improved QLC SSD lifetime.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"18 1","pages":"1 - 32"},"PeriodicalIF":1.7,"publicationDate":"2022-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43627797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Disk Failure Prediction Method Based on Active Semi-supervised Learning 基于主动半监督学习的磁盘故障预测方法

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2022-11-12 DOI: https://dl.acm.org/doi/10.1145/3523699

Yang Zhou, Fang Wang, Dan Feng

Disk failure has always been a major problem for data centers, leading to data loss. Current disk failure prediction approaches are mostly offline and assume that the disk labels required for training learning models are available and accurate. However, these offline methods are no longer suitable for disk failure prediction tasks in large-scale data centers. Behind this explosive amount of data, most methods do not consider whether it is not easy to get the label values during the training or the obtained label values are not completely accurate. These problems further restrict the development of supervised learning and offline modeling in disk failure prediction. In this article, Active Semi-supervised Learning Disk-failure Prediction (ASLDP), a novel disk failure prediction method is proposed, which uses active learning and semi-supervised learning. According to the characteristics of data in the disk lifecycle, ASLDP carries out active learning for those clear labeled samples, which selects valuable samples with the most significant probability uncertainty and eliminates redundancy. For those samples that are unclearly labeled or unlabeled, ASLDP uses semi-supervised learning for pre-labeled by calculating the conditional values of the samples and enhances the generalization ability by active learning. Compared with several state-of-the-art offline and online learning approaches, the results on four realistic datasets from Backblaze and Baidu demonstrate that ASLDP achieves stable failure detection rates of 80–85% with low false alarm rates. In addition, we use a dataset from Alibaba to evaluate the generality of ASLDP. Furthermore, ASLDP can overcome the problem of missing sample labels and data redundancy in large data centers, which are not considered and implemented in all offline learning methods for disk failure prediction to the best of our knowledge. Finally, ASLDP can predict the disk failure 4.9 days in advance with lower overhead and latency.

磁盘故障一直是数据中心面临的主要问题，它会导致数据丢失。目前的磁盘故障预测方法大多是离线的，并且假设训练学习模型所需的磁盘标签是可用的和准确的。然而，这些离线方法已经不适合大规模数据中心的硬盘故障预测任务。在这种爆炸式的数据量背后，大多数方法都没有考虑是否在训练过程中不容易获得标签值或者获得的标签值不完全准确。这些问题进一步制约了监督学习和离线建模在磁盘故障预测中的发展。主动半监督学习磁盘故障预测(ASLDP)是一种结合主动学习和半监督学习的磁盘故障预测方法。ASLDP根据数据在磁盘生命周期中的特点，对标记清晰的样本进行主动学习，选取概率不确定性最显著的有价值样本，消除冗余。对于标记不清楚或未标记的样本，ASLDP通过计算样本的条件值，使用半监督学习进行预标记，并通过主动学习增强泛化能力。与几种最先进的离线和在线学习方法相比，来自Backblaze和百度的四个实际数据集的结果表明，ASLDP实现了80-85%的稳定故障检测率和低误报率。此外，我们使用来自阿里巴巴的数据集来评估ASLDP的通用性。此外，ASLDP可以克服大型数据中心中样本标签缺失和数据冗余的问题，这些问题在我们所知的所有离线磁盘故障预测学习方法中都没有考虑和实现。最后，ASLDP能够以较低的开销和延迟提前4.9天预测磁盘故障。

{"title":"A Disk Failure Prediction Method Based on Active Semi-supervised Learning","authors":"Yang Zhou, Fang Wang, Dan Feng","doi":"https://dl.acm.org/doi/10.1145/3523699","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3523699","url":null,"abstract":"Disk failure has always been a major problem for data centers, leading to data loss. Current disk failure prediction approaches are mostly offline and assume that the disk labels required for training learning models are available and accurate. However, these offline methods are no longer suitable for disk failure prediction tasks in large-scale data centers. Behind this explosive amount of data, most methods do not consider whether it is not easy to get the label values during the training or the obtained label values are not completely accurate. These problems further restrict the development of supervised learning and offline modeling in disk failure prediction. In this article, Active Semi-supervised Learning Disk-failure Prediction (ASLDP), a novel disk failure prediction method is proposed, which uses active learning and semi-supervised learning. According to the characteristics of data in the disk lifecycle, ASLDP carries out active learning for those clear labeled samples, which selects valuable samples with the most significant probability uncertainty and eliminates redundancy. For those samples that are unclearly labeled or unlabeled, ASLDP uses semi-supervised learning for pre-labeled by calculating the conditional values of the samples and enhances the generalization ability by active learning. Compared with several state-of-the-art offline and online learning approaches, the results on four realistic datasets from Backblaze and Baidu demonstrate that ASLDP achieves stable failure detection rates of 80–85% with low false alarm rates. In addition, we use a dataset from Alibaba to evaluate the generality of ASLDP. Furthermore, ASLDP can overcome the problem of missing sample labels and data redundancy in large data centers, which are not considered and implemented in all offline learning methods for disk failure prediction to the best of our knowledge. Finally, ASLDP can predict the disk failure 4.9 days in advance with lower overhead and latency.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"69 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2022-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ares: Adaptive, Reconfigurable, Erasure coded, Atomic Storage 领域:自适应、可重构、擦除编码、原子存储

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2022-11-12 DOI: https://dl.acm.org/doi/10.1145/3510613

Nicolas Nicolaou, Viveck Cadambe, N. Prakash, Andria Trigeorgi, Kishori Konwar, Muriel Medard, Nancy Lynch

Emulating a shared atomic, read/write storage system is a fundamental problem in distributed computing. Replicating atomic objects among a set of data hosts was the norm for traditional implementations (e.g., [11]) in order to guarantee the availability and accessibility of the data despite host failures. As replication is highly storage demanding, recent approaches suggested the use of erasure-codes to offer the same fault-tolerance while optimizing storage usage at the hosts. Initial works focused on a fixed set of data hosts. To guarantee longevity and scalability, a storage service should be able to dynamically mask hosts failures by allowing new hosts to join, and failed host to be removed without service interruptions. This work presents the first erasure-code -based atomic algorithm, called Ares, which allows the set of hosts to be modified in the course of an execution. Ares is composed of three main components: (i) a reconfiguration protocol, (ii) a read/write protocol, and (iii) a set of data access primitives (DAPs). The design of Ares is modular and is such to accommodate the usage of various erasure-code parameters on a per-configuration basis. We provide bounds on the latency of read/write operations and analyze the storage and communication costs of the Ares algorithm.

模拟共享原子读/写存储系统是分布式计算中的一个基本问题。在一组数据主机之间复制原子对象是传统实现的标准(例如，[11])，目的是在主机故障时保证数据的可用性和可访问性。由于复制对存储的要求很高，最近的方法建议使用擦除码来提供相同的容错性，同时优化主机上的存储使用。最初的工作集中在一组固定的数据主机上。为了保证寿命和可伸缩性，存储服务应该能够通过允许新主机加入来动态掩盖主机故障，并且在不中断服务的情况下删除故障主机。这项工作提出了第一个基于擦除码的原子算法，称为Ares，它允许在执行过程中修改主机集。Ares由三个主要部分组成:(i)一个重新配置协议，(ii)一个读/写协议，(iii)一组数据访问原语(dap)。Ares的设计是模块化的，因此可以在每个配置的基础上适应各种擦除码参数的使用。我们提供了读/写操作的延迟边界，并分析了Ares算法的存储和通信成本。

{"title":"Ares: Adaptive, Reconfigurable, Erasure coded, Atomic Storage","authors":"Nicolas Nicolaou, Viveck Cadambe, N. Prakash, Andria Trigeorgi, Kishori Konwar, Muriel Medard, Nancy Lynch","doi":"https://dl.acm.org/doi/10.1145/3510613","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3510613","url":null,"abstract":"Emulating a shared atomic, read/write storage system is a fundamental problem in distributed computing. Replicating atomic objects among a set of data hosts was the norm for traditional implementations (e.g., [11]) in order to guarantee the availability and accessibility of the data despite host failures. As replication is highly storage demanding, recent approaches suggested the use of erasure-codes to offer the same fault-tolerance while optimizing storage usage at the hosts. Initial works focused on a fixed set of data hosts. To guarantee longevity and scalability, a storage service should be able to dynamically mask hosts failures by allowing new hosts to join, and failed host to be removed without service interruptions. This work presents the first erasure-code -based atomic algorithm, called Ares, which allows the set of hosts to be modified in the course of an execution. Ares is composed of three main components: (i) a reconfiguration protocol, (ii) a read/write protocol, and (iii) a set of data access primitives (DAPs). The design of Ares is modular and is such to accommodate the usage of various erasure-code parameters on a per-configuration basis. We provide bounds on the latency of read/write operations and analyze the storage and communication costs of the Ares algorithm.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"4320 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2022-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138542298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tunable Encrypted Deduplication with Attack-resilient Key Management 可调加密重复数据删除与攻击弹性密钥管理

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2022-11-11 DOI: https://dl.acm.org/doi/10.1145/3510614

Zuoru Yang, Jingwei Li, Yanjing Ren, Patrick P. C. Lee

Conventional encrypted deduplication approaches retain the deduplication capability on duplicate chunks after encryption by always deriving the key for encryption/decryption from the chunk content, but such a deterministic nature causes information leakage due to frequency analysis. We present TED, a tunable encrypted deduplication primitive that provides a tunable mechanism for balancing the tradeoff between storage efficiency and data confidentiality. The core idea of TED is that its key derivation is based on not only the chunk content but also the number of duplicate chunk copies, such that duplicate chunks are encrypted by distinct keys in a controlled manner. In particular, TED allows users to configure a storage blowup factor, under which the information leakage quantified by an information-theoretic measure is minimized for any input workload. In addition, we extend TED with a distributed key management architecture and propose two attack-resilient key generation schemes that trade between performance and fault tolerance. We implement an encrypted deduplication prototype TEDStore to realize TED in networked environments. Evaluation on real-world file system snapshots shows that TED effectively balances the tradeoff between storage efficiency and data confidentiality, with small performance overhead.

传统的加密重复数据删除方法通过始终从数据块内容中获得用于加密/解密的密钥，从而在加密后保留了对重复数据块的重复数据删除功能，但这种确定性会由于频率分析而导致信息泄露。我们介绍了TED，一个可调的加密重复数据删除原语，它提供了一种可调的机制来平衡存储效率和数据机密性之间的权衡。TED的核心思想是，它的密钥派生不仅基于块内容，还基于重复块副本的数量，这样重复的块就可以通过不同的密钥以受控的方式进行加密。特别是，TED允许用户配置存储爆炸因子，在该因子下，通过信息理论度量量化的信息泄漏对于任何输入工作负载都是最小化的。此外，我们用分布式密钥管理架构扩展了TED，并提出了两种攻击弹性密钥生成方案，在性能和容错性之间进行交易。为了在网络环境下实现TED，我们实现了一个加密的重复数据删除原型TEDStore。对实际文件系统快照的评估表明，TED有效地平衡了存储效率和数据机密性之间的权衡，并且性能开销很小。

{"title":"Tunable Encrypted Deduplication with Attack-resilient Key Management","authors":"Zuoru Yang, Jingwei Li, Yanjing Ren, Patrick P. C. Lee","doi":"https://dl.acm.org/doi/10.1145/3510614","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3510614","url":null,"abstract":"Conventional encrypted deduplication approaches retain the deduplication capability on duplicate chunks after encryption by always deriving the key for encryption/decryption from the chunk content, but such a deterministic nature causes information leakage due to frequency analysis. We present <sans-serif>TED</sans-serif>, a tunable encrypted deduplication primitive that provides a tunable mechanism for balancing the tradeoff between storage efficiency and data confidentiality. The core idea of <sans-serif>TED</sans-serif> is that its key derivation is based on not only the chunk content but also the number of duplicate chunk copies, such that duplicate chunks are encrypted by distinct keys in a controlled manner. In particular, <sans-serif>TED</sans-serif> allows users to configure a storage blowup factor, under which the information leakage quantified by an information-theoretic measure is minimized for any input workload. In addition, we extend <sans-serif>TED</sans-serif> with a distributed key management architecture and propose two attack-resilient key generation schemes that trade between performance and fault tolerance. We implement an encrypted deduplication prototype <sans-serif>TEDStore</sans-serif> to realize <sans-serif>TED</sans-serif> in networked environments. Evaluation on real-world file system snapshots shows that <sans-serif>TED</sans-serif> effectively balances the tradeoff between storage efficiency and data confidentiality, with small performance overhead.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"68 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2022-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Toward Fast and Scalable Random Walks over Disk-Resident Graphs via Efficient I/O Management 通过高效I/O管理实现磁盘驻留图的快速可伸缩随机漫步

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2022-11-11 DOI: https://dl.acm.org/doi/10.1145/3533579

Rui Wang, Yongkun Li, Yinlong Xu, Hong Xie, John C. S. Lui, Shuibing He

Traditional graph systems mainly use the iteration-based model, which iteratively loads graph blocks into memory for analysis so as to reduce random I/Os. However, this iteration-based model limits the efficiency and scalability of running random walk, which is a fundamental technique to analyze large graphs. In this article, we first propose a state-aware I/O model to improve the I/O efficiency of running random walk, then we develop a block-centric indexing and buffering scheme for managing walk data, and leverage an asynchronous walk updating strategy to improve random walk efficiency. We implement an I/O-efficient graph system, GraphWalker, which is efficient to handle very large disk-resident graphs and also scalable to run tens of billions of random walks with only a single commodity machine. Experiments show that GraphWalker can achieve more than an order of magnitude speedup when compared with DrunkardMob, which is tailored for random walks based on the classical graph system GraphChi, as well as two state-of-the-art single-machine graph systems, Graphene and GraFSoft. Furthermore, when compared with the most recent distributed system KnightKing, GraphWalker still achieves comparable performance with only a single machine, thereby making it a more cost-effective alternative.

传统的图系统主要使用基于迭代的模型，迭代地将图块加载到内存中进行分析，以减少随机I/ o。然而，这种基于迭代的模型限制了运行随机漫步的效率和可扩展性，而随机漫步是分析大型图的基本技术。在本文中，我们首先提出了一个状态感知的I/O模型来提高运行随机漫步的I/O效率，然后我们开发了一个以块为中心的索引和缓冲方案来管理漫步数据，并利用异步漫步更新策略来提高随机漫步效率。我们实现了一个I/ o高效的图形系统GraphWalker，它可以有效地处理非常大的磁盘驻留图形，并且可以扩展到只需一台商用机器就可以运行数百亿次随机漫步。实验表明，与DrunkardMob相比，GraphWalker可以实现超过一个数量级的加速，DrunkardMob是基于经典图形系统GraphChi以及两种最先进的单机图形系统石墨烯和GraFSoft为随机行走量身定制的。此外，与最新的分布式系统KnightKing相比，GraphWalker仅用一台机器就可以实现相当的性能，从而使其成为更具成本效益的替代方案。

{"title":"Toward Fast and Scalable Random Walks over Disk-Resident Graphs via Efficient I/O Management","authors":"Rui Wang, Yongkun Li, Yinlong Xu, Hong Xie, John C. S. Lui, Shuibing He","doi":"https://dl.acm.org/doi/10.1145/3533579","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3533579","url":null,"abstract":"Traditional graph systems mainly use the iteration-based model, which iteratively loads graph blocks into memory for analysis so as to reduce random I/Os. However, this iteration-based model limits the efficiency and scalability of running random walk, which is a fundamental technique to analyze large graphs. In this article, we first propose a state-aware I/O model to improve the I/O efficiency of running random walk, then we develop a block-centric indexing and buffering scheme for managing walk data, and leverage an asynchronous walk updating strategy to improve random walk efficiency. We implement an I/O-efficient graph system, <sans-serif>GraphWalker</sans-serif>, which is efficient to handle very large disk-resident graphs and also scalable to run tens of billions of random walks with only a single commodity machine. Experiments show that <sans-serif>GraphWalker</sans-serif> can achieve more than an order of magnitude speedup when compared with DrunkardMob, which is tailored for random walks based on the classical graph system GraphChi, as well as two state-of-the-art single-machine graph systems, Graphene and GraFSoft. Furthermore, when compared with the most recent distributed system KnightKing, <sans-serif>GraphWalker</sans-serif> still achieves comparable performance with only a single machine, thereby making it a more cost-effective alternative.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"68 7","pages":""},"PeriodicalIF":1.7,"publicationDate":"2022-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Introduction to the Special Section on USENIX FAST 2022 USENIX FAST 2022专题介绍

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2022-11-11 DOI: 10.1145/3564770

Hildebrand Dean, Donald Porter

. This paper presents a detailed investigation and innovative solu-tion for reducing metadata overheads in persistent memory file systems. The authors show that by representing each file as a contiguous region of virtual memory, lookup performance in the file system’s indexes can improve real world workloads such as LevelDB by up to 3.6 × over existing solutions.

.本文提出了一个详细的研究和创新的解决方案，以减少持久内存文件系统中的元数据开销。作者表明，通过将每个文件表示为虚拟内存的连续区域，文件系统索引中的查找性能可以将LevelDB等现实工作负载比现有解决方案提高3.6倍。

引用次数: 0

Introduction to the Special Section on USENIX FAST 2022 介绍USENIX FAST 2022的特殊部分

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2022-11-11 DOI: https://dl.acm.org/doi/10.1145/3564770

Hildebrand Dean, Donald Porter

No abstract available.

没有摘要。

引用次数: 0

ctFS: Replacing File Indexing with Hardware Memory Translation through Contiguous File Allocation for Persistent Memory ctFS:通过为持久内存分配连续文件，用硬件内存转换代替文件索引

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2022-11-09 DOI: 10.1145/3565026

Ruibin Li, Xiang Ren, Xu Zhao, Siwei He, M. Stumm, Ding Yuan

Persistent byte-addressable memory (PM) is poised to become prevalent in future computer systems. PMs are significantly faster than disk storage, and accesses to PMs are governed by the Memory Management Unit (MMU) just as accesses with volatile RAM. These unique characteristics shift the bottleneck from I/O to operations such as block address lookup—for example, in write workloads, up to 45% of the overhead in ext4-DAX is due to building and searching extent trees to translate file offsets to addresses on persistent memory. We propose a novel contiguous file system, ctFS, that eliminates most of the overhead associated with indexing structures such as extent trees in the file system. ctFS represents each file as a contiguous region of virtual memory, hence a lookup from the file offset to the address is simply an offset operation, which can be efficiently performed by the hardware MMU at a fraction of the cost of software-maintained indexes. Evaluating ctFS on real-world workloads such as LevelDB shows it outperforms ext4-DAX and SplitFS by 3.6× and 1.8×, respectively.

永久字节可寻址存储器（PM）将在未来的计算机系统中流行起来。PM比磁盘存储快得多，对PM的访问由内存管理单元（MMU）控制，就像对易失性RAM的访问一样。这些独特的特性将瓶颈从I/O转移到块地址查找等操作上——例如，在写入工作负载中，ext4 DAX中高达45%的开销是由于构建和搜索扩展数据块树以将文件偏移转换为持久内存上的地址。我们提出了一种新的连续文件系统ctFS，它消除了与文件系统中的索引结构（如数据块树）相关的大部分开销。ctFS将每个文件表示为虚拟存储器的连续区域，因此从文件偏移到地址的查找只是一种偏移操作，硬件MMU可以以软件维护索引的一小部分成本有效地执行该操作。在LevelDB等真实工作负载上评估ctFS表明，它的性能分别比ext4 DAX和SplitFS高3.6倍和1.8倍。

引用次数: 11