ACM Transactions on Storage最新文献_第8页

Oasis: Controlling Data Migration in Expansion of Object-based Storage Systems Oasis:对象存储系统扩展中的数据迁移控制

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-19 DOI: https://dl.acm.org/doi/10.1145/3568424

Yiming Zhang, Li Wang, Shun Gai, Qiwen Ke, Wenhao Li, Zhenlong Song, Guangtao Xue, Jiwu Shu

Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, and so on, where the data is placed among a large number of object storage devices (OSDs). Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the capacity of the storage clusters (i.e., adding new OSDs), which is determined by the nature of CRUSH and will cause significant performance degradation when the expansion is nontrivial.

This article presents MapX, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlling data migration after cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MapX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MapX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. We have applied MapX to the state-of-the-art Ceph-RBD (RADOS Block Device) to implement a migration-controllable, decentralized object-based block store (called Oasis). Oasis extends the RBD metadata structure to maintain and retrieve approximate object creation times (for migration control) at the granularity of expansion layers. Experimental results show that the MapX-based Oasis block store outperforms the CRUSH-based Ceph-RBD (which is busy in migrating objects after expansions) by 3.17× ∼ 4.31× in tail latency, and 76.3% (respectively, 83.8%) in IOPS for reads (respectively, writes).

对象存储系统被广泛应用于文件存储、块存储、blob(如大视频)存储等各种场景，这些场景中数据被放置在大量的对象存储设备(object storage device, osd)中。数据放置对于分散的基于对象的存储系统的可伸缩性至关重要。最先进的CRUSH放置方法是一种分散的算法，它确定地将对象副本放置到存储设备上，而不依赖于中心目录。虽然享有去中心化的好处，如高可伸缩性、健壮性和性能，但基于CRUSH的存储系统在扩展存储集群的容量(即添加新的osd)时，会遭受不受控制的数据迁移，这是由CRUSH的性质决定的，当扩展非常大时，会导致显著的性能下降。本文介绍MapX，它是CRUSH的一个新颖扩展，它使用额外的时间维度映射(从对象创建时间到集群扩展时间)来控制集群扩展后的数据迁移。每个扩展都被视为CRUSH映射的一个新层，由CRUSH根下的虚拟节点表示。MapX通过操纵中间放置组(pg)的时间戳来控制从对象到层的映射。MapX适用于各种基于对象的存储场景，在这些场景中，对象时间戳可以作为高级元数据进行维护。我们将MapX应用于最先进的Ceph-RBD (RADOS块设备)，以实现迁移可控，分散的基于对象的块存储(称为Oasis)。Oasis扩展了RBD元数据结构，以在扩展层的粒度上维护和检索近似的对象创建时间(用于迁移控制)。实验结果表明，基于mapx的Oasis块存储比基于crush的Ceph-RBD(扩容后忙于迁移对象)的尾部延迟高3.17× ~ 4.31×，读(写)IOPS高76.3%(分别为83.8%)。

{"title":"Oasis: Controlling Data Migration in Expansion of Object-based Storage Systems","authors":"Yiming Zhang, Li Wang, Shun Gai, Qiwen Ke, Wenhao Li, Zhenlong Song, Guangtao Xue, Jiwu Shu","doi":"https://dl.acm.org/doi/10.1145/3568424","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568424","url":null,"abstract":"Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, and so on, where the data is placed among a large number of object storage devices (OSDs). Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the capacity of the storage clusters (i.e., adding new OSDs), which is determined by the nature of CRUSH and will cause significant performance degradation when the expansion is nontrivial.This article presents MapX, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlling data migration after cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MapX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MapX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. We have applied MapX to the state-of-the-art Ceph-RBD (RADOS Block Device) to implement a migration-controllable, decentralized object-based block store (called Oasis). Oasis extends the RBD metadata structure to maintain and retrieve approximate object creation times (for migration control) at the granularity of expansion layers. Experimental results show that the MapX-based Oasis block store outperforms the CRUSH-based Ceph-RBD (which is busy in migrating objects after expansions) by 3.17× ∼ 4.31× in tail latency, and 76.3% (respectively, 83.8%) in IOPS for reads (respectively, writes).","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"42 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance Bug Analysis and Detection for Distributed Storage and Computing Systems 分布式存储和计算系统的性能缺陷分析与检测

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-18 DOI: 10.1145/3580281

Jiaxin Li, Yiming Zhang, Shan Lu, Haryadi S. Gunawi, Xiaohui Gu, Feng Huang, Dongsheng Li

This article systematically studies 99 distributed performance bugs from five widely deployed distributed storage and computing systems (Cassandra, HBase, HDFS, Hadoop MapReduce and ZooKeeper). We present the TaxPerf database, which collectively organizes the analysis results as over 400 classification labels and over 2,500 lines of bug re-description. TaxPerf is classified into six bug categories (and 18 bug subcategories) by their root causes; resource, blocking, synchronization, optimization, configuration, and logic. TaxPerf can be used as a benchmark for performance bug studies and debug tool designs. Although it is impractical to automatically detect all categories of performance bugs in TaxPerf, we find that an important category of blocking bugs can be effectively solved by analysis tools. We analyze the cascading nature of blocking bugs and design an automatic detection tool called PCatch, which (i) performs program analysis to identify code regions whose execution time can potentially increase dramatically with the workload size; (ii) adapts the traditional happens-before model to reason about software resource contention and performance dependency relationship; and (iii) uses dynamic tracking to identify whether the slowdown propagation is contained in one job. Evaluation shows that PCatch can accurately detect blocking bugs of representative distributed storage and computing systems by observing system executions under small-scale workloads.

本文系统地研究了来自五个广泛部署的分布式存储和计算系统（Cassandra、HBase、HDFS、Hadoop MapReduce和ZooKeeper）的99个分布式性能缺陷。我们展示了TaxPerf数据库，该数据库将分析结果集中组织为400多个分类标签和2500多行错误重新描述。TaxPerf根据其根本原因分为六个bug类别（和18个bug子类别）；资源、阻塞、同步、优化、配置和逻辑。TaxPerf可以用作性能缺陷研究和调试工具设计的基准。尽管在TaxPerf中自动检测所有类别的性能错误是不切实际的，但我们发现，分析工具可以有效地解决一类重要的阻塞错误。我们分析了阻塞错误的级联性质，并设计了一个名为PCatch的自动检测工具，该工具（i）执行程序分析，以识别执行时间可能随着工作负载大小而急剧增加的代码区域；（ii）将传统的先发生后发生模型应用于软件资源竞争和性能依赖关系的推理；以及（iii）使用动态跟踪来识别减速传播是否包含在一个作业中。评估表明，PCatch可以通过观察小规模工作负载下的系统执行情况，准确检测具有代表性的分布式存储和计算系统的阻塞错误。

{"title":"Performance Bug Analysis and Detection for Distributed Storage and Computing Systems","authors":"Jiaxin Li, Yiming Zhang, Shan Lu, Haryadi S. Gunawi, Xiaohui Gu, Feng Huang, Dongsheng Li","doi":"10.1145/3580281","DOIUrl":"https://doi.org/10.1145/3580281","url":null,"abstract":"This article systematically studies 99 distributed performance bugs from five widely deployed distributed storage and computing systems (Cassandra, HBase, HDFS, Hadoop MapReduce and ZooKeeper). We present the TaxPerf database, which collectively organizes the analysis results as over 400 classification labels and over 2,500 lines of bug re-description. TaxPerf is classified into six bug categories (and 18 bug subcategories) by their root causes; resource, blocking, synchronization, optimization, configuration, and logic. TaxPerf can be used as a benchmark for performance bug studies and debug tool designs. Although it is impractical to automatically detect all categories of performance bugs in TaxPerf, we find that an important category of blocking bugs can be effectively solved by analysis tools. We analyze the cascading nature of blocking bugs and design an automatic detection tool called PCatch, which (i) performs program analysis to identify code regions whose execution time can potentially increase dramatically with the workload size; (ii) adapts the traditional happens-before model to reason about software resource contention and performance dependency relationship; and (iii) uses dynamic tracking to identify whether the slowdown propagation is contained in one job. Evaluation shows that PCatch can accurately detect blocking bugs of representative distributed storage and computing systems by observing system executions under small-scale workloads.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"19 1","pages":"1 - 33"},"PeriodicalIF":1.7,"publicationDate":"2023-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43597467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Boosting Cache Performance by Access Time Measurements 通过访问时间测量提高缓存性能

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-13 DOI: 10.1145/3572778

Gil Einziger, Omri Himelbrand, Erez Waisbard

Most modern systems utilize caches to reduce the average data access time and optimize their performance. Recently proposed policies implicitly assume uniform access times, but variable access times naturally appear in domains such as storage, web search, and DNS resolution. Our work measures the access times for various items and exploits variations in access times as an additional signal for caching algorithms. Using such a signal, we introduce adaptive access time-aware cache policies that consistently improve the average access time compared with the best alternative in diverse workloads. Our adaptive algorithm attains an average access time reduction of up to 46% in storage workloads, up to 16% in web searches, and 8.4% on average when considering all experiments in our study.

大多数现代系统利用缓存来减少平均数据访问时间并优化其性能。最近提出的策略隐含地假设统一的访问时间，但可变的访问时间自然出现在存储、web搜索和DNS解析等领域。我们的工作测量了各种项目的访问时间，并利用访问时间的变化作为缓存算法的额外信号。使用这样的信号，我们引入了自适应访问时间感知缓存策略，与不同工作负载中的最佳替代方案相比，该策略可以持续提高平均访问时间。考虑到我们研究中的所有实验，我们的自适应算法在存储工作负载中的平均访问时间减少了46%，在网络搜索中的平均存取时间减少了16%，平均存取时间缩短了8.4%。

引用次数: 1

TPFS: A High-Performance Tiered File System for Persistent Memories and Disks 持久性存储器和磁盘的高性能分级文件系统

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-13 DOI: 10.1145/3580280

Shengan Zheng, Morteza Hoseinzadeh, S. Swanson, Linpeng Huang

Emerging fast, byte-addressable persistent memory (PM) promises substantial storage performance gains compared with traditional disks. We present TPFS, a tiered file system that combines PM and slow disks to create a storage system with near-PM performance and large capacity. TPFS steers incoming file input/output (I/O) to PM, dynamic random access memory (DRAM), or disk depending on the synchronicity, write size, and read frequency. TPFS profiles the application’s access stream online to predict the behavior of file access. In the background, TPFS estimates the “temperature” of file data and migrates the write-cold and read-hot file data from PM to disks. To fully utilize disk bandwidth, TPFS coalesces data blocks into large, sequential writes. Experimental results show that with a small amount of PM and a large solid-state drive (SSD), TPFS achieves up to 7.3× and 7.9× throughput improvement compared with EXT4 and XFS running on an SSD alone, respectively. As the amount of PM grows, TPFS’s performance improves until it matches the performance of a PM-only file system.

与传统磁盘相比，新兴的快速字节可寻址持久存储器（PM）有望大幅提高存储性能。我们介绍了TPFS，这是一种分层文件系统，它结合了PM和慢速磁盘，创建了一个具有接近PM性能和大容量的存储系统。TPFS根据同步性、写入大小和读取频率将传入的文件输入/输出（I/O）引导到PM、动态随机存取存储器（DRAM）或磁盘。TPFS在线评测应用程序的访问流，以预测文件访问的行为。在后台，TPFS估计文件数据的“温度”，并将冷写和热读文件数据从PM迁移到磁盘。为了充分利用磁盘带宽，TPFS将数据块合并为大的顺序写入。实验结果表明，在少量PM和大型固态驱动器（SSD）的情况下，与单独在SSD上运行的EXT4和XFS相比，TPFS的吞吐量分别提高了7.3倍和7.9倍。随着PM数量的增长，TPFS的性能会提高，直到它与仅PM文件系统的性能相匹配。

引用次数: 0

InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization InDe:通过自适应检测有效容器利用率的内联重复数据删除方法

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-11 DOI: https://dl.acm.org/doi/10.1145/3568426

Lifang Lin, Yuhui Deng, Yi Zhou, Yifeng Zhu

Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers.

To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC), to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.

内联重复数据删除是在数据发送到存储系统的过程中实时删除冗余数据。但是，重复数据删除会导致数据碎片化:重复数据删除后，逻辑上连续的块在物理上分散在不同的容器中。许多重写算法旨在通过将碎片化的重复块作为唯一块重写到新的容器中来缓解由于碎片化而导致的性能下降。不幸的是，这些算法根据一个简单的预先设定的固定值来确定一个块是否被分片，而忽略了数据段之间数据特征的差异。因此，在恢复备份时，它们通常无法选择一组适当的旧容器进行重写，从而在检索到的容器中生成大量无效块。为了解决这个问题，我们提出了一种用于存储系统的内联重复数据删除方法，称为InDe，它使用贪婪算法来检测有效的容器利用率，并动态调整每个段中旧容器引用的数量。InDe充分利用重复块的分布来提高恢复性能，同时保持高备份性能。我们定义了一个有效性度量，即有效的容器引用计数(VCRC)，以确定适合重写的容器。我们设计了一个重写算法F-greedy，它检测有效的容器利用率来重写低vcrc的容器。F-greedy根据容器的VCRC分布动态调整旧容器引用的数量，使每段只与高利用率容器共享重复块，从而提高恢复速度。为了充分利用上述特征，我们进一步提出了另一种基于有效容器利用率自适应间隔检测的重写算法F-greedy+。F-greedy+通过检测VCRC在两个方向上的变化趋势，并在全局范围内选择参考容器，更准确地估计旧容器的有效利用率。我们使用三个真实的备份工作负载对InDe进行了定量评估。实验结果表明，与两种最先进的算法(Capping和SMR)相比，我们的方案在获得几乎相同的备份性能的同时，将恢复速度提高了1.3×-2.4×。

{"title":"InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization","authors":"Lifang Lin, Yuhui Deng, Yi Zhou, Yifeng Zhu","doi":"https://dl.acm.org/doi/10.1145/3568426","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568426","url":null,"abstract":"Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers. To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC), to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"58 8","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Crash Consistency for NVMe over PCIe and RDMA NVMe在PCIe和RDMA上的高效崩溃一致性

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-11 DOI: https://dl.acm.org/doi/10.1145/3568428

Xiaojian Liao, Youyou Lu, Zhe Yang, Jiwu Shu

This article presents crash-consistent Non-Volatile Memory Express (ccNVMe), a novel extension of the NVMe that defines how host software communicates with the non-volatile memory (e.g., solid-state drive) across a PCI Express bus and RDMA-capable networks with both crash consistency and performance efficiency. Existing storage systems pay a huge tax on crash consistency, and thus cannot fully exploit the multi-queue parallelism and low latency of the NVMe and RDMA interfaces. ccNVMe alleviates this major bottleneck by coupling the crash consistency to the data dissemination. This new idea allows the storage system to achieve crash consistency by taking the free rides of the data dissemination mechanism of NVMe, using only two lightweight memory-mapped I/Os (MMIOs), unlike traditional systems that use complex update protocol and synchronized block I/Os. ccNVMe introduces a series of techniques including transaction-aware MMIO/doorbell and I/O command coalescing to reduce the PCIe traffic as well as to provide atomicity. We present how to build a high-performance and crash-consistent file system named MQFS atop ccNVMe. We experimentally show that MQFS increases the IOPS of RocksDB by 36% and 28% compared to a state-of-the-art file system and Ext4 without journaling, respectively.

本文介绍了崩溃一致性非易失性内存Express (ccNVMe)，这是NVMe的一种新扩展，它定义了主机软件如何通过PCI Express总线和支持rdma的网络与非易失性内存(例如固态驱动器)进行通信，同时具有崩溃一致性和性能效率。现有的存储系统在崩溃一致性上付出了巨大的代价，因此不能充分利用NVMe和RDMA接口的多队列并行性和低延迟。ccNVMe通过将崩溃一致性与数据分发相结合，缓解了这一主要瓶颈。与使用复杂更新协议和同步块I/ o的传统系统不同，这种新想法允许存储系统通过免费使用NVMe的数据传播机制，仅使用两个轻量级内存映射I/ o (mmio)来实现崩溃一致性。ccNVMe引入了一系列技术，包括事务感知的MMIO/门铃和I/O命令合并，以减少PCIe流量并提供原子性。我们介绍了如何在ccNVMe之上构建一个名为MQFS的高性能和崩溃一致性文件系统。我们通过实验表明，与最先进的文件系统和没有日志记录的Ext4相比，MQFS使RocksDB的IOPS分别提高了36%和28%。

{"title":"Efficient Crash Consistency for NVMe over PCIe and RDMA","authors":"Xiaojian Liao, Youyou Lu, Zhe Yang, Jiwu Shu","doi":"https://dl.acm.org/doi/10.1145/3568428","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568428","url":null,"abstract":"This article presents crash-consistent Non-Volatile Memory Express (ccNVMe), a novel extension of the NVMe that defines how host software communicates with the non-volatile memory (e.g., solid-state drive) across a PCI Express bus and RDMA-capable networks with both crash consistency and performance efficiency. Existing storage systems pay a huge tax on crash consistency, and thus cannot fully exploit the multi-queue parallelism and low latency of the NVMe and RDMA interfaces. ccNVMe alleviates this major bottleneck by coupling the crash consistency to the data dissemination. This new idea allows the storage system to achieve crash consistency by taking the free rides of the data dissemination mechanism of NVMe, using only two lightweight memory-mapped I/Os (MMIOs), unlike traditional systems that use complex update protocol and synchronized block I/Os. ccNVMe introduces a series of techniques including transaction-aware MMIO/doorbell and I/O command coalescing to reduce the PCIe traffic as well as to provide atomicity. We present how to build a high-performance and crash-consistent file system named MQFS atop ccNVMe. We experimentally show that MQFS increases the IOPS of RocksDB by 36% and 28% compared to a state-of-the-art file system and Ext4 without journaling, respectively.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"96 8","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Extending and Programming the NVMe I/O Determinism Interface for Flash Arrays Flash阵列NVMe I/O确定性接口的扩展与编程

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-11 DOI: https://dl.acm.org/doi/10.1145/3568427

Huaicheng Li, Martin L. Putra, Ronald Shi, Fadhil I. Kurnia, Xing Lin, Jaeyoung Do, Achmad Imam Kistijantoro, Gregory R. Ganger, Haryadi S. Gunawi

Predictable latency on flash storage is a long-pursuit goal, yet unpredictability stays due to the unavoidable disturbance from many well-known SSD internal activities. To combat this issue, the recent NVMe IO Determinism (IOD) interface advocates host-level controls to SSD internal management tasks. Although promising, challenges remain on how to exploit it for truly predictable performance.

We present IODA,¹ an I/O deterministic flash array design built on top of small but powerful extensions to the IOD interface for easy deployment. IODA exploits data redundancy in the context of IOD for a strong latency predictability contract. In IODA, SSDs are expected to quickly fail an I/O on purpose to allow predictable I/Os through proactive data reconstruction. In the case of concurrent internal operations, IODA introduces busy remaining time exposure and predictable-latency-window formulation to guarantee predictable data reconstructions. Overall, IODA only adds five new fields to the NVMe interface and a small modification in the flash firmware while keeping most of the complexity in the host OS. Our evaluation shows that IODA improves the 95–99.99^th latencies by up to 75×. IODA is also the nearest to the ideal, no disturbance case compared to seven state-of-the-art preemption, suspension, GC coordination, partitioning, tiny-tail flash controller, prediction, and proactive approaches.

闪存存储的可预测延迟是一个长期追求的目标，但是由于许多众所周知的SSD内部活动不可避免的干扰，不可预测性仍然存在。为了解决这个问题，最近的NVMe IO Determinism (IOD)接口提倡对SSD内部管理任务进行主机级控制。尽管前景光明，但如何利用它实现真正可预测的性能仍然存在挑战。我们提出IODA，一个I/O确定性闪存阵列设计，建立在IOD接口的小而强大的扩展之上，便于部署。IODA利用IOD上下文中的数据冗余来实现强大的延迟可预测性合约。在IODA中，ssd预计会故意快速失败I/O，以便通过主动数据重建实现可预测的I/O。在并发内部操作的情况下，IODA引入了繁忙剩余时间暴露和可预测延迟窗口公式，以保证可预测的数据重建。总的来说，IODA只向NVMe接口添加了五个新字段，并在flash固件中进行了小修改，同时保留了主机操作系统的大部分复杂性。我们的评估表明，IODA将95 - 99.99延迟提高了75倍。与七种最先进的抢占、悬浮、GC协调、分区、小尾闪存控制器、预测和主动方法相比，IODA也是最接近理想的、无干扰的情况。

{"title":"Extending and Programming the NVMe I/O Determinism Interface for Flash Arrays","authors":"Huaicheng Li, Martin L. Putra, Ronald Shi, Fadhil I. Kurnia, Xing Lin, Jaeyoung Do, Achmad Imam Kistijantoro, Gregory R. Ganger, Haryadi S. Gunawi","doi":"https://dl.acm.org/doi/10.1145/3568427","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568427","url":null,"abstract":"Predictable latency on flash storage is a long-pursuit goal, yet unpredictability stays due to the unavoidable disturbance from many well-known SSD internal activities. To combat this issue, the recent NVMe IO Determinism (IOD) interface advocates host-level controls to SSD internal management tasks. Although promising, challenges remain on how to exploit it for truly predictable performance.We present IODA,1 an I/O deterministic flash array design built on top of small but powerful extensions to the IOD interface for easy deployment. IODA exploits data redundancy in the context of IOD for a strong latency predictability contract. In IODA, SSDs are expected to quickly fail an I/O on purpose to allow predictable I/Os through proactive data reconstruction. In the case of concurrent internal operations, IODA introduces busy remaining time exposure and predictable-latency-window formulation to guarantee predictable data reconstructions. Overall, IODA only adds five new fields to the NVMe interface and a small modification in the flash firmware while keeping most of the complexity in the host OS. Our evaluation shows that IODA improves the 95–99.99th latencies by up to 75×. IODA is also the nearest to the ideal, no disturbance case compared to seven state-of-the-art preemption, suspension, GC coordination, partitioning, tiny-tail flash controller, prediction, and proactive approaches.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"1 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

End-to-end I/O Monitoring on Leading Supercomputers 领先超级计算机的端到端I/O监控

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-11 DOI: https://dl.acm.org/doi/10.1145/3568425

Bin Yang, Wei Xue, Tianyu Zhang, Shichao Liu, Xiaosong Ma, Xiyang Wang, Weiguo Liu

This paper offers a solution to overcome the complexities of production system I/O performance monitoring. We present Beacon, an end-to-end I/O resource monitoring and diagnosis system for the 40960-node Sunway TaihuLight supercomputer, currently the fourth-ranked supercomputer in the world. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes, and metadata servers. With mechanisms such as aggressive online and offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. With Beacon’s deployment on TaihuLight for more than three years, we demonstrate Beacon’s effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has already successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. Encouraged by Beacon’s success in I/O monitoring, we extend it to monitor interconnection networks, which is another contention point on supercomputers. In addition, we demonstrate Beacon’s generality by extending it to other supercomputers. Both Beacon codes and part of collected monitoring data are released.¹

本文提供了一种解决方案来克服生产系统I/O性能监控的复杂性。我们为目前世界排名第四的超级计算机——40960节点的神威太湖之光超级计算机提供了端到端I/O资源监测和诊断系统Beacon。Beacon同时收集并关联所有计算节点、转发节点、存储节点和元数据服务器的I/O跟踪/分析数据。通过积极的在线和离线跟踪压缩以及分布式缓存/存储等机制，它可以在生产使用下提供可扩展、低开销和可持续的I/O诊断。通过Beacon在太湖之光上三年多的部署，我们展示了Beacon在I/O性能问题识别和诊断方面的实际用例的有效性。它已经成功地帮助中心管理员识别模糊的设计或配置缺陷、系统异常事件、I/O性能干扰以及资源供应不足或过度的问题。一些暴露的问题已经得到修复，其他问题目前正在解决中。受Beacon在I/O监控方面的成功鼓舞，我们将其扩展到监控互连网络，这是超级计算机上的另一个争论点。此外，我们通过将Beacon扩展到其他超级计算机来演示它的通用性。发布Beacon码和部分采集到的监控数据

{"title":"End-to-end I/O Monitoring on Leading Supercomputers","authors":"Bin Yang, Wei Xue, Tianyu Zhang, Shichao Liu, Xiaosong Ma, Xiyang Wang, Weiguo Liu","doi":"https://dl.acm.org/doi/10.1145/3568425","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568425","url":null,"abstract":"This paper offers a solution to overcome the complexities of production system I/O performance monitoring. We present Beacon, an end-to-end I/O resource monitoring and diagnosis system for the 40960-node Sunway TaihuLight supercomputer, currently the fourth-ranked supercomputer in the world. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes, and metadata servers. With mechanisms such as aggressive online and offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. With Beacon’s deployment on TaihuLight for more than three years, we demonstrate Beacon’s effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has already successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. Encouraged by Beacon’s success in I/O monitoring, we extend it to monitor interconnection networks, which is another contention point on supercomputers. In addition, we demonstrate Beacon’s generality by extending it to other supercomputers. Both Beacon codes and part of collected monitoring data are released.1","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"29 9","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Reliability Evaluation of Erasure-coded Storage Systems with Latent Errors 带有潜在错误的擦除编码存储系统可靠性评估

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-11 DOI: https://dl.acm.org/doi/10.1145/3568313

Ilias Iliadis

Large-scale storage systems employ erasure-coding redundancy schemes to protect against device failures. The adverse effect of latent sector errors on the Mean Time to Data Loss (MTTDL) and the Expected Annual Fraction of Data Loss (EAFDL) reliability metrics is evaluated. A theoretical model capturing the effect of latent errors and device failures is developed, and closed-form expressions for the metrics of interest are derived. The MTTDL and EAFDL of erasure-coded systems are obtained analytically for (i) the entire range of bit error rates; (ii) the symmetric, clustered, and declustered data placement schemes; and (iii) arbitrary device failure and rebuild time distributions under network rebuild bandwidth constraints. The range of error rates that deteriorate system reliability is derived analytically. For realistic values of sector error rates, the results obtained demonstrate that MTTDL degrades, whereas, for moderate erasure codes, EAFDL remains practically unaffected. It is demonstrated that, in the range of typical sector error rates and for very powerful erasure codes, EAFDL degrades as well. It is also shown that the declustered data placement scheme offers superior reliability.

大规模存储系统采用erasure-coding冗余机制来防止设备故障。潜在扇区错误对平均数据丢失时间(MTTDL)和数据丢失预期年分数(EAFDL)可靠性指标的不利影响进行了评估。一个理论模型捕捉潜在的错误和设备故障的影响，并为感兴趣的度量导出了封闭形式的表达式。对擦除编码系统的MTTDL和EAFDL进行分析，得到(i)误码率的整个范围;(ii)对称、聚类和非聚类数据放置方案;(iii)在网络重构带宽约束下的任意设备故障和重构时间分布。导出了影响系统可靠性的误差率范围。对于扇区错误率的实际值，获得的结果表明MTTDL会降级，而对于中等擦除码，EAFDL实际上不受影响。结果表明，在典型扇区错误率范围内，对于非常强大的擦除码，EAFDL也会退化。实验还表明，这种分散的数据放置方案具有较高的可靠性。

{"title":"Reliability Evaluation of Erasure-coded Storage Systems with Latent Errors","authors":"Ilias Iliadis","doi":"https://dl.acm.org/doi/10.1145/3568313","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568313","url":null,"abstract":"Large-scale storage systems employ erasure-coding redundancy schemes to protect against device failures. The adverse effect of latent sector errors on the Mean Time to Data Loss (MTTDL) and the Expected Annual Fraction of Data Loss (EAFDL) reliability metrics is evaluated. A theoretical model capturing the effect of latent errors and device failures is developed, and closed-form expressions for the metrics of interest are derived. The MTTDL and EAFDL of erasure-coded systems are obtained analytically for (i) the entire range of bit error rates; (ii) the symmetric, clustered, and declustered data placement schemes; and (iii) arbitrary device failure and rebuild time distributions under network rebuild bandwidth constraints. The range of error rates that deteriorate system reliability is derived analytically. For realistic values of sector error rates, the results obtained demonstrate that MTTDL degrades, whereas, for moderate erasure codes, EAFDL remains practically unaffected. It is demonstrated that, in the range of typical sector error rates and for very powerful erasure codes, EAFDL degrades as well. It is also shown that the declustered data placement scheme offers superior reliability.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"30 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ctFS: Replacing File Indexing with Hardware Memory Translation through Contiguous File Allocation for Persistent Memory ctFS:通过为持久内存分配连续文件，用硬件内存转换代替文件索引

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2022-12-16 DOI: https://dl.acm.org/doi/10.1145/3565026

Ruibin Li, Xiang Ren, Xu Zhao, Siwei He, Michael Stumm, Ding Yuan

Persistent byte-addressable memory (PM) is poised to become prevalent in future computer systems. PMs are significantly faster than disk storage, and accesses to PMs are governed by the Memory Management Unit (MMU) just as accesses with volatile RAM. These unique characteristics shift the bottleneck from I/O to operations such as block address lookup—for example, in write workloads, up to 45% of the overhead in ext4-DAX is due to building and searching extent trees to translate file offsets to addresses on persistent memory.

We propose a novel contiguous file system, ctFS, that eliminates most of the overhead associated with indexing structures such as extent trees in the file system. ctFS represents each file as a contiguous region of virtual memory, hence a lookup from the file offset to the address is simply an offset operation, which can be efficiently performed by the hardware MMU at a fraction of the cost of software-maintained indexes. Evaluating ctFS on real-world workloads such as LevelDB shows it outperforms ext4-DAX and SplitFS by 3.6× and 1.8×, respectively.

持久字节可寻址存储器(PM)将在未来的计算机系统中变得普遍。pm比磁盘存储快得多，并且对pm的访问由内存管理单元(MMU)控制，就像访问易失性RAM一样。这些独特的特性将瓶颈从I/O转移到块地址查找等操作上——例如，在写工作负载中，ext4-DAX中高达45%的开销是由于构建和搜索区段树以将文件偏移量转换为持久内存上的地址。我们提出了一种新的连续文件系统ctFS，它消除了与文件系统中的索引结构(如区段树)相关的大部分开销。ctFS将每个文件表示为虚拟内存的一个连续区域，因此从文件偏移量到地址的查找只是一个偏移量操作，它可以由硬件MMU有效地执行，而成本只是软件维护索引的一小部分。在LevelDB等实际工作负载上评估ctFS显示，它的性能分别比ext4-DAX和SplitFS高3.6倍和1.8倍。

{"title":"ctFS: Replacing File Indexing with Hardware Memory Translation through Contiguous File Allocation for Persistent Memory","authors":"Ruibin Li, Xiang Ren, Xu Zhao, Siwei He, Michael Stumm, Ding Yuan","doi":"https://dl.acm.org/doi/10.1145/3565026","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3565026","url":null,"abstract":"Persistent byte-addressable memory (PM) is poised to become prevalent in future computer systems. PMs are significantly faster than disk storage, and accesses to PMs are governed by the Memory Management Unit (MMU) just as accesses with volatile RAM. These unique characteristics shift the bottleneck from I/O to operations such as block address lookup—for example, in write workloads, up to 45% of the overhead in ext4-DAX is due to building and searching extent trees to translate file offsets to addresses on persistent memory.We propose a novel contiguous file system, ctFS, that eliminates most of the overhead associated with indexing structures such as extent trees in the file system. ctFS represents each file as a contiguous region of virtual memory, hence a lookup from the file offset to the address is simply an offset operation, which can be efficiently performed by the hardware MMU at a fraction of the cost of software-maintained indexes. Evaluating ctFS on real-world workloads such as LevelDB shows it outperforms ext4-DAX and SplitFS by 3.6× and 1.8×, respectively.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"41 11","pages":""},"PeriodicalIF":1.7,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0