Pub Date : 2023-01-19DOI: https://dl.acm.org/doi/10.1145/3568424
Yiming Zhang, Li Wang, Shun Gai, Qiwen Ke, Wenhao Li, Zhenlong Song, Guangtao Xue, Jiwu Shu
Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, and so on, where the data is placed among a large number of object storage devices (OSDs). Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the capacity of the storage clusters (i.e., adding new OSDs), which is determined by the nature of CRUSH and will cause significant performance degradation when the expansion is nontrivial.
This article presents MapX, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlling data migration after cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MapX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MapX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. We have applied MapX to the state-of-the-art Ceph-RBD (RADOS Block Device) to implement a migration-controllable, decentralized object-based block store (called Oasis). Oasis extends the RBD metadata structure to maintain and retrieve approximate object creation times (for migration control) at the granularity of expansion layers. Experimental results show that the MapX-based Oasis block store outperforms the CRUSH-based Ceph-RBD (which is busy in migrating objects after expansions) by 3.17× ∼ 4.31× in tail latency, and 76.3% (respectively, 83.8%) in IOPS for reads (respectively, writes).
{"title":"Oasis: Controlling Data Migration in Expansion of Object-based Storage Systems","authors":"Yiming Zhang, Li Wang, Shun Gai, Qiwen Ke, Wenhao Li, Zhenlong Song, Guangtao Xue, Jiwu Shu","doi":"https://dl.acm.org/doi/10.1145/3568424","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568424","url":null,"abstract":"<p>Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, and so on, where the data is placed among a large number of object storage devices (OSDs). Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from <i>uncontrolled</i> data migration when expanding the capacity of the storage clusters (i.e., adding new OSDs), which is determined by the nature of CRUSH and will cause significant performance degradation when the expansion is nontrivial.</p><p>This article presents <span>MapX</span>, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlling data migration after cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. <span>MapX</span> controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). <span>MapX</span> is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. We have applied <span>MapX</span> to the state-of-the-art Ceph-RBD (RADOS Block Device) to implement a migration-controllable, decentralized object-based block store (called <span>Oasis</span>). <span>Oasis</span> extends the RBD metadata structure to maintain and retrieve approximate object creation times (for migration control) at the granularity of expansion layers. Experimental results show that the <span>MapX</span>-based <span>Oasis</span> block store outperforms the CRUSH-based Ceph-RBD (which is busy in migrating objects after expansions) by 3.17× ∼ 4.31× in tail latency, and 76.3% (respectively, 83.8%) in IOPS for reads (respectively, writes).</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"42 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaxin Li, Yiming Zhang, Shan Lu, Haryadi S. Gunawi, Xiaohui Gu, Feng Huang, Dongsheng Li
This article systematically studies 99 distributed performance bugs from five widely deployed distributed storage and computing systems (Cassandra, HBase, HDFS, Hadoop MapReduce and ZooKeeper). We present the TaxPerf database, which collectively organizes the analysis results as over 400 classification labels and over 2,500 lines of bug re-description. TaxPerf is classified into six bug categories (and 18 bug subcategories) by their root causes; resource, blocking, synchronization, optimization, configuration, and logic. TaxPerf can be used as a benchmark for performance bug studies and debug tool designs. Although it is impractical to automatically detect all categories of performance bugs in TaxPerf, we find that an important category of blocking bugs can be effectively solved by analysis tools. We analyze the cascading nature of blocking bugs and design an automatic detection tool called PCatch, which (i) performs program analysis to identify code regions whose execution time can potentially increase dramatically with the workload size; (ii) adapts the traditional happens-before model to reason about software resource contention and performance dependency relationship; and (iii) uses dynamic tracking to identify whether the slowdown propagation is contained in one job. Evaluation shows that PCatch can accurately detect blocking bugs of representative distributed storage and computing systems by observing system executions under small-scale workloads.
{"title":"Performance Bug Analysis and Detection for Distributed Storage and Computing Systems","authors":"Jiaxin Li, Yiming Zhang, Shan Lu, Haryadi S. Gunawi, Xiaohui Gu, Feng Huang, Dongsheng Li","doi":"10.1145/3580281","DOIUrl":"https://doi.org/10.1145/3580281","url":null,"abstract":"This article systematically studies 99 distributed performance bugs from five widely deployed distributed storage and computing systems (Cassandra, HBase, HDFS, Hadoop MapReduce and ZooKeeper). We present the TaxPerf database, which collectively organizes the analysis results as over 400 classification labels and over 2,500 lines of bug re-description. TaxPerf is classified into six bug categories (and 18 bug subcategories) by their root causes; resource, blocking, synchronization, optimization, configuration, and logic. TaxPerf can be used as a benchmark for performance bug studies and debug tool designs. Although it is impractical to automatically detect all categories of performance bugs in TaxPerf, we find that an important category of blocking bugs can be effectively solved by analysis tools. We analyze the cascading nature of blocking bugs and design an automatic detection tool called PCatch, which (i) performs program analysis to identify code regions whose execution time can potentially increase dramatically with the workload size; (ii) adapts the traditional happens-before model to reason about software resource contention and performance dependency relationship; and (iii) uses dynamic tracking to identify whether the slowdown propagation is contained in one job. Evaluation shows that PCatch can accurately detect blocking bugs of representative distributed storage and computing systems by observing system executions under small-scale workloads.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"19 1","pages":"1 - 33"},"PeriodicalIF":1.7,"publicationDate":"2023-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43597467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Most modern systems utilize caches to reduce the average data access time and optimize their performance. Recently proposed policies implicitly assume uniform access times, but variable access times naturally appear in domains such as storage, web search, and DNS resolution. Our work measures the access times for various items and exploits variations in access times as an additional signal for caching algorithms. Using such a signal, we introduce adaptive access time-aware cache policies that consistently improve the average access time compared with the best alternative in diverse workloads. Our adaptive algorithm attains an average access time reduction of up to 46% in storage workloads, up to 16% in web searches, and 8.4% on average when considering all experiments in our study.
{"title":"Boosting Cache Performance by Access Time Measurements","authors":"Gil Einziger, Omri Himelbrand, Erez Waisbard","doi":"10.1145/3572778","DOIUrl":"https://doi.org/10.1145/3572778","url":null,"abstract":"Most modern systems utilize caches to reduce the average data access time and optimize their performance. Recently proposed policies implicitly assume uniform access times, but variable access times naturally appear in domains such as storage, web search, and DNS resolution. Our work measures the access times for various items and exploits variations in access times as an additional signal for caching algorithms. Using such a signal, we introduce adaptive access time-aware cache policies that consistently improve the average access time compared with the best alternative in diverse workloads. Our adaptive algorithm attains an average access time reduction of up to 46% in storage workloads, up to 16% in web searches, and 8.4% on average when considering all experiments in our study.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":" ","pages":"1 - 29"},"PeriodicalIF":1.7,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46400680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shengan Zheng, Morteza Hoseinzadeh, S. Swanson, Linpeng Huang
Emerging fast, byte-addressable persistent memory (PM) promises substantial storage performance gains compared with traditional disks. We present TPFS, a tiered file system that combines PM and slow disks to create a storage system with near-PM performance and large capacity. TPFS steers incoming file input/output (I/O) to PM, dynamic random access memory (DRAM), or disk depending on the synchronicity, write size, and read frequency. TPFS profiles the application’s access stream online to predict the behavior of file access. In the background, TPFS estimates the “temperature” of file data and migrates the write-cold and read-hot file data from PM to disks. To fully utilize disk bandwidth, TPFS coalesces data blocks into large, sequential writes. Experimental results show that with a small amount of PM and a large solid-state drive (SSD), TPFS achieves up to 7.3× and 7.9× throughput improvement compared with EXT4 and XFS running on an SSD alone, respectively. As the amount of PM grows, TPFS’s performance improves until it matches the performance of a PM-only file system.
{"title":"TPFS: A High-Performance Tiered File System for Persistent Memories and Disks","authors":"Shengan Zheng, Morteza Hoseinzadeh, S. Swanson, Linpeng Huang","doi":"10.1145/3580280","DOIUrl":"https://doi.org/10.1145/3580280","url":null,"abstract":"Emerging fast, byte-addressable persistent memory (PM) promises substantial storage performance gains compared with traditional disks. We present TPFS, a tiered file system that combines PM and slow disks to create a storage system with near-PM performance and large capacity. TPFS steers incoming file input/output (I/O) to PM, dynamic random access memory (DRAM), or disk depending on the synchronicity, write size, and read frequency. TPFS profiles the application’s access stream online to predict the behavior of file access. In the background, TPFS estimates the “temperature” of file data and migrates the write-cold and read-hot file data from PM to disks. To fully utilize disk bandwidth, TPFS coalesces data blocks into large, sequential writes. Experimental results show that with a small amount of PM and a large solid-state drive (SSD), TPFS achieves up to 7.3× and 7.9× throughput improvement compared with EXT4 and XFS running on an SSD alone, respectively. As the amount of PM grows, TPFS’s performance improves until it matches the performance of a PM-only file system.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"19 1","pages":"1 - 28"},"PeriodicalIF":1.7,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47734810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-11DOI: https://dl.acm.org/doi/10.1145/3568426
Lifang Lin, Yuhui Deng, Yi Zhou, Yifeng Zhu
Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers.
To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC), to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.
{"title":"InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization","authors":"Lifang Lin, Yuhui Deng, Yi Zhou, Yifeng Zhu","doi":"https://dl.acm.org/doi/10.1145/3568426","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568426","url":null,"abstract":"<p>Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers. </p><p>To address this issue, we propose an inline deduplication approach for storage systems, called <i>InDe</i>, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, <i>valid container referenced counts (VCRC)</i>, to identify appropriate containers for the rewrite. We design a rewrite algorithm <i>F-greedy</i> that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called <i>F-greedy+</i> based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"58 8","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-11DOI: https://dl.acm.org/doi/10.1145/3568428
Xiaojian Liao, Youyou Lu, Zhe Yang, Jiwu Shu
This article presents crash-consistent Non-Volatile Memory Express (ccNVMe), a novel extension of the NVMe that defines how host software communicates with the non-volatile memory (e.g., solid-state drive) across a PCI Express bus and RDMA-capable networks with both crash consistency and performance efficiency. Existing storage systems pay a huge tax on crash consistency, and thus cannot fully exploit the multi-queue parallelism and low latency of the NVMe and RDMA interfaces. ccNVMe alleviates this major bottleneck by coupling the crash consistency to the data dissemination. This new idea allows the storage system to achieve crash consistency by taking the free rides of the data dissemination mechanism of NVMe, using only two lightweight memory-mapped I/Os (MMIOs), unlike traditional systems that use complex update protocol and synchronized block I/Os. ccNVMe introduces a series of techniques including transaction-aware MMIO/doorbell and I/O command coalescing to reduce the PCIe traffic as well as to provide atomicity. We present how to build a high-performance and crash-consistent file system named MQFS atop ccNVMe. We experimentally show that MQFS increases the IOPS of RocksDB by 36% and 28% compared to a state-of-the-art file system and Ext4 without journaling, respectively.
本文介绍了崩溃一致性非易失性内存Express (ccNVMe),这是NVMe的一种新扩展,它定义了主机软件如何通过PCI Express总线和支持rdma的网络与非易失性内存(例如固态驱动器)进行通信,同时具有崩溃一致性和性能效率。现有的存储系统在崩溃一致性上付出了巨大的代价,因此不能充分利用NVMe和RDMA接口的多队列并行性和低延迟。ccNVMe通过将崩溃一致性与数据分发相结合,缓解了这一主要瓶颈。与使用复杂更新协议和同步块I/ o的传统系统不同,这种新想法允许存储系统通过免费使用NVMe的数据传播机制,仅使用两个轻量级内存映射I/ o (mmio)来实现崩溃一致性。ccNVMe引入了一系列技术,包括事务感知的MMIO/门铃和I/O命令合并,以减少PCIe流量并提供原子性。我们介绍了如何在ccNVMe之上构建一个名为MQFS的高性能和崩溃一致性文件系统。我们通过实验表明,与最先进的文件系统和没有日志记录的Ext4相比,MQFS使RocksDB的IOPS分别提高了36%和28%。
{"title":"Efficient Crash Consistency for NVMe over PCIe and RDMA","authors":"Xiaojian Liao, Youyou Lu, Zhe Yang, Jiwu Shu","doi":"https://dl.acm.org/doi/10.1145/3568428","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568428","url":null,"abstract":"<p>This article presents crash-consistent Non-Volatile Memory Express (ccNVMe), a novel extension of the NVMe that defines how host software communicates with the non-volatile memory (e.g., solid-state drive) across a PCI Express bus and RDMA-capable networks with both crash consistency and performance efficiency. Existing storage systems pay a huge tax on crash consistency, and thus cannot fully exploit the multi-queue parallelism and low latency of the NVMe and RDMA interfaces. ccNVMe alleviates this major bottleneck by coupling the crash consistency to the data dissemination. This new idea allows the storage system to achieve crash consistency by taking the free rides of the data dissemination mechanism of NVMe, using only two lightweight memory-mapped I/Os (MMIOs), unlike traditional systems that use complex update protocol and synchronized block I/Os. ccNVMe introduces a series of techniques including transaction-aware MMIO/doorbell and I/O command coalescing to reduce the PCIe traffic as well as to provide atomicity. We present how to build a high-performance and crash-consistent file system named <span>MQFS</span> atop ccNVMe. We experimentally show that <span>MQFS</span> increases the IOPS of RocksDB by 36% and 28% compared to a state-of-the-art file system and Ext4 without journaling, respectively.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"96 8","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-11DOI: https://dl.acm.org/doi/10.1145/3568427
Huaicheng Li, Martin L. Putra, Ronald Shi, Fadhil I. Kurnia, Xing Lin, Jaeyoung Do, Achmad Imam Kistijantoro, Gregory R. Ganger, Haryadi S. Gunawi
Predictable latency on flash storage is a long-pursuit goal, yet unpredictability stays due to the unavoidable disturbance from many well-known SSD internal activities. To combat this issue, the recent NVMe IO Determinism (IOD) interface advocates host-level controls to SSD internal management tasks. Although promising, challenges remain on how to exploit it for truly predictable performance.
We present IODA,1 an I/O deterministic flash array design built on top of small but powerful extensions to the IOD interface for easy deployment. IODA exploits data redundancy in the context of IOD for a strong latency predictability contract. In IODA, SSDs are expected to quickly fail an I/O on purpose to allow predictable I/Os through proactive data reconstruction. In the case of concurrent internal operations, IODA introduces busy remaining time exposure and predictable-latency-window formulation to guarantee predictable data reconstructions. Overall, IODA only adds five new fields to the NVMe interface and a small modification in the flash firmware while keeping most of the complexity in the host OS. Our evaluation shows that IODA improves the 95–99.99th latencies by up to 75×. IODA is also the nearest to the ideal, no disturbance case compared to seven state-of-the-art preemption, suspension, GC coordination, partitioning, tiny-tail flash controller, prediction, and proactive approaches.
{"title":"Extending and Programming the NVMe I/O Determinism Interface for Flash Arrays","authors":"Huaicheng Li, Martin L. Putra, Ronald Shi, Fadhil I. Kurnia, Xing Lin, Jaeyoung Do, Achmad Imam Kistijantoro, Gregory R. Ganger, Haryadi S. Gunawi","doi":"https://dl.acm.org/doi/10.1145/3568427","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568427","url":null,"abstract":"<p>Predictable latency on flash storage is a long-pursuit goal, yet unpredictability stays due to the unavoidable disturbance from many well-known SSD internal activities. To combat this issue, the recent NVMe IO Determinism (IOD) interface advocates host-level controls to SSD internal management tasks. Although promising, challenges remain on how to exploit it for truly predictable performance.</p><p>We present <span>IODA</span>,<sup>1</sup> an I/O deterministic flash array design built on top of small but powerful extensions to the IOD interface for easy deployment. <span>IODA</span> exploits data redundancy in the context of IOD for a strong latency predictability contract. In <span>IODA</span>, SSDs are expected to quickly fail an I/O on purpose to allow predictable I/Os through proactive data reconstruction. In the case of concurrent internal operations, <span>IODA</span> introduces busy remaining time exposure and predictable-latency-window formulation to guarantee predictable data reconstructions. Overall, <span>IODA</span> only adds five new fields to the NVMe interface and a small modification in the flash firmware while keeping most of the complexity in the host OS. Our evaluation shows that <span>IODA</span> improves the 95–99.99<sup>th</sup> latencies by up to 75×. <span>IODA</span> is also the nearest to the ideal, no disturbance case compared to seven state-of-the-art preemption, suspension, GC coordination, partitioning, tiny-tail flash controller, prediction, and proactive approaches.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"1 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-11DOI: https://dl.acm.org/doi/10.1145/3568425
Bin Yang, Wei Xue, Tianyu Zhang, Shichao Liu, Xiaosong Ma, Xiyang Wang, Weiguo Liu
This paper offers a solution to overcome the complexities of production system I/O performance monitoring. We present Beacon, an end-to-end I/O resource monitoring and diagnosis system for the 40960-node Sunway TaihuLight supercomputer, currently the fourth-ranked supercomputer in the world. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes, and metadata servers. With mechanisms such as aggressive online and offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. With Beacon’s deployment on TaihuLight for more than three years, we demonstrate Beacon’s effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has already successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. Encouraged by Beacon’s success in I/O monitoring, we extend it to monitor interconnection networks, which is another contention point on supercomputers. In addition, we demonstrate Beacon’s generality by extending it to other supercomputers. Both Beacon codes and part of collected monitoring data are released.1
{"title":"End-to-end I/O Monitoring on Leading Supercomputers","authors":"Bin Yang, Wei Xue, Tianyu Zhang, Shichao Liu, Xiaosong Ma, Xiyang Wang, Weiguo Liu","doi":"https://dl.acm.org/doi/10.1145/3568425","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568425","url":null,"abstract":"<p>This paper offers a solution to overcome the complexities of production system I/O performance monitoring. We present Beacon, an end-to-end I/O resource monitoring and diagnosis system for the 40960-node Sunway TaihuLight supercomputer, currently the fourth-ranked supercomputer in the world. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes, and metadata servers. With mechanisms such as aggressive online and offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. With Beacon’s deployment on TaihuLight for more than three years, we demonstrate Beacon’s effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has already successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. Encouraged by Beacon’s success in I/O monitoring, we extend it to monitor interconnection networks, which is another contention point on supercomputers. In addition, we demonstrate Beacon’s generality by extending it to other supercomputers. Both Beacon codes and part of collected monitoring data are released.<sup>1</sup></p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"29 9","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2023-01-11DOI: https://dl.acm.org/doi/10.1145/3568313
Ilias Iliadis
Large-scale storage systems employ erasure-coding redundancy schemes to protect against device failures. The adverse effect of latent sector errors on the Mean Time to Data Loss (MTTDL) and the Expected Annual Fraction of Data Loss (EAFDL) reliability metrics is evaluated. A theoretical model capturing the effect of latent errors and device failures is developed, and closed-form expressions for the metrics of interest are derived. The MTTDL and EAFDL of erasure-coded systems are obtained analytically for (i) the entire range of bit error rates; (ii) the symmetric, clustered, and declustered data placement schemes; and (iii) arbitrary device failure and rebuild time distributions under network rebuild bandwidth constraints. The range of error rates that deteriorate system reliability is derived analytically. For realistic values of sector error rates, the results obtained demonstrate that MTTDL degrades, whereas, for moderate erasure codes, EAFDL remains practically unaffected. It is demonstrated that, in the range of typical sector error rates and for very powerful erasure codes, EAFDL degrades as well. It is also shown that the declustered data placement scheme offers superior reliability.
{"title":"Reliability Evaluation of Erasure-coded Storage Systems with Latent Errors","authors":"Ilias Iliadis","doi":"https://dl.acm.org/doi/10.1145/3568313","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568313","url":null,"abstract":"<p>Large-scale storage systems employ erasure-coding redundancy schemes to protect against device failures. The adverse effect of latent sector errors on the Mean Time to Data Loss (MTTDL) and the Expected Annual Fraction of Data Loss (EAFDL) reliability metrics is evaluated. A theoretical model capturing the effect of latent errors and device failures is developed, and closed-form expressions for the metrics of interest are derived. The MTTDL and EAFDL of erasure-coded systems are obtained analytically for (i) the entire range of bit error rates; (ii) the symmetric, clustered, and declustered data placement schemes; and (iii) arbitrary device failure and rebuild time distributions under network rebuild bandwidth constraints. The range of error rates that deteriorate system reliability is derived analytically. For realistic values of sector error rates, the results obtained demonstrate that MTTDL degrades, whereas, for moderate erasure codes, EAFDL remains practically unaffected. It is demonstrated that, in the range of typical sector error rates and for very powerful erasure codes, EAFDL degrades as well. It is also shown that the declustered data placement scheme offers superior reliability.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"30 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Persistent byte-addressable memory (PM) is poised to become prevalent in future computer systems. PMs are significantly faster than disk storage, and accesses to PMs are governed by the Memory Management Unit (MMU) just as accesses with volatile RAM. These unique characteristics shift the bottleneck from I/O to operations such as block address lookup—for example, in write workloads, up to 45% of the overhead in ext4-DAX is due to building and searching extent trees to translate file offsets to addresses on persistent memory.
We propose a novel contiguous file system, ctFS, that eliminates most of the overhead associated with indexing structures such as extent trees in the file system. ctFS represents each file as a contiguous region of virtual memory, hence a lookup from the file offset to the address is simply an offset operation, which can be efficiently performed by the hardware MMU at a fraction of the cost of software-maintained indexes. Evaluating ctFS on real-world workloads such as LevelDB shows it outperforms ext4-DAX and SplitFS by 3.6× and 1.8×, respectively.
{"title":"ctFS: Replacing File Indexing with Hardware Memory Translation through Contiguous File Allocation for Persistent Memory","authors":"Ruibin Li, Xiang Ren, Xu Zhao, Siwei He, Michael Stumm, Ding Yuan","doi":"https://dl.acm.org/doi/10.1145/3565026","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3565026","url":null,"abstract":"<p><b>Persistent byte-addressable memory (PM)</b> is poised to become prevalent in future computer systems. PMs are significantly faster than disk storage, and accesses to PMs are governed by the <b>Memory Management Unit (MMU)</b> just as accesses with volatile RAM. These unique characteristics shift the bottleneck from I/O to operations such as block address lookup—for example, in write workloads, up to 45% of the overhead in ext4-DAX is due to building and searching extent trees to translate file offsets to addresses on persistent memory.</p><p>We propose a novel <i>contiguous</i> file system, ctFS, that eliminates most of the overhead associated with indexing structures such as extent trees in the file system. ctFS represents each file as a contiguous region of virtual memory, hence a lookup from the file offset to the address is simply an offset operation, which can be efficiently performed by the hardware MMU at a fraction of the cost of software-maintained indexes. Evaluating ctFS on real-world workloads such as LevelDB shows it outperforms ext4-DAX and SplitFS by 3.6× and 1.8×, respectively.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"41 11","pages":""},"PeriodicalIF":1.7,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}