Emerging fast, byte-addressable persistent memory (PM) promises substantial storage performance gains compared with traditional disks. We present TPFS, a tiered file system that combines PM and slow disks to create a storage system with near-PM performance and large capacity. TPFS steers incoming file input/output (I/O) to PM, dynamic random access memory (DRAM), or disk depending on the synchronicity, write size, and read frequency. TPFS profiles the application’s access stream online to predict the behavior of file access. In the background, TPFS estimates the “temperature” of file data and migrates the write-cold and read-hot file data from PM to disks. To fully utilize disk bandwidth, TPFS coalesces data blocks into large, sequential writes. Experimental results show that with a small amount of PM and a large solid-state drive (SSD), TPFS achieves up to 7.3× and 7.9× throughput improvement compared with EXT4 and XFS running on an SSD alone, respectively. As the amount of PM grows, TPFS’s performance improves until it matches the performance of a PM-only file system.
Garbage collection (GC) plays a pivotal role in the performance of 3D NAND flash memory, where Copyback has been widely used to accelerate valid page migration during GC. Unfortunately, copyback is constrained by the parity symmetry issue: data read from an odd/even page must be written to an odd/even page. After migrating two odd/even consecutive pages, a free page between the two migrated pages will be wasted. Such wasted pages noticeably lower free space on flash memory and cause extra GCs, thereby degrading solid-state-disk (SSD) performance. To address this problem, we propose a page-state-aware cache scheme called PSA-Cache, which prevents page waste to boost the performance of NAND Flash-based SSDs. To facilitate making write-back scheduling decisions, PSA-Cache regulates write-back priorities for cached pages according to the state of pages in victim blocks. With high write-back-priority pages written back to flash chips, PSA-Cache effectively fends off page waste by breaking odd/even consecutive pages in subsequent garbage collections. We quantitatively evaluate the performance of PSA-Cache in terms of the number of wasted pages, the number of GCs, and response time. We compare PSA-Cache with two state-of-the-art schemes, GCaR and TTflash, in addition to a baseline scheme LRU. The experimental results unveil that PSA-Cache outperforms the existing schemes. In particular, PSA-Cache curtails the number of wasted pages of GCaR and TTflash by 25.7% and 62.1%, respectively. PSA-Cache immensely cuts back the number of GC counts by up to 78.7% with an average of 49.6%. Furthermore, PSA-Cache slashes the average write response time by up to 85.4% with an average of 30.05%.
No abstract available.
The Log-Structured Merge Tree (LSM-Tree) is widely used in key-value (KV) stores because of its excwrite performance. But LSM-Tree-based KV stores still have the overhead of write-ahead log and write stall caused by slow L0 flush and L0-L1 compaction. New byte-addressable, persistent memory (PM) devices bring an opportunity to improve the write performance of LSM-Tree. Previous studies on PM-based LSM-Tree have not fully exploited PM’s “dual role” of main memory and external storage. In this article, we analyze two strategies of memtables based on PM and the reasons write stall problems occur in the first place. Inspired by the analysis result, we propose FlatLSM, a specially designed flat LSM-Tree for non-volatile memory based KV stores. First, we propose PMTable with separated index and data. The PM Log utilizes the Buffer Log to store KVs of size less than 256B. Second, to solve the write stall problem, FlatLSM merges the volatile memtables and the persistent L0 into large PMTables, which can reduce the depth of LSM-Tree and concentrate I/O bandwidth on L0-L1 compaction. To mitigate write stall caused by flushing large PMTables to SSD, we propose a parallel flush/compaction algorithm based on KV separation. We implemented FlatLSM based on RocksDB and evaluated its performance on Intel’s latest PM device, the Intel Optane DC PMM with the state-of-the-art PM-based LSM-Tree KV stores, FlatLSM improves the throughput 5.2× on random write workload and 2.55× on YCSB-A.
This article describes the algorithm, implementation, and deployment experience of CacheSack, the admission algorithm for Google datacenter flash caches. CacheSack minimizes the dominant costs of Google’s datacenter flash caches: disk IO and flash footprint. CacheSack partitions cache traffic into disjoint categories, analyzes the observed cache benefit of each subset, and formulates a knapsack problem to assign the optimal admission policy to each subset. Prior to this work, Google datacenter flash cache admission policies were optimized manually, with most caches using the Lazy Adaptive Replacement Cache algorithm. Production experiments showed that CacheSack significantly outperforms the prior static admission policies for a 7.7% improvement of the total cost of ownership, as well as significant improvements in disk reads (9.5% reduction) and flash wearout (17.8% reduction).
We introduce ZNSwap , a novel swap subsystem optimized for the recent Zoned Namespace (ZNS) SSDs. ZNSwap leverages ZNS’s explicit control over data management on the drive and introduces a space-efficient host-side Garbage Collector (GC) for swap storage co-designed with the OS swap logic. ZNSwap enables cross-layer optimizations, such as direct access to the in-kernel swap usage statistics by the GC to enable fine-grain swap storage management, and correct accounting of the GC bandwidth usage in the OS resource isolation mechanisms to improve performance isolation in multi-tenant environments. We evaluate ZNSwap using standard Linux swap benchmarks and two production key-value stores. ZNSwap shows significant performance benefits over the Linux swap on traditional SSDs, such as stable throughput for different memory access patterns, and 10× lower 99th percentile latency and 5× higher throughput for
Cloud block storage systems support diverse types of applications in modern cloud services. Characterizing their input/output (I/O) activities is critical for guiding better system designs and optimizations. In this article, we present an in-depth comparative analysis of production cloud block storage workloads through the block-level I/O traces of billions of I/O requests collected from two production systems, Alibaba Cloud and Tencent Cloud Block Storage. We study their characteristics of load intensities, spatial patterns, and temporal patterns. We also compare the cloud block storage workloads with the notable public block-level I/O workloads from the enterprise data centers at Microsoft Research Cambridge, and we identify the commonalities and differences of the three sources of traces. To this end, we provide 6 findings through the high-level analysis and 16 findings through the detailed analysis on load intensity, spatial patterns, and temporal patterns. We discuss the implications of our findings on load balancing, cache efficiency, and storage cluster management in cloud block storage systems.
In this article, we present an approach to systematically examine the schedulability of distributed storage systems, identify their scheduling problems, and enable effective scheduling in these systems. We use Thread Architecture Models (TAMs) to describe the behavior and interactions of different threads in a system, and show both how to construct TAMs for existing systems and utilize TAMs to identify critical scheduling problems. We specify three schedulability conditions that a schedulable TAM should satisfy: completeness, local enforceability, and independence; meeting these conditions enables a system to easily support different scheduling policies. We identify five common problems that prevent a system from satisfying the schedulability conditions, and show that these problems arise in existing systems such as HBase, Cassandra, MongoDB, and Riak, making it difficult or impossible to realize various scheduling disciplines. We demonstrate how to address these schedulability problems using both direct and indirect solutions, with different trade-offs. To show how to apply our approach to enable scheduling in realistic systems, we develop Tamed-HBase and Muzzled-HBase, sets of modifications to HBase that can realize the desired scheduling disciplines, including fairness and priority scheduling, even when presented with challenging workloads.