Zhaoguo Wang, Haibo Chen, Youyun Wang, Chuzhe Tang, Huan Wang
We present XIndex, which is a concurrent index library and designed for fast queries. It includes a concurrent ordered index (XIndex-R) and a concurrent hash index (XIndex-H). Similar to a recent proposal of the learned index, the indexes in XIndex use learned models to optimize index efficiency. Compared with the learned index, for the ordered index, XIndex-R is able to handle concurrent writes effectively and adapts its structure according to runtime workload characteristics. For the hash index, XIndex-H is able to avoid the resize operation blocking concurrent writes. Furthermore, the indexes in XIndex can index string keys much more efficiently than the learned index. We demonstrate the advantages of XIndex with YCSB, TPC-C (KV), which is a TPC-C-inspired benchmark for key-value stores, and micro-benchmarks. Compared with ordered indexes of Masstree and Wormhole, XIndex-R achieves up to 3.2× and 4.4× performance improvement on a 24-core machine. Compared with hash indexes of Intel TBB HashMap, XIndex-H achieves up to 3.1× speedup. The performance further improves by 91% after adding the optimizations on indexing string keys. The library is open-sourced.1
{"title":"The Concurrent Learned Indexes for Multicore Data Storage","authors":"Zhaoguo Wang, Haibo Chen, Youyun Wang, Chuzhe Tang, Huan Wang","doi":"10.1145/3478289","DOIUrl":"https://doi.org/10.1145/3478289","url":null,"abstract":"We present XIndex, which is a concurrent index library and designed for fast queries. It includes a concurrent ordered index (XIndex-R) and a concurrent hash index (XIndex-H). Similar to a recent proposal of the learned index, the indexes in XIndex use learned models to optimize index efficiency. Compared with the learned index, for the ordered index, XIndex-R is able to handle concurrent writes effectively and adapts its structure according to runtime workload characteristics. For the hash index, XIndex-H is able to avoid the resize operation blocking concurrent writes. Furthermore, the indexes in XIndex can index string keys much more efficiently than the learned index. We demonstrate the advantages of XIndex with YCSB, TPC-C (KV), which is a TPC-C-inspired benchmark for key-value stores, and micro-benchmarks. Compared with ordered indexes of Masstree and Wormhole, XIndex-R achieves up to 3.2× and 4.4× performance improvement on a 24-core machine. Compared with hash indexes of Intel TBB HashMap, XIndex-H achieves up to 3.1× speedup. The performance further improves by 91% after adding the optimizations on indexing string keys. The library is open-sourced.1","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124918492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Congming Gao, Min Ye, C. Xue, Youtao Zhang, Liang Shi, J. Shu, Jun Yang
NAND flash memory-based SSDs have been widely adopted. The scaling of SSD has evolved from plannar (2D) to 3D stacking. For reliability and other reasons, the technology node in 3D NAND SSD is larger than in 2D, but data density can be increased via increasing bit-per-cell. In this work, we develop a novel reprogramming scheme for TLCs in 3D NAND SSD, such that a cell can be programmed and reprogrammed several times before it is erased. Such reprogramming can improve the endurance of a cell and the speed of programming, and increase the amount of bits written in a cell per program/erase cycle, i.e., effective capacity. Our work is the first to perform a real 3D NAND SSD test to validate the feasibility of the reprogram operation. From the collected data, we derive the restrictions of performing reprogramming due to reliability challenges. Furthermore, a reprogrammable SSD (ReSSD) is designed to structure reprogram operations. ReSSD is evaluated in a case study in RAID 5 system (RSS-RAID). Experimental results show that RSS-RAID can improve the endurance by 35.7%, boost write performance by 15.9%, and increase effective capacity by 7.71%, with negligible overhead compared with conventional 3D SSD-based RAID 5 system.
{"title":"Reprogramming 3D TLC Flash Memory based Solid State Drives","authors":"Congming Gao, Min Ye, C. Xue, Youtao Zhang, Liang Shi, J. Shu, Jun Yang","doi":"10.1145/3487064","DOIUrl":"https://doi.org/10.1145/3487064","url":null,"abstract":"NAND flash memory-based SSDs have been widely adopted. The scaling of SSD has evolved from plannar (2D) to 3D stacking. For reliability and other reasons, the technology node in 3D NAND SSD is larger than in 2D, but data density can be increased via increasing bit-per-cell. In this work, we develop a novel reprogramming scheme for TLCs in 3D NAND SSD, such that a cell can be programmed and reprogrammed several times before it is erased. Such reprogramming can improve the endurance of a cell and the speed of programming, and increase the amount of bits written in a cell per program/erase cycle, i.e., effective capacity. Our work is the first to perform a real 3D NAND SSD test to validate the feasibility of the reprogram operation. From the collected data, we derive the restrictions of performing reprogramming due to reliability challenges. Furthermore, a reprogrammable SSD (ReSSD) is designed to structure reprogram operations. ReSSD is evaluated in a case study in RAID 5 system (RSS-RAID). Experimental results show that RSS-RAID can improve the endurance by 35.7%, boost write performance by 15.9%, and increase effective capacity by 7.71%, with negligible overhead compared with conventional 3D SSD-based RAID 5 system.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127942939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Heiner Litz, Javier González, Ana Klimovic, C. Kozyrakis
Flash-based storage is replacing disk for an increasing number of data center applications, providing orders of magnitude higher throughput and lower average latency. However, applications also require predictable storage latency. Existing Flash devices fail to provide low tail read latency in the presence of write operations. We propose two novel techniques to address SSD read tail latency, including Redundant Array of Independent LUNs (RAIL) which avoids serialization of reads behind user writes as well as latency-aware hot-cold separation (HC) which improves write throughput while maintaining low tail latency. RAIL leverages the internal parallelism of modern Flash devices and allocates data and parity pages to avoid reads getting stuck behind writes. We implement RAIL in the Linux Kernel as part of the LightNVM Flash translation layer and show that it can reduce read tail latency by 7× at the 99.99th percentile, while reducing relative bandwidth by only 33%.
{"title":"RAIL: Predictable, Low Tail Latency for NVMe Flash","authors":"Heiner Litz, Javier González, Ana Klimovic, C. Kozyrakis","doi":"10.1145/3465406","DOIUrl":"https://doi.org/10.1145/3465406","url":null,"abstract":"Flash-based storage is replacing disk for an increasing number of data center applications, providing orders of magnitude higher throughput and lower average latency. However, applications also require predictable storage latency. Existing Flash devices fail to provide low tail read latency in the presence of write operations. We propose two novel techniques to address SSD read tail latency, including Redundant Array of Independent LUNs (RAIL) which avoids serialization of reads behind user writes as well as latency-aware hot-cold separation (HC) which improves write throughput while maintaining low tail latency. RAIL leverages the internal parallelism of modern Flash devices and allocates data and parity pages to avoid reads getting stuck behind writes. We implement RAIL in the Linux Kernel as part of the LightNVM Flash translation layer and show that it can reduce read tail latency by 7× at the 99.99th percentile, while reducing relative bandwidth by only 33%.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122173573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shucheng Wang, Ziyi Lu, Q. Cao, Hong Jiang, Jie Yao, Yuanyuan Dong, Puyuan Yang, Changsheng Xie
Hybrid storage servers combining solid-state drives (SSDs) and hard-drive disks (HDDs) provide cost-effectiveness and μs-level responsiveness for applications. However, observations from cloud storage system Pangu manifest that HDDs are often underutilized while SSDs are overused, especially under intensive writes. It leads to fast wear-out and high tail latency to SSDs. On the other hand, our experimental study reveals that a series of sequential and continuous writes to HDDs exhibit a periodic, staircase-shaped pattern of write latency, i.e., low (e.g., 35 μs), middle (e.g., 55 μs), and high latency (e.g., 12 ms), resulting from buffered writes within HDD’s controller. It inspires us to explore and exploit the potential μs-level IO delay of HDDs to absorb excessive SSD writes without performance degradation. We first build an HDD writing model for describing the staircase behavior and design a profiling process to initialize and dynamically recalibrate the model parameters. Then, we propose a Buffer-Controlled Write approach (BCW) to proactively control buffered writes so that low- and mid-latency periods are scheduled with application data and high-latency periods are filled with padded data. Leveraging BCW, we design a mixed IO scheduler (MIOS) to adaptively steer incoming data to SSDs and HDDs. A multi-HDD scheduling is further designed to minimize HDD-write latency. We perform extensive evaluations under production workloads and benchmarks. The results show that MIOS removes up to 93% amount of data written to SSDs, reduces average and 99th-percentile latencies of the hybrid server by 65% and 85%, respectively.
{"title":"Exploration and Exploitation for Buffer-Controlled HDD-Writes for SSD-HDD Hybrid Storage Server","authors":"Shucheng Wang, Ziyi Lu, Q. Cao, Hong Jiang, Jie Yao, Yuanyuan Dong, Puyuan Yang, Changsheng Xie","doi":"10.1145/3465410","DOIUrl":"https://doi.org/10.1145/3465410","url":null,"abstract":"Hybrid storage servers combining solid-state drives (SSDs) and hard-drive disks (HDDs) provide cost-effectiveness and μs-level responsiveness for applications. However, observations from cloud storage system Pangu manifest that HDDs are often underutilized while SSDs are overused, especially under intensive writes. It leads to fast wear-out and high tail latency to SSDs. On the other hand, our experimental study reveals that a series of sequential and continuous writes to HDDs exhibit a periodic, staircase-shaped pattern of write latency, i.e., low (e.g., 35 μs), middle (e.g., 55 μs), and high latency (e.g., 12 ms), resulting from buffered writes within HDD’s controller. It inspires us to explore and exploit the potential μs-level IO delay of HDDs to absorb excessive SSD writes without performance degradation. We first build an HDD writing model for describing the staircase behavior and design a profiling process to initialize and dynamically recalibrate the model parameters. Then, we propose a Buffer-Controlled Write approach (BCW) to proactively control buffered writes so that low- and mid-latency periods are scheduled with application data and high-latency periods are filled with padded data. Leveraging BCW, we design a mixed IO scheduler (MIOS) to adaptively steer incoming data to SSDs and HDDs. A multi-HDD scheduling is further designed to minimize HDD-write latency. We perform extensive evaluations under production workloads and benchmarks. The results show that MIOS removes up to 93% amount of data written to SSDs, reduces average and 99th-percentile latencies of the hybrid server by 65% and 85%, respectively.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128315672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Li, Xiaofei Xu, Zhigang Cai, Jianwei Liao, Kenli Li, Balazs Gerofi, Y. Ishikawa
This article proposes a pattern-based prefetching scheme with the support of adaptive cache management, at the flash translation layer of solid-state drives (SSDs). It works inside of SSDs and has features of OS dependence and uses transparency. Specifically, it first mines frequent block access patterns that reflect the correlation among the occurred I/O requests. Then, it compares the requests in the current time window with the identified patterns to direct prefetching data into the cache of SSDs. More importantly, to maximize the cache use efficiency, we build a mathematical model to adaptively determine the cache partition on the basis of I/O workload characteristics, for separately buffering the prefetched data and the written data. Experimental results show that our proposal can yield improvements on average read latency by 1.8%–36.5% without noticeably increasing the write latency, in contrast to conventional SSD-inside prefetching schemes.
{"title":"Pattern-Based Prefetching with Adaptive Cache Management Inside of Solid-State Drives","authors":"Jun Li, Xiaofei Xu, Zhigang Cai, Jianwei Liao, Kenli Li, Balazs Gerofi, Y. Ishikawa","doi":"10.1145/3474393","DOIUrl":"https://doi.org/10.1145/3474393","url":null,"abstract":"This article proposes a pattern-based prefetching scheme with the support of adaptive cache management, at the flash translation layer of solid-state drives (SSDs). It works inside of SSDs and has features of OS dependence and uses transparency. Specifically, it first mines frequent block access patterns that reflect the correlation among the occurred I/O requests. Then, it compares the requests in the current time window with the identified patterns to direct prefetching data into the cache of SSDs. More importantly, to maximize the cache use efficiency, we build a mathematical model to adaptively determine the cache partition on the basis of I/O workload characteristics, for separately buffering the prefetched data and the written data. Experimental results show that our proposal can yield improvements on average read latency by 1.8%–36.5% without noticeably increasing the write latency, in contrast to conventional SSD-inside prefetching schemes.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132987917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anthony Rebello, Yuvraj Patel, R. Alagappan, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau
We analyze how file systems and modern data-intensive applications react to fsync failures. First, we characterize how three Linux file systems (ext4, XFS, Btrfs) behave in the presence of failures. We find commonalities across file systems (pages are always marked clean, certain block writes always lead to unavailability) as well as differences (page content and failure reporting is varied). Next, we study how five widely used applications (PostgreSQL, LMDB, LevelDB, SQLite, Redis) handle fsync failures. Our findings show that although applications use many failure-handling strategies, none are sufficient: fsync failures can cause catastrophic outcomes such as data loss and corruption. Our findings have strong implications for the design of file systems and applications that intend to provide strong durability guarantees.
{"title":"Can Applications Recover from fsync Failures?","authors":"Anthony Rebello, Yuvraj Patel, R. Alagappan, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau","doi":"10.1145/3450338","DOIUrl":"https://doi.org/10.1145/3450338","url":null,"abstract":"We analyze how file systems and modern data-intensive applications react to fsync failures. First, we characterize how three Linux file systems (ext4, XFS, Btrfs) behave in the presence of failures. We find commonalities across file systems (pages are always marked clean, certain block writes always lead to unavailability) as well as differences (page content and failure reporting is varied). Next, we study how five widely used applications (PostgreSQL, LMDB, LevelDB, SQLite, Redis) handle fsync failures. Our findings show that although applications use many failure-handling strategies, none are sufficient: fsync failures can cause catastrophic outcomes such as data loss and corruption. Our findings have strong implications for the design of file systems and applications that intend to provide strong durability guarantees.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129241459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joonsung Kim, Kanghyun Choi, Wonsik Lee, Jangwoo Kim
Modern servers are actively deploying Solid-State Drives (SSDs) thanks to their high throughput and low latency. However, current server architects cannot achieve the full performance potential of commodity SSDs, as SSDs are complex devices designed for specific goals (e.g., latency, throughput, endurance, cost) with their internal mechanisms undisclosed to users. In this article, we propose SSDcheck, a novel SSD performance model to extract various internal mechanisms and predict the latency of next access to commodity black-box SSDs. We identify key performance-critical features (e.g., garbage collection, write buffering) and find their parameters (i.e., size, threshold) from each SSD by using our novel diagnosis code snippets. Then, SSDcheck constructs a performance model for a target SSD and dynamically manages the model to predict the latency of the next access. In addition, SSDcheck extracts and provides other useful internal mechanisms (e.g., fetch unit in multi-queue SSDs, background tasks triggering idle-time interval) for the storage system to fully exploit SSDs. By using those useful features and the performance model, we propose multiple practical use cases. Our evaluations show that SSDcheck’s performance model is highly accurate, and proposed use cases achieve significant performance improvement in various scenarios.
{"title":"Performance Modeling and Practical Use Cases for Black-Box SSDs","authors":"Joonsung Kim, Kanghyun Choi, Wonsik Lee, Jangwoo Kim","doi":"10.1145/3440022","DOIUrl":"https://doi.org/10.1145/3440022","url":null,"abstract":"Modern servers are actively deploying Solid-State Drives (SSDs) thanks to their high throughput and low latency. However, current server architects cannot achieve the full performance potential of commodity SSDs, as SSDs are complex devices designed for specific goals (e.g., latency, throughput, endurance, cost) with their internal mechanisms undisclosed to users. In this article, we propose SSDcheck, a novel SSD performance model to extract various internal mechanisms and predict the latency of next access to commodity black-box SSDs. We identify key performance-critical features (e.g., garbage collection, write buffering) and find their parameters (i.e., size, threshold) from each SSD by using our novel diagnosis code snippets. Then, SSDcheck constructs a performance model for a target SSD and dynamically manages the model to predict the latency of the next access. In addition, SSDcheck extracts and provides other useful internal mechanisms (e.g., fetch unit in multi-queue SSDs, background tasks triggering idle-time interval) for the storage system to fully exploit SSDs. By using those useful features and the performance model, we propose multiple practical use cases. Our evaluations show that SSDcheck’s performance model is highly accurate, and proposed use cases achieve significant performance improvement in various scenarios.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126467127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Bittman, P. Alvaro, P. Mehra, D. Long, E. Miller
Byte-addressable, non-volatile memory (NVM) presents an opportunity to rethink the entire system stack. We present Twizzler, an operating system redesign for this near-future. Twizzler removes the kernel from the I/O path, provides programs with memory-style access to persistent data using small (64 bit), object-relative cross-object pointers, and enables simple and efficient long-term sharing of data both between applications and between runs of an application. Twizzler provides a clean-slate programming model for persistent data, realizing the vision of Unix in a world of persistent RAM. We show that Twizzler is simpler, more extensible, and more secure than existing I/O models and implementations by building software for Twizzler and evaluating it on NVM DIMMs. Most persistent pointer operations in Twizzler impose less than 0.5 ns added latency. Twizzler operations are up to faster than Unix, and SQLite queries are up to faster than on PMDK. YCSB workloads ran 1.1– faster on Twizzler than on native and NVM-optimized SQLite backends.
{"title":"Twizzler: A Data-centric OS for Non-volatile Memory","authors":"Daniel Bittman, P. Alvaro, P. Mehra, D. Long, E. Miller","doi":"10.1145/3454129","DOIUrl":"https://doi.org/10.1145/3454129","url":null,"abstract":"Byte-addressable, non-volatile memory (NVM) presents an opportunity to rethink the entire system stack. We present Twizzler, an operating system redesign for this near-future. Twizzler removes the kernel from the I/O path, provides programs with memory-style access to persistent data using small (64 bit), object-relative cross-object pointers, and enables simple and efficient long-term sharing of data both between applications and between runs of an application. Twizzler provides a clean-slate programming model for persistent data, realizing the vision of Unix in a world of persistent RAM. We show that Twizzler is simpler, more extensible, and more secure than existing I/O models and implementations by building software for Twizzler and evaluating it on NVM DIMMs. Most persistent pointer operations in Twizzler impose less than 0.5 ns added latency. Twizzler operations are up to faster than Unix, and SQLite queries are up to faster than on PMDK. YCSB workloads ran 1.1– faster on Twizzler than on native and NVM-optimized SQLite backends.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129201665","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junsu Im, Jinwook Bae, Chanwoo Chung, Arvind, Sungjin Lee
Key-value store based on a log-structured merge-tree (LSM-tree) is preferable to hash-based key-value store, because an LSM-tree can support a wider variety of operations and show better performance, especially for writes. However, LSM-tree is difficult to implement in the resource constrained environment of a key-value SSD (KV-SSD), and, consequently, KV-SSDs typically use hash-based schemes. We present PinK, a design and implementation of an LSM-tree-based KV-SSD, which compared to a hash-based KV-SSD, reduces 99th percentile tail latency by 73%, improves average read latency by 42%, and shows 37% higher throughput. The key idea in improving the performance of an LSM-tree in a resource constrained environment is to avoid the use of Bloom filters and instead, use a small amount of DRAM to keep/pin the top levels of the LSM-tree. We also find that PinK is able to provide a flexible design space for a wide range of KV workloads by leveraging the read-write tradeoff in LSM-trees.
{"title":"Design of LSM-tree-based Key-value SSDs with Bounded Tails","authors":"Junsu Im, Jinwook Bae, Chanwoo Chung, Arvind, Sungjin Lee","doi":"10.1145/3452846","DOIUrl":"https://doi.org/10.1145/3452846","url":null,"abstract":"Key-value store based on a log-structured merge-tree (LSM-tree) is preferable to hash-based key-value store, because an LSM-tree can support a wider variety of operations and show better performance, especially for writes. However, LSM-tree is difficult to implement in the resource constrained environment of a key-value SSD (KV-SSD), and, consequently, KV-SSDs typically use hash-based schemes. We present PinK, a design and implementation of an LSM-tree-based KV-SSD, which compared to a hash-based KV-SSD, reduces 99th percentile tail latency by 73%, improves average read latency by 42%, and shows 37% higher throughput. The key idea in improving the performance of an LSM-tree in a resource constrained environment is to avoid the use of Bloom filters and instead, use a small amount of DRAM to keep/pin the top levels of the LSM-tree. We also find that PinK is able to provide a flexible design space for a wide range of KV workloads by leveraging the read-write tradeoff in LSM-trees.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121059984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cheng Pan, Xiaolin Wang, Yingwei Luo, Zhenlin Wang
Due to large data volume and low latency requirements of modern web services, the use of an in-memory key-value (KV) cache often becomes an inevitable choice (e.g., Redis and Memcached). The in-memory cache holds hot data, reduces request latency, and alleviates the load on background databases. Inheriting from the traditional hardware cache design, many existing KV cache systems still use recency-based cache replacement algorithms, e.g., least recently used or its approximations. However, the diversity of miss penalty distinguishes a KV cache from a hardware cache. Inadequate consideration of penalty can substantially compromise space utilization and request service time. KV accesses also demonstrate locality, which needs to be coordinated with miss penalty to guide cache management. In this article, we first discuss how to enhance the existing cache model, the Average Eviction Time model, so that it can adapt to modeling a KV cache. After that, we apply the model to Redis and propose pRedis, Penalty- and Locality-aware Memory Allocation in Redis, which synthesizes data locality and miss penalty, in a quantitative manner, to guide memory allocation and replacement in Redis. At the same time, we also explore the diurnal behavior of a KV store and exploit long-term reuse. We replace the original passive eviction mechanism with an automatic dump/load mechanism, to smooth the transition between access peaks and valleys. Our evaluation shows that pRedis effectively reduces the average and tail access latency with minimal time and space overhead. For both real-world and synthetic workloads, our approach delivers an average of 14.0%∼52.3% latency reduction over a state-of-the-art penalty-aware cache management scheme, Hyperbolic Caching (HC), and shows more quantitative predictability of performance. Moreover, we can obtain even lower average latency (1.1%∼5.5%) when dynamically switching policies between pRedis and HC.
由于现代web服务的大数据量和低延迟要求,使用内存中的键值(KV)缓存通常成为不可避免的选择(例如,Redis和Memcached)。内存缓存保存热数据,减少请求延迟,减轻后台数据库的负载。继承了传统的硬件缓存设计,许多现有的KV缓存系统仍然使用基于最近的缓存替换算法,例如,最近最少使用或其近似值。然而,失误惩罚的多样性将KV缓存与硬件缓存区分开来。对惩罚的考虑不足会严重影响空间利用率和请求服务时间。KV访问也具有局部性,需要与未命中惩罚相协调,以指导缓存管理。在本文中,我们首先讨论了如何改进现有的缓存模型,即平均驱逐时间模型,使其能够适应千伏缓存的建模。之后,我们将该模型应用到Redis中,提出了pRedis, Penalty- and - locality -aware Memory Allocation in Redis,该方法定量地综合了数据的locality和miss Penalty,指导Redis中的内存分配和替换。同时,我们还探索了KV存储的日常行为,并开发了长期重用。我们用自动转储/加载机制取代了原来的被动驱逐机制,以平滑访问高峰和低谷之间的过渡。我们的评估表明,pRedis以最小的时间和空间开销有效地降低了平均和尾访问延迟。对于现实世界和合成工作负载,我们的方法比最先进的惩罚感知缓存管理方案双曲缓存(HC)平均减少14.0% ~ 52.3%的延迟,并显示出更多的性能定量可预测性。此外,当在pRedis和HC之间动态切换策略时,我们可以获得更低的平均延迟(1.1% ~ 5.5%)。
{"title":"Penalty- and Locality-aware Memory Allocation in Redis Using Enhanced AET","authors":"Cheng Pan, Xiaolin Wang, Yingwei Luo, Zhenlin Wang","doi":"10.1145/3447573","DOIUrl":"https://doi.org/10.1145/3447573","url":null,"abstract":"Due to large data volume and low latency requirements of modern web services, the use of an in-memory key-value (KV) cache often becomes an inevitable choice (e.g., Redis and Memcached). The in-memory cache holds hot data, reduces request latency, and alleviates the load on background databases. Inheriting from the traditional hardware cache design, many existing KV cache systems still use recency-based cache replacement algorithms, e.g., least recently used or its approximations. However, the diversity of miss penalty distinguishes a KV cache from a hardware cache. Inadequate consideration of penalty can substantially compromise space utilization and request service time. KV accesses also demonstrate locality, which needs to be coordinated with miss penalty to guide cache management. In this article, we first discuss how to enhance the existing cache model, the Average Eviction Time model, so that it can adapt to modeling a KV cache. After that, we apply the model to Redis and propose pRedis, Penalty- and Locality-aware Memory Allocation in Redis, which synthesizes data locality and miss penalty, in a quantitative manner, to guide memory allocation and replacement in Redis. At the same time, we also explore the diurnal behavior of a KV store and exploit long-term reuse. We replace the original passive eviction mechanism with an automatic dump/load mechanism, to smooth the transition between access peaks and valleys. Our evaluation shows that pRedis effectively reduces the average and tail access latency with minimal time and space overhead. For both real-world and synthetic workloads, our approach delivers an average of 14.0%∼52.3% latency reduction over a state-of-the-art penalty-aware cache management scheme, Hyperbolic Caching (HC), and shows more quantitative predictability of performance. Moreover, we can obtain even lower average latency (1.1%∼5.5%) when dynamically switching policies between pRedis and HC.","PeriodicalId":273014,"journal":{"name":"ACM Transactions on Storage (TOS)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115826793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}