ACM Transactions on Storage最新文献_第7页

Realizing Strong Determinism Contract on Log-Structured Merge Key-Value Stores 日志结构合并键值库中强确定性契约的实现

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-02-24 DOI: 10.1145/3582695

Miryeong Kwon, Seungjun Lee, Hyunkyu Choi, Jooyoung Hwang, Myoungsoo Jung

We propose Vigil-KV, a hardware and software co-designed framework that eliminates long-tail latency almost perfectly by introducing strong latency determinism. To make Get latency deterministic, Vigil-KV first enables a predictable latency mode (PLM) interface on a real datacenter-scale NVMe SSD, having knowledge about the nature of the underlying flash technologies. Vigil-KV at the system-level then hides the non-deterministic time window (associated with SSD’s internal tasks and/or write services) by internally scheduling the different device states of PLM across multiple physical functions. Vigil-KV further schedules compaction/flush operations and client requests being aware of PLM’s restrictions thereby integrating strong latency determinism into LSM KVs. We implement Vigil-KV upon a 1.92TB NVMe SSD prototype and Linux 4.19.91, but other LSM KVs can adopt its concept. We evaluate diverse Facebook and Yahoo scenarios with Vigil-KV, and the results show that Vigil-KV can reducethe tail latency of a baseline KV system by 3.19× while reducing the average latency by 34%, on average.

我们提出了Vigil KV，这是一个硬件和软件共同设计的框架，通过引入强延迟确定性，几乎完美地消除了长尾延迟。为了使Get latency具有确定性，Vigil KV首先在真正的数据中心规模的NVMe SSD上启用了可预测延迟模式（PLM）接口，了解底层闪存技术的性质。然后，系统级的Vigil KV通过在多个物理功能上对PLM的不同设备状态进行内部调度来隐藏不确定的时间窗口（与SSD的内部任务和/或写入服务相关）。Vigil KV在意识到PLM限制的情况下，进一步安排压缩/刷新操作和客户端请求，从而将强大的延迟确定性集成到LSM KV中。我们在1.92TB NVMe SSD原型和Linux 4.19.91上实现了Vigil KV，但其他LSM KV可以采用它的概念。我们用Vigil KV评估了不同的Facebook和Yahoo场景，结果表明，Vigil KV可以将基线KV系统的尾部延迟减少3.19倍，同时平均将平均延迟减少34%。

引用次数: 0

Boosting Cache Performance by Access Time Measurements 通过访问时间测量提高缓存性能

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-02-17 DOI: https://dl.acm.org/doi/10.1145/3572778

Gil Einziger, Omri Himelbrand, Erez Waisbard

Most modern systems utilize caches to reduce the average data access time and optimize their performance. Recently proposed policies implicitly assume uniform access times, but variable access times naturally appear in domains such as storage, web search, and DNS resolution.

Our work measures the access times for various items and exploits variations in access times as an additional signal for caching algorithms. Using such a signal, we introduce adaptive access time-aware cache policies that consistently improve the average access time compared with the best alternative in diverse workloads. Our adaptive algorithm attains an average access time reduction of up to 46% in storage workloads, up to 16% in web searches, and 8.4% on average when considering all experiments in our study.

大多数现代系统利用缓存来减少平均数据访问时间并优化其性能。最近提出的策略隐式地假设统一的访问时间，但是可变的访问时间自然会出现在存储、web搜索和DNS解析等领域中。我们的工作测量各种项目的访问时间，并利用访问时间的变化作为缓存算法的附加信号。使用这样的信号，我们引入了自适应的访问时间感知缓存策略，与不同工作负载下的最佳替代方案相比，这些策略持续提高了平均访问时间。我们的自适应算法在存储工作负载中平均减少了高达46%的访问时间，在网络搜索中减少了高达16%的访问时间，在考虑我们研究中的所有实验时平均减少了8.4%。

引用次数: 0

The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression 用于高效重复数据删除后增量压缩的快速轻量级相似性检测设计

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-02-16 DOI: 10.1145/3584663

Wen Xia, Lifeng Pu, Xiangyu Zou, Philip Shilane, Shiyi Li, Haijun Zhang, Xuan Wang

Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte by byte across data chunks and ② applying multiple transforms on all of the calculated rolling hash values. In this article, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and a high compression ratio. Odess first utilizes a novel Subwindow-based Parallel Rolling (SWPR) hash method using Single Instruction Multiple Data [1] (SIMD) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Odess then uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set and quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead). Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼31.4× and ∼7.9× faster than the state-of-the-art N-Transform and Finesse (a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, the Odess-based system’s throughput is about 3.20× and 1.41× higher than the N-Transform- and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼1.22× higher compression ratio over Finesse.

重复数据消除后增量压缩是一种数据缩减技术，它计算并存储存储系统中非常相似但不重复的块的差异，从而能够实现非常高的压缩率。然而，由于引入了高计算开销，广泛使用的相似性检测方法（例如，N变换）的低吞吐量通常成为delta压缩系统的瓶颈。通常，这种开销主要由两部分组成：①跨数据块逐字节计算滚动哈希；②对所有计算出的滚动哈希值进行多次转换。在本文中，我们提出了一种快速、轻量级的相似性检测方法Odess，它大大减少了相似性检测的计算开销，同时实现了高检测精度和高压缩比。Odess首先利用了一种新颖的基于子窗口的并行滚动（SWPR）哈希方法，该方法使用单指令多数据[1]（SIMD）来加速滚动哈希的计算（对应于开销的第一部分）。然后，Odess使用一种新颖的内容定义采样方法从整个滚动哈希集生成一个小得多的代理哈希集，并在这个小哈希集上快速应用变换进行相似性检测（对应于开销的第二部分）。评估结果表明，在相似性检测阶段，Odess方法分别比最先进的N-变换和Finesse（N-变换的最新变体[39]）快31.4倍和7.9倍。当考虑端到端数据缩减存储系统时，基于Odess的系统的吞吐量分别比基于N-变换和Finesse的系统的吞吐率高3.20倍和1.41倍，同时保持N-变换的高压缩比，并实现比Finesse高约1.22倍的压缩比。

{"title":"The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression","authors":"Wen Xia, Lifeng Pu, Xiangyu Zou, Philip Shilane, Shiyi Li, Haijun Zhang, Xuan Wang","doi":"10.1145/3584663","DOIUrl":"https://doi.org/10.1145/3584663","url":null,"abstract":"Post-deduplication delta compression is a data reduction technique that calculates and stores the differences of very similar but non-duplicate chunks in storage systems, which is able to achieve a very high compression ratio. However, the low throughput of widely used resemblance detection approaches (e.g., N-Transform) usually becomes the bottleneck of delta compression systems due to introducing high computational overhead. Generally, this overhead mainly consists of two parts: ① calculating the rolling hash byte by byte across data chunks and ② applying multiple transforms on all of the calculated rolling hash values. In this article, we propose Odess, a fast and lightweight resemblance detection approach, that greatly reduces the computational overhead for resemblance detection while achieving high detection accuracy and a high compression ratio. Odess first utilizes a novel Subwindow-based Parallel Rolling (SWPR) hash method using Single Instruction Multiple Data [1] (SIMD) to accelerate calculation of rolling hashes (corresponding to the first part of the overhead). Odess then uses a novel Content-Defined Sampling method to generate a much smaller proxy hash set from the whole rolling hash set and quickly applies transforms on this small hash set for resemblance detection (corresponding to the second part of the overhead). Evaluation results show that during the stage of resemblance detection, the Odess approach is ∼31.4× and ∼7.9× faster than the state-of-the-art N-Transform and Finesse (a recent variant of N-Transform [39]), respectively. When considering an end-to-end data reduction storage system, the Odess-based system’s throughput is about 3.20× and 1.41× higher than the N-Transform- and Finesse-based systems’ throughput, respectively, while maintaining the high compression ratio of N-Transform and achieving ∼1.22× higher compression ratio over Finesse.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"109 3","pages":"1 - 30"},"PeriodicalIF":1.7,"publicationDate":"2023-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41294923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TriCache: A User-Transparent Block Cache Enabling High-Performance Out-of-Core Processing with In-Memory Programs TriCache：一种用户透明的块缓存，使用内存程序实现高性能的核外处理

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-02-13 DOI: 10.1145/3583139

Guan Feng, Huanqi Cao, Xiaowei Zhu, Bowen Yu, Yuanwei Wang, Zixuan Ma, Shengqi Chen, Wenguang Chen

Out-of-core systems rely on high-performance cache sub-systems to reduce the number of I/O operations. Although the page cache in modern operating systems enables transparent access to memory and storage devices, it suffers from efficiency and scalability issues on cache misses, forcing out-of-core systems to design and implement their own cache components, which is a non-trivial task. This study proposes TriCache, a cache mechanism that enables in-memory programs to efficiently process out-of-core datasets without requiring any code rewrite. It provides a virtual memory interface on top of the conventional block interface to simultaneously achieve user transparency and sufficient out-of-core performance. A multi-level block cache design is proposed to address the challenge of per-access address translations required by a memory interface. It can exploit spatial and temporal localities in memory or storage accesses to render storage-to-memory address translation and page-level concurrency control adequately efficient for the virtual memory interface. Our evaluation shows that in-memory systems operating on top of TriCache can outperform Linux OS page cache by more than one order of magnitude, and can deliver performance comparable to or even better than that of corresponding counterparts designed specifically for out-of-core scenarios.

外核系统依赖于高性能缓存子系统来减少I/O操作的数量。尽管现代操作系统中的页面缓存支持对内存和存储设备的透明访问，但它在缓存丢失时存在效率和可伸缩性问题，迫使核心外系统设计和实现自己的缓存组件，这是一项非常重要的任务。本研究提出了TriCache，这是一种缓存机制，使内存程序能够有效地处理核心外数据集，而无需重写任何代码。它在传统块接口之上提供了一个虚拟内存接口，以同时实现用户透明性和足够的核外性能。提出了一种多级块缓存设计，以解决存储器接口要求的每次访问地址转换的挑战。它可以利用内存中的空间和时间位置或存储访问，为虚拟内存接口提供足够有效的存储到内存地址转换和页面级并发控制。我们的评估表明，在TriCache之上运行的内存系统可以比Linux操作系统的页面缓存性能高出一个数量级以上，并且可以提供与专门为非核心场景设计的相应系统相当甚至更好的性能。

{"title":"TriCache: A User-Transparent Block Cache Enabling High-Performance Out-of-Core Processing with In-Memory Programs","authors":"Guan Feng, Huanqi Cao, Xiaowei Zhu, Bowen Yu, Yuanwei Wang, Zixuan Ma, Shengqi Chen, Wenguang Chen","doi":"10.1145/3583139","DOIUrl":"https://doi.org/10.1145/3583139","url":null,"abstract":"Out-of-core systems rely on high-performance cache sub-systems to reduce the number of I/O operations. Although the page cache in modern operating systems enables transparent access to memory and storage devices, it suffers from efficiency and scalability issues on cache misses, forcing out-of-core systems to design and implement their own cache components, which is a non-trivial task. This study proposes TriCache, a cache mechanism that enables in-memory programs to efficiently process out-of-core datasets without requiring any code rewrite. It provides a virtual memory interface on top of the conventional block interface to simultaneously achieve user transparency and sufficient out-of-core performance. A multi-level block cache design is proposed to address the challenge of per-access address translations required by a memory interface. It can exploit spatial and temporal localities in memory or storage accesses to render storage-to-memory address translation and page-level concurrency control adequately efficient for the virtual memory interface. Our evaluation shows that in-memory systems operating on top of TriCache can outperform Linux OS page cache by more than one order of magnitude, and can deliver performance comparable to or even better than that of corresponding counterparts designed specifically for out-of-core scenarios.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":" ","pages":"1 - 30"},"PeriodicalIF":1.7,"publicationDate":"2023-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45651206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

ZNSwap: un-Block your Swap ZNSwap：取消阻止您的Swap

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-02-01 DOI: 10.1145/3582434

Shai Bergman, Niklas Cassel, Matias Bjørling, M. Silberstein

We introduce ZNSwap , a novel swap subsystem optimized for the recent Zoned Namespace (ZNS) SSDs. ZNSwap leverages ZNS’s explicit control over data management on the drive and introduces a space-efficient host-side Garbage Collector (GC) for swap storage co-designed with the OS swap logic. ZNSwap enables cross-layer optimizations, such as direct access to the in-kernel swap usage statistics by the GC to enable fine-grain swap storage management, and correct accounting of the GC bandwidth usage in the OS resource isolation mechanisms to improve performance isolation in multi-tenant environments. We evaluate ZNSwap using standard Linux swap benchmarks and two production key-value stores. ZNSwap shows significant performance benefits over the Linux swap on traditional SSDs, such as stable throughput for different memory access patterns, and 10× lower 99th percentile latency and 5× higher throughput for memcached key-value store under realistic usage scenarios.

我们介绍了ZNSwap，这是一种针对最近的分区命名空间（ZNS）SSD进行优化的新型交换子系统。ZNSwap利用ZNS对驱动器上数据管理的明确控制，并引入了一个与操作系统交换逻辑共同设计的用于交换存储的节省空间的主机端垃圾回收器（GC）。ZNSwap支持跨层优化，例如GC直接访问内核内交换使用情况统计信息以实现细粒度交换存储管理，以及在操作系统资源隔离机制中正确核算GC带宽使用情况以提高多租户环境中的性能隔离。我们使用标准的Linux交换基准和两个生产关键价值存储来评估ZNSwap。与传统SSD上的Linux交换相比，ZNSwap显示出显著的性能优势，例如不同内存访问模式的稳定吞吐量，以及在实际使用场景下memcached键值存储的10倍低的99%延迟和5倍高的吞吐量。

引用次数: 4

CacheSack: Theory and Experience of Google’s Admission Optimization for Datacenter Flash Caches CacheSack：谷歌数据中心Flash缓存准入优化的理论与经验

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-21 DOI: 10.1145/3582014

Tzu-Wei Yang, Seth Pollen, Mustafa Uysal, A. Merchant, H. Wolfmeister, Junaid Khalid

This article describes the algorithm, implementation, and deployment experience of CacheSack, the admission algorithm for Google datacenter flash caches. CacheSack minimizes the dominant costs of Google’s datacenter flash caches: disk IO and flash footprint. CacheSack partitions cache traffic into disjoint categories, analyzes the observed cache benefit of each subset, and formulates a knapsack problem to assign the optimal admission policy to each subset. Prior to this work, Google datacenter flash cache admission policies were optimized manually, with most caches using the Lazy Adaptive Replacement Cache algorithm. Production experiments showed that CacheSack significantly outperforms the prior static admission policies for a 7.7% improvement of the total cost of ownership, as well as significant improvements in disk reads (9.5% reduction) and flash wearout (17.8% reduction).

本文描述了CacheSack的算法、实现和部署经验，CacheSack是谷歌数据中心闪存缓存的接纳算法。CacheSack最大限度地减少b谷歌数据中心闪存缓存的主要成本:磁盘IO和闪存占用。CacheSack将缓存流量划分为不相关的类别，分析观察到的每个子集的缓存效益，并制定一个背包问题，为每个子集分配最优的允许策略。在此之前，谷歌数据中心的闪存缓存准入策略是手动优化的，大多数缓存使用Lazy Adaptive Replacement cache算法。生产实验表明，CacheSack显著优于之前的静态准入策略，总拥有成本提高了7.7%，磁盘读取(减少9.5%)和闪存磨损(减少17.8%)方面也有显著改善。

引用次数: 0

KVRangeDB: Range Queries for a Hash-based Key–Value Device KVRangeDB：基于哈希的键值设备的范围查询

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-21 DOI: 10.1145/3582013

Mian Qin, Qing Zheng, Jason Lee, B. Settlemyer, Fei Wen, N. Reddy, Paul V. Gratz

Key–value (KV) software has proven useful to a wide variety of applications including analytics, time-series databases, and distributed file systems. To satisfy the requirements of diverse workloads, KV stores have been carefully tailored to best match the performance characteristics of underlying solid-state block devices. Emerging KV storage device is a promising technology for both simplifying the KV software stack and improving the performance of persistent storage-based applications. However, while providing fast, predictable put and get operations, existing KV storage devices do not natively support range queries that are critical to all three types of applications described above. In this article, we present KVRangeDB, a software layer that enables processing range queries for existing hash-based KV solid-state disks (KVSSDs). As an effort to adapt to the performance characteristics of emerging KVSSDs, KVRangeDB implements log-structured merge tree key index that reduces compaction I/O, merges keys when possible, and provides separate caches for indexes and values. We evaluated the KVRangeDB under a set of representative workloads, and compared its performance with two existing database solutions: a Rocksdb variant ported to work with the KVSSD, and Wisckey, a key–value database that is carefully tuned for conventional block devices. On filesystem aging workloads, KVRangeDB outperforms Wisckey by 23.7× in terms of throughput and reduce CPU usage and external write amplifications by 14.3× and 9.8×, respectively.

关键价值（KV）软件已被证明可用于各种应用程序，包括分析、时间序列数据库和分布式文件系统。为了满足不同工作负载的要求，KV存储经过精心定制，以最佳匹配底层固态块设备的性能特征。新兴的KV存储设备是一种很有前途的技术，既可以简化KV软件堆栈，又可以提高基于持久存储的应用程序的性能。然而，在提供快速、可预测的输入和输出操作的同时，现有的KV存储设备本身并不支持对上述所有三种类型的应用程序都至关重要的范围查询。在本文中，我们介绍了KVRangeDB，这是一个软件层，可以处理现有基于哈希的KV固态磁盘（KVSSD）的范围查询。为了适应新兴KVSSD的性能特征，KVRangeDB实现了日志结构的合并树键索引，该索引可减少压缩I/O，在可能的情况下合并键，并为索引和值提供单独的缓存。我们在一组具有代表性的工作负载下评估了KVRangeDB，并将其性能与两种现有的数据库解决方案进行了比较：一种是Rocksdb变体，用于KVSSD，另一种是Wisckey，一种是针对传统块设备精心调整的关键值数据库。在文件系统老化的工作负载上，KVRangeDB的吞吐量比Wisckey高23.7倍，CPU使用率和外部写入放大率分别降低了14.3倍和9.8倍。

{"title":"KVRangeDB: Range Queries for a Hash-based Key–Value Device","authors":"Mian Qin, Qing Zheng, Jason Lee, B. Settlemyer, Fei Wen, N. Reddy, Paul V. Gratz","doi":"10.1145/3582013","DOIUrl":"https://doi.org/10.1145/3582013","url":null,"abstract":"Key–value (KV) software has proven useful to a wide variety of applications including analytics, time-series databases, and distributed file systems. To satisfy the requirements of diverse workloads, KV stores have been carefully tailored to best match the performance characteristics of underlying solid-state block devices. Emerging KV storage device is a promising technology for both simplifying the KV software stack and improving the performance of persistent storage-based applications. However, while providing fast, predictable put and get operations, existing KV storage devices do not natively support range queries that are critical to all three types of applications described above. In this article, we present KVRangeDB, a software layer that enables processing range queries for existing hash-based KV solid-state disks (KVSSDs). As an effort to adapt to the performance characteristics of emerging KVSSDs, KVRangeDB implements log-structured merge tree key index that reduces compaction I/O, merges keys when possible, and provides separate caches for indexes and values. We evaluated the KVRangeDB under a set of representative workloads, and compared its performance with two existing database solutions: a Rocksdb variant ported to work with the KVSSD, and Wisckey, a key–value database that is carefully tuned for conventional block devices. On filesystem aging workloads, KVRangeDB outperforms Wisckey by 23.7× in terms of throughput and reduce CPU usage and external write amplifications by 14.3× and 9.8×, respectively.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"19 1","pages":"1 - 21"},"PeriodicalIF":1.7,"publicationDate":"2023-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42130474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Localized Validation Accelerates Distributed Transactions on Disaggregated Persistent Memory 本地化验证加速了分解持久内存上的分布式事务

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-21 DOI: 10.1145/3582012

Ming Zhang, Yu Hua, Pengfei Zuo, Lurong Liu

Persistent memory (PM) disaggregation significantly improves the resource utilization and failure isolation to build a scalable and cost-effective remote memory pool in modern data centers. However, due to offering limited computing power and overlooking the bandwidth and persistence properties of real PMs, existing distributed transaction schemes, which are designed for legacy DRAM-based monolithic servers, fail to efficiently work on the disaggregated PM. In this article, we propose FORD, a Fast One-sided RDMA-based Distributed transaction system for the new disaggregated PM architecture. FORD thoroughly leverages one-sided remote direct memory access to handle transactions for bypassing the remote CPU in the PM pool. To reduce the round trips, FORD batches the read and lock operations into one request to eliminate extra locking and validations for the read-write data. To accelerate the transaction commit, FORD updates all remote replicas in a single round trip with parallel undo logging and data visibility control. Moreover, considering the limited PM bandwidth, FORD enables the backup replicas to be read to alleviate the load on the primary replicas, thus improving the throughput. To efficiently guarantee the remote data persistency in the PM pool, FORD selectively flushes data to the backup replicas to mitigate the network overheads. Nevertheless, the original FORD wastes some validation round trips if the read-only data are not modified by other transactions. Hence, we further propose a localized validation scheme to transfer the validation operations for the read-only data from remote to local as much as possible to reduce the round trips. Experimental results demonstrate that FORD significantly improves the transaction throughput by up to 3× and decreases the latency by up to 87.4% compared with state-of-the-art systems.

持久内存（PM）分解显著提高了资源利用率和故障隔离，从而在现代数据中心构建了一个可扩展且经济高效的远程内存池。然而，由于提供的计算能力有限，并且忽略了真实PM的带宽和持久性特性，现有的分布式事务方案是为传统的基于DRAM的单片服务器设计的，无法有效地处理分解的PM。在本文中，我们提出了FORD，用于新的分解PM架构的基于快速单侧RDMA的分布式事务系统。FORD完全利用单侧远程直接内存访问来处理事务，从而绕过PM池中的远程CPU。为了减少往返，FORD将读取和锁定操作批处理到一个请求中，以消除读写数据的额外锁定和验证。为了加快事务提交，FORD通过并行撤消日志记录和数据可见性控制，在一次往返中更新所有远程副本。此外，考虑到PM带宽有限，FORD允许读取备份副本，以减轻主副本的负载，从而提高吞吐量。为了有效地保证PM池中的远程数据持久性，FORD选择性地将数据刷新到备份副本，以减轻网络开销。然而，如果只读数据没有被其他事务修改，原始FORD会浪费一些验证往返行程。因此，我们进一步提出了一种本地化验证方案，尽可能地将只读数据的验证操作从远程转移到本地，以减少往返次数。实验结果表明，与最先进的系统相比，FORD显著提高了高达3倍的事务吞吐量，并将延迟降低了87.4%。

{"title":"Localized Validation Accelerates Distributed Transactions on Disaggregated Persistent Memory","authors":"Ming Zhang, Yu Hua, Pengfei Zuo, Lurong Liu","doi":"10.1145/3582012","DOIUrl":"https://doi.org/10.1145/3582012","url":null,"abstract":"Persistent memory (PM) disaggregation significantly improves the resource utilization and failure isolation to build a scalable and cost-effective remote memory pool in modern data centers. However, due to offering limited computing power and overlooking the bandwidth and persistence properties of real PMs, existing distributed transaction schemes, which are designed for legacy DRAM-based monolithic servers, fail to efficiently work on the disaggregated PM. In this article, we propose FORD, a Fast One-sided RDMA-based Distributed transaction system for the new disaggregated PM architecture. FORD thoroughly leverages one-sided remote direct memory access to handle transactions for bypassing the remote CPU in the PM pool. To reduce the round trips, FORD batches the read and lock operations into one request to eliminate extra locking and validations for the read-write data. To accelerate the transaction commit, FORD updates all remote replicas in a single round trip with parallel undo logging and data visibility control. Moreover, considering the limited PM bandwidth, FORD enables the backup replicas to be read to alleviate the load on the primary replicas, thus improving the throughput. To efficiently guarantee the remote data persistency in the PM pool, FORD selectively flushes data to the backup replicas to mitigate the network overheads. Nevertheless, the original FORD wastes some validation round trips if the read-only data are not modified by other transactions. Hence, we further propose a localized validation scheme to transfer the validation operations for the read-only data from remote to local as much as possible to reduce the round trips. Experimental results demonstrate that FORD significantly improves the transaction throughput by up to 3× and decreases the latency by up to 87.4% compared with state-of-the-art systems.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":" ","pages":"1 - 35"},"PeriodicalIF":1.7,"publicationDate":"2023-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49399837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FlatLSM: Write-Optimized LSM-Tree for PM-Based KV Stores FlatLSM：用于基于PM的KV存储的写优化LSM树

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-21 DOI: 10.1145/3579855

Kewen He, Yujie An, Yijing Luo, Xiaoguang Liu, Gang Wang

The Log-Structured Merge Tree (LSM-Tree) is widely used in key-value (KV) stores because of its excwrite performance. But LSM-Tree-based KV stores still have the overhead of write-ahead log and write stall caused by slow L0 flush and L0-L1 compaction. New byte-addressable, persistent memory (PM) devices bring an opportunity to improve the write performance of LSM-Tree. Previous studies on PM-based LSM-Tree have not fully exploited PM’s “dual role” of main memory and external storage. In this article, we analyze two strategies of memtables based on PM and the reasons write stall problems occur in the first place. Inspired by the analysis result, we propose FlatLSM, a specially designed flat LSM-Tree for non-volatile memory based KV stores. First, we propose PMTable with separated index and data. The PM Log utilizes the Buffer Log to store KVs of size less than 256B. Second, to solve the write stall problem, FlatLSM merges the volatile memtables and the persistent L0 into large PMTables, which can reduce the depth of LSM-Tree and concentrate I/O bandwidth on L0-L1 compaction. To mitigate write stall caused by flushing large PMTables to SSD, we propose a parallel flush/compaction algorithm based on KV separation. We implemented FlatLSM based on RocksDB and evaluated its performance on Intel’s latest PM device, the Intel Optane DC PMM with the state-of-the-art PM-based LSM-Tree KV stores, FlatLSM improves the throughput 5.2× on random write workload and 2.55× on YCSB-A.

日志结构合并树（LSM-Tree）由于其优异的执行性能，在密钥值（KV）存储中得到了广泛的应用。但是，基于LSM树的KV存储仍然存在由于L0刷新和L0-L1压缩缓慢而导致的提前写入日志和写入停滞的开销。新的字节可寻址持久存储器（PM）设备为提高LSM树的写入性能提供了机会。以前对基于PM的LSM树的研究并没有充分利用PM的主存储器和外部存储器的“双重作用”。在本文中，我们首先分析了基于PM的内存表的两种策略，以及出现写停滞问题的原因。受分析结果的启发，我们提出了FlatLSM，这是一种专门设计的用于基于非易失性存储器的KV存储的平坦LSM树。首先，我们提出了索引和数据分离的PMTable。PM日志利用缓冲日志来存储大小小于256B的KV。其次，为了解决写停滞问题，FlatLSM将易失性memtables和持久性L0合并为大型PMTables，这可以减少LSM树的深度，并将I/O带宽集中在L0-L1压缩上。为了减轻由于将大型PMTables刷新到SSD而导致的写入停滞，我们提出了一种基于KV分离的并行刷新/压缩算法。我们基于RocksDB实现了FlatLSM，并在Intel最新的PM设备Intel Optane DC PMM上评估了其性能，该设备具有最先进的基于PM的LSM Tree KV存储，FlatLSM在随机写入工作负载上提高了5.2倍的吞吐量，在YCSB-A上提高了2.55倍的吞吐量。

{"title":"FlatLSM: Write-Optimized LSM-Tree for PM-Based KV Stores","authors":"Kewen He, Yujie An, Yijing Luo, Xiaoguang Liu, Gang Wang","doi":"10.1145/3579855","DOIUrl":"https://doi.org/10.1145/3579855","url":null,"abstract":"The Log-Structured Merge Tree (LSM-Tree) is widely used in key-value (KV) stores because of its excwrite performance. But LSM-Tree-based KV stores still have the overhead of write-ahead log and write stall caused by slow L0 flush and L0-L1 compaction. New byte-addressable, persistent memory (PM) devices bring an opportunity to improve the write performance of LSM-Tree. Previous studies on PM-based LSM-Tree have not fully exploited PM’s “dual role” of main memory and external storage. In this article, we analyze two strategies of memtables based on PM and the reasons write stall problems occur in the first place. Inspired by the analysis result, we propose FlatLSM, a specially designed flat LSM-Tree for non-volatile memory based KV stores. First, we propose PMTable with separated index and data. The PM Log utilizes the Buffer Log to store KVs of size less than 256B. Second, to solve the write stall problem, FlatLSM merges the volatile memtables and the persistent L0 into large PMTables, which can reduce the depth of LSM-Tree and concentrate I/O bandwidth on L0-L1 compaction. To mitigate write stall caused by flushing large PMTables to SSD, we propose a parallel flush/compaction algorithm based on KV separation. We implemented FlatLSM based on RocksDB and evaluated its performance on Intel’s latest PM device, the Intel Optane DC PMM with the state-of-the-art PM-based LSM-Tree KV stores, FlatLSM improves the throughput 5.2× on random write workload and 2.55× on YCSB-A.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"19 1","pages":"1 - 26"},"PeriodicalIF":1.7,"publicationDate":"2023-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49366031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Improving Storage Systems Using Machine Learning 利用机器学习改进存储系统

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-01-19 DOI: https://dl.acm.org/doi/10.1145/3568429

Ibrahim Umit Akgun, Ali Selman Aydin, Andrew Burford, Michael McNeill, Michael Arkhangelskiy, Erez Zadok

Operating systems include many heuristic algorithms designed to improve overall storage performance and throughput. Because such heuristics cannot work well for all conditions and workloads, system designers resorted to exposing numerous tunable parameters to users—thus burdening users with continually optimizing their own storage systems and applications. Storage systems are usually responsible for most latency in I/O-heavy applications, so even a small latency improvement can be significant. Machine learning (ML) techniques promise to learn patterns, generalize from them, and enable optimal solutions that adapt to changing workloads. We propose that ML solutions become a first-class component in OSs and replace manual heuristics to optimize storage systems dynamically. In this article, we describe our proposed ML architecture, called KML. We developed a prototype KML architecture and applied it to two case studies: optimizing readahead and NFS read-size values. Our experiments show that KML consumes less than 4 KB of dynamic kernel memory, has a CPU overhead smaller than 0.2%, and yet can learn patterns and improve I/O throughput by as much as 2.3× and 15× for two case studies—even for complex, never-seen-before, concurrently running mixed workloads on different storage devices.

操作系统包括许多旨在提高整体存储性能和吞吐量的启发式算法。由于这种启发式方法不能很好地适用于所有条件和工作负载，因此系统设计人员向用户公开了许多可调参数，从而增加了用户不断优化自己的存储系统和应用程序的负担。在I/ o密集型应用程序中，存储系统通常是造成大部分延迟的原因，因此即使是很小的延迟改进也可能是显著的。机器学习(ML)技术有望学习模式，从中进行推广，并实现适应不断变化的工作负载的最佳解决方案。我们建议机器学习解决方案成为操作系统中的一流组件，并取代手动启发式来动态优化存储系统。在本文中，我们描述了我们提出的ML体系结构，称为KML。我们开发了一个原型KML架构，并将其应用于两个案例研究:优化预读和NFS读大小值。我们的实验表明，KML消耗的动态内核内存少于4 KB, CPU开销小于0.2%，但在两个案例研究中，KML可以学习模式并将I/O吞吐量提高2.3倍和15倍——甚至对于复杂的、从未见过的、在不同存储设备上并发运行的混合工作负载也是如此。

{"title":"Improving Storage Systems Using Machine Learning","authors":"Ibrahim Umit Akgun, Ali Selman Aydin, Andrew Burford, Michael McNeill, Michael Arkhangelskiy, Erez Zadok","doi":"https://dl.acm.org/doi/10.1145/3568429","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568429","url":null,"abstract":"<p>Operating systems include many heuristic algorithms designed to improve overall storage performance and throughput. Because such heuristics cannot work well for all conditions and workloads, system designers resorted to exposing numerous tunable parameters to users—thus burdening users with continually optimizing their own storage systems and applications. Storage systems are usually responsible for most latency in I/O-heavy applications, so even a small latency improvement can be significant. Machine learning (ML) techniques promise to learn patterns, generalize from them, and enable optimal solutions that adapt to changing workloads. We propose that ML solutions become a first-class component in OSs and replace manual heuristics to optimize storage systems dynamically. In this article, we describe our proposed ML architecture, called KML. We developed a prototype KML architecture and applied it to two case studies: optimizing readahead and NFS read-size values. Our experiments show that KML consumes less than 4 KB of dynamic kernel memory, has a CPU overhead smaller than 0.2%, and yet can learn patterns and improve I/O throughput by as much as 2.3× and 15× for two case studies—even for complex, never-seen-before, concurrently running mixed workloads on different storage devices.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"59 4","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0