首页 > 最新文献

ACM Transactions on Storage最新文献

英文 中文
KVRangeDB: Range Queries for a Hash-based Key–Value Device KVRangeDB:基于哈希的键值设备的范围查询
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-01-21 DOI: 10.1145/3582013
Mian Qin, Qing Zheng, Jason Lee, B. Settlemyer, Fei Wen, N. Reddy, Paul V. Gratz
Key–value (KV) software has proven useful to a wide variety of applications including analytics, time-series databases, and distributed file systems. To satisfy the requirements of diverse workloads, KV stores have been carefully tailored to best match the performance characteristics of underlying solid-state block devices. Emerging KV storage device is a promising technology for both simplifying the KV software stack and improving the performance of persistent storage-based applications. However, while providing fast, predictable put and get operations, existing KV storage devices do not natively support range queries that are critical to all three types of applications described above. In this article, we present KVRangeDB, a software layer that enables processing range queries for existing hash-based KV solid-state disks (KVSSDs). As an effort to adapt to the performance characteristics of emerging KVSSDs, KVRangeDB implements log-structured merge tree key index that reduces compaction I/O, merges keys when possible, and provides separate caches for indexes and values. We evaluated the KVRangeDB under a set of representative workloads, and compared its performance with two existing database solutions: a Rocksdb variant ported to work with the KVSSD, and Wisckey, a key–value database that is carefully tuned for conventional block devices. On filesystem aging workloads, KVRangeDB outperforms Wisckey by 23.7× in terms of throughput and reduce CPU usage and external write amplifications by 14.3× and 9.8×, respectively.
关键价值(KV)软件已被证明可用于各种应用程序,包括分析、时间序列数据库和分布式文件系统。为了满足不同工作负载的要求,KV存储经过精心定制,以最佳匹配底层固态块设备的性能特征。新兴的KV存储设备是一种很有前途的技术,既可以简化KV软件堆栈,又可以提高基于持久存储的应用程序的性能。然而,在提供快速、可预测的输入和输出操作的同时,现有的KV存储设备本身并不支持对上述所有三种类型的应用程序都至关重要的范围查询。在本文中,我们介绍了KVRangeDB,这是一个软件层,可以处理现有基于哈希的KV固态磁盘(KVSSD)的范围查询。为了适应新兴KVSSD的性能特征,KVRangeDB实现了日志结构的合并树键索引,该索引可减少压缩I/O,在可能的情况下合并键,并为索引和值提供单独的缓存。我们在一组具有代表性的工作负载下评估了KVRangeDB,并将其性能与两种现有的数据库解决方案进行了比较:一种是Rocksdb变体,用于KVSSD,另一种是Wisckey,一种是针对传统块设备精心调整的关键值数据库。在文件系统老化的工作负载上,KVRangeDB的吞吐量比Wisckey高23.7倍,CPU使用率和外部写入放大率分别降低了14.3倍和9.8倍。
{"title":"KVRangeDB: Range Queries for a Hash-based Key–Value Device","authors":"Mian Qin, Qing Zheng, Jason Lee, B. Settlemyer, Fei Wen, N. Reddy, Paul V. Gratz","doi":"10.1145/3582013","DOIUrl":"https://doi.org/10.1145/3582013","url":null,"abstract":"Key–value (KV) software has proven useful to a wide variety of applications including analytics, time-series databases, and distributed file systems. To satisfy the requirements of diverse workloads, KV stores have been carefully tailored to best match the performance characteristics of underlying solid-state block devices. Emerging KV storage device is a promising technology for both simplifying the KV software stack and improving the performance of persistent storage-based applications. However, while providing fast, predictable put and get operations, existing KV storage devices do not natively support range queries that are critical to all three types of applications described above. In this article, we present KVRangeDB, a software layer that enables processing range queries for existing hash-based KV solid-state disks (KVSSDs). As an effort to adapt to the performance characteristics of emerging KVSSDs, KVRangeDB implements log-structured merge tree key index that reduces compaction I/O, merges keys when possible, and provides separate caches for indexes and values. We evaluated the KVRangeDB under a set of representative workloads, and compared its performance with two existing database solutions: a Rocksdb variant ported to work with the KVSSD, and Wisckey, a key–value database that is carefully tuned for conventional block devices. On filesystem aging workloads, KVRangeDB outperforms Wisckey by 23.7× in terms of throughput and reduce CPU usage and external write amplifications by 14.3× and 9.8×, respectively.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42130474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Localized Validation Accelerates Distributed Transactions on Disaggregated Persistent Memory 本地化验证加速了分解持久内存上的分布式事务
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-01-21 DOI: 10.1145/3582012
Ming Zhang, Yu Hua, Pengfei Zuo, Lurong Liu
Persistent memory (PM) disaggregation significantly improves the resource utilization and failure isolation to build a scalable and cost-effective remote memory pool in modern data centers. However, due to offering limited computing power and overlooking the bandwidth and persistence properties of real PMs, existing distributed transaction schemes, which are designed for legacy DRAM-based monolithic servers, fail to efficiently work on the disaggregated PM. In this article, we propose FORD, a Fast One-sided RDMA-based Distributed transaction system for the new disaggregated PM architecture. FORD thoroughly leverages one-sided remote direct memory access to handle transactions for bypassing the remote CPU in the PM pool. To reduce the round trips, FORD batches the read and lock operations into one request to eliminate extra locking and validations for the read-write data. To accelerate the transaction commit, FORD updates all remote replicas in a single round trip with parallel undo logging and data visibility control. Moreover, considering the limited PM bandwidth, FORD enables the backup replicas to be read to alleviate the load on the primary replicas, thus improving the throughput. To efficiently guarantee the remote data persistency in the PM pool, FORD selectively flushes data to the backup replicas to mitigate the network overheads. Nevertheless, the original FORD wastes some validation round trips if the read-only data are not modified by other transactions. Hence, we further propose a localized validation scheme to transfer the validation operations for the read-only data from remote to local as much as possible to reduce the round trips. Experimental results demonstrate that FORD significantly improves the transaction throughput by up to 3× and decreases the latency by up to 87.4% compared with state-of-the-art systems.
持久内存(PM)分解显著提高了资源利用率和故障隔离,从而在现代数据中心构建了一个可扩展且经济高效的远程内存池。然而,由于提供的计算能力有限,并且忽略了真实PM的带宽和持久性特性,现有的分布式事务方案是为传统的基于DRAM的单片服务器设计的,无法有效地处理分解的PM。在本文中,我们提出了FORD,用于新的分解PM架构的基于快速单侧RDMA的分布式事务系统。FORD完全利用单侧远程直接内存访问来处理事务,从而绕过PM池中的远程CPU。为了减少往返,FORD将读取和锁定操作批处理到一个请求中,以消除读写数据的额外锁定和验证。为了加快事务提交,FORD通过并行撤消日志记录和数据可见性控制,在一次往返中更新所有远程副本。此外,考虑到PM带宽有限,FORD允许读取备份副本,以减轻主副本的负载,从而提高吞吐量。为了有效地保证PM池中的远程数据持久性,FORD选择性地将数据刷新到备份副本,以减轻网络开销。然而,如果只读数据没有被其他事务修改,原始FORD会浪费一些验证往返行程。因此,我们进一步提出了一种本地化验证方案,尽可能地将只读数据的验证操作从远程转移到本地,以减少往返次数。实验结果表明,与最先进的系统相比,FORD显著提高了高达3倍的事务吞吐量,并将延迟降低了87.4%。
{"title":"Localized Validation Accelerates Distributed Transactions on Disaggregated Persistent Memory","authors":"Ming Zhang, Yu Hua, Pengfei Zuo, Lurong Liu","doi":"10.1145/3582012","DOIUrl":"https://doi.org/10.1145/3582012","url":null,"abstract":"Persistent memory (PM) disaggregation significantly improves the resource utilization and failure isolation to build a scalable and cost-effective remote memory pool in modern data centers. However, due to offering limited computing power and overlooking the bandwidth and persistence properties of real PMs, existing distributed transaction schemes, which are designed for legacy DRAM-based monolithic servers, fail to efficiently work on the disaggregated PM. In this article, we propose FORD, a Fast One-sided RDMA-based Distributed transaction system for the new disaggregated PM architecture. FORD thoroughly leverages one-sided remote direct memory access to handle transactions for bypassing the remote CPU in the PM pool. To reduce the round trips, FORD batches the read and lock operations into one request to eliminate extra locking and validations for the read-write data. To accelerate the transaction commit, FORD updates all remote replicas in a single round trip with parallel undo logging and data visibility control. Moreover, considering the limited PM bandwidth, FORD enables the backup replicas to be read to alleviate the load on the primary replicas, thus improving the throughput. To efficiently guarantee the remote data persistency in the PM pool, FORD selectively flushes data to the backup replicas to mitigate the network overheads. Nevertheless, the original FORD wastes some validation round trips if the read-only data are not modified by other transactions. Hence, we further propose a localized validation scheme to transfer the validation operations for the read-only data from remote to local as much as possible to reduce the round trips. Experimental results demonstrate that FORD significantly improves the transaction throughput by up to 3× and decreases the latency by up to 87.4% compared with state-of-the-art systems.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49399837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlatLSM: Write-Optimized LSM-Tree for PM-Based KV Stores FlatLSM:用于基于PM的KV存储的写优化LSM树
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-01-21 DOI: 10.1145/3579855
Kewen He, Yujie An, Yijing Luo, Xiaoguang Liu, Gang Wang
The Log-Structured Merge Tree (LSM-Tree) is widely used in key-value (KV) stores because of its excwrite performance. But LSM-Tree-based KV stores still have the overhead of write-ahead log and write stall caused by slow L0 flush and L0-L1 compaction. New byte-addressable, persistent memory (PM) devices bring an opportunity to improve the write performance of LSM-Tree. Previous studies on PM-based LSM-Tree have not fully exploited PM’s “dual role” of main memory and external storage. In this article, we analyze two strategies of memtables based on PM and the reasons write stall problems occur in the first place. Inspired by the analysis result, we propose FlatLSM, a specially designed flat LSM-Tree for non-volatile memory based KV stores. First, we propose PMTable with separated index and data. The PM Log utilizes the Buffer Log to store KVs of size less than 256B. Second, to solve the write stall problem, FlatLSM merges the volatile memtables and the persistent L0 into large PMTables, which can reduce the depth of LSM-Tree and concentrate I/O bandwidth on L0-L1 compaction. To mitigate write stall caused by flushing large PMTables to SSD, we propose a parallel flush/compaction algorithm based on KV separation. We implemented FlatLSM based on RocksDB and evaluated its performance on Intel’s latest PM device, the Intel Optane DC PMM with the state-of-the-art PM-based LSM-Tree KV stores, FlatLSM improves the throughput 5.2× on random write workload and 2.55× on YCSB-A.
日志结构合并树(LSM-Tree)由于其优异的执行性能,在密钥值(KV)存储中得到了广泛的应用。但是,基于LSM树的KV存储仍然存在由于L0刷新和L0-L1压缩缓慢而导致的提前写入日志和写入停滞的开销。新的字节可寻址持久存储器(PM)设备为提高LSM树的写入性能提供了机会。以前对基于PM的LSM树的研究并没有充分利用PM的主存储器和外部存储器的“双重作用”。在本文中,我们首先分析了基于PM的内存表的两种策略,以及出现写停滞问题的原因。受分析结果的启发,我们提出了FlatLSM,这是一种专门设计的用于基于非易失性存储器的KV存储的平坦LSM树。首先,我们提出了索引和数据分离的PMTable。PM日志利用缓冲日志来存储大小小于256B的KV。其次,为了解决写停滞问题,FlatLSM将易失性memtables和持久性L0合并为大型PMTables,这可以减少LSM树的深度,并将I/O带宽集中在L0-L1压缩上。为了减轻由于将大型PMTables刷新到SSD而导致的写入停滞,我们提出了一种基于KV分离的并行刷新/压缩算法。我们基于RocksDB实现了FlatLSM,并在Intel最新的PM设备Intel Optane DC PMM上评估了其性能,该设备具有最先进的基于PM的LSM Tree KV存储,FlatLSM在随机写入工作负载上提高了5.2倍的吞吐量,在YCSB-A上提高了2.55倍的吞吐量。
{"title":"FlatLSM: Write-Optimized LSM-Tree for PM-Based KV Stores","authors":"Kewen He, Yujie An, Yijing Luo, Xiaoguang Liu, Gang Wang","doi":"10.1145/3579855","DOIUrl":"https://doi.org/10.1145/3579855","url":null,"abstract":"The Log-Structured Merge Tree (LSM-Tree) is widely used in key-value (KV) stores because of its excwrite performance. But LSM-Tree-based KV stores still have the overhead of write-ahead log and write stall caused by slow L0 flush and L0-L1 compaction. New byte-addressable, persistent memory (PM) devices bring an opportunity to improve the write performance of LSM-Tree. Previous studies on PM-based LSM-Tree have not fully exploited PM’s “dual role” of main memory and external storage. In this article, we analyze two strategies of memtables based on PM and the reasons write stall problems occur in the first place. Inspired by the analysis result, we propose FlatLSM, a specially designed flat LSM-Tree for non-volatile memory based KV stores. First, we propose PMTable with separated index and data. The PM Log utilizes the Buffer Log to store KVs of size less than 256B. Second, to solve the write stall problem, FlatLSM merges the volatile memtables and the persistent L0 into large PMTables, which can reduce the depth of LSM-Tree and concentrate I/O bandwidth on L0-L1 compaction. To mitigate write stall caused by flushing large PMTables to SSD, we propose a parallel flush/compaction algorithm based on KV separation. We implemented FlatLSM based on RocksDB and evaluated its performance on Intel’s latest PM device, the Intel Optane DC PMM with the state-of-the-art PM-based LSM-Tree KV stores, FlatLSM improves the throughput 5.2× on random write workload and 2.55× on YCSB-A.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49366031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Oasis: Controlling Data Migration in Expansion of Object-based Storage Systems Oasis:对象存储系统扩展中的数据迁移控制
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-01-19 DOI: https://dl.acm.org/doi/10.1145/3568424
Yiming Zhang, Li Wang, Shun Gai, Qiwen Ke, Wenhao Li, Zhenlong Song, Guangtao Xue, Jiwu Shu

Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, and so on, where the data is placed among a large number of object storage devices (OSDs). Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the capacity of the storage clusters (i.e., adding new OSDs), which is determined by the nature of CRUSH and will cause significant performance degradation when the expansion is nontrivial.

This article presents MapX, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlling data migration after cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MapX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MapX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. We have applied MapX to the state-of-the-art Ceph-RBD (RADOS Block Device) to implement a migration-controllable, decentralized object-based block store (called Oasis). Oasis extends the RBD metadata structure to maintain and retrieve approximate object creation times (for migration control) at the granularity of expansion layers. Experimental results show that the MapX-based Oasis block store outperforms the CRUSH-based Ceph-RBD (which is busy in migrating objects after expansions) by 3.17× ∼ 4.31× in tail latency, and 76.3% (respectively, 83.8%) in IOPS for reads (respectively, writes).

对象存储系统被广泛应用于文件存储、块存储、blob(如大视频)存储等各种场景,这些场景中数据被放置在大量的对象存储设备(object storage device, osd)中。数据放置对于分散的基于对象的存储系统的可伸缩性至关重要。最先进的CRUSH放置方法是一种分散的算法,它确定地将对象副本放置到存储设备上,而不依赖于中心目录。虽然享有去中心化的好处,如高可伸缩性、健壮性和性能,但基于CRUSH的存储系统在扩展存储集群的容量(即添加新的osd)时,会遭受不受控制的数据迁移,这是由CRUSH的性质决定的,当扩展非常大时,会导致显著的性能下降。本文介绍MapX,它是CRUSH的一个新颖扩展,它使用额外的时间维度映射(从对象创建时间到集群扩展时间)来控制集群扩展后的数据迁移。每个扩展都被视为CRUSH映射的一个新层,由CRUSH根下的虚拟节点表示。MapX通过操纵中间放置组(pg)的时间戳来控制从对象到层的映射。MapX适用于各种基于对象的存储场景,在这些场景中,对象时间戳可以作为高级元数据进行维护。我们将MapX应用于最先进的Ceph-RBD (RADOS块设备),以实现迁移可控,分散的基于对象的块存储(称为Oasis)。Oasis扩展了RBD元数据结构,以在扩展层的粒度上维护和检索近似的对象创建时间(用于迁移控制)。实验结果表明,基于mapx的Oasis块存储比基于crush的Ceph-RBD(扩容后忙于迁移对象)的尾部延迟高3.17× ~ 4.31×,读(写)IOPS高76.3%(分别为83.8%)。
{"title":"Oasis: Controlling Data Migration in Expansion of Object-based Storage Systems","authors":"Yiming Zhang, Li Wang, Shun Gai, Qiwen Ke, Wenhao Li, Zhenlong Song, Guangtao Xue, Jiwu Shu","doi":"https://dl.acm.org/doi/10.1145/3568424","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568424","url":null,"abstract":"<p>Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, and so on, where the data is placed among a large number of object storage devices (OSDs). Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from <i>uncontrolled</i> data migration when expanding the capacity of the storage clusters (i.e., adding new OSDs), which is determined by the nature of CRUSH and will cause significant performance degradation when the expansion is nontrivial.</p><p>This article presents <span>MapX</span>, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlling data migration after cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. <span>MapX</span> controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). <span>MapX</span> is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. We have applied <span>MapX</span> to the state-of-the-art Ceph-RBD (RADOS Block Device) to implement a migration-controllable, decentralized object-based block store (called <span>Oasis</span>). <span>Oasis</span> extends the RBD metadata structure to maintain and retrieve approximate object creation times (for migration control) at the granularity of expansion layers. Experimental results show that the <span>MapX</span>-based <span>Oasis</span> block store outperforms the CRUSH-based Ceph-RBD (which is busy in migrating objects after expansions) by 3.17× ∼ 4.31× in tail latency, and 76.3% (respectively, 83.8%) in IOPS for reads (respectively, writes).</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Storage Systems Using Machine Learning 利用机器学习改进存储系统
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-01-19 DOI: https://dl.acm.org/doi/10.1145/3568429
Ibrahim Umit Akgun, Ali Selman Aydin, Andrew Burford, Michael McNeill, Michael Arkhangelskiy, Erez Zadok

Operating systems include many heuristic algorithms designed to improve overall storage performance and throughput. Because such heuristics cannot work well for all conditions and workloads, system designers resorted to exposing numerous tunable parameters to users—thus burdening users with continually optimizing their own storage systems and applications. Storage systems are usually responsible for most latency in I/O-heavy applications, so even a small latency improvement can be significant. Machine learning (ML) techniques promise to learn patterns, generalize from them, and enable optimal solutions that adapt to changing workloads. We propose that ML solutions become a first-class component in OSs and replace manual heuristics to optimize storage systems dynamically. In this article, we describe our proposed ML architecture, called KML. We developed a prototype KML architecture and applied it to two case studies: optimizing readahead and NFS read-size values. Our experiments show that KML consumes less than 4 KB of dynamic kernel memory, has a CPU overhead smaller than 0.2%, and yet can learn patterns and improve I/O throughput by as much as 2.3× and 15× for two case studies—even for complex, never-seen-before, concurrently running mixed workloads on different storage devices.

操作系统包括许多旨在提高整体存储性能和吞吐量的启发式算法。由于这种启发式方法不能很好地适用于所有条件和工作负载,因此系统设计人员向用户公开了许多可调参数,从而增加了用户不断优化自己的存储系统和应用程序的负担。在I/ o密集型应用程序中,存储系统通常是造成大部分延迟的原因,因此即使是很小的延迟改进也可能是显著的。机器学习(ML)技术有望学习模式,从中进行推广,并实现适应不断变化的工作负载的最佳解决方案。我们建议机器学习解决方案成为操作系统中的一流组件,并取代手动启发式来动态优化存储系统。在本文中,我们描述了我们提出的ML体系结构,称为KML。我们开发了一个原型KML架构,并将其应用于两个案例研究:优化预读和NFS读大小值。我们的实验表明,KML消耗的动态内核内存少于4 KB, CPU开销小于0.2%,但在两个案例研究中,KML可以学习模式并将I/O吞吐量提高2.3倍和15倍——甚至对于复杂的、从未见过的、在不同存储设备上并发运行的混合工作负载也是如此。
{"title":"Improving Storage Systems Using Machine Learning","authors":"Ibrahim Umit Akgun, Ali Selman Aydin, Andrew Burford, Michael McNeill, Michael Arkhangelskiy, Erez Zadok","doi":"https://dl.acm.org/doi/10.1145/3568429","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568429","url":null,"abstract":"<p>Operating systems include many heuristic algorithms designed to improve overall storage performance and throughput. Because such heuristics cannot work well for all conditions and workloads, system designers resorted to exposing numerous tunable parameters to users—thus burdening users with continually optimizing their own storage systems and applications. Storage systems are usually responsible for most latency in I/O-heavy applications, so even a small latency improvement can be significant. Machine learning (ML) techniques promise to learn patterns, generalize from them, and enable optimal solutions that adapt to changing workloads. We propose that ML solutions become a first-class component in OSs and replace manual heuristics to optimize storage systems dynamically. In this article, we describe our proposed ML architecture, called KML. We developed a prototype KML architecture and applied it to two case studies: optimizing readahead and NFS read-size values. Our experiments show that KML consumes less than 4 KB of dynamic kernel memory, has a CPU overhead smaller than 0.2%, and yet can learn patterns and improve I/O throughput by as much as 2.3× and 15× for two case studies—even for complex, never-seen-before, concurrently running mixed workloads on different storage devices.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance Bug Analysis and Detection for Distributed Storage and Computing Systems 分布式存储和计算系统的性能缺陷分析与检测
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-01-18 DOI: 10.1145/3580281
Jiaxin Li, Yiming Zhang, Shan Lu, Haryadi S. Gunawi, Xiaohui Gu, Feng Huang, Dongsheng Li
This article systematically studies 99 distributed performance bugs from five widely deployed distributed storage and computing systems (Cassandra, HBase, HDFS, Hadoop MapReduce and ZooKeeper). We present the TaxPerf database, which collectively organizes the analysis results as over 400 classification labels and over 2,500 lines of bug re-description. TaxPerf is classified into six bug categories (and 18 bug subcategories) by their root causes; resource, blocking, synchronization, optimization, configuration, and logic. TaxPerf can be used as a benchmark for performance bug studies and debug tool designs. Although it is impractical to automatically detect all categories of performance bugs in TaxPerf, we find that an important category of blocking bugs can be effectively solved by analysis tools. We analyze the cascading nature of blocking bugs and design an automatic detection tool called PCatch, which (i) performs program analysis to identify code regions whose execution time can potentially increase dramatically with the workload size; (ii) adapts the traditional happens-before model to reason about software resource contention and performance dependency relationship; and (iii) uses dynamic tracking to identify whether the slowdown propagation is contained in one job. Evaluation shows that PCatch can accurately detect blocking bugs of representative distributed storage and computing systems by observing system executions under small-scale workloads.
本文系统地研究了来自五个广泛部署的分布式存储和计算系统(Cassandra、HBase、HDFS、Hadoop MapReduce和ZooKeeper)的99个分布式性能缺陷。我们展示了TaxPerf数据库,该数据库将分析结果集中组织为400多个分类标签和2500多行错误重新描述。TaxPerf根据其根本原因分为六个bug类别(和18个bug子类别);资源、阻塞、同步、优化、配置和逻辑。TaxPerf可以用作性能缺陷研究和调试工具设计的基准。尽管在TaxPerf中自动检测所有类别的性能错误是不切实际的,但我们发现,分析工具可以有效地解决一类重要的阻塞错误。我们分析了阻塞错误的级联性质,并设计了一个名为PCatch的自动检测工具,该工具(i)执行程序分析,以识别执行时间可能随着工作负载大小而急剧增加的代码区域;(ii)将传统的先发生后发生模型应用于软件资源竞争和性能依赖关系的推理;以及(iii)使用动态跟踪来识别减速传播是否包含在一个作业中。评估表明,PCatch可以通过观察小规模工作负载下的系统执行情况,准确检测具有代表性的分布式存储和计算系统的阻塞错误。
{"title":"Performance Bug Analysis and Detection for Distributed Storage and Computing Systems","authors":"Jiaxin Li, Yiming Zhang, Shan Lu, Haryadi S. Gunawi, Xiaohui Gu, Feng Huang, Dongsheng Li","doi":"10.1145/3580281","DOIUrl":"https://doi.org/10.1145/3580281","url":null,"abstract":"This article systematically studies 99 distributed performance bugs from five widely deployed distributed storage and computing systems (Cassandra, HBase, HDFS, Hadoop MapReduce and ZooKeeper). We present the TaxPerf database, which collectively organizes the analysis results as over 400 classification labels and over 2,500 lines of bug re-description. TaxPerf is classified into six bug categories (and 18 bug subcategories) by their root causes; resource, blocking, synchronization, optimization, configuration, and logic. TaxPerf can be used as a benchmark for performance bug studies and debug tool designs. Although it is impractical to automatically detect all categories of performance bugs in TaxPerf, we find that an important category of blocking bugs can be effectively solved by analysis tools. We analyze the cascading nature of blocking bugs and design an automatic detection tool called PCatch, which (i) performs program analysis to identify code regions whose execution time can potentially increase dramatically with the workload size; (ii) adapts the traditional happens-before model to reason about software resource contention and performance dependency relationship; and (iii) uses dynamic tracking to identify whether the slowdown propagation is contained in one job. Evaluation shows that PCatch can accurately detect blocking bugs of representative distributed storage and computing systems by observing system executions under small-scale workloads.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-01-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43597467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Boosting Cache Performance by Access Time Measurements 通过访问时间测量提高缓存性能
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-01-13 DOI: 10.1145/3572778
Gil Einziger, Omri Himelbrand, Erez Waisbard
Most modern systems utilize caches to reduce the average data access time and optimize their performance. Recently proposed policies implicitly assume uniform access times, but variable access times naturally appear in domains such as storage, web search, and DNS resolution. Our work measures the access times for various items and exploits variations in access times as an additional signal for caching algorithms. Using such a signal, we introduce adaptive access time-aware cache policies that consistently improve the average access time compared with the best alternative in diverse workloads. Our adaptive algorithm attains an average access time reduction of up to 46% in storage workloads, up to 16% in web searches, and 8.4% on average when considering all experiments in our study.
大多数现代系统利用缓存来减少平均数据访问时间并优化其性能。最近提出的策略隐含地假设统一的访问时间,但可变的访问时间自然出现在存储、web搜索和DNS解析等领域。我们的工作测量了各种项目的访问时间,并利用访问时间的变化作为缓存算法的额外信号。使用这样的信号,我们引入了自适应访问时间感知缓存策略,与不同工作负载中的最佳替代方案相比,该策略可以持续提高平均访问时间。考虑到我们研究中的所有实验,我们的自适应算法在存储工作负载中的平均访问时间减少了46%,在网络搜索中的平均存取时间减少了16%,平均存取时间缩短了8.4%。
{"title":"Boosting Cache Performance by Access Time Measurements","authors":"Gil Einziger, Omri Himelbrand, Erez Waisbard","doi":"10.1145/3572778","DOIUrl":"https://doi.org/10.1145/3572778","url":null,"abstract":"Most modern systems utilize caches to reduce the average data access time and optimize their performance. Recently proposed policies implicitly assume uniform access times, but variable access times naturally appear in domains such as storage, web search, and DNS resolution. Our work measures the access times for various items and exploits variations in access times as an additional signal for caching algorithms. Using such a signal, we introduce adaptive access time-aware cache policies that consistently improve the average access time compared with the best alternative in diverse workloads. Our adaptive algorithm attains an average access time reduction of up to 46% in storage workloads, up to 16% in web searches, and 8.4% on average when considering all experiments in our study.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46400680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
TPFS: A High-Performance Tiered File System for Persistent Memories and Disks 持久性存储器和磁盘的高性能分级文件系统
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-01-13 DOI: 10.1145/3580280
Shengan Zheng, Morteza Hoseinzadeh, S. Swanson, Linpeng Huang
Emerging fast, byte-addressable persistent memory (PM) promises substantial storage performance gains compared with traditional disks. We present TPFS, a tiered file system that combines PM and slow disks to create a storage system with near-PM performance and large capacity. TPFS steers incoming file input/output (I/O) to PM, dynamic random access memory (DRAM), or disk depending on the synchronicity, write size, and read frequency. TPFS profiles the application’s access stream online to predict the behavior of file access. In the background, TPFS estimates the “temperature” of file data and migrates the write-cold and read-hot file data from PM to disks. To fully utilize disk bandwidth, TPFS coalesces data blocks into large, sequential writes. Experimental results show that with a small amount of PM and a large solid-state drive (SSD), TPFS achieves up to 7.3× and 7.9× throughput improvement compared with EXT4 and XFS running on an SSD alone, respectively. As the amount of PM grows, TPFS’s performance improves until it matches the performance of a PM-only file system.
与传统磁盘相比,新兴的快速字节可寻址持久存储器(PM)有望大幅提高存储性能。我们介绍了TPFS,这是一种分层文件系统,它结合了PM和慢速磁盘,创建了一个具有接近PM性能和大容量的存储系统。TPFS根据同步性、写入大小和读取频率将传入的文件输入/输出(I/O)引导到PM、动态随机存取存储器(DRAM)或磁盘。TPFS在线评测应用程序的访问流,以预测文件访问的行为。在后台,TPFS估计文件数据的“温度”,并将冷写和热读文件数据从PM迁移到磁盘。为了充分利用磁盘带宽,TPFS将数据块合并为大的顺序写入。实验结果表明,在少量PM和大型固态驱动器(SSD)的情况下,与单独在SSD上运行的EXT4和XFS相比,TPFS的吞吐量分别提高了7.3倍和7.9倍。随着PM数量的增长,TPFS的性能会提高,直到它与仅PM文件系统的性能相匹配。
{"title":"TPFS: A High-Performance Tiered File System for Persistent Memories and Disks","authors":"Shengan Zheng, Morteza Hoseinzadeh, S. Swanson, Linpeng Huang","doi":"10.1145/3580280","DOIUrl":"https://doi.org/10.1145/3580280","url":null,"abstract":"Emerging fast, byte-addressable persistent memory (PM) promises substantial storage performance gains compared with traditional disks. We present TPFS, a tiered file system that combines PM and slow disks to create a storage system with near-PM performance and large capacity. TPFS steers incoming file input/output (I/O) to PM, dynamic random access memory (DRAM), or disk depending on the synchronicity, write size, and read frequency. TPFS profiles the application’s access stream online to predict the behavior of file access. In the background, TPFS estimates the “temperature” of file data and migrates the write-cold and read-hot file data from PM to disks. To fully utilize disk bandwidth, TPFS coalesces data blocks into large, sequential writes. Experimental results show that with a small amount of PM and a large solid-state drive (SSD), TPFS achieves up to 7.3× and 7.9× throughput improvement compared with EXT4 and XFS running on an SSD alone, respectively. As the amount of PM grows, TPFS’s performance improves until it matches the performance of a PM-only file system.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47734810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization InDe:通过自适应检测有效容器利用率的内联重复数据删除方法
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-01-11 DOI: https://dl.acm.org/doi/10.1145/3568426
Lifang Lin, Yuhui Deng, Yi Zhou, Yifeng Zhu

Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers.

To address this issue, we propose an inline deduplication approach for storage systems, called InDe, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, valid container referenced counts (VCRC), to identify appropriate containers for the rewrite. We design a rewrite algorithm F-greedy that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called F-greedy+ based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.

内联重复数据删除是在数据发送到存储系统的过程中实时删除冗余数据。但是,重复数据删除会导致数据碎片化:重复数据删除后,逻辑上连续的块在物理上分散在不同的容器中。许多重写算法旨在通过将碎片化的重复块作为唯一块重写到新的容器中来缓解由于碎片化而导致的性能下降。不幸的是,这些算法根据一个简单的预先设定的固定值来确定一个块是否被分片,而忽略了数据段之间数据特征的差异。因此,在恢复备份时,它们通常无法选择一组适当的旧容器进行重写,从而在检索到的容器中生成大量无效块。为了解决这个问题,我们提出了一种用于存储系统的内联重复数据删除方法,称为InDe,它使用贪婪算法来检测有效的容器利用率,并动态调整每个段中旧容器引用的数量。InDe充分利用重复块的分布来提高恢复性能,同时保持高备份性能。我们定义了一个有效性度量,即有效的容器引用计数(VCRC),以确定适合重写的容器。我们设计了一个重写算法F-greedy,它检测有效的容器利用率来重写低vcrc的容器。F-greedy根据容器的VCRC分布动态调整旧容器引用的数量,使每段只与高利用率容器共享重复块,从而提高恢复速度。为了充分利用上述特征,我们进一步提出了另一种基于有效容器利用率自适应间隔检测的重写算法F-greedy+。F-greedy+通过检测VCRC在两个方向上的变化趋势,并在全局范围内选择参考容器,更准确地估计旧容器的有效利用率。我们使用三个真实的备份工作负载对InDe进行了定量评估。实验结果表明,与两种最先进的算法(Capping和SMR)相比,我们的方案在获得几乎相同的备份性能的同时,将恢复速度提高了1.3×-2.4×。
{"title":"InDe: An Inline Data Deduplication Approach via Adaptive Detection of Valid Container Utilization","authors":"Lifang Lin, Yuhui Deng, Yi Zhou, Yifeng Zhu","doi":"https://dl.acm.org/doi/10.1145/3568426","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568426","url":null,"abstract":"<p>Inline deduplication removes redundant data in real-time as data is being sent to the storage system. However, it causes data fragmentation: logically consecutive chunks are physically scattered across various containers after data deduplication. Many rewrite algorithms aim to alleviate the performance degradation due to fragmentation by rewriting fragmented duplicate chunks as unique chunks into new containers. Unfortunately, these algorithms determine whether a chunk is fragmented based on a simple pre-set fixed value, ignoring the variance of data characteristics between data segments. Accordingly, when backups are restored, they often fail to select an appropriate set of old containers for rewrite, generating a substantial number of invalid chunks in retrieved containers. </p><p>To address this issue, we propose an inline deduplication approach for storage systems, called <i>InDe</i>, which uses a greedy algorithm to detect valid container utilization and dynamically adjusts the number of old container references in each segment. InDe fully leverages the distribution of duplicated chunks to improve the restore performance while maintaining high backup performance. We define an effectiveness metric, <i>valid container referenced counts (VCRC)</i>, to identify appropriate containers for the rewrite. We design a rewrite algorithm <i>F-greedy</i> that detects valid container utilization to rewrite low-VCRC containers. According to the VCRC distribution of containers, F-greedy dynamically adjusts the number of old container references to only share duplicate chunks with high-utilization containers for each segment, thereby improving the restore speed. To take full advantage of the above features, we further propose another rewrite algorithm called <i>F-greedy+</i> based on adaptive interval detection of valid container utilization. F-greedy+ makes a more accurate estimation of the valid utilization of old containers by detecting trends of VCRC’s change in two directions and selecting referenced containers in the global scope. We quantitatively evaluate InDe using three real-world backup workloads. The experimental results show that compared with two state-of-the-art algorithms (Capping and SMR), our scheme improves the restore speed by 1.3×–2.4× while achieving almost the same backup performance.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Crash Consistency for NVMe over PCIe and RDMA NVMe在PCIe和RDMA上的高效崩溃一致性
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-01-11 DOI: https://dl.acm.org/doi/10.1145/3568428
Xiaojian Liao, Youyou Lu, Zhe Yang, Jiwu Shu

This article presents crash-consistent Non-Volatile Memory Express (ccNVMe), a novel extension of the NVMe that defines how host software communicates with the non-volatile memory (e.g., solid-state drive) across a PCI Express bus and RDMA-capable networks with both crash consistency and performance efficiency. Existing storage systems pay a huge tax on crash consistency, and thus cannot fully exploit the multi-queue parallelism and low latency of the NVMe and RDMA interfaces. ccNVMe alleviates this major bottleneck by coupling the crash consistency to the data dissemination. This new idea allows the storage system to achieve crash consistency by taking the free rides of the data dissemination mechanism of NVMe, using only two lightweight memory-mapped I/Os (MMIOs), unlike traditional systems that use complex update protocol and synchronized block I/Os. ccNVMe introduces a series of techniques including transaction-aware MMIO/doorbell and I/O command coalescing to reduce the PCIe traffic as well as to provide atomicity. We present how to build a high-performance and crash-consistent file system named MQFS atop ccNVMe. We experimentally show that MQFS increases the IOPS of RocksDB by 36% and 28% compared to a state-of-the-art file system and Ext4 without journaling, respectively.

本文介绍了崩溃一致性非易失性内存Express (ccNVMe),这是NVMe的一种新扩展,它定义了主机软件如何通过PCI Express总线和支持rdma的网络与非易失性内存(例如固态驱动器)进行通信,同时具有崩溃一致性和性能效率。现有的存储系统在崩溃一致性上付出了巨大的代价,因此不能充分利用NVMe和RDMA接口的多队列并行性和低延迟。ccNVMe通过将崩溃一致性与数据分发相结合,缓解了这一主要瓶颈。与使用复杂更新协议和同步块I/ o的传统系统不同,这种新想法允许存储系统通过免费使用NVMe的数据传播机制,仅使用两个轻量级内存映射I/ o (mmio)来实现崩溃一致性。ccNVMe引入了一系列技术,包括事务感知的MMIO/门铃和I/O命令合并,以减少PCIe流量并提供原子性。我们介绍了如何在ccNVMe之上构建一个名为MQFS的高性能和崩溃一致性文件系统。我们通过实验表明,与最先进的文件系统和没有日志记录的Ext4相比,MQFS使RocksDB的IOPS分别提高了36%和28%。
{"title":"Efficient Crash Consistency for NVMe over PCIe and RDMA","authors":"Xiaojian Liao, Youyou Lu, Zhe Yang, Jiwu Shu","doi":"https://dl.acm.org/doi/10.1145/3568428","DOIUrl":"https://doi.org/https://dl.acm.org/doi/10.1145/3568428","url":null,"abstract":"<p>This article presents crash-consistent Non-Volatile Memory Express (ccNVMe), a novel extension of the NVMe that defines how host software communicates with the non-volatile memory (e.g., solid-state drive) across a PCI Express bus and RDMA-capable networks with both crash consistency and performance efficiency. Existing storage systems pay a huge tax on crash consistency, and thus cannot fully exploit the multi-queue parallelism and low latency of the NVMe and RDMA interfaces. ccNVMe alleviates this major bottleneck by coupling the crash consistency to the data dissemination. This new idea allows the storage system to achieve crash consistency by taking the free rides of the data dissemination mechanism of NVMe, using only two lightweight memory-mapped I/Os (MMIOs), unlike traditional systems that use complex update protocol and synchronized block I/Os. ccNVMe introduces a series of techniques including transaction-aware MMIO/doorbell and I/O command coalescing to reduce the PCIe traffic as well as to provide atomicity. We present how to build a high-performance and crash-consistent file system named <span>MQFS</span> atop ccNVMe. We experimentally show that <span>MQFS</span> increases the IOPS of RocksDB by 36% and 28% compared to a state-of-the-art file system and Ext4 without journaling, respectively.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138526310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Storage
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1