ACM Transactions on Storage最新文献_第3页

From Missteps to Milestones: A Journey to Practical Fail-Slow Detection 从失误到里程碑:实用故障慢速检测之旅

3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-11-01 DOI: 10.1145/3617690

Ruiming Lu, Erci Xu, Yiming Zhang, Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Jiwu Shu, Minglu Li, Jiesheng Wu

The newly emerging “fail-slow” failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this article presents Perseus , a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to quickly pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.

新出现的“慢速故障”困扰着软件和硬件，其中受害组件仍在运行，但性能下降。为了解决这个问题，本文介绍了Perseus，这是一个实用的存储设备慢速故障检测框架。Perseus利用基于轻回归的模型来快速定位和分析驱动器粒度的故障慢速故障。在对248K硬盘进行了10个月的密切监测后，Perseus发现了304个慢速故障。隔离它们可以将(节点级)99.99次尾延迟减少48%。我们从生产轨迹中组装了一个大规模的故障慢速数据集(包括41K正常驱动器和315个经过验证的故障慢速驱动器)，在此基础上，我们提供了故障慢速驱动器的根本原因分析，涵盖了各种执行不良的调度、硬件缺陷和环境因素。我们已经向公众发布了数据集，以供慢速研究。

引用次数: 0

A Scalable Wear Leveling Technique for Phase Change Memory 一种可扩展的相变存储器磨损均衡技术

3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-10-30 DOI: 10.1145/3631146

Wang Xu, Israel Koren

Phase Change Memory (PCM), one of the recently proposed non-volatile memory technologies, has been suffering from low write endurance. For example, a single layer PCM cell could only be written approximately 10 8 times. This limits the lifetime of a PCM-based memory to a few days rather than years when memory intensive applications are running. Wear leveling techniques have been proposed to improve the write endurance of a PCM. Among those techniques, the region based start-gap (RBSG) scheme is widely cited as achieving the highest lifetime. Based on our experiments, RBSG can achieve 97% of the ideal lifetime but only for relatively small memory sizes (e.g. 8GB-32GB). As the memory size goes up, RBSG becomes less effective and its expected percentage of the ideal lifetime reduces to less than 57% for a 2TB PCM. In this paper, we propose a table-based wear leveling scheme called block grouping to enhance the write endurance of a PCM with a negligible overhead. Our research results show that with a proper configuration and adoption of partial writes (writing back only 64B subblocks instead of a whole row to the PCM arrays) and internal row shift (shifting the subblocks in a row periodically so no subblock in a row will be written repeatedly), the proposed block grouping scheme could achieve 95% of the ideal lifetime on average for the Rodinia, NPB, and SPEC benchmarks with less than 1.74% performance overhead and up to 0.18% hardware overhead. Moreover, our scheme is scalable and achieves the same percentage of ideal lifetime for PCM of size from 8GB to 2TB. We also show that the proposed scheme can better tolerate memory write attacks than WoLFRAM (Wear Leveling and Fault Tolerance for Resistive Memories) and RBSG for a PCM of size 32GB or higher. Finally, we integrate an error-correcting pointer technique into our proposed block grouping scheme to make the PCM fault tolerant against hard errors.

相变存储器(PCM)是近年来提出的一种非易失性存储技术，其写入持久时间较低。例如，单层PCM单元只能被写入大约108次。当运行内存密集型应用程序时，这将基于pcm的内存的寿命限制在几天，而不是几年。为了提高PCM的写入耐久性，提出了磨损调平技术。在这些技术中，基于区域的启动间隙(RBSG)方案被广泛引用为具有最高的生命周期。根据我们的实验，RBSG可以达到理想寿命的97%，但仅适用于相对较小的内存大小(例如8GB-32GB)。随着内存大小的增加，RBSG变得不那么有效，对于2TB PCM，其理想寿命的预期百分比减少到不到57%。在本文中，我们提出了一种基于表的磨损均衡方案，称为块分组，以微不足道的开销提高PCM的写入持久性。与适当的配置和我们的研究结果表明,采用部分写(写只有64 b子块而不是整个行PCM数组)和内部行转移(转移子块连续定期所以没有子块连续反复写),提出了块分组方案能达到95%的理想一生平均Rodinia, NPB和SPEC基准性能开销不到1.74%和0.18%的硬件开销。此外，我们的方案是可扩展的，并且对于大小从8GB到2TB的PCM实现了相同的理想寿命百分比。我们还表明，对于大小为32GB或更高的PCM，所提出的方案比WoLFRAM(耐磨均衡和容错电阻存储器)和RBSG更好地耐受内存写攻击。最后，我们将纠错指针技术集成到我们提出的块分组方案中，使PCM对硬错误具有容错性。

{"title":"A Scalable Wear Leveling Technique for Phase Change Memory","authors":"Wang Xu, Israel Koren","doi":"10.1145/3631146","DOIUrl":"https://doi.org/10.1145/3631146","url":null,"abstract":"Phase Change Memory (PCM), one of the recently proposed non-volatile memory technologies, has been suffering from low write endurance. For example, a single layer PCM cell could only be written approximately 10 8 times. This limits the lifetime of a PCM-based memory to a few days rather than years when memory intensive applications are running. Wear leveling techniques have been proposed to improve the write endurance of a PCM. Among those techniques, the region based start-gap (RBSG) scheme is widely cited as achieving the highest lifetime. Based on our experiments, RBSG can achieve 97% of the ideal lifetime but only for relatively small memory sizes (e.g. 8GB-32GB). As the memory size goes up, RBSG becomes less effective and its expected percentage of the ideal lifetime reduces to less than 57% for a 2TB PCM. In this paper, we propose a table-based wear leveling scheme called block grouping to enhance the write endurance of a PCM with a negligible overhead. Our research results show that with a proper configuration and adoption of partial writes (writing back only 64B subblocks instead of a whole row to the PCM arrays) and internal row shift (shifting the subblocks in a row periodically so no subblock in a row will be written repeatedly), the proposed block grouping scheme could achieve 95% of the ideal lifetime on average for the Rodinia, NPB, and SPEC benchmarks with less than 1.74% performance overhead and up to 0.18% hardware overhead. Moreover, our scheme is scalable and achieves the same percentage of ideal lifetime for PCM of size from 8GB to 2TB. We also show that the proposed scheme can better tolerate memory write attacks than WoLFRAM (Wear Leveling and Fault Tolerance for Resistive Memories) and RBSG for a PCM of size 32GB or higher. Finally, we integrate an error-correcting pointer technique into our proposed block grouping scheme to make the PCM fault tolerant against hard errors.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136069911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Explorations and Exploitation for Parity-based RAIDs with Ultra-fast SSDs 超高速ssd基于奇偶校验的raid的探索和开发

3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-10-16 DOI: 10.1145/3627992

Shucheng Wang, Qiang Cao, Hong Jiang, Ziyi Lu, Jie Yao, Yuxing Chen, Anqun Pan

Following a conventional design principle that pays more fast-CPU-cycles for fewer slow-I/Os, popular software storage architecture Linux Multiple-Disk (MD) for parity-based RAID (e.g., RAID5 and RAID6) assigns one or more centralized worker threads to efficiently process all user requests based on multi-stage asynchronous control and global data structures, successfully exploiting characteristics of slow devices, e.g., Hard Disk Drives (HDDs). However, we observe that, with high-performance NVMe-based Solid State Drives (SSDs), even the recently added multi-worker processing mode in MD achieves only limited performance gain because of the severe lock contentions under intensive write workloads. In this paper, we propose a novel stripe-threaded RAID architecture, StRAID, assigning a dedicated worker thread for each stripe-write (one-for-one model) to sufficiently exploit high parallelism inherent among RAID stripes, multi-core processors, and SSDs. For the notoriously performance-punishing partial-stripe writes that induce extra read and write I/Os, StRAID presents a two-stage stripe write mechanism and a two-dimensional multi-log SSD buffer. All writes first are opportunistically batched in memory, and then are written into the primary RAID for aggregated full-stripe writes or conditionally redirected to the buffer for partial-stripe writes. These buffered data are strategically reclaimed to the primary RAID. We evaluate a StRAID prototype with a variety of benchmarks and real-world traces. StRAID is demonstrated to outperform MD by up to 5.8 times in write throughput.

遵循传统的设计原则，即为更少的慢i / o支付更多的快速cpu周期，流行的软件存储架构Linux multi- Disk (MD)用于基于奇偶性的RAID(例如RAID5和RAID6)分配一个或多个集中的工作线程，以有效地处理基于多阶段异步控制和全局数据结构的所有用户请求，成功地利用慢设备(例如硬盘驱动器(hdd))的特性。然而，我们观察到，对于基于nvme的高性能固态驱动器(ssd)，即使是最近在MD中添加的多工作者处理模式也只能获得有限的性能提升，因为在密集的写工作负载下存在严重的锁争用。在本文中，我们提出了一种新的条带线程RAID架构，StRAID，为每个条带写入分配一个专用的工作线程(一对一模型)，以充分利用RAID条带、多核处理器和ssd之间固有的高并行性。众所周知，部分分条写入会导致额外的读/写I/ o，而StRAID提供了一个两阶段的分条写入机制和一个二维多日志SSD缓冲区。所有写操作首先在内存中分批处理，然后写入主RAID以进行聚合的全条带写操作，或者有条件地重定向到缓冲区以进行部分条带写操作。这些缓冲的数据策略性地回收到主RAID。我们使用各种基准测试和真实世界的跟踪来评估StRAID原型。在写入吞吐量方面，StRAID的性能比MD高出5.8倍。

{"title":"Explorations and Exploitation for Parity-based RAIDs with Ultra-fast SSDs","authors":"Shucheng Wang, Qiang Cao, Hong Jiang, Ziyi Lu, Jie Yao, Yuxing Chen, Anqun Pan","doi":"10.1145/3627992","DOIUrl":"https://doi.org/10.1145/3627992","url":null,"abstract":"Following a conventional design principle that pays more fast-CPU-cycles for fewer slow-I/Os, popular software storage architecture Linux Multiple-Disk (MD) for parity-based RAID (e.g., RAID5 and RAID6) assigns one or more centralized worker threads to efficiently process all user requests based on multi-stage asynchronous control and global data structures, successfully exploiting characteristics of slow devices, e.g., Hard Disk Drives (HDDs). However, we observe that, with high-performance NVMe-based Solid State Drives (SSDs), even the recently added multi-worker processing mode in MD achieves only limited performance gain because of the severe lock contentions under intensive write workloads. In this paper, we propose a novel stripe-threaded RAID architecture, StRAID, assigning a dedicated worker thread for each stripe-write (one-for-one model) to sufficiently exploit high parallelism inherent among RAID stripes, multi-core processors, and SSDs. For the notoriously performance-punishing partial-stripe writes that induce extra read and write I/Os, StRAID presents a two-stage stripe write mechanism and a two-dimensional multi-log SSD buffer. All writes first are opportunistically batched in memory, and then are written into the primary RAID for aggregated full-stripe writes or conditionally redirected to the buffer for partial-stripe writes. These buffered data are strategically reclaimed to the primary RAID. We evaluate a StRAID prototype with a variety of benchmarks and real-world traces. StRAID is demonstrated to outperform MD by up to 5.8 times in write throughput.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136078654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Understanding Persistent-memory-related Issues in the Linux Kernel 理解Linux内核中与持久性内存相关的问题

3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-10-03 DOI: 10.1145/3605946

Om Rameshwar Gatla, Duo Zhang, Wei Xu, Mai Zheng

Persistent memory (PM) technologies have inspired a wide range of PM-based system optimizations. However, building correct PM-based systems is difficult due to the unique characteristics of PM hardware. To better understand the challenges as well as the opportunities to address them, this article presents a comprehensive study of PM-related issues in the Linux kernel. By analyzing 1,553 PM-related kernel patches in depth and conducting experiments on reproducibility and tool extension, we derive multiple insights in terms of PM patch categories, PM bug patterns, consequences, fix strategies, triggering conditions, and remedy solutions. We hope our results could contribute to the development of robust PM-based storage systems.

持久内存(PM)技术激发了广泛的基于PM的系统优化。然而，由于PM硬件的独特特性，构建正确的基于PM的系统是困难的。为了更好地理解这些挑战以及解决这些挑战的机会，本文对Linux内核中与pm相关的问题进行了全面的研究。通过深入分析1553个与PM相关的内核补丁，并对再现性和工具扩展进行实验，我们在PM补丁类别、PM错误模式、结果、修复策略、触发条件和补救解决方案方面获得了多种见解。我们希望我们的研究结果能够对健壮的基于pm的存储系统的开发做出贡献。

引用次数: 0

Empowering Storage Systems Research with NVMeVirt: A Comprehensive NVMe Device Emulator 增强存储系统研究与NVMeVirt:一个全面的NVMe设备模拟器

3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-09-21 DOI: 10.1145/3625006

Sang-Hoon Kim, Jaehoon Shim, Euidong Lee, Seongyeop Jeong, Ilkueon Kang, Jin-Soo Kim

There have been drastic changes in the storage device landscape recently. At the center of the diverse storage landscape lies the NVMe interface, which allows high-performance and flexible communication models required by these next-generation device types. However, its hardware-oriented definition and specification are bottlenecking the development and evaluation cycle for new revolutionary storage devices. Furthermore, existing emulators lack the capability to support the advanced storage configurations that are currently in the spotlight. In this paper, we present NVMeVirt, a novel approach to facilitate software-defined NVMe devices. A user can define any NVMe device type with custom features, and NVMeVirt allows it to bridge the gap between the host I/O stack and the virtual NVMe device in software. We demonstrate the advantages and features of NVMeVirt by realizing various storage types and configurations, such as conventional SSDs, low-latency high-bandwidth NVM SSDs, zoned namespace SSDs, and key-value SSDs with the support of PCI peer-to-peer DMA and NVMe-oF target offloading. We also make cases for storage research with NVMeVirt, such as studying the performance characteristics of database engines and extending the NVMe specification for the improved key-value SSD performance.

最近，存储设备领域发生了巨大的变化。NVMe接口是各种存储环境的核心，它允许这些下一代设备类型所需的高性能和灵活的通信模型。然而，其面向硬件的定义和规范阻碍了新的革命性存储设备的开发和评估周期。此外，现有仿真器缺乏支持当前备受关注的高级存储配置的能力。在本文中，我们提出了NVMeVirt，一种促进软件定义NVMe设备的新方法。用户可以定义任何具有自定义功能的NVMe设备类型，NVMeVirt允许它在软件中弥合主机I/O堆栈和虚拟NVMe设备之间的差距。我们通过实现各种存储类型和配置，例如传统ssd、低延迟高带宽NVM ssd、分区命名空间ssd和支持PCI对等DMA和NVMe-oF目标卸载的键值ssd，来展示NVMeVirt的优势和特性。我们还对NVMeVirt的存储研究进行了案例分析，例如研究数据库引擎的性能特征，扩展NVMe规范以提高键值SSD的性能。

{"title":"Empowering Storage Systems Research with NVMeVirt: A Comprehensive NVMe Device Emulator","authors":"Sang-Hoon Kim, Jaehoon Shim, Euidong Lee, Seongyeop Jeong, Ilkueon Kang, Jin-Soo Kim","doi":"10.1145/3625006","DOIUrl":"https://doi.org/10.1145/3625006","url":null,"abstract":"There have been drastic changes in the storage device landscape recently. At the center of the diverse storage landscape lies the NVMe interface, which allows high-performance and flexible communication models required by these next-generation device types. However, its hardware-oriented definition and specification are bottlenecking the development and evaluation cycle for new revolutionary storage devices. Furthermore, existing emulators lack the capability to support the advanced storage configurations that are currently in the spotlight. In this paper, we present NVMeVirt, a novel approach to facilitate software-defined NVMe devices. A user can define any NVMe device type with custom features, and NVMeVirt allows it to bridge the gap between the host I/O stack and the virtual NVMe device in software. We demonstrate the advantages and features of NVMeVirt by realizing various storage types and configurations, such as conventional SSDs, low-latency high-bandwidth NVM SSDs, zoned namespace SSDs, and key-value SSDs with the support of PCI peer-to-peer DMA and NVMe-oF target offloading. We also make cases for storage research with NVMeVirt, such as studying the performance characteristics of database engines and extending the NVMe specification for the improved key-value SSD performance.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136152933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploiting Flat Namespace to Improve File System Metadata Performance on Ultra-fast, Byte-addressable NVMs 利用平面命名空间在超快、字节可寻址的nvm上提高文件系统元数据性能

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-09-06 DOI: 10.1145/3620673

Miao Cai, Junru Shen, Bin Tang, Hao Huang, Baoliu Ye

The conventional file system provides a hierarchical namespace by structuring it as a directory tree. Tree-based namespace structure leads to inefficient file path walk and expensive namespace tree traversal, underutilizing ultra-low access latency and superior sequential performance provided by non-volatile memories (NVMs). This paper proposes FlatFS+, an NVM file system that features a flat namespace architecture while providing a compatible hierarchical namespace view. FlatFS+ incorporates three novel techniques: direct file path walk model, range-optimized Br tree, and compressed index key design with scan and write dual optimization, to fully exploit flat namespace to improve file system metadata performance on ultra-fast, byte-addressable NVMs. Evaluation results demonstrate that FlatFS+ achieves significant performance improvements for metadata-intensive benchmarks and real-world applications compared to other file systems.

传统的文件系统通过将其构造为目录树来提供分层命名空间。基于树的命名空间结构导致低效的文件路径遍历和昂贵的命名空间树遍历，未充分利用非易失性存储器（NVM）提供的超低访问延迟和卓越的顺序性能。本文提出了FlatFS+，这是一种NVM文件系统，它具有扁平的命名空间架构，同时提供了兼容的分层命名空间视图。FlatFS+结合了三种新技术：直接文件路径遍历模型、范围优化的Br树和具有扫描和写入双重优化的压缩索引密钥设计，以充分利用扁平命名空间来提高文件系统元数据在超快速字节可寻址NVM上的性能。评估结果表明，与其他文件系统相比，FlatFS+在元数据密集型基准测试和实际应用程序方面实现了显著的性能改进。

引用次数: 0

A High-Performance RDMA-oriented Learned Key-Value Store for Disaggregated Memory Systems 面向RDMA的高性能分解存储系统学习键值存储

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-09-05 DOI: 10.1145/3620674

Pengfei Li, Yu Hua, Pengfei Zuo, Zhangyu Chen, Jiajie Sheng

Disaggregated memory systems separate monolithic servers into different components, including compute and memory nodes, to enjoy the benefits of high resource utilization, flexible hardware scalability, and efficient data sharing. By exploiting the high-performance RDMA (Remote Direct Memory Access), the compute nodes directly access the remote memory pool without involving remote CPUs. Hence, the ordered key-value (KV) stores (e.g., B-trees and learned indexes) keep all data sorted to provide rang query service via the high-performance network. However, existing ordered KVs fail to work well on the disaggregated memory systems, due to either consuming multiple network roundtrips to search the remote data or heavily relying on the memory nodes equipped with insufficient computing resources to process data modifications. In this paper, we propose a scalable RDMA-oriented KV store with learned indexes, called ROLEX, to coalesce the ordered KV store in the disaggregated systems for efficient data storage and retrieval. ROLEX leverages a retraining-decoupled learned index scheme to dissociate the model retraining from data modification operations via adding a bias and some data-movement constraints to learned models. Based on the operation decoupling, data modifications are directly executed in compute nodes via one-sided RDMA verbs with high scalability. The model retraining is hence removed from the critical path of data modification and asynchronously executed in memory nodes by using dedicated computing resources. ROLEX efficiently alleviates the fragmentation and garbage collection issues, due to allocating and reclaiming space via fixed-size leaves that are accessed via the atomic-size leaf numbers. Our experimental results on YCSB and real-world workloads demonstrate that ROLEX achieves competitive performance on the static workloads, as well as significantly improving the performance on dynamic workloads by up to 2.2 × than state-of-the-art schemes on the disaggregated memory systems. We have released the open-source codes for public use in GitHub.

分解内存系统将单片服务器分离为不同的组件，包括计算和内存节点，以享受高资源利用率、灵活的硬件可扩展性和高效数据共享的好处。通过利用高性能的RDMA（远程直接内存访问），计算节点可以直接访问远程内存池，而无需涉及远程CPU。因此，有序键值（KV）存储（例如，B树和学习索引）保持所有数据的排序，以通过高性能网络提供范围查询服务。然而，由于消耗多个网络往返来搜索远程数据，或者严重依赖于配备有不足计算资源的存储器节点来处理数据修改，现有的有序KV在分解的存储器系统上不能很好地工作。在本文中，我们提出了一种具有学习索引的面向RDMA的可扩展KV存储，称为ROLEX，以在分解系统中合并有序的KV存储，从而实现高效的数据存储和检索。ROLEX利用再训练解耦学习索引方案，通过向学习模型添加偏差和一些数据移动约束，将模型再训练与数据修改操作分离。基于操作解耦，数据修改通过具有高可伸缩性的单边RDMA谓词直接在计算节点中执行。因此，模型再训练从数据修改的关键路径中删除，并通过使用专用计算资源在内存节点中异步执行。ROLEX通过固定大小的叶分配和回收空间，有效地缓解了碎片和垃圾收集问题，这些叶是通过原子大小的叶编号访问的。我们在YCSB和真实世界工作负载上的实验结果表明，ROLEX在静态工作负载上实现了有竞争力的性能，并且在动态工作负载上比在分解内存系统上的最先进方案显著提高了2.2倍。我们已经在GitHub中发布了供公众使用的开源代码。

{"title":"A High-Performance RDMA-oriented Learned Key-Value Store for Disaggregated Memory Systems","authors":"Pengfei Li, Yu Hua, Pengfei Zuo, Zhangyu Chen, Jiajie Sheng","doi":"10.1145/3620674","DOIUrl":"https://doi.org/10.1145/3620674","url":null,"abstract":"Disaggregated memory systems separate monolithic servers into different components, including compute and memory nodes, to enjoy the benefits of high resource utilization, flexible hardware scalability, and efficient data sharing. By exploiting the high-performance RDMA (Remote Direct Memory Access), the compute nodes directly access the remote memory pool without involving remote CPUs. Hence, the ordered key-value (KV) stores (e.g., B-trees and learned indexes) keep all data sorted to provide rang query service via the high-performance network. However, existing ordered KVs fail to work well on the disaggregated memory systems, due to either consuming multiple network roundtrips to search the remote data or heavily relying on the memory nodes equipped with insufficient computing resources to process data modifications. In this paper, we propose a scalable RDMA-oriented KV store with learned indexes, called ROLEX, to coalesce the ordered KV store in the disaggregated systems for efficient data storage and retrieval. ROLEX leverages a retraining-decoupled learned index scheme to dissociate the model retraining from data modification operations via adding a bias and some data-movement constraints to learned models. Based on the operation decoupling, data modifications are directly executed in compute nodes via one-sided RDMA verbs with high scalability. The model retraining is hence removed from the critical path of data modification and asynchronously executed in memory nodes by using dedicated computing resources. ROLEX efficiently alleviates the fragmentation and garbage collection issues, due to allocating and reclaiming space via fixed-size leaves that are accessed via the atomic-size leaf numbers. Our experimental results on YCSB and real-world workloads demonstrate that ROLEX achieves competitive performance on the static workloads, as well as significantly improving the performance on dynamic workloads by up to 2.2 × than state-of-the-art schemes on the disaggregated memory systems. We have released the open-source codes for public use in GitHub.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":" ","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43932825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Block-Level Image Service for the Cloud 云的块级图像服务

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-09-05 DOI: 10.1145/3620672

Huiba Li, Zhihao Zhang, Yifan Yuan, Rui Du, Kai Ma, Lanzheng Liu, Yiming Zhang, Windsor Hsu

Businesses increasingly need agile and elastic computing infrastructure to respond quickly to real world situations. By offering efficient process-based virtualization and a layered image system, containers are designed to enable agile and elastic application deployment. However, creating or updating large container clusters is still slow due to the image downloading and unpacking process. In this paper, we present DADI Image Service, a block-level image service for increased agility and elasticity in deploying applications. DADI replaces the waterfall model of starting containers (downloading image, unpacking image, starting container) with fine-grained on-demand transfer of remote images, realizing instant start of containers. To accelerate the cold start of containers, DADI designs a pull-based prefetching mechanism which allows a host to read necessary image data beforehand at the granularity of image layers. We design a P2P-based decentralized image sharing architecture to balance traffic among all the participating hosts and propose a pull-push collaborative prefetching mechanism to accelerate cold start. DADI efficiently supports various kinds of runtimes including cgroups, QEMU, etc., further realizing “build once, run anywhere”. DADI has been deployed at scale in the production environment of Alibaba, serving one of the world’s largest ecommerce platforms. Performance results show that DADI can cold start 10,000 containers on 1,000 hosts within 4 seconds.

企业越来越需要灵活和弹性的计算基础设施来快速响应现实世界的情况。通过提供高效的基于流程的虚拟化和分层映像系统，容器被设计为实现灵活和弹性的应用程序部署。然而，由于图像下载和解包过程，创建或更新大型容器集群的速度仍然很慢。在本文中，我们介绍了DADI映像服务，这是一种块级映像服务，用于提高部署应用程序的灵活性和弹性。DADI将启动容器的瀑布模型（下载映像、打开映像、启动容器）替换为远程映像的细粒度按需传输，实现了容器的即时启动。为了加速容器的冷启动，DADI设计了一种基于拉的预取机制，该机制允许主机以图像层的粒度预先读取必要的图像数据。我们设计了一种基于P2P的去中心化图像共享架构，以平衡所有参与主机之间的流量，并提出了一种拉-推协同预取机制来加速冷启动。DADI有效地支持各种运行时，包括cgroups、QEMU等，进一步实现了“一次构建，随处运行”。DADI已在阿里巴巴的生产环境中大规模部署，为全球最大的电子商务平台之一提供服务。性能结果表明，DADI可以在4秒内冷启动1000台主机上的10000个容器。

{"title":"Block-Level Image Service for the Cloud","authors":"Huiba Li, Zhihao Zhang, Yifan Yuan, Rui Du, Kai Ma, Lanzheng Liu, Yiming Zhang, Windsor Hsu","doi":"10.1145/3620672","DOIUrl":"https://doi.org/10.1145/3620672","url":null,"abstract":"Businesses increasingly need agile and elastic computing infrastructure to respond quickly to real world situations. By offering efficient process-based virtualization and a layered image system, containers are designed to enable agile and elastic application deployment. However, creating or updating large container clusters is still slow due to the image downloading and unpacking process. In this paper, we present DADI Image Service, a block-level image service for increased agility and elasticity in deploying applications. DADI replaces the waterfall model of starting containers (downloading image, unpacking image, starting container) with fine-grained on-demand transfer of remote images, realizing instant start of containers. To accelerate the cold start of containers, DADI designs a pull-based prefetching mechanism which allows a host to read necessary image data beforehand at the granularity of image layers. We design a P2P-based decentralized image sharing architecture to balance traffic among all the participating hosts and propose a pull-push collaborative prefetching mechanism to accelerate cold start. DADI efficiently supports various kinds of runtimes including cgroups, QEMU, etc., further realizing “build once, run anywhere”. DADI has been deployed at scale in the production environment of Alibaba, serving one of the world’s largest ecommerce platforms. Performance results show that DADI can cold start 10,000 containers on 1,000 hosts within 4 seconds.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":" ","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45552135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Security War in File Systems: An Empirical Study from A Vulnerability-Centric Perspective 文件系统中的安全战争:基于漏洞中心视角的实证研究

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-07-17 DOI: https://dl.acm.org/doi/10.1145/3606020

Jinghan Sun, Shaobo Li, Jun Xu, Jian Huang

This paper presents a systematic study on the security of modern file systems, following a vulnerability-centric perspective. Specifically, we collected 377 file system vulnerabilities committed to the CVE database in the past 20 years. We characterize them from four dimensions that include why the vulnerabilities appear, how the vulnerabilities can be exploited, what consequences can arise, and how the vulnerabilities are fixed. This way, we build a deep understanding of the attack surfaces faced by file systems, the threats imposed by the attack surfaces, and the good and bad practices in mitigating the attacks in file systems. We envision that our study will bring insights towards the future development of file systems, the enhancement of file system security, and the relevant vulnerability mitigating solutions.

本文从以漏洞为中心的角度对现代文件系统的安全性进行了系统的研究。具体来说，我们收集了过去20年CVE数据库中存在的377个文件系统漏洞。我们从四个方面来描述它们，包括漏洞出现的原因、漏洞如何被利用、可能产生的后果以及如何修复漏洞。通过这种方式，我们可以深入了解文件系统面临的攻击面、攻击面所带来的威胁，以及减轻文件系统攻击的好方法和坏方法。我们预期我们的研究将为文件系统的未来发展，增强文件系统的安全性，以及相关的漏洞缓解解决方案带来见解。

引用次数: 0

FASTSync: a FAST Delta Sync Scheme for Encrypted Cloud Storage in High-Bandwidth Network Environments FASTSync:用于高带宽网络环境下加密云存储的快速增量同步方案

IF 1.7 3区计算机科学 Q3 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

ACM Transactions on Storage

Pub Date : 2023-07-07 DOI: 10.1145/3607536

Suzhen Wu, Zhanhong Tu, Yuxuan Zhou, Zuocheng Wang, Zhirong Shen, Wei Chen, Wen Wang, Weichun Wang, Bo Mao

More and more data are stored in cloud storage which brings two major challenges. First, the modified files in the cloud should be quickly synchronized to ensure data consistency, e.g., delta synchronization (sync) achieves efficient cloud sync by synchronizing only the updated part of the file. Second, the huge data in the cloud needs to be deduplicated and encrypted, e.g., Message-Locked Encryption (MLE) implements data deduplication by encrypting the content among different users. However, when combined, a few updates in the content can cause large sync traffic amplification for both keys and ciphertext in the MLE-based cloud storage, significantly degrading the cloud sync efficiency. A feature-based encryption sync scheme, FeatureSync, is proposed to address the delta amplification problem. However, with further improvement of the network bandwidth, the performance of FeatureSync stagnates. In our preliminary experimental evaluations, we find that the bottleneck of the computational overhead in the high-bandwidth network environments is the main bottleneck in FeatureSync. In this paper, we propose an enhanced feature-based encryption sync scheme FASTSync to optimize the performance of FeatureSync in high-bandwidth network environments. The performance evaluations on a lightweight prototype implementation of FASTSync show that FASTSync reduces the cloud sync time by 70.3% and the encryption time by 37.3% on average, compared with FeatureSync.

越来越多的数据存储在云存储中，这带来了两大挑战。首先，应该快速同步云中修改后的文件，以确保数据一致性，例如，delta同步（sync）通过只同步文件的更新部分来实现高效的云同步。其次，云中的巨大数据需要进行重复数据消除和加密，例如，消息锁定加密（MLE）通过加密不同用户之间的内容来实现重复数据消除。然而，当结合在一起时，内容中的一些更新可能会导致基于MLE的云存储中的密钥和密文的大量同步流量放大，从而显著降低云同步效率。为了解决增量放大问题，提出了一种基于特征的加密同步方案FeatureSync。然而，随着网络带宽的进一步提高，FeatureSync的性能停滞不前。在我们的初步实验评估中，我们发现高带宽网络环境中的计算开销瓶颈是FeatureSync的主要瓶颈。在本文中，我们提出了一种增强的基于特征的加密同步方案FASTSync，以优化FeatureSync在高带宽网络环境中的性能。对FASTSync轻量级原型实现的性能评估表明，与FeatureSync相比，FASTSync平均减少了70.3%的云同步时间和37.3%的加密时间。

{"title":"FASTSync: a FAST Delta Sync Scheme for Encrypted Cloud Storage in High-Bandwidth Network Environments","authors":"Suzhen Wu, Zhanhong Tu, Yuxuan Zhou, Zuocheng Wang, Zhirong Shen, Wei Chen, Wen Wang, Weichun Wang, Bo Mao","doi":"10.1145/3607536","DOIUrl":"https://doi.org/10.1145/3607536","url":null,"abstract":"More and more data are stored in cloud storage which brings two major challenges. First, the modified files in the cloud should be quickly synchronized to ensure data consistency, e.g., delta synchronization (sync) achieves efficient cloud sync by synchronizing only the updated part of the file. Second, the huge data in the cloud needs to be deduplicated and encrypted, e.g., Message-Locked Encryption (MLE) implements data deduplication by encrypting the content among different users. However, when combined, a few updates in the content can cause large sync traffic amplification for both keys and ciphertext in the MLE-based cloud storage, significantly degrading the cloud sync efficiency. A feature-based encryption sync scheme, FeatureSync, is proposed to address the delta amplification problem. However, with further improvement of the network bandwidth, the performance of FeatureSync stagnates. In our preliminary experimental evaluations, we find that the bottleneck of the computational overhead in the high-bandwidth network environments is the main bottleneck in FeatureSync. In this paper, we propose an enhanced feature-based encryption sync scheme FASTSync to optimize the performance of FeatureSync in high-bandwidth network environments. The performance evaluations on a lightweight prototype implementation of FASTSync show that FASTSync reduces the cloud sync time by 70.3% and the encryption time by 37.3% on average, compared with FeatureSync.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":" ","pages":""},"PeriodicalIF":1.7,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48634042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0