Log-structured-merge tree or LSM-tree is a technological underpinning in key-value (KV) stores to support a wide range of performance-critical applications. By conducting data re-organization in the background by virtue of compaction operations, the KV stores have the potential to swiftly service write requests with sequential batched disk writes and read requests for KV items constantly sorted by the compaction. Compaction demands high I/O bandwidth and CPU speed to facilitate quality service to user read/write requests. With the emergence of high-speed SSDs, CPUs are increasingly becoming a performance bottleneck. To mitigate the bottleneck limiting the KV-store’s performance and that of the applications supported by the store, we propose a system - gLSM - to leverage GPGPU to remarkably accelerate the compaction operations. gLSM fully utilizes the parallelism and computational capability inside GPGPUs to improve the compaction performance. We design a driver framework to parallelize compaction operations handled between a pair of CPU and GPGPU. We employ data independence and GPGPU-orient radix-sorting algorithm to concurrently conduct compaction. A key-value separation method is devised to slash the transfer of data volume from CPU-side memory to the GPGPU counterpart. The results reveal that gLSM improves the throughput and compaction bandwidth by up to a factor of 2.9 and 26.0, respectively, compared with the four state-of-the-art KV stores. gLSM also reduces the write latency by 73.3%. gLSM exhibits a performance improvement by up to 45% compared against its variant where there are no KV separation and collaboration sort modules.
{"title":"gLSM: Using GPGPU to Accelerate Compactions in LSM-tree-based Key-value Stores","authors":"Hui Sun, Jinfeng Xu, Xiangxiang Jiang, Guanzhong Chen, Yinliang Yue, Xiao Qin","doi":"10.1145/3633782","DOIUrl":"https://doi.org/10.1145/3633782","url":null,"abstract":"<p>Log-structured-merge tree or LSM-tree is a technological underpinning in key-value (KV) stores to support a wide range of performance-critical applications. By conducting data re-organization in the background by virtue of compaction operations, the KV stores have the potential to swiftly service write requests with sequential batched disk writes and read requests for KV items constantly sorted by the compaction. Compaction demands high I/O bandwidth and CPU speed to facilitate quality service to user read/write requests. With the emergence of high-speed SSDs, CPUs are increasingly becoming a performance bottleneck. To mitigate the bottleneck limiting the KV-store’s performance and that of the applications supported by the store, we propose a system - <i>gLSM</i> - to leverage GPGPU to remarkably accelerate the compaction operations. gLSM fully utilizes the parallelism and computational capability inside GPGPUs to improve the compaction performance. We design a driver framework to parallelize compaction operations handled between a pair of CPU and GPGPU. We employ data independence and GPGPU-orient radix-sorting algorithm to concurrently conduct compaction. A key-value separation method is devised to slash the transfer of data volume from CPU-side memory to the GPGPU counterpart. The results reveal that gLSM improves the throughput and compaction bandwidth by up to a factor of 2.9 and 26.0, respectively, compared with the four state-of-the-art KV stores. gLSM also reduces the write latency by 73.3%. gLSM exhibits a performance improvement by up to 45% compared against its variant where there are no KV separation and collaboration sort modules.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Wang, Youyou Lu, Qing Wang, Yuhao Zhang, Jiwu Shu
LSM-based storage systems are widely used for superior write performance on block devices. However, they currently fail to efficiently support secondary indexing, since a secondary index query operation usually needs to retrieve multiple small values, which scatter in multiple LSM components. In this work, we revisit secondary indexing in LSM-based storage systems with byte-addressable persistent memory (PM). Existing PM-based indexes are not directly competent for efficient secondary indexing. We propose Perseid, an efficient PM-based secondary indexing mechanism for LSM-based storage systems, which takes into account both characteristics of PM and secondary indexing. Perseid consists of (1) a specifically designed secondary index structure that achieves high-performance insertion and query, (2) a lightweight hybrid PM-DRAM and hash-based validation approach to filter out obsolete values with subtle overhead, and (3) two adapted optimizations on primary table searching issued from secondary indexes to accelerate non-index-only queries. Our evaluation shows that Perseid outperforms existing PM-based indexes by 3-7 × and achieves about two orders of magnitude performance of state-of-the-art LSM-based secondary indexing techniques even if on PM instead of disks.
{"title":"Perseid: A Secondary Indexing Mechanism for LSM-based Storage Systems","authors":"Jing Wang, Youyou Lu, Qing Wang, Yuhao Zhang, Jiwu Shu","doi":"10.1145/3633285","DOIUrl":"https://doi.org/10.1145/3633285","url":null,"abstract":"<p>LSM-based storage systems are widely used for superior write performance on block devices. However, they currently fail to efficiently support secondary indexing, since a secondary index query operation usually needs to retrieve multiple small values, which scatter in multiple LSM components. In this work, we revisit secondary indexing in LSM-based storage systems with byte-addressable persistent memory (PM). Existing PM-based indexes are not directly competent for efficient secondary indexing. We propose <span>Perseid</span>, an efficient PM-based secondary indexing mechanism for LSM-based storage systems, which takes into account both characteristics of PM and secondary indexing. <span>Perseid</span> consists of (1) a specifically designed secondary index structure that achieves high-performance insertion and query, (2) a lightweight hybrid PM-DRAM and hash-based validation approach to filter out obsolete values with subtle overhead, and (3) two adapted optimizations on primary table searching issued from secondary indexes to accelerate non-index-only queries. Our evaluation shows that <span>Perseid</span> outperforms existing PM-based indexes by 3-7 × and achieves about two orders of magnitude performance of state-of-the-art LSM-based secondary indexing techniques even if on PM instead of disks.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This special section of the IEEE Transactions on Visualization and Computer Graphics (IEEE TVCG) presents the five most highly rated papers from the 2023 IEEE Pacific Visualization Symposium (IEEE PacificVis), hosted in Seoul, Korea from ...
{"title":"Introduction to the Special Section on USENIX FAST 2023","authors":"Ashvin Goel, Dalit Naor","doi":"10.1145/3612820","DOIUrl":"https://doi.org/10.1145/3612820","url":null,"abstract":"This special section of the <italic>IEEE Transactions on Visualization and Computer Graphics (IEEE TVCG)</italic> presents the five most highly rated papers from the 2023 IEEE Pacific Visualization Symposium (IEEE PacificVis), hosted in Seoul, Korea from ...","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134954270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saurabh Kadekodi, Shashwat Silas, David Clausen, Arif Merchant
Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. Locally recoverable codes (LRCs) have deservedly gained central importance in this field, because they can balance many of these requirements. In our work, we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their reliability, since wider stripes are prone to more simultaneous failures.
We conduct a practically minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called Uniform Cauchy LRCs, which show excellent performance in simulations and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance-optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers.
{"title":"Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)","authors":"Saurabh Kadekodi, Shashwat Silas, David Clausen, Arif Merchant","doi":"10.1145/3626198","DOIUrl":"https://doi.org/10.1145/3626198","url":null,"abstract":"<p>Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. <i>Locally recoverable codes (LRCs)</i> have deservedly gained central importance in this field, because they can balance many of these requirements. In our work, we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their <i>reliability</i>, since wider stripes are prone to more simultaneous failures.</p><p>We conduct a practically minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called <i>Uniform Cauchy LRCs</i>, which show excellent performance in simulations and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance-optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The newly emerging “fail-slow” failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this article presents Perseus , a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to quickly pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.
{"title":"From Missteps to Milestones: A Journey to Practical Fail-Slow Detection","authors":"Ruiming Lu, Erci Xu, Yiming Zhang, Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Jiwu Shu, Minglu Li, Jiesheng Wu","doi":"10.1145/3617690","DOIUrl":"https://doi.org/10.1145/3617690","url":null,"abstract":"The newly emerging “fail-slow” failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this article presents Perseus , a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to quickly pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134957500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Phase Change Memory (PCM), one of the recently proposed non-volatile memory technologies, has been suffering from low write endurance. For example, a single layer PCM cell could only be written approximately 10 8 times. This limits the lifetime of a PCM-based memory to a few days rather than years when memory intensive applications are running. Wear leveling techniques have been proposed to improve the write endurance of a PCM. Among those techniques, the region based start-gap (RBSG) scheme is widely cited as achieving the highest lifetime. Based on our experiments, RBSG can achieve 97% of the ideal lifetime but only for relatively small memory sizes (e.g. 8GB-32GB). As the memory size goes up, RBSG becomes less effective and its expected percentage of the ideal lifetime reduces to less than 57% for a 2TB PCM. In this paper, we propose a table-based wear leveling scheme called block grouping to enhance the write endurance of a PCM with a negligible overhead. Our research results show that with a proper configuration and adoption of partial writes (writing back only 64B subblocks instead of a whole row to the PCM arrays) and internal row shift (shifting the subblocks in a row periodically so no subblock in a row will be written repeatedly), the proposed block grouping scheme could achieve 95% of the ideal lifetime on average for the Rodinia, NPB, and SPEC benchmarks with less than 1.74% performance overhead and up to 0.18% hardware overhead. Moreover, our scheme is scalable and achieves the same percentage of ideal lifetime for PCM of size from 8GB to 2TB. We also show that the proposed scheme can better tolerate memory write attacks than WoLFRAM (Wear Leveling and Fault Tolerance for Resistive Memories) and RBSG for a PCM of size 32GB or higher. Finally, we integrate an error-correcting pointer technique into our proposed block grouping scheme to make the PCM fault tolerant against hard errors.
{"title":"A Scalable Wear Leveling Technique for Phase Change Memory","authors":"Wang Xu, Israel Koren","doi":"10.1145/3631146","DOIUrl":"https://doi.org/10.1145/3631146","url":null,"abstract":"Phase Change Memory (PCM), one of the recently proposed non-volatile memory technologies, has been suffering from low write endurance. For example, a single layer PCM cell could only be written approximately 10 8 times. This limits the lifetime of a PCM-based memory to a few days rather than years when memory intensive applications are running. Wear leveling techniques have been proposed to improve the write endurance of a PCM. Among those techniques, the region based start-gap (RBSG) scheme is widely cited as achieving the highest lifetime. Based on our experiments, RBSG can achieve 97% of the ideal lifetime but only for relatively small memory sizes (e.g. 8GB-32GB). As the memory size goes up, RBSG becomes less effective and its expected percentage of the ideal lifetime reduces to less than 57% for a 2TB PCM. In this paper, we propose a table-based wear leveling scheme called block grouping to enhance the write endurance of a PCM with a negligible overhead. Our research results show that with a proper configuration and adoption of partial writes (writing back only 64B subblocks instead of a whole row to the PCM arrays) and internal row shift (shifting the subblocks in a row periodically so no subblock in a row will be written repeatedly), the proposed block grouping scheme could achieve 95% of the ideal lifetime on average for the Rodinia, NPB, and SPEC benchmarks with less than 1.74% performance overhead and up to 0.18% hardware overhead. Moreover, our scheme is scalable and achieves the same percentage of ideal lifetime for PCM of size from 8GB to 2TB. We also show that the proposed scheme can better tolerate memory write attacks than WoLFRAM (Wear Leveling and Fault Tolerance for Resistive Memories) and RBSG for a PCM of size 32GB or higher. Finally, we integrate an error-correcting pointer technique into our proposed block grouping scheme to make the PCM fault tolerant against hard errors.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136069911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shucheng Wang, Qiang Cao, Hong Jiang, Ziyi Lu, Jie Yao, Yuxing Chen, Anqun Pan
Following a conventional design principle that pays more fast-CPU-cycles for fewer slow-I/Os, popular software storage architecture Linux Multiple-Disk (MD) for parity-based RAID (e.g., RAID5 and RAID6) assigns one or more centralized worker threads to efficiently process all user requests based on multi-stage asynchronous control and global data structures, successfully exploiting characteristics of slow devices, e.g., Hard Disk Drives (HDDs). However, we observe that, with high-performance NVMe-based Solid State Drives (SSDs), even the recently added multi-worker processing mode in MD achieves only limited performance gain because of the severe lock contentions under intensive write workloads. In this paper, we propose a novel stripe-threaded RAID architecture, StRAID, assigning a dedicated worker thread for each stripe-write (one-for-one model) to sufficiently exploit high parallelism inherent among RAID stripes, multi-core processors, and SSDs. For the notoriously performance-punishing partial-stripe writes that induce extra read and write I/Os, StRAID presents a two-stage stripe write mechanism and a two-dimensional multi-log SSD buffer. All writes first are opportunistically batched in memory, and then are written into the primary RAID for aggregated full-stripe writes or conditionally redirected to the buffer for partial-stripe writes. These buffered data are strategically reclaimed to the primary RAID. We evaluate a StRAID prototype with a variety of benchmarks and real-world traces. StRAID is demonstrated to outperform MD by up to 5.8 times in write throughput.
遵循传统的设计原则,即为更少的慢i / o支付更多的快速cpu周期,流行的软件存储架构Linux multi- Disk (MD)用于基于奇偶性的RAID(例如RAID5和RAID6)分配一个或多个集中的工作线程,以有效地处理基于多阶段异步控制和全局数据结构的所有用户请求,成功地利用慢设备(例如硬盘驱动器(hdd))的特性。然而,我们观察到,对于基于nvme的高性能固态驱动器(ssd),即使是最近在MD中添加的多工作者处理模式也只能获得有限的性能提升,因为在密集的写工作负载下存在严重的锁争用。在本文中,我们提出了一种新的条带线程RAID架构,StRAID,为每个条带写入分配一个专用的工作线程(一对一模型),以充分利用RAID条带、多核处理器和ssd之间固有的高并行性。众所周知,部分分条写入会导致额外的读/写I/ o,而StRAID提供了一个两阶段的分条写入机制和一个二维多日志SSD缓冲区。所有写操作首先在内存中分批处理,然后写入主RAID以进行聚合的全条带写操作,或者有条件地重定向到缓冲区以进行部分条带写操作。这些缓冲的数据策略性地回收到主RAID。我们使用各种基准测试和真实世界的跟踪来评估StRAID原型。在写入吞吐量方面,StRAID的性能比MD高出5.8倍。
{"title":"Explorations and Exploitation for Parity-based RAIDs with Ultra-fast SSDs","authors":"Shucheng Wang, Qiang Cao, Hong Jiang, Ziyi Lu, Jie Yao, Yuxing Chen, Anqun Pan","doi":"10.1145/3627992","DOIUrl":"https://doi.org/10.1145/3627992","url":null,"abstract":"Following a conventional design principle that pays more fast-CPU-cycles for fewer slow-I/Os, popular software storage architecture Linux Multiple-Disk (MD) for parity-based RAID (e.g., RAID5 and RAID6) assigns one or more centralized worker threads to efficiently process all user requests based on multi-stage asynchronous control and global data structures, successfully exploiting characteristics of slow devices, e.g., Hard Disk Drives (HDDs). However, we observe that, with high-performance NVMe-based Solid State Drives (SSDs), even the recently added multi-worker processing mode in MD achieves only limited performance gain because of the severe lock contentions under intensive write workloads. In this paper, we propose a novel stripe-threaded RAID architecture, StRAID, assigning a dedicated worker thread for each stripe-write (one-for-one model) to sufficiently exploit high parallelism inherent among RAID stripes, multi-core processors, and SSDs. For the notoriously performance-punishing partial-stripe writes that induce extra read and write I/Os, StRAID presents a two-stage stripe write mechanism and a two-dimensional multi-log SSD buffer. All writes first are opportunistically batched in memory, and then are written into the primary RAID for aggregated full-stripe writes or conditionally redirected to the buffer for partial-stripe writes. These buffered data are strategically reclaimed to the primary RAID. We evaluate a StRAID prototype with a variety of benchmarks and real-world traces. StRAID is demonstrated to outperform MD by up to 5.8 times in write throughput.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136078654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Persistent memory (PM) technologies have inspired a wide range of PM-based system optimizations. However, building correct PM-based systems is difficult due to the unique characteristics of PM hardware. To better understand the challenges as well as the opportunities to address them, this article presents a comprehensive study of PM-related issues in the Linux kernel. By analyzing 1,553 PM-related kernel patches in depth and conducting experiments on reproducibility and tool extension, we derive multiple insights in terms of PM patch categories, PM bug patterns, consequences, fix strategies, triggering conditions, and remedy solutions. We hope our results could contribute to the development of robust PM-based storage systems.
{"title":"Understanding Persistent-memory-related Issues in the Linux Kernel","authors":"Om Rameshwar Gatla, Duo Zhang, Wei Xu, Mai Zheng","doi":"10.1145/3605946","DOIUrl":"https://doi.org/10.1145/3605946","url":null,"abstract":"Persistent memory (PM) technologies have inspired a wide range of PM-based system optimizations. However, building correct PM-based systems is difficult due to the unique characteristics of PM hardware. To better understand the challenges as well as the opportunities to address them, this article presents a comprehensive study of PM-related issues in the Linux kernel. By analyzing 1,553 PM-related kernel patches in depth and conducting experiments on reproducibility and tool extension, we derive multiple insights in terms of PM patch categories, PM bug patterns, consequences, fix strategies, triggering conditions, and remedy solutions. We hope our results could contribute to the development of robust PM-based storage systems.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135696134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sang-Hoon Kim, Jaehoon Shim, Euidong Lee, Seongyeop Jeong, Ilkueon Kang, Jin-Soo Kim
There have been drastic changes in the storage device landscape recently. At the center of the diverse storage landscape lies the NVMe interface, which allows high-performance and flexible communication models required by these next-generation device types. However, its hardware-oriented definition and specification are bottlenecking the development and evaluation cycle for new revolutionary storage devices. Furthermore, existing emulators lack the capability to support the advanced storage configurations that are currently in the spotlight. In this paper, we present NVMeVirt, a novel approach to facilitate software-defined NVMe devices. A user can define any NVMe device type with custom features, and NVMeVirt allows it to bridge the gap between the host I/O stack and the virtual NVMe device in software. We demonstrate the advantages and features of NVMeVirt by realizing various storage types and configurations, such as conventional SSDs, low-latency high-bandwidth NVM SSDs, zoned namespace SSDs, and key-value SSDs with the support of PCI peer-to-peer DMA and NVMe-oF target offloading. We also make cases for storage research with NVMeVirt, such as studying the performance characteristics of database engines and extending the NVMe specification for the improved key-value SSD performance.
{"title":"Empowering Storage Systems Research with NVMeVirt: A Comprehensive NVMe Device Emulator","authors":"Sang-Hoon Kim, Jaehoon Shim, Euidong Lee, Seongyeop Jeong, Ilkueon Kang, Jin-Soo Kim","doi":"10.1145/3625006","DOIUrl":"https://doi.org/10.1145/3625006","url":null,"abstract":"There have been drastic changes in the storage device landscape recently. At the center of the diverse storage landscape lies the NVMe interface, which allows high-performance and flexible communication models required by these next-generation device types. However, its hardware-oriented definition and specification are bottlenecking the development and evaluation cycle for new revolutionary storage devices. Furthermore, existing emulators lack the capability to support the advanced storage configurations that are currently in the spotlight. In this paper, we present NVMeVirt, a novel approach to facilitate software-defined NVMe devices. A user can define any NVMe device type with custom features, and NVMeVirt allows it to bridge the gap between the host I/O stack and the virtual NVMe device in software. We demonstrate the advantages and features of NVMeVirt by realizing various storage types and configurations, such as conventional SSDs, low-latency high-bandwidth NVM SSDs, zoned namespace SSDs, and key-value SSDs with the support of PCI peer-to-peer DMA and NVMe-oF target offloading. We also make cases for storage research with NVMeVirt, such as studying the performance characteristics of database engines and extending the NVMe specification for the improved key-value SSD performance.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136152933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Miao Cai, Junru Shen, Bin Tang, Hao Huang, Baoliu Ye
The conventional file system provides a hierarchical namespace by structuring it as a directory tree. Tree-based namespace structure leads to inefficient file path walk and expensive namespace tree traversal, underutilizing ultra-low access latency and superior sequential performance provided by non-volatile memories (NVMs). This paper proposes FlatFS+, an NVM file system that features a flat namespace architecture while providing a compatible hierarchical namespace view. FlatFS+ incorporates three novel techniques: direct file path walk model, range-optimized Br tree, and compressed index key design with scan and write dual optimization, to fully exploit flat namespace to improve file system metadata performance on ultra-fast, byte-addressable NVMs. Evaluation results demonstrate that FlatFS+ achieves significant performance improvements for metadata-intensive benchmarks and real-world applications compared to other file systems.
{"title":"Exploiting Flat Namespace to Improve File System Metadata Performance on Ultra-fast, Byte-addressable NVMs","authors":"Miao Cai, Junru Shen, Bin Tang, Hao Huang, Baoliu Ye","doi":"10.1145/3620673","DOIUrl":"https://doi.org/10.1145/3620673","url":null,"abstract":"The conventional file system provides a hierarchical namespace by structuring it as a directory tree. Tree-based namespace structure leads to inefficient file path walk and expensive namespace tree traversal, underutilizing ultra-low access latency and superior sequential performance provided by non-volatile memories (NVMs). This paper proposes FlatFS+, an NVM file system that features a flat namespace architecture while providing a compatible hierarchical namespace view. FlatFS+ incorporates three novel techniques: direct file path walk model, range-optimized Br tree, and compressed index key design with scan and write dual optimization, to fully exploit flat namespace to improve file system metadata performance on ultra-fast, byte-addressable NVMs. Evaluation results demonstrate that FlatFS+ achieves significant performance improvements for metadata-intensive benchmarks and real-world applications compared to other file systems.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43687749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}