首页 > 最新文献

ACM Transactions on Storage最新文献

英文 中文
gLSM: Using GPGPU to Accelerate Compactions in LSM-tree-based Key-value Stores gLSM:使用GPGPU加速基于lsm树的键值存储的压缩
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-11-24 DOI: 10.1145/3633782
Hui Sun, Jinfeng Xu, Xiangxiang Jiang, Guanzhong Chen, Yinliang Yue, Xiao Qin

Log-structured-merge tree or LSM-tree is a technological underpinning in key-value (KV) stores to support a wide range of performance-critical applications. By conducting data re-organization in the background by virtue of compaction operations, the KV stores have the potential to swiftly service write requests with sequential batched disk writes and read requests for KV items constantly sorted by the compaction. Compaction demands high I/O bandwidth and CPU speed to facilitate quality service to user read/write requests. With the emergence of high-speed SSDs, CPUs are increasingly becoming a performance bottleneck. To mitigate the bottleneck limiting the KV-store’s performance and that of the applications supported by the store, we propose a system - gLSM - to leverage GPGPU to remarkably accelerate the compaction operations. gLSM fully utilizes the parallelism and computational capability inside GPGPUs to improve the compaction performance. We design a driver framework to parallelize compaction operations handled between a pair of CPU and GPGPU. We employ data independence and GPGPU-orient radix-sorting algorithm to concurrently conduct compaction. A key-value separation method is devised to slash the transfer of data volume from CPU-side memory to the GPGPU counterpart. The results reveal that gLSM improves the throughput and compaction bandwidth by up to a factor of 2.9 and 26.0, respectively, compared with the four state-of-the-art KV stores. gLSM also reduces the write latency by 73.3%. gLSM exhibits a performance improvement by up to 45% compared against its variant where there are no KV separation and collaboration sort modules.

日志结构的合并树或lsm树是键值(KV)存储中的一种技术基础,用于支持广泛的性能关键型应用程序。通过压缩操作在后台进行数据重组,KV存储有可能通过顺序批处理磁盘写入和读取请求来快速服务KV项目,并不断按压缩排序。压缩要求较高的I/O带宽和CPU速度,以保证对用户读/写请求的高质量服务。随着高速ssd的出现,cpu日益成为性能瓶颈。为了缓解限制KV-store性能和该store支持的应用程序性能的瓶颈,我们提出了一个系统- gLSM -利用GPGPU显著加速压缩操作。gLSM充分利用了gpgpu内部的并行性和计算能力来提高压缩性能。我们设计了一个驱动框架来并行处理一对CPU和GPGPU之间的压缩操作。我们采用数据独立性和面向gpgpu的基数排序算法并行地进行压缩。设计了一种键值分离方法来减少从cpu端内存到GPGPU端的数据传输量。结果表明,与四种最先进的KV存储相比,gLSM将吞吐量和压缩带宽分别提高了2.9和26.0倍。gLSM还将写延迟减少了73.3%。与没有KV分离和协作排序模块的变体相比,gLSM的性能提高了45%。
{"title":"gLSM: Using GPGPU to Accelerate Compactions in LSM-tree-based Key-value Stores","authors":"Hui Sun, Jinfeng Xu, Xiangxiang Jiang, Guanzhong Chen, Yinliang Yue, Xiao Qin","doi":"10.1145/3633782","DOIUrl":"https://doi.org/10.1145/3633782","url":null,"abstract":"<p>Log-structured-merge tree or LSM-tree is a technological underpinning in key-value (KV) stores to support a wide range of performance-critical applications. By conducting data re-organization in the background by virtue of compaction operations, the KV stores have the potential to swiftly service write requests with sequential batched disk writes and read requests for KV items constantly sorted by the compaction. Compaction demands high I/O bandwidth and CPU speed to facilitate quality service to user read/write requests. With the emergence of high-speed SSDs, CPUs are increasingly becoming a performance bottleneck. To mitigate the bottleneck limiting the KV-store’s performance and that of the applications supported by the store, we propose a system - <i>gLSM</i> - to leverage GPGPU to remarkably accelerate the compaction operations. gLSM fully utilizes the parallelism and computational capability inside GPGPUs to improve the compaction performance. We design a driver framework to parallelize compaction operations handled between a pair of CPU and GPGPU. We employ data independence and GPGPU-orient radix-sorting algorithm to concurrently conduct compaction. A key-value separation method is devised to slash the transfer of data volume from CPU-side memory to the GPGPU counterpart. The results reveal that gLSM improves the throughput and compaction bandwidth by up to a factor of 2.9 and 26.0, respectively, compared with the four state-of-the-art KV stores. gLSM also reduces the write latency by 73.3%. gLSM exhibits a performance improvement by up to 45% compared against its variant where there are no KV separation and collaboration sort modules.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Perseid: A Secondary Indexing Mechanism for LSM-based Storage Systems Perseid:基于lsm的存储系统的二级索引机制
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-11-17 DOI: 10.1145/3633285
Jing Wang, Youyou Lu, Qing Wang, Yuhao Zhang, Jiwu Shu

LSM-based storage systems are widely used for superior write performance on block devices. However, they currently fail to efficiently support secondary indexing, since a secondary index query operation usually needs to retrieve multiple small values, which scatter in multiple LSM components. In this work, we revisit secondary indexing in LSM-based storage systems with byte-addressable persistent memory (PM). Existing PM-based indexes are not directly competent for efficient secondary indexing. We propose Perseid, an efficient PM-based secondary indexing mechanism for LSM-based storage systems, which takes into account both characteristics of PM and secondary indexing. Perseid consists of (1) a specifically designed secondary index structure that achieves high-performance insertion and query, (2) a lightweight hybrid PM-DRAM and hash-based validation approach to filter out obsolete values with subtle overhead, and (3) two adapted optimizations on primary table searching issued from secondary indexes to accelerate non-index-only queries. Our evaluation shows that Perseid outperforms existing PM-based indexes by 3-7 × and achieves about two orders of magnitude performance of state-of-the-art LSM-based secondary indexing techniques even if on PM instead of disks.

基于lsm的存储系统由于在块设备上具有优异的写性能而得到了广泛的应用。但是,它们目前不能有效地支持二级索引,因为二级索引查询操作通常需要检索分散在多个LSM组件中的多个小值。在这项工作中,我们重新审视基于lsm的存储系统中具有字节可寻址持久内存(PM)的二级索引。现有的基于pm的索引不能直接胜任高效的二级索引。针对基于lsm的存储系统,我们提出了一种高效的基于PM的二次索引机制Perseid,该机制兼顾了PM和二次索引的特点。Perseid包括(1)一个专门设计的二级索引结构,实现高性能的插入和查询;(2)一个轻量级的混合PM-DRAM和基于哈希的验证方法,过滤掉开销很小的过时值;(3)对二级索引发出的主表搜索进行了两个自适应优化,以加速非仅索引的查询。我们的评估表明,即使在PM而不是磁盘上,Perseid比现有的基于PM的索引要好3-7倍,并且达到了最先进的基于lsm的二级索引技术的两个数量级的性能。
{"title":"Perseid: A Secondary Indexing Mechanism for LSM-based Storage Systems","authors":"Jing Wang, Youyou Lu, Qing Wang, Yuhao Zhang, Jiwu Shu","doi":"10.1145/3633285","DOIUrl":"https://doi.org/10.1145/3633285","url":null,"abstract":"<p>LSM-based storage systems are widely used for superior write performance on block devices. However, they currently fail to efficiently support secondary indexing, since a secondary index query operation usually needs to retrieve multiple small values, which scatter in multiple LSM components. In this work, we revisit secondary indexing in LSM-based storage systems with byte-addressable persistent memory (PM). Existing PM-based indexes are not directly competent for efficient secondary indexing. We propose <span>Perseid</span>, an efficient PM-based secondary indexing mechanism for LSM-based storage systems, which takes into account both characteristics of PM and secondary indexing. <span>Perseid</span> consists of (1) a specifically designed secondary index structure that achieves high-performance insertion and query, (2) a lightweight hybrid PM-DRAM and hash-based validation approach to filter out obsolete values with subtle overhead, and (3) two adapted optimizations on primary table searching issued from secondary indexes to accelerate non-index-only queries. Our evaluation shows that <span>Perseid</span> outperforms existing PM-based indexes by 3-7 × and achieves about two orders of magnitude performance of state-of-the-art LSM-based secondary indexing techniques even if on PM instead of disks.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introduction to the Special Section on USENIX FAST 2023 介绍USENIX FAST 2023的特殊部分
3区 计算机科学 Q3 Computer Science Pub Date : 2023-11-14 DOI: 10.1145/3612820
Ashvin Goel, Dalit Naor
This special section of the IEEE Transactions on Visualization and Computer Graphics (IEEE TVCG) presents the five most highly rated papers from the 2023 IEEE Pacific Visualization Symposium (IEEE PacificVis), hosted in Seoul, Korea from ...
IEEE可视化与计算机图形学汇刊(IEEE TVCG)的这个特别部分介绍了来自2023年IEEE太平洋可视化研讨会(IEEE PacificVis)的五篇评价最高的论文,该研讨会在韩国首尔举办。
{"title":"Introduction to the Special Section on USENIX FAST 2023","authors":"Ashvin Goel, Dalit Naor","doi":"10.1145/3612820","DOIUrl":"https://doi.org/10.1145/3612820","url":null,"abstract":"This special section of the <italic>IEEE Transactions on Visualization and Computer Graphics (IEEE TVCG)</italic> presents the five most highly rated papers from the 2023 IEEE Pacific Visualization Symposium (IEEE PacificVis), hosted in Seoul, Korea from ...","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134954270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Practical Design Considerations for Wide Locally Recoverable Codes (LRCs) 宽局部可恢复码(lrc)的实际设计考虑
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-11-14 DOI: 10.1145/3626198
Saurabh Kadekodi, Shashwat Silas, David Clausen, Arif Merchant

Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. Locally recoverable codes (LRCs) have deservedly gained central importance in this field, because they can balance many of these requirements. In our work, we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their reliability, since wider stripes are prone to more simultaneous failures.

We conduct a practically minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called Uniform Cauchy LRCs, which show excellent performance in simulations and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance-optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers.

大规模存储集群中的大部分数据都是擦除编码。在百亿亿级,优化擦除码以实现低存储开销、高效重构和易于部署是至关重要的。本地可恢复代码(lrc)在这个领域理所当然地获得了中心重要性,因为它们可以平衡许多这些需求。在我们的工作中,我们研究了广泛的lrc;每个条带具有大量块和低存储开销的lrc。这些代码是从业者解锁更高存储节省的自然下一步,但它们也带来了自己的挑战。特别令人感兴趣的是它们的可靠性,因为更宽的条纹更容易同时发生故障。我们对几种流行的和新颖的lrc进行了实际的分析。我们发现,宽LRC可靠性是一种微妙的现象,它对几种设计选择很敏感,其中一些被理论家忽视,而另一些被实践者忽视。基于这些见解,我们构建了一种新的LRC,称为统一柯西LRC,它在模拟中表现出优异的性能,在谷歌存储集群中部署的广泛LRC观察到的不可用事件的可靠性提高了33%。我们还展示了这些代码很容易部署,从而提高了它们对常见维护事件的健壮性。在此过程中,我们还给出了距离最优lrc的一个非常简单和新颖的结构(其他结构也已知),这可能会引起有理论头脑的读者的兴趣。
{"title":"Practical Design Considerations for Wide Locally Recoverable Codes (LRCs)","authors":"Saurabh Kadekodi, Shashwat Silas, David Clausen, Arif Merchant","doi":"10.1145/3626198","DOIUrl":"https://doi.org/10.1145/3626198","url":null,"abstract":"<p>Most of the data in large-scale storage clusters is erasure coded. At exascale, optimizing erasure codes for low storage overhead, efficient reconstruction, and easy deployment is of critical importance. <i>Locally recoverable codes (LRCs)</i> have deservedly gained central importance in this field, because they can balance many of these requirements. In our work, we study wide LRCs; LRCs with large number of blocks per stripe and low storage overhead. These codes are a natural next step for practitioners to unlock higher storage savings, but they come with their own challenges. Of particular interest is their <i>reliability</i>, since wider stripes are prone to more simultaneous failures.</p><p>We conduct a practically minded analysis of several popular and novel LRCs. We find that wide LRC reliability is a subtle phenomenon that is sensitive to several design choices, some of which are overlooked by theoreticians, and others by practitioners. Based on these insights, we construct novel LRCs called <i>Uniform Cauchy LRCs</i>, which show excellent performance in simulations and a 33% improvement in reliability on unavailability events observed by a wide LRC deployed in a Google storage cluster. We also show that these codes are easy to deploy in a manner that improves their robustness to common maintenance events. Along the way, we also give a remarkably simple and novel construction of distance-optimal LRCs (other constructions are also known), which may be of interest to theory-minded readers.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138512863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
From Missteps to Milestones: A Journey to Practical Fail-Slow Detection 从失误到里程碑:实用故障慢速检测之旅
3区 计算机科学 Q3 Computer Science Pub Date : 2023-11-01 DOI: 10.1145/3617690
Ruiming Lu, Erci Xu, Yiming Zhang, Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Jiwu Shu, Minglu Li, Jiesheng Wu
The newly emerging “fail-slow” failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this article presents Perseus , a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to quickly pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.
新出现的“慢速故障”困扰着软件和硬件,其中受害组件仍在运行,但性能下降。为了解决这个问题,本文介绍了Perseus,这是一个实用的存储设备慢速故障检测框架。Perseus利用基于轻回归的模型来快速定位和分析驱动器粒度的故障慢速故障。在对248K硬盘进行了10个月的密切监测后,Perseus发现了304个慢速故障。隔离它们可以将(节点级)99.99次尾延迟减少48%。我们从生产轨迹中组装了一个大规模的故障慢速数据集(包括41K正常驱动器和315个经过验证的故障慢速驱动器),在此基础上,我们提供了故障慢速驱动器的根本原因分析,涵盖了各种执行不良的调度、硬件缺陷和环境因素。我们已经向公众发布了数据集,以供慢速研究。
{"title":"From Missteps to Milestones: A Journey to Practical Fail-Slow Detection","authors":"Ruiming Lu, Erci Xu, Yiming Zhang, Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, Zongpeng Zhu, Guangtao Xue, Jiwu Shu, Minglu Li, Jiesheng Wu","doi":"10.1145/3617690","DOIUrl":"https://doi.org/10.1145/3617690","url":null,"abstract":"The newly emerging “fail-slow” failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this article presents Perseus , a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to quickly pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134957500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Scalable Wear Leveling Technique for Phase Change Memory 一种可扩展的相变存储器磨损均衡技术
3区 计算机科学 Q3 Computer Science Pub Date : 2023-10-30 DOI: 10.1145/3631146
Wang Xu, Israel Koren
Phase Change Memory (PCM), one of the recently proposed non-volatile memory technologies, has been suffering from low write endurance. For example, a single layer PCM cell could only be written approximately 10 8 times. This limits the lifetime of a PCM-based memory to a few days rather than years when memory intensive applications are running. Wear leveling techniques have been proposed to improve the write endurance of a PCM. Among those techniques, the region based start-gap (RBSG) scheme is widely cited as achieving the highest lifetime. Based on our experiments, RBSG can achieve 97% of the ideal lifetime but only for relatively small memory sizes (e.g. 8GB-32GB). As the memory size goes up, RBSG becomes less effective and its expected percentage of the ideal lifetime reduces to less than 57% for a 2TB PCM. In this paper, we propose a table-based wear leveling scheme called block grouping to enhance the write endurance of a PCM with a negligible overhead. Our research results show that with a proper configuration and adoption of partial writes (writing back only 64B subblocks instead of a whole row to the PCM arrays) and internal row shift (shifting the subblocks in a row periodically so no subblock in a row will be written repeatedly), the proposed block grouping scheme could achieve 95% of the ideal lifetime on average for the Rodinia, NPB, and SPEC benchmarks with less than 1.74% performance overhead and up to 0.18% hardware overhead. Moreover, our scheme is scalable and achieves the same percentage of ideal lifetime for PCM of size from 8GB to 2TB. We also show that the proposed scheme can better tolerate memory write attacks than WoLFRAM (Wear Leveling and Fault Tolerance for Resistive Memories) and RBSG for a PCM of size 32GB or higher. Finally, we integrate an error-correcting pointer technique into our proposed block grouping scheme to make the PCM fault tolerant against hard errors.
相变存储器(PCM)是近年来提出的一种非易失性存储技术,其写入持久时间较低。例如,单层PCM单元只能被写入大约108次。当运行内存密集型应用程序时,这将基于pcm的内存的寿命限制在几天,而不是几年。为了提高PCM的写入耐久性,提出了磨损调平技术。在这些技术中,基于区域的启动间隙(RBSG)方案被广泛引用为具有最高的生命周期。根据我们的实验,RBSG可以达到理想寿命的97%,但仅适用于相对较小的内存大小(例如8GB-32GB)。随着内存大小的增加,RBSG变得不那么有效,对于2TB PCM,其理想寿命的预期百分比减少到不到57%。在本文中,我们提出了一种基于表的磨损均衡方案,称为块分组,以微不足道的开销提高PCM的写入持久性。与适当的配置和我们的研究结果表明,采用部分写(写只有64 b子块而不是整个行PCM数组)和内部行转移(转移子块连续定期所以没有子块连续反复写),提出了块分组方案能达到95%的理想一生平均Rodinia, NPB和SPEC基准性能开销不到1.74%和0.18%的硬件开销。此外,我们的方案是可扩展的,并且对于大小从8GB到2TB的PCM实现了相同的理想寿命百分比。我们还表明,对于大小为32GB或更高的PCM,所提出的方案比WoLFRAM(耐磨均衡和容错电阻存储器)和RBSG更好地耐受内存写攻击。最后,我们将纠错指针技术集成到我们提出的块分组方案中,使PCM对硬错误具有容错性。
{"title":"A Scalable Wear Leveling Technique for Phase Change Memory","authors":"Wang Xu, Israel Koren","doi":"10.1145/3631146","DOIUrl":"https://doi.org/10.1145/3631146","url":null,"abstract":"Phase Change Memory (PCM), one of the recently proposed non-volatile memory technologies, has been suffering from low write endurance. For example, a single layer PCM cell could only be written approximately 10 8 times. This limits the lifetime of a PCM-based memory to a few days rather than years when memory intensive applications are running. Wear leveling techniques have been proposed to improve the write endurance of a PCM. Among those techniques, the region based start-gap (RBSG) scheme is widely cited as achieving the highest lifetime. Based on our experiments, RBSG can achieve 97% of the ideal lifetime but only for relatively small memory sizes (e.g. 8GB-32GB). As the memory size goes up, RBSG becomes less effective and its expected percentage of the ideal lifetime reduces to less than 57% for a 2TB PCM. In this paper, we propose a table-based wear leveling scheme called block grouping to enhance the write endurance of a PCM with a negligible overhead. Our research results show that with a proper configuration and adoption of partial writes (writing back only 64B subblocks instead of a whole row to the PCM arrays) and internal row shift (shifting the subblocks in a row periodically so no subblock in a row will be written repeatedly), the proposed block grouping scheme could achieve 95% of the ideal lifetime on average for the Rodinia, NPB, and SPEC benchmarks with less than 1.74% performance overhead and up to 0.18% hardware overhead. Moreover, our scheme is scalable and achieves the same percentage of ideal lifetime for PCM of size from 8GB to 2TB. We also show that the proposed scheme can better tolerate memory write attacks than WoLFRAM (Wear Leveling and Fault Tolerance for Resistive Memories) and RBSG for a PCM of size 32GB or higher. Finally, we integrate an error-correcting pointer technique into our proposed block grouping scheme to make the PCM fault tolerant against hard errors.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136069911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explorations and Exploitation for Parity-based RAIDs with Ultra-fast SSDs 超高速ssd基于奇偶校验的raid的探索和开发
3区 计算机科学 Q3 Computer Science Pub Date : 2023-10-16 DOI: 10.1145/3627992
Shucheng Wang, Qiang Cao, Hong Jiang, Ziyi Lu, Jie Yao, Yuxing Chen, Anqun Pan
Following a conventional design principle that pays more fast-CPU-cycles for fewer slow-I/Os, popular software storage architecture Linux Multiple-Disk (MD) for parity-based RAID (e.g., RAID5 and RAID6) assigns one or more centralized worker threads to efficiently process all user requests based on multi-stage asynchronous control and global data structures, successfully exploiting characteristics of slow devices, e.g., Hard Disk Drives (HDDs). However, we observe that, with high-performance NVMe-based Solid State Drives (SSDs), even the recently added multi-worker processing mode in MD achieves only limited performance gain because of the severe lock contentions under intensive write workloads. In this paper, we propose a novel stripe-threaded RAID architecture, StRAID, assigning a dedicated worker thread for each stripe-write (one-for-one model) to sufficiently exploit high parallelism inherent among RAID stripes, multi-core processors, and SSDs. For the notoriously performance-punishing partial-stripe writes that induce extra read and write I/Os, StRAID presents a two-stage stripe write mechanism and a two-dimensional multi-log SSD buffer. All writes first are opportunistically batched in memory, and then are written into the primary RAID for aggregated full-stripe writes or conditionally redirected to the buffer for partial-stripe writes. These buffered data are strategically reclaimed to the primary RAID. We evaluate a StRAID prototype with a variety of benchmarks and real-world traces. StRAID is demonstrated to outperform MD by up to 5.8 times in write throughput.
遵循传统的设计原则,即为更少的慢i / o支付更多的快速cpu周期,流行的软件存储架构Linux multi- Disk (MD)用于基于奇偶性的RAID(例如RAID5和RAID6)分配一个或多个集中的工作线程,以有效地处理基于多阶段异步控制和全局数据结构的所有用户请求,成功地利用慢设备(例如硬盘驱动器(hdd))的特性。然而,我们观察到,对于基于nvme的高性能固态驱动器(ssd),即使是最近在MD中添加的多工作者处理模式也只能获得有限的性能提升,因为在密集的写工作负载下存在严重的锁争用。在本文中,我们提出了一种新的条带线程RAID架构,StRAID,为每个条带写入分配一个专用的工作线程(一对一模型),以充分利用RAID条带、多核处理器和ssd之间固有的高并行性。众所周知,部分分条写入会导致额外的读/写I/ o,而StRAID提供了一个两阶段的分条写入机制和一个二维多日志SSD缓冲区。所有写操作首先在内存中分批处理,然后写入主RAID以进行聚合的全条带写操作,或者有条件地重定向到缓冲区以进行部分条带写操作。这些缓冲的数据策略性地回收到主RAID。我们使用各种基准测试和真实世界的跟踪来评估StRAID原型。在写入吞吐量方面,StRAID的性能比MD高出5.8倍。
{"title":"Explorations and Exploitation for Parity-based RAIDs with Ultra-fast SSDs","authors":"Shucheng Wang, Qiang Cao, Hong Jiang, Ziyi Lu, Jie Yao, Yuxing Chen, Anqun Pan","doi":"10.1145/3627992","DOIUrl":"https://doi.org/10.1145/3627992","url":null,"abstract":"Following a conventional design principle that pays more fast-CPU-cycles for fewer slow-I/Os, popular software storage architecture Linux Multiple-Disk (MD) for parity-based RAID (e.g., RAID5 and RAID6) assigns one or more centralized worker threads to efficiently process all user requests based on multi-stage asynchronous control and global data structures, successfully exploiting characteristics of slow devices, e.g., Hard Disk Drives (HDDs). However, we observe that, with high-performance NVMe-based Solid State Drives (SSDs), even the recently added multi-worker processing mode in MD achieves only limited performance gain because of the severe lock contentions under intensive write workloads. In this paper, we propose a novel stripe-threaded RAID architecture, StRAID, assigning a dedicated worker thread for each stripe-write (one-for-one model) to sufficiently exploit high parallelism inherent among RAID stripes, multi-core processors, and SSDs. For the notoriously performance-punishing partial-stripe writes that induce extra read and write I/Os, StRAID presents a two-stage stripe write mechanism and a two-dimensional multi-log SSD buffer. All writes first are opportunistically batched in memory, and then are written into the primary RAID for aggregated full-stripe writes or conditionally redirected to the buffer for partial-stripe writes. These buffered data are strategically reclaimed to the primary RAID. We evaluate a StRAID prototype with a variety of benchmarks and real-world traces. StRAID is demonstrated to outperform MD by up to 5.8 times in write throughput.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136078654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding Persistent-memory-related Issues in the Linux Kernel 理解Linux内核中与持久性内存相关的问题
3区 计算机科学 Q3 Computer Science Pub Date : 2023-10-03 DOI: 10.1145/3605946
Om Rameshwar Gatla, Duo Zhang, Wei Xu, Mai Zheng
Persistent memory (PM) technologies have inspired a wide range of PM-based system optimizations. However, building correct PM-based systems is difficult due to the unique characteristics of PM hardware. To better understand the challenges as well as the opportunities to address them, this article presents a comprehensive study of PM-related issues in the Linux kernel. By analyzing 1,553 PM-related kernel patches in depth and conducting experiments on reproducibility and tool extension, we derive multiple insights in terms of PM patch categories, PM bug patterns, consequences, fix strategies, triggering conditions, and remedy solutions. We hope our results could contribute to the development of robust PM-based storage systems.
持久内存(PM)技术激发了广泛的基于PM的系统优化。然而,由于PM硬件的独特特性,构建正确的基于PM的系统是困难的。为了更好地理解这些挑战以及解决这些挑战的机会,本文对Linux内核中与pm相关的问题进行了全面的研究。通过深入分析1553个与PM相关的内核补丁,并对再现性和工具扩展进行实验,我们在PM补丁类别、PM错误模式、结果、修复策略、触发条件和补救解决方案方面获得了多种见解。我们希望我们的研究结果能够对健壮的基于pm的存储系统的开发做出贡献。
{"title":"Understanding Persistent-memory-related Issues in the Linux Kernel","authors":"Om Rameshwar Gatla, Duo Zhang, Wei Xu, Mai Zheng","doi":"10.1145/3605946","DOIUrl":"https://doi.org/10.1145/3605946","url":null,"abstract":"Persistent memory (PM) technologies have inspired a wide range of PM-based system optimizations. However, building correct PM-based systems is difficult due to the unique characteristics of PM hardware. To better understand the challenges as well as the opportunities to address them, this article presents a comprehensive study of PM-related issues in the Linux kernel. By analyzing 1,553 PM-related kernel patches in depth and conducting experiments on reproducibility and tool extension, we derive multiple insights in terms of PM patch categories, PM bug patterns, consequences, fix strategies, triggering conditions, and remedy solutions. We hope our results could contribute to the development of robust PM-based storage systems.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135696134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Empowering Storage Systems Research with NVMeVirt: A Comprehensive NVMe Device Emulator 增强存储系统研究与NVMeVirt:一个全面的NVMe设备模拟器
3区 计算机科学 Q3 Computer Science Pub Date : 2023-09-21 DOI: 10.1145/3625006
Sang-Hoon Kim, Jaehoon Shim, Euidong Lee, Seongyeop Jeong, Ilkueon Kang, Jin-Soo Kim
There have been drastic changes in the storage device landscape recently. At the center of the diverse storage landscape lies the NVMe interface, which allows high-performance and flexible communication models required by these next-generation device types. However, its hardware-oriented definition and specification are bottlenecking the development and evaluation cycle for new revolutionary storage devices. Furthermore, existing emulators lack the capability to support the advanced storage configurations that are currently in the spotlight. In this paper, we present NVMeVirt, a novel approach to facilitate software-defined NVMe devices. A user can define any NVMe device type with custom features, and NVMeVirt allows it to bridge the gap between the host I/O stack and the virtual NVMe device in software. We demonstrate the advantages and features of NVMeVirt by realizing various storage types and configurations, such as conventional SSDs, low-latency high-bandwidth NVM SSDs, zoned namespace SSDs, and key-value SSDs with the support of PCI peer-to-peer DMA and NVMe-oF target offloading. We also make cases for storage research with NVMeVirt, such as studying the performance characteristics of database engines and extending the NVMe specification for the improved key-value SSD performance.
最近,存储设备领域发生了巨大的变化。NVMe接口是各种存储环境的核心,它允许这些下一代设备类型所需的高性能和灵活的通信模型。然而,其面向硬件的定义和规范阻碍了新的革命性存储设备的开发和评估周期。此外,现有仿真器缺乏支持当前备受关注的高级存储配置的能力。在本文中,我们提出了NVMeVirt,一种促进软件定义NVMe设备的新方法。用户可以定义任何具有自定义功能的NVMe设备类型,NVMeVirt允许它在软件中弥合主机I/O堆栈和虚拟NVMe设备之间的差距。我们通过实现各种存储类型和配置,例如传统ssd、低延迟高带宽NVM ssd、分区命名空间ssd和支持PCI对等DMA和NVMe-oF目标卸载的键值ssd,来展示NVMeVirt的优势和特性。我们还对NVMeVirt的存储研究进行了案例分析,例如研究数据库引擎的性能特征,扩展NVMe规范以提高键值SSD的性能。
{"title":"Empowering Storage Systems Research with NVMeVirt: A Comprehensive NVMe Device Emulator","authors":"Sang-Hoon Kim, Jaehoon Shim, Euidong Lee, Seongyeop Jeong, Ilkueon Kang, Jin-Soo Kim","doi":"10.1145/3625006","DOIUrl":"https://doi.org/10.1145/3625006","url":null,"abstract":"There have been drastic changes in the storage device landscape recently. At the center of the diverse storage landscape lies the NVMe interface, which allows high-performance and flexible communication models required by these next-generation device types. However, its hardware-oriented definition and specification are bottlenecking the development and evaluation cycle for new revolutionary storage devices. Furthermore, existing emulators lack the capability to support the advanced storage configurations that are currently in the spotlight. In this paper, we present NVMeVirt, a novel approach to facilitate software-defined NVMe devices. A user can define any NVMe device type with custom features, and NVMeVirt allows it to bridge the gap between the host I/O stack and the virtual NVMe device in software. We demonstrate the advantages and features of NVMeVirt by realizing various storage types and configurations, such as conventional SSDs, low-latency high-bandwidth NVM SSDs, zoned namespace SSDs, and key-value SSDs with the support of PCI peer-to-peer DMA and NVMe-oF target offloading. We also make cases for storage research with NVMeVirt, such as studying the performance characteristics of database engines and extending the NVMe specification for the improved key-value SSD performance.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"136152933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Flat Namespace to Improve File System Metadata Performance on Ultra-fast, Byte-addressable NVMs 利用平面命名空间在超快、字节可寻址的nvm上提高文件系统元数据性能
IF 1.7 3区 计算机科学 Q3 Computer Science Pub Date : 2023-09-06 DOI: 10.1145/3620673
Miao Cai, Junru Shen, Bin Tang, Hao Huang, Baoliu Ye
The conventional file system provides a hierarchical namespace by structuring it as a directory tree. Tree-based namespace structure leads to inefficient file path walk and expensive namespace tree traversal, underutilizing ultra-low access latency and superior sequential performance provided by non-volatile memories (NVMs). This paper proposes FlatFS+, an NVM file system that features a flat namespace architecture while providing a compatible hierarchical namespace view. FlatFS+ incorporates three novel techniques: direct file path walk model, range-optimized Br tree, and compressed index key design with scan and write dual optimization, to fully exploit flat namespace to improve file system metadata performance on ultra-fast, byte-addressable NVMs. Evaluation results demonstrate that FlatFS+ achieves significant performance improvements for metadata-intensive benchmarks and real-world applications compared to other file systems.
传统的文件系统通过将其构造为目录树来提供分层命名空间。基于树的命名空间结构导致低效的文件路径遍历和昂贵的命名空间树遍历,未充分利用非易失性存储器(NVM)提供的超低访问延迟和卓越的顺序性能。本文提出了FlatFS+,这是一种NVM文件系统,它具有扁平的命名空间架构,同时提供了兼容的分层命名空间视图。FlatFS+结合了三种新技术:直接文件路径遍历模型、范围优化的Br树和具有扫描和写入双重优化的压缩索引密钥设计,以充分利用扁平命名空间来提高文件系统元数据在超快速字节可寻址NVM上的性能。评估结果表明,与其他文件系统相比,FlatFS+在元数据密集型基准测试和实际应用程序方面实现了显著的性能改进。
{"title":"Exploiting Flat Namespace to Improve File System Metadata Performance on Ultra-fast, Byte-addressable NVMs","authors":"Miao Cai, Junru Shen, Bin Tang, Hao Huang, Baoliu Ye","doi":"10.1145/3620673","DOIUrl":"https://doi.org/10.1145/3620673","url":null,"abstract":"The conventional file system provides a hierarchical namespace by structuring it as a directory tree. Tree-based namespace structure leads to inefficient file path walk and expensive namespace tree traversal, underutilizing ultra-low access latency and superior sequential performance provided by non-volatile memories (NVMs). This paper proposes FlatFS+, an NVM file system that features a flat namespace architecture while providing a compatible hierarchical namespace view. FlatFS+ incorporates three novel techniques: direct file path walk model, range-optimized Br tree, and compressed index key design with scan and write dual optimization, to fully exploit flat namespace to improve file system metadata performance on ultra-fast, byte-addressable NVMs. Evaluation results demonstrate that FlatFS+ achieves significant performance improvements for metadata-intensive benchmarks and real-world applications compared to other file systems.","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2023-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43687749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Transactions on Storage
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1