2019 35th Symposium on Mass Storage Systems and Technologies (MSST)最新文献

Tiered-ReRAM: A Low Latency and Energy Efficient TLC Crossbar ReRAM Architecture 分层ReRAM:一种低延迟和节能的TLC交叉栏ReRAM架构

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.00-13

Yang Zhang, D. Feng, Wei Tong, Jingning Liu, Chengning Wang, Jie Xu

Resistive Memory (ReRAM) is promising to be used as high density storage-class memory by employing Triple-Level Cell (TLC) and crossbar structures. However, TLC crossbar ReRAM suffers from high write latency and energy due to the IR drop issue and the iterative program-and-verify procedure. In this paper, we propose Tiered-ReRAM architecture to overcome the challenges of TLC crossbar ReRAM. The proposed Tiered-ReRAM consists of three components, namely Tiered-crossbar design, Compression-based Incomplete Data Mapping (CIDM), and Compression-based Flip Scheme (CFS). Specifically, based on the observation that the magnitude of IR drops is primarily determined by the long length of bitlines in Double-Sided Ground Biasing (DSGB) crossbar arrays, Tiered-crossbar design splits each long bitline into the near and far segments by an isolation transistor, allowing the near segment to be accessed with decreased latency and energy. Moreover, in the near segments, CIDM dynamically selects the most appropriate IDM for each cache line according to the saved space by compression, which further reduces the write latency and energy with insignificant space overhead. In addition, in the far segments, CFS dynamically selects the most appropriate flip scheme for each cache line, which ensures more high resistance cells written into crossbar arrays and effectively reduces the leakage energy. For each compressed cache line, the selected IDM or flip scheme is applied on the condition that the total encoded data size will never exceed the original cache line size. The experimental results show that, on average, Tiered-ReRAM can improve the system performance by 30.5%, reduce the write latency by 35.2%, decrease the read latency by 26.1%, and reduce the energy consumption by 35.6%, compared to an aggressive baseline.

电阻式存储器(ReRAM)采用三能级单元(TLC)和交叉杆结构，有望成为高密度存储级存储器。然而，由于IR下降问题和迭代的编程和验证过程，TLC横杆ReRAM遭受高写入延迟和能量的困扰。在本文中，我们提出了分层ReRAM架构，以克服TLC交叉ReRAM的挑战。提出的分层reram由分层横杆设计、基于压缩的不完全数据映射(CIDM)和基于压缩的翻转方案(CFS)三个部分组成。具体来说，基于观察到IR下降的幅度主要由双面地偏置(DSGB)交叉条阵列中的长位线长度决定，分层交叉条设计通过隔离晶体管将每个长位线分成近段和远段，从而允许以更低的延迟和能量访问近段。此外，在近段中，CIDM根据压缩节省的空间动态地为每条缓存线选择最合适的IDM，在不增加空间开销的情况下，进一步减少了写延迟和能量。此外，在远段，CFS为每条缓存线动态选择最合适的翻转方案，确保更多的高阻单元写入交叉棒阵列，有效降低泄漏能量。对于每条压缩缓存线，所选的IDM或翻转方案将在总编码数据大小永远不会超过原始缓存线大小的条件下应用。实验结果表明，与主动基准相比，Tiered-ReRAM平均可将系统性能提高30.5%，将写时延降低35.2%，将读时延降低26.1%，将能耗降低35.6%。

{"title":"Tiered-ReRAM: A Low Latency and Energy Efficient TLC Crossbar ReRAM Architecture","authors":"Yang Zhang, D. Feng, Wei Tong, Jingning Liu, Chengning Wang, Jie Xu","doi":"10.1109/MSST.2019.00-13","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-13","url":null,"abstract":"Resistive Memory (ReRAM) is promising to be used as high density storage-class memory by employing Triple-Level Cell (TLC) and crossbar structures. However, TLC crossbar ReRAM suffers from high write latency and energy due to the IR drop issue and the iterative program-and-verify procedure. In this paper, we propose Tiered-ReRAM architecture to overcome the challenges of TLC crossbar ReRAM. The proposed Tiered-ReRAM consists of three components, namely Tiered-crossbar design, Compression-based Incomplete Data Mapping (CIDM), and Compression-based Flip Scheme (CFS). Specifically, based on the observation that the magnitude of IR drops is primarily determined by the long length of bitlines in Double-Sided Ground Biasing (DSGB) crossbar arrays, Tiered-crossbar design splits each long bitline into the near and far segments by an isolation transistor, allowing the near segment to be accessed with decreased latency and energy. Moreover, in the near segments, CIDM dynamically selects the most appropriate IDM for each cache line according to the saved space by compression, which further reduces the write latency and energy with insignificant space overhead. In addition, in the far segments, CFS dynamically selects the most appropriate flip scheme for each cache line, which ensures more high resistance cells written into crossbar arrays and effectively reduces the leakage energy. For each compressed cache line, the selected IDM or flip scheme is applied on the condition that the total encoded data size will never exceed the original cache line size. The experimental results show that, on average, Tiered-ReRAM can improve the system performance by 30.5%, reduce the write latency by 35.2%, decrease the read latency by 26.1%, and reduce the energy consumption by 35.6%, compared to an aggressive baseline.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128375347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

LIPA: A Learning-based Indexing and Prefetching Approach for Data Deduplication 基于学习的重复数据删除索引和预取方法

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.00010

Guangping Xu, Bo Tang, Hongli Lu, Quan Yu, C. Sung

In this paper, we present a learning based data deduplication algorithm, called LIPA, which uses the reinforcement learning framework to build an adaptive indexing structure. It is rather different from previous inline chunk-based deduplication methods to solve the chunk-lookup disk bottleneck problem for large-scale backup. In previous methods, a full chunk index or a sampled chunk index often is often required to identify duplicate chunks, which is a critical stage for data deduplication. The full chunk index is hard to fit in RAM and the sampled chunk index directly affects the deduplication ratio dependent on the sampling ratio. Our learning based method only requires little memory overheads to store the index but achieves the same or even better deduplication ratio than previous methods. In our method, after the data stream is broken into relatively large segments, one or more representative chunk fingerprints are chosen as the feature of a segment. An incoming segment may share the same feature with previous segments. Thus we use a key-value structure to record the relationship between features and segments: a feature maps to a fixed number of segments. We train the similarities of these segments to a feature represented as scores by the reinforcement learning method. For an incoming segment, our method adaptively prefetches a segment and the successive ones into cache by using multi-armed bandits model. Our experimental results show that our method significantly reduces memory overheads and achieves effective deduplication.

在本文中，我们提出了一种基于学习的重复数据删除算法，称为LIPA，它使用强化学习框架来构建自适应索引结构。它与以往基于内联块的重复数据删除方法解决大规模备份的块查找磁盘瓶颈问题有很大的不同。在以前的方法中，经常需要使用全块索引或采样块索引来识别重复的块，这是重复数据删除的关键阶段。全块索引很难装入RAM，并且采样的块索引直接影响依赖于采样比率的重复数据删除比率。我们基于学习的方法只需要很少的内存开销来存储索引，但实现了与以前的方法相同甚至更好的重复数据删除比率。在我们的方法中，将数据流分成相对较大的段后，选择一个或多个具有代表性的块指纹作为段的特征。一个传入段可能与之前的段共享相同的特征。因此，我们使用键值结构来记录特征和段之间的关系:一个特征映射到固定数量的段。我们通过强化学习方法将这些片段的相似性训练为分数表示的特征。该方法采用多臂强盗模型，自适应地将一段和后续的段预取到缓存中。实验结果表明，该方法显著降低了内存开销，实现了有效的重复数据删除。

{"title":"LIPA: A Learning-based Indexing and Prefetching Approach for Data Deduplication","authors":"Guangping Xu, Bo Tang, Hongli Lu, Quan Yu, C. Sung","doi":"10.1109/MSST.2019.00010","DOIUrl":"https://doi.org/10.1109/MSST.2019.00010","url":null,"abstract":"In this paper, we present a learning based data deduplication algorithm, called LIPA, which uses the reinforcement learning framework to build an adaptive indexing structure. It is rather different from previous inline chunk-based deduplication methods to solve the chunk-lookup disk bottleneck problem for large-scale backup. In previous methods, a full chunk index or a sampled chunk index often is often required to identify duplicate chunks, which is a critical stage for data deduplication. The full chunk index is hard to fit in RAM and the sampled chunk index directly affects the deduplication ratio dependent on the sampling ratio. Our learning based method only requires little memory overheads to store the index but achieves the same or even better deduplication ratio than previous methods. In our method, after the data stream is broken into relatively large segments, one or more representative chunk fingerprints are chosen as the feature of a segment. An incoming segment may share the same feature with previous segments. Thus we use a key-value structure to record the relationship between features and segments: a feature maps to a fixed number of segments. We train the similarities of these segments to a feature represented as scores by the reinforcement learning method. For an incoming segment, our method adaptively prefetches a segment and the successive ones into cache by using multi-armed bandits model. Our experimental results show that our method significantly reduces memory overheads and achieves effective deduplication.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121431478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

CDAC: Content-Driven Deduplication-Aware Storage Cache CDAC:内容驱动的重复数据删除感知存储缓存

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.00008

Yujuan Tan, Wen Xia, Jing Xie, Congcong Xu, Zhichao Yan, Hong Jiang, Yajun Zhao, Min Fu, Xianzhang Chen, Duo Liu

Data deduplication, as a proven technology for effective data reduction in backup and archive storage systems, also demonstrates the promise in increasing the logical space capacity of storage caches by removing redundant data. However, our in-depth evaluation of the existing deduplication-aware caching algorithms reveals that they do improve the hit ratios compared to the caching algorithms without deduplication, especially when the cache block size is set to 4KB. But when the block size is larger than 4KB, a clear trend for modern storage systems, their hit ratios are significantly reduced. A slight increase in hit ratios due to deduplicationmay not be able to improve the overall storage performance because of the high overhead created by deduplication. To address this problem, in this paper we propose CDAC, a Content-driven Deduplication-Aware Cache, which focuses on exploiting the blocks' content redundancy and their intensity of content sharing among source addresses in cache management strategies. We have implemented CDAC based on LRU and ARC algorithms, called CDAC-LRU and CDAC-ARC respectively. Our extensive experimental results show that CDACLRU and CDAC-ARC outperform the state-of-the-art deduplication-aware caching algorithms, D-LRU and DARC, by up to 19.49X in read cache hit ratio, with an average of 1.95X under real-world traces when the cache size ranges from 20% to 80% of the working set size and the block size ranges from 4KB to 64 KB.

重复数据删除作为备份和归档存储系统中有效减少数据的技术，也证明了通过删除冗余数据来增加存储缓存的逻辑空间容量的前景。然而，我们对现有的支持重复数据删除的缓存算法进行了深入的评估，发现与没有重复数据删除的缓存算法相比，它们确实提高了命中率，特别是当缓存块大小设置为4KB时。但是当块大小大于4KB时(这是现代存储系统的一个明显趋势)，它们的命中率会显著降低。由于重复数据删除导致的命中率略有提高，但可能无法提高整体存储性能，因为重复数据删除造成了很高的开销。为了解决这一问题，本文提出了CDAC，一种内容驱动的重复数据删除感知缓存，其重点是在缓存管理策略中利用块的内容冗余及其在源地址之间的内容共享强度。我们实现了基于LRU和ARC算法的CDAC，分别称为CDAC-LRU和CDAC-ARC。我们广泛的实验结果表明，CDACLRU和CDAC-ARC在读缓存命中率上优于最先进的重复数据删除感知缓存算法D-LRU和DARC，最高可达19.49倍，在实际跟踪中，当缓存大小为工作集大小的20%至80%，块大小为4KB至64kb时，平均为1.95倍。

{"title":"CDAC: Content-Driven Deduplication-Aware Storage Cache","authors":"Yujuan Tan, Wen Xia, Jing Xie, Congcong Xu, Zhichao Yan, Hong Jiang, Yajun Zhao, Min Fu, Xianzhang Chen, Duo Liu","doi":"10.1109/MSST.2019.00008","DOIUrl":"https://doi.org/10.1109/MSST.2019.00008","url":null,"abstract":"Data deduplication, as a proven technology for effective data reduction in backup and archive storage systems, also demonstrates the promise in increasing the logical space capacity of storage caches by removing redundant data. However, our in-depth evaluation of the existing deduplication-aware caching algorithms reveals that they do improve the hit ratios compared to the caching algorithms without deduplication, especially when the cache block size is set to 4KB. But when the block size is larger than 4KB, a clear trend for modern storage systems, their hit ratios are significantly reduced. A slight increase in hit ratios due to deduplicationmay not be able to improve the overall storage performance because of the high overhead created by deduplication. To address this problem, in this paper we propose CDAC, a Content-driven Deduplication-Aware Cache, which focuses on exploiting the blocks' content redundancy and their intensity of content sharing among source addresses in cache management strategies. We have implemented CDAC based on LRU and ARC algorithms, called CDAC-LRU and CDAC-ARC respectively. Our extensive experimental results show that CDACLRU and CDAC-ARC outperform the state-of-the-art deduplication-aware caching algorithms, D-LRU and DARC, by up to 19.49X in read cache hit ratio, with an average of 1.95X under real-world traces when the cache size ranges from 20% to 80% of the working set size and the block size ranges from 4KB to 64 KB.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116656072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

When NVMe over Fabrics Meets Arm: Performance and Implications 当NVMe在fabric上遇到Arm:性能和影响

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.000-9

Yichen Jia, E. Anger, Feng Chen

A growing technology trend in the industry is to deploy highly capable and power-efficient storage servers based on the Arm architecture. An important driving force behind this is storage disaggregation, which separates compute and storage to different servers, enabling independent resource allocation and optimized hardware utilization. The recently released remote storage protocol specification, NVMe-over-Fabrics (NVMeoF), makes flash disaggregation possible by reducing the remote access overhead to the minimum. It is highly appealing to integrate the two promising technologies together to build an efficient Arm based storage server with NVMeoF. In this work, we have conducted a set of comprehensive experiments to understand the performance behaviors of NVMeoF on Arm-based Data Center SoC and to gain insight into the implications of their design and deployment in data centers. Our experiments show that NVMeoF delivers the promised ultra-low latency. With appropriate optimizations on both hardware and software, NVMeoF can achieve even better performance than direct attached storage. Specifically, with appropriate NIC optimizations, we have observed a throughput increase by up to 42.5% and a decrease of the 95th percentile tail latency by up to 14.6%. Based on our measurement results, we also discuss several system implications for integrating NVMeoF on Arm based platforms. Our studies show that this system solution can well balance the computation, network, and storage resources for data-center storage services. Our findings have also been reported to Arm and Broadcom for future optimizations.

基于Arm架构部署高性能、高能效的存储服务器是业界不断发展的技术趋势。这背后的一个重要驱动力是存储分解，它将计算和存储分离到不同的服务器，从而实现独立的资源分配和优化的硬件利用率。最近发布的远程存储协议规范NVMe-over-Fabrics (NVMeoF)通过将远程访问开销降至最低，使闪存分解成为可能。将这两种有前途的技术集成在一起，构建一个基于Arm的高效NVMeoF存储服务器是非常有吸引力的。在这项工作中，我们进行了一组全面的实验，以了解NVMeoF在基于arm的数据中心SoC上的性能行为，并深入了解其设计和部署在数据中心中的影响。我们的实验表明，NVMeoF提供了承诺的超低延迟。通过对硬件和软件进行适当的优化，NVMeoF可以实现比直接连接存储更好的性能。具体来说，通过适当的NIC优化，我们观察到吞吐量增加了42.5%，第95百分位尾部延迟减少了14.6%。基于我们的测量结果，我们还讨论了在基于Arm的平台上集成NVMeoF的几个系统含义。研究表明，该方案能够很好地平衡数据中心存储业务的计算资源、网络资源和存储资源。我们的研究结果也已报告给Arm和Broadcom，以便将来进行优化。

{"title":"When NVMe over Fabrics Meets Arm: Performance and Implications","authors":"Yichen Jia, E. Anger, Feng Chen","doi":"10.1109/MSST.2019.000-9","DOIUrl":"https://doi.org/10.1109/MSST.2019.000-9","url":null,"abstract":"A growing technology trend in the industry is to deploy highly capable and power-efficient storage servers based on the Arm architecture. An important driving force behind this is storage disaggregation, which separates compute and storage to different servers, enabling independent resource allocation and optimized hardware utilization. The recently released remote storage protocol specification, NVMe-over-Fabrics (NVMeoF), makes flash disaggregation possible by reducing the remote access overhead to the minimum. It is highly appealing to integrate the two promising technologies together to build an efficient Arm based storage server with NVMeoF. In this work, we have conducted a set of comprehensive experiments to understand the performance behaviors of NVMeoF on Arm-based Data Center SoC and to gain insight into the implications of their design and deployment in data centers. Our experiments show that NVMeoF delivers the promised ultra-low latency. With appropriate optimizations on both hardware and software, NVMeoF can achieve even better performance than direct attached storage. Specifically, with appropriate NIC optimizations, we have observed a throughput increase by up to 42.5% and a decrease of the 95th percentile tail latency by up to 14.6%. Based on our measurement results, we also discuss several system implications for integrating NVMeoF on Arm based platforms. Our studies show that this system solution can well balance the computation, network, and storage resources for data-center storage services. Our findings have also been reported to Arm and Broadcom for future optimizations.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128710413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

vNVML: An Efficient User Space Library for Virtualizing and Sharing Non-Volatile Memories vNVML:用于虚拟化和共享非易失性内存的高效用户空间库

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.00-12

C. Chou, Jaemin Jung, A. Reddy, Paul V. Gratz, Doug Voigt

The emerging non-volatile memory (NVM) has attractive characteristics such as DRAM-like, low-latency together with the non-volatility of storage devices. Recently, byte-addressable, memory bus-attached NVM has become available. This paper addresses the problem of combining a smaller, faster byte-addressable NVM with a larger, slower storage device, like SSD, to create the impression of a larger and faster byte-addressable NVM which can be shared across many applications. In this paper, we propose vNVML, a user space library for virtualizing and sharing NVM. vNVML provides for applications transaction like memory semantics that ensures write ordering and persistency guarantees across system failures. vNVML exploits DRAM for read caching, to enable improvements in performance and potentially to reduce the number of writes to NVM, extending the NVM lifetime. vNVML is implemented and evaluated with realistic workloads to show that our library allows applications to share NVM, both in a single O/S and when docker like containers are employed. The results from the evaluation show that vNVML incurs less than 10% overhead while providing the benefits of an expanded virtualized NVM space to the applications, allowing applications to safely share the virtual NVM.

新兴的非易失性存储器(NVM)具有类似dram、低延迟以及存储设备的非易失性等吸引人的特点。最近，可以使用字节寻址、内存总线附加的NVM。本文解决了将一个更小、更快的字节可寻址NVM与一个更大、更慢的存储设备(如SSD)结合起来的问题，以创建一个更大、更快的字节可寻址NVM的印象，可以在许多应用程序中共享。在本文中，我们提出了vNVML，一个用于虚拟和共享NVM的用户空间库。vNVML为应用程序提供诸如内存语义之类的事务，以确保跨系统故障的写顺序和持久性保证。vNVML利用DRAM进行读缓存，以提高性能，并可能减少对NVM的写入次数，延长NVM的生命周期。vNVML是在实际工作负载下实现和评估的，以表明我们的库允许应用程序共享NVM，无论是在单个O/S中还是在使用docker之类的容器时。评估的结果表明，vNVML产生的开销不到10%，同时为应用程序提供了扩展的虚拟化NVM空间的好处，允许应用程序安全地共享虚拟NVM。

{"title":"vNVML: An Efficient User Space Library for Virtualizing and Sharing Non-Volatile Memories","authors":"C. Chou, Jaemin Jung, A. Reddy, Paul V. Gratz, Doug Voigt","doi":"10.1109/MSST.2019.00-12","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-12","url":null,"abstract":"The emerging non-volatile memory (NVM) has attractive characteristics such as DRAM-like, low-latency together with the non-volatility of storage devices. Recently, byte-addressable, memory bus-attached NVM has become available. This paper addresses the problem of combining a smaller, faster byte-addressable NVM with a larger, slower storage device, like SSD, to create the impression of a larger and faster byte-addressable NVM which can be shared across many applications. In this paper, we propose vNVML, a user space library for virtualizing and sharing NVM. vNVML provides for applications transaction like memory semantics that ensures write ordering and persistency guarantees across system failures. vNVML exploits DRAM for read caching, to enable improvements in performance and potentially to reduce the number of writes to NVM, extending the NVM lifetime. vNVML is implemented and evaluated with realistic workloads to show that our library allows applications to share NVM, both in a single O/S and when docker like containers are employed. The results from the evaluation show that vNVML incurs less than 10% overhead while providing the benefits of an expanded virtualized NVM space to the applications, allowing applications to safely share the virtual NVM.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116841308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

AZ-Code: An Efficient Availability Zone Level Erasure Code to Provide High Fault Tolerance in Cloud Storage Systems AZ-Code:一种高效的可用分区级Erasure Code，为云存储系统提供高容错能力

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.00004

Xin Xie, Chentao Wu, Junqing Gu, Han Qiu, Jie Li, M. Guo, Xubin He, Yuanyuan Dong, Yafei Zhao

As data in modern cloud storage system grows dramatically, it's a common method to partition data and store them in different Availability Zones (AZs). Multiple AZs not only provide high fault tolerance (e.g., rack level tolerance or disaster tolerance), but also reduce the network latency. Replication and Erasure Codes (EC) are typical data redundancy methods to provide high reliability for storage systems. Compared with the replication approach, erasure codes can achieve much lower monetary cost with the same fault-tolerance capability. However, the recovery cost of EC is extremely high in multiple AZ environment, especially because of its high bandwidth consumption in data centers. LRC is a widely used EC to reduce the recovery cost, but the storage efficiency is sacrificed. MSR code is designed to decrease the recovery cost with high storage efficiency, but its computation is too complex. To address this problem, in this paper, we propose an erasure code for multiple availability zones (called AZ-Code), which is a hybrid code by taking advantages of both MSR code and LRC codes. AZ-Code utilizes a specific MSR code as the local parity layout, and a typical RS code is used to generate the global parities. In this way, AZ-Code can keep low recovery cost with high reliability. To demonstrate the effectiveness of AZ-Code, we evaluate various erasure codes via mathematical analysis and experiments in Hadoop systems. The results show that, compared to the traditional erasure coding methods, AZ-Code saves the recovery bandwidth by up to 78.24%.

随着现代云存储系统中数据的急剧增长，对数据进行分区并将其存储在不同的可用区(az)中是一种常见的方法。多个可用分区不仅提供高容错性(例如机架级容错性或容灾性)，而且还减少了网络延迟。复制和EC (Erasure Codes)是为存储系统提供高可靠性的典型数据冗余方式。与复制方法相比，在具有相同容错能力的情况下，擦除码的成本要低得多。但是，EC在多AZ环境下的恢复成本非常高，特别是数据中心的带宽消耗非常大。为了降低回收成本，LRC是一种广泛使用的EC，但牺牲了存储效率。MSR码是为了降低回收成本和提高存储效率而设计的，但其计算过于复杂。为了解决这个问题，在本文中，我们提出了一种多可用区擦除码(称为AZ-Code)，它是一种利用MSR码和LRC码的混合码。AZ-Code使用特定的MSR代码作为本地奇偶校验布局，使用典型的RS代码生成全局奇偶校验。这样，AZ-Code可以保持低的恢复成本和高的可靠性。为了证明AZ-Code的有效性，我们通过数学分析和Hadoop系统中的实验来评估各种擦除代码。结果表明，与传统的擦除编码方法相比，AZ-Code可节省高达78.24%的恢复带宽。

{"title":"AZ-Code: An Efficient Availability Zone Level Erasure Code to Provide High Fault Tolerance in Cloud Storage Systems","authors":"Xin Xie, Chentao Wu, Junqing Gu, Han Qiu, Jie Li, M. Guo, Xubin He, Yuanyuan Dong, Yafei Zhao","doi":"10.1109/MSST.2019.00004","DOIUrl":"https://doi.org/10.1109/MSST.2019.00004","url":null,"abstract":"As data in modern cloud storage system grows dramatically, it's a common method to partition data and store them in different Availability Zones (AZs). Multiple AZs not only provide high fault tolerance (e.g., rack level tolerance or disaster tolerance), but also reduce the network latency. Replication and Erasure Codes (EC) are typical data redundancy methods to provide high reliability for storage systems. Compared with the replication approach, erasure codes can achieve much lower monetary cost with the same fault-tolerance capability. However, the recovery cost of EC is extremely high in multiple AZ environment, especially because of its high bandwidth consumption in data centers. LRC is a widely used EC to reduce the recovery cost, but the storage efficiency is sacrificed. MSR code is designed to decrease the recovery cost with high storage efficiency, but its computation is too complex. To address this problem, in this paper, we propose an erasure code for multiple availability zones (called AZ-Code), which is a hybrid code by taking advantages of both MSR code and LRC codes. AZ-Code utilizes a specific MSR code as the local parity layout, and a typical RS code is used to generate the global parities. In this way, AZ-Code can keep low recovery cost with high reliability. To demonstrate the effectiveness of AZ-Code, we evaluate various erasure codes via mathematical analysis and experiments in Hadoop systems. The results show that, compared to the traditional erasure coding methods, AZ-Code saves the recovery bandwidth by up to 78.24%.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128515655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Wear-aware Memory Management Scheme for Balancing Lifetime and Performance of Multiple NVM Slots 平衡多个NVM插槽寿命和性能的磨损感知内存管理方案

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.000-7

Chunhua Xiao, Linfeng Cheng, Lei Zhang, Duo Liu, Weichen Liu

Emerging Non-Volatile Memory (NVM) has many advantages, such as near-DRAM speed, byte-addressability, and persistence. Modern computer systems contain many memory slots, which are exposed as a unified storage interface by shared address space. Since NVM has limited write endurance, many wear-leveling techniques are implemented in hardware. However, existing hardware techniques can only effective in a single NVM slot, which cannot ensure wear-leveling among multiple NVM slots. This paper explores how to optimize a storage system with multiple NVM slots in terms of performance and lifetime. We show that simple integration of multiple NVMs in traditional memory policies results in poor reliability. We also reveal that existing hardware wear-leveling technologies are ineffective for a system with multiple NVM slots. In this paper, we propose a common wear-aware memory management scheme for in-memory file system. The proposed memory scheme enables wear-aware control of NVM slot use which minimizes the cost of performance and lifetime. We implemented the proposed memory management scheme and evaluated their effectiveness. The experiments show that the proposed wear-aware memory management scheme can outperform wear-leveling effect by more than 2600x, and the lifetime of NVM can be prolonged by 2.5x, the write performance can be improved by up to 15%.

新兴非易失性内存(NVM)具有许多优点，例如接近dram的速度、字节寻址能力和持久性。现代计算机系统包含许多内存插槽，它们通过共享地址空间暴露为统一的存储接口。由于NVM的写入持久性有限，因此许多损耗均衡技术都是在硬件中实现的。然而，现有的硬件技术只能在单个NVM插槽中有效，无法保证多个NVM插槽之间的磨损均衡。本文探讨了如何从性能和生命周期方面对具有多个NVM插槽的存储系统进行优化。我们表明，在传统内存策略中简单集成多个nvm会导致可靠性差。我们还揭示了现有的硬件损耗均衡技术对于具有多个NVM插槽的系统是无效的。在本文中，我们提出了一种通用的内存文件系统损耗感知内存管理方案。提出的内存方案可以实现NVM插槽使用的磨损感知控制，从而最大限度地降低性能和寿命成本。我们实现了所提出的内存管理方案，并评估了其有效性。实验表明，所提出的磨损感知内存管理方案比磨损均衡效果提高2600x以上，NVM寿命延长2.5倍，写性能提高15%以上。

{"title":"Wear-aware Memory Management Scheme for Balancing Lifetime and Performance of Multiple NVM Slots","authors":"Chunhua Xiao, Linfeng Cheng, Lei Zhang, Duo Liu, Weichen Liu","doi":"10.1109/MSST.2019.000-7","DOIUrl":"https://doi.org/10.1109/MSST.2019.000-7","url":null,"abstract":"Emerging Non-Volatile Memory (NVM) has many advantages, such as near-DRAM speed, byte-addressability, and persistence. Modern computer systems contain many memory slots, which are exposed as a unified storage interface by shared address space. Since NVM has limited write endurance, many wear-leveling techniques are implemented in hardware. However, existing hardware techniques can only effective in a single NVM slot, which cannot ensure wear-leveling among multiple NVM slots. This paper explores how to optimize a storage system with multiple NVM slots in terms of performance and lifetime. We show that simple integration of multiple NVMs in traditional memory policies results in poor reliability. We also reveal that existing hardware wear-leveling technologies are ineffective for a system with multiple NVM slots. In this paper, we propose a common wear-aware memory management scheme for in-memory file system. The proposed memory scheme enables wear-aware control of NVM slot use which minimizes the cost of performance and lifetime. We implemented the proposed memory management scheme and evaluated their effectiveness. The experiments show that the proposed wear-aware memory management scheme can outperform wear-leveling effect by more than 2600x, and the lifetime of NVM can be prolonged by 2.5x, the write performance can be improved by up to 15%.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132370584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

vPFS+: Managing I/O Performance for Diverse HPC Applications vPFS+:管理各种HPC应用程序的I/O性能

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.00-16

Ming Zhao, Yiqi Xu

High-performance computing (HPC) systems are increasingly shared by a variety of data-and metadata-intensive parallel applications. However, existing parallel file systems employed for HPC storage management are unable to differentiate the I/O requests from concurrent applications and meet their different performance requirements. Previous work, vPFS, provided a solution to this problem by virtualizing a parallel file system and enabling proportional-share bandwidth allocation to the applications; but it cannot handle the increasingly diverse applications in today's HPC environments, including those that have different sizes of I/Os and those that are metadata-intensive. This paper presents vPFS+ which builds upon the virtualization framework provided by vPFS but addresses its limitations in supporting diverse HPC applications. First, a new proportional-share I/O scheduler, SFQ(D)+, is created to allow applications with various I/O sizes and issue rates to share the storage with good application-level fairness and system-level utilization. Second, vPFS+ extends the scheduling to also include metadata I/Os and provides performance isolation to metadata-intensive applications. vPFS+ is prototyped on PVFS2, a widely used open-source parallel file system, and evaluated using a comprehensive set of representative HPC benchmarks and applications (IOR, NPB BTIO, WRF, and multi-md-test). The results confirm that the new SFQ(D)+ scheduler can provide significantly better performance isolation to applications with small, bursty I/Os than the traditional SFQ(D) scheduler (3.35 times better) and the native PVFS2 (8.25 times better) while still making efficient use of the storage. The results also show that vPFS+ can deliver near-perfect proportional sharing (>95% of the target sharing ratio) to metadata-intensive applications.

高性能计算(HPC)系统越来越多地被各种数据和元数据密集型并行应用程序共享。然而，现有用于高性能计算存储管理的并行文件系统无法区分并发应用程序的I/O请求，无法满足并发应用程序的不同性能要求。以前的工作vPFS通过虚拟化并行文件系统并为应用程序提供按比例共享带宽分配提供了解决方案;但它无法处理当今HPC环境中日益多样化的应用程序，包括那些具有不同大小的I/ o和那些元数据密集型应用程序。本文介绍了vPFS+，它建立在vPFS提供的虚拟化框架之上，但解决了它在支持各种HPC应用程序方面的局限性。首先，创建了一个新的按比例共享I/O调度器SFQ(D)+，允许具有不同I/O大小和发行速率的应用程序共享存储，同时具有良好的应用程序级公平性和系统级利用率。其次，vPFS+扩展了调度，还包括元数据I/ o，并为元数据密集型应用程序提供性能隔离。vPFS+在广泛使用的开源并行文件系统PVFS2上进行了原型设计，并使用一套全面的代表性HPC基准测试和应用程序(IOR、NPB BTIO、WRF和multi-md-test)进行了评估。结果证实，与传统的SFQ(D)调度器(好3.35倍)和原生PVFS2(好8.25倍)相比，新的SFQ(D)+调度器可以为具有小的突发I/ o的应用程序提供明显更好的性能隔离，同时仍然有效地利用存储。结果还表明，vPFS+可以为元数据密集型应用程序提供近乎完美的比例共享(约为目标共享比例的95%)。

{"title":"vPFS+: Managing I/O Performance for Diverse HPC Applications","authors":"Ming Zhao, Yiqi Xu","doi":"10.1109/MSST.2019.00-16","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-16","url":null,"abstract":"High-performance computing (HPC) systems are increasingly shared by a variety of data-and metadata-intensive parallel applications. However, existing parallel file systems employed for HPC storage management are unable to differentiate the I/O requests from concurrent applications and meet their different performance requirements. Previous work, vPFS, provided a solution to this problem by virtualizing a parallel file system and enabling proportional-share bandwidth allocation to the applications; but it cannot handle the increasingly diverse applications in today's HPC environments, including those that have different sizes of I/Os and those that are metadata-intensive. This paper presents vPFS+ which builds upon the virtualization framework provided by vPFS but addresses its limitations in supporting diverse HPC applications. First, a new proportional-share I/O scheduler, SFQ(D)+, is created to allow applications with various I/O sizes and issue rates to share the storage with good application-level fairness and system-level utilization. Second, vPFS+ extends the scheduling to also include metadata I/Os and provides performance isolation to metadata-intensive applications. vPFS+ is prototyped on PVFS2, a widely used open-source parallel file system, and evaluated using a comprehensive set of representative HPC benchmarks and applications (IOR, NPB BTIO, WRF, and multi-md-test). The results confirm that the new SFQ(D)+ scheduler can provide significantly better performance isolation to applications with small, bursty I/Os than the traditional SFQ(D) scheduler (3.35 times better) and the native PVFS2 (8.25 times better) while still making efficient use of the storage. The results also show that vPFS+ can deliver near-perfect proportional sharing (>95% of the target sharing ratio) to metadata-intensive applications.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121542004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DFPE: Explaining Predictive Models for Disk Failure Prediction DFPE:解释磁盘故障预测的预测模型

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.000-3

Yanwen Xie, D. Feng, F. Wang, Xuehai Tang, Jizhong Han, Xinyan Zhang

Recent research works on disk failure prediction achieve a high detection rate and a low false alarm rate with complex models at the cost of explainability. The lack of explainability is likely to hide bias or overfitting in the models, resulting in bad performance in real-world applications. To address the problem, we propose a new explanation method DFPE designed for disk failure prediction to explain failure predictions made by a model and infer prediction rules learned by a model. DFPE explains failure predictions by performing a series of replacement tests to find out the failure causes while it explains models by aggregating explanations for the failure predictions. A presented use case on a real-world dataset shows that compared to current explanation methods, DFPE can explain more about failure predictions and models with more accuracy. Thus it helps to target and handle the hidden bias and overfitting, measures feature importances from a new perspective and enables intelligent failure handling.

目前的硬盘故障预测研究以牺牲可解释性为代价，以复杂的模型实现了高检出率和低虚警率。缺乏可解释性可能会隐藏模型中的偏差或过拟合，从而导致实际应用中的不良性能。针对这一问题，本文提出了一种新的磁盘故障预测解释方法DFPE，用于解释模型对磁盘故障的预测，并推断模型学习到的预测规则。DFPE通过执行一系列替换测试以找出故障原因来解释故障预测，同时通过汇总故障预测的解释来解释模型。一个真实数据集的用例表明，与当前的解释方法相比，DFPE可以更准确地解释故障预测和模型。因此，它有助于定位和处理隐藏的偏差和过拟合，从新的角度衡量特征的重要性，并实现智能故障处理。

引用次数: 9

Mitigate HDD Fail-Slow by Pro-actively Utilizing System-level Data Redundancy with Enhanced HDD Controllability and Observability 通过主动利用系统级数据冗余，增强HDD可控性和可观察性，减轻HDD Fail-Slow

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-20 DOI: 10.1109/MSST.2019.000-2

Jingpeng Hao, Yin Li, Xubin Chen, Tong Zhang

This paper presents a design framework aiming to mitigate occasional HDD fail-slow. Due to their mechanical nature, HDDs may occasionally suffer from spikes of abnormally high internal read retry rates, leading to temporarily significant degradation of speed (especially the read latency). Intuitively, one could expect that existing system-level data redundancy (e.g., RAID or distributed erasure coding) may be opportunistically utilized to mitigate HDD fail-slow. Nevertheless, current practice tends to use system-level redundancy merely as a safety net, i.e., reconstruct data sectors via system-level redundancy only after the costly intra-HDD read retry fails. This paper shows that one could much more effectively mitigate occasional HDD fail-slow by more pro-actively utilizing existing system-level data redundancy, in complement to (or even replacement of) intra-HDD read retry. To enable this, HDDs should support a higher degree of controllability and observability in terms of their internal read retry operations. Assuming a very simple form enhanced HDD controllability and observability, this paper presents design solutions and a mathematical formulation framework to facilitate the practical implementation of such pro-active strategy for mitigating occasional HDD fail-slow. Using RAID as a test vehicle, our experimental results show that the proposed design solutions can effectively mitigate the RAID read latency degradation even when HDDs suffer from read retry rates as high as 1% or 2%.

本文提出了一个设计框架，旨在减轻偶尔的HDD慢速故障。由于其机械性质，hdd可能偶尔会出现异常高的内部读取重试率，导致速度暂时显著下降(尤其是读取延迟)。直观地说，人们可以期望现有的系统级数据冗余(例如，RAID或分布式擦除编码)可能会被机会性地利用来减轻HDD慢速故障。然而，目前的做法倾向于将系统级冗余仅仅用作安全网，即，只有在代价高昂的hdd内部读取重试失败后，才通过系统级冗余重建数据扇区。本文表明，通过更主动地利用现有的系统级数据冗余，可以更有效地缓解偶尔的HDD故障缓慢，以补充(甚至取代)HDD内部读取重试。为了实现这一点，hdd应该在其内部读重试操作方面支持更高程度的可控性和可观察性。假设一种非常简单的形式增强了HDD的可控性和可观察性，本文提出了设计解决方案和数学公式框架，以促进这种主动策略的实际实施，以减轻偶尔的HDD故障缓慢。使用RAID作为测试工具，我们的实验结果表明，即使hdd遭受高达1%或2%的读重试率，我们提出的设计解决方案也可以有效地缓解RAID读延迟退化。

{"title":"Mitigate HDD Fail-Slow by Pro-actively Utilizing System-level Data Redundancy with Enhanced HDD Controllability and Observability","authors":"Jingpeng Hao, Yin Li, Xubin Chen, Tong Zhang","doi":"10.1109/MSST.2019.000-2","DOIUrl":"https://doi.org/10.1109/MSST.2019.000-2","url":null,"abstract":"This paper presents a design framework aiming to mitigate occasional HDD fail-slow. Due to their mechanical nature, HDDs may occasionally suffer from spikes of abnormally high internal read retry rates, leading to temporarily significant degradation of speed (especially the read latency). Intuitively, one could expect that existing system-level data redundancy (e.g., RAID or distributed erasure coding) may be opportunistically utilized to mitigate HDD fail-slow. Nevertheless, current practice tends to use system-level redundancy merely as a safety net, i.e., reconstruct data sectors via system-level redundancy only after the costly intra-HDD read retry fails. This paper shows that one could much more effectively mitigate occasional HDD fail-slow by more pro-actively utilizing existing system-level data redundancy, in complement to (or even replacement of) intra-HDD read retry. To enable this, HDDs should support a higher degree of controllability and observability in terms of their internal read retry operations. Assuming a very simple form enhanced HDD controllability and observability, this paper presents design solutions and a mathematical formulation framework to facilitate the practical implementation of such pro-active strategy for mitigating occasional HDD fail-slow. Using RAID as a test vehicle, our experimental results show that the proposed design solutions can effectively mitigate the RAID read latency degradation even when HDDs suffer from read retry rates as high as 1% or 2%.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114537831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1