Yang Zhang, D. Feng, Wei Tong, Jingning Liu, Chengning Wang, Jie Xu
Resistive Memory (ReRAM) is promising to be used as high density storage-class memory by employing Triple-Level Cell (TLC) and crossbar structures. However, TLC crossbar ReRAM suffers from high write latency and energy due to the IR drop issue and the iterative program-and-verify procedure. In this paper, we propose Tiered-ReRAM architecture to overcome the challenges of TLC crossbar ReRAM. The proposed Tiered-ReRAM consists of three components, namely Tiered-crossbar design, Compression-based Incomplete Data Mapping (CIDM), and Compression-based Flip Scheme (CFS). Specifically, based on the observation that the magnitude of IR drops is primarily determined by the long length of bitlines in Double-Sided Ground Biasing (DSGB) crossbar arrays, Tiered-crossbar design splits each long bitline into the near and far segments by an isolation transistor, allowing the near segment to be accessed with decreased latency and energy. Moreover, in the near segments, CIDM dynamically selects the most appropriate IDM for each cache line according to the saved space by compression, which further reduces the write latency and energy with insignificant space overhead. In addition, in the far segments, CFS dynamically selects the most appropriate flip scheme for each cache line, which ensures more high resistance cells written into crossbar arrays and effectively reduces the leakage energy. For each compressed cache line, the selected IDM or flip scheme is applied on the condition that the total encoded data size will never exceed the original cache line size. The experimental results show that, on average, Tiered-ReRAM can improve the system performance by 30.5%, reduce the write latency by 35.2%, decrease the read latency by 26.1%, and reduce the energy consumption by 35.6%, compared to an aggressive baseline.
{"title":"Tiered-ReRAM: A Low Latency and Energy Efficient TLC Crossbar ReRAM Architecture","authors":"Yang Zhang, D. Feng, Wei Tong, Jingning Liu, Chengning Wang, Jie Xu","doi":"10.1109/MSST.2019.00-13","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-13","url":null,"abstract":"Resistive Memory (ReRAM) is promising to be used as high density storage-class memory by employing Triple-Level Cell (TLC) and crossbar structures. However, TLC crossbar ReRAM suffers from high write latency and energy due to the IR drop issue and the iterative program-and-verify procedure. In this paper, we propose Tiered-ReRAM architecture to overcome the challenges of TLC crossbar ReRAM. The proposed Tiered-ReRAM consists of three components, namely Tiered-crossbar design, Compression-based Incomplete Data Mapping (CIDM), and Compression-based Flip Scheme (CFS). Specifically, based on the observation that the magnitude of IR drops is primarily determined by the long length of bitlines in Double-Sided Ground Biasing (DSGB) crossbar arrays, Tiered-crossbar design splits each long bitline into the near and far segments by an isolation transistor, allowing the near segment to be accessed with decreased latency and energy. Moreover, in the near segments, CIDM dynamically selects the most appropriate IDM for each cache line according to the saved space by compression, which further reduces the write latency and energy with insignificant space overhead. In addition, in the far segments, CFS dynamically selects the most appropriate flip scheme for each cache line, which ensures more high resistance cells written into crossbar arrays and effectively reduces the leakage energy. For each compressed cache line, the selected IDM or flip scheme is applied on the condition that the total encoded data size will never exceed the original cache line size. The experimental results show that, on average, Tiered-ReRAM can improve the system performance by 30.5%, reduce the write latency by 35.2%, decrease the read latency by 26.1%, and reduce the energy consumption by 35.6%, compared to an aggressive baseline.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128375347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangping Xu, Bo Tang, Hongli Lu, Quan Yu, C. Sung
In this paper, we present a learning based data deduplication algorithm, called LIPA, which uses the reinforcement learning framework to build an adaptive indexing structure. It is rather different from previous inline chunk-based deduplication methods to solve the chunk-lookup disk bottleneck problem for large-scale backup. In previous methods, a full chunk index or a sampled chunk index often is often required to identify duplicate chunks, which is a critical stage for data deduplication. The full chunk index is hard to fit in RAM and the sampled chunk index directly affects the deduplication ratio dependent on the sampling ratio. Our learning based method only requires little memory overheads to store the index but achieves the same or even better deduplication ratio than previous methods. In our method, after the data stream is broken into relatively large segments, one or more representative chunk fingerprints are chosen as the feature of a segment. An incoming segment may share the same feature with previous segments. Thus we use a key-value structure to record the relationship between features and segments: a feature maps to a fixed number of segments. We train the similarities of these segments to a feature represented as scores by the reinforcement learning method. For an incoming segment, our method adaptively prefetches a segment and the successive ones into cache by using multi-armed bandits model. Our experimental results show that our method significantly reduces memory overheads and achieves effective deduplication.
{"title":"LIPA: A Learning-based Indexing and Prefetching Approach for Data Deduplication","authors":"Guangping Xu, Bo Tang, Hongli Lu, Quan Yu, C. Sung","doi":"10.1109/MSST.2019.00010","DOIUrl":"https://doi.org/10.1109/MSST.2019.00010","url":null,"abstract":"In this paper, we present a learning based data deduplication algorithm, called LIPA, which uses the reinforcement learning framework to build an adaptive indexing structure. It is rather different from previous inline chunk-based deduplication methods to solve the chunk-lookup disk bottleneck problem for large-scale backup. In previous methods, a full chunk index or a sampled chunk index often is often required to identify duplicate chunks, which is a critical stage for data deduplication. The full chunk index is hard to fit in RAM and the sampled chunk index directly affects the deduplication ratio dependent on the sampling ratio. Our learning based method only requires little memory overheads to store the index but achieves the same or even better deduplication ratio than previous methods. In our method, after the data stream is broken into relatively large segments, one or more representative chunk fingerprints are chosen as the feature of a segment. An incoming segment may share the same feature with previous segments. Thus we use a key-value structure to record the relationship between features and segments: a feature maps to a fixed number of segments. We train the similarities of these segments to a feature represented as scores by the reinforcement learning method. For an incoming segment, our method adaptively prefetches a segment and the successive ones into cache by using multi-armed bandits model. Our experimental results show that our method significantly reduces memory overheads and achieves effective deduplication.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121431478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yujuan Tan, Wen Xia, Jing Xie, Congcong Xu, Zhichao Yan, Hong Jiang, Yajun Zhao, Min Fu, Xianzhang Chen, Duo Liu
Data deduplication, as a proven technology for effective data reduction in backup and archive storage systems, also demonstrates the promise in increasing the logical space capacity of storage caches by removing redundant data. However, our in-depth evaluation of the existing deduplication-aware caching algorithms reveals that they do improve the hit ratios compared to the caching algorithms without deduplication, especially when the cache block size is set to 4KB. But when the block size is larger than 4KB, a clear trend for modern storage systems, their hit ratios are significantly reduced. A slight increase in hit ratios due to deduplicationmay not be able to improve the overall storage performance because of the high overhead created by deduplication. To address this problem, in this paper we propose CDAC, a Content-driven Deduplication-Aware Cache, which focuses on exploiting the blocks' content redundancy and their intensity of content sharing among source addresses in cache management strategies. We have implemented CDAC based on LRU and ARC algorithms, called CDAC-LRU and CDAC-ARC respectively. Our extensive experimental results show that CDACLRU and CDAC-ARC outperform the state-of-the-art deduplication-aware caching algorithms, D-LRU and DARC, by up to 19.49X in read cache hit ratio, with an average of 1.95X under real-world traces when the cache size ranges from 20% to 80% of the working set size and the block size ranges from 4KB to 64 KB.
{"title":"CDAC: Content-Driven Deduplication-Aware Storage Cache","authors":"Yujuan Tan, Wen Xia, Jing Xie, Congcong Xu, Zhichao Yan, Hong Jiang, Yajun Zhao, Min Fu, Xianzhang Chen, Duo Liu","doi":"10.1109/MSST.2019.00008","DOIUrl":"https://doi.org/10.1109/MSST.2019.00008","url":null,"abstract":"Data deduplication, as a proven technology for effective data reduction in backup and archive storage systems, also demonstrates the promise in increasing the logical space capacity of storage caches by removing redundant data. However, our in-depth evaluation of the existing deduplication-aware caching algorithms reveals that they do improve the hit ratios compared to the caching algorithms without deduplication, especially when the cache block size is set to 4KB. But when the block size is larger than 4KB, a clear trend for modern storage systems, their hit ratios are significantly reduced. A slight increase in hit ratios due to deduplicationmay not be able to improve the overall storage performance because of the high overhead created by deduplication. To address this problem, in this paper we propose CDAC, a Content-driven Deduplication-Aware Cache, which focuses on exploiting the blocks' content redundancy and their intensity of content sharing among source addresses in cache management strategies. We have implemented CDAC based on LRU and ARC algorithms, called CDAC-LRU and CDAC-ARC respectively. Our extensive experimental results show that CDACLRU and CDAC-ARC outperform the state-of-the-art deduplication-aware caching algorithms, D-LRU and DARC, by up to 19.49X in read cache hit ratio, with an average of 1.95X under real-world traces when the cache size ranges from 20% to 80% of the working set size and the block size ranges from 4KB to 64 KB.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116656072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A growing technology trend in the industry is to deploy highly capable and power-efficient storage servers based on the Arm architecture. An important driving force behind this is storage disaggregation, which separates compute and storage to different servers, enabling independent resource allocation and optimized hardware utilization. The recently released remote storage protocol specification, NVMe-over-Fabrics (NVMeoF), makes flash disaggregation possible by reducing the remote access overhead to the minimum. It is highly appealing to integrate the two promising technologies together to build an efficient Arm based storage server with NVMeoF. In this work, we have conducted a set of comprehensive experiments to understand the performance behaviors of NVMeoF on Arm-based Data Center SoC and to gain insight into the implications of their design and deployment in data centers. Our experiments show that NVMeoF delivers the promised ultra-low latency. With appropriate optimizations on both hardware and software, NVMeoF can achieve even better performance than direct attached storage. Specifically, with appropriate NIC optimizations, we have observed a throughput increase by up to 42.5% and a decrease of the 95th percentile tail latency by up to 14.6%. Based on our measurement results, we also discuss several system implications for integrating NVMeoF on Arm based platforms. Our studies show that this system solution can well balance the computation, network, and storage resources for data-center storage services. Our findings have also been reported to Arm and Broadcom for future optimizations.
{"title":"When NVMe over Fabrics Meets Arm: Performance and Implications","authors":"Yichen Jia, E. Anger, Feng Chen","doi":"10.1109/MSST.2019.000-9","DOIUrl":"https://doi.org/10.1109/MSST.2019.000-9","url":null,"abstract":"A growing technology trend in the industry is to deploy highly capable and power-efficient storage servers based on the Arm architecture. An important driving force behind this is storage disaggregation, which separates compute and storage to different servers, enabling independent resource allocation and optimized hardware utilization. The recently released remote storage protocol specification, NVMe-over-Fabrics (NVMeoF), makes flash disaggregation possible by reducing the remote access overhead to the minimum. It is highly appealing to integrate the two promising technologies together to build an efficient Arm based storage server with NVMeoF. In this work, we have conducted a set of comprehensive experiments to understand the performance behaviors of NVMeoF on Arm-based Data Center SoC and to gain insight into the implications of their design and deployment in data centers. Our experiments show that NVMeoF delivers the promised ultra-low latency. With appropriate optimizations on both hardware and software, NVMeoF can achieve even better performance than direct attached storage. Specifically, with appropriate NIC optimizations, we have observed a throughput increase by up to 42.5% and a decrease of the 95th percentile tail latency by up to 14.6%. Based on our measurement results, we also discuss several system implications for integrating NVMeoF on Arm based platforms. Our studies show that this system solution can well balance the computation, network, and storage resources for data-center storage services. Our findings have also been reported to Arm and Broadcom for future optimizations.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128710413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Chou, Jaemin Jung, A. Reddy, Paul V. Gratz, Doug Voigt
The emerging non-volatile memory (NVM) has attractive characteristics such as DRAM-like, low-latency together with the non-volatility of storage devices. Recently, byte-addressable, memory bus-attached NVM has become available. This paper addresses the problem of combining a smaller, faster byte-addressable NVM with a larger, slower storage device, like SSD, to create the impression of a larger and faster byte-addressable NVM which can be shared across many applications. In this paper, we propose vNVML, a user space library for virtualizing and sharing NVM. vNVML provides for applications transaction like memory semantics that ensures write ordering and persistency guarantees across system failures. vNVML exploits DRAM for read caching, to enable improvements in performance and potentially to reduce the number of writes to NVM, extending the NVM lifetime. vNVML is implemented and evaluated with realistic workloads to show that our library allows applications to share NVM, both in a single O/S and when docker like containers are employed. The results from the evaluation show that vNVML incurs less than 10% overhead while providing the benefits of an expanded virtualized NVM space to the applications, allowing applications to safely share the virtual NVM.
{"title":"vNVML: An Efficient User Space Library for Virtualizing and Sharing Non-Volatile Memories","authors":"C. Chou, Jaemin Jung, A. Reddy, Paul V. Gratz, Doug Voigt","doi":"10.1109/MSST.2019.00-12","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-12","url":null,"abstract":"The emerging non-volatile memory (NVM) has attractive characteristics such as DRAM-like, low-latency together with the non-volatility of storage devices. Recently, byte-addressable, memory bus-attached NVM has become available. This paper addresses the problem of combining a smaller, faster byte-addressable NVM with a larger, slower storage device, like SSD, to create the impression of a larger and faster byte-addressable NVM which can be shared across many applications. In this paper, we propose vNVML, a user space library for virtualizing and sharing NVM. vNVML provides for applications transaction like memory semantics that ensures write ordering and persistency guarantees across system failures. vNVML exploits DRAM for read caching, to enable improvements in performance and potentially to reduce the number of writes to NVM, extending the NVM lifetime. vNVML is implemented and evaluated with realistic workloads to show that our library allows applications to share NVM, both in a single O/S and when docker like containers are employed. The results from the evaluation show that vNVML incurs less than 10% overhead while providing the benefits of an expanded virtualized NVM space to the applications, allowing applications to safely share the virtual NVM.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116841308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Xie, Chentao Wu, Junqing Gu, Han Qiu, Jie Li, M. Guo, Xubin He, Yuanyuan Dong, Yafei Zhao
As data in modern cloud storage system grows dramatically, it's a common method to partition data and store them in different Availability Zones (AZs). Multiple AZs not only provide high fault tolerance (e.g., rack level tolerance or disaster tolerance), but also reduce the network latency. Replication and Erasure Codes (EC) are typical data redundancy methods to provide high reliability for storage systems. Compared with the replication approach, erasure codes can achieve much lower monetary cost with the same fault-tolerance capability. However, the recovery cost of EC is extremely high in multiple AZ environment, especially because of its high bandwidth consumption in data centers. LRC is a widely used EC to reduce the recovery cost, but the storage efficiency is sacrificed. MSR code is designed to decrease the recovery cost with high storage efficiency, but its computation is too complex. To address this problem, in this paper, we propose an erasure code for multiple availability zones (called AZ-Code), which is a hybrid code by taking advantages of both MSR code and LRC codes. AZ-Code utilizes a specific MSR code as the local parity layout, and a typical RS code is used to generate the global parities. In this way, AZ-Code can keep low recovery cost with high reliability. To demonstrate the effectiveness of AZ-Code, we evaluate various erasure codes via mathematical analysis and experiments in Hadoop systems. The results show that, compared to the traditional erasure coding methods, AZ-Code saves the recovery bandwidth by up to 78.24%.
{"title":"AZ-Code: An Efficient Availability Zone Level Erasure Code to Provide High Fault Tolerance in Cloud Storage Systems","authors":"Xin Xie, Chentao Wu, Junqing Gu, Han Qiu, Jie Li, M. Guo, Xubin He, Yuanyuan Dong, Yafei Zhao","doi":"10.1109/MSST.2019.00004","DOIUrl":"https://doi.org/10.1109/MSST.2019.00004","url":null,"abstract":"As data in modern cloud storage system grows dramatically, it's a common method to partition data and store them in different Availability Zones (AZs). Multiple AZs not only provide high fault tolerance (e.g., rack level tolerance or disaster tolerance), but also reduce the network latency. Replication and Erasure Codes (EC) are typical data redundancy methods to provide high reliability for storage systems. Compared with the replication approach, erasure codes can achieve much lower monetary cost with the same fault-tolerance capability. However, the recovery cost of EC is extremely high in multiple AZ environment, especially because of its high bandwidth consumption in data centers. LRC is a widely used EC to reduce the recovery cost, but the storage efficiency is sacrificed. MSR code is designed to decrease the recovery cost with high storage efficiency, but its computation is too complex. To address this problem, in this paper, we propose an erasure code for multiple availability zones (called AZ-Code), which is a hybrid code by taking advantages of both MSR code and LRC codes. AZ-Code utilizes a specific MSR code as the local parity layout, and a typical RS code is used to generate the global parities. In this way, AZ-Code can keep low recovery cost with high reliability. To demonstrate the effectiveness of AZ-Code, we evaluate various erasure codes via mathematical analysis and experiments in Hadoop systems. The results show that, compared to the traditional erasure coding methods, AZ-Code saves the recovery bandwidth by up to 78.24%.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128515655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chunhua Xiao, Linfeng Cheng, Lei Zhang, Duo Liu, Weichen Liu
Emerging Non-Volatile Memory (NVM) has many advantages, such as near-DRAM speed, byte-addressability, and persistence. Modern computer systems contain many memory slots, which are exposed as a unified storage interface by shared address space. Since NVM has limited write endurance, many wear-leveling techniques are implemented in hardware. However, existing hardware techniques can only effective in a single NVM slot, which cannot ensure wear-leveling among multiple NVM slots. This paper explores how to optimize a storage system with multiple NVM slots in terms of performance and lifetime. We show that simple integration of multiple NVMs in traditional memory policies results in poor reliability. We also reveal that existing hardware wear-leveling technologies are ineffective for a system with multiple NVM slots. In this paper, we propose a common wear-aware memory management scheme for in-memory file system. The proposed memory scheme enables wear-aware control of NVM slot use which minimizes the cost of performance and lifetime. We implemented the proposed memory management scheme and evaluated their effectiveness. The experiments show that the proposed wear-aware memory management scheme can outperform wear-leveling effect by more than 2600x, and the lifetime of NVM can be prolonged by 2.5x, the write performance can be improved by up to 15%.
{"title":"Wear-aware Memory Management Scheme for Balancing Lifetime and Performance of Multiple NVM Slots","authors":"Chunhua Xiao, Linfeng Cheng, Lei Zhang, Duo Liu, Weichen Liu","doi":"10.1109/MSST.2019.000-7","DOIUrl":"https://doi.org/10.1109/MSST.2019.000-7","url":null,"abstract":"Emerging Non-Volatile Memory (NVM) has many advantages, such as near-DRAM speed, byte-addressability, and persistence. Modern computer systems contain many memory slots, which are exposed as a unified storage interface by shared address space. Since NVM has limited write endurance, many wear-leveling techniques are implemented in hardware. However, existing hardware techniques can only effective in a single NVM slot, which cannot ensure wear-leveling among multiple NVM slots. This paper explores how to optimize a storage system with multiple NVM slots in terms of performance and lifetime. We show that simple integration of multiple NVMs in traditional memory policies results in poor reliability. We also reveal that existing hardware wear-leveling technologies are ineffective for a system with multiple NVM slots. In this paper, we propose a common wear-aware memory management scheme for in-memory file system. The proposed memory scheme enables wear-aware control of NVM slot use which minimizes the cost of performance and lifetime. We implemented the proposed memory management scheme and evaluated their effectiveness. The experiments show that the proposed wear-aware memory management scheme can outperform wear-leveling effect by more than 2600x, and the lifetime of NVM can be prolonged by 2.5x, the write performance can be improved by up to 15%.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132370584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-performance computing (HPC) systems are increasingly shared by a variety of data-and metadata-intensive parallel applications. However, existing parallel file systems employed for HPC storage management are unable to differentiate the I/O requests from concurrent applications and meet their different performance requirements. Previous work, vPFS, provided a solution to this problem by virtualizing a parallel file system and enabling proportional-share bandwidth allocation to the applications; but it cannot handle the increasingly diverse applications in today's HPC environments, including those that have different sizes of I/Os and those that are metadata-intensive. This paper presents vPFS+ which builds upon the virtualization framework provided by vPFS but addresses its limitations in supporting diverse HPC applications. First, a new proportional-share I/O scheduler, SFQ(D)+, is created to allow applications with various I/O sizes and issue rates to share the storage with good application-level fairness and system-level utilization. Second, vPFS+ extends the scheduling to also include metadata I/Os and provides performance isolation to metadata-intensive applications. vPFS+ is prototyped on PVFS2, a widely used open-source parallel file system, and evaluated using a comprehensive set of representative HPC benchmarks and applications (IOR, NPB BTIO, WRF, and multi-md-test). The results confirm that the new SFQ(D)+ scheduler can provide significantly better performance isolation to applications with small, bursty I/Os than the traditional SFQ(D) scheduler (3.35 times better) and the native PVFS2 (8.25 times better) while still making efficient use of the storage. The results also show that vPFS+ can deliver near-perfect proportional sharing (>95% of the target sharing ratio) to metadata-intensive applications.
{"title":"vPFS+: Managing I/O Performance for Diverse HPC Applications","authors":"Ming Zhao, Yiqi Xu","doi":"10.1109/MSST.2019.00-16","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-16","url":null,"abstract":"High-performance computing (HPC) systems are increasingly shared by a variety of data-and metadata-intensive parallel applications. However, existing parallel file systems employed for HPC storage management are unable to differentiate the I/O requests from concurrent applications and meet their different performance requirements. Previous work, vPFS, provided a solution to this problem by virtualizing a parallel file system and enabling proportional-share bandwidth allocation to the applications; but it cannot handle the increasingly diverse applications in today's HPC environments, including those that have different sizes of I/Os and those that are metadata-intensive. This paper presents vPFS+ which builds upon the virtualization framework provided by vPFS but addresses its limitations in supporting diverse HPC applications. First, a new proportional-share I/O scheduler, SFQ(D)+, is created to allow applications with various I/O sizes and issue rates to share the storage with good application-level fairness and system-level utilization. Second, vPFS+ extends the scheduling to also include metadata I/Os and provides performance isolation to metadata-intensive applications. vPFS+ is prototyped on PVFS2, a widely used open-source parallel file system, and evaluated using a comprehensive set of representative HPC benchmarks and applications (IOR, NPB BTIO, WRF, and multi-md-test). The results confirm that the new SFQ(D)+ scheduler can provide significantly better performance isolation to applications with small, bursty I/Os than the traditional SFQ(D) scheduler (3.35 times better) and the native PVFS2 (8.25 times better) while still making efficient use of the storage. The results also show that vPFS+ can deliver near-perfect proportional sharing (>95% of the target sharing ratio) to metadata-intensive applications.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121542004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanwen Xie, D. Feng, F. Wang, Xuehai Tang, Jizhong Han, Xinyan Zhang
Recent research works on disk failure prediction achieve a high detection rate and a low false alarm rate with complex models at the cost of explainability. The lack of explainability is likely to hide bias or overfitting in the models, resulting in bad performance in real-world applications. To address the problem, we propose a new explanation method DFPE designed for disk failure prediction to explain failure predictions made by a model and infer prediction rules learned by a model. DFPE explains failure predictions by performing a series of replacement tests to find out the failure causes while it explains models by aggregating explanations for the failure predictions. A presented use case on a real-world dataset shows that compared to current explanation methods, DFPE can explain more about failure predictions and models with more accuracy. Thus it helps to target and handle the hidden bias and overfitting, measures feature importances from a new perspective and enables intelligent failure handling.
{"title":"DFPE: Explaining Predictive Models for Disk Failure Prediction","authors":"Yanwen Xie, D. Feng, F. Wang, Xuehai Tang, Jizhong Han, Xinyan Zhang","doi":"10.1109/MSST.2019.000-3","DOIUrl":"https://doi.org/10.1109/MSST.2019.000-3","url":null,"abstract":"Recent research works on disk failure prediction achieve a high detection rate and a low false alarm rate with complex models at the cost of explainability. The lack of explainability is likely to hide bias or overfitting in the models, resulting in bad performance in real-world applications. To address the problem, we propose a new explanation method DFPE designed for disk failure prediction to explain failure predictions made by a model and infer prediction rules learned by a model. DFPE explains failure predictions by performing a series of replacement tests to find out the failure causes while it explains models by aggregating explanations for the failure predictions. A presented use case on a real-world dataset shows that compared to current explanation methods, DFPE can explain more about failure predictions and models with more accuracy. Thus it helps to target and handle the hidden bias and overfitting, measures feature importances from a new perspective and enables intelligent failure handling.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127954198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a design framework aiming to mitigate occasional HDD fail-slow. Due to their mechanical nature, HDDs may occasionally suffer from spikes of abnormally high internal read retry rates, leading to temporarily significant degradation of speed (especially the read latency). Intuitively, one could expect that existing system-level data redundancy (e.g., RAID or distributed erasure coding) may be opportunistically utilized to mitigate HDD fail-slow. Nevertheless, current practice tends to use system-level redundancy merely as a safety net, i.e., reconstruct data sectors via system-level redundancy only after the costly intra-HDD read retry fails. This paper shows that one could much more effectively mitigate occasional HDD fail-slow by more pro-actively utilizing existing system-level data redundancy, in complement to (or even replacement of) intra-HDD read retry. To enable this, HDDs should support a higher degree of controllability and observability in terms of their internal read retry operations. Assuming a very simple form enhanced HDD controllability and observability, this paper presents design solutions and a mathematical formulation framework to facilitate the practical implementation of such pro-active strategy for mitigating occasional HDD fail-slow. Using RAID as a test vehicle, our experimental results show that the proposed design solutions can effectively mitigate the RAID read latency degradation even when HDDs suffer from read retry rates as high as 1% or 2%.
{"title":"Mitigate HDD Fail-Slow by Pro-actively Utilizing System-level Data Redundancy with Enhanced HDD Controllability and Observability","authors":"Jingpeng Hao, Yin Li, Xubin Chen, Tong Zhang","doi":"10.1109/MSST.2019.000-2","DOIUrl":"https://doi.org/10.1109/MSST.2019.000-2","url":null,"abstract":"This paper presents a design framework aiming to mitigate occasional HDD fail-slow. Due to their mechanical nature, HDDs may occasionally suffer from spikes of abnormally high internal read retry rates, leading to temporarily significant degradation of speed (especially the read latency). Intuitively, one could expect that existing system-level data redundancy (e.g., RAID or distributed erasure coding) may be opportunistically utilized to mitigate HDD fail-slow. Nevertheless, current practice tends to use system-level redundancy merely as a safety net, i.e., reconstruct data sectors via system-level redundancy only after the costly intra-HDD read retry fails. This paper shows that one could much more effectively mitigate occasional HDD fail-slow by more pro-actively utilizing existing system-level data redundancy, in complement to (or even replacement of) intra-HDD read retry. To enable this, HDDs should support a higher degree of controllability and observability in terms of their internal read retry operations. Assuming a very simple form enhanced HDD controllability and observability, this paper presents design solutions and a mathematical formulation framework to facilitate the practical implementation of such pro-active strategy for mitigating occasional HDD fail-slow. Using RAID as a test vehicle, our experimental results show that the proposed design solutions can effectively mitigate the RAID read latency degradation even when HDDs suffer from read retry rates as high as 1% or 2%.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114537831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}