Lustre, as one of the most popular parallel file systems in high-performance computing (HPC), provides POSIX interface and maintains a large set of POSIX-related metadata, which could be corrupted due to hardware failures, software bugs, configuration errors, etc. The Lustre file system checker (LFSCK) is the remedy tool to detect metadata inconsistencies and to restore a corrupted Lustre to a valid state, hence is critical for reliable HPC. Unfortunately, in practice, LFSCK runs slow in large deployment, making system administrators reluctant to use it as a routine maintenance tool. Consequently, cascading errors may lead to unrecoverable failures, resulting in significant downtime or even data loss. Given the fact that HPC is rapidly marching to Exascale and much larger Lustre file systems are being deployed, it is critical to understand the performance of LFSCK. In this paper, we study the performance of LFSCK to identify its bottlenecks and analyze its performance potentials. Specifically, we design an aging method based on real-world HPC workloads to age Lustre to representative states, and then systematically evaluate and analyze how LFSCK runs on such an aged Lustre via monitoring the utilization of various resources. From our experiments, we find out that the design and implementation of LFSCK is sub-optimal. It consists of scalability bottleneck on the metadata server (MDS), relatively high fan-out ratio in network utilization, and unnecessary blocking among internal components. Based on these observations, we discussed potential optimization and present some preliminary results.
{"title":"A Performance Study of Lustre File System Checker: Bottlenecks and Potentials","authors":"Dong Dai, Om Rameshwar Gatla, Mai Zheng","doi":"10.1109/MSST.2019.00-20","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-20","url":null,"abstract":"Lustre, as one of the most popular parallel file systems in high-performance computing (HPC), provides POSIX interface and maintains a large set of POSIX-related metadata, which could be corrupted due to hardware failures, software bugs, configuration errors, etc. The Lustre file system checker (LFSCK) is the remedy tool to detect metadata inconsistencies and to restore a corrupted Lustre to a valid state, hence is critical for reliable HPC. Unfortunately, in practice, LFSCK runs slow in large deployment, making system administrators reluctant to use it as a routine maintenance tool. Consequently, cascading errors may lead to unrecoverable failures, resulting in significant downtime or even data loss. Given the fact that HPC is rapidly marching to Exascale and much larger Lustre file systems are being deployed, it is critical to understand the performance of LFSCK. In this paper, we study the performance of LFSCK to identify its bottlenecks and analyze its performance potentials. Specifically, we design an aging method based on real-world HPC workloads to age Lustre to representative states, and then systematically evaluate and analyze how LFSCK runs on such an aged Lustre via monitoring the utilization of various resources. From our experiments, we find out that the design and implementation of LFSCK is sub-optimal. It consists of scalability bottleneck on the metadata server (MDS), relatively high fan-out ratio in network utilization, and unnecessary blocking among internal components. Based on these observations, we discussed potential optimization and present some preliminary results.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114781121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhichao Yan, Hong Jiang, Song Jiang, Yujuan Tan, Hao Luo
Integrating the data deduplication function into Solid State Drives (SSDs) helps avoid writing duplicate contents to NAND flash chips, which will not only effectively reduce the number of Program/Erase (P/E) operations to extend the device's lifespan but also proportionally enlarge the effective capacity of SSD to improve the performance of its behind-the-scenes maintenance tasks such as wear-leveling (WL) and garbage-collection (GC). However, these benefits of deduplication come at a non-trivial computational cost incurred by the embedded SSD controller to compute cryptographic hashes. To address this overhead problem, some researchers have suggested replacing cryptographic hashes with error correction codes (ECCs) already embedded in the SSD chips to detect the duplicate contents. However, all existing attempts have ignored the impact of the data randomization (scrambler) module that is widely used in modern SSDs, thus making it impractical to directly integrate ECC-based deduplication into commercial SSDs. In this work, we revisit SSD's internal structure and propose the first deduplicatable SSD that can bypass the data scrambler module to enable the low-cost ECC-based data deduplication. Specifically, we propose two design solutions, one on the host side and the other on the device side, to enable ECC-based deduplication. Based on our approach, we can effectively exploit SSD's built-in ECC module to calculate the hash values of stored data for data deduplication. We have evaluated our SES-Dedup approach by replaying data traces in an SSD simulator and found that it can remove up to 30.8% redundant data with up to 17.0% write performance improvement over the baseline SSD.
将重复数据删除功能集成到SSD (Solid State Drives)中,可以避免重复的内容写入NAND闪存芯片,不仅可以有效减少P/E (Program/Erase)操作,延长设备的使用寿命,还可以成比例地扩大SSD的有效容量,提高其后台维护任务(如磨损均衡(WL)和垃圾收集(GC))的性能。然而,重复数据删除的这些好处是以嵌入式SSD控制器计算加密哈希所带来的计算成本为代价的。为了解决这个开销问题,一些研究人员建议用已经嵌入在SSD芯片中的纠错码(ECCs)替换加密散列,以检测重复内容。然而,现有的所有尝试都忽略了现代ssd中广泛使用的数据随机化(扰频器)模块的影响,因此直接将基于ecc的重复数据删除集成到商用ssd中是不切实际的。在这项工作中,我们重新审视了SSD的内部结构,并提出了第一种可重复数据删除的SSD,它可以绕过数据扰频模块,实现基于ecc的低成本重复数据删除。具体来说,我们提出了两种设计方案,一种在主机端,另一种在设备端,以启用基于ecc的重复数据删除。基于我们的方法,我们可以有效地利用SSD内置的ECC模块来计算存储数据的哈希值,用于重复数据删除。我们通过在SSD模拟器中重播数据跟踪来评估我们的SES-Dedup方法,发现它可以删除多达30.8%的冗余数据,与基准SSD相比,写入性能提高高达17.0%。
{"title":"SES-Dedup: a Case for Low-Cost ECC-based SSD Deduplication","authors":"Zhichao Yan, Hong Jiang, Song Jiang, Yujuan Tan, Hao Luo","doi":"10.1109/MSST.2019.00009","DOIUrl":"https://doi.org/10.1109/MSST.2019.00009","url":null,"abstract":"Integrating the data deduplication function into Solid State Drives (SSDs) helps avoid writing duplicate contents to NAND flash chips, which will not only effectively reduce the number of Program/Erase (P/E) operations to extend the device's lifespan but also proportionally enlarge the effective capacity of SSD to improve the performance of its behind-the-scenes maintenance tasks such as wear-leveling (WL) and garbage-collection (GC). However, these benefits of deduplication come at a non-trivial computational cost incurred by the embedded SSD controller to compute cryptographic hashes. To address this overhead problem, some researchers have suggested replacing cryptographic hashes with error correction codes (ECCs) already embedded in the SSD chips to detect the duplicate contents. However, all existing attempts have ignored the impact of the data randomization (scrambler) module that is widely used in modern SSDs, thus making it impractical to directly integrate ECC-based deduplication into commercial SSDs. In this work, we revisit SSD's internal structure and propose the first deduplicatable SSD that can bypass the data scrambler module to enable the low-cost ECC-based data deduplication. Specifically, we propose two design solutions, one on the host side and the other on the device side, to enable ECC-based deduplication. Based on our approach, we can effectively exploit SSD's built-in ECC module to calculate the hash values of stored data for data deduplication. We have evaluated our SES-Dedup approach by replaying data traces in an SSD simulator and found that it can remove up to 30.8% redundant data with up to 17.0% write performance improvement over the baseline SSD.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115537264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Docker containers have been increasingly adopted on various computing platforms to provide a lightweight virtualized execution environment. Compared to virtual machines, this technology can often reduce the launch time from a few minutes to less than 10 seconds, assuming the Docker image has been locally available. However, Docker images are highly customizable, and are mostly built at runtime from a remote base image by running instructions in a script (the Dockerfile). During the instruction execution, a large number of input files may have to be retrieved via the Internet. The image building may be an iterative process as one may need to repeatedly modify the Dockerfile until a desired image composition is received. In the process, every input file required by an instruction has to be remotely retrieved, even if it has been recently downloaded. This can make the process of building of an image and launching of a container unexpectedly slow. To address the issue, we propose a technique, named FastBuild, that maintains a local file cache to minimize the expensive file downloading. By non-intrusively intercepting remote file requests, and supplying files locally, FastBuild enables file caching in a manner transparent to image building. To further accelerate the image building, FastBuild overlaps operations of instructions' execution and writing intermediate image layers to the disk. We have implemented FastBuild. And experiments with images and Dockerfiles obtained from Docker Hub show that the system can improve building speed by up to 10 times, and reduce downloaded data by 72%.
{"title":"FastBuild: Accelerating Docker Image Building for Efficient Development and Deployment of Container","authors":"Zhuo Huang, Song Wu, Song Jiang, Hai Jin","doi":"10.1109/MSST.2019.00-18","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-18","url":null,"abstract":"Docker containers have been increasingly adopted on various computing platforms to provide a lightweight virtualized execution environment. Compared to virtual machines, this technology can often reduce the launch time from a few minutes to less than 10 seconds, assuming the Docker image has been locally available. However, Docker images are highly customizable, and are mostly built at runtime from a remote base image by running instructions in a script (the Dockerfile). During the instruction execution, a large number of input files may have to be retrieved via the Internet. The image building may be an iterative process as one may need to repeatedly modify the Dockerfile until a desired image composition is received. In the process, every input file required by an instruction has to be remotely retrieved, even if it has been recently downloaded. This can make the process of building of an image and launching of a container unexpectedly slow. To address the issue, we propose a technique, named FastBuild, that maintains a local file cache to minimize the expensive file downloading. By non-intrusively intercepting remote file requests, and supplying files locally, FastBuild enables file caching in a manner transparent to image building. To further accelerate the image building, FastBuild overlaps operations of instructions' execution and writing intermediate image layers to the disk. We have implemented FastBuild. And experiments with images and Dockerfiles obtained from Docker Hub show that the system can improve building speed by up to 10 times, and reduce downloaded data by 72%.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130138066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Title Page iii","authors":"","doi":"10.1109/msst.2019.00002","DOIUrl":"https://doi.org/10.1109/msst.2019.00002","url":null,"abstract":"","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130958417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Encrypted deduplication combines encryption and deduplication in a seamless way to provide confidentiality guarantees for the physical data in deduplication storage, yet it incurs substantial metadata storage overhead due to the additional storage of keys. We present a new encrypted deduplication storage system called Metadedup, which suppresses metadata storage by also applying deduplication to metadata. Its idea builds on indirection, which adds another level of metadata chunks that record metadata information. We find that metadata chunks are highly redundant in real-world workloads and hence can be effectively deduplicated. In addition, metadata chunks can be protected under the same encrypted deduplication framework, thereby providing confidentiality guarantees for metadata as well. We evaluate Metadedup through microbenchmarks, prototype experiments, and trace-driven simulation. Metadedup has limited computational overhead in metadata processing, and only adds 6.19% of performance overhead on average when storing files in a networked setting. Also, for real-world backup workloads, Metadedup saves the metadata storage by up to 97.46% at the expense of only up to 1.07% of indexing overhead for metadata chunks.
{"title":"Metadedup: Deduplicating Metadata in Encrypted Deduplication via Indirection","authors":"Jingwei Li, P. Lee, Yanjing Ren, Xiaosong Zhang","doi":"10.1109/MSST.2019.00007","DOIUrl":"https://doi.org/10.1109/MSST.2019.00007","url":null,"abstract":"Encrypted deduplication combines encryption and deduplication in a seamless way to provide confidentiality guarantees for the physical data in deduplication storage, yet it incurs substantial metadata storage overhead due to the additional storage of keys. We present a new encrypted deduplication storage system called Metadedup, which suppresses metadata storage by also applying deduplication to metadata. Its idea builds on indirection, which adds another level of metadata chunks that record metadata information. We find that metadata chunks are highly redundant in real-world workloads and hence can be effectively deduplicated. In addition, metadata chunks can be protected under the same encrypted deduplication framework, thereby providing confidentiality guarantees for metadata as well. We evaluate Metadedup through microbenchmarks, prototype experiments, and trace-driven simulation. Metadedup has limited computational overhead in metadata processing, and only adds 6.19% of performance overhead on average when storing files in a networked setting. Also, for real-world backup workloads, Metadedup saves the metadata storage by up to 97.46% at the expense of only up to 1.07% of indexing overhead for metadata chunks.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115988406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Yang, Q. Cao, Hong Jiang, Li Yang, Jie Yao, Yuanyuan Dong, Puyuan Yang
Existing local file systems, designed to support a typical single-file access pattern only, can lead to poor performance when accessing a batch of files, especially small files. This single-file pattern essentially serializes accesses to batched files one by one, resulting in a large number of non-sequential, random, and often dependent I/Os between file data and metadata at the storage ends. We first experimentally analyze the root cause of such inefficiency in batch-file accesses. Then, we propose a novel batch-file access approach, referred to as BFO for its set of optimized Batch-File Operations, by developing novel BFOr and BFOw operations for fundamental read and write processes respectively, using a two-phase access for metadata and data jointly. The BFO offers dedicated interfaces for batch-file accesses and additional processes integrated into existing file systems without modifying their structures and procedures. We implement a BFO prototype on ext4, one of the most popular file systems. Our evaluation results show that the batch-file read and write performances of BFO are consistently higher than those of the traditional approaches regardless of access patterns, data layouts, and storage media, with synthetic and real-world file sets. BFO improves the read performance by up to 22.4× and 1.8× with HDD and SSD respectively; and boosts the write performance by up to 111.4× and 2.9× with HDD and SSD respectively. BFO also demonstrates consistent performance advantages when applied to four representative applications, Linux cp, Tar, GridFTP, and Hadoop.
{"title":"BFO: Batch-File Operations on Massive Files for Consistent Performance Improvement","authors":"Yang Yang, Q. Cao, Hong Jiang, Li Yang, Jie Yao, Yuanyuan Dong, Puyuan Yang","doi":"10.1109/MSST.2019.00-17","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-17","url":null,"abstract":"Existing local file systems, designed to support a typical single-file access pattern only, can lead to poor performance when accessing a batch of files, especially small files. This single-file pattern essentially serializes accesses to batched files one by one, resulting in a large number of non-sequential, random, and often dependent I/Os between file data and metadata at the storage ends. We first experimentally analyze the root cause of such inefficiency in batch-file accesses. Then, we propose a novel batch-file access approach, referred to as BFO for its set of optimized Batch-File Operations, by developing novel BFOr and BFOw operations for fundamental read and write processes respectively, using a two-phase access for metadata and data jointly. The BFO offers dedicated interfaces for batch-file accesses and additional processes integrated into existing file systems without modifying their structures and procedures. We implement a BFO prototype on ext4, one of the most popular file systems. Our evaluation results show that the batch-file read and write performances of BFO are consistently higher than those of the traditional approaches regardless of access patterns, data layouts, and storage media, with synthetic and real-world file sets. BFO improves the read performance by up to 22.4× and 1.8× with HDD and SSD respectively; and boosts the write performance by up to 111.4× and 2.9× with HDD and SSD respectively. BFO also demonstrates consistent performance advantages when applied to four representative applications, Linux cp, Tar, GridFTP, and Hadoop.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128030916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper addresses the problem of providing performance QoS guarantees in a clustered storage system. Multiple related storage objects are grouped into logical containers called buckets, which are distributed over the servers based on the placement policies of the storage system. QoS is provided at the level of buckets. The service credited to a bucket is the aggregate of the IOs received by its objects at all the servers. The service depends on individual time-varying demands and congestion at the servers. We present a token-based, coarse-grained approach to providing IO reservations and limits to buckets. We propose pShift, a novel token allocation algorithm that works in conjunction with token-sensitive scheduling at each server to control the aggregate IOs received by each bucket on multiple servers. pShift determines the optimal token distribution based on the estimated bucket demands and server IOPS capacities. Compared to existing approaches, pShift has far smaller overhead, and can be accelerated using parallelization and approximation. Our experimental results show that pShift provides accurate QoS among the buckets with different access patterns, and handles runtime demand changes well.
{"title":"Scalable QoS for Distributed Storage Clusters using Dynamic Token Allocation","authors":"Yuhan Peng, Qingyue Liu, P. Varman","doi":"10.1109/MSST.2019.00-19","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-19","url":null,"abstract":"The paper addresses the problem of providing performance QoS guarantees in a clustered storage system. Multiple related storage objects are grouped into logical containers called buckets, which are distributed over the servers based on the placement policies of the storage system. QoS is provided at the level of buckets. The service credited to a bucket is the aggregate of the IOs received by its objects at all the servers. The service depends on individual time-varying demands and congestion at the servers. We present a token-based, coarse-grained approach to providing IO reservations and limits to buckets. We propose pShift, a novel token allocation algorithm that works in conjunction with token-sensitive scheduling at each server to control the aggregate IOs received by each bucket on multiple servers. pShift determines the optimal token distribution based on the estimated bucket demands and server IOPS capacities. Compared to existing approaches, pShift has far smaller overhead, and can be accelerated using parallelization and approximation. Our experimental results show that pShift provides accurate QoS among the buckets with different access patterns, and handles runtime demand changes well.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126282600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jialing Zhang, Xiaoyan Zhuo, Aekyeung Moon, Hang Liu, S. Son
As the amount of data produced by HPC applications reaches the exabyte range, compression techniques are often adopted to reduce the checkpoint time and volume. Since lossless techniques are limited in their ability to achieve appreciable data reduction, lossy compression becomes a preferable option. In this work, a lossy compression technique with highly efficient encoding, purpose-built error control, and high compression ratios is proposed. Specifically, we apply a discrete cosine transform with a novel block decomposition strategy directly to double-precision floating point datasets instead of prevailing prediction-based techniques. Further, we design an adaptive quantization with two specific task-oriented quantizers: guaranteed error bounds and higher compression ratios. Using real-world HPC datasets, our approach achieves 3x-38x compression ratios while guaranteeing specified error bounds, showing comparable performance with state-of-the-art lossy compression methods, SZ and ZFP. Moreover, our method provides viable reconstructed data for various checkpoint/restart scenarios in the FLASH application, thus is considered to be a promising approach for lossy data compression in HPC I/O software stacks.
{"title":"Efficient Encoding and Reconstruction of HPC Datasets for Checkpoint/Restart","authors":"Jialing Zhang, Xiaoyan Zhuo, Aekyeung Moon, Hang Liu, S. Son","doi":"10.1109/MSST.2019.00-14","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-14","url":null,"abstract":"As the amount of data produced by HPC applications reaches the exabyte range, compression techniques are often adopted to reduce the checkpoint time and volume. Since lossless techniques are limited in their ability to achieve appreciable data reduction, lossy compression becomes a preferable option. In this work, a lossy compression technique with highly efficient encoding, purpose-built error control, and high compression ratios is proposed. Specifically, we apply a discrete cosine transform with a novel block decomposition strategy directly to double-precision floating point datasets instead of prevailing prediction-based techniques. Further, we design an adaptive quantization with two specific task-oriented quantizers: guaranteed error bounds and higher compression ratios. Using real-world HPC datasets, our approach achieves 3x-38x compression ratios while guaranteeing specified error bounds, showing comparable performance with state-of-the-art lossy compression methods, SZ and ZFP. Moreover, our method provides viable reconstructed data for various checkpoint/restart scenarios in the FLASH application, thus is considered to be a promising approach for lossy data compression in HPC I/O software stacks.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132895879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stragglers (i.e., nodes with slow performance) are prevalent and incur performance instability in large-scale storage systems, yet it is challenging to detect stragglers in practice. We make a case by showing how erasure-coded caching provides robust straggler tolerance without relying on timely and accurate straggler detection, while incurring limited redundancy overhead in caching. We first analytically motivate that caching only parity blocks can achieve effective straggler tolerance. To this end, we present POCache, a parity-only caching design that provides robust straggler tolerance. To limit the erasure coding overhead, POCache slices blocks into smaller subblocks and parallelizes the coding operations at the subblock level. Also, it leverages a straggler-aware cache algorithm that takes into account both file access popularity and straggler estimation to decide which parity blocks should be cached. We implement a POCache prototype atop Hadoop 3.1 HDFS, while preserving the performance and functionalities of normal HDFS operations. Our extensive experiments on both local and Amazon EC2 clusters show that in the presence of stragglers, POCache can reduce the read latency by up to 87.9% compared to vanilla HDFS.
{"title":"Parity-Only Caching for Robust Straggler Tolerance","authors":"Mi Zhang, Qiuping Wang, Zhirong Shen, P. Lee","doi":"10.1109/MSST.2019.00006","DOIUrl":"https://doi.org/10.1109/MSST.2019.00006","url":null,"abstract":"Stragglers (i.e., nodes with slow performance) are prevalent and incur performance instability in large-scale storage systems, yet it is challenging to detect stragglers in practice. We make a case by showing how erasure-coded caching provides robust straggler tolerance without relying on timely and accurate straggler detection, while incurring limited redundancy overhead in caching. We first analytically motivate that caching only parity blocks can achieve effective straggler tolerance. To this end, we present POCache, a parity-only caching design that provides robust straggler tolerance. To limit the erasure coding overhead, POCache slices blocks into smaller subblocks and parallelizes the coding operations at the subblock level. Also, it leverages a straggler-aware cache algorithm that takes into account both file access popularity and straggler estimation to decide which parity blocks should be cached. We implement a POCache prototype atop Hadoop 3.1 HDFS, while preserving the performance and functionalities of normal HDFS operations. Our extensive experiments on both local and Amazon EC2 clusters show that in the presence of stragglers, POCache can reduce the read latency by up to 87.9% compared to vanilla HDFS.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133961285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}