2019 35th Symposium on Mass Storage Systems and Technologies (MSST)最新文献_第3页

A Performance Study of Lustre File System Checker: Bottlenecks and Potentials Lustre文件系统检查器的性能研究:瓶颈与潜力

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-01 DOI: 10.1109/MSST.2019.00-20

Dong Dai, Om Rameshwar Gatla, Mai Zheng

Lustre, as one of the most popular parallel file systems in high-performance computing (HPC), provides POSIX interface and maintains a large set of POSIX-related metadata, which could be corrupted due to hardware failures, software bugs, configuration errors, etc. The Lustre file system checker (LFSCK) is the remedy tool to detect metadata inconsistencies and to restore a corrupted Lustre to a valid state, hence is critical for reliable HPC. Unfortunately, in practice, LFSCK runs slow in large deployment, making system administrators reluctant to use it as a routine maintenance tool. Consequently, cascading errors may lead to unrecoverable failures, resulting in significant downtime or even data loss. Given the fact that HPC is rapidly marching to Exascale and much larger Lustre file systems are being deployed, it is critical to understand the performance of LFSCK. In this paper, we study the performance of LFSCK to identify its bottlenecks and analyze its performance potentials. Specifically, we design an aging method based on real-world HPC workloads to age Lustre to representative states, and then systematically evaluate and analyze how LFSCK runs on such an aged Lustre via monitoring the utilization of various resources. From our experiments, we find out that the design and implementation of LFSCK is sub-optimal. It consists of scalability bottleneck on the metadata server (MDS), relatively high fan-out ratio in network utilization, and unnecessary blocking among internal components. Based on these observations, we discussed potential optimization and present some preliminary results.

Lustre作为高性能计算(HPC)中最流行的并行文件系统之一，提供POSIX接口并维护大量与POSIX相关的元数据，这些元数据可能由于硬件故障、软件错误、配置错误等而损坏。Lustre文件系统检查器(LFSCK)是检测元数据不一致并将损坏的Lustre恢复到有效状态的补救工具，因此对于可靠的HPC至关重要。不幸的是，在实践中，LFSCK在大型部署中运行缓慢，使得系统管理员不愿意将其用作日常维护工具。因此，级联错误可能导致无法恢复的故障，导致大量停机甚至数据丢失。考虑到HPC正在迅速发展到Exascale和更大的Lustre文件系统，了解LFSCK的性能是至关重要的。在本文中，我们研究了LFSCK的性能，以识别其瓶颈并分析其性能潜力。具体而言，我们设计了一种基于真实HPC工作负载的老化方法，将Lustre老化到具有代表性的状态，然后通过监测各种资源的利用率，系统地评估和分析LFSCK在这样一个老化的Lustre上的运行情况。从我们的实验中，我们发现LFSCK的设计和实现是次优的。它包括MDS (metadata server)的可扩展性瓶颈、网络利用率较高的扇出率以及内部组件之间不必要的阻塞。基于这些观察，我们讨论了潜在的优化，并提出了一些初步的结果。

{"title":"A Performance Study of Lustre File System Checker: Bottlenecks and Potentials","authors":"Dong Dai, Om Rameshwar Gatla, Mai Zheng","doi":"10.1109/MSST.2019.00-20","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-20","url":null,"abstract":"Lustre, as one of the most popular parallel file systems in high-performance computing (HPC), provides POSIX interface and maintains a large set of POSIX-related metadata, which could be corrupted due to hardware failures, software bugs, configuration errors, etc. The Lustre file system checker (LFSCK) is the remedy tool to detect metadata inconsistencies and to restore a corrupted Lustre to a valid state, hence is critical for reliable HPC. Unfortunately, in practice, LFSCK runs slow in large deployment, making system administrators reluctant to use it as a routine maintenance tool. Consequently, cascading errors may lead to unrecoverable failures, resulting in significant downtime or even data loss. Given the fact that HPC is rapidly marching to Exascale and much larger Lustre file systems are being deployed, it is critical to understand the performance of LFSCK. In this paper, we study the performance of LFSCK to identify its bottlenecks and analyze its performance potentials. Specifically, we design an aging method based on real-world HPC workloads to age Lustre to representative states, and then systematically evaluate and analyze how LFSCK runs on such an aged Lustre via monitoring the utilization of various resources. From our experiments, we find out that the design and implementation of LFSCK is sub-optimal. It consists of scalability bottleneck on the metadata server (MDS), relatively high fan-out ratio in network utilization, and unnecessary blocking among internal components. Based on these observations, we discussed potential optimization and present some preliminary results.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114781121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

SES-Dedup: a Case for Low-Cost ECC-based SSD Deduplication SES-Dedup:基于ecc的SSD低成本重复数据删除案例

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-01 DOI: 10.1109/MSST.2019.00009

Zhichao Yan, Hong Jiang, Song Jiang, Yujuan Tan, Hao Luo

Integrating the data deduplication function into Solid State Drives (SSDs) helps avoid writing duplicate contents to NAND flash chips, which will not only effectively reduce the number of Program/Erase (P/E) operations to extend the device's lifespan but also proportionally enlarge the effective capacity of SSD to improve the performance of its behind-the-scenes maintenance tasks such as wear-leveling (WL) and garbage-collection (GC). However, these benefits of deduplication come at a non-trivial computational cost incurred by the embedded SSD controller to compute cryptographic hashes. To address this overhead problem, some researchers have suggested replacing cryptographic hashes with error correction codes (ECCs) already embedded in the SSD chips to detect the duplicate contents. However, all existing attempts have ignored the impact of the data randomization (scrambler) module that is widely used in modern SSDs, thus making it impractical to directly integrate ECC-based deduplication into commercial SSDs. In this work, we revisit SSD's internal structure and propose the first deduplicatable SSD that can bypass the data scrambler module to enable the low-cost ECC-based data deduplication. Specifically, we propose two design solutions, one on the host side and the other on the device side, to enable ECC-based deduplication. Based on our approach, we can effectively exploit SSD's built-in ECC module to calculate the hash values of stored data for data deduplication. We have evaluated our SES-Dedup approach by replaying data traces in an SSD simulator and found that it can remove up to 30.8% redundant data with up to 17.0% write performance improvement over the baseline SSD.

将重复数据删除功能集成到SSD (Solid State Drives)中，可以避免重复的内容写入NAND闪存芯片，不仅可以有效减少P/E (Program/Erase)操作，延长设备的使用寿命，还可以成比例地扩大SSD的有效容量，提高其后台维护任务(如磨损均衡(WL)和垃圾收集(GC))的性能。然而，重复数据删除的这些好处是以嵌入式SSD控制器计算加密哈希所带来的计算成本为代价的。为了解决这个开销问题，一些研究人员建议用已经嵌入在SSD芯片中的纠错码(ECCs)替换加密散列，以检测重复内容。然而，现有的所有尝试都忽略了现代ssd中广泛使用的数据随机化(扰频器)模块的影响，因此直接将基于ecc的重复数据删除集成到商用ssd中是不切实际的。在这项工作中，我们重新审视了SSD的内部结构，并提出了第一种可重复数据删除的SSD，它可以绕过数据扰频模块，实现基于ecc的低成本重复数据删除。具体来说，我们提出了两种设计方案，一种在主机端，另一种在设备端，以启用基于ecc的重复数据删除。基于我们的方法，我们可以有效地利用SSD内置的ECC模块来计算存储数据的哈希值，用于重复数据删除。我们通过在SSD模拟器中重播数据跟踪来评估我们的SES-Dedup方法，发现它可以删除多达30.8%的冗余数据，与基准SSD相比，写入性能提高高达17.0%。

{"title":"SES-Dedup: a Case for Low-Cost ECC-based SSD Deduplication","authors":"Zhichao Yan, Hong Jiang, Song Jiang, Yujuan Tan, Hao Luo","doi":"10.1109/MSST.2019.00009","DOIUrl":"https://doi.org/10.1109/MSST.2019.00009","url":null,"abstract":"Integrating the data deduplication function into Solid State Drives (SSDs) helps avoid writing duplicate contents to NAND flash chips, which will not only effectively reduce the number of Program/Erase (P/E) operations to extend the device's lifespan but also proportionally enlarge the effective capacity of SSD to improve the performance of its behind-the-scenes maintenance tasks such as wear-leveling (WL) and garbage-collection (GC). However, these benefits of deduplication come at a non-trivial computational cost incurred by the embedded SSD controller to compute cryptographic hashes. To address this overhead problem, some researchers have suggested replacing cryptographic hashes with error correction codes (ECCs) already embedded in the SSD chips to detect the duplicate contents. However, all existing attempts have ignored the impact of the data randomization (scrambler) module that is widely used in modern SSDs, thus making it impractical to directly integrate ECC-based deduplication into commercial SSDs. In this work, we revisit SSD's internal structure and propose the first deduplicatable SSD that can bypass the data scrambler module to enable the low-cost ECC-based data deduplication. Specifically, we propose two design solutions, one on the host side and the other on the device side, to enable ECC-based deduplication. Based on our approach, we can effectively exploit SSD's built-in ECC module to calculate the hash values of stored data for data deduplication. We have evaluated our SES-Dedup approach by replaying data traces in an SSD simulator and found that it can remove up to 30.8% redundant data with up to 17.0% write performance improvement over the baseline SSD.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115537264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

FastBuild: Accelerating Docker Image Building for Efficient Development and Deployment of Container FastBuild:加速Docker映像构建，用于高效开发和部署容器

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-01 DOI: 10.1109/MSST.2019.00-18

Zhuo Huang, Song Wu, Song Jiang, Hai Jin

Docker containers have been increasingly adopted on various computing platforms to provide a lightweight virtualized execution environment. Compared to virtual machines, this technology can often reduce the launch time from a few minutes to less than 10 seconds, assuming the Docker image has been locally available. However, Docker images are highly customizable, and are mostly built at runtime from a remote base image by running instructions in a script (the Dockerfile). During the instruction execution, a large number of input files may have to be retrieved via the Internet. The image building may be an iterative process as one may need to repeatedly modify the Dockerfile until a desired image composition is received. In the process, every input file required by an instruction has to be remotely retrieved, even if it has been recently downloaded. This can make the process of building of an image and launching of a container unexpectedly slow. To address the issue, we propose a technique, named FastBuild, that maintains a local file cache to minimize the expensive file downloading. By non-intrusively intercepting remote file requests, and supplying files locally, FastBuild enables file caching in a manner transparent to image building. To further accelerate the image building, FastBuild overlaps operations of instructions' execution and writing intermediate image layers to the disk. We have implemented FastBuild. And experiments with images and Dockerfiles obtained from Docker Hub show that the system can improve building speed by up to 10 times, and reduce downloaded data by 72%.

Docker容器越来越多地应用于各种计算平台，以提供轻量级的虚拟化执行环境。与虚拟机相比，该技术通常可以将启动时间从几分钟减少到不到10秒，前提是Docker镜像在本地可用。然而，Docker镜像是高度可定制的，并且大多数是在运行时通过在脚本(Dockerfile)中运行指令从远程基本镜像构建的。在指令执行期间，可能需要通过Internet检索大量输入文件。映像构建可能是一个迭代过程，因为可能需要反复修改Dockerfile，直到收到所需的映像组合。在此过程中，指令所需的每个输入文件都必须远程检索，即使它最近已下载。这可能会使构建映像和启动容器的过程异常缓慢。为了解决这个问题，我们提出了一种名为FastBuild的技术，它维护一个本地文件缓存，以最大限度地减少昂贵的文件下载。通过非侵入性地拦截远程文件请求，并在本地提供文件，FastBuild以一种对映像构建透明的方式支持文件缓存。为了进一步加速映像构建，FastBuild将指令执行和将中间映像层写入磁盘的操作重叠。我们已经实现了FastBuild。通过对从Docker Hub获取的图像和Dockerfiles进行的实验表明，该系统可以将构建速度提高10倍，并减少72%的下载数据。

{"title":"FastBuild: Accelerating Docker Image Building for Efficient Development and Deployment of Container","authors":"Zhuo Huang, Song Wu, Song Jiang, Hai Jin","doi":"10.1109/MSST.2019.00-18","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-18","url":null,"abstract":"Docker containers have been increasingly adopted on various computing platforms to provide a lightweight virtualized execution environment. Compared to virtual machines, this technology can often reduce the launch time from a few minutes to less than 10 seconds, assuming the Docker image has been locally available. However, Docker images are highly customizable, and are mostly built at runtime from a remote base image by running instructions in a script (the Dockerfile). During the instruction execution, a large number of input files may have to be retrieved via the Internet. The image building may be an iterative process as one may need to repeatedly modify the Dockerfile until a desired image composition is received. In the process, every input file required by an instruction has to be remotely retrieved, even if it has been recently downloaded. This can make the process of building of an image and launching of a container unexpectedly slow. To address the issue, we propose a technique, named FastBuild, that maintains a local file cache to minimize the expensive file downloading. By non-intrusively intercepting remote file requests, and supplying files locally, FastBuild enables file caching in a manner transparent to image building. To further accelerate the image building, FastBuild overlaps operations of instructions' execution and writing intermediate image layers to the disk. We have implemented FastBuild. And experiments with images and Dockerfiles obtained from Docker Hub show that the system can improve building speed by up to 10 times, and reduce downloaded data by 72%.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130138066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Title Page iii 第三页标题

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-01 DOI: 10.1109/msst.2019.00002

引用次数: 0

Metadedup: Deduplicating Metadata in Encrypted Deduplication via Indirection Metadedup:间接删除加密重复数据删除中的元数据

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-01 DOI: 10.1109/MSST.2019.00007

Jingwei Li, P. Lee, Yanjing Ren, Xiaosong Zhang

Encrypted deduplication combines encryption and deduplication in a seamless way to provide confidentiality guarantees for the physical data in deduplication storage, yet it incurs substantial metadata storage overhead due to the additional storage of keys. We present a new encrypted deduplication storage system called Metadedup, which suppresses metadata storage by also applying deduplication to metadata. Its idea builds on indirection, which adds another level of metadata chunks that record metadata information. We find that metadata chunks are highly redundant in real-world workloads and hence can be effectively deduplicated. In addition, metadata chunks can be protected under the same encrypted deduplication framework, thereby providing confidentiality guarantees for metadata as well. We evaluate Metadedup through microbenchmarks, prototype experiments, and trace-driven simulation. Metadedup has limited computational overhead in metadata processing, and only adds 6.19% of performance overhead on average when storing files in a networked setting. Also, for real-world backup workloads, Metadedup saves the metadata storage by up to 97.46% at the expense of only up to 1.07% of indexing overhead for metadata chunks.

加密重复数据删除将加密和重复数据删除无缝地结合在一起，为重复数据删除存储中的物理数据提供机密性保证，但由于需要额外存储密钥，因此会产生大量元数据存储开销。我们提出了一种新的加密的重复数据删除存储系统，称为Metadedup，它通过对元数据应用重复数据删除来抑制元数据存储。它的思想建立在间接的基础上，它增加了另一层元数据块来记录元数据信息。我们发现元数据块在实际工作负载中是高度冗余的，因此可以有效地进行重复数据删除。此外，元数据块可以在相同的加密重复数据删除框架下得到保护，从而为元数据提供保密性保证。我们通过微基准测试、原型实验和跟踪驱动仿真来评估Metadedup。Metadedup在元数据处理方面的计算开销有限，在网络环境中存储文件时，平均只增加6.19%的性能开销。此外，对于真实的备份工作负载，Metadedup最多可以节省97.46%的元数据存储，而元数据块的索引开销仅为1.07%。

{"title":"Metadedup: Deduplicating Metadata in Encrypted Deduplication via Indirection","authors":"Jingwei Li, P. Lee, Yanjing Ren, Xiaosong Zhang","doi":"10.1109/MSST.2019.00007","DOIUrl":"https://doi.org/10.1109/MSST.2019.00007","url":null,"abstract":"Encrypted deduplication combines encryption and deduplication in a seamless way to provide confidentiality guarantees for the physical data in deduplication storage, yet it incurs substantial metadata storage overhead due to the additional storage of keys. We present a new encrypted deduplication storage system called Metadedup, which suppresses metadata storage by also applying deduplication to metadata. Its idea builds on indirection, which adds another level of metadata chunks that record metadata information. We find that metadata chunks are highly redundant in real-world workloads and hence can be effectively deduplicated. In addition, metadata chunks can be protected under the same encrypted deduplication framework, thereby providing confidentiality guarantees for metadata as well. We evaluate Metadedup through microbenchmarks, prototype experiments, and trace-driven simulation. Metadedup has limited computational overhead in metadata processing, and only adds 6.19% of performance overhead on average when storing files in a networked setting. Also, for real-world backup workloads, Metadedup saves the metadata storage by up to 97.46% at the expense of only up to 1.07% of indexing overhead for metadata chunks.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115988406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

BFO: Batch-File Operations on Massive Files for Consistent Performance Improvement BFO:对海量文件进行批处理文件操作，以实现一致的性能改进

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-01 DOI: 10.1109/MSST.2019.00-17

Yang Yang, Q. Cao, Hong Jiang, Li Yang, Jie Yao, Yuanyuan Dong, Puyuan Yang

Existing local file systems, designed to support a typical single-file access pattern only, can lead to poor performance when accessing a batch of files, especially small files. This single-file pattern essentially serializes accesses to batched files one by one, resulting in a large number of non-sequential, random, and often dependent I/Os between file data and metadata at the storage ends. We first experimentally analyze the root cause of such inefficiency in batch-file accesses. Then, we propose a novel batch-file access approach, referred to as BFO for its set of optimized Batch-File Operations, by developing novel BFOr and BFOw operations for fundamental read and write processes respectively, using a two-phase access for metadata and data jointly. The BFO offers dedicated interfaces for batch-file accesses and additional processes integrated into existing file systems without modifying their structures and procedures. We implement a BFO prototype on ext4, one of the most popular file systems. Our evaluation results show that the batch-file read and write performances of BFO are consistently higher than those of the traditional approaches regardless of access patterns, data layouts, and storage media, with synthetic and real-world file sets. BFO improves the read performance by up to 22.4× and 1.8× with HDD and SSD respectively; and boosts the write performance by up to 111.4× and 2.9× with HDD and SSD respectively. BFO also demonstrates consistent performance advantages when applied to four representative applications, Linux cp, Tar, GridFTP, and Hadoop.

现有的本地文件系统被设计为仅支持典型的单文件访问模式，在访问一批文件(尤其是小文件)时可能导致性能不佳。这种单文件模式本质上是将对批处理文件的访问一个接一个地序列化，从而导致存储端文件数据和元数据之间的大量非顺序、随机且通常依赖的I/ o。我们首先通过实验分析了批处理文件访问效率低下的根本原因。然后，我们提出了一种新的批处理文件访问方法，通过对基本读和写过程分别开发新的BFOr和bflow操作，使用元数据和数据的两阶段访问，将其称为优化的批处理文件操作集。BFO为批量文件访问和集成到现有文件系统的附加进程提供专用接口，而无需修改其结构和过程。我们在ext4(最流行的文件系统之一)上实现了一个BFO原型。我们的评估结果表明，无论访问模式、数据布局和存储介质如何，BFO的批处理文件读写性能始终高于使用合成文件集和真实文件集的传统方法。BFO将HDD和SSD的读取性能分别提高了22.4倍和1.8倍;并将写入性能分别提高111.4倍和2.9倍，分别与HDD和SSD。在将BFO应用于四个代表性应用程序(Linux cp、Tar、GridFTP和Hadoop)时，还展示了一致的性能优势。

{"title":"BFO: Batch-File Operations on Massive Files for Consistent Performance Improvement","authors":"Yang Yang, Q. Cao, Hong Jiang, Li Yang, Jie Yao, Yuanyuan Dong, Puyuan Yang","doi":"10.1109/MSST.2019.00-17","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-17","url":null,"abstract":"Existing local file systems, designed to support a typical single-file access pattern only, can lead to poor performance when accessing a batch of files, especially small files. This single-file pattern essentially serializes accesses to batched files one by one, resulting in a large number of non-sequential, random, and often dependent I/Os between file data and metadata at the storage ends. We first experimentally analyze the root cause of such inefficiency in batch-file accesses. Then, we propose a novel batch-file access approach, referred to as BFO for its set of optimized Batch-File Operations, by developing novel BFOr and BFOw operations for fundamental read and write processes respectively, using a two-phase access for metadata and data jointly. The BFO offers dedicated interfaces for batch-file accesses and additional processes integrated into existing file systems without modifying their structures and procedures. We implement a BFO prototype on ext4, one of the most popular file systems. Our evaluation results show that the batch-file read and write performances of BFO are consistently higher than those of the traditional approaches regardless of access patterns, data layouts, and storage media, with synthetic and real-world file sets. BFO improves the read performance by up to 22.4× and 1.8× with HDD and SSD respectively; and boosts the write performance by up to 111.4× and 2.9× with HDD and SSD respectively. BFO also demonstrates consistent performance advantages when applied to four representative applications, Linux cp, Tar, GridFTP, and Hadoop.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128030916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Scalable QoS for Distributed Storage Clusters using Dynamic Token Allocation 使用动态令牌分配的分布式存储集群的可扩展QoS

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-01 DOI: 10.1109/MSST.2019.00-19

Yuhan Peng, Qingyue Liu, P. Varman

The paper addresses the problem of providing performance QoS guarantees in a clustered storage system. Multiple related storage objects are grouped into logical containers called buckets, which are distributed over the servers based on the placement policies of the storage system. QoS is provided at the level of buckets. The service credited to a bucket is the aggregate of the IOs received by its objects at all the servers. The service depends on individual time-varying demands and congestion at the servers. We present a token-based, coarse-grained approach to providing IO reservations and limits to buckets. We propose pShift, a novel token allocation algorithm that works in conjunction with token-sensitive scheduling at each server to control the aggregate IOs received by each bucket on multiple servers. pShift determines the optimal token distribution based on the estimated bucket demands and server IOPS capacities. Compared to existing approaches, pShift has far smaller overhead, and can be accelerated using parallelization and approximation. Our experimental results show that pShift provides accurate QoS among the buckets with different access patterns, and handles runtime demand changes well.

本文研究了在集群存储系统中提供性能QoS保证的问题。将多个相关的存储对象组合成一个逻辑容器(bucket)，根据存储系统的放置策略，将桶分布在不同的服务器上。QoS是在桶级别提供的。存储桶的服务是其对象在所有服务器上接收到的IOs的总和。该服务依赖于单个时变需求和服务器上的拥塞。我们提出了一种基于令牌的粗粒度方法来为桶提供IO保留和限制。我们提出了pShift，这是一种新颖的令牌分配算法，它与每个服务器上的令牌敏感调度相结合，以控制多个服务器上每个桶接收的聚合IOs。pShift根据估计的桶需求和服务器IOPS容量确定最优令牌分配。与现有的方法相比，pShift的开销要小得多，并且可以使用并行化和近似来加速。实验结果表明，pShift在不同访问模式的桶之间提供了准确的QoS，并能很好地处理运行时需求的变化。

{"title":"Scalable QoS for Distributed Storage Clusters using Dynamic Token Allocation","authors":"Yuhan Peng, Qingyue Liu, P. Varman","doi":"10.1109/MSST.2019.00-19","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-19","url":null,"abstract":"The paper addresses the problem of providing performance QoS guarantees in a clustered storage system. Multiple related storage objects are grouped into logical containers called buckets, which are distributed over the servers based on the placement policies of the storage system. QoS is provided at the level of buckets. The service credited to a bucket is the aggregate of the IOs received by its objects at all the servers. The service depends on individual time-varying demands and congestion at the servers. We present a token-based, coarse-grained approach to providing IO reservations and limits to buckets. We propose pShift, a novel token allocation algorithm that works in conjunction with token-sensitive scheduling at each server to control the aggregate IOs received by each bucket on multiple servers. pShift determines the optimal token distribution based on the estimated bucket demands and server IOPS capacities. Compared to existing approaches, pShift has far smaller overhead, and can be accelerated using parallelization and approximation. Our experimental results show that pShift provides accurate QoS among the buckets with different access patterns, and handles runtime demand changes well.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126282600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Efficient Encoding and Reconstruction of HPC Datasets for Checkpoint/Restart 检查点/重启HPC数据集的高效编码和重构

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-01 DOI: 10.1109/MSST.2019.00-14

Jialing Zhang, Xiaoyan Zhuo, Aekyeung Moon, Hang Liu, S. Son

As the amount of data produced by HPC applications reaches the exabyte range, compression techniques are often adopted to reduce the checkpoint time and volume. Since lossless techniques are limited in their ability to achieve appreciable data reduction, lossy compression becomes a preferable option. In this work, a lossy compression technique with highly efficient encoding, purpose-built error control, and high compression ratios is proposed. Specifically, we apply a discrete cosine transform with a novel block decomposition strategy directly to double-precision floating point datasets instead of prevailing prediction-based techniques. Further, we design an adaptive quantization with two specific task-oriented quantizers: guaranteed error bounds and higher compression ratios. Using real-world HPC datasets, our approach achieves 3x-38x compression ratios while guaranteeing specified error bounds, showing comparable performance with state-of-the-art lossy compression methods, SZ and ZFP. Moreover, our method provides viable reconstructed data for various checkpoint/restart scenarios in the FLASH application, thus is considered to be a promising approach for lossy data compression in HPC I/O software stacks.

当HPC应用程序产生的数据量达到eb级时，通常采用压缩技术来减少检查点时间和数据量。由于无损技术在实现可观的数据缩减方面的能力有限，因此有损压缩成为更可取的选择。在这项工作中，提出了一种具有高效编码、专用错误控制和高压缩比的有损压缩技术。具体来说，我们将离散余弦变换与一种新的块分解策略直接应用于双精度浮点数据集，而不是目前流行的基于预测的技术。此外，我们设计了一个自适应量化与两个特定的面向任务的量化:保证误差界限和更高的压缩比。使用真实的HPC数据集，我们的方法在保证指定误差范围的同时实现了3 -38倍的压缩比，显示出与最先进的有损压缩方法SZ和ZFP相当的性能。此外，我们的方法为FLASH应用程序中的各种检查点/重启场景提供了可行的重构数据，因此被认为是HPC I/O软件堆栈中有损数据压缩的一种有前途的方法。

{"title":"Efficient Encoding and Reconstruction of HPC Datasets for Checkpoint/Restart","authors":"Jialing Zhang, Xiaoyan Zhuo, Aekyeung Moon, Hang Liu, S. Son","doi":"10.1109/MSST.2019.00-14","DOIUrl":"https://doi.org/10.1109/MSST.2019.00-14","url":null,"abstract":"As the amount of data produced by HPC applications reaches the exabyte range, compression techniques are often adopted to reduce the checkpoint time and volume. Since lossless techniques are limited in their ability to achieve appreciable data reduction, lossy compression becomes a preferable option. In this work, a lossy compression technique with highly efficient encoding, purpose-built error control, and high compression ratios is proposed. Specifically, we apply a discrete cosine transform with a novel block decomposition strategy directly to double-precision floating point datasets instead of prevailing prediction-based techniques. Further, we design an adaptive quantization with two specific task-oriented quantizers: guaranteed error bounds and higher compression ratios. Using real-world HPC datasets, our approach achieves 3x-38x compression ratios while guaranteeing specified error bounds, showing comparable performance with state-of-the-art lossy compression methods, SZ and ZFP. Moreover, our method provides viable reconstructed data for various checkpoint/restart scenarios in the FLASH application, thus is considered to be a promising approach for lossy data compression in HPC I/O software stacks.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132895879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Parity-Only Caching for Robust Straggler Tolerance 鲁棒掉队容忍度的纯奇偶缓存

2019 35th Symposium on Mass Storage Systems and Technologies (MSST)

Pub Date : 2019-05-01 DOI: 10.1109/MSST.2019.00006

Mi Zhang, Qiuping Wang, Zhirong Shen, P. Lee

Stragglers (i.e., nodes with slow performance) are prevalent and incur performance instability in large-scale storage systems, yet it is challenging to detect stragglers in practice. We make a case by showing how erasure-coded caching provides robust straggler tolerance without relying on timely and accurate straggler detection, while incurring limited redundancy overhead in caching. We first analytically motivate that caching only parity blocks can achieve effective straggler tolerance. To this end, we present POCache, a parity-only caching design that provides robust straggler tolerance. To limit the erasure coding overhead, POCache slices blocks into smaller subblocks and parallelizes the coding operations at the subblock level. Also, it leverages a straggler-aware cache algorithm that takes into account both file access popularity and straggler estimation to decide which parity blocks should be cached. We implement a POCache prototype atop Hadoop 3.1 HDFS, while preserving the performance and functionalities of normal HDFS operations. Our extensive experiments on both local and Amazon EC2 clusters show that in the presence of stragglers, POCache can reduce the read latency by up to 87.9% compared to vanilla HDFS.

在大规模存储系统中，散列节点(即性能较低的节点)普遍存在，导致性能不稳定，但在实际应用中，对散列节点的检测是一个挑战。我们通过展示擦除编码缓存如何在不依赖于及时和准确的掉队检测的情况下提供健壮的掉队容忍度，同时在缓存中产生有限的冗余开销来说明这一点。我们首先分析了缓存奇偶校验块可以实现有效的掉队容忍度。为此，我们提出了POCache，这是一种仅限奇偶校验的缓存设计，提供了健壮的掉队容忍度。为了限制擦除编码开销，POCache将块分割成更小的子块，并在子块级别并行化编码操作。此外，它还利用了一个能够感知散列器的缓存算法，该算法同时考虑了文件访问的流行程度和散列器的估计，以决定应该缓存哪些奇偶校验块。我们在Hadoop 3.1 HDFS上实现了一个POCache原型，同时保留了正常HDFS操作的性能和功能。我们在本地和Amazon EC2集群上进行的大量实验表明，在存在离散者的情况下，与普通HDFS相比，POCache可以减少高达87.9%的读延迟。

{"title":"Parity-Only Caching for Robust Straggler Tolerance","authors":"Mi Zhang, Qiuping Wang, Zhirong Shen, P. Lee","doi":"10.1109/MSST.2019.00006","DOIUrl":"https://doi.org/10.1109/MSST.2019.00006","url":null,"abstract":"Stragglers (i.e., nodes with slow performance) are prevalent and incur performance instability in large-scale storage systems, yet it is challenging to detect stragglers in practice. We make a case by showing how erasure-coded caching provides robust straggler tolerance without relying on timely and accurate straggler detection, while incurring limited redundancy overhead in caching. We first analytically motivate that caching only parity blocks can achieve effective straggler tolerance. To this end, we present POCache, a parity-only caching design that provides robust straggler tolerance. To limit the erasure coding overhead, POCache slices blocks into smaller subblocks and parallelizes the coding operations at the subblock level. Also, it leverages a straggler-aware cache algorithm that takes into account both file access popularity and straggler estimation to decide which parity blocks should be cached. We implement a POCache prototype atop Hadoop 3.1 HDFS, while preserving the performance and functionalities of normal HDFS operations. Our extensive experiments on both local and Amazon EC2 clusters show that in the presence of stragglers, POCache can reduce the read latency by up to 87.9% compared to vanilla HDFS.","PeriodicalId":391517,"journal":{"name":"2019 35th Symposium on Mass Storage Systems and Technologies (MSST)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133961285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5