On Data Parallelism of Erasure Coding in Distributed Storage Systems

2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) Pub Date : 2017-06-05 DOI:10.1109/ICDCS.2017.191

Jun Li, Baochun Li

{"title":"On Data Parallelism of Erasure Coding in Distributed Storage Systems","authors":"Jun Li, Baochun Li","doi":"10.1109/ICDCS.2017.191","DOIUrl":null,"url":null,"abstract":"Deployed in various distributed storage systems, erasure coding has demonstrated its advantages of low storage overhead and high failure tolerance. Typically in an erasure-coded distributed storage system, systematic maximum distance seperable (MDS) codes are chosen since the optimal storage overhead can be achieved and meanwhile data can be read directly without decoding operations. However, data parallelism of existing MDS codes is limited, because we can only read data from some specific servers in parallel without decoding operations. In this paper, we propose Carousel codes, designed to allow data to be read from an arbitrary number of servers in parallel without decoding, while preserving the optimal storage overhead of MDS codes. Furthermore, Carousel codes can achieve the optimal network traffic to reconstruct an unavailable block. We have implemented a prototype of Carousel codes on Apache Hadoop. Our experimental results have demonstrated that Carousel codes can make MapReduce jobs finish with almost 50% less time and reduce data access latency significantly, with a comparable throughput in the encoding and decoding operations and no additional sacrifice of failure tolerance or the network overhead to reconstruct unavailable data.","PeriodicalId":127689,"journal":{"name":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2017.191","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Deployed in various distributed storage systems, erasure coding has demonstrated its advantages of low storage overhead and high failure tolerance. Typically in an erasure-coded distributed storage system, systematic maximum distance seperable (MDS) codes are chosen since the optimal storage overhead can be achieved and meanwhile data can be read directly without decoding operations. However, data parallelism of existing MDS codes is limited, because we can only read data from some specific servers in parallel without decoding operations. In this paper, we propose Carousel codes, designed to allow data to be read from an arbitrary number of servers in parallel without decoding, while preserving the optimal storage overhead of MDS codes. Furthermore, Carousel codes can achieve the optimal network traffic to reconstruct an unavailable block. We have implemented a prototype of Carousel codes on Apache Hadoop. Our experimental results have demonstrated that Carousel codes can make MapReduce jobs finish with almost 50% less time and reduce data access latency significantly, with a comparable throughput in the encoding and decoding operations and no additional sacrifice of failure tolerance or the network overhead to reconstruct unavailable data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

论分布式存储系统中擦除编码的数据并行性

在各种分布式存储系统中，擦除编码已显示出其低存储开销和高故障容错性的优势。通常情况下，在消除编码分布式存储系统中，会选择系统化的最大距离可分离（MDS）编码，因为这样可以达到最佳存储开销，同时无需解码操作即可直接读取数据。然而，现有 MDS 代码的数据并行性是有限的，因为我们只能从某些特定服务器并行读取数据，而无需进行解码操作。在本文中，我们提出了旋转木马代码（Carousel codes），目的是在保留 MDS 代码最佳存储开销的前提下，允许从任意数量的服务器并行读取数据而无需解码。此外，旋转木马代码还能以最佳网络流量重建不可用的数据块。我们在 Apache Hadoop 上实现了 Carousel 代码的原型。我们的实验结果表明，在编码和解码操作吞吐量相当、不额外牺牲故障容忍度或网络开销以重建不可用数据的情况下，Carousel代码能使MapReduce作业完成的时间缩短近50%，并显著减少数据访问延迟。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)

自引率

0.00%

发文量