Qi Yu, Lin Wang, Yuchong Hu, Yumeng Xu, D. Feng, Jie Fu, Xia Zhu, Zhen Yao, Wenjia Wei
{"title":"Boosting Multi-Block Repair in Cloud Storage Systems with Wide-Stripe Erasure Coding","authors":"Qi Yu, Lin Wang, Yuchong Hu, Yumeng Xu, D. Feng, Jie Fu, Xia Zhu, Zhen Yao, Wenjia Wei","doi":"10.1109/IPDPS54959.2023.00036","DOIUrl":null,"url":null,"abstract":"Cloud storage systems have commonly used erasure coding that encodes data in stripes of blocks as a low-cost redundancy method for data reliability. Relative to traditional erasure coding, wide-stripe erasure coding that increases the stripe size has been recently proposed and explored to achieve lower redundancy. We observe that wide-stripe erasure coding makes multi-block failures occur much more frequently than traditional erasure coding in cloud storage systems.However, how to efficiently repair multiple blocks in wide-stripe erasure-coded storage systems remains unexplored. The conventional multi-block repair method sends available blocks from surviving nodes to one single new node to repair all failed blocks in a centralized way, which may cause the new node to be the bottleneck; recent multi-block repair methods follow pipelined single-block repair methods and the former are simply built on the latter in an independent way, which may cause the surviving nodes with limited bandwidth to be bottlenecks.In this paper, we first analyze the effects of both centralized and independent ways on the multi-block repair and then propose HMBR, a hybrid multi-block repair mechanism that combines centralized and independent multi-block repairs to tradeoff the bandwidth bottlenecks caused by the new and surviving nodes, thus optimizing the multi-block repair performance. We further extend HMBR for hierarchical network topology and multi-node failures. We prototype HMBR and show via Amazon EC2 that the repair time of a multi-block failure can be reduced by up to 64.8% over state-of-the-art schemes.","PeriodicalId":343684,"journal":{"name":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS54959.2023.00036","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Cloud storage systems have commonly used erasure coding that encodes data in stripes of blocks as a low-cost redundancy method for data reliability. Relative to traditional erasure coding, wide-stripe erasure coding that increases the stripe size has been recently proposed and explored to achieve lower redundancy. We observe that wide-stripe erasure coding makes multi-block failures occur much more frequently than traditional erasure coding in cloud storage systems.However, how to efficiently repair multiple blocks in wide-stripe erasure-coded storage systems remains unexplored. The conventional multi-block repair method sends available blocks from surviving nodes to one single new node to repair all failed blocks in a centralized way, which may cause the new node to be the bottleneck; recent multi-block repair methods follow pipelined single-block repair methods and the former are simply built on the latter in an independent way, which may cause the surviving nodes with limited bandwidth to be bottlenecks.In this paper, we first analyze the effects of both centralized and independent ways on the multi-block repair and then propose HMBR, a hybrid multi-block repair mechanism that combines centralized and independent multi-block repairs to tradeoff the bandwidth bottlenecks caused by the new and surviving nodes, thus optimizing the multi-block repair performance. We further extend HMBR for hierarchical network topology and multi-node failures. We prototype HMBR and show via Amazon EC2 that the repair time of a multi-block failure can be reduced by up to 64.8% over state-of-the-art schemes.